DPUB IG Metadata Task Force Report

“Vertical” Interviews (Notes by Madi Solomon)

A small controlled survey of Pearson’s thoughts on metadata

Executive Summary

Metadata, data about data, has been a conversation piece in the publishing industry for years now but proving its usefulness to the businesses has remained elusive. While most people can speak of the importance of metadata and how it has been successfully monetized by the likes of Facebook, Amazon and Google, in local practices it is still perceived as a labour intensive manual effort with few redeemable benefits. Even with past efforts of applying minimal metadata, our assets remained un-discoverable, rights information are locked and difficult to obtain, and federated search provides scant returns.

This persistent reality has remained despite several enterprise deployments of different asset management systems over the past ten years. The systems were not at fault, but our process was. Many large publishing organisations have never had a global strategy. Investments in digital asset management and rights systems tended to concentrate in the US and UK leaving significant international regional offices such as Brazil, Australia, and Mexico, completely isolated, without tools or access.

This approach to asset and content management must change if we are to unite around a single suite of technologies that streamline global access, embed efficacy measures, and enable digital distributions across all devices and platforms.

In preparation for a Pearson Metadata & Taxonomy Roadmap, Madi Solomon conducted 12 anonymous interviews with education publishing representatives across many different Lines of Business (mostly from US, UK, and Canada). This report synthesizes the results of the survey.

Sponsorship

These interviews were commissioned by the Semantic Platforms & Metadata team of the Core Platforms and Enterprise Architecture (Pearson Technology) and the W3C Digital Publishing Interest Group, Metadata Taskforce, in the quest to answer the question “What are the metadata pain points for publishers as they evolve to digital distribution?”

Goal

The results of this survey inform two goals:

Results have been combined with another set of interviews that were conducted by the Co-Chairs (Madi Solomon - Pearson, Bill Kasdorff - Apex CoVantage) of the W3C Digital Publishing Interest Group, Metadata Taskforce, to provide a broad view of publishing challenges around metadata.
Inform the Metadata & Taxonomy Roadmap to be created by Ian Piper, Chief Enterprise Taxonomist and Madi Solomon, Semantic Platforms, to ensure business relevance across the enterprise.

Methodology

One-to-one half-hour interviews were conducted over a four week period in March-April 2014. The interviews proceeded as anonymous, casual and candid conversations on experiences or observations around metadata.

Results

When the results of the W3C Metadata Task Force were combined with those from this report, differences in metadata expectations between Trade and Education publishing surfaced. The majority of those interviewed by Bill Kasdorf were trade publishers while Pearson, still in its transformative state to digital, were more focused on modularising content. Some of these differences are exampled in the table below.

Trade	Education Publishers
Trade publishers stated that metadata complexity, mostly with ONIX, was a challenge to their business	Not a single interviewee mentioned ONIX or any other industry standard.
Trade publishers lamented the many metadata vocabularies (ONIX, BISAC, PRISM, etc.) and the difficulty in keeping up-to-date on all of them.	One interviewee mentioned multiple metadata standards and vocabularies as an issue
“Few books are online anyway,” was a general response from Trade. Other than STEM journals and articles, traditional publishers considered books as whole products and rarely modularised or componentized content. Metadata was relegated to Title and Author and not much more.	This did not apply to any representative as modular education content, personalised content, learner outcomes and efficacy measures were all based on data. The ability for personalisation (by a student, teacher, or institution) was a top priority for education publishers.
ONIX vs. Subject Metadata was a common debate in Trade. ONIX was originally created for the supply chain (retailers, etc), primarily for physical books and has since been updated (ONIX 3.0) for eBooks. There was general resistance to ONIX 3.0 because publishers believed 2.1 was fine and what the supply chain demanded. Subject or descriptive metadata on the intellectual content of a book was not easily supported or embedded in ONIX.	Subject metadata was a key entry point for educational content and information such as Learning Outcomes/Objectives and Learner levels were considered essential.
Trade publishers recognized the need for Keywords for books, chapters and component discovery, but were not necessarily interested in a controlled vocabulary. Trade publishers were only just beginning to realise that search engines did not use Library Catalogues to find book titles.	Many interviewees stated that many controlled vocabularies were required to optimise discovery and to re-purpose existing content.

Education Publisher’s Results

Percentages refer to the number of respondents who mentioned these topics as top metadata priorities for their business.

Governance—100%

Every respondent cited the lack of metadata governance and authority as a major issue in their daily interactions with metadata. There wasn’t an authority to help dictate metadata requirements or to help embed or impose standards across the workflow. Instead, the “right to refuse” remained steeped in the traditional business culture where editorial had the authority to reject anything. The right to reject was also scattered across the workflow.

Quotes:

“We’re nowhere on this. Every publisher right now adds and amends metadata at will so there is no cohesive approach.”

“Governance is critical in ensuring that customers have a good experience with our platforms.”

“We need to standardise on format for dates, for example, and make everyone use the ISO standards. This would have a very positive impact overall.”

“Governance is a pain point. In the past, governance boards were formed by non-metadata experts. We should be adopting industry standards and impact should not be a decision choice.”

“There are no mechanisms, governance, or levers that apply metadata, so the businesses just remain frustrated.”

Rights—60%

A majority of respondents stated that Rights Information was one of the most important metadata issues facing publishers at large. Without trusting Rights data, the businesses would rather err on caution and re-create or re-commission content before re-using or re-purposing existing content. The lack of Rights information on content and assets was costing organisations a small fortune in duplication and litigation. A means for querying rights information from source content/titles to the many derivatives was a priority need. While Rights Information was available in internal rights systems, the difficulty and the long wait in getting requested information was a source of frustration to many.

QUOTES:

“Our legacy systems are intractable and need to be abandoned.”

“We are six months behind in rights clearances for digital delivery and this is a major bottleneck.”

“There really shouldn’t be a separation anymore between the technologies that handles rights information. We need a new global rights strategy.”

“U.S. Rights can be cleared in our rights system … but there is no consideration for data governance.”

“We’re getting better at this by getting getter rights interactions between people. Now we need to get our systems to interact.”

Flow—60%

60% of the respondents think that the flow of metadata was seriously compromised. There were many opportunities for metadata to be inherited, but no measures or mechanisms were created to capture them in the content lifecycle. Metadata was an afterthought in most workflows. Automation existed only in scraping exercises long after the content or assets were created, leaving some guesswork to vital information such as rights or identifying source originals to the distributors. The onus of applying metadata, data clean-up, and format conversions were left to the receiving platforms. This could be ameliorated with a more holistic coordinated view.

QUOTES:

“Metadata doesn’t flow upstream! We should fix this whatever we do.”

“A hybrid approach would be ideal, where humans populate some fields with controlled vocabularies and the rest auto-populated or scraped.”

“There is a general confusion between metadata, the process and flows of metadata, and what and when it’s captured.”

“The current workflow is more bothered than helped by metadata.”

“We end up being in the conversion business rather than the content delivery business. There are glaring inefficiencies, if not outright broken components, in our workflow and our ways of working.”

“Metadata is currently a myth, it simply does not exist. So what is its value? Metadata’s value can only be measured by its application so we’re stuck in a Catch 22: there is no metadata so there is no value so there is no use case.”

Lack of Skills—50%

In our current changing culture, modular content demands more disciplined data care, but the businesses have not caught up to this. In general, the businesses don’t have the knowledge or skills to design a metadata growth model and by default, expect it to be done by someone else. This option, while valid, has not been formalised in any way and consequently, no such entity exists (other than vendors). There were many recommendations for a centralised service to help with this (See Centralised Service—40%). Overall, the businesses requested more help in defining the new rules of engagement.

QUOTES:

“We are not data specialists, and this is all about data.“

“There are missing links between the workflows and information.”

“We need more education in order to unlock the potential of metadata.”

“It’s a chicken and egg thing: how to innovate and still support key business functionality?”

Lack of Authority—50%

Half of the respondents collectively wanted more authority around metadata. The businesses wanted to know who “owned” metadata. Rather than leaving it up to the businesses, or the editorial process, they requested a stronger authority that could better support and enforce the required standards and could extend this authority to influence technology measures that ensured compliance. This related directly to the Governance issues around metadata, which further substantiated the need for a centralised entity to fully manage, monitor, and govern metadata standards.

QUOTES:

“Consensus takes too much time.”

“There are no governing principles so we are capturing metadata but without a good story. By story, I mean we don’t know the worth of the effort.”

“There would be more acceptance if we had a stronger top-down mandate.”

“We need a central function of metadata, taxonomy, and vocabularies with an authority that manages it.”

“We need a balance between enforced standards through tools and extending standards for specific business needs.”

“Metadata was the responsibility of the engineers and developers of the system. The role of metadata should be shifted to a specialist.”

Standards—50%

Metadata standards were cited as something that should be identified, imposed, and managed. This included all forms of metadata from technical, structural, descriptive, search optimisation, educational standards and curriculum, all the way to online delivery standards. The use of standards was the solution for keeping content fluid enough to be shared and distributed across multiple platforms and devices. These were not solely for content, however, but also applied to consumer and learner data. These standards were recognised as essential in realising goals for personalisation and recommendations.

QUOTES:

“We need to implement minimum metadata standards and vocabularies.”

“The problem is that terms are dictated by owners so there are conflicts between systems because they are not standardised.”

“Rather than working groups coming up with standards, we should use educational standards already in existence.”

“If I had a magic wand, I’d build collaborations between product and services. Product has been isolated and really, 75% of metadata should be relevant across all platforms. I’d normalize metadata.”

“No one is aware of what standards we’re supposed to use.”

“We need to make standards adoptable. Right now it’s too difficult to get people to change.”

“Fundamental change in our process is necessary.”

Conclusion

Metadata touches many parts of the digital supply chain, yet a comprehensive approach to its application and its value has been poorly executed. Reasons for this is includes a long list of exhausted efforts with good intentions. The businesses, however, may now be ready to embrace changes to their traditional approaches to content creation and are particularly open to the prospect of data-driven workflows that extend to efficacy metrics and personalisation of learning objects.

“Horizontal” Interviews (Notes by Bill Kasdorf)

All the interviews are published with the interviewees' consent.

Laura Dawson, Bowker

Laura’s first observation was that there is “so much metadata available—that’s not the problem.” She said that “wrangling all the descriptions” is really the problem, because of the proliferation of vocabularies and terms.

The biggest pain point for publishers in her view is “the problem of updates.” Hachette sends metadata to X recipient; that metadata doesn’t see the light of day for three weeks; in the meantime the metadata has been updated in a weekly feed. Feeds can overwrite other feeds.

Another key problem: different versions needed for different recipients: “there is no one true ONIX.”

Issues are not as much about the structure, or about the vocabularies, but how they are used.

There is also a vocabulary issue re subjects of components: BISAC doesn’t work, “not granular enough.” “If you really want to embed metadata in a meaningful way, you can’t do it.” (She mentioned IPTC codes in this context.)

She is a big fan of schema.org. Publishers need to use identifiers for people (ISNI, ORCID).

What’s needed is some combination of bibliographic ontology and relations.

May need different types for selling books vs. finding them. Library schemes (e.g., Bib Extend Community Group) primarily focus on libraries, which have properties like “holdings” and “checking out a book” that are irrelevant to publishers.

She said that “Google should be the primary audience—that’s how people find books.” Search engines and the indexing they do are key. Perhaps making this information viewable is the key, the link back to the OWP and W3C—“even if the content isn’t available on the open web.”

Fran Toolan, CEO, Firebrand

Fran characterized the biggest problem as “Publishers don’t understand the role of the web. They understand that everyone is on it, but they are confused by all the virtual storefronts, and very confused about SEO, how to determine keywords, how does book industry metadata (BISAC) fit in. They don’t have workflows around this. They don’t have somebody on staff looking at Google AdWords for every title.”

Another big problem: every retailer site has its own ingestion engine, its own specifications.

The metadata itself isn’t as bad as people think it is. The vast majority of publishers are creating the metadata they’ve been told to create.

Another key insight from Fran: “Metadata was once identified as a method by which a publisher could control the perception of the book. That is no longer the case. With all the review sites, social media, etc., it is now out of the publisher’s control. The publishers don’t know what to do about this.”

“The right question is ’How can publishers be more successful on the Web?’”

He feels that metadata in the context of the W3C “means something completely different than what it means in the book supply chain.” “What’s ’the product’ on the Web?”

There’s confusion regarding usage. Most people consider usage info as “metadata”: how much of the book can display, where it can be sold, who has the rights to it, etc. This is really supply chain data.

“ONIX is irrelevant to the W3C.“ It doesn’t show up anywhere on the Web.

“What the publisher cares about is if the book can be discovered.” The Web page should be an aid to discovery. “But the publisher is not in control of that because they’re not the retailer.”

Biggest problem: the publisher doesn’t know what affects what. “Is the info they’ve been browbeaten to give over the past few years doing any good? No way to know, no feedback loop.”

He feels that “BISAC categories are irrelevant to the W3C.” How can this translate to the W3C? Keywords are a big deal. (Note that BISG is working on a keywords recommendation—not a vocabulary, but guidelines regarding keywords.)

“CrossRef and schema.org work because they’re each a central registration agency.” “Bowker and Books In Print used to be that in the print era. Currently there is no single repository for book metadata.” “The only way to get a central registration agency for book metadata is if the retailers agree to use it. Amazon and Apple will never do that.” “Retailers like Walmart etc. use UPC (bar codes) for which GS1 is the central registration agency (for general merchandise). This never got any legs for books.”

Thad McIlroy, Consultant and co-author, The Metadata Handbook

“ONIX and metadata in general are just way too complex for the average trade publisher.” “These people may never come along.”

“We need ONIX LITE focused on marketing and discovery.”

“Very few publishers have the resources of a Pearson to get ONIX right.”

“The metadata situation for a receiver is just crazy—TONS of inconsistency.”

“The use case is not clear—what is the benefit?”

“There’s also lots of inconsistency in what the recipients require.”

The ONIX 3.0 vs. 2.1 situation is a real problem.

“Today, in 2014, the point of metadata is marketing books and making them discoverable. We should just focus on that.”

“Maybe schema.org should become ONIX LITE.”

Renee Register, Consultant and co-author, The Metadata Handbook

“Getting people to use the same model and metadata is the big issue. Everybody—Amazon, Books In Print, B&N, Baker & Taylor, etc. etc.—all use proprietary models. There are huge proprietary databases. There is tremendous duplication of effort.”

“Publishers feel as if they don’t have control of the data.”

“There’s a big need for metadata remediation, cleanup.”

“It looks different on Ingram, Amazon, B&N, B&T, etc.—it’s all over the place.”

She “would love to see a central repository.”

“Inputting ONIX is too difficult—there are no good tools or mechanisms to create ONIX.”

Vincent Baby, Chairman of the Board, International Press Telecommunications Council (IPTC)

Vincent Baby, Chairman of the Board of the IPTC (International Press Telecommunications Council), the major source of metadata standards for the news industry, including the widely used IPTC Photo Metadata schemas, the XML-based NewsML, the RDFa-based rNews, RightsML, and others.

Mr. Baby—a journalist by training, currently in a product management role with Thomson Reuters—made it clear that his comments in our conversation were his own personal views; he was in no way speaking for the IPTC or Thomson Reuters.

Here are some links to relevant IPTC resources sent by Mr. Baby: “IPTC is constantly updating core standards like photo metadata and NewsML and crafting new standards for semantic annotation of news items on the Web, embeddable rights expressions and APIs. We have also recently revived our NewsCodes Working Party which develops and maintains provider-agnostic taxonomies.”

He began by pointing out that the IPTC and W3C overlap and intersect in many ways; currently, the main link is ODRL, the Open Digital Rights Language. IPTC has done a lot of work on a rights expression language that he thinks could be very valuable to publishers of all sorts (most others are unaware of it). Recognizing that the “free text” most such models enable is ineffective, IPTC focused on developing machine-processable metadata. Their RightsML is ODRL with a vocabulary specific to the news industry.

The Semantic News Community Group at the W3C “emanated from IPTC members.” However, this activity has been pretty dormant for the past 2-3 years. He commented that a number of IPTC members are “wary of the W3C” because of IP concerns. However, he mentioned that one of their constituent organizations, EBU (the European Broadcasting Union in Geneva) has been an active participant in the W3C.

The story of rNews is interesting in the context of the other conversations I’ve been having with people from other sides of publishing. The New York Times brought the use case to the IPTC initially: the need to preserve metadata in HTML pages. News publishers have very rich metadata in their CMSs but it is lost in the content that goes online. What was developed was rNews, an RDFa-based model that makes news more accessible to search results and to social media and allows better displays such as “rich snippets”.

Here’s the part of the story I found most resonant: in June 2011 they became aware of schema.org, and there was a concern that schema.org would overshadow rNews. They contacted schema.org and found that schema.org was in fact very interested in collaborating and open to the input of representative organizations such as IPTC in various domains. The result was that schema.org now incorporates a small subset of rNews properties. This resonated with comments others have made about the need for an “ONIX Lite” or some similar ability to get subject metadata into schema.org. BTW. in my view it is probably Thema, the new international subject codes standard, not ONIX, that’s the best candidate for this.

The result is that many rNews properties are now widely adopted in the news industry, though mainly the subset that’s in schema.org. The big players like the BBC use the full rNews schema.

In discussing the challengers publishers face, Mr. Baby named two fundamental issues:

Maximizing engagement with the user base
The need for efficiency and economy

He pointed out that metadata can have a role in both areas.

He remarked that publishers initially focused on SEO; but that proved ineffective because just getting a user to a given web page didn’t build any lasting value or connection with that user, or knowledge about that user. Now they are more focused on social media and semantic markup.

An important aspect of the ecosystem today is the importance of rich multimedia content. Cross-media alignment between assets has become critically important. The problem is that there are different taxonomies and different levels of richness associated with different types of media assets. For example, video has great technical metadata but very poor subject metadata.

It’s not just about text anymore. It’s about text + pictures, interactivity, databases, Twitter, and on and on. This multiplicity of media formats presents a huge challenge: how do you keep track of everything?

On the subject of engagement, one thing he lamented was that embedded metadata winds up getting stripped out when it gets to Twitter, Facebook, etc.: most social networks are “cavalier about metadata—they just toss it out.” He mentioned an “Embedded Metadata Manifesto” that is hoping to counter this [Link provided below].

Another important issue: how do you measure impact? He said that “all kinds of initiatives are working on this.” Over time, publishers will want to add this information to their metadata, e.g. “this is a bestseller” or “this has been retweeted X times.” [Note: that is already intrinsic to ONIX for the book supply chain.]

He had a lot of interesting and resonant points to make in the context of efficiency. [I think a lot of what he pointed out regarding news will become increasingly relevant to most types of publishing, which are moving from a “once-and-done” model to a situation where content evolves over time.]

An area I found particularly relevant to work in other areas was his discussion of “how to automate workflows with metadata?” and “how to mark up content in a way that makes it easy to reuse chunks?” In the news industry, it is common for new versions of a story to build on previous versions, with lots of overlap. But often, when there is only a small proportion of updated content, it becomes a whole new story.

One aspect of this is the need for granular structural semantic markup: What is a quote? What is a “background paragraph”? What is a “lede”? etc. There is a need to “isolate these bits” so that the user can be presented with “just the new bits.”

Many people are putting their hopes in an “event-based” model where ”unique events” can be identified in advance, with an ID, and then managed over time. Others, including for example the BBC and the Guardian, object that this does not work because this is not how journalists actually work. New stories pop up unexpectedly and then twig off in unexpected directions that can’t be anticipated in advance. (E.g., Ukraine rejects the deal with the EU and a few months later Crimea has been annexed to Russia, with lots happening in between that nobody would have expected.) Newsrooms typically identify the successive iterations of stories using “slugs” (keywords) that enable users (and journalists) to follow how a story evolves. The “storyline” model aims to organize these free-form slugs ex-post using a dedicated ontology, thereby leveraging in the background a pre-existing workflow.

In this context, archives have become increasingly important: they help get more engagement from readers. This is another issue that crosses media types. He mentioned that just this week, ProPublica has posted a very interesting draft on their work on how to archive “news applications.” This is based on the need for an interactive database that is queryable on a very local level. How do you archive the stories that make up that repository to enable this to work? (See below for further links.)

Another key issue in all this is the human dimension. He pointed out that “journalists are very creative people,” interested in the substance, not the process. They’re generally not disciplined about cataloging, labeling, etc. So “metadata becomes a struggle.” Management, on the other hand, recognizes that this metadata is crucial: it’s what makes the content useful to both users and publishers.

Having said that, there can be a “blind faith” that putting the systems and processes in place will resolve the problems: “If we fix our metadata everything will be hunky dory.” This is obviously naïve.

Finally, he raised the familiar issue of the proliferation of devices, OS’s, form factors, etc. He pointed out that during a recent weekend 55% of The Guardian’s content was viewed via mobile, where there are hundreds of different devices and form factors. Therefore, responsive design becomes crucial. A challenge: everything needs to be tested for compatibility across a broad range of browsers and devices.

He closed our conversation with an offer to provide links to a number of useful resources—which in fact he did about an hour after our conversation. Here’s what he sent:

W3C Semantic News Community Group (purely for reference)
Leading edge innovators in the news publishing industry
- FT Labs: http://labs.ft.com/
- NYT Labs: http://nytlabs.com/
- Knight Mozilla OpenNews: http://opennews.org/
- BBC http://www.bbc.co.uk/blogs/internet/
Structured news
- Circa http://www.fastcolabs.com/3008881/tracking/circas-object-oriented-approach-to-building-the-news
- Storyline Ontology http://www.bbc.co.uk/ontologies/storyline/
- Structured Journalism (Reg Chua) http://structureofnews.wordpress.com/structured-journalism/
Dealing with non standard media formats
- ProPublica http://www.propublica.org/nerds/item/a-conceptual-model-for-interactive-databases-in-news
- CMSes http://www.fastcolabs.com/3022755/whats-so-hard-about-about-building-a-cms
Semantic Web (for news)
- rNews, schema.org and New York Times http://open.blogs.nytimes.com/2012/02/16/rnews-is-here-and-this-is-what-it-means/
- rNews at the BBC http://www.bbc.co.uk/blogs/internet/posts/News-Linked-Data-Ontology
- Dynamic Semantic Publishing at the BBC http://www.bbc.co.uk/blogs/bbcinternet/2012/04/sports_dynamic_semantic.html
Embedded metadata
- Embedded Metadata Manifesto http://www.embeddedmetadata.org/
Measuring impact

Michael Steidl, Managing Director, International Press Telecommunications Council (IPTC)

My interview with Michael was an excellent complement to the interview I conducted with Vincent Baby, Chairman of the Board of the IPTC. Whereas Vincent is part of the volunteer leadership (he works for Thomson Reuters), Michael’s full time responsibility is the very wide-ranging work of the IPTC, an organization that has been deeply involved in the development and use of metadata standards for 35 years. Its standards are extensively implemented globally, primarily in the context of news (including textual, image, and multimedia content).

Although his focus is of course not primarily on books, Michael began by observing that the concept of “book” is evolving due to the emergence of digital books, along with print: a “book” really becomes the intellectual content, not just a product, and even the nature of that intellectual content is changing.

Similarly, in the IPTC’s area, the question “what is news?” is evolving for many of the same reasons. It can mean “professional news,” or it can be broadened to include blogging, news created by private individuals, etc. IPTC focuses on the former: they represent “the professional creation and distribution of news,” not everything that might be new, or “every 10th tweet would be news.”

IPTC has a long history of metadata work. Their first standard with metadata was created in 1979, and it is still in use: “IPTC 7901,” which is a sibling to “ANPA 1312” in the US. (There are only minor differences—they are “95% in common.”)

IPTC also began to focus on multimedia early on, creating IIM (their Information Interchange Model) which was adopted by Adobe in 1994. This was the origin of photo metadata. They have been “deeply involved” in image and multimedia metadata since then. All IIM properties can be expressed in XMP (the eXtensible Metadata Platform, originally an Adobe spec for embedding metadata in Photoshop, Illustrator, etc. files, which is now an international standard): IPTC Photo Metadata Standard is the metadata vocabulary, XMP is the mechanism for embedding it.

They are also deeply involved with Identifiers, recognizing that all “creative work needs an identifier.” “What makes a string an identifier?” In the book industry, identifiers like DOI and ISNI and ISBN are maintained by organizations that formally issue the identifiers. But there’s a big difference between news and books: e.g., a given book publisher might publish 5, 10, or even 1,000 books a week (just the giants), whereas a mid-sized news agency produces 1,000 items per day, and a large news agency can produce 10,000 items per day. Thus they can’t “hand pick an identifier,” this doesn’t scale for news. Instead, they need self-describing identifiers. He pointed out that URIs and URLs have very high relevance for this because they are both an identifier and a carrier of information about what is being identified (unlike the identifiers like DOI, ISNI, and ISBN, which just provide the key to obtain information about what is being identified).

Another topic he stressed was the issue of metadata schemas. There are “lots of organizations creating schemas, and many of their properties are quite similar in terms of semantics.” The problem: there are reasonable schemas for different areas of creative work in a given area, but “looking across boundaries it is hard to bring them together.” He pointed out that text, video, etc. all have different schemas. In multimedia, you are often dealing with 5, 6, 7, or 8 different metadata schemas at the same time.

The IPTC is trying not to “contribute to metadata proliferation.” Their first rule of thumb: “Is this already defined somewhere?”

Rights metadata is very important in this regard. IPTC “will not create its own rights metadata schema.” A book may use content, text, graphics, photos, videos, etc. from news agencies and so they need their metadata to be as consistent as possible. They are working with the W3C Community Groups and have been particularly involved in the development of Open Digital Rights Language (ODRL).

They are also a driver of the Linked Content Coalition (LCC), a followup to ACAP (Application Configuration Access Protocol), a group of 40-some organizations (including EDItEUR: both Michael and Graham Bell are directors). They are working on creating a framework for “exchanging rights information across the silos.”

Another big topic: metadata values.

What’s easy are literal values like dates. Much harder is conceptual information, e.g. “The person in this picture is Mr. So-and-so.” There is a need for a common way to describe entities (people, companies, etc.). Now, with the Semantic Web, it is much more common for each entity to have an identifier.

He pointed out that in the context of news, proprietary identifiers get created within given news organizations because of urgency: “they have to do this right away” in order to manage their information. Now there’s a need for a layer for sharing information. One potential solution is a layer on top of Wikipedia, but they are bound to a single language, and Wikipedia links are to a single article. Wikidata is an important initiative for extracting “entities and topics” and enabling the application of a “generic identifier” that could, e.g., provide a list of “all articles in all languages associated about this entity.”

The IPTC has done work on subject categorization (originally “Subject Codes,” now the improved “Media Topics”): over 1,000 terms of content description, which has a “clear focus on news.” IPTC is working on mapping IPTC Media Topics to proprietary vocabularies used in the news industry and then to Wikidata, thus providing a sort of “hub” between all those proprietary schemes and wikidata.

IPTC made a formal decision in mid-March that all vocabularies will use SKOS (Simple Knowledge Organization System), which enables “matching” of identifiers at different levels. E.g., “my concept identifier relates to your concept identifier,” but as either an exact match, a subset relationship, or a superset relationship (e.g. “Book Industry” could map as a subset concept to “Economy”).

Finally, he talked about formats: how to express a vocabulary, the syntax, etc. He observed that there are “fashions” in formats. “Ten years ago, everything had to be XML; now, XML is old fashioned, everything has to be JSON.” (BTW I hear this “fashion” issue all the time, !)

IPTC has a deep involvement in XML, but “with the advent of APIs, XML is too complex, JSON is much simpler.” So now IPTC “needs to reformulate in JSON.” This is an ongoing challenge for standards organizations: “following the fashions of the industry.”

Carol Meyer, head of Business Development and Marketing for CrossRef and immediate Past President of the Society for Scholarly Publishing.

Carol is in an ideal position to comment on the issue of metadata because CrossRef is a receiver of an enormous amount of metadata from publishers and Carol’s job involves a lot of work directly with the publishers. And as I’ve mentioned in many of these interviews, it is really thanks to CrossRef that metadata is perceived as basically a “solved problem” in the context of scholarly and STM publishing.

Carol started by pointing out that one reason for CrossRef’s success and its near universality in the scholarly publishing realm is that it started collecting metadata for a very specific purpose: obtaining just the specific bibliographic metadata required to enable reference linking.

This was done initially just for journals, and is now literally a given in journal publishing; a journal article is considered invisible if it doesn’t have a CrossRef DOI and thus isn’t registered in CrossRef. They subsequently added many other publication types—books, conference proceedings, etc.—although one of the things that has made the adoption on the book side slower and less “a given” than on the journal side is that whereas journal content is virtually always online, book content hardly ever is (metadata may be available online but the books themselves are almost always still offline print or ebook products—though scholarly books are some of the most likely of any book content to be online, along with reference).

Carol pointed out that this initial simplicity has turned out to be both an advantage and a limitation. The plus side is that it made it quite straightforward for publishers to be able to supply the metadata CrossRef needed to make reference linking work. (Though this is not without its problems; see below.) On the other hand, the very ubiquity of CrossRef caused two other not-necessarily-valid perceptions to form:

That CrossRef could do pretty much anything with metadata. Not so: they can only do what the metadata they get enables them to do. So, for example, publishers would like CrossRef to provide email addresses, but CrossRef has not been collecting email addresses because they weren’t needed for reference linking, and of course now they have a gazillion records that are lacking that information. Extremely non-trivial problem for them to address.
That having a CrossRef DOI confers some sort of legitimacy on a journal article. Not so: it just means the article has been properly registered, it says nothing about the quality of the article or the research. But many authors rush to get DOIs purely because without them their article lacks credibility. And sometimes authors and publishers just “make up” DOIs that are not in fact even registered. Of course they don’t “work” in the system, but it’s a headache and friction in the system.

One key observation Carol made is that “there are standards and there are practicalities—and they diverge.” For example, the solution for metadata implementation is probably RDFa, Semantic Web, etc., but this is very hard for the average publisher to do. Giving CrossRef a specific small set of metadata is one thing [and often their prepress or conversion or hosting vendor does that for them anyway] but a true Semantic Web implementation is way beyond the capabilities of all but the largest and most technically savvy publishers.

She said that “CrossRef even has a problem interoperating with other registration agencies.” [This even though CrossRef and those other registration agencies are in fact very technically expert.] And this is even more a problem for publishers, especially small or medium size publishers that don’t have the technical expertise, because the tools that are available require programming or at least “a programming mindset,” they don’t “just work.”

The big publishers get all this stuff, of course; but the small publishers “are at a real disadvantage” because they still need to work with services that need metadata—e.g. Amazon.

There is also a “legacy data problem,” where much of the metadata is “locked up in PDF,” which is a real struggle to deal with.

She also pointed out that for books, there is a big frontlist/backlist issue. (Not as much the case for journals, though it depends on the discipline and market demands.) Again, it requires a publisher “of a certain size” to be able to deal with this well.

CrossRef’s philosophy has always been “Do the simplest thing that makes it work.” This works for the purpose something was built for, but it gives you legacy issues and doesn’t work for all applications (e.g., the email example mentioned above). Plus you get “nonstandard records” and “compliance issues.”

She also pointed out that for any important development, “when there’s a business need, that’s when it happens.”

Another big issue in scholarly publishing these days is Data. There is a lot of pressure for the data sets on which research is based to get published, along with the articles and books based on research on that data. We are “hitting a tipping point,” and it is a very complex issue—she characterized it as a “huge unsolved problem.”

An important new initiative at CrossRef is FundRef, which is an interesting example of how a de facto standard can come about. There is a need in scholarly publishing to document the funders of research, and there is a great “community of interest” around it. But no standard for funding bodies existed, “so CrossRef made one, and it has become a de facto standard.” It is an example of their “simplest possible solution” strategy. It involves a taxonomy of funder names, and then associates the funders with the papers. There is pressure to make it more complex but right now that’s what was needed, it could be implemented quickly, and it works. She said “at the end of the day it’s all XML tags and identifiers. Everybody using FundRef uses the same set of tags, and it starts becoming a standard.”

She pointed out that NISO also very quickly developed an OpenAccess indicator at the article level, involving a “license-ref” URL and a “free-to-read” date [the embargo date] (see http://www.niso.org/workrooms/oami/), so “CrossRef is using that and it works.” Publishers were getting a lot of criticism about not complying with OpenAccess requirements, but it was because there were no systems to deal with it. Something had to be done quickly.

She also mentioned that although CrossRef metadata schemas are all initially based on Dublin Core, they are way more complex and rich—they just use Dublin Core as a framework. In fact their CrossMark standard is not based on Dublin Core at all. [CrossMark is the service that enables a link to be embedded in an article that returns, to the user, the information about whether the version they are using is in fact the latest version of an article, and points them to a later version if appropriate—very cool.]

Kent Anderson, CEO, Journal of Bone and Joint Surgery (JBJS)

Kent is the CEO of a major medical publisher and the current President of the Society of Scholarly Publishing (SSP); previously he pioneered the innovative digital work at the New England Journal of Medicine (NEJM). He is one of the most knowledgeable, articulate, and well-informed people in scholarly publishing (one of those brilliant people who seems never to sleep). So it was particularly telling when his first response was along the lines of “Gee, I haven’t thought about metadata in a while.”

If metadata was a big issue—or, more accurately, a big problem—for a medical publisher like this, I guarantee you Kent would be all over it. The fact is, for scholarly publishing, and particularly STM publishing, and even more particularly the M part, medical publishing, in many ways metadata is seen as a solved problem. This is in direct contrast to most other types of publishing, but it was confirmed by others whom I interviewed with this perspective. It’s not that they feel that metadata is perfect, or that it couldn’t be improved; but in contrast to most other types of publishing there is really not much pain there because in the STM sphere metadata has long been a known quantity and pretty much does what they need it to do. (Thanks mainly to CrossRef—see the interview with Carol Meyer.)

Indicative of Kent’s forward-thinking nature, he said that the one area where he probably had thought about metadata was in the context of video. That may surprise many folks who think of a scholarly journal publisher as being pretty boring and routine. In fact, publishers like Kent have actually been in the forefront of publishing online and including multimedia and interactivity. He was doing this years ago at NEJM; I’ve been using examples of Kent’s in my talks on the subject for at least the past 5+ years, and have recruited him to speak on the subject on numerous occasions. Most other types of publishers are way behind on this, compared to the leading STM and especially medical publishers.

He said that they are now doing “much more multimedia” and observed that there is a lack of good tagging [by which I’m sure he meant subject/semantic tagging] even on things like podcasts. They are also getting into publications that are “beyond journals” and don’t have a good way to manage metadata for them.

One metadata issue that Kent astutely picked up on in our conversation was “the portability of metadata.” They have recently updated their XML modeling from NLM to JATS [as are most other STM publishers—these are the XML models that are virtually universally relied upon in the STM realm], and also moved from one hosting service to another. Some of the metadata (particularly semantic metadata) that had been added by the old host was considered proprietary and not transferred, so JBJS had to recreate it for the new hosting context.

Another big metadata issue that is just coming to the fore in STM publishing is the need to disclose conflicts of interest. He observed that it is now “very clumsy and not very interoperable.” [Note that the FundRef initiative from CrossRef is an important new vehicle addressing this to some extent.]

He also pointed out that they have made investments in things that help bring more visibility into the publication process. They have just acquired PreScore, a service that provides a metric to assess the level of peer review a given article received, which provides information on pre-publication peer review that “never makes it into the article.” They are also participating in SocialSite, which rates the quality of an article’s reference list.

Kevin Hawkins, Univ. Michigan Libraries and University Press

Until a move to the University of North Texas just after our conversation, Kevin had a very interesting and relevant dual role at the University of Michigan. He has for many years been a key person in the U-M Library’s innovative and extensive Scholarly Publishing Office (now absorbed into Michigan Publishing), one of the pioneers in the trend of academic libraries moving into publishing (online journals, print-on-demand books, the Text Creation Partnership, and more). Kevin is also a true XML expert: he is my go-to guy on anything involving the Text Encoding Initiative (TEI), the XML model dominant in libraries, archives, and humanities scholarship. A couple of years ago, the U-M Library took over responsibility for the University of Michigan Press, and Kevin took major responsibility for production for both print and digital university press monographs.

In an interesting confirmation of one of my other calls (also to a scholarly publishing luminary) his first comment was that he was not really aware of any fundamental critical problems regarding metadata. Unlike almost all other segments of publishing, scholarly publishing seems to consider metadata mainly a solved problem (and it largely is).

The main problem he pointed to was the inconsistent adoption of metadata schemes throughout the publishing supply chain and the consequential work involved in customizing metadata feeds for each vendor. The Press pays Firebrand (the company run by Fran Toolan, one of my other interviews) to disseminate their metadata, customizing the ONIX feed for various vendors.

The U-M LIbrary is a member of CrossRef and deposits DOIs for much of its online content. This requires mapping the metadata they have to the metadata CrossRef requires. While this is annoying, it is not a huge problem because the CrossRef metadata requirements are not extensive. A bigger issue, in Kevin’s view, involves digital workflows. Their publishing platform and its workflows for collections of content was designed for digitized library collections, with infrequent additions to a collection or updates to the content. Revisions are cumulative, with new and revised content not distinguished. So when a new issue of a journal is published they send metadata for all that journal’s issues to CrossRef, relying on CrossRef to screen duplicate records. This of course is an internal U-M issue, but it highlights the fact that metadata is not just about marketing for a publisher like this, it is central to how they manage their workflow.

He observed that the Press has not in fact invested much to enhance metadata for discoverability—keywords in HTML, ONIX, BISAC, etc. This needs to be more of a priority. They have done some SEO work to make their online-only content more discoverably in Google search results. But he pointed out that they don’t have “real keywords in microdata,” and they “probably should be doing that.”

Another interesting twist from this conversation: promoting discoverability through metadata access actually has no financial return for online-only content, so it gets put off in favor of work for which there is a clearer financial implication. (!!)

And finally another important observation that is true of virtually all book publishers but hardly anybody ever brings up: for most books (most of which are not online), there is no HTML to put microdata in!

Most of our conversation focused on the Library’s publishing activities, but he did have a few additional comments from the perspective of a library acquiring content and making it available to users:

Because of advances in search and discovery capabilities in library catalogs, vendor databases, and discovery systems, libraries are actually making LESS investment in detailed cataloguing than they used to.
Institutional repositories are very widely and heavily used in academia, and most allow self-deposit of content by authors. The author is asked to supply some metadata, and some institutions have a review/validation process.,In his opinion, the thorough crawling of IRs by Google Scholar and other search engines “argues against laborious metadata creation.”

Len Vlahos, Executive Director, and Julie Morris, Project Manager, Standards and Best Practices, Book Industry Study Group (BISG)

Len and Julie provide a very interesting perspective on our metadata questions because the BISG—a very broad-based book industry organization comprising a wide range of publishers, retailers, distributors, and service providers across the entire book supply chain (including almost all of the biggest ones)—has a number of committees that specifically focus on metadata issues. They are the US representative to EDItEUR regarding updates to ONIX; they are responsible for BISAC, the US subject vocabulary for the book supply chain; they were a major participant in creating Thema, the new international subject vocabulary; and their Identifiers Committee works closely with the groups responsible for key identifiers like ISBN and ISNI (currently participating in the revision to ISBN). Plus they are involved with rights and manufacturing-oriented metadata (EDI, RFID, etc.).They also do a lot of education as to best practices. This all is precisely the area of Julie’s responsibility, as you’ll see from her title; and Len, as Executive Director, is very actively engaged in all these things. They are an ideal “hub” for insight into the broad book publishing industry, both the creators and the recipients of metadata. I interviewed them jointly. (Julie is a member of the W3C Digital Publishing Interest Group, and BISG is a W3C member, thanks to Len’s realizing how important that is.)

Len began by saying that in his view there are two major categories of metadata issues for book publishers: (1) communication, and (2) systems and processes.

Regarding communication of metadata, he said that there is a belief from downstream partners [by which he meant the recipients of publishers’ metadata] that publishers are confused by and inconsistent in their use of metadata, whereas upstream, the publishers think the downstream partners are making changes to their metadata, which is unwelcome. [IMO, there is a kernel of truth to both perceptions, although both are exaggerated.]

Regarding systems and processes, he pointed out that there are “no clear lines of responsibility for metadata.” “It’s a giant game of telephone that’s gone awry.”

Julie specifically addressed issues with ONIX. [In this context, we are talking about ONIX for Books, which is by far the main context in which ONIX is used. There are other versions of ONIX.] Here are some of the key things she pointed out:

There are problems with using metadata to indicate relationships between things, for example a print book from which a digital version was derived.
There’s a huge issue in the identifiers sphere [this comes up in almost every group I’m involved in, btw]: the lack of a “Work” identifier as opposed to identifiers for products of that work. [One reason for this is the difference between how publishers view “the work”—they are focused on what they are trying to sell—vs. librarians’ view of “the work” (e.g., “Huckleberry Finn” or “Hamlet,” not one particular publisher’s).]
versioning is a big issue in digital publications: what’s a new edition vs. a version of an edition.
Expressing series, relating products as part of a series or a group of titles that should be tied together.

Another big topic discussed by both Len and Julie was the volatile nature of the situation, because of changes in the industry and the types of products being produced.

How does “product metadata” relate to metadata embedded in a digital product? These are handled quite separately in most publishing organizations, and there is a lack of awareness and communication between departments (digital, production, marketing, etc.). Fundamentally there is a lack of clarity and consistency regarding “what are we saying about this thing?”

Plus, there is a need for metadata to describe changing products, which results in “mushrooming metadata.” Publishers like Pearson wind up thinking of themselves increasingly as technology companies.

Len pointed out another key problem: the book industry is no longer “all one thing” as it was in the past. There are so many bidirectional relationships between types of publishers and the entities they deal with (trade⇿retailer, trade⇿library, educational publisher⇿school system, etc.), and there is “no overriding standard that accommodates all of these.”

Len also pointed out that there have been discussions with GS1 [the organization responsible for global standards for the supply chain, best known for barcodes] and the book industry metadata was actually “in relatively good shape” compared to that in other sectors like apparel, beverages, etc. The book industry is “more similar to music: so many different SKUs.”

He also mentioned [surprisingly to me] that some have begun to question the long term value of ONIX to the industry. He pointed out that lots of work has been done on the library side regarding linked data. He suggested that there was an evolution toward looking at metadata not at a “record” level [e.g., an ONIX record] but at a more distributed level.

Julie pointed out the inherent conflict—or at least the co-existence, and at present not all that good alignment—between ONIX, MARC, and schema.org.

Len pointed out that eventually the data about the book [not sure if he meant just metadata or the content itself] will reside in the cloud.

He gave the example of author vs. contributor metadata. There is important granular information connecting contacts, contracts, rights, etc. to authors and contributors and their products, and this is not best accomplished by “boxing it into either an ONIX or MARC record” [his implication being, if I can speculate, that this both makes that metadata inaccessible to updating and unnecessarily duplicates information that is really the same in a bunch of those different “boxes”]. A more “networked” approach [my word, not his] would lend itself to greater conformity, which would be all to the good.

Finally, he also pointed out that there is an inherent conflict between the needs of archivists/cataloguers and marketers/publishers [alluded to above in discussing the library vs. publisher pov on things like what “a work” is].

Henning Schoenenberger, Director, Product Data and Metadata, Springer

Henning is responsible for metadata globally for Springer, one of the world’s largest and oldest STM publishers. Springer has long been an innovator in the digital arena, pioneering the use of SGML and XML, publishing journals online, embracing open access, and—most notably in the context of this initiative—digitizing their enormous backlist of some 100,000+ books in a broad range of disciplines in their SpringerLink platform.

Because his work focuses almost entirely on metadata for a publishing program that encompasses many thousands of books and journals with decades of legacy content, Henning has an uncommon perspective and depth of knowledge in this area.

He has a very comprehensive and forward-looking view of metadata. He pushed back on the focus on “pain points” by saying that he doesn’t consider metadata to be a problem statement; rather, “it’s a central aspect of what you do as a publisher.” One of my favorite quotes from the interview:

“Metadata is not to be solved, it’s to be saved.”

If a publisher invests in metadata activities (e.g., CrossRef), it is likely to pay off. His philosophy is to “be as transparent with metadata as possible, flood the market with it, across the world. Same with Linked Open Data and the Semantic Web.” This is not just idealism, it is practical: “This will all benefit Springer in the long run.”

Metadata should be considered an ongoing issue, and should be understood to be always evolving. The development of the DOI and CrossRef was only one part of the picture, one aspect of the ecosystem. That is working well for what it was originally designed to do, but now it has evolved to include FundRef, which is an issue that publishers must now deal with, incorporate into their systems and workflows and publications.

Especially from an STM publisher’s standpoint, you must have very good metadata to position yourself properly in the marketplace and in the scientific/scholarly dialogue.

An example of how significantly the landscape changes: right now, research data is considered “an appendage” to a publication. But who knows to what extent it may become core of STM publishing? Currently, “research data is a mess: there are hundreds of different formats out there.” You can’t always know where things will go, and when. The lesson is that you need to capture and maintain all possible metadata, even if it may not have a demonstrable payback right now: you will almost surely need it in the long run.

As an example of how this philosophy and strategy paid off, Springer needed to act fast to implement ONIX-PC, the subscription format of ONIX . Because they had such good metadata, they were able to implement it on a very large scale in only three months: “It turned out to be quite simple, we needed some IT and some project management and a converter and we were done.” Other publishers struggle with this because they lack the necessary metadata.

On the subject of ONIX-PC, he said that at the ICEDIS meeting at Frankfurt, he advocated renaming the “PC” suffix from the current meaning of “Price Catalog” to “Product Catalog.” He wants it to cover Open Access, hybrid products and packages, which it currently can’t accommodate. He also wants to be able to use it to communicate with the ISSN catalog, etc.

This brought up another important general issue:

“Metadata in the past was tailored to two different markets, trade and library. This cannot continue.”

Concerning metadata these worlds are coming together rapidly. The separation in their metadata models and management systems—how they express subject classifications, the identifiers they use, how they handle linked open data, the fact that they are typically managed by separate teams, and have separate business models—all this needs to converge.

“Functionalities and formats (e.g., schema.org) need to cover library and trade under the same umbrella.”

Another huge issue he emphasized was interoperability.

As an STM publisher, Springer has MARC records; in fact, they have both OCLC MARC records and their own MARC records. The Springer-flavored MARC records have URIs that lead back to the Springer repository, as well as three or four classification schemes, a full TOC, Dewey and Library of Congress classification, and proprietary metadata.

The inconsistencies in MARC records is a problem; there must be reliable readability by different facilities. BIBFRAME is going in the right direction. But academic libraries often have limited budgets, and moving away from MARC is not trivial. “This is 80% of the story: it works okay; workarounds are common.” BIBFRAME needs to make the case for libraries for why it’s justified.

He mentioned that he was approached to collaborate with schema.org to model serial and journal metadata. He recommends not reinventing the wheel or developing Yet Another Standard: “How ONIX-PC models serials is excellent. If schema.org is looking for a deep object model for serials data, they should look at ONIX-PC.”

He pointed out that EDItEUR is based on a trade supply chain model and perspective, dominated by wholesalers. But these players are “only part of the story.” Publishers are increasing end-user sales that don’t involve the intermediaries for which the EDItEUR standards have been designed. ONIX is built for the current supply chain model; EDItEUR needs to move beyond this current focus.

This is especially relevant for STM because STM publishers have the brand recognition and the capabilities to be able to sell direct. You can go to SpringerLink today and buy an article or even a book chapter.

Finally, he placed a strong emphasis on rights: “Some still say ‘Metadata is a challenge’. But that is news from yesterday. The current challenge is Rights metadata.” A few technology companies provide decent rights modules. But there is no thoroughgoing interoperable rights standard. “This gives us a great opportunity to tie up loose ends and get it right.”

He would very much welcome the W3C setting up a WG to work on rights, and making rights Interoperable. Publishers would appreciate this. He mentioned the enormous task that was involved in expanding SpringerLink to accommodate all of their books, which have been published over many decades.

“Clear rights metadata is especially important now that everything is visible: once you publish something digitally, retractions are a complex matter. This can be life-or-death for a small publisher which hasn’t done its homework.”

In an abundance of caution, although they did digitize the entire backlist going back to the late 18th century and comprising more than 100,000 books, Springer is highly scrupulous not making public any content for which it has not cleared the rights. This was an enormous project: it involved hundreds of thousands of contracts from which rights metadata had to be extracted. And these exist in various forms in various jurisdictions.

In conclusion, what Henning characterized as “the Big Data promise”:

Linked Open Data
Named Entity Recognition
Robust Identifiers

With these, it would be possible to realize Henning’s vision:

“In a few years, robots will do research for you. Good metadata will make this possible.”

Introduction