DPUB IG Metadata Task Force Report

The Metadata Task Force of the DPUB IG found, through extensive interviews with representatives of various sectors and roles within the publishing ecosystem, that there are numerous pain points for publishers with regard to metadata but that these pain points are largely not due to deficiencies in the Open Web Platform. Instead, there is a widespread lack of understanding or implementation of the technologies that the OWP already makes available for addressing most of the issues raised. However, some of the very technologies that are little used or understood in most sectors of publishing are widely used and understood in certain other sectors (e.g., scientific publishing, libraries). Priorities that have emerged are the need for better understanding of the importance of expressing identifiers as URIs; the need for much more widespread use of RDF and its various serializations throughout the publishing ecosystem; and the need to develop a truly interoperable, cross-sector specification for the conveyance of rights metadata (while remaining agnostic as to the sector-specific vocabularies for the expression of rights). This Note documents in detail the issues that were raised; provides examples of available RDF educational resources at various levels, from the very technical to non-technical and introductory; and lists important identifiers used in the publishing ecosystem, documenting which of them are expressed as URIs, and in what sectors and contexts. It recommends that while little new technology is called for, the W3C is in a unique position to bridge today's currently siloed metadata practices to help facilitate truly cross-sector exchange of interoperable metadata. This Note is thus intended to provide background and a context in which concrete work, whether by this Task Force or elsewhere within the W3C, may be undertaken.

Overview

Publishers use metadata in three fundamentally different ways:

Metadata that is incorporated within a publication (e.g., an EPUB, a website).
Metadata that is separate from the publication or publications it describes (e.g., the periodic "metadata feeds" that publishers provide to the supply chain).
Metadata that is incorporated in systems designed to provide information about publications (e.g., a publisher’s, retailer’s, aggregator’s, or library's website).

While in many cases metadata is in a system or a form outside of the Web and uses technologies outside of the Open Web Platform (OWP), such as databases, repositories, authoring and formatting software, and proprietary aggregation and dissemination platforms, OWP technologies are increasingly becoming essential to all aspects of the publishing process (including modern versions of all those mentioned).

The Metadata Task Force of the W3C Digital Publishing Interest Group (DPUB IG) was formed to identify ways in which the W3C could help address problems publishers currently have with regard to metadata. In its discovery phase, the TF found the following fundamental issues to be commonly regarded as "pain points" by publishers:

Granularity: The need to associate metadata with arbitrarily granular units of content, rather than simply at the publication level.
Complexity: The profusion of identifiers and metadata vocabularies is confusing and difficult to master.
Difficulty: There are few systems or tools that enable metadata to be provided at appropriate stages of the workflow by non-technical people.
Futility: Even when rich metadata is provided, it often isn't used by the systems disseminating or displaying the publications.

Although each sector of publishing has problems with metadata in its own ways, the causes of these problems fall into two major categories:

Problems addressable by refinements to the Open Web Platform.
Lack of implementation, and even understanding, of existing features of the OWP on the part of publishers and others in the publishing supply chain.

In its initial exploration of these issues, the Metadata Task Force of the W3C Digital Publishing Interest Group found that the vast majority of difficulties that publishers of all types have in implementing metadata more effectively are in the second category. In most cases, the OWP already has features that address these issues, if used properly by publishers and implemented properly in systems that create, disseminate, and display those publications (e.g., expressing identifiers as URIs, using RDF and RDFa, etc.). In other cases, ongoing work by the W3C will likely provide solutions or essential components to solutions (e.g., the work of the Web Annotations WG is closely related to the need to address arbitrarily granular units of content).

Recommendations for Further Work at W3C

The Metadata Task Force of the W3C Digital Publishing Interest Group has developed the following general and specific recommendations to the W3C with regard to the use of metadata within the OWP.

Focus on enabling and facilitating the use of existing vocabularies rather than creating new ones. The vocabularies needed by different types of publishers vary tremendously, and in order to be effective they tend to be highly specific to certain disciplines, use cases, or sectors. A great number of these vocabularies already exist and are in wide use within the communities they were designed to serve. The role of the W3C should be to provide means to enable publishers and others in the publishing ecosystem—particularly those who develop systems used by publishers and their partners and customers to create, manage, disseminate, and access publications—to efficiently and effectively use those vocabularies in the three modes described at the beginning of this section: within a publication, in a separate document associated with a publication, or in systems that provide information about publications.
Specific recommendation: The W3C should collaborate with BISG, the Book Industry Study Group; EDItEUR, the international organization responsible for ONIX, the standard messaging format for dissemimation of book supply chain metadata; schema.org, which provides the most commonly used means of embedding metadata in web content; and other appropriate parties internationally to develop optimal ways for book publishers to embed appropriate and useful metadata in Web content based on the well known and widely implemented ONIX model. The BISG has already formed a Working Group to address this issue from the point of view of its US constituency, and that WG is chaired by the Executive Director of EDItEUR, who can ensure that the resulting recommendations are globally applicable. The W3C is uniquely positioned to be a key partner to all of these organizations, providing the foundational technologies on which much of their work is based. It could serve a valuable role in facilitating collaboration internationally and addressing any technological limitations revealed in the course of that collaboration so that solutions are truly global and interoperable.
Educate publishers and their partners on how best to use existing features of the OWP. While technical specifications, sometimes supplemented by primers, are already provided on such features by the W3C and the community at large, these are often targeted at technical users. There is a need for much simpler, more user-friendly documentation aimed at non-technical people within the publishing ecosystem. There is also a need for much more aggressive dissemination of this information throughout the publishing ecosystem to demystify these features of the OWP and encourage their broad and proper use both by creators and recipients of metadata.
Specific recommendation: Focus initially on encouraging proper understanding and use of URIs and of RDF/RDFa. To this end, the Metadata Task Force developed two lists:
- An extensive (though by no means exhaustive) list of identifiers used in publishing (see Appendix ), documenting those that are recommended to be expressed as URIs. (The focus was on open, non-proprietary identifiers; in addition, there are many proprietary identifiers that may be expressed as URIs but were considered out of scope for this list.) It is clear from this research that in fact a great many identifiers either are or can be expressed as URIs. Within specific domains—typically those whose content is mostly online and whose constituents depend on the web for their work—identifiers are in fact commonly expressed as URIs. Within other domains—typically those whose content is not generally online, such as trade book publishers—identifiers are seldom expressed as URIs. It is the recommendation of the Metadata Task Force that the W3C reach out to organizations representing domains not commonly using URIs for identifiers to help them educate their members. It is through such organizations as the BISG, BIC, and EDItEUR (the book publishing supply chain), IDEAlliance (magazine publishers), and SSP and STM (scholarly publishing) that such education can most effectively be done.
- A list of a variety of resources available that document or explain RDF and related semantic technologies (see Appendix ). Again, this is not an exhaustive list; instead, it was developed to show that there are indeed a variety of resources—some targeted at developers and other technical people, but others specifically targeted at non-technical people—that serve these purposes. Most of these have been developed within a certain domain (e.g., libraries, magazine publishers, news organizations) and are virtually unknown outside that domain. As is the case with URIs mentioned above, the W3C is ideally positioned to serve as a clearinghouse or hub for such information, as well as to serve as an authoritative resource to vet such publications for accuracy.
Work to help facilitate the association of rights metadata with content. While various sectors of publishing—specifically, trade book publishers, educational publishers, scholarly/STM publishers, magazine publishers, and news organizations—have different priorities, different needs, and different vocabularies for the expression of rights information, they share a need to be able to communicate that metadata in an interoperable way, at an arbitrarily granular level, in web-based content. For example, the ODRL Community Group has developed the Open Digital Rights Language, which is currently in draft form as version 2.1, but this is not a W3C Recommendation nor on a Recommendation track. The IDEAlliance is working on the development of rights metadata for the magazine publishing industry, and intends to base it on ODRL. The BISG has developed a rights vocabulary for the book publishing industry. In today's web-based world, previously siloed content now spans domains: for example, a single educational resource could use a chapter from a book, an image from a scholarly journal, a video from the entertainment industry, an article from a magazine, and a blog or podcast from a news organization. Its need to manage the rights information for all those components is crucial, and the ability for those separate organizations to both manage the rights offered for their content and gain income from its use is compelling. It would be beneficial to the publishing ecosystem as a whole if a canonical, interoperable standard were available for the expression of such rights metadata that would span or undergird such domain-specific vocabularies.

Background: Interviews

In order to assess the “pain points” with regard to metadata for publishers, the co-leaders of the task force, Madi Solomon of Pearson and Bill Kasdorf of Apex, conducted a number of interviews in 2014 with publishers, service providers, and representatives from related organizations. The inverviews themeselves are available in a separate document.

The interviewees were selected to provide insight from a variety of perspectives, and were individuals known to the interviewers to be knowledgeable and authoritative within their spheres. Ms. Solomon took a “vertical” approach, interviewing a broad range of individuals within Pearson, a large global educational publisher. Mr. Kasdorf took a “horizontal” approach, interviewing experts from diverse types of publishing (book, journal, magazine, and news) and representing diverse roles within the digital publishing ecosystem (publishers, metadata service providers, consultants, and representatives from other organizations that are addressing the issue of metadata in publishing).

The interview strategy was to conduct casual, open-ended interviews with a single individual without an agenda or a prepared set of questions. The reason for this strategy was to avoid steering the discussion in particular directions. Instead, in this initial phase, the goal was to elicit what each interviewer would perceive as the key issues and pain points with regard to metadata from their own point of view. Thus the interviews deliberately did not focus on the issue of what the W3C could do—and what changes could be made to the Open Web Platform to address them. Instead, the interviews stayed on the general level. Since many of the interviewees were not technical, framing the discussion in too technical a manner would have impeded the ability to obtain authentic responses. As expected, few of the interviewees felt able to identify specific “pain points” with regard to the OWP. They spoke instead of general issues of concern to them in their work. The hope was that with an understanding of these issues and pain points, the DPUB IG could then assess where the W3C and the OWP could potentially help address them—and could avoid addressing theoretical technical issues that might not in fact align with publishers’ priorities.

While the published interview reports cited above will provide the best understanding of both the common themes and diverse perspectives revealed by the interviews, this report attempts to summarize key observations and offer initial recommendations for subsequent strategies.

Primary Observations

If there is a single overarching lesson revealed by these interviews, it is that the issues with regard to metadata seen as priorities for publishers and their clients and partners differ significantly between publishing sectors (although they all share all of these issues to some extent).

Trade book publishers are primarily concerned with discovery: providing metadata to the supply chain that will attract readers to their books and result in increased sales. It was observed that at the present time book content itself is not typically online; it is available digitally only as products (e.g., EPUB), and typically only at the title (book) level, and the metadata regarding those products (primarily in ONIX) is created, maintained, and disseminated independently of the products themselves.
Educational publishers see metadata primarily in the context of asset and content management: the identification and characterization of granular components of content within a large repository of content in order to facilitate the creation of publications, the delivery of content in “chunks,” the repurposing of content to create new editions and new products, the ability to guide students and teachers to highly targeted components of content, and the ability manage both rights and permissions associated with those components of content. The ability to personalize content to individual students, the ability to associate learning objectives with arbitrarily granular components of content, and the ability to monitor and assess the use of content by students and the learning outcomes that result are all key metadata priorities for educational publishers.
Scholarly publishers often perceive metadata as a “solved problem”: while none would assert that the current situation is perfect, the standards-based consensus in the scholarly publishing world—consisting of nearly universal participation in CrossRef and CCC (the Copyright Clearance Center), the ubiquitous use of the JATS XML model1 for markup and metadata, and the reliance upon the DOI as a persistent, actionable identifier—initially for journal articles but now increasingly for book chapters and components, reference content, conference proceedings, and other publications, as well as the data sets that support research)—has enabled the development a rich ecosystem of services and platforms that has made the Web the primary mode of publication, dissemination, and access for scholarly content. It has also led to the development of other standards—such as ORCID, the Open Researcher and Contributor ID, and FundRef, a system for making public the funders of research—that continually refine the sophistication and utility of metadata in the scholarly publishing world, solving what were previously significant pain points (e.g., disambiguating contributor identities, revealing potential conflicts of interest or reliably documenting the absence of such conflicts). However, it is increasingly being recognised among scholarly and STM publishers that while they somewhat uniquely enjoy a common model (JATS) that accommodates extensive metadata, the lack of consistency in the implementation and use of that metadata is indeed a problem.
Magazine publishers and news organizations are currently focused on rights metadata: the rapid shift to the Web as a source of news, entertainment, and information has created an urgent need to identify the owners of content and the rights associated with very granular components of content-both the rights that apply broadly to a particular unit of content (e.g., copyright) and the usage rights conveyed by the rightsholder to specific parties, in specific contexts, for specific purposes, and in specific modes. While there are well established metadata models in each of these areas—PRISM is a very rich metadata framework virtually universal in the magazine world, and standards such as IPTC photo and media metadata, rNews, newsML, and schema.org are widely used by news organizations—standards bodies in both of these areas (IDEAlliance for magazines and IPTC for news) are each actively working on developing and refining rights metadata models.

While all of these issues—discovery via subject metadata and other metadata characterizing content and products, management of content via metadata, development and participation of cross-publisher platforms and services via metadata, and the communication of rights via metadata—cross all sectors of publishing, it is clear from the interviews that the priorities in distinct sectors diverge significantly.

Another major theme heard in virtually all of the interviews was that metadata is “too complicated.” Book publishers, for example, recognize that ONIX is the standard way to communicate supply chain metadata; as such, it is an extremely rich, complex, and useful standard. Similarly, the BISAC standard is a rich vocabulary used in the US for subject classification; there are similar such standards in most other countries or regions, and also a new global standard, Thema. While publishers recognize the value of these standards, they often characterize them as “too hard”; yet when pressed for what an individual publisher needs to communicate (to the supply chain, or about the subjects of its books), they often wind up asking for more complexity. (E.g., a U.S. publisher may want to describe a book as being about “the Battle of the Bulge, within the topic “World War II” which itself is in the category of military history; this can be done with BISAC but not with Thema.) The truth is that these systems are complex because what they are designed to do is complex. The desire for an “ONIX Lite” expressed by several interviewees may prove to be unrealistic, because a significantly simpler model would be significantly less expressive.

Another common theme was that in too many cases metadata may exist—or may potentially exist, if applied to a given publication—but it often “doesn’t do anything.” It is very frustrating to users if it is the case—or even if it is their perception—that going the work to adding metadata is futile because systems are not seen as using it. (This is of course true of some types of metadata but not others: clearly trade publishers know how their ONIX metadata is used by the supply chain, and scholarly publishers know how their CrossRef metadata is used for citation linking.) This particularly surfaced in the context of the Pearson interviews because complex educational content is created by a vast team of participants, each of whom may have the ability to provide some aspect of metadata but most of whom have no clear understanding of how to do so, no systems to enable to do that consistently, and no faith that if they “go to all that work,” it will actually be used for any purpose downstream.

In thinking about metadata, it is important to distinguish between metadata that is incorporated within a publication (an EPUB, a website); metadata that is separate from the publication or publications it describes (e.g., ONIX, which can continually change over time without requiring the publications it communicates metadata about to be altered); and metadata that is incorporated in systems designed to provide information about publications (e.g., a publisher’s, retailer’s, or aggregator’s website).

And finally, it should be noted that an important theme that did not emerge from the interviews was the importance of accessibility. Revealing this was one of the benefits of the interview strategy of not asking leading questions: when anybody is asked if accessibility is an important issue, they will almost always say it is. So it is particularly—and lamentably—of note that none of the interviewees mentioned accessibility as a priority issue with regard to metadata.

Important Themes

The key themes of the interviews conducted by Mr. Kasdorf are summarized in the following appendix. They are the following:

Complexity
Inconsistency
Sacrificing Richness for Simplicity
ONIX vs. Subject Metadata
Few Books are Online Anyway
Discovery (AKA Marketing) is the Priority
Identifiers, Identifiers, Identifiers
And Now for Something Completely Different: News

Please see the summary on the group’s wiki page for a discussion of these themes, including important comments by members of the DPUB IG.

The key themes of the interviews conducted by Ms. Solomon are summarized as follows:

Governance
Rights
Flow
Lack of Skills
Lack of Authority
Standards
Inconsistency
Lack of Incentives
Need for Learning Objectives
More Education and Guidance

Please see Ms. Solomon’s report in the interview document for a more detailed discussion of these themes.

1 JATS, the Journal Article Tag Suite, and BITS, the Book Interchange Tag Suite—which share a common markup model below the article and chapter level and which have very rich metadata models and mechanisms—are the current versions of what were previously known as the “NLM DTDs,” the markup and metadata model on which virtually all publications, platforms, and services in the area of scholarly publishing are based. This is unique to scholarly publishing: in no other sector is there such universal consensus on a single markup and metadata model.

Overview

Recommendations for Further Work at W3C

Background: Interviews

Primary Observations

Important Themes

Acronyms and Terms Used in the Report

List of Some Identifiers for the Publishing Industry

List of Some RDF/RDFa Outreach Documents

Footnotes