The Metadata Task Force of the DPUB IG found, through extensive interviews with representatives of various sectors and roles within the publishing ecosystem, that there are numerous pain points for publishers with regard to metadata but that these pain points are largely not due to deficiencies in the Open Web Platform. Instead, there is a widespread lack of understanding or implementation of the technologies that the OWP already makes available for addressing most of the issues raised. However, some of the very technologies that are little used or understood in most sectors of publishing are widely used and understood in certain other sectors (e.g., scientific publishing, libraries). Priorities that have emerged are the need for better understanding of the importance of expressing identifiers as URIs; the need for much more widespread use of RDF and its various serializations throughout the publishing ecosystem; and the need to develop a truly interoperable, cross-sector specification for the conveyance of rights metadata (while remaining agnostic as to the sector-specific vocabularies for the expression of rights). This Note documents in detail the issues that were raised; provides examples of available RDF educational resources at various levels, from the very technical to non-technical and introductory; and lists important identifiers used in the publishing ecosystem, documenting which of them are expressed as URIs, and in what sectors and contexts. It recommends that while little new technology is called for, the W3C is in a unique position to bridge today's currently siloed metadata practices to help facilitate truly cross-sector exchange of interoperable metadata. This Note is thus intended to provide background and a context in which concrete work, whether by this Task Force or elsewhere within the W3C, may be undertaken.
This is a work in progress. No section should be considered final, and the absence of any content does not imply that such content is out of scope, or may not appear in the future. If you feel something should be covered here, tell us!
Publishers use metadata in three fundamentally different ways:
While in many cases metadata is in a system or a form outside of the Web and uses technologies outside of the Open Web Platform (OWP), such as databases, repositories, authoring and formatting software, and proprietary aggregation and dissemination platforms, OWP technologies are increasingly becoming essential to all aspects of the publishing process (including modern versions of all those mentioned).
The Metadata Task Force of the W3C Digital Publishing Interest Group (DPUB IG) was formed to identify ways in which the W3C could help address problems publishers currently have with regard to metadata. In its discovery phase, the TF found the following fundamental issues to be commonly regarded as "pain points" by publishers:
Although each sector of publishing has problems with metadata in its own ways, the causes of these problems fall into two major categories:
In its initial exploration of these issues, the Metadata Task Force of the W3C Digital Publishing Interest Group found that the vast majority of difficulties that publishers of all types have in implementing metadata more effectively are in the second category. In most cases, the OWP already has features that address these issues, if used properly by publishers and implemented properly in systems that create, disseminate, and display those publications (e.g., expressing identifiers as URIs, using RDF and RDFa, etc.). In other cases, ongoing work by the W3C will likely provide solutions or essential components to solutions (e.g., the work of the Web Annotations WG is closely related to the need to address arbitrarily granular units of content).
The Metadata Task Force of the W3C Digital Publishing Interest Group has developed the following general and specific recommendations to the W3C with regard to the use of metadata within the OWP.
Specific recommendation: The W3C should collaborate with BISG, the Book Industry Study Group; EDItEUR, the international organization responsible for ONIX, the standard messaging format for dissemimation of book supply chain metadata; schema.org, which provides the most commonly used means of embedding metadata in web content; and other appropriate parties internationally to develop optimal ways for book publishers to embed appropriate and useful metadata in Web content based on the well known and widely implemented ONIX model. The BISG has already formed a Working Group to address this issue from the point of view of its US constituency, and that WG is chaired by the Executive Director of EDItEUR, who can ensure that the resulting recommendations are globally applicable. The W3C is uniquely positioned to be a key partner to all of these organizations, providing the foundational technologies on which much of their work is based. It could serve a valuable role in facilitating collaboration internationally and addressing any technological limitations revealed in the course of that collaboration so that solutions are truly global and interoperable.
Specific recommendation: Focus initially on encouraging proper understanding and use of URIs and of RDF/RDFa. To this end, the Metadata Task Force developed two lists:
In order to assess the “pain points” with regard to metadata for publishers, the co-leaders of the task force, Madi Solomon of Pearson and Bill Kasdorf of Apex, conducted a number of interviews in 2014 with publishers, service providers, and representatives from related organizations. The inverviews themeselves are available in a separate document.
The interviewees were selected to provide insight from a variety of perspectives, and were individuals known to the interviewers to be knowledgeable and authoritative within their spheres. Ms. Solomon took a “vertical” approach, interviewing a broad range of individuals within Pearson, a large global educational publisher. Mr. Kasdorf took a “horizontal” approach, interviewing experts from diverse types of publishing (book, journal, magazine, and news) and representing diverse roles within the digital publishing ecosystem (publishers, metadata service providers, consultants, and representatives from other organizations that are addressing the issue of metadata in publishing).
The interview strategy was to conduct casual, open-ended interviews with a single individual without an agenda or a prepared set of questions. The reason for this strategy was to avoid steering the discussion in particular directions. Instead, in this initial phase, the goal was to elicit what each interviewer would perceive as the key issues and pain points with regard to metadata from their own point of view. Thus the interviews deliberately did not focus on the issue of what the W3C could do—and what changes could be made to the Open Web Platform to address them. Instead, the interviews stayed on the general level. Since many of the interviewees were not technical, framing the discussion in too technical a manner would have impeded the ability to obtain authentic responses. As expected, few of the interviewees felt able to identify specific “pain points” with regard to the OWP. They spoke instead of general issues of concern to them in their work. The hope was that with an understanding of these issues and pain points, the DPUB IG could then assess where the W3C and the OWP could potentially help address them—and could avoid addressing theoretical technical issues that might not in fact align with publishers’ priorities.
While the published interview reports cited above will provide the best understanding of both the common themes and diverse perspectives revealed by the interviews, this report attempts to summarize key observations and offer initial recommendations for subsequent strategies.
If there is a single overarching lesson revealed by these interviews, it is that the issues with regard to metadata seen as priorities for publishers and their clients and partners differ significantly between publishing sectors (although they all share all of these issues to some extent).
While all of these issues—discovery via subject metadata and other metadata characterizing content and products, management of content via metadata, development and participation of cross-publisher platforms and services via metadata, and the communication of rights via metadata—cross all sectors of publishing, it is clear from the interviews that the priorities in distinct sectors diverge significantly.
Another major theme heard in virtually all of the interviews was that metadata is “too complicated.” Book publishers, for example, recognize that ONIX is the standard way to communicate supply chain metadata; as such, it is an extremely rich, complex, and useful standard. Similarly, the BISAC standard is a rich vocabulary used in the US for subject classification; there are similar such standards in most other countries or regions, and also a new global standard, Thema. While publishers recognize the value of these standards, they often characterize them as “too hard”; yet when pressed for what an individual publisher needs to communicate (to the supply chain, or about the subjects of its books), they often wind up asking for more complexity. (E.g., a U.S. publisher may want to describe a book as being about “the Battle of the Bulge, within the topic “World War II” which itself is in the category of military history; this can be done with BISAC but not with Thema.) The truth is that these systems are complex because what they are designed to do is complex. The desire for an “ONIX Lite” expressed by several interviewees may prove to be unrealistic, because a significantly simpler model would be significantly less expressive.
Another common theme was that in too many cases metadata may exist—or may potentially exist, if applied to a given publication—but it often “doesn’t do anything.” It is very frustrating to users if it is the case—or even if it is their perception—that going the work to adding metadata is futile because systems are not seen as using it. (This is of course true of some types of metadata but not others: clearly trade publishers know how their ONIX metadata is used by the supply chain, and scholarly publishers know how their CrossRef metadata is used for citation linking.) This particularly surfaced in the context of the Pearson interviews because complex educational content is created by a vast team of participants, each of whom may have the ability to provide some aspect of metadata but most of whom have no clear understanding of how to do so, no systems to enable to do that consistently, and no faith that if they “go to all that work,” it will actually be used for any purpose downstream.
In thinking about metadata, it is important to distinguish between metadata that is incorporated within a publication (an EPUB, a website); metadata that is separate from the publication or publications it describes (e.g., ONIX, which can continually change over time without requiring the publications it communicates metadata about to be altered); and metadata that is incorporated in systems designed to provide information about publications (e.g., a publisher’s, retailer’s, or aggregator’s website).
And finally, it should be noted that an important theme that did not emerge from the interviews was the importance of accessibility. Revealing this was one of the benefits of the interview strategy of not asking leading questions: when anybody is asked if accessibility is an important issue, they will almost always say it is. So it is particularly—and lamentably—of note that none of the interviewees mentioned accessibility as a priority issue with regard to metadata.
The key themes of the interviews conducted by Mr. Kasdorf are summarized in the following appendix. They are the following:
Please see the summary on the group’s wiki page for a discussion of these themes, including important comments by members of the DPUB IG.
The key themes of the interviews conducted by Ms. Solomon are summarized as follows:
Please see Ms. Solomon’s report in the interview document for a more detailed discussion of these themes.
(See also BISG’s Guide to Identifiers.)
(See also W3C’s list on Semantic Web related books.)
1 JATS, the Journal Article Tag Suite, and BITS, the Book Interchange Tag Suite—which share a common markup model below the article and chapter level and which have very rich metadata models and mechanisms—are the current versions of what were previously known as the “NLM DTDs,” the markup and metadata model on which virtually all publications, platforms, and services in the area of scholarly publishing are based. This is unique to scholarly publishing: in no other sector is there such universal consensus on a single markup and metadata model.