Scholarly HTML is a domain-specific rich document format built entirely on open standards
that enables the interoperable exchange of scholarly articles in a manner that is compatible
with off-the-shelf browsers. This document describes how Scholarly HTML works and how it is
encoded.
Introduction
Scholarly articles are still primarily published as unstructured data in which most of the
information created by the research and the practice of authoring is lost. Document
technology has reached a level of maturity and universality that makes this situation no
longer tenable. Information cannot be disseminated if it is destroyed before even having
left its creator’s laptop.
According to the New York Times, adding structured information to their recipes improved
their discoverability to the point of producing an immediate rise of 52 percent in traffic
(NYT). At this point in time, cupcake recipes are reaping
greater benefits from modern data format practices than the whole scientific endeavour.
This places a great burden on tool developers and service providers as well. Anyone who
has explored the world of extracting data from inert publications has built their own
complex toolset, offering no interoprability, no opportunity for cooperative improvements,
and little or no growth in discoverability or meta-analysis in this area.
To address these issues, we have followed an approach rooted in established best practices
for the reuse of open, standard formats. We propose an "HTML Vernacular", a set of
guidelines for the creation of domain-specific data formats that make use of HTML’s
inherent extensibility (Science.AI, 2015). Using the
vernacular foundation overlaid with schema.org metadata and proposed extensions to it, we
have produced a format for the authoring and interchange of scholarly articles built on
open standards, ready for all to use. We hope that this format will be usable
rogue
scientists who choose to publish their articles on their own.
Our high-level goals are to:
Enable structured metadata, accessibility, and internationalisation.
Be fully funcitioning on modern Web browsers.
Be customizable for inclusion in arbitrary Web sites, while remaining easy to process
and interoperable.
Build on top of open, royalty-free standards.
Long-term viability as a data format.
Where semantic modeling is concerned, our approach is to stick as much as possible to the
schema.org. Beyond the obvious advantages there are in reusing a vocabulary that is
supported by all the major search engines and is actively being developed towards enabling a
shared understanding of many useful concepts, it also provides a protection against
ontological drift whereby a new vocabulary is defined by a small group with
insufficient input from a broader community of practice. A language that solely a single
participant understands is of limited value.
In a limited number of cases we have had to depart from schema.org, using the
https://ns.science.ai/, prefixed with sa:. Our goal is to work
with schema.org in order to extend their vocabulary, and we will align our usage with the
outcome of these discussions.
Structure
A Scholarly HTML document is a valid HTML document that follows
some additional rules to specialize its meaning and make it predictable to processors
wishing to produce or consume scholarly articles. These rules are outlined in the following
sections. While valuable on its own, the content structure defined here is simply a stepping
stone to enable semantic enrichment, detailed in Semantics
Overlays. If you would like to write a validation tool, please join us on GitHub.
The root and head
The document must be encoded in UTF-8 and transmitted with a media type of
text/html.
The head must contain <meta charset="utf-8"> element and a
title element.
The article
The first child of article must be header. The header should
contain an h1 with the title of the document. The following element must be a
div with the role of contentinfo containing author and
affiliation information. See The contentinfo
Region Semantics for information about the semantic decoration of this element.
Any number of section elements may be listed within the article at arbitrary
depths, but each section must begin with an hx element,
indicating a numbered section in the article. If the sections require headings that exceed
h6, aria-level must be included to indicate depth.
Each section may contain zero or more Hunk Elements and
section elements.
Hunk Elements
Hunk elements are the meaningful blocks from which sections are built. They contain text
and inline elements. There are several types of hunk elements. All
content, ranging from long paragraphs to note references and footnotes can be captured
using this specified set of elements. The method for distinguishing one from another in a
machine-readable manner is specified in Semantics Overlay.
The
blockquote,
ul,
ol,
and
dl
elements can be used as they typically would and require no special treatment.
The
aside
hunk element is used to capture portions of content that stand apart from the main flow of
content. These can be separated from the article without having impact on the reader’s
understanding of the article. A common use is text boxes in print. If the
aside contains an header heading element, that heading must be
the first element child and its numeric part must reflect its depth, making use of
aria-level according to the same rules that apply for section.
The other children of asidasidee must all be hunk elements. For example, if
an aside follows a section with a level 3 heading, the top-level
heading in the aside should be h4.
Figures
The
figure
element is a general container for self-contained content units that are embedded inside
the main body of the text. It can come in several flavors that are dictated by its
typeof attribute. Common uses for figure are as a container
for images, tables, equations, and computer code.
If figure is typeof="sa:image", it is an image container. It
must contain an img child element and should contain a figcaption
labeling that image.
If figure is a typeof="sa:table"table, it is a table
container. It must contain a table element. If there is a table caption, it
should be included using the caption child element of the table, and not
the figcaption child of the figure. Table notes may also be
included as ol with li elements with the role of
doc-footnote.
If figure is a typeof="sa:formula", it is a formula container.
It must contain a math element and, optionally, a figcaption
describing the formula. The math element must be valid MathML 3.
Additionally, given the dismal state of support for MathML in Web browsers the math
element must contain an annotation descendant with the TeX equivalent of the formula.
If figure is a typeof="schema:SoftwareSourceCode", it is a
code container. It must contain a pre element and, optionally, a
figcaption. The pre element must contain a code
element as its only child.
Inline Elements
Inline
elements decorate, describe, and enrich text. Inline elements can be used inside of
hunk elements, heading elements, and captioning elements. Where applicable, they can nest
within one another. Inline images and inline math can be inlcuded as well. This can be
accomplished using img for images or math for formulas.
Equations can be displayed inline or as blocks within a paragaph.
References
The References section requires specific semantic overlays
(reference) as well as strict content structure. Apart from a (required) hx
element, it must contain only one ol or dl element.
If using a dl element, the contents must be alternating dt and
dd elements. The dd must contain the citation.
If using ol, the only contents are li that include citations.
Interactive Elements
information about iframes to come
Let’s discuss details of iframes with the CG
HTML Roles
It is possible to provide information about an HTML element by decorating it with the
role attribute. The
ARIA vocabulary and its extensions provide
convenient terms that are relevant to document structure. The following roles from ARIA and
DPUB-ARIA should be applied where
appropriate:
contentinfo to apply to the
div containing author and affiliation information
definition for defining a
term, such as a keyword or a glossary term
doc-example on a hunk
element containing an illustrative concept, such as a code sample.
doc-footnote on a hunk
element containing an individual note, such as an informative note about the author or
table footnote
doc-introduction on
the section introducing the article, if relevant
doc-subtitle to indicate
that a portion of a heading is a secondary or alternate title of the work
Should we require ARIA’s table, grid, rowheader, and rowgroup?
I did not include doc-credit bc of extensive citation markup in JSON-LD
doc-endnote, doc-endnotes are not in the current published draft of DPUB-ARIA. See
March DPUB-ARIA draft
Validation
The only validation requirement for Scholarly HTML at this point is that the HTML is valid.
We are considering building a a validation tool in RelaxNG or JavaScript to check compliance
with this set of rules.
Articles should be in the following basic structure:
heading with article title
0+ hunks
0+ sections
0+ headings
0+ hunks
0+ sections
It must feature a DOCTYPE as its preamble.
Semantics Overlay
HTML provides an excellent backbone with which to capture the
structure of a given text but is evidently limited when it comes to
capturing more domain-specific concepts such as people, spaceships,
Humean causation, or
sthenurines.
That is where semantic overlays with the ability to refine the meaning and relations of
HTML elements come into play. Scholarly HTML makes use of two standard mechanisms that
overlay additional semantics atop the HTML DOM: role-based semantics as defined by
WAI-ARIA and DPUB-ARIA, and
semantics rooted in structured data as captured by RDFa.
Using technologies related to the semantic web can at times feel daunting and unrelated to
everyday web development. In order to suppress this disconnect, Scholarly HTML follows a
few simple guiding principles:
The number of prefixes used for semantic properties is kept as small as possible;
There is no such thing as a URI (or URN, IRI, XRI, or whatever else confusing
contraption), everything is a URL;
Where somewhat more complex modeling is required, it is put together using a small set of
common patterns that might require some initial learning but can be applied regularly
with relative ease (notably roles and actions).
The properties that Scholarly HTML uses are naturally document-related (authorship,
keywords, license, citations, as well as specific structure types such as acknowledgements,
introduction, or funding), which additionally requires the ability to describe people and
organizations. There are numerous vocabularies that address this domain and which could be
used with RDFa; however, for reasons detailed in
Web-First
Data Citations Scholarly HTML relies almost exclusively on
schema.org, complemented by a small number of additions from
the Scholarly Article Vocabulary.
Persons & Organizations
Marking up persons and organizations can make use of any applicable properties in
schema:Person and schema:Organization, respectively, but it is worth
pointing out some good practices with how these are to be used in practice.
If the entity in question has a URL then it is best to use that as its identifier (using
the resource attribute) and additionally to provide it as a link using the
a element (see the person example for an
instance of this).
If you happen to have information providing both the
schema:givenName/schema:familyNameschema:additionalName triple and
the schema:name (which can be considered to contain the name as the person wishes
it to be displayed) for a person then it is (perhaps counterintuitively) best to
include all of them and then use CSS (typically sibling selectors) to hide the
extraneous ones (alternatively, they can be captured using the meta element).
The reason for this is that it exposes more information to machine consumers without
having a negative impact on human readers.
Here is also an organization:
How should we represent name transliterations? Are there language tags for transliterated
text? Or should ruby+rdf:HTML be suggested instead? If the latter we can no longer use
meta (which is acceptable).
Roles are an indirection that provides additional information about a property or
relationship. A simple overview is provided in the schema.org blog post
Introducing ’Role’.
Let’s look at the example of authorship information. Some properties of the agent who
authored the document (person or organization), such as their name, are considered to be
true outside the limited context of the document. These properties will be set directly on
the agent.
On the other hand, other properties are considered to be specific to the agent
in their role as author of the document. To give an example, were I to be
writing the document you are currently reading as a freelancer for the Illuminati, my
affiliation to them would be solely in my role as author and I should not be considered
eternally indentured to them.
When a role is used to enrich a property, the convention is to have it as the value of
that property, and then to repeat the property on the role to point to the object. At
first glance it sounds contrived, but it is a simple and powerful construct. To stay with
the authoring example, the indirection would look like:
To demonstrate how properties can attach differently to the role and to the agent, we can
unfold the authorship example further:
Actions
Actions are a global schema.org mechanism to convey facts
about things that can be or have been done. There is an
overview document for actions but it
dives deep very fast and may be more confusing than helpful. This sections intends to
convey all that one needs to know about actions in order to understand their usage in
Scholarly HTML (keeping in mind that they can do much more).
Note that actions can do much more than what Scholarly HTML uses them for. For instance,
if you use an email client that supports actions (such as GMail) you may have noticed that
some emails allow for direct interactions: those are implemented using actions, and
without scripting.
Actions have a type (e.g. ReadAction, DrinkAction), a status
(completed, in progress…), an agent being whoever carries it out, and an object which is
what they are being done to. They can also have start and end times (as well as several
other properties which we won’t go into here). Scholarly articles typically feature
indications about things that people have done, which is a good fit for modeling with
actions. A few examples should help clarify the notion.
When referencing an online work, it is customary to indicate the access date for it (since
it may have changed in the meantime). This can be modeled as a schema:ReadAction,
with its schema:actionStatus set to CompletedActionStatus, and a
schema:endTime being the access date. In JSON-LD it would look like this:
Authors often acknowledge the contributions of others or have to disclose potential
conflicts of interest that may stem from their interactions outside of the article.
The former can be conveyed as an sa:AcknowledgeAction in which the
schema:name of the action is the verb part of the acknowledgement and the
schema:recipient is the person (or entity of any kind) being acknowledged. The
agent is typically implicitly specified as the object to which the action is attached.
Article and Title Semantics
The article element that roots the content of the Scholarly HTML document
needs further refinement to capture the specific type of article that it encodes. The
typeof attribute should always contain schema:ScholarlyArticle as its
first item, but it can be further refined with additional article types.
Should we recommend a specific taxonomy for article (sub)typing? There are so many:
Fabio, MeSH, NPG…
In order for arbitrary parts of the document to be able to attach metadata to the
article, it also needs to have its resource attribute set to a
URL that can be referenced (it is typically sufficient to just use # for that
purpose).
While the h1 in the document’s header is sufficient to convey
the fact that it is the document’s title, some services use extraneous information in
order to assign an unambiguous title to the document. As such, it needs to have its
property attribute set to schema:name. Similary, if a subtitle is
present in the header it needs to be decorated with both a role
of doc-subtitle (to expose its DPUB-ARIA
semantics) and a property of schema:alternateName.
As described in the Structure section, the contentinfo
region serves as a container for the metadata of the article. It is itself nothing more
than a div with a role of contentinfo, but its
content has rich structure.
It contains a list of section elements, each of which is identified with a
specific typeof attribute.
Authors & Contributors
If the document has authors or contributors, they are listed in a section
with typeofsa:AuthorsList. The content of that section
is an h2 title appropriate for it, followed by either a ul or
ol (depending on whether the authors are considered ordered, which is
highly dependent on the discipline’s culture).
Each li in that list must feature a typeof of
sa:ContributorRole and a property of either schema:author or
schema:contributor depending on which is applicable. Modeling with schema.org
roles is explained in the Roles section.
Exactly one span with a property of either
schema:author or schema:contributor (matching the one that points to the
role) and typeof either schema:Person (if the author is a sentient
entity) or schema:Organization (if it is a collective thereof). How to capture
persons and organizations is detailed in the creatively-named
Persons & Organizations section.
Zero or more a elements with a property of
sa:roleAffiliation, one for each affiliation of the author in producing the
article. Each of those elements needs further to have a resource
attribute matching the one on the affiliation it is pointing to and an
href attribute linking to the element on which that affiliation is
defined. The a element may contain arbitrary text (typically a number,
letter, or symbol matching that used by the target in its own list). These should not
occur if the agent is an organization.
Zero or more a elements with a property of
sa:roleAction, one for each comment describing the author’s specific
contribution to the work (e.g. "Authors contributed equally" or "Designed the study").
Each of those elements needs further to have a resource
attribute matching the one on the note it is pointing to and an
href attribute linking to the element which contains that note. The
a element may contain arbitrary text (typically a number, letter, or
symbol matching that used by the target in its own list).
Zero or one ul elements. Each of its li children has a
property of schema:roleContactPoint and a typeof set
to schema:ContactPoint. The content of each li can be anything
that describes a manner of contacting the author in question, but it will typically
involve properties such as schema:email, schema:telephone,
schema:address, schema:description (for arbitrary descriptions of the
contact method), or for journals publishing to the Web of the early 1980s
schema:faxNumber.
Here is an example of a complete kitchen sink authors’ section. Note that in most cases
the markup will be much simpler — this exercises far more of the features than there
is information for in a typical case.
Affiliations
If the authors and contributors of the documents are affiliated with organizations, they
are listed in a section with typeofsa:Affiliations.
The content of that section is an h2 title appropriate for it,
followed by a ul or ol (but the order is less commonly
relevant than it is for authors).
Note that articles that feature an organization as an author should have that
organization listed in the Authors & Contributors section,
and not here.
Each li in the list is one affiliation (though multiple people can
reference it). The li needs to have an id matching that used
in the reference. Inside the li is a span with
typeof set to schema:Organization and its resource
also matching the one used in the reference. (The belt and suspenders approach is
unfortunately needed to produce both usable HTML and a viable data model.)
The content of the schema:Organization can contain any applicable property. An
example of an affiliations section, with some extra structure for the organization is
given below.
A link to the license for the article should be provided. The link should have the
property of schema:license and typeof="CreativeWork".
Keywords should be listed in a section element with an appropriate
h2. The list of terms should be a ul with the
property of schema:keywords on every li.
The abstract should be included in a section element with
typeof attribute containing sa:Abstract. The abstract should
have the role of doc-abstract.
Notes
Notes that add information about the Authors and Contributors
section should be hunk elements labeled as doc-footnote.
Flavored Links
The rel
attribute should be applied to apply some spice to your links. The following values of
rel should be used on the link that refers to these elements:
footnote
license
stylesheet
There are many values for rel, such as glossary and bibliography that look like they are
useful, but based on the definition, it sounds like they point to the section containing
the whole biblio. Not so helpful. rel="citation" or "biblioentry" would be more valuable.\
Citations & References
Citations in scholarly articles provide a way to reference the work of others upon which
one builds. In the pre-Web era, they essentially served as links by carrying sufficient
information for one to find the reference in question, in a relatively compact manner.
In a Web world, it can seem tempting to simply replace citations with links, but there is
value in keeping the limited amount of metadata about the cited object that they provide
inlined in the document. Links rot and disappear; when that happens the rest of the
information can prove crucial in finding the referenced object at some other location.
Unique identifiers with indirect resolution, such as DOIs, might seem like a solution to
this problem but being opaque humans routinely get them wrong. (DOIs additionally suffer
from a single point of failure for resolution.) All things considered, including a link
for convenience and human-readable metadata about the referenced object is likely the
most resilient way to cite another document.
In the print universe, reducing the number of pages one needs to use can be a noticeable
cost-saver. Given that scholarly articles can easily feature dozens if not hundreds of
cited references, making use of compact reference conventions (as well as smaller font
sizes) made sense. Over time, however, what was a sensible idea degenerated into a
territorial maze of gratuitously heterogenous conventions to the point where there now
exist over 8000 citation styles.
There is no value in Scholarly HTML so much as attempting to support all citation styles.
The Web does not need the compactness. Citations and references should be both data-rich
and human-accessible, something on which the traditional formats fail, in some cases
quite spectacularly.
For accessibility purposes, Scholarly HTML recommends that references be formatted in such
a manner that they read naturally in the article’s natural language, with articulations
between the metadata parts, as below:
The references section is simply a section with an appropriate heading,
containing an ol. Each li in the list follows a regular
structure: it has a role of doc-biblioentry, a
resource being the URL identifying the cited object, a property
of schema:citation, and id to make it linkable, and a
typeof capturing the kind of object that is being referenced (typically
schema:ScholarlyArticle, schema:Book, or schema:WebPage but there is
really no limit as to what may be cited).
The content of the li can be any RDFa that
matches the typeof, but some good practices should be observed.
The title or name of the cited object should be in a cite element. If a link
is available, then the title should be linked. Date and time values (such as publication
or access date) should make use of the time element (further noting that
the datatype attribute can be used to express the granularity of the date as
in the example above).
While arbitrary metadata may be used, it is highly recommended to stick to
schema.org and the
Scholarly Article vocabularies. The reason for that is
that, should one wish to convert from Scholarly HTML citations into a specific print
format then it will be desirable to be able to reliably extract information from the
citations. This could be used for instance to produce CSL
variables (as exemplified in the CSL
documentation) and then use a CSL implementation in order to produce the output.
Should we be more constraining and define more precisely the constructs that are more
likely to interoperate?
Providing a mapping to CSL would be extremely useful.
Footnotes & Endnotes
If the document has notes, they are listed in a section with the role of
doc-endnotes. The content of that section is an h2
title appropriate for it, followed by either a ul or ol. Each
li should be labeled with the role of doc-endnote.
Funding Information
Funding informations is provided using a complex triples structure which can be summarized
as follows:
subject: receiver of funding (example: Author)
predicate: string or sponsor role (example: wasFundedBy)
object: funding organization (example: Bill & Melinda Gates Foundation)
This can be enhanced with information such as the award name and Role information. Here is
a detailed example:
Disclosures
Disclosure information is a list of disclosure actions described in a simple triples
structure.
The subject is always one of the contributor roles (example: Author)
The name of the action (nerd-talk: predicate, human-speak: verb) is a string describing
the action (example: "received beer from")
The recipient, or object, is the direct object of the sentence (example: Guinness)
Acknowledgements
XXX
Scholarly Article Vocabulary
A limited number of classes and properties are currently not available from
schema.org. In most if not all cases it would be desirable to
make them available there, but while work is progressing it is simpler to define them
ourselves.
The current URL for the Scholarly Article vocabulary is
http://ns.science.ai/. It may be desirable (should the vocabulary persist) to
use a different URL. But this issue might go away if schema.org steps up.
Scholarly HTML would like to thank Scholarly HTML
(you read that right) for blazing the trail perhaps a few years too soon. Particularly,
the following people were particularly kind and helpful:
Peter Sefton,
Richard Smith-Unna, and
Peter Murray-Rust.
Dan Brickley was kind enough to drop by the office to chat about our usage of
schema.org even though he was tired and hungry. As
always, examples involving fish tanks are the most helpful. Dave Cramer shared ideas
that we happily stole.
Patrick Johnston’s input has been crucial, notably in modeling authoring. We can only
hope that getting those details exactly right have not caused him to lose too much
sleep.
We also received very useful feedback and pointers from: Kjetil Kjernsmo (DAHUT!),
Silvio Peroni, Justin Johansson, Alf Eaton, Raniere Silvia, Kaveh Bazargan and Mike
Smith. We are very much indebted to the help provided us by Ivan Herman.
If we somehow forgot you in this list and you are too gracious to complain, we love you
all the same.