HTML microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data into HTML documents. Whereas the microdata specification describes a means of markup, the output format is JSON. This specification describes processing rules that may be used to extract RDF [[!RDF11-CONCEPTS]] from an HTML document containing microdata.

This document is an experimental work in progress. The concepts described herein are intended to provide guidance for a possible future Working Group chartered to provide a Recommendation for this transformation. As a consequence, implementers of this specification, either producers or consumers, should note that it may change prior to any possible publication as a Recommendation.

This Working Draft is an update of the W3C Interest Group Note, published in October 2012. This update simplifies processing using the following mechanisms:

The intention is to publish this draft as a new version of the Interest Group Note after gathering and incorporating community input.

Introduction

This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [[RDF11-CONCEPTS]].

There are a variety of ways in which a mapping from microdata to RDF might be configured to give a result that is closer to the required result for a particular vocabulary. This specification defines terms that can be used as hooks for vocabulary-specific behavior, which could be defined within a registry or on an implementation-defined basis.

For background on the trade-offs between these options, see http://www.w3.org/wiki/Mapping_Microdata_to_RDF and GitHub Issues.

Background

Microdata [[!MICRODATA]] is a way of embedding data in HTML documents using attributes. The HTML DOM is extended to provide an API for accessing microdata information, and the microdata specification defines how to generate a JSON representation from microdata markup.

Mapping microdata to RDF enables consumers to merge data expressed in other RDF-based formats with microdata. It facilitates the use of RDF vocabularies within microdata, and enables microdata to be used with the full RDF toolchain. Some use cases for this mapping are described in Section 1.2 below.

Microdata's data model does not align neatly with RDF.

Thus, in some places the needs of RDF consumers violate requirements of the microdata specification. This specification highlights where such violations occur and the reasons for them.

This specification allows for vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and potentially vocabulary-specific processing rules.

This specification also assumes that consumers of RDF generated from microdata may have to process the results in order to, for example, assign appropriate datatypes to property values.

Use Cases

During the period of the task force, a number of use cases were put forth for the use of microdata in generating RDF:

Issues

Decisions or open issues in the specification are tracked on the GitHub Issue Tracker. These include the following:

Experimental Feature

Experimental support itemprop-reverse. This attribute is not part of [[MICRODATA]] and is included as an experimental feature. Specific feedback from the community is requested. Based on addoption, the attribute may be considered for inclusion in forthcoming versions of [[MICRODATA]] and this note.

The purpose of this specification is to provide input to a future working group that can make decisions about the need for a registry and the details of processing. Among the options investigated by the Task Force are the following:

Attributes and Syntax

The microdata specification [[!MICRODATA]] defines a number of attributes and the way in which those attributes are to be interpreted. The microdata DOM API provides methods and attributes for retrieving microdata from the HTML DOM.

For reference, attributes used for specifying and retrieving HTML microdata are referenced here:

itemid
An attribute containing a URL used to identify the subject of triples associated with this item. (See itemid in [[!MICRODATA]]).
itemprop
An attribute used to identify one or more names of an items. An itemprop contains a space separated list of names which may either by absolute URLs or terms associated with the type of the item as defined by the referencing item's item type. (See itemprop in [[!MICRODATA]]).
itemref
An additional attribute on an element that references additional elements containing property definitions to be applied to the referencing item. (See itemref in [[!MICRODATA]]).
itemscope
An boolean attribute identifying an element as an item. (See itemscope in [[!MICRODATA]]).
itemtype
An additional attribute on an element used to specify one or more types of an item. The item type of an item is the first value returned from element.itemType on the element. The item type is also used to resolve non-URL names to absolute URLs. Available through the Microdata DOM API as element.itemType. (See itemtype in [[!MICRODATA]]).

In RDF, it is common for people to shorten vocabulary terms via abbreviated URIs that use a 'prefix' and a 'reference'. throughout this document assume that the following vocabulary prefixes have been defined:

dc: http://purl.org/dc/terms/
md: http://www.w3.org/ns/md#
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfa: http://www.w3.org/ns/rdfa#
xsd: http://www.w3.org/2001/XMLSchema#

Vocabulary Registry

In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.

The registry is located at the namespace defined for microdata: http://www.w3.org/ns/md in a variety of formats. Under control of a runtime option, a processor should use another provided by reference, to affect processing.

The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:


This structure associates mappings for two URIs: http://schema.org/ and http://microformats.org/profile/hcard. Items having an item type with a URI prefix from this registry use the the rules described for that prefix within the scope of that item type. For http://schema.org/, this mapping currently defines a single property: additionalType with a value to indicate specific behavior. It also allows overrides on a per-property basis; the item properties key associates an individual name with overrides for default behavior. The interpretation of these rules is defined in the following sections. If an item has no current type or the registry contains no URI prefix matching current type, a conforming processor MUST use the default values defined for these rules.

Property URI Generation

For names which are not absolute URLs, this section defines the algorithm for generating an absolute URL given an evaluation context including a current type and current vocabulary.

The procedure for generating property URIs is defined in Generate Predicate URI.

The URI generation scheme appends names that are not absolute URLs to the URI prefix. When generating property URIs, if the URI prefix does not end with a '/' or '#', a '#' is appended to the URI prefix. (See Step 4 in Generate Predicate URI.)

URI creation uses a base URL with query parameters to indicate the in-scope type and name list. Consider the following example:


  

Given the URI prefix http://microformats.org/profile/hcard, this would generate http://microformats.org/profile/hcard#n and http://microformats.org/profile/hcard#given-name. Note that the '#' is automatically added as a separator.

Looking at another example:


  

Given the URI prefix http://schema.org/, this would generate http://schema.org/name. Note that if the itemtype were http://schema.org/Person/Teacher, this would generate the same property URI.

If the registry contains no match for current type implementations MUST act as if there is a URI prefix made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).

The vocabulary URI prefix is made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).

Deconstructing the itemtype URL to create or identify a vocabulary URI is a violation of the microdata specification which is necessary to support the use of existing vocabularies designed for use with RDF, and shared or inherited properties within all vocabularies.


  

In this example, assuming no matching entry in the registry, the URI prefix is constructed by removing the last path segment, leaving the URI http://example.org/. The resulting property URI would be http://example.org/title.

If there is no in-scope itemtype, property URIs are generated using the base URI of the document and the name as a fragment Consider the following example:


      

If the document is located at http://example/author, the name bar generates the URI http://example/author#bar. However, the included name baz is included in untyped item. The inherited property URI is used to create a new property URI: http://example/author#baz.

This scheme is compatible with the needs of other RDF serialization formats such as RDF/XML [[RDF-SYNTAX-GRAMMAR]], which rely on QNames for expressing properties. For example, the generated property URIs can be split as follows:


      

Value Typing

In microdata, all values are strings. In RDF, values may be resources or may be typed with an appropriate datatype.

In some cases, the type of a microdata value can be determined from the element on which it is specified. In particular:

Vocabulary Expansion

Microdata requires that all values of itemtype come from the same vocabulary. This is required as itemprop values are resolved relative to that vocabulary. However, it is often useful to define an item to have types from multiple different vocabularies.

Vocabulary expansion uses simple rules to generate additional triples based on rules and property relationships described in the registry. Within the registry, a property definition may have either equivalentProperty or subPropertyOf keys having a IRI value (or array of IRI values) of the associated property. Such an entry causes the processor to generate triples associating the source property IRI with the target property IRI using either rdf:subPropertyOf or owl:equivalentProperty predicates.

For example, the registry definition for the additionalType property within schema.org, defines additionalType to have an rdfs:subPropertyOf relationship with rdf:type.


The previous example, indicates a registry rule, which causes the processor to emit an extra triple when first seeing the additionalProperty itemprop:


After performing vocabulary expansion, an additional rdf:type triple is generated:


The owl:equivalentProperty rule is more powerfull than rdfs:subPropertyOf, in that if any equivalent property matches, then the source property would also cause a triple to be generated. For example, if the registry stated that name was equivalent to rdfs:label, than any use of name in a itemprop would cause a triple using rdfs:label to be emitted, as with rdfs:subPropertyOf. However, logically, any use of label where the current voabulary were rdfs: could also cause a triple using schema:name to be emitted. To simplify processing, this specification requires that all values of a owl:equivalentProperty registry entry have their own rules with those values as keys within the property section of their respective vocabularies.

Control of Microdata Processors

The external registry may be controlled by the registry option passed to the microdata processor. If specified, the registry must be loaded from the location indicated as the option value, Otherwise, the processor MUST load the default registry from http://www.w3.org/ns/md.

Setting registry is performed in a processor-specific way.

When accessed as a web service using HTTP GET, POST or similar action, processors SHOULD use registry query parameter. Acceptable values for registry is a URI-encoded URL. Web service processors SHOULD return the resulting RDF graph using a requested format specified by HTTP Content Negotiation for an acceptable content type. Web service processors MUST support [[!N-TRIPLES]].

Algorithm

Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]] for the treatment of items.

Algorithm Terms

absolute URL
The term absolute URL is defined in [[!HTML5]].
blank node
A blank node is a node in a graph that is neither a URI reference nor a literal. Items without a global identifier have a blank node allocated to them. (See [[RDF11-CONCEPTS]]).
canonicalized fragment
The term canonicalized fragment is defined in [[!URL]]. This involves transforming elements added to URLs to ensure that the result remains a valid URL. Non-Unicode characters, and characters less than U+0020 SPACE character (" ") are subject to percent escaping.
document base
The base address of the document being processed, as defined in Resolving URLs in [[!HTML5]].
evaluation context
A data structure including the following elements:
memory
a mapping of items to subjects, initially empty;
current type
an absolute URL for the current type, used when an item does not contain an item type;
current vocabulary
an absolute URL for the current vocabulary, from the registry.
item
An item is described by an element containing an itemscope attribute. The list of top-level microdata items may be retrieved using the Microdata DOM API document.getItems method.
item properties
The mechanism for finding the properties of an item The list of item properties items may be retrieved using the Microdata DOM API element.properties attribute.
global identifier
The value of an item's itemid attribute, if it has one, resolved relative to the element on which the attribute is specified. (See itemscope in [[!MICRODATA]]).
literal
Literals are values such as strings and dates. These include typed literals, language-tagged strings and simple literals, as defined in [[RDF11-CONCEPTS]].
property
Each name identifies a property of an item. An item may have multiple elements sharing the same name, creating a multi-valued property.
property names
The tokens of an element's itemprop attribute. Each token is a name. (See property names in [[!MICRODATA]]).
property value
The property value of a name-value pair added by an element with an itemprop attribute depends on the element.
If the element has no itemprop attribute
The value is null and no triple should be generated.
If the element creates an item (by having an itemscope attribute)
The value is the URI reference or blank node returned from generate the triples for that item.
If the element is a URL property element (a, area, audio, embed, iframe, img, link, object, source, track or video)
The value is a URI reference created from element.itemValue. (See relevant attribute descriptions in [[!HTML5]]).
If the element is a meter or data element.
The value is a literal made from element.itemValue.
If the value is a valid integer having the lexical form of xsd:integer [[!XMLSCHEMA11-2]]
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#integer.
If the value is a valid float number having the lexical form of xsd:double [[!XMLSCHEMA11-2]]
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#double.
Otherwise
The value is a simple literal.
If the element is a meta element with a @content attribute.
If the element has a non-empty language, the value is a language-tagged string created from the value of the @content attribute with language information set from the language of the property element. Otherwise, the value is a simple literal created from the value of the @content attribute.
If the element is a time element.
The value is a literal made from element.itemValue.
If the value is a valid date string having the lexical form of xsd:date [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#date.
If the value is a valid time string having the lexical form of xsd:time [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#time.
If the value is a valid local date and time string or valid global date and time string having the lexical form of xsd:dateTime [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#dateTime.
If the value is a valid month string having the lexical form of xsd:gYearMonth [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#gYearMonth.
If the value is a valid non-negative integer having the lexical form of xsd:gYear [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#gYear.
If the value is a valid duration string having the lexical form of xsd:duration [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#duration.
Otherwise
If the element has a non-empty language, the value is a language-tagged string created from the value with language information set from the language of the property element. Otherwise, the value is a simple literal created from the value.

The HTML valid yearless date string is similar to xsd:gMonthDay, but the lexical forms differ, so it is not included in this conversion.

See The time element in [[!HTML5]].

Otherwise
If the element has a non-empty language, the value is a language-tagged string created from the value with language information set from the language of the property element. Otherwise, the value is a simple literal created from the value.

See The lang and xml:lang attributes in [[!HTML5]] for determining the language of a node.

top-level item
An item which does not contain an itemprop attribute. Available through the Microdata DOM API as document.getItems. (See top-level microdata item in [[!MICRODATA]]).
URI reference
URI references are suitable to be used in subject, predicate or object positions within an RDF triple, as opposed to a literal value that may contain a string representation of a URI. (See [[RDF11-CONCEPTS]]).

The HTML5/microdata content model for @href, @src, @data, itemtype and itemprop and itemid is that of a URL, not a URI or IRI.

A proposed mechanism for specifying the range of property values to be URI reference or IRI could allow these to be specified as subject or object using a @content attribute.

vocabulary
A vocabulary is a collection of URIs, suitable for use as an itemtype or itemprop value, that share a common URI prefix. That prefix is the vocabulary URI. A vocabulary URI is not allowed to be a prefix of another vocabulary URI.
This definition differs from the language in the HTML spec and is just for the purpose of this document. In HTML, a vocabulary is a specification, and doesn't have a URI. In our view, if one specification defines ten itemtypes, then these could be treated as one vocabulary or as ten distinct vocabularies; it is entirely up to the vocabulary creator.

RDF Conversion Algorithm

A HTML document containing microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.

A conforming microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples to those that the following algorithm generates:

  1. For each element that is also a top-level item, Generate the triples for that item using the evaluation context.

Generate the triples

When the user agent is to Generate triples for an item item, given evaluation context, it must run the following steps:

This algorithm has undergone substantial change from the original microdata specification [[!MICRODATA]].

  1. If there is an entry for item in memory, then let subject be the subject of that entry. Otherwise, if item has a global identifier and that global identifier is an absolute URL, let subject be that global identifier. Otherwise, let subject be a new blank node.
  2. Add a mapping from item to subject in memory
  3. For each type returned from element.itemType of the element defining the item.
    1. If type is an absolute URL, generate the following triple:
      subject
      subject
      predicate
      http://www.w3.org/1999/02/22-rdf-syntax-ns#type
      object
      type (as a URI reference)
  4. Set type to the first value returned from element.itemType of the element defining the item which is an absolute URL, if any.
  5. Otherwise, set type to current type from evaluation context if not empty.
  6. If the registry contains a URI prefix that is a character for character match of type up to the length of the URI prefix, set vocab as that URI prefix.
  7. Otherwise, if type is not empty, construct vocab by removing everything following the last SOLIDUS U+002F ("/") or NUMBER SIGN U+0023 ("#") from the path component of type.
  8. Update evaluation context setting current vocabulary to vocab.
  9. For each element element that has one or more property names and is one of the properties of the item item run the following substep:
    1. For each name in the element's property names, run the following substeps:
      1. Let context be a copy of evaluation context with current type set to type.
      2. Let predicate be the result of generate predicate URI using context and name.
      3. Let value be the property value of element.
      4. If value is an item, then generate the triples for value using context. Replace value by the subject returned from those steps.
      5. Generate the following triple:
        subject
        subject
        predicate
        predicate
        object
        value
      6. If an entry exists in the registry for name in the vocabulary associated with vocab having the key subPropertyOf or equivalentProperty, for each such value equiv, generate the following triple:
        subject
        subject
        predicate
        equiv
        object
        value
  10. Return subject

Generate Predicate URI

Predicate URI generation makes use of current type and current vocabulary from an evaluation context context along with name.

  1. If name is an absolute URL, return name as a URI reference.
  2. If current type from context is null, there can be no current vocabulary. Return the URI reference that is the document base with its fragment set to the canonicalized fragment value of name.
    This rule is intended to allow for a the case where no type is set, and therefore there is no vocabulary from which to extract rules. For example, if there is a document base of http://example.org/doc and an itemprop of 'title', a URI will be constructed to be http://example.org/doc#title.
  3. Set expandedURI to the URI reference constructed by appending the canonicalized fragment value of name to current vocabulary, separated by a U+0023 NUMBER SIGN character ("#") unless the current vocabulary ends with either a U+0023 NUMBER SIGN character ("#") or SOLIDUS U+002F ("/").
  4. Return expandedURI.

Reverse itemprop

The WebSchemas community has proposed the use of a new Microdata attribute: itemprop-reverse. Although not present in [[MICRODATA]] at this time, the attribute can be very useful in many markup examples where items are related using the reverse of a common property; this saves creating new properties which exist solely for the purpose of describing such reverse relationships. Evidence for the utility of such a feature can be seen in the RDFa @rev attribute [[RDFA-CORE]] and the JSON-LD @reverse property [[JSON-LD]].

See issue 5 for further reference.

This feature adds the following attribute:

itemprop-reverse
An attribute used to identify one or more names of an items reversing the sense of itemprop. An itemprop-reverse contains a space separated list of names which may either by absolute URLs or terms associated with the type of the item as defined by the referencing item's item type.

The definition of top-level item is updated to also exclude items having itemprop-reverse.

The Algorithm is extended accordingly:

Algorithm Terms

reverse properties
The mechanism for finding the reverse properties of an item. The list of reverse properties is obtained using a variation of the Microdata properties of an item method, replacing itemprop with itemprop-reverse.
reverse property names
The tokens of an element's itemprop-reverse attribute. Each token is a name.

Generate the triples

The Triples generation algorithm is extended with the following step to take place immediately after Step 9:

  1. For each element element that has one or more reverse property names and is one of the reverse properties of the item item, run the following substep:
    1. For each name in the element's reverse property names, run the following substeps:
      1. Let context be a copy of evaluation context with current type set to type and current vocabulary set to vocab.
      2. Let predicate be the result of generate predicate URI using context and name.
      3. Let value be the property value of element.
      4. If value is an item, then generate the triples for value using context. Replace value by the subject returned from those steps.
      5. Otherwise, if value is a literal ignore the value and continue to the next name; it is an error for the value of itemprop-reverse to be a literal.
      6. Generate the following triple:
        subject
        value
        predicate
        predicate
        object
        subject

Simple use of itemprop-reverse:


Results in the following Turtle:


Testing

A test suite [[MICRODATA-RDF-TESTS]] under development to help processor developers verify conformance to this specification.

Markup Examples

The microdata example below expresses book information as an FRBR Work item.


Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core# this is equivalent to the following Turtle:


The following snippet of HTML has microdata for two people with the same address. This illustrates two items referencing a third item, and how only a single RDF resource definition is created for that third item.


Assuming that registry contains a an entry for http://microformats.org/profile/hcard it generates these triples expressed in Turtle:


The following snippet of HTML has microdata for a playlist and illustrates the use of the schema:additionalType property to relate recordings to the Music Ontology:


Assuming that registry contains a an entry for http://schema.org/ it generates these triples expressed in Turtle:


Default registry

The following is the default registry in JSON format, as of the time of publication.

  {
    "http://schema.org/": {
      "properties": {
        "additionalType": {"subPropertyOf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"}
      }
    },
    "http://microformats.org/profile/hcard": {}
  }
  

Acknowledgements

Thanks to Richard Cyganiak for property URI and vocabulary terminology and the general excellent consideration of practical problems in generating RDF from microdata.