HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data into HTML documents. Whereas the microdata specification describes a means of markup, the output format is JSON. This specification describes processing rules that may be used to extract RDF [[!RDF11-CONCEPTS]] from an HTML document containing microdata.

The concepts described herein are intended to provide guidance for a possible future Working Group chartered to provide a Recommendation for this transformation. As a consequence, implementers of this specification, either producers or consumers, should note that it may change prior to any possible publication as a Recommendation.

This document is an update of the W3C Interest Group Note, published in December 2014. This aligns with the 2017 update to [[!MICRODATA]].

As the Semantic Web Interest Group has expired, a new home will need to be found for this Note.

Introduction

This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [[RDF11-CONCEPTS]].

There are a variety of ways in which a mapping from microdata to RDF might be configured to give a result that is closer to the required result for a particular vocabulary. This specification defines terms that can be used as hooks for vocabulary-specific behavior, which could be defined within a registry or on an implementation-defined basis.

For background on the trade-offs between these options, see https://www.w3.org/wiki/Mapping_Microdata_to_RDF and GitHub Issues.

The current version of [[!MICRODATA]] does not generate URIs for properties of un-typed items, and consequently, the mechanism for generating property URIs for itemprop tokens using the document URI has been removed.

Background

Microdata [[!MICRODATA]] is a way of embedding data in HTML documents using attributes. The Microdata specification defines how to generate a JSON representation from microdata markup.

Mapping microdata to RDF enables consumers to merge data expressed in other RDF-based formats with microdata. It facilitates the use of RDF vocabularies within microdata, and enables microdata to be used with the full RDF toolchain. Some use cases for this mapping are described in Section 1.2 below.

Microdata's data model does not always align neatly with RDF.

This specification allows for vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and potentially vocabulary-specific processing rules.

This specification also assumes that consumers of RDF generated from microdata may have to process the results in order to, for example, assign appropriate datatypes to property values.

Issues

Decisions or open issues in the specification are tracked on the GitHub Issue Tracker. These include the following:

Experimental Feature

Experimental support itemprop-reverse. This attribute is not part of [[MICRODATA]] and is included as an experimental feature. Specific feedback from the community is requested. Based on addoption, the attribute may be considered for inclusion in forthcoming versions of [[MICRODATA]] and this note.

Attributes and Syntax

The Microdata specification [[!MICRODATA]] defines a number of attributes and the way in which those attributes are to be interpreted.

For reference, attributes used for specifying and retrieving HTML microdata are referenced here:

itemid
An attribute containing a URL used to identify the subject of triples associated with this item. (See itemid in [[!MICRODATA]]).
itemprop
An attribute used to identify one or more names of an items. An itemprop contains a space separated list of names which may either by absolute URLs or terms associated with the type of the item as defined by the referencing item's item type. (See itemprop in [[!MICRODATA]]).
itemref
An additional attribute on an element that references additional elements containing property definitions to be applied to the referencing item. (See itemref in [[!MICRODATA]]).
itemscope
An boolean attribute identifying an element as an item. (See itemscope in [[!MICRODATA]]).
itemtype
An additional attribute on an element used to specify one or more types of an item. The item type of an item is the first value defined if itemtype contains multiple values, as defined in items. (See itemtype and itemtypein [[!MICRODATA]]).

In RDF, it is common for people to shorten vocabulary terms via abbreviated URIs that use a 'prefix' and a 'reference'. throughout this document assume that the following vocabulary prefixes have been defined:

dc: http://purl.org/dc/terms/
md: http://www.w3.org/ns/md#
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfa: http://www.w3.org/ns/rdfa#
xsd: http://www.w3.org/2001/XMLSchema#

Vocabulary Registry

In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.

The registry is located at the namespace defined for microdata: http://www.w3.org/ns/md in a variety of formats. Under control of a runtime option, a processor should use another provided by reference, to affect processing.

The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:


This structure associates mappings for a single URI: http://schema.org/. Items having an item type with a URI prefix from this registry use the the rules described for that prefix within the scope of that item type. For http://schema.org/, this mapping currently defines a single property: additionalType with a value to indicate specific behavior. It also allows overrides on a per-property basis; the item properties key associates an individual name with overrides for default behavior. The interpretation of these rules is defined in the following sections. If an item has no vocabulary identifier or the registry contains no URI prefix matching vocabulary identifier, a conforming processor MUST use the default values defined for these rules.

Property URI Generation

Property URI generation is described in § 5.3 Properites of [[!MICRODATA]] with the following modification:

Consider the following example:


  

Given the URI prefix http://microformats.org/profile/hcard, this would generate http://microformats.org/profile/hcard#n and http://microformats.org/profile/hcard#given-name. Note that the '#' is automatically added as a separator.

Looking at another example:


  

Given the URI prefix http://schema.org/, this would generate http://schema.org/name. Note that if the itemtype were http://schema.org/Person/Teacher, this would generate the same property URI.

If the registry contains no match for current vocabulary implementations MUST act as if there is a URI prefix made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).

The vocabulary URI prefix is made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).


  

In this example, assuming no matching entry in the registry, the URI prefix is constructed by removing the last path segment, leaving the URI http://example.org/. The resulting property URI would be http://example.org/title.

Value Typing

In microdata, all values are strings. In RDF, values may be resources or may be typed with an appropriate datatype.

In some cases, the type of a microdata value can be determined from the element on which it is specified. In particular:

Vocabulary Expansion

Microdata requires that all values of itemtype come from the same vocabulary. This is required as itemprop values are resolved relative to that vocabulary. However, it is often useful to define an item to have types from multiple different vocabularies.

Vocabulary expansion uses simple rules to generate additional triples based on rules and property relationships described in the registry. Within the registry, a property definition may have either equivalentProperty or subPropertyOf keys having a IRI value (or array of IRI values) of the associated property. Such an entry causes the processor to generate triples associating the source property IRI with the target property IRI using either rdf:subPropertyOf or owl:equivalentProperty predicates.

For example, the registry definition for the additionalType property within schema.org, defines additionalType to have an rdfs:subPropertyOf relationship with rdf:type.


The previous example, indicates a registry rule, which causes the processor to emit an extra triple when first seeing the additionalProperty itemprop:


After performing vocabulary expansion, an additional rdf:type triple is generated:


The owl:equivalentProperty rule is more powerfull than rdfs:subPropertyOf, in that if any equivalent property matches, then the source property would also cause a triple to be generated. For example, if the registry stated that name was equivalent to rdfs:label, than any use of name in a itemprop would cause a triple using rdfs:label to be emitted, as with rdfs:subPropertyOf. However, logically, any use of label where the current voabulary were rdfs: could also cause a triple using schema:name to be emitted. To simplify processing, this specification requires that all values of a owl:equivalentProperty registry entry have their own rules with those values as keys within the property section of their respective vocabularies.

Control of Microdata Processors

The external registry may be controlled by the registry option passed to the microdata processor. If specified, the registry must be loaded from the location indicated as the option value, Otherwise, the processor MUST load the default registry from http://www.w3.org/ns/md.

Setting registry is performed in a processor-specific way.

When accessed as a web service using HTTP GET, POST or similar action, processors SHOULD use registry query parameter. Acceptable values for registry is a URI-encoded URL. Web service processors SHOULD return the resulting RDF graph using a requested format specified by HTTP Content Negotiation for an acceptable content type. Web service processors MUST support [[!N-TRIPLES]].

Algorithm

Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]] for the treatment of items.

Algorithm Terms

absolute URL
The term absolute URL as defined in [[!MICRODATA]].
blank node
A blank node is a node in a graph that is neither a URI nor a literal. Items without a global identifier have a blank node allocated to them. (See blank node in [[RDF11-CONCEPTS]]).
current vocabulary
an absolute URL for the current vocabulary, from the registry.
document base
The base address of the document being processed, as defined in Establishing a Base URI in [[!RFC3986]].
item
An item is described by an element containing an itemscope attribute. A top-level microdata item is an item that does not have an itemprop attribute. (See item and top-level microdata item in [[!MICRODATA]]).
item properties
The mechanism for finding the properties of an item as described in § 5.5 Associating names with items of [[!MICRODATA]]. (See item properties in [[!MICRODATA]]).
global identifier
The value of an item's itemid attribute, if it has one, resolved relative to the element on which the attribute is specified. (See global identifier in [[!MICRODATA]]).
literal
Literals are values such as strings and dates. These include typed literal, language-tagged strings and simple literals, as defined in [[RDF11-CONCEPTS]].
memory
a mapping of items to subjects, initially empty;
object
An object is a URI, blank node or literal. See objectin [[RDF11-CONCEPTS]].
predicate
A subject is a URI. See subject in [[RDF11-CONCEPTS]].
property
Each name identifies a property of an item. An item may have multiple elements sharing the same name, creating a multi-valued property. (See property in [[!MICRODATA]]).
property names
The tokens of an element's itemprop attribute. The property names algorithm in [[!MICRODATA]] is modified with the following modifications: Each property is a URI. (See property names in [[!MICRODATA]]).
property value
The value of a property of an item is determined as described in § 5.4 Values: the content attribute of [[!MICRODATA]]. This specification extends the algorithm to account for URI, datatypes, and language, starting with the original value returned from that algorithm and the context of the element containing the property in the DOM.
If the value is an item (by having an itemscope attribute)
The value is the URI or blank node returned from generate the triples for that item.
If element is a URL property element
The value is a URI created from the value specified by the Microdata algorithm.
If element is a data or meter element.
The value is a literal.
If the value has the lexical form of xsd:integer [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#integer.
If the value has the lexical form of xsd:double [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#double.
Otherwise
The value is a simple literal.
If element is a time element.
The value is a literal.
If the value has the lexical form of xsd:date [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#date.
If the value has the lexical form of xsd:time [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#time.
If the value has the lexical form of xsd:dateTime [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#dateTime.
If the value has the lexical form of xsd:gYearMonth [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#gYearMonth.
If the value has the lexical form of xsd:gYear [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#gYear.
If the value has the lexical form of xsd:duration [[!XMLSCHEMA11-2]].
The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#duration.
Otherwise
If element has a language (as described in § 3.2.5.2 The lang and xml:lang attributes of [[!HTML52]]) the value is a language-tagged string created using the value with the language of element. Otherwise, the value is a simple literal.
The HTML valid yearless date string is similar to xsd:gMonthDay, but the lexical forms differ, so it is not included in this conversion.

See § 2.4.5. Dates and times in [[!HTML52]].

Otherwise
If element has a language (as described in § 3.2.5.2 The lang and xml:lang attributes of [[!HTML52]]) the value is a language-tagged string created using the value with the language of element. Otherwise, the value is a simple literal.

See § 3.2.5.2 The lang and xml:lang attributes in [[!HTML52]] for determining the language of a node.

subject
A subject is a URI or blank node. See subject in [[RDF11-CONCEPTS]].
top-level item
An item which does not contain an itemprop attribute. (See top-level microdata item in [[!MICRODATA]]).
typed item
An item is said to be a typed item when either it has an item type, or it is the value of a property of a typed item. The relevant types for a typed item is the item's item types, if it has any, or else is the relevant types of the item for which it is a property's value. (See typed item in [[!MICRODATA]]).
URI
URIs are suitable to be used in subject, predicate or object positions within an RDF triple, as opposed to a literal value that may contain a string representation of a URI. (See [[RDF11-CONCEPTS]]).
vocabulary
A vocabulary is a collection of URIs, suitable for use as an itemtype or itemprop value, that share a common URI prefix. That prefix is the vocabulary URI. A vocabulary URI is not allowed to be a prefix of another vocabulary URI.
This definition differs from the language in the HTML spec and is just for the purpose of this document. In HTML, a vocabulary is a specification, and doesn't have a URI. In our view, if one specification defines ten itemtypes, then these could be treated as one vocabulary or as ten distinct vocabularies; it is entirely up to the vocabulary creator.
vocabulary identifier
The first URI from item types. (See vocabulary identifier in [[!MICRODATA]]).

RDF Conversion Algorithm

A HTML document containing microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.

A conforming microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples to those that the following algorithm generates:

  1. Create memory as an empty map.
  2. For each element that is also a top-level item, Generate the triples for that item using the null for current vocabulary.

Generate the triples

When the user agent is to Generate triples for an item item, given current vocabulary, it must run the following steps:

  1. If there is an entry for item in memory, then let subject be the subject of that entry. Otherwise, if item has a global identifier and that global identifier is an absolute URL, let subject be that global identifier. Otherwise, let subject be a new blank node.
  2. Add a mapping from item to subject in memory
  3. For each type which is an item type of the item:
    1. Generate the following triple:
      subject
      subject
      predicate
      http://www.w3.org/1999/02/22-rdf-syntax-ns#type
      object
      type (as a URI)
  4. Set vocab to the vocabulary identifier for the item, if any.
  5. If the registry contains a URI prefix that is a character for character match of vocab up to the length of the URI prefix, set vocab as that URI prefix.
  6. For each element which has an item property of item run the following substep:
    1. For each predicate in the element's item properties, run the following substeps:
      1. Let value be the property value of element.
      2. If value is an item, then generate the triples for value using vocab for the current vocabulary. Replace value by the subject returned from those steps.
      3. Generate the following triple:
        subject
        subject
        predicate
        predicate
        object
        value
      4. If an entry exists in the registry for predicate in the vocabulary associated with vocab having the key subPropertyOf or equivalentProperty, for each such value equiv, generate the following triple:
        subject
        subject
        predicate
        equiv
        object
        value
  7. Return subject

Reverse itemprop

The WebSchemas community has proposed the use of a new Microdata attribute: itemprop-reverse. Although not present in [[MICRODATA]] at this time, the attribute can be very useful in many markup examples where items are related using the reverse of a common property; this saves creating new properties which exist solely for the purpose of describing such reverse relationships. Evidence for the utility of such a feature can be seen in the RDFa rev attribute [[RDFA-CORE]] and the JSON-LD reverse property [[JSON-LD]].

See issue 5 for further reference.

This feature adds the following attribute:

itemprop-reverse
An attribute used to identify one or more names of an items reversing the sense of itemprop. An itemprop-reverse contains a space separated list of names which may either by absolute URLs or terms associated with the type of the item as defined by the referencing item's item type.

The definition of top-level item is updated to also exclude items having itemprop-reverse.

The Algorithm is extended accordingly:

Algorithm Terms

reverse properties
The mechanism for finding the reverse properties of an item. The list of reverse properties is obtained using a variation of the Microdata properties of an item method, replacing itemprop with itemprop-reverse.
reverse property names
The tokens of an element's itemprop-reverse attribute. Each token is a name.

Generate the triples

The Triples generation algorithm is extended with the following step to take place immediately after Step 9:

  1. For each element which has reverse property names and is one of the reverse properties of the item item, run the following substep:
    1. For each predicate in the element's reverse properties, run the following substeps:
      1. Let value be the property value of element.
      2. If value is an item, then generate the triples for value using vocab for the current vocabulary. Replace value by the subject returned from those steps.
      3. Otherwise, if value is a literal ignore the value and continue to the next name; it is an error for the value of itemprop-reverse to be a literal.
      4. Generate the following triple:
        subject
        value
        predicate
        predicate
        object
        subject

Simple use of itemprop-reverse:


Results in the following Turtle:


Testing

A test suite [[MICRODATA-RDF-TESTS]] under development to help processor developers verify conformance to this specification.

Markup Examples

The microdata example below expresses book information as an FRBR Work item.


Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core# this is equivalent to the following Turtle:


The following snippet of HTML has microdata for two people with the same address. This illustrates two items referencing a third item, and how only a single RDF resource definition is created for that third item.


Assuming that registry contains a an entry for http://microformats.org/profile/hcard it generates these triples expressed in Turtle:


The following snippet of HTML has microdata for a playlist and illustrates the use of the schema:additionalType property to relate recordings to the Music Ontology:


Assuming that registry contains a an entry for http://schema.org/ it generates these triples expressed in Turtle:


Default registry

The following is the default registry in JSON format, as of the time of publication.

  {
    "http://schema.org/": {
      "properties": {
        "additionalType": {"subPropertyOf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"}
      }
    },
    "http://microformats.org/profile/hcard": {}
  }
  

Changes since the Second Edition of 16 December 2014

Changes to reflect recent updates to [[MICRODATA]]:

Acknowledgements

Thanks to Richard Cyganiak for property URI and vocabulary terminology and the general excellent consideration of practical problems in generating RDF from microdata.