HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data into HTML documents. Whereas the microdata specification describes a means of markup, the output format is JSON. This specification describes processing rules that may be used to extract RDF [[!RDF11-CONCEPTS]] from an HTML document containing microdata.

Introduction

This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [[RDF11-CONCEPTS]].

There are a variety of ways in which a mapping from microdata to RDF might be configured to give a result that is closer to the required result for a particular vocabulary. This specification defines terms that can be used as hooks for vocabulary-specific behavior, which could be defined within a registry or on an implementation-defined basis.

For background on the trade-offs between these options, see https://www.w3.org/wiki/Mapping_Microdata_to_RDF and GitHub Issues.

The current version of [[!MICRODATA]] does not generate URIs for properties of un-typed items, and consequently, the mechanism for generating property URIs for itemprop tokens using the document URI has been removed.

Background

Microdata [[!MICRODATA]] is a way of embedding data in HTML documents using attributes. The Microdata specification defines how to generate a JSON representation from microdata markup.

Mapping microdata to RDF enables consumers to merge data expressed in other RDF-based formats with microdata. It facilitates the use of RDF vocabularies within microdata, and enables microdata to be used with the full RDF toolchain. Some use cases for this mapping are described in Section 1.2 below.

Microdata's data model does not always align neatly with RDF.

When an item has multiple properties with the same name, the values are always ordered; in RDF, property values are unordered unless they are explicitly listed in an RDF Collection.
Except for some specific element values, a value in microdata is always a simple string which is interpreted by the consuming application. In RDF, values can be tagged with a datatype or a language. According to the microdata specification, the HTML context of microdata markup should not change how microdata is interpreted, so although element names and HTML lang attributes could be used to provide datatype and language information for RDF data, this would be contrary to the microdata specification.

This specification allows for vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and potentially vocabulary-specific processing rules.

This specification also assumes that consumers of RDF generated from microdata may have to process the results in order to, for example, assign appropriate datatypes to property values.

Issues

Decisions or open issues in the specification are tracked on the GitHub Issue Tracker. These include the following:

Experimental Feature

Experimental support itemprop-reverse. This attribute is not part of [[MICRODATA]] and is included as an experimental feature. Specific feedback from the community is requested. Based on addoption, the attribute may be considered for inclusion in forthcoming versions of [[MICRODATA]] and this note.

Vocabulary Registry

In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.

The registry is located at the namespace defined for microdata: http://www.w3.org/ns/md in a variety of formats. Under control of a runtime option, a processor should use another provided by reference, to affect processing.

The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:

This structure associates mappings for a single URI: http://schema.org/. Items having an item type with a URI prefix from this registry use the the rules described for that prefix within the scope of that item type. For http://schema.org/, this mapping currently defines a single property: additionalType with a value to indicate specific behavior. It also allows overrides on a per-property basis; the item properties key associates an individual name with overrides for default behavior. The interpretation of these rules is defined in the following sections. If an item has no vocabulary identifier or the registry contains no URI prefix matching vocabulary identifier, a conforming processor MUST use the default values defined for these rules.

Property URI Generation

Property URI generation is described in § 5.3 Properites of [[!MICRODATA]] with the following modification:

If the element is a typed item, a URI is formed using the current vocabulary.
Resulting properties which are not absolute URLs are discarded.

Consider the following example:

Given the URI prefix http://microformats.org/profile/hcard, this would generate http://microformats.org/profile/hcard#n and http://microformats.org/profile/hcard#given-name. Note that the '#' is automatically added as a separator.

Looking at another example:

Given the URI prefix http://schema.org/, this would generate http://schema.org/name. Note that if the itemtype were http://schema.org/Person/Teacher, this would generate the same property URI.

If the registry contains no match for current vocabulary implementations MUST act as if there is a URI prefix made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).

The vocabulary URI prefix is made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).

In this example, assuming no matching entry in the registry, the URI prefix is constructed by removing the last path segment, leaving the URI http://example.org/. The resulting property URI would be http://example.org/title.

Value Typing

In microdata, all values are strings. In RDF, values may be resources or may be typed with an appropriate datatype.

In some cases, the type of a microdata value can be determined from the element on which it is specified. In particular:

URL property elements provide URLs
time element provides dates, times and durations
data and meter elements provides doubles and integers

Vocabulary Expansion

Microdata requires that all values of itemtype come from the same vocabulary. This is required as itemprop values are resolved relative to that vocabulary. However, it is often useful to define an item to have types from multiple different vocabularies.

Vocabulary expansion uses simple rules to generate additional triples based on rules and property relationships described in the registry. Within the registry, a property definition may have either equivalentProperty or subPropertyOf keys having a IRI value (or array of IRI values) of the associated property. Such an entry causes the processor to generate triples associating the source property IRI with the target property IRI using either rdf:subPropertyOf or owl:equivalentProperty predicates.

For example, the registry definition for the additionalType property within schema.org, defines additionalType to have an rdfs:subPropertyOf relationship with rdf:type.

The previous example, indicates a registry rule, which causes the processor to emit an extra triple when first seeing the additionalProperty itemprop:

After performing vocabulary expansion, an additional rdf:type triple is generated:

The owl:equivalentProperty rule is more powerfull than rdfs:subPropertyOf, in that if any equivalent property matches, then the source property would also cause a triple to be generated. For example, if the registry stated that name was equivalent to rdfs:label, than any use of name in a itemprop would cause a triple using rdfs:label to be emitted, as with rdfs:subPropertyOf. However, logically, any use of label where the current voabulary were rdfs: could also cause a triple using schema:name to be emitted. To simplify processing, this specification requires that all values of a owl:equivalentProperty registry entry have their own rules with those values as keys within the property section of their respective vocabularies.

Algorithm

Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]] for the treatment of items.

Algorithm Terms

absolute URL

The term absolute URL as defined in [[!MICRODATA]].

blank node

A blank node is a node in a graph that is neither a URI nor a literal. Items without a global identifier have a blank node allocated to them. (See blank node in [[RDF11-CONCEPTS]]).

current vocabulary

an absolute URL for the current vocabulary, from the registry.

document base

The base address of the document being processed, as defined in Establishing a Base URI in [[!RFC3986]].

item

An item is described by an element containing an itemscope attribute. A top-level microdata item is an item that does not have an itemprop attribute. (See item and top-level microdata item in [[!MICRODATA]]).

item properties

The mechanism for finding the properties of an item as described in § 5.5 Associating names with items of [[!MICRODATA]]. (See item properties in [[!MICRODATA]]).

global identifier

The value of an item's itemid attribute, if it has one, resolved relative to the element on which the attribute is specified. (See global identifier in [[!MICRODATA]]).

literal

Literals are values such as strings and dates. These include typed literal, language-tagged strings and simple literals, as defined in [[RDF11-CONCEPTS]].

memory

a mapping of items to subjects, initially empty;

object

An object is a URI, blank node or literal. See objectin [[RDF11-CONCEPTS]].

predicate

A subject is a URI. See subject in [[RDF11-CONCEPTS]].

property

Each name identifies a property of an item. An item may have multiple elements sharing the same name, creating a multi-valued property. (See property in [[!MICRODATA]]).

property names

The tokens of an element's itemprop attribute. The property names algorithm in [[!MICRODATA]] is modified with the following modifications:

If the element is a typed item, a URI is formed using the current vocabulary.
Resulting properties which are not absolute URLs are discarded.

Each property is a URI. (See property names in [[!MICRODATA]]).

property value

The value of a property of an item is determined as described in § 5.4 Values: the content attribute of [[!MICRODATA]]. This specification extends the algorithm to account for URI, datatypes, and language, starting with the original value returned from that algorithm and the context of the element containing the property in the DOM.

If the value is an item (by having an itemscope attribute)

The value is the URI or blank node returned from generate the triples for that item.

If element is a URL property element

The value is a URI created from the value specified by the Microdata algorithm.

If element is a data or meter element.

The value is a literal.

If the value has the lexical form of xsd:integer [[!XMLSCHEMA11-2]].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#integer.
If the value has the lexical form of xsd:double [[!XMLSCHEMA11-2]].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#double.
Otherwise: The value is a simple literal.

If element is a time element.

The value is a literal.

If the value has the lexical form of xsd:date [[!XMLSCHEMA11-2]].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#date.
If the value has the lexical form of xsd:time [[!XMLSCHEMA11-2]].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#time.
If the value has the lexical form of xsd:dateTime [[!XMLSCHEMA11-2]].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#dateTime.
If the value has the lexical form of xsd:gYearMonth [[!XMLSCHEMA11-2]].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#gYearMonth.
If the value has the lexical form of xsd:gYear [[!XMLSCHEMA11-2]].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#gYear.
If the value has the lexical form of xsd:duration [[!XMLSCHEMA11-2]].: The value is a typed literal composed of the value and http://www.w3.org/2001/XMLSchema#duration.
Otherwise: If element has a language (as described in § 3.2.5.2 The lang and xml:lang attributes of [[!HTML52]]) the value is a language-tagged string created using the value with the language of element. Otherwise, the value is a simple literal.; The HTML valid yearless date string is similar to xsd:gMonthDay, but the lexical forms differ, so it is not included in this conversion.

See § 2.4.5. Dates and times in [[!HTML52]].

Otherwise

If element has a language (as described in § 3.2.5.2 The lang and xml:lang attributes of [[!HTML52]]) the value is a language-tagged string created using the value with the language of element. Otherwise, the value is a simple literal.

See § 3.2.5.2 The lang and xml:lang attributes in [[!HTML52]] for determining the language of a node.

subject

A subject is a URI or blank node. See subject in [[RDF11-CONCEPTS]].

top-level item

An item which does not contain an itemprop attribute. (See top-level microdata item in [[!MICRODATA]]).

typed item

An item is said to be a typed item when either it has an item type, or it is the value of a property of a typed item. The relevant types for a typed item is the item's item types, if it has any, or else is the relevant types of the item for which it is a property's value. (See typed item in [[!MICRODATA]]).

URI

URIs are suitable to be used in subject, predicate or object positions within an RDF triple, as opposed to a literal value that may contain a string representation of a URI. (See [[RDF11-CONCEPTS]]).

vocabulary

A vocabulary is a collection of URIs, suitable for use as an itemtype or itemprop value, that share a common URI prefix. That prefix is the vocabulary URI. A vocabulary URI is not allowed to be a prefix of another vocabulary URI.

This definition differs from the language in the HTML spec and is just for the purpose of this document. In HTML, a vocabulary is a specification, and doesn't have a URI. In our view, if one specification defines ten itemtypes, then these could be treated as one vocabulary or as ten distinct vocabularies; it is entirely up to the vocabulary creator.

vocabulary identifier

The first URI from item types. (See vocabulary identifier in [[!MICRODATA]]).

RDF Conversion Algorithm

A HTML document containing microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.

A conforming microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples to those that the following algorithm generates:

Create memory as an empty map.
For each element that is also a top-level item, Generate the triples for that item using the null for current vocabulary.

Generate the triples

When the user agent is to Generate triples for an item item, given current vocabulary, it must run the following steps:

If there is an entry for item in memory, then let subject be the subject of that entry. Otherwise, if item has a global identifier and that global identifier is an absolute URL, let subject be that global identifier. Otherwise, let subject be a new blank node.
Add a mapping from item to subject in memory
For each type which is an item type of the item:
1. Generate the following triple:
  
  subject
  
  subject
  
  predicate
  
  http://www.w3.org/1999/02/22-rdf-syntax-ns#type
  
  object
  
  type (as a URI)
Set vocab to the vocabulary identifier for the item, if any.
If the registry contains a URI prefix that is a character for character match of vocab up to the length of the URI prefix, set vocab as that URI prefix.
For each element which has an item property of item run the following substep:
1. For each predicate in the element's item properties, run the following substeps:
  1. Let value be the property value of element.
  2. If value is an item, then generate the triples for value using vocab for the current vocabulary. Replace value by the subject returned from those steps.
  3. Generate the following triple:
    
    subject
    
    subject
    
    predicate
    
    predicate
    
    object
    
    value
  4. If an entry exists in the registry for predicate in the vocabulary associated with vocab having the key subPropertyOf or equivalentProperty, for each such value equiv, generate the following triple:
    
    subject
    
    subject
    
    predicate
    
    equiv
    
    object
    
    value
Return subject

Reverse itemprop

The WebSchemas community has proposed the use of a new Microdata attribute: itemprop-reverse. Although not present in [[MICRODATA]] at this time, the attribute can be very useful in many markup examples where items are related using the reverse of a common property; this saves creating new properties which exist solely for the purpose of describing such reverse relationships. Evidence for the utility of such a feature can be seen in the RDFa rev attribute [[RDFA-CORE]] and the JSON-LD reverse property [[JSON-LD]].

See issue 5 for further reference.

This feature adds the following attribute:

itemprop-reverse: An attribute used to identify one or more names of an items reversing the sense of itemprop. An itemprop-reverse contains a space separated list of names which may either by absolute URLs or terms associated with the type of the item as defined by the referencing item's item type.

The definition of top-level item is updated to also exclude items having itemprop-reverse.

The Algorithm is extended accordingly:

Algorithm Terms

reverse properties: The mechanism for finding the reverse properties of an item. The list of reverse properties is obtained using a variation of the Microdata properties of an item method, replacing itemprop with itemprop-reverse.
reverse property names: The tokens of an element's itemprop-reverse attribute. Each token is a name.

Generate the triples

The Triples generation algorithm is extended with the following step to take place immediately after Step 9:

For each element which has reverse property names and is one of the reverse properties of the item item, run the following substep:
1. For each predicate in the element's reverse properties, run the following substeps:
  1. Let value be the property value of element.
  2. If value is an item, then generate the triples for value using vocab for the current vocabulary. Replace value by the subject returned from those steps.
  3. Otherwise, if value is a literal ignore the value and continue to the next name; it is an error for the value of itemprop-reverse to be a literal.
  4. Generate the following triple:
    
    subject
    
    value
    
    predicate
    
    predicate
    
    object
    
    subject

Simple use of itemprop-reverse:

Results in the following Turtle:

Introduction

Background

Issues

Experimental Feature

Attributes and Syntax

Vocabulary Registry

Property URI Generation

Value Typing

Vocabulary Expansion

Control of Microdata Processors

Algorithm

Algorithm Terms

RDF Conversion Algorithm

Generate the triples

Reverse itemprop

Algorithm Terms

Generate the triples

Testing

Markup Examples

Default registry

Changes since the Second Edition of 16 December 2014

Acknowledgements

dc:	http://purl.org/dc/terms/
md:	http://www.w3.org/ns/md#
rdf:	http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdf:	http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfa:	http://www.w3.org/ns/rdfa#
xsd:	http://www.w3.org/2001/XMLSchema#