HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data into HTML documents. Whereas the microdata specification describes a means of markup, the output format is JSON. This specification describes processing rules that may be used to extract RDF [[!RDF11-CONCEPTS]] from an HTML document containing microdata.
The concepts described herein are intended to provide guidance for a possible future Working Group chartered to provide a Recommendation for this transformation. As a consequence, implementers of this specification, either producers or consumers, should note that it may change prior to any possible publication as a Recommendation.
This document is an update of the W3C Interest Group Note, published in December 2014. This aligns with the 2017 update to [[!MICRODATA]].
As the Semantic Web Interest Group has expired, a new home will need to be found for this Note.
This document describes a means of transforming HTML containing microdata into RDF. HTML Microdata [[!MICRODATA]] is an extension to HTML used to embed machine-readable data to HTML documents. This specification describes transformation directly to RDF [[RDF11-CONCEPTS]].
There are a variety of ways in which a mapping from microdata to RDF might be configured to give a result that is closer to the required result for a particular vocabulary. This specification defines terms that can be used as hooks for vocabulary-specific behavior, which could be defined within a registry or on an implementation-defined basis.
For background on the trade-offs between these options, see https://www.w3.org/wiki/Mapping_Microdata_to_RDF and GitHub Issues.
The current version of [[!MICRODATA]] does not generate URIs for properties of un-typed items, and consequently, the mechanism for generating property URIs for itemprop tokens using the document URI has been removed.
Microdata [[!MICRODATA]] is a way of embedding data in HTML documents using attributes. The Microdata specification defines how to generate a JSON representation from microdata markup.
Mapping microdata to RDF enables consumers to merge data expressed in other RDF-based formats with microdata. It facilitates the use of RDF vocabularies within microdata, and enables microdata to be used with the full RDF toolchain. Some use cases for this mapping are described in Section 1.2 below.
Microdata's data model does not always align neatly with RDF.
lang
attributes could
be used to provide datatype and language information for RDF data, this
would be contrary to the microdata specification.This specification allows for vocabulary-specific rules that affect the generation of property URIs and value serializations. This is facilitated by a registry that associates URIs with specific rules based on matching itemtype values against registered URI prefixes do determine a vocabulary and potentially vocabulary-specific processing rules.
This specification also assumes that consumers of RDF generated from microdata may have to process the results in order to, for example, assign appropriate datatypes to property values.
Decisions or open issues in the specification are tracked on the GitHub Issue Tracker. These include the following:
Experimental support itemprop-reverse. This attribute is not part of [[MICRODATA]] and is included as an experimental feature. Specific feedback from the community is requested. Based on addoption, the attribute may be considered for inclusion in forthcoming versions of [[MICRODATA]] and this note.
The Microdata specification [[!MICRODATA]] defines a number of attributes and the way in which those attributes are to be interpreted.
For reference, attributes used for specifying and retrieving HTML microdata are referenced here:
itemtype
contains multiple values, as defined in items.
(See itemtype and itemtypein [[!MICRODATA]]).
In RDF, it is common for people to shorten vocabulary terms via abbreviated URIs that use a 'prefix' and a 'reference'. throughout this document assume that the following vocabulary prefixes have been defined:
dc: | http://purl.org/dc/terms/ |
md: | http://www.w3.org/ns/md# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfa: | http://www.w3.org/ns/rdfa# |
xsd: | http://www.w3.org/2001/XMLSchema# |
In a perfect world, all processors would be able to generate the same output for a given input without regards to the requirements of a particular vocabulary. However, microdata doesn't provide sufficient syntactic help in making these decisions. Different vocabularies have different needs.
The registry is located at the namespace defined for microdata: http://www.w3.org/ns/md
in
a variety of formats. Under control of a runtime option, a processor should use
another provided by reference, to affect processing.
The registry associates a URI prefix with one or more key-value pairs denoting processor behavior. A hypothetical JSON representation of such a registry might be the following:
This structure associates mappings for a single URI: http://schema.org/
.
Items having an item type with a URI prefix from this registry
use the the rules described for that prefix within the scope of that
item type. For http://schema.org/
, this mapping currently defines a single property: additionalType
with a value to indicate specific behavior. It also allows overrides
on a per-property basis; the item properties
key associates an individual name
with overrides for default behavior.
The interpretation of these
rules is defined in the following sections. If an item has no vocabulary identifier or the
registry contains no URI prefix matching vocabulary identifier, a conforming
processor MUST use the default values defined for these rules.
Property URI generation is described in § 5.3 Properites of [[!MICRODATA]] with the following modification:
Consider the following example:
Given the URI prefix http://microformats.org/profile/hcard
, this
would generate http://microformats.org/profile/hcard#n
and
http://microformats.org/profile/hcard#given-name
. Note that the '#' is automatically
added as a separator.
Looking at another example:
Given the URI prefix http://schema.org/
,
this would generate http://schema.org/name
. Note that if the itemtype
were http://schema.org/Person/Teacher
, this would generate the same property URI.
If the registry contains no match for current vocabulary implementations MUST act as if there is a URI prefix made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).
The vocabulary URI prefix is made from the first itemtype value by stripping either the fragment content or last path segment, if the value has no fragment (See [[!RFC3986]]).
In this example, assuming no matching entry in the registry,
the URI prefix is constructed by removing the
last path segment, leaving the URI
http://example.org/
. The resulting property URI would be
http://example.org/title
.
In microdata, all values are strings. In RDF, values may be resources or may be typed with an appropriate datatype.
In some cases, the type of a microdata value can be determined from the element on which it is specified. In particular:
time
element provides dates, times and durationsdata
and meter
elements provides doubles and integersMicrodata requires that all values of itemtype come from the same vocabulary. This is required as itemprop values are resolved relative to that vocabulary. However, it is often useful to define an item to have types from multiple different vocabularies.
Vocabulary expansion uses simple rules to generate additional triples based on
rules and property relationships described in the registry.
Within the registry, a property definition may have either equivalentProperty
or subPropertyOf
keys having a IRI value (or array of IRI values)
of the associated property. Such an
entry causes the processor to generate triples associating the source
property IRI with the target property IRI using either
rdf:subPropertyOf
or
owl:equivalentProperty
predicates.
For example, the registry definition for the additionalType property
within schema.org, defines additionalType to have an rdfs:subPropertyOf
relationship with rdf:type
.
The previous example, indicates a registry rule, which causes the processor to emit
an extra triple when first seeing the additionalProperty
itemprop:
After performing vocabulary expansion, an additional rdf:type
triple is generated:
The owl:equivalentProperty
rule is more powerfull than rdfs:subPropertyOf
,
in that if any equivalent property matches, then the source property would also cause a triple to be generated.
For example, if the registry stated that name
was equivalent to rdfs:label
,
than any use of name
in a itemprop would cause a triple using
rdfs:label
to be emitted, as with rdfs:subPropertyOf
. However, logically,
any use of label
where the current voabulary were rdfs:
could also cause
a triple using schema:name
to be emitted. To simplify processing, this specification
requires that all values of a owl:equivalentProperty
registry entry have their
own rules with those values as keys within the property
section of their respective
vocabularies.
The external registry may be controlled by the
registry
option passed to the microdata processor. If specified, the registry
must be loaded from the location indicated as the option value, Otherwise,
the processor MUST load the default registry from http://www.w3.org/ns/md
.
Setting registry
is performed in a processor-specific way.
When accessed as a web service using HTTP GET, POST or similar action, processors SHOULD use registry
query parameter. Acceptable values for registry
is a URI-encoded URL.
Web service processors SHOULD return the resulting RDF graph using a requested format specified by
HTTP Content Negotiation for an acceptable content type. Web service processors MUST support [[!N-TRIPLES]].
Transformation of Microdata to RDF makes use of general processing rules described in [[!MICRODATA]] for the treatment of items.
content
attribute of [[!MICRODATA]].
This specification extends the algorithm to account for URI, datatypes, and language,
starting with the original value returned from that algorithm and the context of the element
containing the property in the DOM.
data
or meter
element.http://www.w3.org/2001/XMLSchema#integer
.
http://www.w3.org/2001/XMLSchema#double
.
time
element.http://www.w3.org/2001/XMLSchema#date
.
http://www.w3.org/2001/XMLSchema#time
.
http://www.w3.org/2001/XMLSchema#dateTime
.
http://www.w3.org/2001/XMLSchema#gYearMonth
.
http://www.w3.org/2001/XMLSchema#gYear
.
http://www.w3.org/2001/XMLSchema#duration
.
lang
and xml:lang
attributes of [[!HTML52]])
the value is a language-tagged string created using the value
with the language of element.
Otherwise, the value is a simple literal.
See § 2.4.5. Dates and times in [[!HTML52]].
lang
and xml:lang
attributes of [[!HTML52]])
the value is a language-tagged string created using the value
with the language of element.
Otherwise, the value is a simple literal.
See
§ 3.2.5.2 The lang
and xml:lang
attributes
in [[!HTML52]] for determining the language of a node.
A HTML document containing microdata MAY be converted to any other RDF-compatible document format using the algorithm specified in this section.
A conforming microdata processor implementing RDF conversion MUST implement a processing algorithm that results in the equivalent triples to those that the following algorithm generates:
null
for current vocabulary.
When the user agent is to Generate triples for an item item, given current vocabulary, it must run the following steps:
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
subPropertyOf
or
equivalentProperty
, for each such value equiv, generate the following triple:
The WebSchemas community has
proposed the use of a new Microdata attribute:
itemprop-reverse. Although not present in [[MICRODATA]] at this
time, the attribute can be very useful in many markup examples where items
are related using the reverse of a common property; this saves creating new properties
which exist solely for the purpose of describing such reverse relationships. Evidence
for the utility of such a feature can be seen in the
RDFa rev
attribute [[RDFA-CORE]]
and the JSON-LD reverse
property [[JSON-LD]].
See issue 5 for further reference.
This feature adds the following attribute:
The definition of top-level item is updated to also exclude items having itemprop-reverse.
The Algorithm is extended accordingly:
properties of an item
method, replacing itemprop with itemprop-reverse.
The Triples generation algorithm is extended with the following step to take place immediately after Step 9:
Simple use of itemprop-reverse:
Results in the following Turtle:
A test suite [[MICRODATA-RDF-TESTS]] under development to help processor developers verify conformance to this specification.
The microdata example below expresses book information as an FRBR Work item.
Assuming that registry contains a an entry for http://purl.org/vocab/frbr/core#
this is equivalent to the following Turtle:
The following snippet of HTML has microdata for two people with the same address. This illustrates two items referencing a third item, and how only a single RDF resource definition is created for that third item.
Assuming that registry contains a an entry for http://microformats.org/profile/hcard
it generates these triples expressed in Turtle:
The following snippet of HTML has microdata for a playlist
and illustrates the use of the schema:additionalType
property to relate recordings to the Music Ontology:
Assuming that registry contains a an entry for http://schema.org/
it generates these triples expressed in Turtle:
The following is the default registry in JSON format, as of the time of publication.
{ "http://schema.org/": { "properties": { "additionalType": {"subPropertyOf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"} } }, "http://microformats.org/profile/hcard": {} }
Changes to reflect recent updates to [[MICRODATA]]:
content
attribute is processed on all HTML elements.Thanks to Richard Cyganiak for property URI and vocabulary terminology and the general excellent consideration of practical problems in generating RDF from microdata.