Copyright © 2021-2024 the Contributors to the TDM Reservation Protocol (TDMRep) Specification, published by the Text and Data Mining Reservation Protocol Community Group under the W3C Community Final Specification Agreement (FSA). A human-readable summary is available.
This specification defines a simple and practical Web protocol, capable of expressing the reservation of rights relative to text & data mining (TDM) applied to lawfully accessible Web content, and to ease the discovery of TDM licensing policies associated with such content.
This initiative is a technical answer to the constraints set by the Article 4 of the new European Directive on copyright and related rights in the Digital Single Market.
This section is non-normative.
This specification was published by the Text and Data Mining Reservation Protocol Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Final Specification Agreement (FSA) other conditions apply. Learn more about W3C Community and Business Groups.
GitHub Issues are preferred for discussion of this specification. Alternatively, you can send comments to our mailing list. Please send them to public-tdmrep@w3.org (subscribe, archives).
This section is non-normative.
In addition to their significance in the context of scientific research, text and data mining techniques (TDM) are widely used both by private and public entities to analyse large amounts of data (including copyright protected content like text, images, video etc.) in different areas of life and for various purposes, including for government services, complex business decisions and the development of new applications or technologies.
In a digital environment, TDM usage of copyright protected works can be subject to different terms and conditions, depending on the legal framework. In generic terms, an act of reproduction is required before TDM can be applied on content accessible on the Web; international laws stipulate that such act of reproduction is subject to authorization by rightsholders. So far, analyzing and processing the terms and conditions of a website, contacting rightsholders, seeking for permission and concluding licensing agreements require time and resources.
In such context, a machine-readable solution which streamlines the communication of TDM rights and licenses available for online copyrighted content is necessary to facilitate the development of TDM applications and reduce the risks of legal uncertainty for TDM actors. Such a solution, that shall rely on a consensus by rightsholders and TDM actors, will optimize the capacity of TDM actors to lawfully access and process useful content at large scale.
The Directive on copyright and related rights in the Digital Single Market or EU Directive 2019/790, better known as the "CDSM Directive" (CDSM meaning Copryright in the Digital Single Market), introduces two exceptions or limitations to the rights of rightsholders on lawfully accessible content, for reproductions and extractions for the purposes of TDM:
In its Article 3, a mandatory exception for research organisations and cultural heritage institutions which carry out TDM for the purposes of scientific research.
In its Article 4, an exception for any organisation willing to carry out TDM for any purpose other than scientific research, including commercial purposes, which applies on the condition that the use of content for TDM has not been expressly reserved by their rightsholders in an appropriate manner, such as machine-readable means.
These TDM exceptions apply to TDM usage in the European Union in relation to content from European and foreign rightsholders. Outside of the EU, where the CDSM legislation does not apply, the said exception does not apply: exclusive rights of rightsholders to authorize acts of reproduction are maintained. In such cases, no TDM can be performed without the explicit authorisation of these rightsholders: in these countries, the absence of a reservation of rights by rightsholders cannot be considered as an implicit authorization to reproduce copyrighted content for TDM purpose, and advocating fair use or a similar rule is legally uncertain, as these actions are judged on a case-per-case basis.
The “opt-out” mechanism introduced by the CDSM Directive is therefore a real opportunity for TDM actors and publishers across countries to define a machine-readable technique able to express not only if TDM rights on specific Web content are reserved or not, but also how rightsholders can be contacted and which licenses are available, if any. This is a tremendous help for TDM actors from all countries looking for legal certainty.
This section is non-normative.
Person or organization that owns the legal rights to something, in our case Web resources Wiktionary.
Person or organization that makes Web resources available to the public.
Person or organization practicing TDM (on Web resources in our case).
Software accessing Web resources for TDM purposes.
Description of the terms and conditions by which a TDM Actor can process a given Web resource.
Description of the kind of TDM Licenses a TDM Actor may obtain from a Rightsholder.
Rights to process a Web resource via TDM techniques, for a certain purpose (e.g scientific research, commercial).
Identifiable thing available on the Web Wikipedia. Web resources are located using URLs.
Web resource formatted in HTML.
This section is non-normative.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, SHOULD, and SHOULD NOT in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
This section is non-normative.
The technical specification shall:
The goal of this protocol is to allow a rightsholder to declare his choice regarding text & data mining of Web resources he controls, thereby allowing recipients of that declaration to adjust their scraping behavior, or to reach a separate agreement with the rightsholder that satisfies all parties.
Such a preference is expressed via two complementary properties, tdm-reservation
and tdm-policy
.
tdm-reservation
is an boolean value.
tdm-reservation | meaning |
---|---|
1 | TDM rights are reserved. If a TDM Policy is set, TDM Agents MAY use it to get information on how they can acquire from the rightsholder an authorization to mine the content. |
0 | TDM rights are not reserved. TDM agents can mine the content for TDM purposes without having to contact the rightsholder. |
Other values are considered protocol errors. In such a case the TDM Agents MUST consider that tdm-reservation
is unset
.
The "opt-out" option specified by the Article 4 of the CDSM Directive is expressed by the use of tdm-reservation
with value equal 1
.
tdm-policy
is a URL pointing to a TDM Policy set by the rightsholder.
The presence of tdm-policy
when the value of tdm-reservation
is 0
is not considered a protocol error. TDM Agents SHOULD NOT process tdm-policy
in this case.
A TDM Policy is considered human readable if its content-type is text/html
. It is considered machine-readable if its content-type is either application/json
or application/ld+json
.
Being unable to access or parse a TDM Policy is not considered a protocol error. In such a case, the TDM Agent MUST consider that there is no way to know at this time which conditions would allow it to process the resource.
This specification provides four complementary techniques for expressing rightsholders' choices. These techniques correspond to different situations and technical skills a Publisher may have. The first two techniques are independant from the nature and format of the content on which an "opt-out" is applied. The other one refer to formats seen as particularly important in the publishing industry: HTML and EPUB.
The TDM file on the origin server is a mechanism for declaring site-wide righsholder's choices in a file hosted on the Web server a TDM Agent wishes to mine. Checking this file is the first thing an TDM Agent must do (see section Processing Priority).
An origin server that receives a valid GET request targeting this resource MUST send either a successful response containing a machine-readable representation of the site-wide righsholder's choices, as defined below, or a sequence of redirects that leads to such a representation. Failure to provide access to such a representation implies that the origin server does not implement this protocol.
This specification defines a JSON file named tdmrep.json
, which MUST be hosted in the /.well-known
repository of a Web server.
This file contains an array of JSON objects; each object represents a rule and accepts three properties:
In each rule, location
and tdm-reservation
are mandatory, tdm-policy
is optional.
To evaluate if the URL of a Web resource is subject to a given pattern, a TDM Agent MUST match the paths inferred from the pattern against the URL. The matching SHOULD be case sensitive. The most specific match found MUST be used. The most specific match is the first in sequence.
If no match is found, a TDM Agent MUST consider that tdm-reservation
is unset
for the given URL.
If a percent-encoded US-ASCII character is encountered in the URI, it MUST be unencoded prior to comparison, unless it is a reserved character in the URI as defined by RFC3986 or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered.
There are many variants of regular expressions. In order to simplify the work of TDM Agents, this specification is re-using the specification and wording of the robots.txt draft-koster-rep-00 2.2.3.
TDM Agents MUST allow the following special characters:
Character | Description | Example |
---|---|---|
"$" | Designates the end of the match pattern. | "tdm-policy: /this/path/exactly$" |
"*" | Designates 0 or more instances of any character. | "tdm-policy: /this/*/then" |
If TDM Agents match special characters verbatim in the URI, they MUST use "%" encoding. For example:
Pattern | URI |
---|---|
/path/foo-%24 | https://provider.com/path/foo-$ |
This URL matching notation is subject to interpretation. For the sake of interoperability, TDM Agents should follow the rules detailed by Google in How Google interprets the robots.txt specification, section "URL matching based on path values".
This section is non-normative.
In the following example, a rightsholder wants to "opt-out" for every file present on a Web server.
tdmrep.json
is therefore simply structured as:
[
{
"location": "/",
"tdm-reservation": 1
}
]
Note the brackets at the start and end of the file. They indicate that this is a JSON array, and they are mandatory even if you express only one rule. In this example, the provider has decided not to offer a TDM policy.
In the following example, a Web server is hosting three groups of files. The rightsholder of the first group of files (PDF documents) wants to express that TDM rights are reserved on these files with no way to acquire a TDM License. The rightsholder of the second group of files (html pages) wants to express that TDM rights are reserved with a TDM Policy. TDM rights are not reserved for all JPEG images contained in the third group.
In this example, the first group is a set of files stored in /directory-a; the second group is stored in /directory-b/html and the third group in /directory-b/images.
tdmrep.json
is therefore structured as:
[
{
"location": "/directory-a/",
"tdm-reservation": 1
},
{
"location": "/directory-b/html/",
"tdm-reservation": 1,
"tdm-policy":"https://provider.com/policies/policy.json"
},
{
"location": "/directory-b/images/*.jpg",
"tdm-reservation": 0
}
]
The TDM Header Field is a mechanism for declaring a choice in an HTTP response ([RFC7230]).
tdm-reservation
and the optional tdm-policy
are two properties added to the HTTP header of the response to a GET or HEAD request.
This is currently the preferred technique for implementing the protocol, as it is simple and already integrated in the swpawning.ai API.
In the following example, the rightsholder expresses that TDM rights are reserved on these files with no way to acquire a TDM License. The server returns a tdm-reservation
header field with value 1
.
HTTP/1.1 200 OK
Date: Wed, 14 Jul 2021 12:07:48 GMT
Content-type: image/jpg
tdm-reservation: 1
In the following example, a TDM License may be acquired. The server returns a tdm-reservation
header field with value 1
and a tdm-policy
header field pointing to a TDM Policy.
HTTP/1.1 200 OK
Date: Wed, 14 Jul 2021 12:07:48 GMT
Content-type: text/html
tdm-reservation: 1
tdm-policy: https://provider.com/policies/policy.json
This technique provides a way to embed the righsholder's choice in html content.
tdm-reservation
is expressed as value of the name
attribute of a meta
element and tdm-policy
is expressed as value of the name
attribute of a second meta
element.
In the following example, an html document is associated with a TDM Policy:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="tdm-reservation" content="1">
<meta name="tdm-policy" content="https://provider.com/policies/policy.json">
<title>Document title</title>
</head>
<body>
...
<!-- body content -->
...
</body>
</html>
This technique provides a way to embed the rightsholder's choice in an EPUB file.
tdm:reservation
and the optional tdm:policy
are two properties added to the metadata section of the EPUB Package Document, using the <meta> element provided in EPUB for such extensions. They both use the tdm
prefix, which is the shorthand mapping of the URL http://www.w3.org/ns/tdmrep#
.
These properties cover TDM rights for every resource in the package, and this specification does not cover the definition of specific TDM rights for the different resources present in the package.
In the following example, the rightsowner signals that he reserves TDM rights on every resource of the EPUB file, but a TDM License may be acquired.
<package prefix="tdm: http://www.w3.org/ns/tdmrep#" ...>
<metadata ...>
<dc:title>Document title</dc:title>
<meta property="tdm:reservation">1</meta>
<meta property="tdm:policy">https://provider.com/policies/policy.json</meta>
</metadata>
</package>
Rightsholders SHOULD only use one of the techniques specified in the previous section. In case TDM rights are expressed using several techniques on a given Web Server, TDM Agents need a way to unambiguously define rightsholder's choices. This is why the following processing rules are specified.
A TDM Agent MUST check the presence of a TDM file on the origin server before it starts scraping the content of the Web server.
A TDM Agent will keep in cache the content of the TDM file, usually as an in-memory object, so that it can check its rules against every Web resource it fetches from the origin server.
A TDM Agent MUST check the presence of a TDM Header Field in every http header it gets from fetching a resource on the Web server. The values of tdm-reservation
and tdm-policy
found in this header supercede any value inferred from a TDM file on the origin server.
A TDM Agent MUST check the presence of TDM Metadata in every html file fetched from the Web server. The values of tdm-reservation
and tdm-policy
found here supercede previous values.
A TDM Agent MUST check the presence of TDM Metadata in every EPUB file fetched from the Web server. The values of tdm:reservation
and tdm:policy
found here supercede previous values.
At every step, the absence of tdm-reservation
and tdm-policy
MUST not reset current values.
Policies are machine-readable structures referenced from the tdm-policy
property defined in the specification. They provide ways for TDM Actors to contact content rightsholder and they offer details about available TDM licenses. Thus, they facilitate the acquisition of TDM licenses from rightsholders by TDM Actors.
The format of policies defined in this specification is a profile of the Open Digital Rights Language 2.2 [ODRL].
This specification assumes basic knowledge of JSON-LD and the ODRL model and vocabulary.
The @context
of a Policy MUST be http://www.w3.org/ns/odrl.jsonld
.
A tdm
alias MUST be added to the context if "tdm" prefixed properties are used in the Policy, and its value MUST be http://www.w3.org/ns/tdmrep#
.
A policy identifier MUST also be added to the policy, expressed as a URI.
It is not required that the policy identifier is dereferencable. Our recommendation is to use the domain name of the Web server, add "/policies/", and add a number to it, just in case you decided later to manage different license policies. An alternative solution is to use as identifier the URL of the Terms of Use of the Web Site, where the TDM policy is described in human language.
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
]
"uid": "https://provider.com/policies/1",
...
}
The @type
of a Policy MUST have Offer
as value.
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"uid": "https://provider.com/policies/1",
"@type": "Offer",
...
}
A Policy MUST have a profile
property with value http://www.w3.org/ns/tdmrep
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"uid": "https://provider.com/policies/1",
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
...
}
The value of the profile
property is not dereferencable. This is simply an identifier in the form of a URL.
A Policy MUST contain one assigner
property. The assigner
property of the Offer MUST use a limited number of vCard properties ([vcard-rdf]):
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"uid": "https://provider.com/policies/1",
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
"assigner": {
"uid": "https://provider.com",
"vcard:fn": "Provider Name",
"vcard:nickname": "PRV",
"vcard:hasEmail": "mailto:contact@provider.com",
"vcard:hasAddress": {
"vcard:street-address": "111 Street Address",
"vcard:postal-code": "5555",
"vcard:locality": "Espérance",
"vcard:country-name": "France"
},
"vcard:hasTelephone": "tel:+61755555555",
"vcard:hasURL": "https://provider.com/tdm/licensing.html"
}
...,
}
A Policy MUST contain one permission
property. It SHOULD not contain any obligation
or prohibition
property.
The optional target of a permission, which is expressed via the target
property, MUST be a URI identifying the collection of resources involved in the policy.
This property is optional. Il can be used by a publisher to identify a collection of resources on which a specific action applies.If present, it should be used by TDM Agents in their messages to publishers.
The target URL is not necessarily dereferencable. Accessing this URL may end with an http error (403 in many cases): this is not a processing error.
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
"uid": "https://provider.com/policies/1",
...
"permission": [{
"target": "https://provider.com/research-papers",
...
}
]
}
The mandatory action of a permission, which is expressed via the action
property, MUST be the following:
Definition: analyse, via automated analytical technique, text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.
Label: Text & Data Mine
Identifier: http://www.w3.org/ns/tdmrep#mine
Included in: http://www.w3.org/ns/odrl/2/use
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
"uid": "https://provider.com/policies/1",
...
"permission": [{
"action": "tdm:mine"
}
]
}
The duty to obtain verifiable consent before performing TDM on content is expressed by adding an duty
property to the Policy. The duty is expressed as an action
property with an obtainConsent
value.
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
"uid": "https://provider.com/policies/1",
...
"permission": [{
"action": "tdm:mine",
"duty": [{
"action": "obtainConsent"
}
]
}
]
}
The duty to compensate financially the mining of content is expressed by adding a duty
property to the Permission. The duty is expressed as an action
property with a compensate
value.
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
"uid": "https://provider.com/policies/1",
...,
"permission": [{
"action": "tdm:mine",
"duty": [{
"action": "compensate"
}
]
}
]
}
The permission to mine content for a given type of usage only is expressed by adding a constraint
property to the Policy. The usage type is expressed as a purpose
value on a leftOperand
property, the operator
property takes eq
as value and the rightOperand
property takes one of the following values:
The following properties, tdm:research
and tdm:non-research
, are experimental at this point. They have been discussed in the context of the use of the TDM Reservation Protocol outside Europe, where TDM by scientific research organizations and for research purposes is legally allowed without restriction. They may well be soon replaced by other kinds of constraints of mining purposes.
Definition: designates research purposes.
Label: Research purpose
Identifier: http://www.w3.org/ns/tdmrep#research
Included in: http://www.w3.org/ns/odrl/2/rightOperand
Definition: designates non-research purposes, including commercial ones.
Label: Non-research purpose
Identifier: http://www.w3.org/ns/tdmrep#non-research
Included in: http://www.w3.org/ns/odrl/2/rightOperand
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
"uid": "https://provider.com/policies/1",
...
"permission": [{
"action": "tdm:mine",
"constraint": [{
"leftOperand": "purpose",
"operator": "eq",
"rightOperand": "tdm:research"
}
]
}
]
}
This section is non-normative.
In this example, the rightsholder requires TDM Actors to contact him for obtaining licensing rights. The rightsholder provides detailed contact information using the W3C vCard Ontology.
Important note: TDM Actors which benefit from the Article 3 of the CDSM Directive do not have to comply to this requirement.
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
"uid": "https://provider.com/policies/1",
"assigner": {
"uid": "https://provider.com",
"vcard:fn": "Provider",
"vcard:hasEmail": "mailto:contact@provider.com",
"vcard:hasAddress": {
"vcard:street-address": "111 Street Address",
"vcard:postal-code": "5555",
"vcard:locality": "Espérance",
"vcard:country-name": "France"
},
"vcard:hasTelephone": "tel:+61755555555",
"vcard:hasURL": "https://provider.com/tdm/licensing.html"
},
"permission": [{
"action": "tdm:mine",
"duty": [{
"action": "obtainConsent"
}
]
}
]
}
There is no mandatory property in the previous example. Just keep the properties you really see as useful. The vcard:hasURL
property is especially useful if a Web page explains in human language what is the publisher's TDM policy.
In this example, the rightsholder expresses that non-research Actors from any country can mine its content if they agree to pay a fee.
{
"@context": [
"http://www.w3.org/ns/odrl.jsonld",
{"tdm": "http://www.w3.org/ns/tdmrep#"}
],
"@type": "Offer",
"profile": "http://www.w3.org/ns/tdmrep",
"uid": "https://provider.com/policies/1",
"assigner": {
"uid": "https://provider.com",
"vcard:fn": "Provider",
"vcard:hasEmail": "mailto:contact@provider.com",
},
"permission": [{
"action": "tdm:mine",
"duty": [{
"action": "compensate"
}
],
"constraint": [{
"leftOperand": "purpose",
"operator": "eq",
"rightOperand": "tdm:non-research"
}
]
}
]
}