This specification defines a simple and practical Web protocol, capable of expressing the reservation of rights relative to text & data mining (TDM) applied to lawfully accessible Web content, and to ease the discovery of TDM licensing policies associated with such content.
This initiative is a technical answer to the constraints set by the Article 4 of the new European Directive on copyright and related rights in the Digital Single Market.
In addition to their significance in the context of scientific research, text and data mining techniques (TDM) are widely used both by private and public entities to analyse large amounts of data (including copyright protected content like text, images, video etc.) in different areas of life and for various purposes, including for government services, complex business decisions and the development of new applications or technologies.
In a digital environment, TDM usage of copyright protected works can be subject to different terms and conditions, depending on the legal framework. In generic terms, an act of reproduction is required before TDM can be applied on content accessible on the Web; international laws stipulate that such act of reproduction is subject to authorization by rightsholders. So far, analyzing and processing the terms and conditions of a website, contacting rightsholders, seeking for permission and concluding licensing agreements require time and resources.
In such context, a machine-readable solution which streamlines the communication of TDM rights and licenses available for online copyrighted content is necessary to facilitate the development of TDM applications and reduce the risks of legal uncertainty for TDM actors. Such a solution, that shall rely on a consensus by rightsholders and TDM actors, will optimize the capacity of TDM actors to lawfully access and process useful content at large scale.
The Directive on copyright and related rights in the Digital Single Market or EU Directive 2019/790, better known as the "DSM Directive" (DSM meaning Digital Single Market), introduces two exceptions or limitations to the rights of rightsholders on lawfully accessible content, for reproductions and extractions for the purposes of TDM:
In its Article 3, a mandatory exception for research organisations and cultural heritage institutions which carry out TDM for the purposes of scientific research.
In its Article 4, an exception for any organisation willing to carry out TDM for any purpose other than scientific research, including commercial purposes, which applies on the condition that the use of content for TDM has not been expressly reserved by their rights holders in an appropriate manner, such as machine-readable means.
These TDM exceptions apply to TDM usage in the European Union in relation to content from European and foreign rightsholders. Outside of the EU, where the DSM legislation does not apply, the said exception does not apply: exclusive rights of right-holders to authorize acts of reproduction are maintained. In such cases, no TDM can be performed without the explicit authorisation of these rightsholders: in these countries, the absence of a reservation of rights by rightsholders cannot be considered as an implicit authorization to reproduce copyrighted content for TDM purpose, and advocating fair use or a similar rule is legally uncertain, as these actions are judged on a case-per-case basis.
The “opt-out” mechanism introduced by the DSM Directive is therefore a real opportunity for TDM actors and publishers across countries to define a machine-readable technique able to express not only if TDM rights on specific Web content are reserved or not, but also how rightsholders can be contacted and which licenses are available, if any. This is a tremendous help for TDM actors from all countries looking for legal certainty.
Person or organization that owns the legal rights to something, in our case Web resources Wiktionary.
Person or organization that makes Web resources available to the public.
Person or organization practicing TDM (on Web resources in our case).
Software accessing Web resources for TDM purposes.
Description of the terms and conditions by which a TDM Actor can process a given Web resource.
Description of the kind of TDM Licenses a TDM Actor may obtain from a Rightsholder.
Rights to process a Web resource via TDM techniques, for a certain purpose (e.g scientific research, commercial).
Identifiable thing available on the Web Wikipedia. Web resources are located using URLs.
Web resource formatted in HTML.
The technical specification shall:
The goal of this protocol is to allow a rightsholder to declare his choice regarding text & data mining of Web resources he controls, thereby allowing recipients of that declaration to adjust their scraping behavior, or to reach a separate agreement with the rightsholder that satisfies all parties.
Such a preference is expressed via two complementary properties, `tdm-reservation` and `tdm-policy`.
`tdm-reservation` is an integer.
tdm-reservation | meaning |
---|---|
1 | TDM rights are reserved. If a TDM Policy is set, TDM Agents MAY use it to get information on how they can acquire from the rightsholder an authorization to mine the content. |
0 | TDM rights are not reserved. TDM agents can mine the content for TDM purposes without having to contact the rightsholder. |
Other values are considered protocol errors. In such a case the TDM Agents MUST consider that `tdm-reservation` is unset
.
The "opt-out" option specified by the Article 4 of the DSM Directive is expressed by the use of `tdm-reservation` with value equal 1
.
`tdm-policy` is a URL pointing to a TDM Policy set by the rightsholder.
The presence of `tdm-policy` when the value of `tdm-reservation` is 0
is not considered a protocol error. TDM Agents SHOULD NOT process `tdm-policy` in this case.
A TDM Policy is considered human readable if its content-type is text/html
. It is considered machine-readable if its content-type is either application/json
or application/ld+json
.
Being unable to access or parse a TDM Policy is not considered a protocol error. In such a case, the TDM Agent MUST consider that there is no way to know at this time which conditions would allow it to process the resource.
This specification provides three complementary techniques for expressing rightsholders' choices. These three techniques correspond to different situations and technical skills a Publisher may have.
The TDM file on the origin server is a mechanism for declaring site-wide righsholder's choices in a file hosted on the origin server of the Web content a TDM Agent wishes to mine.
An origin server that receives a valid GET request targeting this resource MUST send either a successful response containing a machine-readable representation of the site-wide righsholder's choices, as defined below, or a sequence of redirects that leads to such a representation. Failure to provide access to such a representation implies that the origin server does not implement this protocol.
This specification defines a JSON file named tdmrep.json
, which MUST be hosted in the /.well-known
repository of a Web server.
This file contains an array of JSON objects; each object represents a rule and contains three properties:
To evaluate if the URL of a Web resource is subject to a given pattern, a TDM Agent MUST match the paths inferred from the pattern against the URL. The matching SHOULD be case sensitive. The most specific match found MUST be used. The most specific match is the first in sequence.
If no match is found, a TDM Agent MUST consider that `tdm-reservation` is unset
for the given URL.
If a percent-encoded US-ASCII character is encountered in the URI, it MUST be unencoded prior to comparison, unless it is a reserved character in the URI as defined by RFC3986 or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered.
There are many variants of regular expressions. In order to simplify the work of TDM Agents, this specification is re-using the specification and wording of the robots.txt draft-koster-rep-00 2.2.3.
TDM Agents MUST allow the following special characters:
Character | Description | Example |
---|---|---|
"$" | Designates the end of the match pattern. | "tdm-policy: /this/path/exactly$" |
"*" | Designates 0 or more instances of any character. | "tdm-policy: /this/*/then" |
If TDM Agents match special characters verbatim in the URI, they MUST use "%" encoding. For example:
Pattern | URI |
---|---|
/path/foo-%24 | https://provider.com/path/foo-$ |
This URL matching notation is subject to interpretation. For the sake of interoperability, TDM Agents should follow the rules detailed by Google in How Google interprets the robots.txt specification, section "URL matching based on path values".
In the following example, a rightsholder wants to "opt-out" for every file present on a Web server.
`tdmrep.json` is therefore simply structured as:
[ { "location": "/", "tdm-reservation": 1 } ]
In the following example, a Web server is hosting three groups of files. The rightsholder of the first group of files (PDF documents) wants to express that TDM rights are reserved on these files with no way to acquire a TDM License. The rightsholder of the second group of files (html pages) wants to express that TDM rights are reserved with a TDM Policy. TDM rights are not reserved for all JPEG images contained in the third group.
In this example, the first group is a set of files stored in /directory-a; the second group is stored in /directory-b/html and the third group in /directory-b/images.
`tdmrep.json` is therefore structured as:
[ { "location": "/directory-a/", "tdm-reservation": 1 }, { "location": "/directory-b/html/", "tdm-reservation": 1, "tdm-policy":"https://provider.com/policies/policy.json" }, { "location": "/directory-b/images/*.jpg", "tdm-reservation": 0 } ]
The TDM Header Field is a mechanism for declaring a choice in an HTTP response ([[RFC7230]]).
In the following example, the rightsholder expresses that TDM rights are reserved on these files with no way to acquire a TDM License. The server returns a `tdm-reservation` header field with value 1
.
HTTP/1.1 200 OK Date: Wed, 14 Jul 2021 12:07:48 GMT Content-type: image/jpg tdm-reservation: 1
In the following example, a TDM License may be acquired. The server returns a `tdm-reservation` header field with value 1
and a `tdm-policy` header field pointing to a TDM Policy.
HTTP/1.1 200 OK Date: Wed, 14 Jul 2021 12:07:48 GMT Content-type: text/html tdm-reservation: 1 tdm-policy: https://provider.com/policies/policy.json
TDM Metadata in HTML Content is a mechanism for declaring a choice embedded in html content.
`tdm-reservation` is expressed as value of the `name` attribute of a `meta` element and `tdm-policy` is expressed as value of the `name` attribute of a second `meta` element.
In the following example, an html document is associated with a TDM Policy:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="tdm-reservation" content="1"> <meta name="tdm-policy" content="https://provider.com/policies/policy.json"> <title>Document title</title> </head> <body> ... <!-- body content --> ... </body> </html>
Rightsholders SHOULD only use one of the techniques specified in the previous section. But in case a Web server is badly configured, TDM Agents need a way to unambiguously define rightsholder's choices. This is why the following processing rules are specified.
A TDM Agent MUST check the presence of a TDM file on the origin server before it starts scraping the content of the Web server.
A TDM Agent will keep in cache the content of the TDM file, usually as an in-memory object, so that it can check its rules against every Web resource it fetches from the origin server.
A TDM Agent MUST check the presence of a TDM Header Field in every http header it gets from fetching a resource on the Web server. The values of `tdm-reservation` and `tdm-policy` found in this header supercede any value inferred from a TDM file on the origin server.
A TDM Agent MUST check the presence of TDM Metadata found in HTML content fetched from the Web server. The values of `tdm-reservation` and `tdm-policy` found here supercede previous values.
Policies are machine-readable structures referenced from the `tdm-policy` property defined in the specification. They provide ways for TDM Actors to contact content rightsholder and they offer details about available TDM licenses. Thus, they facilitate the acquisition of TDM licenses from rightsholders by TDM Actors.
The format of policies defined in this specification is a profile of the Open Digital Rights Language 2.2 [[ODRL]].
This specification assumes basic knowledge of the ODRL model and vocabulary.
The `@context` of a Policy MUST be "http://www.w3.org/ns/odrl.jsonld".
A tdm
alias MUST be added to the context if "tdm" prefixed properties are used in the Policy, and its value MUST be `http://www.w3.org/ns/tdmrep#`.
ODRL Policies also require an identifier, expressed as a URI.
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ] "uid": "https://provider.com/policies/policy-a", ... }
The `@type` of a Policy MUST have `Offer` as value.
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "uid": "https://provider.com/policies/policy-a", "@type": "Offer", ... }
A Policy MUST have a `profile` property with value `http://www.w3.org/ns/tdmrep`
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "uid": "https://provider.com/policies/policy-a", "@type": "Offer", "profile": "http://www.w3.org/ns/tdmrep", ... }
A Policy MUST contain one `assigner` property. The `assigner` property of the Offer MUST use a limited number of vCard properties ([[vcard-rdf]]):
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "uid": "https://provider.com/policies/policy-a", "@type": "Offer", "profile": "http://www.w3.org/ns/tdmrep", "assigner": { "uid": "https://provider.com", "vcard:fn": "Provider Name", "vcard:nickname": "PRV", "vcard:hasEmail": "mailto:contact@provider.com", "vcard:hasAddress": { "vcard:street-address": "111 Street Address", "vcard:postal-code": "5555", "vcard:locality": "Espérance", "vcard:country-name": "France" }, "vcard:hasTelephone": "tel:+61755555555", "vcard:hasURL": "https://provider.com/tdm/licensing.html" } ..., }
A Policy MUST contain one `permission` property. It SHOULD contain no `obligation` nor `prohibition` property.
The mandatory target of a permission, which is expressed via the `target` property, MUST be a URI identifying the collection of resources involved in the policy.
TDM Agents will use this property in their messages to publishers, to identify a collection of resources they which to mine. This identifier shall therefore properly identify a specific collection of resources and be well know from their publisher.
The target URL is not necessarily dereferencable. Accessing this URL may end with an http error (403 in many cases): this is not a processing error.
The mandatory action of a permission, which is expressed via the `action` property, MUST be the following:
Definition: analyse, via automated analytical technique, text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.
Label: Text & Data Mine
Identifier: http://www.w3.org/ns/tdmrep#mine
Included in: http://www.w3.org/ns/odrl/2/use
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "@type": "Offer", "profile": "http://www.w3.org/ns/tdmrep", "uid": "https://provider.com/policies/policy-a", ... "permission": [{ "target": "https://provider.com/research-papers", "action": "tdm:mine" } ] }
The duty to obtain verifiable consent before performing TDM on content is expressed by adding an `duty` property to the Policy. The duty is expressed as an `action` property with an `obtainConsent` value.
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "@type": "Offer", "profile": "http://www.w3.org/ns/tdmrep", "uid": "https://provider.com/policies/policy-a", ... "permission": [{ "target": "https://provider.com/research-papers", "action": "tdm:mine", "duty": [{ "action": "obtainConsent" } ] } ] }
The duty to compensate financially the mining of content is expressed by adding a `duty` property to the Permission. The duty is expressed as an `action` property with a `compensate` value.
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "@type": "Offer", "profile": "http://www.w3.org/ns/tdmrep", "uid": "https://provider.com/policies/policy-a", ..., "permission": [{ "target": "https://provider.com/research-papers", "action": "tdm:mine", "duty": [{ "action": "compensate" } ] } ] }
The permission to mine content for a given type of usage only is expressed by adding a `constraint` property to the Policy. The usage type is expressed as a `purpose` value on a `leftOperand` property, the `operator` property takes `eq` as value and the `rightOperand` property takes one of the following values:
Definition: designates research purposes.
Label: Research purpose
Identifier: http://www.w3.org/ns/tdmrep#research
Included in: http://www.w3.org/ns/odrl/2/rightOperand
Definition: designates non-research purposes, including commercial ones.
Label: Non-research purpose
Identifier: http://www.w3.org/ns/tdmrep#non-research
Included in: http://www.w3.org/ns/odrl/2/rightOperand
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "@type": "Offer", "profile": "http://www.w3.org/ns/tdmrep", "uid": "https://provider.com/policies/policy-a", ... "permission": [{ "target": "https://provider.com/research-papers", "action": "tdm:mine", "constraint": [{ "leftOperand": "purpose", "operator": "eq", "rightOperand": "tdm:research" } ] } ] }
In this example, the rightsholder requires TDM Actors to contact him for obtaining licensing rights. The rightsholder provides detailed contact information using the W3C vCard Ontology.
Important note: TDM Actors which benefit from the Article 3 of the DSM Directive do not have to comply to this requirement.
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "@type": "Offer", "profile": "http://www.w3.org/ns/tdmrep", "uid": "https://provider.com/policies/policy-a", "assigner": { "uid": "https://provider.com", "vcard:fn": "Provider", "vcard:nickname": "PRV", "vcard:hasEmail": "mailto:contact@provider.com", "vcard:hasAddress": { "vcard:street-address": "111 Street Address", "vcard:postal-code": "5555", "vcard:locality": "Espérance", "vcard:country-name": "France" }, "vcard:hasTelephone": "tel:+61755555555", "vcard:hasURL": "https://provider.com/tdm/licensing.html" }, "permission": [{ "target": "https://provider.com/research-papers", "action": "tdm:mine", "duty": [{ "action": "obtainConsent" } ] } ] }
In this example, the rightsholder expresses that non-research Actors from any country can mine its content if they agree to pay a fee.
{ "@context": [ "http://www.w3.org/ns/odrl.jsonld", {"tdm": "http://www.w3.org/ns/tdmrep#"} ], "@type": "Offer", "profile": "http://www.w3.org/ns/tdmrep", "uid": "https://provider.com/policies/policy-a", "assigner": { "uid": "https://provider.com", "vcard:fn": "Provider", "vcard:hasEmail": "mailto:contact@provider.com", }, "permission": [{ "target": "https://provider.com/research-papers", "action": "tdm:mine", "duty": [{ "action": "compensate" } ], "constraint": [{ "leftOperand": "purpose", "operator": "eq", "rightOperand": "tdm:non-research" } ] } ] }