RDF Dataset Canonicalization and Hash Working Group

This is a supporting document for the RDF Dataset Canonicalization specification, providing some extra explanation of the problem space and associated use cases.

Terminology

For a precise definition of the various terms and concepts, the reader should refer to the formal RDF specification [[rdf11-concepts]].

RDF Datasets

R, R' and S each denote an RDF Dataset [[rdf11-concepts]].

Identical RDF Datasets

R = S denotes that R and S are identical RDF Datasets.

Two RDF Datasets are identical if and only if they have the same default graph (under set equality) and the same set of named graphs (under set equality).

If R and S are identical, we may equivalently say that they are the same RDF Dataset.

Isomorphic RDF Datasets

R ≈ S denotes that R and S are isomorphic RDF Datasets.

In particular, R is isomorphic with S if and only if it is possible to map (i.e., relabel) the blank nodes of R to the blank nodes of S in a one-to-one manner, generating an RDF dataset R' such that R' = S.

RDF Dataset Canonicalization

RDF Dataset Canonicalization is a function C that maps an RDF Dataset to an RDF Dataset in a manner that satisfies the following two properties for all RDF Datasets R and S:

R ≈ C(R) ; and
C(R) = C(S) if and only if R ≈ S.

We may refer to C(R) as the canonical form of R (under C ).

Such a canonicalization function can be implemented, in practice, as a procedure that deterministically labels all blank nodes of an RDF Dataset in a one-to-one manner, without depending on any feature of the input serialization (blank node labels, order of the triples, etc.) of the input RDF Dataset.

It is important to emphasize that the term "canonicalization" is used here in its very generic form, described as:

In computer science, canonicalization […] is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form.
Source: Wikipedia.

Canonicalization, as used in the context of this document, is defined on an abstract data model (i.e., on an RDF Dataset [[rdf11-concepts]]), regardless of a specific serialization. (It could also be referred to as a "canonical labeling scheme"). It is therefore very different from the usage of the term in, for example, the "Canonical XML" [[xml-c14n11]] or the "JSON Canonicalization Scheme" [[rfc8785]] specifications which are, essentially, syntactic transformations of the respective documents. Any comparison with those documents can be misleading.

The General Problem Space

Though canonical labeling procedures for directed and undirected graphs have been studied for several decades, only in the past 10 years have two generalized and comprehensive approaches been proposed for RDF Graphs and RDF Datasets:

Algorithms for signing RDF data have been proposed in [[carroll-2003]] and [[kasten-et-al-2014]], both reviewed through the anonymous scholarly peer review process. However, these approaches are not sound and complete with respect to isomorphism.
The algorithms defined by Aidan Hogan in [[hogan-2017]], reviewed through the anonymous scholarly peer review process, and also implemented by the author.
The algorithm defined by Rachel Arnold and Dave Longley, see [[arnold-longley-2020]], reviewed by experts at Mirabolic Consulting, implemented and deployed via, e.g., the JSON-LD Signatures package used in several JSON-LD Signature suites.

The introduction of Aidan Hogan’s paper [[hogan-2017]] also contains a more thorough description of the underlying mathematical challenges.

Defining an RDF Dataset Hash

One possible approach to calculating the hash of an RDF Dataset R may imply the following steps:

use an RDF Dataset Canonicalization function C to calculate C(R);
serialize C(R) to quads [[n-quads]] and sort the resulting set of quads;
apply a (traditional) hashing function h on the result of the serialization to yield h(R) ; this can be considered as the cryptographic hash of the Dataset.

The main challenge for the Working Group was to provide a standard for the RDF Dataset Canonicalization function.

The immediate result of the Canonicalization Algorithm is a canonical representation of the RDF dataset represented as N-Quads, using canonical blank node identifiers, and placed in code point order. Since the media type for N-Quads is defined in [[[n-quads]]] [[n-quads]] as `application/n-quads`, no further media registration is needed for this form. An alternate output of the Canonicalization Algorithm is the issued identifiers map, which may be represented as JSON [[RFC8259]].

The Canonicalization Algorithm operates over RDF quads, rather than separately canonicalizing each RDF graph contained within a dataset. This is necessary as blank nodes may be shared among graphs within an RDF dataset, or be used as the graph name for a named graph within the dataset, making it important that the full context of each blank node be considered when assigning canonical identifiers.

When the hash of a file is transferred from one system to another and the file is used as is, there is no need for any processing other than checking the value of the hash of the file. This is true for RDF just as for any other data formats; this means that any existing hash functions may be used on the original file. However, RDF has many serializations for datasets, notably TriG, JSON-LD, N-Quads, and, informally, CBOR-LD. The space efficient verification use case points to a need in some circumstances to transform — usually to minimize — the data that is transferred. In this scenario, a hash on the original file, such as a hash calculated on a JSON-LD file, is not appropriate, as the conversion will make it invalid. A hash of the abstract dataset, on the other hand, will still be valid.

Development of the specification

The working group considered the algorithms proposed by Aidan Hogan [[hogan-2017]] and by Rachel Arnold and Dave Longley [[arnold-longley-2020]]. The latter was the basis for the final report on the topic published by the Credentials Community Group [[CCG-RDC-FINAL]]. It was largely through that effort that the Arnold/Longley approach enjoyed significant implementation experience before the working group was chartered, backed up by a test suite. For this reason, after careful consideration of the alternatives, efforts focused on formalizing the Arnold/Longley approach.

A major issue that the working group had to tackle was precisely what would the serialized output of the algorithm be that could then be hashed? The current N-Triples Recommendation [[n-triples-20140225]] includes the definition of a canonical form. However, the same is not true for N-Quads [[n-quads-20140225]]. This is expected to be added to a forthcoming version of the N-Quads specification but, for now, normative text is included in the RDF Dataset Canonicalization document. It is necessary to serialize the algorithm output as N-Quads, rather than N-Triples, as the input and output are RDF Datasets, which may include multiple graphs.

On a related issue, the RDF Dataset Canonicalization specification includes an important note about the canonical form of literals, especially language tags.

Future work

Standards are not developed in a vacuum. Rather, they are developed as one of several moving pieces that interact to some degree. The Working Group was very aware of the development of RDF 1.2 by the RDF-Star Working Group, for example. Once that large scale work is complete (or nearing completion), it may be desirable to revisit RDF Dataset Canonicalization. The work on RDF Surfaces may also entail further changes to RDF Dataset Canonicalization. Waiting until one or both of these projects had been completed would have introduced a delay to developing RDF Datasaet Canonicalization, perhaps of a year or more. To avoid such a delay, the decision was made to focus purely on the existing RDF 1.1 series of standards.

Another issue facing the community at the time of writing is the base direction of RDF Literals. The addition of language tags to literals is well-used in RDF, however, it is not always clear whether the text should be rendered left to right or right to left. Although this is clearly relevant to dataset canonicalization, the problem has much wider applicability and is, rightly, being tackled by the RDF-Star Working Group. For dataset canonicalization, we simply note that this is an issue that may require future work.

Use Cases and Requirements

Some typical use cases for RDF Dataset Canonicalization and/or signatures are:

Detecting changes in Datasets: When processing RDF Datasets over a period of time, determining if information has changed is helpful. For example, knowing if information has changed helps with data cache invalidation, detecting if expected data has been tampered with or modified, or when debugging unexpected changes in source RDF Datasets.
Space-efficient verification of the contents of Datasets: If unique identification of RDF Datasets is possible, one can cryptographically hash the information to establish a storage-efficient way to verify that the information has not changed over time. One property of cryptographic hash is that one can verify data integrity. For example, a small device sending an RDF Dataset to a remote storage location can compute a cryptographic hash for later use in verifying that all the data arrived intact and has not been tampered with.
(Contributed by Alan Karp.)
Secret confirmation of the contents of Datasets: Since a cryptographic hash is a one-way function, and serves as an abbreviation for the entire RDF Dataset, one can use it in places where secrecy is desired. For example, when ensuring that the transaction history on a distributed ledger is the same between two services, two systems could keep track of the list of transactions in their respective ledgers. Canonicalizing and cryptographically hashing the list of transactions should result in the same cryptographic hash without either party needing to share the list of transactions with the other.
(Contributed by Alan Karp.)
Annotating Datasets with digital signatures and other digital proofs: When publishing or transmitting an RDF Dataset, clearly articulating the entity that published the data and protecting it from undetected modification is useful for mission critical systems. For example, understanding the issuer of a Verifiable Credential and ensuring that it is evident when a Verifiable Presentation has been tampered with underlies the trustworthiness of the encoded information. This process can go through the calculation of the hash of the Verifiable Credential represented as an RDF Dataset and use some standard cryptographic signature method to ensure the integrity of the hash value.
Generating canonical Skolem IRIs for blank nodes: Skolem IRIs have been proposed in RDF 1.1 as a way to replace blank nodes with IRIs in application scenarios where it is preferable to avoid the use of blank nodes. Rather than using an ad hoc scheme to generate Skolem IRIs to replace blank nodes, an alternative is to generate Skolem IRIs in a deterministic manner, such that compliant implementations will generate the same IRIs to replace the same blank nodes in isomorphic copies of an RDF graph or dataset. Such a procedure will produce a canonical version of a Skolemized RDF graph or dataset that can then be used in the context of several of the use-cases mentioned previously.
Semantic consistency of multi-part datasets: The change detection and space-efficient verification use-cases above can be leveraged in situations where a graph or dataset semantically relies on one or several other graph(s), which it refers to through links. Attaching cryptographic hashes to these links would enable the verification of the overall integrity of the set of interconnected graphs. One such example is the import mechanism of OWL: the ontology consumer may wish to verify that the imported ontology is the same as the one used by the author of the importing ontology, otherwise the resulting inferences may differ. Another such example is an EARL test report: the consumer may wish to ensure that the test description pointed to by the report is the one that was actually used for the test.

Introduction