This is a supporting document for the proposed RDF Dataset Canonicalization and Hash Working Group Charter, providing some extra explanation of the problem space and associated use cases.
For a precise definition of the various terms and concepts, the reader should refer to the formal RDF specification [[rdf11-concepts]].
R, R' and S each denote an RDF Dataset [[rdf11-concepts]].
R = S denotes that R and S are identical RDF Datasets.
Two RDF Datasets are identical if and only if they have the same default graph (under set equality) and the same set of named graphs (under set equality).
If R and S are identical, we may equivalently say that they are the same RDF Dataset.
R ≈ S denotes that R and S are isomorphic RDF Datasets.
In particular, R is isomorphic with S if and only if it is possible to map (i.e., relabel) the blank nodes of R to the blank nodes of S in a one-to-one manner, generating an RDF dataset R' such that R' = S.
RDF Dataset Canonicalization is a function C that maps an RDF Dataset to an RDF Dataset in a manner that satisfies the following two properties for all RDF Datasets R and S:
We may refer to C(R) as the canonical form of R (under C ).
Such a canonicalization function can be implemented, in practice, as a procedure that deterministically labels all blank nodes of an RDF Dataset in a one-to-one manner, without depending on any feature of the input serialization (blank node labels, order of the triples, etc.) of the input RDF Dataset.
It is important to emphasize that the term “canonicalization” is used here in its very generic form, described as:
In computer science, canonicalization […] is a process for converting data that has more than one possible representation into a “standard”, “normal”, or canonical form.
Source: Wikipedia.
Canonicalization, as used in the context of this document and the proposed charter, is indeed defined on an abstract data model (i.e., on RDF Dataset [[rdf11-concepts]]), regardless of a specific serialization. (It could also be referred to as a “canonical labelling scheme”). It is therefore very different from the usage of the term in, for example, the “Canonical XML” [[xml-c14n11]] or the “JSON Canonicalization Scheme” [[rfc8785]] specifications which are, essentially, syntactic transformations of the respective documents. Any comparison with those documents can be misleading.
Though canonical labeling procedures for directed and undirected graphs have been studied for several decades, only in the past 10 years have two generalized and comprehensive approaches been proposed for RDF Graphs and RDF Datasets:
The introduction of Aidan Hogan’s paper [[hogan-2017]] also contains a more thorough description of the underlying mathematical challenges.
One possible approach to calculating the hash of an RDF Dataset R may imply the following steps:
The second step, i.e., the sorting of the serialized dataset in quads, also requires the specification of what could be considered as a “canonical” version of N-Quads [[n-quads]] files (handling of white spaces, specifying the exact sorting algorithms to be used, canonical representation of datatype literals, etc.). Considering the simplicity of the N-Quads format, this does not necessitate a significant specification effort, but has a value in its own right. That being said, there may be other approaches to define a hash that do not necessarily involve a sorted N-Quads representation: the Working Group will have to determine the best approach.
The main challenge for the Working Group is to provide a standard for the RDF Dataset Canonicalization function.
When the hash of a file is transferred from one system to another and the file is used as is, there is no need for any processing other than checking the value of the hash of the file. This is true for RDF just as for any other data formats; this means that any existing hashing functions may be used on the original file. However, RDF has many serializations for datasets, notably TriG, JSON-LD, N-Quads or, informally, CBOR-LD. The space efficient verification use case points to a need in some circumstances to transform — usually to minimize — the data that is transferred. In this scenario, a hash on the original file, such as a hash calculated on a JSON-LD file, is not appropriate, as the conversion will make it invalid. A hash of the abstract dataset, on the other hand, will still be valid.
Some typical use cases for RDF Dataset Canonicalization and/or signatures are: