RDF [[RDF11-CONCEPTS]] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.

This document describes the URDNA2015 algorithm for canonicalizing RDF datasets, which was the input from the W3C Credentials Community Group published as [[CCG-RDC-FINAL]]. There are other canonicalization algorithms actively being considered by the Working Group – notably [[Hogan-Canonical-RDF]]; future versions of this document may change accordingly. See Issue 6: Compare the two algorithms, and decide on basis for our work and Issue 10: C14N choice criteria for further discussion.

Introduction

When data scientists discuss canonicalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to transform each of these different ways into a single, standard representation. With a standard representation, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.

In particular, this specification is about normalizing RDF datasets, which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires canonicalization to achieve the aforementioned goals and any others that may arise via serendipity.

Most RDF datasets can be normalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly. More formally, it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete. Fortunately, existing real world data is rarely modeled in a way that manifests this problem and new data can be modeled to avoid it. In fact, software systems can detect a problematic dataset and may choose to assume it's an attempted denial of service attack, rather than a real input, and abort.

This document outlines an algorithm for generating a canonical serialization of an RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Canonicalization Algorithm 2015 or URDNA2015.

Uses of Dataset Canonicalization

There are different use cases where graph or dataset canonicalization are important:

A canonicalization algorithm is necessary, but not necessarily sufficient, to handle many of these use cases. The use of blank nodes in RDF graphs and datasets has a long history and creates inevitable complexities. Blank nodes are used for different purposes:

Furthermore, RDF semantics dictate that deserializing an RDF document results in the creation of unique blank nodes, unless it can be determined that on each occasion, the blank node identifies the same resource. This is due to the fact that blank node identifiers are an aspect of a concrete RDF syntax and are not intended to be persistent or portable. Within the abstract RDF model, blank nodes do not have identifiers (although some RDF store implementations may use stable identifiers and may choose to make them portable). See Blank Nodes in [[!RDF11-CONCEPTS]] for more information.

RDF does have a provision for allowing blank nodes to be published in an externally identifiable way through the use of Skolem IRIs, which allow a given RDF store to replace the use of blank nodes in a concrete syntax with IRIs, which then serve to repeatably identify that blank node within that particular RDF store; however, this is not generally useful for talking about the same graph in different RDF stores, or other concrete representations. In any case, a stable blank node identifier defined for one RDF store or serialization is arbitrary, and typically not relatable to the context within which it is used.

This specification defines an algorithm for creating stable blank node identifiers repeatably for different serializations possibly using individualized blank node identifiers of the same RDF graph (dataset) by grounding each blank node through the nodes to which it is connected, essentially creating Skolem blank node identifiers. As a result, a graph signature can be obtained by hashing a canonical serialization of the resulting normalized dataset, allowing for the isomorphism and digital signing use cases. As blank node identifiers can be stable even with other changes to a graph (dataset), in some cases it is possible to compute the difference between two graphs (datasets), for example if changes are made only to ground triples, or if new blank nodes are introduced which do not create an automorphic confusion with other existing blank nodes. If any information which would change the generated blank node identifier, a resulting diff might indicate a greater set of changes than actually exists.

Add descriptions for relevant historical discussions and prior art:

[[DesignIssues-Diff]]
TimBL's design note on problems with Diff.
[[eswc2014Kasten]]
A Framework for Iterative Signing of Graph Data on the Web.
[[Hogan-Canonical-RDF]]
Aiden Hogan's paper on canonicalizing RDF
[[HPL-2003-142]]
Jeremy J. Carroll's paper on signing RDF graphs.

How to Read this Document

This document is a detailed specification for an RDF dataset canonicalization algorithm. The document is primarily intended for the following audiences:

To understand the basics in this specification you must be familiar with basic RDF concepts [[!RDF11-CONCEPTS]]. A working knowledge of graph theory and graph isomorphism is also recommended.

Typographical conventions

Terminology

Terms defined by this specification

input dataset
The abstract RDF dataset that is provided as input to the algorithm.
normalized dataset
The immutable, abstract RDF dataset and a set of normalized blank node identifiers that are produced as output by the algorithm. A normalized dataset is a restriction on an RDF dataset where all nodes are labeled, and blank nodes are labeled with canonical blank node identifiers consistent with running this algorithm on a base RDF dataset. A concrete serialization of a normalized dataset MUST label all blank nodes using these stable blank node identifiers.
identifier issuer
An identifier issuer is used to issue new blank node identifier. It maintains a blank node identifier issuer state.
hash
The lowercase, hexadecimal representation of a message digest.
hash algorithm
The hash algorithm used by URDNA2015, namely, SHA-256.
mention
A node is mentioned in a quad if it is a component of that quad, as a subject, predicate, object, or graph name.
mention set
The set of all quads in a dataset that mention a node n is called the mention set of n, denoted Qn.
gossip path
The gossip path between two blank nodes contained within quads in a dataset, where the path is a sequence of nodes and quads such that the first quad includes the starting node as an component, and the last quad includes the ending node as an component with each quad in the path containing both the preceding and following nodes.

Terms defined by cited specifications

string
A string is a sequence of zero or more Unicode characters.
true and false
Values that are used to express one of two possible boolean states.
IRI
An IRI (Internationalized Resource Identifier) is a string that conforms to the syntax defined in [[RFC3987]].
subject
A subject as specified by [[!RDF11-CONCEPTS]].
predicate
A predicate as specified by [[!RDF11-CONCEPTS]].
object
An object as specified by [[!RDF11-CONCEPTS]].
RDF triple
A triple as specified by [[!RDF11-CONCEPTS]].
RDF graph
An RDF graph as specified by [[!RDF11-CONCEPTS]].
graph name
A graph name as specified by [[!RDF11-CONCEPTS]].
default graph
The default graph as specified by [[!RDF11-CONCEPTS]].
quad
A tuple composed of subject, predicate, object, and graph name. This is a generalization of an RDF triple along with a graph name.
Should be defined in RDF Concepts.
canonical n-quads form
The canonicalized representation of a quad is based on canonical N-Triples defined in [[N-Triples]]. A quad in canonical n-quads form represents a graph name in the same manner as a subject, if present, and each quad is terminated with a single new line character (`U+000A`).
At present, there is no existing definition for a canonicalized [[N-Quads]] form. There is a definition for canonical N-Triples in [[N-Triples]], which defines the form of a single triple, without the terminating EOL.
RDF dataset
A dataset as specified by [[!RDF11-CONCEPTS]]. For the purposes of this specification, an RDF dataset is considered to be a set of quads
blank node
A blank node as specified by [[!RDF11-CONCEPTS]]. In short, it is a node in a graph that is neither an IRI, nor a literal.
blank node identifier
A blank node identifier as specified by [[!RDF11-CONCEPTS]]. In short, it is a string that begins with _: that is used as an identifier for a blank node. Blank node identifiers are typically implementation-specific local identifiers; this document specifies an algorithm for deterministically specifying them.
Concrete syntaxes, like [[Turtle]] or [[N-Quads]], prepend blank node identifiers with the _: string to differentiate them from other nodes in the graph. This affects the canonicalization algorithm, which is based on calculating a hash over the representations of quads in this format.
Unicode code point order
This refers to determining the order of two Unicode strings (`A` and `B`), using Unicode Codepoint Collation, as defined in [[XPATH-FUNCTIONS]], which defines a total ordering of strings comparing code points.

Canonicalization

Canonicalization is the process of transforming an input dataset to a normalized dataset. That is, any two input datasets that contain the same information, regardless of their arrangement, will be transformed into identical normalized dataset. The problem requires directed graphs to be deterministically ordered into sets of nodes and edges. This is easy to do when all of the nodes have globally-unique identifiers, but can be difficult to do when some of the nodes do not. Any nodes without globally-unique identifiers must be issued deterministic identifiers.

Strictly speaking, the normalized dataset must be serialized to be stable, as within a dataset, blank node identifiers have no meaning. This specification defines a normalized dataset to include stable identifiers for blank nodes, but practical uses of this will always generate a canonical serialization of such a dataset.

In time, there may be more than one canonicalization algorithm and, therefore, for identification purposes, this algorithm is named the "Universal RDF Dataset Canonicalization Algorithm 2015" (URDNA2015).

This statement is overly prescriptive and does not include normative language. This spec should describe the theoretical basis for graph canonicalization and describe behavior using normative statements. The explicit algorithms should follow as an informative appendix.

Overview

To determine a canonical labeling, URDNA2015 considers the information connected to each blank node. Nodes with unique first degree information can immediately be issued a canonical identifier via the Issue Identifier algorithm. When a node has non-unique first degree information, it is necessary to determine all information that is transitively connected to it throughout the entire dataset. defines a node’s first degree information via its first degree hash.

Hashes are computed from the information of each blank node. These hashes encode the mentions incident to each blank node. The hash of a string s, is the lower-case, hexidecimal representation of the result of passing s through a cryptographic hash function. URDNA2015 uses the SHA-256 hash algorithm.

Canonicalization State

When performing the steps required by the canonicalization algorithm, it is helpful to track state in a data structure called the canonicalization state. The information contained in the canonicalization state is described below.

blank node to quads map
A map that relates a blank node identifier to the quads in which they appear in the input dataset.
hash to blank nodes map
A map that relates a hash to a list of blank node identifiers.
canonical issuer
An identifier issuer, initialized with the prefix c14n, for issuing canonical blank node identifiers.
Mapping all blank nodes to use this identifier spec means that an RDF dataset composed of two different RDF graphs will use different identifiers than that for the graphs taken independently. This may happen anyway, due to automorphisms, or overlapping statements, but an identifier based on the resulting hash along with an issue sequence number specific to that hash would stand a better chance of surviving such minor changes, and allow the resulting information to be useful for RDF Diff.

Blank Node Identifier Issuer State

During the canonicalization algorithm, it is sometimes necessary to issue new identifiers to blank nodes. The Issue Identifier algorithm uses an identifier issuer to accomplish this task. The information an identifier issuer needs to keep track of is described below.

identifier prefix
The identifier prefix is a string that is used at the beginning of an blank node identifier. It should be initialized to a string that is specified by the canonicalization algorithm. When generating a new blank node identifier, the prefix is concatenated with a identifier counter. For example, c14n is a proper initial value for the identifier prefix that would produce blank node identifiers like c14n1.
identifier counter
A counter that is appended to the identifier prefix to create an blank node identifier. It is initialized to 0.
issued identifiers map
An ordered map that relates existing identifiers to issued identifiers, to prevent issuance of more than one new identifier per existing identifier, and to allow blank nodes to be reassigned identifiers some time after issuance.

Canonicalization Algorithm

At the time of writing, there are several open issues that will determine important details of the canonicalization algorithm.

The canonicalization algorithm converts an input dataset into a normalized dataset. This algorithm will assign deterministic identifiers to any blank nodes in the input dataset.

Overview

URDNA2015 canonically labels an RDF dataset by assigning each blank node a canonical identifier. In URDNA2015, an RDF dataset D is represented as a set of quads of the form `< s, p, o, g >` where the graph component `g` is empty if and only if the triple `< s, p, o >` is in the default graph. This algorithm considers an RDF dataset to be a set of quads. Two RDF datasets are considered to be isomorphic (i.e., the same modulo blank nodes), if and only if they return the same canonically labeled list of quads via URDNA2015.

URDNA2015 consists of several sub-algorithms. These sub-algorithms are introduced in the following sub-sections. First, we give a high level summary of URDNA2015.

  1. Initialization. Initialize the state needed for the rest of the algorithm using .
  2. Compute first degree hashes. Compute the first degree hash for each blank node in the dataset using .
  3. Canonically label unique nodes. Assign canonical identifiers via , in Unicode code point order, to each blank node whose first degree hash is unique.
  4. Compute N-degree hashes for non-unique nodes. For each repeated first degree hash (proceeding in Unicode code point order), compute the N-degree hash via of every unlabeled blank node that corresponds to the given repeated hash.
  5. Canonically label remaining nodes. In Unicode code point order of the N-degree hashes, issue canonical identifiers to each corresponding blank node using . If more than one node produces the same N-degree hash, the order in which these nodes receive a canonical identifier does not matter.
  6. Finish. Return the normalized dataset.

Examples

Algorithm

  1. Create the canonicalization state.
    Explanation

    This has the effect of initializing the blank node to quads map, and the hash to blank nodes map, as well as instantiating a new canonical issuer.

  2. For every quad Q in input dataset:
    1. For each blank node that is a component of Q, add a reference to Q from the [= map/entry | map entry =] for the blank node identifier identifier in the blank node to quads map, creating a new entry if necessary.
      Explanation

      This establishes the blank node to quads map, relating each blank node with the set of quads of which it is a component.

      Literal components of quads are not subject to any normalization. As noted in Section 3.3 of [[RDF11-CONCEPTS]], literal term equality is based on the lexical form, rather than the literal value, so two literals `"01"^^xs:integer` and `"1"^^xs:integer` are treated as distinct resources.

    Logging

    Log the state of the blank node to quads map:

                  
                
  3. For each [= map/key =] n in the blank node to quads map:
    Explanation

    This step creates a hash for every blank node in the input document. Some blank nodes will lead to a unique hash, while other blank nodes may share a common hash.

    1. Create a hash, hf(n), for n according to the Hash First Degree Quads algorithm.
    2. Add hf(n) and n to hash to blank nodes map, including repetitions, creating a new entry if necessary.
    Logging

    Log the results from the Hash First Degree Quads algorithm.

                  
                
  4. For each hash to identifier list [= map/entry | map entry =] in hash to blank nodes map, code point ordered by hash:
    Explanation

    This step establishes the canonical identifier for blank nodes having a unique hash, which are recorded in the canonical issuer.

    1. If identifier list has more than one entry, continue to the next mapping.
    2. Use the Issue Identifier algorithm, passing canonical issuer and the single blank node identifier, identifier in identifier list to issue a canonical replacement identifier for identifier.
    3. Remove the [= map/entry | map entry =] for hash from the hash to blank nodes map.
    Logging

    Log the assigned canonical identifiers.

                  
                
  5. For each hash to identifier list [= map/entry | map entry =] in hash to blank nodes map, code point ordered by hash:
    Explanation

    This step establishes the canonical identifier for blank nodes having a shared hash. This is done by creating unique blank node identifiers for all blank nodes traversed by the Hash N-Degree Quads algorithm, running through each blank node without a canonical identifier in the order of the hashes established in the previous step.

    Logging

    Log hash and identifier list for this iteration.

                  
                
    1. Create hash path list where each item will be a result of running the Hash N-Degree Quads algorithm.
      Explanation

      This list establishes an order for those blank nodes sharing a common first-degree hash.

    2. For each blank node identifier n in identifier list:
      1. If a canonical identifier has already been issued for n, continue to the next blank node identifier.
      2. Create temporary issuer, an identifier issuer initialized with the prefix b.
      3. Use the Issue Identifier algorithm, passing temporary issuer and n, to issue a new temporary blank node identifier bn to n.
      4. Run the Hash N-Degree Quads algorithm, passing the canonicalization state, n for identifier, and temporary issuer, appending the result to the hash path list.
        Logging

        Include logs for each call to Hash N-Degree Quads algorithm.

                              
                            
    3. For each result in the hash path list, code point ordered by the hash in result:
      Explanation

      The previous step created temporary identifiers for the blank nodes sharing a common first degree hash, which is now used to generate their canonical identifiers.

      1. For each blank node identifier, existing identifier, that was issued a temporary identifier by identifier issuer in result, issue a canonical identifier, in the same order, using the Issue Identifier algorithm, passing canonical issuer and existing identifier.
        Explanation

        In Step 5.2, hash path list was created with an ordered set of results. Each result contained a temporary issuer which recorded temporary identifiers associated with a particular blank node identifier in identifier list. This step processes each returned temporary issuer, in order, and allocates canonical identifiers to the temporary identifier mappings contained within each temporary issuer, creating a full order on the remaining blank nodes with unissued canonical identifiers.

      Logging

      Log newly issued canonical identifiers.

                        
                      
  6. For each quad, q, in input dataset:
    Explanation

    This step populates the normalized dataset with quads substituting the original blank node identiers, with the newly established canonical blank node identifiers.

    1. Create a copy, quad copy, of q and replace any existing blank node identifier n using the canonical identifiers previously issued by canonical issuer.
    2. Add quad copy to the normalized dataset.
    Logging

    Log the state of the canonical issuer at the completion of the algorithm.

                  
                
  7. Return the normalized dataset.

Issue Identifier Algorithm

This algorithm issues a new blank node identifier for a given existing blank node identifier. It also updates state information that tracks the order in which new blank node identifiers were issued. The order of issuance is important for canonically labeling blank nodes that are isomorphic to others in the dataset.

Overview

The algorithm maintains an issued identifiers map to relate an existing blank node identifier from the input dataset to a new blank node identifier using a given identifier prefix (c14n) with new identifiers issued by appending an incrementing number. For example, when called for a blank node identifier such as e3, it might result in a issued identifier of c14n1.

Algorithm

The algorithm takes an identifier issuer I and an existing identifier as inputs. The output is a new issued identifier. The steps of the algorithm are:

  1. If there is a [= map/entry | map entry =] for existing identifier in issued identifiers map of I, return it.
  2. Generate issued identifier by concatenating identifier prefix with the string value of identifier counter.
  3. Add an [= map/entry =] mapping existing identifier to issued identifier to the issued identifiers map of I.
  4. Increment identifier counter.
  5. Return issued identifier.

Hash First Degree Quads

This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component. If the hash uniquely identifies that blank node, no further examination is necessary. Otherwise, a hash will be created for the blank node using the algorithm in invoked via .

Overview

To determine whether the first degree information of a node n is unique, a hash is assigned to its mention set, Qn. The first degree hash of a blank node n, denoted hf(n), is the hash that results from when passing n. Nodes with unique first degree hashes have unique first degree information.

For consistency, blank node identifiers used in Qn are replaced with placeholders in a canonical n-quads serialization of that quad. Every blank node component is replaced with either a or z, depending on if that component is n or not.

The resulting serialized quads are then code point ordered, concatenated, and hashed. This hash is the first degree hash of n, hf(n).

Examples

Algorithm

This algorithm takes the canonicalization state and a reference blank node identifier as inputs.

  1. Initialize nquads to an empty list. It will be used to store quads in canonical n-quads form.
  2. Get the list of quads quads from the [= map/entry | map entry =] for reference blank node identifier in the blank node to quads map.
  3. For each quad quad in quads:
    1. Serialize the quad in canonical n-quads form with the following special rule:
      1. If any component in quad is an blank node, then serialize it using a special identifier as follows:
        1. If the blank node's existing blank node identifier matches the reference blank node identifier then use the blank node identifier a, otherwise, use the blank node identifier z.
  4. Sort nquads in Unicode code point order.
  5. Return the hash that results from passing the sorted and concatenated nquads through the hash algorithm.
    Logging

    Log the inputs and result of running this algorithm.

                  
                

Hash N-Degree Quads

This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component for which the hash does not uniquely identify that blank node. This is done by expanding the search from quads directly referencing that blank node (the mention set), to those quads which contain nodes which are also components of quads in the mention set, called the gossip path. This process proceeds in every greater degrees of indirection until a unique hash is obtained.

The 'path' terminology could also be changed to better indicate what a path is (a particular deterministic serialization for a subgraph/subdataset of nodes without globally-unique identifiers).

Overview

Usually, when trying to determine if two nodes in a graph are equivalent, you simply compare their identifiers. However, what if the nodes don't have identifiers? Then you must determine if the two nodes have equivalent connections to equivalent nodes all throughout the whole graph. This is called the graph isomorphism problem. This algorithm approaches this problem by considering how one might draw a graph on paper. You can test to see if two nodes are equivalent by drawing the graph twice. The first time you draw the graph the first node is drawn in the center of the page. If you can draw the graph a second time such that it looks just like the first, except the second node is in the center of the page, then the nodes are equivalent. This algorithm essentially defines a deterministic way to draw a graph where, if you begin with a particular node, the graph will always be drawn the same way. If two graphs are drawn the same way with two different nodes, then the nodes are equivalent. A hash is used to indicate a particular way that the graph has been drawn and can be used to compare nodes.

When two blank nodes have the same first degree hash, extra steps must be taken to detect global, or N-degree, distinctions. All information that is in any way connected to the blank node n through other blank nodes, even transitively, must be considered.

To consider all transitive information, the algorithm traverses and encodes all possible paths of incident mentions emanating from n, called gossip paths, that reach every unlabeled blank node connected to n. Each unlabeled blank node is assigned a temporary identifier in the order in which it is reached in the gossip path being explored. The mentions that are traversed to reach connected blank nodes are encoded in these paths via related hashes. This provides a deterministic way to order all paths coming from n that reach all blank nodes connected to n without relying on input blank node identifiers.

This algorithm works in concert with the main canonicalization algorithm to produce a unique, deterministic identifier for a particular blank node. This hash incorporates all of the information that is connected to the blank node as well as how it is connected. It does this by creating deterministic paths that emanate out from the blank node through any other adjacent blank nodes.

Ultimately, the algorithm selects a shortest gossip path, distributing canonical identifiers to the unlabeled blank nodes in the order in which they appear in this path. The hash of this encoded shortest path, called the N-degree hash of n, distinguishes n from other blank nodes in the dataset.

For clarity, we consider a gossip path encoded via the string s to be shortest provided that:

  1. The length of s is less than or equal to the length of any other gossip path string s′.
  2. If s and s′ have the same length (as strings), then s is code point ordered less than or equal to s′.

For example, abc is shorter than bbc, whereas abcd is longer than bcd.

The following provides a high level outline for how the N-degree hash of n is computed along the shortest gossip path. Note that the full algorithm considers all gossip paths, ultimately returning the hash of the shortest encoded path.

  1. Compute related hashes. Compute the related hash Hn set for n, i.e., all first degree mentions between n and another blank node. Note that this includes both unlabeled blank nodes and those already issued a canonical identifier (labeled blank nodes).
  2. Explore mentions. Given the related hash x in Hn, record x in the data to hash Dn. Determine whether each blank node reachable via the mention with related hash x has already received an identifier.
    1. Record the identifiers of labeled nodes. If a blank node already has an identifier, record its identifier in Dn once for every mention with related hash x. Skip to the next related hash in Hn and repeat step 2.
    2. Distribute and record temporary identifiers to unlabeled nodes. For each unlabeled blank node, assign it a temporary identifier according to the order in which it is reached in the gossip path, recording its given identifier in Dn (including repetitions). Add each unlabeled node to the recursion list Rn(x) in this same order (omitting repetitions).
    3. Recurse on newly labeled nodes. For each ni in Rn(x)
      1. Record its identifier in Dn
      2. Append < r(i) > to Dn where r(i) is the data to hash that results from returning to step 1, replacing n with ni.
  3. Compute the N-degree hash of n. Hash Dn to return the N-degree hash of n, namely hN(n). Return the updated issuer In that has now distributed temporary identifiers to all unlabeled blank nodes connected to n.

As described above in step 2.3, HN recurses on each unlabeled blank node when it is first reached along the gossip path being explored. This recursion can be visualized as moving along the path from n to the blank node ni that is receiving a temporary identifier. If, when recursing on ni, another unlabeled blank node nj is discovered, the algorithm again recurses. Such a recursion traces out the gossip path from n to nj via ni.

The recursive hash r(i) is the hash returned from the completed recursion on the node ni when computing hN(n). Just as hN(n) is the hash of Dn, we denote the data to hash in the recursion on ni as Di. So, r(i) = h(Di). For each related hash xHn, Rn(x) is called the recursion list on which the algorithm recurses.

Examples

Algorithm

An additional input to this algorithm should be added that allows it to be optionally skipped and throw an error if any equivalent related hashes were produced that must be permuted during step 5.4.4. For practical uses of the algorithm, this step should never be encountered and could be turned off, disabling canonizing datasets that include a need to run it as a security measure.

The inputs to this algorithm are the canonicalization state, the identifier for the blank node to recursively hash quads for, and path identifier issuer which is an identifier issuer that issues temporary blank node identifiers. The output from this algorithm will be a hash and the identifier issuer used to help generate it.

Logging

Log the inputs to the algorithm.

          
        
  1. Create a new map Hn for relating hashes to related blank nodes.
  2. Get a reference, quads, to the list of quads from the [= map/entry | map entry =] for identifier in the blank node to quads map.
    Explanation

    quads is the mention set of identifier.

    Logging

    Log the quads from the mention set of identifier.

                  
                
  3. For each quad in quads:
    Explanation

    This loop calculates the related hash Hn for other blank nodes within the mention set of identifier.

    1. For each component in quad, where component is the subject, object, or graph name, and it is a blank node that is not identified by identifier:
      1. Set hash to the result of the Hash Related Blank Node algorithm, passing the blank node identifier for component as related, quad, issuer, and position as either s, o, or g based on whether component is a subject, object, graph name, respectively.
      2. Add a mapping of hash to the blank node identifier for component to Hn, adding an entry as necessary.
    Logging

    Include the logs for each iteration of the Hash Related Blank Node algorithm and the resulting Hn.

                  
                
  4. Create an empty string, data to hash.
  5. For each related hash to blank node list mapping in Hn, code point ordered by related hash:
    Explanation

    This loop explores the gossip paths for each related blank node sharing a common hash to identifier finding the shortest such path (chosen path). This determines how canonical identifiers for otherwise commonly hashed blank nodes are chosen.

    Each path is represented by the concatenation of the identifiers for each related blank node – either the issued identifier, or a temporary identifier created using a copy of issuer. Those for which temporary identifiers were issued are later recursed over using this algorithm.

    Logging

    Log the value of related hash and state of data to hash.

                  
                
    1. Append the related hash to the data to hash.
    2. Create a string chosen path.
    3. Create an unset chosen issuer variable.
    4. For each permutation p of blank node list:
      Logging

      Log each permutation p.

                        
                      
      1. Create a copy of issuer, issuer copy.
      2. Create a string path.
      3. Create a recursion list, to store blank node identifiers that must be recursively processed by this algorithm.
      4. For each related in p:
        1. If a canonical identifier has been issued for related by canonical issuer, append the string _:, followed by the canonical identifier for related, to path.
          Explanation

          A canonical identifier may have been generated before calling this algorithm, if it was issued from an earlier call to Hash First Degree Quads algorithm. There is no reason to recurse and apply the algorithm to any related blank node that has already been assigned a canonical identifier. Furthermore, using the canonical identifier also further distinguishes it from any temporary identifier, allowing for even greater efficiency in finding the chosen path.

        2. Otherwise:
          1. If issuer copy has not issued an identifier for related, append related to recursion list.
            Explanation

            Temporarily labeled nodes have identifiers recorded in issuer copy, which is later used to recursively call this algorithm, so that eventually all nodes are given canonical identifiers.

          2. Use the Issue Identifier algorithm, passing issuer copy and related, and append the string _:, followed by the result, to path.
        3. If chosen path is not empty and the length of path is greater than or equal to the length of chosen path and path is greater than chosen path when considering code point order, then skip to the next permutation p.
          Explanation

          If path is already longer than the prospective chosen path, we can terminate this iteration early.

        Explanation

        path is used to generate a hash at a later step; in this respect, it is similar to the Hash First Degree Quads algorithm which uses the serialization of quads in nquads for hashing. For the sake of consistency, the nquad representation of blank node identifiers is used in these steps, hence the usage of the _: string.

        Logging

        Log related and path.

                              
                            
      5. For each related in recursion list:
        Explanation

        The propective path is extended with the hash resulting from recursively calling this algorithm on each related blank node issued a temporary identifier.

        Logging

        Log recursion list and path.

                              
                            
        1. Set result to the result of recursively executing the Hash N-Degree Quads algorithm, passing the canonicalization state, related for identifier, and issuer copy for path identifier issuer.
          Logging

          Log related and include logs for each recursive call to Hash N-Degree Quads algorithm.

                                    
                                  
        2. Use the Issue Identifier algorithm, passing issuer copy and related; append the string _:, followed by the result, to path.
        3. Append <, the hash in result, and > to path.
        4. Set issuer copy to the identifier issuer in result.
        5. If chosen path is not empty and the length of path is greater than or equal to the length of chosen path and path is greater than chosen path when considering code point order, then skip to the next p.
          Explanation

          If path is already longer than the prospective chosen path, we can terminate this iteration early.

      6. If chosen path is empty or path is less than chosen path when considering code point order, set chosen path to path and chosen issuer to issuer copy.
    5. Append chosen path to data to hash.
      Logging

      Log chosen path and data to hash.

                        
                      
    6. Replace issuer, by reference, withchosen issuer.
  6. Return issuer and the hash that results from passing data to hash through the hash algorithm.
    Logging

    Log issuer and results from passing data to hash through the hash algorithm.

                  
                

Privacy Considerations

Selective Disclosure Schemes

Add text that warns implementers using this specification in selective disclosure schemes that graph structure might reveal information about the entity disclosing the information. For example, knowing that a blank node contains two triples vs. five triples might reveal that the entity that is disclosing the information is a part of a subclass of a population, which might be enough to disclose information beyond what the discloser intended to disclose.

Security Considerations

Dataset Poisoning

Add text that warns that attackers can construct datasets which are known to take large amounts of compute time to canonize. The algorithm has a mechanism to detect and prevent this sort of abuse, but implementers need to ensure that they think holistically about their system such as what happens if they don't have rate limiting on a service exposed to the Internet and they are the subject of a DDoS. Default mechanisms that prevent excessive use of compute when an attacker sends a poisoned dataset might be different from system to system.

Formal Verification Incomplete

Add text that warns implementers that, while the algorithm has a mathematical proof associated with it that has had peer review, and while a W3C WG has reviewed the algorithms and that there are multiple interoperable implementations, that a formal proof using a system such as Coq isn't available at this time. We are highly confident of the correctness of the algorithm, but we will not be able to say with 100% certainty that it is correct until we have a formal, machine-based verification of the proof. Any system that utilizes this canonicalization mechanism should have a backup canonicalization mechanism, such as JCS, or other mitigations, such as schema-based validation, ready in the event that an unrecoverable flaw is found in this algorithm.

Use Cases

TBD

Examples

Duplicate Paths

This example illustrates a more complicated example where the same paths through blank nodes are duplicated in a graph, but use different blank node identifiers.

The image represents the graph described in the following code block .

An illustration of a graph with duplicated paths.
Image available in SVG .
      
    

The following is a summary of the more detailed execution log found here.

Double Circle

This example illustrates another complicated example of nodes that are doubly connected in opposite directions.

The image represents the graph described in the following code block .

An illustration of a graph back and forth links to nodes.
Image available in SVG .
      
    

The example is not explored in detail, but the execution log found here shows examples of more complicated pathways through the algorithm

URGNA2012

A previous version of this algorithm has light deployment. For purposes of identification, the algorithm is called the "Universal RDF Graph Canonicalization Algorithm 2012" (URGNA2012), and differs from the stated algorithm in the following ways:

Changes since the First Public Working Draft of 24 November 2022

Acknowledgements

The editors would like to thank Jeremy Carroll for his work on the graph canonicalization problem, Gavin Carothers for providing valuable feedback and testing input for the algorithm defined in this specification, Sir Tim Berners Lee for his thoughts on graph canonicalization over the years, Jesús Arias Fisteus for his work on a similar algorithm.

Acknowledge CCG members.