RDF Dataset Canonicalization

Abstract

RDF [RDF11-CONCEPTS] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.

When data scientists discuss canonicalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to transform each of these different ways into a single, standard representation. With a standard representation, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.

In particular, this specification is about normalizing RDF datasets, which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires canonicalization to achieve the aforementioned goals and any others that may arise via serendipity.

Most RDF datasets can be normalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly. More formally, it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete. Fortunately, existing real world data is rarely modeled in a way that manifests this problem and new data can be modeled to avoid it. In fact, software systems can detect a problematic dataset and may choose to assume it's an attempted denial of service attack, rather than a real input, and abort.

This document outlines an algorithm for generating a canonical serialization of an RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Canonicalization Algorithm 2015 or URDNA2015.

There are different use cases where graph or dataset canonicalization are important:

Determining if one serialization is isomorphic to another.
Digital signing of graphs (datasets) independent of serialization or format.
Comparing two graphs (datasets) to find differences.
Communicating change sets when remotely updating an RDF source.

A canonicalization algorithm is necessary, but not necessarily sufficient, to handle many of these use cases. The use of blank nodes in RDF graphs and datasets has a long history and creates inevitable complexities. Blank nodes are used for different purposes:

when a well known identifier for a node is not known, or the author of a document chooses not to unambiguously name that node,
when a node is used to stitch together parts of a graph and the nodes themselves are not interesting (e.g., RDF Collections in [RDF11-MT]),
when someone is trying to create an intentionally difficult graph topology.

Furthermore, RDF semantics dictate that deserializing an RDF document results in the creation of unique blank nodes, unless it can be determined that on each occasion, the blank node identifies the same resource. This is due to the fact that blank node identifiers are an aspect of a concrete RDF syntax and are not intended to be persistent or portable. Within the abstract RDF model, blank nodes do not have identifiers (although some RDF store implementations may use stable identifiers and may choose to make them portable). See Blank Nodes in [RDF11-CONCEPTS] for more information.

RDF does have a provision for allowing blank nodes to be published in an externally identifiable way through the use of Skolem IRIs, which allow a given RDF store to replace the use of blank nodes in a concrete syntax with IRIs, which then serve to repeatably identify that blank node within that particular RDF store; however, this is not generally useful for talking about the same graph in different RDF stores, or other concrete representations. In any case, a stable blank node identifier defined for one RDF store or serialization is arbitrary, and typically not relatable to the context within which it is used.

This specification defines an algorithm for creating stable blank node identifiers repeatably for different serializations possibly using individualized blank node identifiers of the same RDF graph (dataset) by grounding each blank node through the nodes to which it is connected, essentially creating Skolem blank node identifiers. As a result, a graph signature can be obtained by hashing a canonical serialization of the resulting normalized dataset, allowing for the isomorphism and digital signing use cases. As blank node identifiers can be stable even with other changes to a graph (dataset), in some cases it is possible to compute the difference between two graphs (datasets), for example if changes are made only to ground triples, or if new blank nodes are introduced which do not create an automorphic confusion with other existing blank nodes. If any information which would change the generated blank node identifier, a resulting diff might indicate a greater set of changes than actually exists.

Editor's note

Add descriptions for relevant historical discussions and prior art:

[DesignIssues-Diff]: TimBL's design note on problems with Diff.
[eswc2014Kasten]: A Framework for Iterative Signing of Graph Data on the Web.
[Hogan-Canonical-RDF]: Aiden Hogan's paper on canonicalizing RDF
[HPL-2003-142]: Jeremy J. Carroll's paper on signing RDF graphs.

This document is a detailed specification for an RDF dataset canonicalization algorithm. The document is primarily intended for the following audiences:

Software developers that want to implement an RDF dataset canonicalization algorithm.
Masochists.

To understand the basics in this specification you must be familiar with basic RDF concepts [RDF11-CONCEPTS]. A working knowledge of graph theory and graph isomorphism is also recommended.

This section is non-normative.

The following typographic conventions are used in this specification:

markup: Markup (elements, attributes, properties), machine processable values (string, characters, media types), property names, and file names are in red-orange monospace font.
variable: A variable in pseudo-code or in an algorithm description is italicized.
definition: A definition of a term, to be used elsewhere in this or other specifications, is italicized and in bold.
definition reference: A reference to a definition in this document is underlined and is also an active link to the definition itself.
markup definition reference: References to a definition in this document, when the reference itself is also a markup, is underlined, in a red-orange monospace font, and is also an active link to the definition itself.
external definition reference: A reference to a definition in another document is underlined and italicized, and is also an active link to the definition itself.
markup external definition reference: A reference to a definition in another document, when the reference itself is also a markup, is underlined and italicized in a red-orange monospace font, and is also an active link to the definition itself.
hyperlink: A hyperlink is underlined and in blue.
[reference]: A document reference (normative or informative) is enclosed in square brackets and links to the references section.

Note

Notes are in light green boxes with a green left border and with a "Note" header in green. Notes are always informative.

Example 1

Examples are in light khaki boxes, with khaki left border,
and with a numbered "Example" header in khaki.
Examples are always informative. The content of the example is in monospace font and may be syntax colored.

Examples may have tabbed navigation buttons
to show the results of transforming an example into other representations.

Code examples are generally given in a Turtle or TriG format for brevity,
where each line represents a single triple or quad.
Additionally, have the following implied directives:

BASE <http://example.com/>
PREFIX : <#>

Canonicalization is the process of transforming an input dataset to a normalized dataset. That is, any two input datasets that contain the same information, regardless of their arrangement, will be transformed into identical normalized dataset. The problem requires directed graphs to be deterministically ordered into sets of nodes and edges. This is easy to do when all of the nodes have globally-unique identifiers, but can be difficult to do when some of the nodes do not. Any nodes without globally-unique identifiers must be issued deterministic identifiers.

Editor's note

Strictly speaking, the normalized dataset must be serialized to be stable, as within a dataset, blank node identifiers have no meaning. This specification defines a normalized dataset to include stable identifiers for blank nodes, but practical uses of this will always generate a canonical serialization of such a dataset.

In time, there may be more than one canonicalization algorithm and, therefore, for identification purposes, this algorithm is named the "Universal RDF Dataset Canonicalization Algorithm 2015" (URDNA2015).

Editor's note

This statement is overly prescriptive and does not include normative language. This spec should describe the theoretical basis for graph canonicalization and describe behavior using normative statements. The explicit algorithms should follow as an informative appendix.

This section is non-normative.

To determine a canonical labeling, URDNA2015 considers the information connected to each blank node. Nodes with unique first degree information can immediately be issued a canonical identifier via the Issue Identifier algorithm. When a node has non-unique first degree information, it is necessary to determine all information that is transitively connected to it throughout the entire dataset. 4.7 Hash First Degree Quads defines a node’s first degree information via its first degree hash.

Hashes are computed from the information of each blank node. These hashes encode the mentions incident to each blank node. The hash of a string s, is the lower-case, hexidecimal representation of the result of passing s through a cryptographic hash function. URDNA2015 uses the SHA-256 hash algorithm.

input dataset: The abstract RDF dataset that is provided as input to the algorithm.
normalized dataset: The immutable, abstract RDF dataset and set of normalized blank node identifiers that are produced as output by the algorithm. A normalized dataset is a restriction on an RDF dataset where all nodes are labeled, and blank nodes are labeled with blank node identifiers consistent with running this algorithm on a base RDF dataset. A concrete serialization of an normalized dataset MUST label all blank nodes using these stable blank node identifiers.
identifier issuer: An identifier issuer is used to issue new blank node identifier. It maintains a blank node identifier issuer state.
hash: The lowercase, hexadecimal representation of a message digest.
hash algorithm: The hash algorithm used by URDNA2015, namely, SHA-256.
Unicode code point order: This refers to determining the order of two Unicode strings (A and B), using Unicode Codepoint Collation, as defined in [XPATH-FUNCTIONS], which defines a total ordering of strings comparing code points.
mention set: The set of all quads in a dataset that mention a node n is called its mention set of n, denoted Q_n.
gossip path: The gossip path between two blank nodes contained within quads in a dataset, where the path is a sequence of nodes and quads such that the first quad includes the starting node as an component, and the last quad includes the ending node as an component with each quad in the path containing both the preceding and following nodes.

When performing the steps required by the canonicalization algorithm, it is helpful to track state in a data structure called the canonicalization state. The information contained in the canonicalization state is described below.

blank node to quads map: A data structure that maps a blank node identifier to the quads in which they appear in the input dataset.
hash to blank nodes map: A data structure that maps a hash to a list of blank node identifiers.
canonical issuer: An identifier issuer, initialized with the prefix _:c14n, for issuing canonical blank node identifiers.
Editor's note
Mapping all blank nodes to use this identifier spec means that an RDF dataset composed of two different RDF graphs will use different identifiers then that for the graphs taken independently. This may happen anyway, due to automorphisms, or overlapping statements, but an identifier based on the resulting hash along with an issue sequence number specific to that hash would stand a better chance of surviving such minor changes, and allow the resulting information to be useful for RDF Diff.

During the canonicalization algorithm, it is sometimes necessary to issue new identifiers to blank nodes. The Issue Identifier algorithm uses an identifier issuer to accomplish this task. The information an identifier issuer needs to keep track of is described below.

identifier prefix: The identifier prefix is a string that is used at the beginning of an blank node identifier. It should be initialized to a string that is specified by the canonicalization algorithm. When generating a new blank node identifier, the prefix is concatenated with a identifier counter. For example, _:c14n is a proper initial value for the identifier prefix that would produce blank node identifiers like _:c14n1.
identifier counter: A counter that is appended to the identifier prefix to create an blank node identifier. It is initialized to 0.
issued identifiers list: A list that tracks previously issued identifiers in the order in which they were issued. It also tracks the existing identifier that triggered the issuance, to prevent issuing more than one new identifier per existing identifier, and to allow blank nodes to be reassigned identifiers some time after issuance.

Editor's note

At the time of writing, there are several open issues that will determine important details of the canonicalization algorithm.

Issue 4: What is the output of the c14n algorithm?

Following the discussion that just happened at the TPAC joint meeting with the VC.

An option is to return one particular serialization.
An another option is to return a mapping from bnodes to labels.

I prefer the second solution as it is more generic. It does not preclude which serialization to use downstream. It actually does not impose that you serialize the dataset (imagine storing a dataset in a triple store with canonical blank node labels).

Issue 6: Compare the two algorithms, and decide on basis for our work

At TPAC 2022, the RCH WG had a joint meeting with the VC WG, see minutes.

During that meeting, we had two presentations about canonicalization algorithms:

Presentation by Dave Longley. See also this paper.
Presentation by Aidan Hogan. See also this paper.

The first algorithm has some track record of incubation in the W3C CCG, and it has an existing specification.

On today's RCH WG meeting, there was a proposal to use the W3C CCG specification as a basis, but OTOH there was also a desire to compare the two algorithms to better understand how they differ, and why one might be used over the other.

So this issue is an invitiation to anyone who can tell us more about the differences between the two algorithms and their potential implications.

Issue 7: Support generalized RDF

Generalized RDF is described in RDF 1.1 Concepts and Abstract Syntax.

It removes restrictions on the type of RDF term that can occur in any slot in a quad tuple - literals as subjects or predicates, blank nodes as predicates etc. By implication, that would include RDF-start quoted triples.

RDF 1.1 separately changed "RDF dataset" to allow blank nodes for in the graph slot.

Generalized RDF does arise - for example, in some rules systems.

Covering generalized RDF gives some future proofing.

Issue 8: "Herd-privacy" canonicalization

I completely agree with the importance of the "herd-privacy" canonicalization proposed in #4 (comment) by @dlongley when we use c14n with selective disclosure. However, if I understand it correctly, we would still have to improve the above algorithm; it seems to me that the following normalized datasets CX1 and CX2 are not modified via the above transformation, i.e., CX1==CY1 and CX2==CY2.

CX1 (obtained from JSON-LD Playground) (==CY1)

_:c14n0 <http://schema.org/name> "Alice" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .
_:c14n1 <http://schema.org/name> "Bob" .

CX2 (obtained from JSON-LD Playground) (==CY2)

_:c14n0 <http://schema.org/name> "Carl" .
_:c14n1 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/spouse> _:c14n0 .

Therefore, even if Alice selectively hides the statement about her spouse, anyone can easily guess whether Bob or Carl is Alice's spouse based on the canonicalized identifiers or the order of unrevealed statement:

CY1 with selective disclosure

_:c14n0 <http://schema.org/name> "Alice" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .
### 3rd statement is unrevealed ####

CY2 with selective disclosure

### 1st statement is unrevealed ####
_:c14n1 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/spouse> _:c14n0 .

What we actually wanted seemed like the following result:

CY1'

_:c14n0 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/name> "Bob" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .

CY2'

_:c14n0 <http://schema.org/name> "Alice" .
_:c14n1 <http://schema.org/name> "Carl" .
_:c14n0 <http://schema.org/spouse> _:c14n1 .

I am trying to figure out a solution, but haven't found one yet so just submitting this issue at the moment...

Issue 10: C14N choice criteria documentation

In the meeting on 2022-10-12, we discussed criteria that can be used to make a choice between alternative choices in specific steps in the c14n algorithm. The initial list of suggestions is below. We need to formalize this and, IMO, include it in the explainer doc.

ease of implementation
existing incubation / use in the marketplace
time / resource complexity in solving common datasets
time / resource complexity in solving complex (or poison?) datasets
Existence of formal proofs for the algorithms
Demonstration of review of formal proofs for the algorithms
reusing existing primitives that are available on various platforms
cover real life examples

Issue 11: Recording the canonicalization and hashing applied.

In the meeting on 2022-10-12, there was mention of needing to say which choices were made in generating the hash.

We already have the existing URDNA2015 as the canonicalization algorithm. Anything this working group does that makes any change to that will need to have a way to declare which algorithm was used to produce the resultant canonical form.

There may be good reasons for different hashing and signing algorithms.

We need a naming scheme of choices made, together with a way to transmit that information.

See #4 about the output of canonicalization.

The canonicalization algorithm converts an input dataset into a normalized dataset. This algorithm will assign deterministic identifiers to any blank nodes in the input dataset.

Editor's note

Documenting the algorithm is a WIP, various steps will become more detailed in time.

Issue 23: Explore and resolve `simple` loop editor's notes

See the issue in for the 4.7 Hash First Degree Quads. We should either remove the loop based on simple here but indicate that the original design of the algorithm was to have such a loop, or leave it but inform implementers that it is safe to break after one iteration of the loop (again, indicating why). A future version of this algorithm should make the loop effectual.

This section is non-normative.

URDNA2015 canonically labels an RDF dataset. In URDNA2015, an RDF dataset D is represented as a set of quads of the form < s, p, o, g > where the graph component g is empty if and only if the triple < s, p, o > is in the default graph. This algorithm considers an RDF dataset to be a set of quads. Two RDF datasets are considered to be isomorphic (i.e., the same modulo blank nodes), if and only if they return the same canonically labeled list of quads via URDNA2015.

URDNA2015 consists of several sub-algorithms. These sub-algorithms are introduced in the following sub-sections. First, we give a high level summary of URDNA2015.

Initialization. Initialize the state needed for the rest of the algorithm using 4.3 Canonicalization State.
Compute first degree hashes. Compute the first degree hash for each blank node in the dataset using 4.7 Hash First Degree Quads.
Canonically label unique nodes. Assign canonical identifiers via 4.6 Issue Identifier Algorithm, in Unicode code point order, to each blank node whose first degree hash is unique.
Compute N-degree hashes for non-unique nodes. For each repeated first degree hash (proceeding in Unicode code point order), compute the N-degree hash via 4.9 Hash N-Degree Quads of every unlabeled blank node that corresponds to the given repeated hash.
Canonically label remaining nodes. In Unicode code point order of the N-degree hashes, issue canonical identifiers to each corresponding blank node using 4.6 Issue Identifier Algorithm. If more than one node produces the same N-degree hash, the order in which these nodes receive a canonical identifier does not matter.
Finish. Return the normalized dataset.

Create the canonicalization state.
For every quad Q in input dataset:
1. For each blank node that is a component of Q, add a reference to the Q using the blank node identifier in the blank node to quads map, creating a new entry if necessary.
  Issue 15: Normalize literals? question
  It seems that Q must be normalized, so that literals with different syntactic representations but the same semantic representations are merged, and that two graphs differing in the syntactic representation of a literal will produce the same set of blank node identifiers.
Create a list of non-normalized blank node identifiers non-normalized identifiers and populate it using the keys from the blank node to quads map.
Initialize simple, a boolean flag, to true.
While simple is true, issue canonical identifiers for blank nodes:
1. Set simple to false.
2. Clear hash to blank nodes map.
3. For each blank node identifier n in non-normalized identifiers:
  1. Create a hash, h_f(n), for n according to the Hash First Degree Quads algorithm.
  2. Add h_f(n) and n to hash to blank nodes map, including repetitions, creating a new entry if necessary.
4. For each hash to identifier list mapping in hash to blank nodes map, code point ordered by hash:
  1. If identifier list has more than one entry, continue to the next mapping.
  2. Use the Issue Identifier algorithm, passing canonical issuer and the single blank node identifier, identifier in identifier list to issue a canonical replacement identifier for identifier.
  3. Remove identifier from non-normalized identifiers.
  4. Remove hash from the hash to blank nodes map.
  5. Set simple to true.
For each hash to identifier list mapping in hash to blank nodes map, code point ordered by hash:
1. Create hash path list where each item will be a result of running the Hash N-Degree Quads algorithm.
2. For each blank node identifier n in identifier list:
  1. If a canonical identifier has already been issued for n, continue to the next blank node identifier.
  2. Create temporary issuer, an identifier issuer initialized with the prefix _:b.
  3. Use the Issue Identifier algorithm, passing temporary issuer and n, to issue a new temporary blank node identifier b_n to n.
  4. Run the Hash N-Degree Quads algorithm, passing temporary issuer, and append the result to the hash path list.
3. For each result in the hash path list, code point ordered by the hash in result:
  1. For each blank node identifier, existing identifier, that was issued a temporary identifier by identifier issuer in result, issue a canonical identifier, in the same order, using the Issue Identifier algorithm, passing canonical issuer and existing identifier.
For each quad, q, in input dataset:
1. Create a copy, quad copy, of q and replace any existing blank node identifier n using the canonical identifiers previously issued by canonical issuer.
2. Add quad copy to the normalized dataset.
Return the normalized dataset.

This algorithm issues a new blank node identifier for a given existing blank node identifier. It also updates state information that tracks the order in which new blank node identifiers were issued.

This algorithm takes an identifier issuer I and an existing identifier as inputs. The output is a new issued identifier. The steps of the algorithm are:

If there is already an issued identifier for existing identifier in issued identifiers list of I, return it.
Generate issued identifier by concatenating identifier prefix with the string value of identifier counter.
Append an item to issued identifiers list that maps existing identifier to issued identifier.
Increment identifier counter.
Return issued identifier.

This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component. If the hash uniquely identifies that blank node, no further examination is necessary. Otherwise, a hash will be created for the blank node using the algorithm in 4.9 Hash N-Degree Quads invoked via 4.5 Canonicalization Algorithm.

Issue 23: Explore and resolve `simple` loop editor's notes

The result of this algorithm for a particular blank node will always be the same. This is only true because there was a typo in the spec that has now been implemented by many implementations. The design of the algorithm was to use the assigned canonical blank node identifier, if available, instead of _:a or _:z, similar to how it is used in the related hash algorithm, but this text never made it into the spec before implementations moved forward. Therefore, the hashes here never change, making the loop based on the simple flag that calls this algorithm unnecessary; it needs to only run once. A future version of this algorithm should correct this mistake.

This section is non-normative.

To determine whether the first degree information of a node n is unique, a hash is assigned to its mention set, Q_n. The first degree hash of a blank node n, denoted h_f(n), is the hash that results from 4.7 Hash First Degree Quads when passing n. Nodes with unique first degree hashes have unique first degree information.

For consistency, blank node identifiers used in Q_n are replaced with placeholders in an [N-Quads] serialization of that quad. Every blank node component is replaced with either _:a or _:z, depending on if that component is n or not.

The resulting serialized quads are then code point ordered, concatenated, and hashed. This hash is the first degree hash of n, h_f(n).

This section is non-normative.

Example 2: Unique hashes

This example illustrates hashing quads containing blank nodes where hashing the statements mentioning those blank nodes generates unique results.

Figure 1 An illustration of a graph resulting in unique hashes.
Image available in SVG .

:p :q _:b0 .
:p :r _:b1 .
_:b0 :s :u .
_:b1 :t :u .

The algorithm will be called twice, with each blank node (_:b0 and _:b1). Running the algorithm with the reference node _:b0 results in the following quads, after replacing blank nodes:

:p :q _:a .
_:a :s :u .

These are then serialized to N-Quads, concatenated and hashed using the hash algorithm (SHA-256) resulting in 21d1dd5ba21f3dee9d76c0c00c260fa6f5d5d65315099e553026f4828d0dc77a.

The algorithm is run a second time with the reference node _:b1 resulting in the following quads:

:p :r _:a .
_:a :t :u .

These are then serialized to N-Quads, concatenated and hashed as before resulting in 6fa0b9bdb376852b5743ff39ca4cbf7ea14d34966b2828478fbf222e7c764473.

Thus the generated hashes each reference just a single blank node, allowing the canonicalization algorithm to use only the Hash First Degree Quads algorithm.

Example 3: Shared hashes

This example illustrates hashing quads containing blank nodes where hashing the statements mentioning those blank nodes have overlapping results.

Figure 2 An illustration of a graph resulting in shared hashes.
Image available in SVG .

:p :q _:b0 .
:p :q _:b1 .
_:b0 :p _:b2 .
_:b1 :p _:b3 .
_:b2 :r _:b3 .

The algorithm will be called four times, with each blank node (_:b0, _:b1, _:b2, and _:b3). Running the algorithm with the reference node _:b0 results in the following quads, after replacing blank nodes:

:p :q _:a .
_:a :p _:z .

Which hashes to: 3b26142829b8887d011d779079a243bd61ab53c3990d550320a17b59ade6ba36.

Note that using reference node _:b1 results in the same quads, and thus results in the same hash.

Using the reference node _:b2 results in the following quads:

_:z :p _:a .
_:a :r _:z .

Which hashes to: 15973d39de079913dac841ac4fa8c4781c0febfba5e83e5c6e250869587f8659.

Lastly, using the reference node _:b3 results in the following quads:

_:z :r _:a .
_:z :p _:a .

Which hashes to: 3b26142829b8887d011d779079a243bd61ab53c3990d550320a17b59ade6ba36.

The hashes for _:b2 and _:b3 are unique, but _:b0, and _:b1 share a common hash, which will require the use of the Hash N-Degree Quads Algorithm, as it is necessary to consider quads further removed from the direct mentions to determine a unique hash.

This algorithm takes the canonicalization state and a reference blank node identifier as inputs.

Initialize nquads to an empty list. It will be used to store quads in [N-Quads] format.
Get the list of quads quads associated with the reference blank node identifier in the blank node to quads map.
For each quad quad in quads:
1. Serialize the quad in [N-Quads] format with the following special rule:
  1. If any component in quad is an blank node, then serialize it using a special identifier as follows:
    1. If the blank node's existing blank node identifier matches the reference blank node identifier then use the blank node identifier _:a, otherwise, use the blank node identifier _:z.
  Issue 15: Normalize literals? question
  Note potential need to normalize literals to their canonical representation here as well, if not done on the original input dataset.
Sort nquads in Unicode code point order.
Return the hash that results from passing the sorted, joined nquads through the hash algorithm.

This algorithm calculates a hash for a given blank node across the quads in a dataset in which that blank node is a component for which the hash does not uniquely identify that blank node. This is done by expanding the search from quads directly referencing that blank node (the mention set), to those quads which contain nodes which are also components of quads in the mention set, called the gossip path. This process proceeds in every greater degrees of indirection until a unique hash is obtained.

Editor's note

There may also be better names for the two algorithms.

Editor's note

The 'path' terminology could also be changed to better indicate what a path is (a particular deterministic serialization for a subgraph/subdataset of nodes without globally-unique identifiers).

This section is non-normative.

Usually, when trying to determine if two nodes in a graph are equivalent, you simply compare their identifiers. However, what if the nodes don't have identifiers? Then you must determine if the two nodes have equivalent connections to equivalent nodes all throughout the whole graph. This is called the graph isomorphism problem. This algorithm approaches this problem by considering how one might draw a graph on paper. You can test to see if two nodes are equivalent by drawing the graph twice. The first time you draw the graph the first node is drawn in the center of the page. If you can draw the graph a second time such that it looks just like the first, except the second node is in the center of the page, then the nodes are equivalent. This algorithm essentially defines a deterministic way to draw a graph where, if you begin with a particular node, the graph will always be drawn the same way. If two graphs are drawn the same way with two different nodes, then the nodes are equivalent. A hash is used to indicate a particular way that the graph has been drawn and can be used to compare nodes.

When two blank nodes have the same first degree hash, extra steps must be taken to detect global, or N-degree, distinctions. All information that is in any way connected to the blank node n through other blank nodes, even transitively, must be considered.

To consider all transitive information, the algorithm traverses and encodes all possible paths of incident mentions emanating from n, called gossip paths, that reach every unlabeled blank node connected to n. Each unlabeled blank node is assigned a temporary label in the order in which it is reached in the gossip path being explored. The mentions that are traversed to reach connected blank nodes are encoded in these paths via related hashes. This provides a deterministic way to order all paths coming from n that reach all blank nodes connected to n without relying on input blank node labels.

This algorithm works in concert with the main canonicalization algorithm to produce a unique, deterministic identifier for a particular blank node. This hash incorporates all of the information that is connected to the blank node as well as how it is connected. It does this by creating deterministic paths that emanate out from the blank node through any other adjacent blank nodes.

Ultimately, the algorithm selects a shortest gossip path, distributing canonical labels to the unlabeled blank nodes in the order in which they appear in this path. The hash of this encoded shortest path, called the N-degree hash of n, distinguishes n from other blank nodes in the dataset.

For clarity, we consider a gossip path encoded via the string s to be shortest provided that:

The length of s is less than or equal to the length of any other gossip path string s′.
If s and s′ have the same length (as strings), then s is code point ordered less than or equal to s′.

For example, abc is shorter than bbc, whereas abcd is longer than bcd.

The following provides a high level outline for how the N-degree hash of n is computed along the shortest gossip path. Note that the full algorithm considers all gossip paths, ultimately returning the hash of the shortest encoded path.

Compute related hashes. Compute the related hash H_n set for n, i.e., all first degree mentions between n and another blank node. Note that this includes both labeled and unlabeled blank nodes.
Explore mentions. Given the related hash x in H_n, record x in the data to hash D_n. Determine whether each blank node reachable via the mention with related hash x has already received a label.
1. Record the labels of labeled nodes. If a blank node already has a label, record its label in D_n once for every mention with related hash x. Skip to the next related hash in H_n and repeat step 2.
2. Distribute and record temporary labels to unlabeled nodes. For each unlabeled blank node, assign it a temporary label according to the order in which it is reached in the gossip path, recording its given label in D_n (including repetitions). Add each unlabeled node to the recursion list R_n(x) in this same order (omitting repetitions).
3. Recurse on newly labeled nodes. For each n_i in R_n(x)
  1. Record its label in D_n
  2. Append < r(i) > to D_n where r(i) is the data to hash that results from returning to step 1, replacing n with n_i.
Compute the N-degree hash of n. Hash D_n to return the N-degree hash of n, namely h_N(n). Return the updated issuer I_n that has now distributed temporary labels to all unlabeled blank nodes connected to n.

As described above in step 2.3, HN recurses on each unlabeled blank node when it is first reached along the gossip path being explored. This recursion can be visualized as moving along the path from n to the blank node n_i that is receiving a temporary identifier. If, when recursing on n_i, another unlabeled blank node n_j is discovered, the algorithm again recurses. Such a recursion traces out the gossip path from n to n_j via n_i.

The recursive hash r(i) is the hash returned from the completed recursion on the node n_i when computing h_N(n). Just as h_N(n) is the hash of D_n, we denote the data to hash in the recursion on n_i as D_i. So, r(i) = h(D_i). For each related hash x ∈ H_n, R_n(x) is called the recursion list on which the algorithm recurses.

This section is non-normative.

Editor's note

Add some examples ranging from simple to complicated and resource consuming.

Issue 16: Optionally fail on duplicates in Hash N-Degree Quads? question

An additional input to this algorithm should be added that allows it to be optionally skipped and throw an error if any equivalent related hashes were produced that must be permuted during step 5.4.4. For practical uses of the algorithm, this step should never be encountered and could be turned off, disabling canonizing datasets that include a need to run it as a security measure.

The inputs to this algorithm are the canonicalization state, the identifier for the blank node to recursively hash quads for, and path identifier issuer which is an identifier issuer that issues temporary blank node identifiers. The output from this algorithm will be a hash and the identifier issuer used to help generate it.

Create a hash to related blank nodes map for storing hashes that identify related blank nodes.
Get a reference, quads, to the list of quads in the blank node to quads map for the key identifier.
For each quad in quads:
1. For each component in quad, where component is the subject, object, or graph name, and it is a blank node that is not identified by identifier:
  1. Set hash to the result of the Hash Related Blank Node algorithm, passing the blank node identifier for component as related, quad, path identifier issuer as issuer, and position as either s, o, or g based on whether component is a subject, object, graph name, respectively.
  2. Add a mapping of hash to the blank node identifier for component to hash to related blank nodes map, adding an entry as necessary.
Create an empty string, data to hash.
For each related hash to blank node list mapping in hash to related blank nodes map, code point ordered by related hash:
1. Append the related hash to the data to hash.
2. Create a string chosen path.
3. Create an unset chosen issuer variable.
4. For each permutation of blank node list:
  1. Create a copy of issuer, issuer copy.
  2. Create a string path.
  3. Create a recursion list, to store blank node identifiers that must be recursively processed by this algorithm.
  4. For each related in permutation:
    1. If a canonical identifier has been issued for related, append it to path.
    2. Otherwise:
      1. If issuer copy has not issued an identifier for related, append related to recursion list.
      2. Use the Issue Identifier algorithm, passing issuer copy and related and append the result to path.
    3. If chosen path is not empty and the length of path is greater than or equal to the length of chosen path and path is greater than chosen path when considering code point order, then skip to the next permutation.
  5. For each related in recursion list:
    1. Set result to the result of recursively executing the Hash N-Degree Quads algorithm, passing related for identifier and issuer copy for path identifier issuer.
    2. Use the Issue Identifier algorithm, passing issuer copy and related and append the result to path.
    3. Append <, the hash in result, and > to path.
    4. Set issuer copy to the identifier issuer in result.
    5. If chosen path is not empty and the length of path is greater than or equal to the length of chosen path and path is greater than chosen path when considering code point order, then skip to the next permutation.
  6. If chosen path is empty or path is less than chosen path when considering code point order, set chosen path to path and chosen issuer to issuer copy.
5. Append chosen path to data to hash.
6. Replace issuer, by reference, with chosen issuer.
Return issuer and the hash that results from passing data to hash through the hash algorithm.

RDF Dataset Canonicalization

A Standard RDF Dataset Canonicalization Algorithm

Abstract

Status of This Document

1. Introduction

1.1 Uses of Dataset Canonicalization

1.2 How to Read this Document

1.3 Typographical conventions

2. Conformance

3. Terminology

4. Canonicalization

4.1 Overview

4.2 Canonicalization Algorithm Terms

4.3 Canonicalization State

4.4 Blank Node Identifier Issuer State

4.5 Canonicalization Algorithm

4.5.1 Overview

4.5.2 Algorithm

4.6 Issue Identifier Algorithm

4.6.1 Algorithm

4.7 Hash First Degree Quads

4.7.1 Overview

4.7.2 Examples

4.7.3 Algorithm

4.9 Hash N-Degree Quads

4.9.1 Overview

4.9.2 Examples

4.9.3 Algorithm

5. Privacy Considerations

6. Security Considerations

7. Use Cases

8. Examples

A. URGNA2012

B. Issue summary

C. Acknowledgements

D. References

D.1 Normative references

D.2 Informative references

RDF Dataset Canonicalization

A Standard RDF Dataset Canonicalization Algorithm

Abstract

Status of This Document

1. Introduction

1.1 Uses of Dataset Canonicalization

1.2 How to Read this Document

1.3 Typographical conventions

2. Conformance

3. Terminology

4. Canonicalization

4.1 Overview

4.2 Canonicalization Algorithm Terms

4.3 Canonicalization State

4.4 Blank Node Identifier Issuer State

4.5 Canonicalization Algorithm

4.5.1 Overview

4.5.2 Algorithm

4.6 Issue Identifier Algorithm

4.6.1 Algorithm

4.7 Hash First Degree Quads

4.7.1 Overview

4.7.2 Examples

4.7.3 Algorithm

4.8 Hash Related Blank Node

4.8.1 Overview

4.8.2 Algorithm

4.9 Hash N-Degree Quads

4.9.1 Overview

4.9.2 Examples

4.9.3 Algorithm

5. Privacy Considerations

6. Security Considerations

7. Use Cases

8. Examples

A. URGNA2012

B. Issue summary

C. Acknowledgements

D. References

D.1 Normative references

D.2 Informative references