T.B.D.

Work-in-progress.

Introduction

This document describes the processing of Tabular Data [[!tabular-data-model]] to produce RDF [[!rdf11-concepts]]. It does not cover any cleaning or transformation processes needed to convert the initial CSV data into Tabular Data form. Tabular Data defines an abstract data model consisting of tables, columns, header rows, and data rows. It requires all rows to have the same number of columns. As such, Tabular Data does not cover all possible CSV files and ths processes described in this document do not apply to all possible CSV files.

This document describes the processing of Tabular Data [[!tabular-data-model]] to produce RDF [[!rdf11-concepts]]. It does not cover any cleaning or transformation processes needed to convert the initial data into Tabular Data form. Tabular Data defines an abstract data model consisting of tables, columns, possible header rows providing column names, and data rows. Each row consists of a number of cells. It requires all rows to have the same number of columns. The specification relies on the terms (e.g, header, row, column) as defined in [[!tabular-data-model]].

This specification makes use of the CURIE Syntax for describing RDF Triples; see, for example, the CURIE Syntax Definition Section of the RDFa 1.1 Core Specification [[!rdfa-core]].

This specification makes use of the following namespaces:

csvw:
http://www.w3.org/ns/csvw#
rdf:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
xsd:
http://www.w3.org/2001/XMLSchema#
dc:
http://purl.org/dc/terms/

Processing Annotated Tabular Data

The processing of tabular data is based on the abstract Annotated Tabular Data format as defined in [[!tabular-data-model]]. It does not cover any details on how the initial data (i.e., a CSV file) is parsed into one of those abstract Data formats. The processing steps below define the generated RDF [[!rdf11-concepts]] triples in a serialization-syntax independent way.

Processing makes use of the metadata associated with the tabular data, and defined in the "Metadata Vocabulary for Tabular Data" [[!tabular-metadata]] document. In this section the term property refers to the properties defined in that document.

The metadata for the tabular data may originate from embedded metadata only. In practice this means that the column names MAY be provided by a header row, but no other metadata is provided in the format described in [[!tabular-metadata]]. In that case, for the purpose of this section, processing begins by (conceptually) establishing the metadata as:

[{
  "@id": "URI of the CSV file",
  "schema": {
    "columns": [{
       "name": "name1",
    }, 
    {
       "name": "name2",
    }, 
    .
    .
    .
    {
        "name": "namek",
    }]
  }
}]

where name1, name1,…,namek are the names of the columns.

Default namespace setting

For the purpose of this specification, the default namespace is used for the value of the CURIE prefix :. This default namespace is set as follows:

For example, if the value of the @id property is http://www.example.org/tree-ops.csv, and no @base is set, then :name is the CURIE abbreviation for http://www.example.org/tree-ops.csv#name.

Generating RDF Triples

The processing steps are separated in the generation of table-level RDF triples, and cell-level RDF triples; these steps add RDF triples to the final RDF Graph.

Processing Steps for Table-Level RDF triples

The following RDF triples are added as table level triples (@id in the triples stands for the value of the corresponding table @id property):

  • If the @type property is set, add:
    (@id rdf:type csv:Table)
  • If the notes property is set, its value is either an array of annotation objects, or a single annotation object, each with its own (required) @id property. For each of those annotation objects add the following triple (where @idi is the @id value for the ith annotation object):
    (@id csv:note @idi)
  • For each (optional) property propi with code>valuei corresponding to a term defined in [[!DC-TERMS]], add the triple:
    (@id dc:propi valuei)
Not clear what to do with the @type property. The current metadata specification requires it to have the value of "Table", leaving the actual values dependent on a possible JSON-LD context. Because it is a fixed value, a fixed value has been used in this specification, which may not be what was meant. On the other hand, it may be a problem to require implementations to interpret the @context structure...
The last set of triple generation may depend on whether schema.org terms are used instead of Dublin Core.
Some of the Dublin Core properties are defined in such a way that the value is understood to be a URI Reference. The mapping should follow that, but it is not clear what this specification should say about this.

Processing Steps for Cell-Level RDF triples

The metadata specification [[!tabular-metadata]] has an inheritence structure whereby the metadata for a cell (e.g., datatype) may be derived from values specified for a column, a row, or indeed the specific cell itself. This specification considers the result of this derivation for each property.

  • The metadata [[!tabular-metadata]] defines the name property for each column (as part of the schema array). For each of those names, a corresponding IRI (referred to hereafter as Ci) is defined as follows:
    • If the ith column descriptor metadata includes the column-template property then the corresponding template rules are used to generate Ci.
    • Otherwise, Ci is set using the normalized version of name as a CURIE reference combined with the default namespace (normalization means encoding name to make the CURIE expansion a valid URI)
  • For the jth row a subjectj is established as follows:
    • If the metadata specifies a primary key via a primaryKey property, then:
      • if the primary key refers to a single column with the name property set, then subjectj is set to the normalized version of name as a CURIE reference combined with the default namespace
      • otherwise, subjectj is set to the normalized version of name1-name2-…-namek as a CURIE reference combined with the default namespace, where name1, name2,…, namek are the column names of the columns referred to by the primary keys.
    • Otherwise subjectj is set to a newly generated blank node, as defined in [[!rdf11-concepts]].
  • For the jth row, the following RDF triple is added:
    (:subjectj csvw:row "j"^^xsd:integer)
  • If the @type property is specified for the jth row, the following RDF triple is added:
    (:subjectj rdf:type csv:Row)
  • For the ith cell of the jth row, the value of objecti,j is established as follows:
    • If the cell’s metadata includes the datatype property, then objecti,j is the literal value of the cell with the corresponding XSD datatype [[!xmlschema11-2]]. Note that this may require a transformation of the cell’s content to abide to the lexical form rules of the corresponding XSD datatype (e.g., for dates), and this may require the usage of additional cell metadata (e.g., for date-like datatypes the value of the format property provides the date format to be used to parse the cell’s content).
    • Otherwise, if the cell’s metadata includes the language property, then objecti,j is an RDF literal with a language tag; the latter is set to the value of the language property.
    • Otherwise, objecti,j is an RDF Literal of type xsd:string with the value of the cell as lexical form.
    Once the value of objecti,j established, the following triple is added:
    (:subjectj Cj objecti,j)
The primaryKey, as defined in the metadata document [[!tabular-metadata]], contains a reference to a column, but does not contain the name of the column itself. Hence the somewhat convoluted way of generating subjectj.
The specification refers to the column-template property, though that is not part of the metadata specification at this moment. This is in anticipation of a minimal URI-transformation for the predicate. Similar predicates may have to be used for each row to influence the row’s URI.
At the moment, the subject for a cell is a literal. A separate property may have to be defined (or a special value of datatype could be used) to indicate that the value should be a URI in the RDF sense. That may require some sort of a cell-level template, a bit like the column-template.
The exact mapping to Dublin Core is underspecified; not sure how much details we want to put here.
The current metadata document says that the schema property is optional, or, even if set, column, row, or cell properties are all optionals. What this means is that the names of the columns may not be available. If so, the algorithm fails. A possibility would be to define (in the metadata document) a default fall-back for column naming if there is nothing providing them. (Or requiring the presense of schema/column).
The algorithm, at present, generates triples for cells that are, in fact, primary keys (or parts of primary keys). Is that necessary, or should those triples be skipped?
It is not clear whether the value of csvw:row should take into account the header row, or should refer to the data rows only. I.e., what is the number of the first data row?

Processing Core Tabular Data

The Core Tabular Data, as defined in [[!tabular-data-model]], does not have any column names, only raw data in terms of data rows. In that case, for the purpose of this section, processing begins by (conceptually) establishing the metadata as:

[{
  "@id": "URI of the CSV file",
  "schema" : {
    "columns": [{
      "name": "col_1",
     }, 
     {
      "name": "col_2",
     }, 
     .
     .
     .
     {
      "name": "col_k",
    }]
  }
}]

Where "k" is the number of columns in the table. The tabular data is then processed as Annotated Tabular Data with this associated metadata, as described in the previous section.

In fact, the default metadata, for this case, i.e., the default naming of the column names, may not be something to be defined in this document; a better place would be the data model document.

Examples

Let us consider the following example

1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010

this is a tabular data without any header row; the generated RDF (in Turtle[[!turtle]]) is:

@prefix :     <http://www.example.org/file.csv#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
[ csvw:row 1 ;
  :col_1 "1";
  :col_2 "ADDISON AV";
  :col_3 "Celtis australis";
  :col_4 "Large Tree Routine Prune";
  :col_5 "10/18/2010"
] .
[ csvw:row 2 ;
  :col_1 "2";
  :col_2 "EMERSON ST";
  :col_3 "Liquidambar styraciflua";
  :col_4 "Large Tree Routine Prune";
  :col_5 "6/2/2010"
] .
[ csvw:row 3 ;
  :col_1 "3";
  :col_2 "EMERSON ST";
  :col_3 "Liquidambar styraciflua";
  :col_4 "Large Tree Routine Prune";
  :col_5 "6/2/2010"
] .

By adding a header row the data may be as follows:

GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
        

yielding the following RDF:

@prefix :     <http://www.example.org/file.csv#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
[ csvw:row 2 ;
  :GID "1";
  :On%20Street "ADDISON AV";
  :Species "Celtis australis";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "10/18/2010"
] .
[ csvw:row 3 ;
  :GID "2";
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "6/2/2010"
] .
[ csvw:row 4 ;
  :GID "3";
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "6/2/2010"
] .

The CSV Data may also have the following associated metadata:

[{
  "@id": "http://www.example.org/tree-ops.csv",
  "@type": "Table",
  "title": "Tree Operations",
  "keywords": ["tree", "street", "maintenance"],
  "license": "http://opendefinition.org/licenses/cc-by/",
  "modified": "2010-12-31",
  "columns": [{
    "@id": "_:GID",
    "name": "GID",
    "datatype": "integer"
   }, {
    "name": "On Street",
    "description": "The street that the tree is on.",
    "datatype": "string"
  }, {
    "name": "Species",
    "description": "The species of the tree.",
    "datatype": "string"
  }, {
    "name": "Trim Cycle",
    "description": "The operation performed on the tree.",
    "datatype": "string"
  }, {
    "name": "Inventory Date",
    "description": "The date of the operation that was performed.",
    "datatype": "date",
    "format": "M/D/YYYY"
  }]
  "primaryKey": "_:GID"  
}]

The generated RDF is then:

@prefix :     <http://www.example.org/tree-ops.csv#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix dc:   <http://purl.org/dc/terms/> .

<http://www.example.org/tree-ops.csv>
  a csv:Table,
  dc:title "Tree Operations";
  dc:license <http://opendefinition.org/licenses/cc-by/>;
  dc:modified "2010-12-31" .

:1 
  csvw:row 2 ;
  :GID 1;
  :On%20Street "ADDISON AV";
  :Species "Celtis australis";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "2010-10-18"^^xsd:date
.
:2
  csvw:row 3 ;
  :GID 2;
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "2010-6-2"^^xsd:date
.
:3 
  csvw:row 4 ;
  :GID 3;
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "2010-6-20"^^xsd:date
.

Note the value for :GID is now an integer, and the value for :Inventory%20Date is a proper date.

More examples should be added here

Templates

This section was part of an earlier version of the document and has not been re-worked to fit the previous sections.

Graph Templates

An RDF graph can be used as a template for mapping cells from a row by following a couple of conventions.

A graph template is defined (as a named graph?) within a metadata mapping file as set of RDF triples where any value may include one or more Cell References. Each record is processed to emit triples based on transforming the graph template into a series of triples created from the result of substiting Cell References for their referenced cell values. Triples which result in any position having a value of csv:nil are excluded from output.

Cell References

A Cell Reference is a brace-surrounded value matching a column name from the CSV input. During record expansion, cell references are replaced with the value of the cell from the specific record being mapped.

          [
          schema:name "{name}"@en;
          schema:homepage <{+homepage}>;
          schema:image <{+image}>
          ] .
        

Given an input file such as the following:

name homepage image
Homer Simpson http://example/homer http://example/avatar/homer.png

The resulting output graph would be the following:

          [
            schema:name "Homer Simpson"@en;
            schema:homepage <http://example/homer>;
            schema:image <http://example/avatar/homer.png>
          ] .
        

URI Templates

A URI Template is a URI containing one or more variables as described in [[!RFC6570]]. URI variables are treated as Cell References. The expansion of URI Templates is modified so that if the URI template contains any unmapped Cell Reference the resulting URI is replaced with csv:nil. After processing, all triples containing csv:nil in any position are removed.

A URI template having the scheme "_" (otherwise illegal) results in a blank node if all Cell References are substituted.

{COL} value

Lookup

String functions (SPARQL? i.e. XSD functions and operators)