T.B.D.

Work-in-progress.

Introduction

This document describes the processing of Tabular Data [[!tabular-data-model]] to produce RDF [[!rdf11-concepts]]. It does not cover any cleaning or transformation processes needed to convert the initial CSV data into Tabular Data form. Tabular Data defines an abstract data model consisting of tables, columns, header rows, and data rows. It requires all rows to have the same number of columns. As such, Tabular Data does not cover all possible CSV files and ths processes described in this document do not apply to all possible CSV files.

This document describes the processing of Tabular Data [[!tabular-data-model]] to produce RDF [[!rdf11-concepts]]. It does not cover any cleaning or transformation processes needed to convert the initial data into Tabular Data form. Tabular Data defines an abstract data model consisting of tables, columns, possible header rows providing column names, and data rows. Each row consists of a number of cells. It requires all rows to have the same number of columns. The specification relies on the terms (e.g, header, row, column) as defined in [[!tabular-data-model]].

This specification makes use of the CURIE Syntax for describing RDF Triples; see, for example, the CURIE Syntax Definition Section of the RDFa 1.1 Core Specification [[!rdfa-core]].

This specification makes use of the following namespaces:

csvw:
http://www.w3.org/ns/csvw#
rdf:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
xsd:
http://www.w3.org/2001/XMLSchema#
dc:
http://purl.org/dc/terms/

Processing Annotated Tabular Data

The processing of tabular data is based on the abstract Annotated Tabular Data format as defined in [[!tabular-data-model]]. It does not cover any details on how the initial data (i.e., a CSV file) is parsed into one of those abstract Data formats. The processing steps below define the generated RDF [[!rdf11-concepts]] triples in a serialization-syntax independent way.

Processing makes use of the metadata associated with the tabular data, and defined in the "Metadata Vocabulary for Tabular Data" [[!tabular-metadata]] document. In this section the term property refers to the properties defined in that document.

The metadata for the tabular data may originate from embedded metadata only. In practice this means that the column names MAY be provided by a header row, but no other metadata is provided in the format described in [[!tabular-metadata]]. In that case, for the purpose of this section, processing begins by (conceptually) establishing the metadata as:

[{
  "@id": "URI of the CSV file",
  "schema": {
    "columns": [{
       "name": "name1",
    }, 
    {
       "name": "name2",
    }, 
    .
    .
    .
    {
        "name": "namek",
    }]
  }
}]

where name1, name1,…,namek are the names of the columns.

Default namespace setting

For the purpose of this specification, the default namespace is used for the value of the CURIE prefix :. This default namespace is set as follows:

For example, if the value of the @id property is http://www.example.org/tree-ops.csv, and no @base is set, then :name is the CURIE abbreviation for http://www.example.org/tree-ops.csv#name.

Generating RDF Triples

The processing steps are separated in the generation of table-level RDF triples, and cell-level RDF triples; these steps add RDF triples to the final RDF Graph.

Processing Steps for Table-Level RDF triples

The following RDF triples are added as table level triples (@id in the triples stands for the value of the corresponding table @id property):

  • If the @type property is set, add:
    (@id rdf:type csv:Table)
  • For each (optional) property propi with valuei corresponding to a term defined in [[!DC-TERMS]], add the triple:
    (@id dc:propi valuei)
Relevant issue: Issue 30.
Relevant issue: Issue 29 for the dublin core related properties.

Processing Steps for Cell-Level RDF triples

The metadata specification [[!tabular-metadata]] has an inheritence structure whereby the metadata for a cell (e.g., datatype) may be derived from values specified for a column, a row, or indeed the specific cell itself. This specification considers the result of this derivation for each property.

  • The metadata [[!tabular-metadata]] defines the name property for each column (as part of the schema array). For each of those names, a corresponding IRI (referred to hereafter as Ci) is defined as follows:
    • If the ith column descriptor metadata includes the column-template property then the corresponding template rules are used to generate Ci.
    • Otherwise, Ci is set using the normalized version of name as a CURIE reference combined with the default namespace (normalization means encoding name to make the CURIE expansion a valid URI)
  • For the jth row a subjectj is established as follows:
    • If the metadata specifies a primary key via a primaryKey property, then:
      • if the primary key refers to a single column with the name property set, then subjectj is set to the normalized version of name as a CURIE reference combined with the default namespace
      • otherwise, subjectj is set to the normalized version of name1-name2-…-namek as a CURIE reference combined with the default namespace, where name1, name2,…, namek are the column names of the columns referred to by the primary keys.
    • Otherwise subjectj is set to a newly generated blank node, as defined in [[!rdf11-concepts]].
  • For the jth row, the following RDF triple is added:
    (:subjectj csvw:row "j"^^xsd:integer)
  • If the @type property is specified for the jth row, the following RDF triple is added:
    (:subjectj rdf:type csv:Row)
  • For the ith cell of the jth row, the value of objecti,j is established as follows:
    • If the cell’s metadata includes the datatype property, then objecti,j is the literal value of the cell with the corresponding XSD datatype [[!xmlschema11-2]]. Note that this may require a transformation of the cell’s content to abide to the lexical form rules of the corresponding XSD datatype (e.g., for dates), and this may require the usage of additional cell metadata (e.g., for date-like datatypes the value of the format property provides the date format to be used to parse the cell’s content).
    • Otherwise, if the cell’s metadata includes the language property, then objecti,j is an RDF literal with a language tag; the latter is set to the value of the language property.
    • Otherwise, objecti,j is an RDF Literal of type xsd:string with the value of the cell as lexical form.
    Once the value of objecti,j established, the following triple is added:
    (:subjectj Cj objecti,j)
The primaryKey, as defined in the metadata document [[!tabular-metadata]], contains a reference to a column, but does not contain the name of the column itself. Hence the somewhat convoluted way of generating subjectj.
The specification refers to the column-template property, though that is not part of the metadata specification at this moment. This is in anticipation of a minimal URI-transformation for the predicate. Similar predicates may have to be used for each row to influence the row’s URI.
Relevant issue: Issue 31.
Relevant issue: Issue 32
Relevant issue: Issue 33
Relevant issue: Issue 35

Examples

Let us consider the following example (first row established the column names):

GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
        

yielding the following RDF:

@prefix :     <http://www.example.org/file.csv#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
[ csvw:row 2 ;
  :GID "1";
  :On%20Street "ADDISON AV";
  :Species "Celtis australis";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "10/18/2010"
] .
[ csvw:row 3 ;
  :GID "2";
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "6/2/2010"
] .
[ csvw:row 4 ;
  :GID "3";
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "6/2/2010"
] .

The CSV Data may also have the following associated metadata:

[{
  "@id": "http://www.example.org/tree-ops.csv",
  "@type": "Table",
  "title": "Tree Operations",
  "keywords": ["tree", "street", "maintenance"],
  "license": "http://opendefinition.org/licenses/cc-by/",
  "modified": "2010-12-31",
  "columns": [{
    "@id": "_:GID",
    "name": "GID",
    "datatype": "integer"
   }, {
    "name": "On Street",
    "description": "The street that the tree is on.",
    "datatype": "string"
  }, {
    "name": "Species",
    "description": "The species of the tree.",
    "datatype": "string"
  }, {
    "name": "Trim Cycle",
    "description": "The operation performed on the tree.",
    "datatype": "string"
  }, {
    "name": "Inventory Date",
    "description": "The date of the operation that was performed.",
    "datatype": "date",
    "format": "M/D/YYYY"
  }]
  "primaryKey": "_:GID"  
}]

The generated RDF is then:

@prefix :     <http://www.example.org/tree-ops.csv#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix dc:   <http://purl.org/dc/terms/> .

<http://www.example.org/tree-ops.csv>
  a csv:Table,
  dc:title "Tree Operations";
  dc:license <http://opendefinition.org/licenses/cc-by/>;
  dc:modified "2010-12-31" .

:1 
  csvw:row 2 ;
  :GID 1;
  :On%20Street "ADDISON AV";
  :Species "Celtis australis";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "2010-10-18"^^xsd:date
.
:2
  csvw:row 3 ;
  :GID 2;
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "2010-6-2"^^xsd:date
.
:3 
  csvw:row 4 ;
  :GID 3;
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "2010-6-20"^^xsd:date
.

Note the value for :GID is now an integer, and the value for :Inventory%20Date is a proper date.

More examples should be added here