Model for Tabular Data and Metadata on the Web

Tabular data is routinely transferred on the web as "CSV".

Work-in-progress.

Introduction

This document uses keywords from RFC2119.

Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of fields (although some of these fields may be empty), which provide values of properties of the thing described by the row. In tabular data, fields within the same column provide values for the same property of the thing described by the particular row. This is what differentiates tabular data from other line-oriented formats.

One common form of Tabular Data is CSV RFC [[!RFC4180]] (to be revised).

This document describes the processing of Tabular Data to produce [[!RDF11-CONCEPTS]]. It does not cover any cleaning or trasnformation processes needed to convert the initial data into Tabular Data form. Tabular Data defines an abstract data model considitng of tables, columns, header rows and data rows. It requires all rows to have the same number of columns. As such, Tabular Data does not cover all possible CSV files and ths processes described in this document do not apply to all possible CSV files.

Examples

Example 1

GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
        

gives the following RDF (in Turtle) for the data rows.

# Data rows
[ csv:row 2 ;
  :GID 1;
  :On%20Street "ADDISON AV";
  :Species "Celtis australis";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "10/18/2010"
] .
[ csv:row 3 ;
  :GID 2;
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "6/2/2010"
] .
[ csv:row 4 ;
  :GID 3;
  :On%20Street "EMERSON ST";
  :Species "Liquidambar styraciflua";
  :Trim%20Cycle "Large Tree Routine Prune";
  :Inventory%20Date "6/2/2010"
] .

In the full Column Mapping, metadata about the translation applied is also included.

Processing Model

The processing of Tabular Data to RDF takes as its starting point a file meeting the requirments of Tabular Data and a metadata description of the data.

The header row (if present) is processed to generate additional metadata, then each data row it processed one at a time, in the order given. Processing of a data row is done without reference to any other data row.

Independently processed rows - is this always the case?

There are 3 different level of conversion:

Column Mapping is defined in terms of the implicit template used formed form the metadata.

Annotations

Possible annotations.

Table

Row

Column Annotations

conditional forms? "skip if blank else ..."

Graph Templates

An RDF graph can be used as a template for mapping fields from a row by following a couple of conventions.

A graph template is defined (as a named graph?) within a metadata mapping file as set of RDF triples where any value may include one or more Field References. Each record is processed to emit triples based on transforming the graph template into a series of triples created from the result of substiting Field References for their referenced field values. Triples which result in any position having a value of csv:nil are excluded from output.

Field References

A Field Reference is a brace-surrounded value matching a column name from the CSV input. During record expansion, field references are replaced with the value of the field from the specific record being mapped.

          [
          schema:name "{name}"@en;
          schema:homepage <{+homepage}>;
          schema:image <{+image}>
          ] .
        

Given an input file such as the following:

namehomepageimage
Homer Simpsonhttp://example/homerhttp://example/avatar/homer.png

The resulting output graph would be the following:

          [
            schema:name "Homer Simpson"@en;
            schema:homepage <http://example/homer>;
            schema:image <http://example/avatar/homer.png>
          ] .
        

URI Templates

A URI Template is a URI containing one or more variables as described in [[!RFC6570]]. URI variables are treated as Field References. The expansion of URI Templates is modified so that if the URI template contains any unmapped Field Reference the resulting URI is replaced with csv:nil. After processing, all triples containing csv:nil in any position are removed.

A URI template having the scheme "_" (otherwise illegal) results in a blank node if all Field References are substituted.

{COL} value

Lookup

String functions (SPARQL? i.e. XSD functions and operators)

Datatype by Appearance

If guessing datatypes:

otherwise the field value is processed as a string.

Outline of the Mapping Process

The process takes places in the following steps:

For each row:
  Subject is template or bnode
  For each filed in the row:
    Predicate is column predicate
    If skip, skip.
    If a datatype is given :dt, the object is "string"^^:dt
    If a language tag is given @tag, the object is "string"@tag
    If a URI template is given ...
    If guess, guess integer/decimal/double/boolean/http URI (URN?)
    ...
    Else xsd:string
    Emit (s,p,o)

Data Conversion

More details on guessing datatypes.

Minimal Mapping

If there is no header row and no annotations, then a fixed process is applied to generate sufficient annotations to generate output. A header row is assumed with column names "col1", "col2", ....; the fields of a CSV record (row) are inspected to determine whether to generate numbers, booleans, links or strings by inspecting the character value of a field. Fields have whiste space trimmed.

Column Mapping

In column mapping, a process is applied to the CSV data that takes each row in the input tabular data and emits a number for triples translated to a number of triples, which with the same subject. In addition, some table-wide RDF triples can also be added to the output.

Define the template used to process each data row

CSV-LD

A further procedure mapping to JSON-LD.

Better as a separate document?