Abstract

This document defines the procedures and rules to be applied when mapping tabular data into RDF. Tabular data may be complemented with metadata annotations that describe its structure, the meaning of its content and how it may form part of a collection of interrelated tabular data. This document specifies the effect of this metadata on the resulting RDF.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The CSV on the Web Working Group was chartered to produce Recommendations for "Access methods for CSV Metadata", "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to satisfy the RDF variant of the mapping Recommendation.

Due to the limited resources available within the CSV on the Web Working Group, this document describes only a simple mapping—that is, where each row of tabular data describes a single resource and a single RDF triple is created per cell. The Working Group solicits input on the value of mapping a single row of tabular data into multiple inter-related resources.

This document was published by the CSV on the Web Working Group as a First Public Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.

Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 August 2014 W3C Process Document.

Table of Contents

1. Introduction

This document describes the processing of tabular data to create an RDF graph comprising subject-predicate-object triples [rdf11-concepts] referred to as the output graph. Since RDF is an abstract syntax, the output graph MUST be serialized in a concrete RDF syntax such as N-Triples [n-triples], Turtle [turtle], RDFa [rdfa-primer], JSON-LD [json-ld] or TriG [trig].

The Tabular Data Model [tabular-data-model] defines a core tabular data model consisting of tables, columns, rows and cells.

Tabular data may be enriched with metadata that describes its structure and the meaning of its content. These metadata annotations are described in [tabular-metadata] and may be embedded within the CSV encoding itself as a header line or provided within a separate metadata document. The resulting annotated table conforms to the annotated tabular data model.

The metadata annotations may describe how a table relates to a group of tables. Such collections conform to the grouped tabular data model.

The mapping procedure operates on the abstract tabular data model; core, annotated or grouped. No discussion is given to the processes needed to convert CSV-encoded data into tabular data form. Please refer to [tabular-data-model] for details of parsing tabular data. Further details on parsing cells within tabular data is provided in [tabular-metadata].

Note

Adopting terminology from the Data Catalog Vocabulary [vocab-dcat], the tabular data is considered to be a dataset, whilst the CSV file within which that tabular data is encoded is considered to be a distribution of that tabular data.

Are the abstract tabular data and the CSV that encodes it the same thing? (Is DCAT distribution appropriate?)

The mapping procedure is intended to be simple; encouraging the provision of compliant mapping applications. The limitation of this simple mapping is that each row of tabular data is inferred to describe a single resource and that a single RDF triples is created for each cell.

Note

An annotated table may include a reference to a template specification (see [tabular-metadata]) that describes how tabular data can be transformed into another format using a template-based approach. Templating facilitates far more sophisticated transformations than are possible using the simple mapping.

There is no standard template syntax, therefore template specifications may be written using existing template languages, such as Mustache, [r2rml] and SPARQL CONSTRUCT queries (as defined in [sparql11-query]).

The processing of template specifications during the mapping is yet to be determined by the Working Group and is, at least for the interim, beyond the scope of this document.

Finally, note that the mapping procedure is considered to be entirely textual. There is no requirement on compliant mapping applications to check the semantic consistency of the data during the mapping, nor validate the cell values against RDF syntax rules. Where cell values within CSV encoded content are improperly formatted, the output from the mapping is likely to include syntax errors. Downstream applications should be aware of this and take appropriate action.

Should the RDF/JSON transformation check the values?

2. Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY, MUST, SHALL, and SHOULD are to be interpreted as described in [RFC2119].

Tabular data MUST conform to the description from [tabular-data-model]. In particular note that each row MUST contain the same number of cells (although some of these cells may be empty). Given this constraint, not all CSV-encoded data can be considered to be tabular data. As such, the procedures and rules defined in this document cannot be applied to all CSV files.

This document relies on terms (e.g. group, table, column, row, cell) defined in [tabular-data-model].

This specification makes use of the CURIE Syntax for describing RDF Triples; see, for example, the CURIE Syntax Definition Section of the RDFa 1.1 Core Specification [rdfa-core].

This specification makes use of the following namespaces:

csvw:
http://www.w3.org/ns/csvw#
rdf:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:
http://www.w3.org/2000/01/rdf-schema#
xsd:
http://www.w3.org/2001/XMLSchema#
dc:
http://purl.org/dc/terms/
dcat:
http://www.w3.org/ns/dcat#
prov:
http://www.w3.org/ns/prov#

3. Mapping Core Tabular Data

The procedures and rules for mapping tabular data compliant with the core tabular data model are described below.

Note

Core Tabular Data lacks any annotation; neither from the header line within the CSV file nor from a separate metadata document.

3.1 Generating RDF

3.1.1 Table-level processing

  1. The output graph SHALL contain a resource that describes the table: the table resource.

    The table resource SHALL be of type csvw:Table.

    Note

    csvw:Table, an RDFS Class, is a sub-class of dcat:Dataset (as defined in [vocab-dcat]).

  2. The table resource SHALL be explicitly identified as [CSV Location]#table, where [CSV Location] is the absolute URL of the source CSV file.

    Should the table resource in the RDF mapping of core tabular data be explicitly identified (e.g. as [CSV Location]#table)?

  3. The output graph SHALL contain a resource of type dcat:Distribution (as defined in [vocab-dcat]) that describes the CSV-encoded distribution of the tabular data.

    The distribution SHALL be related to the table resource using the predicate dcat:distribution and SHALL state the absolute URL of the source CSV file using a triple with the predicate dcat:downloadURL.

    Are the abstract tabular data and the CSV that encodes it the same thing? (Is DCAT distribution appropriate?)

  4. The output graph SHALL contain one resource for each of the rows within the tabular data.

    Each row-level resource SHALL be related to the table resource using the predicate csvw:row.

    Note

    csvw:row, an RDF Property, is a sub-property of rdfs:member (as defined in [rdf-schema]).

    What should the name of the property be that relates rows to the table? row is one option; hasRow is another.

    Refer to Section 3.1.2 Row-level processing for further details on the description of row-level entities.

  5. Optionally, the output graph MAY contain information describing how and when the output graph was created using terms from the PROV Ontology [prov-o].

    If provenance information is to be included in the output graph then the table resource SHALL use the predicate prov:activity to refer to a resource of type prov:Activity that describes the mapping activity.

    The prov:Activity resource SHOULD provide information on the start and end time of the mapping activity and refer to the original CSV file. The prov:Activity resource MAY indicate the location of the generated RDF output graph if known.

    The example below provides an illustration of provenance information, where [CSV Location] is the absolute URL of the source CSV file, [Start Time] is the start time of the mapping activity, [End Time] is the finish time of the mapping activity (both expressed as xsd:dateTime) and [RDF Output Location] is the location of the generated output graph.

    Example 1: Provenance information
    <> a csvw:Table;
      prov:activity [
        a prov:Activity;
        prov:startedAtTime [Start Time];
        prov:endedAtTime [End Time];
        prov:generated <[RDF Output Location]>;
        prov:qualifiedUsage [
          a prov:Usage ;
          prov:Entity <[CSV Location]> ;
          prov:hadRole csvw:csvEncodedTabularData
        ]
      ];
      ...

3.1.2 Row-level processing

Each row in the tabular data is processed sequentially to produce a resource in the output graph corresponding to that row: the row resource.

  1. A row resource SHALL contain one triple for each column where the cell value is not null (e.g. where the cell does not contain an empty string; "").

  2. The cell value SHALL be related to the row resource using a triple with the predicate [CSV Location]#_col=[N], where [CSV Location] is the absolute URL of the source CSV file. and [N] is the column number.

    What do to with conversion if no column name is given?

  3. Where the cell value is null, the triple SHALL be omitted from the output graph.

  4. Given the absence of metadata annotations to indicate the type of data present in a given column, all cell values SHALL be treated as strings.

Should the mapping output for a given row include a reference to the CSV source row?

3.2 Examples

The following example provides a numeric score for four fictional people. A row number is included for convenience. There are four columns and four rows. Given that no metadata annotations are provided, it is very difficult to ascertain the subject of the tabular data without additional insight.

1 Jill Smith 50
2 Eve 94
3 Adam Johnson
4 John Doe 80

The CSV input (published at http://example.org/people-and-points.csv):

Example 2: CSV input
1,Jill,Smith,50
2,Eve,,94
3,Adam,Johnson,
4,John,Doe,80

The resulting RDF output graph:

Example 3: RDF output (Turtle syntax)
@prefix :     <http://example.org/people-and-points.csv#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

:table a csvw:Table ;
  dcat:distribution [
    a dcat:Distribution ;
    dcat:downloadURL <http://example.org/people-and-points.csv> 
  ] ;
  csvw:row [
    :_col=1 "1" ;
    :_col=2 "Jill" ;
    :_col=3 "Smith" ;
    :_col=4 "50" ;
  ] , [
    :_col=1 "2" ;
    :_col=2 "Eve" ;
    :_col=4 "94" ;
  ] , [
    :_col=1 "3" ;
    :_col=2 "Adam" ;
    :_col=3 "Johnson" ;
  ] , [
    :_col=1 "4" ;
    :_col=2 "John" ;
    :_col=3 "Doe" ;
    :_col=4 "80" ;
  ] ;
  prov:activity [
    a prov:Activity ;
    prov:startedAtTime "2014-12-15T12:44:42"^^xsd:dateTime ;
    prov:endedAtTime "2014-12-15T12:44:42"^^xsd:dateTime ;
    prov:qualifiedUsage [
      a prov:Usage ;
      prov:Entity <http://example.org/people-and-points.csv> ;
      prov:hadRole csvw:csvEncodedTabularData ;
    ] ;
  ] ;
  .

4. Mapping Annotated Tabular Data

The procedures and rules for mapping annotated tabular data compliant with the annotated tabular data model are described below.

The metadata for annotated tabular data MAY be provided by either or both of the following sources:

Mapping applications SHALL establish a column description object for each column within the annotated tabular data. The column description object contains the aggregated set of metadata properties for a given column that affect how the cell values within the associated column are expressed in the output graph. Metadata properties are sourced from the header line in the CSV file and column description in the metadata document and enriched with inherited properties from the table description and schema.

The output graph MAY include some Direct Annotations sourced from the metadata document. Where these are natural language properties, a language tag (as defined in [rdf11-concepts]) SHALL be provided for those properties where locale information is specified within the metadata description. Where multiple locale-specific values of a natural language property have been defined using a language map (as defined in [json-ld]), each value SHALL be expressed as a separate triple with the appropriate language tag.

Clearly, in order to process annotated tabular data, a mapping application MUST have access to the full metadata description associated with the tabular data.

URL expansion behaviour of relative URLs SHALL be consistent with Section 6.3 IRI Expansion in [json-ld-api]. The base URL provides the URL against which relative URLs from annotated tabular data are resolved. The base URL SHALL be that of the source CSV file.

What is default value if @base is not defined in the metadata description?

4.1 Generating RDF

4.1.1 Table-level processing

  1. The root object of the output graph SHALL contain a table resource that describes the annotated table.

    The table resource SHALL be of type csvw:Table.

  2. Where provided in the table description, the value of metadata property @id SHALL be used to identify the table resource.

  3. The output graph SHALL contain a resource of type dcat:Distribution (as defined in [vocab-dcat]) that describes the CSV-encoded distribution of the tabular data.

    The distribution SHALL be related to the table resource using the predicate dcat:distribution and SHALL state the absolute URL of the source CSV file using a triple with the predicate dcat:downloadURL.

  4. Where a header line is present in the CSV file then, for each column, the cell value from the column header SHALL be assigned to metadata properties name and title within the column description object for that column.

    Note

    The language of the column header is inferred to be that of the data within the column as specified by inherited property language.

    Where the column header is null, the value assigned to name SHALL be _col=[N] where [N] is the column number and no value assigned to title.

  5. Where present in the table description, the following metadata properties SHALL be included in the output graph as properties of the table resource:

    • notes - the array of entities representing structured annotations on the tabular data SHALL be included verbatim.

      Note

      The Web Annotation Working Group is developing a vocabulary for expressing annotations which we anticipate referencing from this specification. Issues likely to be covered therein include: how to anchor the annotation to a target in the tabular data and/or CSV file, what form the annotations themselves may take (e.g. a simple literal annotation body, or whether additional formatting properties are required to indicate that the annotation is expressed in, say, Markdown or HTML).

      Exact handling of annotations.

      Additionally, the mechanism to reference the annotation target (within tabular data) is still unclear - especially given the confusion on identifying row numbers (ISSUE #68 refers).

    • Any Common Properties (as defined in Section 3.3 Common Properties of [tabular-metadata]).

  6. Any of the inherited properties null, language, separator, format, datatype, default defined within the table description and/or schema SHALL be added to the column description object for each column.

    Where the same property is defined in both the table description and the schema, the value from the schema SHALL take precedence.

  7. Each column description SHALL be matched to a column in the tabular data based on the order that the description is listed in the columns array of the schema.

    For each column description in the metadata document, the following metadata properties SHALL be added to the relevant column description object established by the mapping application:

    • name.

      Where metadata property name is also provided via the header line the value from the column description in the metadata document SHALL take precedence.

    • title.

      More than one value of title MAY be provided in the column description, in which case an array of values SHALL be stored along with any assertions regarding the language of each value.

      Where metadata property title is also provided via the header line the value from the column header SHALL occupy the first position in an array of title values.

    • predicateUrl.

    • urlTemplate.

    • Inherited properties null, language, separator, format, datatype, and default are added to the column description object, overwriting values added in the previous step where properties are duplicated.

  8. For each column description object where metadata property predicateUrl has not been assigned within the column description, the value of predicateUrl SHALL be set as the value of metadata property name.

    Note

    Here, the value of metadata property name is treated as a fragment identifier relative to the base URL. As a URL fragment, the value of name is subject to percent encoding (as defined in [rfc3986]).

    Mapping applications MAY assert that the resource identified by the value of predicateUrl is of type rdf:Property.

  9. Where the following Direct Annotations are provided for columns within the tabular data, these SHALL be included in the output graph using triples whose subject is the RDF Property identified by the value of metadata property predicateUrl for the associated column:

    Where language information about values of title or Common Properties is known, the appropriate language tag SHALL be appended to the triple.

  10. The output graph SHALL contain one resource for each of the rows within the tabular data.

    Each row-level resource SHALL be related to the table resource using the predicate csvw:row.

    Refer to Section 4.1.2 Row-level processing for further details on the description of row-level entities.

  11. Optionally, the output graph MAY contain information describing how and when the output graph was created using terms from the PROV Ontology [prov-o].

    If provenance information is to be included in the output graph then the table resource SHALL use the predicate prov:activity to refer to a resource of type prov:Activity that describes the mapping activity.

    The prov:Activity resource SHOULD provide information on the start and end time of the mapping activity and refer to the original CSV file. The prov:Activity resource MAY indicate the location of the generated RDF output graph if known.

    Furthermore, if the metadata annotations are provided in one or more metadata documents (e.g. as table description, schema and column descriptions) then the provenance information SHOULD also include information about each of those metadata documents.

    The example below provides an illustration of provenance information, where [CSV Location] is the absolute URL of the source CSV file, [Metadata Location] is the location of the metadata document, [Start Time] is the start time of the mapping activity, [End Time] is the finish time of the mapping activity (both expressed as xsd:dateTime) and [RDF Output Location] is the location of the generated output graph.

    Example 4: Provenance information
    <> a csvw:Table;
      prov:activity [
        a prov:Activity;
        prov:startedAtTime [Start Time];
        prov:endedAtTime [End Time];
        prov:generated <[RDF Output Location]>;
        prov:qualifiedUsage [
          a prov:Usage ;
          prov:Entity <[CSV Location]> ;
          prov:hadRole csvw:csvEncodedTabularData
        ];
        prov:qualifiedUsage [
          a prov:Usage ;
          prov:Entity <[Metadata Location]> ;
          prov:hadRole csvw:tabularMetadata
        ]
      ];
      ...

4.1.2 Row-level processing

Each row in the tabular data is processed sequentially to produce a row resource. The behaviour exhibited when processing a given cell within the current row is dependent on the metadata properties of the column description object for the column that that cell resides in. The effect of each metadata property is defined in Section 4.1.3 Metadata property effects on row-level mapping behaviour

  1. A row resource SHALL contain one triple for each column where the cell value is not null.

  2. Where the metadata property urlTemplate is provided in the schema, the row resource SHALL be explicitly identified using the value resulting from the expansion of the [uri-template] specified in the urlTemplate property.

    In the absence of metadata property urlTemplate, the row resource SHALL be treated as a blank node [rdf11-concepts].

    Note

    The variables in the URI Template expression relate to the name property specified for each column. During template expansion, the variables evaluate to the cell value within the row being processed that is associated with the named column.

    The variable _row evaluates to the number of the row being processed.

    Once the URL has been generated via the template expansion, relative URLs are resolved against the base URL to create an absolute URL.

  3. Where the cell value is null, the triple SHALL be omitted from the output graph unless a default value is specified for that column (see metadata properties null and default).

  4. For each column, the value of metadata property predicateUrl from the column description object SHALL be used to relate the row resource to the cell value within the column.

  5. When included in the output graph, the cell value SHALL be subject to the effect of the metadata properties for that column (if any are specified).

4.1.3 Metadata property effects on row-level mapping behaviour

The following metadata properties modify the way that cell values are incorporated into the output graph:

null

By default, a cell value is deemed to be null if it contains an empty string. If specified, the metadata property null provides a token (string) that can be used to identify null values.

language

Where metadata property language is specified, the value of that property SHALL be used as a language tag (as specified in [rdf11-concepts]) for simple literal values (e.g. those values whose datatype is http://www.w3.org/2001/XMLSchema#string).

separator

Where metadata property separator is defined, the cell value SHALL be parsed into an ordered list of values, using the value of separator as the delimiter.

Given that RDF has does not provide any implicit ordering of triples, the list of values shall be expressed in the output graph as an RDF List (as described in Section 5.2.1 rdf:List of [rdf-schema]).

Should there be an option for unordered lists in RDF mapping?

datatype and format

Where metadata property datatype is defined, the triple included in the output graph for this cell SHALL assert the datatype of the cell value using the value of the datatype property.

Where metadata property datatype is undefined, the column SHALL be inferred to hold values of datatype string.

Note

Where the metadata property separator is specified (e.g. to indicate that a cell value is to be parsed into a list of values), the datatype specified by datatype SHALL be inferred to apply to the members of the resulting list.

The following datatypes are given special attention:

  • Datatypes with embedded syntax: xml, json and html.

    These datatypes are treated as literal values; no attempt SHOULD be made to 'unpack' the structured syntax to create sub-objects within the output graph.

  • Booleans: boolean.

    Metadata property format MAY be provided for a boolean-typed column; providing non-standard tokens for true and false (e.g. Y|N. Section 3.12.3 Formats for booleans from [tabular-metadata] refers.

    If a boolean type is declared, the cell value SHALL be processed as follows:

    1. if the value is true, 1 or, if the format property is defined, the value of true, then the output graph SHALL include the value true;
    2. else if the value is false, 0 or, if the format property is defined, the value of false, then the output graph SHALL include the value false;
    3. else the output graph SHALL include the cell value verbatim.
  • Numbers: number, decimal, integer, nonPositiveInteger, negativeInteger, long, int, short, nonNegativeInteger, unsignedLong, unsignedInt, unsignedShort, positiveInteger, float and double.

    Cell values that are asserted to be numeric shall be expressed in the output graph as numbers.

    It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.

    Metadata property format MAY be provided to describe the formatting of the cell values to assist the mapping application convert the cell value to a number format readily consumable by downstream applications.

    Describing the formatting of numbers is currently unresolved and is likely to require information on decimal separator characters, grouping characters and possibly others such as Infinity, Nan, currency tokens, negative numbers appearing in parentheses etc.

    In the interim, mapping applications are not required to undertake any reformatting and may simply pass the cell value to the output graph verbatim.

  • Dates, times and durations: date, time, datetime, dateTime and duration.

    A standard syntax for dates and times is defined by [iso8601]. This format can be readily consumed by software applications. However, dates and times are often provided in a locale-specific format, or use alternate calendars and/or eras.

    Metadata property format MAY be provided to describe the formatting of cell values and assist the mapping application convert the cell value to a date, time, date-time or duration format readily consumable by downstream applications.

    Note

    Where possible, data publishers SHOULD provide dates and times in the [iso8601] format. However, where data publishers choose to use locale-specific date and time formatting, they SHOULD also provide equivalent values in [iso8601] format (e.g. in a complementary column).

    Describing the formatting of dates and times is currently unresolved. The favoured option is to defer the parsing of dates and times to implementations based a picture string provided in the metadata. Unfortunately, there is no standard syntax for picture strings, therefore an array of picture strings relating to common implementations seems like the best option. For example:

    "datatype": "date",

    "format": {

      "picture-strings": [

        "unicode": "dd MMM yyyy",

        "xpath": "[D01] [MN,*-3] [Y0001]"

      ]

    }

    Where an implementation is able to interpret one of the provided picture strings, the date-time value reformatted in [iso8601] format shall be included in the output graph, else the original cell value shall be included verbatim.

    In the interim, mapping applications are not required to undertake any reformatting and may simply pass the cell value to the output graph verbatim.

    A list of potential date-time formatting implementations needs to be defined.

  • Uniform Resource Identifiers: anyURI.

    Where the datatype is specified as anyURI, the cell value is inferred to provide a URI (as described in [rfc3986]) rather than a literal value.

urlTemplate

If metadata property urlTemplate is specified, the value used in the output graph SHALL be the result of the URI Template expansion, as defined in Section 3.1 Property Syntax of [tabular-metadata].

Once the URL has been generated via the template expansion, relative URLs are resolved against the base URL to create an absolute URL.

Should the cell-value URL Template be treated as a datatype?

default

If metadata property default is specified and the cell value is deemed to be null, then the value of default SHALL be used in the output graph.

4.2 Examples

Issue

These examples don't really show the edge cases - probably need to rework them

The first example illustrates how a CSV file with metadata annotations drawn only from a header line is processed. The tabular data describes lists countries, giving their country code and name. There are two columns, named country and name, and four rows.

country name
AD Andorra
AF Afghanistan
AI Anguilla
AL Albania

The CSV input (published at http://example.org/country-codes-and-names.csv):

Example 5: CSV input
country,name
AD,Andorra
AF,Afghanistan
AI,Anguilla
AL,Albania

The resulting RDF output graph (published to http://example.org/country-codes-and-names.ttl):

Example 6: RDF output (Turtle syntax)
@prefix :     <http://example.org/country-codes-and-names.csv#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

:country a rdf:Property ;
  rdfs:label "country" .

:name a rdf:Property ;
  rdfs:label "name" .

:table a csvw:Table ;
  dcat:distribution [
    a dcat:Distribution ;
    dcat:downloadURL <http://example.org/country-codes-and-names.csv> 
  ] ;
  csvw:row [
    :country "AD" ;
    :name "Andorra" ;
  ] , [
    :country "AF" ;
    :name "Afghanistan" ;
  ] , [
    :country "AI" ;
    :name "Anguilla" ;
  ] , [
    :country "AL" ;
    :name "Albania" ;
  ] ;
  prov:activity [
    a prov:Activity ;
    prov:startedAtTime "2014-12-16T12:15:06"^^xsd:dateTime ;
    prov:endedAtTime "2014-12-16T12:15:07"^^xsd:dateTime ;
    prov:generated <http://example.org/country-codes-and-names.ttl> ;
    prov:qualifiedUsage [
      a prov:Usage ;
      prov:Entity <http://example.org/country-codes-and-names.csv> ;
      prov:hadRole csvw:csvEncodedTabularData ;
    ] ;
  ] ;
  .

In the example output above we see Turtle's shorthand syntax for dealing with blank nodes. Should the recommended output form explicitly identify the blank nodes using the row number? e.g.

  csvw:row _:1 , _:2 , _:3 , _:4 .

  _:1 :country "AD" ; :name "Andorra" .

etc.

The second example illustrates how the mapping is modified with the addition of metadata annotations in a metadata document. The CSV file is a small extract from a much larger Tree Inventory dataset from the City of Palo Alto which supports the maintaining and tracking the city's public trees and urban forest. There are five columns, named GID, On Street, Species, Trim Cycle and Inventory Date, and three rows.

GID On Street Species Trim Cycle Inventory Date
1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010
2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010
3 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010

The CSV input (published at http://example.org/tree-ops.csv):

Example 7: CSV input
GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010

The metadata description (published at http://example.org/tree-ops.csv-metadata.json):

Example 8: Metadata description
{
  "@id": "tree-ops",
  "@context": {
    "@language": "en"
  }
  "dcat:distribution": {
    "dcat:downloadURL": "tree-ops.csv"
  }
  "dc:title": "Tree Operations",
  "dc:keywords": ["tree", "street", "maintenance"],
  "dc:publisher": [{
    "schema:name": "Example Municipality",
    "schema:web": "http://example.org"
  }],
  "dc:license": "http://opendefinition.org/licenses/cc-by/",
  "dc:modified": {
    "@value": "2010-12-31",
    "@type": "http://www.w3.org/2001/XMLSchema#date"
  }
  "schema": {
    "columns": [{
      "name": "GID",
      "title": [
        "GID",
        "Generic Identifier"
      ],
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true,
      "unique": true
    }, {
      "name": "on-street",
      "title": "On Street",
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "title": "Species",
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim-cycle",
      "title": "Trim Cycle",
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory-date",
      "title": "Inventory Date",
      "dc:description": "The date of the operation that was performed.",
      "datatype": "date"
      "format": {
        "picture-strings": [
          "unicode": "M/d/yyyy"
        ]
      }
    }]
    "primaryKey": "GID",  
    "urlTemplate": "#gid-{GID}"
  }
}

The resulting RDF output graph (published to http://example.org/tree-ops.ttl):

Example 9: RDF output (Turtle syntax)
@prefix :       <http://example.org/tree-ops.csv#> .
@prefix csvw:   <http://www.w3.org/ns/csvw#> .
@prefix dc:     <http://purl.org/dc/terms/>
@prefix dcat:   <http://www.w3.org/ns/dcat#> .
@prefix prov:   <http://www.w3.org/ns/prov#> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/>
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

:GID a rdf:Property ;
  rdfs:label "GID" , "GID" , "Generic Identifier" ;
  dc:description "An identifier for the operation on a tree." .

:on-street a rdf:Property ;
  rdfs:label "On Street" , "On Street" ;
  dc:description "The street that the tree is on." .

:species a rdf:Property ;
  rdfs:label "Species" , "Species" ;
  dc:description "The species of the tree." .

:trim-cycle a rdf:Property ;
  rdfs:label "Trim Cycle" , "Trim Cycle" ;
  dc:description "The operation performed on the tree." .

:inventory-date a rdf:Property ;
  rdfs:label "Inventory Date" , "Inventory Date" ;
  dc:description "The date of the operation that was performed." .

:table a csvw:Table ;
  dcat:distribution [
    a dcat:Distribution ;
    dcat:downloadURL <http://example.org/tree-ops.csv> 
  ] ;
  dc:title "Tree Operations" ;
  dc:keywords "tree" , "street", "maintenance" ;
  dc:publisher [
    schema:name "Example Municipality" ;
    schema:web "http://example.org"
  ] ;
  dc:license <http://opendefinition.org/licenses/cc-by/> ;
  dc:modified "2010-12-31"^^xsd:date ;
  csvw:row :gid-1 , :gid-2 , :gid-3 ;
  prov:activity [
    a prov:Activity ;
    prov:startedAtTime "2014-12-16T12:15:06"^^xsd:dateTime ;
    prov:endedAtTime "2014-12-16T12:15:07"^^xsd:dateTime ;
    prov:generated <http://example.org/tree-ops.ttl> ;
    prov:qualifiedUsage [
      a prov:Usage ;
      prov:Entity <http://example.org/tree-ops.csv> ;
      prov:hadRole csvw:csvEncodedTabularData 
    ] ;
    prov:qualifiedUsage [
      a prov:Usage ;
      prov:Entity <http://example.org/tree-ops.csv-metadata.json> ;
      prov:hadRole csvw:tabularMetadata 
    ] ;
  ] ;
  .

:gid-1
  :GID "1"^^xsd:string ;
  :on-street "ADDISON AV"^^xsd:string ;
  :species "Celtis australis"^^xsd:string ;
  :trim-cycle "Large Tree Routine Prune"^^xsd:string ;
  :inventory-date "2010-10-18"^^xsd:date ;
  .

:gid-2
  :GID "2"^^xsd:string ;
  :on-street "EMERSON ST"^^xsd:string ;
  :species "Liquidambar styraciflua"^^xsd:string ;
  :trim-cycle "Large Tree Routine Prune"^^xsd:string ;
  :inventory-date "2010-06-02"^^xsd:date ;
  .

:gid-3
  :GID "3"^^xsd:string ;
  :on-street "EMERSON ST"^^xsd:string ;
  :species "Liquidambar styraciflua"^^xsd:string ;
  :trim-cycle "Large Tree Routine Prune"^^xsd:string ;
  :inventory-date "2010-06-02"^^xsd:date ;
  .
Note

In the example RDF output note that the value of column Inventory Date has been amended from format M/d/yyyy (as described by the Unicode Date Format Pattern string [tr35]) into an [iso8601] formatted date string compliant with xsd:date syntax.

Issue

Note the repeating rdfs:label for the column predicates; this is because the CSV and the metadata both include the same title - should the mapping application attempt to deduplicate?

Issue

The assertions about the column predicates could include rdfs:range to specify the data type. Is this desirable?

5. Mapping Grouped Tabular Data

The procedures and rules for mapping a collection of tabular data compliant with the grouped tabular data model are described below.

The metadata for a group of tables SHALL be provided by a table group description (as defined in [tabular-metadata]) within the associated metadata document.

5.1 Generating RDF

5.1.1 Group-level processing

  1. The output graph SHALL contain a resource that describes with the table group: the table group resource.

    The table group resource SHALL be of type csvw:TableGroup.

  2. Where present in the table group description, any Common Properties (as defined in Section 3.3 Common Properties of [tabular-metadata]) SHALL be included in the output graph as properties of the table group resource.

  3. The output graph SHALL contain one resource for each of the tables listed in the resources array of the table group description.

    Each table SHALL be processed sequentially according to the appropriate set of rules for mapping core or annotated tabular data. Refer to Section 3 Mapping Core Tabular Data and Section 4 Mapping Annotated Tabular Data for further details.

    Each table-level resource resulting from processing the tables SHALL be related to the table group resource using the predicate csvw:table.

  4. Any of the inherited properties null, language, separator, format, datatype, or default defined within the table group description SHALL be used to pre-populate the column description objects for each table in the group.

    Where the same property is defined in the table group description, table description, schema or column description the order of precedence SHALL be:

    1. column description
    2. schema description
    3. table description
    4. table group description
Issue

The presence of foreign-key references within the table descriptions may affect the way the data is packaged in the output graph. Reviewers are invited to comment on how grouped tabular data with foreign-key references might best be organised.

5.2 Examples

Use Case 4: Publication of public sector roles and salaries likely provides a good source of material for these examples. To be added.

A. References

A.1 Normative references

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[iso8601]
ISO 8601:2004 Representation of dates and times. International Standard (IS).
[rdf11-concepts]
Richard Cyganiak; David Wood; Markus Lanthaler. RDF 1.1 Concepts and Abstract Syntax. 25 February 2014. W3C Recommendation. URL: http://www.w3.org/TR/rdf11-concepts/
[rdfa-core]
Ben Adida; Mark Birbeck; Shane McCarron; Ivan Herman et al. RDFa Core 1.1 - Third Edition. 16 December 2014. W3C Proposed Edited Recommendation. URL: http://www.w3.org/TR/rdfa-core/
[tabular-data-model]
Jeni Tennison; Gregg Kellogg. Model for Tabular Data and Metadata on the Web. W3C Working Draft. URL: http://www.w3.org/TR/tabular-data-model/
[tabular-metadata]
Rufus Pollock; Jeni Tennison. Metadata Vocabulary for Tabular Data. W3C Working Draft. URL: http://www.w3.org/TR/tabular-metadata/
[tr35]
Mark Davis; CLDR committee members. TR35, Unicode Locale Data Markup Language (LDML). Report. URL: http://unicode.org/reports/tr35/
[uri-template]
Joe Gregorio; Roy T. Fielding; Marc Hadley; Mark Nottingham; David Orchard. URI Template. March 2012. RFC 6570. URL: http://www.rfc-editor.org/rfc/rfc6570.txt

A.2 Informative references

[json-ld]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0. 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld/
[json-ld-api]
Markus Lanthaler; Gregg Kellogg; Manu Sporny. JSON-LD 1.0 Processing Algorithms and API. 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld-api/
[n-triples]
Gavin Carothers; Andy Seaborne. RDF 1.1 N-Triples. 25 February 2014. W3C Recommendation. URL: http://www.w3.org/TR/n-triples/
[prov-o]
Timothy Lebo; Satya Sahoo; Deborah McGuinness. PROV-O: The PROV Ontology. 30 April 2013. W3C Recommendation. URL: http://www.w3.org/TR/prov-o/
[r2rml]
Souripriya Das; Seema Sundara; Richard Cyganiak. R2RML: RDB to RDF Mapping Language. 27 September 2012. W3C Recommendation. URL: http://www.w3.org/TR/r2rml/
[rdf-schema]
Dan Brickley; Ramanathan Guha. RDF Schema 1.1. 25 February 2014. W3C Recommendation. URL: http://www.w3.org/TR/rdf-schema/
[rdfa-primer]
Ivan Herman; Ben Adida; Manu Sporny; Mark Birbeck. RDFa 1.1 Primer - Second Edition. 22 August 2013. W3C Note. URL: http://www.w3.org/TR/rdfa-primer/
[rfc3986]
T. Berners-Lee; R. Fielding; L. Masinter. Uniform Resource Identifier (URI): Generic Syntax. January 2005. Internet Standard. URL: https://tools.ietf.org/html/rfc3986
[sparql11-query]
Steven Harris; Andy Seaborne. SPARQL 1.1 Query Language. 21 March 2013. W3C Recommendation. URL: http://www.w3.org/TR/sparql11-query/
[trig]
Gavin Carothers; Andy Seaborne. RDF 1.1 TriG. 25 February 2014. W3C Recommendation. URL: http://www.w3.org/TR/trig/
[turtle]
Eric Prud'hommeaux; Gavin Carothers. RDF 1.1 Turtle. 25 February 2014. W3C Recommendation. URL: http://www.w3.org/TR/turtle/
[vocab-dcat]
Fadi Maali; John Erickson. Data Catalog Vocabulary (DCAT). 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/vocab-dcat/