Validation, conversion, display and search of tabular data on the web requires additional metadata that describes how the data should be interpreted. This document defines a vocabulary for metadata that annotates tabular data. This can be used to provide metadata at various levels, from collections of data from CSV documents and how they relate to each other down to individual cells within a table.

The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the second of those Recommendations.

Introduction

Interpreting tabular data that is available on the web, particularly as CSV, usually requires additional metadata. As an example, say that the following CSV file were available at http://example.org/tree-ops.csv

      

A human consumer of this data might be able to figure out the meaning of the different columns, particularly if there were some additional human-readable documentation made available. Automated processors would have a much harder time; realistically they would be limited to displaying the information in a table. Making available machine-readable metadata helps with the interpretation of the tabular data. For example, say that the following metadata file were available at http://example.org/tree-ops.csv-metadata.json:

      

Given the location of the CSV file, this metadata document can be located by appending -metadata.json to the URL (as described in Model for Tabular Data and Metadata on the Web). It provides information for different types of applications:

Implementations may fulfil one or more of these functions. In particular, Converters may or may not act as a Validator (perhaps through the setting of a flag), and check the data that they are converting to ensure that it is complant with the schema. If a Converter does not also act as a Validator it may produce invalid output.

The Model for Tabular Data and Metadata on the Web specification defines an Annotated Tabular Data Model in which tables, columns, rows and cells can be annotated with annotations, and a group of tables when is comprised of a set of annotated tables. That specification also describes how to locate metadata about a given CSV file.

This document defines the format and structure of metadata documents, and how these are interpreted to create an Annotated Tabular Data Model. It also defines how to validate tabular data based on some of these annotations.

The metadata format is based on a dialect of [[!JSON-LD]] as defined in . This metadata can be expressed as an RDF graph. However, all applications that conform to this specification (including validators and applications that read or convert tabular data) MUST read the JSON-based format described in this document. It is not necessary for conformant applications to be able to process all JSON-LD, only the dialect defined in this specification.

Tabular data MUST conform to the description from [[!tabular-data-model]]. In particular note that each row MUST contain the same number of cells (although some of these cells may be empty). Given this constraint, not all CSV-encoded data can be considered to be tabular data. As such, the metadata format described in this specification cannot be applied to all CSV files.

This specification makes use of the compact IRI Syntax; please refer to the Compact IRIs from [[!JSON-LD]].

This specification makes use of the following namespaces:

csvw:
http://www.w3.org/ns/csvw#
dc:
http://purl.org/dc/terms/
dcat:
http://www.w3.org/ns/dcat#
foaf:
http://xmlns.com/foaf/0.1/
rdf:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
schema:
http://schema.org/
xsd:
http://www.w3.org/2001/XMLSchema#

Annotating Tables

The metadata defined in this specification is used to provide annotations on an annotated table or group of tables, as defined in [[!tabular-data-model]]. Annotated tables form the basis for all further processing, such as validating, converting or displaying the tables.

All compliant applications MUST create annotated tables based on the algorithm defined here. All compliant applications MUST generate errors if a metadata document does not conform to this specification, such as:

Metadata documents contain descriptions of groups of tables, tables, columns, rows, cells and regions which are used to create annotations on a tabular data model. There are two types of description objects:

The description objects themselves contain a number of properties. These are:

For example, in the column description

{
  "name": "inventory_date",
  "title": "Inventory Date",
  "dc:description": "The date of the operation that was performed.",
  "datatype": {
    "base": "date",
    "format": "M/d/yyyy"
  }
}
      

the properties name, title and dc:description are used to create the name, titles and dc:description annotations on the column in the data model. The datatype and format properties are inherited properties that affect the value of each cell in that column.

The property value of an annotation is that defined in the metadata, unless otherwise noted.

Metadata Format

This section defines a set of properties and permitted values for annotating tabular data, and how these properties should be interpreted by applications.

A metadata document is a JSON document which holds an object at the top level. This object is a description object of either a table group or a single table. A description object is a JSON object that describes a component of the tabular data model (a table group, a table, a column, a row or a cell) and has one or more properties are mapped into properties on that component.

Metadata Properties
Diagram showing the properties of different metadata descriptors (see the diagram in SVG or PNG formats)

Property Syntax

There are different types of properties on description objects:

Array Properties

Array properties hold an array of one or more objects, which are usually description objects.

For example, the resources property is an array property. A table group description might contain:

"resources": [{
  "url": "https://example.org/countries.csv",
  "tableSchema": "https://example.org/countries.json"
}, {
  "url": "https://example.org/country_slice.csv",
  "tableSchema": "https://example.org/country_slice.json"
}]
          

in which case the resources property has a value that is an array of two table description objects.

Link Properties

Link properties hold a single reference to another resource by URL. Their value is a string — resolved as a URL against the base URL.

For example, the url property is a link property. A table description might contain:

"url": "example-2014-01-03.csv"
          

in which case the url property on the table would have a single value, a link to example-2014-01-03.csv, resolved against the base URL of the metadata document in which this was located.

URI Template Properties

A URI template property contains a [[!URI-TEMPLATE]] which can be used to generate a URI. These URI templates are expanded in the context of each row by combining the template with a set of variables with values. The variables that are set are:

column names
a variable is set for each column within the schema; the name of the variable is the name of the column and the value is derived from the value of the cell in that column in the row that is currently being processed, namely one of:
_column
_column is set to the column number of the column from the annotated table that is currently being processed
_sourceColumn
_sourceColumn is set to the source number of the column that is currently being processed; this usually varies from _column by skip columns and header columns
_row
_row is set to the row number of the row from the annotated table that is currently being processed
_sourceRow
_sourceRow is set to the source number of the row that is currently being processed; this usually varies from _row by skip rows and header rows
_name
_name is set to the URI decoded property value of the name property on the cell column that is currently being processed. URI decoding is necessary as name may have been encoded if taken from title; this prevents double percent encoding.

The property value of a URI template property is only available when processing individual cells in a table, usually while converting tables as defined in [[!tabular-data-model]]. The value is the result of:

  1. applying the template against the cell in that column in the row that is currently being processed
  2. expanding any prefixes as if the value were the name of a common property, as described in
  3. resolving the resulting URL against the base URL of the table url if not null

For example, the aboutUrl property holds a URI template that is used to generate a URL identifier for each row, which might look like:

"aboutUrl": "http://example.org/example.csv#row.{_row}"
          

The identifiers that are generated for the rows would then look like http://example.org/example.csv#row.1, http://example.org/example.csv#row.2 and so on.

Alternatively, with the CSV and metadata in the , the aboutUrl might look like:

"aboutUrl": "http://example.org/tree/{on_street}/{GID}"
          

This would generate URIs such as http://example.org/tree/ADDISON%20AV/1 and http://example.org/tree/EMERSON%20ST/2.

If the value of the on_street or GID column were null, the URL would still be generated with the null value generating an empty string in the URL. For example if on_street were null and GID were 3, the generated URL would be http://example.org/tree//3.

Once the URI has been generated, it is resolved against the location of the resource (eg the CSV file) to create an absolute URI. For example, given a aboutUrl within a schema such as:

"aboutUrl": "#row.{_row}"
          

and given a CSV file at http://example.com/temp.csv, the URL for the first row will be http://example.com/temp.csv#row.1.

The propertyUrl property might be defined as "{#_name}", meaning that it resolves as a fragment identifier relative to the URL of the source of the table. For example, accessing it from a column with the column name GID would look like:

"http://example.org/example.csv#GID"
          

A value defined within the data is also subject to expansion. For example, consider the following table:

project_name,project_type,keywords
CSVW,foaf:Project,table;data;conversion
          

The project_type column might have a valueUrl specified as "{project_type}". In the first row the cell value is "foaf:Project". The foaf prefix is understood, as described in to expand to http://xmlns.com/foaf/0.1/Project.

Similarly, the keywords column might have a valueUrl specified as "https://duckduckgo.com/?q={keywords}". If the column also specifies "separator": ";", then the cell value of the keywords column would be an array of the three values table, data and conversion. This is set as the value of the keywords variable within the URI template, which means the result would be https://duckduckgo.com/?q=table,data,conversion.

If the value in the keywords column were an empty sequence (created from an empty cell in the original data), the reference to that column would be expanded to an empty string, generating https://duckduckgo.com/?q=.

When a cell's value is not a string, the canonical representation of that value is used within the expanded URL. For example, the data may include dates such as those in:

          

The Inventory Date column description would indicate that these were dates with the format M/d/yyyy:

{
  "name": "inventory_date",
  "title": "Inventory Date",
  "datatype": {
    "base": "date",
    "format": "M/d/yyyy"
  }
}
          

The string value of the inventory_date column in the first row is parsed to create the date 18th October 2010. When the inventory_date column is referenced within a URI template such as http://example.org/event/{inventory_date}, the canonical representation of that date is used within the URL, giving the result http://example.org/event/2010-10-18.

Column Reference Properties

Column reference properties hold one or more references to other column description objects. The referenced description object must have an name property. Column reference properties can then reference column description objects through values that are:

  • strings — which MUST match the name on a column description object within the metadata document
  • arrays — lists of strings as above

For example, the primaryKey property is an column reference property on the schema. It has to hold references to columns defined elsewhere in the schema, and the descriptions of those columns must have name properties. It can hold a single reference, like this:

"tableSchema": {
  "columns": [{
    "name": "GID"
  }, ... ],
  "primaryKey": "GID"
}
          

or it can contain an array of references, like this:

"tableSchema": {
  "columns": [{
    "name": "givenName"
  }, {
    "name": "familyName"
  }, ... ],
  "primaryKey": [ "givenName", "familyName" ]
}
          

Object Properties

Object properties hold a single object or reference to an object by URL. Their values may be:

  • strings — resolved as URLs against the base URL
  • objects — interpreted as structured objects

Object properties are often used when the values can be or should be values within controlled vocabularies, or structured information which may be held elsewhere. For example, the dialect of a table is an object property. It could be provided as a URL that indicates a commonly used dialect, like this:

"dialect": "http://example.org/tab-separated-values"
          

or a structured object, like this:

"dialect": {
  "delimiter": "\t",
  "encoding": "utf-8"
}
          

Natural Language Properties

Natural language properties hold natural language strings. Their values may be:

  • strings — interpreted as natural language strings in the default language
  • arrays — interpreted as alternative natural language strings in the default language
  • objects whose properties MUST be language codes as defined by [[!BCP47]] and whose values are either strings or arrays, providing natural language strings in that language

Natural language properties are used for things like descriptions and titles. For example, the title property provides a natural language label for a column. If it's a plain string like this:

"title": "Project title"
          

then that string is assumed to be in the language provided through the @language property of the @context (or have no assumed language, if there is no such property). Multiple alternative values can be given in an array:

"title": [
  "Project title",
  "Project"
]
          

It's also possible to provide multiple values in different languages, using an object structure. For example:

"title": {
  "en": "Project title",
  "fr": "Titre du projet"
}
          

and within such an object, the values of the properties can themselves be arrays:

"title": {
  "en": [ "Project title", "Project" ],
  "fr": "Titre du projet"
}
          

The property value of a natural language property is an object whose properties are language codes and where the values of those properties are an array of strings (see Language Maps in [[!JSON-LD]]).

When extracting a property value metadata will have already been merged, a natural language property will already have this form.

Atomic Properties

Atomic properties hold atomic values. Their values may be:

  • numbers — interpreted as integers or doubles
  • booleans — interpreted as booleans (true or false)
  • strings — interpreted as defined by the property
  • objects — interpreted as defined by the property
  • arrays — lists of numbers, booleans, strings, or objects

The property value of a boolean atomic property is false if unset; otherwise, the property value of an atomic property is that set in metadata or null, if unset. Processors MUST raise an error if a property is set to an invalid value type, such as a boolean atomic property being set to the number 1 or a numeric atomic property being set to the string "3.1415".

Top-Level Properties

The top-level object (whether it is a table group description or a table description) MUST have a @context property, an array property, as defined in Section 8.7 of [[!JSON-LD]]. The @context MUST contain one of the following:

@language

an atomic property that indicates the default language for the values of properties in the metadata document; if present, its value MUST be a language code [[!BCP47]] which is the default language for the values of other properties in the metadata document

Note that the @language property of the @context object, which gives the default language used within the metadata file, is distinct from the lang property on a description object, which gives the language used in the data within a group of tables, table or column.

@base

an atomic property that indicates the base URL against which other URLs within the description are resolved; if present, its value MUST be a URL which is resolved against the location of the metadata document to provide the base URL for other URLs in the metadata document; if unspecified, the base URL used for interpreting relative URLs within the metadata document is the location of the metadata document itself

Note that the @base property of the @context object provides the base URL used for URLs within the metadata document, not the URLs that appear as data within the group of tables or table it describes.

Common Properties

Descriptions of groups of tables, tables, schemas, columns, rows and cells MAY contain any properties whose names are either absolute URLs or prefixed names. For example, a table description may contain dc:description, dcat:keyword or schema:copyrightHolder properties to provide a description, keywords or the name of the copyright holder, as defined in Dublin Core Terms, DCAT or schema.org.

Names of Common Properties

The names of common properties are prefixed names, in the syntax prefix:name.

Prefixed names can be expanded to provide a URI, by replacing the prefix and following colon with the URI with the URI that the prefix is associated with. Expansion is intended to be entirely consistent with Section 6.3 IRI Expansion in [[!JSON-LD-API]] and implementations MAY use a JSON-LD processor for performing prefixed name and IRI expansion.

The prefixes that are recognized are those defined for [[rdfa-core]] within the RDFa 1.1 Initial Context and these MUST NOT be overridden. Properties from other vocabularies MUST be named using absolute URLs.

Forbidding the declaration of new prefixes ensures consistent processing between JSON-LD-aware and non-JSON-LD-aware processors.

Implementations SHOULD treat the prefixed names for common properties and the URLs that they expand into in the same way. For example, if an implementation recognises and displays the value of the dc:description property, it should also recognise and display the value of the http://purl.org/dc/terms/description property in the same way.

Values of Common Properties

Common properties can take any JSON value, so long as it adheres to the following restrictions, which are designed to ensure compatibility between JSON-LD-aware and non-JSON-LD-aware processors:

  • If a @value property is used on an object, that object MUST NOT have any other properties aside from either @type or @language, and MUST NOT have both @type and @language as properties. The value of the @value property MUST be a string, number or boolean value.

    If @type is used, its value MUST be one of:

    • a built-in datatype, as defined in
    • a prefixed name using any of the pre-defined prefixes as described in
    • a string, which will be interpreted as an absolute URL

    If a @language property is used, it MUST have a string value that adheres to the syntax defined in [[!BCP47]], or be null.

  • If a @type property is used on an object without a @value property, its value MUST be one of:

    • one of the legitimate values for @type as defined for any of the description objects in this specification
    • a prefixed name using any of the pre-defined prefixes as described in
    • a string, which will be interpreted as an absolute URL

    A @type property can also have a value that is an array of such values.

  • The values of @id properties are link properties and are treated as to be URLs and will have any prefix expanded and the result resolved against the base URL during normalization. Therefore, if a @id property is used on an object, it MUST have a value that is a string and that string MUST NOT start with _:.

  • A @language property MUST NOT be used on an object unless it also has a @value property.

  • Aside from @value, @type, @language and @id, the properties used on an object MUST NOT start with @.

These restrictions are also described in , from the perspective of a processor that otherwise supports JSON-LD.

Table Groups

A table group description is a JSON object that describes a group of tables.

Required Properties

resources

An array property of table descriptions for the tables in the group, namely those listed in the resources annotation on the group of tables being described.

When an array of table descriptions B is merged into an original array of table descriptions A, each table description within B is combined into the original array A by:

  • if there is a table description with the same url in A, the table description from B is merged into the matching table description in A
  • otherwise, the table description from B is appended to the array of table descriptions A

Optional Properties

The description of a group of tables MAY also contain:

tableSchema

An object property that provides a single schema description as described in , used as the default for all the tables in the group. This may be provided as an embedded object within the JSON metadata or as a URL reference to a separate JSON schema document.

tableDirection

An atomic property that MUST have a single string value that is one of "rtl", "ltr" or "default". Indicates whether the tables in the group should be displayed with the first column on the right, on the left, or based on the first character in the table that has a specific direction. The value of this property becomes the value of the direction annotation for all the tables in the table group.

dialect

An object property that provides a single dialect description. If provided, dialect provides hints to processors about how to parse the referenced files for to create tabular data models for the tables in the group. This may be provided as an embedded object or as a URL reference. See for more details.

transformations

An array property of transformation definitions that provide mechanisms to transform the tabular data into other formats. The value of this property becomes the value of the transformations annotation for all the tables in the table group.

When an array of transformation definitions B is merged into an original array of transformation definitions A, each transformation definition within B is combined into the original array A by:

  • if there is a transformation definition with the same url in A, the transformation definition from B is merged into the matching transformation definition in A
  • otherwise, the transformation definition from B is appended to the array of transformation definitions A
notes

An array property that provides an array of objects representing arbitrary annotations on the annotated group of tables. The value of this property becomes the value of the notes annotation for the group of tables. The properties on these objects are interpreted equivalently to common properties as described in . When an array of note objects B is merged into an original array of note objects A, each note object from B is appended into the array A.

@id
If included, @id is a link property that identifies the group of tables, as defined by [[!tabular-data-model]], described by this table group description. The value of this property becomes the value of the id annotation for the table group. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
@type

If included, @type is an atomic property that MUST be set to "TableGroup". Publishers MAY include this to provide additional information to JSON-LD based toolchains.

The description MAY contain any common properties to provide extra metadata about the set of tables as a whole.

The description MAY contain inherited properties to describe cells within the tables.

Tables

A table description is a JSON object that describes a table within a CSV file.

Required Properties

url

This link property gives the single URL of the CSV file that the table is held in, relative to the location of the metadata document. The value of this property is the value of the url annotation for the table this table description describes.

Optional Properties

The description of a table MAY also contain:

tableSchema

An object property that provides a single schema description as described in . This may be provided as an embedded object within the JSON metadata or as a URL reference to a separate JSON schema document. If a table description is within a table group description, the tableSchema from that table group acts as the default for this property.

The property value of a tableSchema is that defined in the table description, if any, or that defined in the table group description, if any, or null.

notes

An array property that provides an array of objects representing arbitrary annotations on the annotated tabular data model. The value of this property becomes the value of the notes annotation for the table. The properties on these objects are interpreted equivalently to common properties as described in . When an array of note objects B is merged into an original array of note objects A, each note object from B is appended into the array A.

The Web Annotation Working Group is developing a vocabulary for expressing annotations. In future versions of this specification, we anticipate referencing that vocabulary.

tableDirection

As defined for table groups. The value of this property becomes the value of the direction annotation for this table.

transformations

As defined for table groups. The value of this property becomes the value of the transformations annotation for this table.

dialect

As defined for table groups.

suppressOutput

A boolean atomic property. If true, suppresses any output that would be generated when converting this table. The value of this property becomes the value of the suppress output annotation for this table.

@id

If included, @id is a link property that identifies the table, as defined in [[!tabular-data-model]], described by this table description. The value of this property becomes the value of the id annotation for this table. Publishers MAY include this to provide additional information to JSON-LD based toolchains.

@type

If included, @type is an atomic property that MUST be set to "Table". Publishers MAY include this to provide additional information to JSON-LD based toolchains.

The description MAY contain any common properties to provide extra metadata about the table as a whole.

The description MAY contain inherited properties to describe cells within the table.

Dialect Descriptions

Much of the tabular data that is published on the web is messy, and CSV parsers frequently need to be configured in order to correctly read in CSV. A dialect description provides hints to parsers about how to parse the file linked to from the url property. It can have any of the following properties, which relate to the flags described in Section 5 Parsing Tabular Data within [[!tabular-data-model]]:

encoding

An atomic property that sets the encoding flag to the single provided string value, which MUST be a defined [[!encoding]].

lineTerminator

An atomic property that sets the line terminator flag to the single provided string value.

quoteChar

An atomic property that sets the quote character flag to the single provided value, which MUST be a single character.

doubleQuote

A boolean atomic property that, if true, sets the escape character flag to ". If false, to \.

skipRows

An numeric atomic property that sets the skip rows flag to the single provided numeric value, which MUST be a non-negative integer.

commentPrefix

An atomic property that sets the comment prefix flag to the single provided value, which MUST be a single character string.

header

A boolean atomic property that, if true, sets the header row count flag to 1, and if false to 0, unless headerRowCount is provided, in which case the value provided for the header property is ignored.

headerRowCount

An numeric atomic property that sets the header row count flag to the single provided value, which MUST be a non-negative integer.

delimiter

An atomic property that sets the delimiter flag to the single provided value, which MUST be a single character string.

skipColumns

An numeric atomic property that sets the skip columns flag to the single provided numeric value, which MUST be a non-negative integer.

headerColumnCount

An numeric atomic property that sets the header column count flag to the single provided value, which MUST be non-negative integer.

skipBlankRows

An boolean atomic property that sets the skip blank rows flag to the single provided boolean value.

skipInitialSpace

A boolean atomic property that, if true, sets the trim flag to "start". If false, to false. If the trim property is provided, the skipInitialSpace property is ignored.

trim

An atomic property that, if the boolean true, sets the trim flag to true and if the boolean false to false. If the value provided is a string, sets the trim flag to the provided value, which MUST be one of "true", "false", "start" or "end".

@id

If included, @id is a link property that identifies the dialect described by this dialect description. Publishers MAY include this to provide additional information to JSON-LD based toolchains.

@type

If included, @type is an atomic property that MUST be set to "Dialect". Publishers MAY include this to provide additional information to JSON-LD based toolchains.

Dialect descriptions do not provide a mechanism for handling CSV files in which there are multiple tables within a single file (eg separated by empty lines).

The property value of a dialect description is that defined in the table description, if any, or that defined in the table group description, if any, or the default defined for that property.

The default dialect description for CSV files is:

{
  "encoding": "utf-8",
  "lineTerminator": "\r\n",
  "quoteChar": "\"",
  "doubleQuote": true,
  "skipRows": 0,
  "commentPrefix": "#",
  "header": true,
  "headerRowCount": 1,
  "delimiter": ",",
  "skipColumns": 0,
  "headerColumnCount": 0,
  "skipBlankRows": false,
  "skipInitialSpace": false,
  "trim": false
}
          

Transformation Definitions

A transformation definition is a definition of how tabular data can be transformed into another format using a script or template. It has the following properties:

Required Properties

Transformation definitions MUST have the following properties:

url

A link property giving the single URL of the file that the script or template is held in, relative to the location of the metadata document.

targetFormat

A link property giving the single URL for the format that will be created through the transformation. If one has been defined, this should be a URL for a media type, in the form http://www.iana.org/assignments/media-types/media-type such as http://www.iana.org/assignments/media-types/text/calendar. Otherwise, it can be any URL that describes the target format.

The targetFormat URL is intended as an informative identifier for the target format, and applications SHOULD NOT access the URL.

scriptFormat

A link property giving the single URL for the format that is used by the script or template. If one has been defined, this should be a URL for a media type, in the form http://www.iana.org/assignments/media-types/media-type such as http://www.iana.org/assignments/media-types/application/javascript. Otherwise, it can be any URL that describes the script or template format.

The scriptFormat URL is intended as an informative identifier for the template format, and applications SHOULD NOT access the URL. The template formats that an application supports are implementation defined.

Optional Properties

Transformation definitions MAY have the following properties:

title

A natural language property that describes the format that will be generated from the transformation. This is useful if the target format is a generic format (such as application/json) and the transformation is creating a specific profile of that format.

source

A single string atomic property that provides, if specified, the format to which the tabular data should be transformed prior to the transformation using the script or template. If the value is json, the tabular data MUST first be transformed to JSON as defined by [[!csv2json]] using standard mode. If the value is rdf, the tabular data MUST first be transformed to an RDF graph as defined by [[!csv2rdf]] using standard mode. If the source property is missing or null then the source of the transformation is the annotated tabular data model. No other values are valid.

@id

If included, @id is a link property that identifies the transformation described by this transformation definition. Publishers MAY include this to provide additional information to JSON-LD based toolchains.

@type

If included, @type is an atomic property that MUST be set to "Template". Publishers MAY include this to provide additional information to JSON-LD based toolchains.

The transformation definition MAY contain any common properties to provide extra metadata about the transformation.

Example

The following transformation definition will enable a processor that supports it to generate an iCalendar document using a Mustache template based on the JSON created from the simple mapping to JSON.

{
  "url": "templates/ical.txt",
  "title": "iCalendar",
  "targetFormat": "http://www.iana.org/assignments/media-types/text/calendar",
  "scriptFormat": "https://mustache.github.io/",
  "source": "json"
} 
          

Schemas

A tableSchema is a definition of a tabular format that may be common to multiple tables. For example, multiple tables from different sources may have the same columns and be designed such that they can be aggregated together.

A schema description is a JSON object that encodes the information about a schema, which describes the structure of a table. All the properties of a schema description are optional.

columns

An array property of column descriptions as described in . These are matched to columns in tables that use the schema by position: the first column description in the array applies to the first column in the table, the second to the second and so on.

The name properties of the column descriptions MUST be unique within a given table description.

When an array of column descriptions B is merged into an original array of column descriptions A, each column description within B is combined into the original array A, based on the index of each column description, as follows:

  1. if the column descriptions at the same index within A and B have the same name, the column description from B is merged into the matching column description in A
  2. otherwise, if the column descriptions at the same index within A and B have matching titles (both titles are the same and either they both have the same language or one of them has an undefined language (und)), the column description from B is merged into the matching column description in A
  3. otherwise, if there are column descriptions at the same index within A and B, validators and converters MUST generate an error
  4. otherwise, if at a given index there is no column description within A, but there is a column description within B, then:
    1. if the virtual property of the column description in B is true, then the column description is appended to A
    2. otherwise, validators and converters MUST generate an error
  5. otherwise, if at a given index there is no column description within B, but there is a column description within A, then:
    1. if the virtual property of the column description in A is true, then the column description is retained
    2. otherwise, validators and converters MUST generate an error
primaryKey

A column reference property that holds either a single reference to a column description object or an array of references. The value of this property is used to create the value of the primary key annotation for each row within a table that uses this schema by creating a list of the cells in that row that are in the referenced columns.

As defined in [[!tabular-data-model]], validators MUST check that each row has a unique combination of cells in the indicated columns. For example, if primaryKey is set to ["familyName", "givenName"] then every row must have a unique value for the combination of the familyName and givenName columns.

foreignKeys

An array property of foreign key definitions that define how the values from specified columns within this table link to rows within this table or other tables. A foreign key definition is a JSON object that MUST contain only the following properties:

columnReference

A column reference property that holds either a single reference to a column description object within this schema, or an array of references. These form the referencing columns for the foreign key definition.

reference

An object property that identifies a referenced table and a set of referenced columns within that table. Its properties are:

resource

A link property holding a URL that is the identifier for a specific table that is being referenced. If this property is present then tableSchema MUST NOT be present. The table group MUST contain a table whose url annotation is identical to the property value of this property. That table is the referenced table.

schemaReference

A link property holding a URL that is the identifier for a schema that is being referenced. If this property is present then resource MUST NOT be present. The table group MUST contain a table with a tableSchema having a @id that is identical to the property value of this property, and there MUST NOT be more than one such table. That table is the referenced table.

columnReference

A column reference property that holds either a single reference to a column description object within the tableSchema of the referenced table, or an array of such references.

The value of this property is used to create the value of the foreign keys annotation on the table using this schema by creating a list of foreign keys comprising a list of columns in the table and a list of columns in the referenced table. The value of this property is also used to create the value of the referenced rows annotation on each of the rows in the table that uses this schema, which is a pair of the relevant foreign key and the referenced row in the referenced table.

As defined in [[!tabular-data-model]], validators MUST check that, for each row, the combination of cells in the referencing columns reference a unique row within the referenced table through a combination of cells in the referenced columns. For examples, see and .

It is not required for the resource or schema referenced from a foreignKeys property to have a similarly defined primaryKey, though frequently it will.

When an array of foreign key definitions B is merged into an original array of foreign key definitions A, each foreign key definition within B which does not appear within A is appended to the original array A.

@id

If included, @id is a link property that identifies the schema described by this schema description. Publishers MAY include this to provide additional information to JSON-LD based toolchains.

@type

If included, @type is an atomic property that MUST be set to "Schema". Publishers MAY include this to provide additional information to JSON-LD based toolchains.

The description MAY contain any common properties to provide extra metadata about the schema as a whole.

The description MAY contain inherited properties to describe cells within the table.

Examples

Foreign Key Reference Between Resources

A list of countries is published at http://example.org/countries.csv with the structure:

            

Another file contains information about the population in some countries each year, at http://example.com/country_slice.csv with the structure:

            

The following metadata for the group of tables links the two together by defining a foreignKeys property:

            

When the population data in country_slice.csv is validated, the validator must check that every countryRef within country_slice.csv has a matching countryCode within countries.csv.

Foreign Key Reference Between Schemas

When publishing information about public sector roles and salaries, as in Use Case 4, the UK government requires departments to publish two files which are interlinked. The first lists senior grades (simplified here) eg at HEFCE_organogram_senior_data_31032011.csv:

Post Unique Reference,              Name,Grade,             Job Title,Reports to Senior Post
                90115,        Steve Egan,SCS1A,Deputy Chief Executive,                 90334
                90250,     David Sweeney,SCS1A,              Director,                 90334
                90284,       Heather Fry,SCS1A,              Director,                 90334
                90334,Sir Alan Langlands, SCS4,       Chief Executive,                    xx
            

The second provides information about the number of junior positions that report to those individuals (simplified here) eg at HEFCE_organogram_junior_data_31032011.csv:

Reporting Senior Post,Grade,Payscale Minimum (£),Payscale Maximum (£),Generic Job Title,Number of Posts in FTE,          Profession
                90284,    4,               17426,               20002,    Administrator,                     2,Operational Delivery
                90284,    5,               19546,               22478,    Administrator,                     1,Operational Delivery
                90115,    4,               17426,               20002,    Administrator,                  8.67,Operational Delivery
                90115,    5,               19546,               22478,    Administrator,                   0.5,Operational Delivery
            

The schemas are reused by multiple departments and for multiple pairs of files. The schemas are therefore defined in separate files, and they need to define links between the schemas which are then picked up as applying between tables that use those schemas.

The metadata file for the particular publication of the files above is:

{
  "@context": "http://www.w3.org/ns/csvw",
  "resources": [{
    "url": "HEFCE_organogram_senior_data_31032011.csv",
    "tableSchema": "http://example.org/schema/senior-roles.json"
  }, {
    "url": "HEFCE_organogram_junior_data_31032011.csv",
    "tableSchema": "http://example.org/schema/junior-roles.json"
  }]
}
            

The schema for the senior role CSV (at http://example.org/schema/senior-roles.json) is as follows:

{
  "@id": "http://example.org/schema/senior-roles.json",
  "@context": "http://www.w3.org/ns/csvw",
  "columns": [{
    "name": "ref",
    "title": "Post Unique Reference"
  }, {
    "name": "name",
    "title": "Name"
  }, {
    "name": "grade",
    "title": "Grade"
  }, {
    "name": "job",
    "title": "Job Title"
  }, {
    "name": "reportsTo",
    "title": "Reports to Senior Post"
  }],
  "primaryKey": "ref"
}
            

The schema for the junior role CSV (at http://example.org/schema/junior-roles.json) is as follows; it includes a foreign key reference to the senior roles schema:

{
  "@id": "http://example.org/schema/junior-roles.json",
  "@context": "http://www.w3.org/ns/csvw",
  "columns": [{
    "name": "reportsTo",
    "title": "Reporting Senior Post"
  }, 
  ...
  ],
  "foreignKeys": [{
    "columnReference": "reportsTo",
    "reference": {
      "schemaReference": "http://example.org/schema/senior-roles.json",
      "columnReference": "ref"
    }
  }]
}
            

In the first line of HEFCE_organogram_junior_data_31032011.csv, the reportsTo (Reporting Senior Post) column contains the value 90284. When validating that file, validators will check that there is a single row within the table generated from HEFCE_organogram_senior_data_31032011.csv whose ref column contains the value 90284.

Weak Linking between Resources

Foreign key definitions provide for strong linking between resources. It is also possible to provide weak linking between resources when converting tabular data into other formats, using aboutUrl and valueUrl.

Taking the example above as a starting point, the schema for HEFCE_organogram_senior_data_31032011.csv could use aboutUrl to provide a URL for each row, which can similarly be created as a valueUrl for the reportsTo column:

{
  "@id": "http://example.org/schema/senior-roles.json",
  "@context": "http://www.w3.org/ns/csvw",
  "aboutUrl": "#role-{ref}",
  "columns": [{
    "name": "ref",
    "title": "Post Unique Reference"
  }, {
    "name": "name",
    "title": "Name"
  }, {
    "name": "grade",
    "title": "Grade"
  }, {
    "name": "job",
    "title": "Job Title"
  }, {
    "name": "reportsTo",
    "title": "Reports to Senior Post",
    "valueUrl": "#role-{reportsTo}"
  }],
  "primaryKey": "ref"
}
            

The URLs generated for the values of the reportsTo will (if the data is correct) match the URLs generated for each row within the table. There will be no error, however, if there is a value in the reportsTo column that does not match a value in the ref column.

Columns

A column description is a simple JSON object that describes a single column. The description provides additional human-readable documentation for a column, as well as additional information that may be used to validate the cells within the column, create a user interface for data entry, or inform conversion into other formats.

name

An atomic property that gives a single canonical name for the column. The value of this property is used to create the value of the name annotation for the described column. This MUST be a string. Conversion specifications MUST use this property as the basis for the names of properties/elements/attributes in the results of conversions.

The property value of name is that defined within metadata, if it exists. Otherwise, it is the first value from the property value of title having the same language tag as default language or und of not specified percent-encoded as necessary to conform to the syntactic requirements described below, as a string without language, if any. Otherwise, it is the string "_col.[N]" where [N] is the column number.

For ease of reference within URI template properties, column names are restricted as defined in Variables in [[!URI-TEMPLATE]] with the additional provision that names beginning with "_" are reserved by this specification and MUST NOT be used.

title

A natural language property that provides possible alternative names for the column. The value of this property is used to create the value of the titles annotation for the described column.

required

A boolean atomic property taking a single value which indicates whether every cell within the column must have a non-null value. The value of this property is used to create the value of the required annotation for the described column.

suppressOutput

A boolean atomic property. If true, suppresses any output that would be generated when converting cells in this column. The value of this property is used to create the value of the suppress output annotation for the described column.

virtual

A boolean atomic property taking a single value which indicates whether the column is a virtual column not present in the original source. The value of this property is used to create the value of the virtual annotation for the described column. If present, a virtual column MUST appear after all other non-virtual column definitions.

Virtual columns are useful for inserting cells with default values into an annotated table to control the results of conversions.

We invite comment on whether virtual columns are useful enough to include in the final recommendation in spite of the added complexity.

@id

If included, @id is a link property that identifies the columns, as defined in [[!tabular-data-model]] and potentially appearing across separate tables, described by this column description. Publishers MAY include this to provide additional information to JSON-LD based toolchains.

@type

If included, @type is an atomic property that MUST be set to "Column". Publishers MAY include this to provide additional information to JSON-LD based toolchains.

The description MAY contain any common properties to provide extra metadata about the column as a whole, such as a full description.

The description MAY contain inherited properties to describe cells within the column.

Examples

Use of virtual columns

Virtual columns are useful when data needs to be added as part of an output transformation that doesn't exist in the source file. This may be to add type information to a column, or to relate different columns having different aboutUrl. For example, the http://example.org/tree-ops.csv example used in the introduction can be used with the following metadata:

            

This metadata creates a relationship model between data in each column by different combinations of aboutUrl, propertyUrl, and valueUrl on existing columns, and defining new virtual columns to supply additional information. In this case, the on_street and inventory_date values are split into separate entities, each having their own aboutUrl. New virtual columns are defined to provide a location type, and to relate the main row entity to the event and location associated with it. The result of converting the table to RDF would include the following, for the first row, with the contributions from the virtual columns highlighted:

<#gid-1>
  schema:url <#gid-1> ;
  schema:name "Celtis australis" ;
  :trim_cycle "Large Tree Routine Prune" ;
  schema:event <#event-2010-10-18> ;
  schema:location <#location-1> ;
  .

<#event-1> a schema:Event ;
  schema:startDate "2010-10-18"^^xsd:date ;
  .

<#location-1> a schema:PostalAddress ;
  schema:streetAddress "ADDISON AV" ;
  .
            

The JSON would similarly include, again with the contributions from the virtual columns highlighted:

{
  "@id": "#gid-1",
  "schema:url": "#gid-1",
  "schema:name": "Celtis australis",
  "trim_cycle": "Large Tree Routine Prune",
  "schema:event": {
    "@id": "#event-1",
    "@type": "schema:Event",
    "schema:startDate": "2010-10-18"
  },
  "schema:location": {
    "@id": "#location-1",
    "@type": "schema:PostalAddress",
    "schema:streetAddress": "ADDISON AV"
  }
}
            

Inherited Properties

A cell may be assigned annotations based on properties on the description objects for the group of tables, table, schema or column that it appears in. These properties are known as inherited properties and are listed below. To ascertain a value for certain annotations on cells, an application MUST identify the relevant property in the descriptions of the group of tables, table, schema or column.

Applications MUST raise an error if the value of a property in a table description is not compatible with the value of that property on the table group description it appears in. Applications MUST raise an error if the value of a property in a schema description is not compatible with the value of that property on the table description and table group description it appears in. Applications MUST raise an error if the value of a property in a column description is not compatible with the value of that property in the schema description, table description and/or table group description it appears in.

A value for an inherited property on a column or table description is compatible with with an inherited value on the same property on a table or table group description if they are the same value or if the first value is a sub-value of the second value. The definitions of individual inherited properties indicate what values count as sub-values of others.

null

An atomic property giving the string or strings used for null values within the data. If the string value of the cell is equal to any one of these values, the cell value is null. If not specified, the default for the null property is the empty string. A value for this property is compatible with an inherited value if it is a subset of the inherited value.

lang

An atomic property giving a single string language code as defined by [[!BCP47]]. Indicates the language of the value within the cell. A value for this property is compatible with an inherited value if it is a sub-language of the inherited value; for example en-US is compatible with en but not fr.

textDirection

An atomic property that MUST have a single string value that is one of "rtl" or "ltr" (the default). Indicates whether the text within cells should be displayed by default as left-to-right or right-to-left text. The value of this property is used to create the value of the text direction annotation for the cell. A value for this property is compatible with an inherited value only if they are identical.

separator

An atomic property that MUST have a single string value that is the character used to separate items in the string value of the cell. If null or unspecified, the cell does not contain a list. Otherwise, application MUST split the string value of the cell on the specified separator character and parse each of the resulting strings separately. The cell's value will then be a list. Conversion specifications MUST use the separator to determine the conversion of a cell into the target format. See for more details. A value for this property is compatible with an inherited value only if they are identical.

ordered

A boolean atomic property taking a single value which indicates whether a list that is the value of the cell is ordered (if true) or unordered (if false). The default is false. This property is irrelevant if the separator is null or undefined, but this is not an error. A value for this property is compatible with an inherited value only if they are identical.

default

An atomic property holding a single string that is used to create a default value for the cell in cases where the original string value is an empty string. This default value MAY be used when converting the table into other formats, or when the table is displayed. If not specified, the default for the default property is the empty string, "". A value for this property is compatible with an inherited value only if they are identical.

datatype

An atomic property that contains either a single string that is the main datatype of the values of the cell or a datatype description object. If the value of this property is a string, it is a term defined in the default context representing a built-in datatype URL; if it is an object then it describes a more specialised datatype. If a cell contains a sequence (ie the separator property is specified and not null) then this property specifies the datatype of each value within that sequence. See for more details.

A value for this property is compatible with an inherited value if they are identical, or if the value is a subtype within the datatype hierarchy defined in , including if the inherited value is explicitly specified as the base of this value.

aboutUrl

A URI template property that MAY be used to create a unique identifier for each cell within a row when mapping data to other formats. There are no compatibility restrictions on this property. The value of this property is used to create the value of the about URL annotation for the cell.

aboutUrl is typically defined on a schema description or table description. If defined on individual column descriptions, care must be taken to ensure that transformed cell values maintain a semantic relationship.

propertyUrl

An URI template property that MAY be used to create a URI for a property if the table is mapped to another format. There are no compatibility restrictions on this property. The value of this property is used to create the value of the property URL annotation for the cell.

propertyUrl is typically defined on a column description. If defined on a schema description, table description or table group description, care must be taken to ensure that transformed cell values maintain an appropriate semantic relationship, for example by including the name of the column in the generated URL by using _name in the template.

valueUrl

An URI template property that is used to map the values to the cells into URLs. See for details. There are no compatibility restrictions on this property. The value of this property is used to create the value of the value URL annotation for the cell.

This allows a cell value to define one or more RDF resources value of a cell instead of a literal values, as defined in [[rdf-concepts]]. For example, if the value were "{#reference}", each cell value of a column named reference would be used to create a URI such as http://example.com/#1234, if 1234 were a cell value of that column.

valueUrl is typically defined on a column description. If defined on a schema description, table description or table group description, care must be taken to ensure that transformed cell values maintain an appropriate semantic relationship.

The property value of valueUrl is null if the cell value is null and the cell column is not a virtual column.

The value of an inherited property is the first value, if any, found by looking in the current description object through all of its containing objects: a inherited property defined in a column description takes precedence of one defined in a schema description, which in turn takes precedence of one defined in a table description, which in turn takes precedence of one defined in a table group description.

Datatypes

Cells within tables may be annotated with a datatype which indicates the type of the values obtained by parsing the string value of the cell. See for details of how string values are parsed against datatypes.

Built-in Datatypes

The possible built-in datatypes are:

  • the datatypes defined in [[!xmlschema11-2]] as derived from and including anyAtomicType, as shown in
  • the datatype number which is exactly equivalent to double
  • the datatype binary which is exactly equivalent to base64Binary
  • the datatype datetime which is exactly equivalent to dateTime
  • the datatype any which is exactly equivalent to anyAtomicType
  • the datatype xml, a sub-type of string which indicates the value is an XML fragment
  • the datatype html, a sub-type of string which indicates the value is an HTML fragment
  • the datatype json, a sub-type of string which indicates the value is serialized JSON
Built-in Datatype Hierarchy diagram
Diagram showing the built-in datatypes, based on [[!xmlschema11-2]]; names in paranthesis denote aliases to the [[!xmlschema11-2]] terms (see the diagram in SVG or PNG formats)

Derived Datatypes

More specialised datatypes can be defined through a datatype description. A datatype description may have any of the following properties, all of which are optional.

base

An atomic property that contains a single string: a term defined in the default context representing a built-in datatype URL, as listed above. If this property is missing, its default is string. All values of the datatype MUST be valid values of the base datatype.

format

An atomic property that contains either a single string or an object that defines the format of a value of this type, used when parsing a string value as described in .

length

A numeric atomic property that contains a single integer that is the exact length of the value. See for details.

minLength

An atomic property that contains a single integer that is the minimum length of the value. See for details.

maxLength

A numeric atomic property that contains a single integer that is the maximum length of the value. See for details.

minimum

An atomic property that contains a single number that is the minimum valid value (inclusive); equivalent to minInclusive. See for details.

maximum

An atomic property that contains a single number that is the maximum valid value (inclusive); equivalent to maxInclusive. See for details.

minInclusive

An atomic property that contains a single number that is the minimum valid value (inclusive). See for details.

maxInclusive

An atomic property that contains a single number that is the maximum valid value (inclusive). See for details.

minExclusive

An atomic property that contains a single number that is the minimum valid value (exclusive). See for details.

maxExclusive

An atomic property that contains a single number that is the maximum valid value (exclusive). See for details.

The datatype description MAY contain any common properties to provide extra metadata about the datatype, such as a title or description.

Length Constraints

The length, minLength and maxLength properties indicate the exact, minimum and maximum lengths of values of a datatype.

Applications MUST raise an error if both length and minLength are specified and they do not have the same value. Similarly, applications MUST raise an error if both length and maxLength are specified and they do not have the same value. Applications MUST raise an error if length, maxLength or minLength are specified and the base datatype is not string or one of its subtypes, or a binary type.

The length of a value is determined as follows:

  • if the value is null, its length is zero
  • if the value is a string or one of its subtypes, its length is the number of characters in the value
  • if the value is of a binary type, its length is the number of bytes in the binary value

Value Constraints

The minimum, maximum, minInclusive, maxInclusive, minExclusive and maxExclusive properties indicate limits on values of a datatype. These apply to numeric, date/time and duration types.

In all ways, including the errors described below, the minimum property is equivalent to the minInclusive property and the maximum property is equivalent to the maxInclusive property. Applications MUST raise an error if both minimum and minInclusive are specified and they do not have the same value. Similarly, applications MUST raise an error if both maximum and maxInclusive are specified and they do not have the same value.

Applications MUST raise an error if both minInclusive and minExclusive are specified, or if both maxInclusive and maxExclusive are specified. Applications MUST raise an error if both minInclusive and maxInclusive are specified and maxInclusive is less than minInclusive, or if both minInclusive and maxExclusive are specified and maxExclusive is less than or equal to minInclusive. Similarly, applications MUST raise an error if both minExclusive and maxExclusive are specified and maxExclusive is less than minExclusive, or if both minExclusive and maxInclusive are specified and maxInclusive is less than or equal to minExclusive.

Applications MUST raise an error if minimum, minInclusive, maximum, maxInclusive, minExclusive or maxExclusive are specified and the base datatype is not a numeric, date/time or duration type.

Validation against these properties is as defined in [[!xmlschema11-2]].

Parsing cells

Unlike many other data formats, tabular data is designed to be read by humans. For that reason, it's common for data to be represented within tabular data in a human-readable way. The null, default, separator, datatype, format and lang properties provide the information needed to parse the string value of a cell into its (semantic) value. This is used:

After parsing, the cell value can be:

The process of parsing the string value of a cell into a single value or a list of values is as follows:

  1. unless the datatype is string, json, xml, html, anyAtomicType or any, replace all carriage return (#xD), line feed (#xA) and tab (#x9) characters with space characters
  2. unless the datatype is string, json, xml, html, anyAtomicType, any or normalizedString, strip leading and trailing whitespace from the string value and replace all instances of two or more whitespace characters with a single space character
  3. if the resulting string is an empty string, apply the remaining steps to the string given by the default property
  4. if the separator property is not null and the resulting string is an empty string, the cell value is an empty list
  5. if the separator property is not null, the cell value is a list of values created by:
    1. if the normalized string is an empty string, apply the remaining steps to the string given by the default property
    2. if the normalized string is the same as any one of the values of the null property, then the resulting value is null
    3. split the normalized string at the character specified by the separator property
    4. unless the datatype is string or anyAtomicType or any, strip leading and trailing whitespace from these strings
    5. applying the remaining steps to each of the strings in turn
  6. if the string is an empty string, apply the remaining steps to the string given by the default property
  7. if the string is the same as any one of the values of the null property, then the resulting value is null
  8. validate the string based on the datatype, using the format property if one is specified, as described below, and then against the constraints described in ; if there are any errors, add them to the list of errors for the cell; the resulting value is typed as a string with the language provided by the lang property
  9. otherwise, if there are no errors, parse the string using the format if one is specified, as described below; the resulting value is typed according to the datatype and if the datatype is string, or there is no datatype, it has the language provided by the lang property

Parsing examples

When no metadata is available, the value of a cell is the same as its string value. For example, a cell with a string value of "99" would similarly have the (semantic) value "99".

If a datatype is provided for the cell, that is used to create a (semantic) value for the cell. For example, if the metadata contains:

"datatype": "integer"
          

for the cell with the string value "99" then the value of that cell will be the integer 99. A cell whose string value was not a valid integer (such as "one" or "1.0") would be assigned that string value as its (semantic) value, but also have a validation error listed in its errors annotation.

Sometimes data uses special codes to indicate unknown or null values. For example, a particular column might contain a number that is expected to be between 1 and 10, with the string 99 used in the original tabular data file to indicate a null value. The metadata for such a column would include:

"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},
"null": "99"
          

In this case, a cell with a string value of "5" would have the (semantic) value of the integer 5; a cell with a string value of "99" would have the value null.

Similarly, a cell may be assigned a default value if the string value for the cell is empty. A configuration such as:

"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},
"default": "5"
          

In this case, a cell whose string value is "" would be assigned the value of the integer 5. A cell whose string value contains whitespace, such as a single tab character, would also be assigned the value of the integer 5: when the datatype is something other than string or anyAtomicType or any, leading and trailing whitespace is stripped from string values before the remainder of the processing is carried out.

Cells can contain sequences of values. For example, a cell might have the string value "1 5 7.0". In this case, the separator is a space character. The appropriate configuration would be:

"datatype": {
  "base": "integer",
  "minimum": 1,
  "maximum": 10
},
"default": "5",
"separator": " "
          

and this would mean that the cell's value would be an array containing two integers and a string: [1, 5, "7.0"]. The final value of the array is a string because it is not a valid integer; the cell's errors annotation will also contain a validation error.

Also, with this configuration, if the string value of the cell were "" (ie it was an empty cell) the value of the cell would be an empty list.

A cell value can be inserted into a URL created using a URI template property such as valueUrl. For example, if a cell with the string value "1 5 7.0" were in a column named values, defined with:

"datatype": "decimal",
"separator": " ",
"valueUrl": "{?values}"
          

then after expansion of the URI template, the resulting valueUrl would be ?values=1.0,5.0,7.0. The canonical representations of the decimal values are used within the URL.

Formats for numeric types

It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.

If the datatype is a numeric type, the format property indicates the expected format for that number. Its value MUST be either a single string or an object with one or more of the properties:

decimalChar
An atomic property containing a single character string whose value is used to represent a decimal point within the number. The default value is ".".
groupChar
An atomic property containing a single character string whose value is used to group digits within the number. The default value is ",".
pattern
An atomic property containing a regular expression string, in the syntax and interpreted as defined by [[!ECMASCRIPT]].

Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.

If the format property is a single string, this is interpreted in the same way as if it were an object with a pattern property whose value is that string.

When parsing the string value of a cell against this format specification, implementations MUST recognise and parse numbers that consist of:

  1. an optional + or - sign
  2. followed by a decimal digit (0-9)
  3. followed by any number of decimal digits (0-9) and the character specified as the groupChar
  4. followed by an optional decimalChar followed by one or more decimal digits (0-9)
  5. followed by an optional exponent, consisting of an E followed by an optional + or - sign followed by one or more decimal digits (0-9)
  6. followed by an optional percent (%) or per-mille () sign

or that are one of the special values:

  1. NaN
  2. INF
  3. -INF

Implementations MUST add a validation error to the errors annotation for the cell if the string being parsed:

  • does not meet the numeric format defined above
  • contains two consecutive groupChar characters
  • does not match the regular expression defined in the pattern property, if there is one
  • contains the decimalChar, if the datatype is integer or one of its sub-values
  • contains an exponent, if the datatype is decimal or one of its sub-values
  • is one of the special values NaN, INF or -INF, if the datatype is decimal or one of its sub-values

Implementations MUST use the sign, exponent, percent and per-mille signs when parsing the string value of a cell to provide the value of the cell. For example, the string value "-25%" must be interpreted as -0.25 and the string value "1E6" as 1000000.

Formats for booleans

Boolean values may be represented in many ways aside from the standard 1 and 0 or true and false.

If the datatype for a cell is boolean, the format property provides the true and false values expected, separated by |. For example if format is Y|N then cells must hold either Y or N with Y meaning true and N meaning false.

The resulting cell value will be one or more boolean true or false values.

Formats for dates and times

Dates and times are commonly represented in tabular data in formats other than those defined in [[!xmlschema11-2]].

If the datatype is a date or time type, the format property indicates the expected format for that date or time.

The supported date and time formats listed here are expressed in terms of the date field symbols defined in [[!UAX35]] and MUST be interpreted by implementations as defined in that specification.

The following date formats MUST be recognised by implementations:

  • yyyy-MM-dd eg 2015-03-22
  • yyyyMMdd eg 20150322
  • dd-MM-yyyy eg 22-03-2015
  • d-M-yyyy eg 22-3-2015
  • MM-dd-yyyy eg 03-22-2015
  • M-d-yyyy eg 3-22-2015
  • dd/MM/yyyy eg 22/03/2015
  • d/M/yyyy eg 22/3/2015
  • MM/dd/yyyy eg 03/22/2015
  • M/d/yyyy eg 3/22/2015
  • dd.MM.yyyy eg 22.03.2015
  • d.M.yyyy eg 22.3.2015
  • MM.dd.yyyy eg 03.22.2015
  • M.d.yyyy eg 3.22.2015

The following time formats MUST be recognised by implementations:

  • HH:mm:ss eg 15:02:37
  • HHmmss eg 150237
  • HH:mm eg 15:02
  • HHmm eg 1502

The following date/time formats MUST be recognised by implementations:

  • yyyy-MM-ddTHH:mm:ss eg 2015-03-15T15:02:37
  • yyyy-MM-ddTHH:mm eg 2015-03-15T15:02
  • any of the date formats above, followed by a single space, followed by any of the time formats above, eg M/d/yyyy HH:mm for 3/22/2015 15:02 or dd.MM.yyyy HH:mm:ss for 22.03.2015 15:02:37

Implementations MUST also recognise date, time and date/time formats that end with timezone markers consisting of between one and three xs or Xs, possibly after a single space. These MUST be interpreted as follows:

  • X eg -08, +0530 or Z (minutes are optional)
  • XX eg -0800, +0530 or Z
  • XXX eg -08:00, +05:30 or Z
  • x eg -08 or +0530 (Z is not permitted)
  • xx eg -0800 or +0530 (Z is not permitted)
  • xxx eg -08:00 or +05:30 (Z is not permitted)

For example, formats could include yyyy-MM-ddTHH:mm:ssXXX for 2015-03-15T15:02:37Z or 2015-03-15T15:02:37-05:00, or HH:mm x for 15:02 -05.

The cell value will one or more dates/time values extracted using the format.

For simplicity, this version of this standard does not support abbreviated or full month or day names, or double digit years. Future versions of this standard may support other date and time formats, or general purpose date/time pattern strings. Authors of schemas SHOULD use appropriate regular expressions, along with the string datatype, for dates and times that use a format other than that specified here.

Formats for durations

Durations MUST be formatted and interpreted as defined in [[!xmlschema11-2]], using the [[!ISO8601]] format -?PnYnMnDTnHnMnS. For example, the duration P1Y1D is used for a year and a day; the duration PT2H30M for 2 hours and 30 minutes.

If the datatype is a duration type, the format property provides a regular expression for the string values, in the syntax and processed as defined by [[!ECMASCRIPT]].

Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.

The cell value will be one or more durations extracted using the format.

Formats for other types

If the datatype is not numeric, boolean, a date/time type or a duration type, the format property provides a regular expression for the string values, in the syntax and processed as defined by [[!ECMASCRIPT]].

Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.

Values that are labelled as html, xml or json are not validated against those formats.

Metadata creators who wish to check the syntax of HTML, XML or JSON within tabular data should use the format property to specify a regular expression against which such values will be tested.

Merging Metadata

When processing a tabular data file, the Locating Metadata in [[!tabular-data-model]] describes different locations for locating metadata. To property transform a tabular data file, such as a CSV, processors MUST merge metadata from these separate sources to create a single metadata document in a manner consistent with this algorithm.

Implementations MUST merge metadata documents in a manner consistent with these rules. Validators MUST check and issue warnings where merge issues are found as noted below and in the relevant property definitions.

Merging of metadata happens in order from highest priority to lowest priority by merging the first two metadata files (A and B) together to create new merged metadata AB'. This is then used to merge in the next metadata file until all metadata have been processed to create a table group description. If the top-level object of either of the metadata files are table descriptions, these are turned into table group descriptions containing a single table description (ie having a single resource property whose value is the same as the original table description). Ensure that @context definitions are moved from the table description to the table group description.

Merging has two stages: the normalization of metadata documents, described in and the merging of those normalized documents, described in .

Normalization

Prior to merging, each description object is expanded relative to its @context and values are normalized as follows:

  1. If the property is a common property or notes the value MUST be normalized as follows:
    1. If the value is an array, each value within the array is normalized in place as described here.
    2. If the value is a string, replace it with an object with a @value property whose value is that string. If a default language is specified, add a @language property whose value is that default language.
    3. If the value is an object with a @value property, it remains as is.
    4. If the value is any other object, normalize each property of that object as follows:
      1. If the property is @id, expand any prefixed names and resolve its value against the base URL.
      2. If the property is @type, then its value remains as is.
      3. Otherwise, normalize the value of the property according to this algorithm.
    5. Otherwise, the value remains as is.
  2. If the property is an array property each element of the value is normalized using this algorithm.
  3. If the property is a link property the value is turned into an absolute URL using the base URL.
  4. If the property is an object property with a string value, the string is a URL referencing a JSON document containing a single object. Dereference this URL and replace the string value with that object, adding @id using the original URL unless it already exists. Normalize each value in the resulting or original object recursively using this algorithm.
  5. If the property is a natural language property and the value is not already an object, it is turned into an object whose properties are language codes and where the values of those properties are arrays. The suitable language code for the values is determined through the default language; if it can't be determined the language code und MUST be used.
  6. If the property is an atomic property that can be a string or an object, normalize to the object form as described for that property.

Following this normalization process, the @base and @language properties within the @context are no longer relevant; the normalized metadata can have its @context set to http://www.w3.org/ns/csvw.

Examples

The following are examples of how common properties are normalized.

In this example, a simple string is used as the title for a table.

{
  "@context": { "http://www.w3.org/ns/csvw", { "@language": "en" } },
  "@type": "Table",
  "url": "http://example.com/table.csv",
  "tableSchema": [...],
  "dc:title": "The title of this Table"
}
          

Since there is a default language, this is equivalent to explicitly specifying the language of that title:

{
  "@type": "Table",
  "url": "http://example.com/table.csv",
  "tableSchema": [...],
  "dc:title": {"@value": "The title of this Table", "@language": "en"}
}
          

It is also possible to use a simple value object to give a title. However, in this case the default language is not applied to the title:

{
  "@context": { "http://www.w3.org/ns/csvw", { "@language": "en" } },
  "@type": "Table",
  "url": "http://example.com/table.csv",
  "tableSchema": [...],
  "dc:title": {"@value": "The title of this Table"}
}
          

The next example uses an array of a string and a value object to give two titles with different languages:

{
  "@context": { "http://www.w3.org/ns/csvw", { "@language": "en" } },
  "@type": "Table",
  "url": "http://example.com/table.csv",
  "tableSchema": [...],
  "dc:title": [
    "The title of this Table",
    {"@value": "Der Titel dieser Tabelle", "@language": "de"}
  ]
}
          

The normalized version of this is:

{
  "@type": "Table",
  "url": "http://example.com/table.csv",
  "tableSchema": [...],
  "dc:title": [
    {"@value": "The title of this Table", "@language": "en"}
    {"@value": "Der Titel dieser Tabelle", "@language": "de"}
  ]
}
          

The next example demonstrates a node object, in which the value of the schema:url property is a reference to another resource:

{
  "@context": [ "http://www.w3.org/ns/csvw", { "@base": "http://example.com/" } ],
  "@type": "Table",
  "url": "table.csv",
  "tableSchema": [...],
  "schema:url": {"@id": "table.csv"}
}
          

The value of the @id property is normalized as described in against the base URL provided through the @base property, which means the above example is equivalent to:

{
  "@context": "http://www.w3.org/ns/csvw",
  "@type": "Table",
  "url": "http://example.com/table.csv",
  "tableSchema": [...],
  "schema:url": {"@id": "http://example.com/table.csv"}
}
          

The following example shows the dc:publisher property as an array that contains a single node object:

{
  "@context": "http://www.w3.org/ns/csvw",
  "@type": "Table",
  "url": "http://example.com/table.csv",
  "tableSchema": [...],
  "dc:publisher": [{
    "schema:name": "Example Municipality",
    "schema:url": {"@id": "http://example.org"}
  }],
}
          

Merging

A description object B is merged into an original description object A by merging each property of B into A. If the property from B does not exist on A, it is simply added to A. If A does have the property, the way the values are merged depends on the type of the property, as follows:

Example

For example, consider the following two Metadata documents to be merged (located at http://example.com/metadata.json and http://example.com/doc1.csv-metadata.json):

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "resources": [{
    "url": "doc1.csv",
    "dc:title": "foo",
    "tableDirection": "ltr",
    "tableSchema": {
      "aboutUrl": "{#foo}",
      "columns": [{
        "name": "foo", 
        "title": "Foo", 
        "required": true
      }, {
        "name": "bar"
      }]
    }
  }, {
    "url": "doc2.csv"
  }]
}
      
{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "http://example.com/doc1.csv",
  "dc:description": "bar",
  "tableSchema": {
    "propertyUrl": "{#_name}",
    "columns": [{
      "title": "Foo",
      "required": false
    }, {
      "name": "bar"
    }, {
    }]
  }
}
      

The process of merging performs the following steps:

  1. Normalize A to use the language specified in the @context within the natural language property title and to expand the link property url against the base URL for A, http://example.com/metadata.json:
    {
      "resources": [{
        "url": "http://example.com/doc1.csv",
        "dc:title": {"@value": "foo", "@language": "en"},
        "tableDirection": "ltr",
        "tableSchema": {
          "aboutUrl": "{#foo}",
          "columns": [{
            "name": "foo", 
            "title": { "en": [ "Foo" ] }, 
            "required": true
          }, {
            "name": "bar"
          }]
        }
      }, {
        "url": "http://example.com/doc2.csv"
      }]
    }
          
  2. Normalize B from a table description to a table group description by embedding the table description in a resources property, resolve the link property url (which is already an absolute URL), and normalize the title property to use the und language:
    {
      "resources": [{
        "url": "http://example.com/doc1.csv",
        "dc:description": {"@value": "bar"},
        "tableSchema": {
          "propertyUrl": "{#_name}",
          "columns": [{
            "title": { "und": [ "Foo" ] },
            "required": false
          }, {
            "name": "bar"
          }, {
          }]
        }
      }]
    }
          
  3. resources is an array property with rules specified in and each value is merged accordingly:
    1. The first value from A and B are now the following:
      {
        "url": "http://example.com/doc1.csv",
        "dc:title": {"@value": "foo", "@language": "en"},
        "tableDirection": "ltr",
        "tableSchema": {
          "aboutUrl": "{#foo}",
          "columns": [{
            "name": "foo", 
            "title": { "en": [ "Foo" ] }, 
            "required": true
          }, {
            "name": "bar"
          }]
        }
      }
      
      {
        "url": "http://example.com/doc1.csv",
        "dc:description": {"@value": "bar"},
        "tableSchema": {
          "propertyUrl": "{#_name}",
          "columns": [{
            "title": { "und": [ "Foo" ] },
            "required": false
          }, {
            "name": "bar"
          }, {
          }]
        }
      }
                    
      As these have the same url, these are merged. Each property from B is considered:
      1. url is the same (otherwise the two table descriptions would not be being merged)
      2. dc:description does not exist in A so it is added to A
      3. The objects held by the object property tableSchema properties are merged:
        1. B has a propertyUrl which is added to A.
        2. Each has columns which is merged as described in :
          1. The first column matches on title because they each have the value "Foo" and the language tag en matches und. Because the und value is present in the en array, that value is removed from the und array, and because the array is now empty the und property is removed. The value of required in A is retained, as is the value of name
          2. The second column has a match on name
      The merged table description is now the following:
      {
        "url": "http://example.com/doc1.csv",
        "dc:title": {"@value": "foo", "@language": "en"},
        "dc:description": {"@value": "bar"},
        "tableDirection": "ltr",
        "tableSchema": {
          "aboutUrl": "{#foo}",
          "propertyUrl": "{#_name}",
          "columns": [{
            "name": "foo", 
            "title": { "en": [ "Foo" ]}, 
            "required": true
          },{
            "name": "bar"
          }]
        }
      }
                    
    2. The second resources value from A is retained as is.

The resulting merged metadata is now the following:

{
  "resources": [{
    "url": "http://example.com/doc1.csv",
    "dc:title": {"@value": "foo", "@language": "en"},
    "dc:description": {"@value": "bar"},
    "tableDirection": "ltr",
    "tableSchema": {
      "aboutUrl": "{#foo}",
      "propertyUrl": "{#_name}",
      "columns": [{
        "name": "foo", 
        "title": { "en": [ "Foo" ]}, 
        "required": true
      },{
        "name": "bar"
      }]
    }
  }, {
    "url": "http://example.com/doc2.csv"
  }]
}
      

Security Considerations

Applications that process tabular data may use that data to drive other actions, which may have security implications. These behaviours are outside the scope of this specification.

Third party metadata provided about a tabular data file (such as a CSV file) may rename or ignore headers, or exclude rows or columns, which may lead to data being misinterpreted by applications that process it.

Transformation definitions are a possible security risk as they enable the creators of metadata to reference arbitrary code that may be executed to convert tabular data into other formats. Implementations should run this arbitrary code in a sandboxed environment to reduce the security risk.

JSON-LD Dialect

The Metadata Vocabulary for Tabular Data uses a format based on JSON-LD [[!JSON-LD]] with some restrictions.

URL Compaction

When normalizing metadata, prefixed names used in common properties and notes are expanded to absolute URLs. For some serializations, these are more appropriately presented using prefixed names or terms. This algorithm compacts an absolute URL to a prefixed name or term.

  1. If the URL exactly matches the absolute IRI associated with a term in [[csvw-context]], replace the URL with that term.
  2. Otherwise, if the URL starts with the absolute IRI associated with a term in [[csvw-context]], replace the matched part of that URL with the term separated with a : (U+0040) to create a prefixed name. If the resulting prefixed name is rdf:type, replace with @type.

Acknowledgements

This document is influenced by Data Package specification and the JSON Table Schema, which are maintained as part of Data Protocols. Particular contributors to that work are Rufus Pollock, Paul Fitzpatrick, Andrew Berkeley, Francis Irving, Benoit Chesneau, Leigh Dodds, Martin Keegan, and Gunnlaugur Thor Briem.

IANA Considerations

This section has not yet been submitted to IANA for review, approval, and registration.

application/csvm+json

Type name:
application
Subtype name:
csvm+json
Required parameters:
N/A
Optional parameters:
N/A
Encoding considerations:
See [[RFC6839]], section 3.1
Security considerations:
See [[RFC4627]] and of this specification
Interoperability considerations:
Note that this format is not the same as the existing text/csv and text/tab-delimited-values mediatypes, but a JSON-based format used to annotate such documents
Published specification:
This specification.
Applications that use this media type:
It is anticipated that there is a broad need for data validators and converters to alternate structured representations of tabular data.
Fragment identifier considerations:
See [[RFC6839]], section 3.1
Additional information:
Magic number(s):
n/a
File extension(s):
".json"
Macintosh file type code(s):
"TEXT"
Person & email address to contact for further information:
Ivan Herman <ivan@w3.org>
Intended usage:
COMMON
Restrictions on usage:
None
Author/Change controller:
The Tabular metadata specification is the product of the CSV on the Web Working Group. The W3C reserves change control over this specification.

JSON-LD Context

The JSON-LD context, located at http://www.w3.org/ns/csvw.jsonld is used with metadata documents. When used within a metadata document, the context can be referenced as http://www.w3.org/ns/csvw. See [[csvw-context]] for a full description of defined terms and prefixes. This context may be updated from time-to-time to define new terms and prefixes.