Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
Validation, conversion, display, and search of tabular data on the web requires additional metadata that describes how the data should be interpreted. This document defines a vocabulary for metadata that annotates tabular data. This can be used to provide metadata at various levels, from groups of tables and how they relate to each other down to individual cells within a table.
The metadata defined in this specification is used to provide annotations on an annotated table or group of tables, as defined in [tabular-data-model]. Annotated tables form the basis for all further processing, such as validating, converting, or displaying the tables.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the second of those Recommendations.
This document was published by the CSV on the Web Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 August 2014 W3C Process Document.
Interpreting tabular data that is available on the web, particularly as CSV, usually requires additional metadata. As an example, say that the following CSV file were available at http://example.org/tree-ops.csv
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
A human consumer of this data might be able to figure out the meaning of the different columns, particularly if there were some additional human-readable documentation made available. Automated processors would have a much harder time; realistically they would be limited to displaying the information in a table. Making available machine-readable metadata helps with the interpretation of the tabular data. For example, say that the following metadata file were available at http://example.org/tree-ops.csv-metadata.json
:
{ "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}], "url": "tree-ops.csv", "dc:title": "Tree Operations", "dcat:keyword": ["tree", "street", "maintenance"], "dc:publisher": { "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }, "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"}, "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"}, "tableSchema": { "columns": [{ "name": "GID", "titles": ["GID", "Generic Identifier"], "dc:description": "An identifier for the operation on a tree.", "datatype": "string", "required": true }, { "name": "on_street", "titles": "On Street", "dc:description": "The street that the tree is on.", "datatype": "string" }, { "name": "species", "titles": "Species", "dc:description": "The species of the tree.", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "dc:description": "The operation performed on the tree.", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "dc:description": "The date of the operation that was performed.", "datatype": {"base": "date", "format": "M/d/yyyy"} }], "primaryKey": "GID", "aboutUrl": "#gid-{GID}" } }
Given the location of the CSV file, this metadata document can be located by appending -metadata.json
to the URL (as described in [tabular-data-model]). It provides information for different types of applications:
GID
column are all present and unique.Implementations may fulfil one or more of these functions. In particular, Converters may or may not act as a Validator (perhaps through the setting of a flag), and check the data that they are converting to ensure that it is compliant with the schema. If a Converter does not also act as a Validator it may produce invalid output.
[tabular-data-model] defines an annotated tabular data model in which groups of tables, individual tables, columns, rows, and cells can be annotated with annotations. That specification also describes how to locate metadata about a given tabular data file.
This document defines the format and structure of metadata documents, and how these are interpreted to create an Annotated Tabular Data Model. It also defines how to validate tabular data based on some of these annotations.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, MUST NOT, SHOULD, and SHOULD NOT are to be interpreted as described in [RFC2119].
The metadata format is based on a dialect of [JSON-LD] as defined in section A. JSON-LD Dialect. This metadata can therefore be expressed as an RDF graph. It is not necessary for conformant applications to be able to process all JSON-LD, only the dialect defined in this specification. All applications that conform to this specification (including validators and applications that read or convert tabular data) MUST read the JSON-based format described in this document.
Tabular data MUST conform to the description from [tabular-data-model]. In particular note that each row MUST contain the same number of cells (although some of these cells may be empty). Parsers might not be able to map all CSV-encoded data to such a table. As such, the metadata format described in this specification cannot be applied to all CSV files.
This specification makes use of the compact IRI Syntax; please refer to the Compact IRIs from [JSON-LD].
This specification makes use of the following namespaces:
csvw
:http://www.w3.org/ns/csvw#
dc
:http://purl.org/dc/terms/
dcat
:http://www.w3.org/ns/dcat#
foaf
:http://xmlns.com/foaf/0.1/
rdf
:http://www.w3.org/1999/02/22-rdf-syntax-ns#
schema
:http://schema.org/
xsd
:http://www.w3.org/2001/XMLSchema#
The following typographic conventions are used in this specification:
markup
markup definition reference
markup external definition reference
Notes are in light green boxes with a green left border and with a "Note" header in green. Notes are normative or informative depending on the whether they are in a normative or informative section, respectively.
Examples are in light khaki boxes, with khaki left border, and with a numbered "Example" header in khaki. Examples are always informative. The content of the example is in monospace font and may be syntax colored.
The metadata defined in this specification is used to provide annotations on an annotated table or group of tables, as defined in [tabular-data-model]. Annotated tables form the basis for all further processing, such as validating, converting, or displaying the tables.
All compliant applications MUST create annotated tables based on the algorithm defined here. All compliant applications MUST generate errors and stop processing if a metadata document:
Compliant applications MUST ignore properties (aside from common properties) which are not defined in this specification and MUST generate a warning when they are encoutered.
If a property has a value that is not permitted by this specification, then if a default value is provided for that property, compliant applications MUST use that default value and MUST generate a warning. If no default value is provided for that property, compliant applications MUST generate a warning and behave as if the property had not been specified.
Metadata documents contain descriptions of groups of tables, tables, columns, rows, and cells, which are used to create annotations on a annotated tabular data model. A description object is a JSON object that describes a component of the annotated tabular data model (a group of tables, a table or a column) and has one or more properties are mapped into properties on that component. There are two types of description objects:
The description objects contain a number of properties. These are:
name
of a column or the dc:provenance
of a tableFor example, in the column description
{ "name": "inventory_date", "titles": "Inventory Date", "dc:description": "The date of the operation that was performed.", "datatype": { "base": "date", "format": "M/d/yyyy" } }
the properties name
, titles
, and dc:description
are used to create the name, titles, datatype and dc:description
annotations on the column in the data model. The datatype
property is an inherited property that also affects the value of each cell in that column (see section 5.7 Inherited Properties for more on inherited properties).
This section defines a set of properties and permitted values for annotating tabular data, and how these properties should be interpreted by applications.
A metadata document is a JSON document which holds an object at the top level. This object is a description object of either a group of tables or a single table. A metadata document may contain other referenced or embedded description objects, description objects for tables and columns. Additional JSON objects, not part of the annotated tabular data model, are used to describe schemas, dialect descriptions, foreign key definitions and transformation definitions.
There are different types of properties on description objects:
Array properties hold an array of one or more objects, which are usually description objects.
For example, the tables
property is an array property. A table group description might contain:
"tables": [{ "url": "https://example.org/countries.csv", "tableSchema": "https://example.org/countries.json" }, { "url": "https://example.org/country_slice.csv", "tableSchema": "https://example.org/country_slice.json" }]
in which case the tables
property has a value that is an array of two table description objects.
Any items within an array that are not valid objects of the type expected are ignored. If the supplied value of an array property is not an array (eg if it is an integer), compliant applications MUST issue a warning and proceed as if the property had been supplied with an empty array.
Link properties hold a single reference to another resource by URL. Their value is a string — resolved as a URL against the base URL. If the supplied value of a link property is not a string (eg if it is an integer), compliant applications MUST issue a warning and proceed as if the property had been supplied with an empty string.
For example, the url
property is a link property. A table description might contain:
"url": "example-2014-01-03.csv"
in which case the url
property on the table would have a single value, a link to example-2014-01-03.csv
, resolved against the base URL of the metadata document in which this was located. For example if the metadata document contained:
"@context": [ "http://www.w3.org/ns/csvw", { "@base": "http://example.org/" }]
this is equivalent to specifying:
"url": "http://example.org/example-2014-01-03.csv"
URI template properties contain a [URI-TEMPLATE] which can be used to generate a URI. These URI templates are expanded in the context of each row by combining the template with a set of variables with values as defined in [URI-TEMPLATE]. The variables that are set are:
null
The languages of cell values are ignored.
_column
_column
is set to the column number of the column from the annotated table that is currently being processed_sourceColumn
_sourceColumn
is set to the source number of the column that is currently being processed; this usually varies from _column
by skip columns_row
_row
is set to the row number of the row from the annotated table that is currently being processed_sourceRow
_sourceRow
is set to the source number of the row that is currently being processed; this usually varies from _row
by skip rows and header rows_name
_name
is set to the URI decoded column name annotation, as defined in [tabular-data-model], for the column that is currently being processed. (Percent-decoding is necessary as name
may have been encoded if taken from titles
; this prevents double percent-encoding.)
The annotation value is the result of:
If the supplied value of a URI template property is not a string (eg if it is an integer), compliant applications MUST issue a warning and proceed as if the property had been supplied with an empty string.
For example, the aboutUrl
property holds a URI template that is used to generate a URL identifier for each row, which might look like:
"aboutUrl": "http://example.org/example.csv#row.{_row}"
The about URL annotations that are generated and used as identifiers for the rows would then look like http://example.org/example.csv#row.1
, http://example.org/example.csv#row.2
and so on.
Alternatively, with the CSV and metadata in the section 1. Introduction, the aboutUrl
might look like:
"aboutUrl": "http://example.org/tree/{on_street}/{GID}"
This would generate URIs such as http://example.org/tree/ADDISON%20AV/1
and http://example.org/tree/EMERSON%20ST/2
.
If the value of the on_street
or GID
column were null
, the URL would still be generated with the null value generating an empty string in the URL. For example if on_street
were null
and GID
were 3
, the generated URL would be http://example.org/tree//3
.
Once the URI has been generated, it is resolved against the url of the table (eg the CSV file) to create an absolute URI. For example, given a aboutUrl
within a schema such as:
"aboutUrl": "#row.{_row}"
and given a CSV file at http://example.com/temp.csv
, the URL for the first row will be http://example.com/temp.csv#row.1
.
The propertyUrl
property might be defined as "{#_name}"
, meaning that it resolves as a fragment identifier relative to the URL of the source of the table. For example, accessing it from a column with the column name GID would look like:
"http://example.org/example.csv#GID"
A value defined within the data is also subject to expansion. For example, consider the following table:
project_name,project_type,keywords CSVW,foaf:Project,table;data;conversion
The project_type
column might have a valueUrl
specified as "{project_type}"
. In the first row the cell value is "foaf:Project"
. The foaf
prefix is understood, as described in section 5.8 Common Properties, to expand to http://xmlns.com/foaf/0.1/Project
.
Similarly, the keywords
column might have a valueUrl
specified as "https://duckduckgo.com/?q={keywords}"
. If the column also specifies "separator": ";"
, then the cell value of the keywords
column would be an array of the three values table
, data
, and conversion
. This is set as the value of the keywords variable within the URI template, which means the result would be https://duckduckgo.com/?q=table,data,conversion
.
If the value in the keywords
column were an empty sequence (created from an empty cell in the original data), the reference to that column would be expanded to an empty string, generating https://duckduckgo.com/?q=
.
When a cell's value is not a string, the canonical representation of that value is used within the expanded URL. For example, the data may include dates such as those in:
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
The Inventory Date
column description would indicate that these were dates with the format M/d/yyyy
:
{ "name": "inventory_date", "titles": "Inventory Date", "datatype": { "base": "date", "format": "M/d/yyyy" } }
The string value of the inventory_date
column in the first row is parsed to create the date 18th October 2010. When the inventory_date
column is referenced within a URI template such as http://example.org/event/{inventory_date}
, the canonical representation of that date, as defined in [xmlschema11-2] is used within the URL, giving the result http://example.org/event/2010-10-18
.
Column reference properties hold one or more references to other column description objects. The referenced description object must have a name
property. Column reference properties can then reference column description objects through values that are:
name
on a column description object within the metadata documentIf the supplied value of a column reference property is not a string or array (eg if it is an integer), or if any of the values in the supplied array are not strings, or if any of the supplied strings do not reference one or more columns, compliant applications MUST issue a warning and proceed as if the property had been specified (which may mean that an error is issued, if the property was required).
For example, the primaryKey
property is a column reference property on the schema. It has to hold references to columns defined elsewhere in the schema, and the descriptions of those columns must have name
properties. It can hold a single reference, like this:
"tableSchema": { "columns": [{ "name": "GID" }, ... ], "primaryKey": "GID" }
or it can contain an array of references, like this:
"tableSchema": { "columns": [{ "name": "givenName" }, { "name": "familyName" }, ... ], "primaryKey": [ "givenName", "familyName" ] }
Object properties hold either a single object or a reference to an object by URL. Their values may be:
If the supplied value of an object property is not a string or object (eg if it is an integer), compliant applications MUST issue a warning and proceed as if the property had been specified as an object with no properties.
Object properties are often used when the values can be or should be values within controlled vocabularies, or structured information which may be held elsewhere. For example, the dialect
of a table is an object property. It could be provided as a URL that indicates a commonly used dialect, like this:
"dialect": "http://example.org/tab-separated-values"
or a structured object, like this:
"dialect": { "delimiter": "\t", "encoding": "utf-8" }
When specified as a string, the resolved URL is used to fetch the referenced object during normalization as described in section 6.1 Normalization. For example, if http://example.org/tab-separated-values
resolved to:
{ "@context": "http://www.w3.org/ns/csvw", "quoteChar": null, "header": true, "delimiter": "\t" }
Following normalization, the value of the dialect
property would then be:
"dialect": { "@id": "http://example.org/tab-separated-values", "quoteChar": null, "header": true, "delimiter": "\t" }
Natural language properties hold natural language strings. Their values may be:
Natural language properties are used for titles. For example, the titles
property on a column description provides a natural language label for a column. If it's a plain string like this:
"titles": "Project title"
then that string is assumed to be in the default language (or have an undefined language, und
, if there is no such property). Multiple alternative values can be given in an array:
"titles": [ "Project title", "Project" ]
It's also possible to provide multiple values in different languages, using an object structure. For example:
"titles": { "en": "Project title", "fr": "Titre du projet" }
and within such an object, the values of the properties can themselves be arrays:
"titles": { "en": [ "Project title", "Project" ], "fr": "Titre du projet" }
The annotation value of a natural language property is an object whose properties are language codes and where the values of those properties are an array of strings (see Language Maps in [JSON-LD]).
When extracting a annotation value from a metadata that will have already been merged, a natural language property will already have this form.
If the supplied value of a natural language property is not a string, array or object (eg if it is an integer), compliant applications MUST issue a warning and proceed as if the property had been specified as an empty array. If the supplied value is an array, any items in that array that are not strings MUST be ignored. If the supplied value is an object, any properties that are not valid language codes as defined by [BCP47] MUST be ignored, as must any properties whose value is not a string or an array, and any items that are not strings within array values of these properties.
Atomic properties hold atomic values. Their values may be:
true
or false
)
The annotation value of a boolean atomic property is false
if unset; otherwise, the annotation value of an atomic property is normalized value of that property, or the defined default value or null
, if unset. Processors MUST issue a warning if a property is set to an invalid value type, such as a boolean atomic property being set to the number 1
or a numeric atomic property being set to the string "3.1415"
, and act as if the property had not been specified (which may mean using the default value for the property, or may mean raising an error and halting processing if the property is a required property).
The top-level object of a metadata document or object referenced through an object property (whether it is a table group description, table description, schema, dialect description or transformation definition) MUST have a @context
property. This is an array property, as defined in Section 8.7 of [JSON-LD]. The @context
MUST have one of the following values:
http://www.w3.org/ns/csvw
, orhttp://www.w3.org/ns/csvw
and the object represents a local context definition, which is restricted to contain either or both of the following members:
@base
an atomic property that provides the base URL against which other URLs within the metadata file are resolved. If present, its value MUST be a string that is interpreted as a URL which is resolved against the location of the metadata document to provide the base URL for other URLs in the metadata document; if unspecified, the base URL used for interpreting relative URLs within the metadata document is the location of the metadata document itself.
Note that the @base
property of the @context
object provides the base URL used for URLs within the metadata document, not the URLs that appear as data within the group of tables or table it describes. URI template properties are not resolved against this base URL: they are resolved against the URL of the table.
@language
an atomic property that indicates the default language for the values of natural language or string-valued common properties in the metadata document; if present, its value MUST be a language code [BCP47]. The default is und
.
Note that the @language
property of the @context
object, which gives the default language used within the metadata file, is distinct from the lang
property on a description object, which gives the language used in the data within a group of tables, table, or column.
A table group description is a JSON object that describes a group of tables.
tables
An array property of table descriptions for the tables in the group, namely those listed in the tables annotation on the group of tables being described. Compliant application MUST raise an error if this array does not contain one or more table descriptions.
When an array of table descriptions B
is merged into an original array of table descriptions A
, each table description within B
is combined into the original array A
by:
url
in A
, the table description from B
is merged into the matching table description in A
.B
is appended to the array of table descriptions A
.The description of a group of tables MAY also contain:
dialect
An object property that provides a single dialect description. If provided, dialect
provides hints to processors about how to parse the referenced files to create tabular data models for the tables in the group. This may be provided as an embedded object or as a URL reference. See section 5.9 Dialect Descriptions for more details.
notes
An array property that provides an array of objects representing arbitrary annotations on the annotated group of tables. The value of this property becomes the value of the notes annotation for the group of tables. The properties on these objects are interpreted equivalently to common properties as described in section 5.8 Common Properties. When an array of note objects B
is merged into an original array of note objects A
, each note object from B
is appended into the array A
.
The Web Annotation Working Group is developing a vocabulary for expressing annotations. In future versions of this specification, we anticipate referencing that vocabulary.
tableDirection
An atomic property that MUST have a single string value that is one of "rtl"
, "ltr"
or "default"
. Indicates whether the tables in the group should be displayed with the first column on the right, on the left, or based on the first character in the table that has a specific direction. The value of this property becomes the value of the direction annotation for all the tables in the table group. See Bidirectional Tables in [tabular-data-model] for details. The default value for this property is "default"
.
tableSchema
An object property that provides a single schema description as described in section 5.5 Schemas, used as the default for all the tables in the group. This may be provided as an embedded object within the JSON metadata or as a URL reference to a separate JSON object that is a schema description.
transformations
An array property of transformation definitions that provide mechanisms to transform the tabular data into other formats. The value of this property becomes the value of the transformations annotation for all the tables in the table group.
When an array of transformation definitions B
is merged into an original array of transformation definitions A
, each transformation definition within B
is combined into the original array A
by:
url
in A
, the transformation definition from B
is merged into the matching transformation definition in A
.B
is appended to the array of transformation definitions A
.@id
@id
is a link property that identifies the group of tables, as defined by [tabular-data-model], described by this table group description. It MUST NOT start with _:
. The value of this property becomes the value of the id annotation for the group of tables.
@type
If included, @type
is an atomic property that MUST be set to "TableGroup"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
The description MAY contain any common properties to provide extra metadata about the group of tables as a whole.
The description MAY contain inherited properties to describe cells within the tables.
A table description is a JSON object that describes a table within a CSV file.
url
This link property gives the single URL of the CSV file that the table is held in, relative to the location of the metadata document. The value of this property is the value of the url annotation for the annotated table this table description describes.
The description of a table MAY also contain:
dialect
As defined for table groups.
notes
An array property that provides an array of objects representing arbitrary annotations on the annotated tabular data model. The value of this property becomes the value of the notes annotation for the table. The properties on these objects are interpreted equivalently to common properties as described in section 5.8 Common Properties. When an array of note objects B
is merged into an original array of note objects A
, each note object from B
is appended into the array A
.
The Web Annotation Working Group is developing a vocabulary for expressing annotations. In future versions of this specification, we anticipate referencing that vocabulary.
suppressOutput
A boolean atomic property. If true
, suppresses any output that would be generated when converting this table. The value of this property becomes the value of the suppress output annotation for this table. The default is false
.
tableDirection
As defined for table groups. The value of this property becomes the value of the direction annotation for this table.
tableSchema
An object property that provides a single schema description as described in section 5.5 Schemas. This may be provided as an embedded object within the JSON metadata or as a URL reference to a separate JSON schema document. If a table description is within a table group description, the tableSchema
from that table group acts as the default for this property.
If a tableSchema
is not declared in table description, it may be declared on the table group description, which is then used as the schema for this table description.
The @id
property of the tableSchema
, if there is one, becomes the value of the schema annotation for this table.
When a schema is referenced by URL, this URL becomes the value of the @id
property in the normalized schema description, and thus the value of the schema annotation on the table.
transformations
As defined for table groups. The value of this property becomes the value of the transformations annotation for this table.
@id
If included, @id
is a link property that identifies the table, as defined in [tabular-data-model], described by this table description. It MUST NOT start with _:
. The value of this property becomes the value of the id annotation for this table.
@type
If included, @type
is an atomic property that MUST be set to "Table"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
The description MAY contain any common properties to provide extra metadata about the table as a whole.
The description MAY contain inherited properties to describe cells within the table.
A schema is a definition of a tabular format that may be common to multiple tables. For example, multiple tables from different sources may have the same columns and be designed such that they can be aggregated together.
A schema description is a JSON object that encodes the information about a schema, which describes the structure of a table. All the properties of a schema description are optional.
columns
An array property of column descriptions as described in section 5.6 Columns. These are matched to columns in tables that use the schema by position: the first column description in the array applies to the first column in the table, the second to the second and so on.
The name
properties of the column descriptions MUST be unique within a given table description.
When an array of column descriptions B
is merged into an original array of column descriptions A
, each column description within B
is combined into the original array A
, based on the index of each column description, as follows:
name
and titles
values for the column description at the same index within A
and B
, the column description from B
is merged into the matching column description in A
.A
and B
, implementations MUST generate an error.A
, but there is a column description within B
, then:
virtual
property of the column description in B
is true
, then the column description is appended to A
.B
, but there is a column description within A
, then:
virtual
property of the column description in A
is true
, then the column description is retained.foreignKeys
An array property of foreign key definitions that define how the values from specified columns within this table link to rows within this table or other tables. A foreign key definition is a JSON object that MUST contain only the following properties:
columnReference
A column reference property that holds either a single reference to a column description object within this schema, or an array of references. These form the referencing columns for the foreign key definition.
reference
An object property that identifies a referenced table and a set of referenced columns within that table. Its properties are:
resource
A link property holding a URL that is the identifier for a specific table that is being referenced. If this property is present then schemaReference
MUST NOT be present. The table group MUST contain a table whose url annotation is identical to the expanded value of this property. That table is the referenced table.
schemaReference
A link property holding a URL that is the identifier for a schema that is being referenced. If this property is present then resource
MUST NOT be present. The table group MUST contain a table with a tableSchema
having a @id
that is identical to the expanded value of this property, and there MUST NOT be more than one such table. That table is the referenced table.
columnReference
A column reference property that holds either a single reference (by name) to a column description object within the tableSchema
of the referenced table, or an array of such references.
The value of this property becomes the foreign keys annotation on the table using this schema by creating a list of foreign keys comprising a list of columns in the table and a list of columns in the referenced table. The value of this property is also used to create the value of the referenced rows annotation on each of the rows in the table that uses this schema, which is a pair of the relevant foreign key and the referenced row in the referenced table.
As defined in [tabular-data-model], validators MUST check that, for each row, the combination of cells in the referencing columns references a unique row within the referenced table through a combination of cells in the referenced columns. For examples, see section 5.5.1.1 Foreign Key Reference Between Tables and section 5.5.1.2 Foreign Key Reference Between Schemas.
It is not required for the table or schema referenced from a foreignKeys
property to have a similarly defined primaryKey
, though frequently it will.
When an array of foreign key definitions B
is merged into an original array of foreign key definitions A
, each foreign key definition within B
which does not appear within A
is appended to the original array A
.
primaryKey
A column reference property that holds either a single reference to a column description object or an array of references. The value of this property becomes the primary key annotation for each row within a table that uses this schema by creating a list of the cells in that row that are in the referenced columns.
As defined in [tabular-data-model], validators MUST check that each row has a unique combination of values of cells in the indicated columns. For example, if primaryKey
is set to ["familyName", "givenName"]
then every row must have a unique value for the combination of values of cells in the familyName
and givenName
columns.
@id
If included, @id
is a link property that identifies the schema described by this schema description. It MUST NOT start with _:
.
@type
If included, @type
is an atomic property that MUST be set to "Schema"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
The description MAY contain any common properties to provide extra metadata about the schema as a whole.
The description MAY contain inherited properties to describe cells within tables that use this schema.
This section is non-normative.
A list of countries is published at http://example.org/countries.csv
with the structure:
countryCode,latitude,longitude,name AD,42.546245,1.601554,Andorra AE,23.424076,53.847818,"United Arab Emirates" AF,33.93911,67.709953,Afghanistan
Another file contains information about the population in some countries each year, at http://example.com/country_slice.csv
with the structure:
countryRef,year,population AF,1960,9616353 AF,1961,9799379 AF,1962,9989846
The following metadata for the group of tables links the two together by defining a foreignKeys
property:
{ "@context": "http://www.w3.org/ns/csvw", "tables": [{ "url": "http://example.org/countries.csv", "tableSchema": { "columns": [{ "name": "countryCode", "datatype": "string", "propertyUrl": "http://www.geonames.org/ontology{#_name}" }, { "name": "latitude", "datatype": "number" }, { "name": "longitude", "datatype": "number" }, { "name": "name", "datatype": "string" }], "aboutUrl": "http://example.org/countries.csv{#countryCode}", "propertyUrl": "http://schema.org/{_name}", "primaryKey": "countryCode" } }, { "url": "http://example.org/country_slice.csv", "tableSchema": { "columns": [{ "name": "countryRef", "valueUrl": "http://example.org/countries.csv{#countryRef}" }, { "name": "year", "datatype": "gYear" }, { "name": "population", "datatype": "integer" }], "foreignKeys": [{ "columnReference": "countryRef", "reference": { "resource": "http://example.org/countries.csv", "columnReference": "countryCode" } }] } }] }
Within the annotated table generated for countries.csv
, each row will have a primary key annotation whose value is a list containing the cell from the first column of that row (countryCode
).
The annotated table generated for country_slice.csv
will have a foreign keys annotation whose value is a list containing a single foreign key referencing the first column from the table generated from country_slice.csv
(countryRef
) and the first column from the table generated from countries.csv
(countryCode
). Each row within that table will have a referenced row annotation referencing this foreign key and the third row in the table generated from countries.csv
.
When the population data in country_slice.csv
is validated, the validator must check that every countryRef
within country_slice.csv
has a matching countryCode
within countries.csv
.
When publishing information about public sector roles and salaries, as in Use Case 4, the UK government requires departments to publish two files which are interlinked. The first lists senior grades (simplified here) e.g., at HEFCE_organogram_senior_data_31032011.csv
:
Post Unique Reference, Name,Grade, Job Title,Reports to Senior Post 90115, Steve Egan,SCS1A,Deputy Chief Executive, 90334 90250, David Sweeney,SCS1A, Director, 90334 90284, Heather Fry,SCS1A, Director, 90334 90334,Sir Alan Langlands, SCS4, Chief Executive, xx
The second provides information about the number of junior positions that report to those individuals (simplified here) e.g., at HEFCE_organogram_junior_data_31032011.csv
:
Reporting Senior Post,Grade,Payscale Minimum (£),Payscale Maximum (£),Generic Job Title,Number of Posts in FTE, Profession 90284, 4, 17426, 20002, Administrator, 2,Operational Delivery 90284, 5, 19546, 22478, Administrator, 1,Operational Delivery 90115, 4, 17426, 20002, Administrator, 8.67,Operational Delivery 90115, 5, 19546, 22478, Administrator, 0.5,Operational Delivery
The schemas are reused by multiple departments and for multiple pairs of files. The schemas are therefore defined in separate files, and they need to define links between the schemas which are then picked up as applying between tables that use those schemas.
The metadata file for the particular publication of the files above is:
{ "@context": "http://www.w3.org/ns/csvw", "tables": [{ "url": "HEFCE_organogram_senior_data_31032011.csv", "tableSchema": "http://example.org/schema/senior-roles.json" }, { "url": "HEFCE_organogram_junior_data_31032011.csv", "tableSchema": "http://example.org/schema/junior-roles.json" }] }
The schema for the senior role CSV (at http://example.org/schema/senior-roles.json
) is as follows:
{ "@id": "http://example.org/schema/senior-roles.json", "@context": "http://www.w3.org/ns/csvw", "columns": [{ "name": "ref", "titles": "Post Unique Reference" }, { "name": "name", "titles": "Name" }, { "name": "grade", "titles": "Grade" }, { "name": "job", "titles": "Job Title" }, { "name": "reportsTo", "titles": "Reports to Senior Post" }], "primaryKey": "ref" }
The schema for the junior role CSV (at http://example.org/schema/junior-roles.json
) is as follows; it includes a foreign key reference to the senior roles schema:
{ "@id": "http://example.org/schema/junior-roles.json", "@context": "http://www.w3.org/ns/csvw", "columns": [{ "name": "reportsTo", "titles": "Reporting Senior Post" }, ... ], "foreignKeys": [{ "columnReference": "reportsTo", "reference": { "schemaReference": "http://example.org/schema/senior-roles.json", "columnReference": "ref" } }] }
The foreign key definition here contains a schemaReference
to senior-roles.json
. Implementations will look for the table referenced within the original metadata file whose tableSchema
is senior-roles.json
, which is HEFCE_organogram_senior_data_31032011.csv
. The implementation will therefore look for a relationship between the reportsTo
column in HEFCE_organogram_junior_data_31032011.csv
and the ref
column in HEFCE_organogram_senior_data_31032011.csv
.
For example, in the first line of HEFCE_organogram_junior_data_31032011.csv
, the reportsTo
(Reporting Senior Post
) column contains the value 90284
. When validating that file, validators will check that there is a single row within the table generated from HEFCE_organogram_senior_data_31032011.csv
whose ref
column contains the value 90284
.
Foreign key definitions provide for strong linking between tables that guarantees (through validation) the existance of a referenced row. It is also possible to provide weak linking between tables that are not tested by validations but which may be useful when converting tabular data into other formats, using aboutUrl
and valueUrl
.
Taking the example above as a starting point, the schema for HEFCE_organogram_senior_data_31032011.csv
could use aboutUrl
to provide a URL for each row, which can similarly be created as a valueUrl
for the reportsTo
column:
{ "@id": "http://example.org/schema/senior-roles.json", "@context": "http://www.w3.org/ns/csvw", "aboutUrl": "#role-{ref}", "columns": [{ "name": "ref", "titles": "Post Unique Reference" }, { "name": "name", "titles": "Name" }, { "name": "grade", "titles": "Grade" }, { "name": "job", "titles": "Job Title" }, { "name": "reportsTo", "titles": "Reports to Senior Post", "valueUrl": "#role-{reportsTo}" }], "primaryKey": "ref" }
The URLs generated for the values of the reportsTo
will (if the data is correct) match the URLs generated for each row within the table. There will be no validation error, however, if there is a value in the reportsTo
column that does not match a value in the ref
column. In contrast, if a foreign key had been specified with:
"foreignKeys": [{ "columnReference": "reportsTo", "reference": { "schemaReference": "http://example.org/schema/senior-roles.json", "columnReference": "ref" } }]
then validators would raise an error if a value in the reportsTo
column did not match any value in the ref
column.
A column description is a JSON object that describes a single column. The description provides additional human-readable documentation for a column, as well as additional information that may be used to validate the cells within the column, create a user interface for data entry, or inform conversion into other formats. All properties are optional.
name
An atomic property that gives a single canonical name for the column. The value of this property becomes the name annotation for the described column. This MUST be a string and this property has no default value, which means it MUST be ignored if the supplied value is not a string.
For ease of reference within URI template properties, column names are restricted as defined in Variables in [URI-TEMPLATE] with the additional provision that names beginning with "_"
are reserved by this specification and MUST NOT be used within metadata documents.
suppressOutput
A boolean atomic property. If true
, suppresses any output that would be generated when converting cells in this column. The value of this property becomes the suppress output annotation for the described column. The default is false
.
titles
A natural language property that provides possible alternative names for the column. The string values of this property, along with their associated language tags, become the titles annotation for the described column.
If there is no name
property defined on this column, the first titles
value having the same language tag as default language, or und
or if no default language is specified, becomes the name annotation for the described column. This annotation MUST be percent-encoded as necessary to conform to the syntactic requirements defined in [RFC3986]
virtual
A boolean atomic property taking a single value which indicates whether the column is a virtual column not present in the original source. The default value is false
. The normalized value of this property becomes the virtual annotation for the described column. If present, a virtual column MUST appear after all other non-virtual column definitions.
Virtual columns are useful for inserting cells with default values into an annotated table to control the results of conversions.
We invite comment on whether virtual columns are useful enough to include in the final recommendation in spite of the added complexity.
@id
If included, @id
is a link property that identifies the columns, as defined in [tabular-data-model], and potentially appearing across separate tables, described by this column description. It MUST NOT start with _:
.
@type
If included, @type
is an atomic property that MUST be set to "Column"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
If the column description has neither name
nor titles
properties, the string "_col.[N]"
where [N]
is the column number, becomes the name annotation for the described column.
The description MAY contain any common properties to provide extra metadata about the column as a whole, such as a full description.
The description MAY contain inherited properties to describe cells within the column.
This section is non-normative.
virtual
columns
Virtual columns are useful when data needs to be added as part of an output transformation that doesn't exist in the source file. This may be to add type information to a column, or to relate different columns having different aboutUrl
. For example, the http://example.org/tree-ops.csv
example used in the introduction can be used with the following metadata:
{ "url": "tree-ops.csv", "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}], "tableSchema": { "columns": [{ "name": "GID", "titles": "GID", "datatype": "string", "propertyUrl": "schema:url", "valueUrl": "#gid-{GID}" }, { "name": "on_street", "titles": "On Street", "datatype": "string", "aboutUrl": "#location-{GID}", "propertyUrl": "schema:streetAddress" }, { "name": "species", "titles": "Species", "datatype": "string", "propertyUrl": "schema:name" }, { "name": "trim_cycle", "titles": "Trim Cycle", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "datatype": {"base": "date", "format": "M/d/yyyy"}, "aboutUrl": "#event-{inventory_date}", "propertyUrl": "schema:startDate" }, { "propertyUrl": "schema:event", "valueUrl": "#event-{inventory_date}", "virtual": true }, { "propertyUrl": "schema:location", "valueUrl": "#location-{GID}", "virtual": true }, { "aboutUrl": "#location-{GID}", "propertyUrl": "rdf:type", "valueUrl": "schema:PostalAddress", "virtual": true }], "aboutUrl": "#gid-{GID}" } }
This metadata creates a relationship model between data in each column by different combinations of aboutUrl
, propertyUrl
, and valueUrl
on existing columns, and defining new virtual columns to supply additional information. In this case, the on_street
and inventory_date
values are split into separate entities, each having their own aboutUrl
. New virtual columns are defined to provide a location type, and to relate the main row entity to the event and location associated with it. The result of converting the table to RDF would include the following, for the first row, with the contributions from the virtual columns highlighted:
<#gid-1> schema:url <#gid-1> ; schema:name "Celtis australis" ; :trim_cycle "Large Tree Routine Prune" ; schema:event <#event-2010-10-18> ; schema:location <#location-1> ; . <#event-1> a schema:Event ; schema:startDate "2010-10-18"^^xsd:date ; . <#location-1> a schema:PostalAddress ; schema:streetAddress "ADDISON AV" ; .
The JSON would similarly include, again with the contributions from the virtual columns highlighted:
{ "@id": "#gid-1", "schema:url": "#gid-1", "schema:name": "Celtis australis", "trim_cycle": "Large Tree Routine Prune", "schema:event": { "@id": "#event-1", "@type": "schema:Event", "schema:startDate": "2010-10-18" }, "schema:location": { "@id": "#location-1", "@type": "schema:PostalAddress", "schema:streetAddress": "ADDISON AV" } }
A cell may be assigned annotations based on properties on the description objects for the group of tables, table, schema, or column that it appears in. These properties are known as inherited properties and are listed below. To ascertain a value for certain annotations on cells, an application MUST identify the relevant property in the descriptions of the group of tables, table, schema, or column.
aboutUrl
A URI template property that MAY be used to indicate what a cell contains information about. The value of this property becomes the about URL annotation for the described column.
aboutUrl
is typically defined on a schema description or table description to indicate what each row is about. If defined on individual column descriptions, care must be taken to ensure that transformed cell values maintain a semantic relationship.
datatype
An atomic property that contains either a single string that is the main datatype of the values of the cell or a datatype description object. If the value of this property is a string, it MUST be one of the built-in datatypes defined in section 5.11.1 Built-in Datatypes; if it is an object then it describes a more specialised datatype. If a cell contains a sequence (ie the separator
property is specified and not null
) then this property specifies the datatype of each value within that sequence. See 5.11 Datatypes and Parsing Cells in [tabular-data-model] for more details.
The normalized value of this property becomes the datatype annotation for the described column.
We invite comment on whether datatype
should allow for a "union" of types for a cell; this would allow for a set of datatypes that could be matched against the string value of a cell, choosing the first match; e.g., to match either a date
or datetime
.
default
An atomic property holding a single string that is used to create a default value for the cell in cases where the original string value is an empty string. See Parsing Cells in [tabular-data-model] for more details. If not specified, the default for the default
property is the empty string, ""
. The value of this property becomes the default annotation for the described column.
lang
An atomic property giving a single string language code as defined by [BCP47]. Indicates the language of the value within the cell. See Parsing Cells in [tabular-data-model] for more details. The value of this property becomes the lang annotation for the described column. The default is und
.
null
An atomic property giving the string or strings used for null values within the data. If the string value of the cell is equal to any one of these values, the cell value is null
. See Parsing Cells in [tabular-data-model] for more details. If not specified, the default for the null
property is the empty string ""
. The value of this property becomes the null annotation for the described column.
ordered
A boolean atomic property taking a single value which indicates whether a list that is the value of the cell is ordered (if true
) or unordered (if false
). The default is false
. This property is irrelevant if the separator
is null
or undefined, but this is not an error. The value of this property becomes the ordered annotation for the described column, and the ordered annotation for the cells within that column.
propertyUrl
An URI template property that MAY be used to create a URI for a property if the table is mapped to another format. The value of this property becomes the property URL annotation for the described column.
propertyUrl
is typically defined on a column description. If defined on a schema description, table description or table group description, care must be taken to ensure that transformed cell values maintain an appropriate semantic relationship, for example by including the name of the column in the generated URL by using _name
in the template.
required
A boolean atomic property taking a single value which indicates whether the cell must have a non-null value. The default is false
. The value of this property becomes the required annotation for the described column.
separator
An atomic property that MUST have a single string value that is the character used to separate items in the string value of the cell. If null
(the default) or unspecified, the cell does not contain a list. Otherwise, application MUST split the string value of the cell on the specified separator character and parse each of the resulting strings separately. The cell's value will then be a list. See Parsing Cells in [tabular-data-model] for more details. The value of this property becomes the separator annotation for the described column.
textDirection
An atomic property that MUST have a single string value that is one of "rtl"
or "ltr"
(the default). Indicates whether the text within cells should be displayed by default as left-to-right or right-to-left text. The value of this property becomes the text direction annotation for the column. See Bidirectional Tables in [tabular-data-model] for details.
valueUrl
An URI template property that is used to map the values of cells into URLs. The value of this property becomes the value URL annotation for the described column.
This allows processors to build URLs from cell values, for example to reference RDF resources, as defined in [rdf-concepts]. For example, if the value URL were "{#reference}"
, each cell value of a column named reference would be used to create a URI such as http://example.com/#1234
, if 1234
were a cell value of that column.
valueUrl
is typically defined on a column description. If defined on a schema description, table description or table group description, care must be taken to ensure that transformed cell values maintain an appropriate semantic relationship.
The value of an inherited property is the first value, if any, found by looking in the current description object through all of its containing objects: a inherited property defined in a column description takes precedence of one defined in a schema description, which in turn takes precedence of one defined in a table description, which in turn takes precedence of one defined in a table group description.
This section is non-normative.
In the following example, aboutUrl
property is defined on the tableSchema
, and therefore affects all cells for that table.
{ "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}], "url": "tree-ops.csv", "dc:title": "Tree Operations", "dcat:keyword": ["tree", "street", "maintenance"], "dc:publisher": { "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }, "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"}, "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"}, "tableSchema": { "columns": [{ "name": "GID", "titles": ["GID", "Generic Identifier"], "dc:description": "An identifier for the operation on a tree.", "datatype": "string", "required": true }, { "name": "on_street", "titles": "On Street", "dc:description": "The street that the tree is on.", "datatype": "string" }, { "name": "species", "titles": "Species", "dc:description": "The species of the tree.", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "dc:description": "The operation performed on the tree.", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "dc:description": "The date of the operation that was performed.", "datatype": {"base": "date", "format": "M/d/yyyy"} }], "primaryKey": "GID", "aboutUrl": "#gid-{GID}" } }
The equivalent effect could be achieved by using the aboutUrl
property on each column:
{ "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}], "url": "tree-ops.csv", "dc:title": "Tree Operations", "dcat:keyword": ["tree", "street", "maintenance"], "dc:publisher": { "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }, "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"}, "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"}, "tableSchema": { "columns": [{ "name": "GID", "titles": ["GID", "Generic Identifier"], "aboutUrl": "#gid-{GID}", "dc:description": "An identifier for the operation on a tree.", "datatype": "string", "required": true }, { "name": "on_street", "titles": "On Street", "aboutUrl": "#gid-{GID}", "dc:description": "The street that the tree is on.", "datatype": "string" }, { "name": "species", "titles": "Species", "aboutUrl": "#gid-{GID}", "dc:description": "The species of the tree.", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "aboutUrl": "#gid-{GID}", "dc:description": "The operation performed on the tree.", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "aboutUrl": "#gid-{GID}", "dc:description": "The date of the operation that was performed.", "datatype": {"base": "date", "format": "M/d/yyyy"} }], "primaryKey": "GID" } }
Descriptions of groups of tables, tables, schemas and columns MAY contain any common properties whose names are either absolute URLs or prefixed names. For example, a table description may contain dc:description
, dcat:keyword
, or schema:copyrightHolder
properties to provide a description, keywords, or the name of the copyright holder, as defined in Dublin Core Terms, DCAT, or schema.org.
The names of common properties are prefixed names, in the syntax prefix:name
.
Prefixed names can be expanded to provide a URI, by replacing the prefix and following colon with the URI that the prefix is associated with. Expansion is intended to be entirely consistent with Section 6.3 IRI Expansion in [JSON-LD-API] and implementations MAY use a JSON-LD processor for performing prefixed name and IRI expansion.
The prefixes that are recognized are those defined for [rdfa-core] within the RDFa 1.1 Initial Context and other prefixes defined within [csvw-context] and these MUST NOT be overridden. These prefixes are periodically extended; refer to [csvw-context] for details. Properties from other vocabularies MUST be named using absolute URLs.
Forbidding the declaration of new prefixes ensures consistent processing between JSON-LD-aware and non-JSON-LD-aware processors.
This specification does not define how common properties are interpreted by implementations. Implementations SHOULD treat the prefixed names for common properties and the URLs that they expand into in the same way. For example, if an implementation recognises and displays the value of the dc:description
property, it should also recognise and display the value of the http://purl.org/dc/terms/description
property in the same way.
Common properties can take any JSON value, so long as any objects within the value (for example as items of an array or values of properties on other objects) adhere to the following restrictions, which are designed to ensure compatibility between JSON-LD-aware and non-JSON-LD-aware processors:
If a @value
property is used on an object, that object MUST NOT have any other properties aside from either @type
or @language
, and MUST NOT have both @type
and @language
as properties. The value of the @value
property MUST be a string, number, or boolean value.
If @type
is also used, its value MUST be one of:
If a @language
property is used, it MUST have a string value that adheres to the syntax defined in [BCP47], or be null
.
If a @type
property is used on an object without a @value
property, its value MUST be one of:
@type
as defined for any of the description objects in this specification.
A @type
property can also have a value that is an array of such values.
The values of @id
properties are link properties and are treated as URLs. During normalization, as described in section 6.1 Normalization, they will have any prefix expanded and the result resolved against the base URL. Therefore, if an @id
property is used on an object, it MUST have a value that is a string and that string MUST NOT start with _:
.
A @language
property MUST NOT be used on an object unless it also has a @value
property.
@value
, @type
, @language
, and @id
, the properties used on an object MUST NOT start with @
.
These restrictions are also described in section A. JSON-LD Dialect, from the perspective of a processor that otherwise supports JSON-LD. Examples of common property values and the impact of normalization are given in section 6.1.1 Examples.
Much of the tabular data that is published on the web is messy, and CSV parsers frequently need to be configured in order to correctly read in CSV. A dialect description provides hints to parsers about how to parse the file linked to from the url
property in a table description. It can have any of the following properties, which relate to the flags described in Section 5 Parsing Tabular Data within the [tabular-data-model]:
commentPrefix
An atomic property that sets the comment prefix flag to the single provided value, which MUST be a single character string. The default is "#"
.
delimiter
An atomic property that sets the delimiter flag to the single provided value, which MUST be a single character string. The default is ","
.
doubleQuote
A boolean atomic property that, if true
, sets the escape character flag to "
. If false
, to \
. The default is true
.
encoding
An atomic property that sets the encoding flag to the single provided string value, which MUST be a defined in [encoding]. The default is "utf-8"
.
header
A boolean atomic property that, if true
, sets the header row count flag to 1
, and if false
to 0
, unless headerRowCount
is provided, in which case the value provided for the header
property is ignored. The default is true
.
headerRowCount
An numeric atomic property that sets the header row count flag to the single provided value, which MUST be a non-negative integer. The default is 1
.
lineTerminators
An atomic property that sets the line terminators flag to either an array containing the single provided string value, or the provided array. The default is ["\r\n", "\n"]
.
quoteChar
An atomic property that sets the quote character flag to the single provided value, which MUST be a single character or null
. If the value is null
, the escape character flag is also set to null
. The default is "
.
skipBlankRows
An boolean atomic property that sets the skip blank rows flag to the single provided boolean value. The default is false
.
skipColumns
An numeric atomic property that sets the skip columns flag to the single provided numeric value, which MUST be a non-negative integer. The default is 0
.
skipInitialSpace
A boolean atomic property that, if true
, sets the trim flag to "start"
. If false
, to false
. If the trim
property is provided, the skipInitialSpace
property is ignored. The default is false
.
skipRows
An numeric atomic property that sets the skip rows flag to the single provided numeric value, which MUST be a non-negative integer. The default is 0
.
trim
An atomic property that, if the boolean true
, sets the trim flag to true
and if the boolean false
to false
. If the value provided is a string, sets the trim flag to the provided value, which MUST be one of "true"
, "false"
, "start"
, or "end"
. The default is false
.
@id
If included, @id
is a link property that identifies the dialect described by this dialect description. It MUST NOT start with _:
.
@type
If included, @type
is an atomic property that MUST be set to "Dialect"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
Dialect descriptions do not provide a mechanism for handling CSV files in which there are multiple tables within a single file (eg separated by empty lines).
The default dialect description for CSV files is:
{ "encoding": "utf-8", "lineTerminators": ["\r\n", "\n"], "quoteChar": "\"", "doubleQuote": true, "skipRows": 0, "commentPrefix": "#", "header": true, "headerRowCount": 1, "delimiter": ",", "skipColumns": 0, "skipBlankRows": false, "skipInitialSpace": false, "trim": false }
A transformation definition is a definition of how tabular data can be transformed into another format using a script or template.
For example, the following transformation definition will enable a processor that supports it to generate an iCalendar document using a Mustache template based on the JSON created from the simple mapping to JSON.
{ "url": "templates/ical.txt", "titles": "iCalendar", "targetFormat": "http://www.iana.org/assignments/media-types/text/calendar", "scriptFormat": "https://mustache.github.io/", "source": "json" }
A processor that recognises templates in the Mustache format indicated by "https://mustache.github.io/"
and that could convert tables into JSON based on [csv2json] would retrieve the template from "templates/ical.txt"
and apply this to the resulting JSON.
Transformation definitions have the following properties:
Transformation definitions MUST have the following properties:
url
A link property giving the single URL of the file that the script or template is held in, relative to the location of the metadata document.
scriptFormat
A link property giving the single URL for the format that is used by the script or template. If one has been defined, this should be a URL for a media type, in the form http://www.iana.org/assignments/media-types/media-type
such as http://www.iana.org/assignments/media-types/application/javascript
. Otherwise, it can be any URL that describes the script or template format.
The scriptFormat
URL is intended as an informative identifier for the template format, and applications SHOULD NOT access the URL. The template formats that an application supports are implementation defined.
targetFormat
A link property giving the single URL for the format that will be created through the transformation. If one has been defined, this should be a URL for a media type, in the form http://www.iana.org/assignments/media-types/media-type
such as http://www.iana.org/assignments/media-types/text/calendar
. Otherwise, it can be any URL that describes the target format.
The targetFormat
URL is intended as an informative identifier for the target format, and applications SHOULD NOT access the URL.
Transformation definitions MAY have the following properties:
source
A single string atomic property that provides, if specified, the format to which the tabular data should be transformed prior to the transformation using the script or template. If the value is json
, the tabular data MUST first be transformed to JSON as defined by [csv2json] using standard mode. If the value is rdf
, the tabular data MUST first be transformed to an RDF graph as defined by [csv2rdf] using standard mode. If the source
property is missing or null
(the default) then the source of the transformation is the annotated tabular data model. No other values are valid.
titles
A natural language property that describes the format that will be generated from the transformation. This is useful if the target format is a generic format (such as application/json
) and the transformation is creating a specific profile of that format.
@id
If included, @id
is a link property that identifies the transformation described by this transformation definition. It MUST NOT start with _:
.
@type
If included, @type
is an atomic property that MUST be set to "Template"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
The transformation definition MAY contain any common properties to provide extra metadata about the transformation.
Implementations MAY present users with options for transformations based on the available transformation definitions and their properties. Implementations SHOULD filter this list to only include those transformations whose scriptFormat
they understand and can apply, and whose source
property, if present, specifies a format that the implementation can convert to. Users may find the targetFormat
and titles
properties useful in deciding which transformation to apply.
When directed by a user to transform a table using a transformation definition, implementations MUST:
source
property, if this is specified and not null
.url
property and raise an error if this does not exist.scriptFormat
property to determine how to interpret that script or template, and apply it to the table (or the result of converting the table).
Cells within tables may be annotated with a datatype
which indicates the type of the values obtained by parsing the string value of the cell. See [tabular-data-model] for a description of annotations on a datatype.
The possible built-in datatypes, as shown on the diagram, are:
anyAtomicType
number
which is mapped to double
in the data modelbinary
which is mapped to base64Binary
in the data modeldatetime
which is mapped to dateTime
in the data modelany
which is mapped to anyAtomicType
in the data modelxml
, a sub-type of string
, which indicates the value is an XML fragmenthtml
, a sub-type of string
, which indicates the value is an HTML fragmentjson
, a sub-type of string
, which indicates the value is serialized JSONMore specialised datatypes can be defined through a datatype description. A datatype description may have any of the following properties, all of which are optional.
base
An atomic property that contains a single string: a term defined in the default context representing a built-in datatype URL, as listed above. Its default is string
. All values of the datatype MUST be valid values of the base datatype. The value of this property becomes the base annotation for the described datatype.
format
An atomic property that contains either a single string or an object that defines the format of a value of this type, used when parsing a string value as described in Parsing Cells in [tabular-data-model]. The value of this property becomes the format annotation for the described datatype.
length
A numeric atomic property that contains a single integer that is the exact length of the value. The value of this property becomes the length annotation for the described datatype. See Length Constraints in [tabular-data-model] for details.
minLength
An atomic property that contains a single integer that is the minimum length of the value. The value of this property becomes the minimum length annotation for the described datatype. See Length Constraints in [tabular-data-model] for details.
maxLength
A numeric atomic property that contains a single integer that is the maximum length of the value. The value of this property becomes the maximum length annotation for the described datatype. See Length Constraints in [tabular-data-model] for details.
minimum
An atomic property that contains a single number or string that is the minimum valid value (inclusive); equivalent to minInclusive
. The value of this property becomes the minimum annotation for the described datatype. See Value Constraints in [tabular-data-model] for details.
maximum
An atomic property that contains a single number or string that is the maximum valid value (inclusive); equivalent to maxInclusive
. The value of this property becomes the maximum annotation for the described datatype. See Value Constraints in [tabular-data-model] for details.
minInclusive
An atomic property that contains a single number or string that is the minimum valid value (inclusive). The value of this property becomes the minimum annotation for the described datatype. See Value Constraints in [tabular-data-model] for details.
maxInclusive
An atomic property that contains a single number or string that is the maximum valid value (inclusive). The value of this property becomes the maximum annotation for the described datatype. See Value Constraints in [tabular-data-model] for details.
minExclusive
An atomic property that contains a single number or string that is the minimum valid value (exclusive). The value of this property becomes the minimum exclusive annotation for the described datatype. See Value Constraints in [tabular-data-model] for details.
maxExclusive
An atomic property that contains a single number or string that is the maximum valid value (exclusive). The value of this property becomes the maximum exclusive annotation for the described datatype. See Value Constraints in [tabular-data-model] for details.
The datatype description MAY contain any common properties to provide extra metadata about the datatype, such as a title or description.
Applications MUST raise an error if both length
and minLength
are specified and they do not have the same value. Similarly, applications MUST raise an error if both length
and maxLength
are specified and they do not have the same value. Applications MUST raise an error if length
, maxLength
, or minLength
are specified and the base
datatype is not string
or one of its subtypes, or a binary type.
In all ways, including the errors described below, the minimum
property is equivalent to the minInclusive
property and the maximum
property is equivalent to the maxInclusive
property. Applications MUST raise an error if both minimum
and minInclusive
are specified and they do not have the same value. Similarly, applications MUST raise an error if both maximum
and maxInclusive
are specified and they do not have the same value.
Applications MUST raise an error if both minInclusive
and minExclusive
are specified, or if both maxInclusive
and maxExclusive
are specified. Applications MUST raise an error if both minInclusive
and maxInclusive
are specified and maxInclusive
is less than minInclusive
, or if both minInclusive
and maxExclusive
are specified and maxExclusive
is less than or equal to minInclusive
. Similarly, applications MUST raise an error if both minExclusive
and maxExclusive
are specified and maxExclusive
is less than minExclusive
, or if both minExclusive
and maxInclusive
are specified and maxInclusive
is less than or equal to minExclusive
.
Applications MUST raise an error if minimum
, minInclusive
, maximum
, maxInclusive
, minExclusive
, or maxExclusive
are specified and the base
datatype is not a numeric, date/time, or duration type.
Validation against these properties is as defined in [xmlschema11-2].
When processing a tabular data file, the Locating Metadata section in [tabular-data-model] describes different locations for locating metadata. To properly transform a tabular data file, such as a CSV file, processors MUST merge metadata from these separate sources to create a single metadata document in a manner consistent with this algorithm.
Implementations MUST check and issue warnings where merge issues are found as noted below and in the relevant property definitions.
Merging of metadata happens in order from highest priority to lowest priority by merging the first two metadata files (A
and B
) together to create new merged metadata AB'
. This is then used to merge in the next metadata file until all metadata have been processed to create a table group description.
If the top-level object of either of the metadata files are table descriptions, these are turned into table group descriptions containing a single table description (i.e., having a tables
property whose value is an array containing the original table description). Ensure that @context
definitions are moved from the table description to the table group description.
Merging has two stages: the normalization of metadata documents, described in section 6.1 Normalization and the merging of those normalized documents, described in section 6.2 Merging.
Prior to merging, each description object is expanded relative to its @context
and values are normalized as follows:
notes
the value MUST be normalized as follows:
@value
property whose value is that string. If a default language is specified, add a @language
property whose value is that default language.@value
property, it remains as is.@id
, expand any prefixed names and resolve its value against the base URL.@type
, then its value remains as is.@context
. Raise an error if fetching this URL does not result in a JSON object. Normalize each property in the resulting object recursively using this algorithm and with its local @context
then remove the local @context
property. If the resulting object does not have an @id
property, add an @id
whose value is the original URL. This object becomes the value of the original object property.und
MUST be used.
Following this normalization process, the @base
and @language
properties within the @context
are no longer relevant; the normalized metadata can have its @context
set to http://www.w3.org/ns/csvw
.
This section is non-normative.
The following are examples of how common properties are normalized.
In this example, a simple string is used as the title for a table using the dc:title
common property:
{ "@context": { "http://www.w3.org/ns/csvw", { "@language": "en" } }, "@type": "Table", "url": "http://example.com/table.csv", "tableSchema": [...], "dc:title": "The title of this Table" }
Since there is a default language, this is equivalent to explicitly specifying the language of that title; the original string value becomes the value of the @value
property within a value object:
{ "@type": "Table", "url": "http://example.com/table.csv", "tableSchema": [...], "dc:title": {"@value": "The title of this Table", "@language": "en"} }
It is also possible to use a simple value object to give a title. However, in this case the default language is not applied to the title:
{ "@context": { "http://www.w3.org/ns/csvw", { "@language": "en" } }, "@type": "Table", "url": "http://example.com/table.csv", "tableSchema": [...], "dc:title": {"@value": "The title of this Table"} }
The next example uses an array of a string and a value object to give two titles with different languages:
{ "@context": { "http://www.w3.org/ns/csvw", { "@language": "en" } }, "@type": "Table", "url": "http://example.com/table.csv", "tableSchema": [...], "dc:title": [ "The title of this Table", {"@value": "Der Titel dieser Tabelle", "@language": "de"} ] }
The normalized version of this is:
{ "@type": "Table", "url": "http://example.com/table.csv", "tableSchema": [...], "dc:title": [ {"@value": "The title of this Table", "@language": "en"} {"@value": "Der Titel dieser Tabelle", "@language": "de"} ] }
The next example demonstrates a node object, in which the value of the schema:url
property is a reference to another resource:
{ "@context": [ "http://www.w3.org/ns/csvw", { "@base": "http://example.com/" } ], "@type": "Table", "url": "table.csv", "tableSchema": [...], "schema:url": {"@id": "table.csv"} }
The value of the @id
property is normalized as described in section 6.1 Normalization against the base URL provided through the @base
property, which means the above example is equivalent to:
{ "@context": "http://www.w3.org/ns/csvw", "@type": "Table", "url": "http://example.com/table.csv", "tableSchema": [...], "schema:url": {"@id": "http://example.com/table.csv"} }
The following example shows the dc:publisher
property as an array that contains a single node object:
{ "@context": "http://www.w3.org/ns/csvw", "@type": "Table", "url": "http://example.com/table.csv", "tableSchema": [...], "dc:publisher": [{ "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }], }
Following normalization, the schema:name
property of the dc:publisher
is expanded as shown:
"dc:publisher": [{ "schema:name": { "@value": "Example Municipality" }, "schema:url": { "@id": "http://example.org" } }]
A description object B
is merged into an original description object A
by merging each property of B
into A
. If the property from B
does not exist on A
, it is simply added to A
. If A
does have the property, the way the values are merged depends on the type of the property, as follows:
A
followed by those from B
that were not already a value in A
. If a value exists in the array for the undefined language (und
) and in the array for any other language, the value from the array for und
MUST be removed, ensuring that the und
property is removed entirely if it's value becomes an empty array.A
overrides that from B
.This section is non-normative.
For example, consider the following two metadata documents to be merged (located at http://example.com/metadata.json
and http://example.com/doc1.csv-metadata.json
):
{ "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}], "tables": [{ "url": "doc1.csv", "dc:title": "foo", "tableDirection": "ltr", "tableSchema": { "aboutUrl": "{#foo}", "columns": [{ "name": "foo", "titles": "Foo", "required": true }, { "name": "bar" }] } }, { "url": "doc2.csv" }] }
{ "@context": "http://www.w3.org/ns/csvw", "url": "http://example.com/doc1.csv", "dc:description": "bar", "tableSchema": { "propertyUrl": "{#_name}", "columns": [{ "titles": "Foo", "required": false }, { "name": "bar" }, { }] } }
The process of merging performs the following steps:
A
to use the language specified in the @context
within the natural language property titles
and to expand the link property url
against the base URL for A
, http://example.com/metadata.json
:
{ "tables": [{ "url": "http://example.com/doc1.csv", "dc:title": {"@value": "foo", "@language": "en"}, "tableDirection": "ltr", "tableSchema": { "aboutUrl": "{#foo}", "columns": [{ "name": "foo", "titles": { "en": [ "Foo" ] }, "required": true }, { "name": "bar" }] } }, { "url": "http://example.com/doc2.csv" }] }
B
from a table description to a table group description by embedding the table description in a tables
property, resolve the link property url
(which is already an absolute URL), and normalize the titles
property to use the und
language:
{ "tables": [{ "url": "http://example.com/doc1.csv", "dc:description": {"@value": "bar"}, "tableSchema": { "propertyUrl": "{#_name}", "columns": [{ "titles": { "und": [ "Foo" ] }, "required": false }, { "name": "bar" }, { }] } }] }
tables
is an array property with rules specified in section 5.3 Table Groups and each value is merged accordingly:
A
and B
are now the following:
{ "url": "http://example.com/doc1.csv", "dc:title": {"@value": "foo", "@language": "en"}, "tableDirection": "ltr", "tableSchema": { "aboutUrl": "{#foo}", "columns": [{ "name": "foo", "titles": { "en": [ "Foo" ] }, "required": true }, { "name": "bar" }] } } { "url": "http://example.com/doc1.csv", "dc:description": {"@value": "bar"}, "tableSchema": { "propertyUrl": "{#_name}", "columns": [{ "titles": { "und": [ "Foo" ] }, "required": false }, { "name": "bar" }, { }] } }
url
, these are merged. Each property from B
is considered:
url
is the same (otherwise the two table descriptions would not be being merged).dc:description
does not exist in A
so it is added to A
.tableSchema
properties are merged:
B
has a propertyUrl
which is added to A
.columns
which is merged as described in 5.5 Schemas:
titles
because they each have the value "Foo"
and the language tag en
matches und
. Because the und
value is present in the en
array, that value is removed from the und
array, and because the array is now empty the und
property is removed. The value of required
in A is retained, as is the value of name
.name
.{ "url": "http://example.com/doc1.csv", "dc:title": {"@value": "foo", "@language": "en"}, "dc:description": {"@value": "bar"}, "tableDirection": "ltr", "tableSchema": { "aboutUrl": "{#foo}", "propertyUrl": "{#_name}", "columns": [{ "name": "foo", "titles": { "en": [ "Foo" ]}, "required": true },{ "name": "bar" }] } }
A
is retained as is.The resulting merged metadata is now the following:
{ "tables": [{ "url": "http://example.com/doc1.csv", "dc:title": {"@value": "foo", "@language": "en"}, "dc:description": {"@value": "bar"}, "tableDirection": "ltr", "tableSchema": { "aboutUrl": "{#foo}", "propertyUrl": "{#_name}", "columns": [{ "name": "foo", "titles": { "en": [ "Foo" ]}, "required": true },{ "name": "bar" }] } }, { "url": "http://example.com/doc2.csv" }] }
Applications that process tabular data may use that data to drive other actions, which may have security implications. These behaviors are outside the scope of this specification.
Third party metadata provided about a tabular data file (such as a CSV file) may rename or ignore headers, or exclude rows or columns, which may lead to data being misinterpreted by applications that process it.
Transformation definitions are a possible security risk as they enable the creators of metadata to reference arbitrary code that may be executed to convert tabular data into other formats. Implementations should run this arbitrary code in a sandboxed environment to reduce the security risk.
The Metadata Vocabulary for Tabular Data uses a format based on JSON-LD [JSON-LD] with some restrictions.
The value of any @id
or @type
contained within a metadata document MUST NOT be a blank node.
A metadata document MUST NOT add a new context (ie include a @context
property except at the top level), or extend the top-level context in anyway other than as specifically allowed in section 5.2 Top-Level Properties.
Common properties and notes may contain arbitrary JSON-LD with the following restrictions:
The value of any member of @type
MUST be either a term defined in [csvw-context], a prefixed name where the prefix is a term defined in [csvw-context], or an absolute URL.
Values MAY be a string, native JSON type (such as number, true
, or false
.), value object, node object or an array of zero or more of any of these.
Values MUST NOT use list objects or set objects.
Keys of node objects MUST NOT include @graph
, @context
, terms, or blank nodes.
When normalizing metadata, prefixed names used in common properties and notes
are expanded to absolute URLs. For some serializations, these are more appropriately presented using prefixed names or terms. This algorithm compacts an absolute URL to a prefixed name or term.
:
(U+0040
) to create a prefixed name. If the resulting prefixed name is rdf:type
, replace with @type
.This document is influenced by Data Package specification and the JSON Table Schema, which are maintained as part of Data Protocols. Particular contributors to that work are Rufus Pollock, Paul Fitzpatrick, Andrew Berkeley, Francis Irving, Benoit Chesneau, Leigh Dodds, Martin Keegan, and Gunnlaugur Thor Briem.
This section has not yet been submitted to IANA for review, approval, and registration.
text/csv
and text/tab-delimited-values
mediatypes, but a JSON-based format used to annotate such documents
The JSON-LD context, located at http://www.w3.org/ns/csvw.jsonld
is used with metadata documents. When used within a metadata document,
the context can be referenced as http://www.w3.org/ns/csvw
.
See [csvw-context] for a full description of defined terms and prefixes. This context may be updated from time-to-time to define new terms and prefixes.
The document has undergone substantial changes since the last working draft. Below are some of the changes made:
notes
and common properties defined.resources
property was changed to tables
.foreignKeys
.templates
property was changed to transformations
.urlTemplate
property was changed from a schema property to the aboutUrl
common property, and propertyUrl
and valueUrl
were added as common properties.virtual
columns to allow data to be inserted into a row.