Copyright © 2014 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
Validation, conversion, display and search of tabular data on the web requires additional metadata that describes how the data should be interpreted. This document defines a vocabulary for metadata that annotates tabular data. This can be used to provide metadata at various levels, from collections of data from CSV documents and how they relate to each other down to individual cells within a table.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the second of those Recommendations.
This document was published by the CSV on the Web Working Group as a First Public Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org (subscribe, archives). All comments are welcome.
Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Interpreting tabular data that is available on the web, particularly as CSV, usually requires additional metadata. As an example, say that the following CSV file were available at http://example.org/tree-ops.csv
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010 3,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
A human consumer of this data might be able to figure out the meaning of the different columns, particularly if there were some additional human-readable documentation made available. Automated processors would have a much harder time; realistically they would be limited to displaying the information in a table. Making available machine-readable metadata helps with the interpretation of the tabular data. For example, say that the following metadata file were available at http://example.org/trees-ops.csv.csvm
:
{ "@id": "tree-ops.csv", "@context": { "@language": "en" } "title": "Tree Operations", "keywords": ["tree", "street", "maintenance"], "publisher": [{ "name": "Example Municipality", "web": "http://example.org" }], "license": "http://opendefinition.org/licenses/cc-by/", "modified": "2010-12-31", "schema": { "columns": [{ "@id": "_:GID", "name": "GID", "title": [ "GID", "Generic Identifier" ], "description": "An identifier for the operation on a tree.", "datatype": "string", "required": true, "unique": true }, { "name": "on-street", "title": "On Street", "description": "The street that the tree is on.", "datatype": "string" }, { "name": "species", "title": "Species", "description": "The species of the tree.", "datatype": "string" }, { "name": "trim-cycle", "title": "Trim Cycle", "description": "The operation performed on the tree.", "datatype": "string" }, { "name": "inventory-date", "title": "Inventory Date", "description": "The date of the operation that was performed.", "datatype": "date", "format": "M/D/YYYY" }] "primaryKey": "_:GID" } }
Given the location of the CSV file, this metadata file can be located by appending .csvm
to the URL (as described in Model for Tabular Data and Metadata on the Web). It provides information for different types of applications:
GID
column are all present and unique.The Model for Tabular Data and Metadata on the Web specification defines an Annotated Tabular Data Model in which tables, columns, rows and cells can be annotated with properties and values, and a Grouped Tabular Data Model in which a group of tables is annotated. That specification also describes how to locate metadata about a given CSV file.
This document defines the format and structure of metadata documents, and how these are interpreted to create an Annotated Tabular Data Model. It also defines how to validate tabular data based on some of these annotations. This metadata can be expressed as an RDF graph. However, all applications that conform to this specification (including validators and applications that read or convert tabular data) MUST read the JSON-based format described in this document.
We are aiming for the JSON format to be interpretable as JSON-LD, but without any requirement to include context within the JSON itself (to save people from having to do boilerplate). We invite comments on the utility of this approach: is it useful for CSV metadata to be interpretable as JSON-LD? Is it helpful to be able to map it to RDF? Would it be better to rename some of the JSON-LD keywords, such as @id
and @type
?
This section describes how particular types of applications should use the metadata supplied about a CSV file when they process that CSV file.
The metadata defined in this specification is used to annotate an existing annotated table as defined in [tabular-data-model]. Annotated tables form the basis for all further processing, such as validating or displaying the table. All compliant applications MUST create annotated tables based on the algorithm defined here.
Metadata documents contain descriptions of tables, columns, rows, cells and regions which are used to create annotations. There are two types of description objects:
The description objects themselves contain a number of properties. These are:
name
of a column or the provenance
of a tableFor example, in the column description
{ "name": "inventory-date", "title": "Inventory Date", "description": "The date of the operation that was performed.", "datatype": "date", "format": "M/D/YYYY" }
the properties name
, title
and description
are direct annotations that become name
, title
and description
properties on the column in the data model. The datatype
and format
properties are inherited properties that become datatype
and format
properties on the cells within the column.
Direct annotations are properties on the description object for a given table, column, row or cell which map directly to properties on the described table, column, row or cell. The name of the annotation is the same as the name of the property on the annotation. The value of the annotation is the same as the value of the property on the description object.
A cell may be assigned annotations based on properties on the description objects for the table, column or row that it appears in. These properties are known as inherited properties and are listed in section 3.8 Cells. To ascertain a value for these annotations, an application MUST identify the relevant property in the descriptions of the table, column and row.
Applications MUST raise an error if the value of a property in a column or row description is not compatible with the value of that property on the table. Applications MUST raise an error if the value of a property on a row is not compatible with the values of that property on all the columns in the table. Application MUST raise an error if the value of a property on a cell is not compatible with the values of that property on both the column and the row that the cell is associated with.
A value for a cell, column or row is compatible with with a value on a row, column or table if they are the same value or if the first value is a sub-value of the second value. The definitions of individual inherited properties indicate what values count as sub-values of others.
We intend to include other sections here about:
Much of this is likely to be non-normative. We invite comment on whether it's useful to provide this kind of guidance.
There are two levels of bidirectionality to consider when displaying tables: the directionality of the table (ie whether the columns should be arranged left-to-right or right-to-left) and the directionality of the content of individual cells.
The table-direction
property provides information about the desired display of the table. If table-direction=ltr
then the first column SHOULD be displayed on the left and the last column on the right. If table-direction=rtl
then the first column SHOULD be displayed on the right and the last column on the left.
If table-direction=default
then tables SHOULD be displayed with attention to the bidirectionality of the content of the file. Specifically, the values of the cells in the table should be scanned breadth first: from the first cell in the first column through to the last cell in the first column, down to the last cell in the last column. If the first character in the table with a strong type as defined in [UNICODE-BIDI] indicates a RTL directionality, the table should be displayed with the first column on the right and the last column on the left. Otherwise, the table should be displayed with the first column on the left and the last column on the right. Characters such as whitespace, quotes, commas and numbers do not have a strong type, and therefore are skipped when identifying the character that determines the directionality of the table.
Implementations SHOULD enable user preferences to override the indicated metadata about the directionality of the table.
Once the directionality of the table has been determined, each cell within the table should be considered as a separate paragraph, as defined by the UBA in [UNICODE-BIDI]. The default directionality for the cell is determined by looking at the text-direction
property, which is an inherited property.
Thus, as defined by the UBA, if a cell contains no characters with a strong type (if it's a number or date for example) then the way the cell is displayed should be determined by the text-direction
property of the cell. However, when the cell contains characters with a strong type (such as letters) then they MUST be displayed according to the Unicode Bidirectional Algorithm as described in [UNICODE-BIDI].
We intend to detail how to validate a CSV file against metadata. This would be normative: compliant validators would have to report the errors and warnings that we define. We invite comment on whether this is a useful thing to specify.
Conversions of tabular data to other formats operate over a annotated table constructed as defined in section 2.1 Annotating Tables. The mechanics of these conversions to other formats are defined in other specifications.
Conversion specifications MUST define a default mapping from an annotated table that lacks any annotations (ie that is equivalent to an un-annotated table).
Conversion specifications MUST use the name
of a column as the basis for naming machine-readable fields in the target format, such as the name of the equivalent element or attribute in XML, property in JSON or property URI in RDF.
Conversion specifications MAY use any of the properties defined in this specification to adjust the mapping of an annotated table into another format.
Conversion specifications MAY define additional properties, not defined in this specification, which are specifically used when converting to the target format of the conversion. For example, a conversion to XML might specify a element-or-attribute
property on columns that determines whether a particular column is represented through an element or an attribute in the data.
Conversion specifications SHOULD specify format-specific properties specifying external processing steps to provide more control to people defining conversions. If these are specified, the conversion specification MUST specify at what point in the processing this external processing takes place, and what it takes place on. Examples might be:
This section defines a set of properties and permitted values for annotating tabular data, and how these annotations should be interpreted by applications.
We intend to support metadata for packages. In this version of this specification, we are scoping to single metadata files defining single CSV files.
A metadata document is a JSON document which holds an object at the top level. This object is a description object of a table. A description object is a JSON object that describes a component of a table (a table, a column, a row or a cell) and has one or more properties are mapped into properties on that component. There are different types of properties on description objects:
These hold one or more references to other resources by URL. Their values may be:
For example, the hasVersion
property is a link property. A table description might contain:
"hasVersion": "example-2014-01-03.csv"
in which case the hasVersion
property on the table would have a single value, a link to example-2014-01-03.csv
, or it might contain:
"hasVersion": [ "example-2014-01-03.csv", "example-2014-01-17.csv", "example-2014-01-25.csv" ]
in which case the hasVersion
property on the table would have three values, links to other versions of the table.
These hold one or more references to other description objects. The referenced description object must have an @id
property whose value looks like _:name
. Internal reference properties can then reference other description objects through values that are:
_:name
which MUST match the @id
on another description object within the metadata document
For example, the primaryKey
property is an internal reference property on the schema. It has to hold references to columns defined elsewhere in the schema, and the descriptions of those columns must have @id
properties. It can hold a single reference, like this:
"schema": { "columns": [{ "@id": "_:GID", "name": "GID" }, ... ], "primaryKey": "_:GID" }
or it can contain an array of references, like this:
"schema": { "columns": [{ "@id": "_:givenName", "name": "givenName" }, { "@id": "_:familyName", "name": "familyName" }, ... ], "primaryKey": [ "_:givenName", "_:familyName" ] }
These hold one or more objects or references to objects by URL. Their values may be:
Object properties are often used when the values can be or should be values within controlled vocabularies, or structured information which may be held elsewhere. For example, the creator
of a table is an object property. It could be provided as a URL that indicates the creator, like this:
"creator": "http://ons.gov.uk"
or a structured object, like this:
"creator": { "name": "Office of National Statistics", "url": "http://ons.gov.uk", "email": "info@ons.gsi.gov.uk" }
These hold natural language strings. Their values may be:
Natural language properties are used for things like descriptions and titles. For example, the title
property provides a natural language label for a column. If it's a plain string like this:
"title": "Project title"
then that string is assumed to be in the language provided through the @language
property of the nearest @context
(or have no assumed language, if there is no such property). Multiple alternative values can be given in an array:
"title": [ "Project title", "Project" ]
It's also possible to provide multiple values in different languages, using an object structure. For example:
"title": { "en": "Project title", "fr": "Titre du projet" }
We invite comment on whether it would be useful to enable some markup in natural language strings, for example by stating that they are interpreted as HTML or Markdown.
These hold atomic values. Their values may be:
true
or false
)
JSON does not have date or time types. Where a property takes a date as a value, this MUST be a string in the format YYYY-MM-DD
.
The top-level object MAY have a @context
property. This holds an object that provides metadata for interpreting other properties, namely:
@language
indicates the default language for the values of properties in the description; if present, its value MUST be a language code [RFC3066] which is the default language for the values of other properties in the metadata document
Note that the @language
property of the @context
object, which gives the default language used within the metadata file, is distinct from the language
property on a description object, which gives the language used in the data within the table.
@base
indicates the base URL against which other URLs within the description are resolved; if present, its value MUST be a URL which is resolved against the base URL of the metadata document (the location from which it was retrieved) to provide the base URL for other URLs in the metadata document
Note that the @base
property of the @context
object provides the base URL used for URLs within the metadata document, not the URLs that appear within the table.
The properties listed here may be applied to any structure within the tabular data model: tables, columns, rows or cells.
We invite comment on whether there are other standard metadata vocabularies that should be reused within this specification.
Descriptions MAY contain any properties defined by [DC-TERMS] to describe the table. This specification does not define any application behaviour associated with these properties being present, except that validation of metadata files MUST check that, if they are present, they adhere to the syntax defined here.
Property | Type | Details |
---|---|---|
abstract | natural language property | |
accessRights | object property | |
accrualMethod | object property | |
accrualPeriodicity | object property | |
accrualPolicy | object property | |
alternative | natural language property | |
audience | object property | |
available | atomic property | dates in the format YYYY-MM-DD |
bibliographicCitation | natural language property | |
conformsTo | object property | |
contributor | object property | |
coverage | object property | |
created | atomic property | dates in the format YYYY-MM-DD |
creator | object property | |
date | atomic property | dates in the format YYYY-MM-DD |
dateAccepted | atomic property | dates in the format YYYY-MM-DD |
dateCopyrighted | atomic property | dates in the format YYYY-MM-DD |
dateSubmitted | atomic property | dates in the format YYYY-MM-DD |
description | natural language property | |
educationLevel | object property | |
extent | object property | |
format | object property | |
hasFormat | object property | |
hasPart | link property | |
hasVersion | link property | |
identifier | atomic property | a URL |
instructionalMethod | object property | |
isFormatOf | link property | |
isPartOf | link property | |
isReferencedBy | link property | |
isReplacedBy | link property | |
isRequiredBy | link property | |
issued | atomic property | dates in the format YYYY-MM-DD |
isVersionOf | link property | |
language | atomic property | a language code as defined by [RFC3066]; this is an inherited property |
license | object property | |
mediator | object property | |
medium | object property | |
modified | atomic property | dates in the format YYYY-MM-DD |
provenance | object property | |
publisher | object property | |
references | link property | |
relation | link property | |
replaces | link property | |
requires | link property | |
rights | object property | |
rightsHolder | object property | |
source | link property | |
spatial | object property | |
subject | object property | |
tableOfContents | natural language property | |
temporal | object property | |
title | natural language property | |
type | object property | |
valid | atomic property | dates in the format YYYY-MM-DD |
Description MAY include properties for registered link relations, prefixed by link:
. This specification does not define any application behaviour associated with these properties being present, except that validation of metadata files MUST check that, if they are present, they have values that are URLs or arrays of URLs. The following properties are particularly relevant to tabular data:
link:alternate
link:canonical
link:collection
link:duplicate
link:glossary
link:help
link:icon
link:last
link:latest-version
link:next
link:original
link:predecessor-version
link:prev
or link:previous
link:preview
link:profile
link:related
link:search
link:self
link:start
link:successor-version
link:terms-of-service
link:up
link:version-history
link:working-copy
link:working-copy-of
Unlike the Dublin Core terms, link relations are an ever-expanding list and there may eventually be clashes between link relation terms and those defined above. That's why the above list uses QNames for all link relations, so that they look like link:relation
rather than plain relation
.
text-direction
One of "rtl"
or "ltr"
(the default). Indicates whether the text within cells should be displayed by default as left-to-right or right-to-left text. See section 2.2.1 Bidirectional Tables for more details.
A table description is a JSON object that describes a table within a CSV file.
A CSV file might not be the same as the table that it contains. For example, a given CSV file might contain two tables (in different regions of the CSV file), or might contain a table that isn't positioned at the top left of the CSV file. We invite comment about whether we should assume that pre-processing is used to extract tables where there isn't a 1:1 correspondence between CSV file and table, or not.
@id
This gives the URL of the CSV file that the table is held in, relative to the location of the metadata document.
The description of a table MAY also contain:
@type
@type
MUST be set to "Table"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
table-direction
One of "rtl"
, "ltr"
or "default"
. Indicates whether the table should be displayed with the first column on the right, on the left, or based on the first character in the table that has a specific direction. See section 2.2.1 Bidirectional Tables for more details.
This should be a defined controlled vocabulary in JSON-LD, so that the values map on to URIs in the RDF version rather than strings. We invite comment on how to configure the JSON-LD context to enable these values to be interpreted in this way.
schema
notes
@id
property that references the relevant column, row, cell or region of the table using a fragment identifier. It MAY have any other common properties as described in section 3.3 Common Properties.
We intend to add a small subset of properties that indicate how a CSV file should be parsed, specifically those that mirror the existing distinction between the media types for text/csv
and text/tab-separated-values
, and the media type parameters that they allow, namely:
separator
to give the character used as the separator in the tabular data fileencoding
to specify the encoding used in the fileheader
to specify whether or not a header line is presentWe invite comment about whether these are the right properties to specify.
We invite comment on whether we should include properties that help in checking the integrity of the file: datapackage includes bytes
and hash
. We could reuse the Subresource Integrity work here.
The description MAY contain any of the properties defined in section 3.3 Common Properties to describe the table. As well as links to other related tables, the following common properties are particularly suitable for tables:
created
creator
description
language
license
modified
provenance
publisher
rights
rightsHolder
source
spatial
subject
temporal
title
A schema is a definition of a tabular format that may be common to multiple tables. For example, multiple tables from different sources may have the same columns and be designed such that they can be aggregated together.
A schema description is a JSON object that encodes the information about a schema. All the properties of a schema description are optional.
@type
@type
MUST be set to "Schema"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
columns
An array of column descriptions as described in section 3.6 Columns. These are matched to columns in table that use the schema by position: the first column description in the array applies to the first column in the table, the second to the second and so on.
The name
properties of the column descriptions MUST be unique within a given table description.
rows
An array of row descriptions as described in section 3.7 Rows. These are matched to row by the value of the row
in the row description. The values of the row
properties MUST be unique within a given table description (ie no row can have more than one description).
cells
An array of cell descriptions as described in section 3.8 Cells. These are matched to cell by the value of the row
and column
properties in the cell description. The combination of values of the row
and column
properties MUST be unique within a given table description (ie no cell can have more than one description).
primaryKey
An internal reference property that holds either a single references to a column description object or an array of references.
Validators MUST check that each row has a unique combination of cells in the indicated columns. For example, if primaryKey
is set to ["_:familyName", "_:givenName"]
then every row must have a unique value for the combination of the familyName
and givenName
columns.
When referencing columns for a primary key, it is a lot clearer to reference them by name rather than by number. For JSON-LD compatibility, we have to assign a blank node identifier to each column even though they each have a name
property that could be used instead. We invite comment on how to make this easier for people to use while maintaining JSON-LD compatibility.
The description MAY contain any of the properties defined in section 3.3 Common Properties to describe the schema. As well as links to other related schemas, the following common properties are particularly suitable for schemas:
created
creator
description
license
modified
publisher
rights
rightsHolder
subject
title
The description MAY contain any of the inherited properties defined for cells in section 2.1.2 Inherited Properties.
A column description is a simple JSON object that describes a single column. The description provides additional human-readable documentation for a column, as well as additional information that may be used to validate the cells within the column, create a user interface for data entry, or inform conversion into other formats.
name
An atomic property that gives a canonical name for the column. This MUST be a string. Conversion specifications MUST use this property as the basis for the names of properties/elements/attributes in the results of conversions.
We invite comment on what the syntactic limitations should be on column names to make them most useful when used as the basis of conversion into other formats, bearing in mind that different target languages such as JSON, RDF and XML have different syntactic limitations and common naming conventions.
During validation, if there is no title
property and the column already has a title
annotation then a validator MUST issue a warning if the existing title
annotation does not match the name
specified in the column description.
title
A natural language property that provides possible alternative names for the column. The possible column titles are defined as:
title
is a string, that stringtitle
is an array, the strings in that arraytitle
is an object, the string or strings that are the value of the property of that object whose name is the column language
where the column language is the value of the language
property on the column description, or (if there is no such language), the value of the language
property on the table description.
If the column already has a title
annotation (because a header row has been included in the original CSV file) then a validator MUST issue a warning if the existing title
annotation is not the same as any of the possible column titles.
The facility to specify multiple potential titles for a column is important when the same column description is used for multiple CSVs, through a mechanism yet to be defined by this specification.
@type
If included, @type
MUST be set to "Column"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
required
The description MAY contain any of the inherited properties defined for cells in section 2.1.2 Inherited Properties.
Rows can be described using row description objects. A row description object is a JSON object within a metadata file that includes properties that describe an individual row.
The following properties MUST appear on a row description:
row
@type
If included, @type
MUST be set to "Row"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
The description MAY contain any of the inherited properties defined for cells in section 2.1.2 Inherited Properties.
Cells can be described using cell description objects. A cell description object is a JSON object within a metadata file that includes properties that describe an individual cell.
The following properties MUST appear on a cell description:
row
column
@type
If included, @type
MUST be set to "Cell"
. Publishers MAY include this to provide additional information to JSON-LD based toolchains.
The description MAY contain any of the inherited properties defined for cells in section 2.1.2 Inherited Properties.
Cell descriptions may override inherited properties, as described in section 2.1 Annotating Tables. It is good practice to define these properties on columns, so that all cells within a given column are handled in the same way. These properties are:
null
The string used for null values. If not specified, the default for this is the empty string.
separator
The character used to separate items in the string value of the cell. If null
, the cell does not contain a list. Otherwise, application MUST split the string value of the cell on the specified separator character and parse each of the resulting strings separately. The cell's value will then be a list. Conversion specifications MUST use the separator to determine the conversion of a cell into the target format. See 3.8.5 Parsing cells for more details.
format
A definition of the format of the cell, used when parsing the cell as described in 3.8.5 Parsing cells.
datatype
The main datatype of the values of the cell. If the cell contains a list (ie separator
is not null
) then this is the datatype of each value within the list. Conversion specifications MUST use the datatype of the value to determine the conversion of a cell into the target format. See 3.8.4 Datatypes for more details.
length
The exact length of the value of the cell. See section 3.8.4.1 Length Constraints for details.
minLength
The minimum length of the value of the cell. See section 3.8.4.1 Length Constraints for details.
maxLength
The maximum length of the value of the cell. See section 3.8.4.1 Length Constraints for details.
minimum
The minimum value for the cell (inclusive); equivalent to minInclusive
. See section 3.8.4.2 Value Constraints for details.
maximum
The maximum value for the cell (inclusive); equivalent to maxInclusive
. See section 3.8.4.2 Value Constraints for details.
minInclusive
The minimum value for the cell (inclusive). See section 3.8.4.2 Value Constraints for details.
maxInclusive
The maximum value for the cell (inclusive). See section 3.8.4.2 Value Constraints for details.
minExclusive
The minimum value for the cell (exclusive). See section 3.8.4.2 Value Constraints for details.
maxExclusive
The maximum value for the cell (exclusive). See section 3.8.4.2 Value Constraints for details.
Cells within tables may be annotated with a datatype
which indicates the type of the value obtained by parsing the value of the cell. The format expected in the cell is determined by the format
annotation, if there is one, or uses a default format determined by the type.
The possible datatypes are:
the datatypes defined in [xmlschema-2] with the exception of those that rely on XML mechanisms for definition, namely:
anySimpleType
string
; a sub-value of anySimpleType
normalizedString
; a sub-value of string
token
; a sub-value of normalizedString
language
; a sub-value of token
Name
; a sub-value of token
NCName
; a sub-value of Name
boolean
; a sub-value of anySimpleType
decimal
; a sub-value of anySimpleType
integer
; a sub-value of decimal
nonPositiveInteger
; a sub-value of integer
negativeInteger
; a sub-value of nonPositiveInteger
long
; a sub-value of integer
int
; a sub-value of long
short
; a sub-value of int
byte
; a sub-value of short
nonNegativeInteger
; a sub-value of integer
unsignedLong
; a sub-value of nonNegativeInteger
unsignedInt
; a sub-value of unsignedLong
unsignedShort
; a sub-value of unsignedInt
unsignedByte
; a sub-value of unsignedShort
positiveInteger
; a sub-value of nonNegativeInteger
float
; a sub-value of anySimpleType
double
; a sub-value of anySimpleType
duration
; a sub-value of anySimpleType
dateTime
; a sub-value of anySimpleType
time
; a sub-value of anySimpleType
date
; a sub-value of anySimpleType
gYearMonth
; a sub-value of anySimpleType
gYear
; a sub-value of anySimpleType
gMonthDay
; a sub-value of anySimpleType
gDay
; a sub-value of anySimpleType
gMonth
; a sub-value of anySimpleType
hexBinary
; a sub-value of anySimpleType
base64Binary
; a sub-value of anySimpleType
anyURI
; a sub-value of anySimpleType
number
which is exactly equivalent to double
binary
which is exactly equivalent to base64Binary
datetime
which is exactly equivalent to dateTime
the datatype geopoint
which indicates a comma-separated longitude and latitude (ie values that after stripping leading and trailing whitespace are in the format longitude\s*,\s*latitude
); a sub-value of anySimpleType
In JSON Table Schema, geopoint
permits values in JSON representations of points, namely { lon: longitude, lat:
and latitude
}[longitude, latitude]
. We invite comment about whether these types are suitable for CSV files. If they are, we suggest that these additional formats for geopoint
are supported through the format
property.
any
which is exactly equivalent to anySimpleType
The JSON Table Schema also includes object
, array
and geojson
. We invite comment on whether we should we support the inclusion of JSON-based structures within CSV files.
We invite comment on whether the any
type is useful.
We invite comment on whether there should be types for formats like XML, HTML and markdown which may appear within CSV cells.
The length
, minLength
and maxLength
properties indicate the exact, minimum and maximum lengths of the values of cells.
Applications MUST raise an error if both length
and minLength
are specified and they do not have the same value. Similarly, applications MUST raise an error if both length
and maxLength
are specified and they do not have the same value. Applications MUST raise an error if length
, maxLength
or minLength
are specified and the cell value is not a list (ie separator
is not specified), a string or one of its subtypes, or a binary value.
The length of a value of a cell is determined as follows:
null
its length is zero
The minimum
, maximum
, minInclusive
, maxInclusive
, minExclusive
and maxExclusive
properties indicate limits on the values of cells. These apply to numeric and date/time types. The minimum
property is equivalent to the minInclusive
property and the maximum
property is equivalent to the maxInclusive
property.
Validation against these properties is as defined in [xmlschema-2].
Unlike many other data formats, tabular data is designed to be read by humans. For that reason, it's common for data to be represented within tabular data in a human-readable way. The separator
and format
properties indicates the format used to represent data within the table. This is used:
The process of parsing the string value of a cell into a single value or a list of values is as follows:
datatype
is string
or anySimpleType
or any
, strip leading and trailing whitespace from the valuenull
value, then the value is null
separator
property is not null
, create a list of values by splitting the string at the character specified by the separator
propertyformat
, if one is specified, as described below; raise an error if any of the values do not match the specified formatformat
, as described below
If the datatype
is a string type, the format
property provides a regular expression for the string values, in the syntax defined by [ECMASCRIPT].
We invite comment about which reference to use for regular expression syntax. Other possibilities are to use that defined by XML Schema or XPath.
It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.
If the datatype
is a numeric type, the format
property indicates the expected format for that number. Validators MUST check that the numbers in the column adhere to the specified format. Converters MUST use the format
property to parse the number when mapping it into a suitable type in the target language of the conversion.
When the datatype
is a numeric type, the format
property's value MUST be a number format as specified in [xslt-21].
We invite comment on the best format to specify how to parse numbers.
Boolean values may be represented in many ways aside from the standard 1
and 0
or true
and false
.
If the datatype
is boolean
, the format
property provides the true and false values expected, separated by |
. For example if format
is Y|N
then cells must hold either Y
or N
with Y
meaning true
and N
meaning false
.
Dates and times are commonly represented in tabular data in formats other than those defined in [xmlschema-2].
If the datatype
is a date or time type, the format
property indicates the expected format for that date or time. Validators MUST check that the dates or times in the column adhere to the specified format. Converters MUST use the format
property to parse the date or time when mapping it into a suitable type in the target language of the conversion.
When the datatype
is a date or time type, the format
property's value MUST be a date/time format as specified in [xslt-21].
We invite comment on which format to use when parsing dates and times.
We invite comment on whether there are standard formats to use when parsing durations.
A set of constraints can be associated with a cell. These constraints can be used to validate data against a JSON Table Schema. The constraints might be used by consumers to validate, for example, the contents of a data package, or as a means to validate data being collected or updated via a data entry interface.
A constraints descriptor is a JSON hash. It MAY
contain any of the following
keys.
minLength
– An integer that specifies the minimum number of characters for a valuemaxLength
– An integer that specifies the maximum number of characters for a valueunique
– A boolean. If true
, then all values for that cell MUST be unique within the
data file in which it is found. This defines a unique key for a row although a row could
potentially have several such keys.pattern
– A regular expression that can be used to test cell values. If the regular
expression matches then the value is valid. Values will be treated as a string of characters.
It is recommended that values of this cell conform to the standard
XML Schema regular expression syntax. See also
this reference.minimum
– specifies a minimum value for a cell. This is different to minLength
which
checks number of characters. A minimum
value constraint checks whether a cell value is greater than
or equal to the specified value. The range checking depends on the type
of the cell. E.g. an
integer cell may have a minimum value of 100; a date cell might have a minimum date. If a
minimum
value constraint is specified then the cell descriptor MUST
contain a type
keymaximum
– as above, but specifies a maximum value for a cell.A constraints descriptor may contain multiple constraints, in which case a consumer MUST
apply
all the constraints when determining if a cell value is valid.
A data file, e.g. an entry in a data package, is considered to be valid if all of its cells are valid
according to their declared type
and constraints
.
This document is largely a copy of content from the Data Package specification and the JSON Table Schema, which are maintained as part of Data Protocols. Particular contributors to that work are Rufus Pollock, Paul Fitzpatrick, Andrew Berkeley, Francis Irving, Benoit Chesneau, Leigh Dodds, Martin Keegan, and Gunnlaugur Thor Briem.
application/csvm+json
We intend to include a registration for a new datatype, namely application/csvm+json
. We invite comment about how to indicate that this is consistent with application/ld+json
, or whether we should just use application/json
or application/ld+json
and not create a specific media type for the metadata files defined in this document.
The following JSON document is the JSON-LD context document that can be used to interpret metadata documents as RDF.
See csvm-context.json.