Abstract

CSV is one of the most popular formats for publishing data on the web. It is concise, easy to understand by both humans and computers, and aligns nicely to the tabular nature of most data.

But CSV is also a poor format for data. There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or differing data types within a column.

The CSV on the Web Working Group has developed standard ways to express useful metadata about CSV files and other kinds of tabular data. This primer takes you through the ways in which these standards work together, covering:

What we mean by "tabular data" and "CSV"
Where files that provide metadata about CSV live
How to create a schema to validate the content of a CSV file
How to specify how a CSV file should be converted to RDF or JSON
How to provide other documentation and metadata about a CSV file

Where possible, this primer links back to the normative definitions of terms and properties in the standards. Nothing in this primer overrides those normative definitions.

	column 1	column 2	column 3
row 1	cell in column 1 and row 1	cell in column 2 and row 1	cell in column 3 and row 1
row 2	cell in column 1 and row 2	cell in column 2 and row 2	cell in column 3 and row 2
row 3	cell in column 1 and row 3	cell in column 2 and row 3	cell in column 3 and row 3

	column 1	column 2	column 3	column 4	column 5	column 6	column 7
titles	country	country group	name (en)	name (fr)	name (de)	latitude	longitude
row 1	AT	eu	Austria	Autriche	Österreich	47.6965545	13.34598005
row 2	BE	eu	Belgium	Belgique	Belgien	50.501045	4.47667405
row 3	BG	eu	Bulgaria	Bulgarie	Bulgarien	42.72567375	25.4823218

Documenting CSVs

Providing metadata about CSVs can be useful simply in providing extra information to anyone who wants to work with them.

How can you provide documentation about a CSV file?

Here's an example that includes some extra descriptive documentation about a number of CSV files:

Example 8

{
  "@context": "http://www.w3.org/ns/csvw",
  "dc:title": "Unemployment in Europe (monthly)"
  "dc:description": "Harmonized unemployment data for European countries."
  "dc:creator": "Eurostat",
  "tables": [{
    "url": "countries.csv",
    "dc:title": "Countries"
  }, {
    "url": "country-groups.csv",
    "dc:title": "Country groups"
  }, {
    "url": "unemployment.csv",
    "dc:title": "Unemployment (monthly)",
    "dc:description": "The total number of people unemployed"
  }]
}

This example uses Dublin Core as a vocabulary for providing metadata. You can tell that's the vocabulary that's being used because the terms like dc:title and dc:description begin with the prefix dc, which stands for Dublin Core.

There are several different metadata vocabularies in common use around the web. Some people use Dublin Core. Some people use schema.org. Some people use DCAT. All of these vocabularies can be used independently or together. A publisher could alternatively use:

Example 9

{
  "@context": "http://www.w3.org/ns/csvw",
  "schema:name": "Unemployment in Europe (monthly)",
  "schema:description": "Harmonized unemployment data for European countries."
  "schema:creator": { "schema:name": "Eurostat" },
  "tables": [{
    "url": "countries.csv",
    "schema:name": "Countries"
  }, {
    "url": "country-groups.csv",
    "schema:name": "Country groups"
  }, {
    "url": "unemployment.csv",
    "schema:name": "Unemployment (monthly)",
    "schema:description": "The total number of people unemployed"
  }]
}

Note

It's not clear at the moment which metadata vocabulary will give publishers the most benefits. Search engines are likely to recognise schema.org. RDF-based systems are more likely to recognise Dublin Core.

More generally, you can use prefixed properties like these on any of the objects in a metadata document. The prefixes that are recognised are those used in the RDFa 1.1 Initial Context. Other properties must be named with full URLs.

See also:

Common Properties in [tabular-metadata]

How can you provide documentation about the columns in a CSV file?

You can use metadata properties like the ones used for tables for individual columns as well. For example:

Example 10

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "dc:description": "The ISO two-letter code for a country, in lowercase."
    },{
      "titles": "country group",
      "dc:description": "A lowercase two-letter code for a group of countries."
    },{
      "titles": "name (en)",
      "dc:description": "The official name of the country in English."
    },{
      "titles": "name (fr)",
      "dc:description": "The official name of the country in French."
    },{
      "titles": "name (de)",
      "dc:description": "The official name of the country in German."
    },{
      "titles": "latitude",
      "dc:description": "The latitude of an indicative point in the country."
    },{
      "titles": "longitude",
      "dc:description": "The longitude of an indicative point in the country."
    }]
  }
}

See also:

Columns in [tabular-metadata]
Common Properties in [tabular-metadata]

What about when metadata like this is structured?

We've already seen an example where metadata supplied about a table is structured. Look at the schema:creator here:

Example 11

{
  "@context": "http://www.w3.org/ns/csvw",
  "schema:name": "Unemployment in Europe (monthly)",
  "schema:description": "Harmonized unemployment data for European countries."
  "schema:creator": { "schema:name": "Eurostat" },
  "tables": [{
    "url": "countries.csv"
  }, {
    "url": "country-groups.csv"
  }, {
    "url": "unemployment.csv"
  }]
}

The metadata supplied for a table or group of tables can have nested objects within it. You can provide arrays of values. It will be interpreted as if it is (a minimal version of) [json-ld]. Particular patterns that are useful are:

Use nested objects to create structure; for example you could supply the website and telephone number of the creator of a table as well as their name:

Example 12

"schema:creator": {
  "schema:name": "Eurostat",
  "schema:url": "http://ec.europa.eu/eurostat",
  "schema:contactPoint": [{
    "schema:telephone": "+44 20 300 63103",
    "schema:availableLanguage": "English"
  }, {
    "schema:telephone": "+49 30 223 86 029",
    "schema:availableLanguage": "Deutsch"
  }, {
    "schema:telephone": "+33 1 77 68 89 07"
    "schema:availableLanguage": "français"
  }]
}

Use @type to indicate the type of objects if there's an appropriate type to use, for example in this case Eurostat is a governmental organisation:
Example 13
```
"schema:creator": {
  "schema:name": "Eurostat",
  "@type": "schema:GovernmentOrganization"
}
```
Use an @id property to provide values that are URLs for things about which more information is available at that URL. For example, rather than providing all the information about Eurostat embedded within the metadata, you could just reference out to it:
Example 14
```
"schema:creator": { "@id": "http://ec.europa.eu/eurostat" }
```
or you could include the @id along with other basic properties to make it more readable for humans.

See also:

Common Properties in [tabular-metadata]

How should you annotate individual cells?

There's no standardised facility in the CSV on the Web specifications for annotating individual cells, but there is a hook that will enable best practice about how to do that to emerge: the notes property on a table description can contain objects that represent annotations.

The W3C Web Annotation Working Group is working on a vocabulary for annotations themselves. This vocabulary includes the concept of a target for an annotation and its body (the content of the annotation).

If you are annotating a single cell, the target needs to point to that cell. The easiest way to do that is to use fragment identifiers for CSV files as defined in [RFC7111]. These fragment identifiers reference cells based on their position within the original CSV file, with the first row in the CSV file (usually the header row) counted as 1. For example, with the CSV file:

Example 15

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

the cell containing Belgique is at #cell=3,4. It's also possible to refer to ranges of cells with this syntax and to use * to refer to the last row in the file. For example, to target a comment on all the locations in the CSV file you could use the fragment identifier #cell=2,6-*,7.

To create comments, then, the notes property can hold an array of objects that use the Web Annotation structure. For example:

Example 16

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "notes": [{
    "type": "Annotation",
    "target": "countries.csv#cell=2,6-*,7",
    "body": "These locations are of representative points.",
    "motivation": "commenting"
  }, {
    "type": "Annotation",
    "target": "countries.csv#cell=3,4",
    "body": "Corrected.",
    "motivation": "editing"
  }]
}

See also:

Table notes in [tabular-metadata]

Validating CSVs

Validation is all about checking whether a file contains what you expect it to contain. For CSV files, this can be about:

the number of columns there are and what they're called (see )
what kind of values columns contain
whether any of those values are unique
whether they match values in other CSV files

How can you say what kinds of values are expected in a column?

There's lots more that you can say about the expected content of columns in a CSV file. The most obvious thing is to indicate the data type. For example, with the CSV file:

Example 17

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

The first five columns are strings and the last two are numbers. You can indicate this with the datatype property for each column:

Example 18

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "datatype": "string"
    },{
      "titles": "country group",
      "datatype": "string"
    },{
      "titles": "name (en)",
      "datatype": "string"
    },{
      "titles": "name (fr)",
      "datatype": "string"
    },{
      "titles": "name (de)",
      "datatype": "string"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}

Note

You don't have to include "datatype": "string" for columns that are strings — columns are assumed to hold strings if no datatype is explicitly specified.

There are a number of different datatypes supported by CSV on the Web implementations, based on the set defined in [xmlschema11-2]. The complete set is shown in the following diagram:

Built-in Datatype Hierarchy diagram — Figure 1 Diagram showing the built-in datatypes, based on [xmlschema11-2]; names in parentheses denote aliases to the [xmlschema11-2] terms (see the diagram in SVG or PNG formats)

See also:

Datatypes in [tabular-metadata]
Datatypes in [tabular-data-model]

How do you define new datatypes?

You can define new datatypes based on the built-in datatypes using an object as the value of the datatype property rather than a string. For example:

Example 19

"datatype": {
  "base": "integer",
  "minimum": "1",
  "maximum": "5"
}

The base property must be an existing datatype. The other properties on the new datatype define extra restrictions on values of the new datatype. You can give the new datatype a name and description to provide extra documentation for people using the data:

Example 20

"datatype": {
  "dc:title": "Star Rating",
  "dc:description": "A star rating between 1 and 5."
  "base": "integer",
  "minimum": "1",
  "maximum": "5"
}

See also:

Derived Datatypes in [tabular-metadata]
Datatypes in [tabular-data-model]

How do you restrict what kind of strings a column contains?

In the example we've been using, the first column always contains a country code consisting of two lowercase letters. This is a new datatype based on string. You can specify the format for string values in a column using a regular expression, like this:

Example 21

{
  "titles": "country",
  "datatype": {
    "dc:title": "Country Code",
    "dc:description": "Country codes as specified in ISO 3166.",
    "base": "string",
    "format": "[a-z]{2}"
  }
}

It's also possible to restrict the length of a string-based datatype using the length or minLength and/or maxLength properties. For example the following says that the column holding the English names of countries must have values between 3 and 128 characters long:

Example 22

{
  "titles": "name (en)",
  "datatype": {
    "base": "string",
    "minLength": "3",
    "maxLength": "128"
  }
}

See also:

Formats for other types in [tabular-data-model]
Length Constraints in [tabular-data-model]

How do you restrict the size of numbers a column contains?

The size of numbers in a column can be restricted using the minimum and maximum properties and/or the minExclusive and maxExclusive properties. In our example, one column contains latitudes, which can range between -90 and +90:

Example 23

{
  "titles": "latitude",
  "datatype": {
    "base": "number",
    "minimum": "-90",
    "maximum": "90"
  }
}

See also:

Value Constraints in [tabular-data-model]

How do you ensure that decimal numbers have a particular precision or leading zeros?

In the example we're using, the latitudes are provided to between six and eight decimal places and there are no leading or trailing zeros. You can use the format property to provide a pattern that matches these numbers. In the pattern, 0 represents a required digit and # represents an optional digit. For the latitude, the definition looks like:

Example 24

{
  "titles": "latitude",
  "datatype": {
    "base": "number",
    "minimum": "-90",
    "maximum": "90",
    "format": "#0.000000##"
  }
}

The format property can also be used to indicate that values in a column should have leading zeros. For example, if a column were supposed to hold a three digit number you could use the pattern 000.

See also:

Formats for numeric types in [tabular-data-model]

How do you validate numbers that aren't in standard numeric formats?

Sometimes numbers within a CSV file won't be in a standard numeric format. For example, they might include commas as grouping characters (eg 12,345,678) or as decimal points (eg 12,3). In these cases, you can use the format property with an object value.

To match numbers with grouping separators as in 12,345,678 you should specify "," as the groupChar for the format. The pattern property then holds the pattern that indicates how many digits should be in each group. This example validates numbers with groups of three digits separated by commas:

Example 25

"datatype": {
  "base": "integer",
  "format": {
    "groupChar": ",",
    "pattern": "#,##0"
  }
}

To match numbers with decimal separators other than ., as in 12,3, you should specify "," as the decimalChar for the format. This example validates numbers with commas as decimal points:

Example 26

"datatype": {
  "base": "integer",
  "format": {
    "decimalChar": ","
  }
}

You can mix and match decimal and grouping characters and patterns. For example, in France it's standard to use commas for decimal points and spaces for grouping characters, so CSV files produced in France might contain numbers like 1 234 567,89. These could be validated with a datatype like:

Example 27

"datatype": {
  "base": "integer",
  "format": {
    "decimalChar": ",",
    "groupChar": " ",
    "pattern": "# ##0,0#"
  }
}

See also:

Formats for numeric types in [tabular-data-model]

How do you restrict what kind of dates and times a column contains?

Dates and times are treated similarly to numbers. You can use the minimum, maximum, minExclusive and/or maxExclusive properties to restrict their values.

For example, to indicate that the column should contain dates later than 1st January 2000, you can use the datatype:

Example 28

"datatype": {
  "base": "date",
  "minimum": "2000-01-01"
}

To indicate that the column should contain times before midday (exclusive), you can use the datatype:

Example 29

"datatype": {
  "base": "time",
  "minExclusive": "12:00:00"
}

Note

The format of dates or times used for minimum or maximum values is always the ISO 8601 format: yyyy-MM-dd for dates, HH:mm:ss.S for times and yyyy-MM-ddTHH:mm:ss.S for date/times.

See also:

Value Constraints in [tabular-data-model]

How do you validate dates that aren't in standard date or time formats?

Dates and times in CSV files often come in formats other than the standard ISO 8601 format. You can use the format property to indicate the expected format of the date or time.

For example, to recognise dates in the usual UK format such as 31/10/2015 for 31st October 2015, you could use:

Example 30

"datatype": {
  "base": "date",
  "format": "dd/MM/yyyy"
}

Implementations are only required to understand a particular set of common formats for dates and times. These formats are, for dates:

yyyy-MM-dd e.g. 2015-03-22
yyyyMMdd e.g. 20150322
dd-MM-yyyy e.g. 22-03-2015
d-M-yyyy e.g. 22-3-2015
MM-dd-yyyy e.g. 03-22-2015
M-d-yyyy e.g. 3-22-2015
dd/MM/yyyy e.g. 22/03/2015
d/M/yyyy e.g. 22/3/2015
MM/dd/yyyy e.g. 03/22/2015
M/d/yyyy e.g. 3/22/2015
dd.MM.yyyy e.g. 22.03.2015
d.M.yyyy e.g. 22.3.2015
MM.dd.yyyy e.g. 03.22.2015
M.d.yyyy e.g. 3.22.2015

For times:

HH:mm:ss.S with one or more trailing S characters indicating the maximum number of fractional seconds e.g. HH:mm:ss.SSS for 15:02:37.143
HH:mm:ss e.g. 15:02:37
HHmmss e.g. 150237
HH:mm e.g. 15:02
HHmm e.g. 1502

And for date/times:

yyyy-MM-ddTHH:mm:ss.S with one or more trailing S characters indicating the maximum number of fractional seconds e.g. yyyy-MM-ddTHH:mm:ss.SSS for 2015-03-15T15:02:37.143
yyyy-MM-ddTHH:mm:ss e.g. 2015-03-15T15:02:37
yyyy-MM-ddTHH:mm e.g. 2015-03-15T15:02
any of the date formats above, followed by a single space, followed by any of the time formats above, e.g. M/d/yyyy HH:mm for 3/22/2015 15:02 or dd.MM.yyyy HH:mm:ss for 22.03.2015 15:02:37

Note

None of these formats include names or abbreviations for months or days. The implementation you use might support other date and time formats as well, including specialised formats for the other date and time datatypes such as gMonthYear. Check your implementation's documentation to see what it supports.

See also:

Formats for dates and times in [tabular-data-model]

How do you validate boolean values that aren't `true` or `false`?

By default, validators will recognise boolean values that are 1 or 0 or true or false. If a CSV file contains boolean values like T and F or Yes and No then you can create a derived boolean datatype that uses that syntax using the format property, with the two possible values separated by a |, for example:

Example 31

"datatype": {
  "base": "boolean",
  "format": "Yes|No"
}

See also:

Formats for booleans in [tabular-data-model]

How do you specify a list of valid values for a column?

The example CSV file we're using is this:

Example 32

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

In it, the second column, country group, contains either the value eu or the value non-eu. Despite there being two values, this isn't a boolean column. Instead, it's a column that has only two valid values.

There are two ways to specify that a column contains one of a list of values: using a regular expression to list the values and using a reference to a separate CSV file that contains the values.

Using a regular expression to give a list of valid values

Using a regular expression to list values works best if those values are strings, if there are only a few of them, and if they are self-explanatory such that you don't want to provide any additional information about them.

In this example, the country group column could be specified as:

Example 33

{
  "titles": "country group",
  "datatype": {
    "base": "string",
    "format": "eu|non-eu"
  }
}

As described in , the format property contains a regular expression. List the options separated by | and ensure that you escape any of the characters in the options that have special meaning in regular expressions.

See also:

Formats for other types in [tabular-data-model]

Using a separate CSV file to give a list of valid values

A more powerful method of listing the valid values in a particular column is to list those values in a separate CSV file. The CSV file can be very simple, containing just a single column that lists the valid values. In this example, we can create country-groups.csv containing:

Example 34

group
eu
non-eu

We can then provide definitions for both the countries.csv and country-groups.csv files, and state that the country group column in countries.csv references the group column in country_groups.csv. This reference from one file to another is called a foreign key.

To use a foreign key, both files must be referenced in the same metadata document, and both columns must be given names. Column names are only used inside the metadata document and you can only use (ASCII) letters, numbers, . and _ within them. So the basic metadata document, before adding the foreign key, should look like:

Example 35

{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv",
    "tableSchema": {
      "columns": [{
        "titles": "country"
      },{
        "name": "country_group",
        "titles": "country group"
      },{
        "titles": "name (en)"
      },{
        "titles": "name (fr)"
      },{
        "titles": "name (de)"
      },{
        "titles": "latitude"
      },{
        "titles": "longitude"
      }]
    }
  }, {
    "url": "country-groups.csv",
    "tableSchema": {
      "columns": [{
        "name": "group",
        "titles": "group"
      }]
    }
  }]
}

The foreign key is defined in the schema for the countries.csv table, as follows:

Example 36

{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv",
    "tableSchema": {
      "columns": [{
        "titles": "country"
      },{
        "name": "country_group",
        "titles": "country group"
      },{
        "titles": "name (en)"
      },{
        "titles": "name (fr)"
      },{
        "titles": "name (de)"
      },{
        "titles": "latitude"
      },{
        "titles": "longitude"
      }],
      "foreignKeys": [{
        "columnReference": "country_group",
        "reference": {
          "resource": "country-groups.csv",
          "columnReference": "group"
        }
      }]
    }
  }, {
    "url": "country-groups.csv",
    "tableSchema": {
      "columns": [{
        "name": "group",
        "titles": "group"
      }]
    }
  }]
}

The foreignKeys property can hold several foreign keys. Each contains a columnReference to a column or list of columns in one CSV file, and a reference which defines a column or list of columns in another CSV file.

The advantage of this method of listing the values allowed in a column is that the CSV file that contains the list of possible values can also provide additional information about those values. For example, we can provide expansions of what eu and non-eu mean in different languages:

Example 37

group,name (en),name (fr),name (de)
eu,"European Union","Union européenne","Europäische Union"
non-eu,"Non EU countries","Pays hors Union européenne",Nicht-EU-Länder

See also:

Foreign key examples in [tabular-metadata]

How do you enable a column to have a mix of value types?

Sometimes a column that contains numbers will contain special values, such as X or NK, when a value is unknown or redacted. If these columns are simply classified as numeric then the non-numeric values will be classed as errors.

To avoid values being classified as errors when they are being used to indicate missing values, list those values as null values using the null property. This can take either a single string or an array of strings. For example, the latitude column might usually be numeric but hold an X if there is no indicative point for the country:

Example 38

{
  "titles": "latitude",
  "null": "X",
  "datatype": {
    "base": "number",
    "minimum": "-90",
    "maximum": "90"
  }
}

Note

The null property can also be useful when a column contains values that are of the right type but used to indicate a missing value. It's not uncommon, for example, for publishers to use the value 99 in a column that contains integers to indicate that a value is missing.

See also:

null property in [tabular-metadata]
Parsing Cells in [tabular-data-model]

What if the cells in a column contain lists of values?

Cells may contain lists of values with spaces, semi-colons or other characters acting as separators. For example, instead of using separate latitude and longitude columns, the CSV that we're looking at could contain a single latlong column consisting of the latitude and longitude separated by a space:

Example 39

"country","country group","name (en)","name (fr)","name (de)","latlong"
"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

In this scenario, the separator property can be used to indicate that the values in a column are lists themselves, and what separator is used between the items in the list. For example:

Example 40

{
  "titles": "latlong",
  "separator": " ",
  "datatype": {
    "base": "number",
    "minimum": "-180",
    "maximum": "180"
  }
}

When separator is specified, the datatype property applies to each of the values in the list. There's no way to indicate that the values in the list have different datatypes, or set limits on the length of the list.

See also:

separator property in [tabular-metadata]
Parsing Cells in [tabular-data-model]

How do you ensure every row has a value for a column?

By default, a validator won't give any errors if a value is missing in a column. If you want to ensure that a value is provided for every row in the column, use the required property for that column, with the value true.

In our example, we might say that all the columns are required except the French and German names (applications being expected to default to the English name if the translation is missing):

Example 41

"tableSchema": {
  "columns": [{
    "titles": "country",
    "required": true
  },{
    "titles": "country group",
    "required": true
  },{
    "titles": "name (en)",
    "required": true
  },{
    "titles": "name (fr)"
  },{
    "titles": "name (de)"
  },{
    "titles": "latitude",
    "required": true
  },{
    "titles": "longitude",
    "required": true
  }]
}

Note

Setting required to true means that you can't have any null values in a column. If, in this example, latitude and longitude had null set to X then those columns couldn't contain an X. It doesn't usually make sense to specify both null and required.

See also:

required property in [tabular-metadata]

How do you indicate all the values in a column are unique?

In our example, the country column must contain unique values: each row should define a different country. To specify this, you can use a primary key to refer to the name of the column:

Example 42

"tableSchema": {
  "columns": [{
    "name": "country",
    "titles": "country"
  },{
    "titles": "country group"
  },{
    "titles": "name (en)"
  },{
    "titles": "name (fr)"
  },{
    "titles": "name (de)"
  },{
    "titles": "latitude"
  },{
    "titles": "longitude"
  }],
  "primaryKey": "country"
}

Each CSV file can only have one primary key. A primary key can be made up of a number of columns that have to be unique in combination: the classic example would be ["firstName", "lastName"].

See also:

primaryKey property in [tabular-metadata]

Transforming CSVs

CSV is great for transferring data around the place, but it's often not as useful for processing that data. Other formats can be better at providing structure and meaning for data. So a means for transforming data out of CSV and into other formats is a common requirement.

What can you transform CSV into?

You can of course transform CSV into anything you like using your favourite programming language. However, the CSV on the Web specs provide standardised ways of mapping CSV into two other formats:

JSON, in [csv2json]
RDF, in [csv2rdf]

These specifications make use of the metadata described in this primer during the transformation to decide what to include in the transformed output and how to include it. Processors that support these standardised transformations can be used by people who can't program.

These specifications describe what output you get if you don't supply any metadata. Given a CSV file like this:

Example 43

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

the usual, minimal JSON output would be:

Example 44

[{
  "country": "at",
  "country group": "eu",
  "name (en)": "Austria",
  "name (fr)": "Autriche",
  "name (de)": "Österreich",
  "latitude": "47.6965545",
  "longitude": "13.34598005"
}, {
  "country": "be",
  "country group": "eu",
  "name (en)": "Belgium",
  "name (fr)": "Belgique",
  "name (de)": "Belgien",
  "latitude": "50.501045",
  "longitude": "4.47667405"
}, {
  "country": "bg",
  "country group": "eu",
  "name (en)": "Bulgaria",
  "name (fr)": "Bulgarie",
  "name (de)": "Bulgarien",
  "latitude": "42.72567375",
  "longitude": "25.4823218"
}]

and the RDF output would be:

Example 45

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[
  <#country> "at";
  <#country%20group> "eu";
  <#latitude> "47.6965545";
  <#longitude> "13.34598005";
  <#name%20%28de%29> "Österreich";
  <#name%20%28en%29> "Austria";
  <#name%20%28fr%29> "Autriche"
] .

[
  <#country> "be";
  <#country%20group> "eu";
  <#latitude> "50.501045";
  <#longitude> "4.47667405";
  <#name%20%28de%29> "Belgien";
  <#name%20%28en%29> "Belgium";
  <#name%20%28fr%29> "Belgique"
] .

[
  <#country> "bg";
  <#country%20group> "eu";
  <#latitude> "42.72567375";
  <#longitude> "25.4823218";
  <#name%20%28de%29> "Bulgarien";
  <#name%20%28en%29> "Bulgaria";
  <#name%20%28fr%29> "Bulgarie"
] .

Note

The specifications define how to transform CSV into RDF. In this Primer all the examples use Turtle as the serialisation for that RDF. Implementations may generate other serialisations for RDF such as RDF/XML or JSON-LD.

See also:

Minimal Mode in [csv2json]
Converting Tabular Data to RDF in [csv2rdf]

What values get used in the output of a transformation?

The result of a transformation will include typed values based on the datatype and language of each column. So if we state that the longitude and latitude are numbers and the names are strings in the given language, as in this metadata:

Example 46

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "columns": [{
      "titles": "country"
    },{
      "titles": "country group"
    },{
      "titles": "name (en)",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "lang": "de"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}

then the JSON will look like:

Example 47

[{
  "country": "at",
  "country group": "eu",
  "name (en)": "Austria",
  "name (fr)": "Autriche",
  "name (de)": "Österreich",
  "latitude": 47.6965545,
  "longitude": 13.34598005
},{
  "country": "be",
  "country group": "eu",
  "name (en)": "Belgium",
  "name (fr)": "Belgique",
  "name (de)": "Belgien",
  "latitude": 50.501045,
  "longitude": 4.47667405
},{
  "country": "bg",
  "country group": "eu",
  "name (en)": "Bulgaria",
  "name (fr)": "Bulgarie",
  "name (de)": "Bulgarien",
  "latitude": 42.72567375,
  "longitude": 25.4823218
}]

and the RDF will look like:

Example 48

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[
  <#country> "at";
  <#country%20group> "eu";
  <#latitude> 4.76965545e1;
  <#longitude> 1.334598005e1;
  <#name%20%28de%29> "Österreich"@de;
  <#name%20%28en%29> "Austria"@en;
  <#name%20%28fr%29> "Autriche"@fr
] .

[
  <#country> "be";
  <#country%20group> "eu";
  <#latitude> 5.0501045e1;
  <#longitude> 4.47667405e0;
  <#name%20%28de%29> "Belgien"@de;
  <#name%20%28en%29> "Belgium"@en;
  <#name%20%28fr%29> "Belgique"@fr
] .

[
  <#country> "bg";
  <#country%20group> "eu";
  <#latitude> 4.272567375e1;
  <#longitude> 2.54823218e1;
  <#name%20%28de%29> "Bulgarien"@de;
  <#name%20%28en%29> "Bulgaria"@en;
  <#name%20%28fr%29> "Bulgarie"@fr
] .

See also:

Datatypes in [csv2json]
Datatypes in [csv2rdf]

What value gets used in the output if it's missing in the CSV?

If there's a missing value in the CSV, then usually the property will be omitted in the result as well. So if the latitude and longitude are missing for the first row of the CSV file we've been using, the equivalent JSON won't include those properties either:

Example 49

{
  "country": "at",
  "country group": "eu",
  "name (en)": "Austria",
  "name (fr)": "Autriche",
  "name (de)": "Österreich"
}

and nor will the RDF:

Example 50

[
  <#country> "at";
  <#country%20group> "eu";
  <#name%20%28de%29> "Österreich"@de;
  <#name%20%28en%29> "Austria"@en;
  <#name%20%28fr%29> "Autriche"@fr
] .

If you want a value to appear even when the value is missing in the CSV, you can provide that value as the default for the column. This value must be supplied as a string but will be treated exactly as if it had appeared within the CSV file. For example, if you supply a non-numeric string for a numeric column, as in:

Example 51

{
  "titles": "latitude",
  "datatype": "number",
  "default": "NOT KNOWN"
}

The default value will be seen as an invalid value and therefore represented as a string in the output.

See also:

Minimal Mode in [csv2json]
Converting Tabular Data to RDF in [csv2rdf]

How are the properties in the output of a transformation named?

By default, the properties in the JSON and RDF come from the titles of the columns (the headers in the CSV file). In RDF, since the properties are URIs, the names are URL-encoded and turned into fragment identifiers in a URL based on the location of the CSV file being transformed. Hence in the previous examples there have been property names like "name (en)" in JSON and #name%20%28en%29 in RDF.

You can override this default by supplying a name for the column. That name will be used instead of the title of the column when creating the property. So if you have:

Example 52

{
  "titles": "name (en)",
  "name": "english_name"
}

then the property will be called english_name in the JSON output and #english_name in the RDF output.

You can also use the propertyUrl property to supply a prefixed name or a URL for the property. For example, to use schema:latitude as the name for the latitude property in both the JSON and the RDF output, you could use:

Example 53

{
  "titles": "latitude",
  "propertyUrl": "schema:latitude"
  "datatype": "number"
}

The propertyUrl property can be used to map several columns in the CSV file onto properties with the same name. In our example, each country has several names which are all really the same property; the schema could look like:

Example 54

{
  "titles": "name (en)",
  "propertyUrl": "schema:name"
  "lang": "en"
},{
  "titles": "name (fr)",
  "propertyUrl": "schema:name"
  "lang": "fr"
},{
  "titles": "name (de)",
  "propertyUrl": "schema:name"
  "lang": "de"
}

With that schema, the JSON output will contain:

Example 55

"schema:name": ["Belgium", "Belgique", "Belgien"]

and the RDF output will contain:

Example 56

schema:name "Belgium"@en, "Belgique"@fr, "Belgien"@de

If there isn't a relevant property in one of the vocabularies that is built-in to CSV on the Web (those listed as part of the RDFa 1.1 initial context), the propertyUrl can hold a URL template. Usually this template won't include any substitutable parts because it's generally the case that the property should be the same for the whole column. For example, you might have:

Example 57

{
  "titles": "country group",
  "propertyUrl": "http://example.org/def/countryGroup"
}

In this case, the result of a transformation to JSON will contain:

Example 58

"http://example.org/def/countryGroup": "eu"

and the RDF similarly:

Example 59

<http://example.org/def/countryGroup> "eu"

See also:

name property in [tabular-metadata]
propertyUrl property in [tabular-metadata]

How do you map values into URLs?

Sometimes a column contains a value that can be programmatically mapped into a URL. In this case, the valueUrl property contains a template for the URL that it should be mapped into.

For example, say that there were pages for each country on the web at e.g. http://example.org/country/at. In that case, the URL for the country could be generated with the URL template http://example.org/country/{country}. Within this template, {country} inserts the value from the column named country into the URL. So the metadata could contain:

Example 60

{
  "titles": "country",
  "name": "country",
  "valueUrl": "http://example.org/country/{country}",
  "propertyUrl": "schema:url"
}

The JSON output would then contain:

Example 61

"schema:url": "http://example.org/country/at"

and the RDF output:

Example 62

schema:url <http://example.org/country/at>

If you want to preserve the original value from the column and use it to create a URL, you may want to introduce a virtual column. For example, with the latitude and longitude of each country available, you might want to provide a link to a map centered on the country within Google Maps. The URLs for these look like https://www.google.com/maps/@50.501045,4.476674,7z, and a template like https://www.google.com/maps/@{lat},{long},7z.

To add a property that points to this URL, add a virtual column at the end of the column definitions within the schema. A virtual column definition looks just like a normal column definition but with the virtual property set to true:

Example 63

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "columns": [{
      "titles": "country"
    },{
      "titles": "country group"
    },{
      "titles": "name (en)",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "lang": "de"
    },{
      "titles": "latitude",
      "name": "lat",
      "datatype": "number"
    },{
      "titles": "longitude",
      "name": "long",
      "datatype": "number"
    },{
      "virtual": true,
      "propertyUrl": "schema:hasMap",
      "valueUrl": "https://www.google.com/maps/@{lat},{long},7z"
    }]
  }
}

In JSON, this will result in the output:

Example 64

"schema:hasMap": "https://www.google.com/maps/@42.72567375,25.4823218,7z"

and in RDF, the output:

Example 65

schema:hasMap <https://www.google.com/maps/@42.72567375,25.4823218,7z>

See also:

valueUrl property in [tabular-metadata]

How do you include an identifier for the thing described by each row?

By default, the things described by each row don't have identifiers associated with them in either JSON or RDF outputs. You can add an identifier for the row by setting the aboutUrl property. Usually that's done at the top level of the schema.

For example, say each row in countries.csv was about a country whose identifier looked like http://example.org/country/{code} where code was the value within the first column of the CSV file (the country column). The aboutUrl could be set to the generate this URL for each row using a URL template:

Example 66

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "aboutUrl": "http://example.org/country/{code}",
    "columns": [{
      "titles": "country",
      "name": "code"
    },{
      "titles": "country group"
    },{
      "titles": "name (en)",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "lang": "de"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}

In the JSON, these identifiers are turned into @id properties on the objects generated for each row:

Example 67

[{
  "@id": "http: //example.org/country/at",
  "code": "at",
  "country group": "eu",
  "name (en)": "Austria",
  "name (fr)": "Autriche",
  "name (de)": "Österreich"
},{
  "@id": "http: //example.org/country/be",
  "code": "be",
  "country group": "eu",
  "name (en)": "Belgium",
  "name (fr)": "Belgique",
  "name (de)": "Belgien",
  "latitude": 50.501045,
  "longitude": 4.47667405
},{
  "@id": "http: //example.org/country/bg",
  "code": "bg",
  "country group": "eu",
  "name (en)": "Bulgaria",
  "name (fr)": "Bulgarie",
  "name (de)": "Bulgarien",
  "latitude": 42.72567375,
  "longitude": 25.4823218
}]

In the RDF, these identifiers become the identifiers for the entities that the properties relate to, rather than those being blank nodes:

Example 68

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/country/at>
   <#code> "at";
   <#country%20group> "eu";
   <#latitude> 4.76965545e1;
   <#longitude> 1.334598005e1;
   <#name%20%28de%29> "Österreich"@de;
   <#name%20%28en%29> "Austria"@en;
   <#name%20%28fr%29> "Autriche"@fr .

<http://example.org/country/be>
   <#code> "be";
   <#country%20group> "eu";
   <#latitude> 5.0501045e1;
   <#longitude> 4.47667405e0;
   <#name%20%28de%29> "Belgien"@de;
   <#name%20%28en%29> "Belgium"@en;
   <#name%20%28fr%29> "Belgique"@fr .

<http://example.org/country/bg>
   <#code> "bg";
   <#country%20group> "eu";
   <#latitude> 4.272567375e1;
   <#longitude> 2.54823218e1;
   <#name%20%28de%29> "Bulgarien"@de;
   <#name%20%28en%29> "Bulgaria"@en;
   <#name%20%28fr%29> "Bulgarie"@fr .

See also:

aboutUrl property in [tabular-metadata]

How do you indicate the type of the thing described by each row?

Whether generating JSON or RDF it can be useful to indicate that each row contains data about a particular type of thing, such as a Person or a Country. There isn't usually a column within tabular data that indicates the type of the row (because it's generally the same for every row, so including it would be superfluous), so you have to add it as a virtual column.

The virtual column needs to come after the descriptions of columns actually within the data. It should have its virtual property set to true and its propertyUrl set to rdf:type to indicate that the virtual column will indicate the type of entity the row is about. The valueUrl property can then be set to the prefixed name or URL of the type of the entity. For example, when each row represents a Country, you might use:

Example 69

{
  "virtual": true,
  "propertyUrl": "rdf:type",
  "valueUrl": "schema:Country"
}

In the JSON this value will be transformed into the value of the @type property on the relevant object:

Example 70

"@type": "schema:Country"

In RDF output, it becomes the class for the entity:

Example 71

a schema:Country

See also:

virtual property in [tabular-metadata]
virtual columns in [tabular-data-model]

How do you include extra metadata in the result of the transformation?

Transformations into JSON or RDF can be carried out in one of two modes. In minimal mode, which is what we've looked at so far, the output only contains data from the rows within the CSV file. In full mode, the output also contains metadata about the CSV file, including metadata from the metadata file.

For example, say that our metadata file looked like:

Example 72

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "schema:name": "Countries",
  "schema:description": "European countries for which comparative statistics are collected by Eurostat.",
  "schema:creator": { "schema:name": "Eurostat" },
  "tableSchema": {
    ...
  }
}

The output of a full JSON transformation would look like:

Example 73

{
  "tables": [{
    "url":"countries.csv",
    "schema:name":"Countries",
    "schema:description":"European countries for which comparative statistics are collected by Eurostat.",
    "schema:creator": {
      "schema:name": "Eurostat"
    },
    "row": [{
      "url": "countries.csv#row=2",
      "rownum":1,
      "describes": [{
        "country": "at",
        "country group": "eu",
        "name (en)": "Austria",
        "name (fr)": "Autriche",
        "name (de)": "Österreich",
        "latitude": 47.6965545,
        "longitude": 13.34598005
      }]
    }, {
      "url": "countries.csv#row=3",
      "rownum": 2,
      "describes": [{
        "country": "be",
        "country group": "eu",
        "name (en)": "Belgium",
        "name (fr)": "Belgique",
        "name (de)": "Belgien",
        "latitude": 50.501045,
        "longitude": 4.47667405
      }]
    }, {
      "url": "countries.csv#row=4",
      "rownum": 3,
      "describes": [{
        "country": "bg",
        "country group": "eu",
        "name (en)": "Bulgaria",
        "name (fr)": "Bulgarie",
        "name (de)": "Bulgarien",
        "latitude": 42.72567375,
        "longitude": 25.4823218
      }]
    }]
  }]
}

Similarly, the output of the full RDF would look like:

Example 74

@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

 [
    a csvw:TableGroup;
    csvw:table [
      a csvw:Table;
      schema:creator [ schema:name "Eurostat" ];
      schema:description "European countries for which comparative statistics are collected by Eurostat.";
      schema:name "Countries";
      csvw:row [
        a csvw:Row;
        csvw:describes [
          <#country> "at";
          <#country%20group> "eu";
          <#latitude> 4.76965545e1;
          <#longitude> 1.334598005e1;
          <#name%20%28de%29> "Österreich"@de;
          <#name%20%28en%29> "Austria"@en;
          <#name%20%28fr%29> "Autriche"@fr
        ];
        csvw:rownum 1;
        csvw:url <#row=2>
      ],  [
        a csvw:Row;
        csvw:describes [
          <#country> "be";
          <#country%20group> "eu";
          <#latitude> 5.0501045e1;
          <#longitude> 4.47667405e0;
          <#name%20%28de%29> "Belgien"@de;
          <#name%20%28en%29> "Belgium"@en;
          <#name%20%28fr%29> "Belgique"@fr
        ];
        csvw:rownum 2;
        csvw:url <#row=3>
      ],  [
        a csvw:Row;
        csvw:describes [
          <#country> "bg";
          <#country%20group> "eu";
          <#latitude> 4.272567375e1;
          <#longitude> 2.54823218e1;
          <#name%20%28de%29> "Bulgarien"@de;
          <#name%20%28en%29> "Bulgaria"@en;
          <#name%20%28fr%29> "Bulgarie"@fr
        ];
        csvw:rownum 3;
        csvw:url <#row=4>
      ];
      csvw:url <>
    ]
 ] .

Metadata provided about tables is interpreted based on the rules for [json-ld] which means that you can provide as much structure within that metadata as you like, including providing structured values, languages and datatypes, so that the data in the output includes what you need it to.

See also:

Common Properties in [tabular-metadata]
Standard Mode in [csv2json]
Converting Tabular Data to RDF in [csv2rdf]

How can you remove output from a transformation result?

By default, the output from a JSON or RDF transformation will include all the data from all the columns of all the tables in the metadata document. It may be that you're not interested in some of that within the output of your transformation. In that case, you can use the suppressOutput property in the metadata to exclude the data that you're not interested in.

For example, perhaps I'm not interested in the non-English names of countries in my output. In that case, I could suppress them as follows:

Example 75

{
  "titles": "name (fr)",
  "lang": "fr",
  "suppressOutput": true
}

Similarly, when generating the output for a set of tables, you can suppress the output from a whole table by adding the suppressOutput property to the description of that table:

Example 76

{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv"
  }, {
    "url": "country-groups.csv",
    "suppressOutput": true
  }, {
    "url": "unemployment.csv"
  }]
}

See also:

How do you transform into nested structures in JSON?

While tabular data is by necessity flat, it often holds data that is actually structured. For example, the data that we have been looking at:

Example 77

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

if modelled according to the schema.org vocabulary, would look like:

Example 78

[{
  "@type": "schema:Country",
  "schema:name": ["Austria", "Autriche", "Österreich"],
  "schema:geo": {
    "@type": "schema:GeoCoordinates",
    "schema:latitude":47.6965545,
    "schema:longitude":13.34598005
  }
}, {
  "@type": "schema:Country",
  "schema:name": ["Belgium", "Belgique", "Belgien"],
  "schema:geo": {
    "@type": "schema:GeoCoordinates",
    "schema:latitude":50.501045,
    "schema:longitude":4.47667405
  }
}, {
  "@type": "schema:Country",
  "schema:name": ["Bulgaria", "Bulgarie", "Bulgarien"],
  "schema:geo": {
    "@type": "schema:GeoCoordinates",
    "schema:latitude":42.72567375,
    "schema:longitude":25.4823218
  }
}]

Generating JSON in this shape requires the judicious use of virtual columns, aboutUrl and valueUrl: if you create a column whose valueUrl corresponds to the aboutUrl of another column, you will create nested properties.

In this example, we can use two aboutUrls: one in the form http://example.org/country/{code} for the Country and one in the form http://example.org/country/{code}#geo for the geo-coordinates of the country. The names are properties of the former while the longitude and latitude are properties of the latter. A virtual column can add the association between the two objects, with a propertyUrl of schema:geo, like so:

Example 79

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "aboutUrl": "http://example.org/country/{code}",
    "columns": [{
      "titles": "country",
      "name": "code",
      "suppressOutput": true
    },{
      "titles": "country group",
      "suppressOutput": true
    },{
      "titles": "name (en)",
      "lang": "en",
      "propertyUrl": "schema:name"
    },{
      "titles": "name (fr)",
      "lang": "fr",
      "propertyUrl": "schema:name"
    },{
      "titles": "name (de)",
      "lang": "de",
      "propertyUrl": "schema:name"
    },{
      "titles": "latitude",
      "datatype": "number",
      "aboutUrl": "http://example.org/country/{code}#geo",
      "propertyUrl": "schema:latitude"
    },{
      "titles": "longitude",
      "datatype": "number",
      "aboutUrl": "http://example.org/country/{code}#geo",
      "propertyUrl": "schema:longitude"
    },{
      "virtual": true,
      "propertyUrl": "rdf:type",
      "valueUrl": "schema:Country"
    },{
      "virtual": true,
      "propertyUrl": "schema:geo",
      "valueUrl": "http://example.org/country/{code}#geo"
    },{
      "virtual": true,
      "aboutUrl": "http://example.org/country/{code}#geo",
      "propertyUrl": "rdf:type",
      "valueUrl": "schema:GeoCoordinates"
    }]
  }
}

Note

Note also in this example the use of suppressOutput to remove properties that we're not interested in, and the use of virtual columns to add type information to both types of generated object.

The result of this transformation is close to what we were aiming for, though with the addition of @id properties:

Example 80

[{
  "@id": "http://example.org/country/at",
  "@type": "schema:Country",
  "schema:name": ["Austria", "Autriche", "Österreich"],
  "schema:geo": {
    "@id": "http://example.org/country/at#geo",
    "@type": "schema:GeoCoordinates",
    "schema:latitude": 47.6965545,
    "schema:longitude": 13.34598005
  }
}, {
  "@id": "http://example.org/country/be",
  "@type": "schema:Country",
  "schema:name": ["Belgium", "Belgique", "Belgien"],
  "schema:geo": {
    "@id":"http://example.org/country/be#geo",
    "@type": "schema:GeoCoordinates",
    "schema:latitude": 50.501045,
    "schema:longitude": 4.47667405
  }
}, {
  "@id": "http://example.org/country/bg",
  "@type": "schema:Country",
  "schema:name": ["Bulgaria", "Bulgarie", "Bulgarien"],
  "schema:geo": {
    "@id": "http://example.org/country/bg#geo",
    "@type": "schema:GeoCoordinates",
    "schema:latitude": 42.72567375,
    "schema:longitude": 25.4823218
  }
}]

The same metadata will generate similar RDF, though the nesting structure is not so obvious because of the way RDF works:

Example 81

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/country/at> a schema:Country;
   schema:geo <http://example.org/country/at#geo>;
   schema:name "Austria"@en, "Autriche"@fr, "Österreich"@de .

<http://example.org/country/be> a schema:Country;
   schema:geo <http://example.org/country/be#geo>;
   schema:name "Belgium"@en, "Belgique"@fr, "Belgien"@de .

<http://example.org/country/bg> a schema:Country;
   schema:geo <http://example.org/country/bg#geo>;
   schema:name "Bulgaria"@en, "Bulgarie"@fr, "Bulgarien"@de .

<http://example.org/country/at#geo> a schema:GeoCoordinates;
   schema:latitude 4.76965545e1;
   schema:longitude 1.334598005e1 .

<http://example.org/country/be#geo> a schema:GeoCoordinates;
   schema:latitude 5.0501045e1;
   schema:longitude 4.47667405e0 .

<http://example.org/country/bg#geo> a schema:GeoCoordinates;
   schema:latitude 4.272567375e1;
   schema:longitude 2.54823218e1 .

See also:

Generating Nested Objects in [csv2json]

How do you indicate that values should be mapped to a list rather than repeating properties in RDF?

If you are used to using RDF you'll know that there's a big difference between having a property that has multiple values (ie multiple triples with the same subject and property) and a property that has a rdf:List as a value. Sometimes one is appropriate, sometimes the other. The ordered property enables you to indicate which to use for values that are sequences in the original data.

Let's use as an example the version of our data in which the latitude and longitude are in the same property:

Example 82

"country","country group","name (en)","name (fr)","name (de)","latlong"
"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

In this example, if we state that the latlong column is space-separated like so:

Example 83

{
  "titles": "latlong",
  "separator": " ",
  "datatype": "number"
}

we'll end up with output like this:

Example 84

[
  <#country> "at";
  <#country%20group> "eu";
  <#name%20%28en%29> "Austria"@en;
  <#name%20%28fr%29> "Autriche"@fr;
  <#name%20%28de%29> "Österreich"@de;
  <#latlong> 4.76965545e1, 1.334598005e1;
] .

This shows the #latlong property having two values: 4.76965545e1 and 1.334598005e1. These values could easily get mixed up such that the first was taken to be the longitude and the second the latitude rather than the other way around.

To avoid this mix-up occurring, set the ordered property to true:

Example 85

{
  "titles": "latlong",
  "separator": " ",
  "ordered": true,
  "datatype": "number"
}

This will make the processor use a rdf:List for the value of the #latlong property instead, which in Turtle looks like:

Example 86

[
  <#country> "at";
  <#country%20group> "eu";
  <#name%20%28en%29> "Austria"@en;
  <#name%20%28fr%29> "Autriche"@fr;
  <#name%20%28de%29> "Österreich"@de;
  <#latlong> ( 4.76965545e1, 1.334598005e1 );
] .

Note

The ordered property makes no difference to JSON output because sequences in CSV are always transformed into arrays in JSON.

See also:

ordered property in [tabular-metadata]

How should you transform CSV into JSON-LD?

As illustrated above, it is possible to transform CSV into something that looks like JSON-LD by transforming it into JSON. You can add @id properties for identifiers using aboutUrl and add @type properties for types using virtual columns.

However, if you're really after JSON-LD as an output, the best route is to transform into RDF and emit that RDF as JSON-LD. This will give you more control over the context that's used to determine the properties and structure of the output.

See also:

Converting Tabular Data to RDF in [csv2rdf]

How should you display CSV tables in HTML?

There's no single specified way of displaying CSV tables in HTML. If you are writing code to do so, it's good practice to:

use the locale of the user of the browser to determine which titles to display for columns, and how to format numbers and dates within the table
respect the directionality of the table and the text within the table that's indicated through the tableDirection and textDirection properties
use the aboutUrl and valueUrl properties to link out to other pages from cells within the table, and the propertyUrl to link out to information about the property a column represents; these properties could also be used to embed RDFa into the HTML table, as described in [html-rdfa]
display the metadata that's available about the table so that people get a full idea of the context of the data
provide a way to access notes and annotations that have been provided about individual cells, rows or columns
highlight cells that contain errors so that they're easy to spot and correct
provide links to other related resources indicated through the metadata supplied for the table
keep columns that are used to provide a title for each row (using rowTitles) visible; for screen readers, use these to label the rows so that people can navigate through the table easily
link out to the original CSV file so that people can download it

See also:

Presenting Tables in [tabular-data-model]

How can you transform CSV into the DataCube vocabulary?

The Data Cube vocabulary [vocab-data-cube] is a vocabulary for statistical data based on SDMX (Statistical Data and Metadata eXchange). Statistical data is often expressed in tables, sometimes with one row per Observation and sometimes with each cell containing a different Observation.

Generating data in the Data Cube vocabulary requires the use of techniques that have been described above, such as adding identifiers to entities, adding structure and types through virtual columns, and using metadata to supply additional static context.

There is a fully worked out example of transforming to DataCube, using meterological data as its basis, available on the Github repository for the Working Group.

See also:

Generating Data Cube Example

How can you transform CSV into other formats?

While there are only specifications for transforming CSV into JSON and RDF, there is an extension mechanism within CSV metadata to indicate other transformations that could be applied to CSV files. The transformations property on a table description holds an array of descriptions of transformations that processors could carry out. There's no guarantees that a given processor will recognise them, but over time it might be that there begins to be recognised practices for how such transformations might work.

The transformations must have the following properties:

targetFormat: gives a URL for the format that the transformation transforms into, for example http://www.iana.org/assignments/media-types/application/xml for XML
url: points to a script or template that can be used to transform the CSV into that format
scriptFormat: gives a URL for the format that the script is in, for example https://mustache.github.io/ for Mustache or http://www.iana.org/assignments/media-types/application/javascript for Javascript

You can also supply a titles property to provide a human-readable description of the output of the transformation, and a source property to indicate that the input to the transformation isn't the original CSV or tabular data, but json or rdf.

For example, if I wanted to convert the data that we've been using into XML, I could create a Mustache template like this at xml-template.mustache:

Example 87

{{#tables}}
  <countries>
    {{#row}}
      {{#describes}}
        <country id="{{country}}" group="{{country_group}}">
          <name xml:lang="en">{{name_en}}</name>
          <name xml:lang="fr">{{name_fr}}</name>
          <name xml:lang="de">{{name_de}}</name>
          <geo lat="{{latitude}}" long="{{longitude}}" />
        </country>
      {{/describes}}
    {{/row}}
  </countries>
{{/tables}}

In the metadata for the CSV file I could then include:

Example 88

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "transformations": [{
    "targetFormat": "http://www.iana.org/assignments/media-types/application/xml",
    "titles": "Simple XML version",
    "url": "xml-template.mustache",
    "scriptFormat": "https://mustache.github.io/",
    "source": "json"
  }]
}

Processors that recognised the URL for Mustache could offer users the option of passing the JSON output to a Mustache processor, which would generate:

Example 89

<countries>
  <country id="at" group="eu">
    <name xml:lang="en">Austria</name>
    <name xml:lang="fr">Autriche</name>
    <name xml:lang="de">Österreich</name>
    <geo lat="47.6965545" long="13.34598005" />
  </country>
  <country id="be" group="eu">
    <name xml:lang="en">Belgium</name>
    <name xml:lang="fr">Belgique</name>
    <name xml:lang="de">Belgien</name>
    <geo lat="50.501045" long="4.47667405" />
  </country>
  <country id="bg" group="eu">
    <name xml:lang="en">Bulgaria</name>
    <name xml:lang="fr">Bulgarie</name>
    <name xml:lang="de">Bulgarien</name>
    <geo lat="42.72567375" long="25.4823218" />
  </country>
</countries>

Processors should document the kind of transformation scripts that they can recognise and how they process them.

See also:

Transformation Definitions in [tabular-metadata]

Handling language in CSVs

There are a number of features in the CSV on the Web metadata documents that support scenarios encountered in CSV files that use different languages. We already discussed using varying number formats in and date formats in . Here we'll look at how to create metadata files, schemas and CSV files that take account of and work across multiple languages.

How do you indicate the language used by the metadata file?

The metadata file will often contain natural language text, such as titles and descriptions of columns and tables. Unless you specify otherwise, implementations will assume all this text is in an undefined language (und). If you want to specify what natural language is in use within the metadata file, you have to change the way the @context is specified. Instead of the normal value:

Example 90

"@context": "http://www.w3.org/ns/csvw"

The @context should take an array, where the first value is the usual URL as a string and the second is an object with a @language property set to the language being used within the metadata file. This example, which has English-language titles and descriptions, illustrates:

Example 91

{
  "@context": [ "http://www.w3.org/ns/csvw", { "@language": "en "} ],
  "dc:title": "Countries"
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "dc:description": "The ISO two-letter code for a country, in lowercase."
    },{
      "titles": "country group",
      "dc:description": "A lowercase two-letter code for a group of countries."
    },{
      "titles": "name (en)",
      "dc:description": "The official name of the country in English."
    },{
      "titles": "name (fr)",
      "dc:description": "The official name of the country in French."
    },{
      "titles": "name (de)",
      "dc:description": "The official name of the country in German."
    },{
      "titles": "latitude",
      "dc:description": "The latitude of an indicative point in the country."
    },{
      "titles": "longitude",
      "dc:description": "The longitude of an indicative point in the country."
    }]
  }
}

See also:

default language in [tabular-metadata]

How do you provide metadata such as descriptions in different languages?

Metadata such as descriptions can be objects rather than strings. Using objects is useful when you want to provide the language for a value. In this case, the object should have two properties: a @value property holding the natural-language string and a @lang value indicating what language that string is in. For example:

Example 92

{
  "titles": "name (en)",
  "dc:description": {
    "@value": "The official name of the country in English.",
    "@lang": "en"
  }
}

You can use an array to provide the same metadata in many different languages, for example:

Example 93

"dc:title": [{
  "@lang": "en",
  "@value": "Unemployment in Europe (monthly)"
},{
  "@lang": "de",
  "@value": "Arbeitslosigkeit in Europa (monatlich)"
},{
  "@lang": "fr",
  "@value": "Le Chômage en Europe (mensuel)"
}]

Note

If you don't indicate the language used for metadata, processors will assume it's the default language used in the metadata as a whole.

See also:

Values of Common Properties in [tabular-metadata]

How do you provide titles for columns in different languages?

You can use an object as the value for the titles property for a column to provide titles in different languages. Within the object, each property is a language and the value is the title in that language:

Example 94

"titles": {
  "en": "Country",
  "de": "Land",
  "fr": "Pays"
}

Note

If you don't indicate the language used for the title of a column, processors will assume it's the default language used in the metadata as a whole.

See also:

Natural Language Properties in [tabular-metadata]

How do you specify the language of the values in a column?

Within an individual CSV file, it may be that different columns contain values that are in different languages. In the example we're using, there are three columns that each contain the name of a country, in English, French and German:

Example 95

"country","country group","name (en)","name (fr)","name (de)","latlong"
"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

Use the lang property on the column description to indicate the language of text in that column:

Example 96

{
  "@context": [ "http://www.w3.org/ns/csvw", { "@language": "en "} ],
  "dc:title": "Countries"
  "url": "countries.csv"
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "dc:description": "The ISO two-letter code for a country, in lowercase."
    },{
      "titles": "country group",
      "dc:description": "A lowercase two-letter code for a group of countries."
    },{
      "titles": "name (en)",
      "dc:description": "The official name of the country in English.",
      "lang": "en"
    },{
      "titles": "name (fr)",
      "dc:description": "The official name of the country in French.",
      "lang": "fr"
    },{
      "titles": "name (de)",
      "dc:description": "The official name of the country in German.",
      "lang": "de"
    },{
      "titles": "latitude",
      "dc:description": "The latitude of an indicative point in the country."
    },{
      "titles": "longitude",
      "dc:description": "The longitude of an indicative point in the country."
    }]
  }
}

Note

There's no relationship between the language used in a metadata file and that used in the CSV file that it describes.

See also:

lang property in [tabular-metadata]

How do you indicate that tables should be displayed right-to-left?

Implementations that display tables according to the specs should mostly be able to guess whether a table should be displayed left-to-right or right-to-left based on the content of the table. Implementations will look at the content of the cells to work out which way to display their content and will look at the content of the table as a whole to work out whether to display the first column on the right or left of the page.

If you want to override the display of a particular column then you can use the textDirection property on a column description to explicitly be rtl or ltr:

Example 97

{
  "titles": "name (ar)",
  "lang": "ar",
  "textDirection": "rtl"
}

If you want to override the display of the table overall then you can use the tableDirection property on the description of the table, or for all tables in the group.

Example 98

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "results.csv",
  "tableDirection": "rtl"
}

The value of the tableDirection property is inherited to all columns in the table, so any text within this table will similarly be displayed right-to-left. This can be overridden by setting textDirection to ltr or auto (in which case the direction of the text within each cell will be determined by its contents).

See also:

textDirection property in [tabular-metadata]
tableDirection property in [tabular-metadata]
Bidirectional Tables in [tabular-data-model]

Advanced Use

How do you support units of measure?

There is no native support for expressing the units of measure for a particular column. You can, however, use documentation to tell people who are using the data what unit of measure is used for that particular column. This can be informal within the description of the column:

Example 99

{
  "titles": "distance",
  "dc:description": "Distance (kilometres)"
}

Alternatively, it can be more explicit using an existing units-of-measure property and vocabulary, such as:

Example 100

{
  "titles": "distance",
  "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {
    "@id": "http://qudt.org/vocab/unit#Kilometer"
  }
}

to which you could even add more detail if you wanted (this is replicating the canonical definition of definition of unit:Kilometer from the Quanitites, Units, Dimensions and Data Types (QUDT) Ontologies):

Example 101

 {
  "titles": "distance",
  "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {
    "@id": "http://qudt.org/vocab/unit#Kilometre",
    "@type": [
      "http://qudt.org/schema/qudt#SIUnit",
      "http://qudt.org/schema/qudt#DerivedUnit",
      "http://qudt.org/schema/qudt#LengthUnit"
    ],
    "rdfs:label": "Kilometer",
    "http://qudt.org/schema/qudt#abbreviation": "km",
    "http://qudt.org/schema/qudt#code": "1091",
    "http://qudt.org/schema/qudt#conversionMultiplier": 1000,
    "http://qudt.org/schema/qudt#conversionOffset": 0.0,
    "http://qudt.org/schema/qudt#symbol": "km",
    "skos:exactMatch": { "@id": "http://dbpedia.org/resource/Kilometre" }
  }
}

Supporting units of measure by transforming to structured values

If you are generating JSON or RDF from CSV, you may want to generate structured values that include the units of each value from the CSV file. This is a little complicated, but useful if different rows contain values that use different units. In this case, the output that you're aiming for in RDF would look something like:

Example 102

[] :distance <#row-1-distance> .

<#row-1-distance>
  schema:value 3.5 ;
  schema:unitCode <http://qudt.org/vocab/unit#Kilometer> ;
  .

and in JSON something like this:

Example 103

"distance": {
  "@id": "#row-1-distance",
  "schema:value": 3.5
  "schema:unitCode": "http://qudt.org/vocab/unit#Kilometer"
}

Note

You may want to use different properties than schema:value and schema:unitCode; if so, just use different propertyUrls.

You need to decide on a pattern for the URLs for the values themselves, and set the aboutUrl for the relevant column create that URL. In this example, the URLs that look like #row-1-distance can be generated with the pattern #row-{_row}-distance. The propertyUrl for the column needs to be schema:value as the value in the column provides the value for that property. So the column description looks like:

Example 104

{
  "name": "distance_value",
  "titles": "distance",
  "datatype": "number",
  "aboutUrl": "#row-{_row}-distance",
  "propertyUrl": "schema:value"
}

You then need to use virtual columns (descriptions of additional columns that don't exist in the source CSV) to generate the relationship between the thing whose distance is being measured and the structured value, and the additional property providing the unit for the structured value.

To generate the relationship being the thing that has the distance and the structured value, the virtual column's valueUrl needs to hold the same URL template as you used before:

Example 105

{
  "name": "distance",
  "virtual": true,
  "valueUrl": "#row-{_row}-distance"
}

To create the units property, you need another virtual column where the aboutUrl of that virtual column generates the URL for the structured value, the propertyUrl is schema:unitCode and the valueUrl is the URL representing the unit (in this case http://qudt.org/vocab/unit#Kilometre):

Example 106

{
  "name": "distance_unit",
  "aboutUrl": "#row-{_row}-distance",
  "propertyUrl": "schema:unitCode",
  "valueUrl": "http://qudt.org/vocab/unit#Kilometer"
}

If it's necessary to add more detail about the unit (e.g. the fact that it's a unit of length) this can be done with additional virtual columns:

Example 107

{
  "name": "kilometer_abbreviation",
  "virtual": true,
  "aboutUrl": "http://qudt.org/vocab/unit#Kilometer",
  "propertyUrl": "rdf:type",
  "valueUrl": "http://qudt.org/schema/qudt#LengthUnit"
}

Usually, however, processors should recognise or be able to resolve the URL for the unit to understand that it's a unit of length, if this is important for onward processing.

See also:

Quanitites, Units, Dimensions and Data Types (QUDT) Ontologies

Supporting units of measure with named datatypes in RDF

If you are generating RDF from CSV, you may want to define a datatype for a column and then provide additional information about that datatype as properties. For example, the column description could look like:

Example 108

{
  "titles": "distance",
  "datatype": {
    "@id": "http://example.org/unit/kilometre",
    "rdfs:label": "Kilometre",
    "base": "number"
  }
}

When values are generated in RDF for this column, they will be assigned the relevant datatype, for example:

Example 109

[] :distance "3.5"^^<http://example.org/unit/kilometre> .

<http://example.org/unit/kilometre> rdfs:label "Kilometre" .

Again, it is possible to include additional information about the unit being used as the datatype within the definition of the datatype:

Example 110

{
  "titles": "distance",
  "datatype": {
    "@id": "http://example.org/unit/kilometre",
    "@type": "http://example.org/quantity/length",
    "rdfs:label": "Kilometre",
    "base": "number",
    "skos:notation": "km"
  }
}

See also:

Interpreting Datatypes in [csv2rdf]

How do you support geospatial data?

There are many different ways of representing geospatial data within a CSV file, and no single best practice for doing so.

At the simplest level, it's possible to reference geospatial coordinates as points with separate columns for latitude, longitude and if necessary altitude (or using a different spatial reference system). This enables separate validation for the separate coordinates. The example used throughout this primer uses this setup:

Example 111

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

Metadata can be used to provide specialist types for the values of these coordinates, to indicate that they are latitude and longitude (by mapping to the well-known schema:latitude and schema:longitude properties which specify the use of WGS84), to group the coordinates together, and to provide a link that uses the coordinates to provide a map:

Example 112

{
  "titles": "latitude",
  "name": "lat",
  "datatype": {
    "base": "number",
    "minimum": "-90",
    "maximum": "90"
  },
  "aboutUrl": "http://example.org/country/{code}#geo",
  "propertyUrl": "schema:latitude"
}, {
  "titles": "longitude",
  "name": "long",
  "datatype": {
    "base": "number",
    "minimum": "-180",
    "maximum": "180"
  },
  "aboutUrl": "http://example.org/country/{code}#geo",
  "propertyUrl": "schema:longitude"
}, {
  "virtual": true,
  "propertyUrl": "schema:geo",
  "valueUrl": "http://example.org/country/{code}#geo"
}, {
  "virtual": true,
  "aboutUrl": "http://example.org/country/{code}#geo",
  "propertyUrl": "rdf:type",
  "valueUrl": "schema:GeoCoordinates"
}, {
  "virtual": true,
  "propertyUrl": "schema:hasMap",
  "valueUrl": "https://www.google.com/maps/@{lat},{long},7z"
}

Note

You can put latitude and longitude into a single column, with a character separator between the numbers as shown in . However, this makes it harder to accurately validate the individual coordinates. They also cannot be separated out into separate property values when converting to JSON or RDF. So this is a more restrictive method and best avoided.

CSV files may also need to contain geometries beyond individual points. There are no built-in formats for geometries recognised by implementations of CSV on the Web. Geometries may be expressed using GeoJSON, GML, KML or OGC Well-Known Text (WKT) representations. In each case, schemas may indicate that columns containing geometries adhere to a particular type:

json if the format is JSON-based such as GeoJSON
xml if the format is XML-based such as GML or KML
string if the format is text-based such as OGC WKT

You can use the format property to further constrain the content of columns containing these values. For example, you could use:

Example 113

"datatype": {
  "@id": "http://geojson.org/",
  "base": "json",
  "format": "\\{ ?\"type\": ?\"Polygon\",.+\\}"
}

to indicate that a column contains a Polygon in GeoJSON, or:

Example 114

"datatype": {
  "@id": "http://www.iana.org/assignments/media-types/application/gml+xml",
  "base": "xml",
  "format": ".*\\<gml:Point xmlns:gml=\"http://www\.opengis\.net/ont/gml" srsName=\"([^\"])+\".*\\>.+\\</gml:Point\\>"
}

to indicate that a column contains a GML Point, or:

Example 115

"datatype": {
  "@id": "http://www.iana.org/assignments/media-types/application/vnd.google-earth.kml+xml",
  "base": "xml",
  "format": ".*\\<kml xmlns=\"http://www\\.opengis\\.net/kml/2.2\"\\>.+\\</kml\\>"
}

to indicate that a column contains KML, or:

Example 116

"datatype": {
  "base": "string",
  "format": "POLYGON \\(\\(\\d+(\\.\\d+)? \\d+(\\.\\d+)?(, \\d+(\\.\\d+)? \\d+(\\.\\d+)?)+\\)\\)"
  "rdfs:seeAlso": "http://www.opengeospatial.org/standards/sfa"
}

to indicate that a column contains a very basic OGC WKT Polygon. (More sophisticated regular expressions could be used if properties held different types of objects such as points or lines or more complex polygons expressed in OGC WKT.)

Note

There is no standard way of expressing which coordinate reference system (CRS) is being used within CSV metadata, but GeoJSON, GML and OGC WKT values can include information about which CRS they use. The W3C Spatial Data on the Web Working Group intend to make recommendations in this area within [sdw-bp].

See also:

Datatypes in [tabular-metadata]
Datatypes in [tabular-data-model]

How can you specify a single schema for multiple CSV files?

CSV on the Web is designed to enable you to reuse the same schema when publishing multiple CSV files, even if those files are created by different organisations and therefore reside in different places. Rather than embedding a schema within the description of a table, the tableSchema property can point out to a schema held somewhere else.

For example, if you were a statistical agency and wanted municipalities to publish their unemployment figures using the same schema, you could specify the columns that you wanted included within the schema you specified at http://example.org/schema/unemployment.json:

Example 117

{
  "columns": [{
    "name": "municipality"
  }, {
    "name": "month",
    "datatype": "gYearMonth"
  }, {
    "name": "unemployment",
    "datatype": "number"
  }]
}

Experience shows that publishers of data in CSV files often use their own headings for the columns. So long as these don't change the meaning of the column, as a statistical agency you probably don't care. The titles property allows you to provide multiple alternative titles that people may use in an array:

Example 118

{
  "columns": [{
    "name": "municipality",
    "titles": [ "Municipality", "City", "Area" ]
  }, {
    "name": "month",
    "titles": [ "Month", "Period" ],
    "datatype": "gYearMonth"
  }, {
    "name": "unemployment",
    "titles": [ "Unemployment Rate", "Unemployment", "Number unemployed" ]
    "datatype": "number"
  }]
}

Note

In cases where the data is being gathered from multiple countries, it may also be useful to specify multiple possible titles in different languages, as described in .

It will help consistency and processing of the data if the municipalities use a consistent set of codes to indicate which municipality the data relates to. As the statistical agency, you can supply the relevant codes in a CSV file, e.g. http://example.org/ref/municipalities.csv:

Example 119

code,name
0101,Absecon City
0102,Atlantic City
0103,Brigantine City
...

This file will of course have its own simple schema, http://example.org/schema/municipalities.json, that provides the datatype for the municipality codes and indicates that they are unique through a primary key:

Example 120

{
  "columns": [{
    "name": "code",
    "datatype": { "format": "\d{4}" }
  }, {
    "name": "name"
  }],
  "primaryKey": "code"
}

A foreign key in the schema supplied by the statistical authority ensures that the codes used in the unemployment data match up with the standard set supplied by the statistical authority:

Example 121

{
  "columns": [{
    "name": "municipality",
    "titles": [ "Municipality", "City", "Area" ]
  }, {
    "name": "month",
    "titles": [ "Month", "Period" ],
    "datatype": "gYearMonth"
  }, {
    "name": "unemployment",
    "titles": [ "Unemployment Rate", "Unemployment", "Number unemployed" ]
    "datatype": "number"
  }],
  "foreignKeys": [{
    "columnReference": "municipality",
    "reference": {
      "resource": "http://example.org/ref/municipalities.csv",
      "columnReference": "code"
    }
  }]
}

When a municipality describes their unemployment CSV, they will need to point to your schema for their data and to the centrally provided municipalities reference data and schema:

Example 122

{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "http://local.example.org/data/unemployment.csv",
    "tableSchema": "http://example.org/schema/unemployment.json"
  }, {
    "url": "http://example.org/ref/municipalities.csv",
    "tableSchema": "http://example.org/schema/municipalities.json"
  }]
}

Note

A more complex example in which there is linking between pairs of files where the schemas are provided by a central authority is provided in the section Foreign Key Reference Between Schemas within [tabular-metadata].

See also:

Examples of Foreign Keys in [tabular-metadata]

How can you provide a title for a row?

It is useful to have titles for rows both for screen readers and in other displays where it may be difficult to view the complete context of a table and therefore to understand the content of a row.

The rowTitles property within a schema provides an array of columns that provide sufficient context to label the row. These may sometimes be the same as the columns used for the primaryKey in the table, but are more likely to be columns that contain human readable text.

For example, with the CSV:

Example 123

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

The rowTitles property could be set to reference the columns containing the name of each country, in the different languages available:

Example 124

"tableSchema": {
  "columns": [{
    "titles": "country"
  },{
    "titles": "country group"
  },{
    "name": "name_en",
    "titles": "name (en)",
    "lang": "en"
  },{
    "name": "name_fr",
    "titles": "name (fr)",
    "lang": "fr"
  },{
    "name": "name_de",
    "titles": "name (de)",
    "lang": "de"
  },{
    "titles": "latitude"
  },{
    "titles": "longitude"
  }],
  "rowTitles": ["name_en", "name_fr", "name_de"]
}

In this case, a screen reader or other display of the table could choose to read or display only the row title that matched the user's preferred language.

In other cases, the rowTitles property may be set to an array of columns that together provided sufficient context to understand the column (eg ["firstName", "lastName"]).

See also:

rowTitles property in [tabular-metadata]
Column and Row Labelling in [tabular-data-model]

What about CSV that isn't standard CSV?

A lot of what's called "CSV" that's published on the web isn't actually CSV. It might use something other than commas (such as tabs or semi-colons) as separators between values, or might have multiple header lines.

The specification for CSV as a format is [RFC4180]. However, this is an informational specification and and not a formal standard. Therefore, applications may deviate from it.

The metadata that's described here can be used with files that contain tabular data but that aren't CSV. You can provide guidance to processors that are trying to parse those files through the dialect property on a table description. For example, say we were dealing with a tab-separated file that contains multiple header lines at http://example.org/data/unemployment.tsv:

Example 125

"country"	"country group"	"name (en)"	"name (fr)"	"name (de)"	"latitude"	"longitude"
"Land"	"Ländergruppe"	"Name (en)"	"Name (fr)"	"Name (de)"	"Breite"	"Länge"
"pays"	"groupe de pays"	"nom (en)"	"nom (fr)"	"nom (de)"	"latitude"	"longitude"
"at"	"eu"	"Austria"	"Autriche"	"Österreich"	"47.6965545"	"13.34598005"
"be"	"eu"	"Belgium"	"Belgique"	"Belgien"	"50.501045"	"4.47667405"
"bg"	"eu"	"Bulgaria"	"Bulgarie"	"Bulgarien"	"42.72567375"	"25.4823218"

The metadata for this file could be:

Example 126

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "http://example.org/data/unemployment.tsv",
  "dialect": {
    "delimiter": "\t",
    "headerRowCount": 3
  }
}

There are a number of other properties that you can set within the dialect to cater for the large range of weird things that people do in CSV files. They are:

commentPrefix: If the file contains comment lines, set this to the character used at the start of the lines that are comments (usually that's #).
delimiter: If the file doesn't use commas as separators between values, set this to the separator that it uses.
doubleQuote: If the file uses \ to escape double quotes within values, set this to false.
encoding: If the encoding of the file is not UTF-8, set this to the encoding.
header: If the file doesn't have a header line, set this to false.
headerRowCount: If the file has more than one header line, set this to the number of header lines it has.
lineTerminators: If the file uses an unusual character at the end of its lines, set this to that character.
quoteChar: If the file doesn't use double quotes (") around values that contain commas, set this to the character that it does use.
skipBlankRows: If the file contains blank rows that should just be ignored, set this to true.
skipColumns: If the file has some columns at the start that don't contain useful information, set this to that number of columns.
skipInitialSpace: If values in the file sometimes start with whitespace that should be ignored, set this to true.
skipRows: If the file has some rows at the start that don't contain useful information, set this to that number of rows. (Sometimes people put metadata at the start of a CSV file, before the header lines.)
trim: If you don't want to ignore whitespace around values, set this to false. If you want to only ignore whitespace at the beginning of values, set it to start and if you want to only ignore whitespace at the end of values, to end. By default whitespace at the start and the end of a value will be stripped away.

See also:

Parsing Tabular Data in [tabular-data-model]
Dialect Descriptions in [tabular-metadata]

What about tables in HTML?

Tables in HTML are a bit more complicated than tables in CSV files. However, it is possible to use the metadata described here and in [tabular-metadata] to described HTML tables by embedding the metadata within a script element in the header of the HTML page. A description and example of how to do this is provided in the [csvw-html] Note.

See also:

[csvw-html]

What if you want to put metadata files somewhere else?

By default, if you point a processor at a CSV file on the web, they will look for a Link header within the response with rel="describedby" and specifying a content type of application/csvm+json, application/ld+json or application/json. If you're publishing CSV files and it's possible to set the Link header, this is the best way of telling processors where to look for metadata files.

If processors don't see an appropriate Link header, they will append -metadata.json to the end of the URL of the CSV file to try to find metadata for it. If they can't find a metadata file there, they will look in the directory containing the CSV file for a file called csv-metadata.json and use that file.

Looking for files in this way isn't appropriate on all servers. As a publisher, it might be that you have URLs for CSV files that are automatically generated using queries, such as http://example.org/data?format=csv&x=15&y=53 or that you want metadata to live in a separate subdirectory or central location.

As a publisher, if you can't or don't want to use the Link header, you can control where processors look for metadata for your CSV files by listing the locations to look at within the /.well-known/csvm file on your server. (This is a well-known location as defined in [RFC5785].) This file should contain a list of URL patterns which will be expanded by substituting url for the URL of the CSV file, and then resolving against the location of the CSV file. Lines might look like:

Example 127

/metadata?for={url}
?format=metadata
{+url}m

With a CSV file at http://example.org/data.csv this would lead the processor to search for metadata at:

Example 128

http://example.org/metadata?for=http://example.org/data.csv
http://example.org/data.csv?format=metadata
http://example.org/data.csvm

See also:

Locating Metadata in [tabular-data-model]

CSV on the Web: A Primer

W3C Editor's Draft 02 November 2022

Abstract

Status of This Document

Introduction

What is tabular data and CSV?

How can you provide extra information about CSV data?

How can you provide extra information about the columns in a CSV file?

What tools implement or use CSV on the Web?

Documenting CSVs

How can you provide documentation about a CSV file?

How can you provide documentation about the columns in a CSV file?

What about when metadata like this is structured?

How should you annotate individual cells?

Validating CSVs

How can you say what kinds of values are expected in a column?

How do you define new datatypes?

How do you restrict what kind of strings a column contains?

How do you restrict the size of numbers a column contains?

How do you ensure that decimal numbers have a particular precision or leading zeros?

How do you validate numbers that aren't in standard numeric formats?

How do you restrict what kind of dates and times a column contains?

How do you validate dates that aren't in standard date or time formats?

How do you validate boolean values that aren't true or false?

How do you specify a list of valid values for a column?

Using a regular expression to give a list of valid values

Using a separate CSV file to give a list of valid values

How do you enable a column to have a mix of value types?

What if the cells in a column contain lists of values?

How do you ensure every row has a value for a column?

How do you indicate all the values in a column are unique?

Transforming CSVs

What can you transform CSV into?

What values get used in the output of a transformation?

What value gets used in the output if it's missing in the CSV?

How are the properties in the output of a transformation named?

How do you map values into URLs?

How do you include an identifier for the thing described by each row?

How do you indicate the type of the thing described by each row?

How do you include extra metadata in the result of the transformation?

How can you remove output from a transformation result?

How do you transform into nested structures in JSON?

How do you indicate that values should be mapped to a list rather than repeating properties in RDF?

How should you transform CSV into JSON-LD?

How should you display CSV tables in HTML?

How can you transform CSV into the DataCube vocabulary?

How can you transform CSV into other formats?

Handling language in CSVs

How do you indicate the language used by the metadata file?

How do you provide metadata such as descriptions in different languages?

How do you provide titles for columns in different languages?

How do you specify the language of the values in a column?

How do you indicate that tables should be displayed right-to-left?

Advanced Use

How do you support units of measure?

Supporting units of measure by transforming to structured values

Supporting units of measure with named datatypes in RDF

How do you support geospatial data?

How can you specify a single schema for multiple CSV files?

How can you provide a title for a row?

What about CSV that isn't standard CSV?

What about tables in HTML?

What if you want to put metadata files somewhere else?

How do you validate boolean values that aren't `true` or `false`?