Tabular data is routinely transferred on the web as "CSV", but the definition of "CSV" in practice is very loose. This document outlines a basic data model or infoset for tabular data and metadata about that tabular data. It also contains some non-normative information about a best practice syntax for tabular data, for mapping into that data model, to contribute to the standardisation of CSV syntax by IETF. Various methods of locating metadata are also provided.

The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the first of those Recommendations (see ), though it also specifies an underlying model and therefore starting point for the other chartered Recommendations.

This document is based on IETF's [[!RFC4180]] which is an Informational RFC. The working group's expectation is that future suggestions to refine RFC 4180 will be relayed to the IETF (e.g. around I18N and multi-part packaging) and contribute to its discussions about moving CSV to the Standards track.

Many "CSV" files embed metadata, for example in lines before the header row of the CSV document. This specification does not define any formats for embedding metadata within CSV files, aside from the names of columns in the header row.

Introduction

Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of cells (although some of these cells may be empty), which provide values of properties of the thing described by the row. In tabular data, cells within the same column provide values for the same property of the thing described by the particular row. This is what differentiates tabular data from other line-oriented formats.

Tabular data is routinely transferred on the web in a textual format called "CSV", but the definition of "CSV" in practice is very loose. Some people use the term to mean any delimited text file. Others stick more closely to the most standard definition of CSV that there is, [[!RFC4180]]. Appendix A describes the various ways in which CSV is defined.

Tabular Data Models

There are different levels of data models for tabular data:

Core Tabular Data Model

The core tabular data model can be used to describe a table that lacks any annotations, whether those annotations are embedded within a CSV file or arise from a separate metadata document.

Data is held in a table. The properties of a table are:

A column represents a vertical arrangement of cells within a table. The properties of a column are:

A row represents a horizontal arrangement of cells within a table. The properties of a row are:

A cell represents a cell at the intersection of a row and a column within a table. The properties of a cell are:

Should an empty string value always count as a null semantic value within the core data model (ie when there is no other metadata telling the processor how to interpret that value), or should the semantic value be an empty string?

There presence or absence of quotes around a value within a CSV file is a syntactic detail that is not reflected in the tabular data model. In other words, there is no distinction in the model between the second value in a,,z and the second value in a,"",z.

Annotated Tabular Data Model

An annotated table is a table that is annotated with additional metadata. The table MAY have any number of properties in addition to those provided in the core tabular data model described in which provide information about the table as a whole. The values of these properties may be lists, structured objects, or atomic values. Annotations on a table may include:

The columns within an annotated table are all annotated columns: columns which MAY have any number of properties in addition to those provided in the core tabular data model described in . The annotations on a column might provide information about how to interpret the cells in the column or information about the column as a whole. Examples of annotations might be:

The rows within an annotated table are all annotated rows: rows which MAY have any number of properties in addition to those provided in the core tabular data model described in . The annotations on a row provide additional metadata about the information held in the row, such as:

The cells within an annotated row are all annotated cells: cells which MAY have any number of properties in addition to those provided in the core tabular data model described in . The annotations on a cell provide metadata about the value held in the cell, particularly when this overrides the information provided for the annotated column and annotated row that the cell falls within. Annotations on a cell might be:

The value of an annotated cell MAY be of a datatype other than a string. For example, annotations might enable a processor to understand the string value of the cell as representing a number or a date.

The permitted types of names and values of annotations need to be spelled out here.

Grouped Tabular Data Model

A group of tables comprises a set of tables (which may be annotated tables) and a set of annotations (properties and values) that relate to the set.

Tables can be loosely related to each other simply through annotations; not all tables that are related to each other need to grouped together. Groups of tables are useful because they can be annotated with metadata that applies to all the tables in the group.

Locating Metadata

As described in , tabular data may have a number of annotations associated with it. Here we describe the different methods that can be used to locate those annotations given a link to a CSV file.

In most methods of locating metadata described here, metadata is provided within a separate document. The syntax of this document is defined in the Metadata Vocabulary for Tabular Data specification. These documents can include things like:

When creating a set of annotations from metadata, if the same property is specified in two locations then information "closer" to the end user or interpreter of the document should override other information, and information "closer" to the document itself should override information "further" from the document. Explicitly, the order of preference is:

  1. metadata supplied by the user of the implementation that is processing the tabular data, see
  2. metadata embedded within the tabular data file itself, see
  3. metadata in a document linked to using a Link header associated with the tabular data file, see
  4. file-specific metadata in a document located based on the location of the tabular data file, see
  5. directory-specific metadata in a document located based on the location of the tabular data file, see

Processors MUST attempt to locate metadata documents based on each of these locations in order, and use all metadata documents that are successfully located in this way, merging them in the order found as defined in in [[!tabular-metadata]].

What should processors do if they have been passed a metadata file and located a CSV file from that metadata file. Should they still check for metadata files related specifically to the CSV file itself? For example, say that an application has been pointed at a metadata file at http://example.org/metadata.json which references http://example.org/toilets.csv but there is also a metadata file at http://example.org/toilets.csv-metadata.json. If the processor had been originally pointed to http://example.org/toilets.csv then it would have located the file-specific metadata at http://example.org/toilets.csv-metadata.json, but coming via http://example.org/metadata.json means that file-specific metadata is skipped.

If no metadata is supplied or found, then the following Metadata is used as the default for extracted embedded metadata:

{
  "@type": "TableGroup",
  "resources": []
}
      

This particular value for default metadata will allow a header line to be extracted from the tabular data file to create simple metadata using header cells as column names and titles.

Overriding Metadata

Processors SHOULD provide users with the facility to provide their own metadata for tabular data files that they process.

Embedded Metadata

Most syntaxes for tabular data provide a facility for embedding metadata within the tabular data file itself. The definition of a syntax for tabular data SHOULD include a description of how the syntax maps to an annotated data model. See for a description of this for CSV files.

Link Header

When retrieving a tabular data file via HTTP, the response MAY include a Link header with rel=describedby and type=application/csv-metadata+json that points to a metadata file that describes the tabular data file. If there is more than one valid metadata file linked to through multiple Link headers, then the metadata referenced by Link headers that appear later in the response are used in preference to those referenced by earlier Link headers.

Standard File Metadata

If no metadata document has been located through higher-priority mechanisms, processors MUST attempt to locate a file-specific metadata document. This method is only used if the URL of the tabular data file does not contain a query component.

In this case, the file-specific metadata document should be located at location-metadata.json, where location is the base URL of the tabular data file with the fragment component of the URL removed, if applicable. For example, if the tabular data file is at http://example.org/south-west/devon.csv then processors should attempt to locate a metadata document at http://example.org/south-west/devon.csv-metadata.json. If the metadata file found at this location does not explicitly point to the relevant tabular data file then it MUST be ignored.

Standard Directory Metadata

If no metadata document has been located through higher-priority mechanisms, processors MUST attempt to locate a directory-level metadata document. This method is only used if the URL of the tabular data file does not contain a query component.

The common metadata document for a directory can be located by resolving the relative URL metadata.json against the base URL of the tabular data file. For example, if the tabular data file is at http://example.org/south-west/devon.csv then processors should attempt to locate a metadata file at http://example.org/south-west/metadata.json. If the metadata file found at this location does not explicitly point to the relevant tabular data file then it MUST be ignored.

CSV+ Syntax

There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for outputting tabular data adhering to the core tabular data model described in into a standard, CSV-based, syntax. Compliant applications that output this format must meet each of the constraints.

We are actively working with the IETF to develop a standard for CSV, which is outside the scope of the Working Group. The details here aim to help shape that standard based on our requirements.

This section does not seek to describe how applications that input textual tabular data should interpret it, except that any data that is in the format defined here should be understood as defined here.

This syntax is not compliant with text/csv as defined in [[!RFC4180]] in that it permits line endings other than CRLF. Supporting LF line endings are important characteristics for data formats that are used on non-Windows platforms. However, all files that adhere to [[!RFC4180]]'s definition of CSV are compliant CSV+ files.

Content Type

The appropriate content type for a CSV+ file is text/csv. For example, when a CSV+ file is transmitted via HTTP, the HTTP response MUST include a Content-Type header with the value text/csv:

Content-Type: text/csv
        

See below for issues relating to whether we should instead define a different content type.

Encoding

CSV+ files SHOULD be encoded using UTF-8. If a CSV+ file is not encoded using UTF-8, the encoding MUST be specified through the charset parameter in the Content-Type header:

Content-Type: text/csv;charset=ISO-8859-1
        

Line Endings

The ends of rows in a CSV+ file MUST be either CRLF (U+000D U+000A) or LF (U+000A). Line endings within escaped cells are not normalised.

Section 4.1.1 of RFC2046 specifies that "The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden."

Should we be defining application/csv instead, to prevent having to adhere to this rule, or should we stick to the CRLF rule?

Lines

Each line of a CSV+ file MUST contain the same number of comma-separated values.

Values that contain commas, line endings or double quotes MUST be escaped by having the entire value wrapped in double quotes. There MUST NOT be whitespace before or after the double quotes. Within these escaped cells, any double quotes MUST be escaped with two double quotes ("").

Headers

The first line of a CSV+ file SHOULD contain a comma-separated list of names of columns. This is known as the header line and provides names for the columns. There are no constraints on these names.

If a CSV+ file does not include a header line, this MUST be specified using the header parameter of the media type:

Content-Type: text/csv;header=absent
          

Bidirectionality in CSV+ Files

Bidirectional content does not alter the definition of rows or the assignment of cells to columns. Whether or not a CSV+ file contains right-to-left characters, the first column's content is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.

For example, Egyptian Referendum results are available as a CSV file at https://egelections-2011.appspot.com/Referendum2012/results/csv/EG.csv. Over the wire and in non-Unicode-aware text editors, the CSV looks like:

              
‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌,‌ن‌س‌ب‌ة‌ ‌م‌و‌ا‌ف‌ق‌,‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌,‌ع‌د‌د‌ ‌ا‌ل‌ن‌ا‌خ‌ب‌ي‌ن‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ص‌ح‌ي‌ح‌ة‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ب‌ا‌ط‌ل‌ة‌,‌ن‌س‌ب‌ة‌ ‌ا‌ل‌م‌ش‌ا‌ر‌ك‌ة‌,‌م‌و‌ا‌ف‌ق‌,‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌
‌ا‌ل‌ق‌ل‌ي‌و‌ب‌ي‌ة‌,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070"
‌ا‌ل‌ج‌ي‌ز‌ة‌,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675"
‌ا‌ل‌ق‌ا‌ه‌ر‌ة‌,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327"
‌ق‌ن‌ا‌,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670"
...
              
            

Within this CSV file, the first column appears as the content of each line before the first comma and is named المحافظة (appearing at the start of each row as ‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌ in the example, which is displaying the relevant characters from left to right in the order they appear "on the wire").

The CSV translates to a table model that looks like:

Column / Row column 1 column 2 column 3 column 4 column 5 column 6 column 7 column 8 column 9
row 1 (header)المحافظةنسبة موافقنسبة غير موافقعدد الناخبينالأصوات الصحيحةالأصوات الباطلةنسبة المشاركةموافقغير موافق
row 2القليوبية60.040.02,639,808853,12515,22432.9512,055341,070
row 3الجيزة66.733.34,383,7011,493,09224,10534.6995,417497,675
row 4القاهرة43.256.86,580,4782,254,69836,34234.8974,3711,280,327
row 5قنا84.515.51,629,713364,5096,74322.8307,83956,670

The fragment identifier #col=3 identifies the third of the columns, named نسبة غير موافق (appearing as ‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌ in the example).

[[!tabular-metadata]] defines how this table model should be displayed by compliant applications, and how metadata can affect the display. The default is for the display to be determined by the content of the table. For example, if this CSV were turned into an HTML table for display into a web page, it should be displayed with the first column on the right and the last on the left, as follows:

غير موافق موافق نسبة المشاركة الأصوات الباطلة الأصوات الصحيحة عدد الناخبين نسبة غير موافق نسبة موافق المحافظة
341,070 512,055 32.9 15,224 853,125 2,639,808 40.0 60.0 القليوبية
497,675 995,417 34.6 24,105 1,493,092 4,383,701 33.3 66.7 الجيزة
1,280,327 974,371 34.8 36,342 2,254,698 6,580,478 56.8 43.2 القاهرة
56,670 307,839 22.8 6,743 364,509 1,629,713 15.5 84.5 قنا

The fragment identifier #col=3 still identifies the third of the columns, named نسبة غير موافق, which appears in the HTML display as the third column from the right and is what those who read right-to-left would think of as the third column.

Note that this display matches that shown on the original website.

An alternative approach is for the CSV to be parsed into a table model in which the columns are numbered in the reverse, for tables which are either marked as or detected to be right-to-left tables. For example, we could introduce a bidi=rtl or similar media type parameter, and use this to determine whether the first column in table generated from the CSV is the text before the first comma in each line or the text after the last comma in the line.

In the example above, if the CSV were served with bidi=rtl, or the table was detected as being a right-to-left table, then the column numbering in the model would be reversed:

Column / Row column 9 column 8 column 7 column 6 column 5 column 4 column 3 column 2 column 1
row 1 (header)المحافظةنسبة موافقنسبة غير موافقعدد الناخبينالأصوات الصحيحةالأصوات الباطلةنسبة المشاركةموافقغير موافق
row 2القليوبية60.040.02,639,808853,12515,22432.9512,055341,070
row 3الجيزة66.733.34,383,7011,493,09224,10534.6995,417497,675
row 4القاهرة43.256.86,580,4782,254,69836,34234.8974,3711,280,327
row 5قنا84.515.51,629,713364,5096,74322.8307,83956,670

This would require a change to [[RFC7111]] but that might be required by updates to the definition of text/csv in any case. With the change, the fragment identifier #col=3 would then refer to the third column from the right, named نسبة المشاركة.

If the model were defined in this way, there would be no need to determine the order of the columns when displayed using a metadata property. Columns would always be displayed with the first column (numbered 1 in the model) on the left. The final display in HTML, for example, would be exactly as above. The only difference would be that #col=3 would refer to the third column from the left.

We note that using media type parameters is problematic because publishers might not have the ability to set them on their servers, and because they can easily get lost as a file is republished or emailed between people.

We invite comment on the best way to approach bidirectionality in CSV files.

Grammar

This grammar is a generalization of that defined in [[!RFC4180]] and is included for reference only.

The EBNF used here is defined in XML 1.0 [[EBNF-NOTATION]].

We should probably place further restrictions on QCHAR and SCHAR to avoid control characters. If header weren’t optional, it would be better defined as in RFC4180, but if the syntax allows it to be optional, this would make it not an LL(1) grammar, which isn’t too much of an issue.

Parsing Tabular Data

As described in , there may be many formats which an application might interpret into the tabular data model described in , including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.

Standardising the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardisation of the format.

This section describes an algorithm for parsing formats other than the plain CSV+ format specified in and extracting embedded metadata. The parsing algorithm depends on metadata properties found while Locating Metadata, which may include user input (see Overriding Metadata) or through the inclusion of a dialect description within a metadata file):

encoding
The character encoding for the file, one of the encodings listed in [[!encoding]]. The default is utf-8.
line terminator
The character that is used at the end of a row. The default is CRLF.
quote character
The character that is used around escaped cells. The default is ".
escape character
The character that is used to escape the quote character within escaped cells. The default is " (such that "" is used to escape " within an escaped cell).
skip rows
The number of rows to skip at the beginning of the file, before a header row or tabular data. The default is 0.
comment prefix
A character that, when it appears at the beginning of a skipped row, indicates a comment that should be associated as a comment annotation to the table. The default is #.
header row count
The number of header rows (following the skipped rows) in the file. The default is 1.
delimiter
The separator between cells. The default is ,.
skip columns
The number of columns to skip at the beginning of each row, before any header columns. The default is 0.
header column count
The number of header columns (following the skipped columns) in each row. The default is 0.
skip blank rows
Indicates whether to ignore wholly empty rows (ie rows in which all the cells are empty). The default is false.
trim
Indicates whether to trim whitespace around cells; may be true, false, start or end. The default is false.

When parsing, should we:

The algorithm for parsing a document containing tabular data is as follows:

  1. Create merged metadata by locating all sources of metadata, other than embedded, as defined in Locating Metadata and merge together in priority order as defined in Merging Metadata in [[!tabular-metadata]].
  2. Read the file using the specified encoding.
  3. Find the rows. Each row ends with a line terminator, but values that are enclosed within the quote character may contain the line terminator without it indicating the end of the row. The quote character may be escaped using the escape character where it appears within cells. If the escape character is not the same as the quote character then the escape character is also used to escape the character that immediately follows it.
  4. Skip the number of rows indicated by the skip rows parameter. These form rdfs:comment annotations on the table. If a skipped row begins with the comment prefix, strip that prefix from the beginning of the row to create the content of the rdfs:comment annotation.
  5. Gather the number of header rows indicated by the header row count parameter; the remaining rows are data rows.
  6. Split the header and data rows into cells using the delimiter. Values that are enclosed within the quote character may contain the delimiter. The quote character may be escaped using the escape character where it appears within cells. If the escape character is not the same as the quote character then the escape character is also used to escape the character that immediately follows it.

    If trim is true or start then whitespace from the start of values that are not enclosed within the quote character is removed from the value. If trim is true or end then whitespace from the end of values that are not enclosed within the quote character is removed from the value.

  7. Construct embedded metadata as Table Metadata (see in [[!tabular-metadata]]) using the @context from merged metadata and setting the lang annotation to the default language from the @context.
    This change as a result of processing the CSV and deriving embedded metadata. If the asserted metadata has @language: en, and the column titles are then interpreted to be in English, it won't match with the actual column headings from the CSV, unless they are also in English, thus the addition of adding both @language: en and language: en to the resulting embedded metadata.
  8. In each row, ignore the number of columns indicated by the skip columns parameter. Always start from the first character in the row when counting columns (see ).
  9. Gather the number of header columns indicated by the header column count parameter. Always start from the first character in the row when counting columns (see ).
  10. Each cell within a header row that is not in a skipped or header column is a title annotation on that column.
    How are titles from multiple header rows handled? Presume that multiple header cells in the same column are joined together with a newline separator.
  11. Each cell within a header column is an annotation on the row it appears in; if there is a header row then that provides the type of the annotation for the row, otherwise it is a title annotation.
  12. If skip blank rows is true then ignore any rows in which all the cell values are empty strings.
  13. Re-create merged metadata using by locating all sources of metadata, included embedded metadata as constructed above, as defined in Locating Metadata and merge together in priority order as defined in Merging Metadata in [[!tabular-metadata]].
  14. Generate row data for each row using merged metadata by parsing the cells from each row as defined in Parsing Cells in [[!tabular-metadata]].

Examples

Simple Example

A simple CSV+ file might look like:

GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
          

Parsing this file results in an annotated tabular data model of a single table T with five columns and two rows. The columns have the properties shown in the following table:

idcore propertiesannotations
tablenumbercellstitle
C1T1C1.1, C2.1GID
C2T2C1.2, C2.2On Street
C3T3C1.3, C2.3Species
C4T4C1.4, C2.4Trim Cycle
C5T5C1.5, C2.5Inventory Date

Expressed as Table Metadata:

{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [
      {"title": "GID"},
      {"title": "On Street"},
      {"title": "Species"},
      {"title": "Trim Cycle"},
      {"title": "Inventory Date"}
    ]
  }
}
          

The rows have the properties shown in the following table:

idcore properties
tablenumbercells
R1T1C1.1, C1.2, C1.3, C1.4, C1.5
R2T2C2.1, C2.2, C2.3, C2.4, C2.5

The cells have the properties shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):

idcore properties
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010""10/18/2010"
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010""6/2/2010"
Using Overriding Metadata

The tools that the consumer of this data uses may provide a mechanism for overriding the metadata that has been provided within the file itself. For example, they might enable the consumer to add machine-readable names to the columns, or to mark the fifth column as holding a date in the format M/D/YYYY. These facilities are implementation defined; the code for a Javascript-based parser might look like:

data.parse({
  "column-names": ["GID", "on-street", "species", "trim-cycle", "inventory-date"],
  "datatypes": ["string", "string", "string", "string", "date"],
  "formats": [null,null,null,null,"M/D/YYYY"]
});
          	

The parser could then create an annotated tabular data model that included name annotations on the columns, datatype and format annotations on the cells, and created cells whose values were of appropriate types (in the case of this Javascript implementation, the cells in the last column would be Date objects, for example).

Assuming this kind of implementation-defined parsing, the columns would then have the properties shown in the following table:

idcore propertiesannotations
tablenumbercellsnametitle
C1T1C1.1, C2.1GIDGID
C2T2C1.2, C2.2on-streetOn Street
C3T3C1.3, C2.3speciesSpecies
C4T4C1.4, C2.4trim-cycleTrim Cycle
C5T5C1.5, C2.5inventory-dateInventory Date

After creating embedded metadata and merging with overriding metadata:

{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [
      {"name": "GID", "title": "GID", "datatype": "string"},
      {"name": "on-street", "title": "On Street", "datatype": "string"},
      {"name": "species", "title": "Species", "datatype": "string"},
      {"name": "trim-cycle", "title": "Trim Cycle", "datatype": "string"},
      {
        "name": "inventory-date",
        "title": "Inventory Date",
        "datatype": "date",
        "format": "M/D/YYYY"
      }
    ]
  }
}
          

The cells have the properties shown in the following table. Because of the overrides provided by the consumer to guide the parsing, and the way the parser works, these include additional annotations but also, for the Inventory Date column (cells C1.5 and C2.5), have a value that is a parsed date rather than an unparsed string.

idcore propertiesannotations
tablecolumnrowstring valuevaluedatatypeformat
C1.1TC1R1"1""1"string
C1.2TC2R1"ADDISON AV""ADDISON AV"string
C1.3TC3R1"Celtis australis""Celtis australis"string
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"string
C1.5TC5R1"10/18/2010"2010-10-18dateM/D/YYYY
C2.1TC1R2"2""2"string
C2.2TC2R2"EMERSON ST""EMERSON ST"string
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"string
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"string
C2.5TC5R2"6/2/2010"2010-06-02dateM/D/YYYY
Using a Metadata File

A similar set of annotations could be provided through a metadata file, located as discussed in and defined in [[!tabular-metadata]]. For example, this might look like:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "tree-ops.csv",
  "dc:title": "Tree Operations",
  "dc:keywords": ["tree", "street", "maintenance"],
  "dc:publisher": [{
    "schema:name": "Example Municipality",
    "schema:web": "http://example.org"
  }],
  "dc:license": "http://opendefinition.org/licenses/cc-by/",
  "dc:modified": "2010-12-31",
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "title": [
        "GID",
        "Generic Identifier"
      ],
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true
    }, {
      "name": "on-street",
      "title": "On Street",
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "title": "Species",
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim-cycle",
      "title": "Trim Cycle",
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory-date",
      "title": "Inventory Date",
      "dc:description": "The date of the operation that was performed.",
      "datatype": "date",
      "format": "M/D/YYYY"
    }],
    "primaryKey": "GID"
  }
}
			      

The annotated tabular data model generated from this would be more sophisticated again. The table itself would have the following annotations:

dc:title
"Tree Operations"
dc:keywords
["tree", "street", "maintenance"]
dc:publisher
[{ "schema:name": "Example Municipality", "schema:web": "http://example.org" }]
dc:license
"http://opendefinition.org/licenses/cc-by/"
dc:modified
"2010-12-31"
primaryKey
C1

Note that the value of the primaryKey annotation is a pointer to the column C1 within the annotated tabular data model.

The columns would have the properties shown in the following table:

idcore propertiesannotations
tablenumbercellsnametitledc:description
C1T1C1.1, C2.1GIDGID, Generic IdentifierAn identifier for the operation on a tree.
C2T2C1.2, C2.2on-streetOn StreetThe street that the tree is on.
C3T3C1.3, C2.3speciesSpeciesThe species of the tree.
C4T4C1.4, C2.4trim-cycleTrim CycleThe operation performed on the tree.
C5T5C1.5, C2.5inventory-dateInventory DateThe date of the operation that was performed.

Thanks to the provided metadata, the cells again have the properties shown in the following table. The metadata file has provided the information to supplement the model with additional annotations but also, for the Inventory Date column (cells C1.5 and C2.5), have a value that is a parsed date rather than an unparsed string.

idcore propertiesannotations
tablecolumnrowstring valuevaluedatatypeformat
C1.1TC1R1"1""1"string
C1.2TC2R1"ADDISON AV""ADDISON AV"string
C1.3TC3R1"Celtis australis""Celtis australis"string
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"string
C1.5TC5R1"10/18/2010"2010-10-18dateM/D/YYYY
C2.1TC1R2"2""2"string
C2.2TC2R2"EMERSON ST""EMERSON ST"string
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"string
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"string
C2.5TC5R2"6/2/2010"2010-06-02dateM/D/YYYY

Empty and Quoted Cells

The following slightly amended CSV file contains quoted and missing cell values:

GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,"Celtis australis","Large Tree Routine Prune",10/18/2010
2,,"Liquidambar styraciflua","Large Tree Routine Prune",
          

Parsing this file similarly results in an annotated tabular data model of a single table T with five columns and two rows. The columns and rows have exactly the same properties as previously, but there are two null cell values for C2.2 and C2.5. Note that the quoting of values within the CSV makes no difference to either the string value or value of the cell.

idcore properties
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010""10/18/2010"
C2.1TC1R2"2""2"
C2.2TC2R2""null
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2""null

Tabular Data Embedding Annotations

The following example illustrates some of the complexities that can be involved in parsing tabular data, how the flags described above can be used, and how new tabular data formats could be defined that embed additional annotations into the tabular data model.

In this example, the publishers of the data are using an internal convention to supply additional metadata about the tabular data embedded within the file itself. They are also using a tab as a separator rather than a comma.

#	publisher	City of Palo Alto
#	updated	12/31/2010
#name	GID	on-street	species	trim-cycle	inventory-date
#datatype	string	string	string	string	date:M/D/YYYY
	GID	On Street	Species	Trim Cycle	Inventory Date
	1	ADDISON AV	Celtis australis	Large Tree Routine Prune	10/18/2010
	2	EMERSON ST	Liquidambar styraciflua	Large Tree Routine Prune	6/2/2010
	        
Naive Parsing

Naive parsing of the above data will assume a comma separator and thus results in a single table T with a single column and six rows. The column has the properties shown in the following table:

idcore propertiesannotations
tablenumbercellstitle
C1T1C1.1, C2.1, C3.1, C4.1, C5.1# publisher City of Palo Alto

The rows have the properties shown in the following table:

idcore properties
tablenumbercells
R1T1C1.1
R2T2C2.1
R3T3C3.1
R4T4C4.1
R5T5C5.1
R6T6C6.1

The cells have the properties shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):

idcore properties
tablecolumnrowstring valuevalue
C1.1TC1R1"# updated 12/31/2010""# updated 12/31/2010"
C1.1TC1R1"#name GID on-street species trim-cycle inventory-date""#name GID on-street species trim-cycle inventory-date"
C2.1TC1R2"#datatype string string string string date:M/D/YYYY""#datatype string string string string date:M/D/YYYY"
C3.1TC1R3" GID On Street Species Trim Cycle Inventory Date"" GID On Street Species Trim Cycle Inventory Date"
C4.1TC1R4" 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010"" 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010"
C5.1TC1R5" 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010"" 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010"
Parsing with Flags

The consumer of the data may use the flags described above to create a more useful set of data from this file. Specifically, they could set:

Setting these is done in an implementation-defined way. It could be done, for example, by sniffing the contents of the file itself, through command-line options, or by embedding a dialect description into a metadata file associated with the tabular data, which would look like:

{
  "delimiter": "\t",
  "skipRows": 4,
  "skipColumns": 1
}
	          

With these flags in operation, parsing this file results in an annotated tabular data model of a single table T with five columns and two rows which is largely the same as that created from the original simple example described in . The only difference is that because the four skipped rows began with the default comment prefix, the table itself now has four rdfs:comment annotations, with the values:

  1. publisher City of Palo Alto
  2. updated 12/31/2010
  3. name GID on-street species trim-cycle inventory-date
  4. datatype string string string string date:M/D/YYYY
Recognising Tabular Data Formats

The conventions used in this data (invented for the purpose of this example) are in fact intended to create an annotated tabular data model which includes named annotations on the table itself, on the columns and on the cells. The creator of these conventions could create a specification for this particular tabular data syntax and register a media type for it. The specification would include things statements like:

  • A tab delimiter is always used.
  • The first column is always ignored.
  • When the first column of a row has the value "#", the second column is the name of an annotation on the table and the values of the remaining columns are concatenated to create the value of that annotation.
  • When the first column of a row has the value #name, the remaining cells in the row provide a name annotation for each column in the table.
  • When the first column of a row has the value #datatype, the remaining cells in the row provide datatype and format annotations for the cells within the relevant column, and these are interpreted to create the value for each cell in that column.
  • The first row where the first column is empty is a row of headers; these provide title annotations on the columns in the table.
  • The remaining rows make up the data of the table.

Parsers that recognised the format could then build a more sophisticated annotated tabular data model using only the embedded information in the tabular data file. As before, the result would be a single table T with five columns and two rows. The table itself would have two annotations:

publisher
City of Palo Alto
updated
12/31/2010

The columns have the properties shown in the following table:

idcore propertiesannotations
tablenumbercellsnametitle
C1T1C1.1, C2.1GIDGID
C2T2C1.2, C2.2on-streetOn Street
C3T3C1.3, C2.3speciesSpecies
C4T4C1.4, C2.4trim-cycleTrim Cycle
C5T5C1.5, C2.5inventory-dateInventory Date

The rows have the properties shown in the following table, exactly as in previous examples:

idcore properties
tablenumbercells
R1T1C1.1, C1.2, C1.3, C1.4, C1.5
R2T2C2.1, C2.2, C2.3, C2.4, C2.5

The cells have the properties shown in the following table. Because of the way the particular tabular data format has been specified, these include additional annotations but also, for the Inventory Date column (cells C1.5 and C2.5), have a value that is a parsed date rather than an unparsed string.

idcore propertiesannotations
tablecolumnrowstring valuevaluedatatypeformat
C1.1TC1R1"1""1"string
C1.2TC2R1"ADDISON AV""ADDISON AV"string
C1.3TC3R1"Celtis australis""Celtis australis"string
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"string
C1.5TC5R1"10/18/2010"2010-10-18dateM/D/YYYY
C2.1TC1R2"2""2"string
C2.2TC2R2"EMERSON ST""EMERSON ST"string
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"string
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"string
C2.5TC5R2"6/2/2010"2010-06-02dateM/D/YYYY

Existing Standards

This appendix outlines various ways in which CSV is defined.

RFC 4180

[[!RFC4180]] defines CSV with the following ABNF grammar:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE =  %x22
LF = %x0A
CRLF = CR LF
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
        

Of particular note here are:

Excel

Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.

The following describes the behaviour of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behaviour of Excel in other situations.

Saved CSV

Excel generates CSV files encoded using Windows-1252 with LF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those cells that need escaping (eg because they contain commas or double quotes) are escaped, and double quotes are escaped with two double quotes.

Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.

Opened CSV

When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not a BOM is present). It correctly deals with double quoted cells, except that it converts line breaks within cells into spaces. It understands CRLF as a line break. It detects dates (formatted as YYYY-MM-DD) and formats them in the default date formatting for files.

Imported CSV

Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.

A particular quirk in the importing of CSV is that if a cell contains a line break, the final double quote that escapes the cell will be included within it.

Copied Tabular Data

When tabular data is copied from Excel, it is copied in a tab-delimited format, with LF line breaks.

Google Spreadsheets

Downloading CSV

Downloaded CSV files are encoded in UTF-8, without a BOM, and with LF line endings. Dates and numbers are formatted as they appear within the spreadsheet.

Importing CSV

CSV files can be imported as UTF-8 (with or without BOM). CRLF line endings are correctly recognised. Dates are reformatted to the default date format on load.

CSV Files in a Tabular Data Package

Tabular Data Packages place the following restrictions on CSV files:

As a starting point, CSV files included in a Tabular Data Package package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:

  • File names MUST end with .csv
  • Files MUST be encoded as UTF-8
  • Files MUST have a single header row. This row MUST be the first row in the file.

    • Terminology: each column in the CSV file is termed a field and its name is the string in that column in the header row.
    • The name MUST be unique amongst fields and MUST contain at least one character
    • There are no further restrictions on the form of the name but it is RECOMMENDED that it contain only alphanumeric characters together with “ .-_”
  • Rows in the file MUST NOT contain more fields than are in the header row (though they may contain less)
  • Each file MUST have an entry in the resources array in the datapackage.json file
  • The resource metadata MUST include a tableSchema attribute whose value MUST conform to the JSON Table Schema
  • All fields in the CSV files MUST be described in the tableSchema

CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a CSV files in a Tabular Data Package SHOULD

  • Use “,” as field delimiters
  • Use CRLF (U+000D U+000A) or LF (U+000A) as line terminators

If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the datapackage.json descriptor MUST:

Applications processing the CSV file SHOULD read use the dialect of the CSV file to guide parsing.

More details of behaviour of other tools should go here. This should include the most popular CSV parsing/generating libraries in common programming languages. Test files which include non-ASCII characters, double quotes and line breaks within cells are: