Tabular data is routinely transferred on the web in a variety of formats, including variants on CSV, tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps. This document outlines a data model or infoset for tabular data and metadata about that tabular data that can be used as a basis for validation, display, or creating other formats. It also contains some non-normative guidance for publishing tabular data as CSV and how that maps into the tabular data model.

Creating an annotated model of tabular data relies on the provision of separate metadata about the table. This specification defines how implementations should locate that metadata, given a file containing tabular data. The syntax for that metadata is defined in [[!tabular-metadata]].

The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the first of those Recommendations (see ), though it also specifies an underlying model and is therefore a basis for the other chartered Recommendations.

This definition of CSV used in this document is based on IETF's [[!RFC4180]] which is an Informational RFC. The working group's expectation is that future suggestions to refine RFC 4180 will be relayed to the IETF (e.g. around I18N and multi-part packaging) and contribute to its discussions about moving CSV to the Standards track.

Many files containing tabular data embed metadata, for example in lines before the header row of an otherwise standard CSV document. This specification does not define any formats for embedding metadata within CSV files, aside from the titles of columns in the header row which is defined in CSV. We would encourage groups that define tabular data formats to also define a mapping into the annotated tabular data model defined in this document.

Introduction

Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of cells (although some of these cells may be empty), which provide values of properties of the thing described by the row. In tabular data, cells within the same column provide values for the same property of the thing described by the particular row. This is what differentiates tabular data from other line-oriented formats.

Tabular data is routinely transferred on the web in a textual format called "CSV", but the definition of "CSV" in practice is very loose. Some people use the term to mean any delimited text file. Others stick more closely to the most standard definition of CSV that there is, [[!RFC4180]]. Appendix A describes the various ways in which CSV is defined. This specification refers to such files as tabular data files.

Tabular data is also transferred on the web in tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps.

In , this document defines a model for tabular data that abstracts away from the varying syntaxes that are used for when exchanging tabular data. The model includes annotations, or metadata, about collections of individual tables, rows, columns, and cells. These annotations are typically supplied through separate metadata files; defines how these metadata files can be located, while [[!tabular-metadata]] defines what they contain.

Once an annotated table has been created, it can be processed in various ways, such as display, validation, or conversion into other formats. This processing is described in .

This specification does not normatively define a format for exchanging tabular data. However, it does provide some best practice guidelines for publishing tabular data as CSV, in section , and for parsing both this syntax and those similar to it, in .

This specification makes use of the compact IRI Syntax; please refer to the Compact IRIs from [[!JSON-LD]].

This specification makes use of the following namespaces:

dc:
http://purl.org/dc/terms/
rdfs:
http://www.w3.org/2000/01/rdf-schema#
schema:
http://schema.org/
xsd:
http://www.w3.org/2001/XMLSchema#

Tabular Data Models

This section defines an annotated tabular data model: a model for tables that are annotated with metadata. Annotations provide information about the cells, rows, columns, tables, and groups of tables with which they are associated. The values of these annotations may be lists, structured objects, or atomic values. Core annotations are those that affect the behavior of processors defined in this specification, but other annotations may also be present on any of the components of the model.

Annotations may be described directly in [[!tabular-metadata]], be embedded in a tabular data file, or created during the process of generating an annotated table.

In this document, the term annotation refers to any metadata associated with an object in the annotated tabular data model. These are not necessarily web annotations in the sense of [[annotation-model]].

Table groups

A group of tables comprises a set of annotated tables and a set of annotations that relate to those tables. The core annotations of a group of tables are:

Groups of tables MAY in addition have any number of annotations which provide information about the group of tables. Annotations on a group of tables may include:

Tables

An annotated table is a table that is annotated with additional metadata. The core annotations of a table are:

The table MAY in addition have any number of other annotations. Annotations on a table may include:

These arise from common properties defined within metadata documents, as defined in [[!tabular-metadata]].

Columns

A column represents a vertical arrangement of cells within a table. The core annotations of a column are:

Columns MAY in addition have any number of other annotations. These arise from common properties defined within metadata documents, as defined in [[!tabular-metadata]].

Rows

A row represents a horizontal arrangement of cells within a table. The core annotations of a row are:

Rows MAY have any number of additional annotations. The annotations on a row provide additional metadata about the information held in the row, such as:

Cells

A cell represents a cell at the intersection of a row and a column within a table. The core annotations of a cell are:

There presence or absence of quotes around a value within a CSV file is a syntactic detail that is not reflected in the tabular data model. In other words, there is no distinction in the model between the second value in a,,z and the second value in a,"",z.

Cells MAY have any number of additional annotations. The annotations on a cell provide metadata about the value held in the cell, particularly when this overrides the information provided for the column and row that the cell falls within. Annotations on a cell might be:

Locating Metadata

As described in , tabular data may have a number of annotations associated with it. Here we describe the different methods that can be used to locate metadata that provides those annotations.

In most methods of locating metadata described here, metadata is provided within separate documents. The syntax of these documents is defined in [[!tabular-metadata]]. These documents can include things like:

When creating a set of annotations from metadata, if the same property is specified in two locations then information "closer" to the end user or interpreter of the document overrides other information, and information "closer" to the document itself overrides information "further" from the document. Explicitly, the order of precedence is:

  1. metadata supplied by the user of the implementation that is processing the tabular data, see
  2. metadata embedded within the tabular data file itself, see
  3. metadata in a document linked to using a Link header associated with the tabular data file, see
  4. file-specific metadata in a document located based on the location of the tabular data file, see
  5. directory-specific metadata in a document located based on the location of the tabular data file, see

Processors MUST attempt to locate the first metadata document based on each of these locations in order, in addition to metadata supplied by the user and metadata embedded within the tabular data file itself, merging them as defined in in [[!tabular-metadata]].

If no metadata is supplied or found, then the following Metadata is used as the default for extracted embedded metadata:

{
  "@type": "TableGroup",
  "resources": []
}
      

Overriding Metadata

Processors SHOULD provide users with the facility to provide their own metadata for tabular data files that they process. This might be provided:

For example, a processor might be invoked with:

$ csvlint data.csv --datatypes:string,float,string,string
        

to enable the testing of the types of values in the columns of a CSV file, or with:

$ csvlint data.csv --schema:schema.json
        

to supply a schema that describes the contents of the file, against which it can be validated.

Metadata supplied in this way is called overriding, or user-supplied, metadata. Implementations should define how this is mapped into the vocabulary defined in [[!tabular-metadata]]. If such metadata describes a table or group of tables then it is the only metadata used for processing tabular data files, other than the data embedded within those files as defined in .

Embedded Metadata

Most syntaxes for tabular data provide a facility for embedding metadata within the tabular data file itself. The definition of a syntax for tabular data SHOULD include a description of how the syntax maps to an annotated data model, and in particular how any embedded metadata is mapped into the vocabulary defined in [[!tabular-metadata]]. The default dialect for a CSV as described in will extract column titles from first row of a CSV.

        

The results of this can be found in .

For another example, the following tab-delimited file contains embedded metadata where it is assumed that comments may be added using a #, and that the column types may be indicated using a #datatype annotation:

# publisher City of Palo Alto
# updated 12/31/2010
#name GID on_street species trim_cycle  inventory_date
#datatype string  string  string  string  date:M/D/YYYY
  GID On Street Species Trim Cycle  Inventory Date
  1 ADDISON AV  Celtis australis  Large Tree Routine Prune  10/18/2010
  2 EMERSON ST  Liquidambar styraciflua Large Tree Routine Prune  6/2/2010
        

A processor that recognises this format may be able to extract and make sense of this embedded metadata. provides a non-normative guide on how embedded metadata commonly used within "CSV" files can be extracted.

Link Header

When retrieving a tabular data file via HTTP, the response MAY include a Link header with rel="describedby" and type="application/csvm+json", type="application/ld+json" or type="application/json" that points to a metadata file that describes the tabular data file. If there is more than one valid metadata file linked to through multiple Link headers, then the metadata referenced by Link headers that appear later in the response are used instead of those referenced by earlier Link headers.

For example, when the response to requesting a tab-separated file looks like:

HTTP/1.1 200 OK
Content-Type: text/tab-separated-values
...
Link: <metadata.ttl>; rel="describedBy"; type="text/turtle"
Link: <metadata.json>; rel="describedBy"; type="application/csvm+json"
        

an implementation should use the referenced metadata.json to supply metadata for processing the file.

If the metadata file found at this location does not explicitly include a reference to the relevant tabular data file then it MUST be ignored.

Standard File Metadata

If metadata has not already been found, processors MUST also attempt to locate a file-specific metadata document. This method MUST NOT be used if the URL of the tabular data file contains a query component.

In this case, the file-specific metadata document should be located at location-metadata.json, where location is the base URL of the tabular data file with the fragment component of the URL removed, if applicable.

For example, if the tabular data file is at http://example.org/south-west/devon.csv then processors should attempt to locate a metadata document at http://example.org/south-west/devon.csv-metadata.json. On the other hand, if the tabular data file is at http://example.com/index.php?file=43178.csv then, because the URL of the tabular data file includes a query component, the processor should not attempt to look for a file-specific metadata document.

If the metadata file found at this location does not explicitly include a reference to the relevant tabular data file then it MUST be ignored.

Standard Directory Metadata

If metadata has not already been found, processors MUST attempt to locate a directory-level metadata document. This method MUST NOT be used if the URL of the tabular data file contains a query component.

The common metadata document for a directory can be located by resolving the relative URL metadata.json against the base URL of the tabular data file.

For example, if the tabular data file is at http://example.org/south-west/devon.csv then processors should attempt to locate a metadata file at http://example.org/south-west/metadata.json. On the other hand, if the tabular data file is at http://example.com/index.php?file=43178.csv then, because the URL of the tabular data file includes a query component, the processor should not attempt to look for a directory-level metadata document.

If the metadata file found at this location does not explicitly include a reference to the relevant tabular data file then it MUST be ignored.

Processing Tables

This section describes how particular types of applications should process tabular data and metadata files.

In many cases, an application will start from a metadata file. In that case, the initial metadata file is treated as overriding metadata and the application MUST NOT continue to retrieve other available metadata, about each of the tabular data files referenced by that initial metadata file other than embedded metadata.

In other cases, applications will start from a tabular data file, such as a CSV file, and locate metadata from that file. This metadata will inform the processing of the tabular data file.

For example, if a validator is passed a locally authored metadata file spending.json, which contains:

{
  "tableSchema": "government-spending.csv",
  "resources": [{
    "url": "http://example.org/east-sussex-2015-03.csv",
  }, {
    "url": "http://example.org/east-sussex-2015-02.csv"
  }, ... 
  ]
}
      

the validator would validate all the listed resources, using the locally defined schema at government-spending.csv. It would still use the metadata located from the referenced CSV files; for example, when processing http://example.org/east-sussex-2015-03.csv, it would use embedded metadata within that file only, without attempting to locate any additional metadata.

If a validator is passed a tabular data file http://example.org/east-sussex-2015-03.csv, the validator would use the metadata located from the CSV file; it would use embedded metadata within that file, referenced from any Link header found when retrieving that file, or at http://example.org/east-sussex-2015-03.csv-metadata.json or http://example.org/metadata.json.

Starting with a metadata file can remove the need to perform additional requests to locate linked, file-specific, or directory-specific metadata

Creating Annotated Tables

After locating metadata, metadata is merged into a single description containing all annotations. When starting with a metadata file, this involves merging any overriding metadata with provided metadata file along with the embedded metadata for each tabular data file referenced from a table. When starting with a tabular data file this involves locating the first metadata file as described in and merging all relevant metadata into a single descriptor.

If processing starts with a tabular data file, implementations:

  1. Retrieve the tabular data file.
  2. Retrieve any metadata supplied by the user (UMT) (see ).
  3. If UMT does not describe a table or group of tables describing the tabular data file, retrieve the first metadata file (FM) as described in :
    1. metadata referenced from a Link Header that may be returned when retrieving the tabular data file (see ).
    2. file-specific metadata in a document located based on the location of the tabular data file (see ).
    3. directory-specific metadata in a document located based on the location of the tabular data file (see ).
  4. Proceed as if the process starts with a metadata file using the merged metadata UMM = merge(UMT, FM) or UMT if no FM is loaded.

If the process starts with a metadata file:

  1. Retrieve the metadata file yielding the metadata UMM (which is treated as overriding metadata, see ).
  2. For each table in UMM in order, create one or more annotated tables:
    1. Extract the dialect description (DD) from UMM for the table associated with the tabular data file. If there is no such dialect description, extract the first available dialect description from a group of tables in which the tabular data file is described. Otherwise use the default dialect description.
    2. If the tabular data file was retrieved with Content-Type including the header=absent parameter set header to false in DD.
    3. Parse the tabular data file, using DD as a guide, to create a basic tabular data model (T) and extract embedded metadata (EM), for example from the header line.

      This specification provides a non-normative definition for parsing tabular data files, including the extraction of embedded metadata, in . No definition of the format of embedded metadata is given here; whatever syntax is used, it's assumed that metadata can be mapped to the vocabulary defined in [[!tabular-metadata]].

    4. Create the merged metadata M = merge(UMM, EM) using the merge algorithm defined in [[!tabular-metadata]].
    5. Use the metadata M to add annotations to the tabular data model T as described in Section 2 Annotating Tables in [[!tabular-metadata]].

In the case of starting with a metadata file, UMM will describe a table or group of tables, and no other metadata files will be retrieved. Thus the metadata file must provide all applicable metadata aside from that embedded within the tabular data files themselves.

Displaying Tables

We intend to include other sections here about:

  • displaying metadata about groups of tables, tables, columns, rows, and cells
  • what headings to use for columns when displaying tabular content
  • how to format values in cells

These sections are planned to be non-normative. We invite comment on whether it's useful to provide this kind of guidance.

Bidirectional Tables

There are two levels of bidirectionality to consider when displaying tables: the directionality of the table (i.e., whether the columns should be arranged left-to-right or right-to-left) and the directionality of the content of individual cells.

The tableDirection annotation provides information about the desired display of the table. If tableDirection=ltr then the first column SHOULD be displayed on the left and the last column on the right. If tableDirection=rtl then the first column SHOULD be displayed on the right and the last column on the left.

If tableDirection=default then tables SHOULD be displayed with attention to the bidirectionality of the content of the file. Specifically, the values of the cells in the table should be scanned breadth first: from the first cell in the first column through to the last cell in the first column, down to the last cell in the last column. If the first character in the table with a strong type as defined in [[!BIDI]] indicates a RTL directionality, the table should be displayed with the first column on the right and the last column on the left. Otherwise, the table should be displayed with the first column on the left and the last column on the right. Characters such as whitespace, quotes, commas, and numbers do not have a strong type, and therefore are skipped when identifying the character that determines the directionality of the table.

Implementations SHOULD enable user preferences to override the indicated metadata about the directionality of the table.

Once the directionality of the table has been determined, each cell within the table should be considered as a separate paragraph, as defined by the UBA in [[!BIDI]]. The default directionality for the cell is determined by looking at the textDirection property, which is an inherited property.

Thus, as defined by the UBA, if a cell contains no characters with a strong type (if it's a number or date for example) then the way the cell is displayed should be determined by the textDirection property of the cell. However, when the cell contains characters with a strong type (such as letters) then they MUST be displayed according to the Unicode Bidirectional Algorithm as described in [[!BIDI]].

Validating Tables

Validators test whether given tabular data files adhere to the structure defined within a schema. Validators MUST raise errors as defined in [[!tabular-metadata]] and in the following situations:

There is no obligation for processors to halt processing at the first error.

Converting Tables

Conversions of tabular data to other formats operate over a annotated table constructed as defined in Annotating Tables in [[!tabular-metadata]]. The mechanics of these conversions to other formats are defined in other specifications.

Conversion specifications MUST define a default mapping from an annotated table that lacks any annotations (i.e., that is equivalent to an un-annotated table).

Conversion specifications MUST use the property value of the propertyUrl of a column as the basis for naming machine-readable fields in the target format, such as the name of the equivalent element or attribute in XML, property in JSON or property URI in RDF.

Conversion specifications MAY use any of the annotations defined in [[!tabular-metadata]] to adjust the mapping of an annotated table into another format.

Conversion specifications MAY define additional annotations, not defined in [[!tabular-metadata]], which are specifically used when converting to the target format of the conversion. For example, a conversion to XML might specify a http://example.org/conversion/xml/element-or-attribute property on columns that determines whether a particular column is represented through an element or an attribute in the data.

Best Practice CSV

There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for expressing tabular data adhering to the annotated tabular data model defined in in CSV. Authors are encouraged to adhere to the constraints described in this section as implementations should process such CSV files consistently.

This syntax is not compliant with text/csv as defined in [[!RFC4180]] in that it permits line endings other than CRLF. Supporting LF line endings are important characteristics for data formats that are used on non-Windows platforms. However, all files that adhere to [[!RFC4180]]'s definition of CSV meet the constraints described in this section.

We are actively working with the IETF to develop a standard for CSV, which is outside the scope of the Working Group. The details here aim to help shape that standard based on our requirements.

Content Type

The appropriate content type for a CSV file is text/csv. For example, when a CSV file is transmitted via HTTP, the HTTP response MUST include a Content-Type header with the value text/csv:

Content-Type: text/csv
        

Encoding

CSV files SHOULD be encoded using UTF-8. If a CSV file is not encoded using UTF-8, the encoding MUST be specified through the charset parameter in the Content-Type header:

Content-Type: text/csv;charset=ISO-8859-1
        

Line Endings

The ends of rows in a CSV file MUST be either CRLF (U+000D U+000A) or LF (U+000A). Line endings within escaped cells are not normalised.

Lines

Each line of a CSV file MUST contain the same number of comma-separated values.

Values that contain commas, line endings, or double quotes MUST be escaped by having the entire value wrapped in double quotes. There MUST NOT be whitespace before or after the double quotes. Within these escaped cells, any double quotes MUST be escaped with two double quotes ("").

Headers

The first line of a CSV file SHOULD contain a comma-separated list of names of columns. This is known as the header line and provides titles for the columns. There are no constraints on these titles.

If a CSV file does not include a header line, this MUST be specified using the header parameter of the media type:

Content-Type: text/csv;header=absent
          

Bidirectionality in CSV Files

Bidirectional content does not alter the definition of rows or the assignment of cells to columns. Whether or not a CSV file contains right-to-left characters, the first column's content is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.

For example, Egyptian Referendum results are available as a CSV file at https://egelections-2011.appspot.com/Referendum2012/results/csv/EG.csv. Over the wire and in non-Unicode-aware text editors, the CSV looks like:

              
‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌,‌ن‌س‌ب‌ة‌ ‌م‌و‌ا‌ف‌ق‌,‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌,‌ع‌د‌د‌ ‌ا‌ل‌ن‌ا‌خ‌ب‌ي‌ن‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ص‌ح‌ي‌ح‌ة‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ب‌ا‌ط‌ل‌ة‌,‌ن‌س‌ب‌ة‌ ‌ا‌ل‌م‌ش‌ا‌ر‌ك‌ة‌,‌م‌و‌ا‌ف‌ق‌,‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌
‌ا‌ل‌ق‌ل‌ي‌و‌ب‌ي‌ة‌,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070"
‌ا‌ل‌ج‌ي‌ز‌ة‌,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675"
‌ا‌ل‌ق‌ا‌ه‌ر‌ة‌,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327"
‌ق‌ن‌ا‌,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670"
...
              
            

Within this CSV file, the first column appears as the content of each line before the first comma and is named المحافظة (appearing at the start of each row as ‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌ in the example, which is displaying the relevant characters from left to right in the order they appear "on the wire").

The CSV translates to a table model that looks like:

Column / Row column 1 column 2 column 3 column 4 column 5 column 6 column 7 column 8 column 9
column namesالمحافظةنسبة موافقنسبة غير موافقعدد الناخبينالأصوات الصحيحةالأصوات الباطلةنسبة المشاركةموافقغير موافق
row 1القليوبية60.040.02,639,808853,12515,22432.9512,055341,070
row 2الجيزة66.733.34,383,7011,493,09224,10534.6995,417497,675
row 3القاهرة43.256.86,580,4782,254,69836,34234.8974,3711,280,327
row 4قنا84.515.51,629,713364,5096,74322.8307,83956,670

The fragment identifier #col=3 identifies the third of the columns, named نسبة غير موافق (appearing as ‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌ in the example).

defines how this table model should be displayed by compliant applications, and how metadata can affect the display. The default is for the display to be determined by the content of the table. For example, if this CSV were turned into an HTML table for display into a web page, it should be displayed with the first column on the right and the last on the left, as follows:

غير موافق موافق نسبة المشاركة الأصوات الباطلة الأصوات الصحيحة عدد الناخبين نسبة غير موافق نسبة موافق المحافظة
341,070 512,055 32.9 15,224 853,125 2,639,808 40.0 60.0 القليوبية
497,675 995,417 34.6 24,105 1,493,092 4,383,701 33.3 66.7 الجيزة
1,280,327 974,371 34.8 36,342 2,254,698 6,580,478 56.8 43.2 القاهرة
56,670 307,839 22.8 6,743 364,509 1,629,713 15.5 84.5 قنا

The fragment identifier #col=3 still identifies the third of the columns, named نسبة غير موافق, which appears in the HTML display as the third column from the right and is what those who read right-to-left would think of as the third column.

Note that this display matches that shown on the original website.

An alternative approach is for the CSV to be parsed into a table model in which the columns are numbered in the reverse, for tables which are either marked as or detected to be right-to-left tables. For example, we could introduce a bidi=rtl or similar media type parameter, and use this to determine whether the first column in table generated from the CSV is the text before the first comma in each line or the text after the last comma in the line.

In the example above, if the CSV were served with bidi=rtl, or the table was detected as being a right-to-left table, then the column numbering in the model would be reversed:

Column / Row column 9 column 8 column 7 column 6 column 5 column 4 column 3 column 2 column 1
row 1 (header)المحافظةنسبة موافقنسبة غير موافقعدد الناخبينالأصوات الصحيحةالأصوات الباطلةنسبة المشاركةموافقغير موافق
row 2القليوبية60.040.02,639,808853,12515,22432.9512,055341,070
row 3الجيزة66.733.34,383,7011,493,09224,10534.6995,417497,675
row 4القاهرة43.256.86,580,4782,254,69836,34234.8974,3711,280,327
row 5قنا84.515.51,629,713364,5096,74322.8307,83956,670

This would require a change to [[RFC7111]] but that might be required by updates to the definition of text/csv in any case. With the change, the fragment identifier #col=3 would then refer to the third column from the right, named نسبة المشاركة.

If the model were defined in this way, there would be no need to determine the order of the columns when displayed using a metadata property. Columns would always be displayed with the first column (numbered 1 in the model) on the left. The final display in HTML, for example, would be exactly as above. The only difference would be that #col=3 would refer to the third column from the left.

We note that using media type parameters is problematic because publishers might not have the ability to set them on their servers, and because they can easily get lost as a file is republished or emailed between people.

We invite comment on the best way to approach bidirectionality in CSV files.

Grammar

This grammar is a generalization of that defined in [[!RFC4180]] and is included for reference only.

The EBNF used here is defined in XML 1.0 [[EBNF-NOTATION]].

We should probably place further restrictions on QCHAR and SCHAR to avoid control characters. If header weren’t optional, it would be better defined as in RFC4180, but if the syntax allows it to be optional, this would make it not an LL(1) grammar, which isn’t too much of an issue.

Parsing Tabular Data

As described in , there may be many formats which an application might interpret into the tabular data model described in , including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.

Standardising the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardisation of the format.

This section describes an algorithm for parsing formats that do not adhere to the constraints described in , as well as those that do, and extracting embedded metadata. The parsing algorithm uses the following flags. These may be set by metadata properties found while Locating Metadata, including through user input (see Overriding Metadata), or through the inclusion of a dialect description within a metadata file:

comment prefix
A character that, when it appears at the beginning of a row, indicates that the row is a comment that should be associated as a rdfs:comment annotation to the table. This is set by the commentPrefix property of a dialect description. The default is null, which means no rows are treated as comments. A value other than null may mean that the source numbers of rows are different from their numbers.
delimiter
The separator between cells, set by the delimiter property of a dialect description. The default is ,.
encoding
The character encoding for the file, one of the encodings listed in [[!encoding]], set by the encoding property of a dialect description. The default is utf-8.
escape character
The character that is used to escape the quote character within escaped cells, set by the doubleQuote property of a dialect description. The default is " (such that "" is used to escape " within an escaped cell).
header column count
The number of header columns (following the skipped columns) in each row, set by the headerColumnCount property of a dialect description. The default is 0. A value other than 0 will mean that the source numbers of columns will be different from their numbers.
header row count
The number of header rows (following the skipped rows) in the file, set by the header or headerRowCount property of a dialect description. The default is 1. A value other than 0 will mean that the source numbers of rows will be different from their numbers.
line terminator
The character that is used at the end of a row, set by the lineTerminator property of a dialect description. The default is CRLF.
quote character
The character that is used around escaped cells, set by the quoteChar property of a dialect description. The default is ".
skip blank rows
Indicates whether to ignore wholly empty rows (ie rows in which all the cells are empty), set by the skipBlankRows property of a dialect description. The default is false. A value other than false may mean that the source numbers of rows are different from their numbers.
skip columns
The number of columns to skip at the beginning of each row, before any header columns, set by the skipColumns property of a dialect description. The default is 0. A value other than 0 will mean that the source numbers of columns will be different from their numbers.
skip rows
The number of rows to skip at the beginning of the file, before a header row or tabular data, set by the skipRows property of a dialect description. The default is 0. A value greater than 0 will mean that the source numbers of rows will be different from their numbers.
trim
Indicates whether to trim whitespace around cells; may be true, false, start, or end, set by the skipInitialSpace or trim property of a dialect description. The default is false.

The algorithm for using these flags to parse a document containing tabular data to create a basic annotated tabular data model and to extract embedded metadata is as follows:

  1. Create a new table T with the annotations:
  2. Create a metadata document structure M that looks like:
    {
      "@context": "http://www.w3.org/ns/csvw",
      "rdfs:comment": []
      "tableSchema": {
        "columns": []
      }
    }
              
  3. If the URL of the tabular data file being parsed is known, set the url property on M to that URL.
  4. Set source row number to 1.
  5. Read the file using the encoding.
  6. Repeat the following the number of times indicated by skip rows:
    1. Read a row to provide the row content.
    2. If the comment prefix is not null and the row content begins with the comment prefix, strip that prefix from the row content, and add the resulting string to the M.rdfs:comment array.
    3. Otherwise, if the row content is not an empty string, add the row content to the M.rdfs:comment array.
    4. Add 1 to the source row number.
  7. Repeat the following the number of times indicated by header row count:
    1. Read a row to provide the row content.
    2. If the comment prefix is not null and the row content begins with the comment prefix, strip that prefix from the row content, and add the resulting string to the M.rdfs:comment array.
    3. Otherwise, parse the row to provide a list of cell values, and:
      1. Remove the first skip columns number of values from the list of cell values.
      2. Remove the first header column count number of values from the list of cell values.
      3. For each of the remaining values at index i in the list of cell values:
        1. If the value at index i in the list of cell values is an empty string or consists only of whitespace, do nothing.
        2. Otherwise, if there is no column description object at index i in M.tableSchema.columns, create a new one with a title property whose value is an array containing a single value that is the value at index i in the list of cell values.
        3. Otherwise, add the value at index i in the list of cell values to the array at M.tableSchema.columns[i].title.
    4. Add 1 to the source row number.
  8. Set row number to 1.
  9. While it is possible to read another row, do the following:
    1. Set the source column number to 1.
    2. Read a row to provide the row content.
    3. If the comment prefix is not null and the row content begins with the comment prefix, strip that prefix from the row content, and add the resulting string to the M.rdfs:comment array.
    4. Otherwise, parse the row to provide a list of cell values, and:
      1. If all of the values in the list of cell values are empty strings, and skip blank rows is true, add 1 to the source row number and move on to process the next row.
      2. Otherwise, create a new row R, with:
      3. Append R to the rows of table T.
      4. Remove the first skip columns number of values from the list of cell values and add that number to the source column number.
      5. Remove the first header column count number of values from the list of cell values and add that number to the source column number.
      6. For each of the remaining values at index i in the list of cell values (where i starts at 1):
        1. Identify the column C at index i within the columns of table T. If there is no such column:
          1. Create a new column C with:
          2. Append C to the columns of table T (at index i).
        2. Create a new cell D, with:
        3. Append cell D to the cells of column C.
        4. Append cell D to the cells of row R (at index i).
        5. Add 1 to the source column number.
    5. Add 1 to the source row number.
  10. If M.rdfs:comment is an empty array, remove the rdfs:comment property from M.
  11. Return the table T and the embedded metadata M.

To read a row to provide row content, perform the following steps:

  1. Set the row content to an empty string.
  2. Read a character and process as follows:
    1. If the character is an escape character followed by the quote character, append both characters to the row content, and move on to process the character following the quote character.
    2. Otherwise, if the character is an escape character and the escape character is not the same as the quote character, append the escape character and the character following it to the row content and move on to process the next character.
    3. Otherwise, if the character is a quote character, append the quoted value obtained by reading a quoted value to the row content and move on to process the character following the quoted value.
    4. Otherwise, if the character is a line terminator, return the row content.
    5. Otherwise, append the character to the row content and move on to process the next character.
  3. If there are no more characters to read, return the row content.

To read a quoted value to provide a quoted value, perform the following steps:

  1. Set the quoted value to an empty string.
  2. Read the initial quote character and add a quote character to the quoted value.
  3. Read a character and process as follows:
    1. If the character is an escape character followed by the quote character, append both characters to the quoted value, and move on to process the character following the quote character.
    2. Otherwise, if the character is an escape character and the escape character is not the same as the quote character, append the escape character and the character following it to the quoted value and move on to process the next character.
    3. Otherwise, if the character is a quote character, return the quoted value.
    4. Otherwise, append the character to the quoted value and move on to process the next character.

To parse a row to provide a list of cell values, perform the following steps:

  1. Set the list of cell values to an empty list and the current cell value to an empty string.
  2. Set the quoted flag to false.
  3. Read a character and process as follows:
    1. If the character is an escape character followed by the quote character, append a quote character to the current cell value, and move on to process the character following the quote character.
    2. Otherwise, if the character is an escape character and the escape character is not the same as the quote character, append the character following the escape character to the current cell value and move on to process the next character.
    3. Otherwise, if the character is a quote character then:
      1. If quoted is false, set the quoted flag to true, and move on to process the next character. If the current cell value is not an empty string, raise an error.
      2. Otherwise, set quoted to false, and move on to process the next character. If the next character is not the delimiter, raise an error.
    4. Otherwise, if the character is a delimiter, then:
      1. If quoted is true, append the delimiter character to the current cell value and move on to process the next character.
      2. Otherwise, trim the current cell value, add the resulting trimmed cell value to the list of cell values and move on to process the next character.
    5. Otherwise, append the character to the current cell value and move on to process the next character.
  4. If there are no more characters to read, trim the current cell value, add the resulting trimmed cell value to the list of cell values and return the list of cell values.

To trim a cell value to provide a trimmed cell value, perform the following steps:

  1. Set the trimmed cell value to the provided cell value.
  2. If trim is true or start then remove any leading whitespace from the start of the trimmed cell value and move on to the next step.
  3. If trim is true or end then remove any trailing whitespace from the end of the trimmed cell value and move on to the next step.
  4. Return the trimmed cell value.

This parsing algorithm does not account for the possibility of there being more than one area of tabular data within a single CSV file.

Examples

Simple Example

A simple CSV file that complies to the constraints described in , at http://example.org/tree-ops.csv, might look like:

          

Parsing this file results in an annotated tabular data model of a single table T with five columns and two rows. The columns have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercellstitles
C1T11C1.1, C2.1GID
C2T22C1.2, C2.2On Street
C3T33C1.3, C2.3Species
C4T44C1.4, C2.4Trim Cycle
C5T55C1.5, C2.5Inventory Date

The extracted embedded metadata, as defined in [[!tabular-metadata]], would look like:

{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [
      {"title": [ "GID" ]},
      {"title": [ "On Street" ]},
      {"title": [ "Species" ]},
      {"title": [ "Trim Cycle" ]},
      {"title": [ "Inventory Date" ]}
    ]
  }
}
          

The rows have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercells
R1T12C1.1, C1.2, C1.3, C1.4, C1.5
R2T23C2.1, C2.2, C2.3, C2.4, C2.5

The source number of each row is offset by one from the number of each row because in the source CSV file, the header line is the first line. It is possible to reconstruct a [[RFC7111]] compliant reference to the first record in the original CSV file (http://example.org/tree-ops.csv#row=2) using the value of the row's source number. This enables implementations to retain provenance between the table model and the original file.

The cells have the annotations shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010""10/18/2010"
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010""6/2/2010"
Using Overriding Metadata

The tools that the consumer of this data uses may provide a mechanism for overriding the metadata that has been provided within the file itself. For example, they might enable the consumer to add machine-readable names to the columns, or to mark the fifth column as holding a date in the format M/D/YYYY. These facilities are implementation defined; the code for a Javascript-based parser might look like:

data.parse({
  "column-names": ["GID", "on_street", "species", "trim_cycle", "inventory_date"],
  "datatypes": ["string", "string", "string", "string", "date"],
  "formats": [null,null,null,null,"M/D/YYYY"]
});
          	

This is equivalent to a metadata file expressed in the syntax defined in [[!tabular-metadata]], looking like:

{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "datatype": "string"
    }, {
      "name": "on_street",
      "datatype": "string"
    }, {
      "name": "species",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "datatype": {
        "base": "date",
        "format": "M/d/yyyy"
      }
    }]
  }
}
            

This would be merged with the embedded metadata found in the CSV file, providing the titles for the columns to create:

{
  "@type": "Table",
  "url": "http://example.org/tree-ops.csv",
  "tableSchema": {
    "columns": [{
      "name": "GID", 
      "title": "GID", 
      "datatype": "string"
    }, {
      "name": "on_street", 
      "title": "On Street", 
      "datatype": "string"
    }, {
      "name": "species", 
      "title": "Species", 
      "datatype": "string"
    }, {
      "name": "trim_cycle", 
      "title": "Trim Cycle", 
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "title": "Inventory Date",
      "datatype": {
        "base": "date", 
        "format": "M/d/yyyy"
      }
    }]
  }
}
            

The processor can then create an annotated tabular data model that included name annotations on the columns, and datatype annotations on the cells, and created cells whose values were of appropriate types (in the case of this Javascript implementation, the cells in the last column would be Date objects, for example).

Assuming this kind of implementation-defined parsing, the columns would then have the annotations shown in the following table:

idcore annotationsannotations
tablenumbersource numbercellsnametitle
C1T11C1.1, C2.1GIDGID
C2T22C1.2, C2.2on_streetOn Street
C3T33C1.3, C2.3speciesSpecies
C4T44C1.4, C2.4trim_cycleTrim Cycle
C5T55C1.5, C2.5inventory_dateInventory Date

The cells have the annotations shown in the following table. Because of the overrides provided by the consumer to guide the parsing, and the way the parser works, the cells in the Inventory Date column (cells C1.5 and C2.5) have values that are parsed dates rather than unparsed strings.

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010"2010-10-18
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010"2010-06-02
Using a Metadata File

A similar set of annotations could be provided through a metadata file, located as discussed in and defined in [[!tabular-metadata]]. For example, this might look like:

            

The annotated tabular data model generated from this would be more sophisticated again. The table itself would have the following annotations:

dc:title
{"@value": "Tree Operations", "@language": "en"}
dc:keywords
[{"@value": "tree", "@language", "en"}, {"@value": "street", "@language": "en"}, {"@value": "maintenance", "@language": "en"}]
dc:publisher
[{ "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }]
dc:license
{"@id": "http://opendefinition.org/licenses/cc-by/"}
dc:modified
{"@value": "2010-12-31", "@type": "date"}

The columns would have the annotations shown in the following table:

idcore annotationsother annotations
tablenumbersource numbercellsnametitlesdc:description
C1T11C1.1, C2.1GIDGID, Generic IdentifierAn identifier for the operation on a tree.
C2T22C1.2, C2.2on_streetOn StreetThe street that the tree is on.
C3T33C1.3, C2.3speciesSpeciesThe species of the tree.
C4T44C1.4, C2.4trim_cycleTrim CycleThe operation performed on the tree.
C5T55C1.5, C2.5inventory_dateInventory DateThe date of the operation that was performed.

The rows have an additional primary key annotation, as shown in the following table:

idcore annotations
tablenumbersource numbercellsprimary key
R1T12C1.1, C1.2, C1.3, C1.4, C1.5C1.1
R2T23C2.1, C2.2, C2.3, C2.4, C2.5C2.1

Thanks to the provided metadata, the cells again have the annotations shown in the following table. The metadata file has provided the information to supplement the model with additional annotations but also, for the Inventory Date column (cells C1.5 and C2.5), have a value that is a parsed date rather than an unparsed string.

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010"2010-10-18
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010"2010-06-02

Empty and Quoted Cells

The following slightly amended CSV file contains quoted and missing cell values:

GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,"Celtis australis","Large Tree Routine Prune",10/18/2010
2,,"Liquidambar styraciflua","Large Tree Routine Prune",
          

Parsing this file similarly results in an annotated tabular data model of a single table T with five columns and two rows. The columns and rows have exactly the same annotations as previously, but there are two null cell values for C2.2 and C2.5. Note that the quoting of values within the CSV makes no difference to either the string value or value of the cell.

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010""10/18/2010"
C2.1TC1R2"2""2"
C2.2TC2R2""null
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2""null

Tabular Data Embedding Annotations

The following example illustrates some of the complexities that can be involved in parsing tabular data, how the flags described above can be used, and how new tabular data formats could be defined that embed additional annotations into the tabular data model.

In this example, the publishers of the data are using an internal convention to supply additional metadata about the tabular data embedded within the file itself. They are also using a tab as a separator rather than a comma.

#	publisher	City of Palo Alto
#	updated	12/31/2010
#name	GID	on_street	species	trim_cycle	inventory_date
#datatype	string	string	string	string	date:M/D/YYYY
	GID	On Street	Species	Trim Cycle	Inventory Date
	1	ADDISON AV	Celtis australis	Large Tree Routine Prune	10/18/2010
	2	EMERSON ST	Liquidambar styraciflua	Large Tree Routine Prune	6/2/2010
	        
Naive Parsing

Naive parsing of the above data will assume a comma separator and thus results in a single table T with a single column and six rows. The column has the annotations shown in the following table:

idcore annotations
tablenumbersource numbercellstitles
C1T11C1.1, C2.1, C3.1, C4.1, C5.1# publisher City of Palo Alto

The rows have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercells
R1T12C1.1
R2T23C2.1
R3T34C3.1
R4T45C4.1
R5T56C5.1
R6T67C6.1

The cells have the annotations shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"# updated 12/31/2010""# updated 12/31/2010"
C1.1TC1R1"#name GID on_street species trim_cycle inventory_date""#name GID on_street species trim_cycle inventory_date"
C2.1TC1R2"#datatype string string string string date:M/D/YYYY""#datatype string string string string date:M/D/YYYY"
C3.1TC1R3" GID On Street Species Trim Cycle Inventory Date"" GID On Street Species Trim Cycle Inventory Date"
C4.1TC1R4" 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010"" 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010"
C5.1TC1R5" 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010"" 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010"
Parsing with Flags

The consumer of the data may use the flags described above to create a more useful set of data from this file. Specifically, they could set:

Setting these is done in an implementation-defined way. It could be done, for example, by sniffing the contents of the file itself, through command-line options, or by embedding a dialect description into a metadata file associated with the tabular data, which would look like:

{
  "delimiter": "\t",
  "skipRows": 4,
  "skipColumns": 1,
  "commentPrefix": "#"
}
	          

With these flags in operation, parsing this file results in an annotated tabular data model of a single table T with five columns and two rows which is largely the same as that created from the original simple example described in . There are three differences.

First, because the four skipped rows began with the comment prefix, the table itself now has four rdfs:comment annotations, with the values:

  1. publisher City of Palo Alto
  2. updated 12/31/2010
  3. name GID on_street species trim_cycle inventory_date
  4. datatype string string string string date:M/D/YYYY

Second, because the first column has been skipped, the source number of each of the columns is offset by one from the number of each column:

idcore annotations
tablenumbersource numbercellstitles
C1T12C1.1, C2.1GID
C2T23C1.2, C2.2On Street
C3T34C1.3, C2.3Species
C4T45C1.4, C2.4Trim Cycle
C5T56C1.5, C2.5Inventory Date

Finally, because four additional rows have been skipped, the source number of each of the rows is offset by five from the row number (the four skipped rows plus the single header row):

idcore annotations
tablenumbersource numbercells
R1T16C1.1, C1.2, C1.3, C1.4, C1.5
R2T27C2.1, C2.2, C2.3, C2.4, C2.5
Recognising Tabular Data Formats

The conventions used in this data (invented for the purpose of this example) are in fact intended to create an annotated tabular data model which includes named annotations on the table itself, on the columns, and on the cells. The creator of these conventions could create a specification for this particular tabular data syntax and register a media type for it. The specification would include statements like:

  • A tab delimiter is always used.
  • The first column is always ignored.
  • When the first column of a row has the value "#", the second column is the name of an annotation on the table and the values of the remaining columns are concatenated to create the value of that annotation.
  • When the first column of a row has the value #name, the remaining cells in the row provide a name annotation for each column in the table.
  • When the first column of a row has the value #datatype, the remaining cells in the row provide datatype/format annotations for the cells within the relevant column, and these are interpreted to create the value for each cell in that column.
  • The first row where the first column is empty is a row of headers; these provide title annotations on the columns in the table.
  • The remaining rows make up the data of the table.

Parsers that recognised the format could then build a more sophisticated annotated tabular data model using only the embedded information in the tabular data file. They would extract embedded metadata looking like:

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "tree-ops.csv",
  "dc:publisher": "City of Palo Alto",
  "dc:updated": "12/31/2010",
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "title": "GID",
      "datatype": "string",
    }, {
      "name": "on_street",
      "title": "On Street",
      "datatype": "string"
    }, {
      "name": "species",
      "title": "Species",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "title": "Trim Cycle",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "title": "Inventory Date",
      "datatype": {
        "base": "date", 
        "format": "M/d/yyyy"
      }
    }]
  }
}
            

As before, the result would be a single table T with five columns and two rows. The table itself would have two annotations:

dc:publisher
{"@value": "City of Palo Alto"}
dc:updated
{"@value": "12/31/2010"}

The columns have the annotations shown in the following table:

idcore annotations
tablenumbersource numbercellsnametitles
C1T12C1.1, C2.1GIDGID
C2T23C1.2, C2.2on_streetOn Street
C3T34C1.3, C2.3speciesSpecies
C4T45C1.4, C2.4trim_cycleTrim Cycle
C5T56C1.5, C2.5inventory_dateInventory Date

The rows have the annotations shown in the following table, exactly as in previous examples:

idcore annotations
tablenumbersource numbercells
R1T16C1.1, C1.2, C1.3, C1.4, C1.5
R2T27C2.1, C2.2, C2.3, C2.4, C2.5

The cells have the annotations shown in the following table. Because of the way the particular tabular data format has been specified, these include additional annotations but also, for the Inventory Date column (cells C1.5 and C2.5), have a value that is a parsed date rather than an unparsed string.

idcore annotations
tablecolumnrowstring valuevalue
C1.1TC1R1"1""1"
C1.2TC2R1"ADDISON AV""ADDISON AV"
C1.3TC3R1"Celtis australis""Celtis australis"
C1.4TC4R1"Large Tree Routine Prune""Large Tree Routine Prune"
C1.5TC5R1"10/18/2010"2010-10-18
C2.1TC1R2"2""2"
C2.2TC2R2"EMERSON ST""EMERSON ST"
C2.3TC3R2"Liquidambar styraciflua""Liquidambar styraciflua"
C2.4TC4R2"Large Tree Routine Prune""Large Tree Routine Prune"
C2.5TC5R2"6/2/2010"2010-06-02

Parsing Multiple Header Lines

The following example shows a CSV file with multiple header lines:

Who,What,,Where,
Organisation,Sector,Subsector,Department,Municipality
#org,#sector,#subsector,#adm1,#adm2
UNICEF,Education,Teacher training,Chocó,Quidbó
UNICEF,Education,Teacher training,Chocó,Bojayá
          

Here, the first line contains some grouping titles, which are not particularly helpful. The lines following those contain useful titles for the columns. Thus the appropriate configuration for a dialect description is:

{
  "skipRows": 1,
  "headerRowCount": 2
}
          

With this configuration, the table model contains five columns, each of which have two titles, summarised in the following table:

idcore annotations
tablenumbersource numbercellstitles
C1T11C1.1, C2.1Organisation, #org
C2T22C1.2, C2.2Sector, #sector
C3T33C1.3, C2.3Subsector, #subsector
C4T44C1.4, C2.4Department, #adm1
C5T55C1.5, C2.5Municipality, #adm2

As metadata, this would look like:

{
  "tableSchema": {
    "columns": [
      { "title": ["Organisation", "#org"] }, 
      { "title": ["Sector", "#sector"] }, 
      { "title": ["Subsector", "#subsector"] }, 
      { "title": ["Department", "#adm1"] }, 
      { "title": ["Municipality", "#adm2"] }, 
    ]
  }
}
          

A separate metadata file could contain just the second of each of these titles, for example:

{
  "tableSchema": {
    "columns": [
      { "name": "org", "title": #org" }, 
      { "name": "sector", "title": #sector" }, 
      { "name": "subsector", "title": #subsector" }, 
      { "name": "adm1", "title": #adm1" }, 
      { "name": "adm2", "title": #adm2" }, 
    ]
  }
}
          

This enables people from multiple jurisdictions to use the same tabular data structures without having to use exactly the same titles within their documents.

Existing Standards

This appendix outlines various ways in which CSV is defined.

RFC 4180

[[!RFC4180]] defines CSV with the following ABNF grammar:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE =  %x22
LF = %x0A
CRLF = CR LF
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
        

Of particular note here are:

Excel

Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.

The following describes the behaviour of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behaviour of Excel in other situations.

Saved CSV

Excel generates CSV files encoded using Windows-1252 with LF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those cells that need escaping (eg because they contain commas or double quotes) are escaped, and double quotes are escaped with two double quotes.

Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.

Opened CSV

When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not a BOM is present). It correctly deals with double quoted cells, except that it converts line breaks within cells into spaces. It understands CRLF as a line break. It detects dates (formatted as YYYY-MM-DD) and formats them in the default date formatting for files.

Imported CSV

Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.

A particular quirk in the importing of CSV is that if a cell contains a line break, the final double quote that escapes the cell will be included within it.

Copied Tabular Data

When tabular data is copied from Excel, it is copied in a tab-delimited format, with LF line breaks.

Google Spreadsheets

Downloading CSV

Downloaded CSV files are encoded in UTF-8, without a BOM, and with LF line endings. Dates and numbers are formatted as they appear within the spreadsheet.

Importing CSV

CSV files can be imported as UTF-8 (with or without BOM). CRLF line endings are correctly recognised. Dates are reformatted to the default date format on load.

CSV Files in a Tabular Data Package

Tabular Data Packages place the following restrictions on CSV files:

As a starting point, CSV files included in a Tabular Data Package package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:

  • File names MUST end with .csv.

  • Files MUST be encoded as UTF-8.

  • Files MUST have a single header row. This row MUST be the first row in the file.

    • Terminology: each column in the CSV file is termed a field and its name is the string in that column in the header row.

    • The name MUST be unique amongst fields, MUST contain at least one character, and MUST conform to the character restrictions defined for the name property.

  • Rows in the file MUST NOT contain more fields than are in the header row (though they may contain less).

  • Each file MUST have an entry in the resources array in the datapackage.json file.

  • The resource metadata MUST include a tableSchema attribute whose value MUST be a valid schema description.

  • All fields in the CSV files MUST be described in the schema description.

CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a CSV files in a Tabular Data Package SHOULD:

  • Use "," as field delimiters.
  • Use CRLF (U+000D U+000A) or LF (U+000A) as line terminators.

If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the datapackage.json descriptor MUST:

Applications processing the CSV file SHOULD read use the dialect of the CSV file to guide parsing.

More details of behaviour of other tools should go here. This should include the most popular CSV parsing/generating libraries in common programming languages. Test files which include non-ASCII characters, double quotes, and line breaks within cells are:

Acknowledgements