Copyright © 2014 W3C ® ( MIT , ERCIM , Keio , Beihang ), All Rights Reserved. W3C liability , trademark and document use rules apply.
Tabular data is routinely transferred on the web as "CSV", but the definition of "CSV" in practice is very loose. This document outlines a basic data model or infoset for tabular data and metadata about that tabular data. It also contains some non-normative information about a best practice syntax for tabular data, for mapping into that data model, to contribute to the standardisation of CSV syntax by IETF. Various methods of locating metadata are also provided.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The CSV on the Web Working Group was chartered to produce a Recommendation "Access methods for CSV Metadata" as well as Recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML)". This document aims to primarily satisfy the first of those Recommendations (see section 3. Locating Metadata ), though it also specifies an underlying model and therefore starting point for the other chartered Recommendations.
This document is based on IETF's [ RFC4180 ] which is an Informational RFC. The working group's expectation is that future suggestions to refine RFC 4180 will be relayed to the IETF (e.g. around I18N and multi-part packaging) and contribute to its discussions about moving CSV to the Standards track.
Many "CSV" files embed metadata, for example in lines before the header row of the CSV document. This specification does not define any formats for embedding metadata within CSV files, aside from the names of columns in the header row.
This
document
was
published
by
the
CSV
on
the
Web
Working
Group
as
a
First
Public
Working
Draft.
This
document
is
intended
to
become
a
W3C
Recommendation.
If
you
wish
to
make
comments
regarding
this
document,
please
send
them
to
public-csv-wg@w3.org
(
subscribe
,
archives
).
All
comments
are
welcome.
Publication
as
a
First
Public
Working
Draft
does
not
imply
endorsement
by
the
W3C
Membership.
This
is
a
draft
document
and
may
be
updated,
replaced
or
obsoleted
by
other
documents
at
any
time.
It
is
inappropriate
to
cite
this
document
as
other
than
work
in
progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
Tabular
data
is
data
that
is
structured
into
rows,
each
of
which
contains
information
about
some
thing.
Each
row
contains
the
same
number
of
fields
cells
(although
some
of
these
fields
cells
may
be
empty),
which
provide
values
of
properties
of
the
thing
described
by
the
row.
In
tabular
data,
fields
cells
within
the
same
column
provide
values
for
the
same
property
of
the
thing
described
by
the
particular
row.
This
is
what
differentiates
tabular
data
from
other
line-oriented
formats.
Tabular data is routinely transferred on the web in a textual format called "CSV", but the definition of "CSV" in practice is very loose. Some people use the term to mean any delimited text file. Others stick more closely to the most standard definition of CSV that there is, [ RFC4180 ]. Appendix A describes the various ways in which CSV is defined.
There are different levels of data models for tabular data:
The core tabular data model can be used to describe a table that lacks any annotations, whether those annotations are embedded within a CSV file or arise from a separate metadata document.
Data is held in a table . The properties of a table are:
Each
A
column
represents
a
vertical
arrangement
of
cells
within
a
table
has
.
The
properties
of
a
column
are:
A
row
contains
represents
a
field
horizontal
arrangement
of
cells
within
a
table
.
The
properties
of
a
row
are:
A cell represents a cell at the intersection of a row and a column within a table . The properties of a cell are:
null
fields
.
,
if
the
string
value
is
an
empty
string
Should
there
be
a
distinction
made
(in
the
core
tabular
data
model)
between
empty
cells
and
cells
whose
value
is
an
empty
string?
With
a
lack
of
other
metadata,
the
only
way
to
make
the
distinction
would
be
to
say
there
was
a
difference
between
a
missing
value
in
a
CSV
file
(eg
the
second
value
in
a,,z
)
and
a
quoted
value
(eg
the
second
value
in
a,"",z
).
This
seems
dangerous
as
I
don't
think
many
clients
will
make
a
distinction
between
the
two,
so
the
semantics
will
be
lost
on
round-tripping.
An
annotated
table
is
a
table
that
is
annotated
with
additional
metadata.
The
table
has
a
MAY
have
any
number
of
properties
in
addition
to
those
provided
in
the
core
tabular
data
model
described
in
section
2.1
Core
Tabular
Data
Model
which
provide
additional
information
about
the
table
as
a
whole.
The
values
of
these
properties
may
be
lists,
structured
objects,
or
atomic
values.
Annotations
on
a
table
may
include:
The
columns
within
an
annotated
table
are
all
annotated
columns
.
:
columns
which
MAY
have
any
number
of
properties
in
addition
to
those
provided
in
the
core
tabular
data
model
described
in
section
2.1
Core
Tabular
Data
Model
.
The
annotations
on
a
column
might
provide
information
about
how
to
interpret
the
fields
cells
in
the
column.
column
or
information
about
the
column
as
a
whole.
Examples
of
annotations
might
be:
The
rows
within
an
annotated
table
are
all
annotated
rows
.
:
rows
which
MAY
have
any
number
of
properties
in
addition
to
those
provided
in
the
core
tabular
data
model
described
in
section
2.1
Core
Tabular
Data
Model
.
The
annotations
on
a
row
provide
additional
metadata
about
the
information
held
in
the
row,
such
as:
The
fields
cells
within
an
annotated
row
are
all
annotated
fields
.
cells
:
cells
which
MAY
have
any
number
of
properties
in
addition
to
those
provided
in
the
core
tabular
data
model
described
in
section
2.1
Core
Tabular
Data
Model
.
The
annotations
on
a
field
cell
provide
metadata
about
the
value
held
in
the
field,
cell,
particularly
when
this
overrides
the
information
provided
for
the
annotated
column
and
annotated
row
that
the
field
cell
falls
within.
Annotations
on
a
field
cell
might
be:
The value of an annotated cell MAY be of a datatype other than a string. For example, annotations might enable a processor to understand the string value of the cell as representing a number or a date.
The permitted types of names and values of annotations need to be spelled out here.
It might be useful to define annotated regions as follows:
An annotated table may also contain a number of annotated regions . Regions are themselves tabular structures comprised of selected rows and columns and thefieldscells within those rows for those columns. Annotated regions are regions that have annotations associated with them. Annotated columns and annotated rows are special types of annotated regions where the region is the entirety of a single column or single row.
But it's not currently clear that there are use cases or examples that justify it. Input is welcome on this.
A group of tables comprises a set of tables (which may be annotated tables ) and a set of annotations ( properties and values ) that relate to the set.
Tables can be loosely related to each other simply through annotations; not all tables that are related to each other need to grouped together. Groups of tables are useful because they can be annotated with metadata that applies to all the tables in the group.
As described in section 2.2 Annotated Tabular Data Model , tabular data may have a number of annotations associated with it. Here we describe the different methods that can be used to locate those annotations given a link to a CSV file.
In
most
methods
of
locating
metadata
described
here,
metadata
is
provided
within
a
separate
document.
The
syntax
of
this
document
is
not
defined
here,
but
these
in
the
Metadata
Vocabulary
for
Tabular
Data
specification.
These
documents
can
include
things
like:
When
creating
a
set
of
annotations
from
metadata,
if
the
same
property
is
specified
in
two
locations
then
information
closer
"closer"
to
the
document
itself
should
override
information
further
"further"
from
the
document.
Explicitly,
the
order
of
preference
is:
Link
header
when
retrieving
the
tabular
data
file,
see
section
What
should
the
syntax
be
for
embedding
metadata
within
We
probably
need
to
add
some
rules
about
conflicts
as
well.
For
example,
if
a
metadata
file
says
that
the
CSV
file?
An
example
approach
is
shown
should
contain
certain
columns
but
the
names
of
the
columns
are
different
in
Linked
the
CSV
.
If
this
file
itself,
is
allowed
we
should
specify
a
new
media
type
for
the
syntax.
this
an
error?
A
link
to
the
metadata
to
be
used
with
The
first
line
of
a
CSV
may
CSV+
file
MUST
be
indicated
within
processed
as
a
header
line
unless
the
CSV
CSV+
file
itself.
is
served
with
a
header=absent
parameter
on
the
media
type.
Each
cell
in
the
header
line
that
includes
non-whitespace
characters
provides
a
title
annotation
on
the
column
in
which
it
appears.
The
title
annotation
needs
to
be
embedded?
Perhaps
something
like:
as
linked
to
the
first
line?
If
this
is
allowed
we
should
specify
a
new
media
type
for
relevant
annotation
in
the
syntax.
metadata
vocabulary.
Rather than providing CSV files directly on the web, they can be packaged up with a metadata file that includes annotations, and any other relevant CSV files.
What should that package look like? Just a zip? A multipart document? How is the metadata file within it identified? If this is allowed we should specify a new media type for the package.
See Packaging on the Web Editor's Draft for a proposed generic approach for packaging on the web that could be used for packaging CSV files and metadata, and discovering those packages.
When
retrieving
a
CSV
file
via
HTTP,
the
response
can
include
a
Link
header
with
rel=describedby
that
points
to
a
metadata
file
that
describes
the
CSV
file.
If,
by
inspection,
the
referenced
file
is
not
a
valid
metadata
file
then
it
MUST
be
ignored.
If
there
is
more
than
one
valid
metadata
file
linked
to
through
multiple
Link
headers,
then
the
metadata
referenced
by
Link
headers
that
appear
later
in
the
response
override
that
referenced
by
earlier
Link
headers.
Is
rel=describedby
the
right
link
relation
to
use?
When
retrieving
Given
a
CSV
file
via
HTTP,
file,
the
default
location
for
a
metadata
file
that
describes
that
CSV
file
is
set
to
in
the
same
directory.
If
csv-metadata
filename
.csvm
this
that
file
does
not
exist,
then
the
application
should
look
for
a
metadata
file
at
metadata.csvm
in
the
same
directory.
In
both
cases,
if
the
metadata
file
does
not
explicitly
point
to
the
relevant
CSV
file
then
it
MUST
be
ignored.
Deliberately
not
included
Used
a
suffix
here,
because
on
filenames
to
find
metadata
about
them,
though
we
haven't
decided
what
format
metadata
documents
should
be
in,
or
even
if
they
should
be
conneg'd.
Probably
best
for
the
file
to
be
in
the
same
directory
as
the
CSV
file,
so
that
you
can
have
one
metadata
document
that
describes
a
lot
of
them.
Should there be a default navigational thing of continuing up the path hierarchy until you find a metadata document?
We
have
discussed
using
a
.well-known
location
or
something
within
a
sitemap
file
to
provide
the
location
of
a
metadata
file
about
a
given
CSV
file,
but
these
are
just
as
unlikely
to
be
editable
as
a
Link
header,
so
probably
don't
address
the
use
case
that
the
standard
path
method
addresses,
of
being
a
really
simple
way
to
provide
metadata
about
a
CSV
file.
This section is non-normative.
There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for outputting tabular data adhering to the core tabular data model described in section 2.1 Core Tabular Data Model into a standard, CSV-based, syntax. Compliant applications that output this format must meet each of the constraints.
We are actively working with the IETF to develop a standard for CSV, which is outside the scope of the Working Group. The details here aim to help shape that standard based on our requirements.
This section does not seek to describe how applications that input textual tabular data should interpret it, except that any data that is in the format defined here should be understood as defined here.
This
syntax
is
not
compliant
with
text/csv
as
defined
in
[
RFC4180
]:
it
permits
characters
other
than
ASCII,
and
it
permits
line
endings
other
than
CRLF
.
Supporting
the
full
set
of
Unicode
characters
by
using
UTF-8
and
supporting
LF
line
endings
are
important
characteristics
for
data
formats
that
are
used
internationally
and
on
non-Windows
platforms.
However,
all
files
that
adhere
to
[
RFC4180
]'s
definition
of
CSV
are
compliant
CSV+
files.
The
appropriate
content
type
for
a
CSV+
file
is
text/csv
.
For
example,
when
a
CSV+
file
is
transmitted
via
HTTP,
the
HTTP
response
MUST
include
a
Content-Type
header
with
the
value
text/csv
:
Content-Type: text/csv
See below for issues relating to whether we should instead define a different content type.
CSV+
files
SHOULD
be
encoded
using
UTF-8.
If
a
CSV+
file
is
not
encoded
using
UTF-8,
the
encoding
MUST
be
specified
through
the
charset
parameter
in
the
Content-Type
header:
Content-Type: text/csv;charset=ISO-8859-1
RFC4180
defines
the
default
charset
as
US-ASCII
because
that
was
(at
the
time
RFC4180
was
written)
the
default
charset
for
all
text/*
media
types.
This
has
been
superseded
with
RFC6657.
Section
3
of
RFC6657
states
"new
subtypes
of
the
"text"
media
type
SHOULD
NOT
define
a
default
"charset"
value.
If
there
is
a
strong
reason
to
do
so
despite
this
advice,
they
SHOULD
use
the
"UTF-8"
[RFC3629]
charset
as
the
default."
Do
we
have
a
strong
reason
to
specify
a
default
charset?
Should
we
be
defining
application/csv
instead,
to
avoid
doing
unrecommended
things
with
a
text/*
media
type.
The
ends
of
rows
in
a
CSV+
file
MUST
be
either
CRLF
(
U+000D
U+000A
)
or
LF
(
U+000A
).
Line
endings
within
escaped
fields
cells
are
not
normalised.
Section 4.1.1 of RFC2046 specifies that "The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden."
Should
we
be
defining
application/csv
instead,
to
prevent
having
to
adhere
to
this
rule,
or
should
we
stick
to
the
CRLF
rule?
Each line of a CSV+ file MUST contain the same number of comma-separated values.
Values
that
contain
commas,
line
endings
or
double
quotes
MUST
be
escaped
by
having
the
entire
value
wrapped
in
double
quotes.
There
MUST
NOT
be
whitespace
before
or
after
the
double
quotes.
Within
these
escaped
fields,
cells,
any
double
quotes
MUST
be
escaped
with
two
double
quotes
(
""
).
The first line of a CSV+ file SHOULD contain a comma-separated list of names of columns . This is known as the header line and provides names for the columns. There are no constraints on these names.
If
a
CSV+
file
does
not
include
a
header
line,
this
MUST
be
specified
using
the
header
parameter
of
the
media
type:
Content-Type: text/csv;header=absent
Bidirectional content does not alter the definition of rows or the assignment of cells to columns. Whether or not a CSV+ file contains right-to-left characters, the first column's content is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.
For example, Egyptian Referendum results are available as a CSV file at https://egelections-2011.appspot.com/Referendum2012/results/csv/EG.csv . Over the wire and in non-Unicode-aware text editors, the CSV looks like:
المحافظة,نسبة موافق,نسبة غير موافق,عدد الناخبين,الأصوات الصحيحة,الأصوات الباطلة,نسبة المشاركة,موافق,غير موافق القليوبية,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070" الجيزة,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675" القاهرة,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327" قنا,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670" ...
Within
this
CSV
file,
the
first
column
appears
as
the
content
of
each
line
before
the
first
comma
and
is
named
المحافظة
(appearing
at
the
start
of
each
row
as
المحافظة
in
the
example,
which
is
displaying
the
relevant
characters
from
left
to
right
in
the
order
they
appear
"on
the
wire").
The CSV translates to a table model that looks like:
Column / Row | column 1 | column 2 | column 3 | column 4 | column 5 | column 6 | column 7 | column 8 | column 9 |
---|---|---|---|---|---|---|---|---|---|
row 1 (header) | المحافظة | نسبة موافق | نسبة غير موافق | عدد الناخبين | الأصوات الصحيحة | الأصوات الباطلة | نسبة المشاركة | موافق | غير موافق |
row 2 | القليوبية | 60.0 | 40.0 | 2,639,808 | 853,125 | 15,224 | 32.9 | 512,055 | 341,070 |
row 3 | الجيزة | 66.7 | 33.3 | 4,383,701 | 1,493,092 | 24,105 | 34.6 | 995,417 | 497,675 |
row 4 | القاهرة | 43.2 | 56.8 | 6,580,478 | 2,254,698 | 36,342 | 34.8 | 974,371 | 1,280,327 |
row 5 | قنا | 84.5 | 15.5 | 1,629,713 | 364,509 | 6,743 | 22.8 | 307,839 | 56,670 |
The
fragment
identifier
#col=3
identifies
the
third
of
the
columns,
named
نسبة
غير
موافق
(appearing
as
نسبة
غير
موافق
in
the
example).
[ CSV-METADATA ] defines how this table model should be displayed by compliant applications, and how metadata can affect the display. The default is for the display to be determined by the content of the table. For example, if this CSV were turned into an HTML table for display into a web page, it should be displayed with the first column on the right and the last on the left, as follows:
غير موافق | موافق | نسبة المشاركة | الأصوات الباطلة | الأصوات الصحيحة | عدد الناخبين | نسبة غير موافق | نسبة موافق | المحافظة |
---|---|---|---|---|---|---|---|---|
341,070 | 512,055 | 32.9 | 15,224 | 853,125 | 2,639,808 | 40.0 | 60.0 | القليوبية |
497,675 | 995,417 | 34.6 | 24,105 | 1,493,092 | 4,383,701 | 33.3 | 66.7 | الجيزة |
1,280,327 | 974,371 | 34.8 | 36,342 | 2,254,698 | 6,580,478 | 56.8 | 43.2 | القاهرة |
56,670 | 307,839 | 22.8 | 6,743 | 364,509 | 1,629,713 | 15.5 | 84.5 | قنا |
The
fragment
identifier
#col=3
still
identifies
the
third
of
the
columns,
named
نسبة
غير
موافق
,
which
appears
in
the
HTML
display
as
the
third
column
from
the
right
and
is
what
those
who
read
right-to-left
would
think
of
as
the
third
column.
Note that this display matches that shown on the original website .
An
alternative
approach
is
for
the
CSV
to
be
parsed
into
a
table
model
in
which
the
columns
are
numbered
in
the
reverse,
for
tables
which
are
either
marked
as
or
detected
to
be
right-to-left
tables.
For
example,
we
could
introduce
a
bidi=rtl
or
similar
media
type
parameter,
and
use
this
to
determine
whether
the
first
column
in
table
generated
from
the
CSV
is
the
text
before
the
first
comma
in
each
line
or
the
text
after
the
last
comma
in
the
line.
In
the
example
above,
if
the
CSV
were
served
with
bidi=rtl
,
or
the
table
was
detected
as
being
a
right-to-left
table,
then
the
column
numbering
in
the
model
would
be
reversed:
Column / Row | column 9 | column 8 | column 7 | column 6 | column 5 | column 4 | column 3 | column 2 | column 1 |
---|---|---|---|---|---|---|---|---|---|
row 1 (header) | المحافظة | نسبة موافق | نسبة غير موافق | عدد الناخبين | الأصوات الصحيحة | الأصوات الباطلة | نسبة المشاركة | موافق | غير موافق |
row 2 | القليوبية | 60.0 | 40.0 | 2,639,808 | 853,125 | 15,224 | 32.9 | 512,055 | 341,070 |
row 3 | الجيزة | 66.7 | 33.3 | 4,383,701 | 1,493,092 | 24,105 | 34.6 | 995,417 | 497,675 |
row 4 | القاهرة | 43.2 | 56.8 | 6,580,478 | 2,254,698 | 36,342 | 34.8 | 974,371 | 1,280,327 |
row 5 | قنا | 84.5 | 15.5 | 1,629,713 | 364,509 | 6,743 | 22.8 | 307,839 | 56,670 |
This
would
require
a
change
to
[
RFC7111
]
but
that
might
be
required
by
updates
to
the
definition
of
text/csv
in
any
case.
With
the
change,
the
fragment
identifier
#col=3
would
then
refer
to
the
third
column
from
the
right,
named
نسبة
المشاركة
.
If
the
model
were
defined
in
this
way,
there
would
be
no
need
to
determine
the
order
of
the
columns
when
displayed
using
a
metadata
property.
Columns
would
always
be
displayed
with
the
first
column
(numbered
1
in
the
model)
on
the
left.
The
final
display
in
HTML,
for
example,
would
be
exactly
as
above.
The
only
difference
would
be
that
#col=3
would
refer
to
the
third
column
from
the
left.
We note that using media type parameters is problematic because publishers might not have the ability to set them on their servers, and because they can easily get lost as a file is republished or emailed between people.
We invite comment on the best way to approach bidirectionality in CSV files.
This grammar is a generalization of that defined in [ RFC4180 ] and is included for reference only.
The EBNF used here is defined in XML 1.0 [ EBNF-NOTATION ].
[1] |
csv
|
::= |
header
record
+
|
[2] |
header
|
::= | record |
[3] |
record
|
::= |
fields
#x0D
?
#x0A
|
[4] |
fields
|
::= |
field
("
,
"
fields
)
*
|
[5] |
field
|
::= |
WS
*
rawfield
WS
*
|
[6] |
rawfield
|
::= |
'
"
'
QCHAR
*
'
"
'
|
SCHAR
*
|
[7] |
QCHAR
|
::= |
[
^"
]
|
'
""
'
|
[8] |
SCHAR
|
::= |
[
^",
#x0A
#x0D
]
|
[9] |
WS
|
::= |
[
#x20
#x09
]
|
We should probably place further restrictions on QCHAR and SCHAR to avoid control characters. If header weren’t optional, it would be better defined as in RFC4180, but if the syntax allows it to be optional, this would make it not an LL(1) grammar, which isn’t too much of an issue.
This section is non-normative.
As
described
in
section
4.
CSV+
Syntax
,
there
may
be
many
formats
which
an
application
might
interpret
into
the
tabular
data
model
described
in
section
2.
Tabular
Data
Model
Models
,
including
using
different
separators
or
fixed
format
tables,
multiple
tables
within
a
single
file,
or
ones
that
have
metadata
lines
before
a
table
header.
Standardising the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardisation of the format.
This section describes an algorithm for parsing formats other than the plain CSV+ format specified in section 4. CSV+ Syntax . It is impossible to do this in a fully automated manner, so this algorithm depends on the following flags being set externally (eg through user input):
utf-8
.
CRLF
.
"
.
"
.
0
.
#
.
1
.
,
.
0
.
0
.
false
.
When parsing, should we:
The algorithm for parsing a document containing tabular data is as follows:
Split
the
header
and
data
rows
into
fields
cells
using
the
delimiter
.
Values
that
are
enclosed
within
the
enclosure
character
may
contain
the
delimiter
.
The
enclosure
character
may
be
escaped
using
the
escape
character
where
it
appears
within
fields.
cells.
If
trim
is
true
or
start
then
whitespace
from
the
start
of
values
that
are
not
enclosed
must
be
removed
from
the
value.
If
trim
is
true
or
end
then
whitespace
from
the
end
of
values
that
are
not
enclosed
must
be
removed
from
the
value.
true
then
ignore
any
rows
in
which
all
the
This appendix outlines various ways in which CSV is defined.
[ RFC4180 ] defines CSV with the following ABNF grammar:
file = [header CRLF] record *(CRLF record) [CRLF] header = name *(COMMA name) record = field *(COMMA field) name = field field = (escaped / non-escaped) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE non-escaped = *TEXTDATA COMMA = %x2C CR = %x0D DQUOTE = %x22 LF = %x0A CRLF = CR LF TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Of particular note here are:
TEXTDATA
indicates
that
only
non-control
ASCII
characters
are
permitted
within
a
CSV
file.
This
restriction
is
routinely
ignored
in
practice,
and
is
impractical
on
the
international
web.
CRLF
.
This
makes
it
harder
to
produce
CSV
files
on
Unix-based
systems
where
the
usual
line
ending
is
LF
.
header
parameter
on
the
media
type
indicates
whether
the
header
is
present
or
not.
""
).
Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.
The following describes the behaviour of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behaviour of Excel in other situations.
Excel
generates
CSV
files
encoded
using
Windows-1252
with
LF
line
endings.
Characters
that
cannot
be
represented
within
Windows-1252
are
replaced
by
underscores.
Only
those
fields
cells
that
need
escaping
(eg
because
they
contain
commas
or
double
quotes)
are
escaped,
and
double
quotes
are
escaped
with
two
double
quotes.
Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.
When
opening
CSV
files,
Excel
interprets
CSV
files
saved
in
UTF-8
as
being
encoded
as
Windows-1252
(whether
or
not
a
BOM
is
present).
It
correctly
deals
with
double
quoted
fields,
cells,
except
that
it
converts
line
breaks
within
fields
cells
into
spaces.
It
understands
CRLF
as
a
line
break.
It
detects
dates
(formatted
as
YYYY-MM-DD
)
and
formats
them
in
the
default
date
formatting
for
files.
Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM ). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.
A
particular
quirk
in
the
importing
of
CSV
is
that
if
a
field
cell
contains
a
line
break,
the
final
double
quote
that
escapes
the
field
cell
will
be
included
within
it.
When
tabular
data
is
copied
from
Excel,
it
is
copied
in
a
tab-delimited
format,
with
LF
line
breaks.
Downloaded
CSV
files
are
encoded
in
UTF-8,
without
a
BOM
,
and
with
LF
line
endings.
Dates
and
numbers
are
formatted
as
they
appear
within
the
spreadsheet.
CSV
files
can
be
imported
as
UTF-8
(with
or
without
BOM
).
CRLF
line
endings
are
correctly
recognised.
Dates
are
reformatted
to
the
default
date
format
on
load.
Ths
Simple
Tabular
Data
Format
Packages
places
place
the
following
restrictions
on
CSV
files:
As a starting point, CSV files included in a
SimpleTabular DataFormatPackage package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:
- File names MUST end with
.csv
- Files MUST be encoded as UTF-8
Files MUST have a single header row. This row MUST be the first row in the file.
- Terminology: each column in the CSV file is termed a field and its
name
is the string in that column in the header row.- The
name
MUST be unique amongst fields and MUST contain at least one character- There are no further restrictions on the form of the
name
but it is RECOMMENDED that it contain only alphanumeric characters together with “ .-_”- Rows in the file MUST NOT contain more fields than are in the header row (though they may contain less)
- Each file MUST have an entry in the
resources
array in thedatapackage.json
file- The resource metadata MUST include a
schema
attribute whose value MUST conform to the JSON Table Schema- All fields in the CSV files MUST be described in the
schema
CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a
SimpleCSV files in a Tabular DataFormatPackage SHOULD
- Use “,” as field delimiters
- Use “rn” or “n” as line terminators
If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the
datapackage.json
descriptor MUST :
- Include a
dialect
key that conforms to that described in the CSV Dialect Description FormatApplications processing the CSV file SHOULD read use the
dialect
of the CSV file to guide parsing.
More
details
of
behaviour
of
other
tools
should
go
here.
This
should
include
the
most
popular
CSV
parsing/generating
libraries
in
common
programming
languages.
Test
files
which
include
non-ASCII
characters,
double
quotes
and
line
breaks
within
fields
cells
are: