Copyright
©
2015
W3C
®
(
MIT
,
ERCIM
,
Keio
,
Beihang
),
All
Rights
Reserved.
).
W3C
liability
,
trademark
and
document
use
rules
apply.
Tabular
data
is
routinely
transferred
on
the
web
as
"CSV",
but
the
definition
of
"CSV"
in
practice
is
very
loose.
a
variety
of
formats,
including
variants
on
CSV,
tab-delimited
files,
fixed
field
formats,
spreadsheets,
HTML
tables,
and
SQL
dumps.
This
document
outlines
a
basic
data
model
model,
or
infoset
infoset,
for
tabular
data
and
metadata
about
that
tabular
data.
data
that
can
be
used
as
a
basis
for
validation,
display,
or
creating
other
formats.
It
also
contains
some
non-normative
information
about
a
best
practice
syntax
guidance
for
publishing
tabular
data,
for
mapping
into
that
data
model,
to
contribute
to
as
CSV
and
how
that
maps
into
the
standardisation
tabular
data
model.
An
annotated
model
of
CSV
syntax
tabular
data
can
be
supplemented
by
IETF.
Various
methods
of
locating
separate
metadata
are
also
provided.
about
the
table.
This
specification
defines
how
implementations
should
locate
that
metadata,
given
a
file
containing
tabular
data.
The
standard
syntax
for
that
metadata
is
defined
in
[
tabular-metadata
].
Note,
however,
that
applications
may
have
other
means
to
create
annotated
tables,
e.g.,
through
some
application
specific
API-s;
this
model
does
not
depend
on
the
specificities
described
in
[
tabular-metadata
].
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The
CSV
on
the
Web
Working
Group
was
chartered
to
produce
a
Recommendation
"Access
methods
for
CSV
Metadata"
as
well
as
Recommendations
for
"Metadata
vocabulary
for
CSV
data"
and
"Mapping
mechanism
to
transforming
CSV
into
various
Formats
(e.g.,
RDF,
JSON,
or
XML)".
This
document
aims
to
primarily
satisfy
the
first
of
those
Recommendations
(see
section
3.
5.
Locating
Metadata
),
though
it
also
specifies
an
underlying
model
for
tabular
data
and
is
therefore
starting
point
a
basis
for
the
other
chartered
Recommendations.
This
definition
of
CSV
used
in
this
document
is
based
on
IETF's
[
RFC4180
]
which
is
an
Informational
RFC.
The
working
group's
expectation
is
that
future
suggestions
to
refine
RFC
4180
will
be
relayed
to
the
IETF
(e.g.
around
I18N
encoding
and
multi-part
packaging)
line
endings)
and
contribute
to
its
discussions
about
moving
CSV
to
the
Standards
track.
Many
"CSV"
files
containing
tabular
data
embed
metadata,
for
example
in
lines
before
the
header
row
of
the
an
otherwise
standard
CSV
document.
This
specification
does
not
define
any
formats
for
embedding
metadata
within
CSV
files,
aside
from
the
names
titles
of
columns
in
the
header
row.
row
which
is
defined
in
CSV.
We
would
encourage
groups
that
define
tabular
data
formats
to
also
define
a
mapping
into
the
annotated
tabular
data
model
defined
in
this
document.
This document was published by the CSV on the Web Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org ( subscribe , archives ). All comments are welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 1 August 2014 W3C Process Document .
Tabular
data
is
data
that
is
structured
into
rows,
each
of
which
contains
information
about
some
thing.
Each
row
contains
the
same
number
of
cells
(although
some
of
these
cells
may
be
empty),
which
provide
values
of
properties
of
the
thing
described
by
the
row.
In
tabular
data,
cells
within
the
same
column
provide
values
for
the
same
property
of
the
thing
things
described
by
the
particular
each
row.
This
is
what
differentiates
tabular
data
from
other
line-oriented
formats.
Tabular
data
is
routinely
transferred
on
the
web
in
a
textual
format
called
"CSV",
CSV,
but
the
definition
of
"CSV"
CSV
in
practice
is
very
loose.
Some
people
use
the
term
to
mean
any
delimited
text
file.
Others
stick
more
closely
to
the
most
standard
definition
of
CSV
that
there
is,
[
RFC4180
].
Appendix
A
describes
the
various
ways
in
which
CSV
is
defined.
This
specification
refers
to
such
files,
as
well
as
tab-delimited
files,
fixed
field
formats,
spreadsheets,
HTML
tables,
and
SQL
dumps
as
tabular
data
files
.
In
section
2.
4.
Tabular
Data
Models
There
are
different
levels
of
,
this
document
defines
a
model
for
tabular
data
models
that
abstracts
away
from
the
varying
syntaxes
that
are
used
for
when
exchanging
tabular
data:
data.
The
core
tabular
data
model
described
in
includes
annotations,
or
metadata,
about
collections
of
individual
tables,
rows,
columns,
and
cells.
These
annotations
are
typically
supplied
through
separate
metadata
files;
section
2.1
5.
Core
Tabular
Data
Model
Locating
Metadata
defines
the
core
model
for
a
single
how
these
metadata
files
can
be
located,
while
[
tabular-metadata
]
defines
what
they
contain.
Once
an
annotated
table
of
basic
has
been
created,
it
can
be
processed
in
various
ways,
such
as
display,
validation,
or
conversion
into
other
formats.
This
processing
is
described
in
section
6.
Processing
Tables
.
This
specification
does
not
normatively
define
a
format
for
exchanging
tabular
data.
The
annotated
However,
it
does
provide
some
best
practice
guidelines
for
publishing
tabular
data
model
described
as
CSV,
in
section
section
2.2
7.
Annotated
Best
Practice
CSV
,
and
for
parsing
both
this
syntax
and
those
similar
to
it,
in
section
8.
Parsing
Tabular
Data
Model
.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The
key
words
MAY
,
MUST
,
MUST
NOT
,
SHOULD
,
and
SHOULD
NOT
are
to
be
interpreted
as
described
in
[
RFC2119
defines
].
This specification makes use of the compact IRI Syntax ; please refer to the Compact IRIs from [ JSON-LD ].
Within examples, this specification makes use of the following namespaces:
dc
:
http://purl.org/dc/terms/
rdfs
:
http://www.w3.org/2000/01/rdf-schema#
schema
:
http://schema.org/
xsd
:
http://www.w3.org/2001/XMLSchema#
The following typographic conventions are used in this specification:
markup
markup
definition
reference
markup
external
definition
reference
Notes
are
annotated
in
light
green
boxes
with
metadata.
a
green
left
border
and
with
a
"Note"
header
in
green.
Notes
are
normative
or
informative
depending
on
the
whether
they
are
in
a
normative
or
informative
section,
respectively.
Examples are in light khaki boxes, with khaki left border, and with a numbered "Example" header in khaki. Examples are always informative. Thegrouped tabular data model describedcontent of the example is insectionmonospace font and may be syntax colored.
This
section
defines
an
annotated
tabular
data
model
:
a
model
for
tables
that
are
related
to
each
annotated
with
metadata.
Annotations
provide
information
about
the
cells,
rows,
columns,
tables,
and
groups
of
tables
with
which
they
are
associated.
The
values
of
these
annotations
may
be
lists,
structured
objects,
or
atomic
values.
Core
annotations
are
those
that
affect
the
behavior
of
processors
defined
in
this
specification,
but
other
annotations
may
also
be
present
on
any
of
the
components
of
the
model.
Annotations
may
be
described
directly
in
some
way.
[
tabular-metadata
],
be
embedded
in
a
tabular
data
file
,
or
created
during
the
process
of
generating
an
annotated
table
.
In this document, the term annotation refers to any metadata associated with an object in the annotated tabular data model. These are not necessarily web annotations in the sense of [ annotation-model ].
A
group
of
tables
comprises
a
set
of
annotated
tables
and
a
set
of
annotations
that
relate
to
that
group
of
tables.
The
core
tabular
data
model
can
be
used
to
describe
annotations
of
a
table
that
lacks
group
of
tables
are:
null
if
this
is
undefined.
Groups
of
tables
MAY
in
addition
have
any
number
of
annotations
which
provide
information
about
the
group
of
tables.
Annotations
on
a
CSV
file
group
of
tables
may
include:
When
originating
from
[
tabular-metadata
],
these
annotations
arise
from
common
properties
defined
on
table
group
descriptions
within
metadata
document.
documents.
Data
An
annotated
table
is
held
in
a
table
.
that
is
annotated
with
additional
metadata.
The
properties
core
annotations
of
a
table
are:
null
if
this
is
undefined.
null
.
null
if
this
is
undefined.
The table MAY in addition have any number of other annotations . Annotations on a table may include:
When originating from [ tabular-metadata ], these annotations arise from common properties defined on table group descriptions within metadata documents.
A
column
represents
a
vertical
arrangement
of
cells
within
a
table
.
The
properties
core
annotations
of
a
column
are:
null
.
null
.
false
,
the
column
is
a
real
column
,
which
exists
in
the
source
data
for
the
Columns
MAY
in
addition
have
any
number
of
other
annotations
,
starting
such
as
a
description.
When
originating
from
1
[
tabular-metadata
],
these
annotations
arise
from
common
properties
defined
on
table
group
descriptions
within
metadata
documents.
A row represents a horizontal arrangement of cells within a table . The core annotations of a row are:
null
.Rows MAY have any number of additional annotations . The annotations on a row provide additional metadata about the information held in the row, such as:
Neither this specification nor [ tabular-metadata ] defines a method to specify such annotations. Implementations MAY define a method for adding annotations to rows by interpreting notes on the table.
A
cell
represents
a
cell
at
the
intersection
of
a
row
and
a
column
within
a
table
.
The
properties
core
annotations
of
a
cell
are:
null
.
null
.
null
.
For
example,
annotations
might
enable
a
processor
to
understand
the
string
value
of
the
null
.
null
.
There
presence
or
absence
of
quotes
around
a
value
within
a
CSV
file
is
a
syntactic
detail
that
is
not
reflected
in
the
tabular
data
model.
In
other
words,
there
is
no
distinction
in
the
model
between
the
second
value
in
a,,z
and
the
second
value
in
a,"",z
.
An
annotated
table
is
a
table
that
is
annotated
with
additional
metadata.
The
table
Cells
MAY
have
any
number
of
properties
additional
annotations
.
The
annotations
on
a
cell
provide
metadata
about
the
value
held
in
addition
to
those
the
cell,
particularly
when
this
overrides
the
information
provided
in
for
the
core
tabular
data
model
described
in
section
2.1
Core
Tabular
Data
Model
column
which
provide
information
about
and
row
that
the
table
as
a
whole.
The
values
of
these
properties
may
be
lists,
structured
objects,
or
atomic
values.
cell
falls
within.
Annotations
on
a
table
may
include:
cell
might
be:
The
columns
within
an
annotated
table
Neither
this
specification
nor
[
tabular-metadata
are
all
annotated
columns
:
columns
which
]
defines
a
method
to
specify
such
annotations.
Implementations
MAY
have
any
number
of
properties
in
addition
define
a
method
for
adding
annotations
to
those
provided
in
cells
by
interpreting
notes
on
the
core
tabular
data
model
described
in
section
table.
Columns
and
Cells
within
tables
may
be
annotated
with
a
column
might
provide
information
about
how
to
interpret
datatype
which
indicates
the
cells
in
type
of
the
column
or
information
about
values
obtained
by
parsing
the
column
as
string
value
of
the
cell.
Datatypes
are
based
on
a
whole.
Examples
subset
of
annotations
might
be:
those
defined
in
[
xmlschema11-2
]
with
aliases.
The
annotated
tabular
data
model
limits
cell
values
to
have
datatypes
as
shown
on
the
diagram
:
anyAtomicType
xml
,
a
sub-type
of
string
,
which
indicates
the
value
is
an
XML
fragment
html
,
a
sub-type
of
string
,
which
indicates
the
value
is
an
json
,
a
sub-type
of
string
,
which
indicates
the
The
core
annotations
on
of
a
row
provide
additional
metadata
about
the
information
held
in
the
row,
such
as:
datatype
are:
Datatypes
MAY
be
have
any
number
of
a
datatype
other
than
a
string.
For
example,
additional
annotations
might
enable
.
The
annotations
on
a
processor
to
understand
the
string
value
of
datatype
provide
metadata
about
the
cell
datatype
such
as
representing
a
number
title
or
a
date.
Issue
41
description.
These
arise
from
common
properties
defined
on
datatype
descriptions
within
metadata
documents,
as
defined
in
[
tabular-metadata
The
permitted
types
of
names
and
values
of
annotations
need
to
be
spelled
out
here.
].
A
group
of
tables
comprises
a
set
of
tables
(which
may
be
annotated
tables
The
length
,
minimum
length
)
and
a
set
of
annotations
(
properties
maximum
length
annotations
indicate
the
exact,
minimum
and
maximum
lengths
for
cell
string
values
.
The
length
of
a
value
)
that
relate
to
the
set.
is
determined
as
follows:
null
,
its
length
is
zero
The
minimum
,
maximum
,
minimum
exclusive
,
and
maximum
exclusive
annotations
indicate
limits
on
cell
string
values
.
These
apply
to
grouped
together.
Groups
numeric,
date/time,
and
duration
types.
Validation
of
tables
are
useful
because
they
can
be
annotated
with
metadata
that
applies
to
all
the
tables
cell
string
values
against
these
datatypes
is
as
defined
in
the
group.
[
xmlschema11-2
].
As
described
in
section
2.2
4.
Annotated
Tabular
Data
Model
Models
,
tabular
data
may
have
a
number
of
annotations
associated
with
it.
Here
we
describe
the
different
methods
that
can
be
used
to
locate
metadata
that
provides
those
annotations
given
a
link
to
a
CSV
file.
annotations.
In
most
methods
of
locating
metadata
described
here,
metadata
is
provided
within
a
separate
document.
documents.
The
syntax
of
this
document
these
documents
is
defined
in
the
Metadata
Vocabulary
for
Tabular
Data
[
tabular-metadata
specification.
These
documents
can
include
things
like:
].
When
creating
a
set
of
annotations
from
metadata,
if
the
same
property
is
specified
in
two
locations
then
information
"closer"
to
the
end
user
or
interpreter
of
the
document
should
override
overrides
other
information,
and
information
"closer"
to
the
document
itself
should
override
overrides
information
"further"
from
the
document.
Explicitly,
the
order
of
preference
precedence
is:
Link
header
associated
with
the
tabular
data
Processors
MUST
attempt
to
locate
a
always
use
overriding
metadata
document
based
on
each
of
these
locations
in
order,
and
use
first
embedded
metadata.
If
the
overriding
metadata
document
that
is
successfully
located
in
this
way.
Issue
42
What
should
processors
do
if
they
have
been
passed
does
not
originate
from
a
complete
metadata
file
and
located
a
CSV
file
itself
(for
example,
if
it
is
generated
from
that
metadata
file.
Should
they
still
check
for
metadata
files
related
specifically
command-line
options
when
invoking
the
implementation),
processors
MUST
attempt
to
locate
the
CSV
file
itself?
For
example,
say
that
an
application
has
been
pointed
at
a
metadata
file
at
http://example.org/metadata.json
which
references
http://example.org/toilets.csv
but
there
is
also
a
first
metadata
file
at
http://example.org/toilets.csv-metadata.json
.
If
document
from
the
processor
had
been
originally
pointed
to
http://example.org/toilets.csv
Link
then
it
would
have
located
header,
the
file-specific
metadata,
or
the
directory-specific
metadata.
The
located
metadata
at
http://example.org/toilets.csv-metadata.json
,
but
coming
via
http://example.org/metadata.json
means
that
file-specific
from
these
locations
MUST
be
merged
as
defined
in
[
tabular-metadata
].
If
no
metadata
is
skipped.
supplied
or
found,
then
the
following
Metadata
is
used
as
the
default
metadata
:
{ "@type": "TableGroup", "tables": [] }
Processors SHOULD provide users with the facility to provide their own metadata for tabular data files that they process. This might be provided:
For example, a processor might be invoked with:
$ csvlint data.csv --datatypes:string,float,string,string
to enable the testing of the types of values in the columns of a CSV file, or with:
$ csvlint data.csv --schema:schema.json
to supply a schema that describes the contents of the file, against which it can be validated.
Metadata supplied in this way is called overriding, or user-supplied, metadata. Implementations SHOULD define how any options they define are mapped into the vocabulary defined in [ tabular-metadata ]. If the user selects existing metadata files, implementations MUST NOT use metadata located through the link header (as described in section 5.3 Link Header ), file-specific metadata (as described in section 5.4 Standard File Metadata ) or directory-specific metadata (as described in section 5.5 Standard Directory Metadata ).
Most
syntaxes
for
tabular
data
provide
a
facility
for
embedding
some
metadata
within
the
tabular
data
file
itself.
The
definition
of
a
syntax
for
tabular
data
SHOULD
include
a
description
of
how
the
syntax
maps
to
an
annotated
data
model.
See
section
model,
and
in
particular
how
any
embedded
metadata
is
mapped
into
the
vocabulary
defined
in
[
tabular-metadata
].
Parsing
based
on
the
default
dialect
for
CSV,
as
described
in
5.
8.
Parsing
Tabular
Data
,
will
extract
column
titles
for
from
the
first
row
of
a
description
CSV
file.
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
The
results
of
this
for
CSV
files.
can
be
found
in
section
8.2.1
Simple
Example
.
For
another
example,
the
following
tab-delimited
file
contains
embedded
metadata
where
it
is
assumed
that
comments
may
be
added
using
a
#
,
and
that
the
column
types
may
be
indicated
using
a
#datatype
annotation:
# publisher City of Palo Alto # updated 12/31/2010 #name GID on_street species trim_cycle inventory_date #datatype string string string string date:M/D/YYYY GID On Street Species Trim Cycle Inventory Date 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010
A processor that recognises this format may be able to extract and make sense of this embedded metadata.
When
If
the
user
has
not
supplied
a
metadata
file
as
overriding
metadata,
described
in
section
5.1
Overriding
Metadata
,
then
when
retrieving
a
tabular
data
file
via
HTTP,
the
response
MAY
processors
MUST
include
a
retrieve
the
metadata
file
referenced
by
any
Link
header
with
with:
rel=describedby
rel="describedby"
and
type=application/csv-metadata+json
type="application/csvm+json"
,
type="application/ld+json"
type="application/json"
so
long
as
this
referenced
metadata
file
that
describes
the
retrieved
tabular
data
file.
If,
by
inspection,
the
referenced
file
is
not
(ie,
contains
a
valid
metadata
file
then
it
MUST
be
ignored.
table
description
whose
url
matches
the
request
URL).
If
there
is
more
than
one
valid
metadata
file
linked
to
through
multiple
Link
headers,
then
implementations
MUST
use
the
metadata
file
referenced
by
the
last
Link
headers
that
appear
later
in
header.
For
example,
when
the
response
are
used
in
preference
to
those
referenced
by
earlier
Link
headers.
requesting
a
tab-separated
file
looks
like:
HTTP/1.1 200 OK Content-Type: text/tab-separated-values ... Link: <metadata.ttl>; rel="describedBy"; type="text/turtle" Link :Issue 43< metadataSuggest that we drop. json >; rel = "describedBy" ; type = "application/csvm+json"
an
implementation
must
use
the
requirement
referenced
metadata.json
to
ignore
invalid
supply
metadata
files,
given
that
for
processing
the
file.
If
the
correct
metadata
file
should
be
identified
through
the
type
of
found
at
this
location
does
not
explicitly
include
a
reference
to
the
link.
requested
tabular
data
file
then
it
MUST
be
ignored.
If
the
user
has
not
supplied
a
metadata
file
as
overriding
metadata,
described
in
section
5.1
Overriding
Metadata
,
and
no
applicable
metadata
document
file
has
been
located
discovered
through
higher-priority
mechanisms,
a
Link
header,
described
in
section
5.3
Link
Header
,
processors
MUST
attempt
to
locate
a
file-specific
metadata
document.
This
method
is
only
MUST
NOT
be
used
if
the
URL
of
the
tabular
data
file
does
not
contain
contains
a
query
component.
In
this
case,
processors
MUST
construct
a
URL
for
the
file-specific
metadata
document
should
be
located
at
by
appending
to
the
location
-metadata.json
,
where
location
is
base
URL
location
of
the
tabular
data
file
with
the
(with
any
fragment
component
of
the
that
URL
removed,
if
applicable.
removed)
and
attempt
to
retrieve
a
metadata
document
from
that
location.
For
example,
if
the
tabular
data
file
is
at
http://example.org/south-west/devon.csv
then
processors
should
must
attempt
to
locate
a
metadata
document
at
http://example.org/south-west/devon.csv-metadata.json
.
On
the
other
hand,
if
the
tabular
data
file
is
at
http://example.com/index.php?file=43178.csv
then,
because
the
URL
of
the
tabular
data
file
includes
a
query
component,
the
processor
must
not
attempt
to
look
for
a
file-specific
metadata
document.
If
the
metadata
file
found
at
this
location
does
not
explicitly
point
include
a
reference
to
the
relevant
tabular
data
file
then
it
MUST
be
ignored.
If
the
user
has
not
supplied
a
metadata
file
as
overriding
metadata,
described
in
section
5.1
Overriding
Metadata
,
and
no
applicable
metadata
document
file
has
been
located
discovered
through
higher-priority
mechanisms,
a
Link
header,
described
in
section
5.3
Link
Header
,
or
from
the
standard
file
metadata
location,
described
in
section
5.4
Standard
File
Metadata
,
processors
MUST
attempt
to
locate
a
directory-level
metadata
document.
This
method
is
only
MUST
NOT
be
used
if
the
URL
of
the
tabular
data
file
does
not
contain
contains
a
query
component.
The
Processors
MUST
locate
and
retrieve
the
common
metadata
document
for
a
directory
can
be
located
by
resolving
the
relative
URL
metadata.json
against
the
base
URL
of
the
tabular
data
file.
file
and
fetching
the
resulting
URL.
For
example,
if
the
tabular
data
file
is
at
http://example.org/south-west/devon.csv
then
processors
should
must
attempt
to
locate
a
metadata
file
at
http://example.org/south-west/metadata.json
.
On
the
other
hand,
if
the
tabular
data
file
is
at
http://example.com/index.php?file=43178.csv
then,
because
the
URL
of
the
tabular
data
file
includes
a
query
component,
the
processor
must
not
attempt
to
look
for
a
directory-level
metadata
document.
If
the
metadata
file
found
at
this
location
does
not
explicitly
point
include
a
reference
to
the
relevant
tabular
data
file
then
it
MUST
be
ignored.
This section describes how particular types of applications should process tabular data and metadata files.
In
many
cases,
an
application
will
start
processing
from
a
metadata
file.
In
that
case,
the
initial
metadata
file
is
non-normative.
treated
as
overriding
metadata
and
the
application
MUST
NOT
continue
to
retrieve
other
available
metadata
about
each
of
the
tabular
data
file
s
referenced
by
that
initial
metadata
file
other
than
embedded
metadata
.
In other cases, applications will start from a tabular data file , such as a CSV file, and locate metadata from that file. This metadata will be used to process the file as if the processor were starting from that metadata file.
For
example,
if
a
validator
is
passed
a
locally
authored
metadata
file
spending.json
,
which
contains:
{ "tableSchema": "government-spending.csv", "tables": [{ "url": "http://example.org/east-sussex-2015-03.csv", }, { "url": "http://example.org/east-sussex-2015-02.csv" }, ... ] }
the
validator
would
validate
all
the
listed
tables,
using
the
locally
defined
schema
at
government-spending.csv
.
It
would
also
use
the
metadata
embedded
in
the
referenced
CSV
files;
for
example,
when
processing
http://example.org/east-sussex-2015-03.csv
,
it
would
use
embedded
metadata
within
that
file.
There
If
a
validator
is
no
standard
passed
a
tabular
data
file
http://example.org/east-sussex-2015-03.csv
,
the
validator
would
use
the
metadata
located
from
the
CSV
file:
the
first
metadata
file
found
through
the
Link
headers
found
when
retrieving
that
file,
or
at
http://example.org/east-sussex-2015-03.csv-metadata.json
or
http://example.org/metadata.json
.
Starting with a metadata file can remove the need to perform additional requests to locate linked, file-specific, or directory-specific metadata
After
locating
metadata,
metadata
is
merged
into
a
single
table
group
description
.
When
starting
with
a
metadata
file,
this
involves
merging
the
provided
metadata
file
with
the
embedded
metadata
for
CSV,
each
tabular
data
file
referenced
by
that
file.
When
starting
with
a
tabular
data
file,
this
involves
locating
the
first
metadata
file
as
described
in
section
5.
Locating
Metadata
and
there
are
many
variants
merging
all
relevant
metadata
into
a
single
descriptor.
If processing starts with a tabular data file , implementations:
UM
T
)
(see
section
5.1
Overriding
Metadata
).
UM
T
does
not
describe
a
table
or
group
of
FM
)
as
described
in
section
5.
Locating
Metadata
:
tables
entry
where
the
url
property
is
set
from
that
of
the
tabular
data
file
.
UM
M
=
merge(UM
T
,
FM)
or
UM
T
if
no
FM
is
loaded.
If the process starts with a metadata file:
UM
M
(which
is
treated
as
overriding
metadata,
see
section
5.1
Overriding
Metadata
).
UM
M
in
order,
create
one
or
more
annotated
tables
:
DD
)
from
UM
M
for
DD
based
on
HTTP
headers
found
when
retrieving
the
tabular
data
file:
Parse
the
tabular
data
file,
using
DD
as
a
guide,
to
create
a
basic
tabular
data
model
(
T
)
and
extract
embedded
metadata
described
(
EM
),
for
example
from
the
header
line
.
This
specification
provides
a
non-normative
definition
for
parsing
CSV-based
files,
including
the
extraction
of
embedded
metadata
,
in
section
2.1
8.
Core
Parsing
Tabular
Data
Model
.
This
specification
does
not
define
any
syntax
for
embedded
metadata
beyond
this;
whatever
syntax
is
used,
it's
assumed
that
metadata
can
be
mapped
to
the
vocabulary
defined
in
[
tabular-metadata
].
Content-Language
HTTP
header
was
found
when
retrieving
the
tabular
data
file,
and
the
value
provides
a
single
language,
set
the
lang
inherited
property
to
this
value
in
EM
.
M
=
merge(UM
M
,
EM)
using
the
merge
algorithm
defined
in
[
tabular-metadata
].
M
to
add
annotations
to
the
tabular
data
model
T
as
described
in
Section
2
Annotating
Tables
in
[
tabular-metadata
].
In
the
case
of
starting
with
a
metadata
file,
UM
M
will
describe
a
table
or
group
of
tables
,
and
no
other
metadata
files
will
be
retrieved.
Thus
the
metadata
file
must
provide
all
applicable
metadata
aside
from
that
embedded
within
the
tabular
data
files
themselves.
Unlike many other data formats, tabular data is designed to be read by humans. For that reason, it's common for data to be represented within tabular data in a human-readable way. The datatype , default , lang , null , required , and separator annotations provide the information needed to parse the string value of a cell into its (semantic) value annotation. This is used:
The process of parsing a cell creates a cell with annotations based on the original string value, parsed value and other column annotations and adds the cell to the list of cells in a row and cells in a column :
After parsing, the cell value can be:
null
,The process of parsing the string value into a single value or a list of values is as follows:
string
,
json
,
xml
,
html
or
anyAtomicType
,
replace
all
carriage
return
(
#xD
),
line
feed
(
#xA
),
and
tab
(
#x9
)
characters
with
space
characters.
string
,
json
,
xml
,
html
,
anyAtomicType
,
or
normalizedString
,
strip
leading
and
trailing
whitespace
from
the
string
value
and
replace
all
instances
of
two
or
more
whitespace
characters
with
a
single
space
character.
null
and
the
normalized
string
is
an
empty
string,
the
cell
value
is
an
empty
list.
If
the
column
required
annotation
is
true
,
add
an
error
to
the
list
of
errors
for
the
cell.
null
,
the
cell
value
is
a
list
of
values;
set
the
list
annotation
on
the
cell
to
true
,
and
create
the
cell
value
created
by:
null
.
string
or
anyAtomicType
,
strip
leading
and
trailing
whitespace
from
these
strings.
null
.
If
the
column
separator
annotation
is
null
and
the
column
required
annotation
is
true
,
add
an
error
to
the
list
of
errors
for
the
cell.
string
,
or
there
is
no
datatype
,
the
value
has
an
associated
language
from
the
column
lang
annotation.
If
there
are
any
errors,
add
them
to
the
list
of
errors
for
the
cell;
in
this
case
the
value
has
a
datatype
of
string
;
if
the
datatype
base
is
string
,
or
there
is
no
datatype
,
the
value
has
an
associated
language
from
the
column
lang
annotation.
The final value (or values) become the value annotation on the cell .
If there is a about URL annotation on the column , it becomes the about URL annotation on the cell , after being transformed into an absolute URL as described in URI Template Properties of [ tabular-metadata ].
If there is a property URL annotation on the column , it becomes the property URL annotation on the cell , after being transformed into an absolute URL as described in URI Template Properties of [ tabular-metadata ].
If
there
is
a
value
URL
annotation
on
the
column
,
it
becomes
the
value
URL
annotation
on
the
cell
,
after
being
transformed
into
an
absolute
URL
as
described
in
URI
Template
Properties
of
[
tabular-metadata
].
The
value
URL
annotation
is
null
if
the
cell
value
is
null
and
the
column
virtual
annotation
is
false
.
This section is non-normative.
When
datatype
annotation
is
available,
the
value
of
a
cell
is
the
same
as
its
string
value
.
For
example,
a
cell
with
a
string
value
of
"99"
would
similarly
have
the
(semantic)
value
"99"
.
If
a
datatype
base
is
provided
for
the
cell,
that
output
is
used
to
create
a
(semantic)
value
for
the
cell.
For
example,
if
the
metadata
contains:
"datatype" : "integer"
for
the
cell
with
the
string
value
"99"
then
the
value
of
that
cell
will
be
the
integer
99
.
A
cell
whose
string
value
was
not
a
valid
integer
(such
as
"one"
or
"1.0"
)
would
be
assigned
that
string
value
as
its
(semantic)
value
annotation,
but
also
have
a
validation
error
listed
in
its
errors
annotation.
Sometimes
data
uses
special
codes
to
indicate
unknown
or
null
values.
For
example,
a
particular
column
might
contain
a
number
that
is
expected
to
be
between
1
and
10
,
with
the
string
99
used
in
the
original
tabular
data
file
to
indicate
a
null
value.
The
metadata
for
such
a
column
would
include:
"datatype": { "base": "integer", "minimum": 1, "maximum": 10 }, "null" : "99"
In
this
case,
a
cell
with
a
string
value
of
"5"
would
have
the
(semantic)
value
of
the
integer
5
;
a
cell
with
a
string
value
of
"99"
would
have
the
value
null
.
Similarly, a cell may be assigned a default value if the string value for the cell is empty. A configuration such as:
"datatype": { "base": "integer", "minimum": 1, "maximum": 10 }, "default" : "5"
In
this
case,
a
cell
whose
string
value
is
""
would
be
assigned
the
value
of
the
integer
5
.
A
cell
whose
string
value
contains
whitespace,
such
as
a
single
tab
character,
would
also
be
assigned
the
value
of
the
integer
5
:
when
the
datatype
is
something
other
than
string
or
anyAtomicType
,
leading
and
trailing
whitespace
is
stripped
from
string
values
before
the
remainder
of
the
processing
is
carried
out.
Cells
can
contain
sequences
of
values.
For
example,
a
cell
might
have
the
string
value
"1
5
7.0"
.
In
this
case,
the
separator
is
a
space
character.
The
appropriate
configuration
would
be:
"datatype": { "base": "integer", "minimum": 1, "maximum": 10 }, "default": "5", "separator" : " "
and
this
would
mean
that
the
cell's
value
would
be
an
array
containing
two
integers
and
a
string:
[1,
5,
"7.0"]
.
The
final
value
of
the
array
is
a
string
because
it
is
not
a
valid
integer;
the
cell's
errors
annotation
will
also
contain
a
validation
error.
Also,
with
this
configuration,
if
the
string
value
of
the
cell
were
""
(ie
it
was
an
empty
cell)
the
value
of
the
cell
would
be
an
empty
list.
A
cell
value
can
be
inserted
into
a
URL
created
using
a
URI
template
property
such
as
valueUrl
.
For
example,
if
a
cell
with
the
string
value
"1
5
7.0"
were
in
a
column
named
values
,
defined
with:
"datatype": "decimal", "separator": " ", "valueUrl" : "{?values}"
then
after
expansion
of
the
URI
template,
the
resulting
valueUrl
would
be
?values=1.0,5.0,7.0
.
The
canonical
representations
of
the
decimal
values
are
used
within
the
URL.
It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.
If
the
datatype
base
is
a
numeric
type,
the
datatype
format
must
meet
each
annotation
indicates
the
expected
format
for
that
number.
Its
value
MUST
be
either
a
single
string
or
an
object
with
one
or
more
of
the
constraints.
properties:
"."
.
","
.
We
Authors
are
actively
working
encouraged
to
be
conservative
in
the
regular
expressions
that
they
use,
sticking
to
the
basic
features
of
regular
expressions
that
are
likely
to
be
supported
across
implementations.
If the datatype format annotation is a single string, this is interpreted in the same way as if it were an object with a pattern property whose value is that string.
When
parsing
the
IETF
string
value
of
a
cell
against
this
format
specification,
implementations
MUST
recognise
and
parse
numbers
that
consist
of:
+
or
-
sign,
0-9
),
0-9
)
and
the
character
specified
as
the
groupChar
,
0-9
),
E
followed
by
an
optional
+
or
-
sign
followed
by
one
or
more
decimal
digits
(
0-9
),
or
%
)
or
per-mille
(
‰
)
sign.
or that are one of the special values:
NaN
,
INF
,
or
-INF
.
Implementations
MUST
add
a
validation
error
to
develop
the
errors
annotation
for
the
cell
if
the
string
being
parsed:
integer
or
one
of
its
sub-values,
decimal
or
one
of
its
sub-values,
or
NaN
,
INF
,
or
-INF
,
if
the
datatype
base
is
decimal
or
one
of
its
sub-values.
Implementations
MUST
use
the
sign,
exponent,
percent,
and
per-mille
signs
when
parsing
the
string
value
of
a
cell
to
provide
the
value
of
the
cell.
For
example,
the
string
value
"-25%"
must
be
interpreted
as
-0.25
and
the
string
value
"1E6"
as
1000000
.
Boolean
values
may
be
represented
in
many
ways
aside
from
the
standard
1
and
0
or
true
and
false
.
If
the
datatype
base
for
CSV,
which
a
cell
is
outside
boolean
,
the
scope
of
datatype
format
annotation
provides
the
Working
Group.
true
and
false
values
expected,
separated
by
|
.
For
example
if
format
is
Y|N
then
cells
must
hold
either
Y
or
N
with
Y
meaning
true
and
N
meaning
false
.
The
details
resulting
cell
value
will
be
one
or
more
boolean
true
or
false
values.
Dates and times are commonly represented in tabular data in formats other than those defined in [ xmlschema11-2 ].
If the datatype base is a date or time type, the datatype format annotation indicates the expected format for that date or time.
The
supported
date
and
time
formats
listed
here
aim
to
help
shape
are
expressed
in
terms
of
the
date
field
symbols
defined
in
[
UAX35
]
and
MUST
be
interpreted
by
implementations
as
defined
in
that
specification.
The following date formats MUST be recognised by implementations:
yyyy-MM-dd
e.g.,
2015-03-22
yyyyMMdd
e.g.,
20150322
dd-MM-yyyy
e.g.,
22-03-2015
d-M-yyyy
e.g.,
22-3-2015
MM-dd-yyyy
e.g.,
03-22-2015
M-d-yyyy
e.g.,
3-22-2015
dd/MM/yyyy
e.g.,
22/03/2015
d/M/yyyy
e.g.,
22/3/2015
MM/dd/yyyy
e.g.,
03/22/2015
M/d/yyyy
e.g.,
3/22/2015
dd.MM.yyyy
e.g.,
22.03.2015
d.M.yyyy
e.g.,
22.3.2015
MM.dd.yyyy
e.g.,
03.22.2015
M.d.yyyy
e.g.,
3.22.2015
The following time formats MUST be recognised by implementations:
HH:mm:ss
e.g.,
15:02:37
HHmmss
e.g.,
150237
HH:mm
e.g.,
15:02
HHmm
e.g.,
1502
The following date/time formats MUST be recognised by implementations:
yyyy-MM-ddTHH:mm:ss
e.g.,
2015-03-15T15:02:37
yyyy-MM-ddTHH:mm
e.g.,
2015-03-15T15:02
M/d/yyyy
HH:mm
for
3/22/2015
15:02
or
dd.MM.yyyy
HH:mm:ss
for
22.03.2015
15:02:37
Implementations
MUST
also
recognise
date,
time,
and
date/time
formats
that
end
with
timezone
markers
consisting
of
between
one
and
three
x
s
or
X
s,
possibly
after
a
single
space.
These
MUST
be
interpreted
as
follows:
X
e.g.,
-08
,
+0530
,
or
Z
(minutes
are
optional)
XX
e.g.,
-0800
,
+0530
,
or
Z
XXX
e.g.,
-08:00
,
+05:30
,
or
Z
x
e.g.,
-08
or
+0530
(
Z
is
not
permitted)
xx
e.g.,
-0800
or
+0530
(
Z
is
not
permitted)
xxx
e.g.,
-08:00
or
+05:30
(
Z
is
not
permitted)
For
example,
formats
could
include
yyyy-MM-ddTHH:mm:ssXXX
for
2015-03-15T15:02:37Z
or
2015-03-15T15:02:37-05:00
,
or
HH:mm
x
for
15:02
-05
.
The
cell
value
will
one
or
more
dates/time
values
extracted
using
the
format
.
For
simplicity,
this
version
of
this
standard
based
on
our
requirements.
does
not
support
abbreviated
or
full
month
or
day
names,
or
double
digit
years.
Future
versions
of
this
standard
may
support
other
date
and
time
formats,
or
general
purpose
date/time
pattern
strings.
Authors
of
schemas
SHOULD
use
appropriate
regular
expressions,
along
with
the
string
datatype,
for
dates
and
times
that
use
a
format
other
than
that
specified
here.
This
section
does
Durations
MUST
be
formatted
and
interpreted
as
defined
in
[
xmlschema11-2
],
using
the
[
ISO8601
]
format
-?P
n
Y
n
M
n
DT
n
H
n
M
n
S
.
For
example,
the
duration
P1Y1D
is
used
for
a
year
and
a
day;
the
duration
PT2H30M
for
2
hours
and
30
minutes.
If the datatype base is a duration type, the datatype format annotation provides a regular expression for the string values, in the syntax and processed as defined by [ ECMASCRIPT ].
Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
The
cell
value
will
be
one
or
more
durations
extracted
using
the
format
.
If
the
datatype
base
is
not
seek
numeric,
boolean
,
a
date/time
type,
or
a
duration
type,
the
datatype
format
annotation
provides
a
regular
expression
for
the
string
values,
in
the
syntax
and
processed
as
defined
by
[
ECMASCRIPT
].
Authors
are
encouraged
to
describe
how
applications
be
conservative
in
the
regular
expressions
that
input
they
use,
sticking
to
the
basic
features
of
regular
expressions
that
are
likely
to
be
supported
across
implementations.
Values
that
are
labelled
as
html
,
xml
,
or
json
SHOULD
NOT
textual
be
validated
against
those
formats.
Metadata
creators
who
wish
to
check
the
syntax
of
HTML,
XML,
or
JSON
within
tabular
data
should
interpret
it,
except
that
any
use
the
datatype
format
annotation
to
specify
a
regular
expression
against
which
such
values
will
be
tested.
This section is non-normative.
When displaying tables, implementations should:
There are two levels of bidirectionality to consider when displaying tables: the directionality of the table (i.e., whether the columns should be arranged left-to-right or right-to-left) and the directionality of the content of individual cells.
The
direction
annotation
on
the
table
provides
information
about
the
desired
display
of
the
table.
If
direction
is
ltr
then
the
first
column
should
be
displayed
on
the
left
and
the
last
column
on
the
right.
If
direction
is
rtl
then
the
first
column
should
be
displayed
on
the
right
and
the
last
column
on
the
left.
If
direction
is
default
then
tables
should
be
displayed
with
attention
to
the
bidirectionality
of
the
content
of
the
file.
Specifically,
the
values
of
the
cells
in
the
format
table
should
be
scanned
breadth
first:
from
the
first
cell
in
the
first
column
through
to
the
last
cell
in
the
first
row,
down
to
the
last
cell
in
the
last
column.
If
the
first
character
in
the
table
with
a
strong
type
as
defined
here
in
[
BIDI
]
indicates
a
RTL
directionality,
the
table
should
be
understood
displayed
with
the
first
column
on
the
right
and
the
last
column
on
the
left.
Otherwise,
the
table
should
be
displayed
with
the
first
column
on
the
left
and
the
last
column
on
the
right.
Characters
such
as
whitespace,
quotes,
commas,
and
numbers
do
not
have
a
strong
type,
and
therefore
are
skipped
when
identifying
the
character
that
determines
the
directionality
of
the
table.
Implementations should enable user preferences to override the indicated metadata about the directionality of the table.
Once
the
directionality
of
the
table
has
been
determined,
each
cell
within
the
table
should
be
considered
as
a
separate
paragraph
,
as
defined
here.
by
the
Unicode
Bidirectional
Algorithm
(
UBA
)
in
[
BIDI
].
The
default
directionality
for
the
cell
is
determined
by
looking
at
the
text
direction
annotation
for
the
cell.
Thus, as defined by the UBA , if a cell contains no characters with a strong type (if it's a number or date for example) then the way the cell is displayed should be determined by the text direction annotation of the cell. However, when the cell contains characters with a strong type (such as letters) then they should be displayed according to the Unicode Bidirectional Algorithm as described in [ BIDI ].
Validators test whether given tabular data files adhere to the structure defined within a schema . Validators MUST raise errors as defined in [ tabular-metadata ] and in the following situations:
Processors MAY continue validating after encountering the first error.
Conversions of tabular data to other formats operate over a annotated table constructed as defined in Annotating Tables in [ tabular-metadata ]. The mechanics of these conversions to other formats are defined in other specifications such as [ csv2json ] and [ csv2rdf ].
Conversion specifications MUST define a default mapping from an annotated table that lacks any annotations (i.e., that is equivalent to an un-annotated table).
Conversion
specifications
MUST
use
the
property
value
of
the
propertyUrl
of
a
column
as
the
basis
for
naming
machine-readable
fields
in
the
target
format,
such
as
the
name
of
the
equivalent
element
or
attribute
in
XML,
property
in
JSON
or
property
URI
in
RDF.
Conversion specifications MAY use any of the annotations found on an annotated table group, table, column, row or cell, including non-core annotations, to adjust the mapping into another format.
Conversion
specifications
MAY
define
additional
annotations,
not
defined
in
this
specification,
which
are
specifically
used
when
converting
to
the
target
format
of
the
conversion.
For
example,
a
conversion
to
XML
might
specify
a
http://example.org/conversion/xml/element-or-attribute
property
on
columns
that
determines
whether
a
particular
column
is
represented
through
an
element
or
an
attribute
in
the
data.
This section is non-normative.
There is no standard for CSV, and there are many variants of CSV used on the web today. This section defines a method for expressing tabular data adhering to the annotated tabular data model in CSV. Authors are encouraged to adhere to the constraints described in this section as implementations should process such CSV files consistently.
This
syntax
is
not
compliant
with
text/csv
as
defined
in
[
RFC4180
]
in
that
it
permits
line
endings
other
than
CRLF
.
Supporting
LF
line
endings
are
is
important
characteristics
for
data
formats
that
are
used
on
non-Windows
platforms.
However,
all
files
that
adhere
to
[
RFC4180
]'s
definition
of
CSV
are
compliant
CSV+
files.
meet
the
constraints
described
in
this
section.
Developing a standard for CSV is outside the scope of the Working Group. The details here aim to help shape any future standard.
The
appropriate
content
type
for
a
CSV+
CSV
file
is
text/csv
.
For
example,
when
a
CSV+
CSV
file
is
transmitted
via
HTTP,
the
HTTP
response
MUST
should
include
a
Content-Type
header
with
the
value
text/csv
:
Content-Type: text/csv
CSV+
CSV
files
SHOULD
should
be
encoded
using
UTF-8.
If
a
CSV+
CSV
file
is
not
encoded
using
UTF-8,
the
encoding
MUST
should
be
specified
through
the
charset
parameter
in
the
Content-Type
header:
Content-Type: text/csv;charset=ISO-8859-1
The
ends
of
rows
in
a
CSV+
CSV
file
MUST
should
be
either
CRLF
(
U+000D
U+000A
)
or
LF
(
U+000A
).
Line
endings
within
escaped
cells
are
not
normalised.
Each
line
of
a
CSV+
CSV
file
MUST
should
contain
the
same
number
of
comma-separated
values.
Values
that
contain
commas,
line
endings
endings,
or
double
quotes
MUST
should
be
escaped
by
having
the
entire
value
wrapped
in
double
quotes.
There
MUST
NOT
should
not
be
whitespace
before
or
after
the
double
quotes.
Within
these
escaped
cells,
any
double
quotes
MUST
should
be
escaped
with
two
double
quotes
(
""
).
The
first
line
of
a
CSV+
CSV
file
SHOULD
should
contain
a
comma-separated
list
of
names
of
columns
.
This
is
known
as
the
header
line
and
provides
names
titles
for
the
columns.
There
are
no
constraints
on
these
names.
titles.
If
a
CSV+
CSV
file
does
not
include
a
header
line,
this
MUST
should
be
specified
using
the
header
parameter
of
the
media
type:
Content-Type: text/csv;header=absent
This grammar is a generalization of that defined in [ RFC4180 ] and is included for reference only.
The EBNF used here is defined in XML 1.0 [ EBNF-NOTATION ].
This section is non-normative.
As described in section 7. Best Practice CSV , there may be many formats which an application might interpret into the tabular data model described in section 4. Tabular Data Models , including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.
Standardising the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardisation of the format.
This section describes an algorithm for parsing formats that do not adhere to the constraints described in section 7. Best Practice CSV , as well as those that do, and extracting embedded metadata . The parsing algorithm uses the following flags. These may be set by metadata properties found while Locating Metadata , including through user input (see Overriding Metadata ), or through the inclusion of a dialect description within a metadata file:
rdfs:comment
annotation
to
the
table.
This
is
set
by
the
commentPrefix
property
of
a
dialect
description
.
The
default
is
null
,
which
means
no
rows
are
treated
as
comments.
A
value
other
than
null
may
mean
that
the
source
numbers
of
rows
are
different
from
their
numbers
.
delimiter
property
of
a
dialect
description
.
The
default
is
,
.
encoding
property
of
a
dialect
description
.
The
default
is
utf-8
.
null
,
set
by
the
doubleQuote
property
of
a
dialect
description
.
The
default
is
"
(such
that
""
is
used
to
escape
"
within
an
escaped
cell).
header
or
headerRowCount
property
of
a
dialect
description
.
The
default
is
1
.
A
value
other
than
0
will
mean
that
the
source
numbers
of
rows
will
be
different
from
their
numbers
.
lineTerminators
property
of
a
dialect
description
.
The
default
is
[CRLF,
LF]
.
null
,
set
by
the
quoteChar
property
of
a
dialect
description
.
The
default
is
"
.
skipBlankRows
property
of
a
dialect
description
.
The
default
is
false
.
A
value
other
than
false
may
mean
that
the
source
numbers
of
rows
are
different
from
their
numbers
.
skipColumns
property
of
a
dialect
description
.
The
default
is
0
.
A
value
other
than
0
will
mean
that
the
source
numbers
of
columns
will
be
different
from
their
numbers
.
skipRows
property
of
a
dialect
description
.
The
default
is
0
.
A
value
greater
than
0
will
mean
that
the
source
numbers
of
rows
will
be
different
from
their
numbers
.
true
,
false
,
start
,
or
end
,
set
by
the
skipInitialSpace
or
trim
property
of
a
dialect
description
.
The
default
is
false
.The algorithm for using these flags to parse a document containing tabular data to create a basic annotated tabular data model and to extract embedded metadata is as follows:
null
null
ltr
false
false
{ "@context": "http://www.w3.org/ns/csvw", "rdfs:comment": [] "tableSchema": { "columns": [] } }
url
property
on
M
to
that
URL.
1
.
null
and
the
row
content
begins
with
the
comment
prefix
,
strip
that
prefix
from
the
row
content
,
and
add
the
resulting
string
to
the
M
.rdfs:comment
array.
M
.rdfs:comment
array.
1
to
the
source
row
number
.
null
and
the
row
content
begins
with
the
comment
prefix
,
strip
that
prefix
from
the
row
content
,
and
add
the
resulting
string
to
the
M
.rdfs:comment
array.
M
.tableSchema.columns
,
create
a
new
one
with
a
title
property
whose
value
is
an
array
containing
a
single
value
that
is
the
value
at
index
i
in
the
list
of
cell
values
.
M
.tableSchema.columns[
i
].title
.
1
to
the
source
row
number
.
1
.
1
.
null
and
the
row
content
begins
with
the
comment
prefix
,
strip
that
prefix
from
the
row
content
,
and
add
the
resulting
string
to
the
M
.rdfs:comment
array.
true
,
add
1
to
the
source
row
number
and
move
on
to
process
the
next
row.
1
):
null
string
false
false
ltr
null
null
null
1
to
the
source
column
number
.
1
to
the
source
row
number
.
M
.rdfs:comment
is
an
empty
array,
remove
the
rdfs:comment
property
from
M
.To read a row to provide row content , perform the following steps:
To read a quoted value to provide a quoted value , perform the following steps:
To parse a row to provide a list of cell values , perform the following steps:
false
.
false
,
set
the
quoted
flag
to
true
,
and
move
on
to
process
the
next
character.
If
the
current
cell
value
is
not
an
empty
string,
raise
an
error.
false
,
and
move
on
to
process
the
next
character.
If
the
next
character
is
not
the
delimiter
,
raise
an
error.
true
,
append
the
delimiter
character
to
the
current
cell
value
and
move
on
to
process
the
next
character.
To trim a cell value to provide a trimmed cell value , perform the following steps:
true
or
start
then
remove
any
leading
whitespace
from
the
start
of
the
trimmed
cell
value
and
move
on
to
the
next
step.
true
or
end
then
remove
any
trailing
whitespace
from
the
end
of
the
trimmed
cell
value
and
move
on
to
the
next
step.
This parsing algorithm does not account for the possibility of there being more than one area of tabular data within a single CSV file.
This section is non-normative.
Bidirectional
content
does
not
alter
the
definition
of
rows
or
the
assignment
of
cells
to
columns.
Whether
or
not
a
CSV+
CSV
file
contains
right-to-left
characters,
the
first
column's
content
is
the
first
cell
of
each
row,
which
is
the
text
prior
to
the
first
occurrence
of
a
comma
within
that
row.
For example, Egyptian Referendum results are available as a CSV file at https://egelections-2011.appspot.com/Referendum2012/results/csv/EG.csv . Over the wire and in non-Unicode-aware text editors, the CSV looks like:
المحافظة,نسبة موافق,نسبة غير موافق,عدد الناخبين,الأصوات الصحيحة,الأصوات الباطلة,نسبة المشاركة,موافق,غير موافق القليوبية,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070" الجيزة,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675" القاهرة,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327" قنا,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670" ...
Within
this
CSV
file,
the
first
column
appears
as
the
content
of
each
line
before
the
first
comma
and
is
named
المحافظة
(appearing
at
the
start
of
each
row
as
المحافظة
in
the
example,
which
is
displaying
the
relevant
characters
from
left
to
right
in
the
order
they
appear
"on
the
wire").
The CSV translates to a table model that looks like:
Column / Row | column 1 | column 2 | column 3 | column 4 | column 5 | column 6 | column 7 | column 8 | column 9 |
---|---|---|---|---|---|---|---|---|---|
|
المحافظة | نسبة موافق | نسبة غير موافق | عدد الناخبين | الأصوات الصحيحة | الأصوات الباطلة | نسبة المشاركة | موافق | غير موافق |
row
|
القليوبية | 60.0 | 40.0 | 2,639,808 | 853,125 | 15,224 | 32.9 | 512,055 | 341,070 |
row
|
الجيزة | 66.7 | 33.3 | 4,383,701 | 1,493,092 | 24,105 | 34.6 | 995,417 | 497,675 |
row
|
القاهرة | 43.2 | 56.8 | 6,580,478 | 2,254,698 | 36,342 | 34.8 | 974,371 | 1,280,327 |
row
|
قنا | 84.5 | 15.5 | 1,629,713 | 364,509 | 6,743 | 22.8 | 307,839 | 56,670 |
The
fragment
identifier
#col=3
identifies
the
third
of
the
columns,
named
نسبة
غير
موافق
(appearing
as
نسبة
غير
موافق
in
the
example).
[
section
6.3.1
Bidirectional
Tables
tabular-metadata
]
defines
how
this
table
model
should
be
displayed
by
compliant
applications,
and
how
metadata
can
affect
the
display.
The
default
is
for
the
display
to
be
determined
by
the
content
of
the
table.
For
example,
if
this
CSV
were
turned
into
an
HTML
table
for
display
into
a
web
page,
it
should
be
displayed
with
the
first
column
on
the
right
and
the
last
on
the
left,
as
follows:
غير موافق | موافق | نسبة المشاركة | الأصوات الباطلة | الأصوات الصحيحة | عدد الناخبين | نسبة غير موافق | نسبة موافق | المحافظة |
---|---|---|---|---|---|---|---|---|
341,070 | 512,055 | 32.9 | 15,224 | 853,125 | 2,639,808 | 40.0 | 60.0 | القليوبية |
497,675 | 995,417 | 34.6 | 24,105 | 1,493,092 | 4,383,701 | 33.3 | 66.7 | الجيزة |
1,280,327 | 974,371 | 34.8 | 36,342 | 2,254,698 | 6,580,478 | 56.8 | 43.2 | القاهرة |
56,670 | 307,839 | 22.8 | 6,743 | 364,509 | 1,629,713 | 15.5 | 84.5 | قنا |
The
fragment
identifier
#col=3
still
identifies
the
third
of
the
columns,
named
نسبة
غير
موافق
,
which
appears
in
the
HTML
display
as
the
third
column
from
the
right
and
is
what
those
who
read
right-to-left
would
think
of
as
the
third
column.
Note that this display matches that shown on the original website .
An
alternative
approach
is
for
the
CSV
to
be
parsed
into
a
table
model
in
which
the
columns
are
numbered
in
the
reverse,
for
tables
which
are
either
marked
as
or
detected
to
be
right-to-left
tables.
For
example,
we
could
introduce
a
bidi=rtl
or
similar
media
type
parameter,
and
use
this
to
determine
whether
the
first
column
in
table
generated
from
the
CSV
is
the
text
before
the
first
comma
in
each
line
or
the
text
after
the
last
comma
in
the
line.
In
the
example
above,
if
the
CSV
were
served
with
bidi=rtl
,
or
the
table
was
detected
as
being
a
right-to-left
table,
then
the
column
numbering
in
the
model
would
be
reversed:
Column / Row | column 9 | column 8 | column 7 | column 6 | column 5 | column 4 | column 3 | column 2 | column 1 |
---|---|---|---|---|---|---|---|---|---|
row 1 (header) | المحافظة | نسبة موافق | نسبة غير موافق | عدد الناخبين | الأصوات الصحيحة | الأصوات الباطلة | نسبة المشاركة | موافق | غير موافق |
row 2 | القليوبية | 60.0 | 40.0 | 2,639,808 | 853,125 | 15,224 | 32.9 | 512,055 | 341,070 |
row 3 | الجيزة | 66.7 | 33.3 | 4,383,701 | 1,493,092 | 24,105 | 34.6 | 995,417 | 497,675 |
row 4 | القاهرة | 43.2 | 56.8 | 6,580,478 | 2,254,698 | 36,342 | 34.8 | 974,371 | 1,280,327 |
row 5 | قنا | 84.5 | 15.5 | 1,629,713 | 364,509 | 6,743 | 22.8 | 307,839 | 56,670 |
This
would
require
a
change
to
[
RFC7111
]
but
that
might
be
required
by
updates
to
the
definition
of
text/csv
in
any
case.
With
the
change,
the
fragment
identifier
#col=3
would
then
refer
to
the
third
column
from
the
right,
named
نسبة
المشاركة
.
If
the
model
were
defined
in
this
way,
there
would
be
no
need
to
determine
the
order
of
the
columns
when
displayed
using
a
metadata
property.
Columns
would
always
be
displayed
with
the
first
column
(numbered
1
in
the
model)
on
the
left.
The
final
display
in
HTML,
for
example,
would
be
exactly
as
above.
The
only
difference
would
be
that
#col=3
would
refer
to
the
third
column
from
the
left.
We note that using media type parameters is problematic because publishers might not have the ability to set them on their servers, and because they can easily get lost as a file is republished or emailed between people.
We invite comment on the best way to approach bidirectionality in CSV files.
This
grammar
is
a
generalization
of
A
simple
CSV
file
that
complies
to
the
constraints
described
in
section
7.
Best
Practice
CSV
,
at
http://example.org/tree-ops.csv
,
might
look
like:
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010 2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010
Parsing this file results in an annotated tabular data model of a single table T with five columns and two rows. The columns have the annotations shown in the following table:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | titles | |
C1 | T | 1 | 1 | C1.1 , C2.1 |
GID
|
C2 | T | 2 | 2 | C1.2 , C2.2 |
On
Street
|
C3 | T | 3 | 3 | C1.3 , C2.3 |
Species
|
C4 | T | 4 | 4 | C1.4 , C2.4 |
Trim
Cycle
|
C5 | T | 5 | 5 | C1.5 , C2.5 |
Inventory
Date
|
The
extracted
embedded
metadata
,
as
defined
in
[
RFC4180
tabular-metadata
]
and
is
included
for
reference
only.
],
would
look
like:
{ "@type": "Table", "url": "http://example.org/tree-ops.csv", "tableSchema": { "columns": [ {"titles": [ "GID" ]}, {"titles": [ "On Street" ]}, {"titles": [ "Species" ]}, {"titles": [ "Trim Cycle" ]}, {"titles": [ "Inventory Date" ]} ] } }
The
EBNF
used
here
rows
have
the
annotations
shown
in
the
following
table:
id | core annotations | |||
---|---|---|---|---|
table | number | source number | cells | |
R1 | T | 1 | 2 | C1.1 , C1.2 , C1.3 , C1.4 , C1.5 |
R2 | T | 2 | 3 | C2.1 , C2.2 , C2.3 , C2.4 , C2.5 |
The
source
number
of
each
row
is
defined
offset
by
one
from
the
number
of
each
row
because
in
XML
1.0
the
source
CSV
file,
the
header
line
is
the
first
line.
It
is
possible
to
reconstruct
a
[
EBNF-NOTATION
RFC7111
].
]
compliant
reference
to
the
first
record
in
the
original
CSV
file
(
http://example.org/tree-ops.csv#row=2
)
using
the
value
of
the
row's
source
number.
This
enables
implementations
to
retain
provenance
between
the
table
model
and
the
original
file.
The cells have the annotations shown in the following table (note that the values of all the cells in the table are strings, denoted by the double quotes in the table below):
id | core annotations | ||||
---|---|---|---|---|---|
table | column | row | string value | value | |
|
T | C1 | R1 |
|
|
| T | C2 | R1 |
|
"ADDISON
AV"
|
C1.3 |
| C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
| T | C4 | R1 |
|
"Large
Tree
Routine
Prune"
|
| T | C5 | R1 |
"10/18/2010"
|
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
| T | C2 | R2 |
|
"EMERSON
ST"
|
| T | C3 | R2 |
"Liquidambar
styraciflua"
|
|
| T | C4 | R2 |
|
"Large
Tree
Routine
Prune"
|
C2.5 | T | C5 | R2 |
|
"6/2/2010"
|
The
tools
that
the
consumer
of
this
data
uses
may
provide
a
mechanism
for
overriding
the
metadata
that
has
been
provided
within
the
file
itself.
For
example,
they
might
enable
the
consumer
to
add
machine-readable
names
to
the
columns,
or
to
mark
the
fifth
column
as
holding
a
date
in
the
format
M/D/YYYY
.
These
facilities
are
implementation
defined;
the
code
for
invoking
a
Javascript-based
parser
might
look
like:
data.parse({ "column-names": ["GID", "on_street", "species", "trim_cycle", "inventory_date"], "datatypes": ["string", "string", "string", "string", "date"], "formats": [null,null,null,null,"M/D/YYYY"] });
This
is
equivalent
to
a
metadata
file
expressed
in
the
syntax
defined
in
[
tabular-metadata
WS
],
looking
like:
{ "@type": "Table", "url": "http://example.org/tree-ops.csv", "tableSchema": { "columns": [{ "name": "GID", "datatype": "string" }, { "name": "on_street", "datatype": "string" }, { "name": "species", "datatype": "string" }, { "name": "trim_cycle", "datatype": "string" }, { "name": "inventory_date", "datatype": { "base": "date", "format": "M/d/yyyy" } }] } }
This would be merged with the embedded metadata found in the CSV file, providing the titles for the columns to create:
{ "@type": "Table", "url": "http://example.org/tree-ops.csv", "tableSchema": { "columns": [{ "name": "GID", "titles": "GID", "datatype": "string" }, { "name": "on_street", "titles": "On Street", "datatype": "string" }, { "name": "species", "titles": "Species", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "datatype": { "base": "date", "format": "M/d/yyyy" } }] } }
The
processor
can
then
create
an
annotated
tabular
data
model
that
included
annotations
on
the
columns,
and
*
name
datatype
annotations
on
the
cells,
and
created
cells
whose
values
were
of
appropriate
types
(in
the
case
of
this
Javascript
implementation,
the
cells
in
the
last
column
would
be
Date
objects,
for
example).
Assuming this kind of implementation-defined parsing, the columns would then have the annotations shown in the following table:
id | core annotations | ||||||
---|---|---|---|---|---|---|---|
table | number | source number | cells | name | titles | datatype | |
C1 | T | 1 | 1 | C1.1 , C2.1 |
GID
|
GID
|
string
|
| T | 2 | 2 | C1.2 , C2.2 |
|
On
Street
|
string
|
C3 | T | 3 | 3 | C1.3 , C2.3 |
|
Species
|
|
C4 | T | 4 | 4 | C1.4 , C2.4 |
|
Trim
Cycle
|
string
|
| T | 5 | 5 | C1.5 , C2.5 |
|
Inventory
Date
|
|
The
cells
have
the
annotations
shown
in
the
following
table.
Because
of
the
overrides
provided
by
the
consumer
to
guide
the
parsing,
and
the
way
the
parser
works,
the
cells
in
the
Inventory
Date
column
(cells
C1.5
and
C2.5
)
have
values
that
are
parsed
dates
rather
than
unparsed
strings.
id | core annotations | ||||
---|---|---|---|---|---|
table | column | row | string value | value | |
C1.1 | T | C1 | R1 |
|
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
| T | C3 | R1 |
|
"Celtis
australis"
|
C1.4 | T | C4 | R1 |
|
"Large
Tree
Routine
Prune"
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
2010-10-18
|
C2.1 | T | C1 | R2 |
"2"
|
|
| T | C2 | R2 |
|
"EMERSON
ST"
|
C2.3 | T | C3 | R2 |
|
"Liquidambar
styraciflua"
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
|
|
T | C5 | R2 |
"6/2/2010"
|
2010-06-02
|
A
similar
set
of
annotations
could
be
many
formats
which
an
application
might
interpret
into
the
tabular
data
model
described
provided
through
a
metadata
file,
located
as
discussed
in
section
2.
5.
Tabular
Data
Models
Locating
Metadata
,
including
using
different
separators
or
fixed
format
tables,
multiple
tables
within
a
single
file,
or
ones
that
have
metadata
lines
before
a
table
header.
and
defined
in
[
tabular-metadata
Note
].
For
example,
this
might
look
like:
{ "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}], "url": "tree-ops.csv", "dc:title": "Tree Operations", "dcat:keyword": ["tree", "street", "maintenance"], "dc:publisher": { "schema:name": "Example Municipality", "schema:url": {"@id": "http://example.org"} }, "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"}, "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"}, "tableSchema": { "columns": [{ "name": "GID", "titles": ["GID", "Generic Identifier"], "dc:description": "An identifier for the operation on a tree.", "datatype": "string", "required": true }, { "name": "on_street", "titles": "On Street", "dc:description": "The street that the tree is on.", "datatype": "string" }, { "name": "species", "titles": "Species", "dc:description": "The species of the tree.", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "dc:description": "The operation performed on the tree.", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "dc:description": "The date of the operation that was performed.", "datatype": {"base": "date", "format": "M/d/yyyy"} }], "primaryKey": "GID", "aboutUrl": "#gid-{GID}" } }Standardising the parsing of CSV is outside the chartered scope of the Working Group. This non-normative section is intended to help the creators of parsers handle the wide variety of CSV-based formats that they may encounter due to the current lack of standardisation of the format.
This
section
describes
an
algorithm
for
parsing
formats
other
than
the
plain
CSV+
format
specified
in
section
4.
CSV+
Syntax
.
It
is
impossible
to
do
this
in
a
fully
automated
manner,
so
The
annotated
tabular
data
model
generated
from
this
algorithm
depends
on
would
be
more
sophisticated
again.
The
table
itself
would
have
the
following
flags
being
set
externally
(eg
through
user
input):
annotations:
dc:title
{"@value":
"Tree
Operations",
"@language":
"en"}
utf-8
.
dcat:keyword
[{"@value":
"tree",
"@language",
"en"},
{"@value":
"street",
"@language":
"en"},
{"@value":
"maintenance",
"@language":
"en"}]
dc:publisher
CRLF
.
[{
"schema:name":
"Example
Municipality",
"schema:url":
{"@id":
"http://example.org"}
}]
dc:license
"
.
{"@id":
"http://opendefinition.org/licenses/cc-by/"}
dc:modified
{"@value":
"2010-12-31",
"@type":
"date"}
The
character
that
is
used
to
escape
columns
would
have
the
quote
character
within
escaped
cells.
The
default
is
annotations
shown
in
the
following
table:
id | core annotations | other annotations | ||||||
---|---|---|---|---|---|---|---|---|
table | number | source number | cells | name | titles | datatype |
|
|
C1 | T | 1 | 1 | C1.1 , C2.1 |
|
,
Generic
Identifier
|
string
|
An
identifier
for
the
|
C2 | T | 2 | 2 | C1.2 , C2.2 |
on_street
|
On
Street
|
string
|
The
|
C3 | T | 3 | 3 | C1.3 , C2.3 |
|
Species
|
string
|
The
species
of
|
C4 | T | 4 | 4 | C1.4 , C2.4 |
trim_cycle
|
Trim
Cycle
|
string
|
The
|
C5 | T | 5 | 5 | C1.5 , C2.5 |
|
Inventory
Date
|
{
"base":
"date",
"format":
"M/d/yyyy"
}
|
The
|
The
rows
(following
have
an
additional
primary
key
annotation,
as
shown
in
the
skipped
rows)
following
table:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | primary key | |
R1 | T | 1 | 2 | C1.1 , C1.2 , C1.3 , C1.4 , C1.5 | C1.1 |
R2 | T | 2 | 3 | C2.1 , C2.2 , C2.3 , C2.4 , C2.5 | C2.1 |
Thanks
to
the
provided
metadata,
the
cells
again
have
the
annotations
shown
in
the
file.
following
table.
The
default
metadata
file
has
provided
the
information
to
supplement
the
model
with
additional
annotations
but
also,
for
the
Inventory
Date
column
(cells
C1.5
and
C2.5
),
have
a
value
that
is
a
parsed
date
rather
than
an
unparsed
string.
id | core annotations | |||||
---|---|---|---|---|---|---|
table | column | row | string value | value | about URL | |
C1.1 | T | C1 | R1 |
|
"1"
|
http://example.org/tree-ops.csv#gid-1
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
http://example.org/tree-ops.csv#gid-1
|
C1.3 | T | C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
http://example.org/tree-ops.csv#gid-1
|
C1.4 | T | C4 | R1 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
http://example.org/tree-ops.csv#gid-1
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
2010-10-18
|
http://example.org/tree-ops.csv#gid-1
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
http://example.org/tree-ops.csv#gid-2
|
C2.2 | T | C2 | R2 |
"EMERSON
ST"
|
"EMERSON
ST"
|
http://example.org/tree-ops.csv#gid-2
|
C2.3 | T | C3 | R2 |
"Liquidambar
styraciflua"
|
"Liquidambar
styraciflua"
|
http://example.org/tree-ops.csv#gid-2
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
http://example.org/tree-ops.csv#gid-2
|
C2.5 | T | C5 | R2 |
"6/2/2010"
|
2010-06-02
|
http://example.org/tree-ops.csv#gid-2
|
The
number
following
slightly
amended
CSV
file
contains
quoted
and
missing
cell
values:
GID,On Street,Species,Trim Cycle,Inventory Date 1,ADDISON AV,"Celtis australis","Large Tree Routine Prune",10/18/2010 2,,"Liquidambar styraciflua","Large Tree Routine Prune",
Parsing
this
file
similarly
results
in
an
annotated
tabular
data
model
of
a
single
table
T
with
five
columns
and
two
rows.
The
columns
and
rows
have
exactly
the
same
annotations
as
previously,
but
there
are
two
null
cell
values
for
C2.2
and
C2.5
.
Note
that
the
quoting
of
values
within
the
CSV
makes
no
difference
to
skip
at
either
the
beginning
string
value
or
value
of
each
row,
before
any
header
columns.
The
default
is
0
.
the
cell.
id | core annotations | ||||
---|---|---|---|---|---|
table |
column
|
row | string value | value | |
C1.1 | T | C1 | R1 |
"1"
|
"1"
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
C1.3 | T | C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
C1.4 | T | C4 | R1 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
"10/18/2010"
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
C2.2 | T | C2 | R2 |
""
|
null
|
C2.3 | T | C3 | R2 |
"Liquidambar
styraciflua"
|
"Liquidambar
styraciflua"
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C2.5 | T | C5 | R2 |
""
|
null
|
The
number
following
example
illustrates
some
of
header
columns
(following
the
skipped
columns)
complexities
that
can
be
involved
in
each
row.
parsing
tabular
data,
how
the
flags
described
above
can
be
used,
and
how
new
tabular
data
formats
could
be
defined
that
embed
additional
annotations
into
the
tabular
data
model.
In this example, the publishers of the data are using an internal convention to supply additional metadata about the tabular data embedded within the file itself. They are also using a tab as a separator rather than a comma.
# publisher City of Palo Alto # updated 12/31/2010 #name GID on_street species trim_cycle inventory_date #datatype string string string string date:M/D/YYYY GID On Street Species Trim Cycle Inventory Date 1 ADDISON AV Celtis australis Large Tree Routine Prune 10/18/2010 2 EMERSON ST Liquidambar styraciflua Large Tree Routine Prune 6/2/2010
Naive
parsing
of
the
above
data
will
assume
a
comma
separator
and
thus
results
in
a
single
table
T
with
a
single
column
and
six
rows.
The
default
is
column
has
the
annotations
shown
in
the
following
table:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | titles | |
C1 | T | 1 | 1 | C1.1 , C2.1 , C3.1 , C4.1 , C5.1 |
|
The
rows
have
the
annotations
shown
in
which
the
following
table:
id | core annotations | |||
---|---|---|---|---|
table | number | source number | cells | |
R1 | T | 1 | 2 | C1.1 |
R2 | T | 2 | 3 | C2.1 |
R3 | T | 3 | 4 | C3.1 |
R4 | T | 4 | 5 | C4.1 |
R5 | T | 5 | 6 | C5.1 |
R6 | T | 6 | 7 | C6.1 |
The
cells
have
the
annotations
shown
in
the
following
table
(note
that
the
values
of
all
the
cells
in
the
table
are
empty).
The
default
is
strings,
denoted
by
the
double
quotes
in
the
table
below):
id | core annotations | ||||
---|---|---|---|---|---|
table | column | row | string value | value | |
C1.1 | T | C1 | R1 |
|
|
C1.1 | T | C1 | R1 |
|
|
C2.1 | T | C1 | R2 |
|
|
C3.1 | T | C1 | R3 |
"
GID
On
Street
Species
Trim
Cycle
Inventory
Date"
|
"
GID
On
Street
Species
Trim
Cycle
Inventory
Date"
|
C4.1 | T | C1 | R4 |
"
1
ADDISON
AV
Celtis
australis
Large
Tree
Routine
Prune
10/18/2010"
|
"
1
ADDISON
AV
Celtis
australis
Large
Tree
Routine
Prune
10/18/2010"
|
C5.1 | T | C1 | R5 |
"
2
EMERSON
ST
Liquidambar
styraciflua
Large
Tree
Routine
Prune
6/2/2010"
|
"
2
EMERSON
ST
Liquidambar
styraciflua
Large
Tree
Routine
Prune
6/2/2010"
|
When
parsing,
should
we:
The
consumer
of
the
data
may
use
the
flags
described
above
to
create
a
more
useful
set
of
data
from
this
file.
Specifically,
they
could
set:
#
Setting these is done in an implementation-defined way. It could be done, for example, by sniffing the contents of the file itself, through command-line options, or by embedding a dialect description into a metadata file associated with the tabular data, which would look like:
{ "delimiter": "\t", "skipRows": 4, "skipColumns": 1, "commentPrefix": "#" }
The
algorithm
for
With
these
flags
in
operation,
parsing
a
document
containing
this
file
results
in
an
annotated
tabular
data
model
of
a
single
table
T
with
five
columns
and
two
rows
which
is
as
follows:
Read
largely
the
file
using
same
as
that
created
from
the
specified
encoding
original
simple
example
described
in
section
8.2.1
Simple
Example
.
Find
There
are
three
differences.
First,
because
the
rows.
Each
row
ends
four
skipped
rows
began
with
a
line
terminator
,
but
values
that
are
enclosed
within
the
quote
character
may
contain
comment
prefix
,
the
line
terminator
without
it
indicating
table
itself
now
has
four
rdfs:comment
annotations,
with
the
end
values:
publisher
City
of
Palo
Alto
updated
12/31/2010
name
GID
on_street
species
trim_cycle
inventory_date
datatype
string
string
string
string
date:M/D/YYYY
Second,
because
the
row.
The
quote
character
may
be
escaped
using
first
column
has
been
skipped,
the
escape
character
source
number
where
it
appears
within
cells.
If
of
each
of
the
escape
character
columns
is
not
the
same
as
the
quote
character
then
offset
by
one
from
the
escape
character
number
is
also
used
to
escape
the
character
that
immediately
follows
it.
Skip
of
each
column:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | titles | |
C1 | T | 1 | 2 | C1.1 , C2.1 |
GID
|
C2 | T | 2 | 3 | C1.2 , C2.2 |
On
Street
|
C3 | T | 3 | 4 | C1.3 , C2.3 |
Species
|
C4 | T | 4 | 5 | C1.4 , C2.4 |
Trim
Cycle
|
C5 | T | 5 | 6 | C1.5 , C2.5 |
Inventory
Date
|
Finally,
because
four
additional
rows
have
been
skipped,
the
source
number
of
each
of
the
rows
indicated
is
offset
by
five
from
the
skip
row
number
(the
four
skipped
rows
plus
the
single
header
row):
id | core annotations | |||
---|---|---|---|---|
table | number | source number | cells | |
R1 | T | 1 | 6 | C1.1 , C1.2 , C1.3 , C1.4 , C1.5 |
R2 | T | 2 | 7 | C2.1 , C2.2 , C2.3 , C2.4 , C2.5 |
The
conventions
used
in
this
data
(invented
for
the
purpose
of
this
example)
are
in
fact
intended
to
create
an
annotated
tabular
data
model
parameter.
These
form
comment
which
includes
named
annotations
on
the
table.
If
table
itself,
on
the
columns,
and
on
the
cells.
The
creator
of
these
conventions
could
create
a
specification
for
this
particular
tabular
data
syntax
and
register
a
media
type
for
it.
The
specification
would
include
statements
like:
"#"
,
the
#name
,
the
remaining
name
annotation
for
each
column
in
the
table.
#datatype
,
the
remaining
cells
datatype
/
format
annotations
for
the
title
annotations
on
the
Parsers
that
immediately
follows
it.
recognised
the
format
could
then
build
a
more
sophisticated
annotated
tabular
data
model
using
only
the
embedded
information
in
the
tabular
data
file
.
They
would
extract
embedded
metadata
looking
like:
{ "@context": "http://www.w3.org/ns/csvw", "url": "tree-ops.csv", "dc:publisher": "City of Palo Alto", "dc:updated": "12/31/2010", "tableSchema": { "columns": [{ "name": "GID", "titles": "GID", "datatype": "string", }, { "name": "on_street", "titles": "On Street", "datatype": "string" }, { "name": "species", "titles": "Species", "datatype": "string" }, { "name": "trim_cycle", "titles": "Trim Cycle", "datatype": "string" }, { "name": "inventory_date", "titles": "Inventory Date", "datatype": { "base": "date", "format": "M/d/yyyy" } }] } }
If
trim
As
before,
the
result
would
be
a
single
table
is
T
with
five
columns
and
two
rows.
The
table
itself
would
have
two
annotations:
true
dc:publisher
start
then
whitespace
from
the
start
{"@value":
"City
of
values
that
are
not
enclosed
must
be
removed
from
the
value.
If
trim
is
Palo
Alto"}
true
dc:updated
end
{"@value":
"12/31/2010"}
The
columns
have
the
end
of
values
that
are
not
enclosed
must
be
removed
from
annotations
shown
in
the
value.
following
table:
id | core annotations | |||||
---|---|---|---|---|---|---|
table |
number
|
source number | cells | name | titles | |
C1 | T | 1 | 2 | C1.1 , C2.1 |
GID
|
GID
|
C2 | T | 2 | 3 | C1.2 , C2.2 |
on_street
|
On
Street
|
C3 | T | 3 | 4 | C1.3 , C2.3 |
species
|
Species
|
C4 | T | 4 | 5 | C1.4 , C2.4 |
trim_cycle
|
Trim
Cycle
|
C5 | T | 5 | 6 | C1.5 , C2.5 |
inventory_date
|
Inventory
Date
|
The
rows
have
the
first
character
annotations
shown
in
the
row
when
counting
columns
(see
section
4.4.2
Bidirectionality
following
table,
exactly
as
in
CSV+
Files
previous
examples:
id | core annotations | |||
---|---|---|---|---|
table | number | source number | cells | |
R1 | T | 1 | 6 | C1.1 , C1.2 , C1.3 , C1.4 , C1.5 |
R2 | T | 2 | 7 | C2.1 , C2.2 , C2.3 , C2.4 , C2.5 |
The
cells
have
the
annotations
shown
in
the
following
table.
Because
of
header
columns
indicated
by
the
header
column
count
parameter.
Always
start
from
way
the
first
character
in
particular
tabular
data
format
has
been
specified,
these
include
additional
annotations
but
also,
for
the
row
when
counting
columns
(see
section
4.4.2
Bidirectionality
in
CSV+
Files
).
Each
cell
within
a
header
row
that
is
not
in
a
skipped
or
header
Inventory
Date
column
is
(cells
C1.5
and
C2.5
),
have
a
label
annotation
on
value
that
column.
Each
cell
within
a
header
column
is
a
parsed
date
rather
than
an
annotation
on
the
unparsed
string.
id | core annotations | ||||
---|---|---|---|---|---|
table | column |
row
|
string value | value | |
C1.1 | T | C1 | R1 |
"1"
|
"1"
|
C1.2 | T | C2 | R1 |
"ADDISON
AV"
|
"ADDISON
AV"
|
C1.3 | T | C3 | R1 |
"Celtis
australis"
|
"Celtis
australis"
|
C1.4 | T | C4 | R1 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C1.5 | T | C5 | R1 |
"10/18/2010"
|
2010-10-18
|
C2.1 | T | C1 | R2 |
"2"
|
"2"
|
C2.2 | T | C2 | R2 |
"EMERSON
ST"
|
"EMERSON
ST"
|
C2.3 | T | C3 | R2 |
"Liquidambar
styraciflua"
|
"Liquidambar
styraciflua"
|
C2.4 | T | C4 | R2 |
"Large
Tree
Routine
Prune"
|
"Large
Tree
Routine
Prune"
|
C2.5 | T | C5 | R2 |
"6/2/2010"
|
2010-06-02
|
The
following
example
shows
a
CSV
file
with
multiple
header
row
then
that
provides
lines:
Who,What,,Where, Organisation,Sector,Subsector,Department,Municipality #org,#sector,#subsector,#adm1,#adm2 UNICEF,Education,Teacher training,Chocó,Quidbó UNICEF , Education , Teacher training , Choc ó, Bojay á
Here,
the
type
of
first
line
contains
some
grouping
titles
in
the
annotation
first
line,
which
are
not
particularly
helpful.
The
lines
following
those
contain
useful
titles
for
the
row,
otherwise
it
is
columns.
Thus
the
appropriate
configuration
for
a
label
annotation.
If
skip
blank
rows
is
dialect
description
is:
{ "skipRows": 1, "headerRowCount": 2 }
With this configuration, the table model contains five columns, each of which have two titles, summarised in the following table:
id | core annotations | ||||
---|---|---|---|---|---|
table | number | source number | cells | titles | |
C1 | T | 1 | 1 | C1.1 , C2.1 |
,
#org
|
C2 | T | 2 | 2 | C1.2 , C2.2 |
Sector
,
#sector
|
C3 | T | 3 | 3 | C1.3 , C2.3 |
Subsector
,
#subsector
|
C4 | T | 4 | 4 | C1.4 , C2.4 |
Department
,
#adm1
|
C5 | T | 5 | 5 | C1.5 , C2.5 |
Municipality
,
#adm2
|
As metadata, this would look like:
{ "tableSchema": { "columns": [ { "titles": ["Organisation", "#org"] }, { "titles": ["Sector", "#sector"] }, { "titles": ["Subsector", "#subsector"] }, { "titles": ["Department", "#adm1"] }, { "titles": ["Municipality", "#adm2"] }, ] } }
A
separate
metadata
file
could
contain
just
the
cell
values
are
empty
strings.
second
of
each
of
these
titles,
for
example:
{ "tableSchema": { "columns": [ { "name": "org", "titles": #org" }, { "name": "sector", "titles": #sector" }, { "name": "subsector", "titles": #subsector" }, { "name": "adm1", "titles": #adm1" }, { "name": "adm2", "titles": #adm2" }, ] } }
This enables people from multiple jurisdictions to use the same tabular data structures without having to use exactly the same titles within their documents.
This section is non-normative.
This appendix outlines various ways in which CSV is defined.
[ RFC4180 ] defines CSV with the following ABNF grammar:
file = [header CRLF] record *(CRLF record) [CRLF] header = name *(COMMA name) record = field *(COMMA field) name = field field = (escaped / non-escaped) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE non-escaped = *TEXTDATA COMMA = %x2C CR = %x0D DQUOTE = %x22 LF = %x0A CRLF = CR LF TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Of particular note here are:
TEXTDATA
indicates
that
only
non-control
ASCII
characters
are
permitted
within
a
CSV
file.
This
restriction
is
routinely
ignored
in
practice,
and
is
impractical
on
the
international
web.
CRLF
.
This
makes
it
harder
to
produce
CSV
files
on
Unix-based
systems
where
the
usual
line
ending
is
LF
.
header
parameter
on
the
media
type
indicates
whether
the
header
is
present
or
not.
""
).
Excel is a common tool for both creating and reading CSV documents, and therefore the CSV that it produces is a de facto standard.
The following describes the behaviour of Microsoft Excel for Mac 2011 with an English locale. Further testing is needed to see the behaviour of Excel in other situations.
Excel
generates
CSV
files
encoded
using
Windows-1252
with
LF
line
endings.
Characters
that
cannot
be
represented
within
Windows-1252
are
replaced
by
underscores.
Only
those
cells
that
need
escaping
(eg
because
they
contain
commas
or
double
quotes)
are
escaped,
and
double
quotes
are
escaped
with
two
double
quotes.
Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.
When
opening
CSV
files,
Excel
interprets
CSV
files
saved
in
UTF-8
as
being
encoded
as
Windows-1252
(whether
or
not
a
BOM
is
present).
It
correctly
deals
with
double
quoted
cells,
except
that
it
converts
line
breaks
within
cells
into
spaces.
It
understands
CRLF
as
a
line
break.
It
detects
dates
(formatted
as
YYYY-MM-DD
)
and
formats
them
in
the
default
date
formatting
for
files.
Excel provides more control when importing CSV files into Excel. However, it does not properly understand UTF-8 (with or without BOM ). It does however properly understand UTF-16 and can read non-ASCII characters from a UTF-16-encoded file.
A particular quirk in the importing of CSV is that if a cell contains a line break, the final double quote that escapes the cell will be included within it.
When
tabular
data
is
copied
from
Excel,
it
is
copied
in
a
tab-delimited
format,
with
LF
line
breaks.
Downloaded
CSV
files
are
encoded
in
UTF-8,
without
a
BOM
,
and
with
LF
line
endings.
Dates
and
numbers
are
formatted
as
they
appear
within
the
spreadsheet.
CSV
files
can
be
imported
as
UTF-8
(with
or
without
BOM
).
CRLF
line
endings
are
correctly
recognised.
Dates
are
reformatted
to
the
default
date
format
on
load.
Tabular Data Packages place the following restrictions on CSV files:
As a starting point, CSV files included in a Tabular Data Package package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:
File names MUST end with
.csv
.Files MUST be encoded as
UTF-8UTF-8.Files MUST have a single header row. This row MUST be the first row in the file.
Terminology: each column in the CSV file is termed a field and its
name
is the string in that column in the header row.The
name
MUST be unique amongstfields andfields, MUST contain at least one character, and MUST conform to the characterThere are no furtherrestrictionson the form ofdefined for thename
but it is RECOMMENDED that it contain only alphanumeric characters together with “ .-_”property.Rows in the file MUST NOT contain more fields than are in the header row (though they may contain
less)less).Each file MUST have an entry in the
array in the
resourcestablesdatapackage.json
filefile.The resource metadata MUST include a
attribute whose value MUST
schematableSchemaconform to the JSON Table Schemabe a valid schema description .All fields in the CSV files MUST be described in the schema description .
CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a CSV files in a Tabular Data Package SHOULD :
- Use
“,”"," as fielddelimitersdelimiters.- Use
“rn”CRLF
(U+000D U+000A
) or“n”LF
(U+000A
) as lineterminatorsterminators.If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the
datapackage.json
descriptor MUST :
- Include a
dialect
key that conforms to that described in the CSV Dialect Description Format .Applications processing the CSV file SHOULD read use the
dialect
of the CSV file to guide parsing.
More
details
of
behaviour
of
other
tools
should
go
here.
This
should
include
To
replicate
the
most
popular
CSV
parsing/generating
libraries
in
common
programming
languages.
Test
findings
above,
test
files
which
include
non-ASCII
characters,
double
quotes
quotes,
and
line
breaks
within
cells
are:
This section is non-normative.
The document has undergone substantial changes since the last working draft. Below are some of the changes made: