Copyright © 2016 W3C ® ( MIT , ERCIM , Keio , Beihang ). W3C liability , trademark and document use rules apply.
This
document
provides
best
practices
Best
Practices
related
to
the
publication
and
usage
of
data
on
the
Web
designed
to
help
support
a
self-sustaining
ecosystem.
Data
should
be
discoverable
and
understandable
by
humans
and
machines.
Where
data
is
used
in
some
way,
whether
by
the
originator
of
the
data
or
by
an
external
party,
such
usage
should
also
be
discoverable
and
the
efforts
of
the
data
publisher
recognized.
In
short,
following
these
best
practices
Best
Practices
will
facilitate
interaction
between
publishers
and
consumers.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This
version
of
the
The
Working
Group
believes
that
this
document
shows
its
expected
scope
and
future
direction.
A
template
is
used
to
show
the
"what",
"why"
and
"how"
of
each
best
practice.
Comments
are
sought
on
the
usefulness
of
this
approach
and
the
expected
scope
of
the
final
document.
It
differs
from
the
previous
publication
only
in
that
it
highlights
the
instability
of
the
Best
Practice
on
Using
Web
Standardized
Interfaces
now
complete
and
updates
a
reference
ready
to
the
document's
URI
in
the
BP
on
Assign
URIs
advance
to
dataset
versions
and
series
.
Candidate
Recommendation
(call
for
implementations).
If
you
have
comments
to
make
before
that
step
is
taken,
please
make
them
before
Sunday
12
June
2016
.
This
document
was
published
by
the
Data
on
the
Web
Best
Practices
Working
Group
as
a
Working
Draft.
This
document
is
intended
to
become
a
W3C
Recommendation.
If
you
wish
to
make
comments
regarding
this
document,
please
send
them
to
public-dwbp-wg@w3.org
public-dwbp-comments@w3.org
(
subscribe
,
archives
).
All
comments
are
welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 1 September 2015 W3C Process Document .
This section is non-normative.
The
best
practices
Best
Practices
described
below
have
been
developed
to
encourage
and
enable
the
continued
expansion
of
the
Web
as
a
medium
for
the
exchange
of
data.
The
growth
in
online
sharing
of
open
data
by
governments
across
the
world
[
OKFN-INDEX
]
[
ODB
],
the
increasing
online
publication
of
research
data
encouraged
by
organizations
like
the
Research
Data
Alliance
[
RDA
],
the
harvesting
and
harvesting,
analysis
and
online
publishing
of
social
media,
media
data,
crowd-sourcing
of
information,
the
provision
increasing
presence
on
the
Web
of
important
cultural
heritage
collections
such
as
at
the
Bibliothèque
nationale
de
France
[
BNF
]
and
the
sustained
growth
in
the
Linked
Open
Data
Cloud
[
LODC
],
provide
some
examples
of
this
phenomenon.
growth
in
the
use
of
Web
for
publishing
data.
However, this growth is not consistent in style and in many cases does not make use of the full potential of the Open Web Platform's ability to link one fact to another, to discover related resources and to create interactive visualizations.
In
broad
terms,
data
publishers
aim
to
share
data
either
openly
or
with
controlled
access.
Data
consumers
(who
may
also
be
producers
themselves)
want
to
be
able
to
find
and
find,
use
and
link
to
the
data,
especially
if
it
is
accurate,
regularly
updated
and
guaranteed
to
be
available
at
all
times.
This
creates
a
fundamental
need
for
a
common
understanding
between
data
publishers
and
data
consumers.
Without
this
agreement,
data
publishers'
efforts
may
be
incompatible
with
data
consumers'
desires.
Publishing
data
on
The
openness
and
flexibility
of
the
Web
creates
create
new
challenges,
challenges
for
data
publishers
and
data
consumers,
such
as
how
to
represent,
describe
and
make
data
available
in
a
way
that
it
will
be
easy
to
find
and
to
understand.
In
contrast
to
conventional
databases,
for
example,
where
there
is
a
single
data
model
to
represent
the
data
and
a
database
management
system
(DBMS)
to
control
data
access,
data
on
the
Web
allows
for
the
existence
of
multiple
ways
to
represent
and
to
access
data.
For
more
details
about
the
challenges
see
the
section
Data
on
the
Web
Challenges
.
In
this
context,
it
becomes
crucial
to
provide
guidance
to
publishers
that
will
improve
consistency
in
the
way
data
is
managed,
thus
promoting
managed.
Such
guidance
will
promote
the
reuse
of
data
and
also
to
foster
trust
in
the
data
among
developers,
whatever
technology
they
choose
to
use,
increasing
the
potential
for
genuine
innovation.
This
document
sets
out
a
series
Not
all
data
and
metadata
should
be
shared
openly,
however.
Security,
commercial
sensitivity
and,
above
all,
individuals'
privacy
need
to
be
taken
into
account.
It
is
for
data
publishers
to
determine
policy
on
which
data
should
be
shared
and
under
what
circumstances.
Data
sharing
policies
are
likely
to
assess
the
exposure
risk
and
determine
the
appropriate
security
measures
to
be
taken
to
protect
sensitive
data,
such
as
secure
authentication
and
authorization.
Depending
on
circumstances,
sensitive
information
about
individuals
might
include
full
name,
home
address,
email
address,
national
identification
number,
IP
address,
vehicle
registration
plate
number,
driver's
license
number,
face,
fingerprints,
or
handwriting,
credit
card
numbers,
digital
identity,
date
of
birth,
birthplace,
genetic
information,
telephone
number,
login
name,
screen
name,
nickname,
health
records
etc.
Although
it
is
likely
to
be
safe
to
share
some
of
best
practices
that
will
help
information
openly,
and
even
more
within
a
controlled
environment,
publishers
should
bear
in
mind
that
combining
data
from
multiple
sources
may
allow
inadvertent
identification
of
individuals.
A
general
Best
Practice
to
publish
Data
on
the
Web
is
to
use
standards.
Different
types
of
organizations
specify
standards
that
are
specific
to
the
publishing
of
datasets
related
to
particular
domains
&
applications,
involving
communities
of
users
interested
in
that
data.
These
standards
define
a
common
way
of
communicating
information
among
the
users
of
these
communities.
For
example,
there
are
two
standards
that
can
be
used
to
publish
transport
timetables:
the
General
Transit
Feed
Specification
[
GTFS
]
and
consumers
face
the
new
challenges
Service
Interface
for
Real
Time
Information
[
SIRI
].
These
specify,
in
a
mixed
way,
standardized
terms,
standardized
data
formats
and
opportunities
posed
by
standardized
data
access.
The
Best
Practices
set
out
in
this
document
serve
a
general
purpose
of
publishing
and
using
Data
on
the
Web.
Web
and
are
domain
&
application
independent.
They
can
be
extended
or
complemented
by
other
Best
Practices
documents
or
standards
that
cover
more
specialized
contexts.
Best
practices
Practices
cover
different
aspects
related
to
data
publishing
and
consumption,
like
data
formats,
data
access,
data
identifiers
and
metadata.
In
order
to
delimit
the
scope
and
elicit
the
required
features
for
Data
on
the
Web
Best
Practices,
the
DWBP
working
group
compiled
a
set
of
use
cases
[
UCR
DWBP-UCR
]
that
represent
scenarios
of
how
data
is
commonly
published
on
the
Web
and
how
it
is
used.
The
set
of
requirements
derived
from
these
use
cases
were
used
to
guide
the
development
of
the
best
practices.
Best
Practices.
The
Best
Practices
proposed
in
this
document
are
intended
to
serve
a
more
general
purpose
than
the
practices
suggested
in
in,
for
example,
Best
Practices
for
Publishing
Linked
Data
[
LD-BP
]
since
it
DWBP
is
domain-independent
and
whilst
it
domain-independent.
Whilst
DWBP
recommends
the
use
of
Linked
Data,
it
also
promotes
best
practices
for
data
on
the
Web
in
other
open
formats
such
as
CSV
[
RFC4180
CSV.
In
order
to
encourage
data
publishers
to
adopt
the
DWBP
,
a
number
of
distinct
benefits
were
identified:
comprehension;
processability;
discoverability;
reuse;
trust;
linkability;
access;
and
interoperability.
They
are
described
and
JSON
[
RFC4627
].
The
Best
Practices
related
to
the
use
of
vocabularies
incorporate
practices
that
stem
from
Best
Practices
for
Publishing
Linked
Data
where
appropriate.
in
the
section
Best
Practices
Benefits
.
This section is non-normative.
This
document
provides
best
practices
to
sets
out
Best
Practices
tailored
primarily
for
those
who
publish
data
on
the
Web.
The
best
practices
Best
Practices
are
designed
to
meet
the
needs
of
information
management
staff,
developers,
and
wider
groups
such
as
scientists
interested
in
sharing
and
reusing
research
data
on
the
Web.
While
data
publishers
are
our
primary
audience,
we
encourage
all
those
engaged
in
related
activities
to
become
familiar
with
it.
Every
attempt
has
been
made
to
make
the
document
as
readable
and
usable
as
possible
while
still
retaining
the
accuracy
and
clarity
needed
in
a
technical
specification.
Readers
of
this
document
are
expected
to
be
familiar
with
some
fundamental
concepts
of
the
architecture
of
the
Web
[
WEBARCH
],
such
as
resources
and
URIs,
as
well
as
a
number
of
data
formats.
The
normative
element
of
each
best
practice
Best
Practice
is
the
intended
outcome
.
Possible
implementations
are
suggested
and,
where
appropriate,
these
recommend
the
use
of
a
particular
technology
such
as
CSV,
JSON
and
RDF.
technology.
A
basic
knowledge
of
vocabularies
and
data
models
would
be
helpful
to
better
understand
some
aspects
of
this
document.
This section is non-normative.
This
document
is
concerned
solely
with
best
practices
Best
Practices
that:
As
noted
above,
whether
a
best
practice
Best
Practice
has
or
has
not
been
followed
should
be
judged
against
the
intended
outcome
,
not
the
possible
approach
to
implementation
which
is
offered
as
guidance.
A
best
practice
is
always
subject
to
improvement
as
we
learn
and
evolve
the
Web
together.
This section is non-normative.
The
following
diagram
illustrates
the
context
considered
in
this
document.
In
general,
the
Best
Practices
proposed
for
publication
and
usage
of
Data
on
the
Web
refer
to
datasets
and
distributions
.
Data
is
published
in
different
distributions,
which
is
a
are
specific
physical
form
of
a
dataset.
By
data,
"we
mean
known
facts
that
can
be
recorded
and
that
have
implicit
meaning"
[
Navathe
].
These
distributions
facilitate
the
sharing
of
data
on
a
large
scale,
which
allows
datasets
to
be
used
for
several
groups
of
data
consumers
,
without
regard
to
purpose,
audience,
interest,
or
license.
Given
this
heterogeneity
and
the
fact
that
data
publishers
and
data
consumers
may
be
unknown
to
each
other,
it
is
necessary
to
provide
some
information
about
the
datasets
which
and
distributions
that
may
also
contribute
to
trustworthiness
and
reuse,
such
as:
structural
metadata,
descriptive
metadata,
access
information,
data
quality
information,
provenance
information,
license
information
and
usage
information.
Other
An
important
aspect
of
publishing
and
sharing
data
on
the
Web
concerns
the
architectural
bases
basis
of
the
Web
as
discussed
in
[
WEBARCH
].
The
DWBP
document
A
relevant
aspect
of
this
is
mainly
interested
on
the
Identification
identification
principle
that
says
that
URIs
should
be
used
to
identify
resources.
In
our
context,
a
resource
may
be
a
whole
dataset
or
a
specific
item
of
given
dataset.
All
resources
should
be
published
with
stable
URIs,
so
that
they
can
be
referenced
and
make
links,
via
URI,
URIs,
between
two
or
more
resources.
The
following
diagram
illustrates
the
dataset
composition
(data
values
and
metadata)
together
with
other
components
related
to
the
dataset
publication
and
usage.
Data
values
correspond
to
the
data
itself
and
may
be
available
in
one
or
more
distributions,
which
should
be
defined
by
the
publisher
considering
data
consumer's
expectations.
The
Metadata
component
corresponds
to
the
additional
information
that
describes
the
dataset
and
dataset
distributions,
helping
the
manipulation
and
the
reuse
of
the
data.
In
order
to
allow
an
easy
access
to
the
dataset
and
its
corresponding
distributions,
multiple
Dataset
Access
mechanisms
should
be
available.
Finally,
to
promote
the
interoperability
among
datasets
it
is
important
to
adopt
Data
Vocabularies
data
vocabularies
and
Standards.
standards.
This section is non-normative.
The
openness
and
flexibility
of
the
Web
creates
new
challenges
for
data
publishers
and
data
consumers.
In
contrast
to
conventional
databases,
for
example,
where
there
is
a
single
data
model
to
represent
the
data
and
a
database
management
system
(DBMS)
to
control
data
access,
data
on
the
Web
allows
for
the
existence
of
multiple
ways
to
represent
and
to
access
data.
The
following
diagram
summarizes
some
of
the
main
challenges
faced
when
publishing
or
consuming
data
on
the
Web.
These
challenges
were
identified
from
the
DWBP
Use
Cases
and
Requirements
[
UCR
]
and,
as
presented
in
the
diagram,
is
addressed
by
one
or
more
best
practices.
Each
one
of
these
challenges
originated
one
or
more
requirements
as
documented
in
the
use-cases
document
.
The
development
of
Data
on
the
Web
Best
Practices
were
guided
by
these
requirements,
in
such
a
way
that
each
best
practice
should
have
at
least
one
of
these
requirements
as
an
evidence
of
its
relevance.
6.
Best
Practices
Benefits
This
section
is
non-normative.
In
order
to
encourage
data
publishers
to
adopt
the
DWBP
,
the
list
below
describes
the
main
benefits
of
applying
the
DWBP
.
Each
benefit
represents
an
improvement
in
the
way
how
datasets
namespace
prefixes
are
available
on
the
Web.
Comprehension:
humans
will
have
a
better
understanding
about
the
data
structure,
the
data
meaning,
the
metadata
and
the
nature
of
the
dataset.
Processability:
machines
will
be
able
to
automatically
process
and
manipulate
the
data
within
a
dataset.
Discoverability
machines
will
be
able
to
automatically
discover
a
dataset
or
data
within
a
dataset.
Reuse:
the
chances
of
dataset
reuse
by
different
groups
of
data
consumers
will
increase.
Trust:
the
confidence
that
consumers
have
in
the
dataset
will
improve.
Linkability:
it
will
be
possible
to
create
links
between
data
resources
(datasets
and
data
items).
Access:
humans
and
machines
will
be
able
to
access
up
to
date
data
in
a
variety
of
forms.
Interoperability:
it
will
be
easier
to
reach
consensus
among
data
publishers
and
consumers.
The
figure
below
shows
the
benefits
that
data
publishers
will
gain
with
adoption
of
the
best
practices.
Section
17
presents
a
table
that
relates
Best
Practices
to
Benefits.
Reuse
Best
Practice
1:
Provide
metadata
Best
Practice
2:
Provide
descriptive
metadata
Best
Practice
3:
Provide
locale
parameters
metadata
Best
Practice
4:
Provide
structural
metadata
Best
Practice
5:
Provide
data
license
information
Best
Practice
6:
Provide
data
provenance
information
Best
Practice
7:
Provide
data
quality
information
Best
Practice
8:
Provide
versioning
information
Best
Practice
9:
Provide
version
history
Best
Practice
11:
Use
persistent
URIs
as
identifiers
of
datasets
Best
Practice
12:
Use
persistent
URIs
as
identifiers
within
datasets
Best
Practice
13:
Assign
URIs
to
dataset
versions
and
series
Best
Practice
14:
Use
machine-readable
standardized
data
formats
Best
Practice
15:
Provide
data
in
multiple
formats
Best
Practice
16:
Use
standardized
terms
Best
Practice
17:
Reuse
vocabularies
Best
Practice
19:
Provide
data
unavailability
reference
Best
Practice
20:
Provide
bulk
download
Best
Practice
22:
Serving
data
and
resources
with
different
formats
Best
Practice
23:
Provide
real-time
access
Best
Practice
24:
Provide
data
up
to
date
Best
Practice
27:
Assess
dataset
coverage
Best
Practice
28:
Use
a
trusted
serialisation
format
for
preserved
data
dumps
Best
Practice
29:
Update
the
status
of
identifiers
Best
Practice
30:
Gather
feedback
from
data
consumers
Best
Practice
31:
Provide
information
about
feedback
Best
Practice
32:
Enrich
data
by
generating
new
metadata.
Access
Best
Practice
20:
Provide
bulk
download
Best
Practice
22:
Serving
data
and
resources
with
different
formats
Best
Practice
23:
Provide
real-time
access
Best
Practice
24:
Provide
data
up
to
date
Discoverability
Best
Practice
1:
Provide
metadata
Best
Practice
2:
Provide
descriptive
metadata
Best
Practice
11:
Use
persistent
URIs
as
identifiers
of
datasets
Best
Practice
12:
Use
persistent
URIs
as
identifiers
within
datasets
Best
Practice
13:
Assign
URIs
to
dataset
versions
and
series
Processability
used
throughout
this
document.
Prefix | Namespace IRI |
---|---|
cnt | http://www.w3.org/2011/content# |
dcat | http://www.w3.org/ns/dcat# |
dct | http://purl.org/dc/terms/ |
dqv | http://www.w3.org/ns/dqv# |
duv | http://www.w3.org/ns/duv# |
foaf | http://xmlns.com/foaf/0.1/ |
oa | http://www.w3.org/ns/oa# |
owl | http://www.w3.org/2002/07/owl# |
pav | http://pav-ontology.github.io/pav/ |
prov | http://www.w3.org/ns/prov# |
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfs | http://www.w3.org/2000/01/rdf-schema# |
skos | http://www.w3.org/2004/02/skos/core# |
This section presents the template used to describe Data on the Web Best Practices.
Best Practice Template
Short description of the BP
This section answers two crucial questions:
A
full
text
description
of
the
problem
addressed
by
the
best
practice
Best
Practice
may
also
be
provided.
It
can
be
any
length
but
is
likely
to
be
no
more
than
a
few
sentences.
What
it
should
be
possible
to
do
when
a
data
publisher
follows
the
best
practice.
Best
Practice.
A description of a possible implementation strategy is provided. This represents the best advice available at the time of writing but specific circumstances and future developments may mean that alternative implementation methods are more appropriate to achieve the intended outcome.
Information on how to test the BP has been met. This might or might not be machine testable.
Information about the relevance of the BP. It is described by one or more relevant requirements as documented in the Data on the Web Best Practices Use Cases & Requirements document [ DWBP-UCR ]
This
section
contains
the
best
practices
Best
Practices
to
be
used
by
data
publishers
in
order
to
help
them
and
data
consumers
to
overcome
the
different
challenges
faced
when
publishing
and
consuming
data
on
the
Web.
One
or
more
best
practices
Best
Practices
were
proposed
for
each
one
of
the
previously
challenges,
which
are
described
challenges.
in
the
section
Data
on
the
Web
Challenges
.
Each
BP
is
related
to
one
or
more
requirements
from
the
Data
on
the
Web
Best
Practices
Use
Cases
&
Requirements
document.
document
[
DWBP-UCR
]
which
guided
their
development.
Each
Best
Practice
has
at
least
one
of
these
requirements
as
evidence
of
its
relevance.
RDF
examples
will
be
used
to
show
the
result
of
the
application
of
some
best
practices.
RDF
examples
in
this
document
Best
Practices
are
written
in
shown
using
Turtle
syntax
[
TURTLE
]
and
[
JSON-LD
Turtle
].
Note
In
this
current
version,
examples
are
presented
just
in
Turtle
syntax.
The Web is an open information space, where the absence of a specific context, such a company's internal information system, means that the provision of metadata is a fundamental requirement. Data will not be discoverable or reusable by anyone other than the publisher if insufficient metadata is provided. Metadata provides additional information that helps data consumers better understand the meaning of data, its structure, and to clarify other issues, such as rights and license terms, the organization that generated the data, data quality, data access methods and the update schedule of datasets.
Metadata can be used to help tasks such as dataset discovery and reuse, and can be assigned considering different levels of granularity from a single property of a resource to a whole dataset, or all datasets from a specific organization.
Metadata
can
be
of
different
types.
These
types
can
be
classified
in
different
taxonomies,
with
different
grouping
criteria.
For
example,
a
specific
taxonomy
could
define
three
metadata
types
according
to
descriptive,
structural
and
administrative
features.
Descriptive
metadata
serves
to
identify
a
dataset,
structural
metadata
serves
to
understand
the
structure
in
which
the
dataset
is
distributed
and
administrative
metadata
serves
to
provide
information
about
the
version,
update
schedule
etc.
A
different
taxonomy
could
define
metadata
types
with
a
scheme
according
to
tasks
where
metadata
are
used,
for
example,
discovery
and
reuse.
Best Practice 1: Provide metadata
Metadata
must
be
provided
Provide
metadata
for
both
human
users
and
computer
applications
applications.
Providing metadata is a fundamental requirement when publishing data on the Web because data publishers and data consumers may be unknown to each other. Then, it is essential to provide information that helps human users and computer applications to understand the data as well as other important aspects that describes a dataset or a distribution.
It
must
Humans
will
be
possible
for
humans
able
to
understand
the
metadata,
which
makes
it
human-readable
metadata
.
It
should
be
possible
for
and
computer
applications,
notably
user
agents,
will
be
able
to
process
the
metadata,
which
makes
it
machine-readable
metadata
.
it.
Possible approaches to provide human readable metadata:
Possible
approaches
to
provide
machine
readable
machine-readable
metadata:
For
Check
if
human
readable
metadata
,
check
that
a
human
user
can
understand
the
metadata
associated
with
a
dataset.
is
available.
For
machine
readable
metadata
,
access
Check
if
the
same
URL
either
with
a
user
agent
that
accepts
metadata
is
available
in
a
more
data
oriented
valid
machine-readable
format
or
a
tool
that
extracts
the
data
from
an
HTML
page.
and
without
syntax
error.
Relevant requirements : R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead
Best Practice 2: Provide descriptive metadata
The
Provide
metadata
that
describes
the
overall
features
of
datasets
and
distributions
must
be
described
by
metadata
distributions.
Explicitly providing dataset descriptive information allows user agents to automatically discover datasets available on the Web and it allows humans to understand the nature of the dataset and its distributions.
It
should
Humans
will
be
possible
for
humans
able
to
understand
interpret
the
nature
of
the
dataset
and
its
distributions.
It
should
be
possible
for
user
distributions,
and
software
agents
will
be
able
to
automatically
discover
datasets
and
distributions.
Discovery
Descriptive
metadata
should
can
include
the
following
overall
features
of
a
dataset:
Discovery
Descriptive
metadata
should
can
include
the
following
overall
features
of
a
distribution:
The
machine
readable
machine-readable
version
of
the
discovery
descriptive
metadata
may
can
be
provided
according
to
using
the
vocabulary
recommended
by
W3C
to
describe
datasets,
i.e.
the
Data
Catalog
Vocabulary
[
VOCAB-DCAT
].
This
provides
a
framework
in
which
datasets
can
be
described
as
abstract
entities.
Check
that
if
the
metadata
for
the
dataset
itself
includes
the
overall
features
of
the
dataset.
dataset
in
a
human-readable
format.
Check
if
a
user
agent
can
automatically
discover
the
dataset.
descriptive
metadata
is
available
in
a
valid
machine-readable
format.
Relevant requirements : R-MetadataAvailable , R-MetadataMachineRead , R-MetadataStandardized
Best Practice 3: Provide locale parameters metadata
Information
Provide
metadata
about
locale
parameters
(date,
time,
and
number
formats,
language)
should
be
described
by
metadata.
language).
Providing
locale
parameters
metadata
helps
human
users
humans
and
computer
applications
to
understand
work
accurately
with
things
like
dates,
currencies
and
to
manipulate
the
data,
improving
numbers
that
may
look
similar
but
have
different
meanings
in
different
locales.
For
example,
the
reuse
'date'
4/7
can
be
read
as
7th
of
April
or
the
data.
Providing
information
about
the
locality
for
which
4th
of
July
depending
on
where
the
data
was
created.
Similarly
€2,000
is
currently
published
aids
data
users
in
interpreting
its
meaning.
Date,
time,
and
number
formats
can
have
very
different
meanings,
despite
similar
appearances.
either
two
thousand
Euros
or
an
over-precise
representation
of
two
Euros.
Making
the
locale
and
language
explicit
allows
users
to
determine
how
readily
they
can
work
with
the
data
and
may
enable
automated
translation
services.
It
should
be
possible
for
human
users
Humans
and
computer
applications
software
agents
will
be
able
to
interpret
the
meaning
of
strings
representing
dates,
times
times,
currencies
and
numbers
accurately
by
referring
to
locale
information.
etc.
accurately.
Locale
parameters
metadata
should
can
include
the
following
information:
The
machine
readable
machine-readable
version
of
the
discovery
metadata
may
be
provided
according
to
the
vocabulary
recommended
by
W3C
to
describe
datasets,
i.e.
the
Data
Catalog
Vocabulary
[
VOCAB-DCAT
].
Check
that
if
the
metadata
for
the
dataset
itself
includes
the
language
in
which
it
is
published
and
that
all
numeric,
date,
information
about
local
parameters
(i.e.
data,
time,
number
formats,
and
time
fields
have
locale
language)
in
a
human-readable
format.
Check
if
the
metadata
provided
either
with
each
field
or
as
locale
information
is
available
in
a
general
rule.
valid
machine-readable
format
and
without
syntax
errors.
Relevant requirements : R-FormatLocalize , R-MetadataAvailable , R-GeographicalContext
Best Practice 4: Provide structural metadata
Information
about
Provide
metadata
that
describes
the
schema
and
internal
structure
of
a
distribution
must
be
described
by
metadata
distribution.
Providing
information
about
the
internal
structure
of
a
distribution
can
be
helpful
when
exploring
is
essential
for
others
wishing
to
explore
or
querying
query
the
dataset.
Besides,
structural
metadata
provides
information
that
It
also
helps
people
to
understand
the
meaning
of
the
data.
It
should
Humans
will
be
possible
for
humans
able
to
understand
interpret
the
internal
structure
or
schema
of
a
distribution.
It
should
be
possible
for
user
dataset
and
software
agents
will
be
able
to
automatically
process
the
structural
metadata
about
a
distribution.
distributions.
Human readable strucutral metadata usually provides the properties or columns of the dataset schema.
Structural
Machine-readable
structural
metadata
is
available
according
to
the
format
of
a
specific
distribution
and
it
may
be
provided
within
separate
documents
or
embedded
into
the
document.
For
more
details
see
the
links
below.
Check
that
if
the
distribution
itself
includes
structural
information
about
metadata
of
the
data
organization.
dataset
is
provided
in
a
human-readable
format.
Check
if
a
user
agent
can
automatically
process
the
metadata
of
the
distribution
includes
structural
information
about
the
distribution.
dataset
in
a
machine-readable
format
and
without
syntax
errors.
Relevant requirements : R-MetadataAvailable
A
license
is
a
very
useful
piece
of
information
to
be
attached
to
data
on
the
Web.
According
to
the
type
of
license
adopted
by
the
publisher,
there
might
be
more
or
fewer
restrictions
on
sharing
and
reusing
data.
In
the
context
of
data
on
the
Web,
the
license
of
a
dataset
can
be
specified
within
the
data,
metadata,
or
outside
of
it,
in
a
separate
document
to
which
it
is
linked.
Best Practice 5: Provide data license information
Data
Provide
a
link
to
or
copy
of
the
license
information
should
be
available
agreement
that
controls
use
of
the
data.
The
presence
of
license
information
is
essential
for
data
consumers
to
assess
the
usability
of
data.
User
agents,
for
example,
agents
may
use
the
presence/absence
of
license
information
as
a
trigger
for
inclusion
or
exclusion
of
data
presented
to
a
potential
consumer.
It
should
Humans
will
be
possible
for
humans
able
to
understand
data
license
information
describing
possible
restrictions
placed
on
the
use
of
a
distribution.
It
should
be
possible
for
machines
given
distribution
and
software
agents
to
automatically
detect
the
data
license
of
a
distribution.
The
machine
readable
version
of
the
data
Data
license
metadata
may
information
can
be
provided
using
one
available
via
a
link
to,
or
embedded
copy
of,
a
human-readable
license
agreement.
It
can
also
be
made
available
for
processing
via
a
link
to,
or
embedded
copy
of,
a
machine-readable
license
agreement.
One
of
the
following
vocabularies
that
include
properties
for
linking
to
a
license:
license
can
be
used:
dct:license
)
cc:license
)
schema:license
)
xhtml:license
)
There
are
also
a
number
of
machine
readable
machine-readable
rights
languages,
including:
Check
that
if
the
metadata
for
the
dataset
itself
includes
the
data
license
information.
information
in
a
human-readable
format.
Check if a user agent can automatically detect /discover the data license of the dataset.
Relevant
use
cases
:
R-LicenseAvailable
and
,
R-MetadataMachineRead
,
R-LicenseLiability
Data
provenance
becomes
particularly
important
when
data
is
shared
between
collaborators
who
might
not
have
direct
contact
with
one
another
either
due
to
proximity
or
because
the
published
data
outlives
the
lifespan
of
the
data
provider
projects
or
organizations.
The
Web
brings
together
business,
engineering,
and
scientific
communities
creating
collaborative
opportunities
that
were
previously
unimaginable.
The
challenge
in
publishing
data
on
the
Web
is
providing
an
appropriate
level
of
detail
about
its
origin.
The
data
producer
may
not
necessarily
be
the
data
provider
and
so
collecting
and
conveying
this
corresponding
metadata
is
particularly
important.
Without
provenance,
provenance
,
consumers
have
no
inherent
way
to
trust
the
integrity
and
credibility
of
the
data
being
shared.
Data
publishers
in
turn
need
to
be
aware
of
the
needs
of
prospective
consumer
communities
to
know
how
much
provenance
detail
is
appropriate.
Best Practice 6: Provide data provenance information
Data
provenance
Provide
complete
information
should
should
be
available.
about
the
origins
of
the
data
and
any
changes
you
have
made.
Without
accessible
data
provenance,
data
Provenance
is
one
means
by
which
consumers
will
not
know
the
of
a
dataset
judge
its
quality.
Understanding
its
origin
or
and
history
of
helps
one
determine
whether
to
trust
the
published
data.
data
and
provides
important
interpretive
context.
It
should
be
possible
for
humans
to
Humans
will
know
the
origin
or
history
of
the
dataset.
It
should
dataset
and
software
agents
will
be
possible
for
machines
able
to
automatically
process
the
provenance
information
about
the
dataset.
information.
The
machine
readable
machine-readable
version
of
the
data
provenance
may
can
be
provided
according
to
the
using
an
ontology
recommended
by
W3C
to
describe
provenance
information,
i.e.,
the
such
as
W3C
's
Provenance
Ontology
[
PROV-O
].
Check
that
the
metadata
for
the
dataset
itself
includes
the
provenance
information
about
the
dataset.
dataset
in
a
human-readable
format.
Check if a computer application can automatically process the provenance information about the dataset.
Relevant requirements : R-ProvAvailable , R-MetadataAvailable
Data
The
quality
of
a
dataset
can
affect
have
a
big
impact
on
the
potentiality
quality
of
the
application
applications
that
use
data,
as
it.
As
a
consequence,
its
the
inclusion
of
data
quality
information
in
the
data
publishing
and
consumption
pipelines
is
of
primary
importance.
Usually,
the
assessment
of
quality
involves
different
kinds
of
quality
dimensions,
each
representing
groups
of
characteristics
that
are
relevant
to
publishers
and
consumers.
Measures
The
Data
Quality
Vocabulary
defines
concepts
such
as
measures
and
metrics
are
defined
to
assess
the
quality
for
each
quality
dimension
[
DQV
VOCAB-DQV
].
There
are
heuristics
designed
to
fit
specific
assessment
situations
that
rely
on
quality
indicators,
namely,
pieces
of
data
content,
pieces
of
data
meta-information,
and
human
ratings
that
give
indications
about
the
suitability
of
data
for
some
intended
use.
Best Practice 7: Provide data quality information
Data
Quality
Provide
information
should
be
available.
about
data
quality
and
fitness
for
particular
purposes.
Data
quality
might
seriously
affect
the
suitability
of
data
for
specific
applications,
including
applications
very
different
from
the
purpose
for
which
it
was
originally
generated.
Documenting
data
quality
significantly
eases
the
process
of
datasets
dataset
selection,
increasing
the
chances
of
reuse.
Independently
from
domain-specific
peculiarities,
the
quality
of
data
should
be
documented
and
known
quality
issues
should
be
explicitly
stated
in
metadata.
It
should
Humans
and
software
agents
will
be
possible
for
humans
to
have
access
able
to
information
that
describes
assess
the
quality
and
therefore
suitability
of
the
a
dataset
and
its
distributions.
It
should
be
possible
for
machines
to
automatically
process
the
quality
information
about
the
dataset
and
its
distributions.
their
application.
The
machine
readable
machine-readable
version
of
the
dataset
quality
metadata
may
be
provided
according
to
using
the
vocabulary
that
is
being
Data
Quality
Vocabulary
developed
by
the
DWBP
working
group
,
i.e.,
the
Data
Quality
Vocabulary
[
DQV
VOCAB-DQV
].
Check that the metadata for the dataset itself includes quality information about the dataset.
Check if a computer application can automatically process the quality information about the dataset.
Relevant Requirements: R-QualityMetrics , R-DataMissingIncomplete , R-QualityOpinions , R-DataMissingIncomplete , R-QualityMetrics
Datasets
published
on
the
Web
may
change
over
time.
Some
datasets
are
updated
on
a
schedule
basis
scheduled
basis,
and
other
datasets
are
changed
as
improvements
in
collecting
the
data
make
updates
worthwhile.
In
order
to
deal
with
these
changes,
new
versions
of
a
dataset
may
be
created.
Dataset
versioning
has
been
the
subject
of
numerous
discussions,
however
Unfortunately,
there
is
no
consensus
about
when
creating
changes
to
a
new
version
of
dataset
should
cause
it
to
be
considered
a
dataset.
different
dataset
altogether
rather
than
a
new
version.
In
the
following
following,
we
present
some
scenarios
where
a
new
dataset,
i.e.
most
publishers
would
agree
that
the
revision
should
be
considered
a
new
version
of
the
existing
dataset,
should
be
created
to
reflect
the
corresponding
update.
dataset.
The
creation
of
In
general,
multiple
datasets
to
that
represent
time
series
as
well
as
or
spatial
series,
e.g.
the
same
kind
of
data
for
different
regions,
in
general,
regions
or
for
different
years,
are
not
considered
as
multiple
versions
for
of
the
same
dataset.
In
this
case,
each
dataset
covers
a
different
observation
set
of
observations
about
the
world
and
should
be
treated
as
a
new
dataset
instead
of
a
new
version
of
an
existing
dataset.
This
is
also
the
case
of
with
a
dataset
that
collects
data
about
weakly
weekly
weather
forecast
of
forecasts
for
a
given
city,
where
every
week
a
new
dataset
should
be
is
created
to
store
data
about
that
specific
week.
Scenarios
1
and
2
might
trigger
a
major
version,
whereas
Scenario
3
would
likely
trigger
only
a
minor
version.
But
how
you
decide
whether
versions
are
minor
or
major
is
less
important
than
that
you
avoid
making
changes
without
incrementing
the
version
indicator.
Even
for
small
changes
changes,
it
is
important
to
keep
track
of
the
different
dataset
versions
to
make
the
dataset
trustworthy.
Publishers
should
remember
that
a
given
dataset
may
be
in
use
for
by
one
or
more
data
consumers
consumers,
and
they
should
be
notified
about
the
creation
of
new
versions
or
it
should
be
possible
take
reasonable
steps
to
automatically
identify
different
versions
of
inform
those
consumers
when
a
new
version
is
released.
For
real-time
data,
an
automated
timestamp
can
serve
as
a
version
identifier.
For
each
dataset,
the
same
dataset.Different
types
of
dataset
updates
need
publisher
should
take
a
consistent,
informative
approach
to
versioning,
so
data
consumers
can
understand
and
work
with
the
changing
data.
Best
Practice
8:
Provide
versioning
information
a
version
indicator
Information
about
dataset
versioning
should
be
available.
Assign
and
indicate
a
version
number
or
date
for
each
dataset.
Version information makes a revision of a dataset uniquely identifiable. Uniqueness can be used by data consumers to determine whether and how data has changed over time and to determine specifically which version of a dataset they are working with. Good data versioning enables consumers to understand if a newer version of a dataset is available. Explicit versioning allows for repeatability in research, enables comparisons, and prevents confusion. Using unique version numbers that follow a standardized approach can also set consumer expectations about how the versions differ.
It
should
Humans
and
software
agents
will
easily
be
possible
for
data
consumers
able
to
easily
determine
which
version
of
the
a
dataset
they
are
working
with.
The
precise
best
method
adopted
for
providing
versioning
information
may
will
vary
according
to
the
context,
however
context;
however,
there
are
some
basic
guidelines
that
can
be
followed,
for
example:
The
Web
Ontology
Language
[
OWL2-QUICK-REFERENCE
]
and
the
Provenance,
authoring
Authoring
and
versioning
Ontology
[
PAV
]
provides
provide
a
number
of
annotation
properties
for
version
information.
Check
that
if
the
metadata
for
the
dataset/distribution
provides
a
unique
version
number
or
date
is
provided
with
the
metadata
describing
in
a
human-readable
format.
Check
if
a
computer
application
can
automatically
detect/discover
the
dataset.
unique
version
number
or
date
of
a
dataset
or
distribution.
Relevant requirements : R-DataVersion
Best Practice 9: Provide version history
A
Provide
a
complete
version
history
about
that
explains
the
dataset
should
be
available.
changes
made
in
each
version.
In creating applications that use data, it can be helpful to understand the variability of that data over time. Interpreting the data is also enhanced by an understanding of its dynamics. Determining how the various versions of a dataset differ from each other is typically very laborious unless a summary of the differences is provided.
It
should
Humans
and
software
agents
will
be
possible
for
data
consumers
able
to
understand
how
the
dataset
typically
changes
from
version
to
version
and
how
any
two
specific
versions
differ.
Provide a list of published versions and a description for each version that explains how it differs from the previous version. An API can expose a version history with a single dedicated URL that retrieves the latest version of the complete history.
Check
that
a
list
of
published
versions
is
available,
and
that
available
as
well
as
a
change
log
describing
precisely
how
each
version
is
described.
differs
from
the
previous
one.
Relevant requirements : R-DataVersion
Identifiers take many forms and are used extensively in every information system. Data discovery, usage and citation on the Web depends fundamentally on the use of HTTP (or HTTPS) URIs: globally unique identifiers that can be looked up by dereferencing them over the Internet [ RFC3986 ]. It is perhaps worth emphasizing some key points about URIs in the current context.
Best
Practice
11:
10:
Use
persistent
URIs
as
identifiers
of
datasets
Datasets
must
be
identified
Identify
each
dataset
by
a
carefully
chosen,
persistent
URI.
Adopting a common identification system enables basic data identification and comparison processes by any stakeholder in a reliable way. They are an essential pre-condition for proper data management and reuse.
Developers may build URIs into their code and so it is important that those URIs persist and that they dereference to the same resource over time without the need for human intervention.
Datasets
or
information
about
datasets,
must
datasets
will
be
discoverable
and
citable
through
time,
regardless
of
the
status,
availability
or
format
of
the
data.
To
be
persistent,
URIs
must
be
designed
as
such
and
backed
up
by
organizational
commitments.
such.
A
lot
has
been
written
on
this
topic
as
the
table
below
shows.
Some
sources
of
information
related
to
URI
persistence
Status
Title
Authors
and
Date
Background
Cool
URIs
don't
change
Tim
Berners-Lee,
1998
Cool
URIs
for
the
Semantic
Web
Leo
Saurman,
Richard
Cyganiak,
2008
Linked
Data
Tim
Berners-Lee,
2009
Key
Source
Designing
URI
Sets
topic,
see,
for
example,
the
UK
Public
Sector
(PDF)
UK
Chief
Technology
Officer
Council
October
2009
Survey
&
summary
of
techniques
European
Commission's
Study
on
Persistent
URIs
Phil
Archer,
Nikos
Loutas,
Stijn
Goedertier,
Saky
Kourtidis,
2013
Expansion
Creating
Linked
Data
Jeni
Tennison,
2009
Linked
Data:
Evolving
the
Web
into
a
Global
Data
Space
Tom
Heath
&
Christian
Bizer,
2011
Linked
Data
Patterns
Leigh
Dodds
&
Ian
Davis,
2012
Best
Practices
for
Multilingual
Linked
Open
Data
Jose
Emilio
Labra
Gayo,
2012
Detail
Statistical
Linked
Dataspaces
Sarven
Capadisli,
2012
Issue
7
The
table
links
to
Designing
URI
Sets
for
the
UK
Public
Sector.
A
newer
version
of
this
document
(which
was
the
first
of
its
kind)
exists
but
is
on
a
GitHub
repository
.
It
seems
that
this
might
happen
due
to
changes
in
organisation
behind
data.gov.uk.
If
this
happens,
we
should
update
the
link
to
point
to
the
new
version.
URIs
can
be
long.
In
a
dataset
of
even
moderate
size,
storing
each
URI
is
likely
to
be
repetitive
and
obviously
wasteful.
Instead,
define
locally
unique
identifiers
for
each
element
and
provide
data
that
allows
them
to
be
converted
to
globally
unique
URIs
programmatically.
The
Metadata
Vocabulary
for
Tabular
Data
[
tabular-metadata
PURI
]
provides
mechanisms
for
doing
this
within
tabular
data
such
as
CSV
files,
which
in
particular
using
URI
template
properties
such
as
the
about
URL
property.
turn
links
to
many
other
resources.
Where
a
data
publisher
is
unable
or
unwilling
to
manage
its
a
URI
space
directly
for
persistence,
an
alternative
approach
is
to
use
a
redirection
service
such
as
Permanent
Identifiers
for
the
Web
or
purl.org
.
This
provides
These
provide
persistent
URIs
that
can
be
redirected
as
required
so
that
the
eventual
location
can
be
ephemeral.
The
software
behind
such
services
is
freely
available
so
that
it
can
be
installed
and
managed
locally
if
required.
Digital
Object
Identifiers
(
(DOIs)
DOI
s)
offer
a
similar
alternative.
These
identifiers
are
defined
independently
of
any
Web
technology
but
can
be
appended
to
a
'URI
stub.'
DOIs
are
an
important
part
of
the
digital
infrastructure
for
research
data
and
and
libraries.
Check
that
each
dataset
in
question
is
identified
using
a
URI
that
has
been
assigned
under
a
controlled
process
as
set
out
in
the
previous
section.
Ideally,
designed
for
persistence.
Ideally
the
relevant
Web
site
includes
a
description
of
the
process
design
scheme
and
a
credible
pledge
of
persistence
should
the
publisher
no
longer
be
able
to
maintain
the
URI
space
themselves.
Relevant requirements : R-UniqueIdentifier , R-Citable
Best
Practice
12:
11:
Use
persistent
URIs
as
identifiers
within
datasets
Datasets
should
use
and
reuse
Reuse
other
people's
URIs
as
identifiers
within
datasets
where
possible.
The power of the Web lies in the Network effect . The first telephone only became useful when the second telephone meant there was someone to call; the third telephone made both of them more useful yet. Data becomes more valuable if it refers to other people's data about the same thing, the same place, the same concept, the same event, the same person, and so on. That means using the same identifiers across datasets and making sure that your identifiers can be referred to by other datasets. When those identifiers are HTTP URIs, they can be looked up and more data discovered.
These
ideas
are
at
the
heart
of
the
5
Stars
of
Linked
Data
where
one
data
point
links
to
another,
and
of
Hypermedia
where
links
may
be
to
further
data
or
to
services
(or
more
generally
'affordances')
that
can
act
on
or
relate
to
the
data
in
some
way.
Examples
include
a
bug
reporting
mechanisms,
processors,
a
visualization
engine,
a
sensor,
an
actuator
etc.
In
both
Linked
Data
and
Hypermedia,
the
emphasis
is
put
on
the
ability
for
machines
to
traverse
from
one
resource
to
another
following
links
that
express
relationships.
That's the Web of Data.
That
one
data
item
can
Data
items
will
be
related
to
others
across
the
Web
creating
a
global
information
space
accessible
to
humans
and
machines
alike.
This is a topic in itself and a general document such as this can only include superficial detail.
Developers know that very often the problem they're trying to solve will have already been solved by other people. In the same way, if you're looking for a set of identifiers for obvious things like countries, currencies, subjects, species, proteins, cities and regions, Nobel prize winners and products – someone's done it already. The steps described for discovering existing vocabularies [ LD-BP ] can readily be adapted.
If you can't find an existing set of identifiers that meet your needs then you'll need to create your own, following the patterns for URI persistence so that others will add value to your data by linking to it.
URIs can be long. In a dataset of even moderate size, storing each URI is likely to be repetitive and obviously wasteful. Instead, define locally unique identifiers for each element and provide data that allows them to be converted to globally unique URIs programmatically. The Metadata Vocabulary for Tabular Data [ Tabular-Metadata ] provides mechanisms for doing this within tabular data such as CSV files, in particular using URI template properties such as the about URL property.
Check
that
within
the
dataset,
references
to
things
that
don't
change
or
that
change
slowly,
such
as
countries,
regions,
organizations
and
people,
as
are
referred
to
by
URIs
or
by
short
identifiers
that
can
be
appended
to
a
URI
stub.
Ideally
the
URIs
should
resolve,
however,
they
have
value
as
globally
scoped
variables
whether
they
resolve
or
not.
Relevant requirements : R-UniqueIdentifier
Best
Practice
13:
12:
Assign
URIs
to
dataset
versions
and
series
Assign
URIs
should
be
assigned
to
individual
versions
of
datasets
as
well
as
to
the
overall
series.
Like documents, many datasets fall into natural series or groups. For example:
In
different
circumstances,
it
will
be
appropriate
to
refer
separately
to
each
the
current
situation
(the
current
set
of
these
examples
(and
many
like
them).
bus
stops,
the
current
elected
officials
etc.).
In
others,
it
may
be
appropriate
to
refer
to
the
situation
as
it
existed
at
a
specific
time.
It
should
Humans
and
software
agents
will
be
possible
able
to
refer
to
a
specific
version
versions
of
a
dataset
and
to
concepts
such
as
a
'dataset
series'
and
'the
latest
version.'
version'.
The
W3C
provides
a
good
example
of
how
to
do
this.
The
(persistent)
URI
for
this
document
is
http://www.w3.org/TR/2016/WD-dwbp-20160112/.
http://www.w3.org/TR/2015/WD-dwbp-20150224/.
That
identifier
points
to
an
immutable
snapshot
of
the
document
on
the
day
of
its
publication.
The
URI
for
the
'latest
version'
of
this
document
is
http://www.w3.org/TR/dwbp/
which
is
an
identifier
for
a
series
of
closely
related
documents
that
are
subject
to
change
over
time.
At
the
time
of
publication,
these
two
URIs
both
resolve
to
this
document.
However,
when
the
next
version
of
this
document
is
published,
the
'latest
version'
URI
will
be
changed
to
point
to
that.
that,
but
the
dated
URI
remains
unchanged.
Check
that
each
version
of
a
dataset
has
its
own
URI,
and
that
logical
groups
of
datasets
are
there
is
also
identifiable.
a
"latest
version"
URI.
Relevant requirements : R-UniqueIdentifier , R-Citable
The
formats
format
in
which
data
is
made
available
to
consumers
are
is
a
key
aspect
of
making
that
data
usable.
The
best,
most
flexible
access
mechanism
in
the
world
is
pointless
unless
it
serves
data
in
formats
that
enable
use
and
reuse.
Below
we
detail
best
practices
Best
Practices
in
selecting
formats
for
your
data,
both
at
the
level
of
files
and
that
of
individual
fields.
W3C
encourages
use
of
formats
that
can
be
used
by
the
widest
possible
audience
and
processed
most
readily
by
computing
systems.
Source
formats,
such
as
database
dumps
or
spreadsheets,
used
to
generate
the
final
published
format,
are
out
of
scope.
This
document
is
concerned
with
what
is
actually
published
rather
than
internal
systems
used
to
generate
the
published
data.
Best
Practice
14:
13:
Use
machine-readable
standardized
data
formats
Data
must
be
Make
data
available
in
a
machine-readable
machine-readable,
standardized
data
format
that
is
adequate
for
well
suited
to
its
intended
or
potential
use.
As
data
becomes
more
ubiquitous,
and
datasets
become
larger
and
more
complex,
processing
by
computers
becomes
ever
more
crucial.
Posting
data
in
a
format
that
is
not
machine
readable
machine-readable
places
severe
limitations
on
the
continuing
usefulness
of
the
data.
Data
becomes
useful
when
it
has
been
processed
and
transformed
into
information.
Note
that
there
is
an
important
distinction
between
formats
that
can
be
read
and
edited
by
humans
using
a
computer
and
formats
that
are
machine-readable.
The
latter
term
implies
that
the
data
is
readily
extracted,
transformed
and
processed
by
a
computer.
Using
non-standard
data
formats
is
costly
and
inefficient,
and
the
data
may
lose
meaning
as
it
is
transformed.
On
the
other
hand,
standardized
data
formats
enable
interoperability
as
well
as
future
uses,
such
as
remixing
or
visualization,
many
of
which
cannot
be
anticipated
when
the
data
is
first
published.
The
use
of
non-proprietary
data
formats
should
also
be
considered
since
it
increases
the
possibilities
for
use
and
reuse
of
data
It
should
Machines
will
easily
be
possible
for
machines
able
to
easily
read
and
process
data
published
on
the
Web.
It
should
Web
and
humans
will
be
possible
for
data
consumers
able
to
use
computational
tools
typically
available
in
the
relevant
domain
to
work
with
the
data.
Make
data
available
in
a
machine
readable
machine-readable
standardized
data
format
that
is
easily
parseable
including
but
not
limited
to
CSV,
XML,
Turtle,
NetCDF,
HDF5,
JSON
and
RDF.
RDF
serialization
syntaxes
like
RDF/XML,
JSON-LD,
Turtle.
Check
that
if
the
data
format
conforms
to
a
known
machine-readable
data
format
specification.
Relevant requirements : R-FormatMachineRead , R-FormatStandardized R-FormatOpen
Best
Practice
15:
14:
Provide
data
in
multiple
formats
Data
should
be
Make
data
available
in
multiple
data
formats.
formats
when
more
than
one
format
suits
its
intended
or
potential
use.
Providing data in more than one format reduces costs incurred in data transformation. It also minimizes the possibility of introducing errors in the process of transformation. If many users need to transform the data into a specific data format, publishing the data in that format from the beginning saves time and money and prevents errors many times over. Lastly it increases the number of tools and applications that can process the data.
It
should
be
As
many
users
as
possible
for
data
consumers
will
be
able
to
work
with
use
the
data
without
transforming
it.
first
having
to
transform
it
into
their
preferred
format.
Consider
the
data
formats
most
likely
to
be
needed
by
intended
users,
and
consider
alternatives
that
are
likely
to
be
useful
in
the
future.
Data
publishers
must
balance
the
effort
required
to
make
the
data
available
in
many
formats,
formats
against
the
cost
of
doing
so,
but
providing
at
least
one
alternative
will
greatly
increase
the
usability
of
the
data.
In
order
to
serve
data
in
more
than
one
format
you
can
use
content
negotiation
as
described
in
Best
Practice
Use
content
negotiation
for
serving
data
available
in
multiple
formats.
A word of warning: local identifiers within the dataset, which may be exposed as fragment identifiers in URIs, must be consistent across the various formats.
Check
that
if
the
complete
dataset
is
available
in
more
than
one
data
format.
Relevant requirements : R-FormatMultiple
Data
is
often
represented
in
a
structured
and
controlled
way,
making
reference
to
a
range
of
vocabularies,
for
example,
by
defining
types
of
nodes
and
links
in
a
data
graph
or
types
of
values
for
columns
in
a
table,
such
as
the
subject
of
a
book,
or
a
relationship
“knows”
between
two
persons.
Additionally,
the
values
used
may
come
from
a
limited
set
of
pre-existing
values
or
resources:
for
example
object
types,
roles
of
a
person,
countries
in
a
geographic
area,
or
possible
subjects
for
books.
Such
vocabularies
ensure
a
level
of
control,
standardization
and
interoperability
in
the
data.
They
can
also
serve
to
improve
the
usability
of
datasets.
Say,
a
dataset
contains
a
reference
to
a
concept
described
in
several
languages.
Such
reference
allows
applications
to
localize
their
display
of
their
search
depending
on
the
language
of
the
user.
According
to
W3C
,
vocabularies
Vocabularies
define
the
concepts
and
relationships
(also
referred
to
as
“terms”
or
“attributes”)
used
to
describe
and
represent
an
area
of
concern.
Vocabularies
interest.
They
are
used
to
classify
the
terms
that
can
be
used
in
a
particular
application,
characterize
possible
relationships,
and
define
possible
constraints
on
using
those
terms.
Several
categories
of
vocabularies
near-synonyms
for
'vocabulary'
have
been
coined,
for
example,
ontology,
controlled
vocabulary,
thesaurus,
taxonomy,
code
list,
semantic
network.
There is no strict division between the artifacts referred to by these names. “Ontology” tends however to denote the vocabularies of classes and properties that structure the descriptions of resources in (linked) datasets. In relational databases, these correspond to the names of tables and columns; in XML, they correspond to the elements defined by an XML Schema. Ontologies are the key building blocks for inference techniques on the Semantic Web. The first means offered by W3C for creating ontologies is the RDF Schema [ RDF-SCHEMA ] language. It is possible to define more expressive ontologies with additional axioms using languages such as those in The Web Ontology Language [ OWL2-OVERVIEW ].
On
the
other
hand,
“controlled
vocabularies”,
“concept
schemes”,
schemes”
and
“knowledge
organization
systems”
enumerate
and
define
resources
that
can
be
employed
in
the
descriptions
made
with
the
former
kind
of
vocabulary.
vocabulary,
i.e.
vocabularies
that
structure
the
descriptions
of
resources
in
(linked)
datasets.
A
concept
from
a
thesaurus,
say,
“architecture”,
will
for
example
be
used
in
the
subject
field
for
a
book
description
(where
“subject”
has
been
defined
in
an
ontology
for
books).
For
defining
the
terms
in
these
vocabularies,
complex
formalisms
are
most
often
not
needed.
Simpler
models
have
thus
been
proposed
to
represent
and
exchange
them,
such
as
the
ISO
25964
data
model
[
ISO-25964
]
or
W3C
's
Simple
Knowledge
Organization
System
[
SKOS-PRIMER
].
Best
Practice
16:
Use
15:
Reuse
vocabularies,
preferably
standardized
terms
ones
Standardized
Use
terms
should
be
used
from
shared
vocabularies,
preferably
standardized
ones,
to
provide
encode
data
and
metadata
metadata.
The
need
for
code
lists
Use
of
vocabularies
already
in
use
by
others
captures
and
other
commonly
used
terms
for
data
values
facilitates
consensus
in
communities.
It
increases
interoperability
and
reduces
redundancies,
thereby
encouraging
reuse
of
your
own
data.
In
particular,
the
use
of
shared
vocabularies
for
describing
metadata
is
to
avoid
as
much
as
possible
ambiguity
(especially
structural,
provenance,
quality
and
clashes
in
versioning
metadata)
helps
the
terms
chosen
for
comparison
and
automatic
processing
of
both
data
and
metadata
information.
The
key
reason
is
to
be
able
metadata.
In
addition,
refering
to
refer
codes
and
terms
from
standards
helps
to
the
standardized
body/organization
which
defines
the
term
avoid
ambiguity
and
clashes
between
similar
elements
or
code
as
a
clear
reference.
values.
The
benefit
of
using
standardized
code
lists
and
other
commonly
used
terms
is
to
enable
interoperability
Interoperability
and
consensus
among
data
publishers
and
consumers.
consumers
will
be
enhanced.
An
approach
to
implementation
is
the
case
The
Vocabularies
section
of
a
vocabulary
developed
within
a
Working
Group
or
a
standardized
body
such
as
the
W3C
.
Best
Practices
for
Publishing
Linked
Data
[
LD-BP
]
provides
guidance
on
the
discovery,
evaluation
and
selection
of
existing
vocabularies.
The
Organizations
such
as
the
Open
Geospatial
Consortium
(OGC)
could
define
(OGC),
ISO
,
W3C
,
WMO
,
libraries
and
research
data
services,
etc.
provide
lists
of
codes,
terminologies
and
Linked
Data
vocabularies
that
can
be
used
by
everyone.
A
key
point
is
to
make
sure
the
notion
dataset,
or
its
documentation,
provides
enough
(human-
and
machine-readable)
context
so
that
data
consumers
can
retrieve
and
exploit
the
standardized
meaning
of
granularity
the
values.
In
the
context
of
the
Web,
using
unambiguous,
Web-based
identifiers
(URIs)
for
geospatial
datasets,
while
[DCAT]
vocabulary
provides
a
standardized
vocabulary
reusing
the
same
notion
applied
resources
is
an
efficient
way
to
catalogs
on
the
Web.
do
this.
The
Standard
Using
vocabulary
repositories
like
the
Linked
Open
Vocabularies
repository
section
or
lists
of
services
mentioned
in
technology-specific
Best
Practices
such
as
the
W3C
Best
Practices
for
Publishing
Linked
Data
[
LD-BP
]
provides
guidance
on
],
or
the
discovery,
evaluation
Core
Initial
Context
for
RDFa
and
selection
of
existing
vocabularies.
Example
17
to
be
done
How
to
Test
Check
JSON-LD
,
check
that
terms
classes,
properties,
terms,
elements
or
attributes
used
to
represent
a
dataset
do
not
replicate
those
defined
by
vocabularies
used
for
other
datasets.
Check
if
the
terms
or
codes
in
common
use
within
the
same
domain.
vocabulary
to
be
used
are
defined
in
a
standards
development
organization
such
as
IETF,
OGC
&
W3C
etc.,
or
are
published
by
a
suitable
authority,
such
as
a
government
agency.
Relevant requirements : R-MetadataStandardized , R-MetadataDocum , R-QualityComparable , R-VocabOpen , R-VocabReference
Best
Practice
18:
16:
Choose
the
right
formalization
level
When
reusing
a
vocabulary,
a
data
publisher
should
opt
Opt
for
a
level
of
formal
semantics
that
fit
fits
both
data
and
the
most
likely
applications.
As Albert Einstein may or may not have said: everything should be made as simple as possible, but not simpler.
Formal
semantics
may
help
one
to
establish
precise
specifications
that
support
establishing
the
intended
convey
detailed
meaning
of
the
vocabulary
and
the
performance
of
using
a
complex
vocabulary
(ontology)
may
serve
as
a
basis
for
tasks
such
as
automated
reasoning.
On
the
other
hand,
such
complex
vocabularies
require
more
effort
to
produce
and
understand,
which
could
hamper
their
reuse,
as
well
as
the
comparison
and
linking
of
datasets
exploiting
that
use
them.
Highly
formalized
If
the
data
is
also
harder
sufficiently
rich
to
exploit
by
inference
engines:
for
example,
using
an
OWL
class
in
a
position
where
a
SKOS
concept
support
detailed
research
questions
(the
fact
that
A,
B
and
C
are
true,
and
that
D
is
enough,
or
using
OWL
classes
with
complex
OWL
axioms
raises
the
formal
complexity
of
the
data
according
not
true,
leads
to
the
conclusion
E)
then
something
like
an
OWL
Profiles
Profile
would
clearly
be
appropriate
[
OWL2-PROFILES
].
Data
producers
should
therefore
seek
But there is nothing complicated about a list of bus stops.
Choosing
a
very
simple
vocabulary
is
always
attractive
but
there
is
a
danger:
the
drive
for
simplicity
might
lead
the
publisher
to
identify
omit
some
data
that
provides
important
information,
such
as
the
right
level
geographical
location
of
formalization
for
particular
domains,
audiences
and
tasks,
and
maybe
offer
different
formalization
levels
when
one
size
does
the
bus
stops
that
would
prevent
showing
them
on
a
map.
Therefore,
a
balance
has
to
be
struck,
remembering
that
the
goal
is
not
fit
all.
simply
to
share
your
data,
but
for
others
to
reuse
it.
The
data
supports
all
most
likely
application
cases
but
should
not
will
be
supported
with
no
more
complex
to
produce
and
reuse
complexity
than
necessary;
necessary.
Identify
the
"role"
played
by
the
vocabulary
for
the
datasets,
say,
providing
classes
and
properties
Look
at
what
your
peers
do
already.
It's
likely
you'll
see
that
there
is
a
commonly
used
to
type
resources
and
provide
the
predicates
for
RDF
statements,
vocabulary
that
matches,
or
elements
in
an
XML
Schema,
as
opposed
nearly
matches,
your
current
needs.
That's
probably
the
one
to
providing
simple
concepts
or
codes
use.
You
may
find
a
vocabulary
that
are
used
for
representing
attributes
of
the
resources
described
in
you'd
like
to
use
but
you
notice
a
dataset.
When
simpler
models
are
enough
semantic
constraint
that
makes
it
difficult
to
convey
the
necessary
semantics,
represent
vocabularies
using
them.
For
instance,
for
Linked
Data,
SKOS
may
be
preferred
for
simple
vocabularies
do
so,
such
as
opposed
a
domain
or
range
restriction
that
doesn't
apply
to
formal
ontology
languages
like
OWL;
see
for
example
how
concept
schemes
and
code
lists
are
used
in
your
case.
In
that
scenario,
it's
often
worth
contacting
the
RDF
Data
Cube
Recommendation
[
VOCAB-DATA-CUBE
].
Example
18
vocabulary
publisher
and
talking
to
them
about
it.
They
may
well
be
done
How
able
to
Test
For
formal
knowledge
representation
languages,
applying
an
inference
engine
lift
that
restriction
and
provide
further
guidance
on
top
of
how
the
data
that
uses
a
given
vocabulary
does
not
produce
too
many
statements
that
are
unnecessary
for
target
applications.
Evidence
is
used
more
broadly.
Relevant
requirements
:
R-VocabReference
,
R-VocabDocum
,
R-QualityComparable
W3C
operates
a
mailing
list
at
public-vocabs@w3.org
Benefits
to
be
done.
Issue
8
The
best
practice
on
formalization
above
(especially
sections
"Intended
outcome"
[
archive
]
where
issues
around
vocabulary
usage
and
"How
to
test")
should
development
can
be
re-written
in
a
more
technology-neutral
way.
Issue-144
9.10
Sensitive
Data
discussed.
To
support
best
practices
If
you
are
creating
a
vocabulary
of
your
own,
keep
the
semantic
restrictions
to
the
minimum
that
works
for
publishing
sensitive
data,
data
publishers
should
identify
all
sensitive
data,
assess
you,
again,
so
as
to
increase
the
exposure
risk,
determine
possibility
of
reuse
by
others.
As
an
example,
the
intended
usage,
data
user
audience
and
any
related
usage
policies,
obtain
appropriate
approval,
and
determine
designers
of
the
appropriate
security
measures
needed
to
taken
(very
widely
used)
SKOS
ontology
itself
have
minimized
its
ontological
commitment
by
questioning
all
formal
axioms
that
were
suggested
for
its
classes
and
properties.
Often
they
were
rejected
because
their
use,
while
beneficial
to
protect
many
applications,
would
have
created
formal
inconsistencies
for
the
data,
which
should
also
account
data
from
other
applications,
making
SKOS
not
usable
at
all
for
secure
authentication
and
use
of
HTTPS.
Data
publishers
should
preserve
these.
As
an
example,
the
privacy
of
individuals
where
property
skos:broader
was
not
defined
as
a
transitive
property,
even
though
it
would
have
fitted
the
release
way
hierarchical
links
between
concepts
are
created
for
many
thesauri
[
SKOS-DESIGN
].
Look
for
evidence
of
personal
information
would
endanger
safety
(unintended
accidents)
or
security
(deliberate
attack).
Privacy
information
might
include:
full
name,
home
address,
mail
address,
national
identification
number,
IP
address
(in
some
cases),
vehicle
registration
plate
number,
driver's
license
number,
face,
fingerprints,
or
handwriting,
credit
card
numbers,
digital
identity,
date
that
kind
of
birth,
birthplace,
genetic
information,
telephone
number,
login
name,
screen
name,
nickname,
health
records
etc.
"design
for
wide
use"
when
selecting
a
vocabulary.
At
times,
because
Another
example
of
sharing
policies
sensitive
data
may
not
this
"design
for
wide
use"
can
be
available
seen
in
part
or
schema.org
.
Launched
in
its
entirety.
Data
unavailability
represents
gaps
that
may
affect
the
overall
analysis
June
2011,
schema.org
was
massively
adopted
in
a
very
short
time
in
part
because
of
datasets.
To
account
its
informative
rather
than
normative
approach
for
unavailable
data,
data
publishers
should
publish
information
about
unavoidable
data
gaps.
Best
Practice
19:
Provide
data
unavailability
reference
References
to
data
that
is
not
open,
or
available
under
different
restrictions
to
defining
the
origin
types
of
the
reference,
should
provide
explanation
about
how
the
referred
data
objects
that
properties
can
be
accessed
and
who
can
access
it.
Why
Publishing
online
documentation
about
unavailable
data
due
to
sensitivity
issues
provides
a
means
for
publishers
to
explicitly
identify
knowledge
gaps.
This
provides
a
contextual
explanation
for
consumer
communities
thus
encouraging
use
used
with.
For
instance,
the
values
of
the
data
that
is
available.
Intended
Outcome
Publishers
should
provide
information
about
data
that
is
referred
property
author
are
only
"expected"
to
from
be
of
type
Organization
or
Person
.
author
"can
be
used"
on
the
current
dataset
type
CreativeWork
but
that
this
is
unavailable
or
only
available
under
different
conditions.
Possible
Approach
to
Implementation
Data
publishers
may
publish
an
HTML
document
that
gives
not
a
human-readable
explanation
for
data
unavailability.
RDF
may
be
used
strict
constraint.
Again,
that
approach
to
provide
design
makes
schema.org
a
machine
readable
version
of
the
same
information.
If
appropriate,
consider
editing
the
server's
4xx
response
page(s)
good
choice
as
a
vocabulary
to
provide
the
information.
use
when
encoding
data
for
sharing.
If
the
dataset
includes
references
to
other
data
that
is
unavailable,
check
whether
an
explanation
This
is
available
in
the
metadata
and/or
description
almost
always
a
matter
of
it.
subjective
judgement
with
no
objective
test.
As
a
general
guideline:
Relevant
requirements
:
R-AccessLevel
R-VocabReference
,
R-QualityComparable
Providing
easy
access
to
data
on
the
Web
enables
both
humans
and
machines
to
take
advantage
of
the
benefits
of
sharing
data
using
the
Web
infrastructure.
By
default,
the
Web
offers
access
using
Hypertext
Transfer
Protocol
(HTTP)
methods.
This
provides
access
to
data
at
an
atomic
transaction
level.
However,
when
When
data
is
distributed
across
multiple
files
or
requires
more
sophisticated
retrieval
methods
different
approaches
can
be
adopted
to
enable
data
access,
including
like
bulk
download
and
API
s.
s
can
be
adopted.
One
approach
is
packaging
data
in
In
the
bulk
using
non-proprietary
file
formats
(for
example
tar
files).
Using
this
download
approach,
bulk
data
is
generally
pre-processed
server
side
where
multiple
files
or
directory
trees
of
files
are
provided
as
one
downloadable
file.
When
bulk
data
is
being
retrieved
from
non-file
system
solutions,
depending
on
the
data
user
communities,
the
data
publisher
can
offer
APIs
to
support
a
series
of
retrieval
operations
representing
a
single
transaction.
For
data
that
is
streaming
to
the
Web
generated
in
“real
time”
real
time
or
“near
near
real
time”,
time,
data
publishers
should
publish
data
or
use
APIs
an
automated
system
to
enable
immediate
access
to
data,
allowing
access
to
critical
time
sensitive
time-sensitive
data,
such
as
emergency
information,
weather
forecasting
data,
or
published
system
monitoring
metrics.
In
general,
APIs
should
be
available
to
allow
third
parties
to
automatically
search
and
retrieve
data
published
on
the
Web.
such
data.
On
a
further
note,
it
can
be
observed
that
Aside
from
helping
to
automate
real-time
data
on
the
Web
is
essentially
about
the
description
pipelines,
APIs
are
suitable
for
all
kinds
of
entities
identified
by
a
unique,
Web-based,
identifier
(an
URI).
Once
the
data
is
dumped
and
sent
to
an
institute
specialised
in
digital
preservation
the
link
with
the
Web
is
broken
(dereferencing)
but
the
role
of
on
the
URI
as
Web.
Though
they
generally
require
more
work
than
posting
files
for
download,
publishers
are
increasingly
finding
that
delivering
a
unique
identifier
still
remains.
In
order
to
increase
the
usability
of
preserved
dataset
dumps
it
well
documented,
standards-based,
stable
API
is
relevant
to
maintain
a
list
of
these
identifiers.
worth
the
effort.
Best
Practice
20:
17:
Provide
bulk
download
Data
should
be
available
for
bulk
download.
Enable
consumers
to
retrieve
the
full
dataset
with
a
single
request.
When
web
Web
data
is
distributed
across
many
URLs
and
URIs
but
might
logically
be
organized
as
one
container,
accessing
the
data
in
bulk
is
can
be
useful.
Bulk
access
provides
a
consistent
means
to
handle
the
data
as
one
container.
Without
it,
individually
dataset.
Individually
accessing
data
is
over
many
retrievals
can
be
cumbersome
leading
and,
if
used
to
reassemble
the
complete
dataset,
can
lead
to
inconsistent
approaches
to
handling
the
container.
data.
It
should
Large
file
transfers
that
would
require
more
time
than
a
typical
user
would
consider
reasonable
will
be
possible
to
download
data
on
the
Web
in
bulk.
Data
publishers
should
provide
a
way
either
through
bulk
file
formats
or
APIs
for
consumers
to
access
this
type
of
data.
via
dedicated
file-transfer
protocols.
Depending
on
the
nature
of
the
data
and
consumer
needs
needs,
possible
approaches
could
include:
include
the
following:
The
bulk
download
should
include
the
metadata
describing
the
dataset.
Discovery
metadata
[
VOCAB-DCAT
]
describing
the
container
and
data
URLs
associated
with
should
also
be
available
outside
the
container.
bulk
download.
Humans
can
retrieve
copies
of
preprocessed
bulk
data
through
existing
tools
such
as
a
browser.
Clients
Check
if
the
full
dataset
can
test
bulk
access
through
an
API
or
queries
to
web
resources
be
retrieved
with
discoverable
metadata
about
the
bulk
data.
a
single
request.
Relevant requirements : R-AccessBulk
Best
Practice
21:
Use
Web
Standardized
Interfaces
18:
Provide
Subsets
for
Large
Datasets
It
If
your
dataset
is
recommended
to
use
URIs,
HTTP
verbs,
HTTP
response
codes,
MIME
types,
typed
HTTP
Links
large,
enable
users
and
content
negotiation
when
designing
APIs
applications
to
readily
work
with
useful
subsets
of
your
data.
Large
datasets
can
be
difficult
to
move
from
place
to
place.
It
can
also
be
inconvenient
for
users
to
store
or
parse
a
large
dataset.
Users
should
not
yet
be
seen
as
stable.
Issue-233
Why
APIs
have
to
download
a
complete
dataset
if
they
only
need
a
subset
of
it.
Moreover,
Web
applications
that
use
HTTP
verbs,
URIs,
tap
into
large
datasets
will
perform
better
if
their
developers
can
take
advantage
of
“lazy
loading”,
working
with
smaller
pieces
of
a
whole
and
response
codes
leverage
developers’
existing
knowledge,
making
it
easier
pulling
in
new
pieces
only
as
needed.
The
ability
to
make
use
work
with
subsets
of
your
API
.
Using
a
standardized
interface
the
data
also
helps
enables
offline
processing
to
avoid
tight
coupling
between
requests
and
responses,
making
for
an
API
that
work
more
efficiently.
Real-time
applications
benefit
in
particular,
as
they
can
readily
be
used
by
many
clients.
update
more
quickly.
Humans
and
applications
will
have
an
initial
understanding
be
able
to
access
subsets
of
how
a
dataset,
rather
than
the
entire
thing,
with
a
high
ratio
of
needed
to
use
your
API
because
it
uses
standardized
interfaces.
Your
API
unneeded
data
for
the
largest
number
of
users.
Static
datasets
that
users
in
the
domain
would
consider
to
be
too
large
will
also
be
easier
to
maintain
downloadable
in
smaller
pieces.
APIs
will
make
slices
or
filtered
subsets
of
the
data
available,
the
granularity
depending
on
the
needs
of
the
domain
and
the
demands
of
performance
in
a
Web
application.
There
are
many
RESTful
development
frameworks
available.
If
you
Consider
the
expected
use
cases
for
your
dataset
and
determine
what
types
of
subsets
are
already
using
a
web
development
framework
that
supports
building
REST
APIs,
consider
using
that.
If
not,
consider
an
likely
to
be
most
useful.
An
API
-specific
framework
that
uses
REST,
such
as
those
mentioned
above.
One
implementation
type
is
usually
the
most
flexible
approach
to
consider
serving
subets
of
data,
as
it
allows
customization
of
what
data
is
a
hypermedia
API
—an
API
that
responds
with
links
rather
than
transferred,
making
the
available
subsets
much
more
likely
to
provide
the
needed
data
alone.
Even
–
and
little
unneeded
data
–
for
an
any
given
situation.
The
granularity
should
be
suitable
for
Web
application
access
speeds.
(An
API
call
that
returns
within
one
second
enables
an
application
to
deliver
interactivity
that
feels
natural.
Data
that
takes
more
than
ten
seconds
to
deliver
will
likely
cause
users
to
suspect
failure.)
Another
way
to
subset
a
dataset
is
not
truly
RESTful,
using
hypermedia
to
simply
split
it
into
smaller
units
and
make
those
units
individually
available
for
download
or
viewing.
It
can
also
be
helpful
for
making
an
API
to
mark
up
a
dataset
so
that
is
self-documenting.
RESTful
APIs
individual
sections
through
the
data
(or
even
smaller
pieces,
if
expected
use
hypermedia
as
cases
warrant
it)
can
be
processed
separately.
One
way
to
do
that
is
by
indicating
“slices”
with
the
engine
of
application
state
(HATEOAS).
Because
state
RDF
Data
Cube
Vocabulary
.
Check
that
the
entire
dataset
can
be
done
recovered
by
making
multiple
requests
that
retrieve
smaller
units.
Relevant
requirements
:
R-AccessBulk
R-Citable
,
R-GranularityLevels
,
R-UniqueIdentifier
,
R-AccessRealTime
,
R-GranularityLevels
Best
Practice
22:
Serving
19:
Use
content
negotiation
for
serving
data
and
resources
with
different
available
in
multiple
formats
It
is
recommended
to
use
Use
content
negotiation
in
addition
to
file
extensions
for
serving
data
available
in
multiple
formats
formats.
It
is
possible
to
have
serve
data
being
served
in
a
an
HTML
page
mixed
with
human-readable
and
machine-readable
data.
data,
using
RDFa
could
be
used
to
mix
HTML
content
with
semantic
data.
But,
in
some
cases
this
page
is
subject
for
example.
However,
as
the
Architecture
of
scraping
by
some
applications
in
order
to
get
data
available.
When
structured
data
is
mixed
with
HTML,
but
it
is
possible
to
have
a
different
representation
with
the
same
structured
data,
written
in
Turtle
or
JSON-LD,
it
is
recommended
to
serve
this
page
using
Content
Negotiation.
Web
[
WEBARCH
Note
]
and
DCAT
[
VOCAB-DCAT
This
BP
will
]
make
clear,
a
resource,
such
as
a
dataset,
can
have
many
representations.
The
same
data
might
be
complemented.
available
as
JSON,
XML,
RDF,
CSV
and
HTML.
These
multiple
representations
can
be
made
available
via
and
API
but
should
be
made
available
from
the
same
URL
using
content
negotiation
to
return
the
appropriate
representation
(what
DCAT
calls
a
distribution).
Specific
URIs
can
be
used
to
identify
individual
representations
of
the
data
directly,
by-passing
content
negotiation.
It
should
be
possible
to
serve
Content
negotiation
will
enable
different
resources
or
different
representations
of
the
same
resource
with
different
representations.
to
be
served
according
to
the
request
made
by
the
client.
A
possible
approach
to
implementation
is
to
configure
the
web
Web
server
to
deal
with
content
negotiation
of
the
requested
resource.
The
specif
specific
format
of
the
resource's
representation
can
be
accessed
by
the
URI
or
by
the
Content-type
of
the
HTTP
Request.
Check the available representations of the resource and try to get them specifying the accepted content on the HTTP Request header.
Relevant requirements :
Best
Practice
23:
20:
Provide
real-time
access
When
data
is
produced
in
real-time,
real
time,
make
it
should
be
available
on
the
Web
in
real
time
or
near
real-time.
The
presence
of
real-time
data
on
the
Web
enables
access
to
critical
time
sensitive
data,
and
encourages
the
development
of
real-time
web
Web
applications.
Real-time
access
is
dependent
on
real-time
data
producers
making
their
data
readily
available
to
the
data
publisher.
The
necessity
of
providing
real-time
access
for
a
given
application
will
need
to
be
evaluated
on
a
case
by
case
basis
considering
refresh
rates,
latency
introduced
by
data
post
processing
steps,
infrastructure
availability,
and
the
data
needed
by
consumers.
In
addition
to
making
data
accessible,
data
publishers
may
provide
additional
information
describing
data
gaps,
data
errors
and
anomalies,
and
publication
delays.
Data
should
Applications
will
be
available
at
able
to
access
time-critical
data
in
real
time
or
near
real
time,
time
,
where
real-time
means
a
range
from
milliseconds
to
a
few
seconds
after
the
data
creation,
and
near
real
time
is
a
predetermined
delay
for
expected
data
delivery.
creation.
Real-time
data
accessibility
may
be
achieved
through
two
means:
Push
-
as
data
A
possible
approach
to
implementation
is
produced
the
producers
communicates
data
for
publishers
to
the
configure
a
Web
Service
that
provides
a
connection
so
as
real-time
data
publisher
either
is
received
by
disseminating
data
to
the
publisher
or
making
storage
web
service
it
can
be
instantly
made
available
accessible
to
the
consumers
by
polling
or
streaming.
If
data
producer.
On-Demand
(Pull)
-
available
is
checked
infrequently
by
consumers,
real-time
data
is
made
available
can
be
polled
upon
request.
In
this
case,
consumer
request
for
the
most
recent
data
through
an
API
.
The
data
publishers
will
provide
an
API
to
facilitate
these
read-only
requests.
In
addition
to
If
data
access,
to
ensure
credibility
providing
access
to
error
conditions,
anomalies,
and
instrument
"house
keeping"
is
checked
frequently
by
consumers,
a
streaming
data
enhance
real-time
applications
ability
to
interpret
and
convey
real-time
implementation
may
be
more
appropriate
where
data
quality
to
consumers.
is
pushed
through
an
API
.
While
streaming
techniques
are
beyond
the
scope
of
this
best
practice,
there
are
many
standard
protocols
and
technologies
available
(for
example
Server-sent
Events,
WebSocket,
EventSourceAPI)
for
clients
receiving
automatic
updates
from
the
server.
Relevant requirements : R-AccessRealTime
Best
Practice
24:
21:
Provide
data
up
to
date
Data
must
be
Make
data
available
in
an
up-to-date
manner
manner,
and
make
the
update
frequency
made
explicit.
Data
The
availability
of
data
on
the
Web
availability
should
closely
coincide
with
match
the
data
provided
at
creation
time,
or
collection
time,
or
perhaps
after
it
has
been
processed
or
changed.
Carefully
synchronizing
data
publication
to
the
update
frequency
encourages
data
consumer
confidence
and
data
reuse.
Data
on
the
Web
will
be
updated
in
a
timely
manner
so
that
the
most
recent
data
available
online
generally
reflects
the
most
recent
data
released
via
any
other
channel.
When
new
data
is
provided
or
data
is
updated,
becomes
available,
it
must
will
be
published
to
coincide
with
on
the
data
changes.
Web
as
soon
as
practical
thereafter.
Implement
an
API
to
enable
data
access.
When
data
is
provided
by
bulk
access,
new
files
with
new
data
should
New
versions
of
the
dataset
can
be
provided
as
soon
as
additional
data
is
created
or
updated.
Or,
use
technologies
that
are
intended
posted
to
expose
data
the
Web
on
a
regular
schedule,
following
the
Best
Practices
for
Data
Versioning
.
Posting
to
the
Web
using
interlinked
resources,
like
Activity
Streams
or
Atom.
can
be
made
a
part
of
the
release
process
for
new
versions
of
the
data.
Making
Web
publication
a
deliverable
item
in
the
process
and
assigning
an
individual
person
as
responsible
for
the
task
can
help
prevent
data
becoming
out
of
date.
To
set
consumer
expectations
for
updates
going
forward,
you
can
include
human-readable
text
stating
the
expected
publication
frequency,
and
you
can
provide
machine-readable
metadata
indicating
the
frequency
as
well.
Write
test
standard
operating
procedure
for
data
publisher
to
keep
test
data
Check
that
the
update
frequency
is
stated
and
that
the
most
recently
published
copy
on
the
Web
site
up
to
date.
Following
standard
operating
procedure:
is
no
older
than
the
date
predicted
by
the
stated
update
frequency.
Relevant requirements : R-AccessUptodate
Provide
your
users
with
complete
information
For
data
that
is
not
available,
provide
an
explanation
about
how
to
use
your
API
.
the
data
can
be
accessed
and
who
can
access
it.
The
primary
consumers
of
an
API
are
developers.
In
order
Publishing
online
documentation
about
unavailable
data
due
to
develop
against
your
API
,
sensitivity
issues
provides
a
developer
will
need
to
understand
how
means
for
publishers
to
explicitly
identify
knowledge
gaps.
This
provides
a
contextual
explanation
for
consumer
communities
thus
encouraging
use
it.
of
the
data
that
is
available.
Developers
will
be
able
to
code
efficiently
against
your
API
,
and
they
Consumers
will
make
best
use
of
the
features
you
have
provided.
It
know
that
data
that
is
recommended
to
show
explanation
about
the
architecture
chosen
for
the
API
design
and
show
how
referred
to
invoke
each
API
call
and
what
will
be
returned
from
those
calls.
the
current
dataset
is
unavailable
or
only
available
under
different
conditions.
Depending
on
the
machine/human
context
there
are
a
variety
of
ways
to
indicate
data
unavailability.
Data
publishers
may
publish
an
HTML
document
that
gives
a
human-readable
explanation
for
documentation.
data
unavailability.
From
a
machine
application
interface
perspective,
appropriate
HTTP
status
codes
with
customized
human
readable
messages
can
be
used.
Examples
of
status
codes
include:
303
(see
other),
410
(permanently
removed),
503
(service
*providing
data*
unavailable).
Quality
of
documentation
Where
the
dataset
includes
references
to
data
that
is
related
no
longer
available
or
is
not
available
to
the
usage
all
users,
check
that
an
explanation
of
what
is
missing
and
feedback
from
developers.
Try
instructions
for
obtaining
access
(if
possible)
are
given.
Check
if
a
legitimate
http
response
code
in
the
400
or
500
range
is
returned
when
trying
to
get
constant
feedback
from
your
users
about
the
documentation.
unavailable
data.
Note
Relevant
requirements
:
R-AccessLevel
,
R-SensitivePrivacy
,
R-SensitiveSecurity
Best
Practice
26:
Use
23:
Make
data
available
through
an
API
Offer an API to serve data if you have the resources to do so.
An
API
offers
the
greatest
flexibility
and
processability
for
consumers
of
your
data.
It
can
enable
real-time
data
usage,
filtering
on
request,
and
the
ability
to
work
with
the
data
at
an
atomic
level.
If
your
dataset
is
large,
frequently
updated,
or
highly
complex,
an
API
is
likely
to
be
helpful.
the
best
option
for
publishing
your
data.
Developers
will
have
programmatic
access
to
the
data
for
use
in
their
own
applications.
applications,
with
data
updated
without
requiring
effort
on
the
part
of
consumers.
Web
applications
will
be
able
to
obtain
specific
data
by
querying
a
programmatic
interface.
Creating
an
API
is
a
little
more
involved
than
posting
data
for
download.
It
requires
some
understanding
of
how
to
build
a
Web
application.
One
need
not
necessarily
build
from
scratch,
however.
If
you
use
a
data
management
platform,
such
as
CKAN,
you
may
be
able
to
simply
enable
an
existing
API
.
Many
web
Web
development
frameworks
include
support
for
APIs,
and
there
are
also
frameworks
written
specifically
for
building
custom
APIs.
Rails,
Django,
and
Express
are
some
example
Web
development
frameworks
that
offer
support
for
building
APIs.
Examples
of
API
frameworks
include
Swagger,
Apigility,
Apache
CXF,
Restify,
and
Restify
Restlet.
Check if a test client can simulate calls and the API returns the expected responses.
Relevant requirements : R-AccessRealTime , R-AccessUpToDate
Best
Practice
24:
Use
Service
Virtualization
Web
Standards
as
the
foundation
of
APIs
When designing APIs, use an architectural style that is founded on the technologies of the Web itself.
APIs
that
are
built
on
Web
standards
leverage
the
strengths
of
the
Web.
For
example,
using
HTTP
verbs
as
methods
and
URIs
that
map
directly
to
simulate
calls
individual
resources
helps
to
avoid
tight
coupling
between
requests
and
responses,
making
for
an
API
that
is
easy
to
maintain
and
can
readily
be
understood
and
used
by
many
developers.
The
statelessness
of
the
Web
can
be
a
strength
in
enabling
quick
scaling,
and
using
hypermedia
enables
rich
interactions
with
your
API
.
Developers who have some experience with APIs based on Web standards, such as REST, will have an initial understanding of how to use the API . The API will also be easier to maintain.
REST (REpresentational State Transfer)[ Fielding ][ Richardson ] is an architectural style that, when used in a Web API , takes advantage of the architecture of the Web itself. A full discussion of how to build a RESTful API is beyond the scope of this document, but there are many resources and a strong community that can help in getting started. There are also many RESTful development frameworks available. If you are already using a Web development framework that supports building REST APIs, consider using that. If not, consider an API -only framework that uses REST.
Another
aspect
of
implementation
to
consider
is
making
a
hypermedia
API
,
one
that
responds
with
links
as
well
as
data.
Links
are
what
make
sure
the
Web
a
web,
and
data
APIs
can
be
more
useful
and
usable
by
including
links
in
their
responses.
The
links
can
offer
additional
resources,
documentation,
and
navigation.
Even
for
an
API
that
does
not
meet
all
the
performance
constraints
of
REST,
returning
links
in
responses
can
make
for
a
service
that
is
acceptable.
rich
and
self-documenting.
Check that the service avoids using http as a tunnel for calls to custom methods, and check that URIs do not contain method names.
Issue
10
Relevant
requirements
:
R-APIDocumented
,
R-UniqueIdentifier
To
review
Best
Practice
25:
Provide
complete
documentation
for
your
API
Provide
complete
information
on
the
BP
"Use
Web
about
your
API
.
Update
documentation
as
you
add
features
or
make
changes.
Developers
are
the
primary
consumers
of
an
API
"
and
possibly
rewrite
the
documentation
is
the
first
clue
about
its
quality
and
usefulness.
When
API
documentation
is
complete
and
easy
to
understand,
developers
are
probably
more
willing
to
continue
their
journey
to
use
it.
Providing
comprehensive
documentation
in
one
place
allows
developers
to
code
efficiently.
Highlighting
changes
enables
your
users
to
take
advantage
of
new
features
and
adapt
their
code
if
needed.
Developers will be able to obtain detailed information about each call to the API , including the parameters it takes and what it is expected to return, i.e., the whole set of information related to the API . The set of values — how to use it, notices of recent changes, contact information, and so on — should be described and easily browsable on the Web. It will also enables machines to access the API documentation in order to help developers build API client software.
A typical API reference provides a comprehensive list of the calls the API can handle, describing the purpose of each one, detailing the parameters it allows and what it returns, and giving one or more examples of its use. One nice trend in API documentation is to provide a form in which developers can enter specific calls for testing, to see what the API returns for their use case. There are now tools available for quickly creating this type of documentation, such as Swagger , io-docs , OpenApis , and others. It is important to say that the API should be self-documenting as well, so that calls return helpful information about errors and usage. API users should be able to contact the maintainers with questions, suggestions, or bug reports.
The quality of documentation is also related to usage and feedback from developers. Try to get constant feedback from your users about the documentation.
Check that every call enabled by your API is described in your documentation. Make sure you provide details of what parameters are required or optional and what each call returns.
Check the Time To First Successful Call (i.e. being capable of doing a successful request to the API within a few minutes will increase the chances that the developer will stick to your API ).
Relevant requirements : R-APIDocumented
Best Practice 26: Avoid Breaking Changes to Your API
Avoid changes to your API that break client code, and communicate any changes in your API to your developers when evolution happens.
When developers implement a client for your API , they may rely on specific characteristics that you have built into it, such as the schema or the format of a response. Avoiding breaking changes in your API minimizes breakage to client code. Communicating changes when they do occur enables developers to take advantage of new features and, in the rare case of a breaking change, take action.
Developer code will continue to work. Developers will know of improvements you make and be able to make use of them. Breaking changes to your API will be rare, and if they occur, developers will have sufficient time and information to adapt their code. That will enable them to avoid breakage, enhancing trust. Changes to the API will be announced on the API 's documentation site.
When improving your API , focus on adding new calls or new options rather than changing how existing calls work. Existing clients can ignore such changes and will continue functioning.
If using a fully RESTful style, you should be able to avoid changes that affect developers by keeping resource URIs constant and changing only elements that your users do not code to directly. If you need to change your data in ways that are not compatible with the extension points that you initially designed, then a completely new design is called for, and that means changes that break client code. In that case, it’s best to implement the changes as a new REST API , with a different resource URI.
If using an architectural style that does not allow you to make moderately significant changes without breaking client code, use versioning. Indicate the version in the response header. Version numbers should be reflected in your URIs or in request "accept" headers (using content negotiation). When versioning in URIs, include the version number as far to the left as possible. Keep the previous version available for developers whose code has not yet been adapted to the new version.
To notify users directly of changes, it's a good idea to create a mailing list and encourage developers to join. You can then announce changes there, and this provides a nice mechanism for feedback as well. It also allows your users to help each other.
Release changes initially to a test version of your API before applying them to the production version. Invite developers to test their applications on the test version and provide feedback.
Relevant requirements : R-PersistentIdentification , R-APIDocumented
The
working
group
recognizes
that
it
is
unrealistic
to
assume
that
all
data
preservation
.
Albeit
being
on
the
Web
will
be
available
on
demand
at
all
times
into
the
indefinite
future.
For
a
closely
related
topic
archiving
is
considered
wide
variety
of
reasons,
data
publishers
are
likely
to
want
or
need
to
remove
data
from
the
live
Web,
at
which
point
it
moves
out
of
scope
for
this
group
the
current
work
and
therefore
into
the
scope
of
data
archivists.
What
is
in
scope
here,
however,
is
what
is
left
behind,
that
is,
what
steps
should
publishers
take
to
indicate
that
data
has
been
removed
or
archived.
Simply
deleting
a
resource
from
the
Web
is
bad
practice.
In
that
circumstance,
dereferencing
the
URI
would
lead
to
an
HTTP
Response
code
of
404
that
tells
the
user
nothing
other
than
that
the
resource
was
not
covered
here.
found.
The
following
Best
Practices
offer
more
productive
approaches.
Best Practice 27: Preserve identifiers
When removing data from the Web, preserve the identifier and provide information about the archived resource.
URI dereferencing is the primary interface to data on the Web. If dereferencing a URI leads to the infamous 404 response code (Not Found), the user will not know whether the lack of availability is permanent or temporary, planned or accidental. If the publisher, or a third party, has archived the data, that archived copy is much less likely to be found if the original URI is effectively broken.
The URI of a dataset will always dereference to the dataset or redirect to information about it.
There are two scenarios to consider:
In the first of these cases, the server should be configured to respond with an HTTP Response code of 410 (Gone) . From the specification:
The 410 response is primarily intended to assist the task of Web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed.
In the second case, where data has been archived, it is more appropriate to redirect requests to a Web page giving information about the archive that holds the data and how a potential user can access it.
In both cases, the original URI continues to identify the dataset and leads to useful information, even though that dataset is no longer directly available.
Check that dereferencing the URI of a dataset that is no longer available returns information about its current status and availability, using either a 410 or 303 Response Code as appropriate.
Relevant requirements : R-AccessLevel , R-PersistentIdentification
Best Practice 28: Assess dataset coverage
The
Assess
the
coverage
of
a
dataset
should
be
assessed
prior
to
its
preservation
preservation.
A chunk of Web data is by definition dependent on the rest of the global graph. This global context influences the meaning of the description of the resources found in the dataset. Ideally, the preservation of a particular dataset would involve preserving all its context. That is the entire Web of Data.
At
ingestion
the
time
of
archiving,
an
evaluation
of
the
linkage
of
Web
data
the
dataset
dump
to
already
preserved
resources
is
assessed.
The
presence
of
all
resources,
and
the
vocabularies
and
target
resources
in
uses
is
sought
in
a
set
of
digital
archives
taking
care
of
preserving
Web
data.
used,
needs
to
be
assessed.
Datasets
for
which
very
few
of
the
vocabularies
used
and/or
resources
pointed
out
to
are
already
preserved
somewhere
should
be
flagged
as
being
at
risk.
It
should
Users
will
be
possible
able
to
appreciate
the
coverage
and
external
dependencies
make
use
of
a
given
dataset.
archived
data
well
into
the
future.
The
assessment
can
be
performed
by
the
digital
preservation
institute
or
the
dataset
depositor.
It
essentially
consists
in
checking
Check
whether
all
the
resources
used
are
either
already
preserved
somewhere
or
need
to
be
provided
along
with
the
new
dataset
being
considered
for
preservation.
Those
triples
are
serialised
using
the
Turtle
W3C
recommendation.
It
is
a
text-based
format
which
is
supported
by
the
majority
of
software
able
impossible
to
process
Web
data.
This
format
can
thus
be
trusted
for
preservation.
ex
foaf
foaf
A
custom-made
serialisation
of
the
same
data
such
as
determine
what
follows
should
be
given
a
negative
appreciation
towards
preserving
the
dataset.
How
to
Test
Try
to
dereference
the
URI
of
the
data
dump
with
Content-Type
header
according
to
the
format
you
expect
to
get,
using
for
example
[
cURL
]
Evidence
Relevant
requirements
:
R-FormatStandardized
Benefits
Reuse
Best
Practice
29:
Update
the
status
of
identifiers
Preserved
resources
should
be
linked
with
their
"live"
counterparts
Why
URI
dereferencing
is
a
primary
interface
to
data
on
the
Web.
Linking
preserved
datasets
with
the
original
URI
inform
the
data
consumer
of
the
status
of
these
resources.
During
its
life
cycle
a
dataset
may
undergo
several
modifications.
Although
URIs
assigned
to
things
are
not
expected
to
change,
the
description
of
these
resource
will
evolve
over
time.
During
this
evolution,
several
snapshots
could
be
made
available
for
preservation
and
access
as
versions.
Intended
Outcome
A
link
is
maintained
between
the
URI
of
a
resource,
the
most
up-to-date
description
available
for
it,
and
preserved
descriptions.
If
the
resource
does
not
exist
any
more
the
description
should
say
so
and
refer
to
the
last
preserved
description
that
was
available.
Possible
Approach
to
Implementation
There
is
a
variety
of
HTTP
status
codes
that
could
be
put
into
use
to
relate
the
URI
with
its
preserved
description.
In
particular,
200,
410
and
303
in,
say,
50
years'
time.
However,
one
can
be
used
for
different
scenarios:
200
=>
there
is
a
new
description
which
contains
pointers
to
archived
description
410
=>
the
resource
is
no
longer
available
but
it
has
been
removed
under
a
controlled
process
cf.
404
which
simply
states
check
that
something
is
not
available.
303
=>
the
resource
identified
by
this
URI
is
no
longer
served
here
but
there
is
a
preserved
description
at
a
different
location.
In
addition
to
the
status
codes,
HTTP
Link
headers
can
also
be
an
archived
dataset
depends
only
on
widely
used
to
relate
external
resources
to
preserved
descriptions.
Example
29
One
approach
with
link
header
is
to
use
the
Memento
protocol
to
give
a
link
to
a
timegate
providing
access
to
preserved
descriptions
of
the
resource:
HTTP OK
GMT
Using
HTTP
status
code
the
data
consumer
can
be
redirected
to
the
most
recent
description
of
the
entity.
In
the
following
example
a
request
for
the
resource
"http://example.org/timetable-001"
is
first
redirected
to
the
description
"http://example.org/data/timetable-001"
which,
as
it
has
been
preserved
and
flagged
as
invalid,
redirects
the
client
to
the
newer
description
"http://example.org/newdata/timetable-001"
HTTP
HTTP
HTTP
How
to
Test
vocabularies.
Check
that
de-referencing
the
URI
of
a
unique
or
lesser-used
dependencies
are
preserved
dataset
returns
information
about
its
current
status
and
availability.
as
part
of
the
archive.
Relevant
requirements
:
R-AccessLevel
,
R-PersistentIdentification
R-VocabReference
Publishing
data
on
the
Web
enables
data
sharing
on
a
large
scale,
providing
data
access
scale
to
a
wide
range
of
audiences
with
different
levels
of
expertise.
Data
publishers
want
to
ensure
that
the
data
published
is
meeting
the
data
consumer
needs
and
for
this
purpose,
user
feedback
is
crucial.
Feedback
has
benefits
for
both
data
publishers
and
data
consumers,
helping
data
publishers
to
improve
the
integrity
of
their
published
data,
as
well
as
to
encourage
encouraging
the
publication
of
new
data.
Feedback
allows
data
consumers
to
have
a
voice
describing
usage
experiences
(e.g.
applications
using
data),
preferences
and
needs.
When
possible,
feedback
should
also
be
publicly
available
for
other
data
consumers
to
examine.
Making
feedback
publicly
available
allows
users
to
become
aware
of
other
data
consumers,
supports
a
collaborative
environment,
and
allows
user
community
experiences,
concerns
or
questions
are
currently
being
addressed.
From
a
user
interface
perspective
there
are
different
ways
to
gather
feedback
from
data
consumers,
including
site
registration,
contact
forms,
quality
ratings
selection,
surveys
and
comment
boxes
for
blogging.
From
a
machine
perspective
the
data
publisher
can
also
record
metrics
on
data
usage
or
information
about
specific
applications
consumers
are
currently
relying
upon.
that
use
the
data.
Feedback
such
as
this
establishes
a
line
of
communication
channel
between
data
publishers
and
data
consumers.
In
order
to
quantify
and
analyze
usage
feedback,
it
should
be
recorded
in
a
machine-readable
format.
Blogs
and
other
publicly
Publicly
available
feedback
should
be
displayed
in
a
human-readable
form
through
the
user
interface.
form.
This
section
provides
some
BP
Best
Practices
to
be
followed
by
data
publishers
in
order
to
enable
data
consumers
to
provide
feedback
about
the
consumed
data.
feedback.
This
feedback
can
be
for
humans
or
machines.
Best
Practice
30:
29:
Gather
feedback
from
data
consumers
Data
publishers
should
provide
Provide
a
readily
discoverable
means
for
consumers
to
offer
feedback.
Providing
Obtaining
feedback
contributes
to
improving
the
quality
of
published
data,
may
encourage
publication
of
new
data,
helps
data
publishers
understand
the
needs
of
their
data
consumers
needs
better
and,
when
feedback
is
made
publicly
available,
and
can
help
them
improve
the
quality
of
their
published
data.
It
also
enhances
trust
by
showing
consumers
that
the
consumers'
collaborative
experience.
publisher
cares
about
addressing
their
needs.
Specifying
a
clear
feedback
mechanism
removes
the
barrier
of
having
to
search
for
a
way
to
provide
feedback.
It
should
be
possible
for
data
Data
consumers
will
be
able
to
provide
feedback
and
rate
data
in
both
human
and
machine-readable
formats.
The
feedback
should
be
Web
accessible
ratings
about
datasets
and
it
should
provide
a
URL
reference
to
the
corresponding
dataset.
distributions.
Provide
data
consumers
with
one
or
more
feedback
mechanisms
including,
but
not
limited
to:
to,
a
registration
form,
contact
form,
point
and
click
data
quality
rating
buttons,
or
a
comment
box
for
blogging.
Collect
box.
In
order
to
make
the
most
of
feedback
in
machine-readable
formats
received
from
consumers,
it's
a
good
idea
to
represent
collect
the
feedback
with
a
tracking
system
that
captures
each
item
in
a
database,
enabling
quantification
and
use
analysis.
It
is
also
a
vocabulary
good
idea
to
capture
the
semantics
type
of
each
item
of
feedback,
i.e.,
its
motivation
(editing,
classifying
[rating],
commenting
or
questioning),
so
that
each
item
can
be
expressed
using
the
feedback
information.
Dataset
Usage
Vocabulary
[
VOCAB-DUV
].
Check
that
the
at
least
one
feedback
format
conforms
to
a
known
machine-readable
format
specification
in
current
use
among
anticipated
mechanism
is
provided
and
readily
discoverable
by
data
users.
consumers.
Relevant requirements : R-UsageFeedback , R-QualityOpinions
Best
Practice
31:
Provide
information
about
30:
Make
feedback
available
Information
Make
consumerfeedback
about
feedback
should
be
provided.
datasets
and
distributions
publicly
available.
By
sharing
feedback
with
consumers,
publishers
can
demonstrate
to
users
that
their
concerns
are
being
addressed,
and
they
can
avoid
submission
of
duplicate
bug
reports.
Sharing
information
about
feedback
allows
data
also
helps
consumers
understand
any
issues
that
may
affect
their
ability
to
be
aware
use
the
data,
and
it
can
foster
a
sense
of
feedback
given
by
other
consumers.
community
among
them.
It
should
Consumers
will
be
possible
for
humans
to
have
access
able
to
information
assess
the
kinds
of
errors
that
describes
feedback
on
a
dataset
given
by
one
or
more
data
consumers.
It
should
affect
the
dataset,
review
other
users'
experiences
with
it,
and
be
possible
for
machines
reassured
that
the
publisher
is
actively
addressing
issues
as
needed.
Consumers
will
also
be
able
to
automatically
process
feedback
information
about
a
dataset.
determine
whether
other
users
have
already
provided
similar
feedback,
saving
them
the
trouble
of
submitting
unnecessary
bug
reports
and
sparing
the
maintainers
from
having
to
deal
with
duplicates.
The
machine
readable
version
Feedback
can
be
availabe
as
part
of
the
feedback
metadata
may
an
HTML
Web
page,
but
it
can
also
be
provided
according
to
the
vocabulary
that
is
being
developed
by
the
DWBP
working
group
,
i.e.,
in
a
machine-readable
format
using
the
Dataset
Usage
Vocabulary
[
DUV
VOCAB-DUV
].
Check
that
the
metadata
for
the
dataset
itself
includes
any
feedback
information
about
the
dataset.
Check
if
given
by
data
consumers
for
a
computer
application
can
automatically
process
feedback
information
about
the
dataset.
specific
dataset
or
distribution
is
publicly
available.
Relevant requirements : R-UsageFeedback , R-QualityOpinions
Data
enrichment
refers
to
a
set
of
processes
that
can
be
used
to
enhance,
refine
or
otherwise
improve
raw
or
previously
processed
data.
This
idea
and
other
similar
concepts
contribute
to
making
data
a
valuable
asset
for
almost
any
modern
business
or
enterprise.
It
also
shows
is
a
diverse
topic
in
itself,
details
of
which
are
beyond
the
common
imperative
scope
of
proactively
using
this
the
current
document.
However,
it
is
worth
noting
that
some
of
these
techniques
should
be
approached
with
caution,
as
ethical
concerns
may
arise.
In
scientific
research,
care
must
be
taken
to
avoid
enrichment
that
distorts
results
or
statistical
outcomes.
For
data
about
individuals,
privacy
issues
may
arise
when
combining
datasets.
That
is,
enriching
one
dataset
with
another,
when
neither
contains
sufficient
information
about
any
individual
to
identify
them,
may
yield
a
combined
dataset
that
compromises
privacy.
Furthermore,
these
techniqes
can
be
carried
out
at
scale,
which
in
various
ways.
turn
highlights
the
need
for
caution.
This
section
provides
some
advice
to
be
followed
by
data
publishers
in
order
to
enable
data
consumers
to
enrich
data.
Best
Practice
32:
31:
Enrich
data
by
generating
new
metadata.
data
Data
should
be
enriched
whenever
possible,
Enrich
your
data
by
generating
richer
metadata
to
represent
and
describe
it.
new
data
from
the
raw
data
when
doing
so
will
enhance
its
value.
There
is
a
large
number
Enrichment
can
greatly
enhance
processability,
particularly
for
unstructured
data.
Under
some
circumstances,
missing
values
can
be
filled
in,
and
new
attributes
and
measures
can
be
added.
Publishing
more
complete
datasets
can
enhance
trust,
if
done
properly
and
ethically.
Deriving
additional
values
that
are
of
general
utility
saves
users
time
and
encourages
more
kinds
of
reuse.
There
are
many
intelligent
techniques
that
can
be
used
to
enrich
raw
or
previously
treated
data
and
to
extract
new
metadata
from
it,
data,
making
data
the
dataset
an
even
more
valuable
asset.
These
methods
Datasets with missing values will be enhanced by filling those values. Structure will be conferred and utility enhanced if relevant measures or attributes are added, but only if the addition does not distort analytical results, significance, or statistical power.
Techniques for data enrichment are complex and go well beyond the scope of this document, which can only highlight the possibilities.
Machine
learning
can
readily
be
applied
to
the
enrichment
of
data.
Methods
include
those
focused
on
data
categorization,
disambiguation,
entity
recognition,
sentiment
analysis,
analysis
and
topification,
among
others.
Providing
new
and
richer
metadata
New
data
values
may
help
be
derived
as
simply
as
performing
a
mathematical
calculation
across
existing
columns.
Other
examples
include
visual
inspection
to
identify
features
in
spatial
data
and
cross-reference
to
external
databases
for
demographic
information.
Values generated by inference-based techniques should be labeled as such, and it should be possible to retrieve any original values replaced by enrichment.
Whenever licensing permits, the code used to enrich the data should be made available along with the dataset. Sharing such code is particularly important for scientific data.
Describe
a
Look
for
missing
values
in
the
dataset
using
richer
sets
of
metadata,
which
can
or
additional
fields
likely
to
be
needed
by
others.
Check
that
any
data
added
by
inferential
enrichment
techniques
is
identified
as
such
and
that
any
replaced
data
is
still
available.
Check
that
code
used
to
enrich
the
data
is
available.
Check
whether
the
metadata
being
extracted
is
in
accordance
with
human
knowledge
and
readable
by
humans.
Relevant requirements: R-DataEnrichment , R-FormatMachineRead , R-ProvAvailable
Best Practice 32: Provide Complementary Presentations
Enrich data by presenting it in complementary, immediately informative ways, such as visualizations, tables, Web applications, or summaries.
The
implementation
depends
on
what
types
of
metadata
should
be
produced.
They
require
Data
published
online
is
meant
to
inform
others
about
its
subject.
But
only
posting
datasets
for
download
or
API
access
puts
the
implementation
of
methods
burden
on
consumers
to
interpret
it.
The
Web
offers
unparalleled
opportunities
for
presenting
data
categorization,
disambiguation,
sentiment
analysis,
among
others.
After
new
metadata
is
extracted,
in
ways
that
let
users
learn
and
explore
without
having
to
create
their
own
tools.
Complementary
data
presentations
will
enable
human
consumers
to
have
immediate
insight
into
the
data
by
presenting
it
can
be
provided
as
part
of
in
ways
that
are
readily
understood.
One
very
simple
way
to
provide
immediate
insight
is
to
publish
an
analytical
summary
in
an
HTML
Web
page
page.
Including
summative
data
in
graphs
or
any
open
tables
can
help
users
scan
the
summary
and
quickly
understand
the
meaning
of
the
data.
If
you
have
the
means
to
create
interactive
visualizations
or
Web
applications
that
use
the
data,
you
can
give
consumers
of
your
data
format.
greater
ability
to
understand
it
and
discover
patterns
in
it.
These
approaches
also
demonstrate
its
suitability
for
processing
and
encourage
reuse.
Check
whether
that
the
metadata
being
extracted
dataset
is
accompanied
by
some
additional
interpretive
content
that
can
be
perceived
without
downloading
the
techniques
data
or
invoking
an
API
.
Relevant requirements: R-DataEnrichment
Reusing data is another way of publishing data; it's simply republishing. It can take the form of combining existing data with other datasets, creating Web applications or visualizations, or repackaging the data in a new form, such as a translation. Data republishers have some responsibilities that are unique to that form of publishing on the Web. This section provides advice to be followed when republishing data.
Best Practice 33: Provide Feedback to the Original Publisher
Let the original publisher know when you are reusing their data. If you find an error or have suggestions or compliments, let them know.
Publishers
generally
want
to
know
whether
the
data
they
publish
has
been
useful.
Moreover,
they
may
be
required
to
report
usage
statistics
in
accordance
with
human-knowledge
order
to
allocate
resources
to
data
publishing
activities.
Reporting
your
usage
helps
them
justify
putting
effort
toward
data
releases.
Providing
feedback
repays
the
publishers
for
their
efforts
by
directly
helping
them
to
improve
their
dataset
for
future
users.
Better communication will make it easier for original publishers to determine how the data they post is being used, which in turn helps them justify publishing the data. Publishers will also be made aware of steps they can take to improve their data. This leads to more and better data for everyone.
When
you
begin
using
a
dataset
in
a
new
product,
make
a
note
of
the
publisher’s
contact
information,
the
URI
of
the
dataset
you
used,
and
the
date
on
which
you
contacted
them.
This
can
be
readable
done
in
comments
within
your
code
where
the
dataset
is
used.
Follow
the
publisher’s
preferred
route
to
provide
feedback.
If
they
do
not
provide
a
route,
look
for
contact
information
for
the
Web
site
hosting
the
data.
Check that you have a record of at least one communication informing the publisher of your use of the data.
Relevant
requirements:
R-DataEnrichment
R-TrackDataUsage
,
R-UsageFeedback
,
R-QualityOpinions
Best Practice 34: Follow Licensing Terms
Find and follow the licensing requirements from the original publisher of the dataset.
Licensing provides a legal framework for using someone else’s work. By adhering to the original publisher’s requirements, you keep the relationship between yourself and the publisher friendly. You don’t need to worry about legal action from the original publisher if you are following their wishes. Understanding the initial license will help you determine what license to select for your reuse.
Data publishers will be able to trust that their work is being reused in accordance with their licensing requirements, which will make them more likely to continue to publish data. Reusers of data will themselves be able to properly license their derivative works.
Read the original license and adhere to its requirements. If the license calls for specific licensing of derivative works, choose your license to be compatible with that requirement. If no license is given, contact the original publisher and ask what the license is.
Read through the original license and check that your use of the data does not violate any of the terms.
Relevant requirements: R-LicenseAvailable , R-LicenseLiability ,
Best Practice 35: Cite the Original Publication
Acknowledge the source of your data in metadata. If you provide a user interface, include the citation visibly in the interface.
Data is only useful when it is trustworthy. Identifying the source is a major indicator of trustworthiness in two ways: first, the user can judge the trustworthiness of the data from the reputation of the source, and second, citing the source suggests that you yourself are trustworthy as a republisher. In addition to informing the end user, citing helps publishers by crediting their work. Publishers who make data available on the Web deserve acknowledgment and are more likely to continue to share data if they find they are credited. Citation also maintains provenance and helps still others to work with the data.
End users will be able to assess the trustworthiness of the data they see and the efforts of the original publishers will be recognized. The chain of provenance for data on the Web will be traceable back to its original publisher.
You can present the citation to the original source in a user interface by providing bibliographic text and a working link.
Check that the original source of any reused data is cited in the metadata provided. Check that a human-readable citation is readily visible in any user interface.
Relevant requirements: R-Citable , R-ProvAvailable , R-MetadataAvailable , R-TrackDataUsage
This section is non-normative.
A dataset is defined as a collection of data, published or curated by a single agent, and available for access or download in one or more formats. A dataset does not have to be available as a downloadable file.
A Citation may be either direct and explicit (as in the reference list of a journal article), indirect (e.g. a citation to a more recent paper by the same research group on the same topic), or implicit (e.g. as in artistic quotations or parodies, or in cases of plagiarism).
From:
CiTO
CiTO,
the
Citation
Typing
Ontology.
For the purposes of this WG, a Data Consumer is a person or group accessing, using, and potentially performing post-processing steps on data.
From: Strong, Diane M., Yang W. Lee, and Richard Y. Wang. "Data quality in context." Communications of the ACM 40.5 (1997): 103-110.
Data Format defined as a specific convention for data representation i.e. the way that information is encoded and stored for use in a computer system, possibly constrained by a formal data type or set of standards."
From:
DH
Digital
Humanities
Curation
Guide
Data Producer is a person or group responsible for generating and maintaining data.
From: Strong, Diane M., Yang W. Lee, and Richard Y. Wang. "Data quality in context." Communications of the ACM 40.5 (1997): 103-110.
A distribution represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed
A feedback forum is used to collect messages posted by consumers about a particular topic. Messages can include replies to other consumers. Datetime stamps are associated with each message and the messages can be associated with a person or submitted anonymously.
From:
Semantically-Interlinked
Online
Communities
(
SIOC
,
(2)
Annotation#Motivation
)
and
the
Annotation
Model
[
Annotation-Model
]
To
better
understand
why
an
annotation
was
created,
a
SKOS
Concept
Scheme
[
Annotation-Model
SKOS-PRIMER
]
was
created,
a
SKOS
Concept
Scheme
is
used
to
show
inter-related
annotations
between
communities
with
more
meaningful
distinctions
than
a
simple
class/subclass
tree.
Data
Preservation
is
defined
by
the
APA
Alliance
for
Permanent
Access
Network
as
"The
processes
and
operations
in
ensuring
the
technical
and
intellectual
survival
of
objects
through
time".
This
is
part
of
a
data
management
plan
focusing
on
preservation
planning
and
meta-data
.
Whether
it
is
worthwhile
to
put
effort
into
preservation
depends
on
the
(future)
value
of
the
data,
the
resources
available
and
the
opinion
of
the
stakeholders
(=
designated
community).
community
of
stakeholders.
Data Archiving is the set of practices around the storage and monitoring of the state of digital material over the years.
These tasks are the responsibility of a Trusted Digital Repository (TDR), also sometimes referred to as Long-Term Archive Service (LTA) . Often such services follow the Open Archival Information System [ OAIS ] which defines the archival process in terms of ingest, monitoring and reuse of data.
Provenance originates from the French term "provenir" (to come from), which is used to describe the curation process of artwork as art is passed from owner to owner. Data provenance, in a similar way, is metadata that allows data providers to pass details about the data history to data users.
Data quality is commonly defined as “fitness for use” for a specific application or use case.
File Format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open.
Examples
of
file
formats:
txt
,
pdf
,
ps
,
avi
formats
include:
plain
text
(in
a
specified
character
encoding,
ideally
UTF-8),
Comma
Separated
Variable
(CSV)
[
RFC4180
],
Portable
Document
Format
(
PDF
)
XML
,
gif
JSON
[
RFC4627
or
jpg
],
Turtle
[
Turtle
]
and
HDF5
.
A license is a legal document giving official permission to do something with the data with which it is associated.
From:
DC-TERMS
DCTERMS
A
locale
is
a
set
of
parameters
that
defines
specific
clarifies
aspects
of
the
data
aspects,
that
may
be
interpreted
differently
in
different
geographic
locations,
such
as
language
and
formatting
used
for
numeric
values
and
or
dates.
Machine
Readable
Data
are
Machine-readable
data
formats
is
data
in
a
standard
format
that
may
can
be
readily
parsed
read
and
processed
automatically
by
computer
programs
without
access
a
computing
system.
Traditional
word
processing
documents
and
portable
document
format
(PDF)
files
are
easily
read
by
humans
but
typically
are
difficult
for
machines
to
proprietary
libraries.
For
example
CSV
interpret
and
manipulate.
Formats
such
as
XML,
JSON,
HDF5,
RDF
turtle
family
for
graphs
and
CSV
are
machine
readable,
but
PDF
machine-readable
data
formats
From: Adapted from Wikipedia
The
term
"near
real-time"
or
"nearly
real-time"
(NRT),
in
telecommunications
and
JPEG
are
not.
computing,
refers
to
the
time
delay
introduced,
by
automated
data
processing
or
network
transmission,
between
the
occurrence
of
an
event
and
the
use
of
the
processed
data,
such
as
for
display
or
feedback
and
control
purposes.
For
example,
a
near-real-time
display
depicts
an
event
or
situation
as
it
existed
at
the
current
time
minus
the
processing
time,
as
nearly
the
time
of
the
live
event.
From:
Linked
Data
Glossary
Wikipedia
Sensitive data is any designated data or metadata that is used in limited ways and/or intended for limited audiences. Sensitive data may include personal data, corporate or government data, and mishandling of published sensitive data may lead to damages to individuals or organizations.
Vocabulary
is
A
collection
of
"terms"
for
a
particular
purpose.
Vocabularies
can
range
from
simple
such
as
the
widely
used
RDF
Schema
,
Foaf
[
RDF-SCHEMA
],
FOAF
[
FOAF
]
and
Dublin
Core
Metadata
Element
Set
[
DCTERMS
]
to
complex
vocabularies
with
thousands
of
terms,
such
as
those
used
in
healthcare
to
describe
symptoms,
diseases
and
treatments.
Vocabularies
play
a
very
important
role
in
Linked
Data,
specifically
to
help
with
data
integration.
The
use
of
this
term
overlaps
with
Ontology.
From: Linked Data Glossary
Structured Data refers to data that conforms to a fixed schema. Relational databases and spreadsheets are examples of structured data.
This section is non-normative.
The
following
diagram
summarizes
some
of
the
main
challenges
faced
when
publishing
or
consuming
data
on
the
Web.
These
challenges
were
identified
from
the
DWBP
Use
Cases
and
Requirements
[
DWBP-UCR
Note
]
and,
as
presented
in
the
diagram,
is
addressed
by
one
or
more
Best
Practices.
This section is non-normative.
The
list
below
describes
the
table
main
benefits
of
applying
the
DWBP
.
Each
benefit
represents
an
improvement
in
the
way
how
datasets
are
available
on
the
Web.
The following table relates Best Practices and Benefits.
The figure below shows the benefits that data publishers will gain with adoption of the Best Practices.
Reuse
All Best Practices
Access
Discoverability
Processability
Trust
Interoperability
Linkability
Comprehension
This section is non-normative.
The
editors
gratefully
acknowledge
the
contributions
made
to
this
document
by
all
members
of
the
working
group
group.
Especially
Annette
Greiner's
great
effort
and
the
chairs:
contributions
received
from
Antoine
Isaac,
Eric
Stephan
and
Phil
Archer.
This document has benefited from inputs from many members of the Spatial Data on the Web Working Group. Specific thanks are due to Andrea Perego, Dan Brickley, Linda van den Brink and Jeremy Tandy.
The editors would also like to thank comments received from Adriano Machado, Adriano Veloso, Andreas Kuckartz, Augusto Herrmann, Bart van Leeuwen, Erik Wilde, Giancarlo Guizzardi, Gisele Pappa, Gregg Kellogg, Herbert Van de Sompel, Ivan Herman, Leigh Dodds, Lewis John McGibbney, Makx Dekkers, Manuel Tomas Carrasco-Benitez, Maurino Andrea, Michel Dumontier, Nandana Mihindukulasooriya, Nathalia Sautchuk Patrício, Peter Winstanley, Renato Iannella, Steven Adler, Vagner Diniz and Wagner Meira.
The
editors
also
gratefully
acknowledge
the
chairs
of
this
Working
Group:
Deirdre
Lee,
Hadley
Beeman,
Steve
Adler,
Yaso
Córdova,
Deirdre
Lee.
Córdova
and
the
staff
contact
Phil
Archer.
Changes
since
the
previous
version
:
include: