Copyright
©
2014-2015
2015
W3C
®
(
MIT
,
ERCIM
,
Keio
,
Beihang
).
W3C
liability
,
trademark
and
document
use
rules
apply.
This document provides a framework in which the quality of a dataset can be described, whether by the dataset publisher or by a broader community of users. It does not provide a formal, complete definition of quality, rather, it sets out a consistent means by which information can be provided such that a potential user of a dataset can make his/her own judgment about its fitness for purpose.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This
The
model
for
the
Data
Quality
Vocabulary
is
an
early
draft
of
nearing
maturity,
but
the
DQV.
Its
publication
Working
Group
is
designed
to
provoke
comment
seeking
feedback
on
the
overall
direction
foreseen
as
much
as
the
a
number
of
specific
details.
issues
highlighted
in
the
document
below.
This
document
was
published
by
the
Data
on
the
Web
Best
Practices
Working
Group
as
a
First
Public
Working
Draft.
If
you
wish
to
make
comments
regarding
this
document,
please
send
them
to
public-dwbp-comments@w3.org
(
subscribe
,
archives
).
All
comments
are
welcome.
Publication
as
a
First
Public
Working
Draft
does
not
imply
endorsement
by
the
W3C
Membership.
This
is
a
draft
document
and
may
be
updated,
replaced
or
obsoleted
by
other
documents
at
any
time.
It
is
inappropriate
to
cite
this
document
as
other
than
work
in
progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This
document
is
governed
by
the
1
August
2014
September
2015
W3C
Process
Document
.
This section is non-normative.
The
Data
on
the
Web
Best
Practices
Working
Draft
points
has
pointed
out
the
relevance
of
publishing
information
about
the
quality
of
data
published
on
the
Web
.
Accordingly,
the
Data
on
the
Web
Best
Practices
Working
Group
has
been
chartered
to
create
a
vocabulary
for
expressing
data
quality.
The
Data
Quality
Vocabulary
(DQV)
presented
in
this
document
is
foreseen
as
an
extension
to
DCAT
[
vocab-dcat
]
to
cover
the
quality
of
the
data,
how
frequently
is
it
updated,
whether
it
accepts
user
corrections,
persistence
commitments
etc.
When
used
by
publishers,
this
vocabulary
will
foster
trust
in
the
data
amongst
developers.
This
vocabulary
does
not
seek
to
determine
what
"quality"
means.
We
believe
that
quality
lies
in
the
eye
of
the
beholder;
that
there
is
no
objective,
ideal
definition
of
it.
Some
datasets
will
be
judged
as
low-quality
resources
by
some
data
consumers,
while
they
will
perfectly
fit
others'
needs.
Accordingly,
In
accordance,
we
attach
a
lot
of
importance
to
allowing
many
actors
to
assess
the
quality
of
datasets
and
publish
their
annotations,
certificates
and
certificates,
opinions
about
a
dataset.
A
dataset's
publisher
should
seek
to
publish
metadata
that
helps
data
consumers
determine
whether
they
can
use
the
dataset
to
their
benefit.
However,
publishers
should
not
be
the
only
ones
to
have
a
say
on
the
quality
of
data
published
in
an
open
environment
like
the
Web.
Certification
agencies,
data
aggregators
and
aggregators,
data
consumers
can
make
relevant
quality
assessments
assessments,
too.
We
want
to
stimulate
this
by
making
it
easier
to
publish,
exchange
and
consume
quality
metadata,
for
every
step
of
a
dataset's
lifecycle.
This
is
why
next
to
rather
expected
constructs,
constructs
like
quality
measures,
the
Data
Quality
Vocabulary
puts
a
lot
of
emphasis
on
feedback,
annotation,
agreements
and
the
provenance
of
the
metadata.
metadata
that
describes
them.
The
namespace
for
DQV
is
provisionally
set
as
http://www.w3.org/ns/dqv#
.
DQV,
however,
seeks
to
re-use
elements
from
other
vocabularies,
following
the
best
practices
for
data
vocabularies
identified
by
the
Data
on
the
Web
Best
Practices
Working
Group.
The
Working
Group
is
considering
putting
all
new
classes
and
properties
(together
with
defined
in
the
ones
of
DWBP
Vocabularies
in
the
DCAT
namespace.
As
an
attempt
to
stimulate
reactions
which
might
help
in
taking
a
decision,
the
Dataset
Usage
Vocabulary
)
in
will
be
moved
under
the
DCAT
namespace
.
(
Issue-179
).
namespace.
In
case
of
positive
reactions
to
the
DUV
choice,
the
data
quality
vocabulary
might
consider
to
go
in
the
same
direction.
The table below indicates the full list of namespaces and prefixes used in this document.
Prefix | Namespace |
---|---|
daq | http://purl.org/eis/vocab/daq# |
dcat | http://www.w3.org/ns/dcat# |
dcterms | http://purl.org/dc/terms/ |
dqv | http://www.w3.org/ns/dqv# |
duv | http://www.w3.org/ns/duv# |
oa | http://www.w3.org/ns/oa# |
prov | http://www.w3.org/ns/prov# |
The following vocabulary is based on DCAT [ vocab-dcat ] that it extends with a number of additional properties and classes suitable for expressing the quality of a dataset.
The
quality
of
a
given
dataset
or
of
distribution
is
assessed
via
a
number
of
observed
properties.
For
instance,
one
may
consider
a
dataset
to
be
of
high
quality
because
it
complies
with
to
a
specific
standard
while
for
other
use
cases
use-cases
the
quality
of
the
data
will
depend
on
its
level
of
interlinking
with
other
datasets.
To
express
these
properties
an
instance
of
a
dcat:Dataset
or
dcat:Distribution
can
be
related
to
four
different
classes:
Textual description of the diagram will be added.
N.B.:
"graph
containment"
refers
to
the
inclusion
of
quality
statements
in
(RDF)
graphs,
e.g.
for
capturing
the
provenance
of
quality
statements
(see
later
example
).
Issue
3
What
is
the
relation
between
duv:Feedback
and
dqv:UserFeedback
?
(
Issue-165
).
)
Should
we
have
only
the
existing
class
daq:QualityGraph
dqv:QualityMeasureDataset
or
keep
the
define
a
new
class
dqv:QualityMetadata
to
represent
a
set
of
statements
providing
quantitative
and/or
qualitative
information
about
the
dataset
or
distribution.
One
could
be
a
sub-class
of
the
other.
(
Issue-181
)
The
label
Is
dqv:QualityPolicy
a
subclass
of
daq:QualityGraph
does
not
fit
well
with
dcterms:Standard
?
The
wording
in
the
current
model.
DAQ
graphs
are
meant
to
contain
measures.
In
our
context
a
"quality
graph"
has
Dublin
Core
specification
is
very
open
("A
basis
for
comparison;
a
wider
scope:
actually
reference
point
against
which
other
things
can
be
evaluated"),
but
the
role
of
representing
overall
quality
graphs
label
is
currently
played
by
dqv:QualityMetadata
.
(
Issue-182
)
Issue
6
We
may
want
to
consider
a
revision
of
DCAT
to
make
dcat:Dataset
and
dcat:Distribution
subclasses
quite
restrictive.
At
the
time
of
prov:Entity
.
(
Issue-183
discussion
)
Issue
7
Is
a
dqv:ServiceLevelAgreement
a
kind
majority
of
certificate,
or
a
standard?
(
Issue-184
)
Issue
8
dqv:QualityAnnotation
WG
members
is
foreseen
as
a
subclass
of
oa:Annotation
.
The
instances
of
this
class
should
have
a
oa:motivatedBy
statement
ok
with
subclassing,
but
we
welcome
public
feedback
before
making
a
an
instance
of
oa:Motivation
,
which
reflects
a
quality
assessment
purpose.
We
plan
to
define
is
as
dqv:qualityAssessment
.
final
decision
(
Issue-185
Issue-199
)
This section is work in progress. We will include later more tables with specification of individual classes and properties.
DQV
defines
quality
measures
as
specific
instances
of
DAQ
DQV
observations,
adapting
the
DAQ
quality
metrics
framework
:
[
DaQ
],
[
DaQ-RDFCUBE
]:
For example, a dimension could be "multilinguality" and two metrics could be "ratio of literals with language tags" and "number of different language tags".
The
following
property
properties
should
be
used
on
this
class:
daq:metric
.
The
following
property
may
be
used
for
this
class:
dqv:hasMetric
,
dqv:value
,
qb:dataSet
.
Should (and if yes, how) DQV represent multiple/derived values for a metric (e.g., average or normalized value)? ( Issue-222 )
Should (and if yes, how) DQV represent parameters for a metric applied for computing a specific quality measure (e.g.,a specific setting of weights)? ( Issue-223 )
RDF Class: | dqv:QualityMeasure |
---|---|
Definition: | A quality measure represents the evaluation of a given dataset (or dataset distribution) against a specific quality metric. |
Subclass of: |
|
Equivalent class |
|
RDF Property: |
|
---|---|
Definition: | Indicates the metric being observed. |
Instance of: | qb:DimensionProperty |
Domain: | qb:Observation |
Range: |
|
|
|
RDF Property: | qb:dataSet |
---|---|
Definition: |
Indicates
the
|
Domain: | qb:Observation |
Range: | qb:DataSet |
RDF Property: | dqv:computedOn |
---|---|
Definition: | Refers to the resource (e.g., a dataset, a linkset, a graph, a set of triples) on which the quality measurement is performed. In the DQV context, this property is generally expected to be used in statements in which objects are instances of dcat:Dataset and dcat:Distribution . |
Instance of: | qb:DimensionProperty |
Domain: | dqv:QualityMeasure |
Equivalent property: | daq:computedOn |
Inverse property: | dqv:hasQualityMeasure |
RDF Property: | dqv:value |
---|---|
Definition: | Refers to values computed by metric. |
Instance of: | qb:MeasureProperty , owl:DatatypeProperty |
Domain: | dqv:QualityMeasure |
Equivalent property: | daq:value |
The following properties should be used on this class: dqv:hasDimension .
In daQ, the property daq:expectedDataType associates each metric to the expected data type for its observed value. Data types for observed values are restricted to xsd:anySimpleType (e.g. xsd:boolean, xsd:double etc…). Is the current practice of using daq:expectedDataType in daQ appropriate? Isn't the restriction to xsd:anySimpleType too narrow? ( Issue-224 )
RDF Class: | dqv:Metric |
---|---|
Definition: | A standard to measure a quality dimension. An observation (instance of dqv:QualityMeasure) assigns a value in a given unit to a Metric. |
Equivalent class | daq:Metric |
RDF Property: | dqv:hasDimension |
---|---|
Definition: |
|
Domain: | dqv:Metric |
Range: | dqv:Dimension |
Inverse: | daq:hasMetric |
Usage note: |
Dimensions
are
meant
to
systematically
organize
metrics.
The
Data
Quality
Vocabulary
defines
no
specific
cardinality
constraints
for
dqv:hasDimension,
since
distinct
quality
|
Do
we
want
to
keep
the
same
occurrence
constraints
as
defined
in
DAQ
(for
example,
that
every
metric
The
following
properties
should
belong
to
exactly
one
dimension)?
In
this
specific
case
this
may
be
demanding
too
much
used
on
this
class:
dqv:hasCategory
.
RDF Class: | dqv:Dimension |
---|---|
Definition: |
Represents
criteria
relevant
for
assessing
quality.
Each
quality
|
Equivalent class |
daq:Dimension
|
RDF Property: |
|
---|---|
Definition: |
Represents
the
|
Domain: |
|
Range: |
|
Inverse: |
|
|
|
RDF Class: | dqv:Category |
---|---|
Definition: | Represents a group of quality dimensions in which a common type of information is used as quality indicator. |
|
|
RDF Class: | dqv:QualityMeasureDataset |
---|---|
Definition: | Represents a dataset of quality measures, evaluations of a given dataset (or dataset distribution) against a specific quality metric. |
Subclass of: | qb:DataSet |
Equivalent class | daq:QualityGraph |
RDF Class: | dqv:QualityAnnotation |
---|---|
Definition: | Represents quality annotations, including rating, quality certificate, feedback that can be associated to datasets or distributions. Quality annotations must have one oa:motivatedBy statement with an instance of oa:Motivation (and skos:Concept), which reflects a quality assessment purpose. We define this instance as dqv:qualityAssessment. |
Subclass of: | oa:Annotation |
Equivalent class | EquivalentClasses( dqv:QualityAnnotation ObjectHasValue( oa:motivatedBy dqv:qualityAssessment ) ) |
To make the document more self-contained we might consider to describe some properties of oa:Annotation, such as hasBody, hasTarget.
RDF Class: | dqv:UserQualityFeedback |
---|---|
Definition: | Represents feedback users might want to associate to datasets or distributions. |
Subclass of: | dqv:QualityAnnotation duv:UserFeedback |
Should
we
exploit
predefined
instances
of
oa:Motivation
to
further
characterize
a
user's
feedback
purposes?
(
Issue-201
)
Combining
the
predefined
instances
of
oa:Motivation
with
the
dqv:qualityAssessment
we
could
distinguish
different
kinds
of
for
user
feedbacks,
for
example:
RDF Property: | dqv:hasQualityMeasure |
---|---|
Definition: | Refers to the performed quality measurements. Quality measurements can be performed to any kind of resource (e.g., a dataset, a linkset, a graph, a set of triples). However, in the DQV context, this property is generally expected to be used in statements in which subjects are instances of dcat:Dataset and dcat:Distribution . |
Range: | dqv:QualityMeasure |
Inverse property: | dqv:computedOn |
This section is non-normative.
This section shows some examples to illustrate the application of the Dataset Quality Vocabulary.This section is still work in progress. Further examples will be provided as soon as some of the pending issues are resolved. We invite the public to contact the editors and submit relevant examples of quality data, even not yet represented in DQV. We welcome your input!
NB:
in
the
remainder
of
this
section,
the
prefix
"
:
"
refers
to
http://example.org/
myDataset
,
and
its
distribution
myDatasetDistribution
,
:myDataset a dcat:Dataset ;dct:title "My dataset" ;dcterms:title "My dataset" ; dcat:distribution :myDatasetDistribution . :myDatasetDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/mydataset.csv> ;dct:title "CSV distribution of dataset" ;dcterms:title "CSV distribution of dataset" ; dcat:mediaType "text/csv" ; dcat:byteSize "87120"^^xsd:decimal .
An
automated
quality
checker
has
provided
a
quality
assessment
with
two
(CSV)
quality
measures
for
myDatasetDistribution
.
:myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . :measure1 a dqv:QualityMeasure ;# when daq:computedOn ranges into a dcat:Distribution/dcat:Dataset, # dqv:hasQualityMeasure is likely to be the inverse of daq:computedOn ? # In any case we could remove this statement. daq:computedOn :myDatasetDistribution ; daq:metric :cvsAvailabilityMetric ; daq:value "1.0"^^xsd:doubledqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvAvailabilityMetric ; dqv:value "1.0"^^xsd:double . :measure2 a dqv:QualityMeasure ;# when daq:computedOn ranges into a dcat:Distribution/dcat:Dataset, # dqv:hasQualityMeasure is likely to be the inverse of daq:computedOn ? # In any case we could remove this statement. daq:computedOn :myDatasetDistribution ; daq:metric :csvConsistencyMetric ; daq:value "0.5"^^xsd:double . :cvsAvailabilityMetric a daq:Metric ; dqv:hasDimension :availabilitydqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvConsistencyMetric ; dqv:value "0.5"^^xsd:double .:csvConsistencyMetric a daq:Metric ; dqv:hasDimension :consistency . :availability a daq:Dimension ;#definition of dimensions and metrics :availabity a dqv:Dimension ; dqv:hasCategory :category1; . :consistencya daq:Dimension ;a dqv:Dimension ; dqv:hasCategory :category2 .# Categories and dimensions might be more extensively defined, for example, # by grounding them in the section 'Dimensions and metrics hints'. # However, any quality framework is free to define its own dimensions and categories.:csvAvailabilityMetric a dqv:Metric ; dqv:hasDimension :availabity . :csvConsistencyMetric a dqv:Metric ; dqv:hasDimension :consistency .
Categories and dimensions might be more extensively defined, see in the section 'Dimensions and metrics hints' for further examples. Any quality framework is free to define its own dimensions and categories.
Should we represent dimensions and categories as instances of skos:Concept ? This would allow publishers of quality framework to express (hierarchical) relations between dimensions or categories. This could also enable to align with quality-focused categorizations less focused on metrics. Including the DWBP Best Practices dimensions, or even the parts of DQV about annotations. ( Issue-205 )
The
result
results
of
metrics
obtained
in
the
previous
assessment
are
stored
in
the
myQualityMetadata
graph.
# myQualityMatadata is a graph# we are assuming dqv:QualityMetadata is an extension of daq:QualityGraph, # otherwise we should probably define a proper qb:DataStructureDefinition:myQualityMetadata { :myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . # The graph contains the rest of the statements presented in the previous example. } # myQualityMetadata has been created by: qualityChecker and it is the result of the :qualityChecking activity :myQualityMetadata a dqv:QualityMetadata ; prov:wasAttributedTo :qualityChecker ; prov:generatedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime ; prov:wasGeneratedBy :qualityChecking . # qualityChecker is a service computing some quality metrics :qualityChecker a prov:SoftwareAgent ; rdfs:label "a quality assessment service"^^xsd:string # We should probably suggest to add more info about the services . # the qualityChecking is the activity that has generated myQualityMetadata starting from MyDatasetDistribution :qualityChecking a prov:Activity; rdfs:label "the checking of myDatasetDistribution's quality"^^xsd:string; prov:wasAssociatedWith :qualityChecker; prov:used :myDatasetDistribution; prov:generated :myQualityMetadata;prov:endedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime prov:startedAtTime "2015-05-27T00:52:02Z"^^xsd:dateTime;prov:endedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime; prov:startedAtTime "2015-05-27T00:52:02Z"^^xsd:dateTime .
The group has discussed provenance at different level of granularity (dqv:QualityMeasure and dqv:QualityMetadata), so we might consider to add an example of provenance for dqv:QualityMeasure.
Statements
similar
to
the
ones
applied
to
the
resource
myQualityMetadata
above
can
be
applied
to
the
resource
myDataset
to
indicate
the
provenance
of
the
dataset.
I.e.,
a
dataset
can
be
generated
by
a
specific
software
agent,
be
generated
at
a
certain
time,
etc.
The
HCLS
Community
Profile
for
describing
datasets
provide
provides
further
examples.
Let us express that an ODI certificate for the "City of Raleigh Open Government Data" dataset is available at the URL <https://certificates.theodi.org/en/datasets/393/certificate>.
<https://certificates.theodi.org/en/datasets/393> a dcat:Dataset ; dqv:hasQualityAnnotation :myDatasetQA . :myDatasetQA a dqv:QualityCertificate ; oa:hasTarget <https://certificates.theodi.org/en/datasets/393> ; oa:hasBody <https://certificates.theodi.org/en/datasets/393/certificate> ; oa:motivatedBy dqv:qualityAssessment .
Let’s
consider
myControlledVocabulary
,
a
controlled
vocabulary
made
available
on
the
Web
using
the
SKOS
[
SKOS-reference
]
and
DCAT
[
vocab-dcat
].
:myControlledVocabulary a dcat:Dataset ; dcterms:title "My controlled vocabulary" . :myControlledVocabularyDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/myControlledVocabulary.csv> ; dcterms:title "SKOS/RDF distribution of my controlled vocabulary"" ; dcat:mediaType "text/turtle" ; dcat:byteSize "190120"^^xsd:decimal .
qSKOS is an open source tool, which detects quality issues affecting SKOS vocabularies [ qSKOS ]. It considers 26 quality issues including, for example, “Incomplete Language Coverage” and “Label Conflicts” which are grouped in the category “Labeling and Documentation issues”. Quality issues addressed by qSKOS can be considered as DQV quality dimensions, whilst the number of concepts in which a quality issue occurs can be the metric deployed for each quality dimension.
# definition of instances for some of the metrics, dimensions and categories deployed in qSKOS. :numOfConceptsWithLabelConflicts a dqv:Metric; rdfs:label "Conflicting concepts"@en ; rdfs:comment "Number of concepts having conflicting labels"@en ; dqv:hasDimension :LabelConflicts . :numOfConceptsWithIncompleteLanguageCoverage a dqv:Metric; rdfs:label "Language incomplete concepts"@en ; rdfs:comment "Number of concepts having an incomplete language coverage"@en ; dqv:hasDimension :incompleteLanguageCoverage . :LabelConflicts a dqv:Dimension; rdfs:label "Label Conflicts"@en ; rdfs:comment "Dimension corresponding to the label conflicts quality issue"@en ; dqv:hasCategory :labelingDocumentationIssues . :incompleteLanguageCoverage a dqv:Dimension; rdfs:label "Incomplete Language Coverage"@en ; rdfs:comment "Dimension corresponding to the incomplete language coverage issue"@en ; dqv:hasCategory :labelingDocumentationIssues . :labelingDocumentationIssues a dqv:Category ; rdfs:label "Labeling and Documentation Issues"@en ; rdfs:comment "Category grouping labeling and documentation issues"@en ; .
DQV
represents
the
qSKOS
quality
assessment
on
myControlledVocabulary
for
the
dimensions
“Incomplete
Language
Coverage”
and
“Label
Conflicts”.
:myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . :measure1 a dqv:QualityMeasure ; dqv:computedOn :myControlledVocabulary ; dqv:hasMetric :numOfConceptsWithMissingValues ; dqv:value "1500"^^xsd:integer . :measure2 a dqv:QualityMeasure ; dqv:computedOn :numOfConceptsWithIncompleteLanguageCoverage ; dqv:hasMetric :csvConsistencyMetric ; dqv:value "450"^^xsd:integer .
(VoID) linksets are collections of (RDF) links between two datasets. Linksets are as important as datasets when it comes to the joint exploitation of independently served datasets in linked data. The representation of quality for a linkset offers a further example of how DQV can be exploited.
Let’s define three DCAT datasets, including one VoID linkset, which connects the two others:
:myDataset1 a dcat:Dataset ; dcterms:title "My dataset 1" .Issue 11:myDataset2 a dcat:Dataset ; dcterms:title "My dataset 2" . :myLinkset a dcat:Dataset, void:Linkset ; dcterms:title "A Linkset between My dataset 1 and My dataset 2"; void:linkPredicate skos:exactMatch ; void:target :myDataset1 ; void:target :myDataset2 .
This
section
will
be
completed
by
examples
coming
from
Riccardo's
work
on
measuring
We
can
represent
information
about
the
quality
of
:myLinkset
using
the
“Multilingual
importing”
[
MultilingualImporting
]
linkset
quality
metric.
This
metrics
works
on
linksets
between
datasets
that
include
SKOS
concept
schemes,
from
concepts
[
SKOS-reference
].
It
quantifies
the
perspective
of
information
gain
when
adding
multilingual
the
preferred
labels
or
the
alternative
labels
of
the
concepts
from
a
linked
dataset
to
the
descriptions
of
the
concepts
from
the
other
dataset,
which
these
schemes.
On
concepts
have
been
matched
with
a
skos:exactMatch
statement
from
the
Web,
linksets
linkset.
We
must
first
define
the
proper
metric,
dimension
and
category.
# Definition of instances for Metric, Dimension and Category. :importingForPropertyPercentage a dqv:Metric; dqv:hasDimension :completeness. :completeness a dqv:Dimension; dqv:hasCategory :complementationGain . :complementationGain a dqv:Category .
The quality assessment of the "label importing" can be made dependent on two extra parameters: property and language, respectively the SKOS property and the language tag. We extend DQV to represent these parameters.
We need to further evaluate the way we add extra parameters for the metric and extend the DAQ RDF-CUBE data structure (postponed issue)
:language a qb:DimensionProperty, owl:DataProperty ; rdfs:comment "language on which label importing is assessed."@en ; rdfs:domain dqv:QualityMeasure; rdfs:label "label import assessment language"@en . :property a qb:DimensionProperty, rdf:Property ; rdfs:comment "property which label importing is assessed."@en ; rdfs:domain dqv:QualityMeasure ; rdfs:label "label import assessment property"@en ; rdfs:range rdf:Property .
Let us add actual quality assessments:
:qualityMeasureDataset a dqv:QualityMeasureDataset ;
qb:structure :dsd .
:importingForPropertyPercentage
# should dqv:hasObservation be added as inverse of dqv:hasMetric?
dqv:hasObservation :exactMatchaltLabelit1 , :exactMatchaltLabelit2 ,
:exactMatchaltLabelen1 , :exactMatchaltLabelen2,
:exactMatchprefLabelit1, :exactMatchprefLabelit2 .
#Adding quality observations
## for Italian alternative labels
:measure_exactMatchAltLabelItDataset1
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "1.0"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "it" ;
:property skos:altLabel .
:measure_exactMatchAltLabelItDataset2
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "1.0"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "it" ;
:property skos:altLabel .
## for English alternative labels
:measure_exactMatchAltLabelEnDataset1
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "0.1"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "en" ;
:property skos:altLabel .
:measure_exactMatchAltLabelEnDataset2
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "1.0"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "en" ;
:property skos:altLabel .
## for Italian preferred labels
:measure_exactMatchPrefLabelItDataset1
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "0.5"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "it" ;
:property skos:prefLabel .
:exactMatchprefLabelit2
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "0.5"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "it" ;
:property skos:prefLabel .
Let
us
specify
the
RDF
Data
Cube
data
structure:
:dsd a qb:DataStructureDefinition ; ##Copying the structure of daq:dsq qb:component [ qb:dimension dqv:computedOn ; qb:order 2 ] ; qb:component [ qb:measure dqv:value] ; qb:component [ qb:dimension <http://purl.org/linked-data/sdmx/2009/dimension#timePeriod> ; qb:order 3 ] ; qb:component [ qb:dimension dqv:hasMetric ; qb:order 1 ] ; qb:component [ qb:measure dqv:value;]; # Attribute (here: unit of measurement) qb:component [ qb:attribute sdmx-attribute:unitMeasure ; qb:componentRequired false ; qb:componentAttachment qb:DataSet ; ] ; ##Extending the structure of lds:dsq with two new dimensions qb:component [ qb:dimension :property ; qb:order 4 ] ; qb:component [ qb:dimension :language ; qb:order 5 ] .
It
is
often
desirable
to
indicate
that
metadata
about
datasets
in
a
catalogue
are
compliant
with
a
metadata
standard,
or
an
especially
interesting
case
application
profile
of
datasets!
we
could
an
existing
metadata
standard.
A
typical
example
is
the
GeoDCAT
Application
Profile
[
GeoDCAT-AP
],
an
extension
of
the
DCAT
vocabulary
[
vocab-dcat
]
to
represent
metadata
for
geospatial
data
portals.
GeoDCAT-AP
enables
to
express
that
a
dataset's
metadata
conforms
to
an
existing
standard,
following
the
recommendations
of
ISO
19115,
ISO
19157
and
the
EU
INSPIRE
directive.
DCAT
partly
supports
the
expression
of
such
metadata
conformance
statements.
The
following
example
illustrates
how
a
(DCAT)
catalog
record
can
be
said
to
be
conformant
with
the
GeoDCAT-AP
standard
itself.
ex:myDataset a dcat:Dataset; ex:myDatasetRecord a dcat:CatalogRecord ; foaf:primaryTopic :myDataset ; dcterms:conformsTo :geoDCAT-AP . ex:geoDCAT-AP a dcterms:Standard; dcterms:title "GeoDCAT Application Profile" ; dcterms:comment "GeoDCAT-AP is developed in the context of the Interoperability Solutions for European Public Administrations (ISA) Programme"@en; dcterms:issued "201X-XX-XX"^^xsd:date .
Note
that
this
example
does
not
include
the
metadata
about
the
dataset
ex:myDataset
itself.
We
assume
this
is
present
in
an
RDF
data
source
accessible
via
the
URI
ex:myDatasetRecord
.
We
also
add
examples
from
qSKOS
.
assume
that
ex:geoDCAT-AP
is
a
reference
URI
that
denotes
the
GeoDCAT-AP
standard,
which
can
be
re-used
across
many
catalog
record
descriptions,
not
just
a
locally
introduced
URI.
Relation
between
DQV,
ISO
19115/19157
and
GeoDCAT-AP:
DQV
is
already
able
to
express
the
notion
of
"conformance"
to
a
standard
using
the
property
dcterms:conformsTo.
However,
there
were
suggestion
to
be
further
compatible
with
ISO
19157:2013
and
INSPIRE
by
adding
respectively
"Not
conformant"
and
"Not
evaluated"
as
possible
properties
or
values.
Should
DQV
be
this
expressive?
(
Issue-188
Issue-202
)
This section is non-normative.
This
section
will
be
refined,
especially
considering
public
feedback.
refined
as
soon
as
Issue-204
and
Issue-205
are
solved.
In
particular,
following
the
discussion
on
Issue-200
,
we
plan
to
align
the
DQV
dimension
classification
with
the
ISO
25012
[
ISOIEC25012
]
and
to
provide
the
classification
proposed
in
Zaveri
Et
Al.
[
ZaveriEtAl
]
as
a
further
example.
Suggestions
on
possible
mappings
between
ISO
25012
and
Zaveri
et
al.'s
dimensions
as
well
as
any
other
well-known
classification
are
welcome.
This section gathers relevant quality dimensions and ideas for corresponding metrics, which might be eventually represented as instances of daq:Dimension and daq:Metric . The goal is not to define a normative list of dimensions and metrics, rather, the section provides a set of examples starting from use cases included in the Use Cases & Requirements document and from the following sources:
Are the levels of granularity of dqv:Dimension and dqv:Category well-defined enough and fit for purpose? ( Issue-225 )
The following table gives example on statistics that can be computed on a dataset and interpreted as quality indicators by the data consumer. Some of them can be relevant for the dimensions listed in the rest of this section. The properties come from the VoID extension created for the Aether tool .
Observation | Suggested term |
---|---|
Number of distinct external resources linked to | http://ldf.fi/void-ext#distinctIRIReferenceObjects |
Number of distinct external resources used (including schema terms) | http://ldf.fi/void-ext#distinctIRIReferences |
Number of distinct literals | http://ldf.fi/void-ext#distinctLiterals |
Number of languages used | http://ldf.fi/void-ext#languages |
The
Aether
VoID
extension
represents
statistics
as
direct
statements
that
have
a
dataset
as
subject
and
an
integer
as
object.
This
pattern,
which
can
be
expected
to
be
rather
common,
is
different
from
the
pattern
that
DQV
inherits
from
DAQ
(see
examples
).
This
document
will
probably
have
to
explain
DAQ.
Guidance
on
how
the
different
patterns
DQV/daQ
can
work
with
other
quality
statistics
vocabulary
will
be
reconciled,
if
indeed
both
should
exist
alongside.
(
Issue-189
)
provided.
Can the data be accessed now and over time?
Since a dcat:Dataset is an abstract thing, it might be available at any point in time, past present or future. We already have dcterms:issued so two properties come to mind:
Other questions that come to mind: how do we indicate that the dataset is expected to be available 'for the foreseeable future?'
Is the data machine readable ?
is the data correctly representing the real-world entity or event?
Is the data not containing contradictions?
Can I use it readily in an analysis tool? Can I open the dataset in R and do some statistical manipulations? Can I open it in Tableau and make a visualization without doing a lot of cleaning?
There could be some overlap with accuracy.
Does the dataset include an appropriate amount of data?
It might be useful to include some information about the context (e.g., why was the data created and what purpose is it supposed to serve).
Does the data include all data items representing the entity or event ?
Is the data following accepted standards ?
Is the data based on trustworthy sources ?
This is described using the provenance vocabulary PROV-O
Is the data representing the actual situation and it is published soon enough ?
This section is non-normative.
The UCR document lists relevant requirement for data quality and granularity :
The aforementioned requirements are going to be further elaborated considering on-going discussions and materials from these two wiki pages: Requirements from FPWD_BP and Quality Requirements From UCR .
We have to confirm whether the scope of DQV work is indeed these "official" DQV reqs or if we should go beyond, e.g., reflecting the quality of the vocabulary (re-)used, access to datasets, metadata and more generally the implementation of our best practices (cf. the "5 stars" thread ).
The
distinction
between
intrinsic
Intrinsic
and
extrinsic
metadata
may
help
making
choices
here.
For
example,
DQV
could
be
defined
wrt.
intrinsic
properties
of
the
datasets,
not
extrinsic
properties
(let
alone
properties
of
the
metadata
for
a
dataset!)
(
Issue-190
)
Backward
compatibility
with
DAQ
and
RDF
Data
Cube:
DAQ
exploits
Data
Cube
to
make
metric
results
consumable
by
visualizers
visualisers
such
as
CubeViz
(see
Jeremy's
paper
).
This
may
be
useful
to
preserve
in
DQV.
(
Issue-191
)
The W3C Human Care and Life Science Community Group has created a DCAT profile for describing datasets . This is work is well visible and used in the HCLS community. DQV should be aligned with this profile if there are overlapping areas. Are there such areas? ( Issue-221 )
Changes since the previous version include: