Copyright © 2014-2015 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
This document provides a framework in which the quality of a dataset can be described, whether by the dataset publisher or by a broader community of users. It does not provide a formal, complete definition of quality, rather, it sets out a consistent means by which information can be provided such that a potential user of a dataset can make his/her own judgement about its fitness for purpose.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is an early draft of the DQV. Its publication is designed to provoke comment on the overall direction foreseen as much as the specific details.
This document was published by the Data on the Web Best Practices Working Group as a First Public Working Draft.If you wish to make comments regarding this document, please send them to public-dwbp-comments@w3.org (subscribe, archives). All comments are welcome.
Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 August 2014 W3C Process Document.
This section is non-normative.
The Data on the Web Best Practices points out the relevance of publishing information about the quality of data published on the Web. Accordingly, the Data on the Web Best Practices Working Group has been chartered to create a vocabulary for expressing data quality. The Data Quality Vocabulary (DQV) presented in this document is foreseen as an extension to DCAT [vocab-dcat] to cover the quality of the data, how frequently is it updated, whether it accepts user corrections, persistence commitments etc. When used by publishers, this vocabulary will foster trust in the data amongst developers.
This vocabulary does not seek to determine what "quality" means. We believe that quality lies in the eye of the beholder; that there is no objective, ideal definition of it. Some datasets will be judged as low-quality resources by some data consumers, while they will perfectly fit others' needs. Accordingly, we attach a lot of importance to allowing many actors to assess the quality of datasets and publish their annotations, certificates and opinions about a dataset. A dataset's publisher should publish metadata that helps data consumers determine whether they can use the dataset to their benefit. However, publishers should not be the only ones to have a say on the quality of data published in an open environment like the Web. Certification agencies, data aggregators and data consumers can make relevant quality assessments too.
We want to stimulate this by making it easier to publish, exchange and consume quality metadata, for every step of a dataset's lifecycle. This is why next to rather expected constructs, like quality measures, the Data Quality Vocabulary puts a lot of emphasis on feedback, annotation, agreements and the provenance of the metadata.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The namespace for DQV is provisionally set as http://www.w3.org/ns/dqv#
.
DQV, however, seeks to re-use elements from other vocabularies, following the best practices for data vocabularies identified by the Data on the Web Best Practices Working Group.
The Working Group is considering putting all new classes and properties (together with the ones of the Dataset Usage Vocabulary) in the DCAT namespace. (Issue-179).
The table below indicates the full list of namespaces and prefixes used in this document.
Prefix | Namespace |
---|---|
daq | http://purl.org/eis/vocab/daq# |
dcat | http://www.w3.org/ns/dcat# |
dcterms | http://purl.org/dc/terms/ |
dqv | http://www.w3.org/ns/dqv# |
duv | http://www.w3.org/ns/duv# |
oa | http://www.w3.org/ns/oa# |
prov | http://www.w3.org/ns/prov# |
Are we actually allowed by W3C to re-use elements from DAQ? This vocabulary is not a community standard and its guarantee of sustainability may be judged not good enough. A possible way forward would be to declare all relevant classes in the DQV namespace but then declare them all as owl:equivalentClass/property with their DAQ counterparts. (Issue-180).
The following vocabulary is based on DCAT [vocab-dcat] that it extends with a number of additional properties and classes suitable for expressing the quality of a dataset.
The quality of a given dataset or distribution is assessed via a number of observed properties. For instance, one may consider a dataset to be of high quality because it complies with a specific standard while for other use cases the quality of the data will depend on its level of interlinking with other datasets. To express these properties an instance of a dcat:Dataset or dcat:Distribution can be related to four different classes:
What is the relation between duv:Feedback
and dqv:UserFeedback
? (Issue-165).
Should we have only the existing class daq:QualityGraph
or keep the new class dqv:QualityMetadata
to represent a set of statements providing quantitative and/or
qualitative information about the dataset or distribution. One could be a sub-class of the other. (Issue-181)
The label of daq:QualityGraph
does not fit well with the current model.
DAQ graphs are meant to contain measures.
In our context a "quality graph" has a wider scope: actually the role of representing overall quality graphs is currently played by dqv:QualityMetadata. (Issue-182)
We may want to consider a revision of DCAT to make dcat:Dataset and dcat:Distribution subclasses of prov:Entity. (Issue-183)
Is a dqv:ServiceLevelAgreement
a kind of certificate, or a standard? (Issue-184)
dqv:QualityAnnotation
is foreseen as a subclass of
oa:Annotation. The instances of this class should have a
oa:motivatedBy statement with a an instance of
oa:Motivation, which reflects a quality assessment purpose.
We plan to define is as dqv:qualityAssessment. (Issue-185)
This section is work in progress. We will include later more tables with specification of individual classes and properties.
DQV defines quality measures as specific instances of DAQ observations, adapting the DAQ quality metrics framework:
For example, a dimension could be "multilinguality" and two metrics could be "ratio of literals with language tags" and "number of different language tags".
The following property should be used on this class: daq:metric. The following property may be used for this class: qb:dataSet.
RDF Class: | dqv:QualityMeasure |
---|---|
Definition: | A quality measure represents the evaluation of a given dataset (or dataset distribution) against a specific quality metric. |
Subclass of: | daq:Observation (itself a subclass of qb:Observation) |
There might be no need for a subclass link between dqv:QualityMeasure and daq:Observation. I.e., we could re-use daq:Observation directly. (Issue-186)
RDF Property: | daq:metric |
---|---|
Definition: | Indicates the metric being observed. |
Domain: | qb:Observation |
Range: | daq:Metric |
Minimum cardinality: | 1 |
RDF Property: | qb:dataSet |
---|---|
Definition: | Indicates the data set of which this observation is a part. |
Domain: | qb:Observation |
Range: | qb:DataSet |
The following properties should be used on this class: dqv:hasDimension.
RDF Class: | daq:Metric |
---|---|
Definition: | The smallest unit of measuring a quality dimension is a metric. A metric belongs to exactly one dimension. |
Do we want to keep the same occurrence constraints as defined in DAQ (for example, that every metric should belong to exactly one dimension)? In this specific case this may be demanding too much on quality data publishers: it could be that a metric does not clearly belong to a dimension, or that a metric is in scope for several dimensions. (Issue-187)
RDF Property: | dqv:hasDimension |
---|---|
Definition: | Represents the dimension a metric allows a measurement of. |
Domain: | daq:Metric |
Range: | daq:Dimension |
Inverse: | daq:hasMetric |
Minimum cardinality: | 1 |
Maximum cardinality: | 1 |
This section is non-normative.
This section shows some examples to illustrate the application of the Dataset Quality Vocabulary.This section is still work in progress. Further examples will be provided as soon as some of the pending issues are resolved.
NB: in the remainder of this section, the prefix ":
" refers to http://example.org/
myDataset
, and its distribution myDatasetDistribution
,
:myDataset a dcat:Dataset ; dct:title "My dataset" ; dcat:distribution :myDatasetDistribution . :myDatasetDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/mydataset.csv> ; dct:title "CSV distribution of dataset" ; dcat:mediaType "text/csv" ; dcat:byteSize "87120"^^xsd:decimal .
An automated quality checker has provided a quality assessment with two (CSV) quality measures for myDatasetDistribution
.
:myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . :measure1 a dqv:QualityMeasure ;# when daq:computedOn ranges into a dcat:Distribution/dcat:Dataset, # dqv:hasQualityMeasure is likely to be the inverse of daq:computedOn ? # In any case we could remove this statement. daq:computedOn :myDatasetDistribution ; daq:metric :cvsAvailabilityMetric ; daq:value "1.0"^^xsd:double . :measure2 a dqv:QualityMeasure ;# when daq:computedOn ranges into a dcat:Distribution/dcat:Dataset, # dqv:hasQualityMeasure is likely to be the inverse of daq:computedOn ? # In any case we could remove this statement. daq:computedOn :myDatasetDistribution ; daq:metric :csvConsistencyMetric ; daq:value "0.5"^^xsd:double . :cvsAvailabilityMetric a daq:Metric ; dqv:hasDimension :availabity . :csvConsistencyMetric a daq:Metric ; dqv:hasDimension :consistency . :availabity a daq:Dimension ; dqv:hasCategory :category1; . :consistency a daq:Dimension ; dqv:hasCategory :category2 .# Categories and dimensions might be more extensively defined, for example, # by grounding them in the section 'Dimensions and metrics hints'. # However, any quality framework is free to define its own dimensions and categories.
The result of metrics obtained in the previous assessment are stored in the myQualityMetadata
graph.
# myQualityMatadata is a graph# we are assuming dqv:QualityMetadata is an extension of daq:QualityGraph, # otherwise we should probably define a proper qb:DataStructureDefinition :myQualityMetadata { :myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 .# The graph contains the rest of the statements presented in the previous example. } # myQualityMetadata has been created by: qualityChecker and it is the result of the :qualityChecking activity :myQualityMetadata a dqv:QualityMetadata ; prov:wasAttributedTo :qualityChecker ; prov:generatedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime ; prov:wasGeneratedBy :qualityChecking . # qualityChecker is a service computing some quality metrics :qualityChecker a prov:SoftwareAgent ; rdfs:label "a quality assessment service"^^xsd:string# We should probably suggest to add more info about the services . # the qualityChecking is the activity that has generated myQualityMetadata starting from MyDatasetDistribution :qualityChecking a prov:Activity; rdfs:label "the checking of myDatasetDistribution's quality"^^xsd:string; prov:wasAssociatedWith :qualityChecker; prov:used :myDatasetDistribution; prov:generated :myQualityMetadata; prov:endedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime prov:startedAtTime "2015-05-27T00:52:02Z"^^xsd:dateTime; .
Statements similar to the ones applied to the resource myQualityMetadata
above can be applied to the resource myDataset
to indicate the provenance of the dataset. I.e., a dataset can be generated by a specific software agent, be generated at a certain time, etc. The HCLS Community Profile for describing datasets provide further examples.
This section will be completed by examples coming from Riccardo's work on measuring the quality of linksets between SKOS concept schemes, from the perspective of adding multilingual labels to these schemes. On the web, linksets are an especially interesting case of datasets! we could also add examples from qSKOS. (Issue-188)
This section is non-normative.
This section will be refined, especially considering public feedback.
This section gathers relevant quality dimensions and ideas for corresponding metrics, which might be eventually represented as instances of daq:Dimension and daq:Metric. The goal is not to define a normative list of dimensions and metrics, rather, the section provides a set of examples starting from use cases included in the Use Cases & Requirements document and from the following sources:
The following table gives example on statistics that can be computed on a dataset and interpreted as quality indicators by the data consumer. Some of them can be relevant for the dimensions listed in the rest of this section. The properties come from the VoID extension created for the Aether tool.
Observation | Suggested term |
---|---|
Number of distinct external resources linked to | http://ldf.fi/void-ext#distinctIRIReferenceObjects |
Number of distinct external resources used (including schema terms) | http://ldf.fi/void-ext#distinctIRIReferences |
Number of distinct literals | http://ldf.fi/void-ext#distinctLiterals |
Number of languages used | http://ldf.fi/void-ext#languages |
Are statistics about a dataset a kind of quality info we need to include in the data quality vocabulary? (Issue-164)
The Aether VoID extension represents statistics as direct statements that have a dataset as subject and an integer as object. This pattern, which can be expected to be rather common, is different from the pattern that DQV inherits from DAQ (see examples). This document will probably have to explain how the different patterns can be reconciled, if indeed both should exist alongside. (Issue-189)
Can the data be accessed now and over time?
Since a dcat:Dataset is an abstract thing, it might be available at any point in time, past present or future. We already have dcterms:issued so two properties come to mind:
Other questions that come to mind: how do we indicate that the dataset is expected to be available 'for the foreseeable future?'
Is the data machine readable ?
is the data correctly representing the real-world entity or event?
Is the data not containing contradictions?
Can I use it readily in an analysis tool? Can I open the dataset in R and do some statistical manipulations? Can I open it in Tableau and make a visualization without doing a lot of cleaning?
There could be some overlap with accurracy.
Does the dataset include an appropriate amount of data?
It might be useful to include some information about the context (e.g., why was the data created and what purpose is it supposed to serve).
Does the data include all data items representing the entity or event ?
Is the data following accepted standards ?
Is the data based on trustworthy sources ?
This is described using the provenance vocabulary PROV-O
Is the data representing the actual situation and it is published soon enough ?
This section is non-normative.
The UCR document lists relevant requirement for data quality and granularity:
We have to confirm whether the scope of DQV work is indeed these "official" DQV reqs or if we should go beyond, e.g., reflecting the quality of the vocabulary (re-)used, access to datasets, metadata and more generally the implementation of our best practices (cf. the "5 stars" thread).
The distinction between intrinsinc and extrinsic metadata may help making choices here. For example, DQV could be defined wrt. intrinsic properties of the datasets, not extrinsinc properties (let alone properties of the metadata for a dataset!) (Issue-190)
Backward compatibility with DAQ and Data Cube: DAQ exploits Data Cube to make metric results consumable by visualisers such as CubeViz (see Jeremy's paper). This may be useful to preserve in DQV. (Issue-191)