Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
This document provides a framework in which the quality of a dataset can be described, whether by the dataset publisher or by a broader community of users. It does not provide a formal, complete definition of quality, rather, it sets out a consistent means by which information can be provided such that a potential user of a dataset can make his/her own judgment about its fitness for purpose.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The model for the Data Quality Vocabulary is nearing maturity, but the Working Group is seeking feedback on a number of specific issues highlighted in the document below.
This document was published by the Data on the Web Best Practices Working Group as a Working Draft. If you wish to make comments regarding this document, please send them to public-dwbp-comments@w3.org (subscribe, archives). All comments are welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 September 2015 W3C Process Document.
This section is non-normative.
The Data on the Web Best Practices Working Draft has pointed out the relevance of publishing information about the quality of data published on the Web. Accordingly, the Data on the Web Best Practices Working Group has been chartered to create a vocabulary for expressing data quality. The Data Quality Vocabulary (DQV) presented in this document is foreseen as an extension to DCAT [vocab-dcat] to cover the quality of the data, how frequently is it updated, whether it accepts user corrections, persistence commitments etc. When used by publishers, this vocabulary will foster trust in the data amongst developers.
This vocabulary does not seek to determine what "quality" means. We believe that quality lies in the eye of the beholder; that there is no objective, ideal definition of it. Some datasets will be judged as low-quality resources by some data consumers, while they will perfectly fit others' needs. In accordance, we attach a lot of importance to allowing many actors to assess the quality of datasets and publish their annotations, certificates, opinions about a dataset. A dataset's publisher should seek to publish metadata that helps data consumers determine whether they can use the dataset to their benefit. However, publishers should not be the only ones to have a say on the quality of data published in an open environment like the Web. Certification agencies, data aggregators, data consumers can make relevant quality assessments, too.
We want to stimulate this by making it easier to publish, exchange and consume quality metadata, for every step of a dataset's lifecycle. This is why next to rather expected constructs like quality measures, the Data Quality Vocabulary puts a lot of emphasis on feedback, annotation, agreements and the provenance of the metadata that describes them.
The namespace for DQV is provisionally set as http://www.w3.org/ns/dqv#
.
DQV, however, seeks to re-use elements from other vocabularies, following the best practices for data vocabularies identified by the Data on the Web Best Practices Working Group.
The Working Group is considering putting all new classes and properties defined in the DWBP Vocabularies in the DCAT namespace. As an attempt to stimulate reactions which might help in taking a decision, the Dataset Usage Vocabulary will be moved under the DCAT namespace. In case of positive reactions to the DUV choice, the data quality vocabulary might consider to go in the same direction.
The table below indicates the full list of namespaces and prefixes used in this document.
Prefix | Namespace |
---|---|
daq | http://purl.org/eis/vocab/daq# |
dcat | http://www.w3.org/ns/dcat# |
dcterms | http://purl.org/dc/terms/ |
dqv | http://www.w3.org/ns/dqv# |
duv | http://www.w3.org/ns/duv# |
oa | http://www.w3.org/ns/oa# |
prov | http://www.w3.org/ns/prov# |
The following vocabulary is based on DCAT [vocab-dcat] that it extends with a number of additional properties and classes suitable for expressing the quality of a dataset.
The quality of a given dataset of distribution is assessed via a number of observed properties. For instance, one may consider a dataset to be of high quality because it complies to a specific standard while for other use-cases the quality of the data will depend on its level of interlinking with other datasets. To express these properties an instance of a dcat:Dataset or dcat:Distribution can be related to four different classes:
Textual description of the diagram will be added.
N.B.: "graph containment" refers to the inclusion of quality statements in (RDF) graphs, e.g. for capturing the provenance of quality statements (see later example)
Should we have only the existing class dqv:QualityMeasureDataset or define a new class dqv:QualityMetadata to represent a set of statements providing quantitative and/or qualitative information about the dataset or distribution. One could be a sub-class of the other. (Issue-181)
Is dqv:QualityPolicy a subclass of dcterms:Standard? The wording in the Dublin Core specification is very open ("A basis for comparison; a reference point against which other things can be evaluated"), but the label is quite restrictive. At the time of discussion a majority of WG members is ok with subclassing, but we welcome public feedback before making a final decision (Issue-199)
This section is work in progress. We will include later more tables with specification of individual classes and properties.
DQV defines quality measures as specific instances of DQV observations, adapting the DAQ quality metrics framework [DaQ], [DaQ-RDFCUBE]:
For example, a dimension could be "multilinguality" and two metrics could be "ratio of literals with language tags" and "number of different language tags".
The following properties should be used on this class: dqv:hasMetric, dqv:value, qb:dataSet.
Should (and if yes, how) DQV represent multiple/derived values for a metric (e.g., average or normalized value)? (Issue-222)
Should (and if yes, how) DQV represent parameters for a metric applied for computing a specific quality measure (e.g.,a specific setting of weights)? (Issue-223)
RDF Class: | dqv:QualityMeasure |
---|---|
Definition: | A quality measure represents the evaluation of a given dataset (or dataset distribution) against a specific quality metric. |
Subclass of: | qb:Observation |
Equivalent class | daq:Observation |
RDF Property: | dqv:hasMetric |
---|---|
Definition: | Indicates the metric being observed. |
Instance of: | qb:DimensionProperty |
Domain: | qb:Observation |
Range: | dqv:Metric |
Equivalent Property | daq:metric |
RDF Property: | qb:dataSet |
---|---|
Definition: | Indicates the dataset to a quality measure (which is an RDF Data Cube observation) belongs. |
Domain: | qb:Observation |
Range: | qb:DataSet |
RDF Property: | dqv:computedOn |
---|---|
Definition: | Refers to the resource (e.g., a dataset, a linkset, a graph, a set of triples) on which the quality measurement is performed. In the DQV context, this property is generally expected to be used in statements in which objects are instances of dcat:Dataset and dcat:Distribution. |
Instance of: | qb:DimensionProperty |
Domain: | dqv:QualityMeasure |
Equivalent property: | daq:computedOn |
Inverse property: | dqv:hasQualityMeasure |
RDF Property: | dqv:value |
---|---|
Definition: | Refers to values computed by metric. |
Instance of: | qb:MeasureProperty, owl:DatatypeProperty |
Domain: | dqv:QualityMeasure |
Equivalent property: | daq:value |
The following properties should be used on this class: dqv:hasDimension.
In daQ, the property daq:expectedDataType associates each metric to the expected data type for its observed value. Data types for observed values are restricted to xsd:anySimpleType (e.g. xsd:boolean, xsd:double etc…). Is the current practice of using daq:expectedDataType in daQ appropriate? Isn't the restriction to xsd:anySimpleType too narrow? (Issue-224)
RDF Class: | dqv:Metric |
---|---|
Definition: | A standard to measure a quality dimension. An observation (instance of dqv:QualityMeasure) assigns a value in a given unit to a Metric. |
Equivalent class | daq:Metric |
RDF Property: | dqv:hasDimension |
---|---|
Definition: | Represents the dimension a metric allows a measurement of. |
Domain: | dqv:Metric |
Range: | dqv:Dimension |
Inverse: | daq:hasMetric |
Usage note: | Dimensions are meant to systematically organize metrics. The Data Quality Vocabulary defines no specific cardinality constraints for dqv:hasDimension, since distinct quality frameworks might have different perspectives over a metric. A metric may therefore be associated to more than one dimension. However, those who define new quality measures should try to avoid this as much as possible and assign only one dimension to the metrics they define. |
The following properties should be used on this class: dqv:hasCategory.
RDF Class: | dqv:Dimension |
---|---|
Definition: | Represents criteria relevant for assessing quality. Each quality dimension must have one or more metric to measure it. A dimension is linked with a category using the dqv:hasDimension property. |
Equivalent class | daq:Dimension |
RDF Property: | dqv:hasCategory |
---|---|
Definition: | Represents the category a dimension is grouped in. |
Domain: | dqv:Dimension |
Range: | dqv:Category |
Inverse: | daq:hasDimension |
Usage note: | Categories are meant to systematically organize dimensions. The Data Quality Vocabulary defines no specific cardinality constraints for dqv:hasCategory, since distinct quality frameworks might have different perspectives over a dimension. A dimension may therefore be associated to more than one category. However, those who define new quality measures should try to avoid this as much as possible and assign only one category to the dimensions they define. |
RDF Class: | dqv:Category |
---|---|
Definition: | Represents a group of quality dimensions in which a common type of information is used as quality indicator. |
Equivalent class | daq:Category |
RDF Class: | dqv:QualityMeasureDataset |
---|---|
Definition: | Represents a dataset of quality measures, evaluations of a given dataset (or dataset distribution) against a specific quality metric. |
Subclass of: | qb:DataSet |
Equivalent class | daq:QualityGraph |
RDF Class: | dqv:QualityAnnotation |
---|---|
Definition: | Represents quality annotations, including rating, quality certificate, feedback that can be associated to datasets or distributions. Quality annotations must have one oa:motivatedBy statement with an instance of oa:Motivation (and skos:Concept), which reflects a quality assessment purpose. We define this instance as dqv:qualityAssessment. |
Subclass of: | oa:Annotation |
Equivalent class | EquivalentClasses( dqv:QualityAnnotation ObjectHasValue( oa:motivatedBy dqv:qualityAssessment ) ) |
To make the document more self-contained we might consider to describe some properties of oa:Annotation, such as hasBody, hasTarget.
RDF Class: | dqv:UserQualityFeedback |
---|---|
Definition: | Represents feedback users might want to associate to datasets or distributions. |
Subclass of: | dqv:QualityAnnotation duv:UserFeedback |
Should we exploit predefined instances of oa:Motivation to further characterize a user's feedback purposes? (Issue-201)
Combining the predefined instances of oa:Motivation with the dqv:qualityAssessment we could distinguish different kinds of for user feedbacks, for example:
RDF Property: | dqv:hasQualityMeasure |
---|---|
Definition: | Refers to the performed quality measurements. Quality measurements can be performed to any kind of resource (e.g., a dataset, a linkset, a graph, a set of triples). However, in the DQV context, this property is generally expected to be used in statements in which subjects are instances of dcat:Dataset and dcat:Distribution. |
Range: | dqv:QualityMeasure |
Inverse property: | dqv:computedOn |
This section is non-normative.
This section shows some examples to illustrate the application of the Dataset Quality Vocabulary.This section is still work in progress. Further examples will be provided as soon as some of the pending issues are resolved. We invite the public to contact the editors and submit relevant examples of quality data, even not yet represented in DQV. We welcome your input!
NB: in the remainder of this section, the prefix ":
" refers to http://example.org/
myDataset
, and its distribution myDatasetDistribution
,
:myDataset a dcat:Dataset ; dcterms:title "My dataset" ; dcat:distribution :myDatasetDistribution . :myDatasetDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/mydataset.csv> ; dcterms:title "CSV distribution of dataset" ; dcat:mediaType "text/csv" ; dcat:byteSize "87120"^^xsd:decimal .
An automated quality checker has provided a quality assessment with two (CSV) quality measures for myDatasetDistribution
.
:myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . :measure1 a dqv:QualityMeasure ; dqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvAvailabilityMetric ; dqv:value "1.0"^^xsd:double . :measure2 a dqv:QualityMeasure ; dqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvConsistencyMetric ; dqv:value "0.5"^^xsd:double . #definition of dimensions and metrics :availabity a dqv:Dimension ; dqv:hasCategory :category1; . :consistency a dqv:Dimension ; dqv:hasCategory :category2 . :csvAvailabilityMetric a dqv:Metric ; dqv:hasDimension :availabity . :csvConsistencyMetric a dqv:Metric ; dqv:hasDimension :consistency .
Categories and dimensions might be more extensively defined, see in the section 'Dimensions and metrics hints' for further examples. Any quality framework is free to define its own dimensions and categories.
Should we represent dimensions and categories as instances of skos:Concept? This would allow publishers of quality framework to express (hierarchical) relations between dimensions or categories. This could also enable to align with quality-focused categorizations less focused on metrics. Including the DWBP Best Practices dimensions, or even the parts of DQV about annotations. (Issue-205)
The results of metrics obtained in the previous assessment are stored in the myQualityMetadata
graph.
# myQualityMatadata is a graph :myQualityMetadata { :myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . # The graph contains the rest of the statements presented in the previous example. } # myQualityMetadata has been created by: qualityChecker and it is the result of the :qualityChecking activity :myQualityMetadata a dqv:QualityMetadata ; prov:wasAttributedTo :qualityChecker ; prov:generatedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime ; prov:wasGeneratedBy :qualityChecking . # qualityChecker is a service computing some quality metrics :qualityChecker a prov:SoftwareAgent ; rdfs:label "a quality assessment service"^^xsd:string # We should probably suggest to add more info about the services . # the qualityChecking is the activity that has generated myQualityMetadata starting from MyDatasetDistribution :qualityChecking a prov:Activity; rdfs:label "the checking of myDatasetDistribution's quality"^^xsd:string; prov:wasAssociatedWith :qualityChecker; prov:used :myDatasetDistribution; prov:generated :myQualityMetadata; prov:endedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime; prov:startedAtTime "2015-05-27T00:52:02Z"^^xsd:dateTime .
The group has discussed provenance at different level of granularity (dqv:QualityMeasure and dqv:QualityMetadata), so we might consider to add an example of provenance for dqv:QualityMeasure.
Statements similar to the ones applied to the resource myQualityMetadata
above can be applied to the resource myDataset
to indicate the provenance of the dataset. I.e., a dataset can be generated by a specific software agent, be generated at a certain time, etc. The HCLS Community Profile for describing datasets provides further examples.
Let us express that an ODI certificate for the "City of Raleigh Open Government Data" dataset is available at the URL <https://certificates.theodi.org/en/datasets/393/certificate>.
<https://certificates.theodi.org/en/datasets/393> a dcat:Dataset ; dqv:hasQualityAnnotation :myDatasetQA . :myDatasetQA a dqv:QualityCertificate ; oa:hasTarget <https://certificates.theodi.org/en/datasets/393> ; oa:hasBody <https://certificates.theodi.org/en/datasets/393/certificate> ; oa:motivatedBy dqv:qualityAssessment .
Let’s consider myControlledVocabulary
, a controlled vocabulary made available on the Web using the SKOS [SKOS-reference] and DCAT [vocab-dcat].
:myControlledVocabulary a dcat:Dataset ; dcterms:title "My controlled vocabulary" . :myControlledVocabularyDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/myControlledVocabulary.csv> ; dcterms:title "SKOS/RDF distribution of my controlled vocabulary"" ; dcat:mediaType "text/turtle" ; dcat:byteSize "190120"^^xsd:decimal .
qSKOS is an open source tool, which detects quality issues affecting SKOS vocabularies [qSKOS]. It considers 26 quality issues including, for example, “Incomplete Language Coverage” and “Label Conflicts” which are grouped in the category “Labeling and Documentation issues”. Quality issues addressed by qSKOS can be considered as DQV quality dimensions, whilst the number of concepts in which a quality issue occurs can be the metric deployed for each quality dimension.
# definition of instances for some of the metrics, dimensions and categories deployed in qSKOS. :numOfConceptsWithLabelConflicts a dqv:Metric; rdfs:label "Conflicting concepts"@en ; rdfs:comment "Number of concepts having conflicting labels"@en ; dqv:hasDimension :LabelConflicts . :numOfConceptsWithIncompleteLanguageCoverage a dqv:Metric; rdfs:label "Language incomplete concepts"@en ; rdfs:comment "Number of concepts having an incomplete language coverage"@en ; dqv:hasDimension :incompleteLanguageCoverage . :LabelConflicts a dqv:Dimension; rdfs:label "Label Conflicts"@en ; rdfs:comment "Dimension corresponding to the label conflicts quality issue"@en ; dqv:hasCategory :labelingDocumentationIssues . :incompleteLanguageCoverage a dqv:Dimension; rdfs:label "Incomplete Language Coverage"@en ; rdfs:comment "Dimension corresponding to the incomplete language coverage issue"@en ; dqv:hasCategory :labelingDocumentationIssues . :labelingDocumentationIssues a dqv:Category ; rdfs:label "Labeling and Documentation Issues"@en ; rdfs:comment "Category grouping labeling and documentation issues"@en ; .
DQV represents the qSKOS quality assessment on myControlledVocabulary
for the dimensions “Incomplete Language Coverage” and “Label Conflicts”.
:myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . :measure1 a dqv:QualityMeasure ; dqv:computedOn :myControlledVocabulary ; dqv:hasMetric :numOfConceptsWithMissingValues ; dqv:value "1500"^^xsd:integer . :measure2 a dqv:QualityMeasure ; dqv:computedOn :numOfConceptsWithIncompleteLanguageCoverage ; dqv:hasMetric :csvConsistencyMetric ; dqv:value "450"^^xsd:integer .
(VoID) linksets are collections of (RDF) links between two datasets. Linksets are as important as datasets when it comes to the joint exploitation of independently served datasets in linked data. The representation of quality for a linkset offers a further example of how DQV can be exploited.
Let’s define three DCAT datasets, including one VoID linkset, which connects the two others:
:myDataset1 a dcat:Dataset ; dcterms:title "My dataset 1" . :myDataset2 a dcat:Dataset ; dcterms:title "My dataset 2" . :myLinkset a dcat:Dataset, void:Linkset ; dcterms:title "A Linkset between My dataset 1 and My dataset 2"; void:linkPredicate skos:exactMatch ; void:target :myDataset1 ; void:target :myDataset2 .
We can represent information about the quality of :myLinkset using the “Multilingual importing” [MultilingualImporting] linkset quality metric. This metrics works on linksets between datasets that include SKOS concepts [SKOS-reference]. It quantifies the information gain when adding the preferred labels or the alternative labels of the concepts from a linked dataset to the descriptions of the concepts from the other dataset, which these concepts have been matched with a skos:exactMatch statement from the linkset. We must first define the proper metric, dimension and category.
# Definition of instances for Metric, Dimension and Category. :importingForPropertyPercentage a dqv:Metric; dqv:hasDimension :completeness. :completeness a dqv:Dimension; dqv:hasCategory :complementationGain . :complementationGain a dqv:Category .
The quality assessment of the "label importing" can be made dependent on two extra parameters: property and language, respectively the SKOS property and the language tag. We extend DQV to represent these parameters.
We need to further evaluate the way we add extra parameters for the metric and extend the DAQ RDF-CUBE data structure (postponed issue)
:language a qb:DimensionProperty, owl:DataProperty ; rdfs:comment "language on which label importing is assessed."@en ; rdfs:domain dqv:QualityMeasure; rdfs:label "label import assessment language"@en . :property a qb:DimensionProperty, rdf:Property ; rdfs:comment "property which label importing is assessed."@en ; rdfs:domain dqv:QualityMeasure ; rdfs:label "label import assessment property"@en ; rdfs:range rdf:Property .
Let us add actual quality assessments:
:qualityMeasureDataset a dqv:QualityMeasureDataset ;
qb:structure :dsd .
:importingForPropertyPercentage
# should dqv:hasObservation be added as inverse of dqv:hasMetric?
dqv:hasObservation :exactMatchaltLabelit1 , :exactMatchaltLabelit2 ,
:exactMatchaltLabelen1 , :exactMatchaltLabelen2,
:exactMatchprefLabelit1, :exactMatchprefLabelit2 .
#Adding quality observations
## for Italian alternative labels
:measure_exactMatchAltLabelItDataset1
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "1.0"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "it" ;
:property skos:altLabel .
:measure_exactMatchAltLabelItDataset2
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "1.0"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "it" ;
:property skos:altLabel .
## for English alternative labels
:measure_exactMatchAltLabelEnDataset1
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "0.1"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "en" ;
:property skos:altLabel .
:measure_exactMatchAltLabelEnDataset2
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "1.0"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "en" ;
:property skos:altLabel .
## for Italian preferred labels
:measure_exactMatchPrefLabelItDataset1
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "0.5"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "it" ;
:property skos:prefLabel .
:exactMatchprefLabelit2
a dqv:QualityMeasure;
dqv:computedOn :myLinkset ;
dqv:value "0.5"^^xsd:double ;
dqv:hasMetric :importingForPropertyPercentage ;
qb:dataSet :qualityMeasureDataset;
:language "it" ;
:property skos:prefLabel .
Let us specify the RDF Data Cube data structure:
:dsd a qb:DataStructureDefinition ; ##Copying the structure of daq:dsq qb:component [ qb:dimension dqv:computedOn ; qb:order 2 ] ; qb:component [ qb:measure dqv:value] ; qb:component [ qb:dimension <http://purl.org/linked-data/sdmx/2009/dimension#timePeriod> ; qb:order 3 ] ; qb:component [ qb:dimension dqv:hasMetric ; qb:order 1 ] ; qb:component [ qb:measure dqv:value;]; # Attribute (here: unit of measurement) qb:component [ qb:attribute sdmx-attribute:unitMeasure ; qb:componentRequired false ; qb:componentAttachment qb:DataSet ; ] ; ##Extending the structure of lds:dsq with two new dimensions qb:component [ qb:dimension :property ; qb:order 4 ] ; qb:component [ qb:dimension :language ; qb:order 5 ] .
It is often desirable to indicate that metadata about datasets in a catalogue are compliant with a metadata standard, or an application profile of an existing metadata standard. A typical example is the GeoDCAT Application Profile [GeoDCAT-AP], an extension of the DCAT vocabulary [vocab-dcat] to represent metadata for geospatial data portals. GeoDCAT-AP enables to express that a dataset's metadata conforms to an existing standard, following the recommendations of ISO 19115, ISO 19157 and the EU INSPIRE directive. DCAT partly supports the expression of such metadata conformance statements. The following example illustrates how a (DCAT) catalog record can be said to be conformant with the GeoDCAT-AP standard itself.
ex:myDataset a dcat:Dataset; ex:myDatasetRecord a dcat:CatalogRecord ; foaf:primaryTopic :myDataset ; dcterms:conformsTo :geoDCAT-AP . ex:geoDCAT-AP a dcterms:Standard; dcterms:title "GeoDCAT Application Profile" ; dcterms:comment "GeoDCAT-AP is developed in the context of the Interoperability Solutions for European Public Administrations (ISA) Programme"@en; dcterms:issued "201X-XX-XX"^^xsd:date .
Note that this example does not include the metadata about the dataset ex:myDataset itself. We assume this is present in an RDF data source accessible via the URI ex:myDatasetRecord
. We also assume that ex:geoDCAT-AP
is a reference URI that denotes the GeoDCAT-AP standard, which can be re-used across many catalog record descriptions, not just a locally introduced URI.
Relation between DQV, ISO 19115/19157 and GeoDCAT-AP: DQV is already able to express the notion of "conformance" to a standard using the property dcterms:conformsTo. However, there were suggestion to be further compatible with ISO 19157:2013 and INSPIRE by adding respectively "Not conformant" and "Not evaluated" as possible properties or values. Should DQV be this expressive? (Issue-202)
This section is non-normative.
This section will be refined as soon as Issue-204 and Issue-205 are solved. In particular, following the discussion on Issue-200, we plan to align the DQV dimension classification with the ISO 25012 [ISOIEC25012] and to provide the classification proposed in Zaveri Et Al. [ZaveriEtAl] as a further example. Suggestions on possible mappings between ISO 25012 and Zaveri et al.'s dimensions as well as any other well-known classification are welcome.
This section gathers relevant quality dimensions and ideas for corresponding metrics, which might be eventually represented as instances of daq:Dimension and daq:Metric. The goal is not to define a normative list of dimensions and metrics, rather, the section provides a set of examples starting from use cases included in the Use Cases & Requirements document and from the following sources:
Are the levels of granularity of dqv:Dimension and dqv:Category well-defined enough and fit for purpose? (Issue-225)
The following table gives example on statistics that can be computed on a dataset and interpreted as quality indicators by the data consumer. Some of them can be relevant for the dimensions listed in the rest of this section. The properties come from the VoID extension created for the Aether tool.
Observation | Suggested term |
---|---|
Number of distinct external resources linked to | http://ldf.fi/void-ext#distinctIRIReferenceObjects |
Number of distinct external resources used (including schema terms) | http://ldf.fi/void-ext#distinctIRIReferences |
Number of distinct literals | http://ldf.fi/void-ext#distinctLiterals |
Number of languages used | http://ldf.fi/void-ext#languages |
The Aether VoID extension represents statistics as direct statements that have a dataset as subject and an integer as object. This pattern, which can be expected to be rather common, is different from the pattern that DQV inherits from DAQ. Guidance on how DQV/daQ can work with other quality statistics vocabulary will be provided.
Can the data be accessed now and over time?
Since a dcat:Dataset is an abstract thing, it might be available at any point in time, past present or future. We already have dcterms:issued so two properties come to mind:
Other questions that come to mind: how do we indicate that the dataset is expected to be available 'for the foreseeable future?'
Is the data machine readable ?
is the data correctly representing the real-world entity or event?
Is the data not containing contradictions?
Can I use it readily in an analysis tool? Can I open the dataset in R and do some statistical manipulations? Can I open it in Tableau and make a visualization without doing a lot of cleaning?
There could be some overlap with accuracy.
Does the dataset include an appropriate amount of data?
It might be useful to include some information about the context (e.g., why was the data created and what purpose is it supposed to serve).
Does the data include all data items representing the entity or event ?
Is the data following accepted standards ?
Is the data based on trustworthy sources ?
This is described using the provenance vocabulary PROV-O
Is the data representing the actual situation and it is published soon enough ?
This section is non-normative.
The UCR document lists relevant requirement for data quality and granularity:
The aforementioned requirements are going to be further elaborated considering on-going discussions and materials from these two wiki pages: Requirements from FPWD_BP and Quality Requirements From UCR.
We have to confirm whether the scope of DQV work is indeed these "official" DQV reqs or if we should go beyond, e.g., reflecting the quality of the vocabulary (re-)used, access to datasets, metadata and more generally the implementation of our best practices (cf. the "5 stars" thread).
The distinction between Intrinsic and extrinsic metadata may help making choices here. For example, DQV could be defined wrt. intrinsic properties of the datasets, not extrinsic properties (let alone properties of the metadata for a dataset!) (Issue-190)
Backward compatibility with DAQ and RDF Data Cube: DAQ exploits Data Cube to make metric results consumable by visualisers such as CubeViz (see Jeremy's paper). This may be useful to preserve in DQV. (Issue-191)
The W3C Human Care and Life Science Community Group has created a DCAT profile for describing datasets. This is work is well visible and used in the HCLS community. DQV should be aligned with this profile if there are overlapping areas. Are there such areas? (Issue-221)
Changes since the previous version include: