This document lists use cases iteratively compiled by the Dataset Exchange Working Group. They identify current shortcomings and motivate the extension of the Data Catalog Vocabulary (DCAT). Further, they motivate the creation of guidelines for and a formalisation of the concept of (application) profiles and how to describe those, and the need for a mechanism to exchange information about those profiles including profile-based content-negotiation.

Introduction

The provision of metadata describing datasets is crucial to enable data sharing, openly or not, among researchers, governments and citizens. There is a variety of metadata standards used by different communities to describe their datasets, some of which are highly specialized. W3C’s Data Catalog Vocabulary, DCAT is in widespread use, but so too are CKAN’s native schema, schema.org's dataset description vocabulary, ISO 19115, DDI, SDMX, CERIF, VoID, INSPIRE and, in the healthcare and life sciences domain, the Dataset Description vocabulary and DATS (ref) among others.

This wealth of metadata standards indicates that there is no complete and universally accepted solution. It is also recognized that DCAT does not provide sufficient vocabulary for some aspects of dataset descriptions (e.g. ways of describing APIs to access datasets, datasets versions, temporal aspects of datasets).

In addition to the variety of standard vocabularies, there have been multiple definitions of application profiles, which define how a vocabulary is used, for example by providing cardinality constraints and/or enumerated lists of allowed values such that data can be validated.

To enable interoperability between services, e-infrastructures and virtual research environments, it is needed to provide mechanisms for metadata standards and application profiles to be exposed and ingested through transparent and sustainable interfaces. Thus, we need a mechanism for servers to indicate the available standards and application profiles, and for clients to choose an appropriate one. This leads to the concept of content negotiation by application profile, which is orthogonal to content negotiation by data format and language that is already part of HTTP.

Within this context, the mission of the Dataset Exchange Working Group as described in its charter is to:

This document represents the results of the Working Group's initial efforts to identify use cases and requirements for all of the above activities. It contains use cases reflecting situations that members of the Working Group and other stakeholders have identified relevant to these goals, and a minimal set of requirements derived from the use cases. The use cases and requirements will be used to develop the three key deliverables described below.

Deliverables

The deliverables for this Working Group as described in the charter are noted below.

DCAT 1.1

An update and expansion of the current DCAT Recommendation. The new version may deprecate, but MUST NOT delete, any existing terms.

Guidance on publishing application profiles of vocabularies

A definition of what is meant by an application profile and an explanation of one or more methods for publishing and sharing them.

Content Negotiation by Application Profile

An explanation of how to implement the expected RFC and suitable fallback mechanisms as discussed at the SDSVoc workshop.

Methodology

This Working Group was formed as an outcome of the Smart Descriptions & Smarter Vocabularies (SDSVoc) workshop held in Amsterdam from November 30 to December 1, 2016. The first Working Group meeting was held May 18, 2017. At that meeting the Working Group Chairs called for Working Group members and other stakeholders to submit use cases on the Working Group's wiki.

Use cases were written using a template, which required use case authors to provide a problem statement, list of stakeholders experiencing the problem, and requirements suggested given the problem. It was also recommended that use case authors provide references to existing approaches that might be useful for DCAT, links to documents and projects their use cases referenced, related use cases, and editorial comments. In addition, the list of tags below was created and applied to the use cases to reorganize them on demand.

Working Group Chairs, grouped related use cases for discussion. Use cases were discussed during the Working Group's weekly meetings and an intensive two day face-to-face meeting held at the Oxford e-Research Centre, University of Oxford, July 17-18, 2017. After the discussion of each use case, a proposal to accept the use case as-is, accept the use case with changes, or reject the use case was put before Working Group members in attendance for a vote. During voting Working Group Chairs sought consensus as outlined in Section 3.3 of the W3C Process Document.

The requirements were derived from the accepted use cases. One of the key tasks for the editors of this document was removing duplicate requirements, editing those that remained and adding missing ones. The editors also ensured links from requirements to use cases were maintained.

Tags

A tag set has been defined to label the use cases and to interactively rebuild the document according to tag filter selected. Please choose one or more tags and click "apply filter". "Reset filter" will clear the selection and recreate the original specification view .

Filter by deliverable 
Filter by resource type 
Filter by modeling aspect 
Filter by meta-modeling aspect 

     

Use Cases

DCAT packaged distributions [ID1]

Makx Dekkers

dcat distribution documentation packaging representation

Data publisher

▶ Full use case description (click to collapse):

In practice, distributions are sometimes made available in a packaged or compressed format. For example, a group of files may be packaged in a ZIP file, or a single large file may be compressed. The current specification of DCAT allows the package format to be expressed in dct:format or dcat:mediaType but it is currently not possible to specify what types of files are contained in the package.

An example of an approach is the way ADMS defines Representation Technique which could be used to describe the type of data in a ZIP file, e.g. dcat:mediaType="https://www.iana.org/assignments/media-types/application/zip"; adms:representationTechnique="https://www.iana.org/assignments/media-types/text/csv".

Detailing and requesting additional constraints (profiles) beyond content types [ID2]

Ruben Verborgh

content_negotiation profile representation

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

While a content type such as application/json identifies the kind of parser a client needs for a given representation, it does not cover all assumptions of the server. In practice, the server will often follow a much more strict pattern than “everything that is valid JSON”, restricting itself to one of more subsets of JSON. For the purpose of this use case, we refer to such subsets generically as “profiles”. A profile captures additional structural and/or semantic constraints in addition to the media type. Note that one profile might be used across different media types: for instance, a profile could be applied to multiple RDF syntaxes.

In order to inform clients that a representation conforms to a certain profile, servers should be able to explicitly indicate which profile(s) a response conforms to. This then allows the client to make the additional structural and/or semantic interpretations that are allowed within that profile.

Clients and servers should be able to indicate their compatibility and/or preference for certain profiles. This enables clients to request a resource in a specific profile, in addition to the specific content type it requests. A client should be able to determine which profiles a server supports, and with which content types. A client should be able to look up more information about a certain profile.

One example of such a profile is a specific DCAT Application Profile, but many other profiles can be created. For example, another profile could indicate that the representation uses a certain vocabulary.

  • HTTP content negotiation (by media type, language, …)

Responses can conform to multiple, modular profiles [ID3]

Ruben Verborgh

content_negotiation profile representation

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

A response of a server can conform to multiple content types. For example, a JSON-LD response conforms to the following content types: application/octet-stream, application/json, application/ld+json (even though only one of them will typically be indicated).

Similarly, the response of a server can conform to multiple profiles. For example, a profile X could demand that all persons are described with the FOAF vocabulary, and a profile Y could demand that all books are described with the Schema.org vocabulary. Then, a response which uses FOAF for people and Schema.org for books, clearly conforms to both profiles. And in contrast to content types, it is informative to list both profiles, as their conformance is independent.

Therefore, servers should be able to indicate, if they wish to do so, that a response conforms to multiple profiles. Clients should also be able to specify their preference for one or multiple profiles.

This enables a modular design of profiles, which can be combined when appropriate. With content types, only hierarchical combinations are possible. For example, a JSON-LD document is always a JSON document. However, with profiles, this is not necessarily the case: some of them might allow orthogonal combinations (as is the case in the vocabulary example above).

  • conformance to multiple content types

Dataset Versioning Information [ID4]

Nandana Mihindukulasooriya

dcat status version provenance dataset aggregate

▶ Full use case description (click to collapse):

Most datasets that are maintained long-term and evolve over time have distributions of multiple versions. However, the current DCAT model does not cover versioning with sufficient details. Being able to publish dataset version information in a standard way will help both producers publishing their data on data catalogues or archiving data and dataset consumers who want discover new versions of a given dataset, etc.

We can also see some similarities with software versioning and dataset versioning, for instance, some data projects release daily dataset distributions, major/minor releases etc. Probably, we can use some of the lessons learned from software versioning. There are several existing dataset description models that extend DCAT to provide versioning information, for example, HCLS Community Profile.

Discover available content profiles [ID5]

Rob Atkinson

content_negotiation profile referencing representation resolution semantics service

▶ Full use case description (click to collapse):

There are multiple reasons to provide different information about the same concept, so if Linked Data is to exist based on URI object identifiers, and these are to relate to the real world entity, rather than specific implementations (i.e. information records), then it is inevitable that different sets of information will be required for different purposes.

Consider a request for the boundary of a country, with a coastline. If the coastline is included as a property, this may be many megabytes of detail. Alternatively, a generalised simple coastline may be provided, or a single point, or may not be required at all. (In reality there may be many different versions of coastline based on different legal definitions, or practices, or approximation methods).

Furthermore, in any graph based response, the depth of traversal of the graph is always a choice. Consider a request to the GBIF taxonomy service to search for a biological species. A response typically includes not just the species, but potentially more information about the hierarchy of the taxonomy (kingdom, phyla, family, genus etc) - also possible synonyms, also possibly a wide range of metadata about name sources, usages and history. There is a need for offering different choices of how deep such a traversal of relationships should be undertaken and returned.

Different information models (response schema), and different choices of content within the same schema , constitute necessary options, and there may be a large number of these.

Thus there is a need for discovering which profiles a service will offer for a given resource, and a canonical machine readable graph of metadata about what such offerings consist of and how they may be invoked. This may be as simple as providing a profile name, or content profile, schema choice, encoding and languages.

Note that the Linked Data API implementation used by the UK Linked Data effort, includes the notion of _view parameters in URI requests - these are "named collections of properties" but it does not provide a means to attach metadata about what such views consist of. equivalent HTTP header based profile negotiation would still need to address this requirement in the same way as agent-driven negotiation (https://www.w3.org/Protocols/rfc2616/rfc2616-sec12.html) - what is required is a minimal set of metadata and extension mechanisms for this.

Support for a specific profile is also a powerful search axis, potentially encompassing the full suite of semantic specification and resource interoperability requirements. Thus metadata about profile support can be used for both discovery and mediated traversal via forms of content negotiation.

DCAT Distribution to describe web services [ID6]

Jonathan Yu, CSIRO

dcat distribution representation service

Data provider, data consumer

▶ Full use case description (click to collapse):

Users often access datasets via web services. DCAT provides constructs for associating a resource described by dcat:Dataset with dcat:Distribution descriptions. However, the Distribution class provides only the dcat:accessURL and dcat:downloadURL properties for users to access/download something. It would be useful for users to gain more information about the web service endpoint and how users can interact with the data. If information about the web service is known with appropriate identifiers for the data, then users can understand additional context then invoke a call to the web service to access/download the dataset resource or subsets of it.

Support associating fine-grained semantics for datasets and resources within a dataset [ID7]

Jonathan Yu, Simon Cox (CSIRO)

dcat profile content negotiation semantics

Data provider, data consumer

▶ Full use case description (click to collapse):

We want to be able to describe a dataset record using properties appropriate to the dataset type. This is especially the case in a dataset in the geoscience domain, e.g. an observation of a "feature" in the real world captured using a sensor about some property. There are emerging practices on how to represent these semantics for the data, however, DCAT currently only supports association of a dcat:Dataset with dcat:theme to a skos:Concept. Data providers could extend/specialise dcat:theme to provide specific semantics about the association between dcat:Dataset and the ‘theme’ but is this enough? Furthermore, there are broad/aggregated semantics at the dataset level (e.g. observations in the Great Barrier Reef) and then fine grained semantics for elements within a dataset (e.g. sea surface temperature observations in the Great Barrier Reef). Users need a way to view the aggregated collection level metadata and the associated semantics and then they need a way to view record level metadata to obtain/filter on specific information, e.g. instrument/sensor used, spatial feature, observable property, quantity kind, etc.

Properties from the W3C Semantic Sensor Network SOSA ontology and QUDT may be useful in this context.

  • Examples of representing dataset metadata at the collection level and at the fine grained record level via the THREDDS server here:

http://dapds00.nci.org.au/thredds/catalogs/fx3/catalog.html

Scope or type of dataset with a DCAT description [ID8]

Simon Cox (CSIRO)

dataset dcat representation

Data catalogue

▶ Full use case description (click to collapse):

Some users of DCAT may want to apply it to resources that not everyone would consider a 'dataset'. Some examples are text documents, source code, controlled vocabularies, and ontologies. It's not clear what kinds of resources may be described with DCAT or how one would describe the types listed. Users need guidance about the expected scope for DCAT and on the use of whatever terms DCAT chooses to use or recommend for assigning types (e.g., dc:type). There also needs to be a way for a DCAT description to indicate the 'type' of dataset involved (the semantic type, not the media-type).

Common requirements for scientific data [ID9]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat documentation meta provenance quality referencing roles content_negotiation

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

The European Commission's Joint Research Centre (JRC) is a multidisciplinary research organization with the mission of supporting EU policies with independent evidence throughout the whole policy life-cycle.

In order to provide a single access and discovery point to JRC data, in 2016 a corporate data catalog has been launched, where datasets are documented by using a modular metadata schema, consisting of a core profile, defining the elements that should be common to all metadata records, and a set of domain-specific extensions.

The reference metadata standard used is the DCAT application profile for European data portals [[DCAT-AP]] (the de facto EU standard metadata interchange format), and the related domain-specific extensions - namely, [[GeoDCAT-AP]], for geospatial metadata, and [[StatDCAT-AP]], for statistical metadata. The core profile of JRC metadata is however not using [[DCAT-AP]] as is, but it complements it with a number of metadata elements that have been identified as most relevant across scientific domains, and which are required in order to support data citation.

More precisely, the most common, cross-domain requirements identified at JRC are following ones:

  • Ability to indicate dataset authors.
  • Ability to describe data lineage.
  • Ability to give potential data consumers information on how to use the data ("usage notes").
  • Ability to link to scientific publications about a dataset.
  • Ability to link to input data (i.e., data used to create a dataset).

[[VOCAB-DCAT]] does not provide guidance on how to model this information. [[DCAT-AP]] and [[GeoDCAT-AP]] partially support these requirements - namely, the specification of dataset authors (dcterms:creator [[DCTerms]]), data lineage (dcterms:provenance [[DCTerms]]), and input data (dcterms:source [[DCTerms]]). For the two remaining requirements, the JRC metadata schema makes use of dcterms:isReferencedBy [[DCTerms]] (publications) and vann:usageNote [[VANN]] (usage notes).

These solutions allow a simplified description of the dataset context, that can be used for multiple purposes - as assessing the quality and fitness for use of a dataset, or identifying the dataset most commonly used as input data. Additional details could be provided by representing more precisely these relationship with "qualified" forms by using vocabularies as [[PROV-O]], [[VOCAB-DQV]], or [[VOCAB-DUV]]: for instance, the relationship between a dataset and input data can be complemented with the model used for processing them, and possibly with additional information on the data generation workflow.

Requirements for data citation [ID10]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat provenance quality referencing roles time

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Data citation is gaining more and more importance as a way to recognize the scientific value of research data, by treating them as traditionally done for scientific publications.

Requirements for data citation include:

  • Describing data with the information that is typically used to create a bibliographic entry (e.g., authors, publication year, publisher).
  • Associating data and, whenever possible, the related resources (authors, publisher, input data, publications), with persistent identifiers.

A study has been carried out at the European Commission's Joint Research Centre (JRC), in order to create mappings between [[DataCite]] (the current de facto standard for data citation) and [[DCAT-AP]].

The results show that [[DCAT-AP]] covers most of the required [[DataCite]] metadata elements, but some of them are missing. In particular:

  • Mandatory elements:
    • Dataset creator (but [[GeoDCAT-AP]] supports it)
  • Recommended elements:
    • [[DCAT-AP]] does not cover all the types of identifiers, dates, contributors and resources supported by [[DataCite]]
  • Optional elements:
    • Funding reference

Guidance should be provided on how to model this information in order to enable data citation also in records represented with [[VOCAB-DCAT]] and related application profiles.

Modeling identifiers and making them actionable [ID11]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat referencing

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

A number of different (possibly persistent) identifiers are widely used in the scientific community, especially for publications, but now increasingly for authors and data.

Different approaches are used for representing them in RDF – best practices are needed to enable their effective use across platforms. But more importantly, they need to be made actionable, irrespective of the platforms they are used in.

Encoding identifiers as HTTP URIs seems to be the most effective way of making them actionable. Notably, quite a few identifier schemes can be encoded as dereferenceable HTTP URIs, and some of them are also returning machine readable metadata (e.g., DOIs, ORCIDs). Moreover, they can still be encoded as literals, especially if there is the need of knowing the identifier “type”. In such a case, a common identifier type registry would ensure interoperability.

Another issue concerns the ability to specify primary and secondary identifiers. This may be a requirement when resources are associated with multiple identifiers.

When encoded as HTTP URIs, the usual approach to model primary and alternative identifiers is to use the former as the resource URI, whereas the latter are specified by using owl:sameAs. In this case, the information about the identifier type is not explicitly specified, and can be derived only from the URI syntax - although this is not always possible.

To model identifiers as literals, [[VOCAB-DCAT]] uses dcterms:identifier, but it makes no distinction between primary / alternative identifiers, or the identifier type. For alternative identifiers, [[DCAT-AP]] recommends class adms:Identifier [[VOCAB-ADMS]], which can be used to specify the identifier type, plus additional information - namely, the identifier scheme agency and the identifier issue date. It is worth noting that the adms:Identifier has the primary purpose of describing the identifier itself, which makes it less suitable for linking purposes.

Finally, a number of vocabularies have defined specific properties for modeling identifier types, as prism:doi [[PRISM]] and bibo:doi [[BIBO]] for DOIs. Moreover, starting from version 3.2, [[SCHEMA-ORG]] has defined a super-property schema:identifier for all the identifier-specific properties already used in [[SCHEMA-ORG]].

An alternative approach is to denote the identifier type with an RDF datatype. In such a case, the same property can be used to specify the identifier - e.g., dcterms:identifier. This solution has the advantage of being able to easily identify all literals used as identifiers (you just have to lookup / search for the same property), whereas the datatype can be used to filter specific identifier types.

KC: Note that the libraries/archives community has identifiers that are not (yet) actionable, like ISBNs, ISSNs. These can be coded as dcterms:identifier strings but the string itself is not unique. Not sure how these fit into the overall picture but perhaps we can task someone to bring a specific proposal.

Modeling data lineage [ID12]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat provenance quality referencing roles

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Documentation on data lineage is crucial to ensure transparency on how data are created and to facilitate their reproducibility. These have been traditional requirements for scientific data, but are currently becoming relevant also in other communities, especially in the public sector when data are used in support to policy making.

Data lineage is typically specified with a more or less detailed human-targeted documentation. In very few cases, this information is represented in a formal, machine-readable way, enabling a (semi)automated data processing workflow that can be used to re-run the experiment from which the data were produced.

[[DCAT-AP]] uses property dcterms:provenance [[DCTerms]] to specify a human-readable documentation of data lineage, that can be either embedded in metadata or described in a document linked from the metadata record itself. Moreover, dcterms:source can be used to refer to the input data.

[[PROV-O]] can be used in order to provide a machine-readable description of data lineage, but best practices on how to use it consistently are missing.

Modeling agent roles [ID13]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat provenance referencing roles

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Each metadata standard has its own set of agent roles, and they all use their own vocabularies / code lists. E.g., the latest version (2014) of [[ISO-19115-1]] has 20 roles, and [[DataCite]] even more.

Two of the main issues concern (a) how to ensure interoperability across roles defined in different standards, and (b) if it makes sense to support all of them across platforms. The latter point follows from a common issue in metadata standards supporting multiple roles, with overlapping semantics (e.g., the difference between a data distributor and a data publisher is not always clear). In these scenarios, whenever metadata are not created by specialists, roles frequently happen to be used inconsistently.

As far as research data are concerned, agent roles are important to denote the type of contribution provided by each individual / organization in producing data.

Moreover, in some cases, an additional requirement is to specify the temporal dimension of a role – i.e., the time frame during which an individual / organisation played a given role - and, maybe, also other information – e.g., the organisation where the individual held a given position while playing that role.

[[DCTerms]] defines a limited number of agent roles as properties. [[VOCAB-DCAT]] re-uses some of them (in particular, dcterms:publisher), plus it defines a new one, namely, dcat:contactPoint. [[DCAT-AP]] and [[GeoDCAT-AP]] provide guidance on the use of other [[DCTerms]] roles - in particular, dcterms:creator, dcterms:rightsHolder. Anyway, the role properties defined in [[DCTerms]] and [[VOCAB-DCAT]] model just a subset of the agent roles defined in other standards. Moreover, they cannot be used to associate a role with other information concerning its temporal / organizational context.

[[PROV-O]] could be used for this purpose by using a “qualified attribution”. This is, for instance, the approach used in [[GeoDCAT-AP]] to model agent roles defined in [[ISO-19115-1]] but not supported in [[DCTerms]] and [[VOCAB-DCAT]]:

a:Dataset a dcat:Dataset; 
  prov:qualifiedAttribution [ a prov:Attribution ;
# The agent role, as per ISO 19115
    dcterms:type <http://inspire.ec.europa.eu/metadata-codelist/ResponsiblePartyRole/owner> ;
# The agent playing that role
    prov:agent [ a foaf:Organization ;
      foaf:name "European Union"@en ] ] .

However, to address the different use cases, such “qualified roles” should be compatible with the corresponding non-qualified forms, and both should be mutually inferable. For instance, the example above in [[GeoDCAT-AP]] is considered as equivalent to:

a:Dataset a dcat:Dataset;
  dcterms:rightsHolder [ a foaf:Organization ;
    foaf:name "European Union"@en ] .

Data quality modeling patterns [ID14]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat documentation meta provenance quality referencing

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Used in its broader sense, the notion of "data quality" covers different aspects, that may vary depending on the domain.

They include, but are not limited to:

  • Fitness for purpose.
  • Data precision / accuracy.
  • Compliance with given quality benchmarks, standards, specifications.
  • Quality assessments based on data review / users' feedback.

In order to provide a mechanism for the consistent representation of data quality, the most frequently used data quality aspects should be identified, based on existing standards (e.g., [[ISO-19115-1]]) and practices. Such aspects should also be used to identify possible common modeling patterns.

Solutions for modeling data quality have been defined in [[DCAT-AP]], [[GeoDCAT-AP]], [[StatDCAT-AP]], [[VOCAB-DQV]], and [[VOCAB-DUV]]. They cover the following aspects:

  • Metadata conformance with a metadata standard.
  • Data conformance with a given data schema/model.
  • Data conformance with a given reference system (spatial or temporal).
  • Data conformance with a given quality specification / benchmark.
  • Associating data with a quality report.
  • Spatial / temporal resolution.
  • Data quality assessments expressed with quantitative test results.
  • Data quality assessments via users’ feedback.

Notably, the first 4 aspects (those related to "conformance") follow a common pattern in that the reference vocabularies model all them by using property dcterms:conformsTo [[DCTerms]].

Modeling data precision and accuracy [ID15]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat quality referencing resolution

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Understanding the level of precision and accuracy of a dataset is fundamental to verify its fitness for purpose. This is typically denoted in terms of spatial or temporal resolution, but other dimensions are also possible.

Some metadata standards include elements for specifying precision. For instance the latest version (2014) of [[ISO-19115-1]] supports the possibility of specifying spatial resolution in terms of scale (e.g., 1:1,000,000), distance - further split into horizontal ground distance, vertical distance, and angular distance - and level of detail. However, [[VOCAB-DCAT]] does not provide guidance on how to model this information.

Actually, for some time, [[VOCAB-DCAT]] included a property dcat:granularity to model precision, which was dropped in the final version of the vocabulary (see ISSUE-10, and, in particular, the mail proposing to drop this property).

This issue was raised during the development of [[VOCAB-DQV]], and a solution has been proposed on how to model data precision in terms of spatial resolution - expressed as equivalent scale (e.g., 1:1,000,000) or distance (e.g., 1km) - and data accuracy as percentage - see [[VOCAB-DQV]], Section 6.13 Express dataset precision and accuracy. Notably, the same approach can be followed to model temporal resolution.

[[SDW-BP]] addresses this problem as well re-using the approach defined in [[VOCAB-DQV]], and, additionally, it provides an example on how to specify accuracy by stating conformance with a quality standard - see [[SDW-BP]], Best Practice 14: Describe the positional accuracy of spatial data.

Modeling conformance test results on data quality [ID16]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat quality referencing

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

One of the ways of expressing data quality is to verify whether a given dataset is (or not) conformant with a given quality standard / benchmark.

[[ISO-19115-1]] supports a way of modeling this information, by allowing to state whether a given dataset passed or not a given test result. Moreover, [[INSPIRE-MD]] extends this approach by supporting an additional possible result, namely, "not evaluated".

Another approach is provided by the [[EARL10]] vocabulary, which provides a generic mechanisms to model test results. More precisely, [[EARL10]] supports the following possible outcome values (quoting from Section 2.7 OutcomeValue Class):

earl:passed
Passed - the subject passed the test.
earl:failed
Failed - the subject failed the test.
earl:cantTell
Cannot tell - it is unclear if the subject passed or failed the test.
earl:inapplicable
Inapplicable - the test is not applicable to the subject.
earl:untested
Untested - the test has not been carried out.

[[VOCAB-DQV]] allows to specify data conformance with a reference quality standard / benchmark. However, this can model only one of the possible scenarios - i.e., when data are conformant.

[[GeoDCAT-AP]] provides an alternative and extended way of expressing "conformance" by using [[PROV-O]], allowing the specification of additional information about conformance tests (when this has been carried out, by whom, etc.), but also different conformance test results (namely, conformant, not conformant, not evaluated).

An example of the [[GeoDCAT-AP]] [[PROV-O]]-based representation of conformance is provided by the following code snippet:

a:Dataset a dcat:Dataset ;
  prov:wasUsedBy a:TestingActivity .
a:TestingActivity a prov:Activity ;
  prov:generated a:TestResult ;
  prov:qualifiedAssociation [ a prov:Association ;
# Here you can specify which is the agent who did the test, when, etc.
    prov:hadPlan a:ConformanceTest ] .
# Conformance test result
a:TestResult a prov:Entity ;
  dcterms:type <http://inspire.ec.europa.eu/metadata-codelist/DegreeOfConformity/conformant> .
a:ConformanceTest a prov:Plan ;
# Here you can specify additional information on the test
  prov:wasDerivedFrom <http://data.europa.eu/eli/reg/2014/1312/oj> .
# Reference standard / specification
<http://data.europa.eu/eli/reg/2014/1312/oj> a prov:Entity, dct:Standard ;
  dcterms:title "Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing 
                 Directive 2007/2/EC of the European Parliament and of the Council as regards 
                 interoperability of spatial data sets and services"@en
  dcterms:issued "2010-11-23"^^xsd:date .

The example states that the reference dataset is conformant with the Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing Directive 2007/2/EC of the European Parliament and of the Council as regards interoperability of spatial data sets and services. Since this case corresponds to the scenario supported in [[VOCAB-DQV]], the [[PROV-O]]-based representation above is equivalent to:

a:Dataset a dcat:Dataset ;
  dcterms:conformsTo <http://data.europa.eu/eli/reg/2014/1312/oj> .
# Reference standard / specification
<http://data.europa.eu/eli/reg/2014/1312/oj> a prov:Entity, dct:Standard ;
  dcterms:title "Commission Regulation (EU) No 1089/2010 of 23 November 2010 implementing 
                 Directive 2007/2/EC of the European Parliament and of the Council as regards 
                 interoperability of spatial data sets and services"@en
  dcterms:issued "2010-11-23"^^xsd:date .

Data access restrictions [ID17]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat documentation usage_control

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

The types of possible access restrictions of a dataset are one of the key filtering criteria for data consumers. For instance, while searching in a data catalogue, I may not be interested in those data I cannot access (closed data), or in those data requiring I provide personal information (as data that can be accessible by anyone, but only after registration).

Moreover, it is often the case that different distributions of the same dataset are released with different access restrictions. For instance, a dataset containing sensitive information (as personal data) should not be publicly accessible, although it would be possible to openly release a distribution where these data are aggregated and/or anonymized.

Finally, whenever data are not publicly available, an explanation of a reason why they are closed should be provided - especially when these data are maintained by public authorities, or are the outcomes of public-funded research activities.

[[DCAT-AP]] models this information at the dataset level by using property dcterms:accessRights [[DCTerms]], and defines three possible values:

Public
Definition: Publicly accessible by everyone.
Usage note/comment: Permissible obstacles include: registration and request for API keys, as long as anyone can request such registration and/or API keys.
Restricted
Definition: Only available under certain conditions.
Usage note/comment: This category may include: resources that require payment, resources shared under non-disclosure agreements, resources for which the publisher or owner has not yet decided if they can be publicly released.
Non-public
Definition: Not publicly accessible for privacy, security or other reasons.
Usage note/comment: This category may include resources that contain sensitive or personal information.

In addition to this, the JRC extension to [[DCAT-AP]] uses property dcterms:accessRights also at the distribution level, with the following possible values:

no limitations
The distribution can be anonymously accessed
registration required
The distribution can be accessed by anyone, but only after registration
authorization required
The distribution can be accessed only by authorized users

Modeling service-based data access [ID18]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat distribution documentation service

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

This concerns how to model dataset distributions available via services / APIs (e.g., a SPARQL endpoint), and not via direct file download. In such cases, it is necessary to know how to query the service / API to get the data. Moreover, an additional issue is that a service may provide access to more than one dataset. As a consequence, users do not know how to get access to the relevant subset of data accessible via a service / API.

Although this is a domain-independent issue, it is a key one in the geospatial domain, where data are typically made accessible via services (e.g., a view or download service), that, to be used, require specific clients. In metadata, the link to such services is usually pointing to an XML document describing the service's "capabilities". This of course puzzles non-expert users, who expect instead to get the actual "data".

Some catalogue platforms (as GeoNetwork and, to some extent, CKAN) are able to make this transparent for some services (typically, view services), but not for all. It would therefore be desirable to agree on a cross-domain and cross-platform approach to deal with this issue.

In [[VOCAB-DCAT]], the option of accessing data via a service / API is explicitly mentioned, recommending the use of dcat:accessURL to point to it. However, this property is meant to be used, generically, for indirect data download, so it is not enough to know that the URL points to a service endpoint rather than to a download page.

Actually, for some time, [[VOCAB-DCAT]] included a class dcat:WebService (subclass of dcat:Distribution) to specify that data is available via a service / API. Other subclasses of dcat:Distribution were also defined to specify direct data access (dcat:Download), and data access via an RSS/Atom feed (dcat:Feed). All these subclasses were dropped in the final version of the vocabulary (see ISSUE-8 / ISSUE-9, and related discussion).

A proposal to address this issue has been elaborated in the framework of the DCAT-AP implementation guidelines (see issue DT2: Service-based data access), where two main requirements have been identified:

  1. Denote distributions as pointing to a service / API, and not directly to the actual data.
  2. Provide a description of the API / service interface, along with the relevant query parameters, that can be directly used by software agents - either to access the data, or to make transparent data access to end users.

As far as point (1) is concerned, the proposal is to associate with distributions the following information:

  • Whether the access / download URL of a distribution points to data or to a service / API (dcterms:type).
  • In the latter case, we include the specification the service/API conforms to (dcterms:conformsTo).

An example is provided by the following code snippet. Here, the distribution's access URL points to service, implemented by using the [[WMS]] standard of the Open Geospatial Consortium (OGC):

a:Dataset a dcat:Dataset; 
  dcat:distribution [ a dcat:Distribution ;
    dct:title "GMIS - WMS (9km)"@en ;
    dct:description "Web Map Service (WMS) - GetCapabilities"@en ;
    dct:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
    dcat:accessURL <http://gmis.jrc.ec.europa.eu/webservices/9km/wms/meris/?dataset=kd490> ;
# The distribution points to a service
    dct:type <http://publications.europa.eu/resource/authority/distribution-type/WEB_SERVICE> ;
# The service conforms to the WMS specification
    dct:conformsTo <http://www.opengis.net/def/serviceType/ogc/wms> ] .

About (2) (i.e., provide a description of the API / service interface), a number options have been discussed (e.g., describe a service/API by using an OpenSearch document), but no final decision has been taken.

Guidance on the use of qualified forms [ID19]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat documentation meta provenance quality referencing roles

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

In most cases, the relationships between datasets and related resources (e.g., author, publisher, contact point, publications / documentation, input data, model(s) / software used to create the dataset) can be specified with simple, binary properties available from widely used vocabularies - as [[DCTerms]] and [[VOCAB-DCAT]].

As an example, dcterms:source can be used to specify a relationship between a dataset (output:Dataset), and the dataset it was derived from (input:Dataset):

output:Dataset a dcat:Dataset ;
  dcterms:source input:Dataset .
  
input:Dataset a dcat:Dataset .
          

However, there may be the need of providing additional information concerning, e.g., the temporal context of a relationship, which requires the use of a more sophisticated representation, similar to the "qualified" forms used in [[PROV-O]]. For instance, the previous example may be further detailed by saying that the output dataset is an anonymized version of the input dataset, and that the anonymization process started at time t and ended at time t′. By using [[PROV-O]], this information can be expressed as follows:

output:Dataset a dcat:Dataset ;
  prov:qualifiedDerivation [
    a prov:Derivation ;
    prov:entity input:Dataset ; 
    prov:hadActivity   :data_anonymization 
] .

input:Dataset a dcat:Dataset .

# The process of anonymizing the data (load the data, process it, and generate the anonymized version)

:data_anonymization
  a prov:Activity ;
# When the process started  
  prov:startedAtTime  "2018-01-23T01:52:02Z"^^xsd:dateTime;
# When the process ended  
  prov:endedAtTime "2018-01-23T02:00:02Z"^^xsd:dateTime .
          

Besides [[PROV-O]], vocabularies as [[VOCAB-DQV]] and [[VOCAB-DUV]] can be used to specify relationships between datasets and related resources. However, there is the need of providing guidance on how to use them consistently, since the lack of modeling patterns results in the difficulty of aggregating this information across metadata records and catalogs.

Moreover, it is important to define mappings between qualified and non-qualified forms (e.g., along the lines of what done in [[PROV-DC]]), not only to make it clear their semantic relationships (e.g., dcterms:source is the non-qualified form of prov:qualifiedDerivation), but also to enable metadata sharing and re-use across catalogs that may support only one of the two forms (qualified / non-qualified).

[[GeoDCAT-AP]] makes use of both qualified and non-qualified forms to model agent roles and data quality conformance test results.

Modelling resources different from datasets [ID20]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat meta service

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

[[VOCAB-DCAT]] makes use of quite a general definition of dataset (quoting from [[VOCAB-DCAT]], Section 5.3 Class: Dataset: "A collection of data, published or curated by a single agent, and available for access or download in one or more formats."), which is meant to be used as broady as possible (as stated in ISSUE-62).

As such, it could be theoretically used to model a variety of resources - including documents, software, images and audio-visual content. However, the solution adopted in [[VOCAB-DCAT]] is not able to address the following scenarios:

  1. Let's suppose that a data catalog includes records about data, as well as documents and software. If all are modeled just with dcat:Dataset, it is not possible for users to restrict their search to the specific type of resource they are interested in.
  2. In addition to this, let's suppose that the catalog includes also records about Web-based services (e.g., a SPARQL endpoint or any of the services used for geospatial data): can a service be considered a "dataset"? how should it be modeled?

These two scenarios are not hypothetical, but they reflect what is typically included, e.g., in catalogs following the [[ISO-19115-1]] or the [[DataCite]] standards, which model in different ways the documented resources, and both support records about services.

[[GeoDCAT-AP]] provides a mechanism to model three out of the more than 20 resource types supported in [[ISO-19115-1]] - namely, dataset, dataset series, and service.

The adopted approach is as follows:

  • Datasets and dataset series are modeled with dcat:Dataset.
  • The specific dataset "type" (dataset and dataset series) is denoted by using dcterms:type [[DCTerms]].
  • Services are modeled as dcat:Catalog, in case of a catalog service, and with dctype:Service [[DCTerms]] in all the other cases.
  • The type of service (discovery, download, view, etc.) is modeled by using dcterms:type.

A similar approach has been adopted in the study carried out by the European Commission's Joint Research Centre (JRC) to map [[DataCite]] to [[DCAT-AP]].

The resource types supported in [[DataCite]] are 14. Most of them fall into the generic [[VOCAB-DCAT]] definition of "dataset", so they are modeled with dcat:Dataset. Moreover, the DCMI Type Vocabulary [[DCTerms]] is used to model both the dataset "type", and those resource types that cannot be modeled as datasets (events, physical objects, services).

Machine actionable link for a mapping client [ID21]

Stephen Richard, Columbia University

dcat distribution

▶ Full use case description (click to collapse):

A geologic unit dataset has various service distributions e.g. OGC v1.1.1 WFS as GeoSciML v3 GeologicUnit, GeoSciML portrayal GeologicUnit, GeoSciML v4 GeologicUnit, OGC v. 1.3.0 WMS layer portrayed according to stratigraphic age, layer portrayed according to lithology, or layer portrayed according to stratigraphic unit, and as an ESRI feature service. A user's map client software has a catalog search capability, and requires GeoSciMLv4 encoding in order to function correctly.

The metadata must provide sufficient information about the distributions for the catalog client to filter for only services that offer such a distribution in the results offered to the user.

Template link in metadata [ID22]

Stephen Richard, Columbia University

dcat distribution

▶ Full use case description (click to collapse):

A dataset is offered via an OData end point, and the distribution link is a template with several parameters that the user must provide values for to obtain a valid response. Client must have means to know the valid value domains for the parameters. This could be via a link to an open search or URI template description type document, or by metadata elements associated with the link that define the parmeters and their domains.

Data Quality Vocabulary (DQV) Wish List left by the DWBP WG [ID23]

Riccardo Albertoni (Consiglio Nazionale delle Ricerche), Antoine Isaac (VU University Amsterdam and Europeana)

dcat meta quality

Data publisher, data consumer, catalog maintainer, application profile publisher

▶ Full use case description (click to collapse):

As discussed in the recent W3C recommendation DWBP “The quality of a dataset can have a big impact on the quality of applications that use it. As a consequence, the inclusion of data quality information in data publishing and consumption pipelines is of primary importance.” DQV is a new RDF vocabulary which extends DCAT with additional properties and classes suitable for expressing the quality of DCAT datasets and distributions. It defines concepts such as measures and metrics to assess the quality of user-defined quality dimensions, but it also puts much importance to allowing many actors to assess the quality of datasets and publish their annotations, certificates, opinions about a dataset. The W3C DWBP Working Group left a list of possible topics to be developed which were not in the scope or could not be covered by the DWBP group, in particular, some of the wishes left for Data Quality Vocabulary (DQV) seem to be related to the activity of this group.

The list below groups some of DQV wishes by the most likely impacted DXWG deliverable. Each requirement in the list might be expanded into a separated use case after a first scrutiny by the group. Some of the DQV wishes might be included either as Use Cases or as group issues. The choice on the most appropriate way of inclusion is affected by the level of commitment that DCAT1.1 wil make about quality documentation, and how much DCAT will rely on DQV for documenting the dataset quality.

VOCAB-DQV

DWBP Wish List

Harmonising INSPIRE-obligations and DCAT-distribution [ID24]

Thomas D'haenens, Informatie Vlaanderen

dcat profile out_of_scope

▶ Full use case description (click to collapse):

Within our government agency we are struggling to combine two targets. On one side, we have a European obligation to share datasets about a wide range of topics (going from environment to transport to ...), following the INSPIRE guidelines. These are for a major part in the spirit of georeferenceable datasets and are based on ISO-standards and go much more in detail than DCAT does. On the other side we also have an open data policy and implementations based on DCAT (much leaner metadata vocabulary).

We're now working on a way to map the INSPIRE-based descriptions to DCAT-based descriptions. Since INSPIRE is a European Regulations (thus obligated for all European countries) this work ought to be done on a supranational level. At the least, I believe guidelines and mapping rules should be defined within both DCAT(-AP) and INSPIRE to enhance interoperability. Starting point should be that a dataset must be described only once (off course)

Publication of catalog information [ID25]

Jaroslav Pullmann, Christian Mader (Fraunhofer)

dcat catalog dataset publication usage control

Data publisher, data portal operator

▶ Full use case description (click to collapse):

While the operation and co-existence of data portals hosting DCAT descriptions is out of group's scope the standard should support an explicit regulation of their (re)distribution and hosting. This use case refers to scenarios where individual Datasets or entire Catalogs are copied among data portals.

An explicit reference to the original resource should be maintained within any copy even when both share the same URI (i.e. are local copies of an identical resource). The reference to the original resource should indicate resource's publication context (data portal) in a way that is accesible for search engine indexing and browser navigation.

Usage policies might further regulate handling of the distributed entities, e.g. a duty to keep the copies updated, display the provenance information or prohibit a commercial exploitation.

Extension points to 3rd party vocabularies for modeling significant aspects of data exchange [ID26]

Jaroslav Pullmann, Christian Mader (Fraunhofer)

meta

DXWG members

▶ Full use case description (click to collapse):

Considering DCAT a high-level model for data exchange agree on significant aspects missing so far and define extension points (typically properties) for re-use and integration of existing standards, application profiles etc. The reference listing of aspects deemed relevant is based on evaluation of DXWG use cases and ISO 19115:2014:

  • Identification of Data(sub)sets (e.g. for purposes of citation)
  • Lineage, provenance and versioning (sources and processes applied)
  • Content description (internal data structure and semantics)
  • Context (spatial, temporal, socio-economic)
  • Reference system (spatial, temporal, socio-economic)
  • Quality, ratings and recommendations (?)
  • Distribution options (dynamic distribution, coverage of representation-, packaging- and compression formats)
  • Usage control, licensing (usage constraints and obligations, e.g. pricing)
  • Maintenance (scope and frequency of maintenance)

Modeling temporal coverage [ID27]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat coverage documentation time

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

[[VOCAB-DCAT]] uses dcterms:temporal [[DCTerms]] to specify the temporal coverage of a dataset, but does not provide guidance on to specify the start / end date.

Actually, the only relevant example provided in [[VOCAB-DCAT]] makes use of a URI operated by reference.data.gov.uk, denoting a time interval modeled by using [[OWL-TIME]]. Such sophisticated representation could be relevant for some use cases, but it is quite cumbersome when the requirement is to specify simply a start / end date, and it makes it difficult to use temporal coverage as a filtering mechanism during the discovery phase.

To address this issue, [[VOCAB-ADMS]] makes use of properties schema:startDate and schema:endDate [[SCHEMA-ORG]]. [[DCAT-AP]] follows the same approach. Other existing approaches are: DATS, DataCite and Google datasets

  • [[VOCAB-ADMS]], Section 5.2.10 Period of time
  • [[DCAT-AP]]
  • OWL-Time (2017 revision) includes the following general-purpose predicates
    • time:hasTime to associate a time:TemporalEntity with anything
    • time:hasBeginning to associate a time:Instant with anything, though with the entailment that the subject is itself a time:TemporalEntity (or a member of a subclass, which is general not hard)
    • time:hasEnd to associate a time:Instant with anything, though with the entailment that the subject is itself a time:TemporalEntity (or a member of a subclass, which is general not hard)
  • SOSA/SSN Ontology defines sosa:phenomenonTime and sosa:resultTime, adapted from ISO 19156 (Observations and measurements)
    • sosa:phenomenonTime refers to world time
    • sosa:resultTime refers to data aquisition time
      • also briefly discussed was stimulusTime - being the time the act of data aquisition started (complementing resultTIme which is when it finished)
  • ISO 19156 also has om:validTime - being the time interval during which use of the result is recommended (important for forecasting applications)
  • OGC Met Ocean working group / UK MetOffice recognize the following temporal properties of ensemble forecast data
    • simulation event time
    • analysis time (aka run time or reference time)
    • assimilation window
    • datum time
    • forecast computation time
    • validity time
    • partial forecast time
    • re-analysis event time
    • forecast model run collection time

Modeling reference systems [ID28]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat documentation quality representation space

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

One of the key information necessary to correctly interprete geospatial data is the spatial coordinate reference system used. For instance, a coordinate reference system can denote the order in which the coordinates are specified (latitude / longitude, longitude / latitude), whether coordinates denote points, lines, surfaces, volumes, which is the unit of measurement used.

This information is normally included in geospatial metadata since, depending on the coordinate reference system used, a dataset can or cannot be used for specific use cases. So, users can filter the relevant datasets during the discovery phase.

Used more broadly, the notion of "reference system" can be applied to other data as well. For instance, suppose a dataset consisting of a set of measurements expressed as numbers. Are they percentages or quantities using a specific unit of measurement?

[[SDW-BP]] addresses this issue in Best Practice 8, and illustrates a number of options that can be followed.

[[GeoDCAT-AP]] models this information by specifying data conformance with a given standard, as done in [[VOCAB-DQV]], which, in this case, is a spatial or temporal reference system. As far as spatial reference systems are concerned, they are denoted by the HTTP URIs operated by the OGC CRS register (see [[SDW-BP]], Example 22):

@prefix ex:      <http://data.example.org/datasets/> .
@prefix dcat:    <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .
ex:ExampleDataset 
  a dcat:Dataset ;
  dcterms:conformsTo <http://www.opengis.net/def/crs/EPSG/0/32630> .
<http://www.opengis.net/def/crs/EPSG/0/32630> 
  a dcterms:Standard, skos:Concept ;
  dcterms:type <http://inspire.ec.europa.eu/glossary/SpatialReferenceSystem> ;
  dcterms:identifier "http://www.opengis.net/def/crs/EPSG/0/32630"^^xsd:anyURI ;
  skos:prefLabel "WGS 84 / UTM zone 30N"@en ;
  skos:inScheme <http://www.opengis.net/def/crs/EPSG/0/> .

Modeling spatial coverage [ID29]

Andrea Perego - European Commission, Joint Research Centre (JRC)

dcat coverage documentation space

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

The "spatial" or "geographic coverage" of a dataset denotes the geographic area of the phenomena described in the dataset itself.

How dataset spatial coverage is specified varies depending on the domain and metadata standards used. However, the different solutions make basically use of two different approaches (not mutually exclusive):

  1. The geographic area is denoted by a geographical name, possibly by using an identifier from a gazetteer (e.g., Geonames) or a registry concerning, e.g., administrative units (see, e.g., the NUTS).
  2. The geographic area is denoted by its "geometry", i.e., the geographic coordinates denoting its boundaries, its representative point (as its centroid) or its bounding box.

Geometries are typically used when it is necessary to denote an arbitrary geographic area, which may not correspond to a specific geographical name. Examples include (but are not limited to) satellite images and data from sensors. Geometries are also used in existing data catalogs for discovery and filtering purposes (e.g., this feature is supported in GeoNetwork and CKAN). Moreover, spatial queries are supported by the majority of the existing triple stores (including those not supporting [[GeoSPARQL]]).

[[VOCAB-DCAT]] allows the specification of the spatial coverage of a dataset by using dcterms:spatial [[DCTerms]], and includes an example making use of an HTTP URI from Geonames denoting a geographical area.

However, no guidance is provided on how to denote arbitrary regions with a "geometry" (i.e., a point, a bounding box, a polygon), which is the typical way spatial coverage is specified in geospatial metadata.

The issue is particularly problematic since the existing vocabularies model this information in very different ways. Moreover, geometries can be expressed in a number of formats (e.g., [[GML]], WKT, GeoJSON [[RFC7946]]). This situation makes it difficult to use information on spatial coverage effectively, e.g., to support spatial search and filtering.

[[SDW-BP]] provides a comprehensive guidance on how to specify geometries in the Best Practices under Section 12.2.2 Geometries and coordinate reference systems.

As far as metadata are concerned, one of the documented approaches concerns the solution adopted in [[GeoDCAT-AP]], which models spatial coverage by using property locn:geometry [[LOCN]], and recommending encoding the geometry by using [[GML]] and/or [WKT] - see [[SDW-BP]], Example 15:

@prefix dcat:      <http://www.w3.org/ns/dcat#> .
@prefix dcterms:   <http://purl.org/dc/terms/> .
@prefix geosparql: <http://www.opengis.net/ont/geosparql##> .
@prefix locn:      <http://www.w3.org/ns/locn#> .
<http://www.ldproxy.net/bag/inspireadressen/> a dcat:Dataset ;
  dcterms:title "Adressen"@nl ;
  dcterms:title "Addresses"@en ;
  dcterms:description "INSPIRE Adressen afkomstig uit de basisregistratie Adressen,
                   beschikbaar voor heel Nederland"@nl ;
  dcterms:description "INSPIRE addresses derived from the Addresses base registry,
                   available for the Netherlands"@en ;
  dcterms:isPartOf <http://www.ldproxy.net/bag/> ;
  dcat:theme <http://inspire.ec.europa.eu/theme/ad> ;
  dcterms:spatial [
    a dcterms:Location ;
    locn:geometry
# Bounding box in WKT
      "POLYGON((3.053 47.975,7.24 47.975,7.24 53.504,3.053 53.504,3.053 47.975))"^^geosparql:wktLiteral ,
# Bounding box in GML
      "<gml:Envelope srsName=\"http://www.opengis.net/def/crs/OGC/1.3/CRS84\">
         <gml:lowerCorner>3.053 47.975</gml:lowerCorner>
         <gml:upperCorner>7.24  53.504</gml:upperCorner>
       </gml:Envelope>"^^geosparql:gmlLiteral ,
# Bounding box in GeoJSON
      "{ \"type\":\"Polygon\",\"coordinates\":[[
           [3.053,47.975],[7.24,47.975],[7.24,53.504],[3.053,53.504],[3.053,47.975]
         ]] }"^^https://www.iana.org/assignments/media-types/application/geo+json
  ] .

Standard APIs for metadata profile negotiation [ID30]

Andrea Perego - European Commission, Joint Research Centre (JRC)

content_negotiation profile service

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Cross-catalog harvesting is not a recent practice. Standard catalog services, as [[OAI-PMH]] and [[CSW]], have been designed to support this functionality. However, in the past, this was typically done across catalogs of homogeneous resources, usually pertaining to the same domain.

This has changed in the last years, especially with the publication of cross-sector catalogs of government data. A notable example is the European Data Portal, which harvests metadata from both cross-sector and thematic catalogs across EU Member States. In this scenario, one of the issues to be addressed is the heterogeneity of the metadata standards and harvesting protocols used across catalogs.

A partial solution is provided by the development of harmonized mappings between metadata standards (see, e.g., the geospatial and statistical extensions to [[DCAT-AP]]), and by enabling catalog platforms, as CKAN and GeoNetwork, to support multiple harvesting protocols and to map different metadata standards into their internal representation.

An alternative approach is to enable catalogs to provide metadata in different profiles, using a standard harvesting protocol. Notably, standard protocols as [[OAI-PMH]] and [[CSW]] already support the possibility of serving records in different metadata schemas and serializations, by using specific query parameters. So, what is needed is an API-independent mechanism that can be used by clients with the existing catalog service protocols.

HTTP content negotiation may be the most viable solution, since HTTP is the protocol Web-based catalog services makes use of. However, although the HTTP protocol would allow metadata to be served in different formats, it does not support the ability to negotiate the metadata profile.

The GeoDCAT-AP API was designed to enable [[CSW]] endpoints to serve [[ISO-19115-1]] metadata based on the [[GeoDCAT-AP]] profile, by using the standard [[CSW]] interface - i.e., parameters outputSchema (for the metadata profile) and outputFormat (for the metadata format).

HTTP content negotiation is supported to determine the returned metadata format, without the need of using parameter outputSchema. The ability to negotiate also the profile would enable a client to query a [[CSW]] endpoint without the need of knowing the supported harvesting protocol.

Besides the resulting RDF serialisation of the source [[ISO-19115-1]] records, the API returns a set of HTTP Link headers, using the following relationship types:

  • derivedfrom: The URL of the source document, containing the ISO 19139 records.
  • profile: The URI of the metadata schema used in the returned document.
  • self: The URL of the document returned by the API.
  • alternate: The URL of the current document, encoded in an alternative serialization.

It is worth noting that, in its current definition, relationship type alternate denotes just a different serialization, and so it cannot be used to list the possibly alternative metadata schemas.

Modeling funding sources [ID31]

Alejandra Gonzalez-Beltran

dcat provenance

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Many datasets (or catalogs) are produced with support by a sponsor/funder (e.g. scientific datasets that result from a study funded by a funding organisation or datasets produced by governmental organisations) and the ability to describe and group them by funder is important across domains.

Relationships between Datasets [ID32]

Alejandra Gonzalez-Beltran

dcat publication

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Datasets are related in many different ways, e.g. the relationships between the different versions of a dataset, 'has part' relationships between datasets, derivation, aggregation.

Examples of relationships:

  • aggregation
 - the Dryad repository defines the concept of a collection of datasets, for example, for datasets related for their topic 
   e.g. see the collection about Galapagos finches http://datadryad.org/handle/10255/dryad.148 
 - the Gene Expression Onmibus repository (GEO) has the concept of series for related data
  • derivation
 - in the Investigation/Study/Assay (ISA) model, it is possible to represent the workflow from raw data to processed data and to indicate the process that yielded the new data
  • citation
 - to represent data citation

See the list of relationTpes given in the DataCite schema: [1](http://schema.datacite.org/meta/kernel-4.0/include/datacite-relationType-v4.xsd)

(Makx Dekkers) Specific cases of relationships that I have come across:

  • a dataset that contains multi-annual budget data (e.g. for a multi-annual programme) but also contains the data for individual years -- this could be as a spreadsheet with worksheets for each year and a sheet with the sum for the whole period
  • two datasets that contain the same data but differ in other aspects than format, for example currency, measurement units, resolution, projection -- if we adopt an understanding that distributions of a dataset may only differ in format

Summarization/Characterization of datasets [ID33]

Alejandra Gonzalez-Beltran Deliverable(s): DCAT1.1, AP Guidelines

dcat

Data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Summary/descriptive statistics that characterize a dataset are important elements to have a high-level overview of the dataset. This is particularly important for datasets that are not publicly accessible, but whose access could be requested under certain conditions.

HCLS dataset description included a number of statistics for RDF datasets: https://www.w3.org/TR/hcls-dataset/ For healthcare data, there is the Automated Characterization of Health Information at Large-scale Longitudinal Evidence System (ACHILLES): https://www.ohdsi.org/analytic-tools/achilles-for-data-characterization/

Relationships between Distributions of a Dataset [ID34]

Makx Dekkers

dcat distribution documentation packaging semantics

Data publishers

▶ Full use case description (click to collapse):

DCAT defines a Distribution as "Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed". It turns out that people read this differently. Main interpretations are that (a) the data in different Distributions of the same Dataset *only* differ in format, i.e the data contains the same data points in different representations and (b) the data in different Distributions might be related in other ways, for example by containing different data points for similar observations, as in the same kind of data for different years.

In the current situation, a variety of approaches can be observed. In an analysis of the data in the DataHub (see link), at least five different approaches could be observed.

Datasets and catalogues [ID35]

Makx Dekkers

dcat documentation

Data publisher

▶ Full use case description (click to collapse):

The DCAT model contains a hierarchy of the main entities: a catalogue contains datasets and a dataset has associated distributions. This model does not contemplate a situation that datasets exist outside of a catalogue, while in practice datasets may be exposed on the Web as individual entities without description of a catalogue. Also, it may be inferred from the current model that a dataset, if it is defined as part of a catalogue, is part of only one catalogue; no consideration is given to the practice that datasets may be aggregated – for example when the European Data Portal aggregates datasets from national data portals.

Cross-vocabulary relationships [ID36]

Makx Dekkers

dcat documentation void data_cube

Data publisher

▶ Full use case description (click to collapse):

In the context of W3C working and interest groups (e.g. SWIG, GLD, DWBP) several overlapping vocabularies have been developed for the description of datasets: DCAT, VoID and Data Cube. These vocabularies define similar concepts, but it is not entirely clear how these concepts are related. For example, all three vocabularies define a notion of ‘dataset’ – dcat:Dataset, void:Dataset and qb:DataSet. These notions are similar but not entirely equivalent.For example, it has been argued that void:Dataset and qb:DataSet are more like a dcat:Distribution than a dcat:Dataset.

Europeana profile ecosystem: representing, publishing and consuming application profiles of the Europeana Data Model (EDM) [ID37]

Valentine Charles, Antoine Isaac

content_negotiation documentation profile publication semantics

▶ Full use case description (click to collapse):

The metadata aggregated by Europeana is described using the Europeana Data Model (EDM) which goal is to ensure interoperability between various cultural heritage data sources. EDM has been developed to be as re-usable as possible. It can be seen as an anchor to which various finer-grained models can be attached, ensuring their interoperability at a semantic level. The alignments done between EDM and other models such as CIDOC-CRM allow the definition of adequate application profiles that enable the transition from one model to another without hindering the interoperability of the data. Currently, Europeana itself maintains data in two flavours of EDM is being defined into two specific flavours, each with a specific XML Schema (for RDF/XML data):

  • "EDM external": The metadata aggregated by Europeana from its data providers is being validated against the EDM external XML schema prior to being loaded into the Europeana database.
  • "EDM internal": This schema is meant for validation and enrichment inside the Europeana ingestion workflow where data is reorganised to add "proxies" to distinguish provider data from Europeana data and certain properties are added to support the portal or the API. It is not meant to be used by data providers. The metadata complying with this schema is outputted via the Europeana APIs.

Both "external" and "internal" schemas are available at https://github.com/europeana/corelib/tree/master/corelib-solr-definitions/src/main/resources/eu

Because XML can’t capture all the constraints expressed in the EDM, an additional set of rules was defined using Schematron and embedded in the XML schema. These technical choices impose limitations on the constraints that can be checked and a validation approach less suitable for Linked Data (XML imposes a document-centric approach).

Europeana is not the only one designing and consuming different profiles of EDM in its ecosystem.

  • The Digital Public Library of America has created its Metadata Application Profile (MAP), based on EDM
  • Intermediate domain metadata aggregators have explored developing profiles of EDM that represent the specificity of their domain. One of the main motivation is that they can use these profiles to ingest, exploit and/or re-publish datasets with less data loss than if they would use the 'basic' EDM ingested by Europeana. Some Europeana data providers and aggregators have started to experiment with Semantic Web technologies to represent their own application profiles of EDM:
    • Europeana Sounds
    • Digital Manuscripts to Europeana
    • Performing Arts
    • etc

Finally, some third party sources of interest (esp. authority data, thesauri, gazetteers) use models that are building blocks of EDM, like SKOS (i.e. EDM can itself be been as an application profile / extension of SKOS). Sometimes these sources publishes their data in different flavours at once (e.g http://viaf.org), which makes data consumption both easier (consumer can find the data elements it can consumer) and more difficult (consumer has to separate elements of interests from irrelevant ones)

Europeana has identified two types of AP:

  • A refinement of EDM is any kind of specialisation of EDM to meet specific needs of the data provider, typically with specific constraints on the existing EDM elements).
  • An extension to EDM is required when existing EDM classes and properties cannot represent the semantics of providers’ data with sufficient details.

Currently data providers who would like to provide their data to Europeana using their profiles are unable to do it, even when these profiles would be 'compatible' with the Europeana one for ingestion (which typically happens in the case of a basic EDM extension that adds fields on top of the Europeana profile). This is chiefly because of XML rigidities: Europeana ingestion expects a reference to only one profile/schema. It will not recognize profiles that are compatible with it.

Time-related aspects [ID38]

Jaroslav Pullmann with contributions by Andrea Perego, Simon Cox et al.

meta time coverage quality resolution status version usage_control

Data authors, data publishers, data consumers

▶ Full use case description (click to collapse):

There is an evident demand for capturing various types of time-related information in DCAT. This meta use case provides a topic overview and summary of general requirements on temporal statements shared among detailed use cases each dealing with an individual aspect.

There are two basic layers where temporal modeling applies, the content (a) and the publication life-cycle layer (b). The former refers to the different time dimensions of the data and its elicitation process, i.e. occurrence (phenomenon), overall coverage (scope) and observation time etc. The latter considers stages of the DCAT publication process independently of any domain or content.

While the use cases differ with regard to purpose and interpretation of the temporal expressions some general patterns become apparent. There are references to singular or recurrent, named (last week, Middle Ages, Thanksgiving Day) or formal, numeric expressions (e.g. ISO 8601). These might be relative (today, P15M) or absolute, represent an instant or interval.

The description of evidence and motivation in context for these expressions is delegated to sub-use cases.

Possible use cases at content level (a)

  • Temporal coverage, done - ID27 exists, expresses the boundaries of the dataset's phenomenon times (first, last)
  • Temporal resolution of a time series (sampling/observation rate), implied by ID15
  • Profile recommendation on a standardized annotation of phenomenon time for single values (e.g. sosa:phenomenonTime) etc.

Possible use cases at life-cycle and publishing level (b)

  • Creation, modification time already covered by dct:issued, dct:modified
  • Data Retention (related to usage control): The copy of Dataset should be removed after this date
  • Expiry of data: The data is considered outdated / unsupported after this date
  • Expiry of record: The record will become obsolete after this point (and e.g. should be removed from catalog) etc.

DCAT Metadata profile integration [ID39]

Lieven Raes, Thomas D'haenens

dcat meta publication referencing out_of_scope

Data authors, data publishers, data consumers

▶ Full use case description (click to collapse):

In the field we see people describing their datasets confronted with different regulations/profiles etc with each their own target/goal. Slowly we're starting to transgress domain boundaries (especially between geo and open - on a high level), but the process is still hard. This is partly due to the lack of guidelines/recommendations on a higher level (W3C, OGC).

Eg within the project of OTN (Open TransportNet) harmonization work has been done on different levels (more info : https://www.slideshare.net/plan4all/white-paper-data-harmonization-interoperability-in-opentransportnet). The risk exists that when everyone starts to do so, we loose interoperability along the way.

GeoDCAT-AP has started with a first attempt of bridging the gap between Geo and Open - https://joinup.ec.europa.eu/node/139283 Within Informatie Vlaanderen, a project is running of combining the two worlds in one catalogue with an automated mapping - https://www.w3.org/2016/11/sdsvoc/SDSVoc16_PPT_v02

See above

Discoverability by mainstream search engines [ID40]

Rob Atkinson

dcat profile content_negotiation

data publishers, search engines, data users

▶ Full use case description (click to collapse):

Major search engines use mechanisms formalised via schema.org to extract structured metadata from Web resources. It is possible, but not given, that some may directly support DCAT in future. Regardless, consideration should be given to exposing DCAT content using equivalent schema.org elements - and this may perhaps be a case for content negotiation by profile, where equivalent schema.org properties are entailed in a DCAT graph.

Schema.org defines a range of equivalent properties

Vocabulary constraints [ID41]

Karen Coyle

profile

Data producers, data consumers. In particular this facilitates sharing between different data consumers.

▶ Full use case description (click to collapse):

When considering using data produced by someone else, it is necessary to know not only what their vocabulary terms are, but how those terms are used. This means that you need to know

  • which terms are mandatory
  • what the cardinality rules are
  • what are the valid values
  • what dependencies exist between elements of the vocabulary
  • etc.

It would be ideal if the profile could be translated into a validation language (such as ShEx or SHACL). If not, it should at least be able to link to such a language.

Metadata Guidance Rules [ID42]

Karen Coyle

profile

Data consumers

▶ Full use case description (click to collapse):

The GLAM communities (galleries, libraries, archives, museums) produce metadata based on a small set of known guidance rules. These rules determine choices made in creating the metadata such as: form of names for people, families and organizations; selection of primary titles; use of vocabularies like language lists, subject lists, genre and form lists, geographic designators. There needs to be a place in a profile to indicate which of the relevant standards was used in producing the metadata.

The primary metadata format used by libraries includes these, but that is a very narrow case.

Description of dataset compliance with standards [ID43]

Alejandra Gonzalez-Beltran

dcat profile

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

Datasets distributions may or not comply with different types of standards, e.g. may be represented in specific formats, follow specific content guidelines, may be annotated with specific ontologies, may comply with standards for describing their use of identifiers, etc.

The compliance with specific standards is useful information for data consumers, data producers and data publishers and it may help identifying how to use a dataset, what tools may be needed, etc.. DCAT currently supports describing the file format of a dataset distribution, but it is not possible to indicate compliance with other types of standards.

Identification of versioned datasets and subsets [ID44]

Jaroslav Pullmann, Keith Jeffery

dataset distribution status version publication referencing

Data consumers

▶ Full use case description (click to collapse):

A prerequisite of communicating, annotating or linking a dataset (or a defined part of it) is its unambiguous identification. Since a dataset and its distributions might evolve over time the identification method has to take into account their versioning. The respective distributions might significantly differ in terms of media type and further serialization properties and should therefore have distinct identifiers.

While DCAT currently does not support resource versioning, subsets (slices) and derivations of a Dataset might be specified as separate, related Dataset instances. Each one is exposed by a set of dedicated Distribution resources identified by a resolvable URI. These Distribution URIs are used to refer to a particular materialization of the (abstract) Dataset. Their design preferably follows the RESTful URI naming conventions. Referencing Distribution metadata has the benefit of providing access to related properties, e.g. usage restrictions and licensing. Contrary, when there are multiple independent copies of Dataset's metadata across Catalogs this method suffers from generating alternative identifiers for the same resource (i.e. the same access/download target).

Annotating data quality [ID45]

Makx Dekkers

dcat quality publication

data producer, data publisher, data consumer (of statistical data)

▶ Full use case description (click to collapse):

In many cases, data producers and data publishers may want to inform the data consumers about the quality aspects of the data so that consumers better understand the possibilities and risks of using and reusing the data.

Data producers may have human-readable, textual information or more precise machine-readable information either as part of their publication process or as external resources that they can attach to the description of the dataset.

The European StatDCAT application profile for data portals in Europe specifies the optional use of the property dqv:hasQualityAnnotation with a range of a subclass of oa:Annotation from the Open Annotation Model, which allows annotations to be either embedded text or an external resource identified by a URI.

Profile support for input functions [ID46]

Karen Coyle

profile semantics documentation

user interface developers, data input staff

▶ Full use case description (click to collapse):

Profiles can be used to drive input forms for staff creating the data. To facilitate this, as many features as possible of a good input environment need to be supported. Profiles need to have suitable rules for the validation of values, such as date forms and pick lists. There need to be human-readable definitions of terms and, if needed, instructions for input that would accompany a property and its value.

Define update method [ID47]

Karen Coyle

dcat

data consumers

▶ Full use case description (click to collapse):

In the library environment, datasets are issued as periodic aggregated (and up-to-date) files with daily or weekly changes to that file as supplements. The change files have new records that are additions to the file, changed records that must replace the record with the same identifier in the file, and deleted records that must result in the matching record being removed from the local copy of the file.

Profile relation to validation [ID48]

Karen Coyle

profile

data producer, data publisher, validation program(s)

▶ Full use case description (click to collapse):

Many of the functions needed for a profile are also ones that will be targeted by validation routines, such as cardinality of properties, valid values, etc. To define these redundantly in profiles and in validation routines risks the creation of contradictory rules relating to the profile.

There needs to be a way to coordinate the profile and the validation function. This could be a matter of basing the profile on a defined validation language (SHACL or ShEx or Schematron...), or of deriving the validation rules from the profile. Note, though, that the existing validation languages are quite atomistic and so far there has not been a demonstration of creating a usable profile from a validation language. In any case, the relationship between these two related languages needs to be clarified.

Dataset business context [ID49]

Peter Brenton, Simon Cox (CSIRO)

dcat documentation status version profile provenance roles semantics service usage control

data consumer, data producer, data publisher

▶ Full use case description (click to collapse):

It is helpful and often essential to know the business context in which one or more datasets are created and managed, in particular concerning the project, program, initiative through which the dataset was generated. These are typically associated with funding or policy.

The business context links associated entities participating in a project. Projects can be an umbrella or unifying entity for one or many datasets which share the same project context.

DCAT or users of DCAT have often used externally defined classes for associated concepts from FOAF and W3C Organization ontology, but there is not currently any slot or guidance about how to relate a dataset to its business context. However, there is no general agreement on a class for 'Project'. The class might includes spatial, temporal, social, descriptive and financial information. There are a number of discipline or domain specific Project classes (see Links below), but there does not appear to be anything available which is sufficiently expressive and generic.

As part of the DXWG there might be an opportunity to define a basic ontology for projects and related concepts. This should have a tight scope and few dependencies, similar to the approach used in W3C Organization ontology.

VIVO Project

Annotating changes that do not change the information content [ID50]

Peter Winstanley

dcat status version provenance

data producer, data publisher, validation program(s)

▶ Full use case description (click to collapse):

Many events in the life cycle of a data set change the information content - data is added or removed as different versions of the dataset are created. Other events do not alter the information content of the dataset. An example of the latter is deduplication. Perhaps encryption or compressions are similar examples. There are issues to be considered relating to the type of deduplication (e.g. file vs block), but in the main these events do not reduce the information content.

The need of this use case is to be able to record these events in the provenance data, but to have some way to indicate that although something was done to the dataset, there was actually no change in information content. In this way it is slightly different to the regular interpretation of a "version".

Dataset Versioning Information [ID4]

Describing distribution subsetting and container mechanism separately from form of individual items.[ID51]

Rob Atkinson

dcat profile representation service

data user

▶ Full use case description (click to collapse):

When considering Linked Data applications viewing datasets there are potentially multiple ways to access items. Large datasets in particular may support access to specific items, queries, subsets and optimally packaged downloads of data. Because the implications for user and agent interaction vary greatly between these modes of access there is a need to distinguish between distributions that package the entire dataset. Furthermore, access methods delivering queries and subsets may introduce container elements, independently of the dataset itself.

So if we had a metadata catalog that supported a query service that returned a set of DCAT records (using say the BotDCAT-AP profile), and also an api that delivered a specific record using this same profile we would need to be able to specify the query service support the BotDCAT-AP profile and MyDCATCatalogSearchFunction profile (that specifies how the container structure is implemented?)

DCAT packaged distributions [ID1] Detailing and requesting additional constraints (profiles) beyond content types [ID2] DCAT Distribution to describe web services [ID6] Modeling service-based data access [ID18] Machine actionable link for a mapping client [ID21]

Distribution query type

Requirements

This chapter lists the requirements for the Working Group deliverables

In some requirements the expression 'recommended way' is used. This means that a single best way of doing something is sought. It does not say anything about the form this recommended way should have, or who should make the recommendation.

In some requirements the expression 'canonical property' is used. This identifies a specific requirement to provide a property with the required semantics to meet the requirement and guidance on usage of this property. (Note, a 'recommended way' may also involve such canonical properties - requirements described this way reflects cases where such properties have been implemented by communities and the identified requirement is to consolidate and make these properties generally used by the wider community.

Many of these requirements depend on interpretation of keywords such as "profile". At this stage the definitions for these terms are defined in the Tags section. These terms should be cross-referenced as links - but may be best to choose appropriate format - i.e. a formal glossary?

Identification

Dereferenceable identifiers [RDID]

Encode identifiers as dereferenceable HTTP URIs

Identifier type [RIDT]

Indicate type of identifier (e.g. prism:doi, bibo:doi, ISBN etc.).

Primary and alternative identifier [RIDALT]

Provide means to distinguish the primary and alternative (legacy) identifiers.

Versioning

Version subject [RVSS]

Identify DCAT resources that are subject to versioning, i.e. Catalog, Dataset, Distribution.

Version definition [RVSDF]

Provide clear guidance on conditions, type and severity of a resource's update that might motivate the creation of a new version in scenarios such as dataset evolution, conversion, translations etc, including how this may assist change management processes for consumers (e.g. semantic versioning techniques)

Version identifier [RVSID]

Provide a means to identify a version (URI-segment, property etc.). Clarify relationship to identifier of the subject resource.

Version release date [RVSDT]

It must be possible to assign a date to a version. The version identifier might refer to the release date.

Version delta [RVSDA]

Provide a way to indicate the change delta or other change information from the previous version.

Qualitative information

Human readable information to evaluate/understand data and its context.

Usage notes [RUN]

Ability to provide information on how to use the data.

Should this also apply to specific distributions?

Summary statistics [RSS]

Express summary statistics and descriptive metrics to characterize a Dataset.

Provenance information [RPIF]

Provide a way to link to structured information about the provenance of a dataset including:

  • the input data used to create a dataset to the dataset.
  • the software used to produce the dataset to the dataset.
  • an extensible model different types of agent roles
  • funders

dct:creator, dct:publisher etc are special cases, which require guidance, further roles may be defined in provenance or other richer models. The requirement is to establish an extensible mechanism, and for profiles to specify canonical equivalents for the special case properties of dcat:Dataset

Funding source [RFS]

Provide means to describe the funding (amount and source) of a Dataset (or entire Catalog).

Project context [RPCX]

Provide a means to define a "project" as a research, funding or work organzation context of a dataset.

Data quality model [RDQM]

Identify common modeling patterns for different aspects of data quality based on frequently referenced data quality attributes found in existing standards and practices.

This includes potential use and revision of DQV

Aspects include:

  • the degree of a dataset's precision (i.e. measure of resolution or variability).
  • the degree of a dataset's accuracy (i.e. measure of correctness).
  • the degree a dataset conforms to a stated quality standard.
  • details of data quality conformance test results.
  • Quality-related information [RDQIF]

    Define a way to associate quality-related information with Datasets.

    Formal description

    Requirements relating to machine readable descriptions of data and distributions.

    Dataset type [RDST]

    Provide a mechanism to indicate the type of data being described and recommend vocabularies to use given the dataset type indicated.

    Providing examples of scope will provide guidance, without being unnecessarily restrictive. The key requirement is interoperability, achieved by using standardised vocabulary terms. It it unclear whether a canonical registry is required or whether communities should constrain choice via DCAT profiles.

    Dataset aspects [RDSAT]

    Provide recommendations and mechanisms for data providers to describe datasets with a formal description of aspects (e.g. instrument/sensor used, spatial feature, observable property, quantity kind).

    Finer grained semantics will also allow dataset dimensions to be described, and distributions described using these semantics - for example how a dataset is composed of multiple subsets, such as a set of image bands or tiles, or parameterised filtering/subsetting services

    This requirement applies to catalogues of DCAT records, and is thus related to the concept of profiles, which are expected to define classification dimensions (use of controlled vocabularies in mandatory properties)

    Reference system [RRS]

    Provide means to specify the reference system(s) used in a dataset.

    Spatial coverage [RSC]

    Provide means to specify spatial coverage with geometries.

    Temporal coverage [RTC]

    Allow for specification of the start and/or end date of temporal coverage.

    Distribution

    Requirements related to the Distribution, sharable Dataset materialization and underlying media.

    Distribution definition [RDIDF]

    Revise definition of Distribution. Make clearer what a Distribution is and what it is not. Provide better guidance for data publishers.

    Distribution schema [RDIS]

    Define a way to include identification of the schema the described data conforms to

    This may include rich information via extensions points, URI templates and parameters, dimensions and subsetting operations, dereferenceable identifiers of service behaviour profiles and canonical identifiers of well-known web service interfaces (e.g. OGC - WFS, WMS, OpenDAP, REST apis).

    Such a description may be provided through identifier of a suitable profile that defines interoperability conditions the distribution conforms to.

    Distribution service [RDISV]

    Ability 1) to describe the type of distribution and 2) provide information about the type of service

    Such a description may be provided through a suitable profile identifier that defines a profile of the relevant service type.

    Distribution container [RDIC]

    Provide a means to specify the container structure of a distribution for access methods that return lists, independently of the specification of the profile the list items conform to.

    Related to the distinction between dct:accessURL and dct:downloadURL. May be covered by service type, but specifically supports identification of lists vs items. (items have no container). lists may be wrapped in a structural element or not - so this also needs to be described.

    Distribution package [RDIP]

    Define way to specify content of packaged files in a Distribution. For example, a set of files may be organised in an archive format and then compressed, but dct:hasFormat property only indicates the encoding type of the outer layer of packaging.

    Relationships

    Requirements on description of relationships among identified objects.

    Related datasets [RRDS]

    Ability to represent the different relationships between datasets, including: versions of a dataset, collection of datasets, to describe their inclusion criteria and to define the 'hasPart'/'partOf' relationship, derivation, e.g. processed data that is derived from raw data

    this requriement to be rolled in here: Update method of Dataset: Indicate the update method of a Dataset description, e.g. whether each new dataset entirely supercedes previous ones (is stand-alone), or whether there is a base dataset with files that effect updates to that base.

    Project relation [RPR]

    Provide a means to indicate the relation of Datasets to a project.

    Datasets vs. Catalog relation [RDSCR]

    Clarify the relationships between Datasets and zero, one or multiple Catalogs, e.g. in scenarios of copying, harvesting and aggregation of Dataset descriptions among Catalogs.

    Federation and citation

    Requirements covering aspects of publication, exploitation and control over copies.

    Dataset access [RDSA]

    Provide a way to specify access restrictions for both a dataset and a distribution.

    Provide a way to define the meaning of the access restrictions for a dataset or distribution and to specify what is required to access a dataset and distribution.

    Dataset citation [RDSC]

    Provide a way to specify information required for data citation (e.g., dataset authors, title, publication year, publisher, persistent identifier)

    Dataset publications [RDSP]

    Provide a way to link publications about a dataset to the dataset.

    Publication source [RPS]

    Provide a way to cite the original metadata with a dereferenceable identifier.

    Publication control [RPC]

    Provide means to express rights relating to reuse of DCAT metadata

    Profile

    Profile definition [RPFDF]

    Create a sufficiently wide definition of an application profile to address declaration of interoperability profiles data may conform to, and through this mechanism provide the means for DCAT instances and collections to also declare the profiles of DCAT they conform to.

    These Use case specific requirements apply to the required scope of this definition. Where appropriate these are also captured as additional requirements.

    1. Clarify any relationship between profiles and validation languages
    2. Profiles have URI identifiers that resolve to more detailed descriptions
    3. Each application profile needs to be documented, preferably by showing/reusing what is common across profiles
    4. Profiles should provide both machine and human readable views.
    5. Profiles may inherit clauses from one or more parent profiles
    6. Profiles must support a view that includes all inherited constraints clearly identifying the parent profile the constraint is defined in.
    7. Profiles may be used to describe the metadata standard a description conforms to, the standards to which the resource described (e.g. dataset) and the standards each distribution conforms to.
    8. Conceptually, profiles can extend other vocabularies or profiles, or can be refinements of other vocabularies or profiles
    9. Responses can conform to multiple, modular profiles

    Profile representation [RPFRP]

    Create a way to retrieve more information about a profile. This must be flexible enough to support human and machine readable resources, such as input and editing guidance, validation resources, usage notes etc.

    Following additional requirements consider the representation of a profile (document), expressing concrete constraints, compared to the definition of the concept in

    1. Machine-readable specifications of application profiles need to be easily publishable, and optimize re-use of existing specifications.
    2. Application profiles need a rich expression for the validation of metadata
    3. Profiles must have properties for at least two levels of documentation: 1) short definition 2) input and editing guidance
    4. Profiles must support declaration of vocabulary constraints
    5. A mechanism must be available to identify conformance to each inherited profile given conformance to a profile that specialises it.
    6. Profiles list valid vocabulary terms for a metadata usage environment
    7. Profile vocabulary lists may be defined as closed (no other terms are allowed) or open (other terms are allowed)
    8. Profiles should reuse vocabulary terms defined elsewhere (Dublin Core profiles; no use case)
    9. Profiles must be able to support information that can drive data creation functions, including brief and detailed documentation
    10. Profiles must support discoverability via search engines
    11. Profiles must have identifiers that can be used to link the DCAT description to the relevant profile

    Profile negotiation [RPFN]

    Create a way to negotiate choice of profile between clients and servers

    Profiles listing [RPFL]

    Create a way to list the profiles implemented by a dataset or a specific distribution

    Some subset of profile metadata that can be included in a DCAT context should have a canonical set of properties recommended. Initial requirements are 1) short definition 2) input and editing guidance. Links should be consistent with

    Alignment

    Requirements on alignment and relation of DCAT to other vocabularies.

    Related vocabularies comparision [RVC]

    Analyse and compare similar concepts defined by vocabularies related to DCAT (e.g. VOID, Data Cube dataset).

    Related vocabularies mapping [RVM]

    Define guidelines how to create a DCAT description of a VOID or Data Cube dataset

    Entailment of schema.org [RES]

    Define schema.org equivalents for DCAT properties to support entailment of schema.org compliant profiles of DCAT records.

    Meta-level and methodology

    Metadata guidance rules [RMDG]

    Ability to express "guidance rules" or "creation rules" in DCAT

    Qualified forms [RQF]

    Define qualified forms to specify additional attributes of appropriate binary relations (e.g. temporal context).

    This requirement is still under review

    Mapping of qualified and non-qualified forms [RQFM]

    Specify mapping of qualified to non-qualified forms (lowering mapping). The reverse requires information (qualification) that might not be present/evident.

    This requirement is still under review

    Contributors

    The use cases and requirements were contributed by following members of the Dataset Exchange Working Group. Further group members actively participated in their analyis and discussion: Annette Greiner, Dave Ragett, Dan Brickley, David Browning, Lars G. Svensson and Phil Archer.