DWBP Implementation Report

This document reports on evidence and implementations of the Data on the Web Best Practices Candidate Recommendation. In particular, it demonstrates that the DWBP are already in use and are also implementable.

Introduction

One of the main goals of the Data on the Web Best Practices (DWBP) is to facilitate interaction between publishers and consumers of data on the Web. A set of 35 Best Practices were created to cover different challenges related to data publishing and consumption, such as Metadata, Data licenses, Data provenance, Data quality, Data versioning, Data identification, Data formats, Data vocabularies, Data access and APIs, Data preservation, Feedback, Data enrichment and Data republication.

To show that the DWBP are implementable as well broadly adopted and referenced by well-known organizations, we collected evidence in the form of datasets, data portals, documents, references and guidelines (Section 2). We used two forms to collect this evidence: (DWBP evidence form and DWBP template form). The results are summarized in this report.

Besides the results collected from the surveys, in order to strengthen the DWBP adoption evidence, we also present our evaluation of how DWBP are currently being adopted by the major data catalog solutions, including CKAN, Socrata, DKAN, JUNAR, ArcGIS Open Data and OPENDATASOFT (Section 3). Finally, we also present some examples to illustrate that each one of the DWBP is implementable (Section 4).

Methodology

We followed the steps described below to collect evidence for the DWBP:

A standard email was sent to several organizations around the world asking for contributions to DWBP implementations.
Implementations of the DWBP were collected using the standard forms: DWBP evidence form and DWBP template form.
A review of the collected implementations was made in order to check which best practices should have more implementations.
A detailed review of the implementations as well as the comments received through the surveys were made in order to prepare the implementation report.
The Implementation Report was developed.

As noted, to have a broader coverage of the DWBP adoption we considered different types of evidence:

Datasets, Data Portals and Vocabularies: this type of evidence shows that the DWBP were already considered by organizations responsible for publishing data on the Web.
Documents and References: this category includes Web sites, Web pages, blogs, published papers, APIs documentation, projects and wikis. This type of evidence shows the adoption of DWBP in more general scenarios.
Guidelines: these guidelines were proposed by governmental organizations to help data publishers to make data available on the Web. Each guideline discusses and proposes practices that makes an explicit) reference to a DWBP or where the advice offered is fully consistent with the relevant DWBP Best Practice.

Meeting the exit criteria

As described in the DWBP charter, to move on to Proposed Recommendation, evidence will be adduced in order to demonstrate that each of the best practices has been recommended or adopted in at least two environments, such as data portals and formal policies. Evidence of implementation was gathered from existing datasets and data portals, which already implement the proposed best practices, as well as from national or sector-specific guidelines that reference the DWBP and documents available on the Web.

DWBP Evidence

The table below shows the evidence collected for each one of the DWBP.

BP	Evidence	Total

Datasets, Data portals and Vocabularies

The following table shows organizations and implementers that contributed with DWBP evidence in the form of Datasets, Data Portals and Vocabularies.

ID	Organization Name	Evidence URI	Category	Domain	Data Catalog?^*

* This column indicates if a data catalog solution is used to provide the data. The data catalog can be based on an existing solution like CKAN or can be a proprietary one.

Documents and References

The following table shows organizations and implementers that contributed with DWBP evidence in the form Documents and References.

ID	Organization	Evidence URI	Category

Guidelines

The following table shows organizations and implementers that contributed with DWBP evidence in the form Guidelines.

ID	Guide	Creator	Country	Year

General analysis

One of our main concerns when we started to collect evidence for each one of the DWBP was to have implementations from well-known organizations as well as high profile datasets and data portals worldwide, like DBpedia, Data.gov.uk, Data.gov and World Bank. Analyzing the tables presented in the previous section, we can say that we accomplished this goal. The DWBP evidence were collected from well-known organizations and projects including the ones mentioned before as well as BBC, Twitter, Europeana, Pacific Northwest National Laboratory and OpenStreetMaps. Considering the geographical coverage, we collected implementations from several countries, including Brazil, France, Ireland, New Zealand, Spain, UK, USA and Italy. It is also important to notice that evidence in the form of guidelines concerns several governmental organizations from Europe. Other important characteristic from the DWBP implementations is their broad domain coverage, e.g. they refer to different domains, like Government, Environment and Healthcare, as described in the graphic below.

Evidence count per domain

As we can observe in the graphic below, there is a broad adoption of DWBP related to Metadata (BP1 and BP2), Data Licenses (BP4), Data Identification (BP9 and BP10), Data Formats (BP12 and BP14), Vocabularies (BP15 and BP16), Data Access (BP23, BP24, BP25 and BP26) and Feedback (BP29). On the other hand, for others, such as Preserve identifiers (BP27), Assess dataset coverage (BP28), Provide real-time access (BP20) and Provide an explanation for data that is not available (BP22), collection of evidence was more difficult, especially related to datasets and data portals. This can be justified by comments received during the evidence gathering process and also available in the DWBP evidence form. Bill Roberts from the SWIRRL, for example, made the following comment about one of the Data Preservation best practices: "Too difficult to test in a meaningful way. In this system, no datasets have yet been taken offline, so the archiving process has not been developed." In the same way, he made a comment about the Best Practice Provide real-time access: "The system does not currently hold dataset collected in 'real time'. Generally the data is statistical in nature and goes through a slower collection and processing cycle."

Evidence count per Best Practices

DWBP and Data Catalogs

In this section we present some more evidence that shows the adoption of the DWBP. Rather than specific datasets or data portals, we use the following data catalog solutions as evidence: CKAN, Socrata, DKAN, JUNAR, ArcGIS Open Data and OPENDATASOFT. For each one of the DWBP, we show the list of data catalog solutions that implement it.

BP	Data Catalogs	Total

As we may notice, there is no evidence for some of the DWBP. This happens because these Best Practices do not concern the solution used for making the data available on the Web, e.g. the data catalog solution, as explained below.

BP10, BP16, BP22, BP28: these BP apply to the data itself rather than the data catalog solution used to publish the data.
BP33, B34, BP35: these BP apply to situations of data republication, i.e. it depends from the consumer rather than the data catalog solution used to publish the data.
BP31: this BP concerns processes that can be used to enhance, refine or otherwise improve raw or previously processed data, which are not part of the basic data catalog functions.

Concerning BP27 none of the data catalog solutions implement it. In general, when a dataset is not available then just a 404 error message is returned.

Some Best Practices related to metadata are partially implemented by the data catalog solutions. Note that almost all data catalog solutions are compatible with DCAT, which means that metadata covered by DCAT may be completely or partially available both in human-readable and machine-readable formats. In general, it means that just a human-readable or a machine-readable version of the metadata is available, as detailed in the following.

BP3 is partially implemented by CKAN and JUNAR because they do not offer an explicit way to present human-readable structural metadata.
BP4 and BP5 are partially implemented by ARCGIS OPEN DATA and OPENDATASOFT because it does not offer a way to represent machine-readable license metadata and machine-readable provenance metadata.
BP8 is partially implemented by SOCRATA because it does not offer a way to represent machine-readable version history metadata.
BP13 is partially implemented by OPENDATASOFT because it does not offer a way to represent machine-readable language metadata.

As a general analysis with regards to the Data on the Web Challenges, we can say that Metadata, Data Licenses and Data Formats challenges are a main concern of the data catalog solutions. The Data Access challenge has also been recognized as an important one except when it concerns real-time data. The use of Data Access APIs is a consensus. The major data catalog solutions also deal with the Data Identification challenge, however just part of the problem has been solved. The Data Vocabularies challenge has also been considered as an important one since data catalog solutions reuse existing vocabularies, e.g. DCAT, when publishing metadata about the data catalogs. Other challenges like Data Provenance, Data Versioning and Feedback have been superficially dealt with in the data catalog solutions. In general, Data Quality, Data Preservation, Data Enrichment and Data Republications are challenges still not explored by the major data catalog solutions.

Set of Best Practices

The following list shows the set of best practices linked to the DWBP document:

Best Practice 1: Provide metadata
Best Practice 2: Provide descriptive metadata
Best Practice 3: Provide structural metadata
Best Practice 4: Provide data license information
Best Practice 5: Provide data provenance information
Best Practice 6: Provide data quality information
Best Practice 7: Provide a version indicator
Best Practice 8: Provide version history
Best Practice 9: Use persistent URIs as identifiers of datasets
Best Practice 10: Use persistent URIs as identifiers within datasets
Best Practice 11: Assign URIs to dataset versions and series
Best Practice 12: Use machine-readable standardized data formats
Best Practice 13: Use locale-neutral data representations
Best Practice 14: Provide data in multiple formats
Best Practice 15: Reuse vocabularies, preferably standardized ones
Best Practice 16: Choose the right formalization level
Best Practice 17: Provide bulk download
Best Practice 18: Provide Subsets for Large Datasets
Best Practice 19: Use content negotiation for serving data available in multiple formats
Best Practice 20: Provide real-time access
Best Practice 21: Provide data up to date
Best Practice 22: Provide an explanation for data that is not available
Best Practice 23: Make data available through an API
Best Practice 24: Use Web Standards as the foundation of APIs
Best Practice 25: Provide complete documentation for your API
Best Practice 26: Avoid Breaking Changes to Your API
Best Practice 27: Preserve identifiers
Best Practice 28: Assess dataset coverage
Best Practice 29: Gather feedback from data consumers
Best Practice 30: Make feedback available
Best Practice 31: Enrich data by generating new data
Best Practice 32: Provide Complementary Presentations
Best Practice 33: Provide Feedback to the Original Publisher
Best Practice 34: Follow Licensing Terms
Best Practice 35: Cite the Original Publication

Ackownledgements

The editors gratefully acknowledge the contributions made to gathering evidence for the DWBP by all members of the working group. Especially Annette Greiner, Antoine Isaac, Carlos Laufer, Christophe Guéret, Deirdre Lee, Eric Stephan, Makx Dekkers, Martin Alvarez-Espinar, Peter Winstanley, Phil Archer and Riccardo Albertoni.

The editors would also like to thank evidences received from Bill Roberts, Christophe Guéret, Diogo Cortiz, Fábio Rodrigues, Eduardo Rodrigues Vasconcelos, Gregor Boyd, Herbert Van de Sompel, Jefferson Rafael Silva, João Victor Pacheco Dias, José Marcio Martins Junior, Laura Manley, Markus Freudenberg, Milos Jovanovik, Rafael Sá Anselmo, Reinaldo Ferraz and Williams Alcântara.