Abstract

A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. The CSV on the Web Working Group aim to specify technologies that provide greater interoperability for data dependent applications on the Web when working with tabular datasets comprising single or multiple files using CSV, or similar, format.

This document lists the use cases compiled by the Working Group that are considered representative of how tabular data is commonly used within data dependent applications. The use cases observe existing common practice undertaken when working with tabular data, often illustrating shortcomings or limitations of existing formats or technologies. This document also provides a set of requirements derived from these use cases that have been used to guide the specification design.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a draft document which may be merged into another document or eventually make its way into being a standalone Working Draft.

This document was published by the CSV on the Web Working Group as a First Public Working Draft. If you wish to make comments regarding this document, please send them to public-csv-wg@w3.org ( subscribe , archives ). All comments are welcome.

Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .

Table of Contents

1. Introduction

A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. CSV files may be of a significant size but they can be generated and manipulated easily, and there is a significant body of software available to handle them. Indeed, popular spreadsheet applications (Microsoft Excel, iWork’s Number, or OpenOffice.org) as well as numerous other applications can produce and consume these files. However, although these tools make conversion to CSV easy, it is resisted by some publishers because CSV is a much less rich format that can't express important detail that the publishers want to express, such as annotations, the meaning of identifier codes etc.

Existing formats for tabular data are format-oriented and hard to process (e.g. Excel); un-extensible (e.g. CSV/TSV); or they assume the use of particular technologies (e.g. SQL dumps). None of these formats allow developers to pull in multiple data sets, manipulate, visualize and combine them in flexible ways. Other information relevant to these datasets, such as access rights and provenance, is not easy to find. CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata.

To address these issues, the CSV on the Web Working Group seeks to provide:

In order to determine the scope of and elicit the requirements for this extended CSV format (CSV+) a set of use cases have been compiled. Each use case provides a narrative describing how a representative user works with tabular data to achieve their goal, supported, where possible, with example datasets. The use cases observe existing common practice undertaken when working with tabular data, often illustrating shortcomings or limitations of existing formats or technologies. It is anticipated that the additional metadata provided within the CSV+ format, when coupled with metadata-aware tools, will simplify how users work with tabular data. As a result, the use cases seek to identify where user effort may be reduced.

A set of requirements, used to guide the development of the CSV+ specification, have been derived from the compiled use cases.

2. Terminology Use Cases

... The use cases below describe many applications of tabular data. Whilst there are many different variations of tabular data, all the examples conform to the definition of tabular data defined in the Model for Tabular Data and Metadata on the Web [ tabular-data-model ]:

3. Use Cases Tabular data is data that is structured into rows, each of which contains information about some thing. Each row contains the same number of fields (although some of these fields may be empty), which provide values of properties of the thing described by the row. In tabular data, fields within the same column provide values for the same property of the thing described by the particular row.

In selecting the use cases we have reviewed a number of row oriented data formats that, at first glance, appear to be tabular data. However, closer inspection indicates that one or other of the characteristics of tabular data were not present. For example, the HL7 format , from the health informatics domain defines a separate schema for each row (known as a "segment" in that format) which means that HL7 messages do not have a regular number of columns for each row.

3.1 2.1 Use Case #1 - Digital preservation of government records

(Contributed by Adam Retter) Retter; supplemental information about use of XML provided by Liam Quin)

The laws of England and Wales place obligations upon departments and The National Archives for the collection, disposal and preservation of records. Government departments are obliged within the Public Records Act 1958 sections 3, 4 and 5 to select, transfer, preserve and make available those records that have been defined as public records. These obligations apply to records in all formats and media, including paper and digital records. Details concerning the selection and transfer of records can be found here .

Departments transferring records to TNA must catalogue or list the selected records according to The National Archives' defined cataloguing principles and standards. Cataloguing is the process of writing a description, or Transcriptions of Records for the records being transferred. Once each Transcription of Records is added to the Records Catalogue, records can be subsequently discovered and accessed using the supplied descriptions and titles.

TNA specifies what information should be provided within a Transcriptions of Records and how that information should be formatted. A number of formats and syntaxes are supported, including RDF. However, the predominant format used for the exchange of Transcriptions of Records is CSV as the government departments providing the Records lack either the technology or resources to provide metadata in the XML and RDF formats preferred by the TNA.

A CSV-encoded Transcriptions of Records typically describes a set of Records, often organised within a hierarchy. As a result, it is necessary to describe the interrelationships between Records within a single CSV file.

Each row within a CSV file relates to a particular Record and is allocated a unique identifier. This unique identifier behaves as a primary key for the Record within the scope of the CSV file and is used when referencing that Record from within other Record transcriptions. The unique identifier is unique within the scope of the datafile; in order for the Record to be referenced from outside this datafile, the local identifier must be mapped to a globally unique identifier such as a URI.

Requires: PrimaryKey , URIMapping and ForeignKeyReferences .

Upon receipt by TNA, each of the Transcriptions of Records is validated against the (set of) centrally published data definition(s); it is essential that received CSV metadata comply with these specifications to ensure efficient and error free ingest into the Records Catalogue.

The validation applied is dependent the type of entity described in each row. Entity type is specified in a specific column (e.g. type ).

The data definition file, or CSV Schema, used by the CSV Validation Tool effectively forms the basis of a formal contract between TNA and supplying organisations. For more information on the CSV Validation Tool and CSV Schema developed by TNA please refer to the online documentation .

The CSV Validation Tool is written in Scala version 2.10.

Requires: ExternalDataDefinitionResource WellFormedCsvCheck and CsvValidation .

Following validation, the CSV-encoded Transcriptions of Records are transformed into RDF for insertion into the triple store that underpins the Records Catalogue. The CSV is initially transformed into an interim XML format using XSLT and then processed further using a mix of XSLT, Java and Scala to create RDF/XML. The CSV files do not include all the information required to undertake the transformation, e.g. defining which RDF properties are to be used when creating triples for the data value in each cell. As a result, bespoke software has been created by TNA to supply the necessary additional information during the CSV to RDF transformation process. The availability of generic mechanisms to transform CSV to RDF would reduce the burden of effort within TNA when working with CSV files.

Requires: SyntacticTypeDefinition , SemanticTypeDefinition and CsvToRdfTransformation .

Issue 1

Inclusion In this particular case, RDF is the target format for the conversiono f the CSV-encoded Transcriptions of Records. However, the conversion of CSV Schema examples (e.g. Data Definition Resource DDR to XML (in this case used as an interim conversion step) is illustrative of a common data conversion workflow.

The transformation outlined above is typical of common practice in that it uses a freely-available XSLT transformation or XQuery parser (in this case Andrew Wlech's CSV to XML converter in XSLT 2.0 ) which is anticipated. then modified to meet the specific usage requirements.

The resulting XML document can then be used include further transformed using XSLTto create XHTML documention - perhaps including charts such histograms to present summary data.

Requires: CsvToXmlTransformation .

3.2 2.2 Use Case #2 - Publication of National Statistics

(Contributed by Jeni Tennison)

The Office for National Statistics (ONS) is the UK’s largest independent producer of official statistics and is the recognised national statistical institute for the UK. It is responsible for collecting and publishing statistics related to the economy, population and society at national, regional and local levels.

Sets of statistics are typically grouped together into datasets comprising of collections of related tabular data. Within their underlying information systems, ONS maintains a clear separation between the statistical data itself and the metadata required for interpretation. ONS classify the metadata into two categories:

These datasets are published on-line in both CSV format and as Microsoft Excel Workbooks that have been manually assembled from the underlying data.

For example, refer to dataset QS601EW Economic activity , derived from the 2011 Census, is available as a precompiled Microsoft Excel Workbook for several sets of administrative geographies, e.g. 2011 Census: QS601EW Economic activity, local authorities in England and Wales , and in CSV form via the ONS Data Explorer .

The ONS Data Explorer presents the user with a list of available datasets. A user may choose to browse through the entire list or filter that list by topic. To enable the user to determine whether or not a dataset meets their need, summary information is available for each dataset.

QS601EW Economic activity provides the following summary information:

Requires: AnnotationAndSupplementaryInfo .

Once the required dataset has been selected, the user is prompted to choose how they would like the statistical data to be aggregated. In the case of QS601EW Economic activity , the user is required to choose between the two mutually exclusive geography types: 2011 Administrative Hierarchy and 2011 Westminster Parliamentary Constituency Hierarchy. Effectively, the QS601EW Economic activity dataset is partitioned into two separate tables for publication.

Requires: GroupingOfMultipleTables .

The user is also provided with an option to sub-select only the elements of the dataset that they deem pertinent for their needs. In the case of QS601EW Economic activity the user may select data from upto 200 geographic areas within the dataset to create a data subset that meets their needs. The data subset may be viewed on-line (presented as an HTML table) or downloaded in CSV or Microsoft Excel formats.

Requires: CsvAsSubsetOfLargerDataset .

An example extract of data for England and Wales in CSV form is provided below. The data subset is provided as a compressed file containing both a csv CSV formatted data file and a complementary html file containing the reference metadata. White space has been added for clarity. File = CSV_QS601EW2011WARDH_151277.zip

"QS601EW" "Economic activity" "19/10/13" , , "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count" , , "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person", "Person" , , "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)", "Economic activity (T016A)" "Geographic ID","Geographic Area","Total: All categories: Economic activity","Total: Economically active: Total","Economically active: Employee: Part-time","Economically active: Employee: Full-time","Economically active: Self-employed with employees: Part-time","Economically active: Self-employed with employees: Full-time","Economically active: Self-employed without employees: Part-time","Economically active: Self-employed without employees: Full-time","Economically active: Unemployed","Economically active: Full-time student","Total: Economically inactive: Total","Economically inactive: Retired","Economically inactive: Student (including full-time students)","Economically inactive: Looking after home or family","Economically inactive: Long-term sick or disabled","Economically inactive: Other" "E92000001", "England", "38881374", "27183134", "5333268", "15016564", "148074", "715271", "990573", "1939714", "1702847", "1336823", "11698240", "5320691", "2255831", "1695134", "1574134", "852450"
Example 1
"QS601EW"
"Economic activity"
"19/10/13"

               ,                 ,                                   "Count",                            "Count",                                   "Count",                                   "Count",                                                       "Count",                                                       "Count",                                                          "Count",                                                          "Count",                          "Count",                                 "Count",                              "Count",                         "Count",                                                        "Count",                                              "Count",                                            "Count",                       "Count"
               ,                 ,                                  "Person",                           "Person",                                  "Person",                                  "Person",                                                      "Person",                                                      "Person",                                                         "Person",                                                         "Person",                         "Person",                                "Person",                             "Person",                        "Person",                                                       "Person",                                             "Person",                                           "Person",                      "Person"
               ,                 ,               "Economic activity (T016A)",        "Economic activity (T016A)",               "Economic activity (T016A)",               "Economic activity (T016A)",                                   "Economic activity (T016A)",                                   "Economic activity (T016A)",                                      "Economic activity (T016A)",                                      "Economic activity (T016A)",      "Economic activity (T016A)",             "Economic activity (T016A)",          "Economic activity (T016A)",     "Economic activity (T016A)",                                    "Economic activity (T016A)",                          "Economic activity (T016A)",                        "Economic activity (T016A)",   "Economic activity (T016A)"
"Geographic ID","Geographic Area","Total: All categories: Economic activity","Total: Economically active: Total","Economically active: Employee: Part-time","Economically active: Employee: Full-time","Economically active: Self-employed with employees: Part-time","Economically active: Self-employed with employees: Full-time","Economically active: Self-employed without employees: Part-time","Economically active: Self-employed without employees: Full-time","Economically active: Unemployed","Economically active: Full-time student","Total: Economically inactive: Total","Economically inactive: Retired","Economically inactive: Student (including full-time students)","Economically inactive: Looking after home or family","Economically inactive: Long-term sick or disabled","Economically inactive: Other"
    "E92000001",        "England",                                "38881374",                         "27183134",                                 "5333268",                                "15016564",                                                      "148074",                                                      "715271",                                                         "990573",                                                        "1939714",                        "1702847",                               "1336823",                           "11698240",                       "5320691",                                                      "2255831",                                            "1695134",                                          "1574134",                      "852450"

"W92000004",
"Wales",
"2245166",
"1476735",
"313022",
"799348",
"7564",
"42107",
"43250",
"101108",
"96689",
"73647",
"768431",
"361501",
"133880",
"86396",
"140760",
"45894"

Key characteristics of the CSV file are:

Requires: MultipleHeadingRows , and AnnotationAndSupplementaryInfo .

Issue 2 1

Do we want to be able to assert within the CSV+ metadata that the "data" exists at a particular region within the CSV? When talking about multiple tables within a single CSV file, AndyS stated :

" Maybe put the location of the data table within a single CSV file into the associated metadata: a package description for a single file. "

Correct interpretation of the statistics requires additional qualification or awareness of context. To achieve this the complementary html file includes supplementary information and annotations pertinent to the data published in the accompanying csv CSV file. Annotation or references may be applied to:

Requires: AnnotationAndSupplementaryInfo .

Furthermore, these statistical data sets make frequent use of predefined category codes and geographic regions. Dataset QS601EW Economic activity includes two examples:

At present there is no standardised mechanism to associate the catagory codes, provided as plain text, with their authoritative definitions.

Requires: AssociationOfCodeValuesWithExternalDefinitions .

Finally, reuse of the statistical data is also inhibited by a lack of explicit definition of the meaning of column headings.

Requires: SemanticTypeDefinition .

3.3 2.3 Use Case #3 - Creation of consolidated global land surface temperature climate databank

(Contributed by Jeremy Tandy)

Climate change and global warming have become one of the most pressing environmental concerns in society today. Crucial to predicting future change is an understanding of how the world’s historical climate, with long duration instrumental records of climate being central to that goal. Whilst there is an abundance of data recording the climate at locations the world over, the scrutiny under which climate science is put means that much of this data remains unused leading to a paucity of data in some regions with which to verify our understanding of climate change.

The International Surface Temperature Initiative seeks to create a consolidated global land surface temperatures databank as an open and freely available resource to climate scientists.

To achieve this goal, climate datasets, known as “decks”, are gathered from participating organisations and merged into a combined dataset using a scientifically peer reviewed method which assesses the data records for inclusion against a variety of criteria.

Given the need for openness and transparency in creating the databank, it is essential that the provenance of the source data is clear. Original source data, particularly for records captured prior to the mid-twentieth century, may be in hard-copy form. In order to incorporate the widest possible scope of source data, the International Surface Temperature Initiative is supported by data rescue activities to digitise hard copy records.

The data is, where possible, published in the following four stages:

The Stage 1 data is typically provided in tabular form - the most common variant is white-space delimited ASCII files. Each data deck comprises multiple files which are packaged as a compressed tar ball ( .tar.gz ). Included within the compressed tar ball package, and provided alongside, is a read-me file providing unstructured supplementary information. Summary information is often embedded at the top of each file.

For example, see the Ugandan Stage 1 data deck ( local copy ) and associated readme file . ( local copy ).

The Ugandan Stage 1 data deck appears to be comprised of two discrete datasets, each partitioned into a sub-directory within the tar ball: uganda-raw and uganda-bestguess . Each sub-directory includes a Microsoft Word document providing supplementary information about the provenance of the dataset; of particular note is that uganda-raw is collated from 9 source datasets whilst uganda-bestguess provides what is considered by the data publisher to be the best set of values with duplicate values discarded.

Requires: AnnotationAndSupplementaryInfo .

Dataset uganda-raw is split into 96 discrete files, each providing maximum, minimum or mean monthly air temperature for one of the 32 weather observation stations (sites) included in the data set. Similarly, dataset uganda-bestguess is partitioned into discrete files; this case just 3 files each of which provide maximum, minimum or mean monthly air temperature data for all sites. The mapping from data file to data sub-set is described in the Microsoft Word document.

Requires: CsvAsSubsetOfLargerDataset .

A snippet of the data indicating maximum monthly temperature for Entebbe, Uganda, from uganda-raw is provided below. File = 637050_ENTEBBE_tmx.txt

637050 ENTEBBE 5 ENTEBBE BEA 0.05 32.45 3761F ENTEBBE GHCNv3G 0.05 32.45 1155M ENTEBBE ColArchive 0.05 32.45 1155M ENTEBBE GSOD 0.05 32.45 1155M ENTEBBE NCARds512 0.05 32.755 1155M Tmax {snip} 1935.04 27.83 27.80 27.80 -999.00 -999.00 1935.12 25.72 25.70 25.70 -999.00 -999.00 1935.21 26.44 26.40 26.40 -999.00 -999.00 1935.29 25.72 25.70 25.70 -999.00 -999.00 1935.37 24.61 24.60 24.60 -999.00 -999.00 1935.46 24.33 24.30 24.30 -999.00 -999.00 1935.54 24.89 24.90 24.90 -999.00 -999.00
Example 2
637050  ENTEBBE
5
ENTEBBE BEA     0.05    32.45   3761F
ENTEBBE GHCNv3G 0.05    32.45   1155M
ENTEBBE ColArchive      0.05    32.45   1155M
ENTEBBE GSOD    0.05    32.45   1155M
ENTEBBE NCARds512       0.05    32.755  1155M

Tmax
{snip}
1935.04	27.83	27.80	27.80	-999.00	-999.00
1935.12	25.72	25.70	25.70	-999.00	-999.00
1935.21	26.44	26.40	26.40	-999.00	-999.00
1935.29	25.72	25.70	25.70	-999.00	-999.00
1935.37	24.61	24.60	24.60	-999.00	-999.00
1935.46	24.33	24.30	24.30	-999.00	-999.00
1935.54	24.89	24.90	24.90	-999.00	-999.00

{snip}

The key characteristics are:

A snippet of the data indicating maximum monthly temperature for all stations in Uganda from uganda-bestguess is provided below (truncated to 9 columns). File = ug_tmx_jrc_bg_v1.0.txt

ARUA BOMBO BUKALASA BUTIABA DWOLI ENTEBBE AIR FT PORTAL GONDOKORO […] {snip} 1935.04 -99.00 -99.00 -99.00 -99.00 -99.00 27.83 -99.00 -99.00 […] 1935.12 -99.00 -99.00 -99.00 -99.00 -99.00 25.72 -99.00 -99.00 […] 1935.21 -99.00 -99.00 -99.00 -99.00 -99.00 26.44 -99.00 -99.00 […] 1935.29 -99.00 -99.00 -99.00 -99.00 -99.00 25.72 -99.00 -99.00 […] 1935.37 -99.00 -99.00 -99.00 -99.00 -99.00 24.61 -99.00 -99.00 […] 1935.46 -99.00 -99.00 -99.00 -99.00 -99.00 24.33 -99.00 -99.00 […] 1935.54 -99.00 -99.00 -99.00 -99.00 -99.00 24.89 -99.00 -99.00 […]
Example 3
ARUA	BOMBO	BUKALASA	BUTIABA	DWOLI	ENTEBBE AIR	FT PORTAL	GONDOKORO	[…]
{snip}
1935.04	-99.00	-99.00	-99.00	-99.00	-99.00	27.83	-99.00	-99.00	[…]
1935.12	-99.00	-99.00	-99.00	-99.00	-99.00	25.72	-99.00	-99.00	[…]
1935.21	-99.00	-99.00	-99.00	-99.00	-99.00	26.44	-99.00	-99.00	[…]
1935.29	-99.00	-99.00	-99.00	-99.00	-99.00	25.72	-99.00	-99.00	[…]
1935.37	-99.00	-99.00	-99.00	-99.00	-99.00	24.61	-99.00	-99.00	[…]
1935.46	-99.00	-99.00	-99.00	-99.00	-99.00	24.33	-99.00	-99.00	[…]
1935.54	-99.00	-99.00	-99.00	-99.00	-99.00	24.89	-99.00	-99.00	[…]

{snip}

Many of the characteristics concerning the “raw” file are exhibited here too. Additionally, we see that:

At present, the global surface temperature databank comprises 25 Stage 1 data decks for monthly temperature observations. These are provided by numerous organisations in heterogeneous forms. In order to merge these data decks into a single combined dataset, each data deck has to be converted into a standard form. Columns consist of: station name , latitude , longitude , altitude , date , maximum monthly temperature , minimum monthly temperature , mean monthly temperature plus additional provenance information.

An example Stage 2 data file is given for Entebbe, Uganda, below. File = uganda_000000000005_monthly_stage2

{snip} ENTEBBE 0.0500 32.4500 1146.35 193501XX 2783 1711 2247 301/109/101/104/999/999/999/000/000/000/102 ENTEBBE 0.0500 32.4500 1146.35 193502XX 2572 1772 2172 301/109/101/104/999/999/999/000/000/000/102 ENTEBBE 0.0500 32.4500 1146.35 193503XX 2644 1889 2267 301/109/101/104/999/999/999/000/000/000/102 ENTEBBE 0.0500 32.4500 1146.35 193504XX 2572 1817 2194 301/109/101/104/999/999/999/000/000/000/102 ENTEBBE 0.0500 32.4500 1146.35 193505XX 2461 1722 2092 301/109/101/104/999/999/999/000/000/000/102 ENTEBBE 0.0500 32.4500 1146.35 193506XX 2433 1706 2069 301/109/101/104/999/999/999/000/000/000/102 ENTEBBE 0.0500 32.4500 1146.35 193507XX 2489 1628 2058 301/109/101/104/999/999/999/000/000/000/102
Example 4
{snip}
ENTEBBE                            0.0500    32.4500  1146.35 193501XX  2783  1711  2247 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193502XX  2572  1772  2172 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193503XX  2644  1889  2267 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193504XX  2572  1817  2194 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193505XX  2461  1722  2092 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193506XX  2433  1706  2069 301/109/101/104/999/999/999/000/000/000/102
ENTEBBE                            0.0500    32.4500  1146.35 193507XX  2489  1628  2058 301/109/101/104/999/999/999/000/000/000/102

{snip}

Because of the heterogeneity of the Stage 1 data decks, bespoke data processing programs were required for each data deck consuming valuable effort and resource in simple data pre-processing. If the semantics, structure and other supplementary metadata pertinent to the Stage 1 data decks had been machine readable, then this data homogenisation stage could have been avoided altogether. Data provenance is crucial to this initiative, therefore it would be beneficial to be able to associate the supplementary metadata without needing to edit the original data files.

Requires: LinksToExternallyManagedDefinitions R-AssociationOfCodeValuesWithExternalDefinitions , SyntacticTypeDefinition , SemanticTypeDefinition , MissingValueDefinition , NonStandardFieldDelimiter NonStandardCellDelimiter and ZeroEditAdditionOfSupplementaryMetadata .

The data pre-processing tools created to parse each Stage 1 data deck into the standard Stage 2 format and the merge process to create the consolidated Stage 3 data set were written using the software most familiar to the participating scientists: Fortran 95 . The merge software source code is available online . It is worth noting that this sector of the scientific community also commonly uses IDL and is gradually adopting Python as the default software language choice.

The resulting merged dataset is published in several formats – including tabular text. The GHCN-format merged dataset comprises of several files: merged data and withheld data (e.g. those data that did not meet the merge criteria) each with an associated “inventory” file.

A snippet of the inventory for merged data is provided below; each row describing one of the 31,427 sites in the dataset. File = merged.monthly.stage3.v1.0.0-beta4.inv

{snip} REC41011874 0.0500 32.4500 1155.0 ENTEBBE_AIRPO
Example 5
{snip}
REC41011874   0.0500  32.4500 1155.0 ENTEBBE_AIRPO

{snip}

The columns are: station identifier , latitude , longitude , altitude (m) and station name . The data is fixed format rather than delimited.

Similarly, a snippet of the merged data itself is provided. Given that the original .dat file is a largely unmanageable 422.6 MB in size, a subset is provided. File = merged.monthly.stage3.v1.0.0-beta4.snip

{snip} REC410118741935TAVG 2245 2170 2265 2195 2090 2070 2059 2080 2145 2190 2225 2165 REC410118741935TMAX 2780 2570 2640 2570 2460 2430 2490 2520 2620 2630 2660 2590 REC410118741935TMIN 1710 1770 1890 1820 1720 1710 1629 1640 1670 1750 1790 1740
Example 6
{snip}
REC410118741935TAVG 2245    2170    2265    2195    2090    2070    2059    2080    2145    2190    2225    2165   
REC410118741935TMAX 2780    2570    2640    2570    2460    2430    2490    2520    2620    2630    2660    2590   
REC410118741935TMIN 1710    1770    1890    1820    1720    1710    1629    1640    1670    1750    1790    1740   

{snip}

The columns are: station identifier , year , quantity kind and the quantity values for months January to December in that year. Again, the data is fixed format rather than delimited.

Here we see the station identifier REC41011874 being used as a foreign key to refer to the observing station details; in this case Entebbe Airport. Once again, there is no metadata provided within the file to describe how to interpret each of the data values.

Requires: ForeignKeyReferences .

The resulting merged dataset provides time series of how the observed climate has changed over a long duration at approximately 32000 locations around the globe. Such instrumental climate records provide a basis for climate research. However, it is well known that these climate records are usually affected by inhomogeneities (artifical shifts) due to changes in the measurement conditions (e.g. relocation, modification or recalibration of the instrument etc.). As these artificial shifts often have the same magnitude as the climate signal, such as long-term variations, trends or cycles, a direct analysis of the raw time-series data can lead to wrong conclusions about climate change.

Statistical homogenisation procedures are used to detect and correct these artificial shifts. Once detected, the raw time-series data is annotated to indicate the presence of artifical shifts in the data, details of the homogenisation procedure undertaken and, where possible, the reasons for those shifts.

Requires: AnnotationAndSupplementaryInfo .

Future iterations of the global land surface temperatures databank are aniticipated to include quality controlled (Stage 4) and homogenised (Stage 5) datasets derived from the merged dataset (Stage 3) outlined above.

3.4 2.4 Use Case #4 - Publication of public sector roles and salaries

(Contributed by Jeni Tennison)

In line with the G8 open data charter Principle 4: Releasing data for improved governance ,the UK Government publishes information about public sector roles and salaries.

The collection of this information is managed by the Cabinet Office and subsequently published via the UK Government data portal at DATA.GOV.UK data.gov.uk .

In order to ensure a consistent return from submitting departments and agencies, the Cabinet Office mandated that each response conform to a data definition schema. schema, which is described within a narrative PDF document . Each submission comprises a pair of CSV files - one for senior roles and another for junior roles.

Requires: GroupingOfMultipleTables , WellFormedCsvCheck and CsvValidation Issue 3 Is there a reference to the Cabinet Office's data definition schema for "Public Roles and Salaries"? .

The submission for senior roles from the Higher Education Funding Council for England (HEFCE) is provided below to illustrate. White space has been added for clarity. File = HEFCE_organogram_senior_data_31032011.csv

Post Unique Reference, Name,Grade, Job Title, Job/Team Function, Parent Department, Organisation, Unit, Contact Phone, Contact E-mail,Reports to Senior Post,Salary Cost of Reports (£),FTE,Actual Pay Floor (£),Actual Pay Ceiling (£),,Profession,Notes,Valid? 90115, Steve Egan,SCS1A,Deputy Chief Executive, Finance and Corporate Resources,Department for Business Innovation and Skills,Higher Education Funding Council for England, Finance and Corporate Resources, 0117 931 7408, s.egan@hefce.ac.uk, 90334, 5883433, 1, 120000, 124999,, Finance, , 1 90250, David Sweeney,SCS1A, Director,"Research, Innovation and Skills",Department for Business Innovation and Skills,Higher Education Funding Council for England,"Research, Innovation and Skills", 0117 931 7304, d.sweeeney@hefce.ac.uk, 90334, 1207171, 1, 110000, 114999,, Policy, , 1 90284, Heather Fry,SCS1A, Director, Education and Participation,Department for Business Innovation and Skills,Higher Education Funding Council for England, Education and Participation, 0117 931 7280, h.fry@hefce.ac.uk, 90334, 1645195, 1, 100000, 104999,, Policy, , 1
Example 7
Post Unique Reference,              Name,Grade,             Job Title,                Job/Team Function,                            Parent Department,                                Organisation,                             Unit,     Contact Phone,         Contact E-mail,Reports to Senior Post,Salary Cost of Reports (£),FTE,Actual Pay Floor (£),Actual Pay Ceiling (£),,Profession,Notes,Valid?
                90115,        Steve Egan,SCS1A,Deputy Chief Executive,  Finance and Corporate Resources,Department for Business Innovation and Skills,Higher Education Funding Council for England,  Finance and Corporate Resources,     0117 931 7408,     s.egan@hefce.ac.uk,                 90334,                   5883433,  1,              120000,                124999,,   Finance,     ,     1
                90250,     David Sweeney,SCS1A,              Director,"Research, Innovation and Skills",Department for Business Innovation and Skills,Higher Education Funding Council for England,"Research, Innovation and Skills",     0117 931 7304, d.sweeeney@hefce.ac.uk,                 90334,                   1207171,  1,              110000,                114999,,    Policy,     ,     1
                90284,       Heather Fry,SCS1A,              Director,      Education and Participation,Department for Business Innovation and Skills,Higher Education Funding Council for England,      Education and Participation,     0117 931 7280,      h.fry@hefce.ac.uk,                 90334,                   1645195,  1,              100000,                104999,,    Policy,     ,     1

90334,Sir
Alan
Langlands,
SCS4,
Chief
Executive,
Chief
Executive,Department
for
Business
Innovation
and
Skills,Higher
Education
Funding
Council
for
England,
HEFCE,0117
931
7300/7341,a.langlands@hefce.ac.uk,
xx,
0,
1,
230000,
234999,,
Policy,
,
1

Similarly, a snippet of the junior role submission from HEFCE is provided. Again, white space has been added for clarity. File = HEFCE_organogram_junior_data_31032011.csv

. Parent Department, Organisation, Unit,Reporting Senior Post,Grade,Payscale Minimum (£),Payscale Maximum (£),Generic Job Title,Number of Posts in FTE, Profession Department for Business Innovation and Skills,Higher Education Funding Council for England, Education and Participation, 90284, 4, 17426, 20002, Administrator, 2,Operational Delivery Department for Business Innovation and Skills,Higher Education Funding Council for England, Education and Participation, 90284, 5, 19546, 22478, Administrator, 1,Operational Delivery Department for Business Innovation and Skills,Higher Education Funding Council for England,Finance and Corporate Resources, 90115, 4, 17426, 20002, Administrator, 8.67,Operational Delivery Department for Business Innovation and Skills,Higher Education Funding Council for England,Finance and Corporate Resources, 90115, 5, 19546, 22478, Administrator, 0.5,Operational Delivery
Example 8
.                           Parent Department,                                Organisation,                           Unit,Reporting Senior Post,Grade,Payscale Minimum (£),Payscale Maximum (£),Generic Job Title,Number of Posts in FTE,          Profession
Department for Business Innovation and Skills,Higher Education Funding Council for England,    Education and Participation,                90284,    4,               17426,               20002,    Administrator,                     2,Operational Delivery
Department for Business Innovation and Skills,Higher Education Funding Council for England,    Education and Participation,                90284,    5,               19546,               22478,    Administrator,                     1,Operational Delivery
Department for Business Innovation and Skills,Higher Education Funding Council for England,Finance and Corporate Resources,                90115,    4,               17426,               20002,    Administrator,                  8.67,Operational Delivery
Department for Business Innovation and Skills,Higher Education Funding Council for England,Finance and Corporate Resources,                90115,    5,               19546,               22478,    Administrator,                   0.5,Operational Delivery

{snip}

Key characteristics of the CSV files are:

Within the senior role CSV the field cell Post Unique Reference provides a primary key within the data file for each row. In addition, it provides a unique identifier for the entity described within a given row. In order for the entity to be referenced from outside this datafile, the local identifier must be mapped to a globally unique identifier such as a URI.

Requires: PrimaryKey and URIMapping .

This primary key unique identifier is referenced both from within the senior post dataset, Reports to Senior Post , and within the junior post dataset, Reporting Senior Post in order to determine the relationships within the organisational structure.

Requires: ForeignKeyReferences .

For the most senior role in a given organisation, the Reports to Senior Post field cell is expressed as xx denoting that this post does not report to anyone within the organisation.

Requires: MissingValueDefinition .

The public sector roles and salaries information is published at DATA.GOV.UK data.gov.uk using an interactive "Organogram Viewer" widget implemented using javascript. The HEFCE data can be accessed visualized here . For convenience, a screenshot is provided in Fig. 1 Screenshot of Organogram Viewer web application showing HEFCE data .

data.gov.uk-roles-and-salaries-browser.png
Fig. 1 Screenshot of Organogram Viewer web application showing HEFCE data

In order to create this visualisation, visualization, each pair of tabular datasets were transformed into RDF and uploaded into a triple store exposing a SPARQL end-point which the interactive widget then queries to acquire the necessary data. An example of the derived RDF is provided in file HEFCE_organogram_31032011.rdf .

The transformation from CSV to RDF required bespoke software, supplementing the content in the CSV files with additional information such as the RDF properties for each column. The need to create and maintain bespoke software incurs costs that may be avoided through use of a generic CSV-to-RDF transformation mechanism.

Requires: CsvToRdfTransformation .

3.5 2.5 Use Case #5 - Publication of property transaction data

(Contributed by Andy Seaborne)

The Land Registry is the government department with responsibility to register the ownership of land and property within England and Wales. Once land or property is entered to the Land Register any ownership changes, mortgages or leases affecting that land or property are recorded.

Their Price paid data , dating from 1995 and consisting of more than 18.5 million records, tracks the residential property sales in England and Wales that are lodged for registration. This dataset is one of the most reliable sources of house price information in England and Wales.

Residential property transaction details are extracted from a data warehouse system and collated into a tabular dataset for each month. The current monthly dataset is available online in both .txt and .csv formats. Snippets of data for January 2014 are provided below. White space has been added for clarity.

pp-monthly-update.txt ( local copy )

{C6428808-DC2A-4CE7-8576-0000303EF81B},137000,2013-12-13 00:00, "B67 5HE","T","N","F","130","", "WIGORN ROAD", "", "SMETHWICK", "SANDWELL", "WEST MIDLANDS","A" {16748E59-A596-48A0-B034-00007533B0C1}, 99950,2014-01-03 00:00, "PE3 8QR","T","N","F", "11","", "RISBY","BRETTON","PETERBOROUGH","CITY OF PETERBOROUGH","CITY OF PETERBOROUGH","A" {F10C5B50-92DD-4A69-B7F1-0000C3899733},355000,2013-12-19 00:00,"BH24 1SW","D","N","F", "55","","NORTH POULNER ROAD", "", "RINGWOOD", "NEW FOREST", "HAMPSHIRE","A"
Example 9
{C6428808-DC2A-4CE7-8576-0000303EF81B},137000,2013-12-13 00:00, "B67 5HE","T","N","F","130","",       "WIGORN ROAD",       "",   "SMETHWICK",            "SANDWELL",       "WEST MIDLANDS","A"
{16748E59-A596-48A0-B034-00007533B0C1}, 99950,2014-01-03 00:00, "PE3 8QR","T","N","F", "11","",             "RISBY","BRETTON","PETERBOROUGH","CITY OF PETERBOROUGH","CITY OF PETERBOROUGH","A"
{F10C5B50-92DD-4A69-B7F1-0000C3899733},355000,2013-12-19 00:00,"BH24 1SW","D","N","F", "55","","NORTH POULNER ROAD",       "",    "RINGWOOD",          "NEW FOREST",           "HAMPSHIRE","A"

{snip}

pp-monthly-update-new-version.csv ( local copy )

"{C6428808-DC2A-4CE7-8576-0000303EF81B}","137000","2013-12-13 00:00", "B67 5HE","T","N","F","130","", "WIGORN ROAD", "", "SMETHWICK", "SANDWELL", "WEST MIDLANDS","A" "{16748E59-A596-48A0-B034-00007533B0C1}", "99950","2014-01-03 00:00", "PE3 8QR","T","N","F", "11","", "RISBY","BRETTON","PETERBOROUGH","CITY OF PETERBOROUGH","CITY OF PETERBOROUGH","A" "{F10C5B50-92DD-4A69-B7F1-0000C3899733}","355000","2013-12-19 00:00","BH24 1SW","D","N","F", "55","","NORTH POULNER ROAD", "", "RINGWOOD", "NEW FOREST", "HAMPSHIRE","A"
Example 10
"{C6428808-DC2A-4CE7-8576-0000303EF81B}","137000","2013-12-13 00:00", "B67 5HE","T","N","F","130","",       "WIGORN ROAD",       "",   "SMETHWICK",            "SANDWELL",       "WEST MIDLANDS","A"
"{16748E59-A596-48A0-B034-00007533B0C1}", "99950","2014-01-03 00:00", "PE3 8QR","T","N","F", "11","",             "RISBY","BRETTON","PETERBOROUGH","CITY OF PETERBOROUGH","CITY OF PETERBOROUGH","A"
"{F10C5B50-92DD-4A69-B7F1-0000C3899733}","355000","2013-12-19 00:00","BH24 1SW","D","N","F", "55","","NORTH POULNER ROAD",       "",    "RINGWOOD",          "NEW FOREST",           "HAMPSHIRE","A"

{snip}

There seems to be little difference between the two formats with the exception that all fields cells within the .csv file are escaped with a pair of double quotes ( "" ).

The header row is absent. Information regarding the meaning of each column and the abbreviations used within the dataset are provided in a complementary FAQ document . The column headings are provided below along with some supplemental detail:

  1. Transaction unique identifier
  2. Price - sale price stated on the Transfer deed
  3. Date of Transfer - date when the sale was completed, as stated on the Transfer deed
  4. Postcode
  5. Property Type - D (detatched), S (semi-detatched), T (terraced), F (flats/maisonettes)
  6. Old/New - Y (newly built property) and N (established residential building)
  7. Duration - relates to tenure; F (freehold) and L (leasehold)
  8. PAON - Primary Addressable Object Name
  9. SAON - Secondary Addressable Object Name
  10. Street
  11. Locality
  12. Town/City
  13. Local Authority
  14. County
  15. Record status - indicates status of the transaction; A (addition of a new transaction), C (correction of an existing transaction) and D (deleted transaction)

Requires: AnnotationAndSupplementaryInfo .

Each row, or record, within the tabular dataset describes a property transaction. The Transaction unique identifier column provides a unique identifier for that property transaction. Given that transactions may be amended, this identifier cannot be treated as a primary key for rows within the dataset as the identifier may occur more than once. the primary key for each record. In order for the property transaction to be referenced from outside this dataset, the local identifier must be mapped to a globally unique identifier such as a URI.

Requires: PrimaryKey URIMapping .

Each transaction record makes use of predefined category codes as outlined above; e.g. Duration may be F (freehold) or L (leasehold). Furthermore, geographic descriptors are commonly used. Whilst there is no attempt to link these descriptors to specific geographic identifiers, such a linkage is likely to provide additional utility when aggregating transaction data by location or region for further analysis. At present there is no standardised mechanism to associate the catagory codes, provided as plain text, or geographic identifiers with their authoritative definitions.

Requires: AssociationOfCodeValuesWithExternalDefinitions .

The collated monthly transaction dataset is used as the basis for updating the Land Registry's information systems; in this case the data is persisted as RDF triples within a triple store. A SPARQL end-point and accompanying data definitions are provided by the Land Registry allowing users to query the content of the triple store.

In order to update the triple store, the monthly transaction dataset is converted into RDF. The value of the Record status field cell for a given row informs the update process: add, update or delete. Bespoke software has been created by the Land Registry to transformation from CSV to RDF. The transformation requires supplementary information not present in the CSV, such as the RDF properties for each column specified in the data definitions . The need to create and maintain bespoke software incurs costs that may be avoided through use of a generic CSV-to-RDF transformation mechanism.

Requires: CsvToRdfTransformation .

Note

The monthly transaction dataset contains in the order of 100,000 records; any transformation will need to scale accordingly.

In parallel to providing access via the SPARQL end-point , the Land Registry also provides aggregated sets of transaction data. Data is available as a single file containing all transactions since 1995, or partitioned by year. Given that the complete dataset is approaching 3GB in size, the annual partitions provide a far more manageable method to download the property transaction data. However, each annual partition is only a subset of the complete dataset. It is important to be able to both make assertions about the complete dataset (e.g. publication date, license etc.) and to be able to understand how an annual partition relates to the complete dataset and other partitions.

Requires: CsvAsSubsetOfLargerDataset .

3.6 2.6 Use Case #6 - Journal Article Solr Search Results

(Contributed by Alf Eaton)

When performing literature searches researchers need to retain a persisted collection of journal articles of interest in a local database compiled from on-line publication websites. In this use case a researcher wants to retain a local personal journal article publication database based on the search results from Public Library of Science . PLOS One is a nonprofit open access scientific publishing project aimed at creating a library of open access journals and other scientific literature under an open content license.

In general this use case also illustrates the utility of CSV as a convenient exchange format for pushing tabular data between software components:

The PLOS website features a Solr index search engine (Live Search) which can return query results in JSON Live Search (JSON Screen dump) XML Live Search (XML Screen dump) , JSON or in a more concise format CSV format. The output from the CSV Live Search (CSV Screen dump) : is illustrated below:

id,publication_date,title_display,author 10.1371/journal.pone.0087584,2014-02-14T00:00:00Z,Healthcare Worker Perceived Barriers to Early Initiation of Antiretroviral and Tuberculosis Therapy among Tanzanian Inpatients,"Bahati M K Wajanga,Robert N Peck,Samuel Kalluvya,Daniel W Fitzgerald,Luke R Smart,Jennifer A Downs" 10.1371/journal.pone.0087286,2014-02-14T00:00:00Z,"Subditine, a New Monoterpenoid Indole Alkaloid from Bark of <i>Nauclea subdita</i> (Korth.) Steud. Induces Apoptosis in Human Prostate Cancer Cells","Sook Yee Liew,Chung Yeng Looi,Mohammadjavad Paydar,Foo Kit Cheah,Kok Hoong Leong,Won Fen Wong,Mohd Rais Mustafa,Marc Litaudon,Khalijah Awang" 10.1371/journal.pone.0087630,2014-02-14T00:00:00Z,Spatiotemporal Characterizations of Dengue Virus in Mainland China: Insights into the Whole Genome from 1978 to 2011,"Hao Zhang,Yanru Zhang,Rifat Hamoudi,Guiyun Yan,Xiaoguang Chen,Yuanping Zhou" 10.1371/journal.pone.0086587,2014-02-14T00:00:00Z,<i>optGpSampler</i>: An Improved Tool for Uniformly Sampling the Solution-Space of Genome-Scale Metabolic Networks,"Wout Megchelenbrink,Martijn Huynen,Elena Marchiori" 10.1371/journal.pone.0082694,2014-02-14T00:00:00Z,Prophylactic Antibiotics to Prevent Cellulitis
Example 11
id,doi,publication_date,title_display,author
10.1371/journal.pone.0095131,10.1371/journal.pone.0095131,2014-06-05T00:00:00Z,"Genotyping of French <i>Bacillus anthracis</i> Strains Based on 31-Loci Multi Locus VNTR Analysis: Epidemiology, Marker Evaluation, and Update of the Internet Genotype Database","Simon Thierry,Christophe Tourterel,Philippe Le Flèche,Sylviane Derzelle,Neira Dekhil,Christiane Mendy,Cécile Colaneri,Gilles Vergnaud,Nora Madani"
10.1371/journal.pone.0095156,10.1371/journal.pone.0095156,2014-06-05T00:00:00Z,Pathways Mediating the Interaction between Endothelial Progenitor Cells (EPCs) and Platelets,"Oshrat Raz,Dorit L Lev,Alexander Battler,Eli I Lev"
10.1371/journal.pone.0095275,10.1371/journal.pone.0095275,2014-06-05T00:00:00Z,Identification of Divergent Protein Domains by Combining HMM-HMM Comparisons and Co-Occurrence Detection,"Amel Ghouila,Isabelle Florent,Fatma Zahra Guerfali,Nicolas Terrapon,Dhafer Laouini,Sadok Ben Yahia,Olivier Gascuel,Laurent Bréhélin"
10.1371/journal.pone.0096098,10.1371/journal.pone.0096098,2014-06-05T00:00:00Z,Baseline CD4 Cell Counts of Newly Diagnosed HIV Cases in China: 2006–2012,"Houlin Tang,Yurong Mao,Cynthia X Shi,Jing Han,Liyan Wang,Juan Xu,Qianqian Qin,Roger Detels,Zunyou Wu"
10.1371/journal.pone.0097475,10.1371/journal.pone.0097475,2014-06-05T00:00:00Z,Crystal
Structure

of
the
Leg:
Economic
Analysis

Open
State

of
the
PATCH
I
&
II
Trials,"James
M
Mason,Kim
S
Thomas,Angela
M
Crook,Katharine

<i>Neisseria
gonorrhoeae</i>
MtrE
Outer
Membrane
Channel,"Hsiang-Ting
Lei,Tsung-Han
Chou,Chih-Chia
Su,Jani
Reddy
Bolla,Nitin
Kumar,Abhijith
Radhakrishnan,Feng
Long,Jared

A
Foster,Joanne

Delmar,Sylvia
V
Do,Kanagalaghatta

R
Chalmers,Andrew
J
Nunn,Hywel
C
Williams"

Rajashankar,William
M
Shafer,Edward
W
Yu"

Versions of the search results provided at time of writing are available locally in XML , JSON and CSV formats for reference.

A significant difference between the CSV formatted results and those of JSON and XML is the absence of information about how the set of results provided in the HTTP response fit within the complete set of results that match the Live Search request. The information provided in the JSON and XML search results states both the total number of "hits" for the Live Search request and the start index within the complete set (zero for the example provided here as the ?start={offset} query parameter is absent from the request).

Note

Other common methods of splitting up large datasets into manageable chunks include partitioning by time (e.g. all the records added to a dataset in a given day may be exported in a CSV file). Such partitioning allows regular updates to be shared. However, in order to recombine those time-based partitions into the complete set, one needs to know the datetime range for which that dataset partition is valid. Such information should be available within a CSV metadata description.

Requires: CsvAsSubsetOfLargerDataset .

To be useful to a user maintaining a PLOS One search results need to be returned in an organized and consistent tabular format. This includes:

Lastly because the research researcher may use different search criteria the header row plays an important role later for the researcher wanting to combine multiple literature searches into their database. The researcher will use the header column names returned in the first row as a way to identify each column type.

Requires: WellFormedCsvCheck and CsvValidation .

Search results returned in a tabular format can contain cell values that organized in data structures also known as micro formats. In example above the publication_date and authors list represent two micro formats that are represented in a recognizable pattern that can be parsed by software or by the human reader. In the case of the author column, microformats provide the advantage of being able to store a single author's name or multiple authors names separated by a comma delimiter. Because each author cell value is surrounded by quotes a parser can choose to ignore the data structure or address it.

Requires: CellValueMicroSyntax Issue 4 Is this information in scope? How can it be incorporated better into the use case? The following additional observations were made while interacting with the search engine: As the "rows" parameter is set to "5", the first 5 rows are returned. To paginate through the collection, add "&start={offset}" (e.g. "&start=5") to the query. It would be useful to receive, at the start of the response, an indication of the total number of results (as is present in Furthermore, note that the JSON version values of the response, as OpenSearch fields). It would also be useful to title_display column contain markup. Whilst these values may be able to continue fetching rows using the last known identifier with "&before={id}" or "&since={id}", to avoid issues of pagination if the result set is altered while while it is being fetched. To allow incremental updates to the local collection, treated as pure text, it would again be useful to support "&since={id}" (or provides an If-Modified-Since header), to fetch all the items since the query was last update. For archiving, it might also be desirable to break the data up into separate CSV files - one per day - that can easily be synchronized between machines and recombined when needed. For converting to linked data, it would be useful to be able to specify the data type example of each cell, and to map the column headers how structure or syntax may be embedded within a cell.

Requires: CellMicrosyntax and values to URLs. RepeatedProperties .

3.7 2.7 Use Case #7 - Reliability Analyses Analyzes of Police Open Data

(Contributed by Davide Ceolin)

Several Web sources expose datasets about UK crime statistics. These datasets vary in format (e.g. maps vs. CSV files), timeliness, aggregation level, etc. Before being published on the Web, these data are processed to preserve the privacy of the people involved, but again the processing policy varies from source to source.

Every month, the UK Police Home Office publishes (via data.police.uk) data.police.uk ) CSV files that report crime counts, aggregated on geographical basis (per address or police neighbourhood) and on type basis. Before publishing, data are smoothed, that is, grouped in predefined areas and assigned to the mid point of each area. Each area has to contain a minimum number of physical addresses. The goal of this procedure is to prevent the reconstruction of the identity of the people involved in the crimes.

Over time, the policies adopted for preprocessing these data have changed, but data previously published have not been recomputed. Therefore, datasets about different months present relevant differences in terms of crime types reported and geographical aggregation (e.g. initially, each geographical area for aggregation had to include at least 12 physical addresses. Later, this limit was lowered to 8).

These policies introduce a controlled error in the data for privacy reasons, but these changes in the policies imply the fact that different datasets adhere differently to the real data, i.e. they present different reliability levels. Previous work provided two procedures for measuring and comparing the reliability of the datasets, but in order to automate and improve these procedures, it is crucial to understand the meaning of the columns, the relationships between columns, and how the data rows have been computed.

For instance, here is a snippet from a dataset about crime happened in Hampshire in April 2012:

Month Force Neighbourhood Burglary Robbery Vehicle crime Violent crime Anti-social behaviour Other crime {snip} 2011-04 Hampshire Constabulary 2LE11 2 0 1 6 14 6 2011-04 Hampshire Constabulary 2LE10 1 0 2 4 15 6 2011-04 Hampshire Constabulary 2LE12 3 0 0 4 25 21
Example 12
Month,	Force,			Neighbourhood,	Burglary,	Robbery,	Vehicle crime,	Violent crime,	Anti-social behaviour,	Other crime
{snip}
2011-04	Hampshire Constabulary,	2LE11,		2,		0,		1,		6,		14,			6
2011-04	Hampshire Constabulary,	2LE10,		1,		0,		2,		4,		15,			6
2011-04	Hampshire Constabulary,	2LE12,		3,		0,		0,		4,		25,			21

{snip}

and that dataset reports 248 entries, while in October 2012, the crime types we can see are increased to 11:

Month Force Neighbourhood Burglary Robbery Vehicle crime Violent crime ASB CDA Shoplifting OT Drugs PDW OC 2012-10 Hampshire Constabulary 2LE11 1 0 1 2 8 0 0 1 1 0 1 2012-10 Hampshire Constabulary 1SY01 9 1 12 8 87 17 12 14 13 7 4 2012-10 Hampshire Constabulary 1SY02 11 0 11 20 144 39 2 12 9 8 5
Example 13
Month,	Force,			Neighbourhood,	Burglary,	Robbery,	Vehicle crime,	Violent crime,	Anti-social behaviour,	Criminal damage and arson,	Shoplifting,	Other theft,	Drugs,	Public disorder and weapons,	Other crime
{snip}
2012-10,Hampshire Constabulary,	2LE11,		1,		0,		1,		2,		8,			0,				0,		1,		1,	0,				1
2012-10,Hampshire Constabulary,	1SY01,		9,		1,		12,		8,		87,			17,				12,		14,		13,	7,				4
2012-10,Hampshire Constabulary,	1SY02,		11,		0,		11,		20,		144,			39,				2,		12,		9,	8,				5
{snip}

ASB = Anti-social behaviour PDW = Public Disorder and Weapons CDA = Criminal damage and arson OT = Other Theft OC = Other Crime

This dataset reports 232 entries.

In order to properly handle the columns, it is crucial to understand the type of the data therein contained. contained therein. Given the context, knowing this information would reveal an important part of the column meaning (e.g. to identify dates).

Requires: SyntacticTypeDefinition .

Also, it is important to understand the precise semantics of each column. This is relevant for two reasons. First, to identify relations between columns (e.g. some crime types are siblings, while other are less semantically related). Second, to identify semantic relations between columns in heterogeneous datasets (e.g. a column in one dataset may correspond to the sum of two or more columns in others).

Requires: SemanticTypeDefinition .

Lastly, datasets with different row numbers are the result of different smoothing procedures. Therefore, it would be important to trace and access their provenance, in order to facilitate their comparison.

Requires: AnnotationAndSupplementaryInfo .

3.8 2.8 Use Case #8 - Analyzing Scientific Spreadsheets

(Contributed by Alf Eaton, Davide Ceolin, Martine de Vos)

A paper published in Nature Immunology in December 2012 compared changes in expression of a range of genes in response to treatment with two different cytokines. The results were published in the paper as graphic figures, and the raw data was presented in the form of two supplementary spreadsheets, as Excel files . ( local copy ).

Having at disposal both the paper and the results, a scientist may wish to reproduce the experiment, check if the results he obtains coincide with those published, and compare those results with others, provided by different studies about the same issues.

Because of the size of the datasets and of the complexity of the computations, it could be necessary to perform such analyses and comparisons by means of properly defined software, typically by means of an R, Python or Matlab script. Such software would require as input the data contained in the Excel file. However, it would be difficult to write a parser to extract the information, for the reasons described below.

To clarify the issues related to the spreadsheet parsing and analysis, we first present an example extrapolated from it. The example below shows a CSV encoding of the original Excel speadsheet converted using Mircosoft Excel 2007. White space has been added to aid clarity. (file = ni.2449-S3.csv )

Supplementary Table 2. Genes more potently regulated by IL-15 gene_name symbol RPKM Fold Change 4 hour 24 hour 4 hour ... Cont IL2_1nM IL2_500nM IL15_1nM IL15_500nM IL2_1nM IL2_500nM IL15_1nM IL15_500nM IL2_1nM ... {snip} NM_001033122 Cd69 15,67 46,63 216,01 30,71 445,58 9,21 77,32 4,56 77,21 ... NM_026618 Ccdc56 9,07 12,55 9,25 5,88 14,33 20,08 20,91 11,97 22,69 ... NM_008637 Nudt1 9,31 7,51 8,60 11,21 6,84 15,85 25,14 7,56 22,77 ... NM_008638 Mthfd2 58,67 33,99 245,87 44,66 167,87 55,62 204,50 24,52 176,51 ... NM_178185 Hist1h2ao 7,13 16,52 7,82 7,79 16,99 75,04 290,72 21,99 164,93 ...
Example 14
Supplementary Table 2. Genes more potently regulated by IL-15,,,,,,,,,,,,,,,,,,
            ,         ,     ,       ,         ,        ,          ,       ,         ,        ,          ,           ,         ,        ,          ,       ,         ,        ,
   gene_name,   symbol, RPKM,       ,         ,        ,          ,       ,         ,        ,          ,Fold Change,         ,        ,          ,       ,         ,        ,
            ,         ,     , 4 hour,         ,        ,          ,24 hour,         ,        ,          ,     4 hour,         ,        ,          ,24 hour,         ,        ,
            ,         , Cont,IL2_1nM,IL2_500nM,IL15_1nM,IL15_500nM,IL2_1nM,IL2_500nM,IL15_1nM,IL15_500nM,    IL2_1nM,IL2_500nM,IL15_1nM,IL15_500nM,IL2_1nM,IL2_500nM,IL15_1nM,IL15_500nM
NM_001033122,     Cd69,15.67,  46.63,   216.01,   30.71,    445.58,   9.21,    77.32,    4.56,     77.21,       2.98,    13.78,    1.96,     28.44,   0.59,     4.93,    0.29,      4.93
   NM_026618,   Ccdc56, 9.07,  12.55,     9.25,    5.88,     14.33,  20.08,    20.91,   11.97,     22.69,       1.38,     1.02,    0.65,      1.58,   2.21,     2.31,    1.32,      2.50
   NM_008637,    Nudt1, 9.31,   7.51,     8.60,   11.21,      6.84,  15.85,    25.14,    7.56,     22.77,       0.81,     0.92,    1.20,      0.73,   1.70,     2.70,    0.81,      2.45
   NM_008638,   Mthfd2,58.67,  33.99,   245.87,   44.66,    167.87,  55.62,   204.50,   24.52,    176.51,       0.58,     4.19,    0.76,      2.86,   0.95,     3.49,    0.42,      3.01
   NM_178185,Hist1h2ao, 7.13,  16.52,     7.82,    7.79,     16.99,  75.04,   290.72,   21.99,    164.93,       2.32,     1.10,    1.09,      2.38,  10.52,    40.78,    3.08,     23.13

{snip}

As we can see from the example, the table contains several columns of data that are measurements of gene expression in cells after treatment with two concentrations of two cytokines, measured after two periods of time, presented as both actual values and fold change. This can be represented in a table, but needs 3 levels of headings and several merged cells. In fact, the first row is the title of the table, the second to fourth rows are the table headers, and headers.

We also see that the first two columns are also headers. column gene_name provides a unique identifier for the gene described in each row, with the second column symbol providing a human readable notation for each gene - albeit a scientific human! It would be useful is necessary to determine which column, if any, provides the unique identifier for the entity which each row describes. In order for the gene to be able referenced from outside the datafile, e.g. to describe this reconcile the information in a way that a parser could understand. Cell borders are also used to communicate information, though this table would still with other information about the gene, the local identifier must be understandable without them. mapped to a globally unique identifier such as a URI.

Requires: MultipleHeadingRows HeadingColumns and URIMapping .

The first column contains a GenBank identifier for each gene, with the column name "gene_name". The first column contains a GenBank identifier for each gene, with the column name "gene_name". The GenBank identifier provides a local identifier for each gene. This local identifier, e.g. “NM_008638”, can be converted to a fully qualified URI by adding a URI prefix, e.g. “http://www.ncbi.nlm.nih.gov/nuccore/NM_008638” allowing the gene to be uniquely and unambiguously identified.

The second column contains the standard symbol for each gene, labelled as "symbol". These appear to be HUGO gene nomenclature symbols, but as there's no mapping it's hard to be sure which namespace these symbols are from.

Requires: URIMapping .

As this spreadsheet was published as supplemental data for a journal article, there is little description of what the columns represent, even as text. There is a column labelled as "Cont", which has no description anywhere, but is presumably the background level of expression for each gene.

Requires: SyntacticTypeDefinition Requires: and SemanticTypeDefinition .

Half of the cells represent measurements, but the details of what those measurements are can only be found in the article text. The other half of the cells represent the change in expression over the background level. It is difficult to tell the difference without annotation that describes the relationship between the cells (or understanding of the nested headings). In this particular spreadsheet, only the values are published, and not the formulae that were used to calculate the derived values. The units of each cell are "expression levels relative to the expression level of a constant gene, Rpl7", described in the text of the methods section of the full article.

Requires: UnitMeasureDefinition .

The heading rows contain details of the treatment that each cell received, e.g. "4 hour, IL2_1nM". It would be useful to be able to make this machine readable (i.e. to represent treatment with 1nM IL-2 for 4 hours).

All the details of the experiment (which cells were used, how they were treated, when they were measured) are described in the methods section of the article. To be able to compare data between multiple experiments, a parser would also need to be able to understand all these parameters that may have affected the outcome of the experiment.

Requires: AnnotationAndSupplementaryInfo .

3.9 2.9 Use Case #9 - Chemical Imaging

(Contributed by Mathew Thomas)

Chemical imaging experimental work makes use of CSV formats to record its measurements. In this use case two examples are shown to depict scans from a mass spectrometer and corresponding FTIR corrected files that are saved into a CSV format automatically.

Mass Spectrometric Imaging (MSI) allows the generation of 2D ion density maps that help visualize molecules present in sections of tissues and cells. The combination of spatial resolution and mass resolution results in very large and complex data sets. The following is generated using the software Decon Tools, a tool to de-isotope MS spectra and to detect features from MS data using isotopic signatures of expected compounds, available freely at omins.pnnl.gov. The raw files generated by the mass spec instrument are read in and the processed output files are saved as csv CSV files for each line.

Fourier transform (FTIR) spectroscopy is a measurement technique whereby spectra are collected based on measurements of the coherence of a radiative source, using time-domain or space-domain measurements of the electromagnetic radiation or other type of radiation.

In general this use case also illustrates the utility of CSV as a means for scientists to collect and process their experimental results:

The key characteristics are:

Requires: WellFormedCsvCheck , CsvValidation , PrimaryKey and UnitMeasureDefinition .

Lastly, for Mass Spectrometry multiple CSV files need to be examined to view the sample image in its entirety.

Requires: CsvAsSubsetOfLargerDataset .

Below are Mass Spectrometry instrument measurements (3 of 316 CSV rows) for a single line on a sample. It gives the mass-to-charge ranges, peak values, acquisition times and total ion current.

scan_num,scan_time,type,bpi,bpi_mz,tic,num_peaks,num_deisotoped,info 1,0,1,4.45E+07,576.27308,1.06E+09,132,0,FTMS + p NSI Full ms [100.00-2000.00] 2,0.075,1,1.26E+08,576.27306,2.32E+09,86,0,FTMS + p NSI Full ms [100.00-2000.00]
Example 15
scan_num,scan_time,type,bpi,bpi_mz,tic,num_peaks,num_deisotoped,info 
1,0,1,4.45E+07,576.27308,1.06E+09,132,0,FTMS + p NSI Full ms [100.00-2000.00] 
2,0.075,1,1.26E+08,576.27306,2.32E+09,86,0,FTMS + p NSI Full ms [100.00-2000.00] 

3,0.1475,1,9.53E+07,576.27328,1.66E+09,102,0,FTMS
+
p
NSI
Full
ms
[100.00-2000.00]

Below is a example FTIR data. The files from the instrument are baseline corrected, normalized and saved as csv CSV files automatically. Column 1 represents the wavelength # or range and the represent different formations like bound eps (extracellular polymeric substance), lose eps, shewanella etc. Below are (5 of 3161 rows) is a example:

,wt beps,wt laeps,so16533 beps,so167333 laeps,so31 beps,so313375 lAPS,so3176345 bEPS,so313376 laEPS,so3193331 bEPS,so3191444 laeps,so3195553beps,so31933333 laeps 1999.82,-0.0681585,-0.04114415,-0.001671781,0.000589855,0.027188073,0.018877371,-0.066532177,-0.016899697,-0.077690018,0.001594551,-0.086573831,-0.08155035 1998.855,-0.0678255,-0.0409804,-0.001622611,0.000552989,0.027188073,0.01890847,-0.066132737,-0.016857071,-0.077346835,0.001733207,-0.086115107,-0.081042424 1997.89,-0.067603,-0.0410459,-0.001647196,0.000423958,0.027238845,0.018955119,-0.065904461,-0.016750515,-0.077101756,0.001733207,-0.085656382,-0.080590934
Example 16
,wt beps,wt laeps,so16533 beps,so167333 laeps,so31 beps,so313375 lAPS,so3176345 bEPS,so313376 laEPS,so3193331 bEPS,so3191444 laeps,so3195553beps,so31933333 laeps
 1999.82,-0.0681585,-0.04114415,-0.001671781,0.000589855,0.027188073,0.018877371,-0.066532177,-0.016899697,-0.077690018,0.001594551,-0.086573831,-0.08155035
 1998.855,-0.0678255,-0.0409804,-0.001622611,0.000552989,0.027188073,0.01890847,-0.066132737,-0.016857071,-0.077346835,0.001733207,-0.086115107,-0.081042424
 1997.89,-0.067603,-0.0410459,-0.001647196,0.000423958,0.027238845,0.018955119,-0.065904461,-0.016750515,-0.077101756,0.001733207,-0.085656382,-0.080590934

1996.925,-0.0673255,-0.04114415,-0.001647196,0.000258061,0.027289616,0.018970669,-0.065790412,-0.01664396,-0.076856677,0.001629215,-0.085281062,-0.080365189

3.10 2.10 Use Case #10 - OpenSpending Data

(Contributed by Stasinos Konstantopoulos)

The OpenSpending and the Budgit platforms provide plenty of useful datasets providing figures of national budget and spending of several countries. A journalist willing to investigate about public spending fallacies can use these data as a basis for his research, and possibly compare them against different sources. Similarly, a politician that is interested in developing new policies for development can, for instance, combine these data with those from the World Bank to identify correlations and, possibly, dependencies to leverage.

Nevertheless, these uses of these datasets are possibly undermined by the following obstacles.

Note

The datahub.io platform that collects both OpenSpending and Budgit data allows publishing data in Simple Data Format (SDF), RDF and other formats providing explicit semantics. Nevertheless, the datasets mentioned above present either implicit semantics and/or additional metadata files provided only as attachment.

3.11 2.11 Use Case #11 - City of Palo Alto Tree Data

(Contributed by Eric Stephan)

The City of Palo Alto, California Urban Forest Section is responsible for maintaining and tracking the cities public trees and urban forest. In a W3C Data on the Web Best Practices (DWBP) use case discussion with Jonathan Reichental City of Palo Alto CIO, he brought to the working groups attention a Tree Inventory maintained by the city in a spreadsheet form using Google Fusion. This use case represents use of tabular data to be representative of geophysical tree locations also provided in Google Map form where the user can point and click on trees to look up row information about the tree.

Example output from The example below illustrates the first three few rows of CSV file. data:

GID,Private,Tree ID,Admin Area,Side of Street,On Street,From Street,To Street,Street_Name,Situs Number,Address Estimated,Lot Side,Serial Number,Tree Site,Species,Trim Cycle,Diameter at Breast Ht,Trunk Count,Height Code,Canopy Width,Trunk Condition,Structure Condition,Crown Condition,Pest Condition,Condition Calced,Condition Rating,Vigor,Cable Presence,Stake Presence,Grow Space,Utility Presence,Distance from Property,Inventory Date,Staff Name,Comments,Zip,City Name,Longitude,Latitude,Protected,Designated,Heritage,Appraised Value,Hardscape,Identifier,Location Feature ID,Install Date,Feature Name,KML,FusionMarkerIcon 1,True,29,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,203,,Front,,2,Celtis australis,Large Tree Routine Prune,11,1,25-30,15-30,,Good,5,,,Good,2,False,False,Planting Strip,,44,10/18/2010,BK,,,Palo Alto,-122.1565172,37.4409561,False,False,False,,None,40,13872,,"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl","<Point><coordinates>-122.156485,37.440963</coordinates></Point>",small_green 2,True,30,,,EMERSON ST,CHANNING AV,ADDISON AV,ADDISON AV,203,,Left,,1,Liquidambar styraciflua,Large Tree Routine Prune,11,1,50-55,15-30,Good,Good,5,,,Good,2,False,False,Planting Strip,,21,6/2/2010,BK,,,Palo Alto,-122.1567812,37.440951,False,False,False,,None,41,13872,,"Tree: 30 site 1 at 203 ADDISON AV, on EMERSON ST 21 from pl","<Point><coordinates>-122.156749,37.440958</coordinates></Point>",small_green 3,True,31,,,EMERSON ST,CHANNING AV,ADDISON AV,ADDISON AV,203,,Left,,2,Liquidambar styraciflua,Large Tree Routine Prune,11,1,40-45,15-30,Good,Good,5,,,Good,2,False,False,Planting Strip,,54,6/2/2010,BK,,,Palo Alto,-122.1566921,37.4408948,False,False,False,,Low,42,13872,,"Tree: 31 site 2 at 203 ADDISON AV, on EMERSON ST 54 from pl","<Point><coordinates>-122.156659,37.440902</coordinates></Point>",small_green
Example 17
GID,Private,Tree ID,Admin Area,Side of Street,On Street,From Street,To Street,Street_Name,Situs Number,Address Estimated,Lot Side,Serial Number,Tree Site,Species,Trim Cycle,Diameter at Breast Ht,Trunk Count,Height Code,Canopy Width,Trunk Condition,Structure Condition,Crown Condition,Pest Condition,Condition Calced,Condition Rating,Vigor,Cable Presence,Stake Presence,Grow Space,Utility Presence,Distance from Property,Inventory Date,Staff Name,Comments,Zip,City Name,Longitude,Latitude,Protected,Designated,Heritage,Appraised Value,Hardscape,Identifier,Location Feature ID,Install Date,Feature Name,KML,FusionMarkerIcon
1,True,29,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,203,,Front,,2,Celtis australis,Large Tree Routine Prune,11,1,25-30,15-30,,Good,5,,,Good,2,False,False,Planting Strip,,44,10/18/2010,BK,,,Palo Alto,-122.1565172,37.4409561,False,False,False,,None,40,13872,,"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl","<Point><coordinates>-122.156485,37.440963</coordinates></Point>",small_green
2,True,30,,,EMERSON ST,CHANNING AV,ADDISON AV,ADDISON AV,203,,Left,,1,Liquidambar styraciflua,Large Tree Routine Prune,11,1,50-55,15-30,Good,Good,5,,,Good,2,False,False,Planting Strip,,21,6/2/2010,BK,,,Palo Alto,-122.1567812,37.440951,False,False,False,,None,41,13872,,"Tree: 30 site 1 at 203 ADDISON AV, on EMERSON ST 21 from pl","<Point><coordinates>-122.156749,37.440958</coordinates></Point>",small_green
3,True,31,,,EMERSON ST,CHANNING AV,ADDISON AV,ADDISON AV,203,,Left,,2,Liquidambar styraciflua,Large Tree Routine Prune,11,1,40-45,15-30,Good,Good,5,,,Good,2,False,False,Planting Strip,,54,6/2/2010,BK,,,Palo Alto,-122.1566921,37.4408948,False,False,False,,Low,42,13872,,"Tree: 31 site 2 at 203 ADDISON AV, on EMERSON ST 54 from pl","<Point><coordinates>-122.156659,37.440902</coordinates></Point>",small_green
4,True,32,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,209,,Front,,1,Ulmus parvifolia,Large Tree Routine Prune,18,1,35-40,30-45,Good,Good,5,,,Good,2,False,False,Planting Strip,,21,6/2/2010,BK,,,Palo Alto,-122.1564595,37.4410143,False,False,False,,Medium,43,13873,,"Tree: 32 site 1 at 209 ADDISON AV, on ADDISON AV 21 from pl","<Point><coordinates>-122.156427,37.441022</coordinates></Point>",small_green
5,True,33,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,219,,Front,,1,Eriobotrya japonica,Large Tree Routine Prune,7,1,15-20,0-15,Good,Good,3,,,Good,1,False,False,Planting Strip,,16,6/1/2010,BK,,,Palo Alto,-122.1563676,37.441107,False,False,False,,None,44,13874,,"Tree: 33 site 1 at 219 ADDISON AV, on ADDISON AV 16 from pl","<Point><coordinates>-122.156335,37.441114</coordinates></Point>",small_green
6,True,34,,,ADDISON AV,EMERSON ST,RAMONA ST,ADDISON AV,219,,Front,,2,Robinia pseudoacacia,Large Tree Routine Prune,29,1,50-55,30-45,Poor,Poor,5,,,Good,2,False,False,Planting Strip,,33,6/1/2010,BK,cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay;  beware of BEES.,,Palo Alto,-122.1563313,37.4411436,False,False,False,,None,45,13874,,"Tree: 34 site 2 at 219 ADDISON AV, on ADDISON AV 33 from pl","<Point><coordinates>-122.156299,37.441151</coordinates></Point>",small_green
{snip}

The complete CSV file of Palo Alto tree data is available locally - but please note that it is approximately 18MB in size.

Google Fusion allows a user to download the tree data either from a filtered view or the entire spreadsheet. The exported spreadsheet is organized and consistent tabular format. This includes:

In order for information about a given tree to be reconciled with information about the same tree originating from other sources, the local identifier for that tree must be mapped to a globally unique identifier such as a URI.

Also note that in row 6, a series of statements describing the condition of the tree and other important information are provided in the comments cell. These statements are delimited using the semi-colon " ; " character.

Requires: WellFormedCsvCheck , CsvValidation , PrimaryKey , URIMapping , MissingValueDefinition and , UnitMeasureDefinition , CellMicrosyntax CellValueMicroSyntax and RepeatedProperties .

3.12 2.12 Use Case #12 - Chemical Structures

(Contributed by Eric Stephan)

The purpose of this use case is to illustrate how 3-D molecular structures such as the Protein Data Bank and XYZ formats are conveyed in tabular formats. These files be archived to be used informatics analysis or as part of an input deck to be used in experimental simulation. Scientific communities rely heavily on tabular formats such as these to conduct their research and share each others results in platform independent formats.

The Protein Data Bank (pdb) file format is a tabular file describing the three dimensional structures of molecules held in the Protein Data Bank. The pdb format accordingly provides for description and annotation of protein and nucleic acid structures including atomic coordinates, observed sidechain rotamers, secondary structure assignments, as well as atomic connectivity.

The XYZ file format is a chemical file format. There is no formal standard and several variations exist, but a typical XYZ format specifies the molecule geometry by giving the number of atoms with Cartesian coordinates that will be read on the first line, a comment on the second, and the lines of atomic coordinates in the following lines.

In general this use case also illustrates the utility of CSV as a means for scientists to collect and process their experimental results:

The key characteristics of the XYZ format are:

Requires: WellFormedCsvCheck , CsvValidation , MultipleHeadingRows , PrimaryKey and UnitMeasureDefinition .

Below is a Methane molecular structure organized in an XYZ format.

5 methane molecule (in angstroms) C 0.000000 0.000000 0.000000 H 0.000000 0.000000 1.089000 H 1.026719 0.000000 -0.363000 H -0.513360 -0.889165 -0.363000
Example 18
5
methane molecule (in angstroms)
C        0.000000        0.000000        0.000000
H        0.000000        0.000000        1.089000
H        1.026719        0.000000       -0.363000
H       -0.513360       -0.889165       -0.363000

H
-0.513360
0.889165
-0.363000

The key characteristics of the PDB format are:

Requires: GroupingOfMultipleTables .

Below is a example PDB file:

HEADER EXTRACELLULAR MATRIX 22-JAN-98 1A3I TITLE X-RAY CRYSTALLOGRAPHIC DETERMINATION OF A COLLAGEN-LIKE TITLE 2 PEPTIDE WITH THE REPEATING SEQUENCE (PRO-PRO-GLY) ... EXPDTA X-RAY DIFFRACTION AUTHOR R.Z.KRAMER,L.VITAGLIANO,J.BELLA,R.BERISIO,L.MAZZARELLA, AUTHOR 2 B.BRODSKY,A.ZAGARI,H.M.BERMAN ... REMARK 350 BIOMOLECULE: 1 REMARK 350 APPLY THE FOLLOWING TO CHAINS: A, B, C REMARK 350 BIOMT1 1 1.000000 0.000000 0.000000 0.00000 REMARK 350 BIOMT2 1 0.000000 1.000000 0.000000 0.00000 ... SEQRES 1 A 9 PRO PRO GLY PRO PRO GLY PRO PRO GLY SEQRES 1 B 6 PRO PRO GLY PRO PRO GLY SEQRES 1 C 6 PRO PRO GLY PRO PRO GLY ... ATOM 1 N PRO A 1 8.316 21.206 21.530 1.00 17.44 N ATOM 2 CA PRO A 1 7.608 20.729 20.336 1.00 17.44 C ATOM 3 C PRO A 1 8.487 20.707 19.092 1.00 17.44 C ATOM 4 O PRO A 1 9.466 21.457 19.005 1.00 17.44 O ATOM 5 CB PRO A 1 6.460 21.723 20.211 1.00 22.26 C ... HETATM 130 C ACY 401 3.682 22.541 11.236 1.00 21.19 C HETATM 131 O ACY 401 2.807 23.097 10.553 1.00 21.19 O
Example 19
HEADER    EXTRACELLULAR MATRIX                    22-JAN-98   1A3I
TITLE     X-RAY CRYSTALLOGRAPHIC DETERMINATION OF A COLLAGEN-LIKE
TITLE    2 PEPTIDE WITH THE REPEATING SEQUENCE (PRO-PRO-GLY)
...
EXPDTA    X-RAY DIFFRACTION
AUTHOR    R.Z.KRAMER,L.VITAGLIANO,J.BELLA,R.BERISIO,L.MAZZARELLA,
AUTHOR   2 B.BRODSKY,A.ZAGARI,H.M.BERMAN
...
REMARK 350 BIOMOLECULE: 1
REMARK 350 APPLY THE FOLLOWING TO CHAINS: A, B, C
REMARK 350   BIOMT1   1  1.000000  0.000000  0.000000        0.00000
REMARK 350   BIOMT2   1  0.000000  1.000000  0.000000        0.00000
...
SEQRES   1 A    9  PRO PRO GLY PRO PRO GLY PRO PRO GLY
SEQRES   1 B    6  PRO PRO GLY PRO PRO GLY
SEQRES   1 C    6  PRO PRO GLY PRO PRO GLY
...
ATOM      1  N   PRO A   1       8.316  21.206  21.530  1.00 17.44           N
ATOM      2  CA  PRO A   1       7.608  20.729  20.336  1.00 17.44           C
ATOM      3  C   PRO A   1       8.487  20.707  19.092  1.00 17.44           C
ATOM      4  O   PRO A   1       9.466  21.457  19.005  1.00 17.44           O
ATOM      5  CB  PRO A   1       6.460  21.723  20.211  1.00 22.26           C
...
HETATM  130  C   ACY   401       3.682  22.541  11.236  1.00 21.19           C
HETATM  131  O   ACY   401       2.807  23.097  10.553  1.00 21.19           O

HETATM
132
OXT
ACY
401
4.306
23.101
12.291
1.00
21.19
O

3.13 2.13 Use Case #13 - Representing Entities and Facts Extracted From Text

(Contributed by Tim Finin)

The US National Institute of Standards and Technology (NIST) has run various conferences on extracting information from text centered around challenge problems. Participants submit the output of their systems on an evaluation dataset to NIST for scoring, typically in the form of tab-separated format.

The 2013 NIST Cold Start Knowledge Base Population Task , for example, asks participants to extract facts from text and to represent these as triples along with associated metadata that include provenance and certainty values. A line in the submission format consists of a triple (subject-predicate-object) and, for some predicates, provenance information. Provenance includes a document ID and, depending on the predicate, one or three pairs of string offsets within the document. For predicates that are relations, an optional second set of provenance values can be provided. Each line can also have an optional float as a final column to represent a certainty measure.

The following lines show examples of possible triples of varying length. In the second line, D00124 is the ID of a document and the strings like 283-286 refer to strings in a document using the offsets of the first and last characters. The final floating point value on some lines is the optional certainty value.

{snip} :e4 type PER :e4 mention "Bart" D00124 283-286 :e4 mention "JoJo" D00124 145-149 0.9 :e4 per:siblings :e7 D00124 283-286 173-179 274-281 :e4 per:age "10" D00124 180-181 173-179 182-191 0.9 :e4 per:parent :e9 D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210
Example 20
{snip}
:e4 type         PER
:e4 mention      "Bart"  D00124 283-286
:e4 mention      "JoJo"  D00124 145-149 0.9
:e4 per:siblings :e7     D00124 283-286 173-179 274-281
:e4 per:age      "10"    D00124 180-181 173-179 182-191 0.9
:e4 per:parent   :e9     D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210

{snip}

The submission format does not require that each line have the same number of columns. The expected provenance information for a triple depends on the predicate. For example, “type” typically has no provenance, “mention” has a document ID and offset pair, and domain predicates like “per:age” have one or two provenance records each of which has a document ID and three offset pairs.

The file format exemplified above opens up for a number of issues described as follows.

Each row is intended to describe an entity (“:e4”) (e.g. the subject of the triple, “:e4”). The unique identifier for that thus represents a primary key and entity is provided in the first column. In order for information about this entity to be reconcilled with information from other sources about the same entity, the local identifier needs to be identified mapped to a globally unique identifier such as such. a URI.

Requires: PrimaryKey URIMapping .

After each triple, there is a variable number of annotations representing the provenance of the triple and, occasionally, its certainty. This information has to be properly identified and managed.

Requires: AnnotationAndSupplementaryInfo .

Entities “:e4”, “:e7” and “:e9” appear to be (foreign key) references to other entities described in this or in external tables. Likewise, also the identifiers “D00124” and “D00101” are ambiguous identifiers. It would be useful to identify the resources that these references represent.

Moreover, “per” appears to be a term from a controlled vocabulary. How do we know which controlled vocabulary it is a member of and what its authoritative definition is?

Requires: ForeignKeyReferences , AssociationOfCodeValuesWithExternalDefinitions , LinksToExternallyManagedDefinitions , and SemanticTypeDefinition .

The identifiers used for the entities (“:e4”, “:e7” and “:e9”), as well as those used for the predicates (e.g. “type”, “mention”, “per:siblings” etc.), are ambiguous local identifiers. How can one make the identifier an unambiguous URI? A similar requirement regards the provenance annotations. These are composed by document (e.g. “D00124”) and page number ranges. (e.g. “180-181”). Page number ranges are clearly valid only in the context of the preceding document identifier. The interesting assertion about provenance is the reference (document plus page range). Thus we might want to give the reference a unique identifier comprising from document ID and page range (e.g. D00124#180-181).

Requires: URIMapping .

Besides the entities, the table presents also some values. Some of these are strings (e.g. “10”, “Bart”), some of them are probably floating point values (e.g. “0.9”). It would be useful to have an explicit syntactic type definition for these values.

Requires: SyntacticTypeDefinition .

Entity “:e4” is the subject of many rows, meaning that many rows can be combined to make a composite set of statements about this entity.

Moreover, a single row in the table comprises a triple (subject-predicate-object), one or more provenance references and an optional certainty measure. The provenance references have been normalised for compactness (e.g. so they fit on a single row). However, each provenance statement has the same target triple so one could unbundle the composite row into multiple simple statements that have a regular number of columns (see the two equivalent examples below). {snip} :e4 per:age "10" D00124 180-181 173-179 182-191 0.9 :e4 per:parent :e9 D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210

Example 21
{snip}
:e4 per:age      "10"    D00124 180-181 173-179 182-191 0.9
:e4 per:parent   :e9     D00124 180-181 381-380 399-406 D00101 220-225 230-233 201-210

{snip}
{snip} :e4 per:age "10" D00124 180-181 0.9 :e4 per:age "10" D00124 173-179 0.9 :e4 per:age "10" D00124 182-191 0.9 :e4 per:parent :e9 D00124 180-181 :e4 per:parent :e9 D00124 381-380 :e4 per:parent :e9 D00124 399-406 :e4 per:parent :e9 D00101 220-225 :e4 per:parent :e9 D00101 230-233 :e4 per:parent :e9 D00101 201-210
Example 22
{snip}
:e4 per:age      "10"    D00124 180-181 0.9
:e4 per:age      "10"    D00124 173-179 0.9
:e4 per:age      "10"    D00124 182-191 0.9
:e4 per:parent   :e9     D00124 180-181
:e4 per:parent   :e9     D00124 381-380
:e4 per:parent   :e9     D00124 399-406
:e4 per:parent   :e9     D00101 220-225
:e4 per:parent   :e9     D00101 230-233
:e4 per:parent   :e9     D00101 201-210

{snip}

Requires: TableNormalization .

Lastly, since we already observed that rows comprise triples, that there is a frequent reference to externally defined vocabularies, that values are defined as text (literals), and that triples are also composed by entities, for which we aim to obtain a URI (as described above), it may be useful to be able to convert such a table in RDF.

Requires: CsvToRdfTransformation .

3.14 2.14 Use Case #14 - Displaying Locations of Care Homes on a Map

(Contributed by Jeni Tennison)

NHS Choices makes available a number of (what it calls) CSV files for different aspects of NHS data on its website at http://www.nhs.uk/aboutnhschoices/contactus/pages/freedom-of-information.aspx

One of the files (file = SCL.csv ) contains information about the locations of care homes. homes, as illustrated in the example below:

Example 23
OrganisationID¬OrganisationCode¬OrganisationType¬SubType¬OrganisationStatus¬IsPimsManaged¬OrganisationName¬Address1¬Address2¬Address3¬City¬County¬Postcode¬Latitude¬Longitude¬ParentODSCode¬ParentName¬Phone¬Email¬Website¬Fax¬LocalAuthority

220153¬1-303541019¬Care homes and care at home¬UNKNOWN¬Visible¬False¬Bournville House¬Furnace Lane¬Lightmoor Village¬¬Telford¬Shropshire¬TF4 3BY¬0¬0¬1-101653596¬Accord Housing Association Limited¬01952739284¬¬www.accordha.org.uk¬01952588949¬
220154¬1-378873485¬Care homes and care at home¬UNKNOWN¬Visible¬True¬Ashcroft¬Milestone House¬Wicklewood¬¬Wymondham¬Norfolk¬NR18 9QL¬52.577003479003906¬1.0523598194122314¬1-377665735¬Julian Support Limited¬01953 607340¬ashcroftresidential@juliansupport.org¬http://www.juliansupport.org¬01953 607365¬
220155¬1-409848410¬Care homes and care at home¬UNKNOWN¬Visible¬False¬Quorndon Care Limited¬34 Bakewell Road¬¬¬Loughborough¬Leicestershire¬LE11 5QY¬52.785675048828125¬-1.219469428062439¬1-101678101¬Quorndon Care Limited¬01509219024¬¬www.quorndoncare.co.uk¬01509413940¬
{snip}

The file has two interesting syntactic features:

Requires: WellFormedCsvCheck , SyntacticTypeDefinition , NonStandardFieldDelimiter and NonStandardCellDelimiter .

I want Our user wants to be able to embed a map of these locations easily into my web page using a web component , such that I she can use markup like: <emap src="http://media.nhschoices.nhs.uk/data/foi/SCL.csv" latcol="Latitude" longcol="Longitude">

	<emap src="http://media.nhschoices.nhs.uk/data/foi/SCL.csv" latcol="Latitude" longcol="Longitude">
and see a map similar to that shown at https://github.com/JeniT/nhs-choices/blob/master/SCP.geojson , without converting the CSV file into GeoJSON.

To make the web component easy to define, there should be a native API on to the data in the CSV file within the browser.

Requires: CsvToJsonTransformation .

3.15 2.15 Use Case #15 - Intelligently Previewing CSV files

(Contributed by Jeni Tennison)

All of the data repositories based on the CKAN software, such as data.gov.uk , data.gov , and many others, use JSON as the representation of the data when providing a preview of CSV data within a browser. Server side pre-processing of the CSV files is performed to try and determine column types, clean the data and transform the CSV-encoded data to JSON in order to provide the preview. JSON has many features which make it ideal for delivering a preview of the data, originally in CSV format, to the browser.

Javascript is a hard dependency for interacting with data in the browser and as such JSON was used as the serialization format because it was the most appropriate format for delivering those data. As the object notation for Javascript JSON is natively understood by Javascript it is therefore possible to use the data without any external dependencies. The values in the data delivered map directly to common Javascript types and libraries for processing and generating JSON, with appropriate type conversion, are widely available for many programming languages.

Beyond basic knowledge of how to work with JSON, there is no further burden on the user to understand complex semantics around how the data should be interpreted. The user of the data can be assured that the data is correctly encoded as UTF-8 and it is easily queryable using common patterns used in everyday Javascript. None of the encoding and serialization flaws with CSV are apparent, although badly structured CSV files will be mirrored in the JSON.

Requires: WellFormedCsvCheck , and CsvToJsonTransformation .

When providing the in-browser previews of CSV-formatted data, the utility of the preview application is limited because the server-side processing of the CSV is not always able to determine the data types (e.g. date-time) associated with data columns. As a result it is not possible for the in-browser preview to offer functions such as sorting rows by date.

As an example, see the Spend over £25,000 in The Royal Wolverhampton Hospitals NHS Trust example. Note that the underlying data begins with:

"Expenditure over £25,000- Payment made in January 2014",,,,,,,, ,,,,,,,, Department Family,Entity,Date,Expense Type,Expense Area,Supplier,Transaction Number,Amount in Sterling, Department of Health,The Royal Wolverhampton Hospitals NHS Trust RL4,31/01/2014,Capital Project,Capital,STRYKER UK LTD,0001337928,31896.06, Department of Health,The Royal Wolverhampton Hospitals NHS Trust RL4,17/01/2014,SERVICE AGREEMENTS,Pathology,ABBOTT LABORATORIES LTD,0001335058,77775.13,
Example 24
"Expenditure over £25,000- Payment made in January 2014",,,,,,,,
,,,,,,,,
Department Family,Entity,Date,Expense Type,Expense Area,Supplier,Transaction Number,Amount in Sterling,
Department of Health,The Royal Wolverhampton Hospitals NHS Trust RL4,31/01/2014,Capital Project,Capital,STRYKER UK LTD,0001337928,31896.06,
Department of Health,The Royal Wolverhampton Hospitals NHS Trust RL4,17/01/2014,SERVICE AGREEMENTS,Pathology,ABBOTT LABORATORIES LTD,0001335058,77775.13,

...

A local copy of this dataset is available: file = mth-10-january-2014.csv

The header line here comes below an empty row, and there is metadata about the table in the row above the empty row. The preview code manages to identify the headers from the CSV, and displays the metadata as the value in the first cell of the first row.

Requires: MultipleHeadingRows , and AnnotationAndSupplementaryInfo .

It would be good if the preview could recognise that the Date column contains a date and that the Amount in Sterling column contains a number, so that it could offer options to filter/sort these by date/numerically.

Requires: SemanticTypeDefinition , SyntacticTypeDefinition , and UnitMeasureDefinition .

Moreover, some of the values reported may refer to external definitions (from dictionaries or other sources). It would be useful to know where it is possible to find such resources, to be able to properly handle and visualize the data, by linking to them.

Requires: ExternalDataDefinitionResource AssociationOfCodeValuesWithExternalDefinitions .

Lastly, the web page where the CSV is published presents also useful metadata about it. It would be useful to be able to know and access these metadata even though they are not included in the file.

These include:

Requires: AnnotationAndSupplementaryInfo .

3.16 2.16 Use Case #16 - Tabular Representations of NetCDF data Using CDL Syntax

(Contributed by Eric Stephan)

NetCDF is a set of binary data formats, programming interfaces, and software libraries that help read and write scientific data files. NetCDF provides scientists a means to share measured or simulated experiments with one another across the web. What makes NetCDF useful is its ability to be self describing and provide a means for scientists to rely on existing data model as opposed to needing to write their own. The classic netCDF NetCDF data model consists of variables, dimensions, and attributes. This way of thinking about data was introduced with the very first netCDF NetCDF release, and is still the core of all netCDF NetCDF files.

Among the tools available to the NetCDF community, two tools: ncdump and ncgen. The ncdump tool is used by scientists wanting to inspect variables and attributes (metadata) contained in the NetCDF file. It also can provide a full text extraction of data including blocks of tabular data representing by variables. While NetCDF files are typically written by a software client, it is possible to generate NetCDF files using ncgen and ncgen3 from a text format. The ncgen tool parses the text file and stores it in a binary format.

Both ncdump and ncgen rely on a text format to represent the NetCDF file called network Common Data form Language (CDL). The CDL syntax as shown below contains annotation along with blocks of data denoted by the "data:" key. For the results to be legible for visual inspection the measurement data is written as delimited blocks of scalar values. As shown in the example below CDL supports multiple variables or blocks of data. The blocks of data while delimited need to be thought of as a vector or single column of tabular data wrapped around to the next line in a similar way that characters can be wrapped around in a single cell block of a spreadsheet to make the spreadsheet more visually appealing to the user.

netcdf foo { // example netCDF specification in CDL dimensions: lat = 10, lon = 5, time = unlimited; variables: int lat(lat), lon(lon), time(time); float z(time,lat,lon), t(time,lat,lon); double p(time,lat,lon); int rh(time,lat,lon); lat:units = "degrees_north"; lon:units = "degrees_east"; time:units = "seconds"; z:units = "meters"; z:valid_range = 0., 5000.; p:_FillValue = -9999.; rh:_FillValue = -1; data: lat = 0, 10, 20, 30, 40, 50, 60, 70, 80, 90; lon = -140, -118, -96, -84, -52;
Example 25
netcdf foo {    // example NetCDF specification in CDL

dimensions:
lat = 10, lon = 5, time = unlimited;

variables:
  int     lat(lat), lon(lon), time(time);
  float   z(time,lat,lon), t(time,lat,lon);
  double  p(time,lat,lon);
  int     rh(time,lat,lon);

  lat:units = "degrees_north";
  lon:units = "degrees_east";
  time:units = "seconds";
  z:units = "meters";
  z:valid_range = 0., 5000.;
  p:_FillValue = -9999.;
  rh:_FillValue = -1;

data:
  lat   = 0, 10, 20, 30, 40, 50, 60, 70, 80, 90;
  lon   = -140, -118, -96, -84, -52;

}

The next example shows a small subset of data block taken from an actual NetCDF file. The blocks of data while delimited need to be thought of as a vector or single column of tabular data wrapped around to the next line in a similar way that characters can be wrapped around in a single cell block of a spreadsheet to make the spreadsheet more visually appealing to the user.

data: base_time = 1020770640 ; time_offset = 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192, 194, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214, 216, 218, 220, 222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 242, 244, 246, 248, 250, 252, 254, 256, 258, 260, 262, 264, 266, 268, 270, 272, 274, 276, 278, 280, 282, 284, 286, 288, 290, 292, 294, 296, 298, 300, 302, 304, 306, 308, 310, 312, 314, 316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336, 338, 340, 342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362, 364, 366, 368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388, 390, 392, 394, 396, 398, 400, 402, 404, 406, 408, 410, 412, 414, 416, 418, 420, 422, 424, 426, 428, 430, 432, 434, 436, 438, 440, 442, 444, 446, 448, 450, 452, 454, 456, 458, 460, 462, 464, 466, 468, 470, 472, 474, 476, 478, 480, 482, 484, 486, 488, 490, 492, 494,
Example 26
data:

 base_time = 1020770640 ;

 time_offset = 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 
    34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 
    70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 
    104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 
    132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 
    160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 
    188, 190, 192, 194, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214, 
    216, 218, 220, 222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 242, 
    244, 246, 248, 250, 252, 254, 256, 258, 260, 262, 264, 266, 268, 270, 
    272, 274, 276, 278, 280, 282, 284, 286, 288, 290, 292, 294, 296, 298, 
    300, 302, 304, 306, 308, 310, 312, 314, 316, 318, 320, 322, 324, 326, 
    328, 330, 332, 334, 336, 338, 340, 342, 344, 346, 348, 350, 352, 354, 
    356, 358, 360, 362, 364, 366, 368, 370, 372, 374, 376, 378, 380, 382, 
    384, 386, 388, 390, 392, 394, 396, 398, 400, 402, 404, 406, 408, 410, 
    412, 414, 416, 418, 420, 422, 424, 426, 428, 430, 432, 434, 436, 438, 
    440, 442, 444, 446, 448, 450, 452, 454, 456, 458, 460, 462, 464, 466, 
    468, 470, 472, 474, 476, 478, 480, 482, 484, 486, 488, 490, 492, 494, 

496,
498,
500,
502,
504,
506,
508,
510,
512,
514,
516,
518,
520,
522;

The format allows for error codes and missing values to be included.

Requires: WellFormedCsvCheck , CsvValidation , UnitMeasureDefinition , MissingValueDefinition and GroupingOfMultipleTables .

Lastly, NetCDF files are typically collected together in larger datasets where they can be analyzed, so the Csv CSV data can be thought of a subset of a larger dataset.

Requires: CsvAsSubsetOfLargerDataset Requires: and AnnotationAndSupplementaryInfo .

3.17 2.17 Use Case #17 - Canonical mapping of CSV

(Contributed by David Booth and Jeremy Tandy)

CSV is by far the commonest format within which open data is published, and is thus typical of the data that application developers need to work with.

However, an object / object graph serialisation (of open data) is easier to consume within software applications. For example, web applications (using HTML5 & Javascript) require no extra libraries to work with data in JSON format. Similarly, RDF-encoded data in from multiple sources can be simply combined or merged using SPARQL queries once persisted within a triple store.

The UK Government policy paper "Open Data: unleashing the potential" outlines a set of principles for publishing open data . Within this document, principle 9 states:

Release data quickly, and then work to make sure that it is available in open standard formats, including linked data formats.

The open data principles recognise how the additional utility to be gained from publishing in linked data formats must be balanced against the additional effort incurred by the data publisher to do so and the resulting delay to publication of the data. Data publishers are required to release data quickly - which means making the data available in a format convenient for them such as CSV dumps from databases or spread sheets.

One of the hindrances to publishing in linked data formats is the difficulty in determining the ontology or vocabulary (e.g. the classes, predicates, namespaces and other usage patterns) that should be used to describe the data. Whilst it is only reasonable to assume that a data publisher best knows the intended meaning of their data, they cannot be expected to determine the ontology or vocabulary most applicable to to a consuming application!

Furthermore, in lieu of agreed de facto standard vocabularies or ontologies for a given application domain, it is highly likely that disparate applications will conform to different data models. How should the data publisher choose which of the available vocabularies or ontologies to use when publishing (if indeed they are aware of those applications at all)!

In order to assist data publishers provide data in linked data formats without the need to determine ontologies or vocabularies, it is necessary to separate the syntactic mapping (e.g. changing format from CSV to JSON) from the semantic mapping (e.g. defining the transformations required to achieve semantic alignment with a target data model).

As a result of such separation, it will be possible to establish a canonical transformation from CSV conforming to the core tabular data model [ tabular-data-model ] to an object graph serialisation such as JSON.

Requires: WellFormedCsvCheck , CsvToJsonTransformation , and CanonicalMappingInLieuOfAnnotation .

This use case assumes that JSON is the target serialisation for application developers given the general utility of that format. However, by considering JSON-LD [ json-ld ], it becomes trivial to map CSV-encoded tabular data via JSON into a canonical RDF model. In doing so this enables CSV-encoded tabular data to be published in linked data formats as required in the open data principle 9 at no extra effort to the data publisher as standard mechanisms are available for a data user to transform the data from CSV to RDF.

Requires: CsvToRdfTransformation .

In addition, open data principle 14 requires that:

Public bodies should publish relevant metadata about their datasets […]; and they should publish supporting descriptions of the format, provenance and meaning of the data.

To achieve this, data publishers need to be able to publish supplementary metadata concerning their tabular datasets, such as title, usage license and description.

Requires: AnnotationAndSupplementaryInfo .

Applications may automatically determine the data type (e.g. date-time, number) associated with fields cells in a CSV file by parsing the data values. However, on occasion, this is prone to mistakes where data appears to resemble something else. This is especially prevalent for dates. For example, 1/4 is often confused with 1 April rather than 0.25 . In such situations, it is beneficial if guidance can be given to the transformation process indicating the data type for given columns.

Requires: SyntacticTypeDefinition .

Provision of CSV data coupled with a canonical mapping provides significant utility by itself. However, there is nothing stopping a data publisher from adding annotation defining data semantics once, say, an appropriate de facto standard vocabulary has been agreed within the community of use. Similarly, a data consumer may wish to work directly with the canonical mapping and wish to ignore any semantic annotations provided by the publisher.

Issue 5 2

This use case lacks a concrete example to further illustrate the concerns. Whilst the concerns outlined above are relevant for a gamut of open data publication, it would be useful to include an example dataset here - which might be subsequently used in test suites relating to the motivating requirements of this use case.

It may be appropriate to merge this use case with UC-IntelligentlyPreviewingCSVFiles as that is concerned with using an interim JSON encoding to support intelligent preview of unannotated tabular data.

3.18 2.18 Use Case #18 - Supporting Semantic-based Recommendations

(Contributed by Davide Ceolin and Valentina Maccatrozzo)

In the ESWC-14 Challenge: Linked Open Data-enabled Recommender Systems , participants are provided with a series of datasets about books in TSV format.

A first dataset contains a set of user identifiers and their ratings for a bunch of books each. Each book is represented by means of a numeric identifier.

DBbook_userID DBbook_itemID rate {snip} 6873 5950 1 6873 8010 1 6873 5232 1
Example 27
DBbook_userID,	DBbook_itemID,	rate
{snip}
6873,		5950,		1
6873,		8010,		1
6873,		5232,		1

{snip}

Ratings can be boolean (0,1) or Likert scale values (from 1 to 5), depending on the challenge task considered.

Requires: SyntacticTypeDefinition , SemanticTypeDefinition , NonStandardFieldDelimiter and NonStandardCellDelimiter .

A second file provides a mapping between book ids and their names and dbpedia URIs:

DBbook_ItemID name DBpedia_uri {snip} 1 Dragonfly in Amber http://dbpedia.org/resource/Dragonfly_in_Amber 10 Unicorn Variations http://dbpedia.org/resource/Unicorn_Variations 100 A Stranger in the Mirror http://dbpedia.org/resource/A_Stranger_in_the_Mirror 1000 At All Costs http://dbpedia.org/resource/At_All_Costs
Example 28
DBbook_ItemID	name				DBpedia_uri
{snip}
1		Dragonfly in Amber		http://dbpedia.org/resource/Dragonfly_in_Amber
10		Unicorn Variations		http://dbpedia.org/resource/Unicorn_Variations
100		A Stranger in the Mirror	http://dbpedia.org/resource/A_Stranger_in_the_Mirror
1000		At All Costs			http://dbpedia.org/resource/At_All_Costs

{snip}

Requires: ForeignKeyReferences .

Participants are requested to estimate the ratings or relevance scores (depending on the task) that users would attribute to a set of books reported in an evaluation dataset:

DBbook_userID DBbook_itemID {snip} 6873 5946 6873 5229 6873 3151
Example 29
DBbook_userID	DBbook_itemID
{snip}
6873		5946
6873		5229
6873		3151

{snip}

Requires: LinksToExternallyManagedDefinitions , R-AssociationOfCodeValuesWithExternalDefinitions .

The challenge mandates the use of Linked Open Data resources in the recommendations.

An effective manner to satisfy this requirement is to make use of undirected semantic paths. An undirected semantic path is a sequence of entities (subject or object) and properties that link two items, for instance:

{Book1 property1 Object1 property2 Book2}
	{Book1 property1 Object1 property2 Book2}

This sequence results from considering the triples (subject-predicate-object) in a given Linked Open Data resource (e.g. DBpedia), independently of their direction, such that the starting and the ending entities are the desired items and that the subject (or object) of a triple is the object (or subject) of the following triple. For example, the sequence above may result from the following triples:

Book1 property1 Object1 Book2 property1 Object1
	Book1 property1 Object1
	Book2 property1 Object1

Undirected semantic paths are classified according to their length. Fixed a length, one can extract all the undirected semantic paths of that length that link two items within a Linked Open Data resource by running a set of SPARQL queries. This is necessary because an undirected semantic path actually corresponds to the union of a set of directed semantic paths. In the source, data are stored in terms of directed triples (subject-predicate-object).

The number of queries that is necessary to run in order to obtain all the undirected semantic paths that link to items is exponential of the length of the path itself (2 n ). Because of the complexity of this task and of the possible latency times deriving from it, it might be useful to cache these results.

CSV is a good candidate for caching undirected semantic paths, because of its ease of use, sharing, reuse. However, there are some open issues related to this. First, since paths may present a variable number of components, one might want to represent paths in a single cell, while being able to separate the path elements when necessary.

For example, in this file , undirected semantic paths are grouped by means of double quotes, and path components are separated by commas. The starting and ending elements of the undirected semantic paths (Book1 and Book2) are represented in two separate columns by means of the book identifiers used in the challenge (see the example below).

Book1 Book2 Path {snip} 1 7680 "http://dbpedia.org/ontology/language,http://dbpedia.org/resource/English_language,http://dbpedia.org/ontology/language" 1 2 "http://dbpedia.org/ontology/author,http://dbpedia.org/resource/Diana_Gabaldon,http://dbpedia.org/ontology/author" 1 2 "http://dbpedia.org/ontology/country,http://dbpedia.org/resource/United_States,http://dbpedia.org/ontology/country"
Example 30
Book1	Book2	Path
{snip}
1	7680	"http://dbpedia.org/ontology/language,http://dbpedia.org/resource/English_language,http://dbpedia.org/ontology/language"
1	2	"http://dbpedia.org/ontology/author,http://dbpedia.org/resource/Diana_Gabaldon,http://dbpedia.org/ontology/author"
1	2	"http://dbpedia.org/ontology/country,http://dbpedia.org/resource/United_States,http://dbpedia.org/ontology/country"

{snip}

Requires: CellValueMicroSyntax CellMicrosyntax and RepeatedProperties .

Second, the size of these caching files may be remarkable. For example, the size of this file described above is ~2GB, and that may imply prohibitive loading times, especially when making a limited number of recommendations.

Since rows are sorted according to the starting and the ending book of the undirected semantic path, then all the undirected semantic paths that link two books are present in a region of the table formed by consecutive rows.

By having at our disposal an annotation of such regions indicating which book they describe, one might be able to select the "slice" of the file he needs to make a recommendation, without having to load it entirely.

Requires: AnnotationAndSupplementaryInfo , and RandomAccess .

2.19 Use Case #19 - Supporting Right to Left (RTL) Directionality

(Contributed by Yakov Shafranovich)

Writing systems affect the way in which information is displayed. In some cases, these writing systems affect the order in which characters are displayed. Latin based languages display text left-to-right across a page (LTR). Languages such as Arabic and Hebrew are written in scripts whose dominant direction is right to left (RTL) when displayed, however when it involves non-native text or numbers it is actually bidirectional.

Irrespective of the LTR or RTL display of characters in a given language, data is serialised such that the bytes are ordered in one sequential order.

Content published in Hebrew and Arabic provide examples of RTL display behaviour.

Note

Tabular data from originating from countries where vertical writing is the norm (e.g. China, Japan) appear to be published with rows and columns as defined in [ RFC4180 ] (e.g. each horizontal line in the data file conveys a row of data, with the first line optionally providing a header with column names). Rows are published in the left to right topology.

The results from the Egyptian Referendum of 2012 illustrate the problem, as can be seen in Fig. 2 Snippet of web page displaying Egyptian Referendum results (2012) .

egypt-referendum-2012-result-web-page-snip.PNG
Fig. 2 Snippet of web page displaying Egyptian Referendum results (2012)

The content in the CSV data file is serialised in the order as illustrated below (assuming LTR rendering):

Example 31


‌ا‌ل‌م‌ح‌ا‌ف‌ظ‌ة‌,‌ن‌س‌ب‌ة‌ ‌م‌و‌ا‌ف‌ق‌,‌ن‌س‌ب‌ة‌ ‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌,‌ع‌د‌د‌ ‌ا‌ل‌ن‌ا‌خ‌ب‌ي‌ن‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ص‌ح‌ي‌ح‌ة‌,‌ا‌ل‌أ‌ص‌و‌ا‌ت‌ ‌ا‌ل‌ب‌ا‌ط‌ل‌ة‌,‌ن‌س‌ب‌ة‌ ‌ا‌ل‌م‌ش‌ا‌ر‌ك‌ة‌,‌م‌و‌ا‌ف‌ق‌,‌غ‌ي‌ر‌ ‌م‌و‌ا‌ف‌ق‌
‌ا‌ل‌ق‌ل‌ي‌و‌ب‌ي‌ة‌,60.0,40.0,"2,639,808","853,125","15,224",32.9,"512,055","341,070"
‌ا‌ل‌ج‌ي‌ز‌ة‌,66.7,33.3,"4,383,701","1,493,092","24,105",34.6,"995,417","497,675"
‌ا‌ل‌ق‌ا‌ه‌ر‌ة‌,43.2,56.8,"6,580,478","2,254,698","36,342",34.8,"974,371","1,280,327"
‌ق‌ن‌ا‌,84.5,15.5,"1,629,713","364,509","6,743",22.8,"307,839","56,670"
{snip}          

A copy of the referendum results data file is also available locally .

Note

Readers should be aware that both the right-to-left text direction and the cursive nature of Arabic text has been explicitly overridden in the example above in order to display each individual character in sequential left-to-right order.

The directionality of the content as displayed does not affect the logical structure of the tabular data; i.e. the cell at index zero is followed by the cell at index 1, and then index 2 etc.

However, without awareness of the directionality of the content, an application may display data in a way that is unintuitive for the a RTL reader. For example, viewing the CSV file using Libre Office Calc (tested using version 3 configured with English (UK) locale) demonstrates the challenge in rendering the content correctly. Fig. 3 CSV data file containing Egyptian Referendum results (2012) displayed in Libre Office Calc shows how the content is incorrectly rendered; cells progress from left-to-right yet, on the positive side, the Arabic text within a given field runs from right-to-left. Similar behaviour is observed in Microsoft Office Excel 2007.

egypt-referendum-2012-result-csv-in-libre-office-3.png
Fig. 3 CSV data file containing Egyptian Referendum results (2012) displayed in Libre Office Calc

By contrast, we can see Fig. 4 CSV data file containing Egyptian Referendum results (2012) displayed in TextWrangler . The simple TextWrangler text editor is not aware that the overall direction is right-to-left, but does apply the Unicode bidirectional algorithm such that lines starting with an Arabic character have a direction base of right-to-left. However, as a result, the numeric digits are also displayed right to left, which is incorrect.

egypt-referendum-2012-result-csv-in-textwrangler.png
Fig. 4 CSV data file containing Egyptian Referendum results (2012) displayed in TextWrangler

It is clear that a mechanism needs to be provided such that one can explicitly declare the directionality which applies when parsing and rendering the content of CSV files.

Note

From Unicode version 6.3 onwards, the Unicode Standard contains new control codes (RLI, LRI, FSI, PDI) to enable authors to express isolation at the same time as direction in inline bidirectional text. The Unicode Consortium recommends that isolation be used as the default for all future inline bidirectional text embeddings. To use these new control codes, however, it will be necessary to wait until the browsers support them. The new control codes are:

  • RLI (RIGHT-TO-LEFT ISOLATE) U+2067 to set direction right-to-left
  • LRI (LEFT-TO-RIGHT ISOLATE) U+2066 to set direction left-to-right
  • FSI (FIRST STRONG ISOLATE) U+2068 to set direction according to the first strong character
  • PDI (POP DIRECTIONAL ISOLATE) U+2069 to terminate the range set by RLI, LRI or FSI

More information on setting the directionality of text without markup can be found here

Requires: RightToLeftCsvDeclaration .

2.20 Use Case #20 - Integrating components with the TIBCO Spotfire platform using tabular data

(Contributed Yakov Shafranovich)

A systems integrator seeks to integrate a new component into the TIBCO Spotfire analytics platform . Reviewing the documentation that describes how to extend the platform indicates that Spotfire employs a common tabular file format for all products: the Spotfire Text Data Format (STDF).

The example from the STDF documentation (below) illustrates a number of the key differences with the standard CSV format defined in [ RFC4180 ].

Example 32
<bom>\! filetype=Spotfire.DataFormat.Text; version=1.0;

\* ich bin ein berliner 
Column A;Column #14B;Kolonn Ö;The n:th column; 
Real;String;Blob;Date;
-123.45;i think there\r\nshall never be;\#aaXzD;2004-06-18;
1.0E-14;a poem\r\nlovely as a tree;\#ADB12=;\?lost in time;
222.2;\?invalid text;\?;2004-06-19;
\?error11;\\förstår ej\\;\#aXzCV==;\?1979;
3.14;hej
å
hå\seller?;\?NIL;\?#ERROR;

Although not shown in this example, STDF also supports list types :

Requires: CellMicrosyntax .

2.21 Use Case #21 - Publication of Biodiversity Information from GBIF using the Darwin Core Archive Standard

(Contributed by Tim Robertson, GBIF, and Jeremy Tandy)

A citizen scientist investigating biodiversity in the Parque Nacional de Sierra Nevada, Spain, aims to create a compelling web application that combines biodiversity information with other environmental factors - displaying this information on a map and as summary statistics.

The Global Biodiversity Information Facility (GBIF) , a government funded open data initiative that spans over 600 institutions worldwide, has mobilised more that 435 million records describing the occurrence of flora and fauna .

Included in their data holdings is "Sinfonevada: Dataset of Floristic diversity in Sierra Nevada forest (SE Spain)" , containing around 8000 records belonging to 270 taxa collected between January 2004 and December 2005.

As with the majority of datasets published via GBIF, the Sinfonevada dataset is available in the Darwin Core Archive format (DwC-A).

In accordance with the DwC-A specification, the Sinfonevada dataset is packaged as a zip file containing:

The metadata file included in the zip package must always be named meta.xml , whilst the tabular data file and supplementary metadata are explicitly identified within the main metadata file.

A copy of the zip package is provided for reference. Snippets of the tab delimited tabular data file and the full metdata file "meta.xml" are provided below.

Example 33
"occurrence.txt"

----------------

id	modified	institutionCode	collectionCode	basisOfRecord	catalogNumber	eventDate	fieldNumber	continent	countryCode	stateProvince	county	locality	minimumElevationInMeters	maximumElevationInMeters	decimalLatitude	decimalLongitude	coordinateUncertaintyInMeters	scientificName	kingdom	phylum	class	order	family	genus	specificEpithet	infraspecificEpithet	scientificNameAuthorship
OBSNEV:SINFONEVADA:SINFON-100-005717-20040930	2013-06-20T11:18:18	OBSNEV	SINFONEVADA	HumanObservation	SINFON-100-005717-20040930	2004-09-30 & 2004-09-30		Europe	ESP	GR	ALDEIRE		1992	1992	37.12724018	-3.116135071	1	Pinus sylvestris Lour.	Plantae	Pinophyta	Pinopsida	Pinales	Pinaceae	Pinus	sylvestris		Lour.
OBSNEV:SINFONEVADA:SINFON-100-005966-20040930	2013-06-20T11:18:18	OBSNEV	SINFONEVADA	HumanObservation	SINFON-100-005966-20040930	2004-09-30 & 2004-09-30		Europe	ESP	GR	ALDEIRE		1992	1992	37.12724018	-3.116135071	1	Berberis hispanica Boiss. & Reut.	Plantae	Magnoliophyta	Magnoliopsida	Ranunculales	Berberidaceae	Berberis	hispanica		Boiss. & Reut.
OBSNEV:SINFONEVADA:SINFON-100-008211-20040930	2013-06-20T11:18:18	OBSNEV	SINFONEVADA	HumanObservation	SINFON-100-008211-20040930	2004-09-30 & 2004-09-30		Europe	ESP	GR	ALDEIRE		1992	1992	37.12724018	-3.116135071	1	Genista versicolor Boiss. ex Steud.	Plantae	Magnoliophyta	Magnoliopsida	Fabales	Fabaceae	Genista	versicolor		Boiss. ex Steud.
{snip}

The key variances of this tabular data file with RFC 4180 is the use of TAB %x09 as the cell delimiter and LF %x0A as the row terminator.

Also note the use of two adjacent TAB characters to indicate an empty cell.

Example 34
"meta.xml"

----------

<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
  <core encoding="utf-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files>
      <location>occurrence.txt</location>
    </files>
    <id index="0" />
    <field index="1" term="http://purl.org/dc/terms/modified"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/institutionCode"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/eventDate"/>
    <field index="7" term="http://rs.tdwg.org/dwc/terms/fieldNumber"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/continent"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/countryCode"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/stateProvince"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/county"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/locality"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/minimumElevationInMeters"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/maximumElevationInMeters"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
    <field index="16" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
    <field index="17" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
    <field index="18" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="19" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
    <field index="20" term="http://rs.tdwg.org/dwc/terms/phylum"/>
    <field index="21" term="http://rs.tdwg.org/dwc/terms/class"/>
    <field index="22" term="http://rs.tdwg.org/dwc/terms/order"/>
    <field index="23" term="http://rs.tdwg.org/dwc/terms/family"/>
    <field index="24" term="http://rs.tdwg.org/dwc/terms/genus"/>
    <field index="25" term="http://rs.tdwg.org/dwc/terms/specificEpithet"/>
    <field index="26" term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet"/>
    <field index="27" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
  </core>
</archive>

The metadata file specifies:

Requires: NonStandardCellDelimiter , ZeroEditAdditionOfSupplementaryMetadata and AnnotationAndSupplementaryInfo .

The ignoreHeaderLines attribute can be used to ignore files with column headings or preamble comments.

In this particular case, the tabular data file is packaged within the zip file, and is referenced locally. However, the DwC-A specification also supports annotation of remote tabular data files, and thus does not require any modification of the source datafiles themselves.

Requires: LinkFromMetadataToData and IndependentMetadataPublication .

Although not present in this example, DwC-A also supports the ability to specify a property-value pair that is applied to every row in the tabular data file, or, in the case of sparse data, for that property-value pair to be added where the property is absent from the data file (e.g. providing a default value for a property).

Requires: SpecificationOfPropertyValuePairForEachRow .

Future releases of DwC-A also seek to provide stronger typing of data formats; at present only date formats are validated.

Requires: SyntacticTypeDefinition .

Whilst the DwC-A format is embedded in many software platforms, including web based tools, none of these seem to fit the needs of the citizen scientist. They want to use existing javascript libraries such as Leaflet , an open-Source javascript library for interactive maps, where possible to simplify their web development effort.

Leaflet has good support for GeoJSON , a JSON format for encoding a variety of geographic data structures.

In the absence of standard tooling, the citizen scientist needs to write a custom parser to convert the tab delimited data into GeoJSON. An example GeoJSON object resulting from this transformation is provided below.

Example 35
{

    "type": "Feature",
    "id": "OBSNEV:SINFONEVADA:SINFON-100-005717-20040930",
    "properties": {
        "modified": "2013-06-20T11:18:18",
        "institutionCode": "OBSNEV",
        "collectionCode": "SINFONEVADA",
        "basisOfRecord": "HumanObservation",
        "catalogNumber": "SINFON-100-005717-20040930",
        "eventDate": "2004-09-30 & 2004-09-30",
        "fieldNumber": "",
        "continent": "Europe",
        "countryCode": "ESP",
        "stateProvince": "GR",
        "county": "ALDEIRE",
        "locality": "",
        "minimumElevationInMeters": "1992",
        "maximumElevationInMeters": "1992",
        "coordinateUncertaintyInMeters": "1",
        "scientificName": "Pinus sylvestris Lour.",
        "kingdom": "Plantae",
        "phylum": "Pinophyta",
        "class": "Pinopsida",
        "order": "Pinales",
        "family": "Pinaceae",
        "genus": "Pinus",
        "specificEpithet": "sylvestris",
        "infraspecificEpithet": "",
        "scientificNameAuthorship": "Lour."
    },
    "geometry": {
        "type": "Point",
        "coordinates": [-3.116135071, 37.12724018, 1992]
    }
}
Note

GeoJSON coordinates are specified in order of longitude, latitude and, optionally, altitude.

Requires: CsvToJsonTransformation .

The citizen scientist notes that many of the terms in a given row are drawn from controlled vocabularies; geographic names and taxonomies. For the application, they want to be able to refer to the authoritative definitions for these controlled vocabularies, say, to provide easy access for users of the application to the defintions of scientific terms such as "Pinophyta".

Requires: AssociationOfCodeValuesWithExternalDefinitions .

Thinking to the future of their application, our citizen scientist anticipates the need to aggregate data across multiple datasets; each of which might use different column headings depending on who compiled the tabular dataset. Furthermore, how can one be sure they are comparing things of equivalent type?

To remedy this, they want to use the definitions from the metadata file meta.xml . The easiest approach to achieve this is to modify their parser to export [ json-ld ] and transform the tabular data into RDF that can be easily reconciled.

The resultant "GeoJSON-LD" takes the form (edited for brevity):

Example 36
{

    "@context": {
        "base": "http://www.gbif.org/dataset/db6cd9d7-7be5-4cd0-8b3c-fb6dd7446472/",
        "Feature": "http://example.com/vocab#Feature",
        "Point": "http://example.com/vocab#Point",
        "modified": "http://purl.org/dc/terms/modified",
        "institutionCode": "http://rs.tdwg.org/dwc/terms/institutionCode",
        "collectionCode": "http://rs.tdwg.org/dwc/terms/collectionCode",
        "basisOfRecord": "http://rs.tdwg.org/dwc/terms/basisOfRecord",
{snip}
    },
    "type": "Feature",
    "@type": "http://rs.tdwg.org/dwc/terms/Occurrence",
    "id": "OBSNEV:SINFONEVADA:SINFON-100-005717-20040930",
    "@id": "base:OBSNEV:SINFONEVADA:SINFON-100-005717-20040930",
    "properties": {
        "modified": "2013-06-20T11:18:18",
        "institutionCode": "OBSNEV",
        "collectionCode": "SINFONEVADA",
        "basisOfRecord": "HumanObservation",
{snip}
    },
    "geometry": {
        "type": "Point",
        "coordinates": [-3.116135071, 37.12724018, 1992]
    }
}

The complete JSON object may be retrieved here .

The unique identifier for each "occurence" record has been mapped to a URI by appending the local identifier (from column id ) to the URI of the dataset within which the recond occurs.

Requires: URIMapping SemanticTypeDefinition and CsvToRdfTransformation .

Note

The @type of the entity is taken from the rowType attribute within the metadata file.

Note

The amendment of the GeoJSON specification to include JSON-LD is a work in progress at the time of writing. Details can be found on the GeoJSON GitHub .

Issue 3

The example "GeoJSON-LD" needs to be verified as correct.

Note

It is the hope of the DwC-A format specification authors that the availability of general metadata vocabulary for describing CSV files, or indeed any tabular text datasets, will mean that DwC-A can be deprecated. This would allow the biodiversity community, and initiatives such as GBIF, to spend their efforts developing tools that support the generic standard rather than their own domain specific conventions and specifications, thus increasing the accessibility of biodiversity data.

To achieve this goal, it essential that the key characteristics of the DwC-A format can be adequately described, thus enabling the general metadata vocabulary to be adopted without needing to modify the existing DwC-A encoded data holdings.

2.22 Use Case #22 - Making sense of other people's data

(Contributed by Steve Peters via Phil Archer with input from Ian Makgill)

spendnetwork.com harvests spending data from multiple UK local and central government CSV files. It adds new metadata and annotations to the data and cross-links suppliers to OpenCorporates and, elsewhere, is beginning to map transaction types to different categories of spending.

For example, East Sussex County Council publishes its spending data as Excel spreadsheets .

A snippet of data from East Sussex County Council indicating payments over £500 for the second financial quarter of 2011 is below to illustrate. White space has been added for clarity. The full data file for that period (saved in CSV format from Microsoft Excel 2007) is provided here: ESCC-payment-data-Q2281011.csv

Example 37
Transparency Q2 - 01.07.11 to 30.09.11 as at 28.10.11,,,,,

                         Name,          Payment category,   Amount,                        Department,Document no.,Post code
{snip}          
               MARTELLO TAXIS,   Education HTS Transport,     £620,"Economy, Transport & Environment",  7000785623,     BN25
               MARTELLO TAXIS,   Education HTS Transport, "£1,425","Economy, Transport & Environment",  7000785624,     BN25
MCL TRANSPORT CONSULTANTS LTD,        Passenger Services, "£7,134","Economy, Transport & Environment",  4500528162,     BN25
MCL TRANSPORT CONSULTANTS LTD,Concessionary Fares Scheme,"£10,476","Economy, Transport & Environment",  4500529102,     BN25          
{snip}

This data is augmented by spendnetwork.com and presented in a Web page . The web page for East Sussex County Council is illustrated in Fig. 5 Payments over £500 for East Sussex County Council July-Sept 2011, illustrated by spendnetwork

spendnetwork1.png
Fig. 5 Payments over £500 for East Sussex County Council July-Sept 2011, illustrated by spendnetwork

Notice the Linked Data column that links to OpenCorporates data on MCL Transport Consultants Ltd . If we follow the 'more' link we see many more cells that spendnetwork would like to include (see Fig. 6 Payment transaction details, illustrated by spendnetwork ). Where data is available from the original spreadsheet it has been included.

spendnetwork2.png
Fig. 6 Payment transaction details, illustrated by spendnetwork

The schema here is defined by a third party (spendnetwork.com) to make sense of the original data within their own model (only some of which is shown here, spendnetwork.com also tries to categorize transactions and more). This model exists independently of multiple source datasets and entails a mechanism for reusers to link to the original data from the metadata. Published metadata can be seen variously as feedback, advertising, enrichment or annotations. Such information could help the publisher to improve the quality of the original source, however, for the community at large it reduces the need for repetition of the work done to make sense of the data and facilitates a network effect . It may also be the case that the metadata creator is better able to put the original data into a wider context with more accuracy and commitment than the original publisher.

Another (similar) scenario is LG-Inform . This harvests government statistics from multiple sources, many in CSV format, and calculate rates, percentages & trends etc. and packages them as a set of performance metrics/measures. Again, it would be very useful for the original publisher to know, through metadata, that their source has been defined and used (potentially alongside someone else's data) in this way.

See http://standards.esd.org.uk/ and the "Metrics" tab therein; e.g. percentage of measured children in reception year classified as obese (3333) .

The analysis of datasets undertaken by both spendnetwork.com and LG-Inform to make sense of other people's tabular data is time-consuming work. Making that metadata available is a potential help to the original data publisher as well as other would-be reusers of it.

Requires: WellFormedCsvCheck , IndependentMetadataPublication , ZeroEditAdditionOfSupplementaryMetadata , AnnotationAndSupplementaryInfo , AssociationOfCodeValuesWithExternalDefinitions , SemanticTypeDefinition , URIMapping and LinkFromMetadataToData .

2.23 Use Case #23 - Collating humanitarian information for crisis response

(Contributed by Tim Davies)

During a crisis response, information managers within the humanitarian community face a significant challenge in trying to collate data regarding humanitarian needs and response activities conducted by a large number of humanitarian actors. The schemas for these data sets are generally not standardized across different actors nor are the mechanisms for sharing the data. In the best case, this results in a significant delay between the collection of data and the formulation of that data into a common operational picture. In the worst case, information is simply not shared at all, leaving gaps in the understanding of the field situation.

The Humanitarian eXchange Language (HXL) project seeks to address this concern; enabling information from diverse parties to be collated into a single "Humanitarian Data Registry". Supporting tools are provided to assist participants in a given response initiative in finding information within this registry to meet their needs.

The HXL standard is designed to be a common publishing format for humanitarian data. A key design principle of the HXL project is that the data publishers are able to continue publication of their data using their existing systems. Unsurprisingly, data publishers often provide their data in tabular formats such as CSV, having exported the content from spreadsheet applications. As a result, the HXL standard is entirely based on tabular data.

During their engagement with the humanitarian response community, the HXL project team have identified two major concerns when working with tabular data:

To address these issues, the HXL project have developed a number of conventions for publishing tabular data in CSV format.

Instead of relying on matching text strings, a data dictionary provides a set of short (3-character) identifier for each field type. These field codes are provided on the first row of the CSV file. The second row of the CSV file provides a textual column name - as determined in the natural language of the data publisher. The field code is normative, whilst the textual column name is merely informative.

Example 38
.         100,          101,            A00

Location name,Location code,People affected
       Town A,     01000001,           2000
       Town B,     01000002,            750
Town
C,
01000003,
1920

(whitespace included for clarity)

Requires: MultipleHeadingRows .

Note

The HXL tabular data conventions also propose an alternate where the 3-character field code is pre-pended to the textual column name; e.g. 100 Location name .

The data dictionary provides both a semantic meaning for the column and the data type for values, as illustrated in the snippet below:

Example 39
Code,                Title,Data type,                                                             Description,   Example

 000,        Date reported,     Date,                            The date that this data record was released.,2014-06-23
 001,Start date applicable,     Date,             The date when the data in this report was first applicable.,2014-05-01
 002,  End date applicable,     Date,              The date when the data in this report was last applicable.,2014-05-31
 010,   Source description,     Text,A free-text description of the source organisation for this data record.,  OXFAM GB
 011,            Source id,     Code,   A unique identifier for the source organisation for this data record.,      5120          
{snip}

(whitespace included for clarity)

Requires: SyntacticTypeDefinition and SemanticTypeDefinition .

Note

At time of writing, the data dictionary does not provide an explicit mapping to the HXL data vocabulary from the earlier proof of concept work.

For multilingual content, this document proposes that the field-identifiers allow an optional "/" followed by an ISO 639 language code. The language code applies to the content contained within that column and may also apply to the column label.

Example 40
.           B00/en,                      B00/fr

     Project title,             Titre du projet
Malaria treatments,     Traitement du paludisme
Teacher
training,Formation
des
enseignant(e)s

(whitespace included for clarity)

Requires: MultilingualContent .

Where multiple data-values for a given field code are provided in a single row, the field code is repeated - as illustrated in the example below that provides geocodes for multiple locations pertaining to the subject of the record.

Example 41
.     101,      101,      101

P-code #1,P-code #2,P-code #3
   020503,         ,
   060107,   060108,
   173219,         ,
   530012,   530013,   530015
279333,
,

(whitespace included for clarity)

Requires: RepeatedProperties .

A snippet of an example of a tabular HXL data file is provided below. A local copy of the HXL data file is also available: HXL_3W_samples_draft_Multilingual.csv .

Example 42
.             000,         010,               B30,             B41,               B40/es,           B40/en,   B50/en,   105/es,                126,   125/en,120/en,                 131,      130/en

Fecha del informe,      Fuente,     Implementador,Código de sector,       Sector / grupo,   Sector / group,Subsector,     País,Código de provincia, Province,Region,Código del municipio,Municipality
       2013-11-19,Mapaction OP,      World VISION,             S01,Refugio de emergencia,Emergency Shelter,         ,Filipinas,           60400000,    Aklan,    VI,                    ,
       2013-11-19,   DHNetwork,DFID Medical Teams,             S02,                Salud,           Health,         ,         ,           60400000,    Aklan,    VI,                    ,
       2013-11-19,   DHNetwork,               MSF,             S02,                Salud,           Health,         ,         ,           60400000,    Aklan,    VI,                    ,
       2013-11-19,  Cluster 3W,     LDS Charities,             S03,                 WASH,             WASH,  Hygiene,Filipinas,           60400000,    Aklan,    VI,                    ,
{snip}

(whitespace included for clarity)

Finally, note that the humanitarian data often includes codes to reference some authoritative term, definition or other resource; e.g. the Province code 60400000 . In order to establish links between data sets, these codes must be reconciled with their official definitions.

Requires: AssociationOfCodeValuesWithExternalDefinitions .

The HXL proof of concept project (from 2012-2013) developed an RDF/OWL vocabulary to describe the resources and properties pertinent to this domain. Whilst the current implementation of HXL build on a tabular data model, rather than a graph data model, the HXL project are reviewing the need to provide an RDF export format. Publication of HXL data as RDF allows that information to be augmented with additional triples to, say, provide additional content or establish links between data sets. Exporting HXL as RDF would greatly benefit from the availability of standard tooling.

Requires: CsvToRdfTransformation .

2.24 Use Case #24 - Expressing a hierarchy within occupational listings

(Contributed by Dan Brickley)

Our user intends to analyze the current state of the job market using information gleaned from job postings that are published using schema.org markup.

Note

schema.org defines a schema for a listing that describes a job opening within an organization: JobPosting .

One of the things our user wants to do is to organise the job postings into categories based on the occupationalCategory property of each JobPosting .

The occupationalCategory property is used to categorize the described job. The O*NET-SOC Taxonomy is schema.org's recommended controlled vocabulary for the occupational categories.

The schema.org documentation notes that value of the occupationalCategory property should include both the textual label and the formal code from the O*NET-SOC Taxonomy, as illustrated below in the following RDFa snippet:

Example 43

<br><strong>Occupational
Category:</strong>
<span
property="occupationalCategory">15-1199.03
Web
Administrators</span>

The O*NET-SOC Taxonomy is republished every few years; the occupational listing for 2010 is the most recent version available. This listing is also available in CSV format . An extract from this file is provided below. A local copy of this CSV file is also available: file = 2010_Occupations.csv .

Example 44
O*NET-SOC 2010 Code,O*NET-SOC 2010 Title,O*NET-SOC 2010 Description

{snip}
15-1199.00,"Computer Occupations, All Other",All computer occupations not listed separately.
15-1199.01,Software Quality Assurance Engineers and Testers,Develop and execute software test plans in order to identify software problems and their causes.
15-1199.02,Computer Systems Engineers/Architects,"Design and develop solutions to complex applications problems, system administration issues, or network concerns. Perform systems management and integration functions."
15-1199.03,Web Administrators,"Manage web environment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applications."
15-1199.04,Geospatial Information Scientists and Technologists,"Research or develop geospatial technologies. May produce databases, perform applications programming, or coordinate projects. May specialize in areas such as agriculture, mining, health care, retail trade, urban planning, or military intelligence."
15-1199.05,Geographic Information Systems Technicians,"Assist scientists, technologists, or related professionals in building, maintaining, modifying, or using geographic information systems (GIS) databases. May also perform some custom application development or provide user support."
15-1199.06,Database Architects,"Design strategies for enterprise database systems and set standards for operations, programming, and security. Design and construct large relational databases. Integrate new systems with existing warehouse structure and refine system performance and functionality."
15-1199.07,Data Warehousing Specialists,"Design, model, or implement corporate data warehousing activities. Program and configure warehouses of database information and provide support to warehouse users."
15-1199.08,Business Intelligence Analysts,Produce financial and market intelligence by querying data repositories and generating periodic reports. Devise methods for identifying data patterns and trends in available information sources.
15-1199.09,Information Technology Project Managers,"Plan, initiate, and manage information technology (IT) projects. Lead and guide the work of technical staff. Serve as liaison between business and technical aspects of projects. Plan project stages and assess business implications for each stage. Monitor progress to assure deadlines, standards, and cost targets are met."
15-1199.10,Search Marketing Strategists,"Employ search marketing tactics to increase visibility and engagement with content, products, or services in Internet-enabled devices or interfaces. Examine search query behaviors on general or specialty search engines or other Internet-based content. Analyze research, data, or technology to understand user intent and measure outcomes for ongoing optimization."
15-1199.11,Video Game Designers,"Design core features of video games. Specify innovative game and role-play mechanics, story lines, and character biographies. Create and maintain design documentation. Guide and collaborate with production staff to produce games as designed."
15-1199.12,Document Management Specialists,"Implement and administer enterprise-wide document management systems and related procedures that allow organizations to capture, store, retrieve, share, and destroy electronic records and documents."
{snip}

The CSV file follows the specification outlined in [ RFC4180 ] - including the use of pairs of double quotes ( "" ) to escape cells that themselves contain commas.

Also note that each row provides a unique identifier for the occupation it describes. This unique identifier is given in the O*NET-SOC 2010 Code column. This code can be considered as the primary key for each row in the listing as it is unique for every row. Furthermore, the value of the O*NET-SOC 2010 Code column serves as the unique identifier for the occupation.

Requires: PrimaryKey .

Closer inspection of the O*NET-SOC 2010 code illustrates the hierarchical classification within the taxonomy. The first six digits are based on the Standard Occupational Classification (SOC) code from the US Bureau of Labor Statistics, with further subcategorization thereafter where necessary. The first and second digits represent the major group; the third digit represents the minor group; the fourth and fifth digits represent the broad occupation; and the sixth digit represents the detailed occupation.

The SOC structure (2010) is available in Microsoft Excel 97-2003 Workbook format . An extract of this structure, in CSV format (exported from Microsoft Excel 2007), is provided below. A local copy of the SOC structure in CSV is also available: file = soc_structure_2010.csv .

Example 45
Bureau of Labor Statistics,,,,,,,,,

On behalf of the Standard Occupational Classification Policy Committee (SOCPC),,,,,,,,,
,,,,,,,,,
January 2009,,,,,,,,,
*** This is the final structure for the 2010 SOC.   Questions should be emailed to soc@bls.gov***,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
,2010 Standard Occupational Classification,,,,,,,,
,,,,,,,,,
Major Group,Minor Group,Broad Group,Detailed Occupation,,,,,,
,,,,,,,,,
11-0000,,,,Management Occupations,,,,,
,11-1000,,,Top Executives,,,,,
,,11-1010,,Chief Executives,,,,,
,,,11-1011,Chief Executives,,,,,
{snip}
,,,13-2099,"Financial Specialists, All Other",,,,,
15-0000,,,,Computer and Mathematical Occupations,,,,,
,15-1100,,,Computer Occupations,,,,,
,,15-1110,,Computer and Information Research Scientists,,,,,
,,,15-1111,Computer and Information Research Scientists,,,,,
,,15-1120,,Computer and Information Analysts,,,,,
,,,15-1121,Computer Systems Analysts,,,,,
,,,15-1122,Information Security Analysts,,,,,
,,15-1130,,Software Developers and Programmers,,,,,
,,,15-1131,Computer Programmers,,,,,
,,,15-1132,"Software Developers, Applications",,,,,
,,,15-1133,"Software Developers, Systems Software",,,,,
,,,15-1134,Web Developers,,,,,
,,15-1140,,Database and Systems Administrators and Network Architects,,,,,
,,,15-1141,Database Administrators,,,,,
,,,15-1142,Network and Computer Systems Administrators,,,,,
,,,15-1143,Computer Network Architects,,,,,
,,15-1150,,Computer Support Specialists,,,,,
,,,15-1151,Computer User Support Specialists,,,,,
,,,15-1152,Computer Network Support Specialists,,,,,
,,15-1190,,Miscellaneous Computer Occupations,,,,,
,,,15-1199,"Computer Occupations, All Other",,,,,
,15-2000,,,Mathematical Science Occupations,,,,,
{snip}

The header line here comes below an empty row and is separated from the data by another empty row. There is metadata about the table in the rows above the header line.

Requires: MultipleHeadingRows and AnnotationAndSupplementaryInfo .

Being familiar with SKOS , our user decides to map both the O*NET-SOC and SOC taxonomies into a single hierarchy expressed using RDF/OWL and the SKOS vocabulary.

Note that in order to express the two taxonomies in SKOS, the local identifiers used in the CSV files (e.g. 15-1199.03 ) must be mapped to URIs.

Requires: URIMapping .

Each of the five levels used across the occupation classification schemes are assigned to a particular OWL class - each of which is a sub-class of skos:Concept :

The SOC taxonomy contains four different types of entities, and so requires several different passes to extract each of those from the CSV file. Depending on which kind of entity is being extracted, a different column provides the unique identifier for the entity. Data in a given row is only processed if the value for the cell designated as the unique identifier is not blank. For example, if the Detailed Occupation column is designated as providing the unique identifier (e.g. to extract entities of type ex:SOC-DetailedOccupation ), then the only rows to be processed in the snippet below would be "Financial Specialists, All Other", "Computer and Information Research Scientists" and "Computer Occupations, All Other". All other rows would be ignored.

Example 46
{snip}

Major Group,Minor Group,Broad Group,Detailed Occupation,                                            ,,,,,
           ,           ,           ,                   ,                                            ,,,,,
{snip}
           ,           ,           ,            13-2099,          "Financial Specialists, All Other",,,,,
    15-0000,           ,           ,                   ,       Computer and Mathematical Occupations,,,,,
           ,    15-1100,           ,                   ,                        Computer Occupations,,,,,
           ,           ,    15-1110,                   ,Computer and Information Research Scientists,,,,,
           ,           ,           ,            15-1111,Computer and Information Research Scientists,,,,,
{snip}
           ,           ,    15-1190,                   ,          Miscellaneous Computer Occupations,,,,,
           ,           ,           ,            15-1199,           "Computer Occupations, All Other",,,,,
           ,    15-2000,           ,                   ,            Mathematical Science Occupations,,,,,
{snip}

(whitespace added for clarity)

Requires: ConditionalProcessingBasedOnCellValues .

The hierarchy in the SOC structure is implied by inheritance from the preceeding row(s). For example, the row describing SOC minor group "Computer Occupations" ( Minor Group = 15-1100 (above) has an empty cell value for column Major Group . The value for SOC major group is provided by the preceeding row. In the case of SOC detailed occupation "Computer Occupations, All Other" ( Detailed Occupation = 15-1199 ), the value of value for column Major Group is provided 20 lines previously when a value in that column was most recently provided. The example snippet below illustrates what the CSV would look like if the inherited cell values were present:

Example 47
{snip}

Major Group,Minor Group,Broad Group,Detailed Occupation,                                            ,,,,,
           ,           ,           ,                   ,                                            ,,,,,
{snip}
    13-0000,    13-2000,    13-2090,            13-2099,          "Financial Specialists, All Other",,,,,
    15-0000,           ,           ,                   ,       Computer and Mathematical Occupations,,,,,
    15-0000,    15-1100,           ,                   ,                        Computer Occupations,,,,,
    15-0000,    15-1100,    15-1110,                   ,Computer and Information Research Scientists,,,,,
    15-0000,    15-1100,    15-1110,            15-1111,Computer and Information Research Scientists,,,,,
{snip}
    15-0000,    15-1100,    15-1190,                   ,          Miscellaneous Computer Occupations,,,,,
    15-0000,    15-1100,    15-1190,            15-1199,           "Computer Occupations, All Other",,,,,
    15-0000,    15-2000,           ,                   ,            Mathematical Science Occupations,,,,,
{snip}

(whitespace added for clarity)

It is difficult to programatically describe how the inherited values should be implemented. It is not as simple as infering the value for a blank cell from the most recent preceeding row when a non-blank value was provided for that column. For example, the last row in the example above describing "Mathematical Science Occupations" does not inherit the values from columns Broad Group and Detailed Occupation in the preceeding row because it describes a new level in the hierarchy.

However, given that the SOC code is a string value with regular structure that reflects the position of a given concept within the hierarchy, it is possible to determine the identifier of each of the broader concepts by parsing the identifier string. For example, the regular expression /^(\d{2})-(\d{2})(\d)\d$/ could be used to split the identifier for a detailed occupation code into its constituent parts from which the identifiers for the associated broader concepts could be constructed.

Issue 4

The REGEXP needs to be verified as correct.

Requires: CellMicrosyntax .

The same kind of processing applies to the O*NET-SOC taxonomy; in this case also extracting a description for the occupation. There is also an additional complication: where a O*NET-SOC code ends in " .00 ", that occupation is a direct mapping to the occupation defined in the SOC taxonomy. For example, the O*NET-SOC code 15-1199.00 refers to the same occupation category as the SOC code 15-1199 : "Computer Occupations, All Other"

To implement this complication, we need to use conditional processing.

If the final two digits of the O*NET-SOC code are " 00 ", then:

else:

The example below illustrates the conditional behaviour:

Example 48
row:

----

15-1199.00,"Computer Occupations, All Other",All computer occupations not listed separately.

resulting RDF (in Turtle syntax):
---------------------------------

ex:15-1199 a ex:SOC-DetailedOccupation ;
    skos:notation "15-1199" ;
    skos:prefLabel "Computer Occupations, All Other" ;
    dct:description "All computer occupations not listed separately." .

row:
----

15-1199.03,Web Administrators,"Manage web environment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applications."

resulting RDF (in Turtle syntax):
---------------------------------

ex:15-1199.03 a ex:ONETSOC-Occupation ;
    skos:notation "15-1199.03" ;
    skos:prefLabel "Web Administrators" ;
    dct:description "Manage web environment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applications." ;
skos:broader
ex:15-1199
.

Requires: ConditionalProcessingBasedOnCellValues .

A snippet of the final SKOS concept scheme, expressed in RDF using Turtle [ turtle ] syntax, resulting from transformation of the O*NET-SOC and SOC taxonomies into RDF is provided below. Ideally, all duplicate triples will be removed - such as the skos:prefLabel property for concept ex:15-1190 which would be provided by both the O*NET-SOC and SOC CSV files.

Example 49
ex:15-0000 a ex:SOC-MajorGroup ;

    skos:notation "15-0000" ;
    skos:prefLabel "Computer and Mathematical Occupations" .
ex:15-1100 a ex:SOC-MinorGroup ;
    skos:notation "15-1100" ;
    skos:prefLabel "Computer Occupations" ;
    skos:broader ex:15-0000 .
ex:15-1190 a ex:SOC-BroadGroup ;
    skos:notation "15-1190" ;
    skos:prefLabel "Miscellaneous Computer Occupations" ;
    skos:broader ex:15-0000, ex:15-1100 .
ex:15-1199 a ex:SOC-DetailedOccupation ;
    skos:notation "15-1199" ;
    skos:prefLabel "Computer Occupations, All Other" ;
    dct:description "All computer occupations not listed separately." ;
    skos:broader ex:15-0000, ex:15-1100, ex:15-1190 .
ex:15-1199.03 a ex:ONETSOC-Occupation ;
    skos:notation "15-1199.03" ;
    skos:prefLabel "Web Administrators" ;
    dct:description "Manage web environment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applications." ;
skos:broader
ex:15-0000,
ex:15-1100,
ex:15-1190,
ex:15-1199
.

Once the SKOS concept scheme has been defined, it is possible for our user to group job postings by SOC Major Group, SOC Minor Group, SOC Broad Group, SOC Detailed Occupation and O*NET-SOC Occupation to provide summary statistics about the job market.

For example, we can use the SKOS concept scheme to group job postings for "Web Administrators" (code 15-1199.03 ) as follows:

4. 3. Requirements

4.1 3.1 Accepted Requirements

4.2 3.2 Candidate Requirements

4.2.1 3.2.1 Requirements relating to parsing of CSV

R-WellFormedCsvCheck
Ability to determine that a CSV is syntactically well formed

In order to automate the parsing of information published in CSV form, it is essential that that content be well-formed with respect to the syntax for tabular data . [ tabular-data-model ].

Motivation: DigitalPreservationOfGovernmentRecords , OrganogramData , ChemicalImaging , ChemicalStructures , UC-NetCdFcDl NetCdFcDl , UC-PaloAltoTreeData PaloAltoTreeData , CanonicalMappingOfCSV , IntelligentlyPreviewingCSVFiles and MakingSenseOfOtherPeoplesData

R-MultipleHeadingRows
R-RightToLeftCsvDeclaration
Ability to handle headings spread across multiple initial rows, as well as determine that a CSV should be rendered using RTL column ordering.

It shall be possible to distinguish between single declare whether a given tabular data file should be rendered with column headings order direction Right-to-Left (RTL); e.g. the first column on the far right, with subsequent columns displayed to the left of the preceeding column.

A "RTL aware" application should use the RTL declaration to determine how to display the a given data file. Left-to-Right (LTR) rendering shall be the default behaviour (in absence of any such declaration).

Note

The directionality of the content does not affect the logical structure of the tabular data; i.e. the cell at index zero is followed by the cell at index 1, and file headings. then index 2 etc. As a result, parsing of RTL tabular data is anticipated to be identical to LTR content.

Motivation: SupportingRightToLeftDirectionality .

3.2.2 Requirements relating to applications

R-CsvValidation
Ability to validate a CSV for conformance with a specified metadata definition

Row headings should The content of a CSV often needs to be distinguished from file headings (if present). Also, validated for conformance against a specification. A specification may be expressed in case subheadings machine-readable format as defined in the Metadata Vocabulary for Tabular Data [ tabular-metadata ].

Validation shall assess conformance against structural definitions such as number of columns and the datatype for a given column. Further validation needs are present, it should to be possible determined. It is anticipated that validation may vary based on row-specific attributes such as the type of entity described in that row.

Dependency: R-WellFormedCsvCheck

Motivation: DigitalPreservationOfGovernmentRecords , OrganogramData , ChemicalImaging , ChemicalStructures , DisplayingLocationsOfCareHomesOnAMap , NetCdFcDl and PaloAltoTreeData .

Issue 5

David Booth suggests :

" It sounds like the R-CsvValidation requirement may need to define their coverage (i.e. how many be split into two separate validation requirements:

R-CsvOpenValidation: Does the data in the CSV conform to the metadata, ignoring inapplicable metadata? For example, is every column in the CSV described by some metadata?

R-CsvClosedValidation: Does the metadata describe anything that does NOT appear in the CSV?

I suppose if the metadata had a notion of optional columns they refer to). then both of these cases could be covered at once. "

However, at this point the use cases only appear to relate to the closed validation case. Do we need another use case to support open validation?

R-CsvToRdfTransformation
Ability to transform a CSV into RDF

Standardised CSV to RDF transformation mechanisms mitigate the need for bespoke transformation software to be developed by CSV data consumers, thus simplifying the exploitation of CSV data. Local identifiers for the entity described in a given row or used to reference some other entity need to be converted to URIs. RDF properties (or property paths ) need to be determined to relate the entity described within a given row to the corresponding data values for that row. Where available, the type of a data value should be incorporated in the resulting RDF. Built-in types defined in RDF 1.1 [ rdf11-concepts ] (e.g. xsd:dateTime , xsd:integer etc.) and types defined in other RDF vocabularies / OWL ontologies (e.g. geo:wktLiteral , GeoSPARQL [ geosparql ] section 8.5.1 RDFS Datatypes refers) shall be supported.

Dependency: R-SemanticTypeDefinition , R-SyntacticTypeDefinition and R-URIMapping .

Motivation: PublicationOfNationalStatistics DigitalPreservationOfGovernmentRecords , AnalyzingScientificSpreadsheets OrganogramData , PublicationOfPropertyTransactionData , RepresentingEntitiesAndFactsExtractedFromText , CanonicalMappingOfCSV , PublicationOfBiodiversityInformation , CollatingHumanitarianResponseInformation and ExpressingHierarchyWithinOccupationalListings .

R-CsvToJsonTransformation
Ability to transform a CSV into JSON

Standardised CSV to JSON transformation mechanisms mitigate the need for bespoke transformation software to be developed by CSV data consumers, thus simplifying the exploitation of CSV data.

Motivation: DisplayingLocationsOfCareHomesOnAMap , IntelligentlyPreviewingCSVFiles , CanonicalMappingOfCSV and PublicationOfBiodiversityInformation .

R-HeadingColumns
R-CsvToXmlTransformation
Ability to handle columns transform a CSV into XML

Standardised CSV to XML transformation mechanisms mitigate the need for bespoke transformation software to be developed by CSV data consumers, thus simplifying the exploitation of CSV data.

Motivation: DigitalPreservationOfGovernmentRecords .

R-CanonicalMappingInLieuOfAnnotation
Ability to transform CSV conforming to the core tabular data model yet lacking further annotation into a object / object graph serialisation

A CSV conforming with the core tabular data model [ tabular-data-model ], yet lacking any annotation that defines rich semantics for that data, shall be able to be transformed into an object / object graph serialisation such as row headers. JSON, XML or RDF using systematic rules - a "canonical" mapping.

The canonical mapping should provide automatic scoping of local identifiers (e.g. conversion to URI), identification of primary keys and detection of data types.

Motivation: CanonicalMappingOfCSV .

R-IndependentMetadataPublication
Ability to publish metadata independently from the tabular data resource it describes

Heading columns, if present, Commonly, tabular datasets are published without the supplementary metadata that enables a third party to correctly interpret the published information. An independent party - in addition to the data publisher - shall be able to publish metadata about such a dataset, thus enabling a community of users to benefit from the efforts of that third party to understand that dataset.

Dependency: R-LinkFromMetadataToData and R-ZeroEditAdditionOfSupplementaryMetadata .

Motivation: MakingSenseOfOtherPeoplesData and PublicationOfBiodiversityInformation .

R-SpecificationOfPropertyValuePairForEachRow
Ability to define a property-value pair for inclusion in each row.

When annotating tabular data, it should be distinguished possible for one to define within the metadata a property-value pair that is repeated for every row in the tabular dataset; for example, the location ID for a set of weather observations, or the dataset ID for a set of biodiversity observations.

In the case of sparsely populated data, this property-value pair must be applied as a default only where that property is absent from the data.

As an illustration, the Darwin Core Archive standard provides the ability to specify such a property value pair within its metadata description file meta.xml .

Example 50
http://data.gbif.org/download/specimens.csv

-------------------------------------------

ID,Species,Count
123,"Cryptantha gypsophila Reveal & C.R. Broome",12
124,"Buxbaumia piperi",2 

meta.xml
--------

<archive xmlns="http://rs.tdwg.org/dwc/text/">
  <core ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/xsd/simpledarwincore/SimpleDarwinRecord">
    <files>
      <location>http://data.gbif.org/download/specimens.csv</location>
    </files>
    <field index="0" term="http://rs.tdwg.org/dwc/terms/catalogNumber" />
    <field index="1" term="http://rs.tdwg.org/dwc/terms/scientificName" />
    <field index="2" term="http://rs.tdwg.org/dwc/terms/individualCount" />
    <field term="http://rs.tdwg.org/dwc/terms/datasetID" default="urn:lsid:tim.lsid.tdwg.org:collections:1"/>
  </core>
</archive>

Thus the original tabular data columns. file specimens.csv is interpreted as:

Example 51
catalogNumber,scientificName,individualCount,datasetID
123,"Cryptantha gypsophila Reveal & C.R. Broome",12,urn:lsid:tim.lsid.tdwg.org:collections:1
124,"Buxbaumia
piperi",2,urn:lsid:tim.lsid.tdwg.org:collections:1

Motivation: AnalyzingScientificSpreadsheets PublicationOfBiodiversityInformation .

R-ConditionalProcessingBasedOnCellValues
Ability to apply conditional processing based on the value of a specific cell

When transforming CSV content into XML, JSON or RDF it shall be possible to vary the transformation of the information in a particular row based on the values within a cell, or element within a cell, contained within that row.

To vary the transformation based on an element within a cell, the value of that cell must be well structured. See CellMicrosyntax for more information.

Motivation: ExpressingHierarchyWithinOccupationalListings .

R-TableNormalization
R-CommentLines
Ability to normalize identify comment lines within a CSV file and skip over them during parsing, format conversion or other processing

A tabular datafile may include comment lines. It shall be possible to declare how to recognize a comment line within the data (e.g. by specifying a sequence of characters that is are found at the beginning of every comment line).

Comment lines shall not be treated as data when parsing, converting or processing the CSV file. During format conversion, the application may try to include the comment in normal form and possibly vice-versa. the conversion.

Motivation: PlatformIntegrationUsingSTDF .

3.2.3 Non-functional requirements

R-ZeroEditAdditionOfSupplementaryMetadata
Ability to add supplementary metadata to an existing CSV file without requiring modification of that file

Tables could presented It may not be possible for a tabular data file to be modified to include the supplementary metadata required to adequately describe the content of the data file. For example, the data may be published by a third party or the user may be constrained in normal form their workflow by choice of tools that do not support or not. We should even recognize the supplementary metadata.

It shall be able possible to handle both and add provide annotations about a given tabular data file without requiring that file to transform one be modified in the other, when necessary. any way; "zero-edit" addition.

Dependency: R-LinkFromMetadataToData .

Motivation: RepresentingEntitiesAndFactsExtractedFromText PublicationOfNationalStatistics , SurfaceTemperatureDatabank , MakingSenseOfOtherPeoplesData and PublicationOfBiodiversityInformation .

Note

Please refer to R-CanonicalMappingInLieuOfAnnotation for details of the requirement to transform a tabular data lacking any supplementary metadata.

R-CellValueMicroSyntax

3.2.4 Data Model Requirements

R-CellMicrosyntax
Ability to parse internal data structure within a cell value

Cell values may represent more complex data structures for a given column such as lists and time stamps. The presence of complex data structures within a given cell is referred to as microsyntax.

If present parsers should have the option of handling the MicroSyntax microsyntax or ignoring it and treating it as a scalar value.

Looking in further detail at the uses of microsyntax, four types of usage are prevalent:

  1. various date/time syntaxes (not just ISO-8601 ones)
  2. delimited lists of literal values to express multiple values of the same property (typically comma " , " delimited, but other delimiters are also used)
  3. embedded structured data such as XML, JSON or well-known-text (WKT) literals
  4. semi structured text

The following requirements pertain to describing and parsing microsyntax:

  • to document microsyntax so that humans can understand what it is conveying; e.g. to provide human-readable annotation
  • to validate the cell values to ensure they conform to the expected microsyntax
  • to label the value as being in a particular microsyntax when converting into JSON/XML/RDF; e.g. marking an XML value as an XMLLiteral or a datetime value as xsd:dateTime
  • to process the microsyntax into an appropriate data structure when converting into JSON/XML/RDF

The ability to declare that a column within a tabular data file carries values of a particular type, and the potential validation of the cell against the declared type, is covered in R-SyntacticTypeDefinition and is not discussed further here.

We can consider cell values with microsyntax to be annotated strings. The annotation (which might include a definition of the format of the string - such as defining the delimiter used for a list) can be used to validate the string and (in some cases) convert it into a suitable value or data structure.

Microsyntax, therefore, requires manipulation of the text if processed. Typically, this will relate to conversion of lists into multiple-valued entries, but may also include reformatting of text to convert between formats (e.g. to convert a datetime value to a date, or locale dates to ISO 8601 compliant syntax).

Note

We assume that lists of values within a given cell all refer to items of the same type - such as a list of authors for a journal article.

Note

At this time, there is no expectation for CSV parsers to be able to take cells with embedded structure (e.g. XML, JSON, WKT etc.) or arbitrarily semi-structured text and convert them into JSON/XML/RDF.

Validation of the embedded structure is assumed to be limited to validation of syntax rather than structure (e.g. ensuring that what is declared as XML is valid XML). No attempt should be made to, say, validate XML content against a XML Schema Document. Please refer to R-SyntacticTypeDefinition for more details on validation of data types.

Issue 6

Do we want to provide a mechanism to hook user-defined call-back functions (or Promises ) into the CSV parser to validate and process / convert embedded structured or semi-structured content?

Motivation: JournalArticleSearch , PaloAltoTreeData , SupportingSemantic-basedRecommendations , ExpressingHierarchyWithinOccupationalListings and PlatformIntegrationUsingSTDF .

R-NonStandardFieldDelimiter
R-NonStandardCellDelimiter
Ability to parse tabular data with field cell delimiters other than comma ( , )

Tabular data is often provided with field cell delimiters other than comma ( , ). Fixed width field formatting is also commonly used.

If a non-standard field cell delimiter is used, it shall be possible to inform the CSV parser about the field cell delimiter or fixed-width formatting.

Motivation: DisplayingLocationsOfCareHomesOnAMap , SurfaceTemperatureDatabank , SupportingSemantic-basedRecommendations , PublicationOfBiodiversityInformation and PlatformIntegrationUsingSTDF .

4.2.2 Requirements relating to annotation of CSV
R-PrimaryKey
Ability to determine the primary key for entities described rows within a CSV tabular data file

Each It shall be possible to uniquely identify every row within a CSV file typically relates to a single entity. In many cases that entity tabular data file. The default behaviour for uniquely identifying rows is to use the object of references from other entities described within the CSV file - or perhaps even from entities described in other CSV files or data resources. Typically within row number. However, some datasets already include a CSV file, primary key identifiers are only unique within the scope of the CSV file within which they are stated (e.g. a local identifier). In order identifier for each row in the entity to be unambiguously identified, the local identifier needs to dataset. In such cases, it shall be converted to a URI (as defined in [ RFC3986 ]). need possible to determine declare which cell column provides the primary key in a given row. where the primary key is defined as a local identfier, need to determine how to convert the local identifier to a URI. Note Assumption that a row within a CSV file describes a single entity for which a primary key can be assigned. key.

Motivation: DigitalPreservationOfGovernmentRecords , OrganogramData , PublicationOfPropertyTransactionData , RepresentingEntitiesAndFactsExtractedFromText , ChemicalImaging , ChemicalStructures PaloAltoTreeData and ExpressingHierarchyWithinOccupationalListings .

R-ForeignKeyReferences
Ability to cross reference between CSV files

To interpret data in a given row of a CSV file, one may need to be able to refer to information provided in supplementary CSV files or elsewhere within the same CSV file; e.g. using a foreign key type reference. The cross-referenced CSV files may, or may not, be packaged together.

Motivation: DigitalPreservationOfGovernmentRecords , OrganogramData , SurfaceTemperatureDatabank , RepresentingEntitiesAndFactsExtractedFromText , SupportingSemantic-basedRecommendations

R-ExternalDataDefinitionResource Ability to reference a Data Definition Resource defining supplementary metadata external to the CSV file
Issue 7

To allow automated processing of a CSV file additional metadata is required to describe the structure and semantics of that file. This additional metadata is termed a AndyS suggests that: " Data Definition Resource ( DDR ). The DDR may cross reference between files should be defined outside the scope of the CSV file with which it is associated; for example, if the DDR is common limited to many CSV files or from one publisher - else they are just web links with no guarantee of whether the DDR is used to drive CSV file validation. In such cases it must be possible to associate target of the DDR link exists which 'foreign key' might imply. "

This seems like a sensible recommendation - but needs confirmation from the CSV file. group.

Motivation: DigitalPreservationOfGovernmentRecords , OrganogramData , SurfaceTemperatureDatabank , RepresentingEntitiesAndFactsExtractedFromText , IntelligentlyPreviewingCSVFiles and SupportingSemantic-basedRecommendations .

R-AnnotationAndSupplementaryInfo
Ability to add annotation and supplementary information to CSV file

Annotations and supplementary information may be associated with:

  • a group of tables
  • an entire table
  • a row
  • a column
  • an individual cell
  • range (or region) of cells within a table

Annotations and supplementary information may be literal values or references to a remote resource. The presence of annotations or supplementary information must not adversely impact parsing of the tabular data (e.g. the annotations and supplementary information must be logically separate).

Note

This requirement refers to provision of human-readable annotation providing additional context to a group of tables, table, column, row, cell or other region within a table. For example, the publication of national statistics use case adds the following annotations to a table:

  • title: Economic activity
  • dimensions: Economic activity (T016A), 2011 Administrative Hierarchy, 2011 Westminster Parliamentary Constituency Hierarchy
  • dataset population: All usual residents aged 16 to 74
  • coverage: England and Wales
  • area types (list omitted here for brevity)
  • textual description of dataset
  • publication information
  • contact details

This is disjoint from the requirements regarding the provision of supplementary metadata to describe the content and structure of a tabular data file in a machine readable form.

Motivation: PublicationOfNationalStatistics , SurfaceTemperatureDatabank , PublicationOfPropertyTransactionData , AnalyzingScientificSpreadsheets , ReliabilityAnalysesOfPoliceOpenData ReliabilityAnalyzesOfPoliceOpenData , OpenSpendingData , RepresentingEntitiesAndFactsExtractedFromText , IntelligentlyPreviewingCSVFiles , CanonicalMappingOfCSV , SupportingSemantic-basedRecommendations , MakingSenseOfOtherPeoplesData , PublicationOfBiodiversityInformation , ExpressingHierarchyWithinOccupationalListings and PlatformIntegrationUsingSTDF .

R-AssociationOfCodeValuesWithExternalDefinitions
Ability to associate a code value with externally managed definition

CSV files make frequent use of code values when describing data. Examples include: geographic regions, status codes and category codes. In some cases, names are used as a unique identifier for a resource (e.g. company name wihtin a transaction audit). It is difficult to interpret the tabular data with out an unambiguous definition of the code values or (local) identifiers used.

It must be possible to unambiguously associate the notation used within a CSV file with the appropriate external definition.

Note

We cannot assume that the publisher of the CSV file will use a URI to reference the code value; most likely they will use a local identifier that is unique within the scope of a particular code list. For example, the Land Registry use the codes "A", "C" and "D" to denote their transactions rather than a fully qualified URI reference to the concept that these codes identify. Thus the requirement here is two fold: Dependency: URIMapping .

to refer to the code list (aka thesaurus) wherein the code value is defined, and to determine the relevant entity within the code list using the notation provided in the cell value.

Motivation: PublicationOfNationalStatistics , PublicationOfPropertyTransactionData , SurfaceTemperatureDatabank , OpenSpendingData , RepresentingEntitiesAndFactsExtractedFromText , IntelligentlyPreviewingCSVFiles , SupportingSemantic-basedRecommendations , MakingSenseOfOtherPeoplesData , PublicationOfBiodiversityInformation and CollatingHumanitarianResponseInformation .

R-CsvAsSubsetOfLargerDataset
Ability to assert how a single CSV file is a facet or subset of a larger dataset

description A large tabular dataset may be split into several files for publication; perhaps to ensure that each file is a manageable size or to publish the updates to a dataset during the (re-)publishing cycle. It shall be added here possible to declare that each of the files is part of the larger dataset and to describe what content can be found within each file in order to allow users to rapidly find the particular file containing the information they are interested in.

Motivation: SurfaceTemperatureDatabank , PublicationOfPropertyTransactionData , JournalArticleSearch , ChemicalImaging , ChemicalStructures , UC-NetCdFcDl and NetCdFcDl .

R-LinksToExternallyManagedDefinitions Ability to provide (hyper)links to externally managed definitions from with a CSV file
Issue 8

description to be added here Ivan says :

Motivation: SurfaceTemperatureDatabank , OpenSpendingData , RepresentingEntitiesAndFactsExtractedFromText , SupportingSemantic-basedRecommendations " Hm. Do we define CSV packaging in this Working Group? Is this in scope? We could decide it is, though we should check this with the charter. "

R-SyntacticTypeDefinition
Ability to declare syntactic type for data values cells within a specified column.

supporting automated recognition syntactic Whilst it is possible to automatically detect the type e.g. of data (e.g. date, number etc. ... further description number) in a given cell, this can be error prone. For example, the date April 1st if written as 1/4 may be interpreted as a decimal fraction.

It shall be possible to declare the data type for the cells in a given column of a tabular data file. Only one data type can be added declared for a given column.

Note

An application may still attempt to automatically detect the data type for a given cell. However, the explicit declaration shall always take precedent.

Note

The data type declaration will typically be used to declare that a column contains integers, floating point numbers or text. However, it may be used to assert that a cell contains, say, embedded XML content ( rdf:XMLLiteral ), datetime values ( xsd:dateTime ) or geometry expressed as well-known-text ( geo:wktLiteral , GeoSPARQL [ geosparql ] section 8.5.1 RDFS Datatypes refers).

Issue 9

Is there any intent for the CSV parser to validate that a given cell conforms to the type declaration?

Motivation: SurfaceTemperatureDatabank , DigitalPreservationOfGovernmentRecords , ReliabilityAnalysesOfPoliceOpenData ReliabilityAnalyzesOfPoliceOpenData , AnalyzingScientificSpreadsheets , RepresentingEntitiesAndFactsExtractedFromText , DisplayingLocationsOfCareHomesOnAMap , IntelligentlyPreviewingCSVFiles , CanonicalMappingOfCSV , SupportingSemantic-basedRecommendations , PublicationOfBiodiversityInformation , CollatingHumanitarianResponseInformation and PlatformIntegrationUsingSTDF .

R-SemanticTypeDefinition
Ability to declare semantic type for data values cells within a specified column.

supporting automated recognition of semantic type, typically expressed Each row in a tabular data set describes a given resource or entity. The properties for each that entity are described in the cells of that row. All the cells in a given column ... further description are anticipated to provide the same property.

It shall be added possible to declare the semantic relationship between the entity that a given row describes and a cell in a given column.

The following example of an occupational listing illustrates how a row of tabular data can be mapped to equivalent content expressed in RDF (Turtle).

The mappings are:

Example 52
CSV

---

O*NET-SOC 2010 Code,O*NET-SOC 2010 Title,O*NET-SOC 2010 Description
         11-1011.00,    Chief Executives,"Determine and formulate policies and provide overall direction of companies [...]."
{snip}                

RDF (Turtle)
------------

ex:11-1011.00
    skos:notation "11-1011.00" ;
    rdfs:label "Chief Executives" ;
dc:description
"Determine
and
formulate
policies
and
provide
overall
direction
of
companies
[...]."
.

A copy of the occupational listing CSV is available locally .

Note

To express semantics in a machine readable form, RDF seems the appropriate choice. Furthermore, best practice indicates that one should adopt common and widely adopted patterns (e.g. RDF vocabularies, OWL ontologies) when publishing data to enable a wide audience to consume and understand the data. Existing (de facto) standard patterns may add complexity when defining the semantics associated with a particular row such that a single RDF predicate is insufficient.

For example, to express a quantity value using QUDT we use an instance of qudt:QuantityValue to relate the numerical value with the quantity kind (e.g. air temperature) and unit of measurement (e.g. Celsius). Thus the semantics needed for a column containing temperature values might be: qudt:value/qudt:numericValue – more akin to a LDPath .

Furthermore, use of OWL axioms when defining a sub-property of qudt:value would allow the quantity type and unit of measurement to be inferred, with the column semantics then being specified as ex:temperature_Cel/qudt:numericValue .

Motivation: DigitalPreservationOfGovernmentRecords , PublicationOfNationalStatistics , SurfaceTemperatureDatabank , ReliabilityAnalysesOfPoliceOpenData ReliabilityAnalyzesOfPoliceOpenData , AnalyzingScientificSpreadsheets , RepresentingEntitiesAndFactsExtractedFromText , IntelligentlyPreviewingCSVFiles , SupportingSemantic-basedRecommendations , MakingSenseOfOtherPeoplesData , PublicationOfBiodiversityInformation and CollatingHumanitarianResponseInformation .

R-MissingValueDefinition
Ability to declare a "missing value" token and, optionally, a reason for the value to be missing

Significant amounts of existing tabular text data include values such as -999 . Typically, these are outside the normal expected range of values and are meant to infer that the value for that cell is missing. Automated parsing of CSV files needs to recognise such missing value tokens and behave accordingly. Furthermore, it is often useful for a data publisher to declare why a value is missing; e.g. withheld or aboveMeasurementRange

Motivation: SurfaceTemperatureDatabank , OrganogramData , OpenSpendingData , UC-NetCdFcDl NetCdFcDl , UC-PaloAltoTreeData PaloAltoTreeData and PlatformIntegrationUsingSTDF .

R-URIMapping
Ability to map the cell values of within a CSV row/column given column into corresponding URI

Tabular data often makes use of local identifiers to uniquely identify an entity described within a tabular data file or to reference an entity described in the same data file or elsewhere (e.g. reference data, code lists, etc.). The local identifier will often be unique within a particular scope (e.g. a code list or data set), but cannot be guaranteed to be globally unique. In order to make these local identifiers globally unique (e.g. so that the entity described by concatenating a row in a tabular data file can be referred to from an external source, or to establish links between the tabular data and the related reference data) it is necessary to map those values local identifiers to URIs.

It shall be possible to declare how local identifiers used within a column of a particular dataset can be mapped to their respective URI. Typically, this may be achieved by concatenating the local identifier with a prefix). prefix - although more complex mappings are anticipated such as removal of "special characters" that are not permitted in URIs (as defined in [ RFC3986 ]) or CURIEs [ curie ]).

Furthermore, where the local identifier is part of a controlled vocabulary, code list or thesaurus, it should be possible to specify the URI for the controlled vocabulary within which the local identfier is defined.

Motivation: DigitalPreservationOfGovernmentRecords , OrganogramData , PublicationOfPropertyTransactionData , AnalyzingScientificSpreadsheets , RepresentingEntitiesAndFactsExtractedFromText , PaloAltoTreeData , PublicationOfBiodiversityInformation , MakingSenseOfOtherPeoplesData and ExpressingHierarchyWithinOccupationalListings .

R-UnitMeasureDefinition
Ability identify/express the unit of measure for the values reported in a given column.

Motivation: AnalyzingScientificSpreadsheets , OpenSpendingData , IntelligentlyPreviewingCSVFiles , ChemicalImaging , ChemicalStructures , UC-NetCdFcDl , NetCdFcDl and UC-PaloAltoTreeData PaloAltoTreeData

4.2.3 Requirements relating to metadata discovery
R-GroupingOfMultipleTables
Ability to group multiple data tables into a single package for publication

When publishing sets of related data tables, it shall be possible to provide annotation for the group of related tables. Annotation concerning a group of tables may include summary information about the composite dataset (or "group") that the individual tabular datasets belong too, such as the license under which the dataset is made available.

The implication is that the group shall be identified as an entity in its own right, thus enabling assertions to be made about that group. The relationship between the group and the associated tabular datasets will need to be made explicit.

Furthermore, where appropriate, it shall be possible to describe the interrelationships between the tabular datasets within the group.

The tabular datasets comprising a group need not be hosted at the same URL. As such, a group does not necessarily to be published as a single package (e.g. as a zip) - although we note that this is a common method of publication.

Motivation: PublicationOfNationalStatistics , OrganogramData , ChemicalStructures , UC-NetCdFcDl and NetCdFcDl .

4.2.4 Requirements relating to applications R-CsvValidation
R-LinkFromMetadataToData
Ability to validate a CSV for conformance with a specified DDR metadata description to explicitly cite the tabular dataset it describes

The content of a CSV often needs to Metadata resources may be validated for conformance against published independently from the tabular dataset(s) it describes; e.g. a specification (e.g. third party may publish metadata in their own domain that describes how they have interpreted the data for their application or community. In such a DDR ). case, the relationship between the metadata and data resources cannot be inferred - it must be stated explicitly.

Validation shall assess conformance against structural definitions such as number of columns Such a link between metadata and the datatype for data resources should be discoverable, thus enabling a given column. Further validation needs are data publisher to be determined. It determine who is anticipated that validation may vary based on row-specific attributes such as referring to their data leading to the type data publisher gaining a better understanding of entity described in that row. their user community.

Dependency: R-WellFormedCsvCheck Motivation: DigitalPreservationOfGovernmentRecords , OrganogramData , ChemicalImaging , ChemicalStructures , DisplayingLocationsOfCareHomesOnAMap , UC-NetCdFcDl , UC-PaloAltoTreeData MakingSenseOfOtherPeoplesData and PublicationOfBiodiversityInformation .

R-CsvToRdfTransformation
R-MultilingualContent
Ability to automatically transform declare a CSV into RDF locale / language for content in a specified column

Standardised CSV to RDF transformation mechanisms mitigate the need for bespoke transformation software to be developed by CSV Tabular data consumers, thus simplifying may contain literal values for a given property in multiple languages. For example, the exploitation name of CSV data. Identifiers used as primary and foreign keys within a CSV file need to town in English, French and Arabic. It shall be converted to URIs. RDF properties (or possible to:

  • specify the property paths ) need for which the literal values are supplied; and
  • specify the language / locale relevant to all data values in a given column.

Additionally, it should be determined possible to relate the provide supplementary labels for column headings in multiple languages.

Motivation: CollatingHumanitarianResponseInformation .

R-RepeatedProperties
Ability to provide multiple values of a given property for a single entity described within a given row to the corresponding tabular data values file

It is commonplace for that row. Where available, the type of a tabular data value should file to provide multiple values of a given property for a single entity. This may be incorporated achieved in a number of ways.

First, the resulting RDF. Built-in types defined in [ multiple rows may be used to describe the same entity; each such row using the same unique identifier for the entity. For example, a country, identified using its two-letter country code , may have more than one name:

Example 53
CSV:

----

country,name
AD,     Andorra
AD,     Principality of Andorra
AF,     Afghanistan
AF,     Islamic Republic of Afghanistan    
{snip}

Equivalent JSON:
----------------

[{
  "country": "AD",
  "name": [ "Andorra", "Principality of Andorra" ]
},{
  "country": "AF",
  "name": [ "Afghanistan", "Islamic Republic of Afghanistan" ]
}]
rdf11-concepts

Second, a single row within a tabular data set may contain multiple values for a given property by declaring that multiple columns map to the same property. For example, multiple locations:

Example 54
CSV:

----

geocode #1,geocode #2,geocode #3
    020503,          ,
    060107,    060108,
    173219,          ,
    530012,    530013,    530015
    279333,          ,
    
Equivalent RDF (in Turtle syntax):
----------------------------------

row:1 admingeo:gssCode ex:020503 .
row:2 admingeo:gssCode ex:060107, ex:060108 .
row:3 admingeo:gssCode ex:173219 .
row:4 admingeo:gssCode ex:530012, ex:530013, ex:530015 .
row:5
admingeo:gssCode
ex:279333
.
] (e.g.
xsd:dateTime
, Note xsd:integer

In this case, it is essential to declare that each of the columns refer to the same property. In the example above, all the geocode etc.) and types defined columns in other RDF vocabularies / OWL ontologies (e.g. the example above map to geo:wktLiteral admin:gssCode from GeoSPARQL ) shall be supported. Dependency: R-SemanticTypeDefinition , R-SyntacticTypeDefinition , R-PrimaryKey .

Motivation: DigitalPreservationOfGovernmentRecords , OrganogramData , PublicationOfPropertyTransactionData , RepresentingEntitiesAndFactsExtractedFromText , CanonicalMappingOfCSV Finally, microsyntax R-CsvToJsonTransformation Ability to automatically transform may provide a CSV into JSON Standardised CSV to JSON transformation mechanisms mitigate the need for bespoke transformation software to be developed by CSV data consumers, thus simplifying list of values within a single cell. For example, a semi-colon " ; " delimited list of comments about the exploitation characteristics of CSV data a tree within a municipal database:

Example 55
CSV:
----

GID,Tree ID, On Street,From Street,To Street,             Species,[...],Comments
  6,     34,ADDISON AV, EMERSON ST,RAMONA ST,Robinia pseudoacacia,[...],cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay;  beware of BEES.             
{snip}

Equivalent JSON:
----------------

[{
  "GID": "6",
  "Tree_ID": "34",
  "On_Street": "ADDISON AV",
  "From_Street": "EMERSON ST",
  "To_Street": "RAMONA ST",
  "Species": "Robinia pseudoacacia",
  "Comments": [ "cavity or decay", "trunk decay", "codominant leaders", "included bark", "large leader or limb decay", "previous failure root damage", "root decay", "beware of BEES."]
}]

Note that the example above is based on the Palo Alto tree data use case ; albeit truncated for clarity.

Motivation: DisplayingLocationsOfCareHomesOnAMap JournalArticleSearch , CanonicalMappingOfCSV PaloAltoTreeData , IntelligentlyPreviewingCSVFiles SupportingSemantic-basedRecommendations and CollatingHumanitarianResponseInformation .

Issue 6 Note

Apparently, we have In writing this requirement, no assumption has been made regarding how the repeated values should be implemented in RDF, JSON or XML.

In RDF, it may be more appropriate to use case requiring a CSV Collections to group the property values, rather than simple repeated properties as illustrated above.

Should XML conversion, yet. wrap the repeated elements in an enclosing tag or simply provide the list of repeated elements?

Should JSON use arrays or some other mechanism?

etc.

R-CanonicalMappingInLieuOfAnnotation

3.3 Deferred requirements

R-MultipleHeadingRows
Ability to transform CSV conforming handle headings spread across multiple initial rows, as well as to the core tabular data model yet lacking further annotation into a object / object graph serialisation distinguish between single column headings and file headings.

Establish a canonical mapping Row headings should be distinguished from CSV conforming with the core tabular data model , yet lacking any annotation that defines rich semantics for that data, into an object / object graph serialisation such as JSON or RDF. The canonical mapping file headings (if present). Also, in case subheadings are present, it should provide automatic scoping of local identifiers (e.g. conversion be possible to URI), identification of primary keys and detection of data types. define their coverage (i.e. how many columns they refer to).

Motivation: CanonicalMappingOfCSV PublicationOfNationalStatistics , AnalyzingScientificSpreadsheets , IntelligentlyPreviewingCSVFiles , CollatingHumanitarianResponseInformation , ExpressingHierarchyWithinOccupationalListings and PlatformIntegrationUsingSTDF .

R-RandomAccess
Ability to access and/or extract part of a CSV file in a non-sequential manner.

Large datasets may be hard to process in a sequential manner. It may be useful to have the possibility to directly access part of them, possibly by means of a pointer to a given row, cell or region.

Motivation: SupportingSemantic-basedRecommendations .

4.2.5 Non-functional requirements R-ZeroEditCompatibility
R-TableNormalization
Compatibility of Ability to transform data analysis tools that is published in common usage with CSV+ a normalized form into tabular data.

CSV+ format should Textual data may be compatible, at least at published in a basic level, with normalized form; often improving human readability by reducing the data analysis tools number of lines in common usage. At the data file. As a minimum, existing tools should result, such a normalized data file will no longer be able to interpret CSV+ regular as though it were CSV (as defined additional informtion is included in [ RFC4180 ]). each row (e.g. the number of columns will vary because more cells are provided for some rows).

Issue 7 Note

Whilst we have a list Use of CSV tools emerging on the wiki we are yet to define term normalized is meant in a general sense, rather than the list of tools with which we expect, or aspire, specific meaning relavant to retain compatibility with. We also do not have an authoritative set of conformance tests yet. relational databases .

Such a normalized data file must be transformed into a tabular data file, as defined by the model for tabular data [ tabular-data-model ], prior to applying any further transformation.

Motivation: RepresentingEntitiesAndFactsExtractedFromText .

Issue 8 Note

Orphaned requirement as a result of re-editing The motivating use case: PublicationOfNationalStatistics case is an example where we have a CSV file that is not well-formed - in this particular case, the number of columns varies row by row amd there fore does not conform to the model for tabular data [ tabular-data-model ].

This use case used The ability to assert: transform a data file into a Statistical tabular data file is currently published in Microsoft Excel Workbook format in order overcome limitations inherent in simple data formats a necessary prerequisite for any subsequent transformation. That said, such as CSV. However, it a transformation is also important to remember that the data user community also have preferred tools for consuming outside the statistical data. Spreadsheet applications such scope of this working group as Microsoft Excel, Numbers for Mac and LibreOffice Calc are it requires a mainstay of the ubiquitously deployed desktop productivity software, enabling parsing a large swath of users to work with data provided in Microsoft Excel Workbook format. It is important that compatibility, at least at a basic level, is maintained file with the data analysis tools in common usage. any structure.

R-ZeroEditAdditionOfSupplementaryMetadata Ability to add supplementary metadata

Such pre-processing to an existing CSV file without requiring modification of that create a tabular data file description from a given structure is likely to be added here Motivation: PublicationOfNationalStatistics , SurfaceTemperatureDatabank reasonably simple for a programmer to implement, but it cannot be generalised.

A. References

A.1 Normative references

[json]
The JSON Data Interchange Format, ECMA-404 . ECMA Standard. URL: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf
[rdf11-concepts]
Richard Cyganiak; David Wood; Markus Lanthaler. RDF 1.1 Concepts and Abstract Syntax . W3C Recommendation. URL: http://www.w3.org/TR/rdf11-concepts/
[tabular-data-model]
Jeni Tennison; Gregg Kellogg. Model for Tabular Data and Metadata on the Web . W3C Editor's Draft. URL: http://w3c.github.io/csvw/syntax/index.html
[tabular-metadata]
Rufus Pollock; Jeni Tennison. Metadata Vocabulary for Tabular Data . W3C Editor's Draft. URL: http://w3c.github.io/csvw/metadata/index.html
[xml]
Tim Bray; Jean Paoli; Michael Sperberg-McQueen; Eve Maler; François Yergeau et al. Extensible Markup Language (XML) 1.0 (Fifth Edition) . 26 November 2008. W3C Recommendation. URL: http://www.w3.org/TR/xml

A.2 Informative references

[RFC3986]
T. Berners-Lee; R. Fielding; L. Masinter. Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) . January 2005. RFC. Internet Standard. URL: http://www.ietf.org/rfc/rfc3986.txt
[RFC4180]
Y. Shafranovich. Common Format and MIME Type for Comma-Separated Values (CSV) Files . October 2005. Internet RFC 3987. Informational. URL: http://www.ietf.org/rfc/rfc4180.txt
[curie]
Mark Birbeck; Shane McCarron. CURIE Syntax 1.0 . 16 December 2010. W3C Note. URL: http://www.w3.org/TR/curie
[geosparql]
OGC GeoSPARQL - A Geographic Query Language for RDF Data . OpenGIS Implementation Specification. URL: https://portal.opengeospatial.org/files/?artifact_id=47664
[json-ld]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0 . 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld/
[rdf11-concepts]
[turtle]
Richard Cyganiak; David Wood; Markus Lanthaler. Eric Prud'hommeaux; Gavin Carothers. RDF 1.1 Concepts and Abstract Syntax Turtle . 25 February 2014. W3C Recommendation. URL: http://www.w3.org/TR/rdf11-concepts/ http://www.w3.org/TR/turtle/