The Model for Tabular Data and Metadata on the Web describes mechanisms for extracting metadata from CSV documents starting with either a tabular data file, or a metadata description. In the case of starting with a CSV document, a procedure is followed to locate metadata describing that CSV (see Locating Metadata in [[tabular-data-model]]). Alternatively, processing may begin with a metadata file directly, which references the tabular data file(s). However, in some cases, it is preferred to publish datasets using HTML rather than starting with either CSV or metadata files.

Secondly, tabular data is often contained within HTML in the form of HTML table elements (see [[html5]]). This document describes a means of identifying such tables from [[tabular-metadata]] and extracting annotated tabular data from HTML tables.

This document does not attempt to address the full range of ways in which tabular datasets can be used within browser based applications, e.g. related Javascript efforts such as IndexedDB and Web Components. It is concerned primarily with providing additional information about tabular data. Discussion of deeper integration into Web-based apps is encouraged via the CSVW Community Group.

The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g., RDF, JSON, or XML)". This non-normative document describes extensions for discovering [[tabular-metadata]] within HTML documents, and for extracting annotated tables from HTML tables. The normative standards are:

Embedding Tabular Metadata within HTML Documents

Metadata may be exposed in an HTML document in a couple of different ways.

Embedding Metadata in a script Element

This section describes mechanisms similar to Embedding JSON-LD in HTML Documents (see [[json-ld]]) for embedding metadata within an HTML document.

HTML script elements can be used to embed data blocks in documents (see Scripting in [[html5]]). Metadata [[tabular-metadata]] describing one or more tabular data files can be embedded in HTML, which can be used as an alternative way to publish datasets.

The content should be placed in a script element with the type set to application/csvm+json. The character encoding of the embedded metadata will match the HTML documents encoding.

<html>
  <head>
    <script type="application/csvm+json">
    {
      "@context": "http://www.w3.org/ns/csvw",
      "tables": [{
        "url": "countries.csv",
        "tableSchema": {
          "columns": [{
            "name": "countryCode",
            "titles": "countryCode",
            "datatype": "string",
            "propertyUrl": "http://www.geonames.org/ontology{#_name}"
          }, {
            "name": "latitude",
            "titles": "latitude",
            "datatype": "number"
          }, {
            "name": "longitude",
            "titles": "longitude",
            "datatype": "number"
          }, {
            "name": "name",
            "titles": "name",
            "datatype": "string"
          }],
          "aboutUrl": "countries.csv{#countryCode}",
          "propertyUrl": "http://schema.org/{_name}",
          "primaryKey": "countryCode"
        }
      }, {
        "url": "country_slice.csv",
        "tableSchema": {
          "columns": [{
            "name": "countryRef",
            "titles": "countryRef",
            "valueUrl": "countries.csv{#countryRef}"
          }, {
            "name": "year",
            "titles": "year",
            "datatype": "gYear"
          }, {
            "name": "population",
            "titles": "population",
            "datatype": "integer"
          }],
          "foreignKeys": [{
            "columnReference": "countryRef",
            "reference": {
              "resource": "countries.csv",
              "columnReference": "countryCode"
            }
          }]
        }
      }]
    }
    </script>
    ...
  </head>
  <body>
    ...
  </body>
</html>
        

Depending on how the HTML document is served, script content may need to be escaped. See Restrictions for contents of script elements in [[html5]] for more information.

Processing embedded metadata is the same as processing Overriding Metadata where the retrieved document type is text/html or application/xhtml+xml instead of a JSON document type. The base URI of the encapsulating HTML document provides a "Base URI Embedded in Content" per [[RFC3986]] section 5.1.1; metadata is extracted from the first script element having @type application/csvm+json. Metadata documents parsed from an HTML DOM will be a stream of character data rather than a stream of UTF-8 encoded bytes. No decoding is necessary if the HTML document has already been parsed into DOM. Each matching script data block is considered to be it's own metadata document.

Linking to Metadata

An alternative to embedding metadata within a script element is linking to the metadata using an HTTP Link header and/or an HTML link element using the equivalent mechanism described for CSV files by Link Header in [[tabular-data-model]]. Linked metadata provides an alternate mechanism for referencing metadata that would otherwise be discovered by Locating Metadata as defined in [[tabular-data-model]]. See The link element in [[html5]] for more information.

HTTP/1.1 200 OK
Link: <metadata.jsonld>; rel="describedby"
Content-Type: text/html

<html>
  <head>
    <link rel="describedby" type="application/csvm+json" href="metadata.json"/>
    ...
  </head>
</html>
        

The preceding example shows an HTTP response for an HTML document containing a link element referencing external metadata, along with an HTTP Link header referencing the same metadata.

HTML and HTTP Link references must be consistent

If using both HTML link and HTTP Link it is important to reference the same metadata URI.

Prefer embedded metadata

To avoid inconsistencies, do not both embed metadata and link metadata as differences in the embedded representation and the linked representation can cause processing inconsistencies.

Extracting Tabular Data from HTML Tables

This section describes a mechanism for locating tabular data within an HTML document, extracting tabular data from an identified table element, and processing the tabular data to create annotated tables.

In addition to tabular data files, a metadata table id may reference an HTML table within an HTML document. A reference within an HTML document is described using a document-relative fragment identifier which is defined using the @id attribute on an HTML table element.

Include metadata and referenced HTML tables in a single HTML document

HTML documents which are self contained, including both embedded metadata which references HTML tables contained within the same document, are preferred to HTML tables or CSV files defined externally.

Consideration must be given to the generation of URLs. The standard forms of both JSON [[csv2json]] and RDF [[csv2rdf]] generate URLs by appending a fragment identifier to the table URL to identify rows. Also, unless an explicit propertyUrl is defined, RDF properties are also generated using a fragment of the table URL.

Avoid automatically generated URLs

Explicitly define aboutUrl, propertyUrl, and valueUrl, where appropriate, to avoid using automatically generated URL fragments which conflict with using fragments to identify tables.

Extracting HTML Tables

Raw tabular data may be extracted from HTML tables with use of the dialect description as with CSV tables.

Processing extracted tables is otherwise handled in a similar manner to CSV as defined in Parsing Tabular Data in [[tabular-data-model]].

Processors using a Document Object Model Model [[DOM]] may have their content coerced to a normalized including optional elements such as tbody.

Header rows proceed content rows

Tables should be organized with the first rows containing only th elements to describe column headers. Subsequent rows should contain only td elements to describe table data.

Avoid use of @colspan and @rowspan attributes

The processing algorithm for tabular data does not account for differences in column counts and row counts that might be present in HTML tables using the @colspan and/or @rowspan attributes; use of these attributes should be avoided. Note that a header row containing @colspan, or a data column containing @rowspan may be ignored using appropriate dialect descriptions.

Example

The following tables are identified using #countries and #country_slice:

Countries
countryCodelatitudelongitudename
AD42.51.6Andorra
AE23.453.8United Arab Emirates
AF33.967.7Afghanistan
Country Slice
countryRefyearpopulation
AF19609616353
AF19619799379
AF19629989846
<table id="countries">
  <caption>Countries</caption>
  <tr><th>countryCode</th><th>latitude</th><th>longitude</th><th>name</th></tr>
  <tr><td>AD</td><td>42.5</td><td>1.6</td><td>Andorra</td></tr>
  <tr><td>AE</td><td>23.4</td><td>53.8</td><td>United Arab Emirates</td></tr>
  <tr><td>AF</td><td>33.9</td><td>67.7</td><td>Afghanistan</td></tr>
</table>
<table id="country_slice">
  <caption>Country Slice</caption>
  <tr><th>countryRef</th><th>year</th><th>population</th></tr>
  <tr><td>AF</td><td>1960</td><td>9616353</td></tr>
  <tr><td>AF</td><td>1961</td><td>9799379</td></tr>
  <tr><td>AF</td><td>1962</td><td>9989846</td></tr>
</table>
        

The metadata is describe here in a script element:

          
        

Generating Minimal JSON from this document should result in the following:

        

Generating Minimal RDF from this document should result in the following:

        

Extracting Tabular Data from embedded CSV

This section describes a mechanism for locating tabular data within an HTML document, extracting tabular data from an identified script element, and processing the tabular data to create annotated tables.

In addition to embedded metadata, CSV data may also be embedded within HTML using a script element. The general provisions and access patterns described in apply for embedded CSV data.

Example

The following CSV script elements are identified using #countries and #country_slice:

<script id="countries" type="text/csv">
countryCode,latitude,longitude,name
AD,42.5,1.6,Andorra
AE,23.4,53.8,"United Arab Emirates"
AF,33.9,67.7,Afghanistan
</script>

<script id="country_slice" type="text/csv">
countryRef,year,population
AF,1960,9616353
AF,1961,9799379
AF,1962,9989846
</script>
        

The metadata shown in can be used to access embedded CSV as well as HTML tables.