This document discusses the following issue: as Portable Web Publications (PWP) can exist in different states, it should be possible to locate a PWP or resource within a PWP using a transparent locator that doesn't need to know which state the PWP is in, thus keeping in line with the goal of the PWP draft specification.

This is an early draft, expect this document to change much and often.

Definitions

The Problem

A PWP is published on the server. This PWP includes the resource for an image of the Mona Lisa, and it is published in at least one of the different states:

There may be other states, mostly similarly packed states but using a different archiving technology (e.g., tar.gz). Their treatments are similar; we will not consider them in what follows because it does not change the various considerations.

The published PWP is assigned a canonical locator, e.g., the URL https://example.org/published-books/1. Reflecting the unpacked state in terms of (file) structure the canonical locator of the Mona Lisa image is https://example.org/published-books/1/img/mona_lisa.jpg.

As the example shows, the exact URL string of the canonical locator may be structurally different from the PWP locators, i.e., it is not necessarily a substring of one or the other. The canonical locator may be identical to the PWP locator of the unpacked state, but not necessarily.

The translation, in practical terms, is that, whenever possible, the canonical locator should be used when referring to the publishedPWP in, e.g., annotations. This is also true for URL-s derived from the canonical locator, like https://example.org/published-books/1/img/mona_lisa.jpg.

Functionalities of the PWP processor

The functionalities of the processor can be divided into two steps:

  1. finding, based on the canonical locator, the right values of the various state locators
  2. extract, based on a specific state locator, the exact locator for additional, internal resources

Access the state locators based on the canonical locator

The complete manifest of a PWP (which contains a number of items that are required by the PWP to operate properly) MUST include both the canonical locator as well as all available state locators. Consequently, in order to retrieve the state locators, the PWP must first retrieve the PWP manifest using the canonical locator; once this has been achieved, the state locators are readily available.

The essential steps of the retrieval process are described in a separate section (see ). A high level description of the steps is as follows:

  1. The PWP Processor issues a HTTP GET request using the value of the canonical locator.
  2. Depending on the type of the returned resource, an initial value for the PWP manifest is established. The simplest case is when the returned resource is itself a manifest; otherwise the PWP manifest has to be extracted from the returned payload by possibly combining manifests of different origins.
  3. Furthermore, the response header of the HTTP request may also refer to yet another manifest. If so, this is combined with the manifest retrieved from the payload in the previous step, yielding the final PWP manifest.

This algorithm is typically performed by the PWP Processor when initialized with the canonical locator L of a particular PWP instance.

An important consequence of the algorithm is that it defines a priority for the values of the state locators in case they are contained by different manifests. Especially, locator values retrieved via the response header of the HTTP header have the highest priority.

Access internal resources

Once the full manifest is constructed, the published PWP can be retrieved, or a resource within that PWP can be retrieved. Based on the canonical locator, the PWP processor can derive a relative locator for the image, i.e., img/mona_lisa.jpg. Then, the PWP processor can access the image:

There may be “smarter” PWP Processors that make use of local facilities like caching, but those do not modify these conceptual approaches.

Local (downloaded) PWPs

Algorithm to retrieve the PWP manifest

The goal of this algorithm is to obtain the PWP manifest based on the value of the canonical locator L. This algorithm is performed by the PWP Processor, typically when it is initialized with the canonical locator L of a particular PWP instance. The core of the algorithm consists of retrieving the PWP manifest based on the HTTP(S) responses on a HTTP GET request on L.

If the PWP processor already has the cached publication, than that will probably prevail (modulo cache state) and there may be no HTTP request in the first place. This section really refers to the situation of a first access.

In what follows, as an abuse of notation, HTTP GET U, for a URL U, refers to an HTTP or HTTPS request issued to the domain part of U, using the path from U. I.e., if U is http://www.ex.org/a/b/c, then HTTP GET U stands for:

GET /a/b/c HTTP/1.1
Host: www.ex.org

See [[rfc2616]] for further details.

As another abuse of notation, the algorithm refers to a manifest retrieved HTTP and then manipulated via, e.g., the combination of manifests; that step, in fact, involves parsing the serialized manifest file and manipulate the abstract content instead in an implementation specific way.

With these prerequisites, the algorithm is as follows (see also the figure as a visual aid to the algorithm). The input to the algorithm is the canonical locator of the PWP instance, L.

  1. Create two, initially empty, manifests, denoted respectively M2 and M3.
  2. Issue an HTTP GET L request.
  3. If the response is not successful (e.g., the response code is a 404), the process fails with no results.
  4. Otherwise, perform the two, independent processing steps below, yielding possibly new values to M2 and M3, respectively.
    1. Consider the resource returned by the HTTP request to, possibly, provide a new value to M2 as follows. Depending on the media type of the response, take the following actions:
      1. If the response is a packaged PWP instance (identified via the media type to be specified for the packed state of a PWP):
        1. Unpack the package, and retrieve the manifest embedded in the package as (to be) specified by the packed state of a PWP.
        2. M2 is set to the retrieved manifest.
      2. Otherwise, if the resource is a PWP manifest, as identified by its media type, set M2 to this resource.
      3. Otherwise the resource is an HTML file. Take the following actions:
        1. Create two, initially empty, manifests, respectively M1,0 and M1,1
        2. Perform the two, independent processing steps below, yielding possibly new values to M1,0 and M1,1, respectively.
          1. If the HTML content includes a <link rel="pwp_manifest" href="URI> in the header:
            1. Issue a HTTP GET URI request
            2. If the response is successful, M1,0 is set to the content returned in the response
          2. If the HTML content includes a manifest content embedded in a <script> element, serialized in to one of the accepted serializations for PWP manifests
            1. Retrieve and parse the content of the <script> element
            2. If parsing is successful, M1,1 is set to the parsed manifest
        3. Set M2 = M1,0 ⊕ M1,1
    2. Consider the HTTP Response header to, possibly, provide a new value to M3 as follows.
      1. If the response header includes a header of the form LINK <URI>; rel="pwp_manifest" (see [[rfc5988]]) then
        1. Issue an HTTP GET URI request
        2. If the response is successful, M3 is set to the content returned in the response.
  5. The final PWP manifest is set to M = M2 ⊕ M3 and returned.

An important point of the algorithm is that it defines a priority for the manifest items in case several manifest instances contains respective values (see the definition for the combination of manifests). At present, the priority is as follows, in decreasing priority order:

  1. Manifest referred to by the HTTP Link Header
  2. Manifest extracted from the payload; if that means retrieving the manifest from the HTML content, then:
    1. Manifest embedded in the HTML
    2. Manifest referred to by a <link> element.

    otherwise the access to the manifest via a package or directly through the payload are mutually exclusive, i.e., priority among them do not apply

Flowchart of what happens when a Portable Web Publication is retrieved by the user and how this leads to the PWP manifest
Visual representation of the algorithm to retrieve the PWP manifest. The figure is also available in PNG format.