PWP Locators

This document discusses the following issue: as Portable Web Publications (PWP) can exist in different states, it should be possible to locate a PWP or resource within a PWP using a transparent locator that doesn't need to know which state the PWP is in, thus keeping in line with the goal of the PWP draft specification.

Definitions

A Portable Web Document (PWP) is defined in a separate document [[pwp]]. That document also defined the states a PWP can be in.
A reading system has the responsibility of the rendering the contents of a PWP.
A PWP processor is a piece of functionality that has the responsibility of interpreting a PWP and its different locators, to pass the correct requested resources to the reading system. All mapping to and from canonical locators to specific, non-canonical locators should be done by the PWP Processor.
On a conceptual level the PWP processor ensures that the reading system can consider the content of a PWP as if all its resources were available via the canonical locator of the PWP online.
The server has the responsibility of serving the PWP-s in their different states. As such, the is server always aware of all possible published states of the PWP.
A server can be configured in multiple ways to serve the various states of a PWP, possibly respond to content negotations, etc. This specification does not require any particular configuration.
A locator is a URL that points to either an entire PWP or a resource within a PWP.
To be more exact, a locator is always a resource locator. This resource part is implied, but omitted for brevity.
A PWP locator is a locator to the PWP as a whole. This PWP Locator can be dereferenced via an HTTP(S) request and should return some information. As such, it is not the same as an identifier, although, it could be the same.
A relative locator is a locator to a resource within a PWP, without specifying the location of the PWP.
An absolute locator is a combination of a PWP locator and a relative locator.
The term canonical is defined in the mathematical sense: distinguished among entities of its kind, so that it can be picked out in a way that does not depend on any arbitrary choices.
The canonical locator is an absolute, and state independent PWP locator. This is purely conceptual, i.e., the PWP does not have to be published unpacked online. For the purposes of this document, the canonical locator of a PWP is denoted by L.
A state locator is a PWP locator that refers to a PWP in a particular state. For the purposes of this document, L_u denotes the state locator referring to an unpacked state, whereas L_p refers to a state locator referring to a packed state.
There is no requirement whereby the canonical locator must be different from the locator of the published, unpacked state of the PWP.

In practice, it may be possible to define several different PWP packaging formats, in which case there may be several state locators referring to packages. Their treatment, from the point of view of this document, are identical, hence this document is restricted to a single locator of this type. Also the precise treatment of specific packaging formats (unpacking algorithms, etc.) are not relevant for this document.

There is no requirement that the publisher publishes all these states; it may choose to publish only one. I.e., there can be situations where the PWP is only published unpacked (so only the locator of the published unpacked PWP exists), or only published packed (so only that locator exists).
A manifest (file) is a special resource containing fundamental information about a PWP. The serialization syntax and the content of a manifest is not defined in this document. A media type (see [[rfc6838]]) is also associated with a manifest (or a particular serialization thereof), that can be used when a particular instance is retrieved through standard HTTP(S) protocol.
A manifest item is an information item within the manifest file that can be retrieved by, e.g., a PWP Processor. These may include metadata, information on rendering order and rendition, etc. For the purpose of this document it is important that state locators, as well as the canonical locator are such manifest items, i.e., they can be part of a manifest.
A PWP manifest or complete manifest is a manifest that contains all manifest items that are required by a PWP processor. The complete list of required items are not defined here; for the purpose of this document it suffices to state that the canonical locator, as well as all state locators for all available states, MUST be part of a PWP manifest.
The combination of manifests, denoted by the ⊕ operation (as in M = M_a ⊕ M_b) is the creation of a new manifest, consisting of manifest items originating from M_a and M_b.
The precise definition of a manifest should also specify how different manifest items are combined during this operation. For the purpose of this document it is only required, that:
- the value of L MUST be identical, if present, in both M_a and M_b; and
- the value of L_u (resp. L_p) in M_b MUST have a higher priority. I.e., if present in M_b, this is the value added to M, regardless of whether a similar value is also specified in M_a or not.

The Problem

A PWP is published on the server. This PWP includes the resource for an image of the Mona Lisa, and it is published in at least one of the different states:

In an “unpacked” state, i.e., as a hierarchy of files on the server. The “top level” of this unpacked state is available through the URL https://example.org/books/1/: in this setup, the image of the Mona Lisa has the URL https://example.org/books/1/img/mona_lisa.jpg. This is the absolute locator of the image in this state.
In a “packed” state, namely as part of a, say, zip file (but with a PWP-specific media type). This is available through the URL https://example.org/packed-books/1/package.pwp. This is the absolute locator of the PWP in this state.

There may be other states, mostly similarly packed states but using a different archiving technology (e.g., tar.gz). Their treatments are similar; we will not consider them in what follows because it does not change the various considerations.

The published PWP is assigned a canonical locator, e.g., the URL https://example.org/published-books/1. Reflecting the unpacked state in terms of (file) structure the canonical locator of the Mona Lisa image is https://example.org/published-books/1/img/mona_lisa.jpg.

As the example shows, the exact URL string of the canonical locator may be structurally different from the PWP locators, i.e., it is not necessarily a substring of one or the other. The canonical locator may be identical to the PWP locator of the unpacked state, but not necessarily.

The translation, in practical terms, is that, whenever possible, the canonical locator should be used when referring to the publishedPWP in, e.g., annotations. This is also true for URL-s derived from the canonical locator, like https://example.org/published-books/1/img/mona_lisa.jpg.

Functionalities of the PWP processor

The functionalities of the processor can be divided into two steps:

finding, based on the canonical locator, the right values of the various state locators
extract, based on a specific state locator, the exact locator for additional, internal resources

Access the state locators based on the canonical locator

The complete manifest of a PWP (which contains a number of items that are required by the PWP to operate properly) MUST include both the canonical locator as well as all available state locators. Consequently, in order to retrieve the state locators, the PWP must first retrieve the PWP manifest using the canonical locator; once this has been achieved, the state locators are readily available.

The essential steps of the retrieval process are described in a separate section (see ). A high level description of the steps is as follows:

The PWP Processor issues a HTTP GET request using the value of the canonical locator.
Depending on the type of the returned resource, an initial value for the PWP manifest is established. The simplest case is when the returned resource is itself a manifest; otherwise the PWP manifest has to be extracted from the returned payload by possibly combining manifests of different origins.
Furthermore, the response header of the HTTP request may also refer to yet another manifest. If so, this is combined with the manifest retrieved from the payload in the previous step, yielding the final PWP manifest.

This algorithm is typically performed by the PWP Processor when initialized with the canonical locator L of a particular PWP instance.

An important consequence of the algorithm is that it defines a priority for the values of the state locators in case they are contained by different manifests. Especially, locator values retrieved via the response header of the HTTP header have the highest priority.

Access internal resources

Once the full manifest is constructed, the published PWP can be retrieved, or a resource within that PWP can be retrieved. Based on the canonical locator, the PWP processor can derive a relative locator for the image, i.e., img/mona_lisa.jpg. Then, the PWP processor can access the image:

If the preferred state is the packed one, then the PWP locator of the packed state is accessed, unpacked, and the image is localized within the unpacked content (and that usually means using the relative locator as some sort of a file system path within that unpacked content).
If the preferred state is the unpacked one, then the locator of the image is constructed by using the PWP locator of the unpacked state and the relative locator, yielding https://example.org/books/1/img/mona_lisa.jpg.

There may be “smarter” PWP Processors that make use of local facilities like caching, but those do not modify these conceptual approaches.

Algorithm to retrieve the PWP manifest

The goal of this algorithm is to obtain the PWP manifest based on the value of the canonical locator L. This algorithm is performed by the PWP Processor, typically when it is initialized with the canonical locator L of a particular PWP instance. The core of the algorithm consists of retrieving the PWP manifest based on the HTTP(S) responses on a HTTP GET request on L.

If the PWP processor already has the cached publication, than that will probably prevail (modulo cache state) and there may be no HTTP request in the first place. This section really refers to the situation of a first access.

In what follows, as an abuse of notation, HTTP GET U, for a URL U, refers to an HTTP or HTTPS request issued to the domain part of U, using the path from U. I.e., if U is http://www.ex.org/a/b/c, then HTTP GET U stands for:

GET /a/b/c HTTP/1.1
Host: www.ex.org

See [[rfc2616]] for further details.

As another abuse of notation, the algorithm refers to a manifest retrieved HTTP and then manipulated via, e.g., the combination of manifests; that step, in fact, involves parsing the serialized manifest file and manipulate the abstract content instead in an implementation specific way.

With these prerequisites, the algorithm is as follows (see also the figure as a visual aid to the algorithm). The input to the algorithm is the canonical locator of the PWP instance, L.

Create two, initially empty, manifests, denoted respectively M₂ and M₃.
Issue an HTTP GET L request.
If the response is not successful (e.g., the response code is a 404), the process fails with no results.
Otherwise, perform the two, independent processing steps below, yielding possibly new values to M₂ and M₃, respectively.
1. Consider the resource returned by the HTTP request to, possibly, provide a new value to M₂ as follows. Depending on the media type of the response, take the following actions:
  1. If the response is a packaged PWP instance (identified via the media type to be specified for the packed state of a PWP):
    1. Unpack the package, and retrieve the manifest embedded in the package as (to be) specified by the packed state of a PWP.
    2. M₂ is set to the retrieved manifest.
  2. Otherwise, if the resource is a PWP manifest, as identified by its media type, set M₂ to this resource.
  3. Otherwise the resource is an HTML file. Take the following actions:
    1. Create two, initially empty, manifests, respectively M_1,0 and M_1,1
    2. Perform the two, independent processing steps below, yielding possibly new values to M_1,0 and M_1,1, respectively.
      1. If the HTML content includes a <link rel="pwp_manifest" href="URI> in the header:
        
        Issue a HTTP GET URI request
        
        If the response is successful, M_1,0 is set to the content returned in the response
      2. If the HTML content includes a manifest content embedded in a <script> element, serialized in to one of the accepted serializations for PWP manifests
        
        Retrieve and parse the content of the <script> element
        
        If parsing is successful, M_1,1 is set to the parsed manifest
    3. Set M₂ = M_1,0 ⊕ M_1,1
2. Consider the HTTP Response header to, possibly, provide a new value to M3 as follows.
  1. If the response header includes a header of the form LINK <URI>; rel="pwp_manifest" (see [[rfc5988]]) then
    1. Issue an HTTP GET URI request
    2. If the response is successful, M₃ is set to the content returned in the response.
The final PWP manifest is set to M = M₂ ⊕ M₃ and returned.

An important point of the algorithm is that it defines a priority for the manifest items in case several manifest instances contains respective values (see the definition for the combination of manifests). At present, the priority is as follows, in decreasing priority order:

Manifest referred to by the HTTP Link Header
Manifest extracted from the payload; if that means retrieving the manifest from the HTML content, then:
1. Manifest embedded in the HTML
2. Manifest referred to by a <link> element.
otherwise the access to the manifest via a package or directly through the payload are mutually exclusive, i.e., priority among them do not apply

The algorithm considers only HTML as a possible non-packaged and non-manifest response format. It may become possible to allow, for example, SVG as another, possible format for a PWP; this depends on the final specification of a PWP. The algorithm should then be adapted accordingly by adding a relevant branch (e.g., the specification of SVG includes <script> element that can be used to embed a manifest, but does not have a <link> element).
It may become possible for HTML file to includes several <link> elements referring to a manifest each. If that becomes allowed by a PWP specification, the corresponding step could be modified by taking all link elements into account, and sequentially combining the manifest files in document order to yield M_1,0. The same note is valid for (possible) several <script> elements and M_1,1, respectively.
Similarly, if a PWP specification allows for several different serialization syntaxes for manifests, the processor should be able to recognize and parse them accordingly. The expectation is that the various possible serializations MUST serialize the same content, i.e., these do not influence the final result.
The algorithm is silent on the details on how a manifest should be retrieved from a package. This depends on the detailed specification of packaging, on whether a manifest would have to be at a known location within a package, on whether there might be several manifest instances within a package, etc. It is also possible that the details would follow a similar approach as described in this algorithm, i.e., relying on embedded and linked manifests of a top level HTML file, for example. As far as the algorithm described in this section is concerned, these details do not influence the final result.
The algorithm makes use of the constant pwp_manifest; the exact value of this constant must be defined, and registered, through a more precise specification of PWP-s. It is used here for illustrative purpose only.
The PWP Processor MAY include an Accept header (see [[rfc7231]]) when issuing a HTTP GET to express its preference for, e.g., a packed state of a PWP over manifest payload, or in favor of a particular serialization of the manifest content. Whether this is done or not, and whether the server honors this preference, does not influence the details of the algorithm.

Flowchart of what happens when a Portable Web Publication is retrieved by the user and how this leads to the PWP manifest — Visual representation of the algorithm to retrieve the PWP manifest. The figure is also available in PNG format.