Copyright © 2016-2017 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and permissive document license rules apply.
This document discusses the following issue: as Portable Web Publications (PWP) can exist in different states, it should be possible to locate a PWP or resource within a PWP using a transparent locator that doesn't need to know which state the PWP is in, thus keeping in line with the goal of the PWP draft specification.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This is an early draft, expect this document to change much and often.
This document was published by the Digital Publishing Interest Group as an Editor's Draft.
Comments regarding this document are welcome. Please send them to public-digipub-ig@w3.org (archives).
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 March 2019 W3C Process Document.
On a conceptual level the PWP processor ensures that the reading system can consider the content of a PWP as if all its resources were available via the canonical locator of the PWP online.
A server can be configured in multiple ways to serve the various states of a PWP, possibly respond to content negotations, etc. This specification does not require any particular configuration.
To be more exact, a locator is always a resource locator. This resource part is implied, but omitted for brevity.
There is no requirement whereby the canonical locator must be different from the locator of the published, unpacked state of the PWP.
In practice, it may be possible to define several different PWP packaging formats, in which case there may be several state locators referring to packages. Their treatment, from the point of view of this document, are identical, hence this document is restricted to a single locator of this type. Also the precise treatment of specific packaging formats (unpacking algorithms, etc.) are not relevant for this document.
The precise definition of a manifest should also specify how different manifest items are combined during this operation. For the purpose of this document it is only required, that:
A PWP is published on the server. This PWP includes the resource for an image of the Mona Lisa, and it is published in at least one of the different states:
https://example.org/books/1/
:
in this setup, the image of the Mona Lisa has the URL https://example.org/books/1/img/mona_lisa.jpg
.
This is the absolute locator of the image in this state.
https://example.org/packed-books/1/package.pwp
.
This is the absolute locator of the PWP in this state.
There may be other states, mostly similarly packed states but using a different archiving technology (e.g., tar.gz
).
Their treatments are similar; we will not consider them in what follows because it does not change the various considerations.
The published PWP is assigned a canonical locator, e.g., the URL https://example.org/published-books/1
.
Reflecting the unpacked state in terms of (file) structure the canonical locator of the Mona Lisa image is https://example.org/published-books/1/img/mona_lisa.jpg
.
As the example shows, the exact URL string of the canonical locator may be structurally different from the PWP locators, i.e., it is not necessarily a substring of one or the other. The canonical locator may be identical to the PWP locator of the unpacked state, but not necessarily.
The translation, in practical terms, is that, whenever possible,
the canonical locator should be used when referring to the publishedPWP in, e.g., annotations.
This is also true for URL-s derived from the canonical locator, like https://example.org/published-books/1/img/mona_lisa.jpg
.
The functionalities of the processor can be divided into two steps:
The complete manifest of a PWP (which contains a number of items that are required by the PWP to operate properly) MUST include both the canonical locator as well as all available state locators. Consequently, in order to retrieve the state locators, the PWP must first retrieve the PWP manifest using the canonical locator; once this has been achieved, the state locators are readily available.
The essential steps of the retrieval process are described in a separate section (see ). A high level description of the steps is as follows:
HTTP GET
request using the value of the canonical locator.HTTP
request may also refer to yet another manifest. If so, this is combined with the manifest retrieved from the payload in the previous step, yielding the final PWP manifest.This algorithm is typically performed by the PWP Processor when initialized with the canonical locator L of a particular PWP instance.
An important consequence of the algorithm is that it defines a priority for the values of the state locators in case they are contained by different manifests. Especially, locator values retrieved via the response header of the HTTP
header have the highest priority.
Once the full manifest is constructed, the published PWP can be retrieved, or a resource within that PWP can be retrieved. Based on the canonical locator, the PWP processor can derive a relative locator for the image, i.e., img/mona_lisa.jpg
. Then, the PWP processor can access the image:
https://example.org/books/1/img/mona_lisa.jpg
.
There may be “smarter” PWP Processors that make use of local facilities like caching, but those do not modify these conceptual approaches.
The goal of this algorithm is to obtain the PWP manifest based on the value of the canonical locator L. This algorithm is performed by the PWP Processor, typically when it is initialized with the canonical locator L of a particular PWP instance. The core of the algorithm consists of retrieving the PWP manifest based on the HTTP(S) responses on a HTTP GET
request on L.
If the PWP processor already has the cached publication, than that will probably prevail (modulo cache state) and there may be no HTTP request in the first place. This section really refers to the situation of a first access.
In what follows, as an abuse of notation, HTTP GET U
, for a URL U
, refers to an HTTP
or HTTPS
request issued to the domain part of U
, using the path from U
. I.e., if U
is http://www.ex.org/a/b/c
, then HTTP GET U
stands for:
GET /a/b/c HTTP/1.1
Host: www.ex.org
See [rfc2616] for further details.
As another abuse of notation, the algorithm refers to a manifest retrieved HTTP
and then manipulated via, e.g., the combination of manifests; that step, in fact, involves parsing the serialized manifest file and manipulate the abstract content instead in an implementation specific way.
With these prerequisites, the algorithm is as follows (see also the figure as a visual aid to the algorithm). The input to the algorithm is the canonical locator of the PWP instance, L.
HTTP GET L
request.HTTP
request to, possibly, provide a new value to M2 as follows. Depending on the media type of the response, take the following actions:
<link rel="pwp_manifest" href="URI>
in the header:
HTTP GET URI
request<script>
element, serialized in to one of the accepted serializations for PWP manifests
<script>
elementLINK <URI>; rel="pwp_manifest"
(see [rfc5988]) then
HTTP GET URI
requestAn important point of the algorithm is that it defines a priority for the manifest items in case several manifest instances contains respective values (see the definition for the combination of manifests). At present, the priority is as follows, in decreasing priority order:
HTTP
Link Header<link>
element.otherwise the access to the manifest via a package or directly through the payload are mutually exclusive, i.e., priority among them do not apply
<script>
element that can be used to embed a manifest, but does not have a <link>
element).<link>
elements referring to a manifest each. If that becomes allowed by a PWP specification, the corresponding step could be modified by taking all link elements into account, and sequentially combining the manifest files in document order to yield M1,0. The same note is valid for (possible) several <script>
elements and M1,1, respectively.pwp_manifest
; the exact value of this constant must be defined, and registered, through a more precise specification of PWP-s. It is used here for illustrative purpose only.Accept
header (see [rfc7231]) when issuing a HTTP GET
to express its preference for, e.g., a packed state of a PWP over manifest payload, or in favor of a particular serialization of the manifest content. Whether this is done or not, and whether the server honors this preference, does not influence the details of the algorithm.