This document discusses the following issue:
as Portable Web Publications (PWP) can exist in different states,
it should be possible to locate a PWP or resource within a PWP using a transparent locator
that doesn't need to know which state the PWP is in,
thus keeping in line with the goal of the PWP draft specification.
This is an early draft, expect this document to change much and often.
Definitions
A Portable Web Document (PWP) is defined in a separate document [[pwp]]. That document also defined the states a PWP can be in.
A reading system has the responsibility of the rendering the contents of a PWP.
A PWP processor is a piece of functionality
that has the responsibility of interpreting a PWP and its different locators,
to pass the correct requested resources to the reading system.
All mapping to and from canonical locators to specific, non-canonical locators should be done by the PWP Processor.
The server has the responsibility of serving the PWP-s in their different states. As such, the is server always aware of all possible published states of the PWP.
A server can be configured in multiple ways to serve the various states of a PWP, possibly respond to content negotations, etc. This specification does not require any particular configuration.
A locator is a URL that points to either an entire PWP or a resource within a PWP.
To be more exact, a locator is always a resource locator. This resource part is implied, but omitted for brevity.
A PWP locator is a locator to the PWP as a whole.
This PWP Locator can be dereferenced via an HTTP(S) request and should return some information.
As such, it is not the same as an identifier, although, it could be the same.
A relative locator is a locator to a resource within a PWP, without specifying the location of the PWP.
The term canonical is defined in the mathematical sense: distinguished among entities of its kind, so that it can be picked out in a way that does not depend on any arbitrary choices.
The canonical locator is an absolute, and state independent PWP locator. This is purely conceptual, i.e., the PWP does not have to be published unpacked online. For the purposes of this document, the canonical locator of a PWP is denoted by L.
There is no requirement whereby the canonical locatormust be different from the locator of the published, unpacked state of the PWP.
In practice, it may be possible to define several different PWP packaging formats, in which case there may be several state locators referring to packages. Their treatment, from the point of view of this document, are identical, hence this document is restricted to a single locator of this type. Also the precise treatment of specific packaging formats (unpacking algorithms, etc.) are not relevant for this document.
There is no requirement that the publisher publishes all these states; it may choose to publish only one. I.e., there can be situations where the PWP is only published unpacked (so only the locator of the published unpacked PWP exists), or only published packed (so only that locator exists).
A manifest (file) is a special resource containing fundamental information about a PWP. The serialization syntax and the content of a manifest is not defined in this document. A media type (see [[rfc6838]]) is also associated with a manifest (or a particular serialization thereof), that can be used when a particular instance is retrieved through standard HTTP(S) protocol.
A manifest item is an information item within the manifest file that can be retrieved by, e.g., a PWP Processor. These may include metadata, information on rendering order and rendition, etc. For the purpose of this document it is important that
state locators, as well as the canonical locator are such manifest items, i.e., they can be part of a manifest.
A PWP manifest or complete manifest is a manifest that contains allmanifest items that are required by a PWP processor. The complete list of required items are not defined here; for the purpose of this document it suffices to state that the canonical locator, as well as all state locators for all available states, MUST be part of a PWP manifest.
The combination of manifests, denoted by the ⊕ operation (as in M = Ma ⊕ Mb) is the creation of a new manifest, consisting of manifest items originating from Ma and Mb.
The precise definition of a manifest should also specify how different manifest items are combined during this operation. For the purpose of this document it is only required, that:
the value of L MUST be identical, if present, in both Ma and Mb; and
the value of Lu (resp. Lp) in Mb MUST have a higher priority. I.e., if present in Mb, this is the value added to M, regardless of whether a similar value is also specified in Ma or not.
The Problem
A PWP is published on the server.
This PWP includes the resource for an image of the Mona Lisa,
and it is published in at least one of the different states:
In an “unpacked” state, i.e., as a hierarchy of files on the server.
The “top level” of this unpacked state is available through the URL https://example.org/books/1/:
in this setup, the image of the Mona Lisa has the URL https://example.org/books/1/img/mona_lisa.jpg.
This is the absolute locator of the image in this state.
In a “packed” state, namely as part of a, say, zip file (but with a PWP-specific media type).
This is available through the URL https://example.org/packed-books/1/package.pwp.
This is the absolute locator of the PWP in this state.
There may be other states, mostly similarly packed states but using a different archiving technology (e.g., tar.gz).
Their treatments are similar; we will not consider them in what follows because it does not change the various considerations.
The published PWP is assigned a canonical locator, e.g., the URL https://example.org/published-books/1.
Reflecting the unpacked state in terms of (file) structure the canonical locator of the Mona Lisa image is https://example.org/published-books/1/img/mona_lisa.jpg.
As the example shows, the exact URL string of the canonical locator may be structurally different from the PWP locators, i.e., it is not necessarily a substring of one or the other. The canonical locatormay be identical to the PWP locator of the unpacked state, but not necessarily.
The translation, in practical terms, is that, whenever possible,
the canonical locator should be used when referring to the publishedPWP in, e.g., annotations.
This is also true for URL-s derived from the canonical locator, like https://example.org/published-books/1/img/mona_lisa.jpg.
Depending on the type of the returned resource, an initial value for the PWP manifest is established. The simplest case is when the returned resource is itself a manifest; otherwise the PWP manifest has to be extracted from the returned payload by possibly combining manifests of different origins.
Furthermore, the response header of the HTTP request may also refer to yet another manifest. If so, this is combined with the manifest retrieved from the payload in the previous step, yielding the final PWP manifest.
This algorithm is typically performed by the PWP Processor when initialized with the canonical locator L of a particular PWP instance.
An important consequence of the algorithm is that it defines a priority for the values of the state locators in case they are contained by different manifests. Especially, locator values retrieved via the response header of the HTTP header have the highest priority.
If the preferred state is the packed one, then the PWP locator of the packed state is accessed, unpacked, and the image is localized within the unpacked content (and that usually means using the relative locator as some sort of a file system path within that unpacked content).
If the preferred state is the unpacked one, then the locator of the image is constructed by using the PWP locator of the unpacked state and the relative locator,
yielding https://example.org/books/1/img/mona_lisa.jpg.
There may be “smarter” PWP Processors that make use of local facilities like caching, but those do not modify these conceptual approaches.
Local (downloaded) PWPs
Breadcrumbs
The following discussion makes following assumptions:
You cannot alter an existing PWP. You can however create a newPWP that is derived from an already PWP.
You can only create a derivative of a PWP (i.e., edit the contents and/or manifest) if you have sufficient rights to do so.
When I download a PWP (for this example denoted as P, with canonical locatorL),
and copy that PWP to a different location, then we still talk about the same PWP.
However, when, e.g., a teacher adds annotations inside the PWP, he or she creates a new, altered PWP, e.g., P'.
To keep the connections between P and P',
we can allow for a trail of breadcrumbs, to be persisted inside the PWP.
E.g., P' might contain canonical: L, breadcrumbs: [],
as it is an unaltered version of the original PWP, published with canonical locatorL.
P' however might be published at the school of the teacher,
under https://example.org/course/french/frenchbook (and thus with a different canonical locator, e.g., L'),
and contain canonical: L', breadcrumbs: [L].
A student may then again copy P' to his own web space,
and makes annotations (P'' under https://example.org/student/1/frenchbook, L'').
P'' may then contain canonical: L'', breadcrumbs: [L, L']
References to P' or P'' might thus
be interpreted by a PWP processor to link back to the original L.
The goal of this algorithm is to obtain the PWP manifest based on the value of the canonical locatorL. This algorithm is performed by the PWP Processor, typically when it is initialized with the canonical locator L of a particular PWP instance. The core of the algorithm consists of retrieving the PWP manifest based on the HTTP(S) responses on a HTTP GET request on L.
If the PWP processor already has the cached publication, than that will probably prevail (modulo cache state) and there may be no HTTP request in the first place. This section really refers to the situation of a first access.
In what follows, as an abuse of notation, HTTP GET U, for a URL U, refers to an HTTP or HTTPS request issued to the domain part of U, using the path from U. I.e., if U is http://www.ex.org/a/b/c, then HTTP GET U stands for:
GET /a/b/c HTTP/1.1
Host: www.ex.org
See [[rfc2616]] for further details.
As another abuse of notation, the algorithm refers to a manifest retrieved HTTP and then manipulated via, e.g., the combination of manifests; that step, in fact, involves parsing the serialized manifest file and manipulate the abstract content instead in an implementation specific way.
With these prerequisites, the algorithm is as follows (see also the figure as a visual aid to the algorithm). The input to the algorithm is the canonical locator of the PWP instance, L.
Create two, initially empty, manifests, denoted respectively M2 and M3.
Issue an HTTP GET L request.
If the response is not successful (e.g., the response code is a 404), the process fails with no results.
Otherwise, perform the two, independent processing steps below, yielding possibly new values to M2 and M3, respectively.
Consider the resource returned by the HTTP request to, possibly, provide a new value to M2 as follows. Depending on the media type of the response, take the following actions:
If the response is a packaged PWP instance (identified via the media type to be specified for the packed state of a PWP):
Unpack the package, and retrieve the manifest embedded in the package as (to be) specified by the packed state of a PWP.
M2 is set to the retrieved manifest.
Otherwise, if the resource is a PWP manifest, as identified by its media type, set M2 to this resource.
Otherwise the resource is an HTML file. Take the following actions:
Create two, initially empty, manifests, respectively M1,0 and M1,1
Perform the two, independent processing steps below, yielding possibly new values to M1,0 and M1,1, respectively.
If the HTML content includes a <link rel="pwp_manifest" href="URI> in the header:
Issue a HTTP GET URI request
If the response is successful, M1,0 is set to the content returned in the response
If the HTML content includes a manifest content embedded in a <script> element, serialized in to one of the accepted serializations for PWP manifests
Retrieve and parse the content of the <script> element
If parsing is successful, M1,1 is set to the parsed manifest
Set M2 = M1,0 ⊕ M1,1
Consider the HTTP Response header to, possibly, provide a new value to M3 as follows.
If the response header includes a header of the form LINK <URI>; rel="pwp_manifest" (see [[rfc5988]]) then
Issue an HTTP GET URI request
If the response is successful, M3 is set to the content returned in the response.
The final PWP manifest is set to M = M2 ⊕ M3 and returned.
An important point of the algorithm is that it defines a priority for the manifest items in case several manifest instances contains respective values (see the definition for the combination of manifests). At present, the priority is as follows, in decreasing priority order:
Manifest referred to by the HTTP Link Header
Manifest extracted from the payload; if that means retrieving the manifest from the HTML content, then:
Manifest embedded in the HTML
Manifest referred to by a <link> element.
otherwise the access to the manifest via a package or directly through the payload are mutually exclusive, i.e., priority among them do not apply
The algorithm considers only HTML as a possible non-packaged and non-manifest response format. It may become possible to allow, for example, SVG as another, possible format for a PWP; this depends on the final specification of a PWP. The algorithm should then be adapted accordingly by adding a relevant branch (e.g., the specification of SVG includes <script> element that can be used to embed a manifest, but does not have a <link> element).
It may become possible for HTML file to includes several <link> elements referring to a manifest each. If that becomes allowed by a PWP specification, the corresponding step could be modified by taking all link elements into account, and sequentially combining the manifest files in document order to yield M1,0. The same note is valid for (possible) several <script> elements and M1,1, respectively.
Similarly, if a PWP specification allows for several different serialization syntaxes for manifests, the processor should be able to recognize and parse them accordingly. The expectation is that the various possible serializations MUST serialize the same content, i.e., these do not influence the final result.
The algorithm is silent on the details on how a manifest should be retrieved from a package. This depends on the detailed specification of packaging, on whether a manifest would have to be at a known location within a package, on whether there might be several manifest instances within a package, etc. It is also possible that the details would follow a similar approach as described in this algorithm, i.e., relying on embedded and linked manifests of a top level HTML file, for example. As far as the algorithm described in this section is concerned, these details do not influence the final result.
The algorithm makes use of the constant pwp_manifest; the exact value of this constant must be defined, and registered, through a more precise specification of PWP-s. It is used here for illustrative purpose only.
The PWP Processor MAY include an Accept header (see [[rfc7231]]) when issuing a HTTP GET to express its preference for, e.g., a packed state of a PWP over manifest payload, or in favor of a particular serialization of the manifest content. Whether this is done or not, and whether the server honors this preference, does not influence the details of the algorithm.