Synchronized Media for Publications CG: SyncMediaLite

This document shows how to use WebVTT, along with HTML, CSS, and Javascript for synchronizing text highlights with timed media playback, consumable directly in a web browser.

File Details

The WebVTT file

The synchronization data must follow WebVTT's file structure, which consists of a series of cues, following the WEBVTT declaration at the top.

Each cue has an optional identifier (a number, in the example) and contains timing information for the media segment. It also includes a JSON-formatted custom cue payload with these properties:

Name	Required	Type	Description
`selector`	Required	Selector	Selects the text in the document that corresponds with this cue.

WEBVTT

1
00:00:00.000 --> 00:00:03.187
{"selector":{"type":"FragmentSelector","value":"sentence1"}}

2
00:00:03.187 --> 00:00:07.184
{"selector":{"type":"FragmentSelector","value":"sentence2"}}

3
00:00:07.184 --> 00:00:10.945
{"selector":{"type":"FragmentSelector","value":"sentence3"}}

Because of how WebVTT files are parsed, it is important to not have a blank line anywhere in the cue payload.
"Note that you cannot provide blank lines inside a metadata block, because the blank line signifies the end of the WebVTT cue,".

This example uses FragmentSelectors, which link to element ids, but the selector property allows other selectors too. Other examples in this document use CssSelectors and TextRangeSelectors. See Word-level selectors for notes on referencing sub-element ranges.

<span id="sentence1">"Lorem ipsum!" she yelled.</span> 
<span id="sentence2">"Dolor sit amet," her friend replied, laughing.</span>
<span id="sentence3">She shook her head, "Qui sit voluptate."</span>

Associating the WebVTT file with the HTMLMediaElement

The synchronization data of WebVTT is associated with an HTMLMediaElement by being referenced as a metadata track on that media object.

<audio controls autoplay src="chapter.mp3">
 <track default src="highlight.vtt" kind="metadata" id="highlights">
</audio>

The entire audio file can be used, or a portion of it can be used via media fragments.

Styling the highlight

Style the highlight in CSS using the ::highlight pseudo-element:

::highlight {
    background-color: yellow;
}

Text highlights shows how to make the cue into a CSS Highlight and therefore available to style in this way.

There are limitations when styling highlight pseudo-elements. Refer to Styling Highlights.

Processing

This section covers how to link media playback with text highlighting, using browser APIs.

Cue events

As the HTMLMediaElement plays, associated TextTrackCues will trigger events when their timestamp is reached. For example, this is how to listen to the enter event, for when a cue starts:

let track = audio.textTracks[0];
for (let cue of track.cues) {
    cue.onenter = e => {
        let cuePayload = JSON.parse(cue.text);
        doHighlighting(cuePayload.selector);
    };
}

Text highlights

This is how to programatically create a Custom Highlight from a selector.

Create a StaticRange from a selector.

{
    "selector":{
        "type": "CssSelector", 
        "value": "nth-child(1 of .stanza) > :nth-child(1 of .line)",
        "refinedBy": {
            "type": "TextPositionSelector",
            "start": 0,
            "end": 4
        }
    }
}

function createRange(selector) {
    let node = document.querySelector(selector.value);
    return new StaticRange({
        startContainer: node.firstChild,
        startOffset: selector.refinedBy.start,
        endContainer: node.firstChild,
        endOffset: selector.refinedBy.end + 1
    });
}

Then create a Custom Highlight using that range.

function doHighlighting(selector) {
    let range = createRange(selector);
    let highlight = new Highlight(range);
    CSS.highlights.set("sync", highlight); // "sync" is chosen arbitrarily here
}

Layering highlights

Multiple simultaneous highlights are possible when cues overlap. Here is an example using mulitple WebVTT files to provide highlight layers for words, lines, and stanzas of a poem. The track order gives the highlight priority order. In this example, stanzas are highlighted, followed by lines appearing on top of that, followed by words.

<audio controls autoplay src="chapter.mp3">
 <track id="stanzas" kind="metadata" src="stanza.vtt" label="Stanzas" default/>
 <track id="lines" kind="metadata" src="line.vtt" label="Lines" default/>
 <track id="words" kind="metadata" src="word.vtt" label="Words" default/>
</audio>

WEBVTT

00:00:07.800 --> 00:00:08.200
{"selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 0, "end": 3}}}

00:00:08.200 --> 00:00:08.800
{"selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 5, "end": 8}}}

00:00:08.800 --> 00:00:09.200
{"selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 10 "end": ,0"}}}

WEBVTT

00:00:07.800 --> 00:00:12.600
{"selector":{"type":"CssSelector","value":":nth-child(1 of .stanza) > :nth-child(1 of .line)"}}

00:00:12.600 --> 00:00:16.200
{"selector":{"type":"CssSelector","value":":nth-child(1 of .stanza) > :nth-child(2 of .line)"}}

00:00:17.200 --> 00:00:21.000
{"selector":{"type":"CssSelector","value":":nth-child(1 of .stanza) > :nth-child(3 of .line)"}}

WEBVTT

00:00:07.800 --> 00:00:31.100
{"selector":{"type":"CssSelector","value":":nth-child(1 of .stanza)"}}

00:00:32.300 --> 00:00:55.800
{"selector":{"type":"CssSelector","value":":nth-child(2 of .stanza)"}}

In that example, a cue from each WebVTT file starts at the same time. Their associated highlights can all be different, for example the stanza can have a light background, the line can have a stronger background, and the word can be underlined. All the highlights can co-exist simultaneously because each one has its own key in the HighlightRegistry. The link between the highlight's key in the HighlightRegistry and the track is given by the track's ID.

let range = createRange(cuePayload.selector)
let highlight = new Highlight(range);
CSS.highlights.set(cue.track.id, highlight);

::highlight(stanzas) {
    background-color: lightyellow;
}

::highlight(lines) {
    text-decoration: none;
    background-color: yellow;
}
::highlight(words) {
    text-decoration: underline;
    text-decoration-style:dotted;
    text-decoration-thickness: 2px;
}

Content navigation

Previous/next navigation

This user interface feature lets a user move between cues. When paired with multiple simultaneous tracks, it enables variation on the previous/next navigation feature, for example allowing a way to move to the next word, sentence, or paragraph.

Cues may appear in a WebVTT file in any order, not necessarily in document flow order. They can be sorted by timestamp when it is necessary to calculate previous and next cues.

Skip/escape

A user listening to a narrated document has certain elements "set to skip", e.g. page number announcements might be turned off; or wants to "escape" from a complex structure, e.g. a table or nested list, and return to the main content flow.

HTML document semantics provide all necessary information, requiring only to reposition the audio timeline past any cues that reference a construct being escaped or skipped. Therefore including skip/escape semantics and nesting in the WebVTT file becomes unnecessary.

Incorporating into other formats

In cases where the audio narration file is not referenced in the HTML document, more scaffolding is required to make the association among the HTML document, the audio narration, and the WebVTT cues.

This relationship then has to be made programatically by the user agent handling the format.

These are ideas, nothing official yet!

EPUB

Similar to how an EPUB manifest associates a Media Overlay document with a Content Document:

<item href="chapter1.xhtml" id="ch1" media-type="application/xhtml+xml" media-overlay="ch1-overlay"/>
<item href="audio/ch1.mp3" id="ch1-audio" media-type="audio/mpeg"/>
<item href="chapter1.smil" id="ch1-overlay" media-type="application/smil+xml"/>

We can expand the media-overlay attribute to allow multiple values, and determine the purpose of each via its media-type:

<item href="chapter1.xhtml" id="ch1" media-type="application/xhtml+xml" media-overlay="ch1-audio ch1-vtt"/>
<item href="audio/ch1.mp3" id="ch1-audio" media-type="audio/mpeg"/>
<item href="ch1.vtt" id="ch1-vtt" media-type="text/vtt"/>

The EPUB Reading System then dynamically loads the audio and associates the WebVTT file with it. It uses a small amount of scripting to synchronize highlights with cue events.

Alternately, the audio for a chapter and its WebVTT file could live directly in the chapter's HTML file.

Audiobooks

The audio file is already in the manifest; the HTML file is its alternate; and the WebVTT file can be another alternate.

"alternate" : [{
        "type" : "LinkedResource",
        "url" : "text/part001-1.html",
        "encodingFormat" : "text/html"
    },{
        "type" : "LinkedResource",
        "url" : "text/part001-1.vtt",
        "encodingFormat" : "text/vtt"
    }
]

This could raise the question of how the user agent presents alternate options. In this case, the desired approach is to use both alternates together rather than choosing one or the other.

Relationship to Other Specifications

WebVTT

HTML

Timed media

CSS

Javascript

Selectors and states