This document shows how to use WebVTT, along with HTML, CSS, and Javascript for synchronizing text highlights with timed media playback, consumable directly in a web browser.
The synchronization data is in a WebVTT file; specifically, a file using metadata content.
The text to be synchronized with timed media playback is in an HTML document, consumed in a web browser.
This document does not specify which format(s) of timed media are allowed. The media must be able to be rendered via an HTMLMediaElement.
The highlight style is specified via the
::highlight
pseudo-element.
A small amount of Javascript is used to pair TextTrackCue events with Custom Highlights.
Identify text to be highlighted with a selector.
The synchronization data must follow WebVTT's
file structure, which consists of a series of cues, following the
WEBVTT
declaration at the top.
Each cue has an optional identifier (a number, in the example) and contains timing information for the media segment. It also includes a JSON-formatted custom cue payload with these properties:
Name | Required | Type | Description |
---|---|---|---|
selector |
Required | Selector | Selects the text in the document that corresponds with this cue. |
WEBVTT 1 00:00:00.000 --> 00:00:03.187 {"selector":{"type":"FragmentSelector","value":"sentence1"}} 2 00:00:03.187 --> 00:00:07.184 {"selector":{"type":"FragmentSelector","value":"sentence2"}} 3 00:00:07.184 --> 00:00:10.945 {"selector":{"type":"FragmentSelector","value":"sentence3"}}
Because of how WebVTT files are parsed, it is important to
not have a blank line anywhere in the cue payload.
"Note that you cannot provide blank lines inside a
metadata block, because the blank line signifies the end
of the WebVTT cue,".
This example uses FragmentSelector
s, which
link to element id
s, but the
selector property allows other
selectors too. Other examples in this document use
CssSelector
s and
TextRangeSelector
s. See
Word-level selectors
for notes on referencing sub-element ranges.
<span id="sentence1">"Lorem ipsum!" she yelled.</span> <span id="sentence2">"Dolor sit amet," her friend replied, laughing.</span> <span id="sentence3">She shook her head, "Qui sit voluptate."</span>
The synchronization data of WebVTT is associated with an HTMLMediaElement by being referenced as a metadata track on that media object.
<audio controls autoplay src="chapter.mp3"> <track default src="highlight.vtt" kind="metadata" id="highlights"> </audio>
The entire audio file can be used, or a portion of it can be used via media fragments.
Style the highlight in CSS using the
::highlight
pseudo-element:
::highlight { background-color: yellow; }
Text highlights shows how to make the cue into a CSS Highlight and therefore available to style in this way.
There are limitations when styling highlight pseudo-elements. Refer to Styling Highlights.
This section covers how to link media playback with text highlighting, using browser APIs.
As the HTMLMediaElement plays, associated TextTrackCues will
trigger events when their timestamp is reached. For example,
this is how to listen to the enter
event, for
when a cue starts:
let track = audio.textTracks[0]; for (let cue of track.cues) { cue.onenter = e => { let cuePayload = JSON.parse(cue.text); doHighlighting(cuePayload.selector); }; }
This is how to programatically create a Custom Highlight from a selector.
Create a
StaticRange
from a selector
.
{ "selector":{ "type": "CssSelector", "value": "nth-child(1 of .stanza) > :nth-child(1 of .line)", "refinedBy": { "type": "TextPositionSelector", "start": 0, "end": 4 } } }
function createRange(selector) { let node = document.querySelector(selector.value); return new StaticRange({ startContainer: node.firstChild, startOffset: selector.refinedBy.start, endContainer: node.firstChild, endOffset: selector.refinedBy.end + 1 }); }
Then create a Custom Highlight using that range.
function doHighlighting(selector) { let range = createRange(selector); let highlight = new Highlight(range); CSS.highlights.set("sync", highlight); // "sync" is chosen arbitrarily here }
Multiple simultaneous highlights are possible when cues overlap. Here is an example using mulitple WebVTT files to provide highlight layers for words, lines, and stanzas of a poem. The track order gives the highlight priority order. In this example, stanzas are highlighted, followed by lines appearing on top of that, followed by words.
<audio controls autoplay src="chapter.mp3"> <track id="stanzas" kind="metadata" src="stanza.vtt" label="Stanzas" default/> <track id="lines" kind="metadata" src="line.vtt" label="Lines" default/> <track id="words" kind="metadata" src="word.vtt" label="Words" default/> </audio>
WEBVTT 00:00:07.800 --> 00:00:08.200 {"selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 0, "end": 3}}} 00:00:08.200 --> 00:00:08.800 {"selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 5, "end": 8}}} 00:00:08.800 --> 00:00:09.200 {"selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 10 "end": ,0"}}}
WEBVTT 00:00:07.800 --> 00:00:12.600 {"selector":{"type":"CssSelector","value":":nth-child(1 of .stanza) > :nth-child(1 of .line)"}} 00:00:12.600 --> 00:00:16.200 {"selector":{"type":"CssSelector","value":":nth-child(1 of .stanza) > :nth-child(2 of .line)"}} 00:00:17.200 --> 00:00:21.000 {"selector":{"type":"CssSelector","value":":nth-child(1 of .stanza) > :nth-child(3 of .line)"}}
WEBVTT 00:00:07.800 --> 00:00:31.100 {"selector":{"type":"CssSelector","value":":nth-child(1 of .stanza)"}} 00:00:32.300 --> 00:00:55.800 {"selector":{"type":"CssSelector","value":":nth-child(2 of .stanza)"}}
In that example, a cue from each WebVTT file starts at the same time. Their associated highlights can all be different, for example the stanza can have a light background, the line can have a stronger background, and the word can be underlined. All the highlights can co-exist simultaneously because each one has its own key in the HighlightRegistry. The link between the highlight's key in the HighlightRegistry and the track is given by the track's ID.
let range = createRange(cuePayload.selector) let highlight = new Highlight(range); CSS.highlights.set(cue.track.id, highlight);
::highlight(stanzas) { background-color: lightyellow; } ::highlight(lines) { text-decoration: none; background-color: yellow; } ::highlight(words) { text-decoration: underline; text-decoration-style:dotted; text-decoration-thickness: 2px; }
It is possible to use a CssSelector
with a
TextPositionSelector
to reference a sub-element
character range.
{ "selector":{ "type": "CssSelector", "value": "nth-child(1 of .stanza) > :nth-child(1 of .line)", "refinedBy": { "type": "TextPositionSelector", "start": 0, "end": 4 } } }
See the multi-level highlights example for a WebVTT excerpt using this type of selector.
The text of the poem
in the example did not need to be marked up with
id
s!
At the time of writing, autoplay policy implementations were not working sufficiently for proper testing, but many browsers have another way to do this via a setting to allow autoplay per domain. This is not necessarily the most user-friendly way but it does work.
In order to be able to potentially automatically start playback
of HTML media elements, the HTML document to be synchronized
should be served with a compatible
autoplay
policy.
For example, in the context of a multi-HTML document presentation (like a book), this would enable a chapter to start playing as soon as it loads. When it's done, the next HTML file gets loaded automatically and starts playing as soon as it loads.
The page ultimately controls the request to enable/disable
autoplay, so allowing autoplay
at the server level
is reasonable.
In cases where the audio narration file is not referenced in the HTML document, more scaffolding is required to make the association among the HTML document, the audio narration, and the WebVTT cues.
This relationship then has to be made programatically by the user agent handling the format.
These are ideas, nothing official yet!
Similar to how an EPUB manifest associates a Media Overlay document with a Content Document:
<item href="chapter1.xhtml" id="ch1" media-type="application/xhtml+xml" media-overlay="ch1-overlay"/> <item href="audio/ch1.mp3" id="ch1-audio" media-type="audio/mpeg"/> <item href="chapter1.smil" id="ch1-overlay" media-type="application/smil+xml"/>
We can expand the media-overlay
attribute to
allow multiple values, and determine the purpose of each via
its media-type
:
<item href="chapter1.xhtml" id="ch1" media-type="application/xhtml+xml" media-overlay="ch1-audio ch1-vtt"/> <item href="audio/ch1.mp3" id="ch1-audio" media-type="audio/mpeg"/> <item href="ch1.vtt" id="ch1-vtt" media-type="text/vtt"/>
The EPUB Reading System then dynamically loads the audio and associates the WebVTT file with it. It uses a small amount of scripting to synchronize highlights with cue events.
Alternately, the audio for a chapter and its WebVTT file could live directly in the chapter's HTML file.
The audio file is already in the manifest; the HTML file is
its alternate
; and the WebVTT file can be
another alternate
.
"alternate" : [{ "type" : "LinkedResource", "url" : "text/part001-1.html", "encodingFormat" : "text/html" },{ "type" : "LinkedResource", "url" : "text/part001-1.vtt", "encodingFormat" : "text/vtt" } ]
This could raise the question of how the user agent presents
alternate options. In this case, the desired approach is to
use both alternate
s together rather than
choosing one or the other.
At the time of publication, the members of the Synchronized Multimedia for Publications Community Group were:
Avneesh Singh (DAISY Consortium), Ben Dugas (Rakuten, Inc.), Chris Needham (British Broadcasting Corporation), Daniel Weck (DAISY Consortium), Didier Gehrer, Farrah Little (BC Libraries Cooperative), George Kerscher (DAISY Consortium), Ivan Herman (W3C), James Donaldson, Lars Wallin (Colibrio), Livio Mondini, Lynn McCormack (CAST, Inc), Marisa DeMeglio (DAISY Consortium, chair), Markku Hakkinen (Educational Testing Service), Matt Garrish (DAISY Consortium), Michiel Westerbeek (Tella), Nigel Megitt (British Broadcasting Corporation), Romain Deltour (DAISY Consortium), Wendy Reid (Rakuten, Inc.), Zheng Xu (Rakuten, Inc.)