This document shows how to use WebVTT, along with HTML, CSS, and Javascript for synchronizing text highlights with timed media playback, consumable directly in a web browser.
The synchronization data is in a WebVTT file; specifically, a file using metadata content.
The text to be synchronized with timed media playback is in an HTML document, consumed in a web browser.
This document does not specify which format(s) of timed media are allowed. The media must be able to be rendered via an HTMLMediaElement.
The highlight style is specified via the
::highlight
pseudo-element.
A small amount of Javascript is used to pair TextTrackCue events with Custom Highlights.
Identify text to be highlighted with a selector.
The synchronization data must follow WebVTT's
file structure, which consists of a series of cues, following the
WEBVTT
declaration at the top.
Each cue has an identifier (a number, in the example) and contains timing information for the media segment. It also includes a JSON-formatted custom cue payload with these properties:
Name | Required | Type | Description |
---|---|---|---|
selector |
Required | Selector | Selects the text in the document that corresponds with this cue. |
group |
Optional | String or array of strings | Name(s) of the group(s) the cue belongs to. |
WEBVTT 1 00:00:00.000 --> 00:00:03.187 {"selector":{"type":"FragmentSelector","value":"dtb1"},"group":"phrase"} 2 00:00:03.187 --> 00:00:07.184 {"selector":{"type":"FragmentSelector","value":"dtb2"},"group":"phrase"} 3 00:00:07.184 --> 00:00:10.945 {"selector":{"type":"FragmentSelector","value":"dtb3"},"group":"phrase"}
Because of how WebVTT files are parsed, it is important to
not have a blank line anywhere in the cue payload.
"Note that you cannot provide blank lines inside a
metadata block, because the blank line signifies the end
of the WebVTT cue,".
This example uses FragmentSelector
s, which
link to element id
s, but the
selector property allows other
selectors too. Other examples in this document use
CssSelector
s and
TextRangeSelector
s. See
Word-level selectors
for notes on referencing sub-element ranges.
The synchronization data of WebVTT is associated with an HTMLMediaElement by being referenced as a metadata track on that media object.
<audio controls autoplay src="chapter.mp3"> <track default src="highlight.vtt" kind="metadata"> </audio>
The entire audio file can be used, or a portion of it can be used via media fragments.
Style the highlight in CSS using the
::highlight
pseudo-element:
::highlight { background-color: yellow; }
Text highlights shows how to make the cue into a CSS Highlight and therefore available to style in this way.
There are limitations when styling highlight pseudo-elements. Refer to Styling Highlights.
This section covers how to link media playback with text highlighting, using browser APIs.
As the HTMLMediaElement plays, associated TextTrackCues will
trigger events when their timestamp is reached. For example,
this is how to listen to the enter
event, for
when a cue starts:
let track = Array.from(document.querySelector('audio').textTracks)[0]; Array.from(track.cues).map(cue => { cue.onenter = e => { let cuePayload = JSON.parse(cue.text); doHighlighting(cuePayload.selector); }; });
This is how to programatically create a Custom Highlight from a selector.
Create a
StaticRange
from a selector
.
{ "selector":{ "type": "CssSelector", "value": "nth-child(1 of .stanza) > :nth-child(1 of .line)", "refinedBy": { "type": "TextPositionSelector", "start": 0, "end": 4 } } }
function createRange(selector) { let node = document.querySelector(selector.value); return new StaticRange({ startContainer: node.firstChild, startOffset: selector.refinedBy.start, endContainer: node.firstChild, endOffset: selector.refinedBy.end + 1 }); }
Then create a Custom Highlight using that range.
function doHighlighting(selector) { let range = createRange(selector); let highlight = new Highlight(range); CSS.highlights.set("sync", highlight); // "sync" is chosen arbitrarily here }
Style the highlight using the
::highlight
pseudo-element.
::highlight(sync) { background-color: yellow; }
Multiple simultaneous highlights are possible when cues overlap. Here is an example with cues for a nested document structure of stanzas, lines, and words.
Cues with overlapping timing work best when they belong to different groups.
WEBVTT 1 00:00:07.800 --> 00:00:08.200 {"group":"word", "selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 0, "end": 3}}} 2 00:00:08.200 --> 00:00:08.800 {"group":"word", "selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 5, "end": 8}}} 3 00:00:08.800 --> 00:00:09.200 {"group":"word", "selector": {"type": "CssSelector", "value": ":nth-child("1" of .stanza) > :nth-child(1 of .line)", refinedBy: {"type": "TextPositionSelector", "start": 10 "end": ,0"}}} 4 00:00:07.800 --> 00:00:12.600 {"group":"line", "selector":":nth-child(1 of .stanza) > :nth-child(1 of .line)"} 5 00:00:12.600 --> 00:00:16.200 {"group":"line", "selector":":nth-child(1 of .stanza) > :nth-child(2 of .line)"} 6 00:00:17.200 --> 00:00:21.000 {"group":"line", "selector":":nth-child(1 of .stanza) > :nth-child(3 of .line)"} 7 00:00:07.800 --> 00:00:31.100 {"group":"stanza", "selector":":nth-child(1 of .stanza)"}
In that example, cues #1, #4, and #7 all start at the
same time. Their purpose can be distinguished by their
group
value and therefore their highlights
can all be different, for example the stanza can have a
light background, the line can have a stronger
background, and the word can be underlined.
A highlight is registered with the HighlightRegistry
with a key. Use the group
name for this key
to style cues based on their
group
value(s).
let range = createRange(cuePayload.selector) let highlight = new Highlight(range); CSS.highlights.set(cuePayload.group, highlight);
::highlight(stanza) { background-color: lightyellow; } ::highlight(line) { text-decoration: none; background-color: yellow; } ::highlight(word) { text-decoration: underline; text-decoration-style:dotted; text-decoration-thickness: 2px; }
It is possible to use a CssSelector
with a
TextPositionSelector
to reference a sub-element
character range.
{ "selector":{ "type": "CssSelector", "value": "nth-child(1 of .stanza) > :nth-child(1 of .line)", "refinedBy": { "type": "TextPositionSelector", "start": 0, "end": 4 } } }
See the multi-level highlights example for a WebVTT excerpt using this type of selector.
The text of the poem in
the example did not need to be marked up with
id
s!
At the time of writing, autoplay policy implementations were not working sufficiently for proper testing, but many browsers have another way to do this via a setting to allow autoplay per domain. This is not necessarily the most user-friendly way but it does work.
In order to be able to potentially automatically start playback
of HTML media elements, the HTML document to be synchronized
should be served with a compatible
autoplay
policy.
For example, in the context of a multi-HTML document presentation (like a book), this would enable a chapter to start playing as soon as it loads. When it's done, the next HTML file gets loaded automatically and starts playing as soon as it loads.
The page ultimately controls the request to enable/disable
autoplay, so allowing autoplay
at the server level
is reasonable.
In cases where the audio narration file is not referenced in the HTML document, more scaffolding is required to make the association among the HTML document, the audio narration, and the WebVTT cues.
This relationship then has to be made programatically by the user agent handling the format.
These are ideas, nothing official yet!
Similar to how an EPUB manifest associates a Media Overlay document with a Content Document:
<item href="chapter1.xhtml" id="ch1" media-type="application/xhtml+xml" media-overlay="ch1-overlay"/> <item href="audio/ch1.mp3" id="ch1-audio" media-type="audio/mpeg"/> <item href="chapter1.smil" id="ch1-overlay" media-type="application/smil+xml"/>
We can expand the media-overlay
attribute to
allow multiple values, and determine the purpose of each via
its media-type
:
<item href="chapter1.xhtml" id="ch1" media-type="application/xhtml+xml" media-overlay="ch1-audio ch1-vtt"/> <item href="audio/ch1.mp3" id="ch1-audio" media-type="audio/mpeg"/> <item href="ch1.vtt" id="ch1-vtt" media-type="text/vtt"/>
The EPUB Reading System then dynamically loads the audio and associates the WebVTT file with it. It uses a small amount of scripting to synchronize highlights with cue events.
Alternately, the audio for a chapter and its WebVTT file could live directly in the chapter's HTML file.
The audio file is already in the manifest; the HTML file is
its alternate
; and the WebVTT file can be
another alternate
.
"alternate" : [{ "type" : "LinkedResource", "url" : "text/part001-1.html", "encodingFormat" : "text/html" },{ "type" : "LinkedResource", "url" : "text/part001-1.vtt", "encodingFormat" : "text/vtt" } ]
This could raise the question of how the user agent presents
alternate options. In this case, the desired approach is to
use both alternate
s together rather than
choosing one or the other.
Highlight priority controls the overlay order of highlights. How could a document or publication indicate its priorities? E.g. if the groups are "word" and "paragraph", their highlights should be prioritized accordingly so that a "word" highlight is not hidden behind a "paragraph" highlight.
One simple idea is to add a priority
property
to the custom cue payload. Its value would be a non-negative
integer.
The user pauses audio playback and wants to inspect the currently highlighted text of the document with a screen reader. How does the screen reader know where to start from? How can it be made aware of where the highlight is currently?
This discussion about programatically setting focus navigation start point is relevant.
At the time of publication, the members of the Synchronized Multimedia for Publications Community Group were:
Avneesh Singh (DAISY Consortium), Ben Dugas (Rakuten, Inc.), Chris Needham (British Broadcasting Corporation), Daniel Weck (DAISY Consortium), Didier Gehrer, Farrah Little (BC Libraries Cooperative), George Kerscher (DAISY Consortium), Ivan Herman (W3C), James Donaldson, Lars Wallin (Colibrio), Livio Mondini, Lynn McCormack (CAST, Inc), Marisa DeMeglio (DAISY Consortium, chair), Markku Hakkinen (Educational Testing Service), Matt Garrish (DAISY Consortium), Michiel Westerbeek (Tella), Nigel Megitt (British Broadcasting Corporation), Romain Deltour (DAISY Consortium), Wendy Reid (Rakuten, Inc.), Zheng Xu (Rakuten, Inc.)