Copyright © 2019 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and permissive document license rules apply.
This document collects use cases and requirements for improved support for timed events related to audio or video media on the web, where synchronization to a playing audio or video media stream is needed, and makes recommendations for new or changed web APIs to realize these requirements. The goal is to extend the existing support in HTML for text track cue events to add support for dynamic content replacement cues and generic metadata events that drive synchronized interactive media experiences, and improve the timing accuracy of rendering of web content intended to be synchronized with audio or video media playback.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
The Media & Entertainment Interest Group may update these
use cases and requirements over time. Development of new web
APIs based on the requirements described here, for example,
DataCue
, will proceed in the Web Platform Incubator Community Group
(WICG), with the goal of eventual standardization within a
W3C Working
Group. Contributors to this document are encouraged to
participate in the WICG. Where the requirements described here
affect the HTML specification, contributors will follow up with
WHATWG. The Interest Group
will continue to track these developments and provide input and
review feedback on how any proposed API meets these
requirements.
This document was published by the Media & Entertainment Interest Group as an Interest Group Note.
GitHub Issues are preferred for discussion of this specification.
Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
The disclosure obligations of the Participants of this group are described in the charter.
This document is governed by the 1 March 2019 W3C Process Document.
There is a need in the media industry for an API to support
metadata events synchronized to audio or video media,
specifically for both out-of-band event
streams and in-band discrete events (for example,
MPD and emsg
events in MPEG-DASH). These media timed events can be used to support use cases
such as dynamic content replacement, ad insertion, or
presentation of supplemental content alongside the audio or
video, or more generally, making changes to a web page, or
executing application code triggered from JavaScript events, at
specific points on the
media timeline of an audio or video media stream.
The following terms are used in this document:
The following terms are defined in [HTML]:
activeCues
currentTime
enter
exit
oncuechange
onenter
onexit
TextTrack
TextTrackCue
timeupdate
setTimeout()
setInterval()
requestAnimationFrame()
The following term is defined in [HR-TIME]:
The following term is defined in [WEBVTT]:
Media timed events carry metadata that is related to points in time, or regions of time on the media timeline, which can be used to trigger retrieval and/or rendering of web resources synchronized with media playback. Such resources can be used to enhance user experience in the context of media that is being rendered. Some examples include display of social media feeds corresponding to a live video stream such as a sporting event, banner advertisements for sponsored content, accessibility-related assets such as large print rendering of captions, and display of track titles or images alongside an audio stream.
The following sections describe a few use cases in more detail.
A media content provider wants to allow insertion of content, such as personalised video, local news, or advertisements, into a video media stream that contains the main program content. To achieve this, media timed events can be used to describe the points on the media timeline, known as splice points, where switching playback to inserted content is possible.
The Society for Cable and Televison Engineers (SCTE) specification "Digital Program Insertion Cueing for Cable" [SCTE35] defines a data cue format for describing such insertion points. Use of these cues in MPEG-DASH and HLS streams is described in [SCTE35], sections 12.1 and 12.2.
A media content provider wants to provide visual information alongside an audio stream, such as an image of the artist and title of the current playing track, to give users live information about the content they are listening to.
HLS timed metadata [HLS-TIMED-METADATA] uses in-band ID3 metadata to carry the artist and title information, and image content. RadioVIS in DVB ([DVB-DASH], section 9.1.7) defines in-band event messages that contain image URLs and text messages to be displayed, with information about when the content should be displayed in relation to the media timeline.
A media streaming server uses media timed events to send control messages to media client library, such as dash.js. Typically segmented streaming protocols such as HLS and MPEG-DASH make use of a manifest document that informs the client of the available encodings of a media stream, e.g., the Media Presentation Description (MPD) document in [MPEGDASH].
Should any of the content in the manifest document need to
change, the client should refresh it by requesting an updated
copy from the server. Section 5.10.4 of [MPEGDASH] describes
an MPEG-DASH specific event that is used to notify a client
application. An in-band emsg
event is
used as an alternative to setting a cache duration in the
response to the HTTP request for the manifest, so the client
can refresh the MPD when it actually changes, as opposed to
waiting for a cache duration expiry period to elapse. This
also has the benefit of reducing the load on HTTP servers
caused by frequent server requests.
Reference: M&E IG call 1 Feb 2018: Minutes, [DASH-EVENTING].
A user records footage with metadata, including geolocation, on a mobile video device, e.g., drone or dashcam, to share on the web alongside a map, e.g., OpenStreetMap.
[WEBVMT] is an open format for metadata cues, synchronized with a timed media file, that can be used to drive an online map rendered in a separate HTML element alongside the media element on the web page. The media playhead position controls presentation and animation of the map, e.g., pan and zoom, and allows annotations to be added and removed, e.g., markers, at specified times during media playback. Control can also be overridden by the user with the usual interactive features of the map at any time, e.g., zoom. Concrete examples are provided by the tech demos at the WebVMT website.
A video image analysis system processes a media stream to detect and recognize objects shown in the video. This system generates metadata describing the objects, including timestamps that describe the when the objects are visible, together with position information (e.g., bounding boxes). A web application then uses this timed metadata to overlay labels and annotations on the video using HTML and CSS.
During a live media presentation, dynamic and unpredictable events may occur which cause temporary suspension of the media presentation. During that suspension interval, auxiliary content such as the presentation of UI controls and media files, may be unavailable. Depending on the specific user engagement (or not) with the UI controls and the time at which any such engagement occurs, specific web resources may be rendered at defined times in a synchronized manner. For example, a multimedia A/V clip along with subtitles corresponding to an advertisement, and which were previously downloaded and cached by the UA, are played out.
This section describes gaps in existing existing web platform capabilities needed to support the use cases and requirements described in this document. Where applicable, this section also describes how existing web platform features can be used as workarounds, and any associated limitations.
The DataCue
API has been previously discussed
as a means to deliver in-band event data to
web applications, but this is not implemented in all of the
main browser engines. It is
included in the 18 October 2018 HTML 5.3 draft
[HTML53-20181018], but is
not included in [HTML]. See discussion
here and notes on implementation status
here.
WebKit
supports a DataCue
interface that extends
HTML5 DataCue
with two attributes to support
non-text metadata, type
and
value
.
interface DataCue : TextTrackCue {
attribute ArrayBuffer data; // Always empty
// Proposed extensions.
attribute any value;
readonly attribute DOMString type;
};
type
is a string identifying the type of
metadata:
WebKit DataCue metadata
types |
|
---|---|
"com.apple.quicktime.udta" |
QuickTime User Data |
"com.apple.quicktime.mdta" |
QuickTime Metadata |
"com.apple.itunes" |
iTunes metadata |
"org.mp4ra" |
MPEG-4 metadata |
"org.id3" |
ID3 metadata |
and value
is an object with the metadata item
key, data, and optionally a locale:
value = {
key: String
data: String | Number | Array | ArrayBuffer | Object
locale: String
}
Neither [MSE-BYTE-STREAM-FORMAT-ISOBMFF]
nor [INBANDTRACKS] describe
handling of emsg
boxes.
On resource constrained devices such as smart TVs and streaming sticks, parsing media segments to extract event information leads to a significant performance penalty, which can have an impact on UI rendering updates if this is done on the UI thread. There can also be an impact on the battery life of mobile devices. Given that the media segments will be parsed anyway by the user agent, parsing in JavaScript is an expensive overhead that could be avoided.
[HBBTV] section 9.3.2 describes a
mapping between the emsg
fields described
above and the
TextTrack
and
DataCue
APIs. A
TextTrack
instance is created for each event
stream signalled in the MPD document (as identified by the
schemeIdUri
and value
), and the
inBandMetadataTrackDispatchType
TextTrack
attribute contains the
scheme_id_uri
and value
values.
Because HbbTV devices include a native DASH client, parsing
of the MPD document and creation of the
TextTrack
s is done by the user agent, rather
than by application JavaScript code.
In browsers, non media web rendering is handled through
repaint operations at a rate that generally matches the
display refresh rate (e.g., 60 times per second), following
the user's wall clock. A web application can schedule actions
and render web content at specific points on the user's wall
clock, notably through Performance.now()
,
setTimeout()
,
setInterval()
, and
requestAnimationFrame()
.
In most cases, media rendering follows a different path, be it because it gets handled by a dedicated background process or by dedicated hardware circuitry. As a result, progress along the media timeline may follow a clock different from the user's wall clock. [HTML] recommends that the media clock approximate the user's wall clock but does not require it to match the user's wall clock.
To synchronize rendering of web content to a video with frame accuracy, a web application needs:
The following sub-sections discusses mechanisms currently available to web applications to track progress on the media timeline and render content at frame boundaries.
Cues (e.g.,
TextTrackCue
, or VTTCue
)
are units of time-sensitive data on a
media timeline [HTML]. The
time marches on steps in [HTML] control the firing of cue
events during media playback.
Time marches on requires a
timeupdate
event to be fired at the
media element between 15 and 250 milliseconds since the
last such event, and this requirement therefore specifies
the rate at which
time marches on is executed during playback. In
practice it has
been found that the timing varies between browser
implementations.
There are two methods a web application can use to handle cues:
oncuechange
handler function to the
TextTrack
and inspect the track's
activeCues
list. Because
activeCues
contains the list of cues
that are active at the time that
time marches on is run, it is possible for cues to
be missed by a web application using this method, where
cues appear on the
media timeline between successive executions of
time marches on during media playback. This may
occur if the cues have short duration, or by a
long-running event handler function.
onenter
and
onexit
handler functions to each cue.
The
time marches on steps guarantee that
enter
and
exit
events will be fired for all
cues, including those that appear on the
media timeline between successive executions of
time marches on during media playback. This method
is only possible for cues created by the web
application, i.e., VTTCue
objects, and not cue objects created by the user agent.
An issue with handling of text track and data cue events in HbbTV was reported in 2013. HbbTV requires the user agent to implement an MPEG-DASH client, and so applications must use the first of the above methods for cue handling, which means that applications can miss cues as described above.
timeupdate
events from the media
element
Another approach to synchronizing rendering of web
content to media playback is to use the
timeupdate
event, and for the web
application to manage the media timed
events to be triggered, rather than use the text track
cue APIs in [HTML]. This approach has the same
synchronization limitations as described above due to the
250 millisecond update rate specified in
time marches on, and so is
explicitly discouraged in [HTML]. In addition, the timing
variability of
timeupdate
events between browser engines
makes them unreliable for the purpose of synchronized
rendering of web content.
Synchronization accuracy can be improved by polling the
media element's
currentTime
property from a
setInterval()
callback, or by using
requestAnimationFrame()
for greater
accuracy. This technique can be useful in where content
should be animated smoothly in synchronicity with the
media, for example, rendering a playhead position marker in
an audio waveform visualization, or displaying web content
at specific points on the
media timeline. However, the use of
setInterval()
or
requestAnimationFrame()
for media
synchronized rendering is CPU intensive.
[HTML] does not expose any precise
mechanism to assess the time, from a user's wall clock
perspective, at which a particular media frame is going to
be rendered. A web application may only infer this
information by looking at the
media element's
currentTime
property to infer the frame
being rendered and the time at which the user will see the
next frame. This has several limitations:
currentTime
is represented as a
double
value, which does not allow to
identify individual frames due to rounding errors. This
is a known
issue.
currentTime
is updated at a user-agent
defined rate (typically the rate at which
time marches on runs), and is kept stable while
scripts are running. When a web application reads
currentTime
, it cannot tell when this
property was last updated, and thus cannot reliably
assess whether this property still represents the frame
currently being rendered.
This section describes recommendations from the Media & Entertainment Interest Group for the development of a generic media timed event API, and associated synchronization considerations.
The API should allow web applications to subscribe to
receive specific event streams by event type. For example, to
support MPEG-DASH emsg
and MPD events, the API
should allow subscription by id
and (optional)
value
. This is to make receiving events opt-in
from the application point of view. The user agent should
deliver only those events to a web application for which the
application has subscribed. The API should also allow web
applications to unsubscribe from specific event streams by
event type.
To be able to handle out of band events, including
MPEG-DASH MPD events, the API should allow web applications
to create events to be added to the
media timeline, to be triggered by the user agent. The
API should allow the web application to provide all necessary
parameters to define the event, including start and end
times, event type, and data payload. The payload should be
any data type (e.g., the set of types supported by the WebKit
DataCue
). For MPEG-DASH MPD events, the event
type is defined by the id
and (optional)
value
fields.
For those events that the application has subscribed to receive, the API should:
The API should provide guarantees that no events can be missed during linear playback of the media.
We recommend updating [INBANDTRACKS] to describe handling of in-band media timed events supported on the web platform, following a registry approach with one specification per media format that describes the event details for that format.
We recommend that browser engines support MPEG-DASH
emsg
in-band events and MPD out-of-band events, as part of their support for
the MPEG Common Media Application Format (CMAF)
[MPEGCMAF].
In order to achieve greater synchronization accuracy between media playback and web content rendered by an application, the time marches on steps in [HTML] should be modified to allow delivery of media timed event start time and end time notifications within 20 milliseconds of their positions on the media timeline.
Additionally, to allow such synchronization to happen at
frame boundaries, we recommend introducing a mechanism that
would allow a web application to accurately predict, using
the user's wall clock, when the next frame will be rendered
(e.g., as done in the
Web Audio API). The same outcome could perhaps be
achieved through a mechanism similar to
requestAnimationFrame()
that would allow to
couple rendering of non media web content and rendering of
the next media frame.
Thanks to François Daoust, Charles Lo, Nigel Megitt, Jon Piesing, Rob Smith, and Mark Vickers for their contributions and feedback on this document.