Audio Description, also known as Video Description, is an audio service to assist viewers who can not fully see a visual presentation to understand the content, usually achieved by mixing a ‘description’ audio track with the main programme audio, at moments when this does not clash with dialogue, to deliver an audio description mixed audio track. More information about what Audio Description is and how it works can be found at [[WHP051]].
Audio Description is usually delivered as audio, either pre-recorded or synthesised, but (until now) has not been deliverable as accessible text using an open standard. This report describes the requirements for text documents that can support audio description script exchange throughout the workflow from production (scripting) through to distribution (mixing either at the broadcaster or at the viewers device).
This document is a Community Group Report, including requirements and a proposed specification that meets the requirements. The proposed specification is a profile of the Timed Text Markup Language version 2.0 [[TTML2]].
This specification defines a single text-based profile of the Timed Text Markup Language version 2.0 [[TTML2]]. This profile is intended to support audio description workflows worldwide, including description creation, script delivery and exchange and generated audio description distribution. The proposed profile is a syntactic subset of the Timed Text Markup Language version 2.0 [[TTML2]], and a document can simultaneously conform to both the base standard and the proposed profile.
This document defines NO extensions to [[TTML2]].
This document is the first version of this proposal.
This report specifies a profile of [[TTML2]] that meets the requirements for documents needed to support audio description script exchange throughout the workflow from production (scripting) to distribution (mixing). This report also includes a proposed specification that meets those requirements, and serves as a basis for verifying that the proposed specification intended to support that process is suitable. It is anticipated that recent additions to Timed Text Markup Language version 2.0 [[TTML2]] are sufficient to meet all of the requirements. Furthermore, the requirements do not assume or require that a single document format or instance be used for every step of an Audio Description workflow, however that would appear to be desirable since it would reduce conversion step requirements.
The Timed Text Markup Language version 2.0 [[TTML2]] format is an xml based [[XML]] markup language that specifies timed text, so can be used for storing the audio description scripts. This recent version of the TTML standard introduces audio mixing and text to speech semantics into the previous versions of the TTML specification. These new semantics support the main requirements for audio descrition, synchronised audio playback, continuous animation, control of pan / gain for audio mixing and speech rate and pitch attributes for controlling text to speech. There is support for these new TTML features in browsers using [[WEBAUDIO]] and WebSpeech respectively.
A requirements document has been available, published and updated on github for a period of time (approximately 2 years) [[ADPTREQS]]. It is considered that the captured requirements are valid at this point. Those requirements are included in this document as a rationale to support the proposed profile.
This document proposes an Audio Description profile of TTML that will allow the delivery of an audio description script, pre-recorded audio and mixing data in a single file using an open standard format. The proposed profile should allow client implementations to provide real time mixing of the Audio Description, perhaps with some user customisation (like changing the relative volumes of the main programme audio and the AD audio). Additionally, the presentation of the Audio Desciption script text on a completely different device, like a braille display may be possible. A client that is hosted 'server side' would be able to create a “broadcaster mix”, whereas a “receiver mix” could be implemented by hosting the client at the viewers device.
The following example shows an audio description script, with times and text, before any audio has been added.
References to audio recordings of the voiced words can be added:
If the audio recording is long and just a snippet needs to be played,
that can be done using clipBegin
and clipEnd
.
If we just want to play the part of the audio from file from 5s to
8s it would look like:
Or audio attributes can be added to trigger the text to be spoken:
The gain of "received" audio can be changed before mixing in
the audio played from inside the span
, smoothly
animating the value on the way in and returning it on the way out:
In the above example, the div
element's
begin
time becomes the "syncbase" for its child,
so the times on the animate
and span
elements are relative to 25s here.
The first animate
element drops the gain from 1
to 0.39 over 0.3s, and the second one raises it back in the
final 0.3s of this description. Then the span
is
timed to begin only after the first audio dip has finished.
This document uses the same conventions as [[TTML2]] for the specification of parameter attributes, styling attributes and metadata elements. In particular:
Section 2.3 of [[TTML2]] specifies conventions used in the [[XML]] representation of elements; and Sections 6.2 and 8.2 of [[TTML2]] specify conventions used when specifying the syntax of attribute values.
All content of this specification that is not explicitly marked as non-normative is considered to be normative. If a section or appendix header contains the expression "non-normative", then the entirety of the section or appendix is considered non-normative.
This specification uses Feature designations as defined in Appendices E at [[TTML2]]: when making reference to content conformance, these designations refer to the syntactic expression or the semantic capability associated with each designated Feature; and when making reference to processor conformance, these designations refer to processing requirements associated with each designated Feature. If the name of an element referenced in this specification is not namespace qualified, then the TT namespace applies (see 9.3 Namespaces.)
The following terms are used in this proposal:
Audio description An audio rendition of a Description or a set of Descriptions.
Audio description mixed audio track The output of an audio mixer incorporating the main programme audio and the audio description.
Description A set of words that describe an aspect of the programme presentation, suitable for rendering into audio by means of vocalisation and recording or used as a Text Alternative source for speech to text translation.
Default Region See Section 11.3.1.1 at [][TTML2]].
Document Instance As defined by [[TTML2]].
Document Interchange Context As defined by [[TTML2]].
Document Processing Context See Section 2.2 at [[TTML2]].
Feature See Section 2.2 at [[TTML2]].
Intermediate Synchronic Document See Section 11.3.1.3 at [[TTML2]].
Linear White-Space See Section 2.3 at [[TTML2]].
Main programme audio The audio associated with the programme prior to any mixing with audio description.
Markup Language A human-readable computer language that uses tags to define elements within a document. Markup files contain standard words, rather than typical programming syntax.
Presentation processor See Section 2.2 at [[TTML2]].
Processor Either a Presentation processor or a Transformation processor.
Profile A TTML profile specification is a document that lists all the features of TTML that are required / optional / prohibited within “document instances” (files) and “processors” (things that process the files), and any extensions or constraints.
Related Media Object See Section 2.2 at [[TTML2]].
Related Video Object A Related Media Object that consists of a sequence of image frames, each a rectangular array of pixels.
Root Container Region See Section 2.2 at [[TTML2]].
Text Alternative As defined in [[WCAG20]].
Timed Text Text media that is presented in synchrony with other media, such as audio and video with (optional) specified text presentation styling information such as font, colour and position.
Transformation processor See Section 2.2 at [[TTML2]].
TTML A Markup Language designed for the storage and delivery of Timed Text, as defined in [[TTML2]] primarily used in the television industry. TTML is used for authoring, transcoding and exchanging timed text information and for delivering captions, subtitles, and other metadata for television material repurposed for the Web or, more generally, the Internet.
A Document Instance that conforms to the profile defined herein:
A Document Instance, by definition, satisfies the requirements of Section 3.1 at [[TTML2]], and hence a Document Instance that conforms to a profile defined herein is also a conforming TTML2 Document Instance.
A presentation processor that conforms to the profile defined in this specification:
A transformation processor that conforms to the profile defined in this specification:
Change DFXP to something more appropriate in the following note, or remove it altogether.
The use of the terms presentation processor and transformation processor within this document does not imply conformance per se to any of the Standard Profiles defined in [[TTML2]]. In other words, it is not considered an error for a presentation processor or transformation processor to conform to the profile defined in this document without also conforming to the TTML2 Presentation Profile or the TTML2 Transformation Profile.
This document does not specify presentation processor or transformation processor behavior when processing or transforming a non-conformant Document Instance.
The permitted and prohibited dispositions do not refer to the specification of a ttp:feature or ttp:extension element as being permitted or prohibited within a ttp:profile element.
The Profile consists of Constraints.
For the purpose of content processing, the determination of the resolved profile SHOULD take into account both the signaled profile, as defined in , and profile metadata, as designated by either (or both) the Document Interchange Context or (and) the Document Processing Context, which MAY entail inspecting document content.
If the resolved profile is not the Profile supported by the Processor but is feasibly interoperable with the Profile, then the resolved profile is the Profile. If the resolved profile is undetermined or not supported by the Processor, then the Processor SHOULD nevertheless process the Document Instance using the Profile; otherwise, processing MAY be aborted. If the resolved profile is not the proposed Profile, processing is outside the scope of this specification.
If the resolved profile is the profile supported by the Processor, then the Processor SHOULD process the Document Instance according to the Profile.
A Document Instance SHALL use UTF-8 character encoding as specified in [[UNICODE]].
A Document Instance MAY contain elements and attributes that are neither specifically permitted nor forbidden by a profile.
A transformation processor SHOULD preserve such elements or attributes whenever possible.
Document Instances remain subject to the content conformance requirements specified at Section 3.1 of [[TTML2]]. In particular, a Document Instance can contain elements and attributes not in any TT namespace, i.e. in foreign namespaces, since such elements and attributes are pruned by the algorithm at Section 4 of [[TTML2]] prior to evaluating content conformance.
For validation purposes it is good practice to define and use a content specification for all foreign namespace elements and attributes used within a Document Instance.
The following namespaces (see [[xml-names]]) are used in this specification:
Name | Prefix | Value | Defining Specification |
---|---|---|---|
XML | xml |
http://www.w3.org/XML/1998/namespace |
[[xml-names]] |
TT | tt |
http://www.w3.org/ns/ttml |
[[TTML2]] |
TT Parameter | ttp |
http://www.w3.org/ns/ttml#parameter |
[[TTML2]] |
TT Feature | none | http://www.w3.org/ns/ttml/feature/ |
[[TTML2]] |
TT Audio Style | tta: | http://www.w3.org/ns/ttml#audio |
[[TTML2]] |
ADPT 1.0 Profile Designator | none |
|
This specification |
The namespace prefix values defined above are for convenience and Document Instances MAY use any prefix value that conforms to [[xml-names]].
The namespaces defined by this proposal document are mutable [[namespaceState]]; all undefined names in these namespaces are reserved for future standardization by the W3C.
Each intermediate synchronic document of the Document Instance is intended to be presented (audible) starting on a specific frame and removed (inaudible) by a specific frame of the Related Video Object.
When mapping a media time expression M to a frame F of a Related Video Object (or Related Media Object), e.g. for the purpose of mixing audio sources signalled by a Document Instance into the main program audio of the Related Video Object, the presentation processor SHALL map M to the frame F with the presentation time that is the closest to, but not less, than M.
EXAMPLE 1 A media time expression of 00:00:05.1 corresponds to frame ceiling( 5.1 × ( 1000 / 1001 × 30) ) = 153 of a Related Video Object with a frame rate of 1000 / 1001 × 30 ≈ 29.97.
In typical scenario, the same video program (the Related Video Object) will be used for Document Instance authoring, delivery and user playback. The mapping from media time expression to Related Video Object above allows the author to precisely associate audio description content with video frames, e.g. around existing audio dialogue and sound effects. In circumstances where the video program is downsampled during delivery, the application can specify that, at playback, the relative video object be considered the delivered video program upsampled to is original rate, thereby allowing audio content to be presented at the same temporal locations it was authored.
The ttp:profile attribute SHOULD be present on the tt element and equal to the designator of the ADPT 1.0 profile to which the Document Instance conforms.
See Conformance for a definition of permitted, prohibited and optional.
Feature | Disposition | Additional provision |
---|---|---|
Relative to the TT Feature namespace | ||
#animation-version-2 |
permitted | |
#audio |
permitted | |
#audio-description |
permitted | |
#audio-speech |
permitted | |
#backgroundColor-block |
prohibited | |
#backgroundColor-region |
prohibited | |
#cellResolution |
prohibited | |
#chunk |
permitted | |
#clockMode |
prohibited | |
#clockMode-gps |
prohibited | |
#clockMode-local |
prohibited | |
#clockMode-utc |
prohibited | |
#content |
permitted | |
#contentProfiles |
permitted | |
#core |
permitted | |
#data |
permitted | |
#display-block |
prohibited | |
#display-inline |
prohibited | |
#display-region |
prohibited | |
#display |
prohibited Consider display="none" in relation to AD content |
|
#dropMode |
prohibited | |
#dropMode-dropNTSC |
prohibited | |
#dropMode-dropPAL |
prohibited | |
#dropMode-nonDrop |
prohibited | |
#embedded-audio |
permitted | |
#embedded-data |
permitted | |
#extent-root |
prohibited | |
#extent |
prohibited | |
#frameRate |
permitted |
If the Document Instance includes any time expression that uses the frames term or any
offset time expression that uses the f metric, the ttp:frameRate attribute SHALL be present
on the tt element.
|
#frameRateMultiplier |
permitted | |
#gain |
permitted | |
#layout |
prohibited | |
#length-cell |
prohibited | |
#length-integer |
prohibited | |
#length-negative |
prohibited | |
#length-percentage |
prohibited | |
#length-pixel |
prohibited | |
#length-positive |
prohibited | |
#length-real |
prohibited | |
#length |
prohibited | |
#markerMode |
prohibited | |
#markerMode-continuous |
prohibited | |
#markerMode-discontinuous |
prohibited | |
#metadata |
permitted | |
#opacity |
prohibited | |
#origin |
prohibited | |
#overflow |
prohibited | |
#overflow-visible |
prohibited | |
#pan |
permitted | |
#pitch |
permitted | |
#pixelAspectRatio |
prohibited | |
#presentation |
prohibited | |
#processorProfiles |
permitted | |
#profile |
permitted | See . |
#region-timing |
prohibited | |
#resources |
permitted | |
#showBackground |
prohibited | |
#source |
permitted | |
#speak |
permitted | |
#speech |
permitted | |
#structure |
permitted | |
#styling |
permitted | |
#styling-chained |
permitted | |
#styling-inheritance-content |
permitted | |
#styling-inheritance-region |
prohibited | |
#styling-inline |
permitted | |
#styling-nested |
permitted | |
#styling-referential |
permitted | |
#subFrameRate |
permitted | |
#tickRate |
permitted | ttp:tickRate SHALL be present on the tt element if the document contains any time
expression that uses the t metric. |
#timeBase-clock |
prohibited | |
#timeBase-media |
permitted |
NOTE: [[TTML1]] specifies that the default timebase is |
#timeBase-smpte |
prohibited | |
#time-clock-with-frames |
permitted | |
#time-clock |
permitted | |
#time-offset-with-frames |
permitted | |
#time-offset-with-ticks |
permitted | |
#time-offset |
permitted | |
#timeContainer |
permitted | |
#timing |
permitted |
|
#transformation |
permitted | See constraints at #profile. |
#visibility-block |
prohibited | |
#visibility-region |
prohibited | |
#writingMode-horizontal-lr |
prohibited | |
#writingMode-horizontal-rl |
prohibited | |
#writingMode-horizontal |
prohibited | |
#zIndex |
prohibited |
The following diagram illustrates the workflow related to the proposed profile described by this document:
It is proposed that after each process in this workflow the output data may be either inserted into a manifest document such as a TTML document or referenced by it.
Process step | Description |
---|---|
1. Identify gaps in programme dialog | Automatically or manually process the programme audio track to identify intervals within which description audio may be inserted. |
2. Write script | Write a set of descriptions to fit within the identified gaps. |
3. 'Voice' the script or synthesise audio | Generate an audio rendition of the script either by using an actor or voicer and recording the speech or synthesise the audio description by using a text to speech system. This is typically a mono audio track that may be delivered as a single track that is the same duration as the programme or as a set of short audio tracks each beginning at a defined time. |
4. Define AD track left/right pan data | Select a horizontal pan position to apply to the audio rendition of the description when mixing with the main programme audio. This is typically a single value that applies to each description. |
5. Define main programme audio levels during AD | Select the amount by which to lower the main programme audio prior to mixing in the description audio. This is typically defined as a curve defined by a set of moments in time and fade levels to apply, with an interpolation algorithm to vary the levels between each moment in time. |
6. Mix programme audio with descriptions | Mix the programme audio with the rendered descriptions. This may be pre-mixed (also known as “broadcaster mix”) prior to delivery to the audience, or mixed in real time (also known as “receiver mix”) at playback time; mixing at playback time is a requirement to enable user customisation of the relative levels of main programme audio and descriptions. See [BBC_RD_WHP051] for the reference model for this. |
The following table lists the requirements at each stage of the workflow:
Requirement ID | Process step | Requirement |
---|---|---|
ADR1 | 1 |
|
ADR2 | 2 |
|
ADR3 | 2 |
|
ADR4 | 2 |
|
ADR12 | 3 |
|
ADR5 | 3 |
|
ADR6 | 3 |
|
ADR7 | 4 |
|
ADR8 | 5 |
|
ADR9 | 6 |
|
ADR10 | 6 |
|
ADR11 | 6 |
|
[[WEBAUDIO]] specifies three interpolation mechanisms for traversing from one parameter value to another: linear, exponential and linear interpolation between points on a curve, where the default ramp e.g. for setTargetAtTime uses an exponential interpolation.
Every content element in the audio tree creates a mixer that adds the audio from its parent and its audio element children, and optionally applies a pan and gain to the output.
Every audio element provides an audio input from some audio resource, with its own pan and gain.
The output of every content element’s audio is passed to each of its children.
The audio output of all the leaf content elements is mixed together on the “master bus” in Web Audio terms.
This section includes part of a real world audio description script generated by the BBC for its Eastenders television programme.
We have a div element that wraps all the other content. Crucially it includes two audio element children, which do a few things:
We need a convention to identify "tracks that are provided to us from somewhere else", and in this case we've defined ;track=n to do that.
Then there are bunch of child p elements that each have a begin and end time. They each represent a snippet of audio description and the time during which some stuff happens. The text of the audio description is contained in a child span element, which itself has begin and end times. The span's begin and end times are relative to the parent p element's begin time.
There is some metadata there too, which might be helpful during the authoring process, for example.
A few things need to happen for each snippet of audio description:
The fade up and down are both achieved by placing animate elements as children of the p element. They smoothly change ("continuously animate") the tta:gain value between values, in a semi-colon separated list, where the begin and end times of the animation are specified on the element and are relative to the parent p element's begin time. The audio that they modify is the audio that is available to that element, i.e. the programme audio that comes down to the p from the parent div (remember that specified some audio? This is where it goes).
Playing the audio description is done by adding a new audio child to the span. The playback begins in the presentation at the span's begin time, and the clipBegin and clipEnd mark the in and out points of the referenced audio resource to play, which is specified by the src attribute. If we wanted to specify a left/right pan value, we could do that by setting a tta:pan attribute on the audio element itself. Similarly we could vary the level of the audio by setting a tta:gain value.
This structure is implemented in the mixing code by constructing a Web Audio Graph, where the outputs of all the spans are, in the end, mixed together.
The editor acknowledges the current and former members of the W3C Timed Text Working Group (TTWG), the members of other W3C Working Groups, and industry experts in other forums who have contributed directly or indirectly to the process or content of this document.
The editors wish to especially acknowledge the contributions by: