Audio Description, also known as Video Description, is an audio service to assist viewers who can not fully see a visual presentation to understand the content, usually achieved by mixing a ‘description’ audio track with the main programme audio, at moments when this does not clash with dialogue, to deliver an audio description mixed audio track. More information about what Audio Description is and how it works can be found at [[WHP051]].

Audio Description is usually delivered as audio, either pre-recorded or synthesised, but (until now) has not been deliverable as accessible text using an open standard. This report describes the requirements for text documents that can support audio description script exchange throughout the workflow from production (scripting) through to distribution (mixing either at the broadcaster or at the viewers device).

This document is a Community Group Report, including requirements and a proposed specification that meets the requirements. The proposed specification is a profile of the Timed Text Markup Language version 2.0 [[TTML2]].

Scope

This specification defines a single text-based profile of the Timed Text Markup Language version 2.0 [[TTML2]]. This profile is intended to support audio description workflows worldwide, including description creation, script delivery and exchange and generated audio description distribution. The proposed profile is a syntactic subset of the Timed Text Markup Language version 2.0 [[TTML2]], and a document can simultaneously conform to both the base standard and the proposed profile.

This document defines NO extensions to [[TTML2]].

This document is the first version of this proposal.

Introduction

This report specifies a profile of [[TTML2]] that meets the requirements for documents needed to support audio description script exchange throughout the workflow from production (scripting) to distribution (mixing). This report also includes a proposed specification that meets those requirements, and serves as a basis for verifying that the proposed specification intended to support that process is suitable. It is anticipated that recent additions to Timed Text Markup Language version 2.0 [[TTML2]] are sufficient to meet all of the requirements. Furthermore, the requirements do not assume or require that a single document format or instance be used for every step of an Audio Description workflow, however that would appear to be desirable since it would reduce conversion step requirements.

The Timed Text Markup Language version 2.0 [[TTML2]] format is an xml based [[XML]] markup language that specifies timed text, so can be used for storing the audio description scripts. This recent version of the TTML standard introduces audio mixing and text to speech semantics into the previous versions of the TTML specification. These new semantics support the main requirements for audio descrition, synchronised audio playback, continuous animation, control of pan / gain for audio mixing and speech rate and pitch attributes for controlling text to speech. There is support for these new TTML features in browsers using [[WEBAUDIO]] and WebSpeech respectively.

A requirements document has been available, published and updated on github for a period of time (approximately 2 years) [[ADPTREQS]]. It is considered that the captured requirements are valid at this point. Those requirements are included in this document as a rationale to support the proposed profile.

This document proposes an Audio Description profile of TTML that will allow the delivery of an audio description script, pre-recorded audio and mixing data in a single file using an open standard format. The proposed profile should allow client implementations to provide real time mixing of the Audio Description, perhaps with some user customisation (like changing the relative volumes of the main programme audio and the AD audio). Additionally, the presentation of the Audio Desciption script text on a completely different device, like a braille display may be possible. A client that is hosted 'server side' would be able to create a “broadcaster mix”, whereas a “receiver mix” could be implemented by hosting the client at the viewers device.

Example documents

The following example shows an audio description script, with times and text, before any audio has been added.

        

References to audio recordings of the voiced words can be added:

        

If the audio recording is long and just a snippet needs to be played, that can be done using clipBegin and clipEnd. If we just want to play the part of the audio from file from 5s to 8s it would look like:

        

Or audio attributes can be added to trigger the text to be spoken:

        

The gain of "received" audio can be changed before mixing in the audio played from inside the span, smoothly animating the value on the way in and returning it on the way out:

        

In the above example, the div element's begin time becomes the "syncbase" for its child, so the times on the animate and span elements are relative to 25s here. The first animate element drops the gain from 1 to 0.39 over 0.3s, and the second one raises it back in the final 0.3s of this description. Then the span is timed to begin only after the first audio dip has finished.

Documentation Conventions

This document uses the same conventions as [[TTML2]] for the specification of parameter attributes, styling attributes and metadata elements. In particular:

Section 2.3 of [[TTML2]] specifies conventions used in the [[XML]] representation of elements; and Sections 6.2 and 8.2 of [[TTML2]] specify conventions used when specifying the syntax of attribute values.

All content of this specification that is not explicitly marked as non-normative is considered to be normative. If a section or appendix header contains the expression "non-normative", then the entirety of the section or appendix is considered non-normative.

This specification uses Feature designations as defined in Appendices E at [[TTML2]]: when making reference to content conformance, these designations refer to the syntactic expression or the semantic capability associated with each designated Feature; and when making reference to processor conformance, these designations refer to processing requirements associated with each designated Feature. If the name of an element referenced in this specification is not namespace qualified, then the TT namespace applies (see 9.3 Namespaces.)

Definitions

The following terms are used in this proposal:

Audio description An audio rendition of a Description or a set of Descriptions.

Audio description mixed audio track The output of an audio mixer incorporating the main programme audio and the audio description.

Description A set of words that describe an aspect of the programme presentation, suitable for rendering into audio by means of vocalisation and recording or speech to text translation.

Default Region See Section 11.3.1.1 at [][TTML2]].

Document Instance See Section 2.2 at [[TTML2]].

Document Interchange Context See Section 2.2 at [[TTML2]].

Document Processing Context See Section 2.2 at [[TTML2]].

Feature See Section 2.2 at [[TTML2]].

Intermediate Synchronic Document See Section 11.3.1.3 at [[TTML2]].

Linear White-Space See Section 2.3 at [[TTML2]].

Main programme audio The audio associated with the programme prior to any mixing with audio description.

Markup Language A human-readable computer language that uses tags to define elements within a document. Markup files contain standard words, rather than typical programming syntax.

Presentation processor See Section 2.2 at [[TTML2]].

Processor Either a Presentation processor or a Transformation processor.

Profile A TTML profile specification is a document that lists all the features of TTML that are required / optional / prohibited within “document instances” (files) and “processors” (things that process the files), and any extensions or constraints.

Related Media Object See Section 2.2 at [[TTML2]].

Related Video Object A Related Media Object that consists of a sequence of image frames, each a rectangular array of pixels.

Root Container Region See Section 2.2 at [[TTML2]].

Text Alternative As defined in [[WCAG20]].

Timed Text Text media that is presented in synchrony with other media, such as audio and video with (optional) specified text presentation styling information such as font, colour and position.

Transformation processor See Section 2.2 at [[TTML2]].

TTML A Markup Language designed for the storage and delivery of Timed Text, primarily used in the television industry. TTML is used for authoring, transcoding and exchanging timed text information and for delivering captions, subtitles, and other metadata for television material repurposed for the Web or, more generally, the Internet.

A Document Instance that conforms to the profile defined herein:

A Document Instance, by definition, satisfies the requirements of Section 3.1 at [[TTML2]], and hence a Document Instance that conforms to a profile defined herein is also a conforming TTML2 Document Instance.

A presentation processor that conforms to the profile defined in this specification:

A transformation processor that conforms to the profile defined in this specification:

Change DFXP to something more appropriate in the following note, or remove it altogether.

The use of the terms presentation processor and transformation processor within this document does not imply conformance per se to any of the Standard Profiles defined in [[TTML2]]. In other words, it is not considered an error for a presentation processor or transformation processor to conform to the profile defined in this document without also conforming to the TTML2 Presentation Profile or the TTML2 Transformation Profile.

This document does not specify presentation processor or transformation processor behavior when processing or transforming a non-conformant Document Instance.

The permitted and prohibited dispositions do not refer to the specification of a ttp:feature or ttp:extension element as being permitted or prohibited within a ttp:profile element.

Profile

General

The Profile consists of Constraints.

Profile Resolution Semantics

For the purpose of content processing, the determination of the resolved profile SHOULD take into account both the signaled profile, as defined in , and profile metadata, as designated by either (or both) the Document Interchange Context or (and) the Document Processing Context, which MAY entail inspecting document content.

If the resolved profile is not the Profile supported by the Processor but is feasibly interoperable with the Profile, then the resolved profile is the Profile. If the resolved profile is undetermined or not supported by the Processor, then the Processor SHOULD nevertheless process the Document Instance using the Profile; otherwise, processing MAY be aborted. If the resolved profile is not the proposed Profile, processing is outside the scope of this specification.

If the resolved profile is the profile supported by the Processor, then the Processor SHOULD process the Document Instance according to the Profile.

Constraints

Document Encoding

A Document Instance SHALL use UTF-8 character encoding as specified in [[UNICODE]].

Foreign Element and Attributes

A Document Instance MAY contain elements and attributes that are neither specifically permitted nor forbidden by a profile.

A transformation processor SHOULD preserve such elements or attributes whenever possible.

Document Instances remain subject to the content conformance requirements specified at Section 3.1 of [[TTML2]]. In particular, a Document Instance can contain elements and attributes not in any TT namespace, i.e. in foreign namespaces, since such elements and attributes are pruned by the algorithm at Section 4 of [[TTML2]] prior to evaluating content conformance.

For validation purposes it is good practice to define and use a content specification for all foreign namespace elements and attributes used within a Document Instance.

Namespaces

The following namespaces (see [[xml-names]]) are used in this specification:

Name Prefix Value Defining Specification
XML xml http://www.w3.org/XML/1998/namespace [[xml-names]]
TT tt http://www.w3.org/ns/ttml [[TTML2]]
TT Parameter ttp http://www.w3.org/ns/ttml#parameter [[TTML2]]
TT Feature none http://www.w3.org/ns/ttml/feature/ [[TTML2]]
TT Audio Style tta: http://www.w3.org/ns/ttml#audio [[TTML2]]
ADPT 1.0 Profile Designator none This specification

The namespace prefix values defined above are for convenience and Document Instances MAY use any prefix value that conforms to [[xml-names]].

The namespaces defined by this proposal document are mutable [[namespaceState]]; all undefined names in these namespaces are reserved for future standardization by the W3C.

Synchronization

Each intermediate synchronic document of the Document Instance is intended to be presented (audible) starting on a specific frame and removed (inaudible) by a specific frame of the Related Video Object.

When mapping a media time expression M to a frame F of a Related Video Object (or Related Media Object), e.g. for the purpose of mixing audio sources signalled by a Document Instance into the main program audio of the Related Video Object, the presentation processor SHALL map M to the frame F with the presentation time that is the closest to, but not less, than M.

EXAMPLE 1 A media time expression of 00:00:05.1 corresponds to frame ceiling( 5.1 × ( 1000 / 1001 × 30) ) = 153 of a related video object with a frame rate of 1000 / 1001 × 30 ≈ 29.97.

In typical scenario, the same video program (the Related Video Object) will be used for Document Instance authoring, delivery and user playback. The mapping from media time expression to Related Video Object above allows the author to precisely associate audio description content with video frames, e.g. around existing audio dialogue and sound effects. In circumstances where the video program is downsampled during delivery, the application can specify that, at playback, the relative video object be considered the delivered video program upsampled to is original rate, thereby allowing audio content to be presented at the same temporal locations it was authored.

Profile Signaling

The ttp:profile attribute SHOULD be present on the tt element and equal to the designator of the ADPT 1.0 profile to which the Document Instance conforms.

Features

See Conformance for a definition of permitted, prohibited and optional.

Feature Disposition Additional provision
Relative to the TT Feature namespace
#animation-version-2 permitted
#audio permitted
#audio-description permitted
#audio-speech permitted
#backgroundColor-block prohibited
#backgroundColor-region prohibited
#cellResolution prohibited
#chunk permitted
#clockMode prohibited
#clockMode-gps prohibited
#clockMode-local prohibited
#clockMode-utc prohibited
#content permitted
#core permitted
#data permitted
#display-block prohibited
#display-inline prohibited
#display-region prohibited
#display prohibited

Consider display="none" in relation to AD content

#dropMode prohibited
#dropMode-dropNTSC prohibited
#dropMode-dropPAL prohibited
#dropMode-nonDrop prohibited
#embedded-audio permitted
#embedded-data permitted
#extent-root prohibited
#extent prohibited
#frameRate permitted If the Document Instance includes any time expression that uses the frames term or any offset time expression that uses the f metric, the ttp:frameRate attribute SHALL be present on the tt element.
#frameRateMultiplier permitted
#gain permitted
#layout prohibited
#length-cell prohibited
#length-integer prohibited
#length-negative prohibited
#length-percentage prohibited
#length-pixel prohibited
#length-positive prohibited
#length-real prohibited
#length prohibited
#markerMode prohibited
#markerMode-continuous prohibited
#markerMode-discontinuous prohibited
#metadata permitted
#opacity prohibited
#origin prohibited
#overflow prohibited
#overflow-visible prohibited
#pan permitted
#pitch permitted
#pixelAspectRatio prohibited
#presentation prohibited
#profile permitted See .
#region-timing prohibited
#resources permitted
#showBackground prohibited
#source permitted
#speak permitted
#speech permitted
#structure permitted
#styling permitted
#styling-chained permitted
#styling-inheritance-content permitted
#styling-inheritance-region prohibited
#styling-inline permitted
#styling-nested permitted
#styling-referential permitted
#subFrameRate permitted
#tickRate permitted ttp:tickRate SHALL be present on the tt element if the document contains any time expression that uses the t metric.
#timeBase-clock prohibited
#timeBase-media permitted

NOTE: [[TTML1]] specifies that the default timebase is "media" if ttp:timeBase is not specified on tt.

#timeBase-smpte prohibited
#time-clock-with-frames permitted
#time-clock permitted
#time-offset-with-frames permitted
#time-offset-with-ticks permitted
#time-offset permitted
#timeContainer permitted
#timing permitted
  • All time expressions within a Document Instance SHOULD use the same syntax, either clock-time or offset-time.
  • For any content element that contains br elements or text nodes or a smpte:backgroundImage attribute, both the begin attribute and one of either the end or dur attributes SHOULD be specified on the content element or at least one of its ancestors.
#transformation permitted See constraints at #profile.
#visibility-block prohibited
#visibility-region prohibited
#writingMode-horizontal-lr prohibited
#writingMode-horizontal-rl prohibited
#writingMode-horizontal prohibited
#zIndex prohibited

Workflow

Figure 1 Diagram showing Audio Description Workflow

The following diagram illustrates the workflow related to the proposed profile described by this document:

Audio Description Workflow diagram

It is proposed that after each process in this workflow the output data may be either inserted into a manifest document such as a TTML document or referenced by it.

Workflow Processes

Process step Description
1. Identify gaps in programme dialog Automatically or manually process the programme audio track to identify intervals within which description audio may be inserted.
2. Write script Write a set of descriptions to fit within the identified gaps.
3. 'Voice' the script or
synthesise audio
Generate an audio rendition of the script either by using an actor or voicer and recording the speech or synthesise the audio description by using a text to speech system. This is typically a mono audio track that may be delivered as a single track that is the same duration as the programme or as a set of short audio tracks each beginning at a defined time.
4. Define AD track
left/right pan data
Select a horizontal pan position to apply to the audio rendition of the description when mixing with the main programme audio.
This is typically a single value that applies to each description.
5. Define main programme
audio levels during AD
Select the amount by which to lower the main programme audio prior to mixing in the description audio.
This is typically defined as a curve defined by a set of moments in time and fade levels to apply, with an interpolation algorithm to vary the levels between each moment in time.
6. Mix programme audio with descriptions Mix the programme audio with the rendered descriptions.
This may be pre-mixed (also known as “broadcaster mix”) prior to delivery to the audience, or mixed in real time (also known as “receiver mix”) at playback time; mixing at playback time is a requirement to enable user customisation of the relative levels of main programme audio and descriptions. See [BBC_RD_WHP051] for the reference model for this.

Requirements

The following table lists the requirements at each stage of the workflow:

Requirement ID Process step Requirement
ADR1 1
  • The document must be able to define a list of intervals, each defined by a begin time and an end time that are opportunities for adding descriptions.
  • [[MAUR]] DV-2 Render descriptions in a time-synchronized manner, using the primary media resource as the timebase master.
ADR2 2
  • The document must be able to incorporate description text to be voiced, each description located within a timed interval defined by a begin time and an end time.
  • [[MAUR]] TVD-2 TVDs need to be provided in a format that contains the following information:
    1. start time, text per description cue (the duration is determined dynamically, though an end time could provide a cut point)
    2. possibly a speech-synthesis markup to improve quality of the description (existing speech synthesis markups include SSML and CSS 3 Speech Module)
    3. accompanying metadata providing labeling for speakers, language, etc. and
    4. visual style markup.
ADR3 2
  • The document must be able to incorporate additional user defined metadata associated with each description; metadata schemes may be user defined or centrally defined. For example the language of the description may be stored, notes made by the script writer.
  • [[MAUR]] DV-10 Allow the user to select from among different languages of descriptions, if available, even if they are different from the language of the main soundtrack.
  • [[MAUR]] DV-13 Support metadata, such as copyright information, usage rights, language, etc.
ADR4 2
  • The document must be extensible to allow incorporation of data required to achieve the desired quality of audio presentation, whether manual or automated. For example it is typical to include information about what gender and age voice would be appropriate to voice the descriptions; it is also feasible to include data used to improve the quality of text to speech synthesis, such as phonetic descriptions of the text, intonation and emotion data etc.
  • The format of any extensions for this purpose need not be defined.
ADR12 3
  • The document must be able to reference audio tracks either included as binary data within the document or separately.
  • [[MAUR]] DV-4 Support recordings of high quality speech as a track of the media resource, or as an external file.
  • [[MAUR]] DV-9 Allow the author to use a codec which is optimized for voice only, rather than requiring the same codec as the original soundtrack.
ADR5 3
  • The document must be able to associate a begin time with the beginning of playback of each audio track, for the case that multiple audio tracks are created, one per description.
  • [[MAUR]] DV-2 Render descriptions in a time-synchronized manner, using the primary media resource as the timebase master.
ADR6 3
  • The document must be able to associate a begin time with a playback entry time within an audio track, for the case that a single audio track is generated that is the same duration as the main programme audio.
  • The begin time and the playback entry time may be required to be synchronous (coincident values) within the document structure.
  • [[MAUR]] DV-2 Render descriptions in a time-synchronized manner, using the primary media resource as the timebase master.
ADR7 4
  • The document must be able to associate a left/right pan value with playback of each or every audio description. This value applies to the audio description prior to mixing with the main programme audio.
  • [[MAUR]] DV-8 Allow the author to provide fade and pan controls to be accurately synchronized with the original soundtrack.
ADR8 5
  • The document must be able to define a fade level curve that applies to the main programme audio prior to mixing with the audio description, where that fade level curve is defined by a set of pairs of level and times and an interpolation algorithm.
  • [[MAUR]] DV-5 Allow the author to independently adjust the volumes of the audio description and original soundtracks where these are available as separate audio channel resources.
  • [[MAUR]] DV-7 Permit smooth changes in volume rather than stepped changes. The degree and speed of volume change should be under user control.
  • [[MAUR]] DV-8 Allow the author to provide fade and pan controls to be accurately synchronized with the original soundtrack.
ADR9 6
  • The processor must be able to generate a set of directives to control an audio mixer to generate the desired audio description mixed audio track honouring the pan and fade information within the document.
  • The format of those directives may be implementation dependent.
ADR10 6
  • The processor may modify the audio mixer control directives under user control to customise the relative levels of main programme audio and audio description, and the pan information.
  • [[MAUR]] DV-6 Allow the user to independently adjust the volumes of the audio description and original soundtracks (where these are available as separate audio channel resources), with the user's settings overriding the author's.
  • [[MAUR]] DV-12 Allow the user to relocate the pan location of the various audio tracks within the audio field, with the user setting overriding the author setting.
    The setting should be re-adjustable as the media plays.
ADR11 6
  • The audio mixing transitions and semantics must be implementable using [[WEBAUDIO]], specifically relating to the application of gain and pan as defined therein and the interpolation between values.

[[WEBAUDIO]] specifies three interpolation mechanisms for traversing from one parameter value to another: linear, exponential and linear interpolation between points on a curve, where the default ramp e.g. for setTargetAtTime uses an exponential interpolation.

Web Audio Mixing

An image illustrating the effect of pan and gain on audio in a WEBAUDIO process

Every content element in the audio tree creates a mixer that adds the audio from its parent and its audio element children, and optionally applies a pan and gain to the output.

Every audio element provides an audio input from some audio resource, with its own pan and gain.

The output of every content element’s audio is passed to each of its children.

The audio output of all the leaf content elements is mixed together on the “master bus” in Web Audio terms.

      

An image illustrating the effect of gain on audio levels

Real world worked example: BBC

This section includes part of a real world audio description script generated by the BBC for its Eastenders television programme.

      

We have a div element that wraps all the other content. Crucially it includes two audio element children, which do a few things:

We need a convention to identify "tracks that are provided to us from somewhere else", and in this case we've defined ;track=n to do that.

Then there are bunch of child p elements that each have a begin and end time. They each represent a snippet of audio description and the time during which some stuff happens. The text of the audio description is contained in a child span element, which itself has begin and end times. The span's begin and end times are relative to the parent p element's begin time.

There is some metadata there too, which might be helpful during the authoring process, for example.

A few things need to happen for each snippet of audio description:

The fade up and down are both achieved by placing animate elements as children of the p element. They smoothly change ("continuously animate") the tta:gain value between values, in a semi-colon separated list, where the begin and end times of the animation are specified on the element and are relative to the parent p element's begin time. The audio that they modify is the audio that is available to that element, i.e. the programme audio that comes down to the p from the parent div (remember that specified some audio? This is where it goes).

Playing the audio description is done by adding a new audio child to the span. The playback begins in the presentation at the span's begin time, and the clipBegin and clipEnd mark the in and out points of the referenced audio resource to play, which is specified by the src attribute. If we wanted to specify a left/right pan value, we could do that by setting a tta:pan attribute on the audio element itself. Similarly we could vary the level of the audio by setting a tta:gain value.

This structure is implemented in the mixing code by constructing a Web Audio Graph, where the outputs of all the spans are, in the end, mixed together.

Acknowledgements)

The editor acknowledges the current and former members of the W3C Timed Text Working Group (TTWG), the members of other W3C Working Groups, and industry experts in other forums who have contributed directly or indirectly to the process or content of this document.

The editors wish to especially acknowledge the contributions by: