Proposal for an Exchange Format to support Audio Description

Audio Description, also known as Video Description, is an audio service to assist viewers who can not fully see a visual presentation to understand the content, usually achieved by mixing a ‘description’ audio track with the main programme audio, at moments when this does not clash with dialogue, to deliver an audio description mixed audio track. More information about what Audio Description is and how it works can be found at [[WHP051]].

Audio Description is usually delivered as audio, either pre-recorded or synthesised, but (until now) has not been deliverable as accessible text using an open standard. This report describes the requirements for text documents that can support audio description script exchange throughout the workflow from production (scripting) through to distribution (mixing either at the broadcaster or at the viewers device).

This document is a Community Group Report, including requirements and a proposed specification that meets the requirements. The proposed specification is a profile of the Timed Text Markup Language version 2.0 [[TTML2]].

Introduction

This report specifies a profile of [[TTML2]] that meets the requirements for documents needed to support audio description script exchange throughout the workflow from production (scripting) to distribution (mixing). This report also includes a proposed specification that meets those requirements, and serves as a basis for verifying that the proposed specification intended to support that process is suitable. It is anticipated that recent additions to Timed Text Markup Language version 2.0 [[TTML2]] are sufficient to meet all of the requirements. Furthermore, the requirements do not assume or require that a single document format or instance be used for every step of an Audio Description workflow, however that would appear to be desirable since it would reduce conversion step requirements.

The Timed Text Markup Language version 2.0 [[TTML2]] format is an xml based [[XML]] markup language that specifies timed text, so can be used for storing the audio description scripts. This recent version of the TTML standard introduces audio mixing and text to speech semantics into the previous versions of the TTML specification. These new semantics support the main requirements for audio descrition, synchronised audio playback, continuous animation, control of pan / gain for audio mixing and speech rate and pitch attributes for controlling text to speech. There is support for these new TTML features in browsers using [[WEBAUDIO]] and WebSpeech respectively.

A requirements document has been available, published and updated on github for a period of time (approximately 2 years) [[ADPTREQS]]. It is considered that the captured requirements are valid at this point. Those requirements are included in this document as a rationale to support the proposed profile.

This document proposes an Audio Description profile of TTML that will allow the delivery of an audio description script, pre-recorded audio and mixing data in a single file using an open standard format. The proposed profile should allow client implementations to provide real time mixing of the Audio Description, perhaps with some user customisation (like changing the relative volumes of the main programme audio and the AD audio). Additionally, the presentation of the Audio Desciption script text on a completely different device, like a braille display may be possible. A client that is hosted 'server side' would be able to create a “broadcaster mix”, whereas a “receiver mix” could be implemented by hosting the client at the viewers device.

Example documents

The following example shows an audio description script, with times and text, before any audio has been added.

References to audio recordings of the voiced words can be added:

If the audio recording is long and just a snippet needs to be played, that can be done using clipBegin and clipEnd. If we just want to play the part of the audio from file from 5s to 8s it would look like:

Or audio attributes can be added to trigger the text to be spoken:

The gain of "received" audio can be changed before mixing in the audio played from inside the span, smoothly animating the value on the way in and returning it on the way out:

In the above example, the div element's begin time becomes the "syncbase" for its child, so the times on the animate and span elements are relative to 25s here. The first animate element drops the gain from 1 to 0.39 over 0.3s, and the second one raises it back in the final 0.3s of this description. Then the span is timed to begin only after the first audio dip has finished.

Constraints

Document Encoding

A Document Instance SHALL use UTF-8 character encoding as specified in [[UNICODE]].

Foreign Element and Attributes

A Document Instance MAY contain elements and attributes that are neither specifically permitted nor forbidden by a profile.

A transformation processor SHOULD preserve such elements or attributes whenever possible.

Document Instances remain subject to the content conformance requirements specified at Section 3.1 of [[TTML2]]. In particular, a Document Instance can contain elements and attributes not in any TT namespace, i.e. in foreign namespaces, since such elements and attributes are pruned by the algorithm at Section 4 of [[TTML2]] prior to evaluating content conformance.

For validation purposes it is good practice to define and use a content specification for all foreign namespace elements and attributes used within a Document Instance.

Namespaces

The following namespaces (see [[xml-names]]) are used in this specification:

Name	Prefix	Value	Defining Specification
XML	`xml`	`http://www.w3.org/XML/1998/namespace`	[[xml-names]]
TT	`tt`	`http://www.w3.org/ns/ttml`	[[TTML2]]
TT Parameter	`ttp`	`http://www.w3.org/ns/ttml#parameter`	[[TTML2]]
TT Feature	none	`http://www.w3.org/ns/ttml/feature/`	[[TTML2]]
TT Audio Style	tta:	`http://www.w3.org/ns/ttml#audio`	[[TTML2]]
ADPT 1.0 Profile Designator	none		This specification

The namespace prefix values defined above are for convenience and Document Instances MAY use any prefix value that conforms to [[xml-names]].

The namespaces defined by this proposal document are mutable [[namespaceState]]; all undefined names in these namespaces are reserved for future standardization by the W3C.

Synchronization

Each intermediate synchronic document of the Document Instance is intended to be presented (audible) starting on a specific frame and removed (inaudible) by a specific frame of the Related Video Object.

When mapping a media time expression M to a frame F of a Related Video Object (or Related Media Object), e.g. for the purpose of mixing audio sources signalled by a Document Instance into the main program audio of the Related Video Object, the presentation processor SHALL map M to the frame F with the presentation time that is the closest to, but not less, than M.

EXAMPLE 1 A media time expression of 00:00:05.1 corresponds to frame ceiling( 5.1 × ( 1000 / 1001 × 30) ) = 153 of a Related Video Object with a frame rate of 1000 / 1001 × 30 ≈ 29.97.

In typical scenario, the same video program (the Related Video Object) will be used for Document Instance authoring, delivery and user playback. The mapping from media time expression to Related Video Object above allows the author to precisely associate audio description content with video frames, e.g. around existing audio dialogue and sound effects. In circumstances where the video program is downsampled during delivery, the application can specify that, at playback, the relative video object be considered the delivered video program upsampled to is original rate, thereby allowing audio content to be presented at the same temporal locations it was authored.

Profile Signaling

The ttp:profile attribute SHOULD be present on the tt element and equal to the designator of the ADPT 1.0 profile to which the Document Instance conforms.

Features

See Conformance for a definition of permitted, prohibited and optional.

Feature	Disposition	Additional provision
Relative to the TT Feature namespace
`#animation-version-2`	permitted
`#audio`	permitted
`#audio-description`	permitted
`#audio-speech`	permitted
`#backgroundColor-block`	prohibited
`#backgroundColor-region`	prohibited
`#cellResolution`	prohibited
`#chunk`	permitted
`#clockMode`	prohibited
`#clockMode-gps`	prohibited
`#clockMode-local`	prohibited
`#clockMode-utc`	prohibited
`#content`	permitted
`#contentProfiles`	permitted
`#core`	permitted
`#data`	permitted
`#display-block`	prohibited
`#display-inline`	prohibited
`#display-region`	prohibited
`#display`	prohibited Consider display="none" in relation to AD content
`#dropMode`	prohibited
`#dropMode-dropNTSC`	prohibited
`#dropMode-dropPAL`	prohibited
`#dropMode-nonDrop`	prohibited
`#embedded-audio`	permitted
`#embedded-data`	permitted
`#extent-root`	prohibited
`#extent`	prohibited
`#frameRate`	permitted	If the Document Instance includes any time expression that uses the `frames` term or any offset time expression that uses the `f` metric, the `ttp:frameRate` attribute SHALL be present on the `tt` element.
`#frameRateMultiplier`	permitted
`#gain`	permitted
`#layout`	prohibited
`#length-cell`	prohibited
`#length-integer`	prohibited
`#length-negative`	prohibited
`#length-percentage`	prohibited
`#length-pixel`	prohibited
`#length-positive`	prohibited
`#length-real`	prohibited
`#length`	prohibited
`#markerMode`	prohibited
`#markerMode-continuous`	prohibited
`#markerMode-discontinuous`	prohibited
`#metadata`	permitted
`#opacity`	prohibited
`#origin`	prohibited
`#overflow`	prohibited
`#overflow-visible`	prohibited
`#pan`	permitted
`#pitch`	permitted
`#pixelAspectRatio`	prohibited
`#presentation`	prohibited
`#processorProfiles`	permitted
`#profile`	permitted	See .
`#region-timing`	prohibited
`#resources`	permitted
`#showBackground`	prohibited
`#source`	permitted
`#speak`	permitted
`#speech`	permitted
`#structure`	permitted
`#styling`	permitted
`#styling-chained`	permitted
`#styling-inheritance-content`	permitted
`#styling-inheritance-region`	prohibited
`#styling-inline`	permitted
`#styling-nested`	permitted
`#styling-referential`	permitted
`#subFrameRate`	permitted
`#tickRate`	permitted	`ttp:tickRate` SHALL be present on the `tt` element if the document contains any time expression that uses the `t` metric.
`#timeBase-clock`	prohibited
`#timeBase-media`	permitted	NOTE: [[TTML1]] specifies that the default timebase is `"media"` if `ttp:timeBase` is not specified on `tt`.
`#timeBase-smpte`	prohibited
`#time-clock-with-frames`	permitted
`#time-clock`	permitted
`#time-offset-with-frames`	permitted
`#time-offset-with-ticks`	permitted
`#time-offset`	permitted
`#timeContainer`	permitted
`#timing`	permitted	All time expressions within a Document Instance SHOULD use the same syntax, either `clock-time` or `offset-time`. For any content element that contains `br` elements or text nodes or a `smpte:backgroundImage` attribute, both the `begin` attribute and one of either the `end` or `dur` attributes SHOULD be specified on the content element or at least one of its ancestors.
`#transformation`	permitted	See constraints at #profile.
`#visibility-block`	prohibited
`#visibility-region`	prohibited
`#writingMode-horizontal-lr`	prohibited
`#writingMode-horizontal-rl`	prohibited
`#writingMode-horizontal`	prohibited
`#zIndex`	prohibited

Process step	Description
1. Identify gaps in programme dialog	Automatically or manually process the programme audio track to identify intervals within which description audio may be inserted.
2. Write script	Write a set of descriptions to fit within the identified gaps.
3. 'Voice' the script or synthesise audio	Generate an audio rendition of the script either by using an actor or voicer and recording the speech or synthesise the audio description by using a text to speech system. This is typically a mono audio track that may be delivered as a single track that is the same duration as the programme or as a set of short audio tracks each beginning at a defined time.
4. Define AD track left/right pan data	Select a horizontal pan position to apply to the audio rendition of the description when mixing with the main programme audio. This is typically a single value that applies to each description.
5. Define main programme audio levels during AD	Select the amount by which to lower the main programme audio prior to mixing in the description audio. This is typically defined as a curve defined by a set of moments in time and fade levels to apply, with an interpolation algorithm to vary the levels between each moment in time.
6. Mix programme audio with descriptions	Mix the programme audio with the rendered descriptions. This may be pre-mixed (also known as “broadcaster mix”) prior to delivery to the audience, or mixed in real time (also known as “receiver mix”) at playback time; mixing at playback time is a requirement to enable user customisation of the relative levels of main programme audio and descriptions. See [BBC_RD_WHP051] for the reference model for this.

Requirements

The following table lists the requirements at each stage of the workflow:

Requirement ID	Process step	Requirement
ADR1	1	The document must be able to define a list of intervals, each defined by a begin time and an end time that are opportunities for adding descriptions. [[MAUR]] DV-2 Render descriptions in a time-synchronized manner, using the primary media resource as the timebase master.
ADR2	2	The document must be able to incorporate description text to be voiced, each description located within a timed interval defined by a begin time and an end time. [[MAUR]] TVD-2 TVDs need to be provided in a format that contains the following information: start time, text per description cue (the duration is determined dynamically, though an end time could provide a cut point) possibly a speech-synthesis markup to improve quality of the description (existing speech synthesis markups include SSML and CSS 3 Speech Module) accompanying metadata providing labeling for speakers, language, etc. and visual style markup.
ADR3	2	The document must be able to incorporate additional user defined metadata associated with each description; metadata schemes may be user defined or centrally defined. For example the language of the description may be stored, notes made by the script writer. [[MAUR]] DV-10 Allow the user to select from among different languages of descriptions, if available, even if they are different from the language of the main soundtrack. [[MAUR]] DV-13 Support metadata, such as copyright information, usage rights, language, etc.
ADR4	2	The document must be extensible to allow incorporation of data required to achieve the desired quality of audio presentation, whether manual or automated. For example it is typical to include information about what gender and age voice would be appropriate to voice the descriptions; it is also feasible to include data used to improve the quality of text to speech synthesis, such as phonetic descriptions of the text, intonation and emotion data etc. The format of any extensions for this purpose need not be defined.
ADR12	3	The document must be able to reference audio tracks either included as binary data within the document or separately. [[MAUR]] DV-4 Support recordings of high quality speech as a track of the media resource, or as an external file. [[MAUR]] DV-9 Allow the author to use a codec which is optimized for voice only, rather than requiring the same codec as the original soundtrack.
ADR5	3	The document must be able to associate a begin time with the beginning of playback of each audio track, for the case that multiple audio tracks are created, one per description. [[MAUR]] DV-2 Render descriptions in a time-synchronized manner, using the primary media resource as the timebase master.
ADR6	3	The document must be able to associate a begin time with a playback entry time within an audio track, for the case that a single audio track is generated that is the same duration as the main programme audio. The begin time and the playback entry time may be required to be synchronous (coincident values) within the document structure. [[MAUR]] DV-2 Render descriptions in a time-synchronized manner, using the primary media resource as the timebase master.
ADR7	4	The document must be able to associate a left/right pan value with playback of each or every audio description. This value applies to the audio description prior to mixing with the main programme audio. [[MAUR]] DV-8 Allow the author to provide fade and pan controls to be accurately synchronized with the original soundtrack.
ADR8	5	The document must be able to define a fade level curve that applies to the main programme audio prior to mixing with the audio description, where that fade level curve is defined by a set of pairs of level and times and an interpolation algorithm. [[MAUR]] DV-5 Allow the author to independently adjust the volumes of the audio description and original soundtracks where these are available as separate audio channel resources. [[MAUR]] DV-7 Permit smooth changes in volume rather than stepped changes. The degree and speed of volume change should be under user control. [[MAUR]] DV-8 Allow the author to provide fade and pan controls to be accurately synchronized with the original soundtrack.
ADR9	6	The processor must be able to generate a set of directives to control an audio mixer to generate the desired audio description mixed audio track honouring the pan and fade information within the document. The format of those directives may be implementation dependent.
ADR10	6	The processor may modify the audio mixer control directives under user control to customise the relative levels of main programme audio and audio description, and the pan information. [[MAUR]] DV-6 Allow the user to independently adjust the volumes of the audio description and original soundtracks (where these are available as separate audio channel resources), with the user's settings overriding the author's. [[MAUR]] DV-12 Allow the user to relocate the pan location of the various audio tracks within the audio field, with the user setting overriding the author setting. The setting should be re-adjustable as the media plays.
ADR11	6	The audio mixing transitions and semantics must be implementable using [[WEBAUDIO]], specifically relating to the application of gain and pan as defined therein and the interpolation between values.

[[WEBAUDIO]] specifies three interpolation mechanisms for traversing from one parameter value to another: linear, exponential and linear interpolation between points on a curve, where the default ramp e.g. for setTargetAtTime uses an exponential interpolation.

Scope

Introduction

Example documents

Documentation Conventions

Definitions

Profile

General

Profile Resolution Semantics

Constraints

Document Encoding

Foreign Element and Attributes

Namespaces

Synchronization

Profile Signaling

Features

Workflow

Figure 1 Diagram showing Audio Description Workflow

Workflow Processes

Requirements

Web Audio Mixing

Real world worked example: BBC

Acknowledgements)