This document captures technical requirements for a profile of TTML2 for use in workflows related to dubbing and audio description of movies and videos, known as the Dubbing and Audio description Profile of TTML2 (DAPT).

Introduction

W3C members have identified the need for a profile of TTML for the exchange of timed text content in the context of audio description or dubbing applications. This document is based on the work of the Audio Description Community Group including [[ADPT]] and [[TTAL]].

Workflow

The following diagram illustrates a hypothetical combined workflow for the authoring and exchange of timed text content for each of 1) audio description and 2) dubbing applications.

The contents of the diagram are fully described in the paragraphs that follow: it is provided as a visual summary only.

The blue rectangles with labels preceded by a number represent various manual or automated processes. Clicking on a process rectangle will take the reader to a detailed description of the process from the table below. Processes specific to Audio Description or Dubbing workflows are marked respectively with AD or DUB. Those not marked as AD or DUB are common to both workflows. The green boxes with a wave underside represent timed text scripts produced by the various processes and expected to be compliant to the profile definition. The white boxes with a wave underside are media data used or produced in the workflows but whose format is out of scope for the profile.

Hypothetical combined workflow for audio description and dubbing.

Identify key times in programme dialog, mark gaps: Process Step 1.

This step is in both the Audio Description and Dubbing workflows.

The data output of this step is a set of events.

In this step, the programme audio track is automatically or manually processed to identify intervals either with spoken dialogs to be translated or within which description audio may be inserted.

This is the first process step.

The next step is PS2a for audio description or PS2b for dubbing.

Describe images (AD): Process Step 2a.

This step is in the Audio Description workflow only.

The data input to this step is a set of timed events.

The data output of this step is a pre-recording timed text script corresponding to the descriptions.

In this step, a set of descriptions are written to fit within the identified times.

The previous step is PS1.

The next step is PS3.

Transcribe original text (DUB): Process Step 2b.

This step is in the Dubbing workflow only.

The data input to this step is a set of timed events.

The data output of this step is timed text corresponding to the original language.

In this step, the following actions are performed:

The previous step is PS1.

The next step is PS2c.

Translate text (DUB): Process Step 2c.

This step is in the Dubbing workflow only.

The data input to this step is timed text in the original language, including character information, annotations and localization notes.

The data output of this step is timed text corresponding to both the original language and the dubbing (translation) language.

In this step, the dialogue is translated to a target language.

The previous step is PS2b.

The next step is PS2d.

Adapt text (DUB): Process Step 2d.

This step is in the Dubbing workflow only.

The data input to this step is timed text in both the original language, including character information, annotations and localization notes, and the translated dialogue in the dubbing (translation) language.

The data output of this step is a pre-recording timed text script corresponding to both the original language and the dubbing (translation) language adapted as needed for recording.

In this step, the translation is adapted for dubbing; for example matching the actor's lip movements and considering reading speeds and shot changes.

The previous step is PS2c.

The next step is PS3.

Voice recording or synthesize audio: Process Step 3.

This step is in both the Audio Description and Dubbing workflows.

The data input to this step is a pre-recording timed text script.

The data output of this step is an audio rendering of the script.

In this step, an audio rendition of the script is generated either by using an actor or voicer and recording the speech or by synthesizing the audio description by using a text to speech system. For audio descriptions, this is typically a mono audio track that may be delivered as a single track that is the same duration as the programme or as a set of short audio tracks each beginning at a defined time. Alternatively it can be a single track with each short description concatenated to remove unnecessary silences.

The previous steps are PS2a or PS2d.

The next steps are PS4 and/or PS5.

Define audio mixing instructions (AD): Process Step 4.

This step is in the Audio Description workflow only.

The data inputs to this step are the pre-recording timed text script and the rendered audio.

The data output of this step is post-recording script including audio mixing instructions.

In this step, the following actions are performed:

The previous step is PS3.

This step can be followed by, or be done in parallel with, PS5, or it can be the final step.

Edit script to match performance: Process Step 5.

This step is in both the Audio Description and Dubbing workflows.

The data inputs to this step are the pre-recording timed text script and the rendered audio.

The data output of this step is an accurate post-recording script matching the recorded audio, sometimes called an As-recorded script.

In this step, the script is adjusted based on changes made during the audio recording.

The previous step is PS3.

This step can be followed by, or be done in parallel with, PS4, or it can be the final step.

Requirements

The following table lists the requirements at each stage of the workflow:

TODO: Add [[media-accessibility-reqs]]

Table of requirements, each having a unique requirement number, links to the workflow process stages from which it arises, and a detailed description.
Requirement number Process step Requirement
R1 all

A document shall identify the type of document it corresponds to, from typical dubbing and audio description workflows, at least among: "Audio Description Script", "Original Language Dialogue List", "Translated Dialogue List" (a.k.a. "Pivot Language Dialogue List"), "Pre-Recording Dub Script", "Pre-Recording Audio Description Script", "As-recorded Dub Script", "As-recorded Audio Description Script".

R2 PS1

A document must be able to define a list of intervals, each called event, defined by a begin time and an end time, either matching a dialogue spoken by a single character or matching on-screen text from the programme, or that is an opportunity for adding descriptions.

In ADPT, the requirement for intervals was distinct from the requirements their contents - any reason not to do the same here?

Is "event" the best name? Do nested events need a different term?

R3 PS2a

A document must be able to incorporate description text to be voiced, each description located within a timed interval defined by a begin time and an end time.

R4 PS2a

A document must be able to incorporate additional user defined metadata associated with each description; metadata schemes may be user defined or centrally defined. For example the language of the description may be stored, or notes made by the script writer.

R5 PS2a

A document must be extensible to allow incorporation of data required to achieve the desired quality of audio presentation, whether manual or automated. For example it is typical to include information about what gender and age voice would be appropriate to voice the descriptions; it is also feasible to include data used to improve the quality of text to speech synthesis, such as phonetic descriptions of the text, intonation and emotion data etc. The format of any extensions for this purpose need not be defined.

R6 PS2b

A document must be able to describe the characters participating in a programme, defined by a name in the programme, optionally a name in the real world and optional private metadata (e.g. images, text).

R7 PS2b

A document must be able to associate a unique character with a given event, when the event corresponds to a dialogue.

R8 PS2b

A document must be able to associate text content, called original content, with each event, corresponding to the content of the dialogue or on-screen text. The language shall be identified too.

R9 PS2b

A document must be able to associate optional rendering styles with a character.

R10 PS2b

A document shall support associating rendering styles with an event.

R11 PS2b

A document must be able to optionally associate a human-readable string with an event, providing annotation of the event.

R12 PS2b

A document must be able, for an event corresponding to a dialogue, to optionally indicate if the character is in the picture, out of the picture or transitioning in or out of the picture.

R13 PS2b

A document shall support optionally annotating each event with an indication of the type of event (e.g. when the event corresponds to on-screen text, if the text corresponds to "Credits" or "Title"; or if the dialogue text is audible or not), and optionally additional human readable text.

R14 PS2c

A document must be able to optionally associate with an event a translation of the original content in a different, identified language.

R15 PS3

A document must be able to reference audio tracks either included as binary data within the document or separately.

R16 PS3

A document must be able to associate a begin time with the beginning of playback of each audio track, for the case that multiple audio tracks are created, one per description.

R17 PS3

A document must be able to associate a begin time with a playback entry time within an audio track, for the case that a single audio track is generated that contains multiple segments of audio to be played back at different times.

R18 PS4

A document must be able to associate a left/right pan value with playback of each or every audio description. This value applies to the audio description prior to mixing with the main programme audio.

R19 PS4

A document must be able to define a fade level curve that applies to the main programme audio prior to mixing with the audio description, where that fade level curve is defined by a set of pairs of level and times and an interpolation algorithm.

R20 all

A document shall support storing private metadata associated with the document, such as a title, an episode identifier, a season identifier, etc.