DAPT Requirements

This document captures technical requirements for a profile of TTML2 for use in workflows related to dubbing and audio description of movies and videos, known as the Dubbing and Audio description Profile of TTML2 (DAPT).

Workflow

The following diagram illustrates a hypothetical combined workflow for the authoring and exchange of timed text content for each of 1) audio description and 2) dubbing applications.

The contents of the diagram are fully described in the paragraphs that follow: it is provided as a visual summary only.

The blue rectangles with labels preceded by a number represent various manual or automated processes. Clicking on a process rectangle will take the reader to a detailed description of the process from the table below. Processes specific to Audio Description or Dubbing workflows are marked respectively with AD or DUB. Those not marked as AD or DUB are common to both workflows. The green boxes with a wave underside represent timed text scripts produced by the various processes and expected to be compliant to the profile definition. The white boxes with a wave underside are media data used or produced in the workflows but whose format is out of scope for the profile.

Hypothetical combined workflow for audio description and dubbing.

Identify key times in programme dialog, mark gaps: Process Step 1.

This step is in both the Audio Description and Dubbing workflows.

The data output of this step is a set of script events.

In this step, the programme audio track is automatically or manually processed to identify intervals either with spoken dialogs to be translated or within which description audio may be inserted.

This is the first process step.

The next step is PS2a for audio description or PS2b for dubbing.

Describe images (AD): Process Step 2a.

This step is in the Audio Description workflow only.

The data input to this step is a set of timed script events.

The data output of this step is a pre-recording timed text script corresponding to the descriptions.

In this step, a set of descriptions are written to fit within the identified times.

The previous step is PS1.

The next step is PS3.

Transcribe original text (DUB): Process Step 2b.

This step is in the Dubbing workflow only.

The data input to this step is a set of script events.

The data output of this step is timed text corresponding to the original language.

In this step, the following actions are performed:

Dialogue is transcribed in the original language to create a source transcription text.
The dialogue script events are notated with character information and other annotations.
Localization notes are generated to guide further adaptation.

The previous step is PS1.

The next step is PS2c.

Translate text (DUB): Process Step 2c.

This step is in the Dubbing workflow only.

The data input to this step is timed text in the original language, including character information, annotations and localization notes.

The data output of this step is timed text corresponding to both the original language and the dubbing (translation) language.

In this step, the dialogue is translated to a target language.

The previous step is PS2b.

The next step is PS2d.

Adapt text (DUB): Process Step 2d.

This step is in the Dubbing workflow only.

The data input to this step is timed text in both the original language, including character information, annotations and localization notes, and the translated dialogue in the dubbing (translation) language.

The data output of this step is a pre-recording timed text script corresponding to both the original language and the dubbing (translation) language adapted as needed for recording.

In this step, the translation is adapted for dubbing; for example matching the actor's lip movements and considering reading speeds and shot changes.

The previous step is PS2c.

The next step is PS3.

Voice recording or synthesize audio: Process Step 3.

This step is in both the Audio Description and Dubbing workflows.

The data input to this step is a pre-recording timed text script.

The data output of this step is an audio rendering of the script.

In this step, an audio rendition of the script is generated either by using an actor or voicer and recording the speech or by synthesizing the audio description by using a text to speech system. For audio descriptions, this is typically a mono audio track that may be delivered as a single track that is the same duration as the programme or as a set of short audio tracks each beginning at a defined time. Alternatively it can be a single track with each short description concatenated to remove unnecessary silences.

The previous steps are PS2a or PS2d.

The next steps are PS4 and/or PS5.

Define audio mixing instructions: Process Step 4.

This step is in both the Audio Description and Dubbing workflows.

This step would apply to Dubbing workflows where original audio tracks that exclude the original language dialogue are not available, so that the dubbing track must be mixed in over the top of the original language speech.

The data inputs to this step are the pre-recording timed text script and the rendered audio.

The data output of this step is post-recording script including audio mixing instructions.

In this step, the following actions are performed:

Select a horizontal pan position to apply to the audio rendition of the description when mixing with the main programme audio. This is typically a single value that applies to each description.
Select the amount by which to lower the main programme audio prior to mixing in the description audio. This is typically defined as a curve described by a set of moments in time and fade levels to apply, with an interpolation algorithm to vary the levels between each moment in time.

The previous step is PS3.

This step can be followed by, or be done in parallel with, PS5, or it can be the final step.

Edit script to match performance: Process Step 5.

This step is in both the Audio Description and Dubbing workflows.

The data inputs to this step are the pre-recording timed text script and the rendered audio.

The data output of this step is an accurate post-recording script matching the recorded audio, sometimes called an As-recorded script.

In this step, the script is adjusted based on changes made during the audio recording.

The previous step is PS3.

This step can be followed by, or be done in parallel with, PS4, or it can be the final step.

Requirements

The following table lists the requirements at each stage of the workflow:

TODO: Add [[media-accessibility-reqs]]

Table of requirements, each having a unique requirement number, links to the workflow process stages from which it arises, and a detailed description.
Requirement number	Process step	Requirement
R1	all	A document shall identify the type of document it corresponds to, from typical dubbing and audio description workflows, at least among: "Audio Description Script", "Original Language Dialogue List", "Translated Dialogue List" (a.k.a. "Pivot Language Dialogue List"), "Pre-Recording Dub Script", "Pre-Recording Audio Description Script", "As-recorded Dub Script", "As-recorded Audio Description Script".
R2	PS1	A document must be able to define a list of intervals, each called script event, defined by a begin time and an end time, either matching a dialogue spoken by a single character or matching on-screen text from the programme, or that is an opportunity for adding descriptions. Within a script event, there may be other timed content, whose times are relative to the beginning of the script event.
R3	PS2a	A document must be able to incorporate description text to be voiced, each description located within a timed interval defined by a begin time and an end time.
R4	PS2a	A document must be able to incorporate additional user defined metadata associated with each description; metadata schemes may be user defined or centrally defined. For example the language of the description may be stored, or notes made by the script writer.
R5	PS2a	A document must be extensible to allow incorporation of data required to achieve the desired quality of audio presentation, whether manual or automated. For example it is typical to include information about what gender and age voice would be appropriate to voice the descriptions; it is also feasible to include data used to improve the quality of text to speech synthesis, such as phonetic descriptions of the text, intonation and emotion data etc. The format of any extensions for this purpose need not be defined.
R6	PS2b	A document must be able to describe the characters participating in a programme, defined by a name in the programme, optionally a name in the real world and optional private metadata (e.g. images, text).
R7	PS2b	A document must be able to associate a unique character with a given script event, when the script event corresponds to a dialogue.
R8	PS2b	A document must be able to associate text content, called original content, with each script event, corresponding to the content of the dialogue or on-screen text. The language shall be identified too.
R9	PS2b	A document must be able to associate optional rendering styles with a character.
R10	PS2b	A document shall support associating rendering styles with a script event.
R11	PS2b	A document must be able to optionally associate a human-readable string with a script event, providing annotation of the script event.
R12	PS2b	A document must be able, for a script event corresponding to a dialogue, to optionally indicate if the character is in the picture, out of the picture or transitioning in or out of the picture.
R13	PS2b	A document shall support optionally annotating each script event with an indication of the type of script event (e.g. when the script event corresponds to on-screen text, if the text corresponds to "Credits" or "Title"; or if the dialogue text is audible or not), and optionally additional human readable text.
R14	PS2c	A document must be able to optionally associate with a script event a translation of the original content in a different, identified language.
R15	PS3	A document must be able to reference audio tracks either included as binary data within the document or separately.
R16	PS3	A document must be able to associate a begin time with the beginning of playback of each audio track, for the case that multiple audio tracks are created, one per description.
R17	PS3	A document must be able to associate a begin time with a playback entry time within an audio track, for the case that a single audio track is generated that contains multiple segments of audio to be played back at different times.
R18	PS4	A document must be able to associate a left/right pan value with playback of each or every audio description. This value applies to the audio description prior to mixing with the main programme audio.
R19	PS4	A document must be able to define a fade level curve that applies to the main programme audio prior to mixing with the audio description, where that fade level curve is defined by a set of pairs of level and times and an interpolation algorithm.
R20	all	A document shall support storing private metadata associated with the document, such as a title, an episode identifier, a season identifier, etc.

Introduction

Workflow

Identify key times in programme dialog, mark gaps: Process Step 1.

Describe images (AD): Process Step 2a.

Transcribe original text (DUB): Process Step 2b.

Translate text (DUB): Process Step 2c.

Adapt text (DUB): Process Step 2d.

Voice recording or synthesize audio: Process Step 3.

Define audio mixing instructions: Process Step 4.

Edit script to match performance: Process Step 5.

Requirements