TTML2 Karaoke Extension

This document is an extension to [[TTML2]]. It specifies how to represent Karaoke or sing-along applications in a TTML document, associated processing, and a corresponding feature designator that can be used in TTML profile definitions.

Introduction

Karaoke or sing-along applications are well-known timed text applications: song lyrics are displayed on top of a corresponding video clip, with timed emphasis or highlights on words or characters to indicate to the viewer which words/characters have been sung, are being sung or will be sung.

Simple karaoke content can already be represented using [[TTML2]] using as an example the following approach:

<p begin="2s" end="12s">
<span><set begin="2s" tts:color="yellow"/>Twinkle</span>
<span><set begin="4s" tts:color="yellow"/>Twinkle</span>
<span><set begin="6s" tts:color="yellow"/>Little star</span>
</p>

This approach has limitations. The style properties available in [[TTML2]] cannot represent all the complex effects that may be used in Karaoke. For example, using a continuous animation to fill the text with a color sweep is not possible. Animating a bouncing ball on top of the text to represent the word being sung would not be easy, if at all possible. Additionally, the above syntax does not carry semantics enabling a presentation processor to apply user-specific or implementation-specific karaoke styles.

This specification specifies how to signal semantically that parts of a timed text document is karaoke content as well as some additional styling properties enabling richer karaoke presentations.

Support for the features defined in this specification is identified by the following feature designator: #karaoke.

Karaoke Model

This section defines the karaoke model. The semantics and processing defined in this specification are only applicable when timed text content is within a karaoke section. Elements and attributes defined in this specification should (must ?) be ignored when outside of a karaoke section.

A karaoke section is composed of the element on which the ttp:karaoke attribute is specified with the value auto, and its descendants. There may be multiple karaoke sections within the same timed text document. Within a karaoke section, timed text content is organized as follows.

Each word, group of words or part of words meant to be highlighted or emphasized during the same period of time should be wrapped in a span element, called a karaoke span. If not, the anonymous span created by the [construct anonymous spans] procedure is the associated karaoke span. Each karaoke span should contain at least one animation element, typically animating the tts:karaokeMode attribute. Its timing attributes (i.e. begin, dur and/or end) are assumed to specify the interval for the highlight or emphasis of the corresponding text. Other set or animate elements may be present.

A set of karaoke spans contained in the same paragraph is called a karaoke group.

Animation elements (set or animate) may be present at the karaoke section, karaoke span or karaoke group level to represent the timed behavior of the highlight or emphasis within the karaoke content, or to represent transitions between karaoke groups, or at the beginning or end of a karaoke section. They may use any style properties, including those defined in this specification or defined in [[!TTML2]].

All animations at any level within a karaoke section can be overriden by the presentation processor based on either user settings or processor settings.

Styling Attributes

tts:imageEmphasis

The tts:imageEmphasis attribute must conform to the following:

It could be renamed karaokeEmphasis and be applicable to karaoke only, but by making it non-karaoke specific it can be used in other applications.

Values:	`none \| <image>`
Initial:	`none`
Applies to:	`span`
Inherited:	yes
Percentages:	N/A
Animatable:	discrete

This attribute modifies the semantics of the tts:textEmphasis attribute when the value of the emphasis-style component of the tts:textEmphasis attribute is set to auto. It must be ignored otherwise. When not ignored, its value must not be none. In this case, the referenced image must be used instead of the default style indicated in [[!TTML2]], with a positioning indicated by the emphasis-position component of the tts:textEmphasis attribute. If an emphasis-color is specifed in the value of tts:textEmphasis, it must be ignored.

This attribute is defined as an additional attribute that builds on an existing attribute and modify the existing behavior in order to be backwards-compatible: i.e. an old parser that simply ignores the image emphasis would apply a default text emphasis.

<tt xmlns="http://www.w3.org/ns/ttml" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ...>
    <head>
        <layout>
            <region xml:id="karaoke-region1"
                    tts:origin="20px 215px"
                    tts:extent="180px 20px"
                    tts:textEmphasis="auto before"
                    tts:imageEmphasis="http://example.org/star.png"/>
        </layout>
    </head>
</tt>

Karaoke Vocabulary

ttp:karaoke

The ttp:karaoke attribute is used to identify a karaoke section. If specified, it must adhere to the following syntax:

Should this be a style property?

ttp:karaoke
    : "none"
    | "auto"

If not specified, the value of this parameter must be considered to be none.

A ttp:karaoke attribute is considered to be significant only when specified on the body or div element. If specified on other elements it must be ignored. If specified on an element, it must not be specified on any descendant element and must be ignored on the latter if present.

<tt xmlns="http://www.w3.org/ns/ttml" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ...>
    ...
    <body>
      <div>
        ...
      </div>
      <div ttp:karaoke="auto"> <!-- the karaoke section starts here -->
        ...
      </div> <!-- the karaoke section ends here -->
      <div>
        ...
      </div>
    </body>
</tt>

tts:karaokeMode

The tts:karaokeMode attribute is used to specify a style property that defines the desired animation type of karaoke animations. It must conform to the following:

Values:	`auto \| emphasis \| color`
Initial:	`auto`
Applies to:	`karaoke span`, `karaoke group`
Inherited:	yes
Percentages:	N/A
Animatable:	discrete

The semantics of the values are as follows:

auto: This value can be used at any level within a karaoke section. It indicates that an animation is desired but that any animation effect is suitable, including a combination of the other values.
emphasis: This value can be at any level within a karaoke section. It indicates that an animation of the position of the emphasis mark within a karaoke span or across karaoke spans is desired. The actual mark may be specified by the tts:textEmphasis alone or in combination with tts:imageEmphasis. The animation of the position should be consistent with the value of emphasis-position, e.g. it should remain towards the before edge of the affected glyph areas if before is applicable.
color: This value should be used for animations only at karaoke span level. If used elsewhere in a karaoke section, it should be ignored and the value auto should be used instead. It indicates that during the activation of the associated karaoke span, a color animation is desired, e.g. using the tts:color, tts:backgroundColor or tts:textOutline attributes.

The actual animation effect depends on the animation element that is used (set or animate) as follows:

If a set element is used, it indicates that a discrete animation is desired.
- When color is used, the color is expected to be constant during the time interval of the animation.
```
<tt xmlns="http://www.w3.org/ns/ttml" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ...>
 ...
 <body ttp:karaoke="auto">
 
 
 <-- [4s, 6s): a discrete color change should be applied to this span -->
 <set begin="2s" tts:karaokeMode="color"/>Twinkle
 
 
 <-- [6s, 8s): a discrete color change should be applied to this span -->
 <set begin="4s" tts:karaokeMode="color"/>Twinkle
 
 
 <-- [8s, 12): a discrete color change should be applied to this span -->
 <set begin="6s" tts:karaokeMode="color"/>Little Star
 
 
 </body>
</tt>
 
```
 - the use of the ttp:karaoke attribute which provides semantics enabling the presentation processor to override the styles and apply its own styles;
 - the use of the tts:karaokeMode which is used in the animation to semantically indicate that a color animation is desired, enabling any override to be color-specific.
 - and the absence of the tts:color attribute. Note also that it could still be specified for: fallback purposes for processors that do not support this feature, or for providing indications to the presentation processor regarding the colors to be used and the type of color anuimation (outline, background, ...).
- When emphasis is used, the presentation processor should choose a single position to place the emphasis (e.g. on the middle glyph) and it is not expected to move during the time interval of the animation. Continuous animations (either explicitely specified or due to an implementation-specific behavior) may still happen outside of this interval, e.g. between active intervals of karaoke spans or at the start or end of karaoke groups or karaoke sections.
```
<tt xmlns="http://www.w3.org/ns/ttml" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ...>
 ...
 <body ttp:karaoke="auto">
 
 
 <-- [4s, 6s): the text emphasis should be on this span -->
 <set begin="2s" tts:karaokeMode="emphasis"/>Twinkle
 
 
 <-- [6s, 8s): the text emphasis should be on this span -->
 <set begin="4s" tts:karaokeMode="emphasis"/>Twinkle
 
 
 <-- [8s, 12): the text emphasis should be on this span -->
 <set begin="6s" tts:karaokeMode="emphasis"/>Little Star
 
 
 </body>
</tt>
 
```
 It is assumed that there is only one karaoke emphasis at a time. Thus, during the interval [8,12) where the 3 set animations are active, only the last activated one actually affects the karaoke emphasis.
Note also that if a dur or end attribute were specified on the set elements, the animations would end, and the karaoke effect be removed unless the fill attribute is set to freeze.

end+fill=freeze could be used to provide end time indication to processors overriding the set with a continuous animation when the animate element cannot be used (e.g. IMSC1.1 profile).

If an animate element is used, it indicates that a continuous animation is desired.

When color is used, the presentation processor should transition between the animation colors with any continous effect such as a color sweep.

<tt xmlns="http://www.w3.org/ns/ttml" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ...>
    ...
    <body ttp:karaoke="auto">
      <p begin="2s" end="12s">
        <span>
          <-- [4s, 5.5s): a continous color change should be applied to this span -->
          <animate begin="2s" begin="3.5s" tts:karaokeMode="color"/>Twinkle
        </span>
        <span>
          <-- [6s, 7.5s): a continous color change should be applied to this span -->
          <animate begin="4s" begin="5.5s" tts:karaokeMode="color"/>Twinkle
        </span>
        <span>
          <-- [8s, 10s): a continous color change should be applied to this span -->
          <animate begin="6s" begin="8s" tts:karaokeMode="color"/>Little Star
        </span>
      </p>
    </body>
</tt>

Note that color properties could also be specified on the animate element, as a list of values.

When emphasis is used, the presentation processor should continuously transition the emphasis from the first glyph to the last glyph during the time interval of the animation.

<tt xmlns="http://www.w3.org/ns/ttml" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ...>
    ...
    <body ttp:karaoke="auto">
      <p begin="2s" end="12s">
        <span>
          <-- [4s, 5.5s): the text emphasis should be continuously moving on this span -->
          <animate begin="2s" begin="3.5s" tts:karaokeMode="emphasis"/>Twinkle
        </span>
        <span>
          <-- [6s, 7.5s): the text emphasis should be  continuously moving on this span -->
          <animate begin="4s" begin="5.5s" tts:karaokeMode="emphasis"/>Twinkle
        </span>
        <span>
          <-- [8s, 10s): the text emphasis should be  continuously moving on this span -->
          <animate begin="6s" begin="8s" tts:karaokeMode="emphasis"/>Little Star
        </span>
      </p>
    </body>
</tt>

Similar to the example of the set element, given that there is only one emphasis at a time, overlap in time between the different emphasis animations would have no effect.

Note that other attributes of the animate element could be used to control the speed of the animation, such as calcMode, repeatCount, keyTimes or keySplines.

Clarify what happens when multiple animations of type emphasis are used within the same span.

Examples

Do we need more examples? maybe a complete one? maybe with mutiple animations of different types at the same time. Maybe one with group-level animations for offscreen emphasis control.

Acknowledgments

The editors acknowledge the current and former members of the Timed Text Working Group, the members of other W3C Working Groups, and industry experts in other forums who have contributed directly or indirectly to the process or content of this document as follows: XXX