This document is an extension to [[TTML2]]. It specifies how to represent Karaoke or sing-along applications in a TTML document, associated processing, and a corresponding feature designator that can be used in TTML profile definitions.


Karaoke or sing-along applications are well-known timed text applications: song lyrics are displayed on top of a corresponding video clip, with timed emphasis or highlights on words or characters to indicate to the viewer which words/characters have been sung, are being sung or will be sung.

Simple karaoke content can already be represented using [[TTML2]] using as an example the following approach:

<p begin="2s" end="12s">
<span><set begin="2s" tts:color="yellow"/>Twinkle</span>
<span><set begin="4s" tts:color="yellow"/>Twinkle</span>
<span><set begin="6s" tts:color="yellow"/>Little star</span>

This approach has limitations. The style properties available in [[TTML2]] cannot represent all the complex effects that may be used in Karaoke. For example, using a continuous animation to fill the text with a color sweep is not possible. Animating a bouncing ball on top of the text to represent the word being sung would not be easy, if at all possible. Additionally, the above syntax does not carry semantics enabling a presentation processor to apply user-specific or implementation-specific karaoke styles.

This specification specifies how to signal semantically that parts of a timed text document is karaoke content as well as some additional styling properties enabling richer karaoke presentations.

Support for the features defined in this specification is identified by the following feature designator: #karaoke.

Karaoke Model

This section defines the karaoke model. The semantics and processing defined in this specification are only applicable when timed text content is within a karaoke section. Elements and attributes defined in this specification should (must ?) be ignored when outside of a karaoke section.

A karaoke section is composed of the element on which the ttp:karaoke attribute is specified with the value auto, and its descendants. There may be multiple karaoke sections within the same timed text document. Within a karaoke section, timed text content is organized as follows.

Each word, group of words or part of words meant to be highlighted or emphasized during the same period of time should be wrapped in a span element, called a karaoke span. If not, the anonymous span created by the [construct anonymous spans] procedure is the associated karaoke span. Each karaoke span should contain at least one animation element, typically animating the tts:karaokeMode attribute. Its timing attributes (i.e. begin, dur and/or end) are assumed to specify the interval for the highlight or emphasis of the corresponding text. Other set or animate elements may be present.

A set of karaoke spans contained in the same paragraph is called a karaoke group.

Animation elements (set or animate) may be present at the karaoke section, karaoke span or karaoke group level to represent the timed behavior of the highlight or emphasis within the karaoke content, or to represent transitions between karaoke groups, or at the beginning or end of a karaoke section. They may use any style properties, including those defined in this specification or defined in [[!TTML2]].

All animations at any level within a karaoke section can be overriden by the presentation processor based on either user settings or processor settings.

Styling Attributes


The tts:imageEmphasis attribute must conform to the following:

It could be renamed karaokeEmphasis and be applicable to karaoke only, but by making it non-karaoke specific it can be used in other applications.

Values: none | <image>
Initial: none
Applies to: span
Inherited: yes
Percentages: N/A
Animatable: discrete

This attribute modifies the semantics of the tts:textEmphasis attribute when the value of the emphasis-style component of the tts:textEmphasis attribute is set to auto. It must be ignored otherwise. When not ignored, its value must not be none. In this case, the referenced image must be used instead of the default style indicated in [[!TTML2]], with a positioning indicated by the emphasis-position component of the tts:textEmphasis attribute. If an emphasis-color is specifed in the value of tts:textEmphasis, it must be ignored.

This attribute is defined as an additional attribute that builds on an existing attribute and modify the existing behavior in order to be backwards-compatible: i.e. an old parser that simply ignores the image emphasis would apply a default text emphasis.

<tt xmlns="" xmlns:ttp="" ...>
            <region xml:id="karaoke-region1"
                    tts:origin="20px 215px"
                    tts:extent="180px 20px"
                    tts:textEmphasis="auto before"

Karaoke Vocabulary


The ttp:karaoke attribute is used to identify a karaoke section. If specified, it must adhere to the following syntax:

Should this be a style property?

    : "none"
    | "auto"

If not specified, the value of this parameter must be considered to be none.

A ttp:karaoke attribute is considered to be significant only when specified on the body or div element. If specified on other elements it must be ignored. If specified on an element, it must not be specified on any descendant element and must be ignored on the latter if present.

<tt xmlns="" xmlns:ttp="" ...>
      <div ttp:karaoke="auto"> <!-- the karaoke section starts here -->
      </div> <!-- the karaoke section ends here -->


The tts:karaokeMode attribute is used to specify a style property that defines the desired animation type of karaoke animations. It must conform to the following:

Values: auto | emphasis | color
Initial: auto
Applies to: karaoke span, karaoke group
Inherited: yes
Percentages: N/A
Animatable: discrete

The semantics of the values are as follows:

The actual animation effect depends on the animation element that is used (set or animate) as follows:

Clarify what happens when multiple animations of type emphasis are used within the same span.


Do we need more examples? maybe a complete one? maybe with mutiple animations of different types at the same time. Maybe one with group-level animations for offscreen emphasis control.


The editors acknowledge the current and former members of the Timed Text Working Group, the members of other W3C Working Groups, and industry experts in other forums who have contributed directly or indirectly to the process or content of this document as follows: XXX