SyncMediaLite caveats

Last updated Tue Oct 01 2024

Caveats when going from EPUB Media Overlays to SyncMediaLite

When adopting a more modern synchronization strategy, as described in SyncMediaLite, some adaptation is required. It may be that existing .smil files have to be transformed into .vtt files before being distributed to a system that expects .vtt. Or the user agent may be loading .smil files and internally transforming them to TextTrackCues so as to avoid writing a SMIL engine.

In any case where EPUB Media Overlays need to be transformed to work in a WebVTT-based playback scenario, there are some differences to be aware of.

Multiple audio files.

It's theoretically permitted in EPUB Media Overlays to have sync points in the same SMIL file referencing different audio files (in practice this isn't common).

Non-contiguous audio segments.

Say we have an audio file of someone saying "Three one two". Our HTML text, though, says "1 2 3". Theoretically, SMIL can handle this, though it's worth mentioning that this type of content is not commonly found:

<par>
    <audio src="audio.mp3" clipBegin="1s" clipEnd="2s"/>
    <text src="file.html#one"/>
</par>
<par>
    <audio src="audio.mp3" clipBegin="2s" clipEnd="3s"/>
    <text src="file.html#two"/>
</par>
<par>
    <audio src="audio.mp3" clipBegin="0s" clipEnd="1s"/>
    <text src="file.html#three"/>
</par>

You would see the highlight and audio start with "1" and proceed to "2", then "3", since each <par> indicates what portion of audio to render.

Now if you try to represent this in WebVTT, you would get:

10
00:00:01.000 --> 00:00:02.000
{"selector":{"type": "FragmentSelector", "value": "one"}}

20
00:00:02.000 --> 00:00:03.000
{"selector":{"type": "FragmentSelector", "value": "two"}}

30
00:00:00.000 --> 00:00:01.000
{"selector":{"type": "FragmentSelector", "value": "three"}}

But you would hear and see highlighted "3", followed by "1", then "2", since the audio playback is only based on the audio file, from start to end.

Solutions

In both cases, resolving the difference requires either additional special handling by the user agent, or audio file reformulation by the producer.