Publishing Guide to Audio Playback and Text-To-Speech

This guide provides a brief introduction to the different types of audio playback typically found in publications. It covers the differences between the types of audio publications created by publishers and audio playback automatically generated by user agents. It also explains how screen readers and other Assistive Technologies (AT) produce audio through Text-to-Speech (TTS), which often blurs the lines between a prepared title designed for audio consumption and a text-based title that readers can consume using audio through Assistive Technologies.

Overview

Publishing has a sometimes-dizzying array of ways of referring to audio playback — audiobook, talking book, read aloud book, text-to-speech playback, media overlays and full audio are some of the most common terms.

Compounding the confusion is that these are often all referred to as audio books (two words), even though there is a unique form of publication called an audiobook (one word). Moreover, Text-To-Speech synthesis is often talked about as though it is an audio format even though it is a feature of the device or the Assistive Technology a reader is using.

Despite the many terms used to describe audio playback, there are only two primary models for classifying publications with recorded audio:

Audiobook: A publication whose primary, and typically only, way of being read is auditorily.
Full Audio: The content of a publication is available in auditory form, but the audio is often structured together with textual or visual content.

Slightly tangential to these models, but equally important, is Text-To-Speech synthesis. Speech synthesis is not technically a form of audio playback but a way to have the device a reader is using synthetically voice the text content of a publication. Speech synthesis is like an on-demand rendering of content in this way, but due to the limitations of text-to-speech rendering engines the result is often not as precise and clear as professional human narration.

The rest of this guide explores each of these technologies in more detail and demystifies the terminology used to describe them.

Audiobooks

An audiobook provides prerecorded narration of a work. Publishers typically structure audiobooks as a series of one or more audio files that readers will listen to in sequence.

Readers may stream the audio directly from an audiobook publisher, like Audible, or may obtain a set of audio files that they can play back on any device or application with audio playback capabilities.

The key feature of an audiobook is that it is designed to be listened to. If there is any text content, it is typically quite minimal (e.g., a playlist to aid in playback on specific devices or a table of contents).

A recent W3C standard, aptly named Audiobooks, seeks to bring greater structure to the world of audiobooks by introducing a formal syntax for describing audiobook metadata, listing the resources of the book, and defining a play order, among other features.

Although audiobooks have historically been referred to as talking books, the term "talking book" is now more commonly associated with DAISY Digital Talking Books. These are a hybrid between audiobooks and full audio publications that are designed for readers who are blind, have low vision, or other print disabilities, such as dyslexia.

Full Audio Publications

A publication with full audio differs from an audiobook in that formats that allow full audio content also allow the inclusion of, and synchronization with, the full text content (even if the full text is not always available).

EPUB is an example of a format that allows publishers to include full audio. Although EPUB is a text-first format, it includes a technology called Media Overlays that allows publishers to synchronize audio with the text for automatic playback.

Some Reading Systems will omit the text, making it appear as though an EPUB is an audiobook, but, unlike audiobooks, there is always a minimal amount of text that publishers must provide. Publications with full audio are never as simple to load and play as pure audiobooks are because there are control files that a simple audio playback device does not understand.

Full audio publications are often referred to as "read aloud books" because publishers commonly use synchronized text and audio playback in children's works. Children can follow the text as the Reading System plays back the audio.

Text-To-Speech Synthesis

Text-To-Speech (TTS) synthesis is a form of audio rendering typically produced on demand by a Reading System or Assistive Technology. VoiceOver on Apple devices and Talkback on Android are a couple of the more commonly known examples of TTS engines built into mainstream phones and tablets, while Jaws and NVDA are examples of Assistive Technologies that can translate text to speech for users who need an auditory interface to their Windows computers.

Text-To-Speech synthesis is commonly associated with users who are blind, but many readers benefit from being able to render text content auditorily using Text-To-Speech synthesis. Even sighted readers will turn to speech synthesis when it is not conducive to read content visually (e.g., when in a moving vehicle). Many Reading Systems are now providing TTS rendering as a function in the application.

Unlike audiobooks and publications with full audio, however, Text-To-Speech synthesis is not an audio format. It is a way of listening to text content the reader has already obtained. Readers can use a TTS application to read their EPUB publications.

The names some reading systems give their Text-to-Speech playback feature can be confusingly similar to "read aloud books". For example, a button named "Read Now" might initiate Text-to-Speech playback. The distinguishing feature between the two is that Text-to-Speech playback employs on-the-fly voice synthesizing.

The applications that provide Text-To-Speech synthesis for persons with disabilities are usually not specifically designed for reading publications, however. They are general tools that aid navigation across the device the reader is using and any other apps and content that are on it. As a result, they often only have a very limited built-in vocabulary of pronunciations so are not able to provide a similar quality of playback as prerecorded human narration.

When a user enables Text-To-Speech playback, the Reading System or Assistive Technology feeds the text content of the publication to an underlying TTS engine that voices each word. For most general language, the rendering returned is reasonably good, but the engines will struggle with works that contain complex terms, jargon, uncommon names, etc. Readers consequently also typically have the option to have individual words spelled out, both to help disambiguate similar-sounding words and to make sense of complex words and heteronyms (words spelled the same but with different pronunciations) that Text-to-Speech engines mispronounce. This ability to explore the spelling of words, which readers cannot do with human narrated titles, is a particular advantage of Text-To-Speech synthesis.

Although Text-To-Speech synthesis is not audio playback provided by the publisher, it is sometimes possible for publishers to help improve the quality of the rendering. Technologies such as the Speech Synthesis Markup Language (SSML) and Pronunciation Lexicon Specification (PLS) allow publishers to provide the proper phonetic pronunciation of complex words and heteronyms. Unfortunately, support for these technologies is not yet widespread.

Text-To-Speech synthesis provides a reasonable alternative to audio playback formats and allows readers to speed up playback to high rates. In addition, readers can also synchronize Text-To-Speech playback with refreshable braille displays. However, Text-To-Speech remains an imperfect means of reading publications, especially when compared to the wonderful talented narrators that produce audio books.