The objective of the Pronunciation Task Force is to develop normative specifications and best practices guidance collaborating with other W3C groups as appropriate, to provide for proper pronunciation in HTML content when using text to speech (TTS) synthesis. This document defines a standard mechanism to allow content authors to include spoken presentation guidance in HTML content. Also, it contains two identified approaches and enumerates their advantages and disadvantages.

Introduction

Accurate, consistent pronunciation and presentation of content spoken by text to speech synthesis (TTS) is an essential requirement in education, communication, entertainment, and other domains. From helping to teach spelling and pronunciation in different languages, to reading learning materials or new stories, TTS has become a vital technology for providing access to digital content on the web, through mobile devices, and now via voice-based assistants. Organizations such as educational publishers and assessment vendors are looking for a standards-based solution to enable authoring of spoken presentation guidance in HTML which can then be consumed by assistive technologies and other applications that utilize text to speech synthesis (TTS) for rendering of content. Historically, efforts at standardization (e.g. SSML or CSS Speech) have not led to broad adoption of any standard by user agents, authors or assistive technologies; what has arisen are a variety of non-interoperable approaches that meet specific needs for some applications. This explainer document presents the case for improving spoken presentation on the Web and how a standards-based approach can address the requirements.

What is this?

This is a proposal for a mechanism to allow content authors to include spoken presentation guidance in HTML content. Such guidance can be used by assistive technologies (including screen readers and read aloud tools) and voice assistants to control text to speech synthesis. A key requirement is to ensure the spoken presentation content matches the author's intent and user expectations.

Currently, the W3C SSML standard is seen as an important piece of a solution. The challenge is integrating SSML into HTML so that it is easy to author, does not "break" content, and is straightforward for consumption by assistive technologies, voice assistants, and other tools that produce spoken presentation of content.

This proposal has emerged from the work of the Accessible Platform Architecture Pronunciation Task Force and represents a decision point arising from two differing approaches for integrating SSML (or SSML-like characteristics) into HTML. Each of the approaches differs in authoring and consumption models (specifically for assistive technologies).

Why do we care?

Several classes of assistive technology users depend upon spoken rendering of web content by text to speech synthesis (TTS). In contexts such as education, there are specific expectations for accuracy of spoken presentation in terms of pronunciation, emphasis, prosody, pausing, etc.

Correct pronunciation is also important in the context of language learning, where incorrect pronunciation can confuse learners.

In practice, the ecosystem of devices used in classrooms is broad, and each vendor generally provides their own text to speech engines for their platforms. Ensuring consistent spoken presentation across devices is a very real problem, and challenge. For many educational assessment vendors, the problem necessitates non-interoperable hacks to tune pronunciation and other presentation features, such as pausing, which itself can introduce new problems through inconsistent representation of text across speech and braille.

It could be argued that continual advances in machine learning will improve the quality of synthesized speech, reducing the need for this proposal. Waiting for a robust solution that will likely still not fully address our needs is risky, especially when an authorable, declarative approach may be within reach (and wouldn't preclude or conflict with continual improvement in TTS technology).

The current situation:

With the growing consumer adoption of voice assistants, user expectations for high quality spoken presentation is growing. Google and Amazon both encourage application developers to utilize SSML to enhance the user experience on their platforms, yet Web content authors do not have the same opportunity to enhance the spoken presentation of their content.

Finding a solution to this need can have broader benefit in allowing authors to create web content that presents a better user experience if the content is presented by voice assistants.

Goals

Non-Goals

Approaches considered

A variety of approaches have been identified thus far by the Task Force, but two are considered front runners:

  1. In-line SSML within Web Content
  2. Attribute-based Model of SSML

Both approaches have advantages and disadvantages and these are briefly summarized below.

In-line SSML

Advantages including the existence of SSML standard that is directly consumable by many speech synthesizers, and the precedent for in-lining non-HTML markup such as SVG and MathML. Also, this approach may be more easily consumed by Voice Assistants.

A key disadvantage is that inline SSML appears to be more difficult for Assistive Technologies to implement, specifically for screen readers.

A simple example of in-line SSML in an HTML fragment is shown below:

According the 2010 US Census, the population of <speak><say-as interpret-as="digits">90274</say-as></speak> increased to 25209 from 24976 over the past 10 years.

Attribute-based Model of SSML

Advantages include the current uses of variants of the attribute model are currently used by educational assessment vendors, these variants are supported by custom read aloud tools, and it appears that the attribute model may be more easily implementable by screen reader vendors. For example, the EPUB3 standard includes the SSML phoneme element implemented as a pair of namespaced attributes and is used by publishers in Japan.

Disadvantages may include challenges resulting from the use of JSON. The introduction of JSON may add a level of complexity to authoring. However, this could be mitigated by authoring tools. This approach requires transforming the attribute content represented in JSON into SSML by the consumer (screen reader, read aloud tool, voice assistant, etc.). Possible security concerns exist with the JSON approach. The EPUB approach would lead to a large number of attributes if all the SSML elements were to be implemented in that manner.

Furthermore, no other standard uses string JSON values for attributes in HTML. This may cause problems for implementers who must parse the JSON values before processing. The browser, which normally attempts to address malformed HTML, can make no guarantees about the JSON strings. Implementers must decide how to handle malformed JSON.

Finally, the schema for `data-ssml` values has not been set. Competing standards for this format, like SpeakableSpecification, as well as any issues converting SSML to a proper JSON schema could cause confusion for implementors and authors. Often such conversions are "...not exactly 1:1 transformation, but very very close".

A simple example of the attribute based model of SSML is shown below:

According the 2010 US Census, the population of <span data-ssml='{"say-as" : {"interpret-as":"digits"}}'>90274</span> increased to 25209 from 24976 over the past 10 years.

Open Questions

  1. From the TAG/WHATWG perspective, what disadvantages/challenges have we missed with either approach?
  2. Whichever approach makes sense from the web standards perspective, will/can it be adopted by assistive technologies? Particularly for screen readers, does it fit the accessibility API model?
Acknowledgements placeholder