The objective of the Pronunciation Task Force is to develop normative specifications and best practices guidance collaborating with other W3C groups as appropriate, to provide for proper pronunciation in HTML content when using text to speech (TTS) synthesis. This document provides various use cases highlighting the need for standardization of pronunciation markup, to ensure that consistent and accurate representation of the content. The requirements from the user scenarios provide the basis for these technical requirements/specifications.

Use Case aria-ssml

Background and Current Practice

A new aria attribute could be used to include pronunciation content.

Goal

Embed SSML in an HTML document.

Target Audience

Assistive Technology
Browser Extensions
Search Engines

Implementation Options

aria-ssml as embedded JSON

When AT encounters an element with aria-ssml, the AT should enhance the UI by processing the pronunciation content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).

I say <span aria-ssml='{"phoneme":{"ph":"pɪˈkɑːn","alphabet":"ipa"}}'>pecan</span>.
You say <span aria-ssml='{"phoneme":{"ph":"ˈpi.kæn","alphabet":"ipa"}}'>pecan</span>.

Client will convert JSON to SSML and pass the XML string a speech API.

var msg = new SpeechSynthesisUtterance();
msg.text = convertJSONtoSSML(element.getAttribute('aria-ssml'));
speechSynthesis.speak(msg);

aria-ssml referencing XML by template ID

<!-- ssml must appear inside a template to be valid -->
<template id="pecan">
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</template>

<p aria-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will parse XML and serialize it before passing to a speech API:

var msg = new SpeechSynthesisUtterance();
var xml = document.getElementById('pecan').content.firstElementChild;
msg.text = serialize(xml);
speechSynthesis.speak(msg);

aria-ssml referencing an XML string as script tag

<script id="pecan" type="application/ssml+xml">
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</script>

<p aria-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will pass the XML string raw to a speech API.

var msg = new SpeechSynthesisUtterance();
msg.text = document.getElementById('pecan').textContent;
speechSynthesis.speak(msg);

aria-ssml referencing an external XML document by URL

<p aria-ssml="http://example.com/pronounce.ssml#pecan">You say, pecan. I say, pecan.</p>

Client will pass the string payload to a speech API.

var msg = new SpeechSynthesisUtterance();
var response = await fetch(el.dataset.ssml)
msg.txt = await response.text();
speechSynthesis.speak(msg);

Existing Work

Problems and Limitations

aria-ssml is not a valid aria-* attribute.
OS/Browsers combinations that do not support the serialized XML usage of the Web Speech API.

Use Case data-ssml

Background and Current Practice

As an existing attribute, data-* could be used, with some conventions, to include pronunciation content.

Goal

Support repeated use within the page context
Support external file references
Reuse existing techniques without expanding specifications

Target Audience

Hearing users

Implementation Options

data-ssml as embedded JSON

When an element with data-ssml is encountered by an SSML-aware AT, the AT should enhance the user interface by processing the referenced SSML content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).

<h2>The Pronunciation of Pecan</h2>
<p><speak>
I say <span data-ssml='{"phoneme":{"ph":"pɪˈkɑːn","alphabet":"ipa"}}'>pecan</span>.
You say <span data-ssml='{"phoneme":{"ph":"ˈpi.kæn","alphabet":"ipa"}}'>pecan</span>.

Client will convert JSON to SSML and pass the XML string a speech API.

var msg = new SpeechSynthesisUtterance();
msg.text = convertJSONtoSSML(element.dataset.ssml);
speechSynthesis.speak(msg);

data-ssml referencing XML by template ID

<!-- ssml must appear inside a template to be valid -->
<template id="pecan">
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</template>

<p data-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will parse XML and serialize it before passing to a speech API:

var msg = new SpeechSynthesisUtterance();
var xml = document.getElementById('pecan').content.firstElementChild;
msg.text = serialize(xml);
speechSynthesis.speak(msg);

data-ssml referencing an XML string as script tag

<script id="pecan" type="application/ssml+xml">
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</script>

<p data-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will pass the XML string raw to a speech API.

var msg = new SpeechSynthesisUtterance();
msg.text = document.getElementById('pecan').textContent;
speechSynthesis.speak(msg);

data-ssml referencing an external XML document by URL

<p data-ssml="http://example.com/pronounce.ssml#pecan">You say, pecan. I say, pecan.</p>

Client will pass the string payload to a speech API.

var msg = new SpeechSynthesisUtterance();
var response = await fetch(el.dataset.ssml)
msg.txt = await response.text();
speechSynthesis.speak(msg);

Existing Work

Problems and Limitations

Does not assume or suggest visual pronunciation help for deaf or hard of hearing
Use of data-* requires input from AT vendors
XML data is not indexed by search engines

Use Case HTML5

Background and Current Practice

HTML5 includes the XML namespaces for MathML and SVG. So, using either's elements in an HTML5 document is valid. Because SSML's implementation is non-visual in nature, browser implementation could be slow or non-existent without affecting how authors use SSML in HTML. Expansion of HTML5 to include SSML namespace would allow valid use of SSML in the HTML5 document. Browsers would treat the element like any other unknown element, as HTMLUnknownElement.

Goal

Support valid use of SSML in HTML5 documents
Allow visual pronunciation support

Target Audience

SSML-aware technologies and browser extensions
Search indexers

Implementation Options

SSML

<h2>The Pronunciation of Pecan</h2>
  <p><speak>
  You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
  I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak></p>

Existing Work

Problems and Limitations

SSML is not valid HTML5

Use Case Custom Element

Background and Current Practice

Embed valid SSML in HTML using custom elements registered as ssml-* where * is the actual SSML tag name (except for p which expects the same treatment as an HTML p in HTML layout).

Goal

Support use of SSML in HTML documents.

Target Audience

SSML-aware technologies and browser extensions
Search indexers

Implementation Options

ssml-speak: see demo

Only the <ssml-speak> component requires registration. The component code lifts the SSML by getting the innerHTML and removing the ssml- prefix from the interior tags and passing it to the web speech API. The <p> tag from SSML is not given the prefix because we still want to start a semantic paragraph within the content. The other tags used in the example have no semantic meaning. Tags like <em> in HTML could be converted to <emphasis> in SSML. In that case, CSS styles will come from the browser's default styles or the page author.

<ssml-speak>
  Here are <ssml-say-as interpret-as="characters">SSML</ssml-say-as> samples.
  I can pause<ssml-break time="3s"></ssml-break>.
  I can speak in cardinals.
  Your number is <ssml-say-as interpret-as="cardinal">10</ssml-say-as>.
  Or I can speak in ordinals.
  You are <ssml-say-as interpret-as="ordinal">10</ssml-say-as> in line.
  Or I can even speak in digits.
  The digits for ten are <ssml-say-as interpret-as="characters">10</ssml-say-as>.
  I can also substitute phrases, like the <ssml-sub alias="World Wide Web Consortium">W3C</ssml-sub>.
  Finally, I can speak a paragraph with two sentences.
  <p>
    <ssml-s>You say, <ssml-phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</ssml-phoneme>.</ssml-s>
    <ssml-s>I say, <ssml-phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</ssml-phoneme>.</ssml-s>
  </p>
</ssml-speak>
<template id="ssml-controls">
  <style>
    [role="switch"][aria-checked="true"] :first-child,
    [role="switch"][aria-checked="false"] :last-child {
      background: #000;
      color: #fff;
    }
  </style>
  <slot></slot>
  <p>
    <span id="play">Speak</span>
    <button role="switch" aria-checked="false" aria-labelledby="play">
      <span>on</span>
      <span>off</span>
    </button>
  </p>
</template>

class SSMLSpeak extends HTMLElement {
  constructor() {
    super();
    const template = document.getElementById('ssml-controls');
    const templateContent = template.content;
    this.attachShadow({mode: 'open'})
      .appendChild(templateContent.cloneNode(true));
  }
  connectedCallback() {
    const button = this.shadowRoot.querySelector('[role="switch"][aria-labelledby="play"]')
    const ssml = this.innerHTML.replace(/ssml-/gm, '')
    const msg = new SpeechSynthesisUtterance();
    msg.lang = document.documentElement.lang;
    msg.text = `<speak version="1.1"
      xmlns="http://www.w3.org/2001/10/synthesis"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
        http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
      xml:lang="${msg.lang}">
    ${ssml}
    </speak>`;
    msg.voice = speechSynthesis.getVoices().find(voice => voice.lang.startsWith(msg.lang));
    msg.onstart = () => button.setAttribute('aria-checked', 'true');
    msg.onend = () => button.setAttribute('aria-checked', 'false');
    button.addEventListener('click', () => speechSynthesis[speechSynthesis.speaking ? 'cancel' : 'speak'](msg))
  }
}

customElements.define('ssml-speak', SSMLSpeak);

Existing Work

Problems and Limitations

OS/Browsers combinations that do not support the serialized XML usage of the Web Speech API.
Browsers may need to map SSML tags with CSS styles for default user agent styles.
Without an extension or AT, only user interaction can start the Web Speech API.
Authors or parsing may need to remove HTML content with unintended SSML semantics before serialization.

Use Case JSON-LD

Background and Current Practice

JSON-LD provides an established standard for embedding data in HTML. Unlike other microdata approaches, JSON-LD helps to reuse standardized annotations through external references.

Goal

Support use of SSML in HTML documents.

Target Audience

SSML-aware technologies and browser extensions
Search indexers

Implementation Options

JSON-LD

<script type="application/ld+json">
{
  "@context": "http://schema.org/",
  "@id": "/pronunciation#WKRP",
  "@type": "RadioStation",
  "name": ["WKRP",
    "@type": "PronounceableText",
    "textValue": "WKRP",
    "speechToTextMarkup": "SSML",
    "phoneticText": "<speak><say-as interpret-as=\"characters\">WKRP</say-as>"
  ]
}
</script>
<p>
  Do you listen to <span itemscope
    itemtype="http://schema.org/PronounceableText"
    itemid="/pronunciation#WKRP">WKRP</span>?
</p>

Existing Work

Problems and Limitations

not an established "type"/published schema

Use Case Ruby

Background and Current Practice

<Ruby> annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations.

ruby guides pronunciation visually. This seems like a natural fit for text-to-speech.

Goal

Support use of SSML in HTML documents.
Offer visual pronunciation support.

Target Audience

AT and browser extensions
Search indexers

Implementation Options

ruby with microdata

Microdata can augment the ruby element and its descendants.

<p>
  You say,
  <span itemscope="" itemtype="http://example.org/Pronunciation">
    <ruby itemprop="phoneme" content="pecan">
      pecan
      <rt itemprop="ph">pɪˈkɑːn</rt>
      <meta itemprop="alphabet" content="ipa">
    </ruby>.
  </span>
  I say,
  <span itemscope="" itemtype="http://example.org/Pronunciation">
    <ruby itemprop="phoneme" content="pecan">
      pe
      <rt itemprop="ph">ˈpi</rt>
      can
      <rt itemprop="ph">kæn</rt>
      <meta itemprop="alphabet" content="ipa">
    </ruby>.
  </span>
</p>

Existing Work

Problems and Limitations

AT may process annotations as content
AT "double reading" words instead of choosing either the content or the annotation
Only offers for a few SSML expressions
Difficult to reuse by reference

User Scenarios

The purpose of developing user scenarios is to facilitate discussion and further requirements definition for pronunciation standards developed within the PTF prior to review of the APA. There are numerous interpretations of what form user scenarios adopt. Within the user experience research (UXR) body of practice, a user scenario is a written narrative related to the use of a service from the perspective of a user or user group. Importantly, the context of use is emphasized as is the desired outcome of use. There are potentially thousands of user scenarios for a technology such as TTS, however, the focus for the PTF is on the core scenarios that relate to the kinds of users who will engage with TTS.

User scenarios, like Personas, represent a composite of real-world experiences. In the case of the PTF, the scenarios were derived from interviews of people who were end-consumers of TTS, as well as submitted narratives and industry examples from practitioners. There are several formats of scenarios. Several are general goal or task-oriented scenarios. Others elaborate on richer context, for example, educational assessment.

The following user scenarios are organized on the three perspectives of TTS use derived from analysis of the qualitative data collected from the discovery work:

End-Consumers of TTS: Encompasses those with a visual disability or other need to have TTS operational when using assistive technologies (ATs).
Digital Content Managers: Addresses activities related to those responsible for producing content that needs to be accessible to ATs and W3C-WAI Guidelines.
Software Engineers: Includes developers and architects required to put TTS into an application or service.

Need to add the other categories, or remove the list above and just rely on the ToC.

Augmentative and Alternative Communication (AAC)

Names

As an AAC User I want my name to be pronounced correctly and I want to pronounce others names correctly using my AAC device.

Storing others' names

As an AAC user, I want to be able to input and store the correct pronunciation of others’ names, so I can address people respectfully and build meaningful relationships.

For instance, when meeting someone named “Nguyễn,” the AAC user wants to ensure their device pronounces the name correctly, using IPA or SSML markup, to foster respectful communication and avoid embarrassment.

Pronouncing my name

As an AAC user, I want my name to be pronounced correctly by my device, so that I can confidently introduce myself in social, educational, and professional settings.

For example, a user named “Siobhán” may find that default TTS engines mispronounce her name. She wants to input a phonetic or SSML-based pronunciation so that her name is spoken accurately every time.

End-Consumer of TTS

Ultimately, the quality and variation of TTS rendering by assistive technologies vary widely according to a user's context. The following user scenarios reinforce the necessity for accurate pronunciation from the perspective of those who consume digitally generated content.

Traveller

As a traveler who uses assistive technology (AT) with TTS to help navigate through websites, I need to hear arrival and destination codes pronounced accurately so I can select the desired travel itinerary. For example, a user with a visual impairment attempts to book a flight to Ottawa, Canada and so goes to a travel website. The user already knows the airport code and enters "YOW". The site produces the result in a drop-down list as "Ottawa, CA" but the AT does not pronounce the text accurately to help the user make the correct association between their data entry and the list item.

Test Taker

As a test taker (tester) with a visual impairment who may use assistive technology to access the test content with speech software, screen reader or refreshable braille device, I want the content to be presented as intended, with accurate pronunciation and articulation, so that my assessment accurately reflects my knowledge of the content.

Student

As a student/learner with auditory and cognitive processing issues, it is difficult to distinguish sounds, inflections, and variations in pronunciation as rendered through synthetic voice, such as text-to-speech or screen reader technologies. Consistent and accurate pronunciation whether human-provided, external, or embedded is needed to support working executive processing, auditory processing and memory that facilitates comprehension in literacy and numeracy for learning and for assessments.

English learner

As an English Learner (EL) or a visually impaired early learner using speech synthesis for reading comprehension that includes decoding words from letters as part of the learning construct (intent of measurement), pronunciation accuracy is vital to successful comprehension, as it allows the learner to distinguish sounds at the sentence, word, syllable, and phoneme level.

Digital Content Management for TTS

The advent of graphical user interfaces (GUIs) for the management and editing of text content has given rise to content creators not requiring technical expertise beyond the ability to operate a text editing application such as Microsoft Word. The following scenario summarizes the general use, accompanied by a hypothetical application.

As a content creator, I want to create content that can readily be delivered through assistive technology, can convey the correct meaning, and ensure that screen readers render the right pronunciation based on the surrounding context.
As a content producer for a global commercial site that is inclusive, I need to be able to provide accessible culture-specific content for different geographic regions.

Educational Assessment

In the educational assessment field, providing accurate and concise pronunciation for students with auditory accommodations, such as text-to-speech (TTS) or students with screen readers, is vital for ensuring content validity and alignment with the intended construct, which objectively measures a test takers knowledge and skills. For test administrators/educators, pronunciations must be consistent across instruction and assessment in order to avoid test bias or impact effects for students. Some additional requirements for the test administrators, include, but are not limited to, such scenarios:

Test Administrator—Read-aloud intonation, expression

As a test administrator, I want to ensure that students with the read-aloud accommodation, who are using assistive technology or speech synthesis as an alternative to a human reader, have the same speech quality (e.g., intonation, expression, pronunciation, and pace, etc.) as a spoken language.

This may be simlar to the other Test Administrator case below?

Math educator

As a math educator, I want to ensure that speech accuracy with mathematical expressions, including numbers, fractions, and operations have accurate pronunciation for those who rely on TTS. Some mathematical expressions require special pronunciations to ensure accurate interpretation while maintaining test validity and construct. Specific examples include:

Formulas

Mathematical formulas written in simple text with special formatting should convey the correct meaning of the expression to identify changes from normal text to super- or to sub-script text. For example, without the proper formatting, the equation:a³-b³=(a-b)(a²+ab+b²) may incorrectly render through some technologies and applications as a3-b3=(a-b)(a2+ab+b2).

Distinctions in writing

Distinctions made in writing are often not made explicit in speech; For example, “fx” may be interpreted as fx, f(x), fx, F X, F X. The distinction depends on the context; requiring the author to provide consistent and accurate semantic markup.

Greek letters

For math equations with Greek letters, it is important that the speech synthesizer be able to distinguish the phonetic differences between them, whether in the natural language or phonetic equivalents. For example, ε (epsilon) υ (upsilon) φ (phi) χ (chi) ξ(xi).

Test Administrator—consistent pronunciation

As a test administrator/educator, pronunciations must be consistent across instruction and assessment, in order to avoid test bias and pronunciation effects on performance for students with disabilities (SWD) in comparison to students without disabilities (SWOD). Examples include:

Spelling out rhyming words

If a test question is measuring rhyming of words or sounds of words, the speech synthesis should not read aloud the words, but rather spell out the words in the answer options.

Questions measuring spelling

If a test question is measuring spelling and the student needs to consider spelling correctness/incorrectness, the speech synthesis should not read aloud the misspelt words, especially for words, such as:

Heteronyms/homographs: same spelling, different pronunciation, different meanings, such as lead (to go in front of) or lead (a metal); wind (to follow a course that is not straight) or wind (a gust of air); bass (low, deep sound) or bass (a type of fish), etc.
Homophone: words that sound alike, such as, to/two/too; there/their/they're; pray/prey; etc.
Homonyms: multiple meaning words, such as scale (measure) or scale (climb, mount); fair (reasonable) or fair (carnival); suit (outfit) or suit (harmonize); etc.

Academic and Linguistic Practitioners

The extension of content management in TTS is one as a means of encoding and preserving spoken text for academic analyses; irrespective of discipline, subject domain, or research methodology.

Linguist

A. As a linguist, I want to represent all the pronunciation variations of a given word in any language, for future analyses.

Speech Language Pathologist, Speech Therapist

As a speech language pathologist or speech therapists, I want TTS functionality to include components of speech and language that include dialectal and individual differences in pronunciation; identify differences in intonation, syntax, and semantics, and; allow for enhanced comprehension, language processing and support phonological awareness.

Introduction

Use Case aria-ssml

Background and Current Practice

Goal

Target Audience

Implementation Options

Existing Work

Problems and Limitations

Use Case data-ssml

Background and Current Practice

Goal

Target Audience

Implementation Options

Existing Work

Problems and Limitations

Use Case HTML5

Background and Current Practice

Goal

Target Audience

Implementation Options

Existing Work

Problems and Limitations

Use Case Custom Element

Background and Current Practice

Goal

Target Audience

Implementation Options

Existing Work

Problems and Limitations

Use Case JSON-LD

Background and Current Practice

Goal

Target Audience

Implementation Options

Existing Work

Problems and Limitations

Use Case Ruby

Background and Current Practice

Goal

Target Audience

Implementation Options

Existing Work

Problems and Limitations

User Scenarios

Augmentative and Alternative Communication (AAC)

Names

Storing others' names

Pronouncing my name

End-Consumer of TTS

Traveller

Test Taker

Student

English learner

Digital Content Management for TTS

Educational Assessment

Test Administrator—Read-aloud intonation, expression

Math educator

Formulas

Distinctions in writing

Greek letters

Test Administrator—consistent pronunciation

Spelling out rhyming words

Questions measuring spelling

Academic and Linguistic Practitioners

Linguist

Speech Language Pathologist, Speech Therapist

Software Application Development

Product owner

Client-side User Interface Developer