The objective of the Pronunciation Task Force is to develop normative specifications and best practices guidance collaborating with other W3C groups as appropriate, to provide for proper pronunciation in HTML content when using text to speech (TTS) synthesis. This document provides various use cases highlighting the need for standardization of pronunciation markup, to ensure that consistent and accurate representation of the content. The requirements from the user scenarios provide the basis for these technical requirements/specifications.
This document provides use cases which describe specific implmentation approaches for introducing pronunciation and spoken presentation authoring markup into HTML5. These approaches are based on the two primary approaches that have evolved from the Pronunciation Task Force members. Other approaches may appear in subsequent working drafts.
Successful use cases will be those that provide ease of authoring and consumption by assistive technologies and user agents that utilize synthetic speech for spoken presentation of web content. The most challenging aspect of consumption may be alignment of the markup approach with the standard mechanisms by which assistive technologies, specifically screen readers, obtain content via platform accessibility APIs.
A new aria
attribute could be used to include pronunciation content.
Embed SSML in an HTML document.
aria-ssml as embedded JSON
When AT encounters an element with aria-ssml, the AT should enhance the UI by processing the pronunciation content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).
I say <span aria-ssml='{"phoneme":{"ph":"pɪˈkɑːn","alphabet":"ipa"}}'>pecan</span>. You say <span aria-ssml='{"phoneme":{"ph":"ˈpi.kæn","alphabet":"ipa"}}'>pecan</span>.
Client will convert JSON to SSML and pass the XML string a speech API.
var msg = new SpeechSynthesisUtterance(); msg.text = convertJSONtoSSML(element.getAttribute('aria-ssml')); speechSynthesis.speak(msg);
aria-ssml referencing XML by template ID
<!-- ssml must appear inside a template to be valid --> <template id="pecan"> <?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="en-US"> You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>. </speak> </template> <p aria-ssml="#pecan">You say, pecan. I say, pecan.</p>
Client will parse XML and serialize it before passing to a speech API:
var msg = new SpeechSynthesisUtterance(); var xml = document.getElementById('pecan').content.firstElementChild; msg.text = serialize(xml); speechSynthesis.speak(msg);
aria-ssml referencing an XML string as script tag
<script id="pecan" type="application/ssml+xml"> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="en-US"> You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>. </speak> </script> <p aria-ssml="#pecan">You say, pecan. I say, pecan.</p>
Client will pass the XML string raw to a speech API.
var msg = new SpeechSynthesisUtterance(); msg.text = document.getElementById('pecan').textContent; speechSynthesis.speak(msg);
aria-ssml referencing an external XML document by URL
<p aria-ssml="http://example.com/pronounce.ssml#pecan">You say, pecan. I say, pecan.</p>
Client will pass the string payload to a speech API.
var msg = new SpeechSynthesisUtterance(); var response = await fetch(el.dataset.ssml) msg.txt = await response.text(); speechSynthesis.speak(msg);
As an existing attribute, data-* could be used, with some conventions, to include pronunciation content.
Hearing users
data-ssml as embedded JSON
When an element with data-ssml is encountered by an SSML-aware AT, the AT should enhance the user interface by processing the referenced SSML content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).
<h2>The Pronunciation of Pecan</h2> <p><speak> I say <span data-ssml='{"phoneme":{"ph":"pɪˈkɑːn","alphabet":"ipa"}}'>pecan</span>. You say <span data-ssml='{"phoneme":{"ph":"ˈpi.kæn","alphabet":"ipa"}}'>pecan</span>.
Client will convert JSON to SSML and pass the XML string a speech API.
var msg = new SpeechSynthesisUtterance(); msg.text = convertJSONtoSSML(element.dataset.ssml); speechSynthesis.speak(msg);
data-ssml referencing XML by template ID
<!-- ssml must appear inside a template to be valid --> <template id="pecan"> <?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="en-US"> You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>. </speak> </template> <p data-ssml="#pecan">You say, pecan. I say, pecan.</p>
Client will parse XML and serialize it before passing to a speech API:
var msg = new SpeechSynthesisUtterance(); var xml = document.getElementById('pecan').content.firstElementChild; msg.text = serialize(xml); speechSynthesis.speak(msg);
data-ssml referencing an XML string as script tag
<script id="pecan" type="application/ssml+xml"> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="en-US"> You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>. </speak> </script> <p data-ssml="#pecan">You say, pecan. I say, pecan.</p>
Client will pass the XML string raw to a speech API.
var msg = new SpeechSynthesisUtterance(); msg.text = document.getElementById('pecan').textContent; speechSynthesis.speak(msg);
data-ssml referencing an external XML document by URL
<p data-ssml="http://example.com/pronounce.ssml#pecan">You say, pecan. I say, pecan.</p>
Client will pass the string payload to a speech API.
var msg = new SpeechSynthesisUtterance(); var response = await fetch(el.dataset.ssml) msg.txt = await response.text(); speechSynthesis.speak(msg);
HTML5 includes the XML namespaces for MathML and SVG. So, using either's elements in an HTML5 document is valid. Because SSML's implementation is non-visual in nature, browser implementation could be slow or non-existent without affecting how authors use SSML in HTML. Expansion of HTML5 to include SSML namespace would allow valid use of SSML in the HTML5 document. Browsers would treat the element like any other unknown element, as HTMLUnknownElement.
SSML
When an element with data-ssml is encountered by an SSML-aware AT, the AT should enhance the user interface by processing the referenced SSML content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).
<h2>The Pronunciation of Pecan</h2> <p><speak> You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>. </speak></p>
SSML is not valid HTML5
Embed valid SSML in HTML using custom elements registered as ssml-* where * is the actual SSML tag name (except for p which expects the same treatment as an HTML p in HTML layout).
Support use of SSML in HTML documents.
ssml-speak: see demo
Only the <ssml-speak> component requires registration. The component code lifts the SSML by getting the innerHTML and removing the ssml- prefix from the interior tags and passing it to the web speech API. The <p> tag from SSML is not given the prefix because we still want to start a semantic paragraph within the content. The other tags used in the example have no semantic meaning. Tags like <em> in HTML could be converted to <emphasis> in SSML. In that case, CSS styles will come from the browser's default styles or the page author.
<ssml-speak> Here are <ssml-say-as interpret-as="characters">SSML</ssml-say-as> samples. I can pause<ssml-break time="3s"></ssml-break>. I can speak in cardinals. Your number is <ssml-say-as interpret-as="cardinal">10</ssml-say-as>. Or I can speak in ordinals. You are <ssml-say-as interpret-as="ordinal">10</ssml-say-as> in line. Or I can even speak in digits. The digits for ten are <ssml-say-as interpret-as="characters">10</ssml-say-as>. I can also substitute phrases, like the <ssml-sub alias="World Wide Web Consortium">W3C</ssml-sub>. Finally, I can speak a paragraph with two sentences. <p> <ssml-s>You say, <ssml-phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</ssml-phoneme>.</ssml-s> <ssml-s>I say, <ssml-phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</ssml-phoneme>.</ssml-s> </p> </ssml-speak> <template id="ssml-controls"> <style> [role="switch"][aria-checked="true"] :first-child, [role="switch"][aria-checked="false"] :last-child { background: #000; color: #fff; } </style> <slot></slot> <p> <span id="play">Speak</span> <button role="switch" aria-checked="false" aria-labelledby="play"> <span>on</span> <span>off</span> </button> </p> </template>
class SSMLSpeak extends HTMLElement { constructor() { super(); const template = document.getElementById('ssml-controls'); const templateContent = template.content; this.attachShadow({mode: 'open'}) .appendChild(templateContent.cloneNode(true)); } connectedCallback() { const button = this.shadowRoot.querySelector('[role="switch"][aria-labelledby="play"]') const ssml = this.innerHTML.replace(/ssml-/gm, '') const msg = new SpeechSynthesisUtterance(); msg.lang = document.documentElement.lang; msg.text = `<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="${msg.lang}"> ${ssml} </speak>`; msg.voice = speechSynthesis.getVoices().find(voice => voice.lang.startsWith(msg.lang)); msg.onstart = () => button.setAttribute('aria-checked', 'true'); msg.onend = () => button.setAttribute('aria-checked', 'false'); button.addEventListener('click', () => speechSynthesis[speechSynthesis.speaking ? 'cancel' : 'speak'](msg)) } } customElements.define('ssml-speak', SSMLSpeak);
JSON-LD provides an established standard for embedding data in HTML. Unlike other microdata approaches, JSON-LD helps to reuse standardized annotations through external references.
Support use of SSML in HTML documents.
JSON-LD
<script type="application/ld+json"> { "@context": "http://schema.org/", "@id": "/pronunciation#WKRP", "@type": "RadioStation", "name": ["WKRP", "@type": "PronounceableText", "textValue": "WKRP", "speechToTextMarkup": "SSML", "phoneticText": "<speak><say-as interpret-as=\"characters\">WKRP</say-as>" ] } </script> <p> Do you listen to <span itemscope itemtype="http://schema.org/PronounceableText" itemid="/pronunciation#WKRP">WKRP</span>? </p>
not an established "type"/published schema
<Ruby> annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations.
ruby guides pronunciation visually. This seems like a natural fit for text-to-speech.
ruby with microdata
Microdata can augment the ruby element and its descendants.
<p> You say, <span itemscope="" itemtype="http://example.org/Pronunciation"> <ruby itemprop="phoneme" content="pecan"> pecan <rt itemprop="ph">pɪˈkɑːn</rt> <meta itemprop="alphabet" content="ipa"> </ruby>. </span> I say, <span itemscope="" itemtype="http://example.org/Pronunciation"> <ruby itemprop="phoneme" content="pecan"> pe <rt itemprop="ph">ˈpi</rt> can <rt itemprop="ph">kæn</rt> <meta itemprop="alphabet" content="ipa"> </ruby>. </span> </p>