This document describes user requirements for text to speech of electronic documents containing ruby.

Purpose

This document addresses concerns related to the text-to-speech functionality in HTML documents and EPUB publications that contain ruby annotations. While typographical aspects of ruby are covered by [[?JLREQ]], text-to-speech issues in this context have not received widespread recognition. The primary focus of this document is to outline user requirements.

In Section 2, we enumerate the various roles of ruby annotations in relation to their associated ruby bases. Section 3 provides an overview of potential options for using ruby bases and/or ruby annotations in text-to-speech, along with a discussion of the advantages and disadvantages of each option. Section 4 addresses markup issues related to the text-to-speech of ruby annotations. Section 5 introduces alternative mechanisms, such as SSML and PLS. Section 6 delves into the use of ruby annotations in translating HTML or EPUB content to braille.

Roles of ruby annotations

Furigana, background

The primary purpose of ruby annotations is to indicate how to pronounce CJK ideographic characters, a practice known as Furigana (see also JLReq terminology).

In contemporary usage, it is uncommon to attach ruby annotations to all CJK ideographic characters (fully-annotated ruby). Instead, it is common to attach ruby annotations to some CJK ideographic characters (partially-annotated ruby).

Ruby annotations find their application in various contexts, including trade books, newspapers, textbooks, teaching materials, and more, but are rarely utilized in business documents.

Even for simple CJK ideographic characters, ruby annotations may be added for some users who have particular difficulties with CJK ideographic characters (in electronic documents, it is easy to make ruby annotations visible or invisible based on user preferences). Such ruby annotations are called as furigana-added-for-enhanced-accessibility.

Some simple CJK ideographic characters have more than one possible reading and thus require ruby annotations for disambiguation. This is common for names of people and places. For example, 山崎 (a person's name) may be read as YAMAZAKI or YAMASAKI.

In the case of partially-annotated ruby, ruby annotations are often attached to the first occurrence of each CJK ideographic character, and not attached to the second and subsequent occurrences of the same character, probably because users should learn from the first occurrence.

Gikun, background

Especially in Japan, ruby annotations are also used to indicate something different from the reading of a CJK ideographic character. Such ruby is referred to as Gikun. Gikun is commonly employed in light novels and comics.

Here are some examples of Gikun:

Even when Gikun is used for a compound word, it is unlikely to be repeated for later occurrences of the same word. Moreover, different [=GIKUN=] may be added for subsequent occurrences of the same word. For example, the next occurrence of 生命 may well be 生命ライフ where ライフ (life) is an English translation.

Unusual names of people and places, background

Unusual names of people in Japan are typically written using CJK ideographic characters but are pronounced quite differently from the standard reading of these characters. For instance, あだむ is an unusual name, where (usually read as OTOKO) means 'man', and あだむ represents 'Adam' in Kana.

Character names in comics, animations, and light novels can sometimes be extremely challenging to pronounce. Many of the character names in Demon Slayer (Kimetsu no Yaiba) fall into this category. For example, almost no one can read 不死川 玄弥 as SHINAZUGAWA GENNYA without assistance.

Names of places can also be difficult to read due to historical reasons. For instance, 神居古潭かむいこたん, 温根沼おんねとう, 音威子府おといねっぷ are names of places in Hokkaido (the northern island of Japan). These names are challenging to pronounce because they originated from the Ainu language, which is entirely different from the Japanese language.

In many instances, the first occurrence of an unusual name is accompanied by a ruby annotation, but subsequent occurrences are not.

Interlinear notes, background

Interlinear notes resemble ruby annotations in appearance. A note in JLreq introduces interlinear notes as follows:

In the example shown in a figure referenced in the quoted note ("An example of a note in inter lines"), 徳川慶喜 (Tokugawa Yoshinobu) is accompanied by an interlinear note "1837-1913 江戸幕府最後の将軍" (1837-1913 the last shogun of the Edo shogunate). Other examples are: a modern kana phrase as an interlinear note for a historical kana phrase, a standard Japanese expression as an interlinear note for an expression in a dialect, a modern CJK ideographic character as an interlinear note for a traditional CJK ideographic character, an English text chunk as an interlinear note for a Japanese text chunk, and an official name as an interlinear note for an abbreviated name.

One could argue that HTML ruby elements should not be used for representing interlinear notes (see Kobayashi Sensei's mail in Japanese). However, it is not difficult to imagine that ruby elements are actually used for representing interlinear notes.

Ruby annotations for indicating the pronunciation of foreign phrases in language textbooks, background

In language textbooks, ruby annotations are at times employed to indicate the pronunciation of foreign phrases written in hiragana or katakana. For example, a Chinese phrase 我去学校 may include ウオ チュー シュエシャオ as a ruby annotation.

Double-sided Ruby, background

A sequence of characters can be accompanied by two ruby annotations, typically consisting of [=Furigana=] and either [=GIKUN=] or an [=interlinear note=]. In an example provided in JLreq ("An example of ruby attached to both sides of the base characters"), 東南 is accompanied by たつみ and とうなん. Here 東南 means 'southeast', with とうなん (TOUNAN) serving as [=Furigana=], and たつみ (TATSUMI) as [=GIKUN=], as 辰巳 (read as TATSUMI) indicates the same direction as 東南.

We offer two additional illustrative examples.

Double-sided ruby example 1
東洋 features an upper-side ruby annotation オリエント and a lower-side ruby annotation とうよう

In this example, とうよう serves as [=Furigana=], while オリエント is used as [=Gikun=]

Double-sided ruby example 2
織田信長 features an upper-side ruby annotation "1534〜82" and a lower-side ruby annotation おだのぶなが

In this example, おだのぶなが serves as [=Furigana=], while "1534〜82" is presented as an [=interlinear note=].

Which should be read aloud, ruby bases or ruby annotations, or both?

There are three possible options: (1) both ruby bases and ruby annotations, (2) ruby annotations only, and (3) ruby bases only.

Reading aloud both ruby bases and ruby annoations

In this option, ruby bases are read aloud first and ruby annotations are then read aloud. Many implementations (screen readers, in particular) support this option only. For example, foobar is read aloud as 'foo bar'.

Furigana, when both read aloud

The option of reading aloud both interferes with readers' understanding significantly. This is true for both group ruby (see also JLReq terminology) and mono ruby (see also JLReq terminology).

Consider an example from "The Rich Man and the Chicken" by 小川未明 (OGAWA Mimei). Note that the mono ruby for 新鮮 is expressed by two rt (ruby annotation) elements: one ruby annotation for and the other ruby annotation is for .

Original text

にわとりでもって、 しんせんたまごまして べようとおもいました。

If there are no ruby annotations, this should be read aloud as:
にわとりでもかって、しんせんなたまごをうましてたべようとおもいました。 (Niwatori demo katte shinsenna tamagowo umashite tabeyouto omoimashita.)

Translation in English: I thought that I should raise a hen so that I can eat fresh eggs.

Reading

The option of reading aloud both provides:
にわとりにわとりでもかかって、しんしんせんせんなたまごたまごをううましてたたべようとおもおもいました。 (Niwatoriniwatori demo kakatte shinshinsensenna tamagotamagowo uumashite tatabeyouto omoomoimashita.)

This reading does not make any sense at all.

Moreover, in some cases, reading both completely changes the meaning (see examples).

Gikun, when both read aloud

The option of reading aloud both is sensible.

とも is read aloud as TEKI TOMO, which means 'enemy friend' (equal to 'frenemy').

生命いのち is read aloud as SEIMEI INOCHI, where SEIMEI is a loan word from Chinese and INOCHI is a native Japanese word. Both means life.

Unusual names of people and places, when both read aloud

The option of reading aloud both interferes with readers' understanding significantly.

不死川玄弥しなずがわげんや is read aloud as FUSHIKAWA GENYA SHINAZUGAWA GENYA, which suggests two persons rather than one person.

Interlinear notes, when both read aloud

The option of reading aloud both is sensible.

For example, 徳川慶喜1837-1913 江戸幕府最後の将軍 is read aloud as TOKUGAWA YOSHINOBU 1837-1913 EDO BAKUFU SAIGONO SHOUGUN, which means 'Tokugawa Yoshinobu 1837-1913, the last shogun of the Edo shogunate'.

Ruby annotations for indicating the pronunciation of foreign phrases in language books, when both read aloud

The option of reading aloud both interferes with readers' understanding significantly.

In the example of 我去学校, even if ウオ チュー シュエシャオ is read aloud using the Japanese text-to-speech engine, the result will not be helpful to learners because of the incorrect pronunciation and four tones. Katakana pronunciation is also useless in languages such as English.

Double-sided ruby, when both read aloud

Since there are two ruby annotations, double-sided ruby leads to reading aloud three times. One of the ruby annotations is typically furigana, so the description in 1) applies. If the other ruby annotation is a Gikun, the description in 2) applies; if it is an interlinear note, the description in 4) applies.

Reading aloud ruby annotations only

In this option, ruby annotations are read aloud but ruby bases are not. For example, foobar is read aloud as 'bar'.

Furigana, when ruby annotations read aloud

The option of reading aloud ruby annotations only provides not-incorrect-but-unnatural results usually. In some cases, it causes mistakes in deciding whether should be read aloud as (E) or (HE) and whether should be read aloud as (WA) or (HA). This is because the morphological analysis does not work properly and pronunciation dictionaries for compound words cannot be used, as kana characters are used instead of CJK ideographic characters. As an example, consider 今後は発展はってん. Text-to-speech of 今後は発展 typically works fine but that of 今後ははってん does not. The first occurrence of should be read aloud as (WA) but is mistakenly read aloud as (/ha/).

Even when this option is used, it might be wise to ignore furigana-added-for-enhanced-accessibility but rely on ruby bases.

If furigana is assigned only for the first occurrence of a word, there is a risk that the first occurrence and the others are read aloud differently.

Gikun, when ruby annotations read aloud

The option of reading aloud ruby annotations only provides an understandable result but does not properly convey the author's intention.

とも is read aloud as TOMO, which means 'friend', but 'frenemy' is intended.

生命いのち will be read aloud as INOCHI(いのち).

Unusual names of people and places, when ruby annotations read aloud

The option of reading aloud ruby annotations only works correctly. However, if the first occurrence of a name is accompanied by a ruby annotation and the other occurrences are not, the first occurrence is read aloud differently from the others thus suggesting different persons or places.

For example, 不死川玄弥しなずがわげんや is read aloud as SHINAZUGAWA GENYA correctly. But later occurrences of 不死川玄弥 are read aloud as FUSHIKAWA GENYA if they do not have ruby annotations.

Interlinear notes, when ruby annotations read aloud

The option of reading aloud ruby annotations only provides incomprehensible results often.

If "1837-1913 江戸幕府最後の将軍" is attached to 徳川慶喜 as a ruby annotation, it will be read aloud as 1837-1913 EDOBAKUFU SAIGO NO SHOGUN (1837-1913 the last shogun of the Edo shogunate), which is reasonable. But if only "1837-1913" is attached as a ruby annotation, the result is 1837-1913, which does not make any sense.

Ruby annotations for indicating the pronunciation of foreign phrases in language books, when ruby annotations read aloud

The option of reading aloud ruby annotations only interferes with readers' understanding significantly.

In the example of 我去学校 (ウオ チュー シュエシャオ), even if ウオ チュー シュエシャオ is read out in the Japanese style, it will not be helpful to learners because of the inaccurate pronunciation and the four tones (tones). Katakana pronunciation is also useless in languages such as English.

Double-sided ruby, when ruby annotations read aloud

The option of reading aloud ruby annotations only makes two ruby annotations be read aloud while ignoring their ruby base. Since one of the two ruby annotations is typically furigana, the description in 1) applies. If the other ruby annotation is a Gikun, the description in 2) applies; if it is an interlinear note, the description in 4) applies.

Reading aloud ruby bases only

In this option, ruby bases are read aloud but ruby annotations are not. For example, foobar is read aloud as foo.

Furigana, when bases read aloud

The option of reading aloud ruby bases only may or may not provide good results, depending on text-to-speech engines.

The following is a quote from [[?ACCESSIBLE_E_BOOKS]].

Furthermore, compound words made up from CJK ideographic characters in JIS X 0208 are sometimes read aloud incorrectly.

As the importance of accessibility is well recognized and text-to-speech engines are improved, more and more words will be read aloud correctly. However, there are some words, such as the aforementioned YAMAZAKI, that cannot be read aloud correctly by text-to-speech engines and even native Japanese speakers.

Gikun, when bases read aloud

The option of reading aloud ruby bases only results in a perfectly understandable result. However, since gikun is ignored, the author's intent is not completely conveyed.

とも is read aloud as TEKI, which means 'enemy', but 'frenemy' is intended.

生命いのち is read out as SEIMEI.

Unusual names of people and places, when bases read aloud

The option of reading ruby bases only leads to incorrect results. However, since every occurrence of a name is read aloud in the same way, users will not be confused.

Every occurrence 不死川 玄弥しなずがわ げんや will always be incorrectly read aloud as ふしかわ げんや, regardless of the presence or absence of ruby annotations.

Interlinear notes, when bases read aloud

The option of reading ruby bases only provides a perfectly understandable result. However, since interline notes are ignored, the author's intention is not conveyed well.

徳川慶喜1837-1913 江戸幕府最後の将軍 (Tokugawa Yoshinobu 1837-1913, the last shogun of the Edo shogunate), will be read aloud as とくがわよしのぶ (Tokugawa Yoshinobu).

Ruby annotations for indicating the pronunciation of foreign phrases in language books, when bases read aloud

The option of reading ruby bases only is most appropriate when natural languages are correctly identified and ruby bases are read aloud by a text-to-speech engine for that language. On the other hand, if the natural language cannot be identified or the text-to-speech engine for that language is not available, the result is not understandable.

Double-sided ruby, when bases read aloud

The option of reading ruby bases only will ignore the two ruby annotations and read their ruby base only. When one of the two ruby annotations is furigana, the description in 1) applies. If the other is a gikun, the description in 2) applies, and if it is an interlinear note, the description in 4) applies.

Miscellaneous issues around ruby markup

Conversion from small kana characters to full-size kana characters

Small kana characters , , , and are too small when they appear in ruby annotations. For this reason, instead of these small characters, full-size kana characters , , , and are used in ruby annotations.

However, since full-size kana characters are pronounced differently from small kana, ruby annotations containing full-size kana are read aloud differently.

CSS has a mechanism for overcoming this problem. Value 'full-size-kana' of the text-transform property as specified in CSS Text converts small kana characters to full-size kana. It is thus possible to use small kana in ruby annotations while rendering them using full-size kana. Text-to-speech engines can provide correct results even when ruby annotations are read aloud.

A single ruby element or multiple ruby elements per one compound word

Okayama-san of Hitach has argued that, even in the case of mono ruby, creating a single ruby element per compound word is better than creating a ruby element for each character of the ruby base in a compound word. For example, to attach mono ruby to 生命, he recommends a single ruby element and two sets of rb and rt elements: one for and another for rather than creating two ruby elements.

A single ruby element per compound word can be rendered as mono ruby or jukugo ruby by CSS. Moreover, it is also easy for the text-to-speech engine to maintain a correspondence table between ruby bases and ruby annotations.

Markup for indicating furigana-added-for-enhanced-accessibility

Although furigana-added-for-enhanced-accessibility is necessary for those readers who have particular difficulties with CJK ideographic characters, it is unnecessary or slightly disturbing for others. If furigana-added-for-enhanced-accessibility is distinguishable from normal furigana, it can be made visible or invisible depending on user preferences. It is thus necessary to standardize a markup mechanism for indicating furigana-added-for-enhanced-accessibility.

Markup for indicating ruby annotations used as gikun or interlinear note

In Section 3, we have seen that ruby annotations used as gikun or interline notes should be read aloud differently from the other cases. It is thus necessary to standardize a markup mechanism for clearly indicating ruby annotations used as gikun or interlinear note.

Alternatives to ruby

[[?SSML]] and [[?PRONUNCIATION-LEXICON]] offer alternatives for conveying phonemic and phonetic pronunciations of CJK ideographic characters to speech synthesis engines. These methods are not intended for visual presentations but can offer superior control over text-to-speech compared to using ruby.

SSML

[[?SSML]] employs symbol collections (such as IPA and [[?JEITA_IT-4006]]) to represent the sounds of human languages. Phonemic and phonetic pronunciations are conveyed through sequences of these symbols.

[[?epub-32]] allows the use of SSML attributes within XHTML content documents in EPUB publications. In [[?epub-33]], these attributes are relocated to [[?epub-tts-10]]. Meanwhile, the W3C Accessible Platform Architectures Working Group is developing [[?spoken-html]], which outlines two potential methods for incorporating SSML attributes into HTML elements.

In Japan, SSML finds extensive application in digital textbooks, adopted by the biggest textbook publisher in Japan. However, it has been noted that attaching SSML attributes to CJK ideographic characters significantly raises authoring costs. In the case of DAISY textbooks in Japan, SSML is not used, as they contain recorded voice. Trade books in Japan do not typically employ SSML either.

PLS

PLS ([[PRONUNCIATION-LEXICON]]) enables the use of pronunciation lexicons, which map words to sequences of symbol collections such as those found in IPA or [[?JEITA_IT-4006]].

While SSML attributes are embedded within XHTML content documents in EPUB publications, PLS lexicons in EPUB publications are stored externally to and referenced by XHTML content documents (see Pronunciation Lexicons section in [[?epub-tts-10]]). As of the present, [[spoken-html]] does not offer a mechanism for associationg PLS lexicons with HTML documents.

PLS is a robust tool for rendering unusual names of people and places in text-to-speech applications. In particular, PLS allows every occurrence of a word or phrase to be consistently pronounced, regardless of the presence of ruby. At the time of this writing, PLS is used by at least one digital textbook publisher in Japan.

Use of Ruby for Automatic Braille Translation

The conversion of HTML documents and EPUB publications to braille is expected to become increasingly important in the near future.

Japanese braille lacks CJK ideographic characters and does not distinguish between hiragana and katakana. (Note: Han braille in Japan does include CJK ideographic characters, but it is not widely used.)

Braille exhibits some syntactical differences from the Japanese writing system. First, space characters are inserted as delimiters between words. Second, two Japanese particles, and , are transcribed as they are pronounced, meaning and are represented as if they were and , respectively. Third, pronounced as an elongated sound is represented using the long vowel character. For example, to tranlsate たいよう to braille, たいよう is first converted to たいよー and then translated to braille.

Natural language processing is required to handle these differences during the conversion to braille. However, unlike the case of text-to-speech, intonation is not relevant.

When converting HTML or EPUB content to braille, it is essential to select the correct reading for each CJK ideographic character. Choosing an incorrect reading can result in erroneous braille output. Similar to text-to-speech, ruby provides valuable hints, while [[?SSML]] and PLS ([[?PRONUNCIATION-LEXICON]]) serve as effective alternatives.

For furigana and the transcription of unusual names of people and places, natural language processing is more effective when using ruby bases (typically containing CJK ideographic characters) as the foundation. In contrast, the correct readings are chosen when using ruby annotations as the basis. It is also possible to combine both ruby bases and ruby annotations.