This document describes user requirements for text to speech of electronic documents containing ruby.
We are concerned about issues around the text-to-speech of HTML documents and EPUB publications containing ruby. Although typographical characteristics of ruby are covered by [[?JLREQ]] and [[?simple-ruby]], text-to-speech issues have not been widely recognized. This document focuses on user requirements, while a companion document (now in Japanese only) focuses on implementation issues.
Section 2 enumerates the roles of ruby annotation play in relation to its base text. Section 3 describes possible options for using base text and/or ruby annotation for the text-to-speech and discusses the pros and cons of each option. Section 4 shows ruby markup issues around the text-to-speech. Section 5 introduces alternative mechanisms (SSML and PLS). Section 6 describes the use of ruby in translating HTML or EPUB to braille. And Section 7 provides a brief summary of the text-to-speech of Word documents and PDF documents containing ruby.
The primary use of ruby annotation is to indicate how to read CJK ideographic characters (furigana, see also JLReq terminology).
Nowadays, it is not common to attach ruby annotations to every CJK ideographic characters (general ruby, see also JLReq terminology). Ruby annotations are typically attached to difficult CJK ideographic characters only (para ruby, see also JLReq terminology).
Ruby is used in trade books, newspapers, textbooks, teaching materials, etc., but is rarely used in business documents.
Even for simple CJK ideographic characters, ruby annotations may be added for some users who have particular difficulties with CJK ideographic characters (in electronic documents, it is easy to make ruby annotations visible or invisible based on user preferences). Such ruby annotations are called as furigana-added-for-enhanced-accessibility.
Some simple CJK ideographic characters have more than one possible reading and thus require ruby annotations for disambiguation. This is common for names of people and places. For example, "山崎" (a person's name) may be read as "Yamazaki" or "Yamasaki".
In the case of para ruby, ruby annotation is often attached to the first occurrence of each CJK ideographic character, and not attached to the second and subsequent occurrences of the same character, probably because users should learn from the first occurrence.
Especially in Japan, ruby annotation is also used for indicating something different from the reading of a CJK ideographic character. Such ruby is called Gikun. Gikun tends to be used in light novels and comics.
Here are some examples of Gikun:
Even when Gikun is used for a compound word, it is unlikely to be repeated for later occurrences of the same word.
Moreover, different [=GIKUN=] may be added for subsequent occurrences of the same word.
For example, the next occurrence of 生命 may well be
Unusual names of people in Japan are written in CJK ideographic characters but read as something completely
different from the typical reading of the CJK ideographic characters.
Character names in comics, animation and light novels are sometimes extremely difficult to read. Many of the character names in Demon Slayer (Kimetsu no Yaiba) fall into this category. For example, almost no one can read 不死川 玄弥 as "Shinazugawa Gennya" from the beginning.
Names of places are sometimes hard to read for historical reasons.
In many cases, the first occurrence of an unusual name is accompanied by ruby annotation but the other occurrences are not.
Interlinear notes look similar to ruby annotations. A note in JLreq introduces interlinear notes:
In the example shown in a figure referenced in quoted note ("An example of a note in inter lines"), 徳川慶喜 (Tokugawa Yoshinobu) is accompanied by an interlinear note "1837-1913 江戸幕府最後の将軍" (1837-1913 the last shogun of the Edo shogunate). Other examples are: a modern kana phrase as an interlinear note for a historical kana phrase, a standard Japanese expression as an interlinear note for an expression in a dialect, a modern CJK ideographic character as an interlinear note for a traditional CJK ideographic character, an English text chunk as an interlinear note for a Japanese text chunk, and an official name as an interlinear note for an abbreviated name.
One could argue that HTML ruby elements should not be used for representing interlinear notes (see Kobayashi Sensei's mail in Japanese). However, it is not difficult to imagine that ruby elements are actually used for representing interlinear notes.
In language textbooks, ruby annotation is sometimes used to indicate the reading of a foreign phrase in hiragana or katakana. For example, a Chinese phrase 我去学校 may have ウオ チュー シュエシャオ as ruby annotation.
A sequence of base characters may be accompanied by two ruby annotations. Typically, one of them is [=Furigana=] and the other is either a [=GIKUN=] or [=interlinear note=]. In an example in JLreq ("An example of ruby attached to both sides of the base characters"), 東南 is accompanied by たつみ and とうなん. Here 東南 means "southeast", とうなん (TOUNAN) is a [=furigana=], and たつみ (Tatsumi) is a [=GIKUN=], since 辰巳 (read as たつみ) means the same direction as 東南.
Here とうよう is a [=furigana=] and オリエント is a [=Gikun=].
Here おだのぶなが is a [=furigana=] and "1534-82" is an [=interlinear note=].
There are three possible options: (1) both base text and ruby annotation, (2) ruby annotation only, and (3) base text only.
In this option, base text are read aloud first and ruby annotation is then read aloud.
Many implementations (screen readers, in particular) support this option only.
The option of reading aloud both interferes with readers' understanding significantly. This is true for both group ruby (see also JLReq terminology) and mono ruby (see also JLReq terminology).
Consider an example from "The Rich Man and the Chicken" by 小川未明 (OGAWA Mimei). Note that the mono ruby for 新鮮 is expressed by two rt (ruby annotation) elements: one ruby annotation for 新 and the other ruby annotation is for 鮮.
If there is no ruby annotation, this should be read aloud as:
にわとりでもかって、しんせんなたまごをうましてたべようとおもいました。 (Niwatori demo katte shinsenna tamagowo umashite tabeyouto omoimashita.)
Translation in English: I thought that I should raise a hen so that I can eat fresh eggs.
The option of reading aloud both provides:
にわとりにわとりでもかかって、しんしんせんせんなたまごたまごをううましてたたべようとおもおもいました。 (Niwatoriniwatori demo kakatte shinshinsensenna tamagotamagowo uumashite tatabeyouto omoomoimashita.)
This reading does not make any sense at all.
Moreover, in some cases, reading both completely changes the meaning (see examples).
The option of reading aloud both is sensible.
The option of reading aloud both interferes with readers' understanding significantly.
The option of reading aloud both is sensible.
The option of reading aloud both interferes with readers' understanding significantly.
In the example of 我去学校, even if ウオ チュー シュエシャオ is read aloud using the Japanese text-to-speech engine, the result will not be helpful to learners because of the incorrect pronunciation and four tones. Katakana pronunciation is also useless in languages such as English.
Since there are two chunks of ruby annotation, double-sided ruby leads to reading aloud three times. One of the ruby annotations is typically furigana, so the description in 1) applies. If the other ruby annotation is a Gikun, the description in 2) applies; if it is an interlinear note, the description in 4) applies.
In this option, ruby annotation is read aloud but base text is not.
The option of reading aloud ruby annotation only provides not-incorrect-but-unnatural results usually.
In some cases, it causes mistakes in deciding whether へ should be read aloud
as え (/e/) or へ (/he/) and
whether は should be read aloud as わ (/wa/) or は (/ha/).
This is because the morphological analysis does not work properly and pronunciation dictionaries
for compound words cannot be used, as kana characters are used instead of CJK ideographic characters.
As an example, consider 今後は
Even when this option is used, it might be wise to ignore furigana-added-for-enhanced-accessibility but rely on base text.
If furigana is assigned only for the first occurrence of a word, there is a risk that the first occurrence and the others are read aloud differently.
The option of reading aloud ruby annotation only provides an understandable result but does not properly convey the author's intention.
The option of reading aloud ruby annotation only works correctly. However, if the first occurrence of a name is accompanied by ruby annotation and the other occurrences are not, the first occurrence is read aloud differently from the others thus suggesting different persons or places.
The option of reading aloud ruby annotation only provides incomprehensible results often.
If "1837-1913 江戸幕府最後の将軍" is attached to 徳川慶喜 as ruby annotation, it will be read aloud as "1837-1913 エドバクフサイゴノショウグン" (1837-1913 the last shogun of the Edo shogunate), which is reasonable. But if only "1837-1913" is attached as ruby annotation, the result is "1837-1913" which does not make any sense.
The option of reading aloud ruby annotation only interferes with readers' understanding significantly.
In the example of 我去学校 (ウオ チュー シュエシャオ), even if ウオ チュー シュエシャオ is read out in the Japanese style, it will not be helpful to learners because of the inaccurate pronunciation and the four tones (tones). Katakana pronunciation is also useless in languages such as English.
The option of reading aloud ruby annotation only makes two chunks of ruby annotation be read aloud while ignoring their base text. Since one of the ruby annotation is typically furigana, the description in 1) applies. If the other ruby annotation is a Gikun, the description in 2) applies; if it is an interlinear note, the description in 4) applies.
In this option, base text are read aloud but ruby annotation is not.
The option of reading aloud base text only may or may not provide good results, depending on text-to-speech engines.
The following is a quote from [[?ACCESSIBLE_E_BOOKS]].
Furthermore, compound words made up from CJK ideographic characters in JIS X 0208 are sometimes read aloud incorrectly.
As the importance of accessibility is well recognized and text-to-speech engines are improved, more and more words will be read aloud correctly. However, there are some words, such as the aforementioned "Yamazaki", that cannot be read aloud correctly by text-to-speech engines and even native Japanese speakers.
The option of reading aloud base text only results in a perfectly understandable result. However, since gikun is ignored, the author's intent is not completely conveyed.
The option of reading base text only leads to incorrect results. However, since every occurrence of a name is read aloud in the same way, users will not be confused.
The option of reading base text only provides a perfectly understandable result. However, since interline notes are ignored, the author's intention is not conveyed well.
The option of reading base text only is most appropriate when natural languages are correctly identified and base text are read aloud by a text-to-speech engine in that language. On the other hand, if the natural language cannot be identified or the text-to-speech engine for that language is not available, the result is not understandable.
The option of reading base text only will ignore the two chunks of ruby annotation and read their base text only. When one of the ruby annotation is furigana, the description in 1) applies. If the other is a gikun, the description in 2) applies, and if it is an interlinear note, the description in 4) applies.
Small kana characters ゃ, ゅ, ょ, and っ are too small when they appear in ruby annotation. For this reason, instead of these small characters, full-size kana characters や, ゆ, よ, and つ are used in ruby annotation.
However, since full-size kana characters are pronounced differently from small kana, ruby annotation containing full-size kana is read aloud differently.
CSS has a mechanism for overcoming this problem. Value 'full-size-kana' of the text-transform property as specified in CSS Text converts small kana characters to full-size kana. It is thus possible to use small kana in ruby annotation while rendering ruby annotation using full-size kana. Text-to-speech engines can provide correct results even when ruby annotation is read aloud.
Okayama-san of Hitach has argued that, even in the case of mono ruby, creating a single ruby element per compound word is better than creating a ruby element for each character of base text in a compound word. For example, to attach mono ruby to 生命, he recommends a single ruby element and two sets of rb and rt elements: one for 生 and another for 命 rather than creating two ruby elements.
A single ruby element per compound word can be rendered as mono ruby or jukugo ruby by CSS. Moreover, it is also easy for the text-to-speech engine to maintain a correspondence table between base text and ruby annotation.
Although furigana-added-for-enhanced-accessibility is necessary for those readers who have particular difficulties with CJK ideographic characters, it is unnecessary or slightly disturbing for others. If furigana-added-for-enhanced-accessibility is distinguishable from normal furigana, it can be made visible or invisible depending on user preferences. It is thus necessary to standardize a markup mechanism for indicating furigana-added-for-enhanced-accessibility.
In Section 3, we have seen that ruby annotation used as gikun or interline notes should be read aloud differently from the other cases. It is thus necessary to standardize a markup mechanism for clearly indicating ruby annotation used as gikun or interlinear note.
[[?SSML]] and [[?PRONUNCIATION-LEXICON]] can be used for providing phonemic/phonetic pronunciation of CJK ideographic characters to speech synthesis engines. They are not for visual presentations but can control text-to-speech much better than ruby.
[[?SSML]] uses symbol collections (such as IPA and [[?JEITA_IT-4006]]) to represent the sounds of human languages. Phonemic/phonetic pronunciation is represented by sequences of such symbols.
[[?epub-32]] allows SSML attributes to be used within XHTML content documents in EPUB publications. In the upcoming version, [[?epub-33]], these attributes are moved to [[?epub-tts-10]]. Meanwhile, the W3C Accessible Platform Architectures Working Group is developing [[?spoken-html]], which describes two possible approaches for adding SSML attributes to HTML elements.
SSML is widely used for digital textbooks by more than one textbook publisher in Japan. Meanwhile, it has been reported that attaching SSML attributes to CJK ideographic characters significantly increases the authoring cost. DAISY textbooks in Japan do not use SSML, since they include recorded voice. Trade books in Japan do not use SSML either.
PLS ([[PRONUNCIATION-LEXICON]]) allows for pronunciation lexicons, which maps words to sequences of symbol collections such as those in IPA or [[?JEITA_IT-4006]].
While [[?SSML]] attributes are embedded within XHTML content documents in EPUB publications, PLS dictionaries (see [[?PRONUNCIATION-LEXICON]]) in EPUB publications are stored externally to and referenced by XHTML content documents (see Pronunciation Lexicons section in [[?epub-tts-10]]). As of now, [[spoken-html]] does not have a mechanism for associationg PLS lexicons to HTML documents.
PLS is a powerful mechanism for the text-to-speech of unusual names of people and places. In particular, every occurrence of a word or phrase is read aloud in the same way regardless of the existence of ruby. As of this writing, PLS is used by at least one digital textbook publisher in Japan.
Conversion of HTML documents and EPUB publications to braille is expected to become important in the near future.
Japanese braille does not have CJK ideographic characters and does not distinguish hiragana and katakana. (Note: Han braille has CJK ideographic characters, but it is not widely used.)
Braille has some syntactical differences from the Japanese writing system. First, the space character is inserted as delimiters between words. Second, two Japanese particles は and へ are written as they are pronounced; that is, は and へ are represented as if they were わ and え. Third, う pronounced as the elongated sound is represented by the long vowel character.
Natural language processing is required for handling these differences in the conversion to braille. But, unlike in the case of text-to-speech, intonation is not relevant.
To convert HTML or EPUB to braille, it is crucial to choose the correct reading of each CJK ideographic character. If an incorrect reading is chosen, the generated braille becomes incorrect. As in the case of text-to-speech, ruby provides useful hints while [[?SSML]] and PLS are good alternatives.
For furigana and unusual names of people and places, natural language processing will work better when CJK ideographic characters are used as a basis, while correct reading will be chosen when ruby annotation is used as a basis. It is even possible to use both base text and ruby annotation.
Microsoft Word reads aloud neither base text nor ruby annotation. Therefore, text-to-speech does not work when ruby is used.
Ruby in PDF documents is represented as separate lines containing tiny characters. The relationship between base text and ruby annotation is not explicitly represented.
Some implementations read aloud a line for ruby annotations first and then read corresponding original line, which contains base text. Such implementations provide incomprehensible results. Other implementations simply ignore lines for ruby annotation. Subsection 3.3 applies to these implementations.