This document points browser implementers and specification developers to information about how to support features of scripts or writing systems from around the world, and also points to relevant information in specifications, to tests, and to useful articles and papers. It is not exhaustive, and will be added to from time to time.

Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on  using a URL for the dated version of the document.

Introduction

The W3C needs to make sure that the text layout and typographic needs of scripts and languages around the world are built in to technologies such as HTML, CSS, SVG, etc. so that Web pages and eBooks can look and behave as people expect around the world.

To that end we have experts in various parts of the world documenting layout and typographic requirements and gaps between what is needed and what is currently supported in browsers and ebook readers. See a list of relevant work in this area.

This page points browser implementers and specification developers to information about how to support features of scripts or writing systems from around the world, and also points to relevant information in specifications, to tests, and to useful articles and papers. It is not exhaustive, and will be added to from time to time.

Additional information and references are hereby solicited; please suggest additions, clarifications, corrections, and other improvements using the github issues list. 

Characters & phrases

Punctuation

Many scripts use native punctuation marks in addition to or instead of those used in Latin script text. In other cases, such as Greek, common Latin punctuation marks may mean something different from what they mean in English. It may be important to understand what needs to be supported, how these punctuation marks function, and how they interact with other operations applied to the text.

See also and .

Quotations

Quotation marks vary from language to language, not just from script to script. Also, you should expect variations in behavior when quotation marks are nested. Furthermore, the quotation marks used for vertical Japanese text are not the same as those typically used for the same text when horizontally laid out.

See also .

Identifying boundaries of graphemes, words and larger groupings

A browser or application needs to correctly apply functions to the basic units of text, be they characters, character sequences, syllables, or words. Some scripts, such as those used in South and South-East Asia, require clusters of characters to be treated as a single unit for most editing operations. Many other scripts use combining characters such as accents, vowel signs, length markers, etc. that must be kept with the base character they are associated with.

When a user double-clicks on some text, the appropriate units should be selected. In scripts such as Chinese and Thai, 'words' should be selected even though they are not separated by spaces. In scripts such as Tibetan and Ethiopic, the word separator may be a visible character, rather than a space. It is important to understand how they should be treated when a 'word' is highlighted, or when text wraps, etc.

Glyph controls

In some scripts, such as Arabic, it may be desirable to allow the content author to control the placement of glyphs such as diacritics, or to control ligation, etc.

Transforming characters

Conversion between lower, upper and title case only applies to a few scripts, most scripts are unicameral. Where it does apply, the rules can vary by language.

In other cases, a particular script may require a different type of transform. For example, in Japanese it is important to be able to convert between half-width and full-width presentation forms.

Letter spacing

Many scripts create emphasis or other effects by spacing out the letters or syllables in a word. There are questions about how this should work in Indic and SE Asian scripts, and in Arabic-based scripts which join up adjacent letters.

Ruby annotation

Ruby is used for phonetic and semantic annotations of East Asian text, including furigana, pinyin and zhuyin fuhao systems. In addition to positioning annotations along the correct side of the base text, there are many fine adjustments of the annotation and base text to support.

Text decoration

Some aspects related to the drawing of lines alongside or through text involve local typographic considerations. For example, underlines need to be broken in special ways for some scripts, and the height of underlines, strike-through and overlines may vary depending on the script. For vertical text the placement needs to be to the right or left of the line of text, rather than under or over.

Emphasis

Bold and italic are not always appropriate for expressing emphasis, and some scripts have their own unique ways of doing it, that are not in the Western tradition at all.

Initial letter styling

Does the browser or ereader correctly handle special styling of the initial letter of a line or paragraph, such as for drop caps?

Fonts

Some scripts require special handling with regard to how font properties are specified and how font resources are loaded dynamically.

Lines & paragraphs

Line breaking

There are some specific rules about how scripts such as Chinese, Japanese and Korean behave when a line is wrapped. For example, these scripts tend to break a line in the middle of a word (with no hyphenation) – even in Korean, which has spaces between words.

It is common for certain characters to be forbidden at the start or end of a line, but which characters these are, and what rules are applied when depends on the script or language. In some cases, such as Japanese, there may be different rules according to the type of content or the user's preference.

Hyphenation

Some scripts don't use hyphenation, those that do have particular rules about how it should be applied that are typically language-specific.

Justification & line-end alignment

Since the amount of content on a line tends to vary, even if minutely, from line to line within a paragraph, typographers have come up with various methods for effective full justification —causing the text to completely fill the text—in order to create visual alignment on both edges of a paragraph.

Typographic conventions for full text justification depend on the writing system, the content language, and the calligraphic style of the text. Results also tend to vary based on the capabilities of the layout engine and a given typographer’s preferences for weighing its various detrimental effects on typographic color and readability.

This section just after the links provides a hint at some of the different strategies used in different writing systems, but the devil is very much in the detail. Furthermore, other factors such as line-break rules, hyphenation, and other inline features, have to be taken into account during justification.

High-level overview of various approaches to full justification

Chinese Writing System (Han Ideographs)

Historically, Chinese was written as Han ideographs, with no punctuation. Under this system, justification was automatic, as the characters fit perfectly into a square grid. However, the introduction of punctuation in recent centuries, plus the increase in mixed-script text (such as the inclusion of European numbers and/or words, phrases, names, and trademarks) has created a need for adjustments within a line.

Chinese notably does not use word spaces, so these do not provide a justification opportunity within the lines; thus justification techniques focus on adjustments to spacing around punctuation, script-change boundaries, and inter-character spacing.

Japanese Writing System

Like Chinese, Japanese was historically written in Han ideographs; however it has since developed its own phonetic scripts Hiragana and Katakana (collectively, Kana). While pure kana texts do exist, particularly in children’s literature, Han ideographs (Kanji, in Japanese) continue to be an integral part of normal Japanese text, and are interspersed with kana within a sentence.

Like Chinese, embraced European-inspired punctuation, numerals, and other foreign snippets that don’t conform to the standard full-width character grid. The Japanese writing system also does not use word spaces, and similarly focuses on adjustments to spacing around punctuation, script-change boundaries, and inter-character spacing, with a notable preference for compression of intra-glyph spacing over expansion between glyphs.

Korean Writing System

Like Japanese, Korean was historically written in pure Han ideographs, and has since developed its own phonetic script, Hangul. Also like Japanese, it has adopted punctuation and numerals. However, unlike Japanese, Korean has also adopted word spaces, and tends towards narrow (Western-style, rather than full-width) punctuation. This allows it to use inter-word justification: as in English publications, this method stretches the spaces between words in order to fill the line.

While Han ideographs (Hanja, in Korean) were kept as part of the writing system, they have become increasingly scarce over time such that many documents are written in pure Hangul, and some only use Hanja as inline annotations for disambiguation among homophones rather than as part of the main text. However, Hanja and Hangul together remain important components of Korean writing.

Latin (Roman) Writing System

Quite possibly the writing system familiar to more people than any other, the Latin writing system derives from the Roman alphabet, including a few additional characters and diacritic marks to accommodate languages such as Icelandic and modern Vietnamese. Thanks to the Europeans in the Age of Exploration, their missionaries, and the Western-dominated global scholastic culture of the modern age, most languages in the world have one or more Latin transcriptions, even those that do not use it as their primary writing system.

The Latin alphabet is a phonetic system with disjoint letterforms, and typically uses spaces between words. This allows it to use inter-word justification, although it can and sometimes does increase the spacing between individual letters as well. Since it is frequently adopted into other writing systems, it can sometimes adopt characteristics of that system; for example, some styles of Japanese typesetting treat Latin letters the same as Japanese characters for the purpose of line-breaking and justification.

Ethiopic Writing System

Like Latin, the Ethiopic writing system uses an alphabet of disjoint letters and uses punctuation to indicate the break between words. Unlike Latin, Ethiopic traditionally uses a visible word separator— the Ethiopic Word Space U+1361 “፡”— although modern documents sometimes use a regular space U+0020 “ ” instead. Justification strategies are as for Latin: increasing the space at the word separator, and/or distributing space between letters.

Arabic Writing System (and Other Cursive Systems)

Arabic is a cursive script, meaning its letters are typically joined together within a word. This creates additional challenges, as the usual method for stretching out text— inserting spaces between glyphs— does not work.

Since Arabic uses spaces between words, one method for justification is inter-word justification— stretching out the spaces within the line to fill it. However, most styles of Arabic writing prefer calligraphic elongation or compression, distorting the shapes and connections between letters in order to fill the line while preserving its typographic color. This is often called “kashida”, meaning “stretched”. A simplistic variant of this technique inserts elongation marks (sometimes represented with U+0640 “ـ” TATWEEL) at appropriate points in the text.

Syriac and Mongolian have properties similar to Arabic, and in the absence of additional information should be given similar treatment for justification.

Tibetan Writing System

Tibetan is a Brahmic writing system related to Indic scripts like Devanagari and Gujarati; however, unlike these systems, it does not use Western-style punctuation nor spaces between words, and instead uses the Tibetan Tsheg Mark U+0F0B “་” between syllables and its own punctuation marks such as the Tibetan Shad U+0F0D “།” and Tibetan Nyis Shad U+ 0F0E “༎”, which indicate the end of longer segments.

Justification techniques used in Tibetan include stretching the space after a shad, minutely increasing the spaces after tsheg marks, and simply filling the remaining space on a line with tsheg marks.

Southeast Asian Writing Systems

In Southeast Asian systems such as Thai and Lao, letters are merged together into “clusters”. There are no spaces between words (lines must be broken by dictionary), but spaces serve to separate larger units of text.

Techniques for justification include stretching spaces on the line (if it happens to have any) and interspersing extra space between clusters.

Scripts in this category include Khmer, Myanmar, Lao, and Thai.

Other Writing Systems

Most (but not all) writing systems not mentioned here have discrete letters, like Latin, and in the absence of more specific information may be assumed to justify in a similar manner.

Note: Readers who wish to provide such “more specific information” are invited (and strongly encouraged) to contact the W3C Internationalization Working Group so that this document may be updated.

Advice for implementers and authors

In this section we provide additional advice for implementers related to justification.

Tagging Content By Writing System

While most languages have a preferred writing system, many can be transcribed into a different system. As a common example, most languages have a Latin transcription, and can thus be written in the Latin writing system. In these cases the document typically adopts the typographic conventions of the Latin writing system: for example Japanese “romaji” and Chinese Pinyin use word spaces and justify accordingly. As another example, historical ideographic Korean (ko-Hant) does not use word spaces, and should therefore be justified as for Chinese.

Authors can indicate the use of the Latin writing system with the -Latn language subtag, e.g. ja-Latn for Japanese romaji. Other subtags exist for other writing systems, see ????. Some common/historical examples follow:

zh-Latn
Chinese, written in Latin transcription
ko-Hant
Korean, written in Hanja (Chinese ideographic characters)
??-Arab
Turkish, written in Arabic script.
??-???
Mongolian, written in Cyrillic
??-???
Mongolian, written in traditional Mongolian script.

UAs should assume the most common writing system for a given language when choosing a justification strategy, but must not assume that writing system if the author has explicitly indicated a different one.

Justifying Untagged Content

Web browsers frequently have to deal with untagged, potentially mixed-script content. The following are some guidelines for designing a strategy to deal with such content.

  • Since Chinese and Japanese do not use spaces to provide justification opportunities, CJK content (Han, Hiragana, Katakana, and Hangul) should be allowed to accept inter-character spacing.
  • Since Japanese content prefers compression, CJK fullwidth punctuation characters, if present on a line, should be compressed at a higher priority (if possible) than expanding spaces or letter-spacing.
  • Since Korean prefers expanding spaces to expanding between characters, spaces should be expanded at a higher priority (if possible) than letter-spacing.

Authors should use (correct) language tags in order to get the best possible typographic behavior. For example, if Japanese text is tagged as Japanese, the UA knows to preferentially compress the space rather than expand it.

Counters, lists, etc

The CSS specification describes a set of simple and complex styles for counters to be used in list numbering, chapter heading numbering, etc. It also provides a generic mechanism for content authors to create their own counter styles. One has to consider not only the characters and algorithms to be used (numeric, alphabetic, additive, etc), but also what the separator or other associated marks look like.

Bidirectional text direction

Scripts whose characters are typically written right-to-left, like Arabic, Hebrew, Thaana, and so on, become bidirectional when they include numbers or text from other scripts (such as Latin acronyms). Browsers and applications need to support bidirectionality. This means supporting the Unicode Bidirectional Algorithm, but also different visual locations of line start and end, isolation of embedded strings, correct line alignment, and so forth.

Baselines & inline alignment

Browsers and applications must accurately and comprehensively cover requirements for baseline alignment between mixed scripts. For example, Arabic script descenders go far below those of the Latin script, and Armenian characters need to be aligned with ideographic characters in Chinese appropriately with regard to comparative heights and baselines. European, Far Eastern and South Asian scripts tend to use different baselines, which must be aligned correctly.

Other paragraph features

Some scripts have particular rules about indenting text at the start of a paragraph, or indeed whether that's normal. Some allow punctuation to hang outside the text box at the start or end of a line. There may be other aspects of how paragraphs are presented that vary from script to script, or need to be controlled by the content author.

Layout & pages

Vertical text

There are special requirements for vertically oriented text. For example, it's common for content authors to want to mix short horizontal runs of text, such as 2-digit numbers, in a vertical column (tate chu yoko). It's also important to provide appropriate support for text in scripts that are normally only horizontal.

Notes, footnotes, etc

Support for notes, footnotes, endnotes or other necessary annotations of this kind may vary in other cultures. In some cases, a script may use a very idiosyncratic approach to represent notes inline or to link to footnotes.

Page numbering, running headers, etc

These links point to conventions for managing the content that appears outside the main text block, for example page numbering, or the way that running headers and the like are handled.

More page layout and pagination

Some cultures define page areas and page progression direction very differently from those in the West. For example, the size of the Japanese kihon-hanmen, or main text block, is traditionally established by counting character cells, and margin space is then defined by the remaining space. In right-to-left scripts, pages also progress from right to left.

Changes Since the Last Published Version

See the github commit log for more details.