This document describes the best practices for identifying language and base direction in data formats used on the Web.

Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL for the dated version of the document.

Introduction

This document was developed as a result of observations by the Internationalization Working Group over a series of specification reviews related to formats based on JSON, WebIDL, and other non-markup data languages. Unlike markup formats, such as XML, these data languages generally do not provide extensible attributes and were not conceived with built-in language or direction metadata.

Natural language information on the Web depends on and benefits from the presence of language and direction metadata. Along with support for Unicode, mechanisms for including and specifying the base direction and the natural language of spans of text are one of the key internationalization considerations when developing new formats and technologies for the Web.

Markup formats, such as HTML and XML, as well as related styling languages, such as CSS and XSL, are reasonably mature and provide support for the interchange and presentation of the world's languages via built-in features. Data formats need similar mechanisms in order to ensure a complete and consistent support for the world's languages and cultures.

Terminology

This section defines terminology necessary to understand the contents of this document. The terms defined here are specific to this document.

A producer is any process where natural language string data is created for later storage, processing, or interchange.

A consumer is any process that receives natural language strings, either for display or processing.

A serialization agreement (or "agreement" for short) is the common understanding between a producer and consumer about the serialization of string metadata: how it is to be understood, serialized, read, transmitted, removed, etc.

Language negotiation is any process which selects or filters content based on language. Usually this implies selecting content in a single language (or falling back to some meaningful default language that is available) by finding the best matching values when several languages or locales [[LTLI]] are present in the content. Some common language negotiation algorithms include the Lookup algorithm in [[BCP47]] or the BestFitMatcher in [[ECMA-402]].

LTR stands for "left-to-right" and refers to the inline base direction of left-to-right [[!UAX9]]. This is the base text direction used by languages whose starting character progression begins on the left side of the page in horizontal text. It's used for scripts such as Latin, Cyrillic, Devanagari, and many others.

RTL stands for "right-to-left" and refers to the inline base direction of right-to-left [[!UAX9]]. This is the base text direction used by languages whose starting character progression begins on the right side of the page in horizontal text. It's used for scripts such as Arabic, Hebrew, Syriac, and a few others.

If you are unfamiliar with bidirectional or right-to-left text, there is a basic introduction here. Additional materials can be found in the Internationalization Working Group's Techniques Index.

The String Lifecycle

It's not possible to consider alternatives for handling string metadata in a vacuum: we need to establish a framework for talking about string handling and data formats.

Producers

A string can be created in a number of ways, including a content author typing strings into a plain text editor, text message, or editing tool; or a script scraping text from web pages; or acquisition of an existing set of strings from another application or repository. In the data formats under consideration in this document, many strings come from back end data repositories or databases of various kinds. Sources of strings often provide an interface, API, or metadata that includes information about the base direction and language of the data. Some also provide a suitable default for when the direction or language is not provided or specified. In this document, the producer of a string is the source, be it human or a mechanism, that creates or provides a string for storage or transmission.

When a string is created, it's necessary to (a) detect or capture the appropriate language and base direction to be associated with the string, and (b) take steps, where needed, to set the string up in a way that stores and communicates the language and base direction.

For example, in the case of a string that is extracted from an HTML form, the base direction can be detected from the computed value of the form's field. Such a value could be inherited from an earlier element, such as the html element, or set using markup or styling on the input element itself. The user could also set the direction of the text by using keyboard shortcut keys to change the direction of the form field. The dirname attribute provides a way of automatically communicating that value with a form submission.

Similarly, language information in an HTML form would most likely be inherited from the lang attribute on the html tag, or any element in the tree with a lang attribute.

If the producer of the string is receiving the string from a location where it was stored by another producer, and where the base direction/language has already been established, the producer needs to understand that the language and base direction has already been set and convert or encode that information for its consumers.

Consumers

A consumer is an application or process that receives a string for processing and possibly places it into a context where it will be exposed to a user. For display purposes, it must ensure that the base direction and language of the string is correctly applied to the string in that context. For processing purposes, it must at least persist the language and direction and may need to use the language and direction data in order to perform language-specific operations.

Displaying the string usually involves applying the base direction and language by constructing additional markup, adding control codes, or setting display properties. This indicates to rendering software the base direction or language that should be applied to the string in this display context to get the string to appear correctly. For text direction, it must also isolate embedded strings from the surrounding text to avoid spill-over effects of the bidi algorithm [[UAX9]]. For language, it must make clear the boundaries for the range of text to which the language applies.

Note that a consumer of one document format might be a producer of another document format.

Serialization Agreements

Between any producer and consumer, there needs to be an agreement about what the document format contains and what the data in each field or attribute means. Any time a producer of a string takes special steps to collect and communicate information about the base direction or language of that string, it must do so with the expectation that the consumer of the string will understand how the producer encoded this information. If no action is taken by the producer, the consumer must still decide what rules to follow in order to decide on the appropriate base direction and language, even if it is only to provide some form of default value.

In some systems or document formats, the necessary behaviour of the producers and consumers of a string are fully specified. In others, such agreements are not available; it is up to users to provide an agreement for how to encode, transmit, and later decode the necessary language or direction informat. Low level specifications, such as JSON, do not provide a string metadata structure by default, so any document formats based on these need to provide the "agreement" themselves.

Why is this important?

Information about the language of content is important when processing and presenting natural language data for a variety of reasons. When language information is not present, the resulting degradation in appearance or functionality can frustrate users, render the content unintelligible, or disable important features. Some of the affected processes include:

Similarly, direction metadata is important to the Web. When a string contains text in a script that runs right-to-left (RTL), it must be possible to eventually display that string correctly when it reaches an end user. For that to happen, it is necessary to establish what base direction needs to be applied to the string as a whole. The appropriate base direction cannot always be deduced by simply looking at the string; even if it were possible, the producer and consumer of the string would need to use the same heuristics to interpret its direction.

Static content, such as the body of a Web page or the contents of an e-book, often has language or direction information provided by the document format or as part of the content metadata. Data formats found on the Web generally do not supply this metadata. Base specifications such as Microformats, WebIDL, JSON, and more, have tended to store natural language text in string objects, without additional metadata.

This places a burden on application authors and data format designers to provide the metadata on their own initiative. When standardized formats do not address the resulting issues, the result can be that, while the data arrives intact, its processing or presentation cannot be wholly recovered.

In a distributed Web, any consumer can also be a producer for some other process or system. Thus, a given consumer might need to pass language and direction metadata from one document format (and using one agreement) to another consumer using a different document format. Lack of consistency in representing language and direction metadata in serialization agreements poses a threat to interoperability and a barrier to consistent implementation.

An Example

Suppose that you are building a Web page to show a customer's library of e-books. The e-books exist in a catalog of data and consist of the usual data values. A JSON file for a single entry might look something like:

{
    "id": "978-0-1234-5678-X",
    "title": "Moby Dick",
    "authors": [ "Herman Melville" ],
    "language": "en-US",
    "pubDate": "1851-10-18",
    "publisher": "Mark Twain Press",
    "coverImage": "https://example.com/images/mobidick_cover.jpg",
    // etc.
},

Each of the above is a data field in a database somewhere. There is even language information about the contents of the book ("language": "en-US").

A well-internationalized catalog would include additional metadata to what is shown above. That is, for each of the fields containing natural language text, such as the title and authors fields, there should be language and base direction information stored as metadata. (There may be other values as well, such as pronunciation metadata for sorting East Asian language information.) These metadata values are used by consumers of the data to influence the processing and enable the display of the items in a variety of ways. As the JSON data structure provides no place to store or exchange these values, it is more difficult to construct internationalized applications.

One work-around might be to encode the values using a mix of HTML and Unicode bidi controls, so that a data value might look like one of the following:

   "title": "<span lang='en-US' dir='ltr'>Mobi Dick</span>"
   "authors": [ "\u200eHerman Melville" ], // contains LRM as first character

But JSON is a data interchange format: the content might not end up with the title field being displayed in an HTML context. The JSON above might very well be used to populate, say, a local data store which uses native controls to show the title and these controls will treat the HTML as string contents. Producers and consumers of the data might not expect to introspect the data in order to supply or remove the extra data or to expose it as metadata. Most JSON libraries don't know anything about the structure of the content that they are serializing. Producers want to generate the JSON file directly from a local data store, such as a database. Consumers want to store or retrieve the value for use without additional consideration of the content of each string. In addition, either producers or consumers can have other considerations, such as field length restrictions, that are affected by the insertion of additional controls or markup. Each of these considerations places special burden on implementers to create arbitrary means of serializing, deserializing, managing, and exchanging the necessary metadata, with interoperability as a casualty along the way.

Additional Requirements for Localization

The above example shows a data record available in a single language. Some applications might require the ability to send multiple languages for the same field, such as when localizing an application or when multilingual data is available. This is particularly true when the producer needs to support consumers that perform their own language negotiation or when the consumer cannot know which language or languages will be selected for display.

Serialization agreements to support this therefore need to represent several different language variations of the same field. For instance, in the example above the values title or description might each have translations available for display to users who speak a language other than English. Or an application might have localized strings that the consumer can select at runtime. In some cases, all language variations might be shown to the user. In other cases, the different language values might be matched to user preferences as part of language negotiation to select the most appropriate language to show.

When multiple language representations are possible, a serialization might provide a means (defined in the specification for that document format) for setting a default value for language or direction for the whole of the document. This allows the serialized document to omit language and direction metadata from individual fields in cases where they match the default.

Isn't Unicode Enough?

[[!Unicode]] and its character encodings (such as UTF-8) are key elements of the Web and its formats. They provide the ability to encode and exchange text in any language consistently throughout the Internet. However, Unicode by itself does not guarantee perfect presentation and processing of natural language text, even though it does guarantee perfect interchange.

Several features of Unicode are sometimes suggested as part of the solution to providing language and direction metadata. Specificially, Unicode bidi controls are suggested for handling direction metadata. In addition, there are "tag" characters in the U+E0000 block of Unicode originally intended for use as language tags (although this use is now deprecated).

There are a variety of reasons why the addition of characters to data in an interchange format is not a good idea. These include:

This last consideration is important to call out: document formats are often built and serialized using several layers of code. Libraries, such as general purpose JSON libraries, are expected to store and retrieve faithfully the data that they are passed. Higher-level implementations also generally concern themselves with faithful serialization and de-serialization of the values that they are passed. Any process that alters the data itself introduces variability that is undesirable. For example, consider an application's unit test that checks if the string returned from the document is identical to the one in the data catalog used to generate the document. If bidi controls, HTML markup, or Unicode language tags have been inserted, removed, or changed, the strings might not compare as equal, even though they would be expected to be the same.

Best Practices for Communicating Language and Direction

This section contains the Best Practices as identified by the Internationalization Working Group. [[!RFC2119]] keywords have their usual meaning.

Best practices appear with a different background color and decoration like this.

The main issue is how a producer of a string knows how to encode and a consumer of a string will know how to find and interpret the language-related features that ought to be used for that string when it is eventually processed or displayed to the user. This section describes the current best practices, as well as several alternatives that were considered (with reasons for why they are not considered the best practice).

Current Best Practices

Specifying Metadata in Document Formats

The TAG and I18N WG are currently discussing what the best practice recommendations should be. This subsection represents our understanding currently.

Make each natural language string field Localizable. For JSON documents, this uses the WebIDL "dictionary" defined below: a pre-built extension that can be used commonly across different document formats using the same base field names and values.

By defining the language and direction in a WebIDL dictionary form, specifications can incorporate language and direction metadata for a given String value succinctly. Implementations can recyle the dictionary implementation straightforwardly.

Serialized files utilizing the dictionary and its data values will contain additional fields and can be more difficult to read as a result. Here's an example:

If an application needs to provide for language negotiation or if the data is available in multiple languages, the Localizable strings can further be organized into arrays with multiple languages for the same value. A simple example might look like this:

A common use for multiple language values is to enable runtime selection of the language. Altering the format above slightly allows for fast selection of the appropriate language from the array of available languages:

Specifications and document formats MAY provide for a default language and default direction for an overall document. This might be helpful when performing language negotiation on several levels and there is a desire to use the same default language. It can also be helpful when a document is known to be in a single language or have a single expected base direction and the additional serialization complexity of using Localizable can thus be avoided.

Following text is new and still speculative.

Interoperability is enhanced when specifications all use the same attribute name for the default language and default base direction. The name @language is RECOMMENDED as the name of the default language value and @dir as the default direction value. [[!JSON-LD]] defines a mechanism for indicating the default language for a given scope or @context in a JSON document. The name @language was chosen for consistency with JSON-LD.

Bidi Isolation and Isolating Controls in Strings

Do not produce or require bidi isolating controls by default. Bidi isolating controls are plain text control characters that can be used to indicate that a span of text should be "isolated" from the surrounding bidirectional context. A frequent question is whether content management systems or document formats should include these characters by default around strings that can appear in multiple contexts. An example of this would be a localizable string table for an application. Since the content author cannot know the bidi context in which the string will be displayed in advance, providing the controls could help insulate the strings from improper display later.

HTML5 [[HTML5]] introduced isolation at the element level by default. This allows for text insertion into an HTML context without the need for isolating controls. However, not all strings appear in an HTML context. Use of strings in a plain text or other display context cannot be guaranteed the isolating behavior provided by HTML.

While some strings can certainly benefit from using isolating bidi controls, consistent usage can produce layers of overhead, processing, and validation that are unnecessary. Use of these controls should be reserved for cases in which the assembly and presentation of the text depends on runtime directional determination. For example, isolating controls can be included around a variable name in a string whose contents will be determined at runtime.

The advantages of using isolating controls around a string are that the string can then be inserted into any context (that understands isolating controls) without additional processing.

The disadvantages of using isolating controls around a string are several. For most text the controls are superfluous and contribute to storage and processing overhead. The controls also affect the length of the string. Operations such as string truncation need to keep the controls paired.

Requirements and Use Cases

This section of the document describes in depth the need for language and direction metadata and various use cases helpful in understanding the best practices and alternatives listed above.

Identifying the Language of Content

Definitions

Language metadata typically indicates the intended linguistic audience or user of the resource as a whole, and it's possible to imagine that this could, for a multilingual resource, involve a property value that is a list of languages. A property that is about language metadata may have more than one value, since it aims to describe all potential users of the information

The text-processing language is the language of a particular range of text (which could be a whole resource or just part of it). A property that represents the text-processing language needs to have a single value, because it describes the text content in such a way that tools such as spell-checkers, default font applicators, hyphenation and line breakers, case converters, voice browsers, and other language-sensitive applications know which set of rules or resources to apply to a specific range of text. Such applications generally need an unambiguous statement about the language they are working on.

Language Tagging Use Cases

Kensuke is reading an old Tibetan manuscript from the Dunhuang collection. The tool he is using to read the manuscript has access to annotations created by scholars working in the various languages of the International Dunhuang Project, who are commenting on the text. The section of the manuscript he is currently looking at has commentaries by people writing in Chinese, Japanese, and Russian. Each of these commentaries is stored in a separate annotation, but the annotations point to the same point in the target document. Each commentary is mainly written in the language of the scholar, but may contain excerpts from the manuscript and other sources written in Tibetan as well quoted text in Chinese and English. Some commentaries may contain parallel annotations, each in a different language. For example, there are some with the same text translated into Japanese, Chinese and Tibetan.

Kensuke speaks Japanese, so he generally wants to be presented with the Japanese commentary.

Capturing the language of the audience

The annotations containing the Japanese commentary have a language property set to "ja" (Japanese). The tool he is using knows that he wants to read Japanese commentaries, and it uses this information to select and present to him the text contained in that body. This is language information being used as metadata about the intended audience – it indicates to the application doing the retrieval that the intended consumer of the information wants Japanese.

Some of the annotations contain text in more than one language. For example, there are several with commentary in Chinese, Japanese and Tibetan. For these annotations, it's appropriate to set the language property to "ja,zh,bo" – indicating that both Japanese and Chinese readers may want to find it.

The language tagging that is happening here is likely to be at the resource level, rather than the string level. It's possible, however, that the text-processing language for strings inside the resource may be assumed by looking at the resource level language tag – but only if it is a single language tag. If the tag contains "ja,zh,bo" it's not clear which strings are in Japanese, which are in Chinese, and which are in Tibetan.

Capturing the text-processing language

Having identified the relevant annotation text to present to Kensuke, his application has to then display it so that he can read it. It's important to apply the correct font to the text. In the following example, the first line is labeled ja (Japanese), and the second zh-Hant (Traditional Chinese) respectively. The characters on both lines are the same code points, but they demonstrate systematic differences between how those and similar codepoints are rendered in Japanese vs. Chinese fonts. It's important to associate the right forms with the right language, otherwise you can make the reader uncomfortable or possibly unhappy.

雪, 刃, 直, 令, 垔

So, it's important to apply a Japanese font to the Japanese text that Kensuke is reading. There are also language-specific differences in the way text is wrapped at the end of a line. For these reasons we need to identify the actual language of the text to which the font or the wrapping algorithm will be applied.

Another consideration that might apply is the use of text-to-speech. A voice browser will need to know whether to use Japanese or Chinese pronunciations, voices, and dictionaries for the ideographic characters contained in the annotation body text.

Various other text rendering or analysis tools need to know the language of the text they are dealing with. Many different types of text processing depend on information about the language of the content in order to provide the proper processing or results and this goes beyond mere presentation of the text. For example, if Kensuke wanted to search for an annotation, the application might provide a full text search capability. In order to index the words in the annotations, the application would need to split the text according to word boundaries. In Japanese and Chinese, which do not use spaces in-between words, this often involves using dictionaries and heuristics that are language specific.

We also need a way to indicate the change of language to Chinese and Tibetan later in the commentary for some annotations, so that appropriate fonts and wrapping algorithms can be applied there.

Other Approaches Considered

The above Best Practices are based on discussion with TAG, implementation in several W3C Specifications, and the recommendations of the Internationalization Working Group. Other approaches to identifying language in document formats have been used occasionally or have been proposed in the past. Each is described below.

Require HTML or XML for content

One proposal from members of the Annotation WG was to require HTML/XML formats for such annotation bodies, and use the lang or xml:lang attributes in markup to denote the language changes. (This approach would also apply to base direction by using HTML's built-in dir attribute; there is no built-in attribute in XML.)

This proposal can be useful for agreements that support the interchange of HTML or XML markup data.

The benefit for content that already uses markup is clear. The content will already provide complete markup necessary for the display and processing of the text or it can be extracted from the source page context. HTML and XML processors already know how to deal with this markup and provide ready validation.

The downside of this approach is that many data values are just strings. As with adding Unicode tags or Unicode bidi controls, the addition of markup to strings alters the original string content. Producers are required to introspect strings and add markup as needed. Consumers must likewise remove any additional markup introduced by the producer.

The addition of markup also requires consumers to guard against the usual problems with markup insertion, such as XSS attacks.

Create a new datatype

If a new datatype were added to JSON to support natural language strings, then specifications could easily specify that type for use in document formats. Since the format is standardized, producers and consumers would not need to guess about direction or language information when it is encoded. Such a serialization might look like the following:

myLocalizedString: "Hello World!"@en^ltr          // language and direction
myLocalizedString_ar: "مرحبا بالعالم!"@ar-EG^rtl  // right-to-left example
myLocalizedString_fr: "Bonjour monde !"@fr        // language only
myLocalizedString_und: "שלום עולם!"^rtl           // direction information only
myLanguageNeutralString: "978-0-123-4567-X"       // language-neutral string

The downside of adding a datatype is that JSON is a widely implemented format, including many ad-hoc implementations. Any new serialization form would likely break or cause interoperability problems with these existing implementations. JSON is not designed to be a "versioned" format. Any serialization form used would need to be transparent to existing JSON processors and thus could introduce unwanted data or data corruption to existing fields and formats.

[[JSON-LD]] includes some data structures that are partially helpful. Notably, it defines string internationalization in the form of a context-scoped @language value which can be associated with blocks of JSON or within individual objects. There is no definition of base direction, so this is incomplete. The @context concept can be used by specifications as a means of indicating the default language metadata where omitted from individual strings.

The concept of language indexing in JSON-LD is used in the Best Practices in this document as a means for localizing a data value.

Unicode tag characters

Unicode tag characters are strongly deprecated by the Unicode Consortium. These tag characters were intended for use in language tagging within plain text contexts and are often suggested as an alternate means of providing in-band non-markup language tagging. We are unaware of an implementations that use them as language tags.

Here is how Unicode tags are supposed to work:

A [[!BCP47]] language tag is just one of the potential tags that could be applied using this system, so each language tag begins with a tag identification character, in this case U+E0001. The remainder of the Unicode block for forming tags mirrors the printable ASCII characters. That is, U+E0020 is space (mirroring U+0020), U+E0041 is capital A (mirroring U+0041), and so forth. Following the tag identification character, you use each tag character to spell out a [[!BCP47]] language tag using the upper/lowercase letters, digits, and the hyphen character. Normal language tags, which are composed from ASCII letters, digits and hyphens, can be transmogrified into tags by adding 0xE0000 to each character's code point. Additional structure, such as a language priority list (see [[RFC4647]]) might be constructed using other characters such as comma or semi-colon, although Unicode does not define or even necessarily permit this.

The end of a tag's scope is signalled by the end of the string, or can be signalled explicitly using the cancel tag character U+E007F, either alone (to cancel all tags) or preceeded by the language tag identification character U+E0001 (i.e. the sequence <U+E0001,U+E007F> to end only language tags).

Tags therefore have a minimum of three characters, and can easily be 12 or more. Furthermore, these characters are supplementary characters. That is, they are encoded using 4-bytes per character in UTF-8 and they are encoded as a surrogate pair (two 16-bit code units) in UTF-16. Surrogate pairs are needed to encode these characters in string types for languages such as Java and JavaScript that use UTF-16 internally. The use of surrogates makes the strings somewhat opaque. For example, U+E0020 is encoded in UTF-16 as 0xDB40.DC20 and in UTF-8 as the byte sequence 0xF3.A0.80.A0.

Applications that treat the characters as unknown Unicode characters will display them as tofu (hollow box replacement characters) and may count them towards length limits, etc. So they are only useful when applications or interchange mechanisms are fully aware of them and can remove them or disregard them appropriately. Although the characters are not supposed to be displayed or have any effect on text processing, in practice they can interfere with normal text processes such as truncation. line wrapping, hyphenation, spell-checking and so forth.

By design, [[!BCP47]] language tags are intended to be ASCII case-insensitive. Applications handling Unicode tag characters would have to apply similar case-insensitivity to ensure correct identification of the language. (The Unicode data doesn't specify case conversion pairings for these characters; this complicates the processing and matching of langauge tag values encoded using the tag characters.)

Moreover, language tags need to be formed from valid subtags to conform to [[!BCP47]]. Valid subtags are kept in an IANA registry and new subtags are added regularly, so applications dealing with this kind of tagging would need to always check each subtag against the latest version of the registry.

Another issue with these tag characters is that they do not allow nesting of language tags. For example, if a string contains two languages, such as a quote in French inside an English sentence, Unicode tag characters can only indicate where one language starts. To indicate nested languages, tags would need to be embedded into the text not just prefixed to the front.

Finally, although never implemented, other types of tags could be embedded into a string or document using Unicode tag characters. It is possible for these tags to overlap sections of text tagged with a language tag.

Finally, Unicode has recently "recycled" these characters for use in forming sub-regional flags, such as the flag of Scotland (🏴󠁧), which is made of the sequence:󠁢󠁳󠁣󠁴󠁿

  • 🏴 [U+1F3F4 WAVING BLACK FLAG]
  • 󠁧 [U+E0067 TAG LATIN SMALL LETTER G]
  • 󠁢 [U+E0062 TAG LATIN SMALL LETTER B]
  • 󠁳 [U+E0073 TAG LATIN SMALL LETTER S]
  • 󠁣 [U+E0063 TAG LATIN SMALL LETTER C]
  • 󠁴 [U+E0074 TAG LATIN SMALL LETTER T]
  • 󠁿 [U+E007F CANCEL TAG]

The above is a new feature of emoji added in Unicode 10.0 (version 5.0 of UTR#51) in June 2017. Proper display depends on your system's adoption of this version.

Identifying the Base Direction of Content

Bidirectional Use Cases

In order for a consumer to correctly display bidirectional text, such as those in the following use cases, when they reach the point of display to the user, there must be a way to determine the required base direction for each string.

Each use case below shows characters from left-to-right in the order they are stored in memory. We use Hebrew text so as to avoid issues related to the display of cursive characters in Arabic. The use cases also serve as examples for the concepts on this page.

The first four use cases need to be displayed using an RTL base direction. The last use case needs to be displayed as LTR text.

Approaches Considered

The fundamental problem is how a consumer of a string will know what base direction should be used for that string when it is eventually displayed to a user. A number of alternatives are considered below. Note that, unlike some of the language tagging alternatives considered above, each of these mechansisms for identifying or estimating the base direction have utility in specific applications and are in use in different specifications such as [[HTML5]].

Metadata

Summary

Recommended?
yes

Pros:

  • , effective & efficient
  • doesn’t affect the content of the string
  • no need to parse the string or know how to interpret it

Cons:

  • out-of-band information needs to be associated with and kept with strings

To note:

  • best used only where necessary, and rely on first-strong heuristics otherwise
  • producers need to know when to attach metadata because first-strong doesn’t work
  • it must be possible to associate metadata with any string, but it may also be useful to additionally set a default for all strings

Using metadata external to the string is the RECOMMENDED best practice for indicating the base direction of text. Passing metadata as separate data value from the string provides a simple, effective and efficient method of communicating the intended base direction without affecting the actual content of the string. This requires that the consumer know how to retrieve and process the meaning of that metadata.

Metadata not only removes the problem of whether or not, and how, to parse markup in a string to determine the direction, but even in the simplest strings, without markup, it avoids the need to inspect and run heuristics on the string to determine its base direction.

There needs to be metadata available for each individual string. Alternatively, metadata can be inherited, but some mechanism must be available to override the inherited direction for a particular string which differs in direction from the inherited value.

Metadata is probably most effective, however (especially for the original creator of the strings), if it is only passed with a string in those cases where first-strong detection is otherwise going to produce a wrong result. This would mean that consumers of strings should not only recognise the metadata, but should also expect to rely on first-strong heuristics for strings without metadata. It also means that producers of strings need to recognise situations where directional information is needed and set the metadata.

First-strong

Summary

Recommended?
no

Pros:

  • where it is reliable, information about direction can be obtained without any changes to the string

Cons:

  • the base direction applied is unreliable, because the first strong character is not always indicative of the necessary base direction for the string
  • any string containing HTML bounded by an element with a dir attribute makes the direction undetectable, since dir isolates
  • the same goes for strings that begin with RLI, etc and end with PDI
  • it’s not clear how to establish whether markup at the start of a string should be considered when checking for first-strong characters
  • consumers need to know the semantics of any markup vocabulary used if embedded markup contains the directional information

To note:

  • the consumer must know to check the string for first-strong heuristics
  • needs to skip characters at start of string without strong directional property, and internal isolated sequences
  • if no directional character is found in the string, there must be an agreement on the default direction
  • if a string is bounded by markup (eg. <cite>…</cite>) the directionality of the characters in the markup must be ignored when checking for the first-strong character if, and only if, the markup is going to be handled as markup by the consumer; if this is, say, just some example code, then the direction of the markup characters counts; it’s not clear how to tell the difference
  • if a string is bounded by markup with directional information (eg. <cite dir=“rtl”>..<cite>) which indicates the base direction to be used, the directional properties of the characters in the string must be ignored

First-strong detection looks for the first character with a strong Unicode directional property in a string, and sets the base direction to match it. Many developers assume that this provides a robust solution, but first-strong detection alone is not always adequate to communicate base direction.

Note that, if the producer is relying on the consumer using first-strong character detection to establish the contextual base direction of a string, the consumer needs to be aware that it is supposed to use that approach. Although first-strong detection is outlined in the Unicode Bidirectional Algorithm (UBA) [[!UAX9]], it is not the only possible higher-level protocol mentioned for estimating string direction. For example, Twitter and Facebook currently use different default heuristics for guessing the base direction of text—neither use just simple first-strong detection, and one uses a completely different method.

The first-strong detection algorithm needs to skip characters at the start of the string that don't have a strong directional property. It also needs to skip embedded runs of text that are directionally isolated from the text around it, if it is to follow the UBA. Isolation may be achieved by Unicode formatting characters, such as RLI, LRI and FSI, or by using markup in the string if that markup is to be interpreted as actual markup by the consumer. For example, elements with the dir attribute in [[HTML5]] are isolating. An element such as <span dir="rtl"> and all of the text it contains should be skipped by first-strong.

If no strong directional character is found in the string, the direction should be assumed to be LTR.

The principal problem encountered with first-strong detection is that the first strong character is not always representative of the base direction that needs to be applied to that string, such as in use case #2 above.

If a string contains markup that will be parsed by the consumer as markup, there are additional problems. Any such markup at the start of the string must also be skipped when searching for the first strong directional character. If, however, there is angle bracket content that is intended to be an example of markup, rather than actual markup, the markup must not be skipped. It isn't clear how a consumer of the string would know the difference between this case and the previous one.

If parseable markup in the string contains information about the intended direction of the string, that information should be used rather than relying on first-strong heuristics. This is problematic in a couple of ways: (a) it assumes that the consumer of the string understands the semantics of the markup, which may be ok if there is an agreement between all parties to use, say, HTML markup only, but would be problematic, for example, when dealing with random XML vocabularies, and (b) the consumer must be able to recognise and handle a situation where only the initial part of the string has markup, ie. the markup applies to an inline span of text rather than the string as a whole.

Augmenting first-strong by inserting RLM/LRM markers

Summary

Recommended?
no

Pros:

  • it provides a reliable way of indicating base direction, as long as the producer can reliably apply markers
  • in theory, it should be easier to spot the first-strong character in strings that begin with markup, as long as the correct RLM/LRM is prepended to the string

Cons:

  • it is not clear that the producer of a string would always apply RLM/LRM when appropriate; a machine is not able to identify cases where those characters would be needed
  • this approach changes the identity and content of the string
  • consumers may need to remove the RLM/LRM marker, but may not be able to determine when that is or is not appropriate, since the string may start with an RLM/LRM character intentionally

To note:

  • in theory, it should be easier to spot the first-strong character in strings that begin with markup, as long as the correct RLM/LRM is prepended to the string
  • applications must ensure that they do not accumulate markers

It is possible for a producer of a string to attach an RLM or LRM character to the beginning of the string when the wrong base direction would otherwise be assumed by a process using a simple first-strong heuristic.

If the producer is a human, they could theoretically apply one of these characters when creating a string in order to signal the directionality. One problem, especially on mobile devices, is the availability or inconvenience of inputting an RLM/LRM character. In addition, because the characters are invisible and because Unicode bidi is complicated, it can be difficult for the user to know that a bidi control will be necessary.

However, humans often do create text that will later become strings in environments where the bidi algorithm will need help. For example, if a person types information into an HTML form and relies on the form's base direction or use of shortcut keys to make the string look correct in the form field, they would not need to add RLM/LRM to make the string 'look correct' for themselves, but outside of that context the string would look incorrect unless an appropriate strong character was added to it. Similarly, strings scraped from a web page that has dir=rtl set in the html element would not normally have or need an RLM/LRM character at the start of the string in HTML.

This approach is therefore only appropriate for general use if it is acceptable to change the value of the string.

Apart from changing the identity of the string, adding characters to it may have an effect on things such as string length or pointer positions, which may become problematic.

When inserting an LRM or RLM character, the consumer still depends on applying a first-strong heuristic to get the proper direction; consumers that don't apply first-strong can get the direction wrong.

If directional information is contained in markup that will be parsed as such by the consumer (for example, dir=rtl in HTML), the producer of the string needs to understand that markup in order to set or not set an RLM/LRM character as appropriate. If the producer always adds RLM/LRM to the start of such strings, the consumer is expected to know that. If the producer relies instead on the markup being understood, the consumer is expected to understand the markup.

The producer of a string should not automatically apply RLM or LRM to the start of the string, but should test whether it is needed. For example, if there's already an RLM in the text, there is no need to add another. If the context is correctly conveyed by first-strong heuristics, there is no need to add additional characters either. Note, however, that testing whether supplementary directional information of this kind is needed is only possible if the producer has access, and knows that it has access, to the original context of the string. Many document formats are generated from data stored away from the original context. For example, the catalog of books in the original example above is disconnected from the user inputing the bidirectional text.

Paired formatting characters

Summary

Recommended?
no

Pros:

  • none

Cons:

  • isolating formatting characters must be used, but they are not yet well supported by consumers
  • consumers that use first-strong heuristics, rather than recognising this approach, would fail
  • Unicode limits for embedding levels may be exceeded

This approach inserts paired Unicode formatting characters at the start and end of a string to indicate the base direction.

If paired formatting characters are used, they should be isolating, ie. starting with RLI, LRI, FSI, and not with RLE or LRE.

However, it would not be enough to simply apply the UBA first-strong heuristics to such a string, because the Unicode bidi algorithm is unable to ascertain the base direction for a string that starts with RLI/LRI/FSI and ends with PDI. This is because the algorithm skips over isolated sequences and treats them as a neutral character. A consumer of the string would have to take special steps, in this case, to uncover the first-strong character.

This approach is also only appropriate if it is acceptable to change the value of the string. In addition to possible issues such as changed string length or pointer positions, this approach runs the risk of one of the paired characters getting lost, either through handling errors, or through text truncation, etc.

A producer and a consumer of a string would need to recognise and handle a situation where a string begins with a paired formatting character but doesn't end with it because the formatting characters only describe a part of the string.

Unicode specifies a limit to the number of embeddings that are effective, and embeddings could build up over time to exceed that limit.

Consuming applications would need to recognise and appropriately handle the isolating formatting characters. At the moment such support for RLI/LRI/FSI is not pervasive.

Need to describe non-isolating controls here.

Script subtags

This section is currently a first draft and needs review.

Summary

Recommended?
(provisionally, pending deeper investigation) yes, but only when the metadata approach above is not possible

Pros:

  • no need to change the string
  • no need to inspect the string
  • reliable
  • no complications when dealing with markup in strings

Cons:

  • only works where it is possible to associate separate language metadata with each string
  • some scripts in archaic usage switched between LTR and RTL according to the preference of the author or the context of the content; the language tag is unable to handle non-default approaches for such strings, but this is expected to be an edge-case
  • new script tags may be coined, and these will need to be added to the lists used by consumers
  • how to indicate that LTR base direction must be applied to a wrapper around some strings

To note:

  • may be more efficient to assume a default, in the absence of a script subtag, and use first-strong heuristics in non-problematic cases

The W3C Internationalization Working Group recommends that formats and applications should associate dedicated metadata relating to base text direction with strings wherever possible. In cases where that is not possible due to legacy constraints, but where language metadata can be associated with each string, it may be possible to use the language metadata as a fallback method of identifying the direction for a string (eg. JSON-LD, RDF, etc).

Note, however, that this is only appropriate when declaring information about the overall base direction to be associated with a string. We do not recommend generalised use of language data to indicate text direction, especially within strings, since the usage patterns are not interchangeable.

Note, secondly, that language information must use BCP 47 subtags, and that the tag that carries the information should be the script subtag, not the language subtag. For example, Azeri may be written LTR (with the Latin or Cyrillic scripts) or RTL (with the Arabic script). Therefore, the subtag az is insufficient to clarify intended direction. A language tag such as az-Arab, however, can generally be relied upon to indicate that the overall base direction should be RTL.

There are many strings which are not language-specific but which absolutely need to be wrapped by a mechanism that explicitly associates them with a particular base direction for correct consumption. For example, Mac addresses inserted into a RTL context need to be displayed with a LTR overall base direction and isolation from the surrounding text. It's not clear how to distinguish these cases from others (in a way that would be feasible when using direction metadata).

The expected way in which this information is used is as follows. It may be reasonable to assume a default of LTR for all strings unless marked with a script subtag that indicates RTL. Any string that needs to have an overall base direction of RTL should be labelled for language by the producer using a script subtag. If a script subtag is associated with a string, the consumer would check the script against a list of script subtags that indicate a RTL base direction, and if found would take appropriate action.

The list of script subtags may be added to in future. In that case, any subtags that indicate a default RTL direction need to be added to the lists used by the consumers of the strings.

It is perhaps possible to limit the use of script subtag metadata to situations where first-strong heuristics are expected to fail - provided that such cases can be identified, and appropriate action taken by the producer (not always reliable). Consumers would then need to use first-strong heuristics in the absence of a script subtag in order to identify the appropriate base direction. The use of script subtags should not, however, be restricted to strings that need to indicate direction; it is perfectly valid to associate a script subtag with any string.

This approach avoids the issues associated with first-strong detection when the first-strong character is not indicative of the necessary base direction for the string, and avoids issues relating to the interpretation of markup.

Note that a string that begins with markup that sets a language for the string text content (eg. <cite lang=“en-Latn”>) is not problematic here, since that language declaration is not expected to play into the setting of the base direction.

There are some rare situations where the base direction can not necessarily be identified from the script subtag, but these are really limited to archaic usage of text. For example, Japanese and Chinese text prior to World War 2 was often written RTL, rather than LTR. Languages such as those written using Egyptian Hieroglyphs, or the Tifinagh Berber script, could formerly be written either LTR or RTL, however the default for scholastic research tends to LTR.

Acknowledgements

The Internationalization (I18N) Working Group would like to thank the following contributors to this document: Mati Allouche, David Baron, Tobie Langel, Sangwhan Moon, Felix Sasaki, Najib Tounsi, and many others.

The following pages formed the initial basis of this document: