This document describes the best practices for identifying the language and direction for strings used on the Web.

We welcome comments on this document, but to make it easier to track them, please raise separate issues for each comment, and point to the section you are commenting on using a URL.

Introduction

This document was developed as a result of observations by the Internationalization Working Group over a series of specification reviews related to formats based on JSON, WebIDL, and other non-markup data languages. Unlike markup formats, such as XML, these data languages generally do not provide extensible attributes and were not conceived with built-in language or direction metadata.

The concepts in this document are applicable any time strings are used on the Web, either as part of a formalised data structure, but also where they simply originate from JavaScript scripting or any stored list of strings.

Natural language information on the Web depends on and benefits from the presence of language and direction metadata. Along with support for Unicode, mechanisms for including and specifying the block direction and the natural language of spans of text are one of the key internationalization considerations when developing new formats and technologies for the Web.

Markup formats, such as HTML and XML, as well as related styling languages, such as CSS and XSL, are reasonably mature and provide support for the interchange and presentation of the world's languages via built-in features. Strings and string-based data formats need similar mechanisms in order to ensure complete and consistent support for the world's languages and cultures.

Document Conventions

In this document [[RFC2119]] keywords in uppercase italics have their usual meaning. We also use these stylistic conventions:

Definitions appear with a different background color and decoration like this.

Best practices appear with a different background color and decoration like this.

Terminology

This section provides short definitions of key terminology necessary to understand the contents of this document. Most of the terms found here are taken from the [[I18N-GLOSSARY]]: they are repeated here for convenience.

If you are unfamiliar with bidirectional or right-to-left text, there is a basic introduction here. This will give you a basic grasp of how the Unicode Bidirectional Algorithm works and the interplay between it and the block direction, which will stand you in good stead for reading this document. Additional materials can be found in the Internationalization Working Group's Best Practices for Spec Developers.

Metadata is data about data: it is information included in a data structure that provides additional context, meaning, or presentation. In this document, the function of metadata is to express information about direction and language. [[I18N-GLOSSARY]]

A producer is any process where natural language string data is created for later storage, processing, or interchange. [[I18N-GLOSSARY]]

A consumer is any process that receives natural language strings, either for display or processing. [[I18N-GLOSSARY]]

A serialization agreement is the common understanding between a producer and consumer about the serialization of string metadata: how it is to be understood, serialized, read, transmitted, removed, etc. [[I18N-GLOSSARY]]

The Unicode Bidirectional Algorithm [[UAX9]], also known as UBA, defines the concept of a [=paragraph direction=]. This is the initial base direction of a "paragraph", and resolves to either left-to-right or right-to-left. The term "paragraph" has a specific meaning internal to UBA. In the context of this document, the term is misleading, because generally strings and other data on the Web are not "paragraphs of text" in some document format. In this document, we generally use the following two more specific terms:

Block direction. The initial base direction of a block of text, which resolves to either left-to-right or right-to-left. A block refers to a unit of text as a whole, such as a paragraph in a document or a string in a data file. The name "block" is chosen as a contrast to inline direction. Unicode calls this value the [=paragraph direction=]. [[I18N-GLOSSARY]]

String direction. The overall direction of a specific string, which indicates the presentation order of string-internal directional runs. Strings transmitted inside various data structures are often inserted into a block (such as a paragraph). In such a case, the string direction is needed as part of the [=bidi isolation=] of the string.

In this document we are concerned with identifying the string direction of a whole string and how to transmit and apply the string direction when displaying strings in various contexts. We do not talk about how to determine the direction or display of runs of text within a string.

The bidi algorithm is primarily focused on arranging adjacent characters, based on character properties. The block direction dictates (a) the visual order and direction in which runs of strongly-typed LTR and RTL characters are displayed, and (b) where there are weakly-directional or neutral characters, such as punctuation, the placement of those items relative to the other content.

The String Lifecycle

It's not possible to consider alternatives for handling string metadata in a vacuum: we need to establish a framework for talking about string handling and data formats.

Producers

A string can be created in a number of ways, including a content author typing strings into a plain text editor, text message, or editing tool; or a script scraping text from web pages; or acquisition of an existing set of strings from another application or repository. In the data formats under consideration in this document, many strings come from back end data repositories or databases of various kinds. Sources of strings may provide an interface, API, or metadata that includes information about the string direction and language of the data. Some also provide a suitable default for when the direction or language is not provided or specified. In this document, the producer of a string is the source, be it a human or a mechanism, that creates or provides a string for storage or transmission.

When a string is created, it's necessary to (a) detect or capture the appropriate language and string direction to be associated with the string, and (b) take steps, where needed, to set the string up in a way that stores and communicates the language and string direction.

For example, in the case of a string that is extracted from an HTML form, the string direction can be detected from the computed value of the form's field. Such a value could be inherited from an earlier element, such as the html element, or set using markup or styling on the input element itself. The user could also set the direction of the text by using keyboard shortcut keys to change the direction of the form field. The dirname attribute provides a way of automatically communicating that value with a form submission.

Similarly, language information in an HTML form would typically be inherited from the lang attribute on the html tag, or an ancestor element in the tree with a lang attribute.

If the producer of the string is receiving the string from a location where it was stored by another producer, and where the string direction and language has already been established, the producer needs to understand that the language and string direction has already been set, and understand how to convert or encode that information for its consumers.

Consumers

A consumer is an application or process that receives a string for processing and possibly places it into a context where it will be exposed to a user. For display purposes, it must ensure that the block direction and language of the string is correctly applied to the string in that context. For processing purposes, it must at least persist the language and direction and may need to use the language and direction data in order to perform language-specific operations.

Proper display of the string involves supplying the string direction and language to the rendering document or process by applying additional markup, adding control codes, or setting display properties. This indicates to rendering software the string direction or language that should be applied to the string in this display context to get the string to appear correctly. For both language and direction, it must make clear the boundaries for the range of text to which the language applies. For text direction, it must also isolate embedded strings from the surrounding text to avoid spill-over effects of the bidi algorithm [[UAX9]].

Note that a consumer of one document format might be a producer of another document format.

Serialization Agreements

Between any producer and consumer, there needs to be an agreement about what the document format contains and what the data in each field or attribute means. Any time a producer of a string takes special steps to collect and communicate information about the string direction or language of that string, it must do so with the expectation that the consumer of the string will understand how the producer encoded this information.

If no action is taken by the producer, the consumer must still decide what rules to follow in order to decide on the appropriate string direction and language, even if it is only to provide some form of default value.

In some systems or document formats, the necessary behaviour of the producers and consumers of a string are fully specified. In others, such agreements are not available; it is up to users to provide an agreement for how to encode, transmit, and later decode the necessary language or direction information. Low level specifications, such as JSON, do not provide a string metadata structure by default, so any document formats based on these need to provide the "agreement" themselves.

Strings that are not localizable text

The Web uses strings and character sequences to encode most data. Leaving aside different data types (such as numbers, time values, or binary data serializations such as base64), there are still values that are defined as using a string data type but which are not intended for use as natural language data values. For example, the syntactic content defined by a specification, such as the reserved keywords in CSS or the names of the various definitions in a WebIDL document, are not part of the localizable text of their respective document formats or protocols.

Many specifications also allow users to provide user-supplied values inside of a given namespace or document format. For example, SSIDs on a Wifi network are user-defined. So too are class names in a CSS stylesheet. Most specifications allow (and are encouraged to allow) a wide range of Unicode characters in these names. Most users choose values that are recognizable as words in one or another natural language, as doing so makes the values easier to work with. However, even though these strings consist of words in a natural language, these types of strings are not considered localizable text and do not need to be encumbered with additional metadata related to language or string direction. Usually they are merely identifiers that enable a computer to match the values.

A sometimes-useful test is that if replacing the identifier with an arbitrary string such as tK0001.37B would still be allowed, functional, and "normal", then it's not localizable text.

For example, in the base example below, all of the keys in the JSON document (id, title, authors, language, publisher, and so on) are syntactic content. The data values, such as the ISBN, the language tag, and the publication date are also syntactic content. Only the actual book title, the author's name, and the publisher's name are natural language data values and thus localizable text.

Best Practices, Recommendations, and Gaps

This section consists of the Internationalization (I18N) Working Group's set of best practices for identifying language and string direction in data formats on the Web. In some cases, there are gaps in existing standards, where the recommendations of the I18N WG require additional standardization or there might be barriers to full adoption.

The main issue is how to establish a common serialization agreement between producers and consumers of data values so that each knows how to encode, find, and interpret the language and string direction of each data field. The use of metadata for supplying both the language and string direction of natural language string fields ensures that the necessary information is present, can be supplied and extracted with the minimal amount of processing, and does not require producers or consumers to scan or alter the data.

The most basic best practice, which the Internationalization Working Group looks for in every specification, is:

For any string field containing natural language text, it MUST be possible to determine the language and string direction of that specific string. Such determination SHOULD use metadata at the string or document level and SHOULD NOT depend on heuristics.

Recommended Serializations

This section describes four approaches to serialization for string values. Specifications are intended to use these together to form a complete solution to managing language and direction metadata in document formats and protocols.

Non-Linguistic Fields

Avoid assigning or requiring language or direction metadata for non-linguistic fields (that is, strings that contain data that is not human language). Note that this includes application-internal data values [[INTERNATIONAL-SPECS]].

While the value of a syntactic content item or user-supplied value will often use a word-like token that conveys meaning to humans (as an aid in debugging, for example), the values need to consistently be wrapped with localizable display strings for presentation to the user.

Specifications SHOULD NOT specify or require the use of language metadata for syntactic content or for the value of fields that cannot contain natural language text.

If a consumer is required to assign a language tag to some non-linguistic data, the language tag zxx (Non-Linguistic) SHOULD be used. If a consumer is required to assign a string direction to such data, the value auto SHOULD be used.

Specifications SHOULD be careful to distinguish syntactic content, including user-supplied values, from localizable text.

Specifications MUST NOT treat syntactic content values as "displayable".

Single-Language Localizable Text Field

Use field-based metadata or string datatypes to indicate the language and the string direction for individual localizable text values.

For localizable text fields that appear in a single language, use a data structure to represent the value. The recommended representation is an object with three fields. The field value contains the actual string. The field lang contains a valid [[BCP47]] language tag. The field dir contains the string's string direction (one of the values ltr, rtl, or auto).

Use of heuristics to determine language or string direction will always fail for certain cases, and there needs to be a way to provide the correct outcome for those strings. Assignment of metadata (either as a resource-wide default, or in a string-specific label) is an intentional act that removes the need to guess the outcome by applying heuristics.

The use of metadata for indicating block direction is preferred because it avoids requiring the consumer to interpolate the direction using methods such as first strong or use of methods which require modification of the data itself (such as the insertion of RLM/LRM markers or bidirectional controls).

For [[WebIDL]]-defined data structures, define each localizable text (natural language text) field as a Localizable.

This combines both language and direction metadata and, if consistently adopted, makes interchange between different formats easier. Consistency between different specifications and document formats allows for the easy interchange of string data. By naming field attributes in the same way and adopting the same semantics, different specifications can more easily extract values from or add values into resources from other data sources.

Resource-wide Defaults

When a resource contains a number of natural language strings (and particularly if those string are all in the same language), using the localized string representation described above can become inefficient. To reduce the complexity of encoding these strings, specifications can establish a resource-level default for language and [=string direction=]. These are separate values, as language does not imply direction. There should still be the ability to override either language or direction on any given string value by using the representation found above.

Specifications MAY define a mechanism to provide the default language and the default [=string direction=] for all strings in a given resource. However, specifications MUST NOT assume that a resource-wide default is sufficient. Even if a resource-wide setting is available, it must be possible to use string-specific metadata to override that default.

If your specification defines its own document level defaults, provide two optional fields:

A document-level default language field SHOULD be called language and SHOULD be specified to contain a valid [[BCP47]] language tag. Specifications SHOULD specify that implementations are only require to check if a [[BCP47]] language tag is well-formed.

A document-level default block direction field SHOULD be called direction and support the values ltr, rtl, or auto.

Exceptions to the default are always a possibility, so it needs to be possible for users to override the default on a string-by-string basis.

First-strong heuristics are not applied to strings when the direction has been set externally using metadata. Even if a strongly directional character, such as RLMU+200F RIGHT-TO-LEFT MARK, has been prepended to a string, resource-wide default metadata can override the presentation of the string in ways that result in spillover effects. Therefore content needs to be able to provide string-level metadata to override the default for strings whose string direction does not match the resource-wide default.

For specifications that can make use of the [[JSON-LD]] @context mechanism, use the @language and @direction fields to supply the document level defaults.

Language Maps

The world is not monolingual. Having documents that contain only a single language would mean providing many iterations of the document, one for each language, in order to localize the content. This also might require language negotiation when requesting the content.

One way to address this is to allow multilingual values for each localizable text field inside the document.

Language selection is not merely the exact matching of language tag string values to the user's preferred locale. The usual object representation of a localizable text field requires that the object be deserialized in order to discover the language tag associated with the value. This can be inefficient when there are many values in a given file. In these cases, the best practice is to use a language map to organize localizable text values. Such a map exposes the language tag for the purposes of selection, but still uses an object representation on the value side of the map, since both language and direction might need to be overridden for a given string value.

When Language and Direction are Unknown

Specify that, in the absence of other information, the default direction and default language are unknown.

Explicit metadata, if available, trumps the need for heuristics to be applied. This is logical, since the heuristic method cannot reliably deduce the necessary direction on its own, and if metadata has been explicitly provided there is an indication that it is intended to be authoritative.

It is essential for a consumer to know that language and direction are unknown quantities in order for them to know when to apply fallback strategies to the data (this could include language-detection, or first-strong heuristics for direction). In particular, the default direction should not be set to LTR, since that would override the need for first-strong detection, which is more appropriate for strings written in a RTL script.

For the case where the [=string direction=] is not known, specify that consumers should use first-strong heuristics to identify the [=string direction=] of each string.

If metadata is not available, consumers of strings should use heuristics, preferably based on the Unicode Standard's first-strong detection algorithm, to detect the base direction of a string.

The first-strong algorithm looks for the first strongly-directional character in a string (skipping certain preliminary substrings), and assumes that it represents the [=string direction=] of the string as a whole. However, the first strong directional character doesn't always coincide with the actual or desired [=string direction=] of the string as a whole, so it should be possible to provide metadata, where needed, to address this problem.

If relying on first-strong heuristics, allow content developers to use RLM/LRM at the beginning of a string where it is necessary to force a particular base direction, but do not prepend one of these characters to existing strings.

Do not rely on the availability of RLM/LRM formatting characters in most cases.

If string data is being provided by users or content developers in web forms or other simple environments, users may not be able to enter these formatting characters. In fact, most users will probably be unaware that such characters exist, or how to use them. A web form can render their use unnecessary for immediate inspection if it sets the block direction for the input (which it should).

Specifications SHOULD NOT allow a string direction to be interpolated from available language metadata unless direction metadata is not available and cannot otherwise be provided.

Not all resources make use of the available metadata mechanisms. The script subtag of a language tag (or the "likely" script subtag based on [[BCP47]] and [[LDML]]) can sometimes be used to infer a [=block direction=] or [=string direction=] when other data is not available. Using language information is a "last resort" and specifications SHOULD NOT use it as the primary way of indicating [=block direction=]: make the effort to provide for metadata.

JSON-LD

Use of [[JSON-LD]] @context and the built-in @language attribute is RECOMMENDED as a document level default.

For document formats that use it, [[JSON-LD]] includes some data structures that are helpful in assigning language (but not paragraph direction) metadata to collections of strings (including entire resources). Notably, it defines what it calls "string internationalization" in the form of a context-scoped @language value which can be associated with blocks of JSON or within individual objects. There is no definition for base direction, so the @context mechanism does not currently address all concerns raised by this document.

Specifications SHOULD use the i18n Namespace feature for RDF literals, as defined in [[JSON-LD]] 1.1.

Where the i18n Namespace is not available or is inappropriate to use, specifications SHOULD require [[JSON-LD]] plain string literals for natural language values to provide string-specific language information.

Some datatypes, such as [[RDF-PLAIN-LITERAL]], already exist that allow for language metadata to be serialized as part of a string value.

Strings that are part of a legacy protocol or format

For strings that cannot specify direction due to legacy format reasons, specifications SHOULD specify that the string direction of each string depends on first-strong heuristics.

For string values and string fields that are not localizable text, specifications SHOULD specify that the field is non-linguistic in nature and recommend the language tag zxx ("No linguistic content") be associated with each string value.

For string values and string fields that are known to contain localizable text but for which there is no possibility of language metadata from the underlying format, specifications SHOULD specify that the language of the content is unknown and recommend the language tag und ("Undetermined") be associated with each string. Specifications MAY allow the use of heuristics or the inference of the language from other field values where appropriate and as a last resort.

Many protocols or formats make use of values that are meant to be human-decipherable tokens, while not being intended as natural language text. This allows people to make use of the value, such as using it for debugging. These can include common protocol elements where which humans expect to view and interact with the values.

Common examples of these include domain names and email addresses. With greater availability of Unicode in these sorts of value spaces, display of these values might vary between systems and environments. For example, font selection, which can vary depending on language, might be different on systems with different default locales.

Some specifications interact with string values defined by existing protocols or formats. Often these strings are not associated with or do not provide language or direction metadata. For example, many HTTP headers define their contents as if their contents were not localizable text, even when those contents are expected to be natural language text. Specifications that act as consumers or producers of these string values have no way to discover what the language or direction metadata is, nor will they have a mechanism to attach such metadata.

Defining Bidirectional Keywords in Specifications

A specification for a document format or protocol that includes natural language text values will need to define a data field or attribute to store the block direction for each natural language content value. These definitions need to be consistent across the Web in order to ensure interoperability, because consumers of one document format will need to map the block direction for values they receive to fields that they produce or will need to control the string direction of each string when displaying the content. This section describes how to provide such a definition along with the specific content to use.

There are two common use cases for defining content direction: (i) defining a directional metadata field for storing and transmitting the string direction as a field in a data structure or (ii) defining a direction attribute to associate a block direction with a given piece of natural language content.

Directional metadata field. A directional metadata field (or direction field for short) is a field in a data structure used to associate a [=string direction=] with a given natural language string field or data value.

Direction attribute. A direction attribute is a field or value, usually represented by an attribute in markup languages, that provides the [=string direction=] of the associated natural language string content.

Use the field name direction when defining a directional metadata field in a data structure or protocol.

The name direction is preferred for data values. The name dir is an acceptable alternative.

Use the field name dir when defining a direction attribute.

The name dir is preferred for an attribute, such as in markup languages. Using direction for an attribute is not recommended, since it is long and relatively uncommon for this use case. Note that both [[HTML]] and [[XML10]] have a built-in dir attribute. A dir attribute should have scope within a document and should be defined to provide bidi isolation.

Define the values of a directional metadata field or a direction attribute to include and be limited to the values ltr, rtl, and auto.

The value ltr indicates a direction of left-to-right, in exactly the same manner indicated by CSS writing modes [[CSS-WRITING-MODES-4]]

The value rtl indicates a direction of right-to-left, in exactly the same manner indicated by CSS writing modes [[CSS-WRITING-MODES-4]]

The value auto indicates that the user agent uses the algorithm for auto defined by [[HTML]] to determine the [=block direction=] ("[=paragraph direction=]"). This heuristic looks for the first character with a strong directionality, in a manner analogous to the Paragraph Level determination in the bidirectional algorithm [[UAX9]].

When auto is applied to multiple fields or to a document as a whole, it means that the direction should be individually derived for each field (with string-specific metadata providing an override for cases that cannot be determined automatically). It can be useful for labelling a group of mixed direction strings, when the string direction of most strings can be reliably determined using the first-strong heuristics. Whenever possible, the actual string direction (ltr or rtl) of individual strings should be stored or exchanged instead of auto. Omitting the direction field is preferable when the value is truly unknown.

Additional Best Practices

Specifications SHOULD NOT use the Unicode "language tag" characters (code points U+E0000 to U+E007F) for language identification.

[[Unicode]] says that the ... use of tag characters to convey language tags is strongly discouraged and that the use of the character U+E0001 LANGUAGE TAG is strongly discouraged.

Specifications MUST NOT require the production or use of paired bidi controls.

Another way to say this is: do not require implementations to modify data passing through them. Unicode bidi control characters might be found in a particular piece of string content, where the producer or data source has used them to make the text display properly. That is, they might already be part of the data. Implementations should not disturb any controls that they find—but they shouldn't be required to produce additional controls on their own.

Specifications SHOULD recommend the use of language indexing when Localizable strings can be supplied in multiple languages for the same value.

Producers sometimes need to supply multiple language values (see Localization Considerations) for the same content item or data record. One use for this language negotiation by the consumer.

Requirements and Use Cases

Please read the article Use cases for bidi and language metadata on the Web for detailed use cases, including a clear illustration of issues such as spillover or locale-based rendering. This section summarises some key points in that document and related to the need for language and direction metadata.

Why is this important?

Information about the language of content is important when processing and presenting localizable text for a variety of reasons. When language information is not present, the resulting degradation in appearance or functionality can frustrate users, render the content unintelligible, or disable important features. Some of the affected processes include:

Similarly, direction metadata is important to the Web. When a string contains text in a script that runs right-to-left (RTL), it must be possible to eventually display that string correctly when it reaches an end user. For that to happen, it is necessary to establish what string direction needs to be applied to the string as a whole. The appropriate [=string direction=] cannot always be deduced by simply looking at the string; even where it is possible, the producer and consumer of the string need to use the same heuristics to interpret the direction.

Static content, such as the body of a Web page or the contents of an e-book, often has language or direction information provided by the document format or as part of the content metadata. Data formats found on the Web generally do not supply this metadata. Base specifications such as Microformats, WebIDL, JSON, and more, have tended to store natural language text in string objects, without additional metadata.

This places a burden on application authors and data format designers to provide the metadata on their own initiative. When standardized formats do not address the resulting issues, the result can be that, while the data arrives intact, its processing or presentation cannot be wholly recovered.

In a distributed Web, any consumer can also be a producer for some other process or system. Thus, a given consumer might need to pass language and direction metadata from one document format (and using one serialization agreement) to another consumer using a different document format. Lack of consistency in representing language and direction metadata in serialization agreements poses a threat to interoperability and a barrier to consistent implementation.

An example

Suppose that you are building a Web page to show a customer's library of e-books. The e-books exist in a catalog of data and consist of the usual data values. A JSON file for a single entry might look something like:

{
    "id": "978-111887164-5",
    "title": "HTML و CSS: تصميم و إنشاء مواقع الويب",
    "authors": [ "Jon Duckett" ],
    "language": "ar",
    "pubDate": "2008-01-01",
    "publisher": "مكتبة",
    "coverImage": "https://example.com/images/html_and_css_cover.jpg",
    // etc.
},

Each of the above is a data field in a database somewhere. There is even information about what language the book is in: ("language": "ar").

A well-internationalized catalog would include additional metadata to what is shown above. That is, for each of the fields containing localizable text, such as the title and authors fields, there should be language and string direction information stored as metadata. (There may be other values as well, such as pronunciation metadata for sorting East Asian language information.) These metadata values are used by consumers of the data to influence the processing and enable the display of the items in a variety of ways. As the JSON data structure provides no place to store or exchange these values, it is more difficult to construct internationalized applications.

One work-around might be to encode the values using a mix of HTML and Unicode bidi controls, so that a data value might look like one of the following:

// following examples are NOT recommended
// contains HTML markup
"title": "<span lang='ar' dir='rtl'>HTML و CSS: تصميم و إنشاء مواقع الويب</span>",
// contains LRM as first character
"authors": [ "\u200eJon Duckett" ], 

But JSON is a data interchange format: the content might not end up with the title field being displayed in an HTML context. The JSON above might very well be used to populate, say, a local data store which uses native controls to show the title and these controls will treat the HTML as string contents. Producers and consumers of the data might not expect to introspect the data in order to supply or remove the extra data or to expose it as metadata. Most JSON libraries don't know anything about the structure of the content that they are serializing. Producers want to generate the JSON file directly from a local data store, such as a database. Consumers want to store or retrieve the value for use without additional consideration of the content of each string. In addition, either producers or consumers can have other considerations, such as field length restrictions, that are affected by the insertion of additional controls or markup. Each of these considerations places special burden on implementers to create arbitrary means of serializing, deserializing, managing, and exchanging the necessary metadata, with interoperability as a casualty along the way.

(As an aside, note that the markup shown in the above example is actually needed to make the title as well as the inserted markup display correctly in the browser.)

Isn't Unicode enough?

[[Unicode]] and its character encodings (such as UTF-8) are key elements of the Web and its formats. They provide the ability to encode and exchange text in any language consistently throughout the Internet. However, Unicode by itself does not guarantee perfect presentation and processing of natural language text, even though it does guarantee perfect interchange.

Several features of Unicode are sometimes suggested as part of the solution to providing language and direction metadata. Specifically, Unicode bidi controls are suggested for handling direction metadata. In addition, there are "tag" characters in the U+E0000 block of Unicode originally intended for use as language tags (although this use is now deprecated).

There are a variety of reasons why the addition of characters to data in an interchange format is not a good idea. These include:

This last consideration is important to call out: document formats are often built and serialized using several layers of code. Libraries, such as general purpose JSON libraries, are expected to store and retrieve faithfully the data that they are passed. Higher-level implementations also generally concern themselves with faithful serialization and de-serialization of the values that they are passed. Any process that alters the data itself introduces variability that is undesirable. For example, consider an application's unit test that checks if the string returned from the document is identical to the one in the data catalog used to generate the document. If bidi controls, HTML markup, or Unicode language tags have been inserted, removed, or changed, the strings might not compare as equal, even though they would be expected to be the same.

What consumers need to do to support direction

Given the use cases for bidirectional text, it will be clear that a consumer cannot simply insert a string into a target location without some additional work or preparation taking place, first to establish the appropriate string direction for the string being inserted, and secondly to apply bidi isolation around the string.

This requires the presence of markup or Unicode formatting controls around the string. If the string's actual direction is opposite that of the content into which it is being inserted, the markup or control codes need to tightly wrap the string. Strings that are inserted adjacent to each other all need to be individually wrapped in order to avoid the spillover issues we saw in the previous section.

[[HTML]] provides base direction controls and isolation for any inline element when the dir attribute is used, or when the bdi element is used. When inserting strings into plain text environments, isolating Unicode formatting characters need to be used. (Unfortunately, support for the isolating characters, which the Unicode Standard recommends as the default for plain text/non-markup applications, is still not universal.)

The trick is to ensure that the direction information provided by the markup or control characters reflects the string direction of the string.

Approaches Considered for Identifying the [=String Direction=]

The fundamental problem for bidirectional text values is how a consumer of a string will know what [=string direction=] to use for that string when it is eventually displayed to a user. Note that some of these approaches for identifying or estimating the direction have utility in specific applications and are in use in different specifications such as [[HTML]]. The issue here is which are appropriate to adopt generally and specify for use as a best practice in document formats.

First-strong property detection

This approach is NOT recommended when used alone, but IS recommended as a fallback in combination with other approaches.

How it works

A producer doesn't need to do anything.

The string is stored as it is.

Consumers must look for the first character in the string with a strong Unicode directional property, and set the [=string direction=] to match it. They then take appropriate action to ensure that the string will be displayed as needed. This is not quite so simple as it may appear, for the following reasons:

  1. Characters at the start of a string without a strong direction (eg. punctuation, numbers, etc) and isolated sequences (ie. sequences of characters surrounded by RLI/LRI/FSI...PDI formatting characters) within a string must be skipped in order to find the first strong character.
  2. The detection algorithm needs to be able to handle markup at the start of the string. It needs to be able to tell whether the markup is just string text, or whether the markup needs to be parsed in the target location – in which case it must understand the markup, and understand any direction-related information that is carried in the markup.

First-strong detection is only needed where the required [=string direction=] is not already known. If direction is indicated for a string by metadata, either string-specific or via a resource-wide declaration, then first-strong heuristics should not be invoked. For example, first-strong heuristics would produce the wrong result for a string such as "HTML و CSS: تصميم و إنشاء مواقع الويب". This can be corrected using metadata, the use of which signifies informed intention, and you would not need or want to apply heuristics that would then make the result incorrect.

However, if there is no mechanism for the application of metadata, or if there is such a mechanism but the content developer omitted to use it, then first-strong heuristics can be helpful to establish base direction in many, though not all, cases. The application of strongly-directional formatting characters can help produce correct results for plain text strings such as the example just quoted, but it is not always possible to apply those (see [[[#rlm]]]).

Advantages

Where it is reliable, information about direction can be obtained without any changes to the string, and without the agreements and structures that would be needed to support out-of-band metadata.

Issues

The main problem with this approach is that it produces the wrong result for

  1. strings that begin with a strong character with a different directionality than that needed for the string overall (eg. an Arabic tweet that starts with a hashtag)
  2. strings that don't have a strong directional character (such as a telephone number), which are likely to be displayed incorrectly in a RTL context.
  3. strings that begin with markup, such as span, since the first strong character is always going to be LTR.

In cases where the entire string starts and ends with RLI/LRI/FSI...PDI formatting characters, it is not possible to detect the first strong character by following the Unicode Bidirectional Algorithm. This is because the algorithm requires that bidi-isolated text be excluded from the detection.

If no strong directional character is found in the string, the direction should probably be assumed to be LTR, and the consumer should act on that basis. This has not been tested fully, however.

If a string contains markup that will be parsed by the consumer as markup, there are additional problems. Any such markup at the start of the string must also be skipped when searching for the first strong directional character.

If parseable markup in the string contains information about the intended direction of the string (for example, a dir attribute with the value rtl in HTML), that information should be used rather than relying on first-strong heuristics. This is problematic in a couple of ways: (a) it assumes that the consumer of the string understands the semantics of the markup, which may be ok if there is an agreement between all parties to use, say, HTML markup only, but would be problematic, for example, when dealing with random XML vocabularies, and (b) the consumer must be able to recognise and handle a situation where only the initial part of the string has markup, ie. the markup applies to an inline span of text rather than the string as a whole.

It's not clear where the example with the broken link in the following paragraph is or used to be.

If, however, there is angle bracket content that is intended to be an example of markup, rather than actual markup, the markup must not be skipped – trying to display markup source code in a RTL context yields very confusing results! It isn't clear, however, how a consumer of the string would always know the difference between examples and parseable strings.

Additional notes

Although first-strong detection is outlined in the Unicode Bidirectional Algorithm (UBA) [[UAX9]], it is not the only possible higher-level protocol mentioned for estimating string direction. For example, X (formerly known as Twitter) and Facebook currently use different default heuristics for guessing the base direction of text — neither use just simple first-strong detection, and one uses a completely different method.

Metadata

This approach is recommended.

By 'metadata' we mean field-based information associated with a specific string or a set of strings in a data format, or information built into a string datatype (see also [[[#dir-approach-new-datatype]]]).

An example would be:

{
    "title": "HTML و CSS: تصميم و إنشاء مواقع الويب",
    "direction": "rtl",
    "language": "ar",
},

Metadata indicating the default direction for all the strings in a resource could also be set using an appropriate field.

How it works

A producer ascertains the [=string direction=] of the string and adds that to a metadata field that accompanies the string when it is stored or transmitted.

There are several approaches to using metadata:

  1. Label every string with a string direction.
  2. Provide a document-level default for block direction and only include metadata for strings whose value is different. The value auto is used when the direction of a string is not known.
  3. Rely on the consumer to do first-strong detection, and label only those strings which would produce the wrong result (that is, a right-to-left string that starts with left-to-right strong characters).

If storing or transmitting a set of strings at a time, it helps to have a field for the resource as a whole that sets a global, default string direction which can be inherited by all strings in the resource. Note that in addition to a global field, you still need the possibility of attaching string-specific metadata fields in cases where a string's string direction is not the same as the default value. The [=string direction=] set on an individual string must always override the default.

Consumers would need to understand how to read the metadata sent with a string, and would need to apply first-strong heuristics in the absence of metadata.

The use of the Localizable dictionary structure is RECOMMENDED for individual values in JSON-based document formats, as it combines both language and direction metadata and, if consistently adopted, makes interchange between different formats easier.

As noted here, [[JSON-LD]] includes some data structures that are helpful in assigning language (but not direction) metadata to collections of strings (including entire resources). These gaps in support for pre-built metadata at the resource or item level are one of the key reasons for this documents development.

Advantages

Passing metadata as separate data value from the string provides a simple, effective and efficient method of communicating the intended [=string direction=] without affecting the actual content of the string.

If every string is labelled for direction, or the direction for all strings can be ascertained by applying the global setting and any string-specific deviations, it avoids the need to inspect and run heuristics to determine each separate string's [=string direction=].

Issues

Out-of-band information needs to be associated with and kept with strings. This may be problematic for some sets of string data which are not part of a defined framework.

In particular, JSON-LD doesn't allow direction to be associated with individual strings in the same way as it works for language.

Augmenting first-strong by inserting RLM/LRM markers

This approach is NOT workable for all situations.

How it works

A producer ascertains the [=string direction=] of the string and adds an marker character (either U+200F RIGHT-TO-LEFT MARK (RLM) or U+200E LEFT-TO-RIGHT MARK (LRM)) to the beginning of the string. The marker is not functional, ie. it will not automatically apply a base direction to the string that can be used by the consumer, it is simply a marker.

There are a number of possible approaches:

  1. Add a marker to every string (not recommended).
  2. Rely on the consumer to do first-strong detection, and add a marker to only those strings which would produce the wrong result (eg. a RTL string that starts with LTR strong characters).
  3. Assume a default of LTR (no marker), and apply only RLM markers.

Consumers apply first-strong heuristics to detect the [=string direction=] for the string. The RLM and LRM characters are strongly typed directionally, and should therefore result in detecting the appropriate base direction.

As described in [[[#firststrong]]], this approach is not relevant if directional information is provided via metadata.

Advantages

It provides a reliable way of indicating base direction, as long as the producer can reliably apply markers.

In theory, it should be easier to spot the first-strong character in strings that begin with markup, as long as the correct RLM/LRM is prepended to the string.

Issues

If the producer is a human, they could theoretically apply one of these characters when creating a string in order to signal the directionality.

A significant problem with this, especially on mobile devices, is the availability or inconvenience of inputting an RLM/LRM character. The keyboards of mobile devices generally do not provide keys for RLM/LRM characters. Perhaps more important, because the characters are invisible and because Unicode bidi is complicated, it can be difficult for the user to know how to use the character effectively. In fact, a large percentage of users don't actually know what these characters are or what they do.

Furthermore, if a person types information into, say, an HTML form in a RTL page or uses shortcut keys to set the direction for the form field, the strings will look correct without the need to add RLM/LRM. However, used outside of that context, the string would look incorrect unless it is associated with information about the required [=block direction=]. Similarly, strings scraped from a web page that has dir=rtl set in the html element would not normally have or need an RLM/LRM character at the start of the string in HTML.

It may be possible for the steps used by a producer to include an examination of the original context of the string for directional information (for example, by testing the computed direction of an HTML form field), followed by automatic insertion of an RLM/LRM mark into the beginning of the string where necessary. An issue with this approach is that it changes the string value and identity. This may also create problems for working with string length or pointer positions, especially if some producers add markers and others don't.

If directional information is contained in markup that will be parsed as such by the consumer (for example, dir=rtl in HTML), the producer of the string needs to understand that markup in order to set or not set an RLM/LRM character as appropriate. If the producer always adds RLM/LRM to the start of such strings, the consumer is expected to know that. If the producer relies instead on the markup being understood, the consumer is expected to understand the markup.

The producer of a string should not automatically apply RLM or LRM to the start of the string, but should test whether it is needed. For example, if there's already an RLM in the text, there is no need to add another. If the context is correctly conveyed by first-strong heuristics, there is no need to add additional characters either. Note, however, that testing whether supplementary directional information of this kind is needed is only possible if the producer has access, and knows that it has access, to the original context of the string. Many document formats are generated from data stored away from the original context. For example, the catalog of books in the original example above is disconnected from the user inputing the bidirectional text.

Paired formatting characters

This approach is NOT recommended.

How it works

A producer ascertains the [=string direction=] of the string and adds a directional formatting character (one of U+2066 LEFT-TO-RIGHT ISOLATE (LRI), U+2067 RIGHT-TO-LEFT ISOLATE (RLI), U+2068 FIRST STRONG ISOLATE (FSI), U+202A LEFT-TO-RIGHT EMBEDDING (LRE), or U+202B RIGHT-TO-LEFT EMBEDDING (RLE)) to the beginning of the string, and U+2069 POP DIRECTIONAL ISOLATE (PDI) or U+202C POP DIRECTIONAL FORMATTING (PDF) to the end.

There are a number of possible approaches:

  1. Add the formatting codes to every string.
  2. Rely on the consumer to do first-strong detection, and add a marker to only those strings which would produce the wrong result (eg. a RTL string that starts with LTR strong characters).

Consumers would theoretically just insert the string in the place it will be displayed, and rely on the formatting codes to manage directionality. However, things are not quite so simple (see below).

There are two types of paired formatting characters. The original set of controls provide the ability to add an additional level of bidirectional "embedding" to the Unicode bidirectional Algorithm. More recently, Unicode added a complementary set of "isolating" controls. Isolating controls are used to surround a string. The inside of the string is treated as its own bidirectional sequence, and the string is protected against spill-over effects related to any surrounding text. The enclosing string treats the entire surrounded string as a single unit that is ignored for bidi reordering. This issue is described here.

Code Point Abbreviation Description Code Point Abbreviation Description
U+200A LRE Left to Right Embedding U+2066 LRI Left to Right Isolate
U+200B RLE Right to Left Embedding U+2067 RLI Right to Left Isolate
U+2068 FSI First Strong Isolate
U+200C PDF Pop Directional Formatting (ending an embedding) U+2069 PDI Pop Directional Isolate (ending an isolate)

If paired formatting characters are used, they should be isolating, ie. starting with RLI, LRI, FSI, and not with RLE or LRE.

Advantages

There are no real advantages to using this approach.

Issues

This approach is only appropriate if it is acceptable to change the value of the string. In addition to possible issues such as changed string length or pointer positions, this approach runs a real and serious risk of one of the paired characters getting lost, either through handling errors, or through text truncation, etc.

A producer and a consumer of a string would need to recognise and handle a situation where a string begins with a paired formatting character but doesn't end with it because the formatting characters only describe a part of the string.

Unicode specifies a limit to the number of embeddings that are effective, and embeddings could build up over time to exceed that limit.

Consuming applications would need to recognise and appropriately handle the isolating formatting characters. At the moment such support for RLI/LRI/FSI is far from pervasive.

This approach would disqualify the string from being amenable to UBA first-strong heuristics if used by a non-aware consumer, because the Unicode bidi algorithm is unable to ascertain the base direction for a string that starts with RLI/LRI/FSI and ends with PDI. This is because the algorithm skips over isolated sequences and treats them as a neutral character. A consumer of the string would have to take special steps in such a case to locate the first-strong character.

Script subtags

This approach is only recommended as a workaround for situations that prevent the use of metadata.

How it works

A producer supplies language metadata for strings, specifying, where necessary, the script in use.

There are a number of possible approaches:

  1. Label every string for language, including a script subtag as needed. Consumers may need to compute the script subtag when the producer does not provide one.
  2. It might be reasonable to assume a default of LTR for all strings unless marked with a language tag whose script subtag (either present or implied) indicates RTL.
  3. Alternatively, limit the use of script subtag metadata to situations where first-strong heuristics are expected to fail — provided that such cases can be identified, and appropriate action taken by the producer (not always reliable). Consumers would then need to use first-strong heuristics in the absence of a script subtag in order to identify the appropriate [=string direction=]. The use of script subtags should not, however, be restricted to strings that need to indicate direction; it is perfectly valid to associate a script subtag with any string.
  4. Set a default language for a set of strings at a higher level, but provide a mechanism to override that default for a given string where needed.

Consumers extract the script subtag from the language tag associated with each string, computing the string's [=string direction=] as necessary. Script subtags associated with RTL scripts are used to assign a direction of RTL to their associated strings.

Language information MUST use [[BCP47]] language tags. The portion of the language tag that carries the information is the script subtag, not the primary language subtag. For example, Azeri may be written LTR (with the Latin or Cyrillic scripts) or RTL (with the Arabic script). Thus, the subtag az is insufficient to clarify intended [=block direction=]. A language tag such as az-Arab (Azeri as written in the Arabic script), however, can generally be relied upon to indicate that the [=block direction=] should be RTL.

Advantages

There is no need to inspect or change the string itself.

This approach avoids the issues associated with first-strong detection when the first-strong character is not indicative of the necessary [=string direction=] for the string, and avoids issues relating to the interpretation of markup.

Note that a string that begins with markup that sets a language for the string text content (eg. <cite lang="zh-Hans">) is not problematic here, since that language declaration is not expected to play into the setting of the [=string direction=].

Issues

The use of metadata as outlined above is a much better approach, if it is available. This script-related approach is only for use where that approach is unavailable, for legacy reasons.

There are many strings which are not language-specific but which absolutely need to be associated with a particular [=block direction=] for correct consumption. For example, MAC addresses inserted into a RTL context need to be displayed with a LTR overall base direction and also be isolated from the surrounding text. It's not clear how to distinguish these cases from others (in a way that would be feasible when using direction metadata). Special language tags, such as zxx (Non-Linguistic), exist for identifying this type of content, but usually data fields of this type omit language information altogether, since it is not applicable.

The list of script subtags may be added to in future. In that case, any subtags that indicate a default RTL direction need to be added to the lists used by the consumers of the strings.

There are some rare situations where the appropriate [=paragraph direction=] cannot be identified from the script subtag, but these are really limited to archaic usage of text. For example, Japanese and Chinese text prior to World War 2 was often written RTL, rather than LTR. Languages such as those written using Egyptian Hieroglyphs, or the Tifinagh Berber script, could formerly be written either LTR or RTL, however the default for scholastic research tends to LTR.

Other comments

The approach outlined here is only appropriate when declaring information about the overall string direction to be associated with a string. We do not recommend use of language data to indicate text direction within strings, since the usage patterns are not interchangeable.

Require bidi markup for content

This approach is NOT recommended, except under serialization agreements that expect to exclusively interchange HTML or XML markup data.

How it works

The producer ensures that all strings begin and end with markup which indicates the appropriate base direction for that string. This requires the producer to examine the string. If the string is not bounded by markup with directional information, the producer must add wrap the string with elements that have the dir or its:direction [[ITS20]] attributes, or other markup appropriate to a given XML application. If the string is bounded by markup, but it is something such as an HTML h1 element, the producer needs to introduce directional information into the existing markup, rather than simply surround the string with a span.

This example uses HTML markup. (Simply to make the example easier to read, it shows the text content of the string as it should be displayed, rather than in the order in which the characters are stored.)

The consumer then relies on the markup to set the base direction around the text content of the string when it is displayed. (Note that, unless additional metadata is provided, the consumer cannot remove the markup before integrating the string in the target location, because it cannot tell what markup has been added by the producer and what was already there. In general, however, such added markup is harmless.)

Advantages

The benefit for content that already uses markup is clear. The content will already provide complete markup necessary for the display and processing of the text or it can be extracted from the source page context. HTML and XML processors already know how to deal with this markup and provide ready validation.

For HTML, the dir attribute bidirectionally isolates the content from the surrounding text, which removes spillover conflicts. This reduces the work of the consumer.

Markup can also be used for string-internal directional information, something string direction on its own cannot solve.

Issues

Effectively, all levels of the implementation stack have to participate in understanding the markup (or ensure that they do no harm).

If the system uses HTML, end to end, then appropriate markup is available and its semantics are understood (ie. the dir attribute, and the bdi and bdo elements). For XML applications, however, there is no standard markup for bidi support. Such markup would need to first be defined, and then understood by both the producer and consumer.

A key downside of this approach is that many data values are just strings. As with adding Unicode tags or Unicode bidi controls, the addition of markup to strings alters the original string content. Altering the length of the content can cause problems with processes that enforce arbitrary limits or with processes that "sanitize" content by escaping HTML/XML unsafe characters such as angle brackets.

Another issue is the work and sophistication required for producers to examine strings and add markup as needed.

There are limits to the number of embeddings allowed by the Unicode bidirectional algorithm. Consumers would need to ensure that this limit is not passed when embedding strings into a wider context.

The addition of markup also requires consumers to guard against the usual problems with markup insertion, such as XSS attacks.

Create a new bidi datatype

This approach was added to [[JSON-LD]] 1.1.

How it works

This is similar to the idea of sending metadata with a string as discussed previously, however the metadata is not stored in a completely separate field (as in ), or inserted into the string itself (as in ), but is associated with the string as part of the string's serialization format.

Some datatypes, such as [[RDF-PLAIN-LITERAL]], already exist that allow for language metadata to be serialized as part of a string value. However, these do not include a consideration for direction. This might be addressed by defining a new datatype (or extending an existing one) that document formats could then use to serialize natural language strings that includes both language and direction metadata.

[[JSON-LD]] 1.1. added the i18n Namespace to permit JSON documents to serialize language and direction metadata directly with a string value. It provides a deserialization to RDF for specifications that need it.

Note that the last string does not include language information because it is an internal data value, but does include direction information because strings of this kind must be presented in the LTR order.

A producer would need to attach the string direction to each string as needed.

Each consumer should use first-strong heuristics for those strings that don't use this approach or do not contain string direction. The producer would then only add string direction information if the first-strong approach would otherwise produce the wrong result. This might simplify the management of strings and the amount of data to be transmittted, because the number of strings requiring metadata is relatively small.

The consumer would look to see whether the string has metadata associated with it, in which case it would set the indicated [=string direction=]. Otherwise, it would use first-strong heuristics to determine the string direction of the string.

Advantages

If a new datatype were added to JSON to support natural language strings, then specifications could easily specify that type for use in document formats. Since the format is standardized, producers and consumers would not need to guess about direction or language information when it is encoded.

Issues

Apart from the fact that this currently doesn't work, the downside of adding a datatype is that JSON is a widely implemented format, including many ad-hoc implementations. Any new serialization form would likely break or cause interoperability problems with these existing implementations. JSON is not designed to be a "versioned" format. Any serialization form used would need to be transparent to existing JSON processors and thus could introduce unwanted data or data corruption to existing strings and formats.

Approaches Considered for Identifying the Language of Content

This section deals with different means of determining or conveying the language of string values.

Metadata

This approach is recommended.

How it works

A producer ascertains the language of the string (generally from metadata supplied upstream) and includes this information a metadata field that accompanies the string when it is stored or transmitted.

When storing or transmitting a set of strings at a time, it helps to have a field for the resource as a whole that sets a language which can be inherited by all strings in the resource. Note that in addition to a global field, you still need the possibility of attaching string-specific metadata fields in cases where a string's language is not that of the default. The language set on an individual string must override any resource-level value.

A consumer needs to understand how to read the metadata associated with a string and apply it to the display, processing, or data structures that it generates. Note that this might include the need to apply a resource-level default language when serializing or exchanging an individual value.

Advantages

Using a consistent and well-defined data structure makes it more likely that different standards are composable and will work together seamlessly.

Metadata can be supplied without affecting the content itself.

Where metadata is unavailable, it can be omitted.

Consumers and producers do not have to instrospect the data outside of their normal processing.

Issues

Serialized files utilizing the dictionary and its data values will contain additional fields and can be more difficult to read as a result.

For existing document formats, it represents a change to the values being exchanged.

Require markup for content

This approach is NOT recommended except in special cases where the content being exchanged is expected to consist of and is restricted to literal values in a given markup language.

How it works

When a document is expected to consist of HTML or XML fragments and will be processed and displayed strictly in a markup context, the producer can use markup to convey the language of the content by wrapping strings with elements that have the lang or xml:lang attributes.

Advantages

This approach, and thus the advantages, are effectively the same as in this section.

Issues

See above.

Use Unicode language tag characters

This approach is NOT recommended.

How it works

Producers insert Unicode tag characters into the data to tag strings with a language.

Consumers process the Unicode tag characters and use them to assign the language.

Unicode defines special characters that can be used as language tags. These characters are "default ignorable" and should have no visual appearance. Here is how Unicode tags are supposed to work:

Each tag is a character sequence. The sequence begins with a tag identification character. The only one currently defined is U+E0001, which identifies [[BCP47]] language tags. Other types of tags are possible, via private agreement. The remainder of the Unicode block for forming tags mirrors the printable ASCII characters. That is, U+E0020 is space (mirroring U+0020), U+E0041 is capital A (mirroring U+0041), and so forth. Following the tag identification character, producers use each tag character to spell out a [[BCP47]] language tag using the upper/lowercase letters, digits, and the hyphen character. A given source language tag, which is composed from ASCII letters, digits and hyphens, can be transmogrified into tags by adding 0xE0000 to each character's code point. Additional structure, such as a language priority list (see [[RFC4647]]) might be constructed using other characters such as comma or semi-colon, although Unicode does not define or even necessarily permit this.

The end of a tag's scope is signalled by the end of the string, or can be signalled explicitly using the cancel tag character U+E007F, either alone (to cancel all tags) or preceeded by the language tag identification character U+E0001 (i.e. the sequence <U+E0001,U+E007F> to end only language tags).

Tags therefore have a minimum of three characters, and can easily be 12 or more. Furthermore, these characters are supplementary characters. That is, they are encoded using 4-bytes per character in UTF-8 and they are encoded as a surrogate pair (two 16-bit code units) in UTF-16. Surrogate pairs are needed to encode these characters in string types for languages such as Java and JavaScript that use UTF-16 internally. The use of surrogates makes the strings somewhat opaque. For example, U+E0020 is encoded in UTF-16 as 0xDB40.DC20 and in UTF-8 as the byte sequence 0xF3.A0.80.A0.

Advantages

These language tag characters could be used as part of normal Unicode text without modification to the structure of the document format.

Issues

Use of Unicode tag characters for language identification are strongly discouraged by the Unicode Consortium (and thus deprecated). These tag characters were intended for use in language tagging within plain text contexts and are often suggested as an alternate means of providing in-band non-markup language tagging. We are unaware of any implementations that use them as language tags.

Applications that treat the characters as unknown Unicode characters will display them as tofu (hollow box replacement characters) and may count them towards length limits, etc. So they are only useful when applications or interchange mechanisms are fully aware of them and can remove them or disregard them appropriately. Although the characters are not supposed to be displayed or have any effect on text processing, in practice they can interfere with normal text processes such as truncation. line wrapping, hyphenation, spell-checking and so forth.

By design, [[BCP47]] language tags are intended to be ASCII case-insensitive. Applications handling Unicode tag characters would have to apply similar case-insensitivity to ensure correct identification of the language. (The Unicode data doesn't specify case conversion pairings for these characters; this complicates the processing and matching of language tag values encoded using the tag characters.)

Moreover, language tags need to be formed from valid subtags to conform to [[BCP47]]. Valid subtags are kept in an IANA registry and new subtags are added regularly, so applications dealing with this kind of tagging would need to always check each subtag against the latest version of the registry.

The language tag characters do not allow nesting of language tags. For example, if a string contains two languages, such as a quote in French inside an English sentence, Unicode tag characters can only indicate where one language starts. To indicate nested languages, tags would need to be embedded into the text not just prefixed to the front.

Although never implemented, other types of tags could be embedded into a string or document using Unicode tag characters. It is possible for these tags to overlap sections of text tagged with a language tag.

Finally, Unicode has recently "recycled" these characters for use in forming sub-regional flags, such as the flag of Scotland (🏴󠁧󠁢󠁳󠁴󠁿󠁧), which is made of the sequence:󠁢󠁳󠁣󠁴󠁿

  • 🏴 [U+1F3F4 WAVING BLACK FLAG]
  • 󠁧 [U+E0067 TAG LATIN SMALL LETTER G]
  • 󠁢 [U+E0062 TAG LATIN SMALL LETTER B]
  • 󠁳 [U+E0073 TAG LATIN SMALL LETTER S]
  • 󠁣 [U+E0063 TAG LATIN SMALL LETTER C]
  • 󠁴 [U+E0074 TAG LATIN SMALL LETTER T]
  • 󠁿 [U+E007F CANCEL TAG]

The above is a new feature of emoji added in Unicode 10.0 (version 5.0 of UTR#51) in June 2017. Proper display depends on your system's adoption of this version.

Use a language detection heuristic

This approach is NOT recommended.

How it works

Producers do nothing.

Consumers run a language detection algorithm to determine the language of the text. These are usually statistically based heuristics, such as using n-gram frequency in a language, possibly coupled with other data.

Advantages

There are no fundamental advantages to this approach.

Issues

Heuristics are more accurate the longer and more representative the text being scanned is. Short strings may not detect well.

Language detection is limited to the languages for which one has a detector.

Inclusions, such as personal or brand names in another language or script, can throw off the detection.

Language detection tends to be slow and can be memory intensive. Simple consumers probably can't afford the complexity needed to determine the language.

Localization Considerations

Sometimes a producer can supply localized values for a given content item or data record by performing some type of language negotiation between the producer and the consumer. Localization then takes place in the producer using the negotiated language to select the content returned. Such an approach can save on file size, which affects latency, and complexity, since only the language or languages needed by the consumer need be returned.

However, since this is not always possible, specifications sometimes allow multiple different language values to be returned for a given field. This might be to support runtime localization or because the producer has multiple different language values and cannot pre-select them appropriately.

In these cases, localization of a content item is done by having the producer return multiple language representations for the item and letting the consumer choose the value to display. Such an approach is helpful when the producer cannot negotiate the language (such as when the resulting file is cached for multiple users) and when the number of languages is relatively small. Large collections of languages can result in overly large documents that are cumbersome to work with.

One approach a specification might provide for returning multiple languages of a given field is called language indexing. In language indexing, a given field's value is an array of key-value pairs. The keys in the array are language tags. The values of each language tag are strings or, ideally, Localizable objects. Here's an example of what a language indexed field title might look like:

Using the language tag as a key to the value array allow for rapid selection of the correct value for a given request. Notice that, if the value of the language tag is a Localizable, the language might be repeated in the data structure.

For example, if the language requested were U.S. English (en-US), this format makes it easier to match and extract the best fitting title object {"value": "Learning Web Design", "lang": "en"}. An additional potential advantage is that the indexed language tag can indicate the intended audience of the value separately from the language tag of the actual data value. An example of this might be the use of language ranges from [[RFC4647]], as in the following example, where a more specific language value might be wrapped with a less-specific language tag. In this example, the content has been labeled with a specific language tag (de-DE), but is available and applicable to users who speak other variants of German, such as de-CH or de-AT:

A less common example would be when a system supplies a specific value in a different ("wrong") language from the indexing language tag, perhaps because the actual translated value is missing:

The primary issue with this approach is the need to extract the indexing language tag from the content in order to generate the index. Producers might also need to have a serialization agreement with consumers about whether the indexing language tag will be in any way canonicalized. For example, the language tag cel-gaulish is one of the [[BCP47]] grandfathered language tags. Some implementations, such as those following the rules in [[CLDR]], would prefer that this tag be replaced with a modern equivalent (xtg-x-cel-gaulish in this case) for the purposes of language negotiation.

[[JSON-LD]] defines a specific implementation of language indexing, which depends on the use of the @context structure. This structure does not support the use of Localizable values (only strings or arrays of strings are supported), so changes would be needed to allow some of the above capabilities in [[JSON-LD]] documents.

The Localizable WebIDL Dictionary

This section contains a WebIDL definition for a Localizable dictionary.

To be effective, specification authors should consistently use the same formats and data structures so that the majority of data formats are interoperable (in other words, so that data can be copied between many formats without having to apply additional processing). We recommend adoption of the Localizable WebIDL "dictionary" as the best available format for JSON-derived formats to do that.

By defining the language and direction in a WebIDL dictionary form, specifications can incorporate language and direction metadata for a given String value succinctly. Implementations can recyle the dictionary implementation straightforwardly.

Acknowledgements

The Internationalization (I18N) Working Group would like to thank the following contributors to this document: Mati Allouche, David Baron, Ivan Herman, Tobie Langel, Sangwhan Moon, Felix Sasaki, Najib Tounsi, and many others.

The following pages formed the initial basis of this document: