Language Detection in User-Agents

ACTION-555

This page is in response to a request at TPAC 2016 to document different ways in which a user-agent can derive or "figure out" which language tag to apply to content, usually keyboard input from the user.

These considerations are only for situations where the language cannot be marked up by the content author. Where possible, including and using the language metadata in the document format is much safer. None of the methods of language detection described in this document are fool-proof or safe indicators of the language.

Why language metadata is needed

Language metadata is important to the Web for a variety of reasons. When supplied by the system, as part of the content itself, or by information entered by end-users, the language of content can be used for a variety of useful things including but not limited to:

Static content, such as the body of a Web page or the contents of an e-book, often has language information provided by the document format or as part of the content metadata. Other content, particularly user input, is constrained only by the limits of the user's ingenuity. Even very exotic characters or languages can be entered on most systems via a character picker or by copying from some other source. As a result, none of the techniques presented here are perfect indicators of the user's intentions.

Environmental Hints

When processing keyboard input, the runtime environment might provide a number of hints about the user and the user's intentions. In descending order of relevance, here are several that a browser or other user-agent might use in determining the language of a given piece of input:

Keyboard

One important source for a user's input language is their keyboard. Particularly on mobile devices, whose input systems often feature input prediction and correction, users often set their keyboard to the language they are inputting--if only to avoid the need to override the auto-correction constantly. Mobile browsers and Webviews sometimes have programmatic access to the current keyboard language (the layout or the auto-correction language), which can serve as a hint of the user's intended language.

Note that some keyboards are multilingual and the language of the keyboard that is accessible via the API is not always the one that is currently active in the input method.

Even with this as a hint, there is no guarantee that the user is actually typing the language in question: this wiki was typed using a Japanese keyboard set to "romaji" input mode.

In most cases, the keyboard language is *not* available inside the browser (for example, to JavaScript).

Locale

A second environmental hint is the runtime locale. This is not the same thing as the localization of the browser. It is the API setting of the locale that produces date and number formatting, list sorting, and other locale-affected behavior. It's usually a strong indicator of the localization of the operating environment and browser as well, but this isn't always the case.

Unlike keyboard, for those browsers that support it, the recent Intl extension to JavaScript provides access via JS. Note that locale-support in browser implementations of Intl varies unevenly by functional type. In particular, the Collator (sorting) APIs tend to support a narrower range of locales than the various formatters. Also bear in mind that the Unicode locale extensions (-u-) to BCP47, which are used to tailor number, date, and collation settings probably are not appropriate for use in tagging user-entered text for general interchange on the Web. While these subtags don't hurt anything, implementations may wish to strip them off when providing natural language text identification.

Also, obviously the runtime locale is only a hint: there is no guarantee that the user is typing in the same language.

Accept-Language

If the user is using their own computer, the Accept-Language header often indicates the user's preferred list of languages. Most users never set this value, so by default most browsers send the default locale of the operating environment (at least the one in use at the time the browser was installed).

This is also accessible in some cases via JavaScript as window.navigator.languages (Chrome, Safari) or as window.navigator.userLanguage (IE).

Page Language

One other hint that exists is the language of the content where the user is entering text. This might be the computed language of the form element or that of the input element where the user is entering their text. This is often a weaker hint, since the page will generally be in a single language, while users can have much broader linguistic needs. For example, many users understand English, but may still need to type their name, address, or some other text in their preferred language. There is no link between the language of the page and what they input in that case.

Note that the Content-Language HTTP header might also serve as a useful hint about the intended audience of the page where the user is entering their text. If form or input do not have a useful lang attribute, the page's Content-Language might be a useful hint.

Direct Detection

Other than environmental factors, the other option is for the input language to be detected directly. Heuristic language detection is, at best, imperfect, since most detection methods are based on statistics about character distribution. Many languages are very similar and the more languages that can be detected, the less good the separation between languages. Various libraries provide language detection of this sort.

Statistical Detection

For most languages, the statistical distribution of n-grams is used to perform detection. Depending on the language, the length of the n-grams used varies. For languages with relatively small character sets, such as those written in the Latin, Greek, Cyrillic, or Arabic scripts (for example), n-grams of 3 (and sometimes more) characters are highly effective. For languages with large character sets, notably those that use Han ideographs (Chinese and Japanese for example), multi-character n-grams are less effective, since character sequences do not repeat frequently enough to serve as a strong signal.

In some cases, the script of the text provides a strong hint about the language. For example, a text in, say, the Georgian, Armenian, or Cherokee script is probably also in that script's eponymous language. In other cases, the presence of specific characters can serve as a signal. So the appearance of either of the kana scripts hints that the text is in Japanese, while Simplified or Traditional Chinese have specific characters that are suggestive of the specific script variation.

Challenges

Statistical language detection faces a number of challenges and adding languages to the list of those detected can produce diminishing results for a variety of reasons.

Foreknowledge and Hinting

If the range of languages to be detected can be limited, the accuracy can be greatly improved. As mentioned above, the script of the input may be one way to initially limit the range of languages needing detection.

More Sophisticated Schemes

Beyond statistical approaches, larger systems can make use of natural language processing, dictionaries, word frequency lists, and other means to resolve the weaknesses of statistical detection or to more thoroughly identify the language of content. Because these methods usually require significantly sized data sets, they are usually host-based rather than something done at the user-agent.