This page is in response to a request at TPAC 2016 to document different ways in which a user-agent can derive or "figure out" which language tag to apply to content, usually keyboard input from the user.
These considerations are only for situations where the language cannot be marked up by the content author. Where possible, including and using the language metadata in the document format is much safer. None of the methods of language detection described in this document are fool-proof or safe indicators of the language.
Language metadata is important to the Web for a variety of reasons. When supplied by the system, as part of the content itself, or by information entered by end-users, the language of content can be used for a variety of useful things including but not limited to:
Static content, such as the body of a Web page or the contents of an e-book, often has language information provided by the document format or as part of the content metadata. Other content, particularly user input, is constrained only by the limits of the user's ingenuity. Even very exotic characters or languages can be entered on most systems via a character picker or by copying from some other source. As a result, none of the techniques presented here are perfect indicators of the user's intentions.
When processing keyboard input, the runtime environment might provide a number of hints about the user and the user's intentions. In descending order of relevance, here are several that a browser or other user-agent might use in determining the language of a given piece of input:
One important source for a user's input language is their keyboard. Particularly on mobile devices, whose input systems often feature input prediction and correction, users often set their keyboard to the language they are inputting--if only to avoid the need to override the auto-correction constantly. Mobile browsers and Webviews sometimes have programmatic access to the current keyboard language (the layout or the auto-correction language), which can serve as a hint of the user's intended language.
Note that some keyboards are multilingual and the language of the keyboard that is accessible via the API is not always the one that is currently active in the input method.
Even with this as a hint, there is no guarantee that the user is actually typing the language in question: this wiki was typed using a Japanese keyboard set to "romaji" input mode.
In most cases, the keyboard language is *not* available inside the browser (for example, to JavaScript).
A second environmental hint is the runtime locale. This is not the same thing as the localization of the browser. It is the API setting of the locale that produces date and number formatting, list sorting, and other locale-affected behavior. It's usually a strong indicator of the localization of the operating environment and browser as well, but this isn't always the case.
Unlike keyboard, for those browsers that support it, the recent Intl
extension to JavaScript provides access via JS. Note that locale-support in
browser implementations of Intl varies unevenly by functional type. In
particular, the Collator (sorting) APIs tend to support a narrower range of
locales than the various formatters. Also bear in mind that the Unicode locale
extensions (-u-
) to BCP47, which are used to tailor number, date,
and collation settings probably are not appropriate for use in tagging
user-entered text for general interchange on the Web. While these subtags don't
hurt anything, implementations may wish to strip them off when providing natural
language text identification.
Also, obviously the runtime locale is only a hint: there is no guarantee that the user is typing in the same language.
If the user is using their own computer, the Accept-Language header often indicates the user's preferred list of languages. Most users never set this value, so by default most browsers send the default locale of the operating environment (at least the one in use at the time the browser was installed).
This is also accessible in some cases via JavaScript as
window.navigator.languages
(Chrome, Safari) or as
window.navigator.userLanguage
(IE).
One other hint that exists is the language of the content where the user is
entering text. This might be the computed language of the form
element or that of the input
element where the user is entering
their text. This is often a weaker hint, since the page will generally be in a
single language, while users can have much broader linguistic needs. For
example, many users understand English, but may still need to type their name,
address, or some other text in their preferred language. There is no link
between the language of the page and what they input in that case.
Note that the Content-Language
HTTP header might also serve as a
useful hint about the intended audience of the page where the user is
entering their text. If form or input do not have a useful lang
attribute, the page's Content-Language
might be a useful hint.
Other than environmental factors, the other option is for the input language to be detected directly. Heuristic language detection is, at best, imperfect, since most detection methods are based on statistics about character distribution. Many languages are very similar and the more languages that can be detected, the less good the separation between languages. Various libraries provide language detection of this sort.
For most languages, the statistical distribution of n-grams is used to perform detection. Depending on the language, the length of the n-grams used varies. For languages with relatively small character sets, such as those written in the Latin, Greek, Cyrillic, or Arabic scripts (for example), n-grams of 3 (and sometimes more) characters are highly effective. For languages with large character sets, notably those that use Han ideographs (Chinese and Japanese for example), multi-character n-grams are less effective, since character sequences do not repeat frequently enough to serve as a strong signal.
In some cases, the script of the text provides a strong hint about the language. For example, a text in, say, the Georgian, Armenian, or Cherokee script is probably also in that script's eponymous language. In other cases, the presence of specific characters can serve as a signal. So the appearance of either of the kana scripts hints that the text is in Japanese, while Simplified or Traditional Chinese have specific characters that are suggestive of the specific script variation.
Statistical language detection faces a number of challenges and adding languages to the list of those detected can produce diminishing results for a variety of reasons.
If the range of languages to be detected can be limited, the accuracy can be greatly improved. As mentioned above, the script of the input may be one way to initially limit the range of languages needing detection.