This document provides definitions and best practices related to the identification of the natural language of content in document formats, specifications, and implementations on the Web. It describes how language tags are used to indicate a user's locale preferences which, in turn, are used to process, format, and display data values and other information.

This is an updated Public Working Draft of "Language Tags and Locale Identifiers for the World Wide Web". The Working Group expects this to become a Working Group Note.

If you wish to make comments regarding this document, please raise a github issue. You may also send email to the list www-international@w3.org (subscribe, archives) as mentioned below. Please include [ltli] at the start of your email's subject. To make it easier to track comments, please raise separate issues or send separate emails for each comment. All comments are welcome.

Introduction

Language tags and locales are one of the fundamental building blocks of internationalization of the Web. In this document you will find definitions for much of the basic terminology related to this aspect of I18N.

This document also provides terminology and best practices needed by specification authors for the identification of natural language values in document formats or protocols and which are recommended by the Internationalization (I18N) Working Group. These (and many other) best practices, along with links to supporting materials, can also be found in the Internationalization Best Practices for Spec Developers [[INTERNATIONAL-SPECS]]. In addition to the best practices found here, additional best practices relating to language metadata on the Web can be found in [[STRING-META]].

Locales and Internationalization

This section defines basic terminology related to internationalization and localization.

Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.

International Preferences. A user's particular set of language and formatting preferences and associated cultural conventions that software can employ to correctly process or present information exchanged with that user.

There are many kinds of international preferences that may be offered on the Web in order for the content or service to be considered usable and acceptable by users around the world. Some of these preferences might include:

... and many more.

Internationalization. The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated I18N because there are eighteen letters between the "i" and the "n" in the English word.

Localization. The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as L10N because there are ten letters between the "L" and the "N" in the English word. When a particular set of content and preferences corresponding to a specific set of international preferences is operationally available, then the system is said to be localized.

Locale. A collection of international preferences, generally related to a language and geographic region, that is passed in APIs or set in the operating environment to get culturally affected behavior from a system or process. Usually a locale is identified by an id or shorthand token, such as a language tag.

Locale-aware (or Enabled). A system that can respond to changes in the locale with culturally and language-specific behavior or content. Generally, systems that are internationalized can support a wide range of locales in order to meet the international preferences of many kinds of users.

Language tags can provide information about the language, script, region, and various specially-registered variants using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. Thus a German language user might want to choose between the sort ordering used in a dictionary versus that used in a phone book.

Historically, locales were identified by the programming language or operating environment of the user. This application-specific identifier was often inferred from language tags. For example, an implementation could map a language tag from an existing protocol, such as HTTP's Accept-Language header, to its locale model.

Common Locale Data Repository (or [[CLDR]]). The Common Locale Data Repository is a Unicode Consortium project that defines, collects, and curates sets of locale data needed to enable systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.

Unicode Locale. A combination of language tag extensions ([[RFC6067]], [[RFC6497]]) and additional processing rules defined by [[CLDR]] to support locales.

A Unicode locale provides the ability to specify in a language tag international preference variations that go beyond linguistic or regional variation or to select formatting behavior or content when there are multiple options. Unicode locale identifiers are identical to language tags, but apply additional rules about the content of certain language tags. Unicode Locales increasingly form the basis for internationalization on the Web, particularly as part of the Intl locale framework [[ECMA-402]] in JavaScript [[ECMASCRIPT]].

Unicode's [[CLDR]] project maintains both [[BCP47]] extensions related to Unicode locales. The Unicode locale language tag extension [[RFC6067]] uses the -u- subtag, and provides subtags for selecting different locale-based formats and behaviors.

The Transformed Content extension [[RFC6497]], which uses the -t- subtag, provides subtags for text transformations, such as transliteration between scripts.

It is important to remember that every Unicode locale identifier is also a well-formed [[BCP47]] language tag.

Some preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.

Data value. In this document, data values are any data type used in a document format or application other than natural language string values. These often correspond to date types such as numbers, dates, booleans, etc. Note that on the Web many data values are serialized as strings.

Locale-neutral. A data value is said to be locale-neutral when it is stored or exchanged in a format that is not specifically appropriate any given language, locale, or culture and which can be interpreted unambiguously for presentation in a locale aware way.

A locale-neutral representation might itself be linked to a specific cultural preference, but such linkages should be minimized. An example of this are the ISO8601 serializations of date/time values. Many of these are linked to the Gregorian calendar, but the format, field order, separators, and visual appearance are not specifically suitable to any locale (they are intended to be machine readable) and, as shown in the example above, the value can be converted for display into any calendar or locale.

Language negotiation. The process of matching a user's international preferences to available locales, localized resources, content, or processing.

Locale fallback. The process of searching for translated content, locale data, or other resources by "falling back" from more-specific resources to more-general ones following a deterministic pattern.

A user's preferences are usually expressed as a locale or prioritized list of locales. When negotiating the language, the system follows some sort of algorithm to get the best matching content or functionality from the available resources. In many cases the language negotiation algorithm uses locale fallback.

Specifications that present data values in a document format SHOULD require that data is formatted according to the language of the surrounding content.

When data values are present to the user as part of a document or application, the document or application forms the "context" where the data is being viewed. Content authors or application developers need a way to make the data values seem like a natural part of the experience and need a way to control the presentation. This is indicated by the language tag of the context in which the content appears: usually enabled implementations interpret the tag as a locale in order to accomplish this. Using the runtime locale or localization of the user-agent as the locale for presenting data values should only be a last resort.

Specifications that present forms or receive input of data values in a document format or application SHOULD require that the values be presented to the user localized in the format of the language of the content or markup immediately surrounding the value.

Specifications that present, exchange, or allow the input of data values MUST use a locale-neutral format for storage and interchange.

Implementations SHOULD present data values in a document format or application using a format consistent with the language of the surrounding content and are encouraged to provide controls which are localized to the same locale for input or editing.

Users expect form fields and other data inputs to use a presentation for data values that is consistent with the document or application where the values appear. User's usually expect their input to match the document's context rather than the user-agent or operating environments and input validation, prompting, or controls are also thus consistent with the content. This gives content authors the ability to create a wholly localized customer experience and is generally in keeping with customer expectations.

Languages and Language Tags

Tags for identifying the natural language of content or the international preferences of users are one of the fundamental building blocks of the Web. The language tags found in Web and Internet formats and protocols are defined by [[BCP47]]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.

Many of the core standards for the Web include support for language tags; these include the xml:lang attribute in [[XML10]], the lang and hreflang atttributes in [[HTML]], the language property in [[XSL10]], and the :lang pseudo-class in CSS [[CSS3-SELECTORS]].

Language tags can also be used to identify international preferences associated with a given piece of content or user because these preferences are linked to the natural language, regional association, or culture of the end user. Such preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults for items such as the presentation of a calendar, or common units of measurement; selecting between 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, these preferences are usually called a locale. The extensions to [[BCP47]] that define Unicode locales [[CLDR]] provide the basis for internationalization APIs on the Web, notably the JavaScript language [[ECMASCRIPT]] uses Unicode locales as the basis for the APIs found in [[ECMA-402]].

Natural Language (or, in this document, just language). The spoken, written, or signed communications used by human beings.

There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [[BCP47]].

[[BCP47]] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [[RFC5646]], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [[RFC4647]], describes several schemes for atching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.

Language tag. A string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [[BCP47]] language tag. These language tags consist of one or more subtags.

Specifications for the Web that require language identification MUST refer to [[BCP47]].

Specifications SHOULD NOT refer to specific component RFCs.

The "BCP" nomenclature refers to the current set of RFCs that form the "best current practice". At the time this document was published, [[BCP47]] consisted of two RFCs: Tags for the Identification of Languages [[RFC5646]] and Matching of Language Tags [[RFC4647]].

Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary.

While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [[RFC4646]], referring to the BCP will not incur additional compliance risk to most implementations.

Specifications MUST NOT reference obsolete versions of [[BCP47]], such as [[RFC1766]] or [[RFC3066]].

Specifications that need to preserve compatibility with obsolete versions of [[BCP47]] MUST reference the production obs-language-tag in [[BCP47]].

Beginning with [[RFC4646]], [[BCP47]] defined a more complex, machine-readable syntax for language tags. Some specifications might desire or require compatibility with the older language tag grammar found in previous versions of BCP47 (specifically [[RFC1766]] and [[RFC3066]]). This grammar was more permissive and is described in [[BCP47]] as the ABNF production obs-language-tag. [[RFC4646]], which introduced the current grammar for language tags, is itself obsolete.

Applications that provide language information as part of URIs (e.g. in the realm of RDF) SHOULD use [[BCP47]].

Currently, URIs expressing language information often use values from parts of ISO 639. This leads to situations in which there are ambiguities about what the proper value should be, e.g. for German de from ISO 639-1 or ger from ISO 639-2. By using BCP 47 and its language sub tag registry, such ambiguities can be avoided, e.g. for German, the registry contains only de.

Specifications SHOULD NOT restrict the length of language tags or permit or encourage the removal of extensions.

Subtag. A sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag. In [[BCP47]], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).

Selecting content or behavior based on the language tag requires a few additional concepts defined by [[RFC4647]]. In this document, we adopt the following terminology:

IANA Language Subtag Registry. A machine-readable text file available via IANA which contains a comprehensive list of all of the subtags valid in language tags. (Link: Registry)

Specifications SHOULD NOT reference [[BCP47]]'s underlying standards that contribute to the IANA Language Subtag Registry, such as ISO639, ISO15924, ISO3066, or UN M.49.

Some standards might directly consume one of [[BCP47]]'s contributory standards, in which case a reference is wholly appropriate. However, in most cases, the purpose of the reference is to specify a valid list of codes and their meanings. [[BCP47]]'s subtag registry is stabilized and resolves ambiguity in a number of useful ways and so should be the preferred source for this type of reference.

[[BCP47]] defines two different levels of conformance. See classes of conformance in [[BCP47]] for specifics. For language tags, the levels of conformance correspond to type of checking that an implementation applies to language tag values.

Well-formed language tag. A language tag that follows the grammar defined in [[BCP47]]. That is, it is structurally correct, consisting of ASCII letters and digit subtags of the prescribed length, separated by hyphens.

Valid language tag. A language tag that is well-formed and has also been checked to ensure that each of the subtags appears in the IANA Language Subtag Registry.

Specifications SHOULD require that language tags be well-formed.

Specifications MAY require that language tags be valid.

Specifications SHOULD require content authors use valid language tags.

Note that this is stricter than what is recommended for implementations.

Content validators SHOULD check if content uses valid language tags where feasible.

Checking if a tag is valid requires access to or a copy of the registry plus additional runtime logic. While content authors are advised to choose, generate, and exchange only valid values, language tag matching and other common language tag operations are designed so that validity checking is not needed. Features or functions that need to understand the specific semantic content of subtags are the main reason that a specification would normatively require valid tags as part of the protocol or document format.

Canonical Unicode locale identifier. A well-formed language tag that also conforms to the additional rules for Unicode locale identifiers found in [[CLDR]] (see Section 3). Unicode locales define additional conformance criteria and normalization steps beyond that found in [[BCP47]] that help make language tags more consistent and interoperable.

Specifications MAY reference registered extensions to [[BCP47]] as necessary.

In particular, [[RFC6067]] defines the BCP 47 Extension U, also known as "Unicode Locales". This extension to [[BCP47]] provides additional subtag sequences for selecting specific locale variations.

Content authors SHOULD choose language tags that are canonical Unicode locale identifiers.

The additional content restrictions and normalization steps found in Section 3 of [[LDML]] provide for better interoperability and consistency than that afforded by [[BCP47]] directly.

Implementations SHOULD only emit language tags that are canonical Unicode locale identifiers and SHOULD normalize language tags that they consume using the rules for producing canonical tags.

As above, the additional content restrictions and normalization steps found in Section 3 of [[LDML]] provide for better interoperability and consistency than that afforded by [[BCP47]] directly. This best practice should not be interpreted as meaning that implementations need to support, generate, process, or understand either of [[CLDR]]'s extensions.

Language range. A string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".

Language priority list. A collection of one or more language ranges identifying the user's language preferences for use in matching. As the name suggests, such lists are normally ordered or weighted according to the user's preferences. The HTTP [[RFC2616]] Accept-Language [[RFC3282]] header is an example of one kind of language priority list.

Basic language range. A language range consisting of a sequence of subtags separated by hyphens. That is, it is identical in appearance to a language tag.

Extended language range. A language range consisting of a sequence of hyphen-separated subtags. In an extended language range, a subtag can either be a valid subtag or the wildcard subtag *, which matches any value.

Some language priority lists, such as the Accept-Language [[RFC3282]] header mentioned earlier, provide "weights" for values appearing in the list. Such weighting cannot be depended on for anything other than ordering the list.

Specifications that define language tag matching or language negotiation MUST specify whether language ranges used are a basic language range or an extended language range.

Specifications that define language tag matching MUST specify whether the results of a matching operation contains a single result (lookup as defined in [[RFC4647]]), or a possibly-empty (zero or more) set of results (filtering as defined in [[RFC4647]]).

Specifications that define language tag matching MUST specify the matching algorithms available and the selection mechanism.

For example, JavaScript internationalization [[ECMA-402]] and [[CLDR]] provide a "best fit" algorithm which can be tailored by implementers.

Further Reading

The Internationalization WG has additional best practices and other references, such as articles on language tag choice. These include:

Revision Log

Changes to this document following the Working Draft of 2015-04-23 are available via the github commit log. This document was significantly restructured since that revision. Notably:

The following changes were made since the revision of 2006-06-20.

The following log records changes that have been made to this document since the publication in April 2006.

Acknowledgements

The Internationalization Working Group would like to acknowledge the following contributors to this specification: