Types of language declaration

On the Web it is always important to associate content with language information. This is important on the one hand so that content can be processed or presented correctly to the reader, but on the other, it may also be important to know the language(s) of the intended audience for the resource as a whole. These are two different things: technologies should provide separate ways of expressing each, and content authors should use those appropriately.

This article describes how these two types of language information, ('metadata' and 'text-processing') differ.

The text-processing language

A browser or application generally needs to take language into account when it displays or manipulates content. This includes such things as speaking the text in a voice browser, running a spell checker, styling line-breaks, applying hyphenation, choosing default fonts, and many more things that need to be done in a language appropriate way. For this, it needs to know what specific language it is dealing with for a specific range of text.

So we are, by necessity, talking about associating a single language with the text, or some range of text, within the resource. Whereas the intended audience can be speakers of more than one language, a specific range of text can only be in one language at a time.

In HTML the lang attribute is used for specifying the text-processing language. It can be used to set a default for the page as a whole, and for internal fragments where the language changes.

<html lang="en">

...

<p>The title of the book is "<cite lang="el">Κάνοντας τον Παγκόσμιο Ιστό πραγματικά Παγκόσμιο</cite>".</p>

This need for specificity has implications for how one declares the language for text-processing. Which is why the lang attribute only allows you to use a single language value.

Metadata: the language of the intended audience

Metadata that describes the language or languages of the intended audience is about the document/resource as a whole. Such metadata may be used for searching, serving the right language version, workflow management, classification, etc.

The language of the intended audience does not necessarily include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page itself is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

On the other hand, it is also possible for a page to contain the same or parallel content in more than one language. For example, a Canadian web page may welcome readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another. For example, a forum used by a Punjabi community may contain posts in English, Hindi and Punjabi in a single thread.

There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.

For an HTML page, metadata about the audience could be expressed in an HTTP Content-Language header. This content header can take multiple values.

Content-Language: en, hi, pa

HTML pages sometimes contain a meta element that can declare language in a similar way, eg. <meta http-equiv="content-language" content="en, fr"/>, but this construct is now deprecated and should not be used. (For more details, see HTTP headers, meta elements and language information.)

Inferring the text-processing language from metadata

In some cases, it may be possible to infer the text-processing language from the metadata for the resource, but not always.

If the metadata value is a list of more than one language there needs to be a way of identifying which language to use when it comes to processing the content.

Furthermore, where there are language changes inside a document, information about the language of the intended audience can't be associated with the appropriate part of the page or document as would be needed for text-processing (ie. in a way that would be needed for the correct application of language-specific text-to-speech, styling, automatic font assignment, etc. to different parts of the document.)

When developing a new technology or format for data, developers should therefore provide separate methods for expressing the language of the intended audience vs. the text-processing language.

Content developers should use the available constructs properly.

For information about how to set language in HTML, see Declaring language in HTML.