Use cases for language information in web annotations

After discussion in the Web Annotation FTF and on the I18n WG telecon, i was asked by the latter to summarise requirements for indication of language in annotations, by assembling a simple set of use cases that illustrate likely needs. This is my attempt to do that, based on my (perhaps limited) understanding.

Definitions

Language metadata typically indicates the intended linguistic audience or user of the resource as a whole, and it's possible to imagine that this could, for a multilingual resource, involve a property value that is a list of languages. A property that is about language metadata may have more than one value, since it aims to describe all potential users of the information

Text-processing language is the language of a particular range of text (which could be a whole resource or just part of it). A property that represents the text-processing language needs to have a single value, because it describes the text content in such a way that tools such as spell-checkers, default font applicators, hyphenation and line breakers, case converters, voice browsers, and other language-sensitive applications know which set of rules or resources to apply to a specific range of text. Such applications generally need an unambiguous statement about the language they are working on.

An annotation typically contains a target (the thing you are annotating), and a body (the thing you are saying about the target). Note that the body of an annotation can also become the target of another annotation, especially in scientific discourse, where people comment on commentaries.

Currently body target and body can each have a single language property, and the value of that property can be 0 or many language tags.

The question is, how will the recorded language information be used - do we need separate properties for language metadata and text-processing or not?

Language(s) of the target

When applied to a target, language information is almost certainly metadata, since it isn't likely that the target will be operated on or changed in a language sensitive fashion. If an annotation becomes a target for another annnotation and there are no separate fields for text-processing and metadata languages, all language information is used as metadata when the original annotation becomes the target.

Section 3.2.1 has a use case where Béatrice's records a long analysis of a patent, and publishes the audio on her website as an mp3. She then creates an annotation with the mp3 as the body, and the PDF of the patent as the target. Her target is labelled as en, whereas her body is fr.

(Felix's use case) Later someone may want to look through a number of annotations, including those of Béatrice, and retrieve a list of all the English targets that were annotated. In this case a language annotation on the target would be useful.

If Béatrice is annotating a patent that is available in more than one language, ie. the same content is available in both (it's a translation), then she might point to more than one target (with appropriate differences in the selector). In this case, someone going from the annotation to the target may be able to choose which target to refer to(?).

(Another Felix use case) If the patent that Béatrice is annotating contains the same content in both English and French (ie. within the same resource), then someone who is retrieving a list of targets that are in French from a set of annotations which includes that of Béatrice would want to retrieve this document also. In this case it would make sense to declare the language of the target patent to be both en and fr.

In the latter case, the question arises about how much content in another language the resource needs to contain for that language to be added to the list of languages declared. If this is a book or document containing the same content twice, then it seems clear that it is worth annotating for both languages. If there is simply one foriegn phrase in a document, it's less clear to me that you'd want to do that. Nor is it clear to me where the cut-off point is. If this is a Spanish to Thai phrase book, would you need to label the content with both es and th? The value of labelling with es is clear, since that's the intended reader, but although the document indubitably contains large amounts of Thai also, is there a use case where it's necessary to know that, other that simply looking for resources that contain some amount of Thai text - which seems more like a general search query than a reason for annotating a target?

(Off topic: I'm also curious to know how implementations would obtain language information in the case of an ordinary HTML page. Suppose you were annotating this article (in English, but with some Arabic examples), would they use the value of the lang attribute in the html tag? Would they scan further for other lang attributes?)

Language(s) of the body

Kensuke is reading an old Tibetan manuscript from the Dunhuang collection. The tool he is using to read the manuscript has access to annotations created by scholars working in the various languages of the International Dunhuang Project, who are commenting on the text. The section of the manuscript he is currently looking at has commentaries by people writing in Chinese, Japanese and Russian. Each of these commentaries is stored in a separate annotation, but the annotations point to the same point in the target document. Each commentary is mainly written in the language of the scholar, but may contain excerpts from the manuscript and other sources written in Tibetan as well quoted text in Chinese and English.

Kensuke speaks Japanese, so he wants to be presented with the Japanese commentary.

The body containing the Japanese commentary has a language property set to ja (Japanese). The tool he is using knows that he wants to read Japanese commentaries, and it uses this information to select and present to him the text contained in that body. This is language information being used as metadata – it indicates to the application doing the retrieval that the intended consumer of the information wants Japanese.

The Japanese commentary for this particular annotation starts with a sentence in Japanese, but later contains some excerpts from Chinese and Tibetan sources. It's possible for the value of the language property, when used as metadata, to contain three language tags, ja,zh,bo (japanese, chinese, and tibetan, respectively), but i'm not sure how useful that is in this particular use case.

Having identified the relevant annotation text to present to Kensuke, his application has to then display it so that he can read it. It's important to apply the correct font to the text. Ideographic characters such as

雪, 刃, 直, 令, 垔

have slight but important differences in Japanese vs Chinese fonts¹, and it's important not to apply a Chinese font to the Japanese text that Kensuke is reading. There are also language-specific differences in the way text is wrapped at the end of a line. For these reasons we need to identify the actual language of the text to which the font or the wrapping algorithm will be applied. Also, a voice browser will need to know whether to use Japanese or Chinese pronunciations for the ideographic characters contained in the annotation body text, and as mentioned before, various other text rendering or analysis tools need to know the language of the text they are dealing with.

If the language property value contains only ja, that's a good indicator that the application should expect the first sentence and the annotation in general to be in Japanese, unless instructed otherwise. If, however, the language property has the value bo,ja,zh, it's not clear what the default font, etc, should be. In that case, we need a way to indicate that the first sentence in the text presented to Kensuke is actually in Japanese.

We also need a way to indicate the change of language to Chinese and Tibetan later in the commentary for this annotation, so that appropriate fonts and wrapping algorithms can be applied there. One proposal from members of the Annotation WG was to require HTML/XML formats for such annotation bodies, and use the lang or xml:lang attributes in markup to denote the language changes.

(Use case from Felix) If Kensuke's body contains quoted text in Chinese and Tibetan it would be useful to know that if you were someone who wanted to locate all annotations containing text in more than one language.

Tentative conclusions

The language declarations used for the body and target may end up being the same where annotations are chained, if there are not separate fields for metadata and text-processing language information.

The problems only arise where there are multilingual targets/bodies and it is worth calling out the multiple languages. If an target or annotation is in a single language, there is no real need to distinguish text-processing language from metadata language.

Where the language property has a list of languages as its value, the issue is to know which of those languages should be taken to be the default text-processing language (eg. to know that it's ja in our example bo,ja,zh above).

I'm still not clear how the language property values are derived for a target or body, but if it's possible to know what the default text-processing language is, it would be simple to move that to the beginning of the list (eg. instead of bo,ja,zh use ja,bo,zh).

The only solution so far proposed for handling in-body language changes is to require the use of HTML or XML for the body and use the lang or xml:lang attributes on elements to mark the boundaries. (If we cannot expect a tool to analyse the markup, we cannot expect that to work, of course.)

Author: Richard Ishida