Proposal to add base direction information to RDF Literals.

Introduction

There is a problem at the intersection of RDF Literals and internationalization. When using string Literals for natural language texts, the langString datatype of RDF 1.1 [[rdf11-concepts]] provides the possibility to add the language information via a [[bcp47]] language tag. However, it is not possible to express the base text direction (i.e., right-to-left or left-to-right) of the same literal.

While, in many cases, this information can be deduced from a text itself by looking at the first “strongly typed” Unicode character (i.e., not a punctuation, for example) [[uba-basics]], this may be misleading at times. An example for a problematic situation is when this first character happens to be a Latin character in a Hebrew text; the algorithm may deduce that the text is to be rendered in a left-to-right order, which would be wrong.

There are a number of documents that describe the problem in more details for various usages, like [[string-meta]], [[inline-bid-markup]], or [[uba-basics]]. The bottom line is that any specification that expresses natural language texts SHOULD have the means to provide an additional information on the base direction of a text (e.g., set to right-to-left in our example above). While this issue is solved in [[html]] through the dir attribute, and can be done when defining, e.g., a set of terms in simple JSON (see [[string-meta]] for some viable approaches), this is not possible in the abstract model of RDF and, consequently, in the serialization syntaxes of RDF like Turtle [[turtle]] or JSON-LD [[json-ld]].

This problem has affected a number of specifications that define vocabularies expressed in, e.g., JSON-LD. These include Activity Streams [[activitystreams-core]], Verifiable Credentials [[verifiable-claims-data-model]], Web Publications [[wpub]], or Web of Thing [[wot-thing-description]]. There have been parallel discussions in the various Working Groups (see, e.g., VC issue #436, WPUB issue #354, or WoT issue #643) and was also discussed during the editorial work of [[string-meta]] (see, e.g., issue #12). All these discussions led to the same conclusion: because RDF cannot express this information, neither can JSON-LD or Turtle. In other words, this problem must be solved on the RDF level.

This document collects some of the possible solutions that came to the fore in various discussions. All these approaches have their pros and cons; and the goal is to have a unified approach for all W3C specifications instead of each of those having to come up with some heuristics separately.

The problem of setting the base direction for a string is, actually, the tip of the iceberg of internationalization issues around RDF literals. This document focuses so far only on establishing the overall directional context for a given string. There are additional problems related to handling the directional and languages changes within a string. For bidirectional text, this requires the use of Unicode formatting characters within plain text strings, or the use of HTML format strings, i.e., the HTML datatype of RDF 1.1 [[rdf11-concepts]]. While both of these can be done in today’s RDF, there are still problems with those approaches (see also [[string-meta]] for some of the issues).

Overview of the proposals

This document describes several approaches that have been discussed. All of these may require more detailed specification work and they all have their pros and cons in terms of specifications, serializations, and deployments.

For more details on some of the problems, the reader may refer to the issue discussions related to the present document, the [[string-meta]] document put together by the I18N Working Group at W3C, as well as the various W3C I18N articles on the subject, like [[inline-bid-markup]] or [[uba-basics]].

RDF based solutions

These solutions are based on the modification, or the extension of, the current RDF infrastructure.

Extend langString

The core definition

The current definition for langString is in section 3.3 of the RDF 1.1 document [[rdf11-concepts]]. It is based upon assigning a language tag, using [[bcp47]], to a string. It is somewhat of an odd case in the RDF 1.1 model insofar as it has, in contrast to all other data types in RDF, an additional structure: the lexical form comprises a string and the additional (language) tag (in contrast to other datatypes consisting of a single string, albeit possibly with a restricted content). The value space of the datatype is also defined as a set of tuples consisting of the literal string and the language tag. This peculiarity means that langString literals have to be treated separately by all RDF implementations.

Extending this core datatype would mean to keep this unique structure and to add a third element for a base direction to the tuple both in the lexical and the value spaces. This base direction tag MUST have the value of ltr, rtl, with the semantics as defined for the dir attribute in [[html]].

A separate document contains the first draft for the necessary changes in the relevant section of RDF 1.1 [[rdf11-concepts]].

Note that there are use cases when a string SHOULD have a base direction but has no language; an example is an ISBN number which, if not specified as being ltr, may be displayed erroneously in a rtl context. This also means that langString may be considered to be a misnomer, although it may be retained for historical reasons.

Serializations

The serializations of a renewed langString SHOULD be defined in such way that the presence of either the language tag or base direction tag is OPTIONAL (but not both, i.e., one MUST be present). This is important to ensure that current RDF datasets would remain valid.

RDF/XML

RDF/XML can adopt a new rdf:dir attribute, based on the dir attribute defined in [[html]]. I.e., an example could be:

                            <rdf:Description >
                                <ex:example xml:lang="he" rdf:dir="rtl">פעילות הבינאום, W3C</ex:example>
                                <ex:isbn rdf:dir="ltr">978-2-290-16543-0</ex:isbn>
                            </rdf:Description>
                        

Turtle/SPARQL

Turtle has a special syntax for language tagged literals which could be extended, e.g.,:

                            [] ex:example "פעילות הבינאום, W3C"@he^rtl ;
                               ex:isbn "978-2-290-16543-0"@^ltr .
                        

JSON-LD

JSON-LD uses special JSON objects (value objects) to express RDF literals, that can be extended via the introduction of a new JSON-LD keywords @direction:

                            "ex:example" : {
                                "@value" : "פעילות הבינאום, W3C",
                                "@language" : "he",
                                "@direction" : "rtl"
                            },
                            "ex:isbn" : {
                                "@value" : "978-2-290-16543-0",
                                "@direction" : "ltr"
                            }
                        

Pros and Cons

Extending the langString can be considered as the “ideal” solution: it takes care of what could be considered (from an internationalization point of view) as a missing feature (some would call it a bug) in the original RDF specification. It also fits very well the current specification. Its effects on the RDF Semantics [[rdf11-mt]] is only editorial: the tuples for langString for literals in general, and D-entailment in particular, should simply include a new tag. Indeed, the RDF Semantics does not make use of the semantics of [[bcp47]] values, nor should it use the semantics of the base direction tag.

A major problem with the approach is that, because it touches the “core” or RDF, such a change would affect a large number of recommendations that all rely on that core: SPARQL, SHACL, RDFa, R2RML, etc. That means that if the core RDF specification is updated, all other documents may have to be updated at the same time, which becomes a significant endeavor. Similarly, deployment is also a major problem: because language tagged literals are treated separately in RDF, all implementations (Jena, RDFLib, Sesame, various triple stores, Turtle and RDF/XML parsers and serializers, etc.) MUST be updated. That may take a long time, and it may not be easy to convince the community to do so.

Define new datatype(s)

A new RDF datatype can be defined on top of the current RDF definition to cover the missing features.

Core definition

As a reminder, defining a new datatype means (per the RDF specification):

  • Define a lexical space.
  • Define a value space.
  • Define a lexical-to-value mapping.
  • Assign a unique URL to the new datatype.

There may be different possibilities to define a new datatype (or datatypes). These are defined in the following two sections

Define a single LocalizableString datatype

These requirements can defined, for this this case, as follows:

Lexical space
Unicode [[UNICODE]] strings which MUST follow the pattern value@lang^dir. lang is a [[bcp47]] language tag, dir MUST have the value of ltr or rtl, with the semantics as defined for the dir attribute in [[html]]. The presence of one of the two tags is REQUIRED, though one of the two MAY be missing.
Value space
Triples consisting of the string, the language tag, and a base direction tag. One of the language or base direction tags MAY have an undefined value.
Lexical to value mapping
If the lexical value is value@lang^dir the mapping is the identity mapping. If lang or ^dir is missing, then the corresponding tuple value in the lexical space is undefined.
The URL uniquely defining the new datatype
The obvious URL would be to put this into the RDF namespace, i.e., http://www.w3.org/1999/02/22-rdf-syntax-ns#LocalizableString.

Define a family of language datatypes

A (very large, albeit finite) family of datatypes can be defined using the following URL pattern: https://www.w3.org/i18n#XX_YY, where:

  • XX: is a [[bcp47]] language tag
  • YY: is ltr or rtl, with the semantics as defined for the dir attribute in [[html]].

Either XX or YY MAY be missing, but not both.

For each of these datatypes there is an identical lexical and value space:
Lexical space
Unicode [[UNICODE]] strings.
Value space
Triples consisting of the string, the language tag, and a base direction tag. One of the language or base direction tags MAY have an undefined value.
Lexical to value mapping
The mapping means parsing the URL and mapping it to the corresponding triple.

Serializations

Because this is a new datatype, strictly speaking there is no need for a new serialization; all concrete syntaxes have a way to express literals with a datatype. E.g., in Turtle, the literal would have the form of

                        [] ex:example "פעילות הבינאום, W3C@he^rtl"^^rdf:LocalizableString ;
                           ex:isbn "978-2-290-16543-0@^ltr"^^rdf:LocalizableString .
                    

or, respectively

                        [] ex:example "פעילות הבינאום, W3C"^^i18n:he_rtl ;
                           ex:isbn "978-2-290-16543-0"^^i18n:_ltr
                    

and in JSON-LD

                        "ex:example" : {
                            "@value" : "פעילות הבינאום, W3C@he^rtl",
                            "@datatype" : "rdf:LocalizableString"
                        }
                        "ex:isbn" : {
                            "@value": "978-2-290-16543-0@^rtl",
                            "@datatype" : "rdf:LocalizableString"
                        }
                    

or, respectively

                        "ex:example" : {
                            "@value" : "פעילות הבינאום, W3C",
                            "@datatype" : "i18n:he_rtl"
                        }
                        "ex:isbn" : {
                            "@value": "978-2-290-16543-0",
                            "@datatype" : "18n:_ltr"
                        }
                    

However, newer versions of the the serialization syntaxes MAY introduce the syntactic facilities to express this new datatype by adopting the serialization formats as shown for the langString extension case; these would be considered as syntactic sugar to generate these datatypes. E.g.,

                        "ex:example" : {
                            "@value" : "פעילות הבינאום, W3C"
                            "@language" : "he",
                            "@direction" : "rtl"
                        }
                    

could be considered as valid JSON-LD, generating a LocalizableString, or i18n:*, respectively, when mapped upon RDF. An advantage of introducing such a shorthand in JSON-LD is that vanilla JSON may adopt the same pattern which can be used in general regardless (see also the definition of Localizable in [[string-meta]].)

(Care should be taken of the fact that adopting such extra syntactic facilities means that the same localizable string could be expressed in two different ways. While this may not create too much problems in most cases, it may require some extra considerations if and when a canonical RDF format is defined.)

Pros and Cons

The major advantage of these approaches is that it is mostly transparent to current RDF deployments, and thus affects only application layers that do display the literal values for humans. It also works for parsers “out of the box”, unless new serialization versions are defined.

The major disadvantage is that it is never a good idea to introduce yet another micro-syntax into a specification; it is the source of confusion and possible errors. (This can be mitigated by introducing new syntactic sugars.) Also, applications that want to make use of base direction values would have to handle some sort of a duality, insofar as the data may include both langString and LocalizableString (or, respectively, i18n:*) language strings.

(See also some further issues.)

Hybrid solution

This approach considers the “ideal” solution of extending `langString` (taking care of the core problem in RDF) but adding some feasibility considerations. The goal is to avoid the necessity to update all RDF related specifications at the same time. This could be achieved as follows.

It is worth noting that, in all solutions proposed above, the value space of the updated/new datatype is identical: triples consisting of the string, the language tag, and a base direction tag, where at most one of the tags may be undefined.

  • One of the new datatypes (or family thereof), as described above, is adopted, with the explicit provision that it is to be deprecated eventually. I.e., it is considered to be of a temporary usage only.
  • The core RDF 1.1 specification is updated, as described in the section above.
    • The serializations of the new datatype(s) (in N-Triple, N-Quads, Turtle, TriG, etc.) would remain valid (e.g., "value@XX^YY"^^rdf:LocalizableString, or "value"^^i18n:XX_YY, respectively). However, they would be specially parsed and mapped to the renewed rdf:langString datatype.
    • New syntaxes (akin to what has been described above) MAY also be introduced. However, to ensure interoperability with older RDF implementations, serializers MAY keep using the special syntax described above.

What this hybrid approach brings is that, as the first step, it is enough to update RDF 1.1 (as described in the separate section). The other specifications, as well as their deployments, are not under the pressure to be updated right away. Graphs using the new datatypes can be defined using the traditional syntaxes. At a later point, when the usage becomes widespread, the datatypes might be rescinded, alongside the updates of SPARQL, SHACL, etc., incorporating the new langString.

In view of their different usage patterns, new versions of RDFa and JSON-LD may be updated, though, with the appropriate adaptations on the RDF Graphs they generate (e.g., if they generate N-Triples then the new temporary datatypes should be used; if they are built on top of an RDF 1.2 compliant environment they would generate the proper value space entries).

Pros and Cons

The major advantage of this approach is that it combines the ideal solution of taking care of an RDF “bug” while ensuring a somewhat smoother deployment.

A disadvantage is that it creates a special case in several concrete syntaxes (e.g. Turtle), where the rdf:LocalizableString or i18:XX_YY family of IRIs are used as "magic" terms, and end up being interpreted as a different IRI (namely rdf:langString). Another disadvantage is that there has to be a very careful consideration when the datatypes are deprecated, rescinded, etc., i.e., it will need a careful monitoring of the RDF ecosystem evolution.

Compound literal

This solution is inspired by a separate discussion of the RDF community on “Language Tagged Strings” (that issue summarizes a large number of mails related to this topic that have been accumulated over the years). The essence of the discussion is to separate the “string”, as a simple data, from all the various characterizations that may be added to it. Language is one of those, but one can refer to, say, pronunciation issues, translation-related terms (like in [[its]]) and, of course direction. We should not aim at solving all the issues raised in that discussion, but we can get inspired and provide a basis for the community to carry on further, if it wishes.

Core definition

The following terms are defined in RDF:

  • rdf:CompoundLiteral: is a class representing a compound literal.
  • rdf:language: an RDF property. The range of the property is an rdfs:Literal, whose value must be a well-formed [[bcp47]] tag. The domain of the property is rdf:CompoundLiteral.
  • rdf:direction: an RDF property. The range of the property must be an rdfs:Literal, whose value must be either "ltr" or "rtl". The domain of the property is rdf:CompoundLiteral.

Semantically, the following triples represent a compound literal, whose language and base direction are specified:

                        _:a rdf:type rdf:CompoundLiteral .
                        _:a rdf:value "פעילות הבינאום, W3C" .
                        _:a rdf:language "he" . 
                        _:a rdf:direction "rtl" .
                    

The (blank) node _:a stands for the literal, whose language is Hebrew, and whose direction is right-to-left.

The rdf: namespace has been used in the definition for the sake of simplicity; it is a separate decision whether that, or another namespace should be used.

A more “RDF-like” variant

From strict RDF point of view, the definition above is sloppy; it attaches an extra semantics to the values of literals instead of using URLs for identifying distinct notions such as languages and directions. A cleaner definition may be to define the following terms:

  • rdf:CompoundLiteral: is a class representing a compound literal.
  • the set of individuals of the form https://www.w3.org/i18n/lang#XX, where XX is a well-formed [[bcp47]] tag (the class of language tags)
  • the set of the two individuals of the form https://www.w3.org/i18n/dir#ltr and https://www.w3.org/i18n/dir#rtl (the class of direction tags)
  • rdf:language: an RDF property. The range of the property is the class of language tag URL, the domain is rdf:CompoundLiteral
  • rdf:direction: an RDF property. The range of the property is the class of direction tag URL, the domain is rdf:CompoundLiteral

Using the i18n: namespace the example representation would look slightly different:

                            _:a rdf:type rdf:CompoundLiteral .
                            _:a rdf:value "פעילות הבינאום, W3C" .
                            _:a rdf:language i18n:lang#he . 
                            _:a rdf:direction i18n:dir#rtl .
                        

Serializations

(The serialization of the RDF like variant is left as an exercise for the reader.)

Turtle

                                [] ex:example [
                                    rdf:value "פעילות הבינאום, W3C" ;
                                    rdf:language "he" ; 
                                    rdf:direction "rtl" ;
                                ] .
                        

Note that, compared to previous examples the text does not appear directly as the object of a triple, but is encapsulated in a separate (blank) node that “represents” the text with all attributes attached.

JSON-LD

Interestingly, due to the specificities of the JSON-LD syntax, the serialization of compound literals is extremely straightforward. Indeed, a syntax of the form:

                            "ex:example" : {
                                "@value" : "פעילות הבינאום, W3C"
                                "@language" : "he",
                                "@direction" : "rtl"
                            }
                        

may be mapped onto compound literals directly.

Pros and Cons

The major advantage of this approach is that it is fully transparent to current RDF deployments, and affects only application layers that do display the literal values for humans. It also work for parsers “out of the box”. As an extra bonus (as mentioned “Language Tagged Strings” discussion) it may actually be easier to query such compound literals via, say, SPARQL than using the current RDF approach.

The obvious disadvantage is that it duplicates a mechanism for expressing strings with a language tag only: one can use the core langString datatype or the compound literals. The relationship between the two forms should be specified.

General comments: Pros and Cons

The major advantage of these solutions is that they keep the changes “confined” to the realm of RDF, with no danger of interference with other technologies (in contrast to the approach below).

The major disadvantage of these solutions is that they touch the “core” of the RDF family of specifications, that comprises, by now, a rather large number of technologies (SPARQL, SHACL, conversion standards like R2RML or CSVW, etc), and this means that, eventually, all these standards must be refined in one way or another. This is particularly true for , which directly extends the RDF model. The other solutions in this section, in particular the hybrid approach, mitigate the problem somewhat. But even in that case, applications SHOULD, eventually, be adapted to the new structure of literals, which may require, per W3C process, setting up dedicated Working Groups.

BCP47 based solutions

These approaches are based on the extensions, or the usage of the BCP47 [[bcp47]] language tags, without touching the core RDF structures.

Extend language tags with -d-*

The core [[bcp47]] standard can be extended to include information on the base direction.

Core definition

Add -d-ltr and -d-rtl to the [[bcp47]] language tags. I.e., a full language tag could look like lang-d-dir, where lang is a valid [[bcp47]] language tag as of today, and dir MUST have the value of ltr or rtl with the semantics as defined for the dir attribute in [[html]]. See also [[d-langtag]].

Serializations

In terms of serializations, this change could be completely transparent for RDF. I.e., the examples used before would become, in Turle:

                            [] ex:example "פעילות הבינאום, W3C"@he-d-rtl .
                        

and in JSON-LD:

                        "ex:example" : {
                            "@value" : "פעילות הבינאום, W3C",
                            "@language" : "he-d-rtl"
                        }
                    

Pros and Cons

Beyond the general issues described below, this approach has the additional problem that it requires a formal extension of the current [[bcp47]] standard. The first reaction of that community (see, e.g., the relevant email thread on the IETF mailing list) were certainly not in favor…

Private-use subtag -x-d-*

The core [[bcp47]] standard allows the usage of “private-use” subtags. These are chosen and maintained by private agreement amongst parties.

Core definition

Define the -x-d-ltr and -x-d-rtl private-use language subtags. I.e., the language tag could look like lang-x-d-dir, where lang is a valid [[bcp47]] language tag, and dir MUST have the value of ltr, rtl with the semantics as defined for the dir attribute in [[html]]. Per definition this usage of this private-use tag is restricted to a specific family of specifications (at this moment one could restrict it to RDF, OWL, SPARQL, SKOS, etc., their various serializations, plus possibly JSON and CBOR).

Serializations

In terms of serializations, this change could be completely transparent for RDF. I.e., the examples used before would become, in Turtle:

                            [] ex:example "פעילות הבינאום, W3C"@he-x-d-rtl .
                        

and in JSON-LD:

                            "ex:example" : {
                                "@value" : "פעילות הבינאום, W3C",
                                "@language" : "he-x-d-rtl"
                            }
                        

Pros and Cons

See the section below for a general discussion.

Rely on current BCP47 only

A different approach stems from the observation that, in certain contexts, the current BCP-47 can be used, unchanged, to express the necessary base direction using, if necessary, the “script” subtag. The latter is necessary if the language itself does not uniquely identify the script (and, in consequence, the base direction) of the text. This is the case, for example, for Azerbaijani, that may have Arabic, Cyrillic, and Latin scripts. However, the combination of these two information, plus the basic algorithms defined for UNICODE, has the necessary information. For example,

                    "ex:example" : {
                        "@value" : "פעילות הבינאום, W3C",
                        "@language" : "he"
                    }
                

may, in fact, include the necessary information to deduce the right-to-left nature of the text by virtue of declaring the text to be in Hebrew, whereas in

                    [] ex:example "آذربايجانجا ديلي"@az-Arab ;
                                  "Азәрбајҹан дили"@az-Cyrl .
                

both strings need the extra information on script. However, the two together have enough information to establish the base direction.

Pros and Cons

Mark Davis, in his github comment, formalized the approach further on what a user agent may have to do. However, further discussions revealed that the algorithmic approach is significantly more complicated (see a separate wiki page) and is at odd with the current deployment and usage of language tags.

See the section below for a general discussion.

General comments: Pros and Cons

The major advantage of all these approaches is that they are fully transparent to current RDF deployments, and affect only application layers that do display the literal values for humans. They also work for parsers “out of the box”.

A major issue is that [[bcp47]] is already a very complex specification, in terms of the rich "metadata" that it assigns to languages. The notion of base direction does not fit, per current usage and understanding, the purpose of [[bcp47]], which is to describe metadata about language, scripts used, etc., and not a feature that is relevant for the display of that string. (See also the relevant email thread on the IETF mailing list.)

Another major issue is that [[bcp47]] is widely used and deployed in various environments, for example in HTML or CSS. Any change on that core [[bcp47]] specification, or its usage, may therefore have far reaching consequences because all these environment should be updated, considering also that there would be a redundancy in terms of functionality (the renewed language tag format vs. the existing dir attribute). See, e.g., an issue comment outlining the possible difficulties.

Note that none of the examples in this section contained an example for an ISBN term, which needs a base direction but no language tag. One might consider to use, e.g., und-x-d-ltr (where und is the BCP47 tag for an “undefined” language), but that may be at odd with the surrounding context and, therefore, is semantically not clean. In other words, it is not clear how the direction of an ISBN string could be defined in any of those schemes.

Finally, the interplay between current data deployment and the assignment of such extended language tags may not be obvious and may create new practical difficulties. See, e.g., the issue comment and the separate Wiki page outlining the possible difficulties.

UNICODE-based solution

The base direction of a string may also be controlled by the Unicode formatting characters U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK. This means that user agents receiving such a string could identify the base direction of any given natural language value by scanning the text for the first strong directional character that may include these formatting characters; no further information is strictly necessary. (See, e.g., the Web Publication draft for some examples.)

Serializations

In terms of serializations, this change is completely transparent for RDF. Ie, the example used before would become, in Turtle:

                        [] ex:example "\u200Fפעילות הבינאום, W3C"@he ;
                           ex:isbn "\u200E978-2-290-16543-0" .
					

and in JSON-LD:

						"ex:example" : {
						    "@value" : "\u200Fפעילות הבינאום, W3C",
						    "@language" : "he"
                        },
                        "ex:isbn" : "\u200E978-2-290-16543-0"
					

Pros and Cons

The major advantage of this approach is that there is no specification work to be done on the RDF, JSON, etc., side.

The major disadvantage is that this approach requires a change of the string data proper; a change that relies on some additional expertise the author/editor/etc. of that data does not necessarily have. Tools, like screen editors, WYSIWYG tools, etc., rarely offer such facilities. It also raises issues in terms of search, sorting, etc., of string data; see also the relevant set of problems in [[string-meta]] that are closely related to this approach.

Also, just as for the BCP47 solutions, the interplay between current data deployment and the assignment of such extra formatting characters may not be obvious. Although the issue comment and the separate Wiki page outline the problems for the case when the language tags are used, the same considerations apply to this approach as well.

Note that this is the solution adopted in several of the aforementioned specifications as a suboptimal approach.

Acknowledgements

This document is a synopsis of a series of discussions, email contributions, etc., of a number of people, including Manu Sporny, Gregg Kellogg, David Longley, Rob Sanderson, Benjamin Young, Charles Neville, Richard Ishida, Addison Philips, Martin Dürst, Andy Seaborne, and Mark Davis.