Notes on JSON strings and text direction

This is a draft for to promote discussion and to work through ideas, and may be updated at any time.

This page gathers observations about handling of text direction and language in JSON data. Many, if not all, of the observations are also relevant to other aspects of string handling. It doesn't make recommendations, it just aims to draw together useful background information, questions and lines of thought to help determine the best course of action to support text direction and language metadata associated with strings in JSON objects. It was written by r12a, and these are personal thoughts, not endorsed by the i18n WG.

Note: it is difficult to represent bidi examples satisfactorily. For example, "פעילות הבינאום, W3C" doesn't actually show the expected position of 'W3C' when displayed, but also doesn't represent the order of characters in memory, since the first such character is פ. Where code examples below show characters from left-to-right in the order in which they are stored in memory, we'll use Hebrew to avoid the confusing effects of Arabic letters joining backwards.

When talking about markup in JSON strings we will make the assumption for the purposes of this discussion that the markup is HTML5.

Why and what information is needed about the base direction for a string

For a simple introduction to how the Unicode bidirectional algorithm works, and where it needs additional help, see Unicode Bidirectional Algorithm basics. You will need a grasp of these basics to understand what follows. If you feel more adventurous, read the Unicode Bidirectional Algorithm (UBA).

In order to support correct display of text in right-to-left scripts, when they are eventually displayed to a human, it is necessary to be able to:

establish the overall base direction for a paragraph
change the base direction used for a range of text within the string where needed.

These notes are only about the former, ie. establishing the paragraph direction to be associated with a string. We will refer to this as the paragraph direction, however it may not just apply to strings that will be treated as paragraphs when eventually displayed to a human. The string may be injected into an existing paragraph, such as when a telephone number is added to the end of the line which starts with the equivalent of 'Tel:' in, say, Arabic or Hebrew or Thaana.

The Unicode Bidirectional Algorithm specifies that the paragraph direction can be established by seeking out the first strong directional character in the paragraph. While doing so, an application must ignore non-strong characters at the start of the paragraph, as well as any characters inside an isolated range. If the first strong directional character has a Unicode value that equates to RTL, then the paragraph direction is RTL. If there are no strong characters identified, the default is LTR.

The importance of establishing the paragraph direction can be illustrated with a couple of examples which show how it affects the order in which elements in the string will be rendered to a user. Without this information, users may be unable to understand a message. For example, the following shows a string presented with a RTL base direction.

Here is the same string presented with a LTR base direction.

Here is another example, this time a sequence of numbers, such as you may find in a telephone number, presented in a LTR context.

And now, exactly the same sequence, just the surrounding base context has changed to RTL.

For passing around strings this is not an issue, since characters are stored in logical order. It is only an issue when the text needs to be eventually displayed to a human. The display algorithm needs to know the paragraph direction before it starts to display the string.

This document is mostly about cases where the first strong directional character does not indicate the paragraph direction that the originator of the string intended.

Sometimes finding the first strong character is actually misleading. Take a string typed into an HTML form input field such as the following. The image below represents what the user sees as they type, and what they expect others to see later. Imagine this as something like a tweet beginning with a hash tag.

Rtl hash he.png

The form field shown above is in a document where the direction of the input field has been set to RTL.

The sequence of characters stored in memory follows the logical order in which the characters were typed and indeed the order in which they would be pronounced, and is shown just below, progressing from left to right. The point to note is that the sequence starts with LTR characters.

‭#bidi פעילות הבינאום, W3C‬

If the consumer of this string were to assume that the text needs a LTR base direction, based on detecting the first strong directional character, the result would be incorrect when displayed to another human later.

Ltr hash he.png

Typically, the information that indicates that this phrase should be displayed using a RTL paragraph direction is not contained in the string when the form is submitted. It may be contained in the computed direction of the form, which may be set directly on the form itself, via markup or via context menu selections or keystrokes, or it may be inherited from a parent element. (It can be passed to the receiver using the dirname attribute, but that is carried as separate information.)

In cases where the first strong directional character would give the wrong result, the question is how to associate the intended direction with the string for future use.

It is possible that the computed direction of the input element is set to auto. In this case, the browser would look for the first strong character on each line to determine the paragraph direction, and the user would be obliged to provide Unicode control characters at the start of the line shown above to produce the right effect. On the face of it, this doesn't seem such a bad thing – it gets the information we need into the string – but there are practical problems, not least because the user's keyboard is likely to not have the needed control characters. On the desktop browser, users would be more likely to use the context menu or keystrokes to set the direction for the field.

As mentioned above, these notes are only about establishing the paragraph direction.

If a user produces changes in the direction of inline ranges within the paragraph they will need to apply Unicode controls or markup to do so, and those items will remain as part of the string. However, the base direction of the initial (or only) paragraph, however, may not be included in the string

Ascertaining the base direction of strings outside of JSON

Determining base direction from the string itself

As we saw above, it is often possible to look at the beginning of a string to determine whether the paragraph direction needs to be RTL or LTR. That would work for the first (Arabic) example above. The first strong character is RTL, so applying a RTL base direction to the string for display will produce the desired result.

But, as we also saw, sometimes finding the first strong character is actually misleading.

A similar problem arises if the JSON string starts with markup and you want to determine the paragraph direction for the text content enclosed by that markup. HTML tag names are always in Latin text, so to identify the first strong character in HTML markup you need to skip the markup, including attributes and their values.

Here is an example (again the characters are shown left-to-right as stored in memory) of a string which has been picked up with its surrounding markup. The content inside the markup needs to be displayed with a RTL base direction, so that the displayed text puts the 'W3C' (and the comma) to the left of the Hebrew text. (In this case there is no indication in the markup about the direction required because this is a RTL page and that information was set on the html tag.)

פעילות הבינאום, W3C‬

If the string is inserted into an HTML document's source code, where the display direction for the page is set to RTL, this should produce no problems. If, however, it is injected into a document with a LTR direction, it would.

Of course, if the tags here are not actually markup – say, for instance, that this is an example showing some source code – then you'd probably expect the first strong character 'p' to be taken into account to indicate that the paragraph direction for this string should be LTR. It's not clear to me how you would know the difference, just looking at a string, without some intelligence being applied by the consumer. Having said that, if the source of the string is markup based, the angle brackets should probably be escaped already in the case of example source code.

If (true) markup comes with the base direction already specified, it would be important to try to understand that markup. In the following example the markup tells you that the base direction should be RTL, regardless of any first strong directional character:

‭#bidi פעילות הבינאום, W3C

While talking about heuristics-based approaches above, we mentioned on first-strong detection as a way to assess the base direction for a string. There are, however, other algorithms in use, which the Unicode Bidirectional Algorithm allows for higher level protocols. For example, although Facebook relies, indeed, on first-strong detection to build markup with the right base direction around posts, Twitter instead counts the relative number of LTR vs RTL characters in a tweet to determine the base direction. Which way is better is beside the point. The point is that it would help to have a generally agreed way to signal the base direction of strings to promote interoperability.

Multiple paragraphs

As mentioned earlier, in the case where the string input by the user contains multiple paragraphs (ie. multiple lines separated by line breaks in a form input or other plain text, and markup containing block level constructs), we only need to know the base direction of the string as a whole. Any differences in base direction introduced between such paragraphs in a single input field, eg. such as in

Multiple base directions.png

would need to be introduced by the user anyway, and the mechanism used for that would be part of the captured string. In other words, we only need to capture and store the base direction set for the input as a whole.

Note, however, that the base direction is not only specified as RTL or LTR. If the input field is, say, a textarea with direction set to auto, it is expected that the base direction will be determined for each line (paragraph) on the basis of the first strong character in that line.

Storing base direction with JSON strings

There needs to be a way for producers of JSON strings to signal the base direction for a string the first time it is encoded as JSON, that can be recognised and used by consumers of the JSON data. The fewer the decisions, the conventions, and the efforts required to encode and interpret the base direction, the better.

Relying on first-strong characters

One way to store the base direction for a string is to follow the convention that the first strong directional character in a string indicates the base direction for the whole string.

Plain text

In the case of a string such as the following (shown in memory order),

"summary" : "‭פעילות הבינאום, W3C"

which starts with a RTL character, this is straightforward (as long as the intended base direction is indeed RTL).

Consumers of the JSON string would have to scan the string far enough to detect the first strong character.

In the case where there is no strong character (for example, a telephone number), consumers would need follow the UBA convention that the default is LTR. Note that this does not mean that consumers of the JSON don't need to do anything. If a consumer is inserting the string into a RTL context, it would need to ensure that the LTR base direction was preserved for that string. This will involve building something around the string that announces the change in base direction.

In the case of this example (where the characters are also shown in memory order), which is a RTL phrase that starts with a LTR strong character,

"content" : "‭#bidi פעילות הבינאום, W3C‬"

the producer of the JSON string could add an invisible strong RTL character, ie. RLM, to the start of the string to indicate the expected base direction of the string when consumed later.

Note that you cannot expect humans creating the original string to use RLM in this situation, since the string would look perfectly fine to them if the surrounding content was RTL. Furthermore, RLM characters are not commonly available on user keyboards, especially for mobile devices.

An application that constructs a JSON string should only add RLM if there isn't already a strong RTL character at the start. Indiscriminate addition of RLM to RTL strings can cause a build up of redundant RLMs at the start of a string.

One of the problems with this approach is that it modifies the string itself. This may not always be desirable. In some cases it may be important to preserve the string exactly as it was created.

Another potential issue is that the consumer still needs to work at preparing the string for use in it's destination. It has to search the string until it finds the first strong character, and then it needs to build control codes or markup around the string in its final destination, before presentation to a user. The actual RLM character is likely to be redundant at this point, since an RLM character cannot establish an embedded range of text with a changed base direction. In this scenario, it only acts as a flag to the consumer.

Furthermore, this way of indicating the intended base direction only works, as mentioned above, if all consumers of the JSON strings know that they should look for the first strong character to determine the base direction that should be applied. (Bear in mind that first-strong is only one of the heuristic methods used to try to guess the base direction for text.)

Strings that are enclosed in markup

Some strings may start with markup. Here we are looking at a string that begins and ends with HTML markup. For example,

"summary" : "‭פעילות הבינאום, W3C"

A consumer application that looks for the first strong character in such a string will always encounter a LTR character first, and the producer of the JSON string could either:

require the consumer to skip the markup when trying to detect the first character, or
put a directional control character at the start of the string as a flag to represent the base direction.

It is worth repeating that putting a control character such as RLM before a p tag would have no effect on the rendering of a string where its destination is HTML, since RLM doesn't create a base direction in HTML. For that you would need to use markup. Embedding/isolating control code pairs would not work either, since Unicode controls are only effective within the current paragraph (ie. they are inline constructs), and the p tag immediately initiates a new paragraph.

However, if the producer is dealing with a string marked up with HTML that happens to already contain a dir attribute on the surrounding tags, eg.

"summary" : "#bidi ‭פעילות הבינאום, W3C"

we have a different problem. It would not be necessary, nor appropriate, to scan for the first strong directional character. The directional information is already contained in the markup, and other than constituting a redundant flag, an RLM character has no function in markup of this kind.

However, it is also a big assumption to make that the consumer of the JSON string will know enough about the markup to recognise that dir="rtl" already specifies the direction, and use that. The markup may not represent HTML, for example, and an XML application may express direction with an idiosyncratic element or attribute.

Wrapping the string with Unicode characters

Aside from the strong directional characters, RLM and LRM, mentioned above, Unicode also provides paired control characters. Let's look at some of the implications of using those. Again, there are differences between dealing with plain text and markup.

Plain text

If you use an RLM single character to indicate the base direction for a plain text string, the consumer of the string will need to look at the context into which the string will be inserted, and if the surrounding context has a different base direction from the string itself it will need to add characters around the JSON string as it inserts it in order to apply the required base direction to the string itself. For plain text, the consumer will need to wrap the inserted JSON string in paired Unicode control characters.

If one wraps the JSON string in paired characters while producing the JSON, it may appear that one can eliminate the need for the consumer to examine the target context or add extra characters while inserting the string into its destination. Nor does the consumer need to scan the JSON string to find the appropriate first-strong character.

To completely eliminate the need for the consumer to do this extra work, however, you would need to wrap all plain text strings with these characters, whether the string already starts with the right first-strong directional character, such as,

"summary" : "‭פעילות הבינאום, W3C"

or whether it doesn't, as in,

"content" : "‭#bidi פעילות הבינאום, W3C‬"

You'd also need to wrap strings where there is no strong character (for example, the telephone number).

The reason for this is that the string may be inserted into a context which has an opposite base direction, and that cannot be known in advance.

Another advantage of this approach, apart from reducing the work of the consumer, is that there would be no need for an agreement or convention to make things work interoperably, as there was for the previous approach, where consumers needed to uniformly agree to look for RLM flags or first-strong characters to correctly interpret the base direction they should apply.

This all sounds promising, but of course there are some flies in the ointment.

Although it should do no harm to insert JSON strings wrapped in paired controls into the destination, there is a limit to the number of embeddings supported by the Unicode bidi algorithm.

Furthermore, if the string is to be injected into an HTML file, the consumer may still need to build markup around it.

Perhaps more importantly, inserted strings need to be protected from the surrounding context to avoid spillover effects. Until recently, the control characters used for embedding new base direction were LRE/RLE...PDF. For example, suppose you want to list review numbers for film titles, such as,

The Dressmaker : 4 reviews

Once you add a RTL film title, such as בופור, you will produce unacceptable display results even if you wrap the inserted text in the RLE..PDF control characters. The displayed result will look like,

בופור : 4 reviews

You need to isolate the embedded text from the surroundings at the same time as applying the base direction. To do so, there are now some new paired control characters, which the Unicode Consortium recommends in favour of the aforementioned. These are called RLI/LRI...PDI (there's also an FSI for first-strong detection). These would fix the issue, but the problem is that they aren't yet widely supported.

There is, however, another issue with this approach. The UBA says that you should skip embedded isolation ranges when determining the paragraph direction. This means that if you receive a string that starts with RLI and ends with PDI you cannot detect that it's actually a RTL string by using the normal first strong detection algorithm. The text inside those controls should be rendered in the appropriate order, wherever the string is dropped, but if you wanted to build HTML markup around the string using a dir attribute, or determine whether to align the string on the right, you would not get the right information using first strong detection.

Producers of the JSON will also need to determine what to do for any string that already starts with one of these paired characters. If the string has already been wrapped, you don't want to keep wrapping it again. It may, however, be the case that a user input string starts with a paired control character that doesn't wrap the whole string. The producer would need to recognise this and wrap the whole string.

Furthermore, the issue that you are changing every string is actually greater with this approach, since all strings are affected.

Again, you cannot expect humans creating the original string to use paired controls, not least because these characters are much less likely to be available on user keyboards than the RLM/LRM characters.

Strings that are enclosed in markup

Here not much changes over the section about using first-strong detection. Here is our previous example,

"summary" : "‭פעילות הבינאום, W3C"

Paired control characters at the start and end of such a string will perform no function when inserted into a markup destination such as an HTML file. The RLE/LRE/RLI/LRI/FSI...PDF/PDI characters are for inline use only. Their effect is terminated by paragraph boundaries. In the case of HTML, the p element tags constitute those paragraph boundaries, and so paired controls outside those tags will have no effect whatsoever.

In other words, such characters can act only as flags and the consumer will still need to apply the base direction by changing the markup.

If the paired controls are added to the JSON string by the producer, this now produces a situation where the destination HTML is littered with useless extra characters (particularly where a string contains multiple p tags) unless the consumer removes them while inserting. It's also feasible that in a scenario where a user edits the resulting file without adding block level markup, this could lead to unexpected effects if the new text were added to the wrong side of a control character.

In the light of this, it doesn't really seem sensible to use paired control characters for strings containing markup. As before, the producer will need to examine the markup

However, as we mentioned before it is also a big assumption to make that the consumer of the JSON string will know enough about the markup to recognise that dir="rtl" already specifies the direction, and use that. The markup may not represent HTML, for example, and an XML application may express direction with an idiosyncratic element or attribute.

Using a direction property

If direction information is stored as a property, there needs to be a property for each string. In the following example, a direction property at the same level as name and content would NOT work, since it can't serve both of those strings.

{
  "@context": {
    "@value": "http://www.w3.org/ns/activitystreams",
     },
  "name": "r12a posted a note!",
  "type": "Note",
  "content": "פעילות הבינאום, W3C"
  }

What may work better, however, is a new string type that allows for direction to be optionally stored with each string. For example,

{
  "@context": {
    "@value": "http://www.w3.org/ns/activitystreams",
     },
  "name": "content" : "r12a posted a note" ,
  "type": "Note",
  "content": { "str" : "פעילות הבינאום, W3C", "dir" : "rtl" }
  }

This is similar to the concept of LanguageString. For this to work, all JSON applications would need to recognise the structure of these strings, and be able to extract at least the string itself.

A scenario where each string is part of an object which has a direction property would also work, as long as consumers are able to understand the logic of the object. That approach is more dependent on conformance to format specifications, which may not be a general solution to the problem of attaching base direction information to strings.

If the direction information is omitted, the convention must be that the direction is LTR. This is important where LTR strings are injected into a RTL environment.

There may be some logic to this approach, if it were possible to define a string type that is adopted widely, since the base direction really is metadata about the string, viz. it is often expressed separately from the string in the original source and the final destination of the string when visible to humans.

Storing the base direction in a property also avoids cluttering strings with additional characters (see below), or altering strings that are passed around. It also simplifies the process of determining the base direction of the JSON string, since the inspection procedure is straightforward. The crux, of course, is getting it recognised as a standard approach.

It still requires the consumer to build a context around a string which is inserted into a destination, in order to preserve the appropriate base direction.

It also still requires the JSON producer to examine the surrounding context of an incoming string, although one of the problems of using the first-strong approach may be that if you are storing a string in the JSON format, you must know whether the string is coming from a non-JSON context (in which case you need to examine the context and decide whether or not to add RLM at the start of the string, etc.), or coming from a JSON context where those decisions have already been taken. If a string arrives with direction metadata attached, it's pretty clear that that initial process has been done.

Notes on JSON strings and text direction

Why and what information is needed about the base direction for a string

Ascertaining the base direction of strings outside of JSON

Determining base direction from the string itself

Multiple paragraphs

Producing JSON strings

Consuming JSON strings

Storing base direction with JSON strings

Relying on first-strong characters

Plain text

Strings that are enclosed in markup

Wrapping the string with Unicode characters

Plain text

Strings that are enclosed in markup

Using a direction property