This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in Character Model for the World Wide Web 1.0: Fundamentals [[CHARMOD]] and Character Model for the World Wide Web 1.0: String Matching [[CHARMOD-NORM]] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.

Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues against the latest editor's copy. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL.

Introduction

Goals and Scope

This document describes the problems, requirements, and considerations for specification or implementations of string searching operations. A common example of string searching is the "find" command in a Web browser, but there are many other forms of searching that a specification might wish to define.

This document builds on Character Model for the World Wide Web: Fundamentals [[CHARMOD]] and Character Model for the Word Wide Web: String Matching [[CHARMOD-NORM]]. Understanding the concepts in those documents are important to being able to understand and apply this document successfully.

The main target audience of this specification is W3C specification developers who need to define some form of search or find algorithm: the goal is to provide a stable reference to the concepts, terms, and requirements needed.

The concepts described in this document provide authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text searching on the World Wide Web. Working together, these three groups can build a globally accessible Web.

This document contains best practices and requirements for other specifications, as well as recommendations for implementations and content authors. These best practices for specifications (and others) can also be found in the Internationalization Working Group's document Internationalization Best Practices for Spec Developers [[INTERNATIONAL-SPECS]], which is intended to serve as a general reference for all Internationalization best practices in W3C specifications.

Document Conventions

In this document [[RFC2119]] keywords in uppercase italics have their usual meaning. We also use these stylistic conventions:

Definitions appear with a different background color and decoration like this.

Best practices appear with a different background color and decoration like this.

Issues, gaps, and recommendations for future work appear with a different background color and decoration like this.

Terminology

This section contains terminology specific to this document.

Much of the terminology needed to understand this document is provided by the Internationalization Glossary [[I18N-GLOSSARY]]. Some terms are also defined by [[CHARMOD-NORM]] and can be found in the Terminology and Notation section of that document.

Unicode, also known as the Universal Character Set, allows Web documents to be authored in any of the world's writing systems, scripts, or languages, on any computing platforms and then to be exchanged, read, and searched by the Web's users around the world. The first few chapters of the Unicode Standard [[Unicode]] provide useful background reading. Also see the Unicode Collation Algorithm [[UTS10]], which contains a chapter on searching.

Corpus The natural language text contained by a document or set of documents which the user would like to search.

Segmentation The process of breaking natural language text up into distinct words and phrases. This often includes operations such as "named entity recognition" (such as recognizing that the three word sequence Dr. Jonas Salk is a person's name).

Stemming A process or operation that reduces words to their "stem" or root. For example, the words runs, ran, and running all share the stem run. This some sometimes called (more formally) lemmatization and the stem is sometimes called the lemma.

Full-Text Search refers to searches that process the entire contents of the textual document or set of documents. Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.

Frequently this means that a full-text search employs indexes and natural language processing. When you are using a search engine, you are using a form of full text search. Full text search often breaks natural language text into words or phrases (this is called segmentation) and may apply complex processing to get at the semantic "root" values of words (this is called stemming). These processes are sensitive to language, context, and many other aspects of textual variation.

String Searching in Natural Language Content

String searching is widely implemented in browsers and other user agents, but has not historically been well documented. Various W3C working groups have attempted to provide such documentation in the past. The most recent effort produced these issues.

Users of the Web often want to search documents for particular words or phrases within the natural language text of a given document. This is different from the sorts of programmatic matching needed by formal languages (such as markup languages such as [[HTML]]; style sheets [[CSS21]]; or data formats such as [[TURTLE]] or [[JSON-LD]]), and which are described by our document [[CHARMOD-NORM]].

There are different types of string searching.

One limited form of full-text search—and the topic of this document—is sub-string matching. One familiar form of sub-string matching is the "find" feature of your browser. A sub-string match searches the body ("corpus") of a document with the user's input, seeking a match.

Find operations can have different options or implementation details, such as the addition or removal of case sensitivity, or whether the feature supports different aspects of a regular expression language or "wildcards".

One way that sub-string matching usually differs from other types of full-text search is that, while it may use algorithms in an attempt to suppress or ignore textual variations, it usually does not produce matches that contain additional or unspecified character sequences, words, or phrases.

Quite often, the user's input does not use a sequence of code points identical to that in the text being searched. This can happen for a variety of reasons. Sometimes it is because the text varies in ways the user cannot predict. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed—or because the user cannot be bothered to input the text accurately. In this section, we examine various common cases known to us.

Additional Types of Equivalence

When searching text, the concept of "grapheme boundaries" and "user-perceived characters" can be important. See Section 3 of Character Model for the World Wide Web: Fundamentals [[CHARMOD]] for a description. For example, if the user has entered a capital "A" into a search box, should the software find the character À (U+00C0 LATIN CAPITAL LETTER A WITH ACCENT GRAVE)? What about the character "A" followed by U+0300 COMBINING ACCENT GRAVE? What about writing systems, such as Devanagari, which use combining marks to suppress or express certain vowels?

In order to describe or implement sub-string matching, it is necessary to understand the types of textual variation that users expect the search feature to pay attention to (or ignore) and the types of features that the implementation will need to consider when building the searching algorithm.

The Character Model for the World-Wide Web: String Matching [[CHARMOD-NORM]] describes several textual equivalences which also apply to sub-string matching. These include case folding and different Unicode normalization forms.

There are other types of equivalence that are interesting when performing sub-string matching. Some forms of equivalence, such as those mentioned above, are based on character properties assigned by Unicode or due to the mapping of legacy character encodings to the Unicode character set. Other "interesting equivalences" go outside of those defined by Unicode. Some of these potential "text normalizations" are application, natural language, or domain specific and should not be overlooked by specifications or implementations.

Case Folding

A user might expect a term entered in lowercase to match uppercase equivalents (and perhaps vice-versa). Most sub-string matching feature, such as the browser "find" command, offer a user-selectable option for matching the case of the input to that of the text.

For a survey of case folding, see the discussion here in [[CHARMOD-NORM]].

Unicode Normalization and character equivalence

Unicode defines canonical and compatibility relationships between characters which can impact user perceptions of string searching. For a detailed discussion of Unicode Normalization forms see Section 2.2 of [[CHARMOD-NORM]] as well as the definitions found in Unicode Normalization Forms [[UAX15]].

In many complex scripts it is possible to encode letters or vowel-signs in more than one way, but the alternatives are canonically equivalent.

Script Equivalence

Some languages are written in more than one script. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts.

East Asian Width

Some compatibility characters were encoded into Unicode to account for single- or multibyte representation in legacy character encodings or for compatibility with certain layout behaviors in East Asian languages.

Digit Shaping

Many scripts have their own digit characters for the numbers from 0 to 9. In some Web applications, the familiar ASCII digits are replaced for display purposes with the local digit shapes. In other cases, the text actually might contain the Unicode characters for the local digits. Users attempting to search a document might expect that typing one form of digit will find the eqivalent digits.

Orthographic or Dialectical Variation

Some languages have different orthographic traditions that vary by region or dialect or allow different spellings of the same word. Searches and spell-checking may need to know about these variations.

South Asian (Indic script) languages

Indic script languages have many instances of this kind of problem. Sometimes these are spelling errors, but in other cases multiple spellings are acceptable.

For example, the Bengali language (language tag bn) is notorious for having a wide range of spelling variations permitted by the language: nearly 80% of Bengali words have at least two spellings. Many words have 3, 4, or more variations—with at least one word having 16 different valid spellings.

Other Indic scripts provide alternative mechanisms for representing particular sounds, and in most cases either representation is considered equally valid. The most common instance of this involves representation of syllable-final nasals.

For example, the /n/ sound in the word for snake in Hindi can be written using either [U+0901 DEVANAGARI SIGN CANDRABINDU] and [U+0902 DEVANAGARI SIGN ANUSVARA] Both of the following are possible valid spellings:

In an additional twist to this story, two diacritics with different code points could be used here. In our previous example we used [U+0902 DEVANAGARI SIGN ANUSVARA ] to represent the nasal sound because the accompanying vowel-sign rises above the hanging baseline. If the vowel-sign was one that didn't rise above the hanging baseline, we would normally use [U+0901 DEVANAGARI SIGN CANDRABINDU ] instead. The function of both of these diacritics is the same, but their code points are different.

The alternative use of either a letter or a diacritic for syllable-final nasals is common to several other Indian languages. In addition to Devanagari, used to write languages such as Hindi (language tag hi) or Marathi (language tag mr, scripts such as Malayalam, Gujarati, Odia, and others provide similar spelling options.

Whitespace Normalization

Some languages use whitespace to separate words, sentences, or paragraphs while others do not. When performing sub-string matching, different forms of whitespace found in [[Unicode]] must be normalized so that the match succeeds.

Accents and diacritic marks

Users will sometimes vary their input when dealing with letters that contain accents or diacritic marks when entering search terms in scripts (such as the Latin script) that use various diacritics, even though the text they are searching includes the additional marks. This is particularly true on mobile keyboards, where input of these characters can require additional effort. In these cases, users generally expect the search operation to be more "promiscuous" to make up for their failure to make the additional effort needed.

This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.

Optional characters

In some orthographies it is necessary to match strings with different numbers of characters.

A prime example of this involves vowel diacritics in abjads. For example, some languages that use the Arabic and Hebrew scripts do not require (but optionally allow) the user to input short vowels. (For some other languages in these scripts, the inclusion of the short vowels is not optional.) The presence or absence of vowels in the text being input or searched might impede a match if the user doesn't enter or know to enter them.

Visually identical text that is not canonically equivalent

In some cases, visually similar or identical glyph patterns can be made from different sequences of code points. Sometimes this is intentional and variations can be removed via Unicode normalization. But there are other cases in which similar-appearing [= graphemes =] are not made the same by normalisation, and they are not semantically equivalent.

Some languages which use the Arabic script also have [= graphemes =] which can be encoded in more than one way. In some cases, these variations are handled by Unicode Normalization, but in other cases they are not considered equivalent by Unicode, even if they appear visually to be identical. Sometimes these variations are considered to be valid spelling variations. In other cases they are the result of user's mistaken perception.

Considerations for Searching

This section was identified as a new area needing document as part of the overall rearchitecting of the document. The text here is incomplete and needs further development. Contributions from the community are invited.

Implementers often need to provide simple "find text" algorithms and specifications often try to define APIs to support these needs. Find operations on text generate different user expectations and thus have different requirements from the need for absolute identity matching needed by document formats and protocols. It is important to note that domain-specific requirements may impose additional restrictions or alter the considerations presented here.

Increasing input effort from the user SHOULD be mirrored by more selective matching.

When the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the base letter—they might expect their search results to match (only) their more-specific input.

Types of Search Option

When creating a string search API or algorithm, the following textual options might be useful to users:

Acknowledgements

The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions. The Working Group would like to thank: all of the contributors to the Character Model series of documents over the many years of their development.