This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in Character Model for the World Wide Web 1.0: Fundamentals [[CHARMOD]] and Character Model for the World Wide Web 1.0: String Matching [[CHARMOD-NORM]] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.

Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues against the latest editor's copy. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL.

Introduction

Goals and Scope

This document describes the problems, requirements, and considerations for specification or implementations of string searching operations. A common example of string searching is the "find" command in a Web browser, but there are many other forms of searching that a specification might wish to define.

This document builds on Character Model for the World Wide Web: Fundamentals [[CHARMOD]] and Character Model for the Word Wide Web: String Matching [[CHARMOD-NORM]]. Understanding the concepts in those documents are important to being able to understand and apply this document successfully.

The main target audience of this specification is W3C specification developers who need to define some form of search or find algorithm: the goal is to provide a stable reference to the concepts, terms, and requirements needed.

The concepts described in this document provide authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text searching on the World Wide Web. Working together, these three groups can build a globally accessible Web.

This document contains best practices and requirements for other specifications, as well as recommendations for implementations and content authors. These best practices for specifications (and others) can also be found in the Internationalization Working Group's document [[INTERNATIONAL-SPECS]], which is intended to serve as a general reference for all Internationalization best practices in W3C specifications.

Background

At the core of the character model is the Universal Character Set (UCS), defined jointly by the Unicode Standard [[Unicode]] and ISO/IEC 10646 [[ISO10646]]. In this document, Unicode is used as a synonym for the Universal Character Set. A successful character model allows Web documents authored in the world's writing systems, scripts, and languages (and on different platforms) to be exchanged, read, and searched by the Web's users around the world.

The first few chapters of the Unicode Standard [[Unicode]] provide useful background reading. In addition, the Unicode Collation Algorithm [[UTS10]] contains a chapter on searching.

Terminology and Notation

This section contains terminology and notation specific to this document.

Much of the special terminology needed to understand this document is provided by [[CHARMOD-NORM]] and can be found in the Terminology and Notation section of that document.

String Searching in Natural Language Content

String searching is widely implemented in browsers and other user agents, but has not historically been well documented. Various W3C working groups have attempted to provide such documentation in the past. The most recent effort produced these issues.

Many Web implementations and applications have a need for users or programs to search documents for particular words or phrases of natural language text. This is different from the sorts of programmatic matching needed by formal languages (such as markup languages such as [[HTML]]; style sheets [[CSS21]]; or data formats such as [[TURTLE]] or [[JSON-LD]]).

There are many types of string searching. The form of string searching which we'll concern ourselves with here is sub-string matching or "find" operations. This is the direct searching of the body or "corpus" of a document with the user's input. Find operations can have different options or implementation details, such as the addition or removal of case sensitivity, or whether the feature supports different aspects of a regular expression language or "wildcards".

A different type of string searching, which is outside the scope of this document, is Full Text Search. When you are using a search engine, you are generally using a form of full text search. Full text search generally breaks natural language text into word segments and may apply complex processing to get at the semantic "root" values of words. For example, if the user searches for "run", you might want to find words like "running", "ran", or "runs" in addition to the actual search term "run". This process, naturally, is sensitive to language, context, and many other aspects of textual variation.

Other Types of Equivalence

In addition to the forms of character equivalence described in [[CHARMOD-NORM]], there are other types of equivalence that are interesting when performing string searching. The forms of equivalence found in the String Matching document are all based on character properties assigned by Unicode or due to the mapping of legacy character encodings to the Unicode character set. The "interesting equivalences" in this section that are outside of those defined by Unicode.

For example, Japanese uses two syllabic scripts, hiragana and katakana. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts. These additional "text normalizations" are sometimes application, natural language, or domain specific and shouldn't be overlooked by specifications or implementations as an additional consideration.

Another similar example is called digit shaping. Some scripts, such as Arabic or Thai, have their own digit characters for the numbers from 0 to 9. In some Web applications, the familiar ASCII digits are replaced for display purposes with the local digit shapes. In other cases, the text actually might contain the Unicode characters for the local digits. Users attempting to search a document might expect that typing one form of digit will find the eqivalent digits.

Considerations for Searching

This section was identified as a new area needing document as part of the overall rearchitecting of the document. The text here is incomplete and needs further development. Contributions from the community are invited.

Implementers often need to provide simple "find text" algorithms and specifications often try to define APIs to support these needs. Find operations on text generate different user expectations and thus have different requirements from the need for absolute identity matching needed by document formats and protocols. It is important to note that domain-specific requirements may impose additional restrictions or alter the considerations presented here.

Variations in User Input

One of the primary considerations for string searching is that, quite often, the user's input is not identical to the way that the text being searched is encoded.

One primary reason this happens is because the text can vary in ways the user cannot predict. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed—or because the user cannot be bothered to input the text accurately. For example, users often omit accents when entering Latin-script languages, particularly on mobile keyboards, even though the text they are searching includes the accents. In these cases, users generally expect the search operation to be more "promiscuous" to make up for the failure to add additional effort to their input.

For example, a user might expect a term entered in lowercase to match uppercase equivalents. Conversely, when the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the base letter—they might expect their search results to match (only) their more-specific input.

A different case is where the text can vary in multiple ways, but the user can only type a single search term in. For example, the Japanese language uses two different phonetic scripts, hiragana and katakana. These scripts encode the same phonemes; thus the user might expect that typing in a search term in hiragana would find the exact same word spelled out in katakana.

A different example might be the presence or absence of short vowels in the Arabic and Hebrew scripts. For most languages in these scripts, the inclusion of the short vowels is entirely optional, but the presence of vowels in text being searched might impede a match if the user doesn't enter or know to enter them.

This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.

Consider a document containing these strings: "re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ".

In the table below, the user's input (on the left) might be considered a match for the above items as follows:

User Input Matched Strings
e (lowercase 'e') "re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ"
E (uppercase 'E') "RE-RESUME" and "RE-RÉSUMÉ"
é (lowercase 'e' with acute accent) "re-résumé" and "RE-RÉSUMÉ"
É (uppercase 'E' with acute accent) "RE-RÉSUMÉ"

In addition to variations of case or the use of accents, Unicode also has an array of canonical equivalents or compatibility characters (as described in the sections above) that might impact string searching.

For example, consider the letter "K". Characters with a compatibility mapping to U+004B LATIN CAPITAL LETTER K include:

  1. Ķ U+0136
  2. Ǩ U+01E8
  3. ᴷ U+1D37
  4. Ḱ U+1E30
  5. Ḳ U+1E32
  6. Ḵ U+1E34
  7. K U+212A
  8. Ⓚ U+24C0
  9. ㎅ U+3385
  10. ㏍ U+33CD
  11. ㏎ U+33CE
  12. K U+FF2B
  13. (a variety of mathematical symbols such as U+1D40A,U+1D43E,U+1D472,U+1D4A6,U+1D4DA)
  14. 🄚 U+1F11A
  15. 🄺 U+1F13A.

Other differences include Unicode Normalization forms (or lack thereof). There are also ignorable characters (such as the variation selectors), whitespace differences, bidirectional controls, and other code points that can interfere with a match.

Users might also expect certain kinds of equivalence to be applied to matching. For example, a Japanese user might expect that hiragana, katakana, and half-width compatibility katakana equivalents all match each other (regardless of which is used to perform the selection or encoded in the text).

When searching text, the concept of "grapheme boundaries" and "user-perceived characters" can be important. See Section 3 of Character Model for the World Wide Web: Fundamentals [[CHARMOD]] for a description. For example, if the user has entered a capital "A" into a search box, should the software find the character À (U+00C0 LATIN CAPITAL LETTER A WITH ACCENT GRAVE)? What about the character "A" followed by U+0300 (a combining accent grave)? What about writing systems, such as Devanagari, which use combining marks to suppress or express certain vowels?

Types of Search Option

When creating a string search API or algorithm, the following textual options might be useful to users:

  • Case-sensitive vs. case-insensitive
  • Kana folding
  • Unicode normalization form
  • etc.

Acknowledgements

The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions. The Working Group would like to thank: all of the contributors to the Character Model series of documents over the many years of their development.