This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in Character Model for the World Wide Web 1.0: Fundamentals [[!CHARMOD]] and Character Model for the World Wide Web 1.0: String Matching [[!CHARMOD-NORM]] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.

Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues against the lasted editor's copy. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL.

Introduction

Goals and Scope

This document describes string searching—the process by which a specification or implementation matches a natural language string fragment against a specific document or series of documents. A common example of string searching is the "find" command in a Web browser, but there are many other forms of searching that a specification might wish to define.

This document builds on Character Model for the World Wide Web: Fundamentals [[!CHARMOD]] and Character Model for the Word Wide Web: String Matching [[!CHARMOD-NORM]]. Understanding the concepts in those documents are important to being able to understand and apply this document successfully.

The main target audience of this specification is W3C specification developers who need to define some form of search or find algorithm: the goal is to provide a stable reference to the concepts, terms, and requirements needed.

The concepts described in this document provide authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text searching on the World Wide Web. Working together, these three groups can build a globally accessible Web.

Background

At the core of the character model is the Universal Character Set (UCS), defined jointly by the Unicode Standard [[!Unicode]] and ISO/IEC 10646 [[!ISO10646]]. In this document, Unicode is used as a synonym for the Universal Character Set. A successful character model allows Web documents authored in the world's writing systems, scripts, and languages (and on different platforms) to be exchanged, read, and searched by the Web's users around the world.

The first few chapters of the Unicode Standard [[!Unicode]] provide useful background reading. In particular, the Unicode Collation Algorithm [[!UTS10]] contains a chapter on searching.

Terminology and Notation

This section contains terminology and notation specific to this document.

The Web is built on text-based formats and protocols. In order to describe string matching or searching effectively, it is necessary to establish terminology that allows us to talk about the different kinds of text within a given format or protocol, as the requirements and details vary significantly.

Unicode code points are denoted as U+hhhh, where hhhh is a sequence of at least four, and at most six hexadecimal digits. For example, the character [U+20AC EURO SIGN] has the code point U+20AC.

Some characters that are used in the various examples might not appear as intended unless you have the appropriate font. Care has been taken to ensure that the examples nevertheless remain understandable.

A legacy character encoding is a character encoding not based on the Unicode character set.

A transcoder is a process that converts text between two character encodings. Most commonly in this document it refers to a process that converts from a legacy character encoding to a Unicode encoding form, such as UTF-8.

Syntactic content is any text in a document format or protocol that belongs to the structure of the format or protocol. This definition includes values that are typically thought of as "markup" but can also include other values, such as the name of a field in an HTTP header. Syntactic content consists of all of the characters that form the structure of a format or protocol. For example, < and > (as well as the element name and various attributes they surround) are part of the syntactic content in an HTML document.

Syntactic content usually is defined by a specification or specifications and includes both the defined, reserved keywords for the given protocol or format as well as string tokens and identifiers that are defined by document authors to form the structure of the document (rather than the "content" of the document).

Natural language content refers to the language-bearing content in a document and not to any of the surrounding or embedded syntactic content that form part of the document structure. You can think of it as the actual "content" of the document or the "message" in a given protocol. Note that syntactic content can contain natural language content, such as when an [[HTML]] img element has an alt attribute containing a description of the image.

A resource, in the context of this document, is a given document, file, or protocol "message" which includes both the natural language content as well as the syntactic content such as identifiers surrounding or containing it. For example, in an HTML document that also has some CSS and a few script tags with embedded JavaScript, the entire HTML document, considered as a file, is a resource. This term is intentionally similar to the term 'resource' as used in [[RFC3986]], although here the term is applied loosely.

A user-supplied value is unreserved syntactic content in a vocabulary that is assigned by users, as distinct from reserved keywords in a given format or protocol. For example, CSS class names are part of the syntax of a CSS style sheet. They are not reserved keywords, predefined by any CSS specification. They are subject to the syntactic rules of CSS. And they may (or may not) consist of natural language tokens.

A vocabulary provides the list of reserved names as well as the set of rules and specifications controlling how user-supplied values (such as identifiers) can be assigned in a format or protocol. This can include restrictions on range, order, or type of characters that can appear in different places. For example, HTML defines the names of its elements and attributes, as well as enumerated attribute values, which defines the "vocabulary" of HTML syntactic content. Another example would be ECMAScript, which restricts the range of characters that can appear at the start or in the body of an identifier or variable name. It applies different rules for other cases, such as to the values of string literals.

A grapheme is a sequence of one or more characters in a visual representation of some text that a typical user would perceive as being a single unit (character). Graphemes are important for a number of text operations such as sorting or text selection, so it is necessary to be able to compute the boundaries between each user-perceived character. Unicode defines the default mechanism for computing graphemes in Unicode Standard Annex #29: Text Segmentation [[!UAX29]] and calls this approximation a grapheme cluster. There are two types of default grapheme cluster defined. Unless otherwise noted, grapheme cluster in this document refers to an extended default grapheme cluster. (A discussion of grapheme clusters is also given in Section 2 of the Unicode Standard, [[!Unicode]]. Cf. near the end of Section 2.11 in version 8.0 of The Unicode Standard)

Because different natural languages have different needs, grapheme clusters can also sometimes require tailoring. For example, a Slovak user might wish to treat the default pair of grapheme clusters "ch" as a single grapheme cluster. Note that the interaction between the language of string content and the end-user's preferences might be complex.

Terminology Examples

This section illustrates some of the terminology defined above. For illustration purposes we'll use the following small HTML file as an example (line numbers added for reference):

1 <html lang="en" dir="ltr">

2 <head>

3   <meta charset="UTF-8">

4   <title>Shakespeare</title>

5 </head>

6 <body>

7   <img src="shakespeare.jpg" alt="William Shakespeare" id="shakespeare_image">

8   <p>What&#x2019;s in a name? That which we call a rose by any other name would smell as sweet.</p>

9 </body>

10 </html>

  • Everything inside the black rectangle (that is, in this HTML file) is part of the resource.
  • Syntactic content in this case includes all of the HTML markup. There are only two strings that are not part of the syntactic content: the word "Shakespeare" on line 4 and the sentence "What’s in a name? That which we call a rose by any other name would smell as sweet." on line 8. (The HTML entity &#x2019; embedded in the sentence on line 8 is part of the syntactic content.)
  • Natural language content is shown in a bold blue font with a gray background. In addition to the non-syntactic content, the alt value on line 7 (William Shakespeare) contains natural language text.
  • User-supplied values are shown in italics. In this case there are three user-supplied values on line 7: the values of the src, alt, and id attributes of the img tag. In addition, the value of the lang attribute on line 1 and the charset attribute on line 3 are user-supplied values.
  • Vocabulary is shown with red underlining. The vocabulary of an HTML document are the elements and attributes (as well as some of the attribute values, such as the value ltr for the attribute dir in the example above) defined in [[HTML5]].

All of the text above (all text in a text file) makes up the resource. It's possible that a given resource will contain no natural language content at all (consider an HTML document consisting of four empty div elements styled to be orange rectangles). It's also possible that a resource will contain no syntactic content and consist solely of natural language content: for example, a plain text file with a soliloquy from Hamlet in it. Notice too that the HTML entity &#x2019; appears in the natural language content and belongs to both the natural language content and the syntactic content in this resource.

This document describes best practices and requirements for other specifications, as well as recommendations for implementations and content authors. These best practices for specifications (and others) can also be found in the Internationalization Working Group's document [[!INTERNATIONAL-SPECS]], which is intended to serve as a general reference for all Internationalization best practices in W3C specifications.

When a best practice or requirement appears in this document, it has been styled to like this paragraph. Recommendations for specifications and spec authors are preceded by [S]. Recommendations for implementations and software developers are preceeded by [I]. Recommendations for content and content authors are preceeded by [C].

Specifications can claim conformance to this document if they:

  1. do not violate any conformance criteria preceded by [S] where the imperative is MUST or MUST NOT
  2. document the reason for any deviation from criteria where the imperative is SHOULD, SHOULD NOT, or RECOMMENDED
  3. make it a conformance requirement for implementations to conform to this document
  4. make it a conformance requirement for content to conform to this document

Requirements placed on specifications might indirectly cause requirements to be placed on implementations or content that claim to conform to those specifications.

Where this specification contains a procedural description, it is to be understood as a way to specify the desired external behavior. Implementations MAY use other means of achieving the same results, as long as observable behavior is not affected.

Conformance

Where this specification contains a procedural description, it is to be understood as a way to specify the desired external behavior. Implementations can use other means of achieving the same results, as long as observable behavior is not affected.

String Searching in Natural Language Content

Many Web implementations and applications have a need for users or programs to search documents for particular words or phrases of natural language text. This is different from the sorts of programmatic matching needed by formal languages (such as markup languages such as [[HTML]]; style sheets [[CSS21]]; or data formats such as [[TURTLE]] or [[JSON-LD]]).

There are many types of string searching. The form of string searching which we'll concern ourselves with here is sub-string matching or "find" operations. This is the direct searching of the body or "corpus" of a document with the user's input. Find operations can have different options or implementation details, such as the addition or removal of case sensitivity, or whether the feature supports different aspects of a regular expression language or "wildcards".

A different type of string searching, which is outside the scope of this document, is Full Text Search. When you are using a search engine, you are generally using a form of full text search. Full text search generally breaks natural language text into word segments and may apply complex processing to get at the semantic "root" values of words. For example, if the user searches for "run", you might want to find words like "running", "ran", or "runs" in addition to the actual search term "run". This process, naturally, is sensitive to language, context, and many other aspects of textual variation.

Other Types of Equivalence

In addition to the forms of character equivalence described in [[!CHARMOD-NORM]], there are other types of equivalence that are interesting when performing string searching. The forms of equivalence found in the String Matching document are all based on character properties assigned by Unicode or due to the mapping of legacy character encodings to the Unicode character set. The "interesting equivalences" in this section that are outside of those defined by Unicode.

For example, Japanese uses two syllabic scripts, hiragana and katakana. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts. These additional "text normalizations" are sometimes application, natural language, or domain specific and shouldn't be overlooked by specifications or implementations as an additional consideration.

Another similar example is called digit shaping. Some scripts, such as Arabic or Thai, have their own digit characters for the numbers from 0 to 9. In some Web applications, the familiar ASCII digits are replaced for display purposes with the local digit shapes. In other cases, the text actually might contain the Unicode characters for the local digits. Users attempting to search a document might expect that typing one form of digit will find the eqivalent digits.

Considerations for Searching

This section was identified as a new area needing document as part of the overall rearchitecting of the document. The text here is incomplete and needs further development. Contributions from the community are invited.

Implementers often need to provide simple "find text" algorithms and specifications often try to define APIs to support these needs. Find operations on text generate different user expectations and thus have different requirements from the need for absolute identity matching needed by document formats and protocols. It is important to note that domain-specific requirements may impose additional restrictions or alter the considerations presented here.

Variations in User Input

One of the primary considerations for string searching is that, quite often, the user's input is not identical to the way that the text being searched is encoded.

One primary reason this happens is because the text can vary in ways the user cannot predict. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed—or because the user cannot be bothered to input the text accurately. For example, users often omit accents when entering Latin-script languages, particularly on mobile keyboards, even though the text they are searching includes the accents. In these cases, users generally expect the search operation to be more "promiscuous" to make up for the failure to add additional effort to their input.

For example, a user might expect a term entered in lowercase to match uppercase equivalents. Conversely, when the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the base letter—they might expect their search results to match (only) their more-specific input.

A different case is where the text can vary in multiple ways, but the user can only type a single search term in. For example, the Japanese language uses two different phonetic scripts, hiragana and katakana. These scripts encode the same phonemes; thus the user might expect that typing in a search term in hiragana would find the exact same word spelled out in katakana.

A different example might be the presence or absence of short vowels in the Arabic and Hebrew scripts. For most languages in these scripts, the inclusion of the short vowels is entirely optional, but the presence of vowels in text being searched might impede a match if the user doesn't enter or know to enter them.

This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.

Consider a document containing these strings: "re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ".

In the table below, the user's input (on the left) might be considered a match for the above items as follows:

User Input Matched Strings
e (lowercase 'e') "re-resume", "RE-RESUME", "re-résumé", and "RE-RÉSUMÉ"
E (uppercase 'E') "RE-RESUME" and "RE-RÉSUMÉ"
é (lowercase 'e' with acute accent) "re-résumé" and "RE-RÉSUMÉ"
É (uppercase 'E' with acute accent) "RE-RÉSUMÉ"

In addition to variations of case or the use of accents, Unicode also has an array of canonical equivalents or compatibility characters (as described in the sections above) that might impact string searching.

For example, consider the letter "K". Characters with a compatibility mapping to U+004B LATIN CAPITAL LETTER K include:

  1. Ķ U+0136
  2. Ǩ U+01E8
  3. ᴷ U+1D37
  4. Ḱ U+1E30
  5. Ḳ U+1E32
  6. Ḵ U+1E34
  7. K U+212A
  8. Ⓚ U+24C0
  9. ㎅ U+3385
  10. ㏍ U+33CD
  11. ㏎ U+33CE
  12. K U+FF2B
  13. (a variety of mathematical symbols such as U+1D40A,U+1D43E,U+1D472,U+1D4A6,U+1D4DA)
  14. 🄚 U+1F11A
  15. 🄺 U+1F13A.

Other differences include Unicode Normalization forms (or lack thereof). There are also ignorable characters (such as the variation selectors), whitespace differences, bidirectional controls, and other code points that can interfere with a match.

Users might also expect certain kinds of equivalence to be applied to matching. For example, a Japanese user might expect that hiragana, katakana, and half-width compatibility katakana equivalents all match each other (regardless of which is used to perform the selection or encoded in the text).

When searching text, the concept of "grapheme boundaries" and "user-perceived characters" can be important. See Section 3 of Character Model for the World Wide Web: Fundamentals [[!CHARMOD]] for a description. For example, if the user has entered a capital "A" into a search box, should the software find the character À (U+00C0 LATIN CAPITAL LETTER A WITH ACCENT GRAVE)? What about the character "A" followed by U+0300 (a combining accent grave)? What about writing systems, such as Devanagari, which use combining marks to suppress or express certain vowels?

Types of Search Option

When creating a string search API or algorithm, the following textual options might be useful to users:

  • Case-sensitive vs. case-insensitive
  • Kana folding
  • Unicode normalization form
  • etc.

Acknowledgements

The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions. The Working Group would like to thank: all of the [[!CHARMOD]] contributors..