This specification describes an API for finding ranges of text or DOM nodes in a document or part of a document, using a variety of selection critiera.

This specification is a strawman, intended to collect feedback. It is not stable, and radical ideas for improvement are welcome. Feedback on searching and ranking algorithms are especially needed.

The API described in this document is still experimental.

Feedback on this specification can be made via annotations. All annotations are archived on the public-annotation@w3.org mailing list (archives), and on the searchable WebPlatform Notes stream; see more details on Notes.WebPlatform.org. Note: To create annotations, you must have a WebPlatform.org account; do not use your W3C credentials.

Introduction

Finding a text selection is a basic operation of a Web browser, but the functionality is not exposed to Web developers. There are many different use cases for a custom mechanism for finding and navigating to a particular range of text in a Web document, such as [[!HTML5]] or [[!SVG]]. A developer may wish to design a webapp in which only the text in the active editing subtreee, and not the controls or other application text, is returned as a result of a search, overriding the native find-text functionality. A user may wish to highlight a particular passage, and link to it, even if that selection doesn't have a id attribute to serve as a fragment identifier. A designer may wish to find and style a range of text without changing the markup. A developer may wish to provide a particular text-matching algorithm for searching their specific type of content, to return more meaningful results (such as fuzzy matching or term substitution). A CMS developer may wish to allow advanced review and commenting on text selections for a collaborative document. All of these use cases share the a fundamental requirement for a API to find text.

There are many challenges to finding text in a modern Web document. The content may have changed; the change could be a document restructuring, the addition of content before the target selection, or even a change to the target string itself. In dynamic documents, such as real-time collaborative editing webapps, the content may change during the same session as the selection; in linking or annotation scenarios, the state is lost between sessions, so other search and anchoring strategies must be used. Highly repetitive content, such as poetry or lyrics, may duplicate the same passage several times, making it difficult to indicate a specific instance of that passage. Large documents may have many possible matches, such that search-ahead mechanism may be computationally expensive. In dynamic lazy-loaded (or “infinite-scroll”) content, the text being searched for may not have been loaded into the DOM yet.

There are also quirks in some native browser find-text mechanisms that could be improved through this API. For example, most current browsers do not let the user search for text across element boundaries (such as a string broken across different paragraphs or list items), which may be unintuitive and inconvenient to the end-user. Not all browsers allow the user to specify case-sensitive or whole-word search parameters, and not all browsers indicate the number of returned results. Browsers do not typically match language patterns that may be found in non-Latin character sets, including collapsed Unicode character sequences, optional diacritical marks, or similar features, such as matching o to ó, ö, ø, and oe.

Sometimes, a selection of a document may contain more than text, or no text at all. Such selections include a picture and its caption, the picture alone, a specific area of a picture, a point on a rendered map, or a range of time within timed media such as audio or video. Some of these examples can be fully expressed as a DOM range; others need supplementary characteristics, which may be application-specific, in order to be addressable.

The RangeFinder API is intended to be flexible and performant. To ensure interoperability, this specification describes a default find-text algorithm, but this can be overriden by the developer. For flexibility in searches, this specification defines several parameterized selector methods, which can be applied in different ordered sequences to best match particular kinds of content, via method chaining, which will allow the developer to provide refined matching algorithms to customize the results. For performance reasons, the RangeFinder API can be evoked iteratively, to find the next match in the document, rather than attempting to search the entire document for all matches.

The result of a RangeFinder search is an object that contains the found range, any additional addressing characteristics, and a confidence interval to indicate how closely the search result matches the various selection criteria. By running a search multiple times and comparing the confidence interval for each result, a script can determine the best match for that set of search criteria. A script can also add or subtract selectors based on the results and confidence interval to refine a search.

Should the return object also contain the CSS and/or XPath selector that is the closest match for the selection? Can you get that easily from another Web Platform feature?

This specification defines what is returned, but not the specific algorithms used to find that result, or the speed or optimizations taken. Such optimizations are implementation-specific.

Dependencies

The following concepts and interfaces are defined in [[!HTML5]]:

Promise objects are defined in [[!ECMASCRIPT]].

The following concepts and interfaces are defined in [[!DOM4]]:

The following conformance classes are defined by this specification:

conforming implementation

A user agent is considered to be a conforming implementation if it satisfies all of the MUST-, REQUIRED- and SHALL-level criteria in this specification that apply to implementations.

User agents that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification [[!WEBIDL]] as this specification uses that specification and terminology.

Terminology

search tree
The search tree is the subset of the document that the RangeFinder object is scoped to search in.
search content
The search content is the normalized element text content from the search tree.
edit distance

The edit distance tolerance is the number of operations between the supplied target text attribute string and the candidate text content of the document. The edit distance is measured as the number of operations, of the set delete, insert, replace, and transpose, required to transform one string into another.

Decide which algorithm to use to calculate this value: Levenshtein; Damerau–Levenshtein; Hamming; Jaro–Winkler (unlikely); or some other edit distance algorithm.

Should this edit distance apply to the values of the prefix and suffix attributes as well?

Expose edit-distance calculator as low-level API?

What about "close key" calculations (keyboard-specific)

The RangeFinder API

The RangeFinder API enables a client-side script to incrementally search the DOM of the current document for matching strings, elements, or other DOM structures, based on criteria set with dedicated methods.

The RangeFinder Interface

This interface represents an object which allows for the discovery of arbitrary selections of a document based on criteria set out in its attributes. The constructor takes an optional property object to set the selecotr attributes. The results of the RangeFinder object's search are obtained through iterative execution of the search() method.

Currently, this API uses attributes to set selectors. I considered using chaining setter methods, which would allow for multiple parameters for each selector, and specific confidence ratings for each selector. However, that approach wouldn't let you initialize the object with a property object, nor allow easy object introspection. Feedback welcome.

attribute DOMString text

A text string to be searched for. Note: This string is equivalent to the textContent of a DOM node.

The default value of the text attribute MUST be the empty string.

attribute unsigned short textDistance

The edit distance tolerance for the text attribute string.

The default value of the textDistance attribute MUST be 0.

If the value of the text attribute is the empty string, the textDistance attribute MUST use the default value.

attribute DOMString prefix

A text string of arbitrary length (often 32 characters) that precedes the target string.

The default value of the prefix attribute MUST be the empty string.

Include link to study that shows optimal distance for character prefixes?

attribute unsigned short prefixDistance

The edit distance tolerance for the prefix attribute string.

The default value of the prefixDistance attribute MUST be 0.

If the value of the prefix attribute is the empty string, the prefixDistance attribute MUST use the default value.

attribute DOMString suffix

A text string of arbitrary length (often 32 characters) that follows the target string.

The default value of the suffix attribute MUST be the empty string.

Include link to study that shows optimal distance for character suffixes?

attribute unsigned short suffixDistance

The edit distance tolerance for the suffix attribute string.

The default value of the suffixDistance attribute MUST be 0.

If the value of the suffix attribute is the empty string, the suffixDistance attribute MUST use the default value.

attribute Range scopeRange

The DOM range which delineates the bounds of the search operation. The default range MUST be a range comprising the document's body element.

Should this be an element?

attribute Range startRange

A range object, as the starting point for the selection search.

attribute boolean caseFolding

A boolean representing whether a search MUST ignore the case of the characters when doing match comparison. The default value MUST be true.

attribute boolean unicodeFolding

A boolean representing whether a search MUST match canonically equivalent characters. The default value MUST be false.

Canonical Equivalence is described in [[!UTS10]]. If the value of this attribute is true, the User Agent MUST follow the algorthim defined in section 8 Searching and Matching, with collations and contractions formed per section 8.1 Collation Folding, and secondary strength asymmetric search per section 8.2 Asymmetric Search.

We should require that the UA support the Character Model for the World Wide Web: String Matching and Searching [[!charmod-norm]] spec (and urge the I18n WG to complete that spec).

This still needs a lot of work!

Should we allow the author to supply their own tailoring mapping?

This needs to be balanced with other kinds of fuzzy matches, such as case-folding and edit distance.

Define how these matches affect the rank or confidence of the return result.

attribute boolean wholeWord

A boolean representing whether a search MUST require that the search string be surrounded by non-word characters. The default value MUST be false.

attribute boolean wrap

A boolean representing whether a search MUST begin again at the beginning (or end, depending on search direction) of the search tree. The default value MUST be false.

Promise≺RangeFinderResult≻ search(optional boolean forward)

Executes the search, based on the current selection criteria attributes, from the current search index position within the document serialization. The return result MUST be a Promise of a RangeFinderResult object with properties defined as follows:

  • If a match is found for the current selection criteria, the value of the range property MUST be a Range object that contains the target string or DOM structure, and the value of the confidence property MUST be a postive value as described in the upcoming section on magical ways to determine a confidence value on the results, probably something to do with the edit distance and how many criteria failed or something.
  • If no match is found for the current selection criteria, the value of the range property MUST be null., and the value of the confidence property MUST be 0.

TODO: Add algorithm that describes the search operation, the setting of the search index position, evaluation of each search criterion, and so on. Maybe that's in a later section of the spec?

optional boolean forward

This parameter indicates whether the search should proceed in forward (positive) direction from the last index, or in a reverse (negative) direction. The default value MUST be true.

If the value is true, the search MUST continue in a positive direction along the character or DOM serialization, from the start of the document toward the end, beginning at the index of the last search.

If the value is false, the search MUST continue in a negative direction along the character or DOM serialization, from the end of the document toward the start, beginning at the index of the last search.

DOMException error
The error that occurred while searching.
DOMString text
unsigned short textDistance
DOMString prefix
unsigned short prefixDistance
DOMString suffix
unsigned short suffixDistance
Range scope
Range startRange
boolean caseFolding
boolean unicodeFolding
boolean wholeWord
boolean wrap
Range? range

A DOM Range object, as defined in [[!DOM4]]. If no match is found, the value MUST be null.

dictionary RangeFinderArgs

A dictionary object which contains all the parameters used in the search.

RangeFinderArgs MUST include entries for all available parameters, and MUST include default values for any parameter not explicitly set in the RangeFinder object.

The EditDistance Interface

The EditDistance API defines a method to calculate the minimum edit distance between any two strings. This method uses the Levenshtein Distance algorithm, with the following operations:

The EditDistance Interface is a lower-level API that acts as a helper API to the RangeFinder API. It may also be used as a standalone method.

Given two strings s1 and s2, the edit distance between s1 and s2 is the minimum number of addition, deletion, and replacement operations required to convert string s1 to s2.

The EditDistance Method

unsigned short findEditDistance (in DOMString targetString, in DOMString comparisonString)

The findEditDistance method calculates the edit distance between the targetString and the comparisonString.

DOMString comparisonString
Specifies the comparison string, which the target string is compared to. The default value MUST be the empty string.
DOMString targetString
Specifies the target string, which is compared to the comparison string. The default value MUST be the empty string.

Open Issues

These are open issues that don't yet have a location in the document.

Explain how to tell when a search is complete, e.g. when all the matches have been found.

How to deal with search matches that are dynamically inserted before the current search index position? LEave it up to the webapp?

Should this find "meta" content, such as text in a script, style, body, or CDATA block?

Should this find "hidden" content, such as text that is display:none or visibility:none?

What are the minimum criteria for performing a search? You might usefully look for what's between the prefix and suffix, or what is at a certain range. Should we allow that, or require that the text attribute not be the empty string?

Should we include an option to ignore punctuation? Apparently use of punctuation is widely variable and often changed, so this improves search efficiency.

Concepts

Algorithms defined in this specification assume that for each document there is a pending promise, which is initially set to null, which is a Promise object whose associated operation is to search the document.

Performing a search operation

A conforming user agent MUST perform a search operation in a manner that produces results consistent with this following algorithm. A conforming user agent MAY optimize the search algorithm in any way that produces these consistent results.

Algorithm summary

The RangeFinder search() method initiates an incremental text search of that portion of the document set as the scope parameter of the RangeFinder object, starting at its startRange parameter.

The search process starts by finding the scope of the search; given the scope range, it first finds the parent start and end elements, and all elements between them; these element comprise the search tree. It then extracts the text content of all these elements, and truncates everything outside the start and end character positions, using the algorithm defined for the textContent attribute in [[!DOM4]]; this is referred to as the search content.

In the search content, target text, and prefix and suffix text, all multiple consecutive whitespace characters, including newline characters, are collapsed into single space characters. If the caseFolding attribute is set to true, all characters are converted into their lowercase equivalents (if any). If the unicodeFolding attribute is set to true, all characters are converted into their decomposed equivalents (if any), per the [[!charmod-norm]] spec. If the ignorePunctuation attribute is set to true, all punctuation characters are removed. (How should wholeWord be modeled?)

Once the search content, target text, and prefix and suffix text are normalized according to the required parameters, the search operation is begun. The starting point of the search is the character position of the startRange attribute; if this character position lies outside the scoped range, then the starting point of the search is the 0 position of the search content string.

The target text is then incrementally compared with the search content, within the tolerance of the specified edit distance. Each such operation is performed at the current index character, starting with the first character and proceeding to each subsequent character. If there is a match, the prefix and suffix strings are also tested against the surrounding text and within their respective specified edit distance tolerances. If this also matches, then the process stops and the return results are prepared. If no appropriate match is found, the index character is incremented by a single character until a match is found, or until the end of the search content is reached if the wrap attribute is set to false. If the wrap attribute is set to true, and the start position was not the 0 position of the search content string, the search only concludes when the start position has been reached.

The return results are prepared by first reconciling the matched string with its equivalent DOM range in the document. This range object is created, and bundled in an object with the original search parameters. This return object is in the form of a Promise, so the results can be analyzed and subsquent search operations can be performed by incrementing the start range position based on the matched range (if any). By performing this search multiple times, an application can incrementally find all matching ranges in the document.

The steps to perform a search operation to a document using RangeFinder are as follows:

  1. If the user agent does not support searching the document using RangeFinder, return a Promise rejected with a DOMException whose name is NotSupportedError and abort these steps.
  2. The following sub-steps MAY be run asynchronously for performance reasons, for example, if the user agent has browsing contexts living in different processes:
    1. Let browsing contexts be the list of the descendant browsing contexts of the top-level browsing context's document.
    2. If one of the browsing contexts's document's pending promise is not null:
      1. Let doc be the document which has a not null pending promise.
      2. Reject doc's pending promise with DOMException whose name is AbortError.
      3. Set doc's pending promise to null .
  3. Finish the algorithm.

Define the EditDistance algorithm.

Examples

This examples section is out of date, and uses syntax from an earlier draft of this spec. The general ideas are similar, but the syntax may be very different.

Poetry, lyrics, and some prose often include significant repetition. A simple find-text operation would return only the first instance of a repetition, while the selection in question might be for the second or third instance.

Example 1

In the following poem, there are four instances of the phrase “Rage, rage”; the desired selection is the third instance. Assuming an HTML page consisting only of the poem, there are several ways to use the RangeFinder API to achieve this, a few of which are demonstrated below. Any or all of these techniques can be applied to ensure robust anchoring.

Do Not Go Gentle Into That Good Night

by Dylan Thomas

Do not go gentle into that good night,
Old age should burn and rave at close of day;
Rage, rage against the dying of the light.

Though wise men at their end know dark is right,
Because their words had forked no lightning they
Do not go gentle into that good night.

Good men, the last wave by, crying how bright
Their frail deeds might have danced in a green bay,
Rage, rage against the dying of the light.

Wild men who caught and sang the sun in flight,
And learn, too late, they grieved it on its way,
Do not go gentle into that good night.

Grave men, near death, who see with blinding sight
Blind eyes could blaze like meteors and be gay,
Rage, rage against the dying of the light.

And you, my father, there on the sad height,
Curse, bless, me now with your fierce tears, I pray.
Do not go gentle into that good night.
Rage, rage against the dying of the light.

Method 1: Iterative search


var rf = new RangeFinder({ text: "Rage, rage" });
var result = rf.search(); // result is 1st instance of string
    result = rf.search(); // result is 2nd instance of string
    result = rf.search(); // result is 3rd instance of string, the target instance
        

Method 2: Context search with prefix


var rf = new RangeFinder({ text: "Rage, rage" });
    rf.prefix = "blaze like meteors and be gay, "; // the 32 characters preceding the selection
var result = rf.search(); // result is 1st instance of string after prefix
        

Method 3: Character count


var rf = new RangeFinder();
    rf.characterStart = 696; // the character count of the normalized textContent preceding the selection
    rf.characterLength = 11; // the character count of the selection
var result = rf.search(); // result is the 11 characters starting at the position 696
        

Method 4: Edit distance and query selector

@@ YOU ARE HERE! @@

        var range = document.createRange();
            range.setStart(startNode, startOffset);
            range.setEnd(endNode, endOffset);
var rf = new RangeFinder();
    rf.text = "Range, range"; // the selection, misspelled
    rf.editDistance = 2; // the Levenshtein distance tolerance between "Rage, rage" and "Range, range"
    rf.querySelector = "p:nth-of-type(5)"; // the 5th paragraph (the one containing the selection)
var result = rf.search(); // result is 1st instance of string starting at query selector, 
                          //   with a confidence reflecting the expected edit distance
        

Method 5: Edit distance and query selector


var rf = new RangeFinder();
    rf.text = "Range, range"; // the selection, misspelled
    rf.editDistance = 2; // the Levenshtein distance tolerance between "Rage, rage" and "Range, range"
    rf.querySelector = "p:nth-of-type(5)"; // the 5th paragraph (the one containing the selection)
var result = rf.search(); // result is 1st instance of string starting at query selector, 
                          //   with a confidence reflecting the expected edit distance
        

Method 6: Element scope

The main difference here between the querySelector attribute (used in Method 4) and the element attribute is that querySelector sets the starting point for the search, while element limits the scope. In other words, running search() again on the previous example will find the next instance in the next paragraph, while in this example, it will only find the single instance in the target paragraph.


var rf = new RangeFinder();
    rf.text = "Rage, rage"; // the selection
    rf.element = document.querySelector("p:nth-of-type(5)"); // limit scope to the 5th paragraph
                                                             //   (the one containing the selection)
var result = rf.search(); // result is 1st instance of string within query selector, 
                          //   with a confidence reflecting the expected edit distance
        

Acknowledgements

Special thanks to Cameron McCormack (Mozilla), Kristof Csillag (Hypothes.is), Randall Leeds (Hypothes.is), Dan Whaley (Hypothes.is), Mitar Milutinovic (UC-Berkeley), Jake Hartnell (UC-Berkeley, EPUB.js), Nick Stenning (Hypothes.is), Benjamin Young (Hypothes.is), Mike de Boer (Mozilla), Travis Leithead (Microsoft), Ryosuke Niwa (Apple), Josh Soref (Blackberry), Rob Sanderson (Stanford), Paolo Ciccarese (Mass General Hospital / Harvard), Takeshi Kanai (Sony), Ivan Herman (W3C), Robin Berjon (W3C), Mike™ Smith (W3C), Mounir Lamouri (Google) and Marcos Cáceres (Mozilla) for writing a spec I could plunder mechanics, Peter Brantley (for putting on the Books in Browsers conference), and the entire Web Annotation Working Group and Open Annotations Community Group for discussion and insight. If I've forgotten anyone, please remind me; I'm truly grateful for all the discussions I've had on this topic.