This specification describes an API for finding ranges of text or DOM nodes in a document or part of a document, using a variety of selection critiera.
Note
This specification is a strawman, intended to collect feedback. It is not stable, and radical ideas for improvement are welcome. Feedback on searching and ranking algorithms are especially needed.
Status of This Document
This section describes the status of this
document at the time of its publication. Other documents may supersede
this document. A list of current W3C publications and the latest revision
of this technical report can be found in the
W3C technical reports index at
https://www.w3.org/TR/.
The API described in this document is still experimental.
Publication as an Editor's Draft does not imply endorsement by the W3C
Membership. This is a draft document and may be updated, replaced or
obsoleted by other documents at any time. It is inappropriate to cite this
document as other than work in progress.
This document was produced by a group
operating under the
W3C Patent Policy.
W3C maintains a
public list of any patent disclosures
made in connection with the deliverables of
the group; that page also includes
instructions for disclosing a patent. An individual who has actual
knowledge of a patent which the individual believes contains
Essential Claim(s)
must disclose the information in accordance with
section 6 of the W3C Patent Policy.
Finding a text selection is a basic operation of a Web browser, but the functionality is not exposed to Web developers. There are many different use cases for a custom mechanism for finding and navigating to a particular range of text in a Web document, such as [HTML5] or [SVG]. A developer may wish to design a webapp in which only the text in the active editing subtreee, and not the controls or other application text, is returned as a result of a search, overriding the native find-text functionality. A user may wish to highlight a particular passage, and link to it, even if that selection doesn't have a id attribute to serve as a fragment identifier. A designer may wish to find and style a range of text without changing the markup. A developer may wish to provide a particular text-matching algorithm for searching their specific type of content, to return more meaningful results (such as fuzzy matching or term substitution). A CMS developer may wish to allow advanced review and commenting on text selections for a collaborative document. All of these use cases share the a fundamental requirement for a API to find text.
There are many challenges to finding text in a modern Web document.
The content may have changed; the change could be a document restructuring, the addition of content before the target selection, or even a change to the target string itself. In dynamic documents, such as real-time collaborative editing webapps, the content may change during the same session as the selection; in linking or annotation scenarios, the state is lost between sessions, so other search and anchoring strategies must be used.
Highly repetitive content, such as poetry or lyrics, may duplicate the same passage several times, making it difficult to indicate a specific instance of that passage.
Large documents may have many possible matches, such that search-ahead mechanism may be computationally expensive.
In dynamic lazy-loaded (or “infinite-scroll”) content, the text being searched for may not have been loaded into the DOM yet.
There are also quirks in some native browser find-text mechanisms that could be improved through this API. For example, most current browsers do not let the user search for text across element boundaries (such as a string broken across different paragraphs or list items), which may be unintuitive and inconvenient to the end-user. Not all browsers allow the user to specify case-sensitive or whole-word search parameters, and not all browsers indicate the number of returned results. Browsers do not typically match language patterns that may be found in non-Latin character sets, including collapsed Unicode character sequences, optional diacritical marks, or similar features, such as matching o to ó, ö, ø, and oe.
Sometimes, a selection of a document may contain more than text, or no text at all. Such selections include a picture and its caption, the picture alone, a specific area of a picture, a point on a rendered map, or a range of time within timed media such as audio or video. Some of these examples can be fully expressed as a DOM range; others need supplementary characteristics, which may be application-specific, in order to be addressable.
The RangeFinder API is intended to be flexible and performant. To ensure interoperability, this specification describes a default find-text algorithm, but this can be overriden by the developer. For flexibility in searches, this specification defines several parameterized selector methods, which can be applied in different ordered sequences to best match particular kinds of content, via method chaining, which will allow the developer to provide refined matching algorithms to customize the results.
For performance reasons, the RangeFinder API can be evoked iteratively, to find the next match in the document, rather than attempting to search the entire document for all matches.
The result of a RangeFinder search is an object that contains the found range, any additional addressing characteristics, and a confidence interval to indicate how closely the search result matches the various selection criteria. By running a search multiple times and comparing the confidence interval for each result, a script can determine the best match for that set of search criteria. A script can also add or subtract selectors based on the results and confidence interval to refine a search.
Issue 1
Should the return object also contain the CSS and/or XPath selector that is the closest match for the selection? Can you get that easily from another Web Platform feature?
Note
This specification defines what is returned, but not the specific algorithms used to find that result, or the speed or optimizations taken. Such optimizations are implementation-specific.
Dependencies
The following concepts and interfaces are defined in [HTML5]:
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, REQUIRED, and SHALL in this document
are to be interpreted as described in
BCP 14
[RFC2119] [RFC8174]
when, and only when, they appear in all capitals, as shown here.
The following conformance classes are defined by this specification:
conforming implementation
A user agent is considered to be a conforming implementation if it satisfies all of the MUST-, REQUIRED- and SHALL-level criteria in this specification that apply to implementations.
User agents that use ECMAScript to implement the APIs defined in this
specification must implement them in a manner consistent with the ECMAScript
Bindings defined in the Web IDL specification [WEBIDL] as this
specification uses that specification and terminology.
Terminology
search tree
The search tree is the subset of the document that the RangeFinder object is scoped to search in.
The edit distance tolerance is the number of operations between the supplied target text attribute string and the candidate text content of the document. The edit distance is measured as the number of operations, of the set delete, insert, replace, and transpose, required to transform one string into another.
Issue 2
Decide which algorithm to use to calculate this value: Levenshtein; Damerau–Levenshtein; Hamming; Jaro–Winkler (unlikely); or some other edit distance algorithm.
Issue 3
Should this edit distance apply to the values of the prefix and suffix attributes as well?
Issue 4
Expose edit-distance calculator as low-level API?
Issue 5
What about "close key" calculations (keyboard-specific)
The RangeFinder API
The RangeFinder API enables a client-side script to incrementally search the DOM of the current document for matching strings, elements, or other DOM structures, based on criteria set with dedicated methods.
The RangeFinder Interface
This interface represents an object which allows for the discovery of arbitrary selections of a document based on criteria set out in its attributes. The constructor takes an optional property object to set the selecotr attributes. The results of the RangeFinder object's search are obtained through iterative execution of the search() method.
Note
Currently, this API uses attributes to set selectors. I considered using chaining setter methods, which would allow for multiple parameters for each selector, and specific confidence ratings for each selector. However, that approach wouldn't let you initialize the object with a property object, nor allow easy object introspection. Feedback welcome.
attribute DOMString text
A text string to be searched for. Note: This string is equivalent to the textContent of a DOM node.
The default value of the text attribute MUST be the empty string.
The default value of the suffixDistance attribute MUST be 0.
If the value of the suffix attribute is the empty string, the suffixDistance attribute MUST use the default value.
attribute Range scopeRange
The DOM range which delineates the bounds of the search operation. The default range MUST be a range comprising the document's body element.
Issue 8
Should this be an element?
attribute Range startRange
A range object, as the starting point for the selection search.
attribute boolean caseFolding
A boolean representing whether a search MUST ignore the case of the characters when doing match comparison. The default value MUST be true.
attribute boolean unicodeFolding
A boolean representing whether a search MUST match canonically equivalent characters. The default value MUST be false.
Canonical Equivalence is described in [UTS10]. If the value of this attribute is true, the User Agent MUST follow the algorthim defined in section 8 Searching and Matching, with collations and contractions formed per section 8.1 Collation Folding, and secondary strength asymmetric search per section 8.2 Asymmetric Search.
Issue 9
We should require that the UA support the Character Model for the World Wide Web: String Matching and Searching [charmod-norm] spec (and urge the I18n WG to complete that spec).
Issue 10
This still needs a lot of work!
Issue 11
Should we allow the author to supply their own tailoring mapping?
Issue 12
This needs to be balanced with other kinds of fuzzy matches, such as case-folding and edit distance.
Issue 13
Define how these matches affect the rank or confidence of the return result.
attribute boolean wholeWord
A boolean representing whether a search MUST require that the search string be surrounded by non-word characters. The default value MUST be false.
attribute boolean wrap
A boolean representing whether a search MUST begin again at the beginning (or end, depending on search direction) of the search tree. The default value MUST be false.
Executes the search, based on the current selection criteria attributes, from the current search index position within the document serialization. The return result MUST be a Promise of a RangeFinderResult object with properties defined as follows:
If a match is found for the current selection criteria, the value of the range property MUST be a Range object that contains the target string or DOM structure, and the value of the confidence property MUST be a postive value as described in the upcoming section on magical ways to determine a confidence value on the results, probably something to do with the edit distance and how many criteria failed or something.
If no match is found for the current selection criteria, the value of the range property MUST be null., and the value of the confidence property MUST be 0.
Issue 14
TODO: Add algorithm that describes the search operation, the setting of the search index position, evaluation of each search criterion, and so on. Maybe that's in a later section of the spec?
optional boolean forward
This parameter indicates whether the search should proceed in forward (positive) direction from the last index, or in a reverse (negative) direction. The default value MUST be true.
If the value is true, the search MUST continue in a positive direction along the character or DOM serialization, from the start of the document toward the end, beginning at the index of the last search.
If the value is false, the search MUST continue in a negative direction along the character or DOM serialization, from the end of the document toward the start, beginning at the index of the last search.
DOMException error
The error that occurred while searching.
DOMString text
unsigned short textDistance
DOMString prefix
unsigned short prefixDistance
DOMString suffix
unsigned short suffixDistance
Range scope
Range startRange
boolean caseFolding
boolean unicodeFolding
boolean wholeWord
boolean wrap
Range? range
A DOM Range object, as defined in [DOM4]. If no match is found, the value MUST be null.
dictionary RangeFinderArgs
A dictionary object which contains all the parameters used in the search.
RangeFinderArgsMUST include entries for all available parameters, and MUST include default values for any parameter not explicitly set in the RangeFinder object.
The EditDistance Interface
The EditDistance API defines a method to calculate the minimum edit distance between any two strings. This method uses the Levenshtein Distance algorithm, with the following operations:
Adding one character to the target string.
Deleting one character from the target string.
Replacing one character of the target string with another character.
The EditDistance Interface is a lower-level API that acts as a helper API to the RangeFinder API. It may also be used as a standalone method.
Given two strings s1 and s2, the edit distance between s1 and s2 is the minimum number of addition, deletion, and replacement operations required to convert string s1 to s2.
The EditDistance Method
unsigned short findEditDistance (in DOMString targetString, in DOMString comparisonString)
The findEditDistance method calculates the edit distance between the targetString and the comparisonString.
DOMString comparisonString
Specifies the comparison string, which the target string is compared to. The default value MUST be the empty string.
DOMString targetString
Specifies the target string, which is compared to the comparison string. The default value MUST be the empty string.
Open Issues
These are open issues that don't yet have a location in the document.
Issue 15
Explain how to tell when a search is complete, e.g. when all the matches have been found.
Issue 16
How to deal with search matches that are dynamically inserted before the current search index position? LEave it up to the webapp?
Issue 17
Should this find "meta" content, such as text in a script, style, body, or CDATA block?
Issue 18
Should this find "hidden" content, such as text that is display:none or visibility:none?
Issue 19
What are the minimum criteria for performing a search? You might usefully look for what's between the prefix and suffix, or what is at a certain range. Should we allow that, or require that the text attribute not be the empty string?
Issue 20
Should we include an option to ignore punctuation? Apparently use of punctuation is widely variable and often changed, so this improves search efficiency.
Concepts
Algorithms defined in this specification assume that for each
document there is a pending promise, which is
initially set to null, which is a Promise object
whose associated operation is to search the document.
Performing a search operation
A conforming user agentMUST perform a search operation in a manner that produces results consistent with this following algorithm. A conforming user agentMAY optimize the search algorithm in any way that produces these consistent results.
Note
Algorithm summary
The RangeFindersearch() method initiates an incremental text search of that portion
of the document set as the scope parameter of the RangeFinder object, starting at its startRange parameter.
The search process starts by finding the scope of the search; given the scope range, it first finds the parent start and end elements, and all elements between them; these element comprise the search tree. It then extracts the text content of all these elements, and truncates everything outside the start and end character positions, using the algorithm defined for the textContent attribute in [DOM4]; this is referred to as the search content.
In the search content, target text, and prefix and suffix text, all multiple consecutive whitespace characters, including newline characters, are collapsed into single space characters. If the caseFolding attribute is set to true, all characters are converted into their lowercase equivalents (if any). If the unicodeFolding attribute is set to true, all characters are converted into their decomposed equivalents (if any), per the [charmod-norm] spec. If the ignorePunctuation attribute is set to true, all punctuation characters are removed. (How should wholeWord be modeled?)
Once the search content, target text, and prefix and suffix text are normalized according to the required parameters, the search operation is begun. The starting point of the search is the character position of the startRange attribute; if this character position lies outside the scoped range, then the starting point of the search is the 0 position of the search content string.
The target text is then incrementally compared with the search content, within the tolerance of the specified edit distance. Each such operation is performed at the current index character, starting with the first character and proceeding to each subsequent character. If there is a match, the prefix and suffix strings are also tested against the surrounding text and within their respective specified edit distance tolerances. If this also matches, then the process stops and the return results are prepared. If no appropriate match is found, the index character is incremented by a single character until a match is found, or until the end of the search content is reached if the wrap attribute is set to false. If the wrap attribute is set to true, and the start position was not the 0 position of the search content string, the search only concludes when the start position has been reached.
The return results are prepared by first reconciling the matched string with its equivalent DOM range in the document. This range object is created, and bundled in an object with the original search parameters. This return object is in the form of a Promise, so the results can be analyzed and subsquent search operations can be performed by incrementing the start range position based on the matched range (if any). By performing this search multiple times, an application can incrementally find all matching ranges in the document.
The steps to perform a search operation to a
document using RangeFinder are as follows:
If the user agent does not support searching the document using RangeFinder,
return a Promise rejected with a
DOMException whose name is
NotSupportedError and abort these steps.
The following sub-steps MAY be run asynchronously for performance
reasons, for example, if the user agent has browsing contexts living in different
processes:
This examples section is out of date, and uses syntax from an earlier draft of this spec. The general ideas are similar, but the syntax may be very different.
Poetry, lyrics, and some prose often include significant repetition. A simple find-text operation would return only the first instance of a repetition, while the selection in question might be for the second or third instance.
Example 1
In the following poem, there are four instances of the phrase “Rage, rage”; the desired selection is the third instance. Assuming an HTML page consisting only of the poem, there are several ways to use the RangeFinder API to achieve this, a few of which are demonstrated below. Any or all of these techniques can be applied to ensure robust anchoring.
Do Not Go Gentle Into That Good Night
by Dylan Thomas
Do not go gentle into that good night,
Old age should burn and rave at close of day;
Rage, rage against the dying of the light.
Though wise men at their end know dark is right,
Because their words had forked no lightning they
Do not go gentle into that good night.
Good men, the last wave by, crying how bright
Their frail deeds might have danced in a green bay,
Rage, rage against the dying of the light.
Wild men who caught and sang the sun in flight,
And learn, too late, they grieved it on its way,
Do not go gentle into that good night.
Grave men, near death, who see with blinding sight
Blind eyes could blaze like meteors and be gay, Rage, rage against the dying of the light.
And you, my father, there on the sad height,
Curse, bless, me now with your fierce tears, I pray.
Do not go gentle into that good night.
Rage, rage against the dying of the light.
var rf = new RangeFinder({ text: "Rage, rage" });
var result = rf.search(); // result is 1st instance of string
result = rf.search(); // result is 2nd instance of string
result = rf.search(); // result is 3rd instance of string, the target instance
var rf = new RangeFinder({ text: "Rage, rage" });
rf.prefix = "blaze like meteors and be gay, "; // the 32 characters preceding the selection
var result = rf.search(); // result is 1st instance of string after prefix
var rf = new RangeFinder();
rf.characterStart = 696; // the character count of the normalized textContent preceding the selection
rf.characterLength = 11; // the character count of the selection
var result = rf.search(); // result is the 11 characters starting at the position 696
var range = document.createRange();
range.setStart(startNode, startOffset);
range.setEnd(endNode, endOffset);
var rf = new RangeFinder();
rf.text = "Range, range"; // the selection, misspelled
rf.editDistance = 2; // the Levenshtein distance tolerance between "Rage, rage" and "Range, range"
rf.querySelector = "p:nth-of-type(5)"; // the 5th paragraph (the one containing the selection)
var result = rf.search(); // result is 1st instance of string starting at query selector,
// with a confidence reflecting the expected edit distance
var rf = new RangeFinder();
rf.text = "Range, range"; // the selection, misspelled
rf.editDistance = 2; // the Levenshtein distance tolerance between "Rage, rage" and "Range, range"
rf.querySelector = "p:nth-of-type(5)"; // the 5th paragraph (the one containing the selection)
var result = rf.search(); // result is 1st instance of string starting at query selector,
// with a confidence reflecting the expected edit distance
Method 6: Element scope
Note
The main difference here between the querySelector attribute (used in Method 4)
and the element attribute is that querySelector sets the starting point for the search,
while element limits the scope. In other words, running search() again on the previous example
will find the next instance in the next paragraph, while in this example, it will only find the single instance in the target paragraph.
var rf = new RangeFinder();
rf.text = "Rage, rage"; // the selection
rf.element = document.querySelector("p:nth-of-type(5)"); // limit scope to the 5th paragraph
// (the one containing the selection)
var result = rf.search(); // result is 1st instance of string within query selector,
// with a confidence reflecting the expected edit distance
Acknowledgements
Special thanks to Cameron McCormack (Mozilla), Kristof Csillag (Hypothes.is), Randall Leeds (Hypothes.is), Dan Whaley (Hypothes.is), Mitar Milutinovic (UC-Berkeley), Jake Hartnell (UC-Berkeley, EPUB.js), Nick Stenning (Hypothes.is), Benjamin Young (Hypothes.is), Mike de Boer (Mozilla), Travis Leithead (Microsoft), Ryosuke Niwa (Apple), Josh Soref (Blackberry), Rob Sanderson (Stanford), Paolo Ciccarese (Mass General Hospital / Harvard), Takeshi Kanai (Sony), Ivan Herman (W3C), Robin Berjon (W3C), Mike™ Smith (W3C), Mounir Lamouri (Google) and Marcos Cáceres (Mozilla) for writing a spec I could plunder mechanics, Peter Brantley (for putting on the Books in Browsers conference), and the entire Web Annotation Working Group and Open Annotations Community Group for discussion and insight. If I've forgotten anyone, please remind me; I'm truly grateful for all the discussions I've had on this topic.