Introduction
Finding a text selection is a basic operation of a Web browser, but the functionality is not exposed to Web developers. There are many different use cases for a custom mechanism for finding and navigating to a particular range of text in a Web document, such as [[!HTML5]] or [[!SVG]]. A developer may wish to design a webapp in which only the text in the active editing subtreee, and not the controls or other application text, is returned as a result of a search, overriding the native find-text functionality. A user may wish to highlight a particular passage, and link to it, even if that selection doesn't have a id
attribute to serve as a fragment identifier. A designer may wish to find and style a range of text without changing the markup. A developer may wish to provide a particular text-matching algorithm for searching their specific type of content, to return more meaningful results (such as fuzzy matching or term substitution). A CMS developer may wish to allow advanced review and commenting on text selections for a collaborative document. All of these use cases share the a fundamental requirement for a API to find text.
There are many challenges to finding text in a modern Web document.
The content may have changed; the change could be a document restructuring, the addition of content before the target selection, or even a change to the target string itself. In dynamic documents, such as real-time collaborative editing webapps, the content may change during the same session as the selection; in linking or annotation scenarios, the state is lost between sessions, so other search and anchoring strategies must be used.
Highly repetitive content, such as poetry or lyrics, may duplicate the same passage several times, making it difficult to indicate a specific instance of that passage.
Large documents may have many possible matches, such that search-ahead mechanism may be computationally expensive.
In dynamic lazy-loaded (or “infinite-scroll”) content, the text being searched for may not have been loaded into the DOM yet.
There are also quirks in some native browser find-text mechanisms that could be improved through this API. For example, most current browsers do not let the user search for text across element boundaries (such as a string broken across different paragraphs or list items), which may be unintuitive and inconvenient to the end-user. Not all browsers allow the user to specify case-sensitive or whole-word search parameters, and not all browsers indicate the number of returned results. Browsers do not typically match language patterns that may be found in non-Latin character sets, including collapsed Unicode character sequences, optional diacritical marks, or similar features, such as matching o
to ó
, ö
, ø
, and oe
.
Sometimes, a selection of a document may contain more than text, or no text at all. Such selections include a picture and its caption, the picture alone, a specific area of a picture, a point on a rendered map, or a range of time within timed media such as audio or video. Some of these examples can be fully expressed as a DOM range; others need supplementary characteristics, which may be application-specific, in order to be addressable.
The RangeFinder API is intended to be flexible and performant. To ensure interoperability, this specification describes a default find-text algorithm, but this can be overriden by the developer. For flexibility in searches, this specification defines several parameterized selector methods, which can be applied in different ordered sequences to best match particular kinds of content, via method chaining, which will allow the developer to provide refined matching algorithms to customize the results.
For performance reasons, the RangeFinder API can be evoked iteratively, to find the next match in the document, rather than attempting to search the entire document for all matches.
The result of a RangeFinder search is an object that contains the found range, any additional addressing characteristics, and a confidence interval to indicate how closely the search result matches the various selection criteria. By running a search multiple times and comparing the confidence interval for each result, a script can determine the best match for that set of search criteria. A script can also add or subtract selectors based on the results and confidence interval to refine a search.
Should the return object also contain the CSS and/or XPath selector that is the closest match for the selection? Can you get that easily from another Web Platform feature?
This specification defines what is returned, but not the specific algorithms used to find that result, or the speed or optimizations taken. Such optimizations are implementation-specific.
Dependencies
The following concepts and interfaces are defined in [[!HTML5]]:
Promise objects are defined in [[!ECMASCRIPT]].
The following concepts and interfaces are defined in [[!DOM4]]:
Terminology
-
search tree
-
The search tree is the subset of the document that the RangeFinder object is scoped to search in.
-
search content
-
The search content is the normalized element text content from the search tree.
-
edit distance
-
The edit distance tolerance is the number of operations between the supplied target text attribute string and the candidate text content of the document. The edit distance is measured as the number of operations, of the set delete, insert, replace, and transpose, required to transform one string into another.
Decide which algorithm to use to calculate this value: Levenshtein; Damerau–Levenshtein; Hamming; Jaro–Winkler (unlikely); or some other edit distance algorithm.
Should this edit distance apply to the values of the prefix
and suffix
attributes as well?
Expose edit-distance calculator as low-level API?
What about "close key" calculations (keyboard-specific)
The RangeFinder API
The RangeFinder API enables a client-side script to incrementally search the DOM of the current document for matching strings, elements, or other DOM structures, based on criteria set with dedicated methods.
The RangeFinder Interface
This interface represents an object which allows for the discovery of arbitrary selections of a document based on criteria set out in its attributes. The constructor takes an optional property object to set the selecotr attributes. The results of the RangeFinder object's search are obtained through iterative execution of the search()
method.
Currently, this API uses attributes to set selectors. I considered using chaining setter methods, which would allow for multiple parameters for each selector, and specific confidence ratings for each selector. However, that approach wouldn't let you initialize the object with a property object, nor allow easy object introspection. Feedback welcome.
-
attribute DOMString text
-
A text string to be searched for. Note: This string is equivalent to the textContent
of a DOM node.
The default value of the text attribute MUST be the empty string.
-
attribute unsigned short textDistance
-
The edit distance tolerance for the text attribute string.
The default value of the textDistance attribute MUST be 0
.
If the value of the text attribute is the empty string, the textDistance attribute MUST use the default value.
-
attribute DOMString prefix
-
A text string of arbitrary length (often 32 characters) that precedes the target string.
The default value of the prefix attribute MUST be the empty string.
Include link to study that shows optimal distance for character prefixes?
-
attribute unsigned short prefixDistance
-
The edit distance tolerance for the prefix attribute string.
The default value of the prefixDistance attribute MUST be 0
.
If the value of the prefix attribute is the empty string, the prefixDistance attribute MUST use the default value.
-
attribute DOMString suffix
-
A text string of arbitrary length (often 32 characters) that follows the target string.
The default value of the suffix attribute MUST be the empty string.
Include link to study that shows optimal distance for character suffixes?
-
attribute unsigned short suffixDistance
-
The edit distance tolerance for the suffix attribute string.
The default value of the suffixDistance attribute MUST be 0
.
If the value of the suffix attribute is the empty string, the suffixDistance attribute MUST use the default value.
-
attribute Range scopeRange
-
The DOM range which delineates the bounds of the search operation. The default range MUST be a range comprising the document's body element.
Should this be an element?
-
attribute Range startRange
-
A range object, as the starting point for the selection search.
-
attribute boolean caseFolding
-
A boolean representing whether a search MUST ignore the case of the characters when doing match comparison. The default value MUST be true.
-
attribute boolean unicodeFolding
-
A boolean representing whether a search MUST match canonically equivalent characters. The default value MUST be false.
Canonical Equivalence is described in [[!UTS10]]. If the value of this attribute is true, the User Agent MUST follow the algorthim defined in section 8 Searching and Matching, with collations and contractions formed per section 8.1 Collation Folding, and secondary strength asymmetric search per section 8.2 Asymmetric Search.
We should require that the UA support the Character Model for the World Wide Web: String Matching and Searching [[!charmod-norm]] spec (and urge the I18n WG to complete that spec).
This still needs a lot of work!
Should we allow the author to supply their own tailoring mapping?
This needs to be balanced with other kinds of fuzzy matches, such as case-folding and edit distance.
Define how these matches affect the rank or confidence of the return result.
-
attribute boolean wholeWord
-
A boolean representing whether a search MUST require that the search string be surrounded by non-word characters. The default value MUST be false.
-
attribute boolean wrap
-
A boolean representing whether a search MUST begin again at the beginning (or end, depending on search direction) of the search tree. The default value MUST be false.
-
Promise≺RangeFinderResult≻ search(optional boolean forward)
-
Executes the search, based on the current selection criteria attributes, from the current search index position within the document serialization. The return result MUST be a Promise
of a RangeFinderResult object with properties defined as follows:
-
If a match is found for the current selection criteria, the value of the range property MUST be a
Range
object that contains the target string or DOM structure, and the value of the confidence property MUST be a postive value as described in the upcoming section on magical ways to determine a confidence value on the results, probably something to do with the edit distance and how many criteria failed or something.
-
If no match is found for the current selection criteria, the value of the
range
property MUST be null
., and the value of the confidence
property MUST be 0
.
TODO: Add algorithm that describes the search operation, the setting of the search index position, evaluation of each search criterion, and so on. Maybe that's in a later section of the spec?
-
optional boolean forward
-
This parameter indicates whether the search should proceed in forward (positive) direction from the last index, or in a reverse (negative) direction. The default value MUST be true.
If the value is true
, the search MUST continue in a positive direction along the character or DOM serialization, from the start of the document toward the end, beginning at the index of the last search.
If the value is false
, the search MUST continue in a negative direction along the character or DOM serialization, from the end of the document toward the start, beginning at the index of the last search.
-
DOMException error
-
The error that occurred while searching.
-
DOMString text
-
unsigned short textDistance
-
DOMString prefix
-
unsigned short prefixDistance
-
DOMString suffix
-
unsigned short suffixDistance
-
Range scope
-
Range startRange
-
boolean caseFolding
-
boolean unicodeFolding
-
boolean wholeWord
-
boolean wrap
-
Range? range
-
A DOM Range object, as defined in [[!DOM4]]. If no match is found, the value MUST be null
.
-
dictionary RangeFinderArgs
-
A dictionary object which contains all the parameters used in the search.
RangeFinderArgs MUST include entries for all available parameters, and MUST include default values for any parameter not explicitly set in the RangeFinder object.
The EditDistance Interface
The EditDistance API defines a method to calculate the minimum edit distance between any two strings. This method uses the Levenshtein Distance algorithm, with the following operations:
- Adding one character to the target string.
- Deleting one character from the target string.
- Replacing one character of the target string with another character.
The EditDistance Interface is a lower-level API that acts as a helper API to the RangeFinder API. It may also be used as a standalone method.
Given two strings s1 and s2, the edit distance between s1 and s2 is the minimum number of addition, deletion, and replacement operations required to convert string s1 to s2.
The EditDistance Method
- unsigned short findEditDistance (in DOMString targetString, in DOMString comparisonString)
-
The findEditDistance method calculates the edit distance between the targetString and the comparisonString.
- DOMString comparisonString
-
Specifies the comparison string, which the target string is compared to. The default value MUST be the empty string.
- DOMString targetString
-
Specifies the target string, which is compared to the comparison string. The default value MUST be the empty string.
Open Issues
These are open issues that don't yet have a location in the document.
Explain how to tell when a search is complete, e.g. when all the matches have been found.
How to deal with search matches that are dynamically inserted before the current search index position? LEave it up to the webapp?
Should this find "meta" content, such as text in a script
, style
, body
, or CDATA
block?
Should this find "hidden" content, such as text that is display:none
or visibility:none
?
What are the minimum criteria for performing a search? You might usefully look for what's between the prefix and suffix, or what is at a certain range. Should we allow that, or require that the text
attribute not be the empty string?
Should we include an option to ignore punctuation? Apparently use of punctuation is widely variable and often changed, so this improves search efficiency.
Concepts
Algorithms defined in this specification assume that for each
document there is a pending promise, which is
initially set to null
, which is a Promise object
whose associated operation is to search the document.
Performing a search operation
A conforming user agent MUST perform a search operation in a manner that produces results consistent with this following algorithm. A conforming user agent MAY optimize the search algorithm in any way that produces these consistent results.
Algorithm summary
The RangeFinder
search()
method initiates an incremental text search of that portion
of the document set as the scope
parameter of the RangeFinder
object, starting at its startRange
parameter.
The search process starts by finding the scope of the search; given the scope
range, it first finds the parent start and end elements, and all elements between them; these element comprise the search tree. It then extracts the text content of all these elements, and truncates everything outside the start and end character positions, using the algorithm defined for the textContent
attribute in [[!DOM4]]; this is referred to as the search content.
In the search content, target text
, and prefix and suffix text
, all multiple consecutive whitespace characters, including newline characters, are collapsed into single space characters. If the caseFolding
attribute is set to true, all characters are converted into their lowercase equivalents (if any). If the unicodeFolding
attribute is set to true, all characters are converted into their decomposed equivalents (if any), per the [[!charmod-norm]] spec. If the ignorePunctuation
attribute is set to true, all punctuation characters are removed. (How should wholeWord
be modeled?)
Once the search content, target text
, and prefix and suffix text
are normalized according to the required parameters, the search operation is begun. The starting point of the search is the character position of the startRange
attribute; if this character position lies outside the scoped range, then the starting point of the search is the 0 position of the search content string.
The target text
is then incrementally compared with the search content, within the tolerance of the specified edit distance. Each such operation is performed at the current index character, starting with the first character and proceeding to each subsequent character. If there is a match, the prefix and suffix strings are also tested against the surrounding text and within their respective specified edit distance tolerances. If this also matches, then the process stops and the return results are prepared. If no appropriate match is found, the index character is incremented by a single character until a match is found, or until the end of the search content is reached if the wrap
attribute is set to false. If the wrap
attribute is set to true, and the start position was not the 0 position of the search content string, the search only concludes when the start position has been reached.
The return results are prepared by first reconciling the matched string with its equivalent DOM range in the document. This range object is created, and bundled in an object with the original search parameters. This return object is in the form of a Promise, so the results can be analyzed and subsquent search operations can be performed by incrementing the start range position based on the matched range (if any). By performing this search multiple times, an application can incrementally find all matching ranges in the document.
The steps to perform a search operation to a
document using RangeFinder are as follows:
- If the user agent does not support searching the document using RangeFinder,
return a Promise rejected with a
DOMException
whose name is
NotSupportedError
and abort these steps.
- The following sub-steps MAY be run asynchronously for performance
reasons, for example, if the user agent has browsing contexts living in different
processes:
- Let browsing contexts be the list of the
descendant browsing contexts of the top-level browsing
context's document.
- If one of the browsing contexts's
document's pending promise is not
null
:
- Let doc be the document which has a not
null
pending promise.
- Reject doc's pending promise with
DOMException
whose name is
AbortError
.
- Set doc's pending promise to
null
.
-
Finish the algorithm.
Define the EditDistance algorithm.
Acknowledgements
Special thanks to Cameron McCormack (Mozilla), Kristof Csillag (Hypothes.is), Randall Leeds (Hypothes.is), Dan Whaley (Hypothes.is), Mitar Milutinovic (UC-Berkeley), Jake Hartnell (UC-Berkeley, EPUB.js), Nick Stenning (Hypothes.is), Benjamin Young (Hypothes.is), Mike de Boer (Mozilla), Travis Leithead (Microsoft), Ryosuke Niwa (Apple), Josh Soref (Blackberry), Rob Sanderson (Stanford), Paolo Ciccarese (Mass General Hospital / Harvard), Takeshi Kanai (Sony), Ivan Herman (W3C), Robin Berjon (W3C), Mike™ Smith (W3C), Mounir Lamouri (Google) and Marcos Cáceres (Mozilla) for writing a spec I could plunder mechanics, Peter Brantley (for putting on the Books in Browsers conference), and the entire Web Annotation Working Group and Open Annotations Community Group for discussion and insight. If I've forgotten anyone, please remind me; I'm truly grateful for all the discussions I've had on this topic.