The FindText API specification describes an API for finding ranges of text in a document or part of a document, using a variety of selection criteria.

This specification is a strawman, intended to collect feedback. It is not stable, and radical ideas for improvement are welcome.

The API described in this document is still experimental.

Feedback on this specification can be made via annotations. All annotations are archived on the public-annotation@w3.org mailing list (archives), and on the searchable WebPlatform Notes stream; see more details on Notes.WebPlatform.org. Note: To create annotations, you must have a WebPlatform.org account; do not use your W3C credentials.

Introduction

Finding a text selection is a basic operation of a Web browser, but the functionality is not exposed to Web developers. The FindText API provides a way for developers to find text selections, with several options to refine more targeted and flexible searches, including fuzzy matching, in a way that works well with a variety of human languages.

There are many different use cases for a custom mechanism for finding and navigating to a particular range of text in a Web document, such as [[!HTML5]] or [[!SVG]]. A developer may wish to design a webapp in which only the text in the active editing subtreee, and not the controls or other application text, is returned as a result of a search, overriding the native find-text functionality. A user may wish to highlight a particular passage, and link to it, even if that selection doesn't have a id attribute to serve as a fragment identifier. A designer may wish to find and style a range of text without changing the markup. A developer may wish to provide a particular text-matching algorithm for searching their specific type of content, to return more meaningful results (such as fuzzy matching or term substitution). A CMS developer may wish to allow advanced review and commenting on text selections for a collaborative document. All of these use cases share the a fundamental requirement for a API to find text.

There are many challenges to finding text in a modern Web document. The content may have changed; the change could be a document restructuring, the addition of content before the target selection, or even a change to the target string itself. In dynamic documents, such as real-time collaborative editing webapps, the content may change during the same session as the selection; in linking or annotation scenarios, the state is lost between sessions, so other search and anchoring strategies must be used. Highly repetitive content, such as poetry or lyrics, may duplicate the same passage several times, making it difficult to indicate a specific instance of that passage. Large documents may have many possible matches, such that search-ahead mechanism may be computationally expensive. In dynamic lazy-loaded (or “infinite-scroll”) content, the text being searched for may not have been loaded into the DOM yet.

There are also quirks in some native browser find-text mechanisms that could be improved through this API. For example, most current browsers do not let the user search for text across element boundaries (such as a string broken across different paragraphs or list items), which may be unintuitive and inconvenient to the end-user. Not all browsers allow the user to specify case-sensitive or whole-word search parameters, and not all browsers indicate the number of returned results. Browsers do not typically match language patterns that may be found in multilingual text, including collapsed Unicode character sequences, optional diacritical marks, or similar features, such as matching o to ó, ö, ø, and oe.

Sometimes, a selection of a document may contain more than text, or no text at all. Such selections include a picture and its caption, the picture alone, a specific area of a picture, a point on a rendered map, or a range of time within timed media such as audio or video. Some of these examples can be fully expressed as a DOM range; others need supplementary characteristics, which may be application-specific, in order to be addressable.

The FindText API is intended to be flexible and performant. To ensure interoperability, this specification describes a default find-text algorithm, but this can be overriden by the developer. For flexibility in searches, this specification defines several parameterized selector methods, which can be applied in different ordered sequences to best match particular kinds of content, via method chaining, which will allow the developer to provide refined matching algorithms to customize the results. For performance reasons, the FindText API can be evoked iteratively, to find the next match in the document, rather than attempting to search the entire document for all matches.

The result of a FindText search is an object that contains the found range, and any additional addressing characteristics. By running a search multiple times and comparing the results against the search criteria, a script can determine the best match for that set of search criteria. A script can also add or subtract selectors based on the results to refine a search.

Should the return object also contain the CSS and/or XPath selector that is the closest match for the selection? Can you get that easily from another Web Platform feature?

This specification defines what is returned, but not the specific algorithms used to find that result, or the speed or optimizations taken. Such optimizations are implementation-specific.

For character counts in ranges, what exactly would be counted as a character? Unicode code points? Graphemes?

Address the issue of multiple concurrent FindText objects executing, especially when they are from different code bases (e.g. a custom find-in-page dialog running at the same time as an annotation client library). Presumably, the implementation could optimize by having them use the same search content?

What's the expectation for updating the search content to reflect the state of the dynamic DOM, while a FindText object is executing?

This specification shares some conceptual similarities to the Selection API. Both deal with collections of ranges; while the Selection API deals with user actions in a live DOM, the FindText API deals with programmatic selection of ranges based on text criteria. The main interrelation between them is that one user might make a selection in a document, which can be serialized as text in such a way that another user could recreate that selection on their own instance of the document (e.g. you can store and share selections across devices). The use cases are otherwise very different.

Dependencies

The following concepts and interfaces are defined in [[!HTML5]]:

Promise objects are defined in [[!ECMASCRIPT]].

The following concepts and interfaces are defined in [[!DOM4]]:

The following conformance classes are defined by this specification:

conforming implementation

A user agent is considered to be a conforming implementation if it satisfies all of the MUST-, REQUIRED- and SHALL-level criteria in this specification that apply to implementations.

User agents that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification [[!WEBIDL]] as this specification uses that specification and terminology.

Terminology

edit distance

The edit distance tolerance is the number of operations between the supplied target text attribute string and the candidate text content of the document. The edit distance is measured as the number of operations, of the set delete, insert, replace, and transpose, required to transform one string into another.

Decide which algorithm to use to calculate this value: Levenshtein; Damerau–Levenshtein; Hamming; Jaro–Winkler (unlikely); or some other edit distance algorithm.

Should this edit distance apply to the values of the prefix and suffix attributes as well?

Expose edit-distance calculator as low-level API?

What about "close key" calculations (keyboard-specific)

search tree
The search tree is the subset of the document that the FindText object is scoped to search in.
search content
The search content is the normalized element text content from the search tree.
search cursor
The search cursor is the character position of the search operation in the search content.

Should search cursor be exposed as a property of the FindText interface? That might allow more fine-grained control of the interface, and provide a way to detect the current search progress state (which might mean also exposing the total character length of the search content).

Would this be clearer if it were named search index?

search operation
The search operation is the current execution of the search method of the FindText interface.
whitespace character

A whitespace character is any character with the Unicode character property "WSpace=Y", including the following codepoints: U+0009 (character tabulation); U+000A (line feed); U+000B (line tabulation); U+000C (form feed); U+000D (carriage return); U+0020 (space); U+0085 (next line); U+00A0 (no-break space); U+1680 (ogham space mark); U+2000 (en quad); U+2001 (em quad); U+2002 (en space); U+2003 (em space); U+2004 (three-per-em space); U+2005 (four-per-em space); U+2006 (six-per-em space); U+2007 (figure space); U+2008 (punctuation space); U+2009 (thin space); U+200A (hair space); U+2028 (line separator); U+2029 (paragraph separator); U+202F (narrow no-break space); U+205F (medium mathematical space); U+3000 (ideographic space)..

The FindText API

The FindText API enables a client-side script to incrementally search the DOM of the current document for matching strings, elements, or other DOM structures, based on criteria set with dedicated methods.

The FindText Interface

This interface represents an object which allows for the discovery of arbitrary selections of a document based on criteria set out in its attributes. The constructor takes an optional property object to set the selecotr attributes. The results of the FindText object's search are obtained through iterative execution of the search() method.

attribute DOMString text

A text string to be searched for. Note: This string is equivalent to the textContent of a DOM node.

The default value of the text attribute MUST be the empty string.

attribute unsigned short textDistance

The edit distance tolerance for the text attribute string.

The default value of the textDistance attribute MUST be 0.

If the value of the text attribute is the empty string, the textDistance attribute MUST use the default value.

attribute DOMString prefix

A text string of arbitrary length (often 32 characters) that precedes the target string.

The default value of the prefix attribute MUST be the empty string.

Include link to study that shows optimal distance for character prefixes?

attribute unsigned short prefixDistance

The edit distance tolerance for the prefix attribute string.

The default value of the prefixDistance attribute MUST be 0.

If the value of the prefix attribute is the empty string, the prefixDistance attribute MUST use the default value.

attribute DOMString suffix

A text string of arbitrary length (often 32 characters) that follows the target string.

The default value of the suffix attribute MUST be the empty string.

Include link to study that shows optimal distance for character suffixes?

attribute unsigned short suffixDistance

The edit distance tolerance for the suffix attribute string.

The default value of the suffixDistance attribute MUST be 0.

If the value of the suffix attribute is the empty string, the suffixDistance attribute MUST use the default value.

attribute Element scope

The DOM range which delineates the bounds of the search operation. The default range MUST be a range comprising the document's body element.

This was originally a range, not an an element. I'm not sure why I did that, because ranges are a PITA to work with if you don't need to… surely there must be a reason I made it a range? Maybe because I wanted to give the developer the power to select non-element start and stop points?

attribute Range startRange

A range object, as the starting point for the selection search.

Do we really need both cursor and startRange? Working with an index would be easier than working with a range, and I'm not sure what having a range gets us.

attribute caseFoldingType caseFolding

A boolean representing whether a search MUST ignore the case of the characters when doing match comparison. The default value MUST be true.

attribute UnicodeEquivalenceType unicodeNormalization

A enumerated value representing whether a search MUST match exact characters, or use a particular Unicode character equivalence. The default value MUST be none.

Canonical Equivalence is described in [[!UTS10]]. If the value of this attribute is true, the User Agent MUST follow the algorthim defined in section 8 Searching and Matching, with collations and contractions formed per section 8.1 Collation Folding, and secondary strength asymmetric search per section 8.2 Asymmetric Search.

We should require that the UA support the Character Model for the World Wide Web: String Matching and Searching [[!charmod-norm]] spec (and urge the I18n WG to complete that spec).

This still needs a lot of work!

Should we allow the author to supply their own tailoring mapping?

This needs to be balanced with other kinds of fuzzy matches, such as case-folding and edit distance.

attribute boolean wholeWord

A boolean representing whether a search MUST require that the search string be surrounded by non-word characters. The default value MUST be false.

Define how wholeWord interacts with edit distance. Probably wholeWord needs an exact match, regardless of edit distance tolerance.

Define how wholeWord works in languages without word breaks. Probably any match there satisfies wholeWord (e.g. it doesn't matter how wholeWord is set).

attribute boolean wrap

A boolean representing whether a search MUST begin again at the beginning (or end, depending on search direction) of the search tree. The default value MUST be false.

attribute unsigned long cursor

An unsigned long indicating the current search index position in the searchContent. The default value MUST be 0.

I think this should be exposed, for extensibility and low-level access, but I'm open to arguments against it.

readonly attribute DOMString searchContent

The stringified and normalized content of the search tree.

I think this should be exposed, for extensibility and low-level access, but I'm open to arguments against it. Note that this would be a really long string, in many cases!

readonly attribute DOMString sourceMap

The mapping between the document and the searchContent.

I think this should be exposed, for extensibility and low-level access, but I'm open to arguments against it.

I have no idea what this would actually look like! Probably JSON (which I don't know how to indicate in WebIDL), but otherwise completely undefined for now. :sadpanda:

Promise≺ResultMatch≻ search(optional boolean forward)

Executes the search, based on the current selection criteria attributes, from the current search cursor position within the document serialization. The return result MUST be a Promise of a ResultMatch object with properties defined as follows:

  • If a match is found for the current selection criteria, the value of the range property MUST be a Range object that contains the target string or DOM structure.
  • If no match is found for the current selection criteria, the value of the range property MUST be null.
Promise≺ResultMatchAll≻ searchAll(optional boolean forward)

Executes the search, based on the current selection criteria attributes, from the current search cursor position within the document serialization. The return result MUST be a Promise of a ResultMatchAll object with properties defined as follows:

  • If a match is found for the current selection criteria, the value of the range property MUST be an array of Range objects that contains the target string or DOM structure.
  • If no match is found for the current selection criteria, the value of the range property MUST be null.
optional boolean forward

This parameter indicates whether the search should proceed in forward (positive) direction from the current search cursor, or in a reverse (negative) direction. The default value MUST be true.

If the value is true, the search MUST continue in a positive direction along the character or DOM serialization, from the start of the document toward the end, beginning at the current search cursor.

If the value is false, the search MUST continue in a negative direction along the character or DOM serialization, from the end of the document toward the start, beginning at the current search cursor.

DOMException error
The error that occurred while searching.
none
The search MUST NOT use any case folding of characters.
ascii
The search MUST map all code points in the range 0x41 to 0x5A (A to Z) to the corresponding code points in the range 0x61 to 0x7A (a to z).

Should we defer to [[!charmod-norm]] for this definition?

unicode
The search MUST map all code points to their Unicode C+F case fold equivalents. Note that this can change the length of the string.

Should we defer to [[!charmod-norm]] for this definition?

language-sensitive
The search MUST map all code points to their case-fold mappings defined in the Unicode Common Locale Data Repository [[!UAX35].

This is defined in [[!charmod-norm]], but should we defer value this to a later version of the spec?

none
The search MUST NOT use any Unicode equivalence of characters.
canonical
The search MUST use canonical (NFC/NFD) Unicode equivalence of characters.
compatibility
The search MUST use compatibility (NFKC/NFKD) Unicode equivalence of characters.

Does compatibility equivalence match digraph forms, as in the German ö to oe? If not, should we somehow include that anyway?

Does compatibility equivalence match elided vowels, as in Hebrew? Or will we have to rely on edit distance? Note that there's only so far we can take this, with abbreviations demonstrating the extreme, such as matching versus and vs or vs.).

all
The search MUST use either canonical (NFC/NFD) or compatibility (NFKC/NFKD) Unicode equivalence of characters, whichever produces the more liberal match.

Does compatibility already include canonical by definition?

Performance issues?

DOMString text
unsigned short textDistance
DOMString prefix
unsigned short prefixDistance
DOMString suffix
unsigned short suffixDistance
Range scope
Range startRange
caseFoldingType caseFolding
UnicodeEquivalenceType unicodeNormalization
boolean wholeWord
boolean wrap
Range? result

A DOM Range object, as defined in [[!DOM4]]. If no match is found, the value MUST be null.

dictionary FindTextArgs

A dictionary object which contains all the parameters used in the search.

FindTextArgs MUST include entries for all available parameters, and MUST include default values for any parameter not explicitly set in the FindText object.

Array result

An array of DOM Range objects, as defined in [[!DOM4]]. If no match is found, the array must be empty.

dictionary FindTextArgs

A dictionary object which contains all the parameters used in the search.

FindTextArgs MUST include entries for all available parameters, and MUST include default values for any parameter not explicitly set in the FindText object.

The EditDistance Interface

The EditDistance API defines a method to calculate the minimum edit distance between any two strings. This method uses the Levenshtein Distance algorithm, with the following operations:

The EditDistance Interface is a lower-level API that acts as a helper API to the FindText API. It may also be used as a standalone method.

Given two strings s1 and s2, the edit distance between s1 and s2 is the minimum number of addition, deletion, and replacement operations required to convert string s1 to s2.

The EditDistance Method

unsigned short findEditDistance (in DOMString targetString, in DOMString comparisonString)

The findEditDistance method calculates the edit distance between the targetString and the comparisonString.

DOMString comparisonString
Specifies the comparison string, which the target string is compared to. The default value MUST be the empty string.
DOMString targetString
Specifies the target string, which is compared to the comparison string. The default value MUST be the empty string.

Open Issues

These are open issues that don't yet have a location in the document.

Explain how to tell when a search is complete, e.g. when all the matches have been found.

How to deal with search matches that are dynamically inserted before the current search cursor position? LEave it up to the webapp?

Should this find "meta" content, such as text in a script, style, body, or CDATA block?

Should this find "hidden" content, such as text that is display:none or visibility:none?

What are the minimum criteria for performing a search? You might usefully look for what's between the prefix and suffix, or what is at a certain range. Should we allow that, or require that the text attribute not be the empty string?

Should we include an option to ignore punctuation? Apparently use of punctuation is widely variable and often changed, so this improves search efficiency.

Concepts

Algorithms defined in this specification assume that for each document there is a pending promise, which is initially set to null, which is a Promise object whose associated operation is to search the document.

When queuing a task, the find text task source is used.

Performing a search operation

A conforming user agent MUST perform a search operation in a manner that produces results consistent with this following algorithm. A conforming user agent MAY optimize the search algorithm in any way that produces these consistent results.

This algorithm is not intended to serve as pseudocode, and would not yeild performant results. The bitap algorithm or other efficient mechanism should be used instead, in a manner consistent with the results of this algorithm.

Should this algorithm take into account transposed words, where the prefix and suffix are different because of the transposition? For example, "The boat was filled the with fishing gear." Or, should this be handled at the application (webapp) level?

Search Algorithm summary

The FindText search() method initiates an incremental text search of that portion of the document set as the scope parameter of the FindText object, starting at its startRange parameter.

The search process starts by finding the scope of the search; given the scope range, it first finds the parent start and end elements, and all elements between them; these element comprise the search tree. It then extracts the text content of all these elements, and truncates everything outside the start and end character positions, using the algorithm defined for the textContent attribute in [[!DOM4]]; this is referred to as the search content.

In the search content, target text, and prefix and suffix text, all multiple consecutive whitespace characters, including newline characters, are collapsed into single space characters. If the caseFolding attribute is set to true, all characters are converted into their lowercase equivalents (if any). If the unicodeNormalization attribute is set to true, all characters are converted into their decomposed equivalents (if any), per the [[!charmod-norm]] spec. If the ignorePunctuation attribute is set to true, all punctuation characters are removed. (How should wholeWord be modeled?)

Once the search content, target text, and prefix and suffix text are normalized according to the required parameters, the search operation is begun. The starting point of the search is the character position of the startRange attribute; if this character position lies outside the scoped range, then the starting point of the search is the 0 position of the search content string.

The target text is then incrementally compared with the search content, within the tolerance of the specified edit distance. Each such operation is performed at the current search cursorx character, starting with the first character and proceeding to each subsequent character. If there is a match, the prefix and suffix strings are also tested against the surrounding text and within their respective specified edit distance tolerances. If this also matches, then the process stops and the return results are prepared. If no appropriate match is found, the cursor character is incremented by a single character until a match is found, or until the end of the search content is reached if the wrap attribute is set to false. If the wrap attribute is set to true, and the start position was not the 0 position of the search content string, the search only concludes when the start position has been reached.

The return results are prepared by first reconciling the matched string with its equivalent DOM range in the document. This range object is created, and bundled in an object with the original search parameters. This return object is in the form of a Promise, so the results can be analyzed and subsquent search operations can be performed by incrementing the start range position based on the matched range (if any). By performing this search multiple times, an application can incrementally find all matching ranges in the document.

Search Algorithm

The steps to perform a search operation on a document using FindText are as follows:

This algorithm is still a rough draft, and may change based on concrete feedback.

  1. If the user agent does not support searching the document using FindText, return a Promise rejected with a DOMException whose name is NotSupportedError and halt these steps.
  2. The following sub-steps MAY be run asynchronously for performance reasons, for example, if the user agent has browsing contexts living in different processes:
    1. Let browsing contexts be the list of the descendant browsing contexts of the top-level browsing context's document.
    2. For each browsing context in browsing contexts, run the following sub-steps:
      1. Let doc be the browsing context's active document.
      2. If doc is not visible per [PAGE-VISIBILITY], halt these steps.
      3. If doc's document's pending promise is not null:
        1. Let doc be the document which has a not null pending promise.
        2. Reject doc's pending promise with DOMException whose name is EndError.
        3. Set doc's pending promise to null.
      4. Let search tree be the active target portion of doc.
      5. Set initial value of search tree to doc.
      6. If scope is not null:
        1. If the value of scope is found within doc, set the value of search tree to the value of scope.
        2. If the value of scope is not found within doc, resolve doc's pending promise with NOT FOUND.

          Define NOT FOUND, or find a better return value.

        This conflates the notion of range and DOM tree. Find a better way to set the search tree to the scope.

      7. Let search content be the result of applying the [[!DOM4]] textContent attribute algorithm algorithm to the search tree.

        Note that this will not allow users to search for CSS-injected content, as currently defined. This is unintuitive, and may not match browser find-in-page behavior.

      8. Let source map be the mapping between each element boundary in doc and the corresponding character position in search content.
      9. Let target text be the discrete values of text, prefix, and suffix.
      10. Collapse all consecutive sequences of any whitespace character into a single U+0020 (space) character, in the search content and target text.
      11. If caseFolding is not null or the value is not none:
        1. If the value of caseFolding is ascii, map all code points in the range 0x41 to 0x5A (A to Z) to the corresponding code points in the range 0x61 to 0x7A (a to z), in the search content and target text.
        2. If the value of caseFolding is unicode, map all code points to their Unicode C+F case fold equivalents, in the search content and target text.

        Should we defer to [[!charmod-norm]] for this part of the algorithm?

      12. Update source map to reflect the new character positions in search content.
      13. If unicodeNormalization is not null or the value is not none:
        1. If the value of unicodeNormalization is canonical, convert all characters in the search content and target text their Unicode equivalents using canonical (NFC/NFD) Unicode equivalence of characters.
        2. If the value of unicodeNormalization is compatibility, convert all characters in the search content and target text their Unicode equivalents using compatibility (NFKC/NFKD) Unicode equivalence of characters.
        3. If the value of unicodeNormalization is all, …

          Is there a good way to handle this?

        Consider exposing unicode normalization operations as a separate method.

      14. Update source map to reflect the new character positions in search content.
      15. Let search cursor be the current character position within search content.
      16. If this is the first initialization of the search execution:
        1. Set search cursor to 0.
        2. If startRange is not null:
          1. If the starting point of startRange is found within search content, set the search cursor to the start of the start range.

        Find a clearer way to explain that the cursor position remains active between executions of the search() method, e.g. it's iterable. We don't want this to be idempotent.

      17. Let search cursor starting position be search cursor.
      18. Let text match length be the number of Unicode codepoints in text.
      19. Let prefix match length be the number of Unicode codepoints in prefix.
      20. Let suffix match length be the number of Unicode codepoints in suffix.
      21. Increment search cursor by 1 character until the search cursor is positioned on the last character in search content; for each increment:
        1. If wrap is true and search cursor is search cursor starting position:
          1. If execution method is searchAll, resolve doc's pending promise with result.
        2. Halt this algorithm.
        3. If execution method is searchAll:
          1. Resolve doc's pending promise with result.
          2. Halt this algorithm.
        4. Let candidate text subset be the string consisting of the number of Unicode codepoints in search content equal to text match length plus the value of textDistance, starting at the current search cursor.
        5. Apply text match algorithm:
          1. Set target text to text.
          2. Set candidate text to candidate text subset.
          3. If text and candidate text subset are a match:
            1. Let matching candidate text be candidate text subset.
            2. Let candidate text subset be the string consisting of the number of Unicode codepoints in search content equal to prefix match length plus the value of prefixDistance, starting at the current search cursor minus the prefix match length.
            3. Apply text match algorithm:
              1. Set target text to prefix.
              2. Set candidate text to candidate text subset.
              3. If prefix and candidate text subset are a match:
                1. Let candidate text subset be the string consisting of the number of Unicode codepoints in search content equal to suffix match length plus the value of prefixDistance, starting at the current search cursor minus the suffix match length.
                2. Apply text match algorithm:
                  1. Set target text to suffix.
                  2. Set candidate text to candidate text subset.
                  3. If suffix and candidate text subset are a match:
                    1. If execution method is search, let result be dom range object.
                    2. If execution method is searchAll, let result be an array of dom range objects.
                    3. Let range be a dom range object in result.
                    4. Set range to a range in doc corresponding to matching candidate text, using source map.

                      Probably needs more detail. :P

                    5. If execution method is search:
                      1. Resolve doc's pending promise with result.

                        Should return cursor position also.

                      2. Halt this algorithm.
          4. Resolve doc's pending promise with MATCH NOT FOUND.

            Define MATCH NOT FOUND, or find a better return value.

          5. If wrap is true and search cursor starting position is not 0, set search cursor to 0. Otherwise:
            1. If execution method is searchAll, resolve doc's pending promise with result.
            2. Halt this algorithm.
    3. Improve this algorithm!

    Text Match Algorithm

    The steps to perform a text match operation on a two strings using FindText are as follows:

    1. Let target text be the string to match.
    2. Let target text match length be the number of Unicode codepoints in target text.
    3. Let edit distance be the character count variance.
    4. Let candidate text be the string to compare to target text.
    5. Let match text be the empty string.
    6. If candidate text and text contain an identical sequence of codepoints for the length of target text match length:
      1. Set match text to candidate text.
      2. Return match text and halt this algorithm.
    7. If wholeWord is true, halt this algorithm.
    8. If candidate text and text match in a comparison using the Calculating Edit Distance Algorithm:
      1. Set match text to candidate text.
      2. Return match text and halt this algorithm.

Calculating edit distance

A conforming user agent MUST calculate the edit distance of 2 strings in a manner that produces results consistent with this following algorithm. A conforming user agent MAY optimize the search algorithm in any way that produces these consistent results.

Calculating Edit Distance Algorithm summary

The edit distance is calculated by first finding and comparing the matching sequence of codepoints in both strings, regardless of intervening extra characters. If the count of the mutual sequence of codepoint matches is within the user-defined tolerance limit (of textDistance, prefixDistance, and suffixDistance, respectively), then the string is considered a match using the change, insertion, or deletion operations.

For example, the words complete and computer have the following mutual sequence of letters: c, o, m, p, t, and e. This pair has an edit distance of 3, because you must perform 3 operations: delete the l, change the e to an u, and add an r. So, a search for complete with an edit distance of 3 would also match computer (as well as more obvious matches like completes, completed, and compete).

Note that in practice, all three of these operations can be combined, and are typically performed simultaneously.

Calculating Edit Distance Algorithm

  1. Define the EditDistance algorithm.

Examples

This examples section is very much in flux, and the syntx may be wrong. The general ideas should be similar, but the syntax may be quite different. There may even be changes to the API based on feedback on the examples.

Poetry, lyrics, and some prose often include significant repetition. A simple find-text operation would return only the first instance of a repetition, while the selection in question might be for the second or third instance.

Example 1

In the following poem, there are four instances of the phrase “Rage, rage”; the desired selection is the third instance. Assuming an HTML page consisting only of the poem, there are several ways to use the FindText API to achieve this, a few of which are demonstrated below. Any or all of these techniques can be applied to ensure robust anchoring.

Do Not Go Gentle Into That Good Night

by Dylan Thomas

Do not go gentle into that good night,
Old age should burn and rave at close of day;
Rage, rage against the dying of the light.

Though wise men at their end know dark is right,
Because their words had forked no lightning they
Do not go gentle into that good night.

Good men, the last wave by, crying how bright
Their frail deeds might have danced in a green bay,
Rage, rage against the dying of the light.

Wild men who caught and sang the sun in flight,
And learn, too late, they grieved it on its way,
Do not go gentle into that good night.

Grave men, near death, who see with blinding sight
Blind eyes could blaze like meteors and be gay,
Rage, rage against the dying of the light.

And you, my father, there on the sad height,
Curse, bless, me now with your fierce tears, I pray.
Do not go gentle into that good night.
Rage, rage against the dying of the light.

Method 1: Search and filter all results


var find = new FindText({ "text": "Rage, rage" });
find.searchAll().then( 
  function( ResultMatchAll ) {          // ResultMatchAll returns an array of all matching instances, as ranges
    var match = ResultMatch.result[2];  // target instance is is 3rd member of result array
  }
);
          

Method 2: Iterative search


function handleMatch( result ) {

  find.search().then( handleMatch );
  // do things with match check if it does what i need
  findText( string, result.position, ... ).search().then( handleMatch );
}
var find = new FindText({ 
                          "text": "Rage, rage", // the selection
                          "cursor": 0 // the 32 characters preceding the selection 
                        });
var resultCount = 0;
handleMatch();
function handleMatch( result ) {
  if (3 > resultCount) {
    var match = find.search().then( handleMatch );
    if ( match.result) {
      resultCount++;
    }
  }
}
          

I know this iterator is wrong, but I don't grok promises well enough yet to fit it, and I need more sleep.

Method 3: Context search with prefix


var find = new FindText({ 
                          "text": "Rage, rage", // the selection
                          "prefix": "blaze like meteors and be gay, " // the 32 characters preceding the selection 
                        });
var match = find.search().result; // result is 1st instance of string after prefix
          

Method 4: Character count


var start_range = document.createRange();
    start_range.setStart(document, 696); // the character count of the normalized searchContent preceding the selection
    start_range.setEnd(document, 11);    // the character count of the selection
var find = new FindText({ "startRange": start_range }); 
var match = find.search().result;            // result is the 11 characters starting at the position 696
          

I don't think the algorithm currently accounts for the "character count" behavior, but you should have a way to make a match based only on character count and offsets in the searchContent, without a target string, for increased robustness (using multiple fallback selection criteria) and flexibility. Ranges only do this in the document, and we might want finer control for searchContent. But I'm not sure if this is the right way to do it. When scope was a range, that might have been another way.

Method 5: Edit distance and scope selector


var scope = document.querySelector("p:nth-of-type(5)"); // the 5th paragraph (the one containing the selection)  
var find = new FindText({ 
                          "scope": scope,
                          "text": "Range, range", // the selection, misspelled
                          "editDistance": 2; // the edit distance between "Rage, rage" and "Range, range"
                        });
var result = find.search().result; // result is 1st instance of string within scope element
          

Method 6: Element scope

The main difference here between the scope attribute (used in Method 5) and the startRange attribute is that startRange sets the starting point for the search, while scope limits the scope. In other words, running search() again on this example will find the next instance in the next paragraph, while in the previous example, it will only find the single instance in the target paragraph.


var start = document.querySelector("p:nth-of-type(5)"); // the 5th paragraph (the one containing the selection)  
var start_range = document.createRange();
    start_range.setStart(scope, 0);
    start_range.setEnd(scope, 1);    // the character count of the selection
var find = new FindText({ 
                          "startRange": start_range,
                          "text": "Rage, rage"
                        });
var result = find.search().result; // result is 1st instance of string, starting at location query selector
          

Actually, this might not work, because the range is on the document, while it's being applied to the normalized searchContent. Maybe there should be an intervening step.

Notes

Acknowledgements

Special thanks to Cameron McCormack (Mozilla); Kristof Csillag (Hypothes.is); Randall Leeds (Hypothes.is); Dan Whaley (Hypothes.is); Mitar Milutinovic (UC-Berkeley); Jake Hartnell (UC-Berkeley, EPUB.js); Nick Stenning (Hypothes.is); Benjamin Young (Hypothes.is); Mike de Boer (Mozilla); Travis Leithead (Microsoft); Ryosuke Niwa (Apple); Josh Soref (Blackberry); Rob Sanderson (Stanford); Paolo Ciccarese (Mass General Hospital / Harvard); Takeshi Kanai (Sony); Ivan Herman (W3C); Robin Berjon (W3C); Mike™ Smith (W3C); Addison Phillips (Amazon), Richard Ishida (W3C), and the rest of the I18n WG for internationalization insight; Alex Schmidtz (jQuery), Bill Hunt (OpenGov Foundation), and Chris Birk (OpenGov Foundation) for expertise on Promises; Mounir Lamouri (Google) and Marcos Cáceres (Mozilla) for writing a spec I could plunder mechanics; Peter Brantley (for putting on the Books in Browsers conference); and the entire Web Annotation Working Group and Open Annotations Community Group for discussion and insight. If I've forgotten anyone, please remind me; I'm truly grateful for all the discussions I've had on this topic.