This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

String Sampling with Bidirectional String Anchors

Grigorios Loukides Department of Informatics, King’s College London, London, UK
[email protected]
Solon P. Pissis CWI, Amsterdam, The Netherlands, [solon.pissis,michelle.sweering]@cwi.nl Vrije Universiteit, Amsterdam, The Netherlands Michelle Sweering CWI, Amsterdam, The Netherlands, [solon.pissis,michelle.sweering]@cwi.nl
Abstract

The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers ww and kk, it selects the lexicographically smallest length-kk substring in every fragment of ww consecutive length-kk substrings (in every sliding window of length w+k1w+k-1). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Although they do not have good worst-case guarantees on their size, they are often small in practice. They thus have been successfully employed in several string processing applications. Two main disadvantages of minimizers sampling mechanisms are: first, they also do not have good guarantees on the expected size of their samples for every combination of ww and kk; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches.

To alleviate these disadvantages, we introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer \ell, our mechanism selects the lexicographically smallest rotation in every length-\ell fragment (in every sliding window of length \ell). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experiments using several datasets demonstrate that the bd-anchors sample sizes decrease proportionally to \ell; and that these sizes are competitive to or smaller than the minimizers sample sizes using the analogous sampling parameters. We provide theoretical justification for these results by analyzing the expected size of bd-anchors samples. As a negative result, we show that computing a total order \leq on the input alphabet, which minimizes the bd-anchors sample size, is NP-hard.

We also show that by using any bd-anchors sample, we can construct, in near-linear time, an index which requires linear (extra) space in the size of the sample and answers on-line pattern searches in near-optimal time. We further show, using several datasets, that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index [Grabowski and Raniszewski, Softw. Pract. Exp. 2017].

Finally, we highlight the applicability of bd-anchors by developing an efficient and effective heuristic for top-KK similarity search under edit distance. We show, using synthetic datasets, that our heuristic is more accurate and more than one order of magnitude faster in top-KK similarity searches than the state-of-the-art tool for the same purpose [Zhang and Zhang, KDD 2020].

1 Introduction

The notion of minimizers, introduced independently by Schleimer et al. [60] and by Roberts et al. [58], is a mechanism to sample a set of positions over an input string. The goal of this sampling mechanism is, given a string TT of length nn over an alphabet Σ\Sigma of size σ\sigma, to simultaneously satisfy the following properties:

Property 1 (approximately uniform sampling):

Every sufficiently long fragment of TT has a representative position sampled by the mechanism.

Property 2 (local consistency):

Exact matches between sufficiently long fragments of TT are preserved unconditionally by having the same (relative) representative positions sampled by the mechanism.

In most practical scenarios, sampling the smallest number of positions is desirable, as long as Properties 1 and 2 are satisfied. This is because it leads to small data structures or fewer computations. Indeed, the minimizers sampling mechanism satisfies the property of approximately uniform sampling: given two positive integers ww and kk, it selects at least one length-kk substring in every fragment of ww consecutive length-kk substrings (Property 1). Specifically, this is achieved by selecting the starting positions of the smallest length-kk substrings in every (w+k1)(w+k-1)-long fragment, where smallest is defined by a choice of a total order on the universe of length-kk strings. These positions are called the “minimizers”. Thus from similar fragments, similar length-kk substrings are sampled (Property 2). In particular, if two strings have a fragment of length w+k1w+k-1 in common, then they have at least one minimizer corresponding to the same length-kk substring. Let us denote by w,k(T)\mathcal{M}_{w,k}(T) the set of minimizers of string TT. The following example illustrates the sampling.

Example 1.

The set w,k\mathcal{M}_{w,k} of minimizers for w=k=3w=k=3 for string T=aabaaabcbdaT=\texttt{aabaaabcbda} (using a 1-based index) is 3,3(T)={1,4,5,6,7}\mathcal{M}_{3,3}(T)=\{1,4,5,6,7\} and for string Q=abaaaQ=\texttt{abaaa} is 3,3(Q)={3}\mathcal{M}_{3,3}(Q)=\{3\}. Indeed QQ occurs at position 22 in TT; and QQ and T[2..6]T[2\mathinner{.\,.}6] have the minimizers 33 and 44, respectively, which both correspond to string aaa of length k=3k=3.

The minimizers sampling mechanism is very versatile, and it has been employed in various ways in many different applications [47, 67, 19, 10, 31, 35, 36, 48, 37]. Since its inception, the minimizers sampling mechanism has undergone numerous theoretical and practical improvements [56, 10, 54, 53, 15, 22, 72, 37, 74] with a particular focus on minimizing the size of the residual sample; see Section 6 for a summary on this line of research. Although minimizers have been extensively and successfully used, especially in bioinformatics, we observe several inherent problems with setting the parameters ww and kk. In particular, although the notion of length-kk substrings (known as kk-mers or kk-grams) is a widely-used string processing tool, we argue that, in the context of minimizers, it may be causing many more problems than it solves: it is not clear to us why one should use an extra sampling parameter kk to effectively characterize a fragment of length =w+k1\ell=w+k-1 of TT. In what follows, we describe some problems that may arise when setting the parameters ww and kk.

Indexing:

The most widely-used approach is to index the selected minimizers using a hash table. The key is the selected length-kk substring and the value is the list of positions it occurs. If one would like to use length-kk^{\prime} substrings for the minimizers with =w+k1=w+k1\ell=w+k-1=w^{\prime}+k^{\prime}-1, for some www^{\prime}\neq w and kkk^{\prime}\neq k, they should compute the new set w,k(T)\mathcal{M}_{w^{\prime},k^{\prime}}(T) of minimizers and construct their new index based on w,k\mathcal{M}_{w^{\prime},k^{\prime}} from scratch.

Querying:

To the best of our knowledge, no index based on minimizers can return in optimal or near-optimal time all occurrences of a pattern QQ of length |Q|=w+k1|Q|\geq\ell=w+k-1 in TT.

Sample Size:

If one would like to minimize the number of selected minimizers, they should consider different total orders on the universe of length-kk strings, which may complicate practical implementations, often scaling only up to a small kk value, e.g. k=16k=16 [22]. On the other hand, when kk is fixed and ww increases, the length-kk substrings in a fragment become increasingly decoupled from each other, and that regardless of the total order we may choose. Unfortunately, this interplay phenomenon is inherent to minimizers. It is known that klogσ(w)+ck\geq\log_{\sigma}(w)+c, for a fixed constant cc, is a necessary condition for the existence of minimizers samples with expected size in 𝒪(n/w)\mathcal{O}(n/w) [72]; see Section 6.

We propose the notion of bidirectional string anchors (bd-anchors) to alleviate these disadvantages. The bd-anchors is a mechanism that drops the sampling parameter kk and its corresponding disadvantages. We only fix a parameter \ell, which can be viewed as the length w+k1w+k-1 of the fragments in the minimizers sampling mechanism. The bd-anchor of a string XX of length \ell is the lexicographically smallest rotation (cyclic shift) of XX. We unambiguously characterize this rotation by its leftmost starting position in string XXXX. The set 𝒜(T)\mathcal{A}_{\ell}(T) of the order-\ell bd-anchors of string TT is the set of bd-anchors of all length-\ell fragments of TT. It can be readily verified that bd-anchors satisfy Properties 1 and 2.

Example 2.

The set 𝒜(T)\mathcal{A}_{\ell}(T) of bd-anchors for =5\ell=5 for string T=aabaaabcbdaT=\texttt{aabaaabcbda} (using a 1-based index) is 𝒜5(T)={4,5,6,11}\mathcal{A}_{5}(T)=\{4,5,6,11\} and for string Q=abaaaQ=\texttt{abaaa}, 𝒜5(Q)={3}\mathcal{A}_{5}(Q)=\{3\}. Indeed QQ occurs at position 22 in TT; and QQ and T[2..6]T[2\mathinner{.\,.}6] have the bd-anchors 33 and 44, respectively, which both correspond to the rotation aaaab.

Let us remark that string synchronizing sets, introduced by Kempa and Kociumaka [43], is another string sampling mechanism which may be employed to resolve the disadvantages of minimizers. Yet, it appears to be quite complicated to be efficient in practice. For instance, in [20], the authors used a simplified and specific definition of string synchronizing sets to design a space-efficient data structure for answering longest common extension queries.

We consider the word RAM model of computations with ww-bit machine words, where w=Ω(logn)w=\Omega(\log n), for stating our results. We also assume throughout that string TT is over alphabet Σ={1,2,,n𝒪(1)}\Sigma=\{1,2,\ldots,n^{\mathcal{O}(1)}\}, which captures virtually any real-world scenario. We measure space in terms of ww-bit machine words. We make the following three specific contributions:

  1. 1.

    In Section 3 we show that the set 𝒜(T)\mathcal{A}_{\ell}(T), for any >0\ell>0 and any TT of length nn, can be constructed in 𝒪(n)\mathcal{O}(n) time. We generalize this result showing that for any constant ϵ(0,1]\epsilon\in(0,1], 𝒜(T)\mathcal{A}_{\ell}(T) can be constructed in 𝒪(n+n1ϵ)\mathcal{O}(n+n^{1-\epsilon}\ell) time using 𝒪(nϵ++|𝒜|)\mathcal{O}(n^{\epsilon}+\ell+|\mathcal{A}_{\ell}|) space. Furthermore, we show that the expected size of 𝒜\mathcal{A}_{\ell} for strings of length nn, randomly generated by a memoryless source with identical letter probabilities, is in 𝒪(n/)\mathcal{O}(n/\ell), for any integer >0\ell>0. The latter is in contrast to minimizers which achieve the expected bound of 𝒪(n/w)\mathcal{O}(n/w) only when klogσw+ck\geq\log_{\sigma}w+c, for some constant cc [72]. We then show, using five real datasets, that indeed the size of 𝒜\mathcal{A}_{\ell} decreases proportionally to \ell; that it is competitive to or smaller than w,k\mathcal{M}_{w,k}, when =w+k1\ell=w+k-1; and that it is much smaller than w,k\mathcal{M}_{w,k} for small ww values, which is practically important, as widely-used aligners that are based on minimizers will require less space and computation time if bd-anchors are used instead. Finally, we show a negative result using a reduction from minimum feedback arc set: computing a total order \leq on Σ\Sigma which minimizes |𝒜(T)||\mathcal{A}_{\ell}(T)| is NP-hard.

  2. 2.

    In Section 4 we show an index based on 𝒜(T)\mathcal{A}_{\ell}(T), for any string TT of length nn and any integer >0\ell>0, which answers on-line pattern searches in near-optimal time. In particular, for any constant ϵ>0\epsilon>0, we show that our index supports the following space/query-time trade-offs:

    • it occupies 𝒪(|𝒜(T)|)\mathcal{O}(|\mathcal{A}_{\ell}(T)|) extra space and reports all kk occurrences of any pattern QQ of length |Q||Q|\geq\ell given on-line in 𝒪(|Q|+(k+1)logϵ(|𝒜(T)|))\mathcal{O}(|Q|+(k+1)\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|)) time; or

    • it occupies 𝒪(|𝒜(T)|logϵ(|𝒜(T)|))\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|)) extra space and reports all kk occurrences of any pattern QQ of length |Q||Q|\geq\ell given on-line in 𝒪(|Q|+loglog(|𝒜(T)|)+k)\mathcal{O}(|Q|+\log\log(|\mathcal{A}_{\ell}(T)|)+k) time.

    We also show that our index can be constructed in 𝒪(n+|𝒜(T)|log(|𝒜(T)|))\mathcal{O}(n+|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)}) time. We then show, using five real datasets, that a simple implementation of our index is consistently faster in on-line pattern searches than an analogous implementation of the minimizers-based index proposed by Grabowski and Raniszewski in [31].

  3. 3.

    In Section 5 we highlight the applicability of bd-anchors by developing an efficient and effective heuristic for top-KK similarity search under edit distance. This is a fundamental and extensively studied problem [38, 9, 11, 46, 68, 71, 57, 65, 64, 17, 34, 69, 70] with applications in areas including bioinformatics, databases, data mining, and information retrieval. We show, using synthetic datasets, that our heuristic, which is based on the bd-anchors index, is more accurate and more than one order of magnitude faster in top-KK similarity searches than the state-of-the-art tool proposed by Zhang and Zhang in [70].

In Section 2, we provide some preliminaries; and in Section 6 we discuss works related to minimizers. Let us stress that, although other works may be related to our contributions, we focus on comparing to minimizers because they are extensively used in applications. The source code of our implementations is available at https://github.com/solonas13/bd-anchors.

A preliminary version of this paper appeared as [49].

2 Preliminaries

We start with some basic definitions and notation following [13]. An alphabet Σ\Sigma is a finite nonempty set of elements called letters. A string X=X[1]X[n]X=X[1]\ldots X[n] is a sequence of length |X|=n|X|=n of letters from Σ\Sigma. The empty string, denoted by ε\varepsilon, is the string of length 0. The fragment X[i..j]X[i\mathinner{.\,.}j] of XX is an occurrence of the underlying substring S=X[i]X[j]S=X[i]\ldots X[j]. We also say that SS occurs at position ii in XX. A prefix of XX is a fragment of XX of the form X[1..j]X[1\mathinner{.\,.}j] and a suffix of XX is a fragment of XX of the form X[i..n]X[i\mathinner{.\,.}n]. The set of all strings over Σ\Sigma (including ε\varepsilon) is denoted by Σ\Sigma^{*}. The set of all length-kk strings over Σ\Sigma is denoted by Σk\Sigma^{k}. Given two strings XX and YY, the edit distance dE(X,Y)d_{\mathrm{E}}(X,Y) is the minimum number of edit operations (letter insertion, deletion, or substitution) transforming one string into the other.

Let MM be a finite nonempty set of strings over Σ\Sigma of total length mm. We define the trie of MM, denoted by TR(M)\textsf{TR}(M), as a deterministic finite automaton that recognizes MM. Its set of states (nodes) is the set of prefixes of the elements of MM; the initial state (root node) is ε\varepsilon; the set of terminal states (leaf nodes) is MM; and edges are of the form (u,α,uα)(u,\alpha,u\alpha), where uu and uαu\alpha are nodes and αΣ\alpha\in\Sigma. The size of TR(M)\textsf{TR}(M) is thus 𝒪(m)\mathcal{O}(m). The compacted trie of MM, denoted by CT(M)\textsf{CT}(M), contains the root node, the branching nodes, and the leaf nodes of TR(M)\textsf{TR}(M). The term compacted refers to the fact that CT(M)\textsf{CT}(M) reduces the number of nodes by replacing each maximal branchless path segment with a single edge, and that it uses a fragment of a string sMs\in M to represent the label of this edge in 𝒪(1)\mathcal{O}(1) machine words. The size of CT(M)\textsf{CT}(M) is thus 𝒪(|M|)\mathcal{O}(|M|). When MM is the set of suffixes of a string YY, then CT(M)\textsf{CT}(M) is called the suffix tree of YY, and we denote it by ST(Y)\textsf{ST}(Y). The suffix tree of a string of length nn over an alphabet Σ={1,,n𝒪(1)}\Sigma=\{1,\ldots,n^{\mathcal{O}(1)}\} can be constructed in 𝒪(n)\mathcal{O}(n) time [23].

Let us fix throughout a string T=T[1..n]T=T[1\mathinner{.\,.}n] of length |T|=n|T|=n over an ordered alphabet Σ\Sigma. Recall that we make the standard assumption of an integer alphabet Σ={1,2,,n𝒪(1)}\Sigma=\{1,2,\ldots,n^{\mathcal{O}(1)}\}.

We start by defining the notion of minimizers of TT from [58] (the definition in [60] is slightly different). Given an integer k>0k>0, an integer w>0w>0, and the iith length-(w+k1)(w+k-1) fragment F=T[i..i+w+k2]F=T[i\mathinner{.\,.}i+w+k-2] of TT, we define the (w,k)(w,k)-minimizers of FF as the positions j[i,i+w)j\in[i,i+w) where a lexicographically minimal length-kk substring of FF occurs. The set w,k(T)\mathcal{M}_{w,k}(T) of (w,k)(w,k)-minimizers of TT is defined as the set of (w,k)(w,k)-minimizers of T[i..i+w+k2]T[i\mathinner{.\,.}i+w+k-2], for all i[1,nwk+2]i\in[1,n-w-k+2]. The density of w,k(T)\mathcal{M}_{w,k}(T) is defined as the quantity |w,k(T)|/n|\mathcal{M}_{w,k}(T)|/n. The following bounds are obtained trivially. The density of any minimizer scheme is at least 1/w1/w, since at least one (w,k)(w,k)-minimizer is selected in each fragment, and at most 11, when every (w,k)(w,k)-minimizer is selected.

If we waive the lexicographic order assumption, the set w,k(T)\mathcal{M}_{w,k}(T) can be computed on-line in 𝒪(n)\mathcal{O}(n) time, and if we further assume a constant-time computable function that gives us an arbitrary rank for each length-kk substring in Σk\Sigma^{k} in constant amortized time [37]. This can be implemented, for instance, using a rolling hash function (e.g. Karp-Rabin fingerprints [41]), and the rank (total order) is defined by this function. We also provide here, for completeness, a simple off-line 𝒪(n)\mathcal{O}(n)-time algorithm that uses a lexicographic order.

Theorem 1.

The set w,k(T)\mathcal{M}_{w,k}(T), for any integers w,k>0w,k>0 and any string TT of length nn, can be constructed in 𝒪(n)\mathcal{O}(n) time.

Proof.

The underlying algorithm has two main steps. In the first step, we construct ST(T)\textsf{ST}(T), the suffix tree of TT in 𝒪(n)\mathcal{O}(n) time [23]. Using a depth-first search traversal of ST(T)\textsf{ST}(T) we assign at every position of TT in [1,nk+1][1,n-k+1] the lexicographic rank of T[i..i+k1]T[i\mathinner{.\,.}i+k-1] among all the length-kk strings occurring in TT. This process clearly takes 𝒪(n)\mathcal{O}(n) time as ST(T)\textsf{ST}(T) is an ordered structure; it yields an array RR of size nk+1n-k+1 with lexicographic ranks. In the second step, we apply a folklore algorithm, which computes the minimum elements in a sliding window of size ww (cf. [37]) over RR. The set of reported indices is w,k(T)\mathcal{M}_{w,k}(T). ∎

3 Bidirectional String Anchors

We introduce the notion of bidirectional string anchors (bd-anchors). Given a string WW, a string RR is a rotation (or cyclic shift or conjugate) of WW if and only if there exists a decomposition W=UVW=UV such that R=VUR=VU, for a string UU and a nonempty string VV. We often characterize RR by its starting position |U|+1|U|+1 in WW=UVUVWW=UVUV. We use the term rotation interchangeably to refer to string RR or to its identifier (|U|+1)(|U|+1).

Definition 1 (Bidirectional anchor).

Given a string XX of length >0\ell>0, the bidirectional anchor (bd-anchor) of XX is the lexicographically minimal rotation j[1,]j\in[1,\ell] of XX with minimal jj. The set of order-\ell bd-anchors of a string TT of length n>n>\ell, for some integer >0\ell>0, is defined as the set 𝒜(T)\mathcal{A}_{\ell}(T) of bd-anchors of T[i..i+1]T[i\mathinner{.\,.}i+\ell-1], for all i[1,n+1]i\in[1,n-\ell+1].

The density of 𝒜(T)\mathcal{A}_{\ell}(T) is defined as the quantity |𝒜(T)|/n|\mathcal{A}_{\ell}(T)|/n. It can be readily verified that the bd-anchors sampling mechanism satisfies Properties 1 (approximately uniform sampling) and 2 (local consistency).

Example 3.

Let =5\ell=5, T=aabaaabcbdaT=\texttt{aabaaabcbda}, and T=aacaaaccbdaT^{\prime}=\texttt{aacaaaccbda}. Strings TT and TT^{\prime} (are at Hamming distance 2 but) have the same set of bd-anchors of order 55: 𝒜5(T)=𝒜5(T)={4,5,6,11}\mathcal{A}_{5}(T)=\mathcal{A}_{5}(T^{\prime})=\{4,5,6,11\}. The reader can probably share the intuition that the bd-anchors sampling mechanism is suitable for sequence comparison due to Properties 1 and 2, in particular, when the parameter \ell is set accordingly.

Linear-Time Construction of 𝒜\mathcal{A}_{\ell}.

Importantly, we show that 𝒜\mathcal{A}_{\ell} admits an efficient construction. One can use the linear-time algorithm by Booth [6] to compute the lexicographically minimal rotation for each length-\ell fragment of TT, resulting in an 𝒪(n)\mathcal{O}(n\ell)-time algorithm, which is reasonably fast for modest \ell. (Booth’s algorithm gives the leftmost minimal rotation by construction.) We instead give an optimal 𝒪(n)\mathcal{O}(n)-time algorithm for the construction of 𝒜\mathcal{A}_{\ell}, which is mostly of theoretical interest.

For every string XX and every natural number mm, we define the mmth power of the string XX, denoted by XmX^{m} , by X0=εX^{0}=\varepsilon and Xk=Xk1XX^{k}=X^{k-1}X for k=1,2,,mk=1,2,\ldots,m. A nonempty string is primitive, if it is not the power of any other string. Let us state two well-known combinatorial lemmas.

Lemma 1 ([13]).

A nonempty string XX is primitive if and only if it occurs as a substring in XXXX only as a prefix and as a suffix.

Lemma 2 ([61]).

Let X=UVX=UV and R=VUR=VU, for two strings U,VU,V. If XX is primitive then RR is also primitive.

A substring UU of a string XX is called an infix of XX if and only if U=X[i..j]U=X[i\mathinner{.\,.}j] with i>1i>1 and j<nj<n.

Lemma 3.

A string XX has more than one minimal lexicographic rotation if and only if XX is a power of some string.

Proof.
  1. ()(\Rightarrow)

    Let X=U1V1X=U_{1}V_{1}, and R=V1U1R=V_{1}U_{1} be the leftmost minimal lexicographic rotation of XX. Suppose towards a contradiction that XX has another minimal lexicographic rotation but XX is primitive. In particular, there exists H=V2U2=RH=V_{2}U_{2}=R, with X=U2V2X=U_{2}V_{2} and |U1|<|U2||U_{1}|<|U_{2}|. If XX is primitive, then RR is also primitive by Lemma 2 but then RR=V1U1V1U1RR=V_{1}U_{1}V_{1}U_{1} has HH occurring as infix. In particular, in RRRR, V2V_{2} is a suffix of the first occurrence of V1V_{1} and U2U_{2} is a prefix of U1V1U_{1}V_{1} and thus H=RH=R occurs as infix. By Lemma 1 we obtain a contradiction.

  2. ()(\Leftarrow)

    Let X=UUUX=UU\cdots U and a minimal lexicographic rotation of XX be i[1,|X|]i\in[1,|X|]. Then either i+|U|i+|U| or i|U|i-|U| is a minimal lexicographic rotation of XX.

Example 4 (Illustration of Lemma 3).

Let X=cbacbacbaX=\texttt{cbacbacba}, R=acbacbacbR=\texttt{acbacba}\cdot\texttt{cb} with U1=acbacbaU_{1}=\texttt{acbacba} and V1=cbV_{1}=\texttt{cb}, and H=acbacbacb=RH=\texttt{acba}\cdot\texttt{cbacb}=R with U2=acbaU_{2}=\texttt{acba} and V2=cbacbV_{2}=\texttt{cbacb}. Observe that HH occurs as infix (shown as underlined) in RR=acbacbacbacbacbacbRR=\texttt{acb\text@underline{acbacbacb}acbacb} hence XX is a power of some string.

Lemma 4.

Let XX be a string of length nn and set Y=XX#Y=XX\#, for some letter #\# not occurring in XX that is the lexicographically maximal letter occurring in YY. Further let Y[i..2n+1]Y[i\mathinner{.\,.}2n+1] be the lexicographically minimal suffix of YY, for some i[1,2n]i\in[1,2n]. The leftmost lexicographically minimal rotation of XX is ii.

Proof.

First note that i[1,n]i\in[1,n] because #\# is the lexicographically maximal letter occurring in YY.

We consider two cases: (i) XX is primitive; and (ii) XX is power of some string. In the first case, XX has one lexicographically minimal rotation by Lemma 3, and thus this is ii. In the second case, XX has more than one lexicographically minimal rotations, but because XX is power of some string and #\# is the lexicographically maximal letter occurring in YY, ii is the leftmost lexicographically minimal rotation of XX. ∎

We employ the data structure of Kociumaka [44, Theorem 20] to obtain the following result.

Theorem 2.

The set 𝒜(T)\mathcal{A}_{\ell}(T), for any >0\ell>0 and any TT of length nn, can be constructed in 𝒪(n)\mathcal{O}(n) time.

Proof.

The data structure of Kociumaka [44, Theorem 20] gives the minimal lexicographic suffix for any concatenation YY of kk arbitrary fragments of a string SS in 𝒪(k2)\mathcal{O}(k^{2}) time after an 𝒪(|S|)\mathcal{O}(|S|) time preprocessing.

We set S=T#S=T\#, for some letter #\# that does not occur in TT and is the lexicographically maximal letter occurring in SS. For each fragment T[i..i+1]T[i\mathinner{.\,.}i+\ell-1], we compute the minimal lexicographic suffix of string

Y=S[i..i+1]S[i..i+1]S[n+1]=T[i..i+1]T[i..i+1]#,Y=S[i\mathinner{.\,.}i+\ell-1]\cdot S[i\mathinner{.\,.}i+\ell-1]\cdot S[n+1]=T[i\mathinner{.\,.}i+\ell-1]\cdot T[i\mathinner{.\,.}i+\ell-1]\cdot\#,

where k=3k=3 in 𝒪(k2)=𝒪(1)\mathcal{O}(k^{2})=\mathcal{O}(1) time. This suffix of YY is the minimal lexicographic rotation by Lemma 4. ∎

Space-Efficient Construction of 𝒜\mathcal{A}_{\ell}.

It should be clear that, in the best case, the size of 𝒜\mathcal{A}_{\ell} is in 𝒪(n/)\mathcal{O}(n/\ell) and this bound is tight. The construction of Theorem 2 requires 𝒪(n)\mathcal{O}(n) space. Ideally, we would thus like to compute 𝒜\mathcal{A}_{\ell} efficiently using (strongly) sublinear space. We generalize Theorem 2 to the following result.

Theorem 3.

The set 𝒜(T)\mathcal{A}_{\ell}(T), for any >0\ell>0, any TT of length nn, and any constant ϵ(0,1]\epsilon\in(0,1], can be constructed in 𝒪(n+n1ϵ)\mathcal{O}(n+n^{1-\epsilon}\ell) time using 𝒪(nϵ++|𝒜|)\mathcal{O}(n^{\epsilon}+\ell+|\mathcal{A}_{\ell}|) space.

Proof.

We compute 𝒜(T[nϵ(i1)+1..max(nϵi+,n)])\mathcal{A}_{\ell}(T[\lceil n^{\epsilon}(i-1)\rceil+1\mathinner{.\,.}\max(\lceil n^{\epsilon}i\rceil+\ell,n)]) for all i[1,n1ϵ]i\in[1,\lceil n^{1-\epsilon}\rceil] using the algorithm from Theorem 2 and output their union. For any constant ϵ(0,1]\epsilon\in(0,1], the alphabet size |Σ|=n𝒪(1)=(nϵ+)𝒪(1)|\Sigma|=n^{\mathcal{O}(1)}=(n^{\epsilon}+\ell)^{\mathcal{O}(1)} is still polynomial in the length nϵ+n^{\epsilon}+\ell of the fragments, so computing one such anchor set takes 𝒪(nϵ+)\mathcal{O}(n^{\epsilon}+\ell) time and space by Theorem 2. We delete each fragment (and the associated data structure) before processing the subsequent anchor set: it takes 𝒪(n+n1ϵ)\mathcal{O}(n+n^{1-\epsilon}\ell) time and 𝒪(nϵ+)\mathcal{O}(n^{\epsilon}+\ell) additional space to construct 𝒜(T)\mathcal{A}_{\ell}(T). ∎

Expected Size of 𝒜\mathcal{A}_{\ell}.

We next analyze the expected size of 𝒜(T)\mathcal{A}_{\ell}(T). We first show that if \ell grows no faster than the size σ\sigma of the alphabet, then the expected size of 𝒜\mathcal{A}_{\ell} is in 𝒪(n/)\mathcal{O}(n/\ell). Otherwise, if \ell grows faster than σ\sigma, we slightly amend the sampling process to ensure that the expected size of the sample is in 𝒪(n/)\mathcal{O}(n/\ell).

Lemma 5.

If TT is a string of length nn, randomly generated by a memoryless source over an alphabet of size σ2\sigma\geq 2 with identical letter probabilities, then, for any integer >0\ell>0, the expected size of 𝒜(T)\mathcal{A}_{\ell}(T) is in 𝒪(nloglogσ+n)\mathcal{O}(\frac{n\log\ell}{\ell\log\sigma}+\frac{n}{\ell}).

Proof.

If =1\ell=1, then 𝒜(T)=n\mathcal{A}_{\ell}(T)=n. If =2\ell=2, then 𝒜(T)=1+(n2)(2σ2+1)/3σ2\mathcal{A}_{\ell}(T)=1+(n-2)(2\sigma^{2}+1)/3\sigma^{2}. Now suppose 3\ell\geq 3. We say that T[i..i+1]T[i\mathinner{.\,.}i+\ell-1] introduces a new bd-anchor if there exists j[1,]j\in[1,\ell] such that jj is the bd-anchor of T[i..i+1]T[i\mathinner{.\,.}i+\ell-1], but j+kj+k is not the bd-anchor of T[ik..ik+1]T[i-k\mathinner{.\,.}i-k+\ell-1] for all k[1,max(j,i1)]k\in[1,\max(\ell-j,i-1)]. Let Ni(T)N_{i}(T) denote the event that T[i..i+1]T[i\mathinner{.\,.}i+\ell-1] introduces a new bd-anchor. Since the letters are independent identically distributed, the probability [Ni(T)]\mathbb{P}[N_{i}(T)] only depends on and is non-increasing in the number of preceding overlapping length-kk substrings. Therefore

𝔼[|𝒜(T)|]=[N1(T)]++[Nn+1(T)]1+(n1)[N2(T)].\mathbb{E}[|\mathcal{A}_{\ell}(T)|]=\mathbb{P}[N_{1}(T)]+\cdots+\mathbb{P}[N_{n-\ell+1}(T)]\leq 1+(n-1)\mathbb{P}[N_{2}(T)].

Let pp be the length of the shortest prefix of the lexicographically minimal rotation of T[2..+1]T[2\mathinner{.\,.}\ell+1] which is strictly smaller than the same length prefix of any other rotation of T[2..+1]T[2\mathinner{.\,.}\ell+1].

Note that

[N2]\displaystyle\mathbb{P}[N_{2}] \displaystyle\leq [T[1..] or T[2..+1] is a power of some string]\displaystyle\mathbb{P}[T[1\mathinner{.\,.}\ell]\text{ or }T[2\mathinner{.\,.}\ell+1]\text{ is a power of some string}] (4)
+[T[1..] is primitive with bd-anchor 1]\displaystyle+\ \mathbb{P}[T[1\mathinner{.\,.}\ell]\text{ is primitive with bd-anchor }1]
+[T[2..+1] is primitive with bd-anchor>3log/logσ]\displaystyle+\ \mathbb{P}[T[2\mathinner{.\,.}\ell+1]\text{ is primitive with bd-anchor}>\ell-3\log\ell/\log\sigma]
+[T[2..+1] is primitive and p3log/logσ]\displaystyle+\ \mathbb{P}[T[2\mathinner{.\,.}\ell+1]\text{ is primitive and }p\geq 3\log\ell/\log\sigma]

To bound the probability in (4), note that

[length- string is a power of some string]d<,dσ(d)d/2σ(d)=σ1/2/(σ1).\mathbb{P}[\text{length-$\ell$ string is a power of some string}]\leq\sum_{d<\ell,d\mid\ell}\sigma^{-(\ell-d)}\leq\sum_{d\leq\ell/2}\sigma^{-(\ell-d)}=\sigma^{1-\ell/2}/(\sigma-1).

The probability in (4) is bounded by 1/1/\ell since each letter of a primitive length-\ell string is equally likely to be the anchor. Similarly, the probability in (4) is bounded by (3log/logσ+1)/(3\log\ell/\log\sigma+1)/\ell. Finally, the probability in (4) is bounded by the probability that two prefixes of length 3log/logσ\lceil 3\log\ell/\log\sigma\rceil of rotations of T[2..+1]T[2\mathinner{.\,.}\ell+1] are equal, which is at most 2σ3log/logσ=1/\ell^{2}\cdot\sigma^{-3\log\ell/\log\sigma}=1/\ell. It follows that

[N2]\displaystyle\mathbb{P}[N_{2}] \displaystyle\leq 2σ1/2/(σ1)+1/+(3log/logσ+1)/+1/\displaystyle 2\sigma^{1-\ell/2}/(\sigma-1)+1/\ell+(3\log\ell/\log\sigma+1)/\ell+1/\ell
=\displaystyle= 𝒪(loglogσ+1)\displaystyle\mathcal{O}\left(\frac{\log\ell}{\ell\log\sigma}+\frac{1}{\ell}\right)

We conclude that for any >0\ell>0 the expected size of 𝒜(T)\mathcal{A}_{\ell}(T) is in 𝒪(nloglogσ+n)\mathcal{O}(\frac{n\log\ell}{\ell\log\sigma}+\frac{n}{\ell}). ∎

We define a reduced version of bd-anchors to ensure that the expected size of the sample is in 𝒪(n/)\mathcal{O}(n/\ell).

Definition 2.

Given a string XX of length >0\ell>0 and an integer 0r10\leq r\leq\ell-1, we define the reduced bidirectional anchor of XX as the lexicographically minimal rotation j[1,r]j\in[1,\ell-r] of XX with minimal jj. The set of order-\ell reduced bd-anchors of a string TT of length n>n>\ell is defined as the set 𝒜red(T)\mathcal{A}_{\ell}^{\text{red}}(T) of reduced bd-anchors of T[i..i+1]T[i\mathinner{.\,.}i+\ell-1], for all i[1,n+1]i\in[1,n-\ell+1].

Lemma 6.

If TT is a string of length nn, randomly generated by a memoryless source over an alphabet of size σ2\sigma\geq 2 with identical letter probabilities, then, for any integer >0\ell>0, the expected size of 𝒜red(T)\mathcal{A}_{\ell}^{\text{red}}(T) with r=4log/logσr=\lceil 4\log\ell/\log\sigma\rceil is in 𝒪(n/)\mathcal{O}(n/\ell).

Proof.

If {1,2}\ell\in\{1,2\}, then 𝒜red(T)n\mathcal{A}_{\ell}^{\text{red}}(T)\leq n. Now suppose 3\ell\geq 3. Analogously to NN in 5, we denote the event that T[i..i+1]T[i\mathinner{.\,.}i+\ell-1] introduces a new reduced bd-anchor by Nired(T)N^{\text{red}}_{i}(T). Again we find

𝔼[|𝒜red(T)|]=[N1red(T)]++[Nn+1red(T)]1+(n)[N2red(T)].\mathbb{E}\left[\left|\mathcal{A}_{\ell}^{\text{red}}(T)\right|\right]=\mathbb{P}[N^{\text{red}}_{1}(T)]+\cdots+\mathbb{P}[N^{\text{red}}_{n-\ell+1}(T)]\leq 1+(n-\ell)\mathbb{P}[N^{\text{red}}_{2}(T)].

Let predp^{\text{red}} be the length of the shortest prefix of the lexicographically minimal rotation j1[1,r]j_{1}\in[1,\ell-r] of T[2..+1]T[2\mathinner{.\,.}\ell+1] which is strictly smaller than the same length prefix of any other rotation j2[1,r]{j1}j_{2}\in[1,\ell-r]\setminus\{j_{1}\}. Using a similar proof to that of 5 we find that

[N2red(T)]\displaystyle\mathbb{P}[N^{\text{red}}_{2}(T)] \displaystyle\leq [T[1..] or T[2..+1] is a power of some string]\displaystyle\mathbb{P}[T[1\mathinner{.\,.}\ell]\text{ or }T[2\mathinner{.\,.}\ell+1]\text{ is a power of some string}]
+[T[1..] is primitive with bd-anchor 1]\displaystyle+\ \mathbb{P}[T[1\mathinner{.\,.}\ell]\text{ is primitive with bd-anchor }1]
+[T[2..+1] is primitive with bd-anchor r+1]\displaystyle+\ \mathbb{P}[T[2\mathinner{.\,.}\ell+1]\text{ is primitive with bd-anchor }\ell-r+1]
+[T[2..+1] is primitive and predr]\displaystyle+\ \mathbb{P}[T[2\mathinner{.\,.}\ell+1]\text{ is primitive and }p^{\text{red}}\geq r]
\displaystyle\leq 2σ1/2/(σ1)+1/(r)+1/(r)+2/σr=2/+o(1/).\displaystyle 2\sigma^{1-\ell/2}/(\sigma-1)+1/(\ell-r)+1/(\ell-r)+\ell^{2}/\sigma^{r}=2/\ell+o\left(1/\ell\right).

We conclude that for any >0\ell>0 the expected size of 𝒜red(T)\mathcal{A}^{\text{red}}_{\ell}(T) is in 𝒪(n/)\mathcal{O}(n/\ell). ∎

In particular, if =𝒪(σ)\ell=\mathcal{O}(\sigma), we employ the sampling mechanism underlying 𝒜(T)\mathcal{A}_{\ell}(T), otherwise (=Ω(σ)\ell=\Omega(\sigma)) we employ the sampling mechanism underlying 𝒜red(T)\mathcal{A}_{\ell}^{\text{red}}(T) with r=4log/logσr=\lceil 4\log\ell/\log\sigma\rceil to ensure that the expected density of the residual sampling is always in 𝒪(n/)\mathcal{O}(n/\ell).

Constructing 𝒜red(T)\mathcal{A}_{\ell}^{\text{red}}(T) in 𝒪(n)\mathcal{O}(n) time requires a trivial modification in Theorem 2. For each fragment T[i..i+1]T[i\mathinner{.\,.}i+\ell-1], instead of computing the minimal lexicographic suffix of string

Y=S[i..i+1]S[i..i+1]S[n+1]=T[i..i+1]T[i..i+1]#,Y=S[i\mathinner{.\,.}i+\ell-1]\cdot S[i\mathinner{.\,.}i+\ell-1]\cdot S[n+1]=T[i\mathinner{.\,.}i+\ell-1]\cdot T[i\mathinner{.\,.}i+\ell-1]\cdot\#,

in 𝒪(1)\mathcal{O}(1) time, we compute the minimal lexicographic suffix of string

Yred=S[i..i+1]S[i..i+1r]S[n+1]=T[i..i+1]T[i..i+1r]#,Y^{\text{red}}=S[i\mathinner{.\,.}i+\ell-1]\cdot S[i\mathinner{.\,.}i+\ell-1-r]\cdot S[n+1]=T[i\mathinner{.\,.}i+\ell-1]\cdot T[i\mathinner{.\,.}i+\ell-1-r]\cdot\#,

in 𝒪(1)\mathcal{O}(1) time. We then directly obtain the trade-off in Theorem 3 for constructing 𝒜red(T)\mathcal{A}_{\ell}^{\text{red}}(T).

Density Evaluation.

We compare the density of bd-anchors and reduced bd-anchors, denoted by BDA and rBDA, respectively, to the density of minimizers, for different values of ww and kk such that =w+k1\ell=w+k-1. This is a fair comparison because =w+k1\ell=w+k-1 is the length of the fragments considered by both mechanisms. We implemented bd-anchors, the standard minimizers mechanism from [58], and the minimizers mechanism with robust winnowing from [60]. The standard minimizers and those with robust winnowing are referred to as STD and WIN, respectively.

For bd-anchors, we used Booth’s algorithm, which is easy to implement and reasonably fast. For minimizers, we used Karp-Rabin fingerprints [41]. (Note that such “random” minimizers tend to perform even better than the ones based on lexicographic total order in terms of density [72].) For the reduced version of bd-anchors, we used r=3log/logσr=\lceil 3\log\ell/\log\sigma\rceil, because the rr value suggested by Lemma 6 is relatively large for the small \ell values tested; e.g. for =15\ell=15, 3log/logσ=8\lceil 3\log\ell/\log\sigma\rceil=8. Throughout, we do not evaluate construction times, as all implementations are reasonably fast, and we make the standard assumption that preprocessing is only required once. We used five string datasets from the popular Pizza & Chili corpus [25] (see Table 1 for the datasets characteristics). All implementations referred to in this paper have been written in C++ and compiled at optimization level -O3. All experiments reported in this paper were conducted using a single core of an AMD Opteron 6386 SE 2.8GHz CPU and 252GB RAM running GNU/Linux.

Dataset Length Alphabet
nn size |Σ||\Sigma|
DNA 200,000,000 4
XML 200,000,000 95
ENGLISH 200,000,000 224
PROTEINS 200,000,000 27
SOURCES 200,000,000 229
Table 1: Datasets characteristics.
Refer to caption
(a) DNA
Refer to caption
(b) XML
Figure 1: Density vs. w,kw,k for =w+k1\ell=w+k-1 and the first two datasets of Table 1.
Refer to caption
(a) ENGLISH
Refer to caption
(b) PROTEINS
Refer to caption
(c) SOURCES
Figure 2: Density vs. w,kw,k for =w+k1\ell=w+k-1 and the last three datasets of Table 1.

As can be seen by the results depicted in Figures 1 and 2, the density of both BDA and rBDA is either significantly smaller than or competitive to the STD and WIN minimizers density, especially for small ww. This is useful because a lower density results in smaller indexes and less computation (see Section 4), and because small ww is of practical interest (see Section 5). For instance, the widely-used long-read aligner Minimap2 [48] stores the selected minimizers of a reference genome in a hash table to find exact matches as anchors for seed-and-extend alignment. The parameters ww and kk are set based on the required sensitivity of the alignment, and thus ww and kk cannot be too large for high sensitivity. Thus, a lower sampling density reduces the size of the hash table, as well as the computation time, by lowering the average number of selected minimizers to consider when performing an alignment. Furthermore, although the datasets are not uniformly random, rBDA performs better than BDA as \ell grows, as suggested by Lemmas 5 and 6.

There exists a long line of research on improving the density of minimizers in special regimes (see Section 6 for details). We stress that most of these algorithms are designed, implemented, or optimized, only for the DNA alphabet. We have tested against two state-of-the-art tools employing such algorithms: Miniception [72] and PASHA [22]. The former did not give better results than STD or WIN for the tested values of ww and kk; and the latter does not scale beyond k=16k=16 or with large alphabets. We have thus omitted these results.

We next report the average number (AVG) of bd-anchors of order {4,8,12,16}\ell\in\{4,8,12,16\} over all strings of length n=20n=20 (see Table 2(a)) and over all strings of length n=32n=32 (see Table 2(b)), both over a binary alphabet. Notably, the results show that AVG always lies in [n/,2n/][n/\ell,2n/\ell] even if not using the reduced version of bd-anchors (see Lemma 6). As expected by Lemma 5, the analogous AVG values using a ternary alphabet (not reported here) were always lower than the corresponding ones with a binary alphabet.

(n,)(n,\ell) (20,4)(20,4) (20,8)(20,8) (20,12)(20,12) (20,16)(20,16)
n/n/\ell 5 2.5 1.66 1.25
AVG 8.53 4.37 2.77 1.76
2n/2n/\ell 16 8 5.33 4
(a)
(n,)(n,\ell) (32,4)(32,4) (32,8)(32,8) (32,12)(32,12) (32,16)(32,16)
n/n/\ell 8 4 2,66 2
AVG 14.16 7.67 5.26 3.85
2n/2n/\ell 16 8 5.33 4
(b)
Table 2: Average number of bd-anchors for varying \ell and: (a) n=20n=20 and (b) n=32n=32.

Minimizing the Size of 𝒜\mathcal{A}_{\ell} is NP-hard.

The number of bd-anchors depends on the lexicographic total order defined on Σ\Sigma. We now prove that finding the total order which minimizes the number of bd-anchors is NP-hard using a reduction from minimum feedback arc set [40]. Let us start be defining this problem. Given a directed graph G(V,E)G(V,E), a feedback arc set in GG is a subset of EE that contains at least one edge from every cycle in GG. Removing these edges from GG breaks all of the cycles, producing a directed acyclic graph. In the minimum feedback arc set problem, we are given G(V,E)G(V,E), and we are asked to compute a smallest feedback arc set in GG. The decision version of the problem takes an additional parameter kk as input, and it asks whether all cycles can be broken by removing at most kk edges from EE. The decision version is NP-complete [40] and the optimization version is APX-hard [21].

Theorem 4.

Let TT be a string of length nn over alphabet Σ\Sigma. Further let >0\ell>0 be an integer. Computing a total order \leq on Σ\Sigma which minimizes |𝒜(T)||\mathcal{A}_{\ell}(T)| is NP-hard.

Proof.

Let G=(Σ,A)G=(\Sigma,A) be any instance of the minimum feedback arc set problem. We will construct a string SΣS\in\Sigma^{*} in polynomial time in the size of GG such that finding a total order \leq on Σ\Sigma which minimizes 𝒜4(S)\mathcal{A}_{4}(S) corresponds to finding a minimum feedback arc set in GG.

We start with an empty string SS. For each edge (a,b)A(a,b)\in A, we append (aabab)2|A|+1(aabab)^{2|A|+1} to SS. Observe that, if a<ba<b, all the aa’s of (aabab)2|A|+1(aabab)^{2|A|+1} and none of its bb’s are order-4 bd-anchors, except possibly the first aa and last bb depending on the preceding and subsequent letters in SS. Thus there are 6|A|+26|A|+2 to 6|A|+46|A|+4 order-44 bd-anchors in (aabab)2|A|+1(aabab)^{2|A|+1}. If on the other hand a>ba>b, we analogously find that there will be 4|A|+14|A|+1 to 4|A|+34|A|+3 order-44 bd-anchors in (aabab)2|A|+1(aabab)^{2|A|+1}.

Let dd_{\leq} be the number of edges (a,b)A(a,b)\in A such that a<ba<b. The total number of order-44 bd-anchors in SS is

|𝒜4(S)|=|A|2(2|A|+1)+d(2|A|+1)+ϵ,|\mathcal{A}_{4}(S)|=|A|\cdot 2\cdot(2|A|+1)+d_{\leq}\cdot(2|A|+1)+\epsilon,

with ϵ[|A|,|A|]\epsilon\in[-|A|,|A|]. Therefore minimizing the total number of order-44 bd-anchors in SS is equivalent to finding an order \leq on the set Σ\Sigma of vertices of GG which minimizes dd_{\leq}.

Note that if we delete all edges (a,b)A(a,b)\in A such that a<ba<b, then the residual graph is acyclic. Moreover, for each acyclic graph there exists an order on the vertices such that a>ba>b for all (a,b)A(a,b)\in A. Therefore the minimal dd_{\leq} equals the size of the minimum feedback arc set.

We conclude that, since finding the size of the minimum feedback arc set is NP-hard, so is finding a total order \leq on Σ\Sigma which minimizes the total number of order-4 bd-anchors. ∎

4 Indexing Using Bidirectional Anchors

Before presenting our index, let us start with a basic definition that is central to our querying process.

Definition 3 ((α,β)(\alpha,\beta)-hit).

Given an order-\ell bd-anchor jQ𝒜(Q)j_{Q}\in\mathcal{A}_{\ell}(Q), for some integer >0\ell>0, of a query string QQ, two integers α>0,β>0\alpha>0,\beta>0, with α+β+1\alpha+\beta\geq\ell+1, and an order-\ell bd-anchor jT𝒜(T)j_{T}\in\mathcal{A}_{\ell}(T) of a target string TT, the ordered pair (jQ,jT)(j_{Q},j_{T}) is called an (α,β)(\alpha,\beta)-hit if and only if T[jTα+1..jT]=Q[jQα+1..jQ]T[j_{T}-\alpha+1\mathinner{.\,.}j_{T}]=Q[j_{Q}-\alpha+1\mathinner{.\,.}j_{Q}] and T[jT..jT+β1]=Q[jQ..jQ+β1]T[j_{T}\mathinner{.\,.}j_{T}+\beta-1]=Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta-1].

Intuitively, the parameters α\alpha and β\beta let us choose a fragment of QQ that is anchored at jQj_{Q}.

Example 5.

Let T=aabaaabcbdaT=\texttt{aabaaabcbda}, Q=aacabaaaaeQ=\texttt{aacabaaaae}, and =5\ell=5. Consider that we would like to find the common fragment Q[4..8]=T[2..6]=abaaaQ[4\mathinner{.\,.}8]=T[2\mathinner{.\,.}6]=\texttt{abaaa}. We know that the bd-anchor of order 55 corresponding to Q[4..8]Q[4\mathinner{.\,.}8] is 6𝒜5(Q)6\in\mathcal{A}_{5}(Q), and thus to find it we set α=3\alpha=3 and β=3\beta=3. The ordered pair (6,4)(6,4) is a (3,3)(3,3)-hit because for 4𝒜5(T)4\in\mathcal{A}_{5}(T), we have: T[43+1..4]=Q[63+1..6]=abaT[4-3+1\mathinner{.\,.}4]=Q[6-3+1\mathinner{.\,.}6]=\texttt{aba} and T[4..4+31]=Q[6..6+31]=aaaT[4\mathinner{.\,.}4+3-1]=Q[6\mathinner{.\,.}6+3-1]=\texttt{aaa}.

We would like to construct a data structure over TT, which is based on 𝒜(T)\mathcal{A}_{\ell}(T), such that, when we are given an order-\ell bd-anchor jQj_{Q} over QQ as an on-line query, together with parameters α\alpha and β\beta, we can report all (α,β)(\alpha,\beta)-hits efficiently. To this end, we present an efficient data structure, denoted by (T)\mathcal{I}_{\ell}(T), which is constructed on top of TT, and answers (α,β)(\alpha,\beta)-hit queries in near-optimal time. We prove the following result.

Theorem 5.

Given a string TT of length nn and an integer >0\ell>0, the (T)\mathcal{I}_{\ell}(T) index can be constructed in 𝒪(n+|𝒜(T)|log(|𝒜(T)|))\mathcal{O}(n+|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)}) time. For any constant ϵ>0\epsilon>0, (T)\mathcal{I}_{\ell}(T):

  • occupies 𝒪(|𝒜(T)|)\mathcal{O}(|\mathcal{A}_{\ell}(T)|) extra space and reports all kk (α,β)(\alpha,\beta)-hits in 𝒪(α+β+(k+1)logϵ(|𝒜(T)|))\mathcal{O}(\alpha+\beta+(k+1)\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|)) time; or

  • occupies 𝒪(|𝒜(T)|logϵ(|𝒜(T)|))\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|)) extra space and reports all kk (α,β)(\alpha,\beta)-hits in 𝒪(α+β+loglog(|𝒜(T)|)+k)\mathcal{O}(\alpha+\beta+\log\log(|\mathcal{A}_{\ell}(T)|)+k) time.

Let us denote by X=X[|X|]X[1]\overleftarrow{X}=X[|X|]\ldots X[1] the reversal of string XX. We now describe our data structure.

Construction of (T)\mathcal{I}_{\ell}(T).

Given 𝒜(T)\mathcal{A}_{\ell}(T), we construct two sets 𝒮L(T)\mathcal{S}^{L}_{\ell}(T) and 𝒮R(T)\mathcal{S}^{R}_{\ell}(T) of strings; conceptually, the reversed suffixes going left from jj to 11, and the suffixes going right from jj to nn, for all jj in 𝒜(T)\mathcal{A}_{\ell}(T). In particular, for the bd-anchor jj, we construct two strings: T[1..j]𝒮L(T)\overleftarrow{T[1\mathinner{.\,.}j]}\in\mathcal{S}^{L}_{\ell}(T) and T[j..n]𝒮R(T)T[j\mathinner{.\,.}n]\in\mathcal{S}^{R}_{\ell}(T). Note that, |𝒮L(T)|=|𝒮R(T)|=|𝒜(T)||\mathcal{S}^{L}_{\ell}(T)|=|\mathcal{S}^{R}_{\ell}(T)|=|\mathcal{A}_{\ell}(T)|, since for every bd-anchor in 𝒜(T)\mathcal{A}_{\ell}(T) we have a distinct string in 𝒮L(T)\mathcal{S}^{L}_{\ell}(T) and in 𝒮R(T)\mathcal{S}^{R}_{\ell}(T).

We construct two compacted tries 𝒯L(T)\mathcal{T}^{L}_{\ell}(T) and 𝒯R(T)\mathcal{T}^{R}_{\ell}(T) over 𝒮L(T)\mathcal{S}^{L}_{\ell}(T) and 𝒮R(T)\mathcal{S}^{R}_{\ell}(T), respectively, to index all strings. Every string is concatenated with some special letter $\$ not occurring in TT, which is lexicographically minimal, to make 𝒮L(T)\mathcal{S}^{L}_{\ell}(T) and 𝒮R(T)\mathcal{S}^{R}_{\ell}(T) prefix-free (this is standard for conceptual convenience). The leaf nodes of the compacted tries are labeled with the corresponding jj: there is a one-to-one correspondence between a leaf node and a bd-anchor jj. In 𝒪(|𝒜(T)|)\mathcal{O}(|\mathcal{A}_{\ell}(T)|) time, we also enhance the nodes of the tries with a perfect static dictionary [27] to ensure constant-time retrieval of edges by the first letter of their label. Let L(T)\mathcal{L}^{L}_{\ell}(T) denote the list of the leaf labels of 𝒯L(T)\mathcal{T}^{L}_{\ell}(T) as they are visited using a depth-first search traversal. L(T)\mathcal{L}^{L}_{\ell}(T) corresponds to the (labels of the) lexicographically sorted list of 𝒮L(T)\mathcal{S}^{L}_{\ell}(T) in increasing order. For each node uu in 𝒯L(T)\mathcal{T}^{L}_{\ell}(T), we also store the corresponding interval [xu,yu][x_{u},y_{u}] over L(T)\mathcal{L}^{L}_{\ell}(T). Analogously for RR, R(T)\mathcal{L}^{R}_{\ell}(T) denotes the list of the leaf labels of 𝒯R(T)\mathcal{T}^{R}_{\ell}(T) as they are visited using a depth-first search traversal and corresponds to the (labels of the) lexicographically sorted list of 𝒮R(T)\mathcal{S}^{R}_{\ell}(T) in increasing order. For each node vv in 𝒯R(T)\mathcal{T}^{R}_{\ell}(T), we also store the corresponding interval [xv,yv][x_{v},y_{v}] over R(T)\mathcal{L}^{R}_{\ell}(T).

The total size occupied by the tries is Θ(|𝒜(T)|)\Theta(|\mathcal{A}_{\ell}(T)|) because they are compacted: we label the edges with intervals over [1,n][1,n] from TT.

We also construct a 2D range reporting data structure over the following points in set (T)\mathcal{R}_{\ell}(T):

(x,y)(T)L(T)[x]=R(T)[y].(x,y)\in\mathcal{R}_{\ell}(T)\iff\mathcal{L}^{L}_{\ell}(T)[x]=\mathcal{L}^{R}_{\ell}(T)[y].

Note that |(T)|=|𝒜(T)||\mathcal{R}_{\ell}(T)|=|\mathcal{A}_{\ell}(T)| because the set of leaf labels stored in both tries is precisely the set 𝒜(T)\mathcal{A}_{\ell}(T). Let us remark that the idea of employing 2D range reporting for bidirectional pattern searches has been introduced by Amir et al. [2] for text indexing and dictionary matching with one error; see also [50].

This completes the construction of (T)\mathcal{I}_{\ell}(T). We next explain how we can query (T)\mathcal{I}_{\ell}(T).

Querying.

Given a bd-anchor jQj_{Q} over a string QQ as an on-line query and parameters α,β>0\alpha,\beta>0, we spell Q[jQα+1..jQ]\overleftarrow{Q[j_{Q}-\alpha+1\mathinner{.\,.}j_{Q}]} in 𝒯L(T)\mathcal{T}^{L}_{\ell}(T) and Q[jQ..jQ+β1]Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta-1] in 𝒯R(T)\mathcal{T}^{R}_{\ell}(T) starting from the root nodes. If any of the two strings is not spelled fully, we return no (α,β)(\alpha,\beta)-hits. If both strings are fully spelled, we arrive at node uu in 𝒯L(T)\mathcal{T}^{L}_{\ell}(T) (resp. vv in 𝒯R(T)\mathcal{T}^{R}_{\ell}(T)), which corresponds to an interval over L(T)\mathcal{L}^{L}_{\ell}(T) stored in uu (resp. R(T)\mathcal{L}^{R}_{\ell}(T) in vv). We obtain the two intervals [xu,yu][x_{u},y_{u}] and [xv,yv][x_{v},y_{v}] forming a rectangle and ask the corresponding 2D range reporting query. It can be readily verified that this query returns all (α,β)(\alpha,\beta)-hits.

Example 6.

Let T=aabaaabcbdaT=\texttt{aabaaabcbda} and 𝒜5(T)={4,5,6,11}\mathcal{A}_{5}(T)=\{4,5,6,11\}. We have the following strings in 𝒮L(T)\mathcal{S}^{L}(T): T[1..4]=abaa\overleftarrow{T[1\mathinner{.\,.}4]}=\texttt{abaa}; T[1..5]=aabaa\overleftarrow{T[1\mathinner{.\,.}5]}=\texttt{aabaa}; T[1..6]=aaabaa\overleftarrow{T[1\mathinner{.\,.}6]}=\texttt{aaabaa}; and T[1..11]=adbcbaaabaa\overleftarrow{T[1\mathinner{.\,.}11]}=\texttt{adbcbaaabaa}. We have the following strings in 𝒮R(T)\mathcal{S}^{R}(T): T[4..11]=aaabcbdaT[4\mathinner{.\,.}11]=\texttt{aaabcbda}; T[5..11]=aabcbdaT[5\mathinner{.\,.}11]=\texttt{aabcbda}; T[6..11]=abcbdaT[6\mathinner{.\,.}11]=\texttt{abcbda}; T[11..11]=aT[11\mathinner{.\,.}11]=\texttt{a}. Inspect Figure 3.

Refer to caption
(a) The (T)\mathcal{I}_{\ell}(T) index
Refer to caption
(b) Querying abaaa
Figure 3: Let T=aabaaabcbdaT=\texttt{aabaaabcbda} and =5\ell=5. Further let Q=aacabaaaaeQ=\texttt{aacabaaaae}, the bd-anchor 6𝒜5(Q)6\in\mathcal{A}_{5}(Q) of order 55 corresponding to Q[4..8]Q[4\mathinner{.\,.}8], α=3\alpha=3 and β=3\beta=3. The figure illustrates the (T)\mathcal{I}_{\ell}(T) index and how we find that Q[4..8]=T[2..5]=abaaaQ[4\mathinner{.\,.}8]=T[2\mathinner{.\,.}5]=\texttt{abaaa}: the fragment T[2..5]T[2\mathinner{.\,.}5] is anchored at position 44.
Proof of Theorem 5.

We use the 𝒪(n)\mathcal{O}(n)-time algorithm underlying Theorem 2 to construct 𝒜(T)\mathcal{A}_{\ell}(T). We use the 𝒪(n)\mathcal{O}(n)-time algorithm from [3, 8] to construct the compacted tries from 𝒜(T)\mathcal{A}_{\ell}(T). We extract the |𝒜(T)||\mathcal{A}_{\ell}(T)| points (x,y)(T)(x,y)\in\mathcal{R}_{\ell}(T) using the compacted tries in 𝒪(|𝒜(T)|)\mathcal{O}(|\mathcal{A}_{\ell}(T)|) time. For the first trade-off of the statement, we use the 𝒪(|𝒜(T)|log(|𝒜(T)|))\mathcal{O}(|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)})-time algorithm from [5] to construct the 2D range reporting data structure over (T)\mathcal{R}_{\ell}(T) from [7]. For the second trade-off, we use the 𝒪(|𝒜(T)|log(|𝒜(T)|))\mathcal{O}(|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)})-time algorithm from [29] to construct the 2D range reporting data structure over (T)\mathcal{R}_{\ell}(T) from the same paper. ∎

We obtain the following corollary for the fundamental problem of text indexing [66, 51, 23, 39, 24, 32, 33, 4, 12, 55, 43, 28].

Corollary 6.

Given (T)\mathcal{I}_{\ell}(T) constructed for some integer >0\ell>0 and some constant ϵ>0\epsilon>0 over string TT, we can report all kk occurrences of any pattern QQ, |Q||Q|\geq\ell, in TT in time:

  • 𝒪(|Q|+(k+1)logϵ(|𝒜(T)|))\mathcal{O}(|Q|+(k+1)\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|)) when (T)\mathcal{I}_{\ell}(T) occupies 𝒪(|𝒜(T)|)\mathcal{O}(|\mathcal{A}_{\ell}(T)|) extra space; or

  • 𝒪(|Q|+loglog(|𝒜(T)|)+k)\mathcal{O}(|Q|+\log\log(|\mathcal{A}_{\ell}(T)|)+k) when (T)\mathcal{I}_{\ell}(T) occupies 𝒪(|𝒜(T)|logϵ(|𝒜(T)|))\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|)) extra space.

Proof.

Every occurrence of QQ in TT is prefixed by string P=Q[1..]P=Q[1\mathinner{.\,.}\ell]. We first compute the bd-anchor of PP in 𝒪()\mathcal{O}(\ell) time using Booth’s algorithm. Let this bd-anchor be jj. We set α=j\alpha=j and β=|Q|j+1\beta=|Q|-j+1. The result follows by applying Theorem 5. ∎

Querying Multiple Fragments.

In the case of approximate pattern matching, we may want to query multiple length-\ell fragments of a string QQ given as an on-line query, and not only its length-\ell prefix. We show that such an operation can be done efficiently using the bd-anchors of QQ and the (T)\mathcal{I}_{\ell}(T) index.

Corollary 7.

Given (T)\mathcal{I}_{\ell}(T) constructed for some >0\ell>0 and some constant ϵ>0\epsilon>0 over TT, for any sequence (not necessarily consecutive) of d>0d>0 length-\ell fragments of a pattern QQ, |Q||Q|\geq\ell, corresponding to the same order-\ell bd-anchor of QQ, we can report all kdk_{d} occurrences of all dd fragments in TT in time:

  • 𝒪(+(d+kd)logϵ(|𝒜(T)|))\mathcal{O}(\ell+(d+k_{d})\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|)) when (T)\mathcal{I}_{\ell}(T) occupies 𝒪(|𝒜(T)|)\mathcal{O}(|\mathcal{A}_{\ell}(T)|) space; or

  • 𝒪(+dloglog(|𝒜(T)|)+kd)\mathcal{O}(\ell+d\log\log(|\mathcal{A}_{\ell}(T)|)+k_{d}) when (T)\mathcal{I}_{\ell}(T) occupies 𝒪(|𝒜(T)|logϵ(|𝒜(T)|))\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|)) space.

Proof.

Let the order-\ell bd-anchor over QQ be jQj_{Q} and the corresponding parameters be (α1,β1),,(αd,βd)(\alpha_{1},\beta_{1}),\cdots,(\alpha_{d},\beta_{d}), with αi+βi=+1\alpha_{i}+\beta_{i}=\ell+1. Observe that αi>αi+1\alpha_{i}>\alpha_{i+1} and βi<βi+1\beta_{i}<\beta_{i+1}. Starting from jQj_{Q}, the string Q[jQαi+1..jQ]\overleftarrow{Q[j_{Q}-\alpha_{i}+1\mathinner{.\,.}j_{Q}]} we spell for fragment ii is the prefix of Q[jQαi1+1..jQ]\overleftarrow{Q[j_{Q}-\alpha_{i-1}+1\mathinner{.\,.}j_{Q}]} for fragment i1i-1. The analogous property holds for the other direction: the string Q[jQ..jQ+βi]Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta_{i}] we spell for fragment ii is the prefix of Q[jQ..jQ+βi+1]Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta_{i+1}] for fragment i+1i+1. Thus it takes only 𝒪()\mathcal{O}(\ell) time to construct all dd rectangles. Finally, we ask the dd corresponding 2D range reporting queries to obtain all kdk_{d} occurrences in the claimed time complexities. ∎

Index Evaluation.

Consider a hash table with the following (key, value) pairs: the key is the hash value h(S)h(S) of a length-kk string SS; and the value (satellite data) is a list of occurrences of SS in TT. It should be clear that such a hash table indexing the minimizers of TT does not perform well for on-line pattern searches of arbitrary length because it would need to verify the remaining prefix and suffix of the pattern using letter comparisons for all occurrences of a minimizer in TT. We thus opted for comparing our index to the one of [31], which addresses this specific problem by sampling the suffix array [51] with minimizers to reduce the number of letter comparisons during verification.

To ensure a fair comparison, we have implemented the basic index from [31]; we denote it by GR Index. We used Karp-Rabin [41] fingerprints for computing the minimizers of TT. We also used the array-based version of the suffix tree that consists of the suffix array (SA) and the longest common prefix (LCP) array [51]; SA was constructed using SDSL [30] and the LCP array using the Kasai et al. [42] algorithm.

We sampled the SA using the minimizers. Given a pattern QQ, we searched Q[j..|Q|]Q[j\mathinner{.\,.}|Q|] starting with the minimizer Q[j..j+k1]Q[j\mathinner{.\,.}j+k-1] using the Manber and Myers [51] algorithm on the sampled SA. For verifying the remaining prefix Q[1..j1]Q[1\mathinner{.\,.}j-1] of QQ, we used letter comparisons, as described in [31]. The space complexity of this implementation is 𝒪(n)\mathcal{O}(n) and the extra space for the index is 𝒪(|w,k(T)|)\mathcal{O}(|\mathcal{M}_{w,k}(T)|). The query time is not bounded. We have implemented two versions of our index. We used Booth’s algorithm for computing the bd-anchors of TT. We used SDSL for SA construction and the Kasai et al. algorithm for LCP array construction. We sampled the SA using the bd-anchors thus constructing L(T)\mathcal{L}^{L}_{\ell}(T) and R(T)\mathcal{L}^{R}_{\ell}(T). Then, the two versions of our index are:

  1. 1.

    BDA Index v1: Let jj be the bd-anchor of Q[1..]Q[1\mathinner{.\,.}\ell]. For Q[1..j]\overleftarrow{Q[1\mathinner{.\,.}j]} (resp. Q[j..|Q|]Q[j\mathinner{.\,.}|Q|]) we used the Manber and Myers algorithm for searching over L(T)\mathcal{L}^{L}_{\ell}(T) (resp. R(T)\mathcal{L}^{R}_{\ell}(T)). We used range trees [14] implemented in CGAL [63] for 2D range reporting as per the described querying process. The space complexity of this implementation is 𝒪(n+|𝒜(T)|log(|𝒜(T)|))\mathcal{O}(n+|\mathcal{A}_{\ell}(T)|\log(|\mathcal{A}_{\ell}(T)|)) and the extra space for the index is 𝒪(|𝒜(T)|log(|𝒜(T)|))\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log(|\mathcal{A}_{\ell}(T)|)). The query time is 𝒪(|Q|+log2(|𝒜(T)|)+k)\mathcal{O}(|Q|+\log^{2}(|\mathcal{A}_{\ell}(T)|)+k), where kk is the total number of occurrences of QQ in TT.

  2. 2.

    BDA Index v2: Let jj be the bd-anchor of Q[1..]Q[1\mathinner{.\,.}\ell]. If |Q|j+1j|Q|-j+1\geq j (resp. |Q|j+1<j|Q|-j+1<j), we search for Q[j..|Q|]Q[j\mathinner{.\,.}|Q|] (resp. Q[1..j]\overleftarrow{Q[1\mathinner{.\,.}j]}) using the Manber and Myers algorithm on R(T)\mathcal{L}^{R}_{\ell}(T) (resp. L(T)\mathcal{L}^{L}_{\ell}(T)). For verifying the remaining part of the pattern we used letter comparisons. The space complexity of this implementation is 𝒪(n)\mathcal{O}(n) and the extra space for the index is 𝒪(|𝒜(T)|)\mathcal{O}(|\mathcal{A}_{\ell}(T)|). The query time is not bounded.

For each of the five real datasets of Table 1 and each query string length \ell, we randomly extracted 500,000 substrings from the text and treated each substring as a query, following [31]. We plot the average query time in Figure 4. As can be seen, BDA Index v2 consistently outperforms GR Index across all datasets and all \ell values. The better performance of BDA Index v2 is due to two theoretical reasons. First, the verification strategy exploits the fact that the index is bidirectional to apply the Manber and Myers algorithm to the largest part of the pattern, which results in fewer letter comparisons. Second, bd-anchors generally have smaller density compared to minimizers; see Figure 5. We also plot the peak memory usage in Figure 6. As can be seen, BDA Index v2 requires a similar amount of memory to GR Index.

Refer to caption
(a) DNA
Refer to caption
(b) XML
Refer to caption
(c) ENGLISH
Refer to caption
(d) PROTEINS
Refer to caption
(e) SOURCES
Figure 4: Average query time (ms) vs. w,kw,k for =w+k1\ell=w+k-1 and the datasets of Table 1.

BDA Index v1 was slower than GR Index for small \ell but faster for large \ell in three out of five datasets used and had by far the highest memory usage. Let us stress that the inefficiency of BDA Index v1 is not due to inefficiency in the query time or space of our algorithm. It is merely because the range tree implementation of CGAL, which is a standard off-the-shelf library, is unfortunately inefficient in terms of both query time and memory usage; see also [62, 26].

Refer to caption
(a) DNA
Refer to caption
(b) XML
Refer to caption
(c) ENGLISH
Refer to caption
(d) PROTEINS
Refer to caption
(e) SOURCES
Figure 5: Density vs. w,kw,k for =w+k1\ell=w+k-1 and the datasets of Table 1.
Refer to caption
(a) DNA
Refer to caption
(b) XML
Refer to caption
(c) ENGLISH
Refer to caption
(d) PROTEINS
Refer to caption
(e) SOURCES
Figure 6: Peak memory usage (GB) vs. w,kw,k for =w+k1\ell=w+k-1 and the datasets of Table 1.

Discussion.

The proposed (T)\mathcal{I}_{\ell}(T) index, which is based on bd-anchors, has the following attributes:

  1. 1.

    Construction: 𝒜(T)\mathcal{A}_{\ell}(T) is constructed in 𝒪(n)\mathcal{O}(n) worst-case time and (T)\mathcal{I}_{\ell}(T) is constructed in 𝒪(n+|𝒜(T)|log(|𝒜(T)|))\mathcal{O}(n+|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)}) worst-case time. These time complexities are near-linear in nn and do not depend on the alphabet Σ\Sigma as long as |Σ|=n𝒪(1)|\Sigma|=n^{\mathcal{O}(1)}, which is true for virtually any real scenario.

  2. 2.

    Index Size: By Theorem 5, (T)\mathcal{I}_{\ell}(T) can occupy 𝒪(|𝒜(T)|)\mathcal{O}(|\mathcal{A}_{\ell}(T)|) space. By Lemma 6, the size of 𝒜(T)\mathcal{A}_{\ell}(T) is 𝒪(n/)\mathcal{O}(n/\ell) in expectation and so (T)\mathcal{I}_{\ell}(T) can also be of size 𝒪(n/)\mathcal{O}(n/\ell). In practice this depends on TT and on the implementation of the 2D range reporting data structure.

  3. 3.

    Querying: The (T)\mathcal{I}_{\ell}(T) index answers on-line pattern searches in near-optimal time.

  4. 4.

    Flexibility: Note that one would have to reconstruct a (hash-based) index, which indexes the set of (w,k)(w,k)-minimizers, to increase specificity or sensitivity: increasing kk increases the specificity and decreases the sensitivity. Our (T)\mathcal{I}_{\ell}(T) index, conceptually truncated at string depth kk, is essentially an index based on (w,k)(w,k)-minimizers, which additionally wrap around. We can thus increase specificity by considering larger α,β\alpha,\beta values or increase sensitivity by considering smaller α,β\alpha,\beta values. This effect can be realized without reconstructing our (T)\mathcal{I}_{\ell}(T) index: we just adapt α\alpha and β\beta upon querying accordingly.

5 Similarity Search under Edit Distance

We show how bd-anchors can be applied to speed up similarity search under edit distance. This is a fundamental problem with myriad applications in bioinformatics, databases, data mining, and information retrieval. It has thus been studied extensively in the literature both from a theoretical and a practical point of view [38, 9, 11, 46, 68, 71, 57, 65, 64, 17, 34, 69, 70]. Let 𝒟\mathcal{D} be a collection of strings called dictionary. We focus, in particular, on indexing 𝒟\mathcal{D} for answering the following type of top-KK queries: Given a query string QQ and an integer KK, return KK strings from the dictionary that are closest to QQ with respect to edit distance. We follow a typical seed-chain-align approach as used by several bioinformatics applications [1, 16, 47, 48]. The main new ingredients we inject, with respect to this classic approach, is that we use: (1) bd-anchors as seeds; and (2) \mathcal{I}_{\ell} to index the dictionary 𝒟\mathcal{D}, for some integer parameter >0\ell>0.

Construction.

We require an integer parameter >0\ell>0 defining the order of the bd-anchors. We set T=S1S|𝒟|T=S_{1}\ldots S_{|\mathcal{D}|}, where Si𝒟S_{i}\in\mathcal{D}, compute the bd-anchors of order \ell of TT, and construct the (T)\mathcal{I}_{\ell}(T) index (see Section 4) using the bd-anchors.

Querying.

We require two parameters τ0\tau\geq 0 and δ0\delta\geq 0. The former parameter controls the sensitivity of our filtering step (Step 2 below); and the latter one controls the sensitivity of our verification step (Step 3 below). Both parameters trade accuracy for speed.

  1. 1.

    For each query string QQ, we compute the bd-anchors of order \ell. For every bd-anchor jQj_{Q}, we take an arbitrary fragment (e.g. the leftmost) of length \ell anchored at jQj_{Q} as the seed. Let this fragment start at position iQi_{Q}. This implies a value for α\alpha and β\beta, with α+β=+1\alpha+\beta=\ell+1; specifically for Q[iQ..iQ+1]Q[i_{Q}\mathinner{.\,.}i_{Q}+\ell-1] we have Q[iQ..jQ]=Q[jQα+1..jQ]Q[i_{Q}\mathinner{.\,.}j_{Q}]=Q[j_{Q}-\alpha+1\mathinner{.\,.}j_{Q}] and Q[jQ..iQ+1]=Q[jQ..jQ+β1]Q[j_{Q}\mathinner{.\,.}i_{Q}+\ell-1]=Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta-1]. For every bd-anchor jQj_{Q}, we query Q[jQα+1..jQ]\overleftarrow{Q[j_{Q}-\alpha+1\mathinner{.\,.}j_{Q}]} in 𝒯L(T)\mathcal{T}^{L}_{\ell}(T) and Q[jQ..jQ+β1]Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta-1] in 𝒯R(T)\mathcal{T}^{R}_{\ell}(T) and collect all (α,β)(\alpha,\beta)-hits.

  2. 2.

    Let τ0\tau\geq 0 be an input parameter and let LQ,S=(q1,s1),,(qk,sh)L_{Q,S}=(q_{1},s_{1}),\ldots,(q_{k},s_{h}) be the list of all (α,β)(\alpha,\beta)-hits between the queried fragments of string QQ and fragments of a string S𝒟S\in\mathcal{D}. If h<τh<\tau, we consider string SS as not found. The intuition here is that if QQ and SS are sufficiently close with respect to edit distance, they would have a relatively long LQ,SL_{Q,S} [16]. If hτh\geq\tau, we sort the elements of LQ,SL_{Q,S} with respect to their first component. (This comes for free because we process QQ from left to right.) We then compute a longest increasing subsequence (LIS) in LQ,SL_{Q,S} with respect to the second component, which chains the (α,β)(\alpha,\beta)-hits, in 𝒪(hlogh)\mathcal{O}(h\log h) time [59] per LQ,SL_{Q,S} list. We use the LIS of LQ,SL_{Q,S} to estimate the identity score (total number of matching letters in a fixed alignment) for QQ and SS, which we denote by EQ,SE_{Q,S}, based on the occurrences of the (α,β)(\alpha,\beta)-hits in the LIS.

  3. 3.

    Let δ0\delta\geq 0 be an input parameter and let EKE_{K} be the KKth largest estimated identity score. We extract, as candidates, the ones whose estimated identity score is at least EKδE_{K}-\delta. For every candidate string SS, we close the gaps between the occurrences of the (α,β)(\alpha,\beta)-hits in the LIS using dynamic programming [45], thus computing an upper bound on the edit distance between QQ and SS (UB score). In particular, closing the gaps consists in summing up the exact edit distance for all pairs of fragments (one from SS and one from QQ) that lie in between the (α,β)(\alpha,\beta)-hits. We return KK strings from the list of candidates with the lowest UB score. If δ=0\delta=0, we return KK strings with the highest EQ,SE_{Q,S} score.

Index Evaluation.

We compared our algorithm, called BDA Search, to Min Search, the state-of-the-art tool for top-KK similarity search under edit distance proposed by Zhang and Zhang in [70]. The main concept used in Min Search is the rank of a letter in a string, defined as the size of the neighborhood of the string in which the letter has the minimum hash value. Based on this concept, Min Search partitions each string in the dictionary 𝒟\mathcal{D} into a hierarchy of substrings and then builds an index comprised of a set of hash tables, so that strings having common substrings and thus small edit distance are grouped into the same hash table. To find the top-KK closest strings to a query string, Min Search partitions the query string based on the ranks of its letters and then traverses the hash tables comprising the index. Thanks to the index and the use of several filtering tricks, Min Search is at least one order of magnitude faster with respect to query time than popular alternatives [69, 71, 18].

We implemented two versions of BDA Search: BDA Search v1 which is based on BDA Index v1; and BDA Search v2 which is based on BDA Index v2. For Min Search, we used the C++ implementation from https://github.com/kedayuge/Search.

We constructed synthetic datasets, referred to as SYN, in a way that enables us to study the impact of different parameters and efficiently identify the ground truth (top-KK closest strings to a query string with respect to edit distance). Specifically, we first generated 5050 query strings and then constructed a cluster of KK strings around each query string. To generate the query strings, we started from an arbitrary string QQ of length |Q|=1000|Q|=1000 from a real dataset of protein sequences, used in [70], and generated a string QQ^{\prime} that is at edit distance ee from QQ, by performing ee edit distance operations, each with equal probability. Then, we treated QQ^{\prime} as QQ and repeated the process to generate the next query string. To create the clusters, we first added each query string into an initially empty cluster and then added K1K-1 strings, each at edit distance at most e<ee^{\prime}<e from the query string. The strings were generated by performing at most ee^{\prime} edit distance operations, each with equal probability. Thus, each cluster contains the top-KK closest strings to the query string of the cluster. We used K{5,10,15,20,25}K\in\{5,10,15,20,25\}, d=e|Q|{0.1,0.15,0.2,0.25,0.3}d=\frac{e}{|Q|}\in\{0.1,0.15,0.2,0.25,0.3\}, and d=e|Q|=d0.05d^{\prime}=\frac{e^{\prime}}{|Q|}=d-0.05. We evaluated query answering accuracy using the F1 score [52], expressed as the harmonic mean of precision and recall111Precision is the ratio between the number of returned strings that are among the top-KK closest strings to a query string and the number of all returned strings. Recall is the ratio between the number of returned strings that are among the top-KK closest strings to a query string and KK. Since all tested algorithms return KK strings, the F1 score in our experiments is equal to precision and equal to recall.. For BDA Search, we report results for τ=0\tau=0 (full sensitivity during filtering) and δ=0\delta=0 (no sensitivity during verification), as it was empirically determined to be a reasonable trade-off between accuracy and speed. For Min Search, we report results using its default parameters from [70].

We plot the F1 scores and average query time in Figures 7 and 8, respectively. All methods achieved almost perfect accuracy, in all tested cases. BDA Search slightly outperformed Min Search (by up to 1.1%1.1\%), remaining accurate even for large \ell; the changes to F1 score for Min Search as \ell varies are because the underlying method is randomized. However, both versions of BDA Search were more than one order of magnitude faster than Min Search on average (over all results of Figure 8), with BDA Search v1 being 2.9 times slower than BDA Search v2 on average, due to the inefficiency of the range tree implementation of CGAL. Furthermore, both versions of BDA Search scaled better with respect to KK. For example, the average query time for BDA Search v1 became 2 times larger when KK increased from 55 to 2525 (on average over \ell values), while that for Min Search became 5.4 times larger on average. The reason is that verification in Min Search, which increases the accuracy of this method, becomes increasingly expensive as KK gets larger. The peak memory usage for these experiments is reported in Figure 9. Although Min Search outperforms BDA Search in terms of memory usage, BDA Search v2 still required a very small amount of memory (less than 1GB). BDA Search v1 required more memory for the reasons mentioned in Section 4.

Refer to caption
(a) SYN
Refer to caption
(b) SYN
Figure 7: F1 score vs. (a) dd, dd^{\prime}, \ell, for K=20K=20, and (b) \ell, KK, for d=0.15d=0.15 and d=0.1d^{\prime}=0.1.
Refer to caption
(a) SYN
Refer to caption
(b) SYN
Figure 8: Average query time (ms) vs. (a) dd, dd^{\prime}, \ell, for K=20K=20, and (b) \ell, KK, for d=0.15d=0.15 and d=0.1d^{\prime}=0.1.
Refer to caption
(a) SYN
Refer to caption
(b) SYN
Figure 9: Peak memory usage (GB) vs. (a) dd, dd^{\prime}, \ell, for K=20K=20, and (b) \ell, KK, for d=0.15d=0.15 and d=0.1d^{\prime}=0.1.

Discussion.

BDA Search outperforms Min Search in accuracy while being more than one order of magnitude faster in query time. These results are very encouraging because the efficiency of BDA Search is entirely due to injecting bd-anchors and not due to any further filtering tricks such as those employed by Min Search. Min Search clearly outperforms BDA Search in memory usage, albeit the memory usage of BDA Search v2 is still quite modest. We defer an experimental evaluation using real datasets to the journal version of our work.

6 Other Works on Improving Minimizers

Although every sampling mechanism based on minimizers primarily aims at satisfying Properties 1 and 2, different mechanisms employ total orders that lead to substantially different total numbers of selected minimizers. Thus, research on minimizers has focused on determining total orders which lead to the lowest possible density (recall that the density is defined as the number of selected length-kk substrings over the length of the input string). In fact, much of the literature focuses on the average case [56, 54, 53, 22, 72]; namely, the lowest expected density when the input string is random. In practice, many works use a “random minimizer” where the order is defined by choosing a permutation of all the length-kk strings at random (e.g. by using a hash function, such as the Karp-Rabin fingerprints [41], on the length-kk strings). Such a randomized mechanism has the benefit of being easy to implement and providing good expected performance in practice.

Minimizers and Universal Hitting Sets.

A universal hitting set (UHS) is an unavoidable set of length-kk strings, i.e., it is a set of length-kk strings that “hits” every (w+k1)(w+k-1)-long fragment of every possible string. The theory of universal hitting sets [56, 53, 43, 73] plays an important role in the current theory for minimizers with low density on average. In particular, if a UHS has small size, it generates minimizers with a provable upper-bound on their density. However, UHSs are less useful in the string-specific case for two reasons [74]: (1) the requirement that a UHS has to hit every (w+k1)(w+k-1)-long fragment of every possible string is too strong; and (2) UHSs are too large to provide a meaningful upper-bound on the density in the string-specific case. Therefore, since in many practical scenarios the input string is known and does not change frequently, we try to optimize the density for one particular string instead of optimizing the average density over a random input.

String-Specific Minimizers.

In the string-specific case, minimizers sampling mechanisms may employ frequency-based orders [10, 37]. In these orders, length-kk strings occurring less frequently in the string compare less than the ones occurring more frequently. The intuition [74] is to obtain a sparse sampling by selecting infrequent length-kk strings which should be spread apart in the string. However, there is no theoretical guarantee that a frequency-based order gives low density minimizers (there are many counter-examples). Furthermore, frequency-based orders do not always give minimizers with lower density in practice. For instance, the two-tier classification (very frequent vs. less frequent length-kk strings) in the work of [37] outperforms an order that strictly follows frequency of occurrence.

A different approach to constructing string-specific minimizers is to start from a UHS and to remove elements from it, as long as it still hits every (w+k1)(w+k-1)-long fragment of the input string [15]. Since this approach starts with a UHS that is not related to the string, the improvement in density may not be significant [74]. Additionally, current methods [22] employing this approach are computationally limited to using k16k\leq 16, as the size of the UHS increases exponentially with kk. Using such small kk values may not be appropriate in some applications.

Other Improvements.

When kwk\approx w, minimizers with expected density of 1.67/w+o(1/w)1.67/w+o(1/w) on a random string can be constructed using the approach of [72]. Such minimizers have guaranteed expected density less than 2/(w+1)2/(w+1) and work for infinitely many ww and kk. The approach of [72] also does not require the use of expensive heuristics to precompute and store a large set of length-kk strings, unlike some methods [56, 15, 22] with low density in practice.

The notion of polar set, which can be seen as complementary to that of UHS, was recently introduced in [74]. While a UHS is a set of length-kk strings that intersect with every (w+k1)(w+k-1)-long fragment at least once, a polar set is a set of length-kk strings that intersect with any fragment at most once. The construction of a polar set builds upon sets of length-kk strings that are sparse in the input string. Thus, the minimizers derived from these polar sets have provably tight bounds on their density. Unfortunately, computing optimal polar sets is NP-hard, as shown in [74]. Thus, the work of [74] also proposed a heuristic for computing feasible “good enough” polar sets. A main disadvantage of this approach is that when each length-kk string occurs frequently in the input string, it becomes hard to select many length-kk strings without violating the polar set condition.

References

  • [1] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. doi:10.1016/S0022-2836(05)80360-2.
  • [2] Amihood Amir, Dmitry Keselman, Gad M. Landau, Moshe Lewenstein, Noa Lewenstein, and Michael Rodeh. Text indexing and dictionary matching with one error. J. Algorithms, 37(2):309–325, 2000. URL: https://doi.org/10.1006/jagm.2000.1104, doi:10.1006/jagm.2000.1104.
  • [3] Carl Barton, Tomasz Kociumaka, Chang Liu, Solon P. Pissis, and Jakub Radoszewski. Indexing weighted sequences: Neat and efficient. Inf. Comput., 270, 2020. URL: https://doi.org/10.1016/j.ic.2019.104462, doi:10.1016/j.ic.2019.104462.
  • [4] Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 148–193, 2014. doi:10.1145/2591796.2591885.
  • [5] Djamal Belazzougui and Simon J. Puglisi. Range predecessor and lempel-ziv parsing. In Robert Krauthgamer, editor, Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 2053–2071. SIAM, 2016. URL: https://doi.org/10.1137/1.9781611974331.ch143, doi:10.1137/1.9781611974331.ch143.
  • [6] Kellogg S. Booth. Lexicographically least circular substrings. Inf. Process. Lett., 10(4/5):240–242, 1980. URL: https://doi.org/10.1016/0020-0190(80)90149-0, doi:10.1016/0020-0190(80)90149-0.
  • [7] Timothy M. Chan, Kasper Green Larsen, and Mihai Patrascu. Orthogonal range searching on the RAM, revisited. In Ferran Hurtado and Marc J. van Kreveld, editors, Proceedings of the 27th ACM Symposium on Computational Geometry, Paris, France, June 13-15, 2011, pages 1–10. ACM, 2011. URL: https://doi.org/10.1145/1998196.1998198, doi:10.1145/1998196.1998198.
  • [8] Panagiotis Charalampopoulos, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Property suffix array with applications in indexing weighted sequences. ACM J. Exp. Algorithmics, 25, April 2020. URL: https://doi.org/10.1145/3385898, doi:10.1145/3385898.
  • [9] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In Alon Y. Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 313–324. ACM, 2003. URL: https://doi.org/10.1145/872757.872796, doi:10.1145/872757.872796.
  • [10] Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinform., 32(12):201–208, 2016. URL: https://doi.org/10.1093/bioinformatics/btw279, doi:10.1093/bioinformatics/btw279.
  • [11] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91–100. ACM, 2004. URL: https://doi.org/10.1145/1007352.1007374, doi:10.1145/1007352.1007374.
  • [12] Richard Cole, Tsvi Kopelowitz, and Moshe Lewenstein. Suffix trays and suffix trists: Structures for faster text indexing. Algorithmica, 72(2):450–466, 2015. doi:10.1007/s00453-013-9860-6.
  • [13] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
  • [14] Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars. Computational geometry: algorithms and applications, 3rd Edition. Springer, 2008. URL: https://www.worldcat.org/oclc/227584184.
  • [15] Dan F. DeBlasio, Fiyinfoluwa Gbosibo, Carl Kingsford, and Guillaume Marçais. Practical universal k-mer sets for minimizer schemes. In Xinghua Mindy Shi, Michael Buck, Jian Ma, and Pierangelo Veltri, editors, Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2019, Niagara Falls, NY, USA, September 7-10, 2019, pages 167–176. ACM, 2019. URL: https://doi.org/10.1145/3307339.3342144, doi:10.1145/3307339.3342144.
  • [16] Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White, and Steven L. Salzberg. Alignment of whole genomes. Nucleic Acids Research, 27(11):2369–2376, 01 1999. URL: https://doi.org/10.1093/nar/27.11.2369, arXiv:https://academic.oup.com/nar/article-pdf/27/11/2369/4086173/27-11-2369.pdf, doi:10.1093/nar/27.11.2369.
  • [17] Dong Deng, Guoliang Li, and Jianhua Feng. A pivotal prefix based filtering algorithm for string similarity search. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 673–684. ACM, 2014. URL: https://doi.org/10.1145/2588555.2593675, doi:10.1145/2588555.2593675.
  • [18] Dong Deng, Guoliang Li, Jianhua Feng, and Wen-Syan Li. Top-k string similarity search with edit-distance constraints. In Christian S. Jensen, Christopher M. Jermaine, and Xiaofang Zhou, editors, 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 925–936. IEEE Computer Society, 2013. URL: https://doi.org/10.1109/ICDE.2013.6544886, doi:10.1109/ICDE.2013.6544886.
  • [19] Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinform., 31(10):1569–1576, 2015. URL: https://doi.org/10.1093/bioinformatics/btv022, doi:10.1093/bioinformatics/btv022.
  • [20] Patrick Dinklage, Johannes Fischer, Alexander Herlez, Tomasz Kociumaka, and Florian Kurpicz. Practical Performance of Space Efficient Data Structures for Longest Common Extensions. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, 28th Annual European Symposium on Algorithms (ESA 2020), volume 173 of Leibniz International Proceedings in Informatics (LIPIcs), pages 39:1–39:20, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2020/12905, doi:10.4230/LIPIcs.ESA.2020.39.
  • [21] Irit Dinur and Shmuel Safra. The importance of being biased. In John H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 33–42. ACM, 2002. URL: https://doi.org/10.1145/509907.509915, doi:10.1145/509907.509915.
  • [22] Baris Ekim, Bonnie Berger, and Yaron Orenstein. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. In Russell Schwartz, editor, Research in Computational Molecular Biology - 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10-13, 2020, Proceedings, volume 12074 of Lecture Notes in Computer Science, pages 37–53. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-45257-5_3, doi:10.1007/978-3-030-45257-5\_3.
  • [23] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102, doi:10.1109/SFCS.1997.646102.
  • [24] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
  • [25] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili corpus – compressed indexes and their testbeds. http://pizzachili.dcc.uchile.cl/texts.html.
  • [26] Vissarion Fisikopoulos. An implementation of range trees with fractional cascading in C++. CoRR, abs/1103.4521, 2011. URL: http://arxiv.org/abs/1103.4521, arXiv:1103.4521.
  • [27] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 𝒪(1)\mathcal{O}(1) worst case access time. J. ACM, 31(3):538–544, 1984. doi:10.1145/828.1884.
  • [28] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1–2:54, 2020. doi:10.1145/3375890.
  • [29] Younan Gao, Meng He, and Yakov Nekrich. Fast preprocessing for optimal orthogonal range reporting and range successor with applications to text indexing. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, 28th Annual European Symposium on Algorithms (ESA 2020), volume 173 of Leibniz International Proceedings in Informatics (LIPIcs), pages 54:1–54:18, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2020/12920, doi:10.4230/LIPIcs.ESA.2020.54.
  • [30] Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Joachim Gudmundsson and Jyrki Katajainen, editors, Experimental Algorithms - 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29 - July 1, 2014. Proceedings, volume 8504 of Lecture Notes in Computer Science, pages 326–337. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28, doi:10.1007/978-3-319-07959-2\_28.
  • [31] Szymon Grabowski and Marcin Raniszewski. Sampled suffix array with minimizers. Softw. Pract. Exp., 47(11):1755–1771, 2017. URL: https://doi.org/10.1002/spe.2481, doi:10.1002/spe.2481.
  • [32] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
  • [33] Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput., 38(6):2162–2178, 2009. doi:10.1137/070685373.
  • [34] Huiqi Hu, Guoliang Li, Zhifeng Bao, Jianhua Feng, Yongwei Wu, Zhiguo Gong, and Yaoqiang Xu. Top-k spatio-textual similarity join. IEEE Trans. Knowl. Data Eng., 28(2):551–565, 2016. URL: https://doi.org/10.1109/TKDE.2015.2485213, doi:10.1109/TKDE.2015.2485213.
  • [35] Chirag Jain, Alexander T. Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. A fast approximate algorithm for mapping long reads to large reference databases. J. Comput. Biol., 25(7):766–779, 2018. URL: https://doi.org/10.1089/cmb.2018.0036, doi:10.1089/cmb.2018.0036.
  • [36] Chirag Jain, Sergey Koren, Alexander T. Dilthey, Adam M. Phillippy, and Srinivas Aluru. A fast adaptive algorithm for computing whole-genome homology maps. Bioinform., 34(17):i748–i756, 2018. URL: https://doi.org/10.1093/bioinformatics/bty597, doi:10.1093/bioinformatics/bty597.
  • [37] Chirag Jain, Arang Rhie, Haowen Zhang, Claudia Chu, Brian Walenz, Sergey Koren, and Adam M. Phillippy. Weighted minimizer sampling improves long read mapping. Bioinform., 36(Supplement-1):i111–i118, 2020. URL: https://doi.org/10.1093/bioinformatics/btaa435, doi:10.1093/bioinformatics/btaa435.
  • [38] Tamer Kahveci and Ambuj K. Singh. Efficient index structures for string databases. In Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass, editors, VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, pages 351–360. Morgan Kaufmann, 2001. URL: http://www.vldb.org/conf/2001/P351.pdf.
  • [39] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918–936, 2006. doi:10.1145/1217856.1217858.
  • [40] Richard M. Karp. Reducibility among combinatorial problems. In Raymond E. Miller and James W. Thatcher, editors, Proceedings of a symposium on the Complexity of Computer Computations, held March 20-22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA, The IBM Research Symposia Series, pages 85–103. Plenum Press, New York, 1972. URL: https://doi.org/10.1007/978-1-4684-2001-2_9, doi:10.1007/978-1-4684-2001-2\_9.
  • [41] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249–260, 1987. URL: https://doi.org/10.1147/rd.312.0249, doi:10.1147/rd.312.0249.
  • [42] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Amihood Amir and Gad M. Landau, editors, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001 Jerusalem, Israel, July 1-4, 2001 Proceedings, volume 2089 of Lecture Notes in Computer Science, pages 181–192. Springer, 2001. URL: https://doi.org/10.1007/3-540-48194-X_17, doi:10.1007/3-540-48194-X\_17.
  • [43] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756–767. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316368, doi:10.1145/3313276.3316368.
  • [44] Tomasz Kociumaka. Minimal suffix and rotation of a substring in optimal time. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 28:1–28:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.28, doi:10.4230/LIPIcs.CPM.2016.28.
  • [45] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707, 1966.
  • [46] Chen Li, Bin Wang, and Xiaochun Yang. VGRAM: improving performance of approximate queries on string collections using variable-length grams. In Christoph Koch, Johannes Gehrke, Minos N. Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J. Neuhold, editors, Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 303–314. ACM, 2007. URL: http://www.vldb.org/conf/2007/papers/research/p303-li.pdf.
  • [47] Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, 03 2016. URL: https://doi.org/10.1093/bioinformatics/btw152, arXiv:https://academic.oup.com/bioinformatics/article-pdf/32/14/2103/19567911/btw152.pdf, doi:10.1093/bioinformatics/btw152.
  • [48] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinform., 34(18):3094–3100, 2018. URL: https://doi.org/10.1093/bioinformatics/bty191, doi:10.1093/bioinformatics/bty191.
  • [49] Grigorios Loukides and Solon P. Pissis. Bidirectional string anchors: A new string sampling mechanism. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 64:1–64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ESA.2021.64, doi:10.4230/LIPIcs.ESA.2021.64.
  • [50] Veli Mäkinen and Gonzalo Navarro. Position-restricted substring searching. In José R. Correa, Alejandro Hevia, and Marcos A. Kiwi, editors, LATIN 2006: Theoretical Informatics, 7th Latin American Symposium, Valdivia, Chile, March 20-24, 2006, Proceedings, volume 3887 of Lecture Notes in Computer Science, pages 703–714. Springer, 2006. URL: https://doi.org/10.1007/11682462_64, doi:10.1007/11682462\_64.
  • [51] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993. URL: https://doi.org/10.1137/0222058, doi:10.1137/0222058.
  • [52] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, USA, 2008.
  • [53] Guillaume Marçais, Dan F. DeBlasio, and Carl Kingsford. Asymptotically optimal minimizers schemes. Bioinform., 34(13):i13–i22, 2018. URL: https://doi.org/10.1093/bioinformatics/bty258, doi:10.1093/bioinformatics/bty258.
  • [54] Guillaume Marçais, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford. Improving the performance of minimizers and winnowing schemes. Bioinform., 33(14):i110–i117, 2017. URL: https://doi.org/10.1093/bioinformatics/btx235, doi:10.1093/bioinformatics/btx235.
  • [55] J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 408–424, 2017. doi:10.1137/1.9781611974782.26.
  • [56] Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. Compact universal k-mer hitting sets. In Martin C. Frith and Christian Nørgaard Storm Pedersen, editors, Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings, volume 9838 of Lecture Notes in Computer Science, pages 257–268. Springer, 2016. URL: https://doi.org/10.1007/978-3-319-43681-4_21, doi:10.1007/978-3-319-43681-4\_21.
  • [57] Jianbin Qin, Wei Wang, Chuan Xiao, Yifei Lu, Xuemin Lin, and Haixun Wang. Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst., 38(3):16:1–16:44, 2013. URL: https://doi.org/10.1145/2508020.2508023, doi:10.1145/2508020.2508023.
  • [58] Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinform., 20(18):3363–3369, 2004. URL: https://doi.org/10.1093/bioinformatics/bth408, doi:10.1093/bioinformatics/bth408.
  • [59] Craige Schensted. Longest increasing and decreasing subsequences. Canadian Journal of Mathematics, 13:179–191, 1961. doi:10.4153/CJM-1961-015-3.
  • [60] Saul Schleimer, Daniel Shawcross Wilkerson, and Alexander Aiken. Winnowing: Local algorithms for document fingerprinting. In Alon Y. Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76–85. ACM, 2003. URL: https://doi.org/10.1145/872757.872770, doi:10.1145/872757.872770.
  • [61] Huei-Jan Shyr and Gabriel Thierrin. Disjunctive languages and codes. In Marek Karpinski, editor, Fundamentals of Computation Theory, Proceedings of the 1977 International FCT-Conference, Poznan-Kórnik, Poland, September 19-23, 1977, volume 56 of Lecture Notes in Computer Science, pages 171–176. Springer, 1977. URL: https://doi.org/10.1007/3-540-08442-8_83, doi:10.1007/3-540-08442-8\_83.
  • [62] Yihan Sun and Guy E. Blelloch. Parallel range, segment and rectangle queries with augmented maps. In Stephen G. Kobourov and Henning Meyerhenke, editors, Proceedings of the Twenty-First Workshop on Algorithm Engineering and Experiments, ALENEX 2019, San Diego, CA, USA, January 7-8, 2019, pages 159–173. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975499.13, doi:10.1137/1.9781611975499.13.
  • [63] The CGAL Project. CGAL User and Reference Manual. CGAL Editorial Board, 5.2.1 edition, 2021. URL: https://doc.cgal.org/5.2.1/Manual/packages.html.
  • [64] Jiannan Wang, Guoliang Li, and Jianhua Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fuxman, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 85–96. ACM, 2012. URL: https://doi.org/10.1145/2213836.2213847, doi:10.1145/2213836.2213847.
  • [65] Xiaoli Wang, Xiaofeng Ding, Anthony K. H. Tung, and Zhenjie Zhang. Efficient and effective KNN sequence search with approximate n-grams. Proc. VLDB Endow., 7(1):1–12, 2013. URL: http://www.vldb.org/pvldb/vol7/p1-wang.pdf, doi:10.14778/2732219.2732220.
  • [66] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11, 1973. doi:10.1109/SWAT.1973.13.
  • [67] Derrick E. Wood and Steven L. Salzberg. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), March 2014. Copyright: Copyright 2014 Elsevier B.V., All rights reserved. doi:10.1186/gb-2014-15-3-r46.
  • [68] Zhenglu Yang, Jianjun Yu, and Masaru Kitsuregawa. Fast algorithms for top-k approximate string matching. In Maria Fox and David Poole, editors, Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010. AAAI Press, 2010. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI10/paper/view/1939.
  • [69] Minghe Yu, Jin Wang, Guoliang Li, Yong Zhang, Dong Deng, and Jianhua Feng. A unified framework for string similarity search with edit-distance constraint. VLDB J., 26(2):249–274, 2017. URL: https://doi.org/10.1007/s00778-016-0449-y, doi:10.1007/s00778-016-0449-y.
  • [70] Haoyu Zhang and Qin Zhang. Minsearch: An efficient algorithm for similarity search under edit distance. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, editors, KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 566–576. ACM, 2020. URL: https://doi.org/10.1145/3394486.3403099, doi:10.1145/3394486.3403099.
  • [71] Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In Ahmed K. Elmagarmid and Divyakant Agrawal, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 915–926. ACM, 2010. URL: https://doi.org/10.1145/1807167.1807266, doi:10.1145/1807167.1807266.
  • [72] Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Improved design and analysis of practical minimizers. Bioinform., 36(Supplement-1):i119–i127, 2020. URL: https://doi.org/10.1093/bioinformatics/btaa472, doi:10.1093/bioinformatics/btaa472.
  • [73] Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Lower density selection schemes via small universal hitting sets with short remaining path length. In Russell Schwartz, editor, Research in Computational Molecular Biology - 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10-13, 2020, Proceedings, volume 12074 of Lecture Notes in Computer Science, pages 202–217. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-45257-5_13, doi:10.1007/978-3-030-45257-5\_13.
  • [74] Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Sequence-specific minimizers via polar sets. bioRxiv, 2021. URL: https://www.biorxiv.org/content/early/2021/02/10/2021.02.01.429246, arXiv:https://www.biorxiv.org/content/early/2021/02/10/2021.02.01.429246.full.pdf, doi:10.1101/2021.02.01.429246.