String Sampling with Bidirectional String Anchors

Grigorios Loukides Department of Informatics, King’s College London, London, UK
[email protected] Solon P. Pissis CWI, Amsterdam, The Netherlands, [solon.pissis,michelle.sweering]@cwi.nl Vrije Universiteit, Amsterdam, The Netherlands Michelle Sweering CWI, Amsterdam, The Netherlands, [solon.pissis,michelle.sweering]@cwi.nl

Abstract

The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers $w$ and $k$ , it selects the lexicographically smallest length- $k$ substring in every fragment of $w$ consecutive length- $k$ substrings (in every sliding window of length $w+k-1$ ). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Although they do not have good worst-case guarantees on their size, they are often small in practice. They thus have been successfully employed in several string processing applications. Two main disadvantages of minimizers sampling mechanisms are: first, they also do not have good guarantees on the expected size of their samples for every combination of $w$ and $k$ ; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches.

To alleviate these disadvantages, we introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer $\ell$ , our mechanism selects the lexicographically smallest rotation in every length- $\ell$ fragment (in every sliding window of length $\ell$ ). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experiments using several datasets demonstrate that the bd-anchors sample sizes decrease proportionally to $\ell$ ; and that these sizes are competitive to or smaller than the minimizers sample sizes using the analogous sampling parameters. We provide theoretical justification for these results by analyzing the expected size of bd-anchors samples. As a negative result, we show that computing a total order $\leq$ on the input alphabet, which minimizes the bd-anchors sample size, is NP-hard.

We also show that by using any bd-anchors sample, we can construct, in near-linear time, an index which requires linear (extra) space in the size of the sample and answers on-line pattern searches in near-optimal time. We further show, using several datasets, that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index [Grabowski and Raniszewski, Softw. Pract. Exp. 2017].

Finally, we highlight the applicability of bd-anchors by developing an efficient and effective heuristic for top- $K$ similarity search under edit distance. We show, using synthetic datasets, that our heuristic is more accurate and more than one order of magnitude faster in top- $K$ similarity searches than the state-of-the-art tool for the same purpose [Zhang and Zhang, KDD 2020].

1 Introduction

The notion of minimizers, introduced independently by Schleimer et al. [60] and by Roberts et al. [58], is a mechanism to sample a set of positions over an input string. The goal of this sampling mechanism is, given a string $T$ of length $n$ over an alphabet $\Sigma$ of size $\sigma$ , to simultaneously satisfy the following properties:

Property 1 (approximately uniform sampling):: Every sufficiently long fragment of $T$ has a representative position sampled by the mechanism.
Property 2 (local consistency):: Exact matches between sufficiently long fragments of $T$ are preserved unconditionally by having the same (relative) representative positions sampled by the mechanism.

In most practical scenarios, sampling the smallest number of positions is desirable, as long as Properties 1 and 2 are satisfied. This is because it leads to small data structures or fewer computations. Indeed, the minimizers sampling mechanism satisfies the property of approximately uniform sampling: given two positive integers $w$ and $k$ , it selects at least one length- $k$ substring in every fragment of $w$ consecutive length- $k$ substrings (Property 1). Specifically, this is achieved by selecting the starting positions of the smallest length- $k$ substrings in every $(w+k-1)$ -long fragment, where smallest is defined by a choice of a total order on the universe of length- $k$ strings. These positions are called the “minimizers”. Thus from similar fragments, similar length- $k$ substrings are sampled (Property 2). In particular, if two strings have a fragment of length $w+k-1$ in common, then they have at least one minimizer corresponding to the same length- $k$ substring. Let us denote by $\mathcal{M}_{w,k}(T)$ the set of minimizers of string $T$ . The following example illustrates the sampling.

Example 1.

The set $\mathcal{M}_{w,k}$ of minimizers for $w=k=3$ for string $T=\texttt{aabaaabcbda}$ (using a 1-based index) is $\mathcal{M}_{3,3}(T)=\{1,4,5,6,7\}$ and for string $Q=\texttt{abaaa}$ is $\mathcal{M}_{3,3}(Q)=\{3\}$ . Indeed $Q$ occurs at position $2$ in $T$ ; and $Q$ and $T[2\mathinner{.\,.}6]$ have the minimizers $3$ and $4$ , respectively, which both correspond to string aaa of length $k=3$ .

The minimizers sampling mechanism is very versatile, and it has been employed in various ways in many different applications [47, 67, 19, 10, 31, 35, 36, 48, 37]. Since its inception, the minimizers sampling mechanism has undergone numerous theoretical and practical improvements [56, 10, 54, 53, 15, 22, 72, 37, 74] with a particular focus on minimizing the size of the residual sample; see Section 6 for a summary on this line of research. Although minimizers have been extensively and successfully used, especially in bioinformatics, we observe several inherent problems with setting the parameters $w$ and $k$ . In particular, although the notion of length- $k$ substrings (known as $k$ -mers or $k$ -grams) is a widely-used string processing tool, we argue that, in the context of minimizers, it may be causing many more problems than it solves: it is not clear to us why one should use an extra sampling parameter $k$ to effectively characterize a fragment of length $\ell=w+k-1$ of $T$ . In what follows, we describe some problems that may arise when setting the parameters $w$ and $k$ .

Indexing:: The most widely-used approach is to index the selected minimizers using a hash table. The key is the selected length- $k$ substring and the value is the list of positions it occurs. If one would like to use length- $k^{\prime}$ substrings for the minimizers with $\ell=w+k-1=w^{\prime}+k^{\prime}-1$ , for some $w^{\prime}\neq w$ and $k^{\prime}\neq k$ , they should compute the new set $\mathcal{M}_{w^{\prime},k^{\prime}}(T)$ of minimizers and construct their new index based on $\mathcal{M}_{w^{\prime},k^{\prime}}$ from scratch.
Querying:: To the best of our knowledge, no index based on minimizers can return in optimal or near-optimal time all occurrences of a pattern $Q$ of length $|Q|\geq\ell=w+k-1$ in $T$ .
Sample Size:: If one would like to minimize the number of selected minimizers, they should consider different total orders on the universe of length- $k$ strings, which may complicate practical implementations, often scaling only up to a small $k$ value, e.g. $k=16$ [22]. On the other hand, when $k$ is fixed and $w$ increases, the length- $k$ substrings in a fragment become increasingly decoupled from each other, and that regardless of the total order we may choose. Unfortunately, this interplay phenomenon is inherent to minimizers. It is known that $k\geq\log_{\sigma}(w)+c$ , for a fixed constant $c$ , is a necessary condition for the existence of minimizers samples with expected size in $\mathcal{O}(n/w)$ [72]; see Section 6.

We propose the notion of bidirectional string anchors (bd-anchors) to alleviate these disadvantages. The bd-anchors is a mechanism that drops the sampling parameter $k$ and its corresponding disadvantages. We only fix a parameter $\ell$ , which can be viewed as the length $w+k-1$ of the fragments in the minimizers sampling mechanism. The bd-anchor of a string $X$ of length $\ell$ is the lexicographically smallest rotation (cyclic shift) of $X$ . We unambiguously characterize this rotation by its leftmost starting position in string $XX$ . The set $\mathcal{A}_{\ell}(T)$ of the order- $\ell$ bd-anchors of string $T$ is the set of bd-anchors of all length- $\ell$ fragments of $T$ . It can be readily verified that bd-anchors satisfy Properties 1 and 2.

Example 2.

The set $\mathcal{A}_{\ell}(T)$ of bd-anchors for $\ell=5$ for string $T=\texttt{aabaaabcbda}$ (using a 1-based index) is $\mathcal{A}_{5}(T)=\{4,5,6,11\}$ and for string $Q=\texttt{abaaa}$ , $\mathcal{A}_{5}(Q)=\{3\}$ . Indeed $Q$ occurs at position $2$ in $T$ ; and $Q$ and $T[2\mathinner{.\,.}6]$ have the bd-anchors $3$ and $4$ , respectively, which both correspond to the rotation aaaab.

Let us remark that string synchronizing sets, introduced by Kempa and Kociumaka [43], is another string sampling mechanism which may be employed to resolve the disadvantages of minimizers. Yet, it appears to be quite complicated to be efficient in practice. For instance, in [20], the authors used a simplified and specific definition of string synchronizing sets to design a space-efficient data structure for answering longest common extension queries.

We consider the word RAM model of computations with $w$ -bit machine words, where $w=\Omega(\log n)$ , for stating our results. We also assume throughout that string $T$ is over alphabet $\Sigma=\{1,2,\ldots,n^{\mathcal{O}(1)}\}$ , which captures virtually any real-world scenario. We measure space in terms of $w$ -bit machine words. We make the following three specific contributions:

1.

In Section 3 we show that the set $\mathcal{A}_{\ell}(T)$ , for any $\ell>0$ and any $T$ of length $n$ , can be constructed in $\mathcal{O}(n)$ time. We generalize this result showing that for any constant $\epsilon\in(0,1]$ , $\mathcal{A}_{\ell}(T)$ can be constructed in $\mathcal{O}(n+n^{1-\epsilon}\ell)$ time using $\mathcal{O}(n^{\epsilon}+\ell+|\mathcal{A}_{\ell}|)$ space. Furthermore, we show that the expected size of $\mathcal{A}_{\ell}$ for strings of length $n$ , randomly generated by a memoryless source with identical letter probabilities, is in $\mathcal{O}(n/\ell)$ , for any integer $\ell>0$ . The latter is in contrast to minimizers which achieve the expected bound of $\mathcal{O}(n/w)$ only when $k\geq\log_{\sigma}w+c$ , for some constant $c$ [72]. We then show, using five real datasets, that indeed the size of $\mathcal{A}_{\ell}$ decreases proportionally to $\ell$ ; that it is competitive to or smaller than $\mathcal{M}_{w,k}$ , when $\ell=w+k-1$ ; and that it is much smaller than $\mathcal{M}_{w,k}$ for small $w$ values, which is practically important, as widely-used aligners that are based on minimizers will require less space and computation time if bd-anchors are used instead. Finally, we show a negative result using a reduction from minimum feedback arc set: computing a total order $\leq$ on $\Sigma$ which minimizes $|\mathcal{A}_{\ell}(T)|$ is NP-hard.
2.
In Section 4 we show an index based on $\mathcal{A}_{\ell}(T)$ , for any string $T$ of length $n$ and any integer $\ell>0$ , which answers on-line pattern searches in near-optimal time. In particular, for any constant $\epsilon>0$ , we show that our index supports the following space/query-time trade-offs:
- •
  
  it occupies $\mathcal{O}(|\mathcal{A}_{\ell}(T)|)$ extra space and reports all $k$ occurrences of any pattern $Q$ of length $|Q|\geq\ell$ given on-line in $\mathcal{O}(|Q|+(k+1)\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|))$ time; or
- •
  
  it occupies $\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|))$ extra space and reports all $k$ occurrences of any pattern $Q$ of length $|Q|\geq\ell$ given on-line in $\mathcal{O}(|Q|+\log\log(|\mathcal{A}_{\ell}(T)|)+k)$ time.
We also show that our index can be constructed in $\mathcal{O}(n+|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)})$ time. We then show, using five real datasets, that a simple implementation of our index is consistently faster in on-line pattern searches than an analogous implementation of the minimizers-based index proposed by Grabowski and Raniszewski in [31].
3.

In Section 5 we highlight the applicability of bd-anchors by developing an efficient and effective heuristic for top- $K$ similarity search under edit distance. This is a fundamental and extensively studied problem [38, 9, 11, 46, 68, 71, 57, 65, 64, 17, 34, 69, 70] with applications in areas including bioinformatics, databases, data mining, and information retrieval. We show, using synthetic datasets, that our heuristic, which is based on the bd-anchors index, is more accurate and more than one order of magnitude faster in top- $K$ similarity searches than the state-of-the-art tool proposed by Zhang and Zhang in [70].

In Section 2, we provide some preliminaries; and in Section 6 we discuss works related to minimizers. Let us stress that, although other works may be related to our contributions, we focus on comparing to minimizers because they are extensively used in applications. The source code of our implementations is available at https://github.com/solonas13/bd-anchors.

A preliminary version of this paper appeared as [49].

2 Preliminaries

We start with some basic definitions and notation following [13]. An alphabet $\Sigma$ is a finite nonempty set of elements called letters. A string $X=X[1]\ldots X[n]$ is a sequence of length $|X|=n$ of letters from $\Sigma$ . The empty string, denoted by $\varepsilon$ , is the string of length $0$ . The fragment $X[i\mathinner{.\,.}j]$ of $X$ is an occurrence of the underlying substring $S=X[i]\ldots X[j]$ . We also say that $S$ occurs at position $i$ in $X$ . A prefix of $X$ is a fragment of $X$ of the form $X[1\mathinner{.\,.}j]$ and a suffix of $X$ is a fragment of $X$ of the form $X[i\mathinner{.\,.}n]$ . The set of all strings over $\Sigma$ (including $\varepsilon$ ) is denoted by $\Sigma^{*}$ . The set of all length- $k$ strings over $\Sigma$ is denoted by $\Sigma^{k}$ . Given two strings $X$ and $Y$ , the edit distance $d_{\mathrm{E}}(X,Y)$ is the minimum number of edit operations (letter insertion, deletion, or substitution) transforming one string into the other.

Let $M$ be a finite nonempty set of strings over $\Sigma$ of total length $m$ . We define the trie of $M$ , denoted by $\textsf{TR}(M)$ , as a deterministic finite automaton that recognizes $M$ . Its set of states (nodes) is the set of prefixes of the elements of $M$ ; the initial state (root node) is $\varepsilon$ ; the set of terminal states (leaf nodes) is $M$ ; and edges are of the form $(u,\alpha,u\alpha)$ , where $u$ and $u\alpha$ are nodes and $\alpha\in\Sigma$ . The size of $\textsf{TR}(M)$ is thus $\mathcal{O}(m)$ . The compacted trie of $M$ , denoted by $\textsf{CT}(M)$ , contains the root node, the branching nodes, and the leaf nodes of $\textsf{TR}(M)$ . The term compacted refers to the fact that $\textsf{CT}(M)$ reduces the number of nodes by replacing each maximal branchless path segment with a single edge, and that it uses a fragment of a string $s\in M$ to represent the label of this edge in $\mathcal{O}(1)$ machine words. The size of $\textsf{CT}(M)$ is thus $\mathcal{O}(|M|)$ . When $M$ is the set of suffixes of a string $Y$ , then $\textsf{CT}(M)$ is called the suffix tree of $Y$ , and we denote it by $\textsf{ST}(Y)$ . The suffix tree of a string of length $n$ over an alphabet $\Sigma=\{1,\ldots,n^{\mathcal{O}(1)}\}$ can be constructed in $\mathcal{O}(n)$ time [23].

Let us fix throughout a string $T=T[1\mathinner{.\,.}n]$ of length $|T|=n$ over an ordered alphabet $\Sigma$ . Recall that we make the standard assumption of an integer alphabet $\Sigma=\{1,2,\ldots,n^{\mathcal{O}(1)}\}$ .

We start by defining the notion of minimizers of $T$ from [58] (the definition in [60] is slightly different). Given an integer $k>0$ , an integer $w>0$ , and the $i$ th length- $(w+k-1)$ fragment $F=T[i\mathinner{.\,.}i+w+k-2]$ of $T$ , we define the $(w,k)$ -minimizers of $F$ as the positions $j\in[i,i+w)$ where a lexicographically minimal length- $k$ substring of $F$ occurs. The set $\mathcal{M}_{w,k}(T)$ of $(w,k)$ -minimizers of $T$ is defined as the set of $(w,k)$ -minimizers of $T[i\mathinner{.\,.}i+w+k-2]$ , for all $i\in[1,n-w-k+2]$ . The density of $\mathcal{M}_{w,k}(T)$ is defined as the quantity $|\mathcal{M}_{w,k}(T)|/n$ . The following bounds are obtained trivially. The density of any minimizer scheme is at least $1/w$ , since at least one $(w,k)$ -minimizer is selected in each fragment, and at most $1$ , when every $(w,k)$ -minimizer is selected.

If we waive the lexicographic order assumption, the set $\mathcal{M}_{w,k}(T)$ can be computed on-line in $\mathcal{O}(n)$ time, and if we further assume a constant-time computable function that gives us an arbitrary rank for each length- $k$ substring in $\Sigma^{k}$ in constant amortized time [37]. This can be implemented, for instance, using a rolling hash function (e.g. Karp-Rabin fingerprints [41]), and the rank (total order) is defined by this function. We also provide here, for completeness, a simple off-line $\mathcal{O}(n)$ -time algorithm that uses a lexicographic order.

Theorem 1.

The set $\mathcal{M}_{w,k}(T)$ , for any integers $w,k>0$ and any string $T$ of length $n$ , can be constructed in $\mathcal{O}(n)$ time.

Proof.

The underlying algorithm has two main steps. In the first step, we construct $\textsf{ST}(T)$ , the suffix tree of $T$ in $\mathcal{O}(n)$ time [23]. Using a depth-first search traversal of $\textsf{ST}(T)$ we assign at every position of $T$ in $[1,n-k+1]$ the lexicographic rank of $T[i\mathinner{.\,.}i+k-1]$ among all the length- $k$ strings occurring in $T$ . This process clearly takes $\mathcal{O}(n)$ time as $\textsf{ST}(T)$ is an ordered structure; it yields an array $R$ of size $n-k+1$ with lexicographic ranks. In the second step, we apply a folklore algorithm, which computes the minimum elements in a sliding window of size $w$ (cf. [37]) over $R$ . The set of reported indices is $\mathcal{M}_{w,k}(T)$ . ∎

3 Bidirectional String Anchors

We introduce the notion of bidirectional string anchors (bd-anchors). Given a string $W$ , a string $R$ is a rotation (or cyclic shift or conjugate) of $W$ if and only if there exists a decomposition $W=UV$ such that $R=VU$ , for a string $U$ and a nonempty string $V$ . We often characterize $R$ by its starting position $|U|+1$ in $WW=UVUV$ . We use the term rotation interchangeably to refer to string $R$ or to its identifier $(|U|+1)$ .

Definition 1 (Bidirectional anchor).

Given a string $X$ of length $\ell>0$ , the bidirectional anchor (bd-anchor) of $X$ is the lexicographically minimal rotation $j\in[1,\ell]$ of $X$ with minimal $j$ . The set of order- $\ell$ bd-anchors of a string $T$ of length $n>\ell$ , for some integer $\ell>0$ , is defined as the set $\mathcal{A}_{\ell}(T)$ of bd-anchors of $T[i\mathinner{.\,.}i+\ell-1]$ , for all $i\in[1,n-\ell+1]$ .

The density of $\mathcal{A}_{\ell}(T)$ is defined as the quantity $|\mathcal{A}_{\ell}(T)|/n$ . It can be readily verified that the bd-anchors sampling mechanism satisfies Properties 1 (approximately uniform sampling) and 2 (local consistency).

Example 3.

Let $\ell=5$ , $T=\texttt{aabaaabcbda}$ , and $T^{\prime}=\texttt{aacaaaccbda}$ . Strings $T$ and $T^{\prime}$ (are at Hamming distance 2 but) have the same set of bd-anchors of order $5$ : $\mathcal{A}_{5}(T)=\mathcal{A}_{5}(T^{\prime})=\{4,5,6,11\}$ . The reader can probably share the intuition that the bd-anchors sampling mechanism is suitable for sequence comparison due to Properties 1 and 2, in particular, when the parameter $\ell$ is set accordingly.

Linear-Time Construction of $\mathcal{A}_{\ell}$ .

Importantly, we show that $\mathcal{A}_{\ell}$ admits an efficient construction. One can use the linear-time algorithm by Booth [6] to compute the lexicographically minimal rotation for each length- $\ell$ fragment of $T$ , resulting in an $\mathcal{O}(n\ell)$ -time algorithm, which is reasonably fast for modest $\ell$ . (Booth’s algorithm gives the leftmost minimal rotation by construction.) We instead give an optimal $\mathcal{O}(n)$ -time algorithm for the construction of $\mathcal{A}_{\ell}$ , which is mostly of theoretical interest.

For every string $X$ and every natural number $m$ , we define the $m$ th power of the string $X$ , denoted by $X^{m}$ , by $X^{0}=\varepsilon$ and $X^{k}=X^{k-1}X$ for $k=1,2,\ldots,m$ . A nonempty string is primitive, if it is not the power of any other string. Let us state two well-known combinatorial lemmas.

Lemma 1 ([13]).

A nonempty string $X$ is primitive if and only if it occurs as a substring in $XX$ only as a prefix and as a suffix.

Lemma 2 ([61]).

Let $X=UV$ and $R=VU$ , for two strings $U,V$ . If $X$ is primitive then $R$ is also primitive.

A substring $U$ of a string $X$ is called an infix of $X$ if and only if $U=X[i\mathinner{.\,.}j]$ with $i>1$ and $j<n$ .

Lemma 3.

A string $X$ has more than one minimal lexicographic rotation if and only if $X$ is a power of some string.

Proof.

$(\Rightarrow)$

Let $X=U_{1}V_{1}$ , and $R=V_{1}U_{1}$ be the leftmost minimal lexicographic rotation of $X$ . Suppose towards a contradiction that $X$ has another minimal lexicographic rotation but $X$ is primitive. In particular, there exists $H=V_{2}U_{2}=R$ , with $X=U_{2}V_{2}$ and $|U_{1}|<|U_{2}|$ . If $X$ is primitive, then $R$ is also primitive by Lemma 2 but then $RR=V_{1}U_{1}V_{1}U_{1}$ has $H$ occurring as infix. In particular, in $RR$ , $V_{2}$ is a suffix of the first occurrence of $V_{1}$ and $U_{2}$ is a prefix of $U_{1}V_{1}$ and thus $H=R$ occurs as infix. By Lemma 1 we obtain a contradiction.
$(\Leftarrow)$

Let $X=UU\cdots U$ and a minimal lexicographic rotation of $X$ be $i\in[1,|X|]$ . Then either $i+|U|$ or $i-|U|$ is a minimal lexicographic rotation of $X$ .

∎

Example 4 (Illustration of Lemma 3).

Let $X=\texttt{cbacbacba}$ , $R=\texttt{acbacba}\cdot\texttt{cb}$ with $U_{1}=\texttt{acbacba}$ and $V_{1}=\texttt{cb}$ , and $H=\texttt{acba}\cdot\texttt{cbacb}=R$ with $U_{2}=\texttt{acba}$ and $V_{2}=\texttt{cbacb}$ . Observe that $H$ occurs as infix (shown as underlined) in $RR=\texttt{acb\text@underline{acbacbacb}acbacb}$ hence $X$ is a power of some string.

Lemma 4.

Let $X$ be a string of length $n$ and set $Y=XX\#$ , for some letter $\#$ not occurring in $X$ that is the lexicographically maximal letter occurring in $Y$ . Further let $Y[i\mathinner{.\,.}2n+1]$ be the lexicographically minimal suffix of $Y$ , for some $i\in[1,2n]$ . The leftmost lexicographically minimal rotation of $X$ is $i$ .

Proof.

First note that $i\in[1,n]$ because $\#$ is the lexicographically maximal letter occurring in $Y$ .

We consider two cases: (i) $X$ is primitive; and (ii) $X$ is power of some string. In the first case, $X$ has one lexicographically minimal rotation by Lemma 3, and thus this is $i$ . In the second case, $X$ has more than one lexicographically minimal rotations, but because $X$ is power of some string and $\#$ is the lexicographically maximal letter occurring in $Y$ , $i$ is the leftmost lexicographically minimal rotation of $X$ . ∎

We employ the data structure of Kociumaka [44, Theorem 20] to obtain the following result.

Theorem 2.

The set $\mathcal{A}_{\ell}(T)$ , for any $\ell>0$ and any $T$ of length $n$ , can be constructed in $\mathcal{O}(n)$ time.

Proof.

The data structure of Kociumaka [44, Theorem 20] gives the minimal lexicographic suffix for any concatenation $Y$ of $k$ arbitrary fragments of a string $S$ in $\mathcal{O}(k^{2})$ time after an $\mathcal{O}(|S|)$ time preprocessing.

We set $S=T\#$ , for some letter $\#$ that does not occur in $T$ and is the lexicographically maximal letter occurring in $S$ . For each fragment $T[i\mathinner{.\,.}i+\ell-1]$ , we compute the minimal lexicographic suffix of string

Y=S[i\mathinner{.\,.}i+\ell-1]\cdot S[i\mathinner{.\,.}i+\ell-1]\cdot S[n+1]=T[i\mathinner{.\,.}i+\ell-1]\cdot T[i\mathinner{.\,.}i+\ell-1]\cdot\#,

where $k=3$ in $\mathcal{O}(k^{2})=\mathcal{O}(1)$ time. This suffix of $Y$ is the minimal lexicographic rotation by Lemma 4. ∎

Space-Efficient Construction of $\mathcal{A}_{\ell}$ .

It should be clear that, in the best case, the size of $\mathcal{A}_{\ell}$ is in $\mathcal{O}(n/\ell)$ and this bound is tight. The construction of Theorem 2 requires $\mathcal{O}(n)$ space. Ideally, we would thus like to compute $\mathcal{A}_{\ell}$ efficiently using (strongly) sublinear space. We generalize Theorem 2 to the following result.

Theorem 3.

The set $\mathcal{A}_{\ell}(T)$ , for any $\ell>0$ , any $T$ of length $n$ , and any constant $\epsilon\in(0,1]$ , can be constructed in $\mathcal{O}(n+n^{1-\epsilon}\ell)$ time using $\mathcal{O}(n^{\epsilon}+\ell+|\mathcal{A}_{\ell}|)$ space.

Proof.

We compute $\mathcal{A}_{\ell}(T[\lceil n^{\epsilon}(i-1)\rceil+1\mathinner{.\,.}\max(\lceil n^{\epsilon}i\rceil+\ell,n)])$ for all $i\in[1,\lceil n^{1-\epsilon}\rceil]$ using the algorithm from Theorem 2 and output their union. For any constant $\epsilon\in(0,1]$ , the alphabet size $|\Sigma|=n^{\mathcal{O}(1)}=(n^{\epsilon}+\ell)^{\mathcal{O}(1)}$ is still polynomial in the length $n^{\epsilon}+\ell$ of the fragments, so computing one such anchor set takes $\mathcal{O}(n^{\epsilon}+\ell)$ time and space by Theorem 2. We delete each fragment (and the associated data structure) before processing the subsequent anchor set: it takes $\mathcal{O}(n+n^{1-\epsilon}\ell)$ time and $\mathcal{O}(n^{\epsilon}+\ell)$ additional space to construct $\mathcal{A}_{\ell}(T)$ . ∎

Expected Size of $\mathcal{A}_{\ell}$ .

We next analyze the expected size of $\mathcal{A}_{\ell}(T)$ . We first show that if $\ell$ grows no faster than the size $\sigma$ of the alphabet, then the expected size of $\mathcal{A}_{\ell}$ is in $\mathcal{O}(n/\ell)$ . Otherwise, if $\ell$ grows faster than $\sigma$ , we slightly amend the sampling process to ensure that the expected size of the sample is in $\mathcal{O}(n/\ell)$ .

Lemma 5.

If $T$ is a string of length $n$ , randomly generated by a memoryless source over an alphabet of size $\sigma\geq 2$ with identical letter probabilities, then, for any integer $\ell>0$ , the expected size of $\mathcal{A}_{\ell}(T)$ is in $\mathcal{O}(\frac{n\log\ell}{\ell\log\sigma}+\frac{n}{\ell})$ .

Proof.

If $\ell=1$ , then $\mathcal{A}_{\ell}(T)=n$ . If $\ell=2$ , then $\mathcal{A}_{\ell}(T)=1+(n-2)(2\sigma^{2}+1)/3\sigma^{2}$ . Now suppose $\ell\geq 3$ . We say that $T[i\mathinner{.\,.}i+\ell-1]$ introduces a new bd-anchor if there exists $j\in[1,\ell]$ such that $j$ is the bd-anchor of $T[i\mathinner{.\,.}i+\ell-1]$ , but $j+k$ is not the bd-anchor of $T[i-k\mathinner{.\,.}i-k+\ell-1]$ for all $k\in[1,\max(\ell-j,i-1)]$ . Let $N_{i}(T)$ denote the event that $T[i\mathinner{.\,.}i+\ell-1]$ introduces a new bd-anchor. Since the letters are independent identically distributed, the probability $\mathbb{P}[N_{i}(T)]$ only depends on and is non-increasing in the number of preceding overlapping length- $k$ substrings. Therefore

\mathbb{E}[|\mathcal{A}_{\ell}(T)|]=\mathbb{P}[N_{1}(T)]+\cdots+\mathbb{P}[N_{n-\ell+1}(T)]\leq 1+(n-1)\mathbb{P}[N_{2}(T)].

Let $p$ be the length of the shortest prefix of the lexicographically minimal rotation of $T[2\mathinner{.\,.}\ell+1]$ which is strictly smaller than the same length prefix of any other rotation of $T[2\mathinner{.\,.}\ell+1]$ .

Note that

$\displaystyle\mathbb{P}[N_{2}]$	$\displaystyle\leq$	$\displaystyle\mathbb{P}[T[1\mathinner{.\,.}\ell]\text{ or }T[2\mathinner{.\,.}\ell+1]\text{ is a power of some string}]$	(4)
		$\displaystyle+\ \mathbb{P}[T[1\mathinner{.\,.}\ell]\text{ is primitive with bd-anchor }1]$
		$\displaystyle+\ \mathbb{P}[T[2\mathinner{.\,.}\ell+1]\text{ is primitive with bd-anchor}>\ell-3\log\ell/\log\sigma]$
		$\displaystyle+\ \mathbb{P}[T[2\mathinner{.\,.}\ell+1]\text{ is primitive and }p\geq 3\log\ell/\log\sigma]$

To bound the probability in (4), note that

\mathbb{P}[\text{length-$\ell$ string is a power of some string}]\leq\sum_{d<\ell,d\mid\ell}\sigma^{-(\ell-d)}\leq\sum_{d\leq\ell/2}\sigma^{-(\ell-d)}=\sigma^{1-\ell/2}/(\sigma-1).

The probability in (4) is bounded by $1/\ell$ since each letter of a primitive length- $\ell$ string is equally likely to be the anchor. Similarly, the probability in (4) is bounded by $(3\log\ell/\log\sigma+1)/\ell$ . Finally, the probability in (4) is bounded by the probability that two prefixes of length $\lceil 3\log\ell/\log\sigma\rceil$ of rotations of $T[2\mathinner{.\,.}\ell+1]$ are equal, which is at most $\ell^{2}\cdot\sigma^{-3\log\ell/\log\sigma}=1/\ell$ . It follows that

	$\displaystyle\mathbb{P}[N_{2}]$	$\displaystyle\leq$	$\displaystyle 2\sigma^{1-\ell/2}/(\sigma-1)+1/\ell+(3\log\ell/\log\sigma+1)/\ell+1/\ell$
		$\displaystyle=$	$\displaystyle\mathcal{O}\left(\frac{\log\ell}{\ell\log\sigma}+\frac{1}{\ell}\right)$

We conclude that for any $\ell>0$ the expected size of $\mathcal{A}_{\ell}(T)$ is in $\mathcal{O}(\frac{n\log\ell}{\ell\log\sigma}+\frac{n}{\ell})$ . ∎

We define a reduced version of bd-anchors to ensure that the expected size of the sample is in $\mathcal{O}(n/\ell)$ .

Definition 2.

Given a string $X$ of length $\ell>0$ and an integer $0\leq r\leq\ell-1$ , we define the reduced bidirectional anchor of $X$ as the lexicographically minimal rotation $j\in[1,\ell-r]$ of $X$ with minimal $j$ . The set of order- $\ell$ reduced bd-anchors of a string $T$ of length $n>\ell$ is defined as the set $\mathcal{A}_{\ell}^{\text{red}}(T)$ of reduced bd-anchors of $T[i\mathinner{.\,.}i+\ell-1]$ , for all $i\in[1,n-\ell+1]$ .

Lemma 6.

If $T$ is a string of length $n$ , randomly generated by a memoryless source over an alphabet of size $\sigma\geq 2$ with identical letter probabilities, then, for any integer $\ell>0$ , the expected size of $\mathcal{A}_{\ell}^{\text{red}}(T)$ with $r=\lceil 4\log\ell/\log\sigma\rceil$ is in $\mathcal{O}(n/\ell)$ .

Proof.

If $\ell\in\{1,2\}$ , then $\mathcal{A}_{\ell}^{\text{red}}(T)\leq n$ . Now suppose $\ell\geq 3$ . Analogously to $N$ in 5, we denote the event that $T[i\mathinner{.\,.}i+\ell-1]$ introduces a new reduced bd-anchor by $N^{\text{red}}_{i}(T)$ . Again we find

\mathbb{E}\left[\left|\mathcal{A}_{\ell}^{\text{red}}(T)\right|\right]=\mathbb{P}[N^{\text{red}}_{1}(T)]+\cdots+\mathbb{P}[N^{\text{red}}_{n-\ell+1}(T)]\leq 1+(n-\ell)\mathbb{P}[N^{\text{red}}_{2}(T)].

Let $p^{\text{red}}$ be the length of the shortest prefix of the lexicographically minimal rotation $j_{1}\in[1,\ell-r]$ of $T[2\mathinner{.\,.}\ell+1]$ which is strictly smaller than the same length prefix of any other rotation $j_{2}\in[1,\ell-r]\setminus\{j_{1}\}$ . Using a similar proof to that of 5 we find that

$\displaystyle\mathbb{P}[N^{\text{red}}_{2}(T)]$	$\displaystyle\leq$	$\displaystyle\mathbb{P}[T[1\mathinner{.\,.}\ell]\text{ or }T[2\mathinner{.\,.}\ell+1]\text{ is a power of some string}]$
		$\displaystyle+\ \mathbb{P}[T[1\mathinner{.\,.}\ell]\text{ is primitive with bd-anchor }1]$
		$\displaystyle+\ \mathbb{P}[T[2\mathinner{.\,.}\ell+1]\text{ is primitive with bd-anchor }\ell-r+1]$
		$\displaystyle+\ \mathbb{P}[T[2\mathinner{.\,.}\ell+1]\text{ is primitive and }p^{\text{red}}\geq r]$
	$\displaystyle\leq$	$\displaystyle 2\sigma^{1-\ell/2}/(\sigma-1)+1/(\ell-r)+1/(\ell-r)+\ell^{2}/\sigma^{r}=2/\ell+o\left(1/\ell\right).$

We conclude that for any $\ell>0$ the expected size of $\mathcal{A}^{\text{red}}_{\ell}(T)$ is in $\mathcal{O}(n/\ell)$ . ∎

In particular, if $\ell=\mathcal{O}(\sigma)$ , we employ the sampling mechanism underlying $\mathcal{A}_{\ell}(T)$ , otherwise ( $\ell=\Omega(\sigma)$ ) we employ the sampling mechanism underlying $\mathcal{A}_{\ell}^{\text{red}}(T)$ with $r=\lceil 4\log\ell/\log\sigma\rceil$ to ensure that the expected density of the residual sampling is always in $\mathcal{O}(n/\ell)$ .

Constructing $\mathcal{A}_{\ell}^{\text{red}}(T)$ in $\mathcal{O}(n)$ time requires a trivial modification in Theorem 2. For each fragment $T[i\mathinner{.\,.}i+\ell-1]$ , instead of computing the minimal lexicographic suffix of string

Y=S[i\mathinner{.\,.}i+\ell-1]\cdot S[i\mathinner{.\,.}i+\ell-1]\cdot S[n+1]=T[i\mathinner{.\,.}i+\ell-1]\cdot T[i\mathinner{.\,.}i+\ell-1]\cdot\#,

in $\mathcal{O}(1)$ time, we compute the minimal lexicographic suffix of string

Y^{\text{red}}=S[i\mathinner{.\,.}i+\ell-1]\cdot S[i\mathinner{.\,.}i+\ell-1-r]\cdot S[n+1]=T[i\mathinner{.\,.}i+\ell-1]\cdot T[i\mathinner{.\,.}i+\ell-1-r]\cdot\#,

in $\mathcal{O}(1)$ time. We then directly obtain the trade-off in Theorem 3 for constructing $\mathcal{A}_{\ell}^{\text{red}}(T)$ .

Density Evaluation.

We compare the density of bd-anchors and reduced bd-anchors, denoted by BDA and rBDA, respectively, to the density of minimizers, for different values of $w$ and $k$ such that $\ell=w+k-1$ . This is a fair comparison because $\ell=w+k-1$ is the length of the fragments considered by both mechanisms. We implemented bd-anchors, the standard minimizers mechanism from [58], and the minimizers mechanism with robust winnowing from [60]. The standard minimizers and those with robust winnowing are referred to as STD and WIN, respectively.

For bd-anchors, we used Booth’s algorithm, which is easy to implement and reasonably fast. For minimizers, we used Karp-Rabin fingerprints [41]. (Note that such “random” minimizers tend to perform even better than the ones based on lexicographic total order in terms of density [72].) For the reduced version of bd-anchors, we used $r=\lceil 3\log\ell/\log\sigma\rceil$ , because the $r$ value suggested by Lemma 6 is relatively large for the small $\ell$ values tested; e.g. for $\ell=15$ , $\lceil 3\log\ell/\log\sigma\rceil=8$ . Throughout, we do not evaluate construction times, as all implementations are reasonably fast, and we make the standard assumption that preprocessing is only required once. We used five string datasets from the popular Pizza & Chili corpus [25] (see Table 1 for the datasets characteristics). All implementations referred to in this paper have been written in C++ and compiled at optimization level -O3. All experiments reported in this paper were conducted using a single core of an AMD Opteron 6386 SE 2.8GHz CPU and 252GB RAM running GNU/Linux.

Dataset	Length	Alphabet
	$n$	size $\|\Sigma\|$
DNA	200,000,000	4
XML	200,000,000	95
ENGLISH	200,000,000	224
PROTEINS	200,000,000	27
SOURCES	200,000,000	229

Table 1: Datasets characteristics.

As can be seen by the results depicted in Figures 1 and 2, the density of both BDA and rBDA is either significantly smaller than or competitive to the STD and WIN minimizers density, especially for small $w$ . This is useful because a lower density results in smaller indexes and less computation (see Section 4), and because small $w$ is of practical interest (see Section 5). For instance, the widely-used long-read aligner Minimap2 [48] stores the selected minimizers of a reference genome in a hash table to find exact matches as anchors for seed-and-extend alignment. The parameters $w$ and $k$ are set based on the required sensitivity of the alignment, and thus $w$ and $k$ cannot be too large for high sensitivity. Thus, a lower sampling density reduces the size of the hash table, as well as the computation time, by lowering the average number of selected minimizers to consider when performing an alignment. Furthermore, although the datasets are not uniformly random, rBDA performs better than BDA as $\ell$ grows, as suggested by Lemmas 5 and 6.

There exists a long line of research on improving the density of minimizers in special regimes (see Section 6 for details). We stress that most of these algorithms are designed, implemented, or optimized, only for the DNA alphabet. We have tested against two state-of-the-art tools employing such algorithms: Miniception [72] and PASHA [22]. The former did not give better results than STD or WIN for the tested values of $w$ and $k$ ; and the latter does not scale beyond $k=16$ or with large alphabets. We have thus omitted these results.

We next report the average number (AVG) of bd-anchors of order $\ell\in\{4,8,12,16\}$ over all strings of length $n=20$ (see Table 2(a)) and over all strings of length $n=32$ (see Table 2(b)), both over a binary alphabet. Notably, the results show that AVG always lies in $[n/\ell,2n/\ell]$ even if not using the reduced version of bd-anchors (see Lemma 6). As expected by Lemma 5, the analogous AVG values using a ternary alphabet (not reported here) were always lower than the corresponding ones with a binary alphabet.

$(n,\ell)$	$(20,4)$	$(20,8)$	$(20,12)$	$(20,16)$
$n/\ell$	5	2.5	1.66	1.25
AVG	8.53	4.37	2.77	1.76
$2n/\ell$	16	8	5.33	4

(a)

$(n,\ell)$	$(32,4)$	$(32,8)$	$(32,12)$	$(32,16)$
$n/\ell$	8	4	2,66	2
AVG	14.16	7.67	5.26	3.85
$2n/\ell$	16	8	5.33	4

(b)

Table 2: Average number of bd-anchors for varying

\ell

and: (a)

n=20

and (b)

n=32

Minimizing the Size of $\mathcal{A}_{\ell}$ is NP-hard.

The number of bd-anchors depends on the lexicographic total order defined on $\Sigma$ . We now prove that finding the total order which minimizes the number of bd-anchors is NP-hard using a reduction from minimum feedback arc set [40]. Let us start be defining this problem. Given a directed graph $G(V,E)$ , a feedback arc set in $G$ is a subset of $E$ that contains at least one edge from every cycle in $G$ . Removing these edges from $G$ breaks all of the cycles, producing a directed acyclic graph. In the minimum feedback arc set problem, we are given $G(V,E)$ , and we are asked to compute a smallest feedback arc set in $G$ . The decision version of the problem takes an additional parameter $k$ as input, and it asks whether all cycles can be broken by removing at most $k$ edges from $E$ . The decision version is NP-complete [40] and the optimization version is APX-hard [21].

Theorem 4.

Let $T$ be a string of length $n$ over alphabet $\Sigma$ . Further let $\ell>0$ be an integer. Computing a total order $\leq$ on $\Sigma$ which minimizes $|\mathcal{A}_{\ell}(T)|$ is NP-hard.

Proof.

Let $G=(\Sigma,A)$ be any instance of the minimum feedback arc set problem. We will construct a string $S\in\Sigma^{*}$ in polynomial time in the size of $G$ such that finding a total order $\leq$ on $\Sigma$ which minimizes $\mathcal{A}_{4}(S)$ corresponds to finding a minimum feedback arc set in $G$ .

We start with an empty string $S$ . For each edge $(a,b)\in A$ , we append $(aabab)^{2|A|+1}$ to $S$ . Observe that, if $a<b$ , all the $a$ ’s of $(aabab)^{2|A|+1}$ and none of its $b$ ’s are order-4 bd-anchors, except possibly the first $a$ and last $b$ depending on the preceding and subsequent letters in $S$ . Thus there are $6|A|+2$ to $6|A|+4$ order- $4$ bd-anchors in $(aabab)^{2|A|+1}$ . If on the other hand $a>b$ , we analogously find that there will be $4|A|+1$ to $4|A|+3$ order- $4$ bd-anchors in $(aabab)^{2|A|+1}$ .

Let $d_{\leq}$ be the number of edges $(a,b)\in A$ such that $a<b$ . The total number of order- $4$ bd-anchors in $S$ is

|\mathcal{A}_{4}(S)|=|A|\cdot 2\cdot(2|A|+1)+d_{\leq}\cdot(2|A|+1)+\epsilon,

with $\epsilon\in[-|A|,|A|]$ . Therefore minimizing the total number of order- $4$ bd-anchors in $S$ is equivalent to finding an order $\leq$ on the set $\Sigma$ of vertices of $G$ which minimizes $d_{\leq}$ .

Note that if we delete all edges $(a,b)\in A$ such that $a<b$ , then the residual graph is acyclic. Moreover, for each acyclic graph there exists an order on the vertices such that $a>b$ for all $(a,b)\in A$ . Therefore the minimal $d_{\leq}$ equals the size of the minimum feedback arc set.

We conclude that, since finding the size of the minimum feedback arc set is NP-hard, so is finding a total order $\leq$ on $\Sigma$ which minimizes the total number of order-4 bd-anchors. ∎

4 Indexing Using Bidirectional Anchors

Before presenting our index, let us start with a basic definition that is central to our querying process.

Definition 3 ( $(\alpha,\beta)$ -hit).

Given an order- $\ell$ bd-anchor $j_{Q}\in\mathcal{A}_{\ell}(Q)$ , for some integer $\ell>0$ , of a query string $Q$ , two integers $\alpha>0,\beta>0$ , with $\alpha+\beta\geq\ell+1$ , and an order- $\ell$ bd-anchor $j_{T}\in\mathcal{A}_{\ell}(T)$ of a target string $T$ , the ordered pair $(j_{Q},j_{T})$ is called an $(\alpha,\beta)$ -hit if and only if $T[j_{T}-\alpha+1\mathinner{.\,.}j_{T}]=Q[j_{Q}-\alpha+1\mathinner{.\,.}j_{Q}]$ and $T[j_{T}\mathinner{.\,.}j_{T}+\beta-1]=Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta-1]$ .

Intuitively, the parameters $\alpha$ and $\beta$ let us choose a fragment of $Q$ that is anchored at $j_{Q}$ .

Example 5.

Let $T=\texttt{aabaaabcbda}$ , $Q=\texttt{aacabaaaae}$ , and $\ell=5$ . Consider that we would like to find the common fragment $Q[4\mathinner{.\,.}8]=T[2\mathinner{.\,.}6]=\texttt{abaaa}$ . We know that the bd-anchor of order $5$ corresponding to $Q[4\mathinner{.\,.}8]$ is $6\in\mathcal{A}_{5}(Q)$ , and thus to find it we set $\alpha=3$ and $\beta=3$ . The ordered pair $(6,4)$ is a $(3,3)$ -hit because for $4\in\mathcal{A}_{5}(T)$ , we have: $T[4-3+1\mathinner{.\,.}4]=Q[6-3+1\mathinner{.\,.}6]=\texttt{aba}$ and $T[4\mathinner{.\,.}4+3-1]=Q[6\mathinner{.\,.}6+3-1]=\texttt{aaa}$ .

We would like to construct a data structure over $T$ , which is based on $\mathcal{A}_{\ell}(T)$ , such that, when we are given an order- $\ell$ bd-anchor $j_{Q}$ over $Q$ as an on-line query, together with parameters $\alpha$ and $\beta$ , we can report all $(\alpha,\beta)$ -hits efficiently. To this end, we present an efficient data structure, denoted by $\mathcal{I}_{\ell}(T)$ , which is constructed on top of $T$ , and answers $(\alpha,\beta)$ -hit queries in near-optimal time. We prove the following result.

Theorem 5.

Given a string $T$ of length $n$ and an integer $\ell>0$ , the $\mathcal{I}_{\ell}(T)$ index can be constructed in $\mathcal{O}(n+|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)})$ time. For any constant $\epsilon>0$ , $\mathcal{I}_{\ell}(T)$ :

•

occupies $\mathcal{O}(|\mathcal{A}_{\ell}(T)|)$ extra space and reports all $k$ $(\alpha,\beta)$ -hits in $\mathcal{O}(\alpha+\beta+(k+1)\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|))$ time; or
•

occupies $\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|))$ extra space and reports all $k$ $(\alpha,\beta)$ -hits in $\mathcal{O}(\alpha+\beta+\log\log(|\mathcal{A}_{\ell}(T)|)+k)$ time.

Let us denote by $\overleftarrow{X}=X[|X|]\ldots X[1]$ the reversal of string $X$ . We now describe our data structure.

Construction of $\mathcal{I}_{\ell}(T)$ .

Given $\mathcal{A}_{\ell}(T)$ , we construct two sets $\mathcal{S}^{L}_{\ell}(T)$ and $\mathcal{S}^{R}_{\ell}(T)$ of strings; conceptually, the reversed suffixes going left from $j$ to $1$ , and the suffixes going right from $j$ to $n$ , for all $j$ in $\mathcal{A}_{\ell}(T)$ . In particular, for the bd-anchor $j$ , we construct two strings: $\overleftarrow{T[1\mathinner{.\,.}j]}\in\mathcal{S}^{L}_{\ell}(T)$ and $T[j\mathinner{.\,.}n]\in\mathcal{S}^{R}_{\ell}(T)$ . Note that, $|\mathcal{S}^{L}_{\ell}(T)|=|\mathcal{S}^{R}_{\ell}(T)|=|\mathcal{A}_{\ell}(T)|$ , since for every bd-anchor in $\mathcal{A}_{\ell}(T)$ we have a distinct string in $\mathcal{S}^{L}_{\ell}(T)$ and in $\mathcal{S}^{R}_{\ell}(T)$ .

We construct two compacted tries $\mathcal{T}^{L}_{\ell}(T)$ and $\mathcal{T}^{R}_{\ell}(T)$ over $\mathcal{S}^{L}_{\ell}(T)$ and $\mathcal{S}^{R}_{\ell}(T)$ , respectively, to index all strings. Every string is concatenated with some special letter $\$$ not occurring in $T$ , which is lexicographically minimal, to make $\mathcal{S}^{L}_{\ell}(T)$ and $\mathcal{S}^{R}_{\ell}(T)$ prefix-free (this is standard for conceptual convenience). The leaf nodes of the compacted tries are labeled with the corresponding $j$ : there is a one-to-one correspondence between a leaf node and a bd-anchor $j$ . In $\mathcal{O}(|\mathcal{A}_{\ell}(T)|)$ time, we also enhance the nodes of the tries with a perfect static dictionary [27] to ensure constant-time retrieval of edges by the first letter of their label. Let $\mathcal{L}^{L}_{\ell}(T)$ denote the list of the leaf labels of $\mathcal{T}^{L}_{\ell}(T)$ as they are visited using a depth-first search traversal. $\mathcal{L}^{L}_{\ell}(T)$ corresponds to the (labels of the) lexicographically sorted list of $\mathcal{S}^{L}_{\ell}(T)$ in increasing order. For each node $u$ in $\mathcal{T}^{L}_{\ell}(T)$ , we also store the corresponding interval $[x_{u},y_{u}]$ over $\mathcal{L}^{L}_{\ell}(T)$ . Analogously for $R$ , $\mathcal{L}^{R}_{\ell}(T)$ denotes the list of the leaf labels of $\mathcal{T}^{R}_{\ell}(T)$ as they are visited using a depth-first search traversal and corresponds to the (labels of the) lexicographically sorted list of $\mathcal{S}^{R}_{\ell}(T)$ in increasing order. For each node $v$ in $\mathcal{T}^{R}_{\ell}(T)$ , we also store the corresponding interval $[x_{v},y_{v}]$ over $\mathcal{L}^{R}_{\ell}(T)$ .

The total size occupied by the tries is $\Theta(|\mathcal{A}_{\ell}(T)|)$ because they are compacted: we label the edges with intervals over $[1,n]$ from $T$ .

We also construct a 2D range reporting data structure over the following points in set $\mathcal{R}_{\ell}(T)$ :

(x,y)\in\mathcal{R}_{\ell}(T)\iff\mathcal{L}^{L}_{\ell}(T)[x]=\mathcal{L}^{R}_{\ell}(T)[y].

Note that $|\mathcal{R}_{\ell}(T)|=|\mathcal{A}_{\ell}(T)|$ because the set of leaf labels stored in both tries is precisely the set $\mathcal{A}_{\ell}(T)$ . Let us remark that the idea of employing 2D range reporting for bidirectional pattern searches has been introduced by Amir et al. [2] for text indexing and dictionary matching with one error; see also [50].

This completes the construction of $\mathcal{I}_{\ell}(T)$ . We next explain how we can query $\mathcal{I}_{\ell}(T)$ .

Querying.

Given a bd-anchor $j_{Q}$ over a string $Q$ as an on-line query and parameters $\alpha,\beta>0$ , we spell $\overleftarrow{Q[j_{Q}-\alpha+1\mathinner{.\,.}j_{Q}]}$ in $\mathcal{T}^{L}_{\ell}(T)$ and $Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta-1]$ in $\mathcal{T}^{R}_{\ell}(T)$ starting from the root nodes. If any of the two strings is not spelled fully, we return no $(\alpha,\beta)$ -hits. If both strings are fully spelled, we arrive at node $u$ in $\mathcal{T}^{L}_{\ell}(T)$ (resp. $v$ in $\mathcal{T}^{R}_{\ell}(T)$ ), which corresponds to an interval over $\mathcal{L}^{L}_{\ell}(T)$ stored in $u$ (resp. $\mathcal{L}^{R}_{\ell}(T)$ in $v$ ). We obtain the two intervals $[x_{u},y_{u}]$ and $[x_{v},y_{v}]$ forming a rectangle and ask the corresponding 2D range reporting query. It can be readily verified that this query returns all $(\alpha,\beta)$ -hits.

Example 6.

Let $T=\texttt{aabaaabcbda}$ and $\mathcal{A}_{5}(T)=\{4,5,6,11\}$ . We have the following strings in $\mathcal{S}^{L}(T)$ : $\overleftarrow{T[1\mathinner{.\,.}4]}=\texttt{abaa}$ ; $\overleftarrow{T[1\mathinner{.\,.}5]}=\texttt{aabaa}$ ; $\overleftarrow{T[1\mathinner{.\,.}6]}=\texttt{aaabaa}$ ; and $\overleftarrow{T[1\mathinner{.\,.}11]}=\texttt{adbcbaaabaa}$ . We have the following strings in $\mathcal{S}^{R}(T)$ : $T[4\mathinner{.\,.}11]=\texttt{aaabcbda}$ ; $T[5\mathinner{.\,.}11]=\texttt{aabcbda}$ ; $T[6\mathinner{.\,.}11]=\texttt{abcbda}$ ; $T[11\mathinner{.\,.}11]=\texttt{a}$ . Inspect Figure 3.

Proof of Theorem 5.

We use the $\mathcal{O}(n)$ -time algorithm underlying Theorem 2 to construct $\mathcal{A}_{\ell}(T)$ . We use the $\mathcal{O}(n)$ -time algorithm from [3, 8] to construct the compacted tries from $\mathcal{A}_{\ell}(T)$ . We extract the $|\mathcal{A}_{\ell}(T)|$ points $(x,y)\in\mathcal{R}_{\ell}(T)$ using the compacted tries in $\mathcal{O}(|\mathcal{A}_{\ell}(T)|)$ time. For the first trade-off of the statement, we use the $\mathcal{O}(|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)})$ -time algorithm from [5] to construct the 2D range reporting data structure over $\mathcal{R}_{\ell}(T)$ from [7]. For the second trade-off, we use the $\mathcal{O}(|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)})$ -time algorithm from [29] to construct the 2D range reporting data structure over $\mathcal{R}_{\ell}(T)$ from the same paper. ∎

We obtain the following corollary for the fundamental problem of text indexing [66, 51, 23, 39, 24, 32, 33, 4, 12, 55, 43, 28].

Corollary 6.

Given $\mathcal{I}_{\ell}(T)$ constructed for some integer $\ell>0$ and some constant $\epsilon>0$ over string $T$ , we can report all $k$ occurrences of any pattern $Q$ , $|Q|\geq\ell$ , in $T$ in time:

•

$\mathcal{O}(|Q|+(k+1)\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|))$ when $\mathcal{I}_{\ell}(T)$ occupies $\mathcal{O}(|\mathcal{A}_{\ell}(T)|)$ extra space; or
•

$\mathcal{O}(|Q|+\log\log(|\mathcal{A}_{\ell}(T)|)+k)$ when $\mathcal{I}_{\ell}(T)$ occupies $\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|))$ extra space.

Proof.

Every occurrence of $Q$ in $T$ is prefixed by string $P=Q[1\mathinner{.\,.}\ell]$ . We first compute the bd-anchor of $P$ in $\mathcal{O}(\ell)$ time using Booth’s algorithm. Let this bd-anchor be $j$ . We set $\alpha=j$ and $\beta=|Q|-j+1$ . The result follows by applying Theorem 5. ∎

Querying Multiple Fragments.

In the case of approximate pattern matching, we may want to query multiple length- $\ell$ fragments of a string $Q$ given as an on-line query, and not only its length- $\ell$ prefix. We show that such an operation can be done efficiently using the bd-anchors of $Q$ and the $\mathcal{I}_{\ell}(T)$ index.

Corollary 7.

Given $\mathcal{I}_{\ell}(T)$ constructed for some $\ell>0$ and some constant $\epsilon>0$ over $T$ , for any sequence (not necessarily consecutive) of $d>0$ length- $\ell$ fragments of a pattern $Q$ , $|Q|\geq\ell$ , corresponding to the same order- $\ell$ bd-anchor of $Q$ , we can report all $k_{d}$ occurrences of all $d$ fragments in $T$ in time:

•

$\mathcal{O}(\ell+(d+k_{d})\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|))$ when $\mathcal{I}_{\ell}(T)$ occupies $\mathcal{O}(|\mathcal{A}_{\ell}(T)|)$ space; or
•

$\mathcal{O}(\ell+d\log\log(|\mathcal{A}_{\ell}(T)|)+k_{d})$ when $\mathcal{I}_{\ell}(T)$ occupies $\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log^{\epsilon}(|\mathcal{A}_{\ell}(T)|))$ space.

Proof.

Let the order- $\ell$ bd-anchor over $Q$ be $j_{Q}$ and the corresponding parameters be $(\alpha_{1},\beta_{1}),\cdots,(\alpha_{d},\beta_{d})$ , with $\alpha_{i}+\beta_{i}=\ell+1$ . Observe that $\alpha_{i}>\alpha_{i+1}$ and $\beta_{i}<\beta_{i+1}$ . Starting from $j_{Q}$ , the string $\overleftarrow{Q[j_{Q}-\alpha_{i}+1\mathinner{.\,.}j_{Q}]}$ we spell for fragment $i$ is the prefix of $\overleftarrow{Q[j_{Q}-\alpha_{i-1}+1\mathinner{.\,.}j_{Q}]}$ for fragment $i-1$ . The analogous property holds for the other direction: the string $Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta_{i}]$ we spell for fragment $i$ is the prefix of $Q[j_{Q}\mathinner{.\,.}j_{Q}+\beta_{i+1}]$ for fragment $i+1$ . Thus it takes only $\mathcal{O}(\ell)$ time to construct all $d$ rectangles. Finally, we ask the $d$ corresponding 2D range reporting queries to obtain all $k_{d}$ occurrences in the claimed time complexities. ∎

Index Evaluation.

Consider a hash table with the following (key, value) pairs: the key is the hash value $h(S)$ of a length- $k$ string $S$ ; and the value (satellite data) is a list of occurrences of $S$ in $T$ . It should be clear that such a hash table indexing the minimizers of $T$ does not perform well for on-line pattern searches of arbitrary length because it would need to verify the remaining prefix and suffix of the pattern using letter comparisons for all occurrences of a minimizer in $T$ . We thus opted for comparing our index to the one of [31], which addresses this specific problem by sampling the suffix array [51] with minimizers to reduce the number of letter comparisons during verification.

To ensure a fair comparison, we have implemented the basic index from [31]; we denote it by GR Index. We used Karp-Rabin [41] fingerprints for computing the minimizers of $T$ . We also used the array-based version of the suffix tree that consists of the suffix array (SA) and the longest common prefix (LCP) array [51]; SA was constructed using SDSL [30] and the LCP array using the Kasai et al. [42] algorithm.

We sampled the SA using the minimizers. Given a pattern $Q$ , we searched $Q[j\mathinner{.\,.}|Q|]$ starting with the minimizer $Q[j\mathinner{.\,.}j+k-1]$ using the Manber and Myers [51] algorithm on the sampled SA. For verifying the remaining prefix $Q[1\mathinner{.\,.}j-1]$ of $Q$ , we used letter comparisons, as described in [31]. The space complexity of this implementation is $\mathcal{O}(n)$ and the extra space for the index is $\mathcal{O}(|\mathcal{M}_{w,k}(T)|)$ . The query time is not bounded. We have implemented two versions of our index. We used Booth’s algorithm for computing the bd-anchors of $T$ . We used SDSL for SA construction and the Kasai et al. algorithm for LCP array construction. We sampled the SA using the bd-anchors thus constructing $\mathcal{L}^{L}_{\ell}(T)$ and $\mathcal{L}^{R}_{\ell}(T)$ . Then, the two versions of our index are:

1.

BDA Index v1: Let $j$ be the bd-anchor of $Q[1\mathinner{.\,.}\ell]$ . For $\overleftarrow{Q[1\mathinner{.\,.}j]}$ (resp. $Q[j\mathinner{.\,.}|Q|]$ ) we used the Manber and Myers algorithm for searching over $\mathcal{L}^{L}_{\ell}(T)$ (resp. $\mathcal{L}^{R}_{\ell}(T)$ ). We used range trees [14] implemented in CGAL [63] for 2D range reporting as per the described querying process. The space complexity of this implementation is $\mathcal{O}(n+|\mathcal{A}_{\ell}(T)|\log(|\mathcal{A}_{\ell}(T)|))$ and the extra space for the index is $\mathcal{O}(|\mathcal{A}_{\ell}(T)|\log(|\mathcal{A}_{\ell}(T)|))$ . The query time is $\mathcal{O}(|Q|+\log^{2}(|\mathcal{A}_{\ell}(T)|)+k)$ , where $k$ is the total number of occurrences of $Q$ in $T$ .
2.

BDA Index v2: Let $j$ be the bd-anchor of $Q[1\mathinner{.\,.}\ell]$ . If $|Q|-j+1\geq j$ (resp. $|Q|-j+1<j$ ), we search for $Q[j\mathinner{.\,.}|Q|]$ (resp. $\overleftarrow{Q[1\mathinner{.\,.}j]}$ ) using the Manber and Myers algorithm on $\mathcal{L}^{R}_{\ell}(T)$ (resp. $\mathcal{L}^{L}_{\ell}(T)$ ). For verifying the remaining part of the pattern we used letter comparisons. The space complexity of this implementation is $\mathcal{O}(n)$ and the extra space for the index is $\mathcal{O}(|\mathcal{A}_{\ell}(T)|)$ . The query time is not bounded.

For each of the five real datasets of Table 1 and each query string length $\ell$ , we randomly extracted 500,000 substrings from the text and treated each substring as a query, following [31]. We plot the average query time in Figure 4. As can be seen, BDA Index v2 consistently outperforms GR Index across all datasets and all $\ell$ values. The better performance of BDA Index v2 is due to two theoretical reasons. First, the verification strategy exploits the fact that the index is bidirectional to apply the Manber and Myers algorithm to the largest part of the pattern, which results in fewer letter comparisons. Second, bd-anchors generally have smaller density compared to minimizers; see Figure 5. We also plot the peak memory usage in Figure 6. As can be seen, BDA Index v2 requires a similar amount of memory to GR Index.

BDA Index v1 was slower than GR Index for small $\ell$ but faster for large $\ell$ in three out of five datasets used and had by far the highest memory usage. Let us stress that the inefficiency of BDA Index v1 is not due to inefficiency in the query time or space of our algorithm. It is merely because the range tree implementation of CGAL, which is a standard off-the-shelf library, is unfortunately inefficient in terms of both query time and memory usage; see also [62, 26].

Discussion.

The proposed $\mathcal{I}_{\ell}(T)$ index, which is based on bd-anchors, has the following attributes:

1.

Construction: $\mathcal{A}_{\ell}(T)$ is constructed in $\mathcal{O}(n)$ worst-case time and $\mathcal{I}_{\ell}(T)$ is constructed in $\mathcal{O}(n+|\mathcal{A}_{\ell}(T)|\sqrt{\log(|\mathcal{A}_{\ell}(T)|)})$ worst-case time. These time complexities are near-linear in $n$ and do not depend on the alphabet $\Sigma$ as long as $|\Sigma|=n^{\mathcal{O}(1)}$ , which is true for virtually any real scenario.
2.

Index Size: By Theorem 5, $\mathcal{I}_{\ell}(T)$ can occupy $\mathcal{O}(|\mathcal{A}_{\ell}(T)|)$ space. By Lemma 6, the size of $\mathcal{A}_{\ell}(T)$ is $\mathcal{O}(n/\ell)$ in expectation and so $\mathcal{I}_{\ell}(T)$ can also be of size $\mathcal{O}(n/\ell)$ . In practice this depends on $T$ and on the implementation of the 2D range reporting data structure.
3.

Querying: The $\mathcal{I}_{\ell}(T)$ index answers on-line pattern searches in near-optimal time.
4.

Flexibility: Note that one would have to reconstruct a (hash-based) index, which indexes the set of $(w,k)$ -minimizers, to increase specificity or sensitivity: increasing $k$ increases the specificity and decreases the sensitivity. Our $\mathcal{I}_{\ell}(T)$ index, conceptually truncated at string depth $k$ , is essentially an index based on $(w,k)$ -minimizers, which additionally wrap around. We can thus increase specificity by considering larger $\alpha,\beta$ values or increase sensitivity by considering smaller $\alpha,\beta$ values. This effect can be realized without reconstructing our $\mathcal{I}_{\ell}(T)$ index: we just adapt $\alpha$ and $\beta$ upon querying accordingly.

5 Similarity Search under Edit Distance

We show how bd-anchors can be applied to speed up similarity search under edit distance. This is a fundamental problem with myriad applications in bioinformatics, databases, data mining, and information retrieval. It has thus been studied extensively in the literature both from a theoretical and a practical point of view [38, 9, 11, 46, 68, 71, 57, 65, 64, 17, 34, 69, 70]. Let $\mathcal{D}$ be a collection of strings called dictionary. We focus, in particular, on indexing $\mathcal{D}$ for answering the following type of top- $K$ queries: Given a query string $Q$ and an integer $K$ , return $K$ strings from the dictionary that are closest to $Q$ with respect to edit distance. We follow a typical seed-chain-align approach as used by several bioinformatics applications [1, 16, 47, 48]. The main new ingredients we inject, with respect to this classic approach, is that we use: (1) bd-anchors as seeds; and (2) $\mathcal{I}_{\ell}$ to index the dictionary $\mathcal{D}$ , for some integer parameter $\ell>0$ .

Construction.

We require an integer parameter $\ell>0$ defining the order of the bd-anchors. We set $T=S_{1}\ldots S_{|\mathcal{D}|}$ , where $S_{i}\in\mathcal{D}$ , compute the bd-anchors of order $\ell$ of $T$ , and construct the $\mathcal{I}_{\ell}(T)$ index (see Section 4) using the bd-anchors.