This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: GSSI, Italy 11email: [email protected] 22institutetext: Dalhousie University, Canada 22email: [email protected], [email protected]
33institutetext: University of Münster, Germany 33email: [email protected]
44institutetext: University Ca’ Foscari, Venice, Italy 44email: [email protected]

Space-time Trade-offs for the LCP Array of Wheeler DFAs

Nicola Cotumaccio 11 2 2 0000-0002-1402-5298    Travis Gagie 22 0000-0003-3689-327X    Dominik Köppl 33 0000-0002-8721-4444    Nicola Prezza 44 0000-0003-3553-4953
Abstract

Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires O(nlogn)O(n\log n) bits, nn being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant.

In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time by only storing a linear number of bits. We use our technique to provide a space-time trade-off to compute matching statistics on a Wheeler DFA. In addition, we show that by augmenting the BOSS representation of a kk-th order de Bruijn graph with a linear number of bits we can navigate the underlying variable-order de Bruijn graph in time logarithmic in kk, thus improving a previous bound by Boucher et al. which was linear in kk [DCC 2015].

Keywords:
Wheeler graphs LCP array de Bruijn graphs Matching statistics Variable-order de Bruijn graphs.

1 Introduction

In 1973, Weiner invented the suffix tree of a string [28], a versatile data structure which allows to efficiently handle a variety of problems, including solving pattern matching queries, determining matching statistics, identifying combinatorial properties of the string and computing its Lempel-Ziv decomposition. However, the space consumption of a suffix tree can be too high for some applications (including bioinformatics), so over the past 30 years a number of compressed data structures simulating the behavior of a suffix tree have been designed, thus leading to compressed suffix trees [26]. In many applications, one does not need the full functionality of a suffix tree, so it may be sufficient to store only some of these data structures. Among the most popular data structures, we have the suffix array [21], the longest common prefix (LCP) array [21], the Burrows-Wheeler transform (BWT) [6] and the FM-index [13].

In the past 20 years, the ideas behind the suffix array, the BWT and the FM-index have been generalized to trees [12, 14], de Bruijn graphs [5], Wheeler graphs [17, 1] and arbitrary graphs and automata [9, 8]. Broadly speaking, Wheeler graphs concisely capture the intuition behind these data structures in a graph setting; thus, they can be regarded as a benchmark for extending suffix tree functionality to graphs. In particular, the LCP array of a string remarkably extends the functionality of the suffix array, and a recent paper [7] shows that the LCP array can also be generalized to Wheeler DFAs, which represents a remarkable step toward fully simulating suffix-tree functionality in a graph setting. However, the solution in [7] is not space efficient: storing the LCP array of a Wheeler DFA requires O(nlogn)O(n\log n) bits, nn being the number of states. If the size σ\sigma of the alphabet is small, this space can be considerably larger than the space required to store the Wheeler DFA itself. As we will see, if σlogσ=o(logn)\sigma\log\sigma=o(\log n) , then the space required to store the Wheeler DFA is o(nlogn)o(n\log n), and if σ=O(1)\sigma=O(1), then the space required to store the Wheeler DFA is O(n)O(n). The latter case is especially relevant in practice, because de Bruijn graphs are the prototypes of Wheeler graphs, and in bioinformatics de Bruijn graphs are defined over the constant-size alphabet Σ={A,C,G,T}\Sigma=\{A,C,G,T\}.

In this paper, we show that we can sample entries of the LCP array in such a way that, by storing only a linear number of additional bits on top of the Wheeler graph, we can compute each entry of the LCP array in logarithmic time, thus providing a space-time trade-off. More precisely:

Theorem 1.1

We can augment the compact representation of a Wheeler DFA 𝒜\mathcal{A} with O(n)O(n) bits (O(nloglogσ)O(n\log\log\sigma) bits, respectively), where nn is the number of states and σ\sigma is the size of the alphabet, in such a way that we can compute each entry of the LCP array of 𝒜\mathcal{A} in O(lognloglogσ)O(\log n\log\log\sigma) time (O(logn)O(\log n) time, respectively).

We present two applications of our result: computing matching statistics on Wheeler DFAs and navigating varriable-order de Bruijn graphs.

Matching Statistics on Wheeler DFAs

The problem of computing matching statistics on a Wheeler DFA is defined as follows: given a pattern of length mm and a Wheeler DFA with nn states, determine the longest suffix of each prefix of mm that occurs in the graph (that is, that can be read by following some edges on the graph and concatenating the labels). This problem is a natural generalization of the problem of computing matching statistics on strings. Conte et al. [7] proved the following result:

Theorem 1.2

We can augment the compact representation of a Wheeler DFA 𝒜\mathcal{A} with O(nlogn)O(n\log n) bits, where nn is the number of states and σ\sigma is the size of the alphabet, in such a way that we can compute the matching statistics of a pattern of length mm w.r.t to the Wheeler DFA in O(mlogn)O(m\log n) time.

We will show that if we only want to use linear space, then we can use Theorem 1.1 to obtain the following trade-off.

Theorem 1.3

We can augment the compact representation of a Wheeler DFA 𝒜\mathcal{A} with O(nloglogσ)O(n\log\log\sigma) bits, where nn is the number of states and σ\sigma is the size of the alphabet, in such a way that we can compute the matching statistics of a pattern of length mm w.r.t to the Wheeler DFA in O(mlog2n)O(m\log^{2}n) time.

Variable-order de Bruijn Graphs

Wheeler graphs are a generalization of de Bruijn graphs; in particular, the compact representation of a Wheeler graph is a generalization of the BOSS representation of a de Bruijn graph [5], and our results on the LCP array also apply to a de Bruijn graph. Many assemblers [3, 24, 19, 27] consider all kk-mers occurring in a set of reads and build a kk-th order de Bruijn graph (on the alphabet Σ={A,C,G,T}\Sigma=\{A,C,G,T\}) to perform Eulerian sequence assembly [18, 25]. However, the choice of the parameter kk impacts the assembly quality, so some assemblers try several choices for kk [3, 24], which slows down the process because several de Bruijn graphs need to be built. In [4] it was shown that the kk-order de Bruijn graph of 𝒮\mathcal{S} can be used to implicitly store the kk^{\prime}-th order de Bruijn graph of 𝒮\mathcal{S} for every kkk^{\prime}\leq k, thus leading to a variable-order de Bruijn graph. The challenge is to navigate this implicit representation (that is, how to follow edges in a forward or backward fashion). In [4], it was shown that the navigation is possible by storing or by simulating an array 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} which can be seen as a simplification of the LCP array of the Wheeler graph GG. More precisely, we have the following result (see [4]; we assume σ=O(1)\sigma=O(1)).

Theorem 1.4
  1. 1.

    We can augment the BOSS representation of a kk-th order de Bruijn graph with O(nlogk)O(n\log k) bits, where nn is the number of nodes, so that the underlying variable-order de Bruijn graph can be navigated in O(logk)O(\log k) time per visited node.

  2. 2.

    We can augment the BOSS representation of a kk-th order de Bruijn graph with O(n)O(n) bits, where nn is the number of nodes, so that the underlying variable-order de Bruijn graph can be navigated in O(klogn)O(k\log n) time per visited node.

Essentially, the first solution in Theorem 1.4 explicitly stores 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G}, while the second solution in Theorem 1.4 computes the entries of 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} by exploiting the BOSS representation. In general, a big kk (close to the size of the reads) allows to retrieve the expressive power on an overlap graph [11], so in Theorem 1.4 we cannot assume that kk is small. On the one hand, the space required for the first solution can be too large, because a de Bruijn graph can be stored by using only O(n)O(n) bits. On the other hand, the time bound in the second solution increases substantially. We can now improve the second solution by providing a data structure that achieves the best of both worlds. As we did in Theorem 1.1, we can conveniently sample some entries of 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G}. We will prove the following result.

Theorem 1.5

We can augment the BOSS representation of a kk-th order de Bruijn graph with O(n)O(n) bits, where nn is the number of nodes, so that the underlying variable-order de Bruijn graph can be navigated in O(logklogn)O(\log k\log n) time per visited node.

2 Definitions

Sets and Relations

Let VV be a set. A total order on VV is a binary relation \leq which is reflexive, antisymmetric and transitive. We say that UU is a \leq-interval (or simply an interval) if for all v1,v2,v3Vv_{1},v_{2},v_{3}\in V, if v1,v3Uv_{1},v_{3}\in U and v1<v2<v3v_{1}<v_{2}<v_{3}, then v2Uv_{2}\in U. If u,vVu,v\in V, with uvu\leq v, we denote by [u,v][u,v] the smallest interval containing uu and vv, that is [u,v]={zV|uzv[u,v]=\{z\in V\;|\;u\leq z\leq v }. In particular, if VV is the set of integers, then we assume that \leq is the standard total order, hence [u,v]={u,u+1,,v1,v}[u,v]=\{u,u+1,\dots,v-1,v\}.

Strings

Let Σ\Sigma be a finite alphabet, with σ=|Σ|\sigma=|\Sigma|. Let Σ\Sigma^{*} be the set of all finite strings on Σ\Sigma and let Σω\Sigma^{\omega} be the set of all (countably) infinite strings on Σ\Sigma. If αΣ\alpha\in\Sigma^{*}, then αR\alpha^{R} is the reverse string of α\alpha. If α,βΣΣω\alpha,\beta\in\Sigma^{*}\cup\Sigma^{\omega}, we denote by 𝗅𝖼𝗉(α,β)\mathop{\mathsf{lcp}}(\alpha,\beta) the length of longest common prefix between α\alpha and β\beta. In particular, if αΣ\alpha\in\Sigma^{*}, then 𝗅𝖼𝗉(α,β)|α|\mathop{\mathsf{lcp}}(\alpha,\beta)\leq|\alpha| and if α,βΣω\alpha,\beta\in\Sigma^{\omega} with α=β\alpha=\beta, then 𝗅𝖼𝗉(α,β)=\mathop{\mathsf{lcp}}(\alpha,\beta)=\infty. Let \preceq be a fixed total order on Σ\Sigma. We extend the total order \preceq from Σ\Sigma to ΣΣω\Sigma^{*}\cup\Sigma^{\omega} lexicographically.

DFAs

Throughout the paper, let 𝒜=(Q,E,s0,F)\mathcal{A}=(Q,E,s_{0},F) be a deterministic finite automaton (DFA), where QQ is the set of states, EQ×Q×ΣE\subseteq Q\times Q\times\Sigma is the set of labeled edges, s0Qs_{0}\in Q is the initial state and FQF\subseteq Q is the set of final states. The alphabet Σ\Sigma is effective, that is, every cΣc\in\Sigma labels some edge. Since 𝒜\mathcal{A} is deterministic, for every uQu\in Q and for every aΣa\in\Sigma there exists at most one edge labeled aa leaving uu. Following [1], we assume that (i) s0s_{0} has no incoming edges, (ii) every state is reachable from the initial state and (iii) all edges entering the same state have the same label (input-consistency). For every uQ{s0}u\in Q\setminus\{s_{0}\}, let λ(u)Σ\lambda(u)\in\Sigma be the label of all edges entering uu. We define λ(s0)=#\lambda(s_{0})=\#, where #Σ\#\not\in\Sigma is a special character such that #a\#\prec a for every aΣa\in\Sigma (the character #\# plays the same role as the termination character $\$ in suffix arrays, suffix trees and Burrows-Wheeler transforms). As a consequence, an edge (u,u,a)(u^{\prime},u,a) can be simply written as (u,u)(u^{\prime},u), because it must be a=λ(u)a=\lambda(u).

Compact Data Structures

Let AA be an array of length nn containing elements from a finite totally-ordered set. A range minimum query on AA is defined as follows: given 1ijn1\leq i\leq j\leq n, return one of the indices kk with 1kn1\leq k\leq n such that (i) ikji\leq k\leq j and A[k]=min{A[i],A[i+1],,A[j1],A[j]}A[k]=\min\{A[i],A[i+1],\dots,A[j-1],A[j]\}. We write k=RMQA(i,j)k=RMQ_{A}(i,j). Then, there exists a data structure of 2n+o(n)2n+o(n) such that in O(1)O(1) time we can compute RMQA(i,j)RMQ_{A}(i,j) for every 1in1\leq i\leq n, without the need to access AA [15, 16]. This result is essentially optimal, because every data structure solving range minimum queries on AA requires at least 2nΘ(logn)2n-\Theta(\log n) bits [16, 20].

Let AA be a bitvector of length nn. Let rank(A,i)=|{j{1,2,,i1,i}|A[j]=1}|rank(A,i)=|\{j\in\{1,2,\dots,i-1,i\}\;|\;A[j]=1\}| be the number of 11’s among the first ii bits of AA. Then, there exists a data structure of n+o(n)n+o(n) bits such that in O(1)O(1) time we can compute rank(A,i)rank(A,i) for 1in1\leq i\leq n [23].

3 Wheeler DFAs

Let us recall the definition of Wheeler DFA [7].

Definition 1

Let 𝒜=(Q,E,s0,F)\mathcal{A}=(Q,E,s_{0},F) be a DFA. A Wheeler order on 𝒜\mathcal{A} is a total order \leq on Q such that s0us_{0}\leq u for every uQu\in Q and:

  1. 1.

    (Axiom 1) If u,vQu,v\in Q and u<vu<v, then λ(u)λ(v)\lambda(u)\preceq\lambda(v).

  2. 2.

    (Axiom 2) If (u,u),(v,v)E(u^{\prime},u),(v^{\prime},v)\in E, λ(u)=λ(v)\lambda(u)=\lambda(v) and u<vu<v, then u<vu^{\prime}<v^{\prime}.

A DFA 𝒜\mathcal{A} is Wheeler if it admits a Wheeler order.

Every DFA admits at most one Wheeler order [1], so the total order \leq in Definition 1 is the Wheeler order on 𝒜\mathcal{A}. In the following, we fix a Wheeler DFA 𝒜=(Q,E,s0,F)\mathcal{A}=(Q,E,s_{0},F), with n=|Q|n=|Q| and e=|E|e=|E|, and we write Q={u1,,un}Q=\{u_{1},\dots,u_{n}\}, with u1<u2<<unu_{1}<u_{2}<\dots<u_{n} in the Wheeler order. In particular, u1=s0u_{1}=s_{0}. Following [7], we assume that s0s_{0} has a self-loop labeled #\#, which is consistent with Axiom 1, because #a\#\prec a for every aΣa\in\Sigma). This implies that every state has at least one incoming edge, so for every state uiu_{i} there exists at least one infinite string αΣω\alpha\in\Sigma^{\omega} that can be read starting from uiu_{i} and following edges in a backward fashion. We denote by IuiI_{u_{i}} the nonempty set of all such strings. Formally:

Definition 2

Let 1in1\leq i\leq n. Define:

Iui={αΣω|there exist integers f1,f2, in [1,n] such that (i) f1=i,(ii) (ufk+1,ufk)E for every k1 and (iii) α=λ(uf1)λ(uf2)}.\begin{split}I_{u_{i}}=\{\alpha\in\Sigma^{\omega}\;|\;\text{there exist integers $f_{1},f_{2},\dots$ in $[1,n]$ such that (i) $f_{1}=i$,}\\ \text{(ii) $(u_{f_{k+1}},u_{f_{k}})\in E$ for every $k\geq 1$ and (iii) $\alpha=\lambda(u_{f_{1}})\lambda(u_{f_{2}})\dots$}\}.\end{split}

For every 1in1\leq i\leq n, let pmin(i)p_{\min}(i) be the smallest 1in1\leq i^{\prime}\leq n such that (ui,ui)E(u_{i^{\prime}},u_{i})\in E and let pmax(i)p_{\max}(i) be the largest 1i′′n1\leq i^{\prime\prime}\leq n such that (ui′′,ui)E(u_{i^{\prime\prime}},u_{i})\in E. Both pmin(i)p_{\min}(i) and pmax(i)p_{\max}(i) are well-defined because every state has at least one incoming edge. For every 1in1\leq i\leq n, define pmin1(i)=pmin(i)p^{1}_{\min}(i)=p_{\min}(i) and recursively, for j2j\geq 2, let pminj(i)=pmin(pminj1(i))p_{\min}^{j}(i)=p_{\min}(p_{\min}^{j-1}(i)). Then, λ(ui)λ(pmin(i))λ(pmin2(i))λ(pmin3(i))\lambda(u_{i})\lambda(p_{\min}(i))\lambda(p_{\min}^{2}(i))\lambda(p_{\min}^{3}(i))\dots is the lexicographically smallest string in IuiI_{u_{i}}, which we denote by mini\min_{i} [7]. Analogously, one can define the lexicographically largest string in IuiI_{u_{i}} by using pmaxp_{\max}. Moreover, in [7] it was shown that:

min1max1min2max2maxn1minnmaxn.\mathrm{min}_{1}\preceq\mathrm{max}_{1}\preceq\mathrm{min}_{2}\preceq\mathrm{max}_{2}\preceq\dots\preceq\mathrm{max}_{n-1}\preceq\mathrm{min}_{n}\preceq\mathrm{max}_{n}.

Intuitively, the previous equation shows that the permutation of the set of all states of 𝒜\mathcal{A} induced by the Wheeler order can be seen as a generalization of the permutation of positions induced by the prefix array of a string α\alpha (or equivalently, the suffix array of the reverse string of αR\alpha^{R}). Indeed, a string α\alpha can also be seen as a DFA 𝒜=(Q,E,s0,F)\mathcal{A}^{\prime}=(Q^{\prime},E^{\prime},s_{0}^{\prime},F^{\prime}), where Q={q0,q1,q|α|}Q^{\prime}=\{q^{\prime}_{0},q^{\prime}_{1}\dots,q^{\prime}_{|\alpha|}\}, s0=q0s_{0}^{\prime}=q^{\prime}_{0}, F={q|α|}F^{\prime}=\{q^{\prime}_{|\alpha|}\} (the set FF plays no role here), λ(qi)\lambda(q^{\prime}_{i}) is the ii-th character of α\alpha for 1in1\leq i\leq n and E={(qi1,qi)| 1in}E^{\prime}=\{(q^{\prime}_{i-1},q^{\prime}_{i})\;|\;1\leq i\leq n\} (every state is reached by exactly one string so the minimum and the maximum string reaching each state are equal).

Let 1rsn1\leq r\leq s\leq n and let cΣc\in\Sigma. Let Er,s,cE_{r,s,c} be the set of all states that can be reached from a state in [r,s][r,s] by following edges labeled cc; formally, Er,s,c={1jn|λ(uj)=c and (ui,uj)E for some i[r,s] }E_{r,s,c}=\{1\leq j\leq n\;|\;\text{$\lambda(u_{j})=c$ and $(u_{i},u_{j})\in E$ for some $i\in[r,s]$ }\}. Then, Er,s,cE_{r,s,c} is again an interval, that is, there exist 1rsn1\leq r^{\prime}\leq s^{\prime}\leq n such that Er,s,c=[r,s]E_{r,s,c}=[r^{\prime},s^{\prime}] [17]. This property enables a compression mechanism that generalizes the Burrows-Wheeler transform [6] and the FM-index [13] to Wheeler DFAs. The Wheeler DFA 𝒜\mathcal{A} can be stored by using only 2e+4n+elogσ+σloge2e+4n+e\log\sigma+\sigma\log e bits (up to lower order terms), including nn bits to mark the set FF of final states and nn bits to mark all 1in1\leq i\leq n such that λ(ui)λ(ui1)\lambda(u_{i})\not=\lambda(u_{i-1}), which allows us to retrieve each λ(ui)\lambda(u_{i}) in O(1)O(1) time by using a rank query [17] (recall that nn is the number of states and ee is the number of edges). Since 𝒜\mathcal{A} is a DFA, we have enσe\leq n\sigma, so the required space is O(nσlogσ)O(n\sigma\log\sigma). If the alphabet is small — that is, if σlogσ=o(logn)\sigma\log\sigma=o(\log n) — then the number of required bits is o(nlogn)o(n\log n); if σ=O(1)\sigma=O(1), then the number of required bits is O(n)O(n). This compact representation supports efficient navigation of the graph and it allows to solve pattern matching queries. More precisely, by resorting to state-of-the art select queries [23] in O(loglogσ)O(\log\log\sigma) time (i) for 1in1\leq i\leq n, we can compute pmin(i)p_{\min}(i) and pmax(i)p_{\max}(i) and (ii) given 1rsn1\leq r\leq s\leq n and cΣc\in\Sigma, we can compute [r,s]=Er,s,c[r^{\prime},s^{\prime}]=E_{r,s,c} [17]. In particular, query (ii) is the so-called forward-search, which generalizes the analogous mechanism of the FM-index, thus allowing to solve pattern matching queries on the graph.

The Wheeler order generalizes the notion of suffix array from strings to DFA. It is also possible to generalize LCP-arrays from strings to graph [7].

Definition 3

The LCP-array of the Wheeler DFA 𝒜\mathcal{A} is the array 𝖫𝖢𝖯𝒜=𝖫𝖢𝖯𝒜[2,2n]\mathsf{LCP}_{\mathcal{A}}=\mathsf{LCP}_{\mathcal{A}}[2,2n] which contains the following 2n12n-1 values in the following order: 𝗅𝖼𝗉(min1,max1)\mathop{\mathsf{lcp}}(\min_{1},\max_{1}), 𝗅𝖼𝗉(max1,min2)\mathop{\mathsf{lcp}}(\max_{1},\min_{2}), 𝗅𝖼𝗉(min2,max2)\mathop{\mathsf{lcp}}(\min_{2},\max_{2}), \dots, 𝗅𝖼𝗉(maxn1,minn)\mathop{\mathsf{lcp}}(\max_{n-1},\min_{n}), 𝗅𝖼𝗉(minn,maxn)\mathop{\mathsf{lcp}}(\min_{n},\max_{n}). In other words, 𝖫𝖢𝖯[2i]=𝗅𝖼𝗉(mini,maxi)\mathsf{LCP}[2i]=\mathop{\mathsf{lcp}}(\min_{i},\max_{i}) for 1in1\leq i\leq n and 𝖫𝖢𝖯𝒜[2i1]=𝗅𝖼𝗉(maxi1,mini)\mathsf{LCP}_{\mathcal{A}}[2i-1]=\mathop{\mathsf{lcp}}(\max_{i-1},\min_{i}) for 2in2\leq i\leq n.

It can be proved that for every 2in2\leq i\leq n, if 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] is finite, then 𝖫𝖢𝖯𝒜[i]<3n\mathsf{LCP}_{\mathcal{A}}[i]<3n [7]. As a consequence, 𝖫𝖢𝖯𝒜\mathsf{LCP}_{\mathcal{A}} can be stored by using O(nlogn)O(n\log n) bits.

4 A Space-time Trade-off for the LCP Array

By storing an LCP array on top of the compact representation of a Wheeler graph, we have additional information that we can use to efficiently solve problems such as computing the matching statistics; however, we need to store O(nlogn)O(n\log n) bits. As we have seen, O(nlogn)O(n\log n) dominates the number of bits required to store 𝒜\mathcal{A} itself, if the alphabet is small. In this section, we show that we can store a data structure of only O(nloglogσ)O(n\log\log\sigma) bits which allows to compute every entry 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] in O(logn)O(\log n) time, thus proving Theorem 1.1. This will be possible by sampling some entries of 𝖫𝖢𝖯𝒜\mathsf{LCP}_{\mathcal{A}}. The sampling mechanism is obtained by conveniently defining an auxiliary graph from the entries of the LCP array. We will immediately describe our technique, our sampling mechanism being general-purpose.

Algorithm 1 Building V(h)V(h)
V(h)V(h)\leftarrow\emptyset
UU\leftarrow\emptyset
while there exists vVv\in V such that (a) v(i)v(i) is defined for 0ih10\leq i\leq h-1, (b) v(i)v(j)v(i)\not=v(j) for 0j<ih10\leq j<i\leq h-1, (c) v(i)Uv(i)\not\in U for 0ih10\leq i\leq h-1 do
    Pick such a vv, add v(h1)v(h-1) to V(h)V(h) and add v(i)v(i) to UU for every 0ih10\leq i\leq h-1
end while
Algorithm 2 Input: h[2,2n]h\in[2,2n]. Output: 𝖫𝖢𝖯𝒜[h]\mathsf{LCP}_{\mathcal{A}}[h].
procedure main_function(hh)
    Initialize a global bit array D[2,2n]D[2,2n] to zero \triangleright D[2,2n]D[2,2n] marks the entries already considered
    return lcp(hh)
end procedure
procedure lcp(hh)
    D[h]1D[h]\leftarrow 1
    if C[h]=1C[h]=1 then \triangleright The desired value has been sampled
       return 𝖫𝖢𝖯𝒜[rank(C,h)]\mathsf{LCP}^{*}_{\mathcal{A}}[rank(C,h)]
    else if hh is odd then
       ih/2i\leftarrow\lceil h/2\rceil
       if λ(ui1)λ(ui)\lambda(u_{i-1})\not=\lambda(u_{i}) then
          return 0
       else
          kpmax(i1)k\leftarrow p_{\max}(i-1)
          kpmin(i)k^{\prime}\leftarrow p_{\min}(i)
          jRMQ𝖫𝖢𝖯𝒜(2k+1,2k1)j\leftarrow RMQ_{\mathsf{LCP}_{\mathcal{A}}}(2k+1,2k^{\prime}-1)
          if D[j]=1D[j]=1 then \triangleright We have already considered this entry before, so there is a cycle
             return \infty
          else
             return 1+lcp(j)1+\textsc{lcp}(j)
          end if
       end if
    else
       ih/2i\leftarrow h/2
       kpmin(i)k\leftarrow p_{\min}(i)
       kpmax(i)k^{\prime}\leftarrow p_{\max}(i)
       jRMQ𝖫𝖢𝖯𝒜(2k,2k)j\leftarrow RMQ_{\mathsf{LCP}_{\mathcal{A}}}(2k,2k^{\prime})
       if D[j]=1D[j]=1 then \triangleright We have already considered this entry before, so there is a cycle
          return \infty
       else
          return 1+lcp(j)1+\textsc{lcp}(j)
       end if
    end if
end procedure

Sampling

Let G=(V,H)G=(V,H) be a finite (unlabeled) directed graph such that every node has at most one incoming edge. For every vVv\in V and for every i0i\geq 0, there exists at most one node vVv^{\prime}\in V such that there exists a directed path from vv^{\prime} to vv having ii edges; if vv^{\prime} exists, we denote it by v(i)v(i). Fix a parameter h1h\geq 1. Let us prove that there exists V(h)VV(h)\subseteq V such that (i) |V(h)||V|h|V(h)|\leq\frac{|V|}{h} and (ii) for every vVv\in V there exists 0i2h20\leq i\leq 2h-2 such that v(i)v(i) is defined and either v(i)V(h)v(i)\in V(h) or v(i)v(i) has no incoming edges or v(i)=v(j)v(i)=v(j) for some 0j<i0\leq j<i. We build V(h)V(h) incrementally following Algorithm 1. Let us prove that, at the end of the algorithm, properties (i) and (ii) are true. For every vV(h)v\in V(h), define Sv={v,v(1),v(2),v(h1)}S_{v}=\{v,v(1),v(2)\dots,v(h-1)\}, which is possible because by construction if vV(h)v\in V(h), then v(i)v(i) is defined for every 0ih10\leq i\leq h-1. It must be v(i)v(j)v(i)\not=v(j) for 0i<jh10\leq i<j\leq h-1, so |Sv|=h|S_{v}|=h. If v,vV(h)v,v^{\prime}\in V(h) and vvv\not=v^{\prime}, then by construction SvS_{v} and SvS_{v^{\prime}} are disjoint. As a consequence, |V|vV(h)|Sv|=vV(h)h=h|Vh||V|\geq\sum_{v\in V(h)}|S_{v}|=\sum_{v\in V(h)}h=h|V_{h}| and so |Vh||V|h|V_{h}|\leq\frac{|V|}{h}, which proves property (i). Let us prove property (ii). Pick vVv\in V; we must prove that there exists 0i2h20\leq i\leq 2h-2 such that v(i)v(i) is defined and either v(i)V(h)v(i)\in V(h) or v(i)v(i) has no incoming edges or v(i)=v(j)v(i)=v(j) for some 0j<i0\leq j<i. We distinguish three cases:

  1. 1.

    there exists ii with 1ih11\leq i\leq h-1 such that v(i1)v(i-1) is defined but v(i)v(i) is not defined. Then, v(i1)v(i-1) has no incoming edges.

  2. 2.

    there exist i,ji,j with 0j<ih10\leq j<i\leq h-1 such that v(j)v(j) and v(i)v(i) are defined and v(i)=v(j)v(i)=v(j). In this case, the conclusion is immediate.

  3. 3.

    v(i)v(i) is defined for every 0ih0\leq i\leq h and v(i)v(j)v(i)\not=v(j) for 0j<ih10\leq j<i\leq h-1. Since Algorithm 1 has terminated, then there exists 0jh10\leq j\leq h-1 such that v(j)Uv(j)\in U. The construction of UU implies that there exists vVv^{\prime}\in V and 0jh10\leq j\leq h-1 such that v(j)=v(j)v(j)=v^{\prime}(j^{\prime}) and v(h1)V(h)v^{\prime}(h-1)\in V(h). As a consequence v(h1+jj)=v(j)(h1j)=(v(j))(h1j)=v(h1)V(h)v(h-1+j-j^{\prime})=v(j)(h-1-j^{\prime})=(v^{\prime}(j^{\prime}))(h-1-j^{\prime})=v^{\prime}(h-1)\in V(h). Since jh1j\leq h-1 and j0j^{\prime}\geq 0, we conclude h1+jj2h2h-1+j-j^{\prime}\leq 2h-2 and we are done.

u1u_{1}startu3u_{3}u4u_{4}u2u_{2}u4u_{4}u6u_{6}u7u_{7}u9u_{9}u10u_{10}u8u_{8}u11u_{11}u12u_{12}u13u_{13}u14u_{14}u15u_{15}u16u_{16}abcdadeeeeffgghijkl
(a)
State ii 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] kk kk^{\prime} (i)\mathcal{R}(i)
1 1
2 \infty 1 1 2
2 3 0 - - -
4 11 1 2 3
3 5 0 - - -
6 \infty 1 1 2
4 7 0 - - -
8 \infty 1 1 2
5 9 0 - - -
10 1 1 5 9
6 11 0 - - -
12 1 2 3 5
7 13 1 3 4 7
14 1 4 5 9
8 15 0 - - -
16 2 6 6 12
9 17 2 6 7 13
18 2 7 7 14
10 19 0 - - -
20 2 6 6 12
11 21 2 6 7 13
22 2 7 7 14
12 23 0 - - -
24 3 8 8 16
13 25 0 - - -
26 4 12 12 24
14 27 0 - - -
28 5 13 13 26
15 29 0 - - -
30 6 14 14 28
16 31 0 - - -
32 7 15 15 30
(b)
v5v_{5}v12v_{12}v16v_{16}v20v_{20}v24v_{24}v26v_{26}v28v_{28}v30v_{30}v32v_{32}v7v_{7}v13v_{13}v17v_{17}v21v_{21}v9v_{9}v10v_{10}v14v_{14}v18v_{18}v22v_{22}v2v_{2}v8v_{8}v6v_{6}v3v_{3}v4v_{4}v11v_{11}v15v_{15}v19v_{19}v23v_{23}v25v_{25}v27v_{27}v29v_{29}v31v_{31}
(c)
ii C[i]C[i] 𝖫𝖢𝖯𝒜\mathsf{LCP}^{*}_{\mathcal{A}}
1 3
2 0 7
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 1
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 1
(d)
Figure 1: (a) A Wheeler DFA. States are numbered according to the Wheeler order. (b) The array 𝖫𝖢𝖯𝒜\mathsf{LCP}_{\mathcal{A}}, and the values needed to compute G=(V,H)G=(V,H). We assume that a range minimum query returns the largest position of a minimum value. (c) The graph G=(V,H)G=(V,H), with V(logn)=V(4)={v24,v32}V(\lceil\log n\rceil)=V(4)=\{v_{24},v_{32}\} (yellow states). (d) The data structures that we store.

Computing the LCP Array Using a Linear Number of Bits

First, let us store a data structure of O(n)O(n) bits which in O(1)O(1) time determines RMQ𝖫𝖢𝖯𝒜(i,j)RMQ_{\mathsf{LCP}_{\mathcal{A}}}(i,j) for every 2ij2n2\leq i\leq j\leq 2n.

Notice that 𝖫𝖢𝖯𝒜[2i]1\mathsf{LCP}_{\mathcal{A}}[2i]\geq 1 for 1in1\leq i\leq n because the first character of mini\min_{i} and the first character of maxi\max_{i} are equal to λ(ui)\lambda(u_{i}). Moreover, we have 𝖫𝖢𝖯𝒜[2i1]1\mathsf{LCP}_{\mathcal{A}}[2i-1]\geq 1 if and only if λ(ui1)=λ(ui)\lambda(u_{i-1})=\lambda(u_{i}), for 2in2\leq i\leq n.

Consider the entry 𝖫𝖢𝖯𝒜[2i1]=𝗅𝖼𝗉(maxi1,mini)\mathsf{LCP}_{\mathcal{A}}[2i-1]=\mathop{\mathsf{lcp}}(\max_{i-1},\min_{i}), for 2in2\leq i\leq n, and assume that 𝖫𝖢𝖯𝒜[2i1]1\mathsf{LCP}_{\mathcal{A}}[2i-1]\geq 1. Let k=pmax(i1)k=p_{\max}(i-1) and k=pmin(i)k^{\prime}=p_{\min}(i). Since 𝖫𝖢𝖯𝒜[2i1]1\mathsf{LCP}_{\mathcal{A}}[2i-1]\geq 1, then there exists aΣa\in\Sigma such that maxi1=amaxk\max_{i-1}=a\max_{k} and mini1=amink\min_{i-1}=a\min_{k^{\prime}}. In particular, (uk,ui1,a)E(u_{k},u_{i-1},a)\in E and (uk,ui,a)E(u_{k^{\prime}},u_{i},a)\in E, so from Axiom 2 we obtain k<kk<k^{\prime}. Moreover, we have 𝖫𝖢𝖯𝒜[2i1]=𝗅𝖼𝗉(maxi1,mini)=𝗅𝖼𝗉(amaxk,amink)=1+𝗅𝖼𝗉(maxk,mink)\mathsf{LCP}_{\mathcal{A}}[2i-1]=\mathop{\mathsf{lcp}}(\max_{i-1},\min_{i})=\mathop{\mathsf{lcp}}(a\max_{k},a\min_{k^{\prime}})=1+\mathop{\mathsf{lcp}}(\max_{k},\min_{k^{\prime}}). Notice that:

𝗅𝖼𝗉(maxk,mink)=min{𝗅𝖼𝗉(maxk,mink+1),𝗅𝖼𝗉(mink+1,maxk+1),,=𝗅𝖼𝗉(mink1,maxk1),𝗅𝖼𝗉(maxk1,mink)}==min{𝖫𝖢𝖯𝒜[2k+1],𝖫𝖢𝖯𝒜[2k+2],,𝖫𝖢𝖯𝒜[2k2],𝖫𝖢𝖯𝒜[2k1]}.\begin{split}&\mathop{\mathsf{lcp}}(\mathrm{max}_{k},\mathrm{min}_{k^{\prime}})=\mathrm{min}\{\mathop{\mathsf{lcp}}(\mathrm{max}_{k},\mathrm{min}_{k+1}),\mathop{\mathsf{lcp}}(\mathrm{min}_{k+1},\mathrm{max}_{k+1}),\dots,\\ &=\mathop{\mathsf{lcp}}(\mathrm{min}_{k^{\prime}-1},\mathrm{max}_{k^{\prime}-1}),\mathop{\mathsf{lcp}}(\mathrm{max}_{k^{\prime}-1},\mathrm{min}_{k^{\prime}})\}=\\ &=\min\{\mathsf{LCP}_{\mathcal{A}}[2k+1],\mathsf{LCP}_{\mathcal{A}}[2k+2],\dots,\mathsf{LCP}_{\mathcal{A}}[2k^{\prime}-2],\mathsf{LCP}_{\mathcal{A}}[2k^{\prime}-1]\}.\end{split}

Let j=RMQ𝖫𝖢𝖯𝒜(2k+1,2k1)j=RMQ_{\mathsf{LCP}_{\mathcal{A}}}(2k+1,2k^{\prime}-1). Then, 𝖫𝖢𝖯𝒜[j]=min{𝖫𝖢𝖯𝒜[2k+1],𝖫𝖢𝖯𝒜[2k+2],,𝖫𝖢𝖯𝒜[2k2],𝖫𝖢𝖯𝒜[2k1]}\mathsf{LCP}_{\mathcal{A}}[j]=\min\{\mathsf{LCP}_{\mathcal{A}}[2k+1],\mathsf{LCP}_{\mathcal{A}}[2k+2],\dots,\mathsf{LCP}_{\mathcal{A}}[2k^{\prime}-2],\mathsf{LCP}_{\mathcal{A}}[2k^{\prime}-1]\}, so 𝖫𝖢𝖯𝒜[2i1]=1+𝖫𝖢𝖯𝒜[j]\mathsf{LCP}_{\mathcal{A}}[2i-1]=1+\mathsf{LCP}_{\mathcal{A}}[j] (we assume t+=t+\infty=\infty for every t0t\geq 0), and we have reduced the problem of computing 𝖫𝖢𝖯𝒜[2i1]\mathsf{LCP}_{\mathcal{A}}[2i-1] to the problem of computing 𝖫𝖢𝖯𝒜[j]\mathsf{LCP}_{\mathcal{A}}[j]. In the following, let (2i1)=j\mathcal{R}(2i-1)=j. Given 2in2\leq i\leq n, we can compute j=(2i1)j=\mathcal{R}(2i-1) in O(loglogσ)O(\log\log\sigma) time, because we can compute k=pmax(i1)k=p_{\max}(i-1) and k=pmin(i)k^{\prime}=p_{\min}(i) in O(loglogσ)O(\log\log\sigma) time and we can compute jj in O(1)O(1) time by means of a range minimum query.

We proceed analogously with the entries 𝖫𝖢𝖯𝒜[2i]=𝗅𝖼𝗉(mini,maxi)\mathsf{LCP}_{\mathcal{A}}[2i]=\mathop{\mathsf{lcp}}(\min_{i},\max_{i}), for 1in1\leq i\leq n (it must necessarily be 𝖫𝖢𝖯𝒜[2i]1\mathsf{LCP}_{\mathcal{A}}[2i]\geq 1). Let k=pmin(i)k=p_{\min}(i) and k=pmax(i)k^{\prime}=p_{\max}(i); by the definitions of pminp_{\min} and pmaxp_{\max} it must be kkk\leq k^{\prime}. Hence, 𝖫𝖢𝖯𝒜[2i]=1+𝗅𝖼𝗉(mink,maxk)\mathsf{LCP}_{\mathcal{A}}[2i]=1+\mathop{\mathsf{lcp}}(\min_{k},\max_{k^{\prime}}) and similarly 𝗅𝖼𝗉(mink,maxk)=min{𝖫𝖢𝖯𝒜[2k],𝖫𝖢𝖯𝒜[2k+1],,𝖫𝖢𝖯𝒜[2k1],𝖫𝖢𝖯𝒜[2k]}\mathop{\mathsf{lcp}}(\min_{k},\max_{k^{\prime}})=\min\{\mathsf{LCP}_{\mathcal{A}}[2k],\mathsf{LCP}_{\mathcal{A}}[2k+1],\dots,\mathsf{LCP}_{\mathcal{A}}[2k^{\prime}\leavevmode\nobreak\ -1],\mathsf{LCP}_{\mathcal{A}}[2k^{\prime}]\}. Let j=RMQ𝖫𝖢𝖯𝒜(2k,2k)j=RMQ_{\mathsf{LCP}_{\mathcal{A}}}(2k,2k^{\prime}). In the following, let (2i)=j\mathcal{R}(2i)=j. Given 1in1\leq i\leq n, we can compute j=(2i)j=\mathcal{R}(2i) in O(loglogσ)O(\log\log\sigma) time. See Figure 1 for an example.

Now, consider the (unlabeled) directed graph G=(V,H)G=(V,H) defined as follows. Let VV be a set of 2n12n-1 nodes v2v_{2}, v3v_{3}, \dots, v2nv_{2n}. Moreover, viVv_{i}\in V has no incoming edge in GG if (i)\mathcal{R}(i) is not defined, which happens if 𝖫𝖢𝖯𝒜[i]=0\mathsf{LCP}_{\mathcal{A}}[i]=0 (and so ii is odd and λ(ui1)λ(ui)\lambda(u_{i-1})\not=\lambda(u_{i})); viVv_{i}\in V has exactly one incoming edge if (i)\mathcal{R}(i) is defined, namely, (v(i),vi)(v_{\mathcal{R}(i)},v_{i}). Note that v2iv_{2i} has an incoming edge for every 1in1\leq i\leq n. Let h1h\geq 1 be a parameter. We know that there exists V(h)VV(h)\subseteq V such that (i) |V(h)||V|h|V(h)|\leq\frac{|V|}{h} and (ii) for every viVv_{i}\in V there exists 0k2h20\leq k\leq 2h-2 such that vi(k)v_{i}(k) is defined and either vi(h)V(h)v_{i}(h)\in V(h) or vi(h)v_{i}(h) has no incoming edges or vi(h)=vi(l)v_{i}(h)=v_{i}(l) for some 0l<h0\leq l<h. Notice that if vi(h)=vi(l)v_{i}(h)=v_{i}(l) for some 0l<h0\leq l<h, then 𝖫𝖢𝖯𝒜[i]=\mathsf{LCP}_{\mathcal{A}}[i]=\infty (because there is a cycle and so vi(h)v_{i}(h^{\prime}) is defined for every h0h^{\prime}\geq 0). Let n=|V(h)|n^{\prime}=|V(h)|, and let 𝖫𝖢𝖯𝒜[1,n]\mathsf{LCP}^{*}_{\mathcal{A}}[1,n^{\prime}] an array storing the value 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] for each viV(h)v_{i}\in V(h), sorted by increasing ii. Since n|V|h=2n1hn^{\prime}\leq\frac{|V|}{h}=\frac{2n-1}{h}, storing 𝖫𝖢𝖯𝒜[1,n]\mathsf{LCP}^{*}_{\mathcal{A}}[1,n^{\prime}] takes nO(logn)=O(nlognh)n^{\prime}O(\log n)=O(\frac{n\log n}{h}) bits. We store a bitvector C[2,2n]C[2,2n] such that C[i]=1C[i]=1 if and only if viV(h)v_{i}\in V(h) for every 2i2n2\leq i\leq 2n; we augment CC with o(n)o(n) bits so that it supports rank queries in O(1)O(1) time. For every 2i2n2\leq i\leq 2n, in O(1)O(1) time we can check whether 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] has been stored in 𝖫𝖢𝖯𝒜\mathsf{LCP}^{*}_{\mathcal{A}} by checking whether C[i]=1C[i]=1, and if C[i]=1C[i]=1 it must be 𝖫𝖢𝖯𝒜[i]=𝖫𝖢𝖯𝒜[rank(C,i)]\mathsf{LCP}_{\mathcal{A}}[i]=\mathsf{LCP}^{*}_{\mathcal{A}}[rank(C,i)].

From our discussion, it follows that Algorithm 2 correctly computes 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] for every 2in2\leq i\leq n. Property (ii) ensures that the function lcplcp is called at most hh times. Every call requires O(loglogσ)O(\log\log\sigma) time, so the running time of our algorithm is O(hloglogσ)O(h\log\log\sigma) (the initialization of D[2,2n]D[2,2n] in Algorithm 2 can be simulated in O(1)O(1) time [22]). We conclude that we store O(n+nlognh)O(n+\frac{n\log n}{h}) bits, and in O(hloglogσ)O(h\log\log\sigma) time we can compute 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] for every 2in2\leq i\leq n.

By choosing h=lognloglogσh=\lceil\frac{\log n}{\log\log\sigma}\rceil, we conclude that our data structure can be stored using O(nloglogσ)O(n\log\log\sigma) bits and it allows to compute 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] for every 2in2\leq i\leq n in O(logn)O(\log n) time. By choosing h=lognh=\lceil\log n\rceil we conclude that our data structure can be stored using O(n)O(n) bits and it allows to compute 𝖫𝖢𝖯𝒜[i]\mathsf{LCP}_{\mathcal{A}}[i] for every 2in2\leq i\leq n in O(lognloglogσ)O(\log n\log\log\sigma) time.

5 Applications

Matching Statistics

Let us recall how the bounds in Theorem 1.2 are obtained. The space bound is O(nlogn)O(n\log n) bits because we need to store 𝖫𝖢𝖯𝒜\mathsf{LCP}_{\mathcal{A}}. We also store a data structure to solve range minimum queries on 𝖫𝖢𝖯𝒜\mathsf{LCP}_{\mathcal{A}}, which only takes O(n)O(n) bits. The time bound O(mlogn)O(m\log n) follows from performing O(m)O(m) steps to compute all matching statistics. In each of these O(m)O(m) steps, we may need to perform a binary search on 𝖫𝖢𝖯𝒜\mathsf{LCP}_{\mathcal{A}}. In each step of the binary search, we need to solve a range minimum query once and we need to access 𝖫𝖢𝖯𝒜\mathsf{LCP}_{\mathcal{A}} once, so the binary search takes O(logn)O(\log n) time per step. By Theorem 1.1, if we store only O(nloglogσ)O(n\log\log\sigma) bits, we can access 𝖫𝖢𝖯𝒜\mathsf{LCP}_{\mathcal{A}} in O(logn)O(\log n) time, so the time for the binary search becomes O(log2n)O(\log^{2}n) per step and Theorem 1.3 follows.

Variable-order de Bruijn Graphs

$$$\$\$\$$$T\$\$T$TA\$TATACTACACGACGCGTCGTGTCGTCTCGTCGCGACGAGACGACACTACTTTAACCGGTTCCGGAACCTTAAGG
(a)
ii Node 𝖫𝖢𝖯¯G[i]\overline{\mathsf{LCP}}_{G}[i] kk kk^{\prime} (i)\mathcal{R}(i)
1 $$$
2 CGA 0 - - -
3 $TA 1 9 9 9
4 GAC 0 - - -
5 TAC 2 3 3 3
6 GTC 1 4 11 9
7 ACG 0 - - -
8 TCG 2 6 6 6
9 $$T 0 - - -
10 ACT 1 2 4 4
11 CGT 1 6 7 7
(b)
v9v_{9}v3v_{3}v6v_{6}v5v_{5}v8v_{8}v4v_{4}v10v_{10}v7v_{7}v11v_{11}v2v_{2}
(c)
ii C[i]C[i] 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}^{*}_{G}
1 1
2 0 2
3 1 1
4 0 1
5 0
6 0
7 0
8 1
9 0
10 1
11 1
(d)
Figure 2: The 33-rd order de Bruijn graph for the set 𝒮={CGAC,GACG,GACT,TACG,GTCG,ACGA,ACGT,TCGA,CGTC}\mathcal{S}=\{CGAC,GACG,GACT,TACG,GTCG,ACGA,ACGT,TCGA,CGTC\} from [4]. We proceed like in Figure 1 (now we only consider odd entries of 𝖫𝖢𝖯G\mathsf{LCP}_{G}, and h=logk=2h=\lceil\log k\rceil=2).

Let k0k\geq 0 be a parameter, and let 𝒮\mathcal{S} be a set of strings on the alphabet Σ={A,C,G,T}\Sigma=\{A,C,G,T\} (in this application we always assume σ=O(1)\sigma=O(1)).

The kk-th order de Bruijn graph of 𝒮\mathcal{S} is defined as follows. The set of nodes is the set of all strings of Σ\Sigma of length kk that occur as a substring of some string in 𝒮\mathcal{S}. There is an edge from node α\alpha to node β\beta labeled cΣc\in\Sigma if and only if (i) the suffix of α\alpha of length k1k-1 is equal to the prefix of β\beta of length k1k-1 and (ii) the last character of β\beta is cc. If some node α\alpha has no incoming edges, then we add nodes $iαki\$^{i}\alpha_{k-i} for 1ik1\leq i\leq k, where αj\alpha_{j} is the prefix of α\alpha of length jj and $\$ is a special character, and we add edges as above; see Figure 2 for an example. Wheeler DFAs are a generalization of de Bruijn graphs (we do not need to define an initial state and a set of final states, because here we are not interested in studying the applications of de Bruijn graphs and Wheeler automata to automata theory [2, 10]); the Wheeler order is the one such that node α\alpha comes before node β\beta if and only if the string αR\alpha^{R} is lexicographically smaller than the string βR\beta^{R} [17].

Notice that, in a kk-th order de Bruijn graph GG, all strings that can be read from node α\alpha by following edges in a backward fashion start with αR\alpha^{R} (as usual, we assume that node $$$\$\$\$ has a self-loop labeled $\$). As a consequence, it holds 𝖫𝖢𝖯G[2i]k\mathsf{LCP}_{G}[2i]\geq k for every 1in1\leq i\leq n and 𝖫𝖢𝖯G[2i1]k1\mathsf{LCP}_{G}[2i-1]\leq k-1 for every 2in2\leq i\leq n (so any value in an odd entry is smaller than any value in an even entry).

As we saw in the introduction, in [4] it was shown that the kk-order de Bruijn graph of 𝒮\mathcal{S} can be used to implicitly store the kk^{\prime}-th order de Bruijn graph of 𝒮\mathcal{S} for every kkk^{\prime}\leq k, thus leading to a variable-order de Bruijn graph. The navigation of a variable-order de Bruijn graph is possible by storing or by simulating the values in the odd entries of the LCP array. Formally, in order to avoid confusion, we define 𝖫𝖢𝖯¯G[i]=𝖫𝖢𝖯G[2i1]\overline{\mathsf{LCP}}_{G}[i]=\mathsf{LCP}_{G}[2i-1] for every 2in2\leq i\leq n; see Figure 2. Note that 𝖫𝖢𝖯¯G[i]k1\overline{\mathsf{LCP}}_{G}[i]\leq k-1 for every 2in2\leq i\leq n, so 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} can be stored by using O(nlogk)O(n\log k) bits. Notice that Theorem 1.1 also applies to 𝖫𝖢𝖯¯G[i]\overline{\mathsf{LCP}}_{G}[i] (we do not need to store values in the even entries because a value in an odd entry is smaller than a value in an even entry, so even entries are never selected in the sampling process when answering a range minimum query on 𝖫𝖢𝖯G\mathsf{LCP}_{G}). However, we can now choose a better parameter h1h\geq 1 in our sampling process. Indeed, each entry of 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} can be stored by using O(logk)O(\log k) bits (not O(lognO(\log n) bits), so if we choose h=logkh=\lceil\log k\rceil, we conclude that we can augment the BOSS representation of a de Bruijn graph with O(n)O(n) bits such that for every 2in2\leq i\leq n we can compute 𝖫𝖢𝖯¯G[i]\overline{\mathsf{LCP}}_{G}[i] in O(logk)O(\log k) time.

The first solution in Theorem 1.4 consists in storing a wavelet tree on 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G}, which requires O(nlogk)O(n\log k) bits and allows to navigate the graph in O(logk)O(\log k) time per visited node. The second solution in Theorem 1.4 does not store 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} at all; whenever needed, an entry of 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} is computed in O(k)O(k) time by exploiting the BOSS representation of the de Bruijn graph. The second solution only stores a data structures of O(n)O(n) bits to solve range minimum queries. The details can be found in [4]. Essentially, the time bound O(klogn)O(k\log n) comes from performing binary searches on 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} while explicitly computing an entry of 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} at each step in O(k)O(k) time. However, we have seen that, while staying within the O(n)O(n) space bound, we can augment the BOSS representation so that we can compute the entries of 𝖫𝖢𝖯¯G\overline{\mathsf{LCP}}_{G} in O(logk)O(\log k) time, so the time bound O(klogn)O(k\log n) becomes O(logklogn)O(\log k\log n), which implies Theorem 1.5.

Acknowledgements

Travis Gagie: funded by National Institutes of Health (NIH) NIAID (grant no. HG011392), the National Science Foundation NSF IIBR (grant no. 2029552) and a Natural Science and Engineering Research Council (NSERC) Discovery Grant (grant no. RGPIN-07185-2020). Dominik Köppl: supported by JSPS KAKENHI with No. JP21K17701 and JP23H04378. Nicola Prezza: funded by the European Union (ERC, REGINDEX, 101039208). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

References

  • [1] Alanko, J., D’Agostino, G., Policriti, A., Prezza, N.: Regular languages meet prefix sorting. In: Proc. of the 31st Symposium on Discrete Algorithms, (SODA’20). pp. 911–930. SIAM (2020). https://doi.org/10.1137/1.9781611975994.55, https://doi.org/10.1137/1.9781611975994.55
  • [2] Alanko, J., D’Agostino, G., Policriti, A., Prezza, N.: Wheeler languages. Information and Computation 281, 104820 (2021)
  • [3] Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19(5), 455–477 (2012). https://doi.org/10.1089/cmb.2012.0021, https://doi.org/10.1089/cmb.2012.0021, pMID: 22506599
  • [4] Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-order de Bruijn graphs. In: 2015 Data Compression Conference. pp. 383–392 (2015). https://doi.org/10.1109/DCC.2015.70
  • [5] Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Algorithms in Bioinformatics. pp. 225–235. Springer Berlin Heidelberg, Berlin, Heidelberg (2012)
  • [6] Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Tech. rep. (1994)
  • [7] Conte, A., Cotumaccio, N., Gagie, T., Manzini, G., Prezza, N., Sciortino, M.: Computing matching statistics on Wheeler DFAs. In: 2023 Data Compression Conference (DCC). pp. 150–159 (2023). https://doi.org/10.1109/DCC55655.2023.00023
  • [8] Cotumaccio, N.: Graphs can be succinctly indexed for pattern matching in O(|E|2+|V|5/2)O(|E|^{2}+|V|^{5/2}) time. In: 2022 Data Compression Conference (DCC). pp. 272–281 (2022). https://doi.org/10.1109/DCC52660.2022.00035
  • [9] Cotumaccio, N., Prezza, N.: On indexing and compressing finite automata. In: Proc. of the 32nd Symposium on Discrete Algorithms, (SODA’21). pp. 2585–2599. SIAM (2021). https://doi.org/10.1137/1.9781611976465.153, https://doi.org/10.1137/1.9781611976465.153
  • [10] Cotumaccio, N., D’Agostino, G., Policriti, A., Prezza, N.: Co-lexicographically ordering automata and regular languages – part i (2023)
  • [11] Díaz-Domínguez, D., Gagie, T., Navarro, G.: Simulating the DNA Overlap Graph in Succinct Space. In: Pisanti, N., Pissis, S.P. (eds.) 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), vol. 128, pp. 26:1–26:20. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.26, http://drops.dagstuhl.de/opus/volltexte/2019/10497
  • [12] Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05). pp. 184–193 (2005). https://doi.org/10.1109/SFCS.2005.69
  • [13] Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS’00). pp. 390–398 (2000). https://doi.org/10.1109/SFCS.2000.892127
  • [14] Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1) (nov 2009). https://doi.org/10.1145/1613676.1613680, https://doi.org/10.1145/1613676.1613680
  • [15] Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010: Theoretical Informatics. pp. 158–169. Springer Berlin Heidelberg, Berlin, Heidelberg (2010)
  • [16] Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing 40(2), 465–492 (2011). https://doi.org/10.1137/090779759, https://doi.org/10.1137/090779759
  • [17] Gagie, T., Manzini, G., Sirén, J.: Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science 698, 67–78 (2017). https://doi.org/https://doi.org/10.1016/j.tcs.2017.06.016, https://www.sciencedirect.com/science/article/pii/S0304397517305285, algorithms, Strings and Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of Professor Raffaele Giancarlo)
  • [18] Idury, R.M., Waterman, M.S.: A new algorithm for DNA sequence assembly. Journal of computational biology : a journal of computational molecular cell biology 2 2, 291–306 (1995)
  • [19] Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., Li, S., Yang, H., Wang, J., Wang, J.: De novo assembly of human genomes with massively parallel short read sequencing. Genome research 20, 265–72 (12 2009). https://doi.org/10.1101/gr.097261.109
  • [20] Liu, M., Yu, H.: Lower bound for succinct range minimum query. In: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. p. 1402–1415. STOC 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3357713.3384260, https://doi.org/10.1145/3357713.3384260
  • [21] Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993). https://doi.org/10.1137/0222058, https://doi.org/10.1137/0222058
  • [22] Navarro, G.: Spaces, trees, and colors: The algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4) (mar 2014). https://doi.org/10.1145/2535933, https://doi.org/10.1145/2535933
  • [23] Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press (2016), http://www.cambridge.org/de/academic/subjects/computer-science/algorithmics-complexity-computer-algebra-and-computational-g/compact-data-structures-practical-approach?format=HB
  • [24] Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA – a practical iterative de Bruijn graph de novo assembler. In: Berger, B. (ed.) Research in Computational Molecular Biology. pp. 426–440. Springer Berlin Heidelberg, Berlin, Heidelberg (2010)
  • [25] Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences 98(17), 9748–9753 (2001). https://doi.org/10.1073/pnas.171285098, https://www.pnas.org/doi/abs/10.1073/pnas.171285098
  • [26] Sadakane, K.: Compressed suffix trees with full functionality. Theor. Comp. Sys. 41(4), 589–607 (2007). https://doi.org/10.1007/s00224-006-1198-x, https://doi.org/10.1007/s00224-006-1198-x
  • [27] Simpson, J., Wong, K., Jackman, S., Schein, J., Jones, S., Birol, I.: ABySS: A parallel assembler for short read sequence data. Genome research 19, 1117–23 (02 2009). https://doi.org/10.1101/gr.089532.108
  • [28] Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th IEEE Annual Symposium on Switching and Automata Theory. pp. 1–11 (1973). https://doi.org/10.1109/SWAT.1973.13