Computing matching statistics on Wheeler DFAs

Alessio Conte¹ Nicola Cotumaccio^2,3 Travis Gagie³ Giovanni Manzini¹
Nicola Prezza⁴ and Marinella Sciortino⁵
¹ University of Pisa, Italy, [email protected], [email protected] ² GSSI, L’Aquila, Italy, [email protected] ³ Dalhousie University, Halifax, Canada, [email protected], [email protected] ⁴ Ca’ Foscari Unversity, Venice, Italy, [email protected] ⁵ University of Palermo, Italy, [email protected]

Abstract

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.

1 Introduction

Given a string $T$ and a pattern $\pi$ , the classical formulation of the pattern matching problem requires to decide whether the pattern $\pi$ occurs in the string $T$ and, possibly, count the number of such occurrences and report the positions where they occur. The invention of the FM-index [1], which is based on the Burrows-Wheeler transform [2], opened a new line of research in the pattern matching field. The indexing and compression techniques behind the FM-index deeply rely on the idea of suffix sorting, and over the years have been generalized from strings to trees [3], De Brujin graphs [4, 5], Wheeler graphs [6, 7] and arbitrary graphs [8, 9]. In particular, the class of Wheeler graphs is probably the one that captures the intuition behind the FM-index in the simplest way, and indeed the notion of Wheeler order has relevant consequences in automata theory [7, 10].

However, in bioinformatics we are not only interested in exact pattern matching, but also in a myriad of variations of the pattern matching problem [11]. In particular, matching statistics were introduced to solve the approximate pattern matching problem [12]. A powerful data structure that is able to address the variations of the pattern matching problem at once is the suffix tree [13]. The main drawback of the suffix tree is its space consumption, which is non-negligible both in theory and in practice. As a consequence, the suffix tree has been replaced by the suffix array [14]. While suffix arrays do not have all the functionalities of suffix trees, it has been shown that they can be augmented with some additional data structures — notably, the longest common prefix (LCP) array — so that it is possible to retrieve the full functionalities of a suffix trees [15]. All these components can be successfully compressed, leading to the so-called compressed suffix trees [16].

The natural question is whether it is possible to provide suffix tree functionalities not only to strings, but also to graphs, and in particular Wheeler graphs. In this paper, we provide a first partial affirmative answer by considering the problem of computing matching statistics. In 2010, Ohlebusch et al. [17] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of longest common prefix (LCP) array for Wheeler automata, thus establishing an important step towards extending (compressed) suffix tree functionalities to labeled graphs.

2 Notation and first definitions

Throughout the paper, we consider an alphabet $\Sigma$ and a fixed total order $\preceq$ on $\Sigma$ . We denote by $\Sigma^{*}$ the set of all finite strings on $\Sigma$ and by $\Sigma^{\omega}$ the set of all (countably) infinite strings on $\Sigma$ . The empty word is $\epsilon$ . If $\alpha\in\Sigma^{*}$ , then $\alpha^{R}$ is the reverse string of $\alpha$ . We extend the total order $\preceq$ from $\Sigma$ to $\Sigma^{*}\cup\Sigma^{\omega}$ lexicographically. If $i$ and $j$ are integers, with $i\leq j$ , define $[i,j]=\{i,i+1,\dots,j-1,j\}$ . If $T$ is a string, the $i$ -th character of $T$ is $T[i]$ , and $T[i..j]=T[i]..T[j]$ .

We will consider deterministic automata $\mathcal{A}=(Q,E,s_{0},F)$ , where $Q$ is the set of states, $E\subseteq Q\times Q\times\Sigma$ is the set of labeled edges, $s_{0}\in Q$ is the initial state and $F\subseteq Q$ is the set of final states. The definition implies that for every $u\in Q$ and for every $a\in\Sigma$ there exists at most one edge labeled $a$ leaving $u$ . Following [7, 10], we assume that $s_{0}$ has no incoming edges, and every state is reachable from the initial state; moreover, all edges entering the same state have the same label (input-consistency), so that for every $u\in Q\setminus\{s_{0}\}$ we can let $\lambda(u)$ be the label of all edges entering $u$ . We define $\lambda(s_{0})=\#$ , where $\#\not\in\Sigma$ is a special character for which we assume $\#\prec a$ for every $a\in\Sigma$ (the character $\#$ is an analogous of the termination character $\$$ used for suffix trees and suffix arrays). As a consequence, an edge $(u^{\prime},u,a)$ can be simply written as $(u^{\prime},u)$ , because it must be $a=\lambda(u)$ .

We assume familiarity with the notions of suffix array (SA), Burrows Wheeler transform (BWT), FM-index and backward search [1].

The matching statistics of a pattern $\pi=\pi[1..m]$ with respect to a string $T=T[1..n]$ are defined as follows. Assume that $T[n]=\$\not\in\Sigma$ , where $\$\prec a$ for every $a\in\Sigma$ . Determining the matching statistics of $\pi$ with respect to $T$ means determining, for $1\leq i\leq m$ , (i) the longest prefix $\pi^{\prime}$ of $\pi[i..m]$ which occurs in $T$ , and (ii) the interval corresponding to the set of all strings starting with $\pi^{\prime}$ in the list of all lexicographically sorted suffixes. We can describe (i) and (ii) by means of three values: the length $\ell_{i}$ of $\pi^{\prime}$ , and the endpoints $l_{i}$ and $r_{i}$ of the interval considered in (ii). For example, let $T=mississippi\$$ (see Figure 1), and $\pi=stpissi$ . For $i=1$ , we have $\pi^{\prime}=s$ , so $\ell_{1}=1$ and $[l_{1},r_{1}]=[9,12]$ (suffixes starting with $s$ ). For $i=2$ , we have $\pi^{\prime}=\epsilon$ , so $\ell_{2}=0$ and $[l_{2},r_{2}]=[1,n]=[1,12]$ (all suffixes start with the empty string). For $i=3$ , we have $\pi^{\prime}=pi$ , so $\ell_{3}=2$ , and $[l_{3},r_{3}]=[7,7]$ (suffixes starting with $pi$ ). For $i=4$ , we have $\pi^{\prime}=issi$ , so $\ell_{4}=4$ , and $[l_{4},r_{4}]=[4,5]$ (suffixes starting with $issi$ ). One can proceed analogously for $i=5,6,7$ .

$i$	Sorted suffixes	LCP	SA	BWT
1	$		12	i
2	i$	0	11	p
3	ippi$	1	8	s
4	issippi$	1	5	s
5	ississippi$	4	2	m
6	mississippi$	0	1	$
7	pi$	0	10	p
8	ppi$	1	9	i
9	sippi$	0	7	s
10	sissippi$	2	4	s
11	ssippi$	1	6	i
12	ssissippi$	3	3	i

Figure 1: The sorted suffixes of “mississippi$” and the LCP, SA, and BWT arrays.

3 Computing matching statistics for strings

We will first describe the algorithm by Ohlebusch et al. [17], emphasizing the ideas that we will generalize when switching to Wheeler DFAs. The algorithms computes the matching statistics using a number of iterations linear in $m$ by exploiting the backward search. We start from the end of $\pi$ , and we use the backward search (starting from the interval $[1,n]$ which corresponds to the set of suffixes prefixed by the empty string) to find the interval of all occurrences of the last character of $\pi$ in $T$ (if any). Then, starting from the new interval, we use the backward search to find all the occurrences of the suffix of length 2 of $\pi$ in $T$ (if any), and so on. At some point, it may happen that for some $i\leq m+1$ we have that $\pi[i..m]$ occurs in $T$ , but the next application of the backward search returns the empty interval, so that $\pi[i-1..m]$ does not occur in $T$ (the case $i=m+1$ corresponds to the initial setting when $\pi[i..m]$ is the empty string). We distinguish two cases:

•

(Case 1) If $l_{i}=1$ and $r_{i}=n$ , this means that all suffixes of $T$ are prefixed by $\pi[i..m]$ . This may happen in particular if $i=m+1$ : this means that the first backward search has been unsuccessful. We immediately conclude that character $\pi[i-1]$ does not occur in $T$ , so $\ell_{i-1}=0$ and $[l_{i-1},r_{i-1}]=[1,n]$ (because all suffixes start with the empty string). In this case, in the following iterations of the algorithm, we can simply discard $\pi[i-1,m]$ : when for $i^{\prime}\leq i-2$ we will be searching for the longest prefix of $\pi[i^{\prime},m]$ occurring in $T$ , it will suffice to search for the longest prefix of $\pi[i^{\prime},i-2]$ occurring in $T$ .
•

(Case 2) If $l_{i}>1$ or $r_{i}<n$ , this means that the number of suffixes of $T$ starting with $\pi[i..m]$ is less than $n$ . Now, every suffix starting with $\pi[i..m]$ also starts with $\pi[i..m-1]$ . If the number of suffixes starting with $\pi[i..m-1]$ is equal to the number of suffixes starting with $\pi[i..m]$ , then also $\pi[i-1..m-1]$ does not occur in $T$ . More in general, for $j\leq m-1$ we can have that $\pi[i-1..j]$ occurs in $T$ only if the number of suffixes starting with $\pi[i..j]$ is larger than the number of suffixes starting with $\pi[i..m]$ . Since we are interested in maximal matches, we want $j$ to be as large as possible: we will show later how to compute the largest integer $j$ such that the number of suffixes starting with $\pi[i..j]$ is larger than the number of suffixes starting with $\pi[i..m]$ . Notice that $j$ always exists, because all $n$ suffixes start with the empty string, but less than $n$ suffixes start with $\pi[i..m]$ . After determining $j$ we discard $\pi[j+1..m]$ (so in the following iterations of the algorithm we will simply consider $\pi[1..j]$ ), and we recursively apply the backward search starting from the interval associated with the occurrences of $\pi[i..j]$ — we will also see how to compute this interval.

Let us apply the above algorithm to $T=mississippi\$$ and $\pi=stpissi$ . We start with the interval $[1,n]=[1,12]$ , corresponding to the empty pattern, and character $\pi[7]=i$ . A backward step yields the interval $[l_{7},r_{7}]=[2,5]$ (suffixes starting with $i$ ), so $\ell_{7}=1$ . Now, we apply a backward step from $[2,5]$ and $\pi[6]=s$ , obtaining $[l_{6},r_{6}]=[9,10]$ (suffixes starting with $si$ ), so $\ell_{6}=2$ . Again, we apply a backward step from $[9,10]$ and $\pi[5]=s$ , obtaining $[l_{5},r_{5}]=[11,12]$ (suffixes starting with $ssi$ ), so $\ell_{5}=3$ . Again, we apply a backward step from $[11,12]$ and $\pi[4]=i$ , obtaining $[l_{4},r_{4}]=[4,5]$ (suffixes starting with $issi$ ), so $\ell_{4}=4$ . We now apply a backward step from $[4,5]$ and $\pi[3]=p$ , and we obtain the empty interval. This means that no suffix starts with $pissi$ . Notice in Figure 1 that the number of suffixes starting with $issi$ is equal to the number of suffixes starting with $iss$ or $is$ , but the number of suffixes starting with $i$ is bigger. As a consequence, we consider the interval of all suffixes starting with $i$ — which is $[2,5]$ — and we apply a backward step with $\pi[3]=p$ . This time the backward step is successful, and we obtain $[l_{3},r_{3}]=[7,7]$ (suffixes starting with $pi$ ), and $\ell_{3}=2$ . We now apply a backward step from $[7,7]$ and $\pi[2]=t$ , obtaining the empty interval. This means that no suffix starts with $tpi$ . Notice in Figure 1 that the number of suffixes starting with $p$ is bigger than the number of suffixes starting with $pi$ . The corresponding interval is $[7,8]$ , but a backward step with $\pi[2]=t$ is still unsuccessful (so no suffix starts with $tp$ ). The number of suffixes starting with $p$ is smaller than the number of suffixes starting with the empty string (which is equal to $n=12$ ), so we apply a backward step with $[1,12]$ and $\pi[2]=t$ . Since the backward step is still unsuccessful, we conclude that $\pi[2]=t$ does not occur in $S$ , so $[l_{2},r_{2}]=[1,n]=[1,12]$ and $\ell_{2}=0$ . Finally, we start again from the whole interval $[1,12]$ , and a backward step with $\pi[1]=s$ returns $[l_{1},r_{1}]=[9,12]$ (suffixes starting with $s$ ), so $\ell_{1}=1$ .

It is easy to see that the number of iterations is linear in $m$ . Indeed, every time we apply a backward step, either we move to the left across $\pi$ to compute a new matching statistic, or we increase by at least 1 the length of the suffix of $\pi$ which is forever discarded. This implies that the number of iterations is bounded by $2|\pi|=2m$ .

We are only left with showing (i) how to compute $j$ and (ii) the interval of all suffixes starting with $\pi[i..j]$ in Case 2 of the algorithm. To this end, we introduce the longest common prefix (LCP) array $\mathsf{LCP}=\mathsf{LCP}[2,n]$ of $T$ . We define $\mathsf{LCP}[i]$ to be the length of the longest common prefix of the $(i-1)$ -st lexicographically smallest suffix of $T$ and the $i$ -th lexicographically smallest suffix of $T$ . In Figure 1 we have $\mathsf{LCP}[5]=4$ because the fourth lexicographically smallest suffix of $T$ is $issippi\$$ , the fifth lexicographically smallest suffix of $T$ is $ississippi\$$ , and the longest common prefix of $issippi\$$ and $ississippi\$$ is $issi$ , which has length $4$ . Remember that in the example the backward search starting from $[4,5]$ (suffixes starting with $issi$ ) and $p$ was unsuccessful, so computing $j$ means determining the longest prefix of $issi$ such that the the number of suffixes starting with such a prefix is bigger than $2$ . This is easy to compute by using the LCP array: the longest such prefix is the one of length $\max\{\mathsf{LCP}[4],\mathsf{LCP}[6]\}=\max\{1,0\}=1$ , so that the desired prefix is $i$ . As a consequence, we are only left with showing how to compute the interval of all suffixes starting with the prefix $i$ — which is $[2,5]$ . Notice that in order to compute this interval, it is enough to expand the interval $[4,6]$ in both directions as long as the LCP value does not go below $1$ . Since $\mathsf{LCP}[4]=1$ , $\mathsf{LCP}[3]=1$ , and $\mathsf{LCP}[2]=0$ , and we already know that $\mathsf{LCP}[6]=0$ , we conclude that the desired interval is $[2,5]$ . In other words, given a position $t$ , we must be able to compute the biggest integer $k$ less than $t$ such that $\mathsf{LCP}[k]<\mathsf{LCP}[t]$ , and the smallest integer $k$ bigger than $t$ such that $\mathsf{LCP}[k]<\mathsf{LCP}[t]$ (in our case, $t=4$ ). These queries are called PSV (“previous smaller value”) and NSV (“next smaller value”) queries. The LCP array can be augmented in such a way that PSV and NSV queries can be solved efficiently: different space-time trade-offs are possible, we refer the reader to [17] for details.

4 Matching statistics for Wheeler DFAs

Let us define Wheeler DFAs [7].

Definition 1.

Let $\mathcal{A}=(Q,E,s_{0},F)$ be a DFA. A Wheeler order on $\mathcal{A}$ is a total order $\leq$ on Q such that $s_{0}\leq u$ for every $u\in Q$ and:

(Axiom 1) If $u,v\in Q$ and $u<v$ , then $\lambda(u)\preceq\lambda(v)$ .
(Axiom 2) If $(u^{\prime},u),(v^{\prime},v)\in E$ , $\lambda(u)=\lambda(v)$ and $u<v$ , then $u^{\prime}<v^{\prime}$ .

A DFA $\mathcal{A}$ is Wheeler if it admits a Wheeler order.

Figure 2: A Wheeler DFA. States are numbered according to their positions in the Wheeler order.

It is immediate to check that this definition is equivalent to the one in [7], where it was shown that if a DFA $\mathcal{A}$ admits a Wheeler order $\leq$ , then $\leq$ is uniquely determined (that is, $\leq$ is the Wheeler order on $\mathcal{A}$ ). In the following, we fix a Wheeler DFA $\mathcal{A}=(Q,E,s_{0},F)$ , where we assume $Q=\{u_{1},\dots,u_{n}\}$ , with $u_{1}<u_{2}<\dots<u_{n}$ in the Wheeler order, and $u_{1}$ coincides with the initial state $s_{0}$ . See Figure 2 for an example.

We now show that a Wheeler order can be seen of as a permutation of the set of all states playing the same role as the suffix array of a string. In the following, it will be expedient to (conceptually) assume that $s_{0}$ has a self-loop labeled $\#$ (this is consistent with Axiom 1, because $\#\prec a$ for every $a\in\Sigma$ ). This implies that every state has at least one incoming edge, so for every state $u_{i}$ there exists at least one infinite string $\alpha\in\Sigma^{\omega}$ that can be read starting from $u_{i}$ and following edges in a backward fashion (for example, in Figure 2 for $u_{9}$ such a string is $cel\#\#\#\dots$ ). We denote by $I_{u_{i}}$ the set of all such strings. Formally:

Definition 2.

Let $i\in[1,n]$ . For every state $u_{i}\in Q$ define:

\begin{split}I_{u_{i}}=\{\alpha\in\Sigma^{\omega}\;|\;\text{there exist integers $f_{1},f_{2},\dots$ in $[1,n]$ such that (i) $f_{1}=i$,}\\ \text{(ii) $(u_{f_{k+1}},u_{f_{k}})\in E$ for every $k\geq 1$ and (iii) $\alpha=\lambda(u_{f_{1}})\lambda(u_{f_{2}})\dots$}\}.\end{split}

For example, in Figure 2 we have $I_{u_{3}}=\{abdg\#\#\#\dots,abeh\#\#\#\dots,acei\#\#\#\dots\}$ .

The following lemma shows that the permutation of the states defined by the Wheeler order is the one lexicographically sorting the strings entering each state, just like the permutation defined by the suffix array lexicographically sorts the suffixes of the strings (a suffix is seen as a string “leaving” a text position).

Lemma 3.

Let $i,j\in[1,n]$ , with $i<j$ . Let $\alpha\in I_{u_{i}}$ and $\beta\in I_{u_{j}}$ . Then, $\alpha\preceq\beta$ .

Proof.

Let $f_{1},f_{2},\dots$ in $[1,n]$ be such that (i) $f_{1}=i$ , (ii) $(u_{f_{k+1}},u_{f_{k}})\in E$ for every $k\geq 1$ and (iii) $\alpha=\lambda(u_{f_{1}})\lambda(u_{f_{2}})\dots$ . Analogously, let $g_{1},g_{2},\dots$ in $[1,n]$ be such that (i) $g_{1}=j$ , (ii) $(u_{g_{k+1}},u_{g_{k}})\in E$ for every $k\geq 1$ and (iii) $\beta=\lambda(u_{g_{1}})\lambda(u_{g_{2}})\dots$ . Let $\alpha\not=\beta$ . We must prove that $\alpha\prec\beta$ . Let $p\geq 1$ be the smallest integer such that the $p$ -th character of $\alpha$ is different than the $p$ -th character of $\beta$ . In other words, we know that $\lambda(u_{f_{1}})=\lambda(u_{g_{1}})$ , $\lambda(u_{f_{2}})=\lambda(u_{g_{2}})$ , $\dots$ , $\lambda(u_{f_{p-1}})=\lambda(u_{g_{p-1}})$ , but $\lambda(u_{f_{p}})\not=\lambda(u_{g_{p}})$ . We must prove that $\lambda(u_{f_{p}})\prec\lambda(u_{g_{p}})$ . Since $\lambda(u_{f_{1}})=\lambda(u_{g_{1}})$ $f_{1}=i<j=g_{1}$ , and $(u_{f_{2}},u_{f_{1}}),(u_{g_{2}},u_{g_{1}})\in E$ , from Axiom 2 we obtain $f_{2}<g_{2}$ . Since $\lambda(u_{f_{2}})=\lambda(u_{g_{2}})$ , $f_{2}<g_{2}$ , and $(u_{f_{3}},u_{f_{2}}),(u_{g_{3}},u_{g_{2}})\in E$ , from Axiom 2 we obtain $f_{3}<g_{3}$ . By iterating this argument, we conclude $f_{p}<g_{p}$ . By Axiom 1, we obtain $\lambda(u_{f_{p}})\preceq\lambda(u_{g_{p}})$ . Since $\lambda(u_{f_{p}})\not=\lambda(u_{g_{p}})$ , we conclude $\lambda(u_{f_{p}})\prec\lambda(u_{g_{p}})$ . ∎

If we think of a string as a labeled path, then the suffix array sorts the strings that can be read from each position by moving forward (that is, the suffixes of the string), while the Wheeler order sorts the strings that can be read from each position by moving backward towards the initial state. The underlying idea is the same: the forward vs backward difference is only due to historical reasons [6]. To compute the matching statistics on Wheeler DFA we reason as in the previous section replacing backward search with the forward search [6] defined as follows: given an interval $[i,j]$ in $[1,n]$ and $a\in\Sigma$ , find the (possibly empty) interval $[i^{\prime},j^{\prime}]$ in $[1,n]$ such that a state $v_{k^{\prime}}$ is reachable from some state $v_{k}$ , with $i\leq k\leq j$ , through an edge labeled $a$ , if and only if $i^{\prime}\leq k^{\prime}\leq j^{\prime}$ (this easily follows by using the axioms of Definition 1). For a constant size alphabet, given $[i,j]$ and $a$ then $[i^{\prime},j^{\prime}]$ can be determined in constant time. Given a string $\pi\in\Sigma^{*}$ , if we start from the whole set of states and repeatedly apply the forward search we reach the set of all states $u_{i}$ for which there exists $\alpha\in I_{u_{i}}$ prefixed by $\pi^{R}$ ; this is an interval with respect to the Wheeler order: in the following we call this interval $T(\pi)$ .

Because of the forward vs backward difference the problem of matching statistics will be defined in a symmetrical way on Wheeler DFAs. Given a pattern $\pi=\pi[1..m]$ , for every $1\leq i\leq m$ we want to determine (i) the longest suffix $\pi^{\prime}$ of $\pi[1..i]$ which occurs in the Wheeler DFA $\mathcal{A}$ (that is, that can be read somewhere on $\mathcal{A}$ by concatenating edges), and $(ii)$ the endpoints of the interval $T(\pi^{\prime})$ .

Broadly speaking, we can apply the same idea of the algorithm for strings, but in a symmetrical way. We start from the beginning of $\pi$ (not from the end of $\pi)$ , and initially we consider the whole set of states. We repeatedly apply the forward search (not the backward search), until the forward search returns the empty interval for some $i\geq 0$ . This means that $\pi[1..i+1]$ does not occur in $\mathcal{A}$ . Then, if $T(\pi[1..i])$ is the whole set of states, we conclude that the character $\pi[i+1]$ labels no edge in the graph. Otherwise, we must find the smallest $j$ such that $T(\pi[1..i])$ is strictly contained in $T(\pi[j..i])$ (that is, we must determine the longest suffix $\pi[j..i]$ of $\pi[1..i]$ which reaches more states than $\pi[1..i]$ ). Then we must determine the endpoints of the interval $T(\pi[j..i])$ so that we can go on with the forward search.

The challenge now is to find a way to solve the same subproblems that we identified in Case 2 of the algorithm for strings. In other words, we must find a way to determine $j$ and find the endpoints of the interval $T(\pi[j..i])$ . We will show that the solution is not as simple as the one for the algorithm on strings.

5 The LCP array and matching statistics for Wheeler DFAs

We start observing that $I_{u_{i}}$ may be an infinite set. For example, in Figure 2, we have

I_{u_{2}}=\{aaaaa\dots,abdf\#\#\#\dots,aabdf\#\#\#\dots,aaabdf\#\#\#\dots,\dots\}.

In general, an infinite set of (lexicographically sorted) strings in $\Sigma^{\omega}$ need not admit a minimum or a maximum. For example, the set $\{baaaa\dots,abaaa\dots,aabaa\dots,aaaba\dots\}$ does not admit a minimum (but only the infimum string $aaaaa\dots$ ). Nonetheless, Lemma 3 implies that each $I_{u_{i}}$ admits both a minimum and a maximum. For example, the minimum is obtained as follows. Let $f_{1}=i$ , and for every $k\geq 1$ , recursively let $f_{k+1}$ be the smallest integer in $[1,n]$ such that $(u_{f_{k+1}},u_{f_{k}})\in E$ . Then, the minimum of $I_{u_{i}}$ is $\lambda(u_{f_{1}})\lambda(u_{f_{2}})\dots$ , and analogously one can determine the maximum.

In the following, we will denote the minimum and the maximum of $I_{u_{i}}$ by $\min_{i}$ and $\max_{i}$ , respectively (for example, in Figure 2 we have $\min_{2}=aaaaa\dots$ , and $\max_{2}=abdf\#\#\#\dots$ ). Lemma 3 implies that:

\mathrm{min}_{1}\preceq\mathrm{max}_{1}\preceq\mathrm{min}_{2}\preceq\mathrm{max}_{2}\preceq\dots\preceq\mathrm{max}_{n-1}\preceq\mathrm{min}_{n}\preceq\mathrm{max}_{n}.

This suggests to generalize the LCP array as follows. Given $\alpha,\beta\in\Sigma^{*}\cup\Sigma^{\omega}$ , let $\mathop{\mathsf{lcp}}(\alpha,\beta)$ be the length of the longest common prefix of $\alpha$ and $\beta$ (if $\alpha=\beta\in\Sigma^{\omega}$ , define $\mathop{\mathsf{lcp}}(\alpha,\beta)=\infty$ ).

Definition 4.

The LCP-array of a Wheeler automaton $\mathcal{A}$ is the array $\mathsf{LCP}_{\mathcal{A}}=\mathsf{LCP}_{\mathcal{A}}[2,2n]$ which contains the following $2n-1$ values in this order: $\mathop{\mathsf{lcp}}(\min_{1},\max_{1})$ , $\mathop{\mathsf{lcp}}(\max_{1},\min_{2})$ , $\mathop{\mathsf{lcp}}(\min_{2},\max_{2})$ , $\dots$ , $\mathop{\mathsf{lcp}}(\max_{n-1},\min_{n})$ , $\mathop{\mathsf{lcp}}(\min_{n},\max_{n})$ .

From the above characterization of $\min_{i}$ and $\max_{i}$ , one can prove that for every entry either $\mathsf{LCP}_{\mathcal{A}}[i]=\infty$ or $\mathsf{LCP}_{\mathcal{A}}[i]<3n$ (it follows from Fine and Wilf Theorem [18, 19]), and one can design a polynomial time algorithm to compute $\mathsf{LCP}_{\mathcal{A}}$ .

Unfortunately, the array $\mathsf{LCP}_{\mathcal{A}}$ alone is not sufficient for computing matching statistics. Assume that $T(\pi)=\{u_{r},u_{r+1},\dots,u_{s-1},u_{s}\}$ , and that when we apply the forward search by adding a character $c$ , we obtain $T(\pi c)=\emptyset$ . We must then determine the largest suffix $\pi^{\prime}$ of $T(\pi)$ such that $T(\pi)$ is strictly contained in $T(\pi^{\prime})$ . Suppose that every string in $I_{u_{r}}$ is prefixed by $\pi^{R}$ , and every string in $I_{u_{s}}$ is prefixed by $\pi^{R}$ . In particular, both $\min_{r}$ and $\max_{s}$ are prefixed by $\pi^{R}$ . In this case, we can proceed like in the algorithm for strings: the desired suffix $\pi^{\prime}$ is the one having length $\max\{\mathop{\mathsf{lcp}}(\max_{r-1},\min_{r}),\mathop{\mathsf{lcp}}(\max_{s},\min_{s+1})\}$ , which can be determined using $\mathsf{LCP}_{\mathcal{A}}$ . However, in general, even if some string in $I_{u_{r}}$ must be prefixed by $\pi^{R}$ , the string $\min_{r}$ need not be prefixed by $\pi^{R}$ , and similarly $\max_{s}$ need not be prefixed by $\pi^{R}$ . The worst-case scenario occurs when $r=s$ . Consider Figure 2, and assume that $\pi=heba$ . Then, we have $r=s=3$ (note that $abeh\#\#\#\dots$ is a string in $I_{u_{3}}$ prefixed by $\pi^{R}$ ). However, both $\min_{3}=abdg\#\#\#\dots$ , and $\max_{3}=acei\#\#\#\dots$ , are not prefixed by $\pi^{R}$ . Notice that $\mathop{\mathsf{lcp}}(\max_{2},\min_{3})=3$ and $\mathop{\mathsf{lcp}}(\max_{3},\min_{4})=3$ , but $\pi^{\prime}$ is not the suffix of length 3 of $\pi$ . Indeed, since $\min_{3}$ is only prefixed by the prefix of $\pi^{R}$ of length $2$ , and $\max_{3}$ is only prefixed by the prefix of $\pi^{R}$ of length $1$ , we conclude that it must be $|\pi^{\prime}|=2$ . In general, the desired suffix $\pi^{\prime}$ is the one having length $|\pi^{\prime}|$ given by:

\max\left\{\,\mathrm{min}\{\mathop{\mathsf{lcp}}(\mathrm{max}_{r-1},\mathrm{min}_{r}),\!\mathop{\mathsf{lcp}}(\mathrm{min}_{r},\pi^{R})\},\mathrm{min}\{\mathop{\mathsf{lcp}}(\pi^{R},\mathrm{max}_{s}),\!\mathop{\mathsf{lcp}}(\mathrm{max}_{s},\mathrm{min}_{s+1})\}\,\right\}.

(1)

The above formula shows that, in order to compute $\pi^{\prime}$ , in addition to $\mathsf{LCP}_{\mathcal{A}}$ it suffices to know the values $\mathop{\mathsf{lcp}}(\min_{r},\pi^{R})$ and $\mathop{\mathsf{lcp}}(\pi^{R},\max_{s})$ ( $\pi^{\prime}$ is a suffix of $\pi$ , so it is determined by its length). We now show how our algorithm can efficiently maintain the current pattern $\pi$ , the set $T(\pi)=\{u_{r},u_{r+1},\dots,u_{s-1},u_{s}\}$ and the values $\mathop{\mathsf{lcp}}(\min_{r},\pi^{R})$ and $\mathop{\mathsf{lcp}}(\pi^{R},\max_{s})$ during the computation of the matching statistics. We assume that the input automaton is encoded with the rank/select data structures supporting the execution of a step of forward search in $O(\log|\Sigma|)$ time, see [6] for details. In addition, we will use the following result.

Lemma 5.

Let $A[1,n]$ be a sequence of values over an ordered alphabet $\Sigma$ . Consider the following queries: (i) given $i,j\in[1..n]$ , compute the minimum value in $S[i..j]$ , and (ii) given $t\in[1..n]$ and $c\in\Sigma$ , determine the biggest $k<t$ (or the smallest $k>t$ ) such that $A[k]<c$ . Then, $A$ can be augumented with a data structure of $2n+o(n)$ bits such that query (i) can be answered in constant time and query (ii) can be answered in $O(\log n)$ time.

Proof.

There exists a data structure of $2n+o(n)$ bits that allows to solve range minimum queries in constant time [20], so using $A$ we can solve queries (i) in constant time. Now, let us show how to solve queries (ii). Let $f_{1}$ be the answer of query (i) on input $i=\lceil t/2\rceil$ and $j=t-1$ . If $f_{1}<c$ , then we must keep searching in the interval $[\lceil t/2\rceil,t-1]$ , otherwise, we must keep searching in the interval $[1,\lceil t/2\rceil-1]$ . In other words, we can answer a query (ii) by means of a binary search on $[1,t-1]$ , which takes $O(\log t)$ (and so $O(\log n)$ ) time. ∎

Notice that query (ii) can be seen as a variant of PSV and NSV queries. In the following, we assume that the array $\mathsf{LCP}_{\mathcal{A}}$ has been augmented with the data structure of Lemma 5.

At the beginning we have $\pi=\epsilon$ , so $T(\epsilon)=\{1,2,\ldots,n\}$ and trivially $\mathop{\mathsf{lcp}}(\min_{r},\pi^{R})=\mathop{\mathsf{lcp}}(\pi^{R},\max_{s})=0$ . At each iteration we perform a step of forward search computing $T(\pi c)$ given $T(\pi)$ ; then we distinguish two cases according to whether $T(\pi c)$ is empty or not.

Case 1. $T(\pi c)=\{u_{r^{\prime}},u_{r^{\prime}+1},\dots,u_{s^{\prime}-1},u_{s^{\prime}}\}$ is not empty. In that case $\pi c$ will become the pattern at the next iteration. Since we already have $T(\pi c)$ we are left with the task of computing $\mathop{\mathsf{lcp}}(\min_{r^{\prime}},c\pi^{R})$ and $\mathop{\mathsf{lcp}}(c\pi^{R},\max_{s^{\prime}})$ . We only show how to compute $\mathop{\mathsf{lcp}}(\min_{r^{\prime}},c\pi^{R})$ , the latter computation being analogous. Let $k$ be the smallest integer in $[1,n]$ such that $(u_{k},u_{r^{\prime}})\in E$ . Notice that we can easily compute $k$ by means of standard rank/select operations on the compact data structure used to encode $\mathcal{A}$ . Since $u_{r^{\prime}}\in T(\pi c)$ , it must be $k\leq s$ . Moreover, the characterization of $\min_{r^{\prime}}$ that we described above implies that $\min_{r^{\prime}}=c\min_{k}$ , hence $\mathop{\mathsf{lcp}}(\min_{r^{\prime}},c\pi^{R})=\mathop{\mathsf{lcp}}(c\min_{k},c\pi^{R})=1+\mathop{\mathsf{lcp}}(\min_{k},\pi^{R})$ . To compute $\mathop{\mathsf{lcp}}(\min_{k},\pi^{R})$ we distinguish two subcases:

$a)$

$k>r$ , hence $r<k\leq s$ . Since $u_{r},u_{s}\in T(\pi)$ , there exist $\alpha\in I_{u_{r}}$ and $\beta\in I_{u_{s}}$ both prefixed by $\pi^{R}$ . But $\alpha\preceq\max_{r}\preceq\min_{k}\preceq\min_{s}\preceq\beta$ , so $\min_{k}$ is also prefixed by $\pi^{R}$ , and we conclude $\mathop{\mathsf{lcp}}(\min_{k},\pi^{R})=|\pi|$ .

b)

$k\leq r$ . In this case, we have $\min_{k}\preceq\max_{k}\preceq\min_{k+1}\prec\max_{k+1}\preceq\dots\preceq\min_{r}\prec\pi^{R}$ , and therefore $\mathop{\mathsf{lcp}}(\mathrm{min}_{k},\pi^{R})$ is equal to

\mathrm{min}\{\mathop{\mathsf{lcp}}(\mathrm{min}_{k},\mathrm{max}_{k}),\mathop{\mathsf{lcp}}(\mathrm{max}_{k},\mathrm{min}_{k+1}),\mathop{\mathsf{lcp}}(\mathrm{min}_{k+1},\mathrm{max}_{k+1}),\dots,\mathop{\mathsf{lcp}}(\mathrm{min}_{r},\pi^{R})\}.

With the above formula we can compute $\mathop{\mathsf{lcp}}(\min_{k},\pi^{R})$ using query (i) of Lemma 5 over the range $\mathsf{LCP}_{\mathcal{A}}[2k,2r-1]$ and the value $\mathop{\mathsf{lcp}}(\min_{r},\pi^{R})$ .

Case 2. $T(\pi c)$ is empty. In this case at the next iteration the pattern will be largest suffix $\pi^{\prime}$ of $\pi$ such that $T(\pi)$ is strictly contained in $T(\pi^{\prime})=\{u_{r^{\prime\prime}},\dots,u_{s^{\prime\prime}}\}$ . We compute $|\pi^{\prime}|$ using (1); if $|\pi^{\prime}|>\mathop{\mathsf{lcp}}(\mathrm{min}_{r},\pi^{R})$ we set $r^{\prime\prime}=r$ , otherwise we apply query (ii) of Lemma 5 to find the rightmost entry $r^{\prime\prime}$ in $\mathsf{LCP}_{\mathcal{A}}[2,2r-1]$ smaller than $|\pi^{\prime}|$ . Computing $s^{\prime\prime}$ is analogous.

Given $T(\pi^{\prime})=\{u_{r^{\prime\prime}},u_{r^{\prime\prime}+1},\dots,u_{s^{\prime\prime}-1},u_{s^{\prime\prime}}\}$ , where $r^{\prime\prime}\leq r$ , $s\leq s^{\prime\prime}$ , and at least one inequality is strict, we want to compute $\mathop{\mathsf{lcp}}(\min_{r^{\prime\prime}},(\pi^{\prime})^{R})$ and $\mathop{\mathsf{lcp}}((\pi^{\prime})^{R},\max_{s^{\prime\prime}})$ . We only consider $\mathop{\mathsf{lcp}}(\min_{r^{\prime\prime}},(\pi^{\prime})^{R})$ , the latter computation being analogous. We distinguish two subcases:

$a)$

$r^{\prime\prime}=r$ . Then $\mathop{\mathsf{lcp}}(\min_{r^{\prime\prime}},(\pi^{\prime})^{R})=\mathop{\mathsf{lcp}}(\min_{r},(\pi^{\prime})^{R})=\min\{\mathop{\mathsf{lcp}}(\min_{r},\pi^{R}),|\pi^{\prime}|\}$ .
$b)$

$r^{\prime\prime}<r$ . In particular, since $u_{r^{\prime\prime}}$ is the left endpoint of $T(\pi^{\prime})$ and $|T(\pi^{\prime})|\geq 2$ , one can prove like in Case $1a)$ that $\max_{r^{\prime\prime}}$ is prefixed by $(\pi^{\prime})^{R}$ . We immediately conclude that $\mathop{\mathsf{lcp}}(\min_{r^{\prime\prime}},(\pi^{\prime})^{R})=\min\{\mathop{\mathsf{lcp}}(\min_{r^{\prime\prime}},\max_{r^{\prime\prime}}),|\pi^{\prime}|\}$ , which can be immediately computed since $\mathop{\mathsf{lcp}}(\min_{r^{\prime\prime}},\max_{r^{\prime\prime}})$ is a value stored in $\mathsf{LCP}_{\mathcal{A}}$ .

We can summarize the above discussion as follows.

Theorem 6.

Given a Wheeler DFA $\mathcal{A}$ , there exists a data structure occupying $O(|\mathcal{A}|)$ words which can compute the pattern matching statistics of a pattern $P$ in time $O(|P|\log|\mathcal{A}|)$ .

Funding

TG funded by National Institutes of Health (NIH) NIAID (grant no. HG011392), the National Science Foundation NSF IIBR (grant no. 2029552) and a Natural Science and Engineering Research Council (NSERC) Discovery Grant (grant no. RGPIN-07185-2020). GM funded by the Italian Ministry of University and Research (PRIN 2017WR7SHH). MS funded by the INdAM-GNCS Project (CUP_E55F22000270001). NP funded by the European Union (ERC, REGINDEX, 101039208). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

6 References

References

[1] P. Ferragina and G. Manzini, “Opportunistic data structures with applications,” in Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS’00), 2000, pp. 390–398.
[2] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression algorithm,” Tech. Rep., 1994.
[3] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan, “Structuring labeled trees for optimal succinctness, and beyond,” in proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), 2005, pp. 184–193.
[4] A. Bowe, T. Onodera, K. Sadakane, and T. Shibuya, “Succinct de Bruijn graphs,” in Algorithms in Bioinformatics, Berlin, Heidelberg, 2012, pp. 225–235, Springer Berlin Heidelberg.
[5] V. Mäkinen, N. Välimäki, and J. Sirén, “Indexing graphs for path queries with applications in genome research,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, pp. 375–388, 2014.
[6] T. Gagie, G. Manzini, and J. Sirén, “Wheeler graphs: A framework for BWT-based data structures,” Theoret. Comput. Sci., vol. 698, pp. 67 – 78, 2017.
[7] J. Alanko, G. D’Agostino, A. Policriti, and N. Prezza, “Regular languages meet prefix sorting,” in Proc. of the 31st Symposium on Discrete Algorithms, (SODA’20). 2020, pp. 911–930, SIAM.
[8] N. Cotumaccio and N. Prezza, “On indexing and compressing finite automata,” in Proc. of the 32nd Symposium on Discrete Algorithms, (SODA’21). 2021, pp. 2585–2599, SIAM.
[9] N. Cotumaccio, “Graphs can be succinctly indexed for pattern matching in $O(|E|^{2}+|V|^{5/2})$ time,” in 2022 Data Compression Conference (DCC), 2022, pp. 272–281.
[10] J. Alanko, G. D’Agostino, A. Policriti, and N. Prezza, “Wheeler languages,” Information and Computation, vol. 281, pp. 104820, 2021.
[11] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997.
[12] W. I. Chang and E. L. Lawler, “Sublinear approximate string matching and biological applications,” Algorithmica, vol. 12, pp. 327–344, 2005.
[13] P. Weiner, “Linear pattern matching algorithms,” in Proc. 14th IEEE Annual Symposium on Switching and Automata Theory, 1973, pp. 1–11.
[14] U. Manber and G. Myers, “Suffix arrays: A new method for on-line string searches,” SIAM J. Comput., vol. 22, no. 5, pp. 935–948, 1993.
[15] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “Replacing suffix trees with enhanced suffix arrays,” J. of Discrete Algorithms, vol. 2, no. 1, pp. 53–86, 2004.
[16] K. Sadakane, “Compressed suffix trees with full functionality,” Theor. Comp. Sys., vol. 41, no. 4, pp. 589–607, 2007.
[17] E. Ohlebusch, S. Gog, and A. Kügell, “Computing matching statistics and maximal exact matches on compressed full-text indexes,” in Proceedings of the 17th International Conference on String Processing and Information Retrieval (SPIRE’10), Berlin, Heidelberg, 2010, p. 347–358, Springer-Verlag.
[18] N. J. Fine and H. S. Wilf, “Uniqueness theorem for periodic functions,” Proc. Amer. Math. Soc., , no. 16, pp. 109–114, 1965.
[19] S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino, “An extension of the Burrows-Wheeler transform,” Theor. Comput. Sci., vol. 387, no. 3, pp. 298–312, 2007.
[20] Johannes Fischer, “Optimal succinctness for range minimum queries,” in LATIN 2010: Theoretical Informatics, Alejandro López-Ortiz, Ed., Berlin, Heidelberg, 2010, pp. 158–169, Springer Berlin Heidelberg.