Computing matching statistics on Wheeler DFAs
Abstract
Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.
1 Introduction
Given a string and a pattern , the classical formulation of the pattern matching problem requires to decide whether the pattern occurs in the string and, possibly, count the number of such occurrences and report the positions where they occur. The invention of the FM-index [1], which is based on the Burrows-Wheeler transform [2], opened a new line of research in the pattern matching field. The indexing and compression techniques behind the FM-index deeply rely on the idea of suffix sorting, and over the years have been generalized from strings to trees [3], De Brujin graphs [4, 5], Wheeler graphs [6, 7] and arbitrary graphs [8, 9]. In particular, the class of Wheeler graphs is probably the one that captures the intuition behind the FM-index in the simplest way, and indeed the notion of Wheeler order has relevant consequences in automata theory [7, 10].
However, in bioinformatics we are not only interested in exact pattern matching, but also in a myriad of variations of the pattern matching problem [11]. In particular, matching statistics were introduced to solve the approximate pattern matching problem [12]. A powerful data structure that is able to address the variations of the pattern matching problem at once is the suffix tree [13]. The main drawback of the suffix tree is its space consumption, which is non-negligible both in theory and in practice. As a consequence, the suffix tree has been replaced by the suffix array [14]. While suffix arrays do not have all the functionalities of suffix trees, it has been shown that they can be augmented with some additional data structures — notably, the longest common prefix (LCP) array — so that it is possible to retrieve the full functionalities of a suffix trees [15]. All these components can be successfully compressed, leading to the so-called compressed suffix trees [16].
The natural question is whether it is possible to provide suffix tree functionalities not only to strings, but also to graphs, and in particular Wheeler graphs. In this paper, we provide a first partial affirmative answer by considering the problem of computing matching statistics. In 2010, Ohlebusch et al. [17] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of longest common prefix (LCP) array for Wheeler automata, thus establishing an important step towards extending (compressed) suffix tree functionalities to labeled graphs.
2 Notation and first definitions
Throughout the paper, we consider an alphabet and a fixed total order on . We denote by the set of all finite strings on and by the set of all (countably) infinite strings on . The empty word is . If , then is the reverse string of . We extend the total order from to lexicographically. If and are integers, with , define . If is a string, the -th character of is , and .
We will consider deterministic automata , where is the set of states, is the set of labeled edges, is the initial state and is the set of final states. The definition implies that for every and for every there exists at most one edge labeled leaving . Following [7, 10], we assume that has no incoming edges, and every state is reachable from the initial state; moreover, all edges entering the same state have the same label (input-consistency), so that for every we can let be the label of all edges entering . We define , where is a special character for which we assume for every (the character is an analogous of the termination character used for suffix trees and suffix arrays). As a consequence, an edge can be simply written as , because it must be .
We assume familiarity with the notions of suffix array (SA), Burrows Wheeler transform (BWT), FM-index and backward search [1].
The matching statistics of a pattern with respect to a string are defined as follows. Assume that , where for every . Determining the matching statistics of with respect to means determining, for , (i) the longest prefix of which occurs in , and (ii) the interval corresponding to the set of all strings starting with in the list of all lexicographically sorted suffixes. We can describe (i) and (ii) by means of three values: the length of , and the endpoints and of the interval considered in (ii). For example, let (see Figure 1), and . For , we have , so and (suffixes starting with ). For , we have , so and (all suffixes start with the empty string). For , we have , so , and (suffixes starting with ). For , we have , so , and (suffixes starting with ). One can proceed analogously for .
Sorted suffixes | LCP | SA | BWT | |
---|---|---|---|---|
1 | $ | 12 | i | |
2 | i$ | 0 | 11 | p |
3 | ippi$ | 1 | 8 | s |
4 | issippi$ | 1 | 5 | s |
5 | ississippi$ | 4 | 2 | m |
6 | mississippi$ | 0 | 1 | $ |
7 | pi$ | 0 | 10 | p |
8 | ppi$ | 1 | 9 | i |
9 | sippi$ | 0 | 7 | s |
10 | sissippi$ | 2 | 4 | s |
11 | ssippi$ | 1 | 6 | i |
12 | ssissippi$ | 3 | 3 | i |
3 Computing matching statistics for strings
We will first describe the algorithm by Ohlebusch et al. [17], emphasizing the ideas that we will generalize when switching to Wheeler DFAs. The algorithms computes the matching statistics using a number of iterations linear in by exploiting the backward search. We start from the end of , and we use the backward search (starting from the interval which corresponds to the set of suffixes prefixed by the empty string) to find the interval of all occurrences of the last character of in (if any). Then, starting from the new interval, we use the backward search to find all the occurrences of the suffix of length 2 of in (if any), and so on. At some point, it may happen that for some we have that occurs in , but the next application of the backward search returns the empty interval, so that does not occur in (the case corresponds to the initial setting when is the empty string). We distinguish two cases:
-
•
(Case 1) If and , this means that all suffixes of are prefixed by . This may happen in particular if : this means that the first backward search has been unsuccessful. We immediately conclude that character does not occur in , so and (because all suffixes start with the empty string). In this case, in the following iterations of the algorithm, we can simply discard : when for we will be searching for the longest prefix of occurring in , it will suffice to search for the longest prefix of occurring in .
-
•
(Case 2) If or , this means that the number of suffixes of starting with is less than . Now, every suffix starting with also starts with . If the number of suffixes starting with is equal to the number of suffixes starting with , then also does not occur in . More in general, for we can have that occurs in only if the number of suffixes starting with is larger than the number of suffixes starting with . Since we are interested in maximal matches, we want to be as large as possible: we will show later how to compute the largest integer such that the number of suffixes starting with is larger than the number of suffixes starting with . Notice that always exists, because all suffixes start with the empty string, but less than suffixes start with . After determining we discard (so in the following iterations of the algorithm we will simply consider ), and we recursively apply the backward search starting from the interval associated with the occurrences of — we will also see how to compute this interval.
Let us apply the above algorithm to and . We start with the interval , corresponding to the empty pattern, and character . A backward step yields the interval (suffixes starting with ), so . Now, we apply a backward step from and , obtaining (suffixes starting with ), so . Again, we apply a backward step from and , obtaining (suffixes starting with ), so . Again, we apply a backward step from and , obtaining (suffixes starting with ), so . We now apply a backward step from and , and we obtain the empty interval. This means that no suffix starts with . Notice in Figure 1 that the number of suffixes starting with is equal to the number of suffixes starting with or , but the number of suffixes starting with is bigger. As a consequence, we consider the interval of all suffixes starting with — which is — and we apply a backward step with . This time the backward step is successful, and we obtain (suffixes starting with ), and . We now apply a backward step from and , obtaining the empty interval. This means that no suffix starts with . Notice in Figure 1 that the number of suffixes starting with is bigger than the number of suffixes starting with . The corresponding interval is , but a backward step with is still unsuccessful (so no suffix starts with ). The number of suffixes starting with is smaller than the number of suffixes starting with the empty string (which is equal to ), so we apply a backward step with and . Since the backward step is still unsuccessful, we conclude that does not occur in , so and . Finally, we start again from the whole interval , and a backward step with returns (suffixes starting with ), so .
It is easy to see that the number of iterations is linear in . Indeed, every time we apply a backward step, either we move to the left across to compute a new matching statistic, or we increase by at least 1 the length of the suffix of which is forever discarded. This implies that the number of iterations is bounded by .
We are only left with showing (i) how to compute and (ii) the interval of all suffixes starting with in Case 2 of the algorithm. To this end, we introduce the longest common prefix (LCP) array of . We define to be the length of the longest common prefix of the -st lexicographically smallest suffix of and the -th lexicographically smallest suffix of . In Figure 1 we have because the fourth lexicographically smallest suffix of is , the fifth lexicographically smallest suffix of is , and the longest common prefix of and is , which has length . Remember that in the example the backward search starting from (suffixes starting with ) and was unsuccessful, so computing means determining the longest prefix of such that the the number of suffixes starting with such a prefix is bigger than . This is easy to compute by using the LCP array: the longest such prefix is the one of length , so that the desired prefix is . As a consequence, we are only left with showing how to compute the interval of all suffixes starting with the prefix — which is . Notice that in order to compute this interval, it is enough to expand the interval in both directions as long as the LCP value does not go below . Since , , and , and we already know that , we conclude that the desired interval is . In other words, given a position , we must be able to compute the biggest integer less than such that , and the smallest integer bigger than such that (in our case, ). These queries are called PSV (“previous smaller value”) and NSV (“next smaller value”) queries. The LCP array can be augmented in such a way that PSV and NSV queries can be solved efficiently: different space-time trade-offs are possible, we refer the reader to [17] for details.
4 Matching statistics for Wheeler DFAs
Let us define Wheeler DFAs [7].
Definition 1.
Let be a DFA. A Wheeler order on is a total order on Q such that for every and:
-
(Axiom 1) If and , then .
-
(Axiom 2) If , and , then .
A DFA is Wheeler if it admits a Wheeler order.
It is immediate to check that this definition is equivalent to the one in [7], where it was shown that if a DFA admits a Wheeler order , then is uniquely determined (that is, is the Wheeler order on ). In the following, we fix a Wheeler DFA , where we assume , with in the Wheeler order, and coincides with the initial state . See Figure 2 for an example.
We now show that a Wheeler order can be seen of as a permutation of the set of all states playing the same role as the suffix array of a string. In the following, it will be expedient to (conceptually) assume that has a self-loop labeled (this is consistent with Axiom 1, because for every ). This implies that every state has at least one incoming edge, so for every state there exists at least one infinite string that can be read starting from and following edges in a backward fashion (for example, in Figure 2 for such a string is ). We denote by the set of all such strings. Formally:
Definition 2.
Let . For every state define:
For example, in Figure 2 we have .
The following lemma shows that the permutation of the states defined by the Wheeler order is the one lexicographically sorting the strings entering each state, just like the permutation defined by the suffix array lexicographically sorts the suffixes of the strings (a suffix is seen as a string “leaving” a text position).
Lemma 3.
Let , with . Let and . Then, .
Proof.
Let in be such that (i) , (ii) for every and (iii) . Analogously, let in be such that (i) , (ii) for every and (iii) . Let . We must prove that . Let be the smallest integer such that the -th character of is different than the -th character of . In other words, we know that , , , , but . We must prove that . Since , and , from Axiom 2 we obtain . Since , , and , from Axiom 2 we obtain . By iterating this argument, we conclude . By Axiom 1, we obtain . Since , we conclude . ∎
If we think of a string as a labeled path, then the suffix array sorts the strings that can be read from each position by moving forward (that is, the suffixes of the string), while the Wheeler order sorts the strings that can be read from each position by moving backward towards the initial state. The underlying idea is the same: the forward vs backward difference is only due to historical reasons [6]. To compute the matching statistics on Wheeler DFA we reason as in the previous section replacing backward search with the forward search [6] defined as follows: given an interval in and , find the (possibly empty) interval in such that a state is reachable from some state , with , through an edge labeled , if and only if (this easily follows by using the axioms of Definition 1). For a constant size alphabet, given and then can be determined in constant time. Given a string , if we start from the whole set of states and repeatedly apply the forward search we reach the set of all states for which there exists prefixed by ; this is an interval with respect to the Wheeler order: in the following we call this interval .
Because of the forward vs backward difference the problem of matching statistics will be defined in a symmetrical way on Wheeler DFAs. Given a pattern , for every we want to determine (i) the longest suffix of which occurs in the Wheeler DFA (that is, that can be read somewhere on by concatenating edges), and the endpoints of the interval .
Broadly speaking, we can apply the same idea of the algorithm for strings, but in a symmetrical way. We start from the beginning of (not from the end of , and initially we consider the whole set of states. We repeatedly apply the forward search (not the backward search), until the forward search returns the empty interval for some . This means that does not occur in . Then, if is the whole set of states, we conclude that the character labels no edge in the graph. Otherwise, we must find the smallest such that is strictly contained in (that is, we must determine the longest suffix of which reaches more states than ). Then we must determine the endpoints of the interval so that we can go on with the forward search.
The challenge now is to find a way to solve the same subproblems that we identified in Case 2 of the algorithm for strings. In other words, we must find a way to determine and find the endpoints of the interval . We will show that the solution is not as simple as the one for the algorithm on strings.
5 The LCP array and matching statistics for Wheeler DFAs
We start observing that may be an infinite set. For example, in Figure 2, we have
In general, an infinite set of (lexicographically sorted) strings in need not admit a minimum or a maximum. For example, the set does not admit a minimum (but only the infimum string ). Nonetheless, Lemma 3 implies that each admits both a minimum and a maximum. For example, the minimum is obtained as follows. Let , and for every , recursively let be the smallest integer in such that . Then, the minimum of is , and analogously one can determine the maximum.
In the following, we will denote the minimum and the maximum of by and , respectively (for example, in Figure 2 we have , and ). Lemma 3 implies that:
This suggests to generalize the LCP array as follows. Given , let be the length of the longest common prefix of and (if , define ).
Definition 4.
The LCP-array of a Wheeler automaton is the array which contains the following values in this order: , , , , , .
From the above characterization of and , one can prove that for every entry either or (it follows from Fine and Wilf Theorem [18, 19]), and one can design a polynomial time algorithm to compute .
Unfortunately, the array alone is not sufficient for computing matching statistics. Assume that , and that when we apply the forward search by adding a character , we obtain . We must then determine the largest suffix of such that is strictly contained in . Suppose that every string in is prefixed by , and every string in is prefixed by . In particular, both and are prefixed by . In this case, we can proceed like in the algorithm for strings: the desired suffix is the one having length , which can be determined using . However, in general, even if some string in must be prefixed by , the string need not be prefixed by , and similarly need not be prefixed by . The worst-case scenario occurs when . Consider Figure 2, and assume that . Then, we have (note that is a string in prefixed by ). However, both , and , are not prefixed by . Notice that and , but is not the suffix of length 3 of . Indeed, since is only prefixed by the prefix of of length , and is only prefixed by the prefix of of length , we conclude that it must be . In general, the desired suffix is the one having length given by:
(1) |
The above formula shows that, in order to compute , in addition to it suffices to know the values and ( is a suffix of , so it is determined by its length). We now show how our algorithm can efficiently maintain the current pattern , the set and the values and during the computation of the matching statistics. We assume that the input automaton is encoded with the rank/select data structures supporting the execution of a step of forward search in time, see [6] for details. In addition, we will use the following result.
Lemma 5.
Let be a sequence of values over an ordered alphabet . Consider the following queries: (i) given , compute the minimum value in , and (ii) given and , determine the biggest (or the smallest ) such that . Then, can be augumented with a data structure of bits such that query (i) can be answered in constant time and query (ii) can be answered in time.
Proof.
There exists a data structure of bits that allows to solve range minimum queries in constant time [20], so using we can solve queries (i) in constant time. Now, let us show how to solve queries (ii). Let be the answer of query (i) on input and . If , then we must keep searching in the interval , otherwise, we must keep searching in the interval . In other words, we can answer a query (ii) by means of a binary search on , which takes (and so ) time. ∎
Notice that query (ii) can be seen as a variant of PSV and NSV queries. In the following, we assume that the array has been augmented with the data structure of Lemma 5.
At the beginning we have , so and trivially . At each iteration we perform a step of forward search computing given ; then we distinguish two cases according to whether is empty or not.
Case 1. is not empty. In that case will become the pattern at the next iteration. Since we already have we are left with the task of computing and . We only show how to compute , the latter computation being analogous. Let be the smallest integer in such that . Notice that we can easily compute by means of standard rank/select operations on the compact data structure used to encode . Since , it must be . Moreover, the characterization of that we described above implies that , hence . To compute we distinguish two subcases:
-
, hence . Since , there exist and both prefixed by . But , so is also prefixed by , and we conclude .
-
. In this case, we have , and therefore is equal to
With the above formula we can compute using query (i) of Lemma 5 over the range and the value .
Case 2. is empty. In this case at the next iteration the pattern will be largest suffix of such that is strictly contained in . We compute using (1); if we set , otherwise we apply query (ii) of Lemma 5 to find the rightmost entry in smaller than . Computing is analogous.
Given , where , , and at least one inequality is strict, we want to compute and . We only consider , the latter computation being analogous. We distinguish two subcases:
-
. Then .
-
. In particular, since is the left endpoint of and , one can prove like in Case that is prefixed by . We immediately conclude that , which can be immediately computed since is a value stored in .
We can summarize the above discussion as follows.
Theorem 6.
Given a Wheeler DFA , there exists a data structure occupying words which can compute the pattern matching statistics of a pattern in time .
Funding
TG funded by National Institutes of Health (NIH) NIAID (grant no. HG011392), the National Science Foundation NSF IIBR (grant no. 2029552) and a Natural Science and Engineering Research Council (NSERC) Discovery Grant (grant no. RGPIN-07185-2020). GM funded by the Italian Ministry of University and Research (PRIN 2017WR7SHH). MS funded by the INdAM-GNCS Project (CUP_E55F22000270001). NP funded by the European Union (ERC, REGINDEX, 101039208). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.
6 References
References
- [1] P. Ferragina and G. Manzini, “Opportunistic data structures with applications,” in Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS’00), 2000, pp. 390–398.
- [2] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression algorithm,” Tech. Rep., 1994.
- [3] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan, “Structuring labeled trees for optimal succinctness, and beyond,” in proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), 2005, pp. 184–193.
- [4] A. Bowe, T. Onodera, K. Sadakane, and T. Shibuya, “Succinct de Bruijn graphs,” in Algorithms in Bioinformatics, Berlin, Heidelberg, 2012, pp. 225–235, Springer Berlin Heidelberg.
- [5] V. Mäkinen, N. Välimäki, and J. Sirén, “Indexing graphs for path queries with applications in genome research,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, pp. 375–388, 2014.
- [6] T. Gagie, G. Manzini, and J. Sirén, “Wheeler graphs: A framework for BWT-based data structures,” Theoret. Comput. Sci., vol. 698, pp. 67 – 78, 2017.
- [7] J. Alanko, G. D’Agostino, A. Policriti, and N. Prezza, “Regular languages meet prefix sorting,” in Proc. of the 31st Symposium on Discrete Algorithms, (SODA’20). 2020, pp. 911–930, SIAM.
- [8] N. Cotumaccio and N. Prezza, “On indexing and compressing finite automata,” in Proc. of the 32nd Symposium on Discrete Algorithms, (SODA’21). 2021, pp. 2585–2599, SIAM.
- [9] N. Cotumaccio, “Graphs can be succinctly indexed for pattern matching in time,” in 2022 Data Compression Conference (DCC), 2022, pp. 272–281.
- [10] J. Alanko, G. D’Agostino, A. Policriti, and N. Prezza, “Wheeler languages,” Information and Computation, vol. 281, pp. 104820, 2021.
- [11] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997.
- [12] W. I. Chang and E. L. Lawler, “Sublinear approximate string matching and biological applications,” Algorithmica, vol. 12, pp. 327–344, 2005.
- [13] P. Weiner, “Linear pattern matching algorithms,” in Proc. 14th IEEE Annual Symposium on Switching and Automata Theory, 1973, pp. 1–11.
- [14] U. Manber and G. Myers, “Suffix arrays: A new method for on-line string searches,” SIAM J. Comput., vol. 22, no. 5, pp. 935–948, 1993.
- [15] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “Replacing suffix trees with enhanced suffix arrays,” J. of Discrete Algorithms, vol. 2, no. 1, pp. 53–86, 2004.
- [16] K. Sadakane, “Compressed suffix trees with full functionality,” Theor. Comp. Sys., vol. 41, no. 4, pp. 589–607, 2007.
- [17] E. Ohlebusch, S. Gog, and A. Kügell, “Computing matching statistics and maximal exact matches on compressed full-text indexes,” in Proceedings of the 17th International Conference on String Processing and Information Retrieval (SPIRE’10), Berlin, Heidelberg, 2010, p. 347–358, Springer-Verlag.
- [18] N. J. Fine and H. S. Wilf, “Uniqueness theorem for periodic functions,” Proc. Amer. Math. Soc., , no. 16, pp. 109–114, 1965.
- [19] S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino, “An extension of the Burrows-Wheeler transform,” Theor. Comput. Sci., vol. 387, no. 3, pp. 298–312, 2007.
- [20] Johannes Fischer, “Optimal succinctness for range minimum queries,” in LATIN 2010: Theoretical Informatics, Alejandro López-Ortiz, Ed., Berlin, Heidelberg, 2010, pp. 158–169, Springer Berlin Heidelberg.