GSSI, Italy and Dalhousie University, [email protected]://orcid.org/0000-0002-1402-5298 \CopyrightNicola Cotumaccio \ccsdesc[500]Theory of computation Graph algorithms analysis \ccsdesc[500]Theory of computation Pattern matching \fundingThis work was partially funded by Dante Labs.
Prefix Sorting DFAs: a Recursive Algorithm
Abstract
In the past thirty years, numerous algorithms for building the suffix array of a string have been proposed. In 2021, the notion of suffix array was extended from strings to DFAs, and it was shown that the resulting data structure can be built in time, where is the number of states and is the number of edges [SODA 2021]. Recently, algorithms running in and time have been described [CPM 2023].
In this paper, we improve the previous bounds by proposing an recursive algorithm inspired by Farach’s algorithm for building a suffix tree [FOCS 1997]. To this end, we provide insight into the rich lexicographic and combinatorial structure of a graph, so contributing to the fascinating journey which might lead to solve the long-standing open problem of building the suffix tree of a graph.
keywords:
Suffix Array, Burrows-Wheeler Transform, FM-index, Recursive Algorithms, Graph Theory, Pattern Matching.category:
1 Introduction
The suffix tree [30] of a string is a versatile data structure introduced by Weiner in 1973 which allows solving a myriad of combinatorial problems, such as determining whether a pattern occurs in the string, computing matching statistics, searching for regular expressions, computing the Lempel-Ziv decomposition of the string and finding palindromes. The book by Gusfield [17] devotes almost 150 pages to the applications of suffix trees, stressing the importance of these applications in bioinformatics. However, the massive increase of genomic data in the last decades requires space-efficient data structures able to efficiently support pattern matching queries, and the space consumption of suffix trees is too high. In 1990, Manber and Myers invented suffix arrays [24] as a space-efficient alternative to suffix trees. While suffix arrays do not have the full functionality of suffix trees, they still allow solving pattern matching queries. Suffix arrays started a new chapter in data compression, which culminated in the invention of data structures closely related to suffix arrays, notably, the Burrows-Wheeler Transform [4] and the FM-index [13, 15], which have heavily influenced sequence assembly [29].
The impact of suffix arrays has led to a big effort in the attempt of designing efficient algorithms to construct suffix arrays, where “efficient” refers to various metrics (worst-case running time, average running time, space, performance on real data and so on); see [27] for a comprehensive survey on the topic. Let us focus on worst-case running time. Manber and Myers build the suffix array of a string of length in by means of a prefix-doubling algorithm [24]. In 1997, Farach proposed a recursive algorithm to build the suffix tree of a string in linear time for integer alphabets [11]. In the following years, the recursive paradigm of Farach’s algorithm was used to developed a multitude of linear-time algorithms for building the suffix array [22, 19, 20, 25]. All these algorithms carefully exploit the lexicographic struture of the suffixes of a string, recursively reducing the problem of computing the suffix array of a string to the problem of computing the suffix array of a smaller string (induced sorting).
The problem of solving pattern matching queries not only on strings, but also on labeled graphs, is an active topic of research. Recently, Equi et al. showed that no algorithm can solve pattern matching queries on arbitrary graphs in time or (where is the number of edges, is the pattern and ), unless the Orthogonal Vectors hypothesis fails [10, 9]. On the other hand, over the years the idea of (lexicographically) sorting the suffixes of a string has been generalized to graphs, thus leading to compact data structures that are able to support pattern matching queries on graphs. The mechanism behind the suffix array, the Burrows-Wheeler Transform and the FM-index was first generalized to trees [12, 14]; later on, it was generalized to De Brujin graphs [3, 23] (which can be used for Eulerian sequence assembly [18]). Subsequently, it was extended to the so-called Wheeler graphs [16, 1], and finally to arbitrary graphs and automata [8, 6]. The idea of lexicographically sorting the strings reaching the states of an automaton has also deep theoretical consequences in automata theory: for example, it leads to a parametrization of the powerset construction, which implies fixed-parameter tractable algorithms for PSPACE-complete problems such as deciding the equivalence of two non-deterministic finite automata (NFAs) [7].
The case of deterministic finite automata (DFAs) is of particular interest, because in this case the notion of “suffix array” of a DFA has a simple interpretation in terms of strings. Assume that there is a fixed total order on the alphabet of a DFA , and extend lexicographically to the set of all infinite strings on . Assume that each state has at least one incoming edge. If is a state, consider the set of all infinite strings that can be read starting from and following edges in a backward fashion, and let and be the lexicograpically smallest (largest, respectively) string in (see Figure 1). Consider the partial order on the set of all states such that for every , with , it holds if and only if . Then, the partial order induces a (partial) permutation of the set of all states that plays the same role of the permutation of text positions induced by the suffix array of a string [8, 21], so is essentially the “suffix array” of the DFA. If we are able to compute the partial order (and a minimum-size partition of into sets such that the restriction of to each set is a total order), then we can efficiently and compactly solve pattern matching queries on the DFA by means of techniques that extend the Burrows-Wheeler transform and the FM-index from strings to graphs. As a consequence, we now have to solve the problem of efficiently building the “suffix array” of a DFA, that is, the problem of computing .
The first paper on the topic [8] presents an algorithm that builds in time, where is the number of states and is the number of edges. In a recent paper [21], Kim et al. describe two algorithms running in and time. The key observation is that, if we build the min/max-partition of the set of all states, then we can determine the partial order in linear time by means of a reduction to the interval partitioning problem. Determining the min/max-partition of the set of all states means picking the string and for every state , and sorting these strings (note that some of these strings may be equal, see Figure 1 for an example). As a consequence, we are only left with the problem of determining the min/max-partition efficiently.
The algorithm builds the min/max-partition by generalizing Manber and Myers’s algorithm from strings to DFAs. However, since it is possible to build the suffix array of a string in time, it is natural to wonder whether it is possible to determine the min/max-partition in time.
In this paper, we show that, indeed, it is possible to build the min/max-partition in time by adopting a recursive approach inspired by one of the linear-time algorithms that we have mentioned earlier, namely, Ko and Aluru’s algorithm [22]. As a consequence, our algorithm is asymptotically faster than all previous algorithms.
A long-standing open problem is whether it is possible to define a suffix tree of a graph. Some recent work [5] suggests that it is possible to define data structures that simulate the behavior of a suffix tree by carefully studying the lexicographic structure of the graph (the results in [5] only hold for Wheeler graphs, but we believe that they can be extended to arbitrary graphs). More specifically, it is reasonable to believe that it is possible to generalize the notion of compressed suffix tree of a string [28] to graphs. A compressed suffix tree is a compressed representation of a suffix tree which consists of some components, including a suffix array. We have already seen that generalizes the suffix array to a graph structure, and [5] suggests that the remaining components may also be generalized to graphs. The complements of the suffix tree of a string heavily exploit the lexicographic and combinatorial structure of a string. Since the algorithm that we present in this paper deeply relies on the richness of the lexicographic structure (which becomes even more challenging and surprising when switching from a string setting to a graph setting), we believe that our results also provide a solid conceptual contribution towards extending suffix tree functionality to graphs.
We remark that, a few days after we submitted this paper to arXiv, a new arXiv preprint showed how to determine the min/max partition in time, where is the number of states and is the number of edges (this new arXiv preprint was also accepted for publication [2]). If the graph underlying the DFA is sparse, then the algorithm in [2] improves our algorithm. Since the algorithm uses different techniques (it is obtained by adapting Paige and Tarjan’s partition refinement algorithm [26]), we are left with the intriguing open problem of determining whether, by possibly combining the ideas behind our algorithm and the algorithm in [2], it is possible to build the min/max partition in time.
Due to space constraints, all proofs can be found in the appendix.
2 Preliminaries
2.1 Relation with Previous Work
In the setting of the previous works on the topic [1, 8, 21], the problem that we want to solve is defined as follows. Consider a deterministic finite automaton (DFA) such that (i) all edges entering the same state have the same label, (ii) each state is reachable from the initial state, (iii) each state is co-reachable, that is, it is either final or it allows reaching a final state, (iv) the initial state has no incoming edges. Then, determine the min/max partition of the set of states (see Section 2.2 for the formal definition of min/max partition, and see Figure 1 for an example).
-
•
Assumptions (ii) and (iii) are standard assumptions in automata theory, because all states that do not satisfy these assumptions can be removed without changing the accepted language.
-
•
In this setting, all the non-initial states have an incoming edge, but the initial state has no incoming edges. This implies for some state it may hold (remember that is the set of all infinite strings that can be read starting from and following edges in a backward fashion, see the introduction), so Kim et al. [21] need to perform a tedious case analysis which also takes finite strings into account in order to define the min/max-partition (in particular, the minimum and maximum strings reaching the initial state are both equal to the empty string). However, we can easily avoid this complication by means of the same trick used in [5]; we can add a self-loop to the initial state, and the label of the self-loop is a special character smaller than any character in the alphabet. Intuitively, plays the same role as the termination character in the Burrows-Wheeler transform of a string, and since is the smallest character, adding this self-loop does not affect the min/max-partition (see [21] for details).
-
•
Notice that the initial state and the set of all final states play no role in the definition of the min/max partition; this explains why, more generally, it will be expedient to consider deterministic graphs rather than DFAs (otherwise we would need to artificially add an initial state and add a set of final states when we recursively build a graph in our algorithm). Equivalently, one may assume to work with semiautomata in which the transition function is not necessarily total. This justifies the assumptions that we will make in Section 2.2.
-
•
Some recent papers [6, 7] have shown that assumptions (i) and (iv) can be removed. The partial order is defined analogously, and all the algorithms for building that we have mentioned still work. Indeed, if a state is reached by edges with the distinct labels, we need to only consider all edges with the smallest label when computing and all edges with the largest label when computing ; moreover, we once again assume that the initial state has a self loop labeled . The only difference is that assumption (i) implies that ( being the number of states, being the number of edges) because each state can have at most incoming edges, but this is no longer true if we remove assumption (i). As a consequence, the running time of our algorithm is no longer but (and the running time of the algorithm in [21] becomes ) because we still need to process all edges in the DFA.
To sum up, all the algorithms for computing work on arbitrary DFAs.
2.2 Notation and First Definitions
Let be a finite alphabet. We consider finite, edge-labeled graphs , where is the set of all nodes and is the set of all edges. Up to taking a subset of , we assume that all label some edge in the graph. We assume that all nodes have at least one incoming edge, and all edges entering the same node have the same label (input consistency). This implies that an edge can be simply denoted as , because it must be . In particular, it must be (and so an algorithm is also a algorithm). If we do not know the ’s, we can easily compute them by scanning all edges. In addition, we always assume that is deterministic, that is, for every and for every there exists at most one such that and .
Let be the set of all finite strings on , and let be the set of all (countably) right-infinite strings on . If and , we denote by the character of (that is, ). If , we define , and if , then is the empty string . If , then is length of ; for every the string is a prefix of , and if it is a strict prefix of ; analogously, one defines suffixes and strict suffixes of . An occurrence of starting at and ending at is a sequence of nodes of such that (i) , (ii) , (iii) for every and (iv) for every . An occurrence of starting at is a sequence of nodes of such that (i) , (ii) for every and (iii) for every . Intuitively, a string has an occurrence starting at if we can read on the graph starting from and following edges in a backward fashion.
In the paper, occurrences of strings in will play a key role, while occurrences of strings in will be used as a technical tool. For every , we denote by the set of all strings in admitting an occurrence starting at . Since every node has at least one incoming edge, then .
A total order on a set if a reflexive, antisymmetric and transitive relation on . If , we write if and .
Let be a fixed total order on . We extend to lexicographically. It is easy to show that in every there is a lexicographically smallest string and a lexicographically largest string (for example, it follows from [21, Observation 8]).
We will often use the following immediate observation. Let , and let be an occurrence of . Fix . Then, is an occurrence of , and .
Let . Let be the unique partition of and let be the unique total order on such that, for every and for every and , (i) if , then and (ii) if , then . Then, we say that , or more simply , is the min-partition of . The max-partition of is defined analogously. Now, consider the set , and define and for every . Let be the unique partition of and let be the unique total order on such that, for every and for every and , (i) if , then and (ii) if , then . Then, we say that , or more simply , is the min/max-partition of .
The main result of this paper will be proving that the min/max partition of can be determined in time.
2.3 Our Approach
Let be a graph. We will first show how to build the min-partition of in time, where (Section 4); then, we will show how the algorithm can be adapted so that it builds the min/max-partition in time (Section 5).
In order to build a min-partition of , we will first classify all minima into three categories (Section 3), so that we can split into three pairwise-disjoint sets . Then, we will show that in time:
-
•
we can compute (Section 4.1);
-
•
we can define a graph having nodes (Section 4.2);
-
•
assuming that we have already determined the min-partition of , we can determine the min-partition of (Section 4.3).
Analogously, in time we can reduce the problem of determining the min-partition of to the problem of determining the min-partition of the set of all nodes of a graph having (not ) nodes (Section 4.4). As a consequence, since , we obtain a recursive algorithm whose running time is given by the recurrence:
and we conclude that the running time of our algorithm is .
3 Classifying Strings
In [22], Ko and Aluru divide the suffixes of a string into two groups. Here we follow an approach purely based on stringology, without fixing a string or a graph from the start. We divide the strings of into three groups, which we call group 1, group 2 and group 3 (Corollary 3.3 provides the intuition behind this choice).
Definition 3.1.
Let . Let and such that . Then, we define as follows:
-
1.
if .
-
2.
if .
-
3.
if .
We will constantly use the following characterization.
Lemma 3.2.
Let . Let and such that . Then:
-
1.
if and only if , if and only if .
-
2.
if and only if , if and only if .
Assume that . Then, there exist unique , and such that (and so ). Moreover:
-
1.
if and only if , if and only if , if and only if .
-
2.
if and only if , if and only if , if and only if ,
The following corollary will be a key ingredient in our recursive approach.
Corollary 3.3.
Let . Let and such that and . Then:
-
1.
If and , then .
-
2.
If and , then . Equivalently, if and , then .
4 Computing the min-partition
Let be a graph. We will prove that we can compute the min-partition of in time. In this section, for every we define (see Figure 2).
Let , and let be an occurrence of starting at . It is immediate to realize that (i) if , then , (ii) if , then for every and (iii) if , then .
As a first step, let us prove that without loss of generality we can remove some edges from without affecting the min/max-partition. This preprocessing will be helpful in Lemma 4.20.
Definition 4.1.
Let be a graph. We say that is trimmed if it contains no edge such that and .
In order to simplify the readability of our proofs, we will not directly remove some edges from , but we will first build a copy of where every node is a mapped to a node , and then we will trim the graph. In this way, when we write and it will be always clear whether we refer to the original graph or the trimmed graph. We will use the same convention in Section 4.2 when we define the graph that we will use for the recursive step.
Lemma 4.2.
Let be a graph. Then, in time we can build a trimmed graph , with , such that for every it holds . In particular, for every .
4.1 Classifying Minima
Let us first show how to compute all such that .
Lemma 4.3.
Let be a graph, and let .
-
1.
If and , then .
-
2.
If , and , then .
Corollary 4.4.
Let be a graph, and let . Then, if and only if there exist and such that (i) for every , (ii) , (iii) and (iv) .
Corollary 4.4 yields an algorithm to decide whether is such that .
Corollary 4.5.
Let be a graph. We can determine all such that in time .
Now, let us show how to determine all such that . We can assume that we have already determined all such that .
Lemma 4.6.
Let be a graph, and let such that . Then, we have if and only if there exist and such that (i) for every , (ii) , (iii) for some and (iv) .
In particular, such must satisfy for every .
Corollary 4.7.
Let be a graph. We can determine all such that in time .
Corollary 4.8.
Let be a graph. Then, in time we can compute for every .
4.2 Recursive Step
Let us sketch the general idea to build a smaller graph for the recursive step. We consider each such that , and we follow edges in a backward fashion, aiming to determine a prefix of . As a consequence, we discard edges through which no occurrence of can go, and by Corollary 3.3 we can restrict our attention to the nodes such that is minimal. We proceed like this until we encounter nodes such that .
Let us formalize our intuition. We will first present some properties that the occurrences of a string must satisfy.
Lemma 4.9.
Let be a graph. Let be such that . Let be an occurrence of and let be an occurrence of . Then:
-
1.
for every .
-
2.
for every .
-
3.
for every .
In particular, the previous results hold if and and are two distinct occurrences of .
Lemma 4.10.
Let be a graph. Let and let an occurrence of starting at . Let be such that . Then, are pairwise distinct. In particular, .
The previous results allow us to give the following definition.
Definition 4.11.
Let be a graph. Let such that . Let to be the smallest integer such that , where is an occurrence of starting at .
Note that is well-defined, because (i) it cannot hold for every by Lemma 4.10 (indeed, if , then is an occurrence of starting at , and by Lemma 4.10 there exists such that ) and (ii) does not depend on the choice of by Lemma 4.9. In particular, it must be because are pairwise distinct ( is distinct from because and by the minimality of ).
Lemma 4.12.
Let be a graph. Let such that . Then, for every . In particular, if , then .
If is a nonempty set of nodes such that for every it holds , we define . If is a nonempty set of nodes such that for every it holds , we define .
Let be a nonempty set of states. Let , where . Notice that is nonempty, and both and are well-defined. In other words, is obtained by first considering the subset of all nodes such that is as small as possible, and then considering the subset of of all nodes such that is as small as possible. This is consistent with our intuition on how we should be looking for a prefix of .
Define:
Notice that we also require that a node in has not been encountered before. Intuitively, this does not affect our search for a prefix of because, if we met the same node twice, then we would have a cycle where all edges are equally labeled (because by Lemma 4.12 labels can only decrease), and since for every , then no occurrence of the minimum can go through the cycle because if we remove the cycle from the occurrence we obtain a smaller string by Lemma 3.2.
The following technical lemma is crucial to prove that our intuition is correct.
Lemma 4.13.
Let be a graph. Let such that .
-
1.
is well-defined and nonempty for every .
-
2.
Let be an occurrence of starting at . Then, for every . In particular, and for every .
-
3.
For every and for every there exists an occurrence of starting at and ending at .
Let such that . We define:
-
•
;
-
•
Now, in order to define the smaller graph for the recursive step, we also need a new alphabet , which must be defined consistently with the mutual ordering of the minima. The next lemma yields all the information that we need.
Lemma 4.14.
Let be a graph. Let such that . Assume that one of the following statements is true:
-
1.
is not a prefix of and .
-
2.
, and .
-
3.
is a strict prefix of .
Then, .
Equivalently, if , then one the following is true: (i) is not a prefix of and ; (ii) and ; (iii) is a strict prefix of .
Now, let , and let be the total order on such that for every distinct , it holds if and only if one of the following is true:
-
1.
is not a prefix of and .
-
2.
, and .
-
3.
is a strict prefix of .
It is immediate to verify that is a total order: indeed, is obtained (i) by first comparing the ’s using the variant of the (total) lexicographic order on in which a string is smaller than every strict prefix of it and (ii) if the ’s are equal by comparing the ’s, which are elements in .
Starting from , we define a new graph as follows:
-
•
.
-
•
The new totally-ordered alphabet is .
-
•
For every , we define .
-
•
.
Note that for every such that and for every it holds , so and . Moreover, satisfies all the assumptions about graphs that we use in this paper: (i) all edges entering the same node have the same label (by definition), (ii) every node has at least one incoming edge (because if , then by Lemma 4.13) and (iii) is deterministic (because if and , then and , so by the definition of if we immediately obtain , and if we obtain ; since by Lemma 4.13 there exist two occurrences of starting at and and both ending at , the determinism of implies and so ).
Notice that if is such that , then contains exactly one string, namely, ; in particular, .
When we implement and , we use integer alphabets and ; in particular, we will not store by means of pairs ’s, but we will remap to an integer alphabet consistently with the total order on , so that the mutual order of the ’s is not affected.
Let us prove that we can use for the recursive step. We will start with some preliminary results.
Lemma 4.15.
Let be a graph. Let be such that , and . Then, .
Lemma 4.16.
Let be a graph. Let , and let be an occurrence of starting at . Then, exactly one of the following holds true:
-
1.
There exists such that for every and for every .
-
2.
for every , and both and are true for infinitely many ’s.
Crucially, the next lemma establishes a correspondence between minima of nodes in and minima of nodes in .
Lemma 4.17.
Let be a graph. Let such that . Let be an occurrence of starting at . Let be the infinite sequence of nodes in obtained as follows. Consider , and for every , let be the smallest element of , if it exists. For every such that is defined, let , and if is such that is not defined (so is a finite set), let . Then, is an occurrence of starting at in .
The following theorem shows that our reduction to is correct.
Theorem 4.18.
Let be a graph. Let be such that .
-
1.
If , then .
-
2.
If , then .
Since is a total order (so exactly one among , and holds true), from Theorem 4.18 we immediately obtain the following result.
Corollary 4.19.
Let be a graph. Let be such that .
-
1.
It holds if and only if .
-
2.
It holds if and only if .
In particular, if we have the min-partition of (with respect to ), then we also have the min-partition of (with respect to ).
Lastly, we show that our reduction to can be computed within time.
Lemma 4.20.
Let be a trimmed graph. Then, we can build in time.
4.3 Merging
We want to determine the min-partition of , assuming that we already have the min-partition of .
First, note that we can easily build the min-partition of . Indeed, if , then by Lemma 3.2. As a consequence, if , then (i) if and only if and (ii) if and only if , so we can build in time by using counting sort.
For every and , let . Consider : (i) if , then and (ii) if and , then by Corollary 3.3. As a consequence, in order to build , we only have to build the min-partition of , for every and every .
A possible way to implement each is by means of an array storing the elements of , where we also use a special character to delimit the border between consecutive elements of .
It is immediate to build incrementally for every , from its smallest element to its largest element. At the beginning, is empty for every . Then, scan the elements in from smallest to largest, and add to , where for any (the definition of does not depend on the choice of ). We scan only once, so this step takes time. Analogously, we can build for every by using .
We are only left with showing how to build for every . At the beginning, each is empty, and we will build each from its smallest element to its largest element. During this step of the algorithm, we will gradually mark the nodes such that . At the beginning of the step, no such node is marked, and at the end of the step all these nodes will be marked. Let , with . Notice that it must be , because if there existed , then it would be by Lemma 3.2 and so would not be the smallest character in . Now, consider ; we have already fully computed . Process each in from smallest to largest, and for every compute the set of all non-marked nodes such that , , and for some . Then, if add to and mark the nodes in . After processing the elements in , we process the element in , , , , and so on, in this order. Each is processed from its (current) smallest element to its (current) largest element. We never remove or modify elements in any , but we only add elements to the ’s. More precisely, when we process in , for every we compute the set of all non-marked nodes such that , , and for some and, if , then we add to and we mark the nodes in .
The following lemma shows that our approach is correct. Let us give some intuition. A prefix of a min-partition is a subset of such that, if , and , then . Notice that every prefix of is obtained by taking the union of , , , , , , in this order up to some element , where possibly we only pick a prefix of the last element . Then, we will show that, when we process in , we have already built the prefix of whose largest element is . This means that, for every and for any any occurrence of starting at , it must hold that is in .
Lemma 4.21.
Let be a graph. If we know the min-partition of , then we can build the min-partition of in time.
4.4 The Complementary Case
We have shown that in time we can reduce the problem of determining the min-partition of to the problem of determining the min-partition of the set of all nodes of a graph having nodes. Now, we must show that (similarly) in time we can reduce the problem of determining the min-partition of to the problem of determining the min-partition of the set of all nodes of a graph having nodes. This time, building the smaller graph will be simpler, because when searching for prefixes the minima we will never meet nodes that we have already met (so the the definition of the ’s is simpler and we do not need to trim the graph to stay within the bound). On the other hand, the merging step will be more complex, because the order in which we will process the will be from largest to smallest (, , , , , and so on) so we will need to update some elements of some ’s to include the information about minima that we may infer at a later stage of the algorithm. We provide the details in the appendix.
5 Computing the min/max-partition
Let be a graph. We can build the max-partition of by simply considering the transpose total order of (the one for which if and only if ) and building the min-partition. As a consequence, the algorithm to build the max-partition is entirely symmetrical to the algorithm to build the min-partition.
Let be a graph. Let us show how we can build the min/max-partition of in time. Assume that we have two graphs and on the same alphabet , with (we allow and to possibly be the null graph, that is, the graph without vertices). Let , , , and for every define if , and if . Let be the unique partition of and let be the unique total order on such that, for every and for every and , (i) if , then and (ii) if , then . Then, we say that , or more simply , is the min/max-partition of . We will show that we can compute the min/max partition of in time. In particular, if and are two (distinct) copies of the same graph , then we can compute the min/max-partition of in time.
We compute for every and we compute for every . If the number of values equal to is smaller than the number of values equal to 1, then (in time ) we build the graphs and as defined before, where and , otherwise we consider the complementary case (which is symmetrical). When building and , we define a unique alphabet obtained by jointly sorting the ’s and the ’s, which is possible because Lemma 4.14 also applies to maxima. Note that .
Assume that we have recursively obtained the min/max-partition of with respect to and . This yields the min/max-partition of . Then, we can build the min/max-partition of by jointly applying the merging step, which is possible because both the merging step for minima and the merging step for maxima require to build the ’s by processing , , , , and so on in this order.
Since we obtain the same recursion as before, we conclude that we can compute the min/max partition of in time.
References
- [1] Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Shuchi Chawla, editor, Proc. of the 31st Symposium on Discrete Algorithms, (SODA’20), pages 911–930. SIAM, 2020. doi:10.1137/1.9781611975994.55.
- [2] Ruben Becker, Manuel Cáceres, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares, and Nicola Prezza. Sorting Finite Automata via Partition Refinement. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pages 15:1–15:15, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2023/18668, doi:10.4230/LIPIcs.ESA.2023.15.
- [3] Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics, pages 225–235, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
- [4] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994.
- [5] Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, and Marinella Sciortino. Computing matching statistics on Wheeler DFAs. In 2023 Data Compression Conference (DCC), pages 150–159, 2023. doi:10.1109/DCC55655.2023.00023.
- [6] Nicola Cotumaccio. Graphs can be succinctly indexed for pattern matching in time. In 2022 Data Compression Conference (DCC), pages 272–281, 2022. doi:10.1109/DCC52660.2022.00035.
- [7] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - part i. J. ACM, 70(4), aug 2023. doi:10.1145/3607471.
- [8] Nicola Cotumaccio and Nicola Prezza. On indexing and compressing finite automata. In Dániel Marx, editor, Proc. of the 32nd Symposium on Discrete Algorithms, (SODA’21), pages 2585–2599. SIAM, 2021. doi:10.1137/1.9781611976465.153.
- [9] Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. In Tomáš Bureš, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdziński, Claus Pahl, Florian Sikora, and Prudence W.H. Wong, editors, SOFSEM 2021: Theory and Practice of Computer Science, pages 608–622, Cham, 2021. Springer International Publishing.
- [10] Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3), apr 2023. doi:10.1145/3588334.
- [11] M. Farach. Optimal suffix tree construction with large alphabets. In Proceedings 38th Annual Symposium on Foundations of Computer Science, pages 137–143, 1997. doi:10.1109/SFCS.1997.646102.
- [12] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pages 184–193, 2005. doi:10.1109/SFCS.2005.69.
- [13] P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS’00), pages 390–398, 2000. doi:10.1109/SFCS.2000.892127.
- [14] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), nov 2009. doi:10.1145/1613676.1613680.
- [15] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552–581, jul 2005. doi:10.1145/1082036.1082039.
- [16] Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67 – 78, 2017. Algorithms, Strings and Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of Professor Raffaele Giancarlo). doi:10.1016/j.tcs.2017.06.016.
- [17] Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. doi:10.1017/CBO9780511574931.
- [18] Ramana M. Idury and Michael S. Waterman. A new algorithm for DNA sequence assembly. Journal of computational biology : a journal of computational molecular cell biology, 2 2:291–306, 1995.
- [19] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918–936, nov 2006. doi:10.1145/1217856.1217858.
- [20] Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Constructing suffix arrays in linear time. Journal of Discrete Algorithms, 3(2):126–142, 2005. Combinatorial Pattern Matching (CPM) Special Issue. URL: https://www.sciencedirect.com/science/article/pii/S1570866704000486, doi:https://doi.org/10.1016/j.jda.2004.08.019.
- [21] Sung-Hwan Kim, Francisco Olivares, and Nicola Prezza. Faster prefix-sorting algorithms for deterministic finite automata. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 16:1–16:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.CPM.2023.16.
- [22] Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2):143–156, 2005. Combinatorial Pattern Matching (CPM) Special Issue. URL: https://www.sciencedirect.com/science/article/pii/S1570866704000498, doi:https://doi.org/10.1016/j.jda.2004.08.002.
- [23] Veli Mäkinen, Niko Välimäki, and Jouni Sirén. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11:375–388, 2014.
- [24] U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993. doi:10.1137/0222058.
- [25] Joong Chae Na. Linear-time construction of compressed suffix arrays using o(n log n)-bit working space for large alphabets. In Alberto Apostolico, Maxime Crochemore, and Kunsoo Park, editors, Combinatorial Pattern Matching, pages 57–67, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.
- [26] Robert Paige and Robert E. Tarjan. Three partition refinement algorithms. SIAM Journal on Computing, 16(6):973–989, 1987. arXiv:https://doi.org/10.1137/0216062, doi:10.1137/0216062.
- [27] Simon J. Puglisi, W. F. Smyth, and Andrew H. Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2):4–es, jul 2007. doi:10.1145/1242471.1242472.
- [28] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589–607, 2007. doi:10.1007/s00224-006-1198-x.
- [29] Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):i367–i373, 06 2010. arXiv:https://academic.oup.com/bioinformatics/article-pdf/26/12/i367/48859513/bioinformatics\_26\_12\_i367.pdf, doi:10.1093/bioinformatics/btq217.
- [30] P. Weiner. Linear pattern matching algorithms. In Proc. 14th IEEE Annual Symposium on Switching and Automata Theory, pages 1–11, 1973. doi:10.1109/SWAT.1973.13.
Appendix A Proofs from Section 3
Statement of Lemma 3.2. Let . Let and such that . Then:
-
1.
if and only if , if and only if .
-
2.
if and only if , if and only if .
Assume that . Then, there exist unique , and such that (and so ). Moreover:
-
1.
if and only if , if and only if , if and only if .
-
2.
if and only if , if and only if , if and only if ,
Proof A.1.
-
1.
Let us prove that if and only if .
If , we have , so .
Assume that and so . In order to prove that , it will suffice to show that for every it holds . We proceed by induction on . Notice that , which proves our claim for . Now, assume that . By the inductive hypothesis, we can assume . Hence, .
Let us prove that if and only if . We have if and only if , if and only if .
-
2.
It follows by negating the first point.
Now assume that . Since , then there exist unique , and such that . Moreover:
-
1.
We have if and only if , if and only if , if and only if , if and only if , if and only if , if and only if , if and only if , if and only if , if and only if , if and only if .
-
2.
It follows by negating the previous points.
Statement of Corollary 3.3. Let . Let and such that and . Then:
-
1.
If and , then .
-
2.
If and , then . Equivalently, if and , then .
Proof A.2.
Appendix B Proofs from Section 4
Statement of Lemma 4.2. Let be a graph. Then, in time we can build a trimmed graph , with , such that for every it holds . In particular, for every .
Proof B.1.
Define for every , , and . Essentially, we define by removing from all edges that violate the definition of trimmed graph. It is easy to realize that still satisfies the assumptions on graphs that we make in this paper.
-
1.
All edges entering the same node in have the same label by construction.
-
2.
Every has an incoming edge in . Indeed, if , then, if is any node such that , then and so . If , then there exists such that and , so and .
-
3.
is deterministic because it is essentially a subgraph of a deterministic graph. More precisely, assume that are such that . We must prove that , or equivalently, . From we obtain , so from we obtain because is deterministic.
Now, let us prove that for every it holds . To this end we have to prove that (i) and (ii) if , then .
-
1.
Let us prove that . To this end, it will suffice to prove that, if is an occurrence of starting at , then is an occurrence of starting at . For every , we have . We are only left with showing that for every . We know that , so we only have to prove that . Assume that . We must prove that . Since is an occurrence of starting at , this is equivalent to proving that , which indeed follows from .
-
2.
Let us prove that if , then (which implies ). Let be an occurrence of starting at . It will suffice to prove that is an occurrence of starting at . For every we have because , and , so the conclusion follows.
In particular, for every , and is a trimmed graph.
B.1 Proofs from Section 4.1
Statement of Lemma 4.3. Let be a graph, and let .
-
1.
If and , then .
-
2.
If , and , then .
Proof B.2.
Statement of Corollary 4.4. Let be a graph, and let . Then, if and only if there exist and such that (i) for every , (ii) , (iii) and (iv) .
Proof B.3.
Let and be such that (i) for every , (ii) , (iii) and (iv) . We must prove that . Let us prove by induction that for every it holds (and in particular ). If , then from and we obtain by Lemma 4.3. Now, assume that . By the inductive hypothesis, we have . Since and , then by Lemma 4.3 we conclude .
Assume that . Then, by Lemma 3.2 there exist , and such that and . Let be an occurrence of starting at . Then, (i) , (ii) for every , (iii) for and (iv) . As a consequence, if and for every , then (i) for every , (ii) , (iii) and (iv) because .
Statement of Corollary 4.5. Let be a graph. We can determine all such that in time .
Proof B.4.
In our algorithm, we will mark exactly all nodes such that by means of a graph traversal. We will use an (initially empty) queue. Initially, no node is marked. First, process all edges and, if and has not been marked before, then mark and add to the queue. This step takes time. Next, recursively pick an element from the queue, and consider the unique such that and (if it exists). If has not been marked before, then mark and add to the queue. This step also takes time because each edge is considered at most once. At the end of the algorithm, by Corollary 4.4 a node has been marked if and only if .
Statement of Lemma 4.6. Let be a graph, and let such that . Then, we have if and only if there exist and such that (i) for every , (ii) , (iii) for some and (iv) .
In particular, such must satisfy for every .
Proof B.5.
Let and such that (i) for every , (ii) , (iii) for some and (iv) . We must prove that . Notice that . Similarly, we have . Since , then by induction we obtain for every , and so . As a consequence, , and so by Lemma 3.2. Since , we conclude .
Assume that . In particular, , so there exists an occurrence of starting at . This means that (i) for every , (ii) and for every . Since is finite, there exist such that . Then, and have the desired properties.
Now, let us prove that, if there exist with the desired properties, then for every . Notice that it must be for every , because otherwise by Corollary 4.4 we would conclude . As a consequence, it must be for every by the characterization that we have just proved.
Statement of Corollary 4.7. Let be a graph. We can determine all such that in time .
Proof B.6.
By Corollary 4.5, we can assume that we have already computed all such that . In order to compute all such that , we explore by using a breadth first search algorithm, with the following modifications: (i) we only start from nodes such that , (ii) we follow edges in a backward fashion, not in a forward fashion and (iii) if we start from , we explore the graph by only following edges labeled (that is, we first consider only all such that and , then we repeat this step in BFS fashion). Note that during the search we can never encounter a node such that , otherwise by Corollary 4.4 we would obtain . During the search, we will infer the value for every node that we encounter. Assume that we start the exploration from node . If during the exploration at at some point we encounter a node for which we have already concluded that , then we do not consider the edges entering and we backtrack (because all the ’s in the characterization of Lemma 4.6 must satisfy ). If during the exploration at at some point we encounter a node for which we have already concluded that , then we backtrack straight to and, by Lemma 4.6, we can also conclude that all that we encounter during the backtracking (including ) are such that . If during the exploration of at some point we encounter the same node twice, then we backtrack straight to and, by Lemma 4.6, we can also conclude that all that we encounter during the backtracking (including ) are such that . If during the exploration of we never encounter a node for which we have already determined and we never encounter the same node twice (and so at some point we get stuck), by Lemma 4.6 we can conclude that all that we have encountered (including ) are such that . The algorithm runs in time because we never follow the same edge twice.
B.2 Proofs from Section 4.2
Statement of Lemma 4.9. Let be a graph. Let be such that . Let be an occurrence of and let be an occurrence of . Then:
-
1.
for every .
-
2.
for every .
-
3.
for every .
In particular, the previous results hold if and and are two distinct occurrences of .
Proof B.7.
-
1.
For every we have .
-
2.
Fix . We have and so
-
3.
Fix . By the previous points we have and , so by Corollary 3.3 we conclude that it must necessarily be .
Statement of Lemma 4.10. Let be a graph. Let and let an occurrence of starting at . Let be such that . Then, are pairwise distinct. In particular, .
Proof B.8.
We assume that , because the case is symmetrical. Notice that for every we have that is an occurrence of starting at and , so . In particular, for every we have .
Suppose for sake of contradiction that there exist such that . Then, . By induction, we obtain for every , and so . For every , we have . Since , then , so for every , which implies and so . By Lemma 3.2 we conclude , a contradiction.
Statement of Lemma 4.12. Let be a graph. Let such that . Then, for every . In particular, if , then .
Proof B.9.
Fix . By the definition of we have , so .
Statement of Lemma 4.13. Let be a graph. Let such that .
-
1.
is well-defined and nonempty for every .
-
2.
Let be an occurrence of starting at . Then, for every . In particular, and for every .
-
3.
For every and for every there exists an occurrence of starting at and ending at .
Proof B.10.
We proceed by induction on . If , then is well-defined and nonempty, and . Moreover, if , then , and there exists an occurrence of starting and ending at . Now, assume that .
First, notice that is well-defined and nonempty. Indeed, by the inductive hypothesis is well-defined and nonempty, so is nonempty because every node has an incoming edge.
Let us prove that . Let and . We must prove that .
-
1.
Let us prove that . By the inductive hypothesis, we know that . We know that , so .
-
2.
Let us prove that . Let . We must prove that . Since , there exists such that . From we obtain . By the inductive hypothesis, there exists an occurrence of starting at and ending at , so there exists an occurrence of starting at and ending at . Since , it must be , and in particular .
-
3.
Let us prove that . Let . We must prove that . In particular, , so as before we obtain . Moreover, from we obtain . From Corollary 3.3 we conclude .
Let us prove that . If , we have and the conclusion follows because for every . As a consequence, in the following we can assume . In particular, , so by Lemma 3.2 there exists , and such that and . Suppose for sake of contradiction that there exists such that . We know that , and in particular, . From Lemma 4.12 we obtain for every , or equivalently for every . Since , we conclude for every . As a consequence, . On the other hand, since and , by the inductive hypothesis there exists an occurrence of starting at and ending at , so . As a consequence, by the minimality of we obtain , or equivalently, . Since , we obtain , a contradiction.
Let us prove that is well-defined and nonempty, and . This follows from and .
Lastly, let us prove that if , then there exists an occurrence of starting at and ending at . Since , then there exists such that . In particular, . By the inductive hypothesis, there exists an occurrence of starting at and ending at , so there exists an occurrence of starting at and ending at .
Statement of Lemma 4.14. Let be a graph. Let such that . Assume that one of the following statements is true:
-
1.
is not a prefix of and .
-
2.
, and .
-
3.
is a strict prefix of .
Then, .
Equivalently, if , then one the following is true: (i) is not a prefix of and ; (ii) and ; (iii) is a strict prefix of .
Proof B.11.
Let us prove that, if one of the three statements (1) - (3) is true, then .
-
1.
If and is not a prefix of , then .
- 2.
-
3.
Assume that is a strict prefix of (hence ). In particular, , so similarly to the previous case we only have to prove that . Again, we obtain , , and . Since , the minimality of implies . By Corollary 3.3 we conclude .
Now, let us prove that if , then one the following is true: (i) is not a prefix of and ; (ii) and ; (iii) is a strict prefix of . If this were not true, then one of the following would be true: (1) is not a prefix of and ; (2) , and ; (3) is a strict prefix of . We would then conclude , a contradiction.
Statement of Lemma 4.15. Let be a graph. Let be such that , and . Then, .
Proof B.12.
Statement of Lemma 4.16. Let be a graph. Let , and let be an occurrence of starting at . Then, exactly one of the following holds true:
-
1.
There exists such that for every and for every .
-
2.
for every , and both and are true for infinitely many ’s.
Proof B.13.
If for every , then both and are true for infinitely many ’s, because otherwise we could choose such that either for every or for every , and in both cases Lemma 4.10 leads to a contradiction, because is an occurrence of starting at and is a finite set.
Now, assume that there exists such that ; without loss of generality, we can assume that is the smallest integer with this property. We only have to prove that if , then . Since , then by Lemma 3.2. Since in an occurrence of starting at , then for every , so for every , being an occurrence of starting at . By Lemma 3.2, we conclude for every .
Statement of Lemma 4.17. Let be a graph. Let such that . Let be an occurrence of starting at . Let be the infinite sequence of nodes in obtained as follows. Consider , and for every , let be the smallest element of , if it exists. For every such that is defined, let , and if is such that is not defined (so is a finite set), let . Then, is an occurrence of starting at in .
Proof B.14.
First, notice that for every because , and (so ). Now, let us prove that for every . Fix . We distinguish three cases.
- 1.
- 2.
-
3.
and are both non-defined. Then, by the previous case we conclude .
We are left with proving that for every . Let be such that for every . We have to prove that . Since for every , we have , and is an occurrence of starting at . The conclusion follows if we prove that for every we have . Fix ; it will suffice to show that, for every , if , then . Fix , and let be an occurrence of starting at . Notice that for every we have , or equivalently, , or equivalently, (so ) and . Note that for every . We distinguish two cases:
-
1.
There exists such that . In this case, the definition of implies , and in particular . Moreover, since and , the definition of implies , and in particular . We conclude .
-
2.
For every it holds . In this case, for every we have and , so and by Lemma 4.13 there exists an occurrence of starting at and ending at . As a consequence, there is an occurrence of starting at and ending at and so, if , then . This implies , so from we obtain . By Lemma 4.14 and , we obtain that one of the following statements must be true:
-
(a)
is not a prefix of , is not a prefix of and .
-
(b)
and .
-
(c)
is a strict prefix of .
In all three cases, we conclude , or equivalently, , or equivalently, .
-
(a)
Statement of Theorem 4.18. Let be a graph. Let be such that .
-
1.
If , then .
-
2.
If , then .
Proof B.15.
Let be and occurrence of starting at , and let be and occurrence of starting at . Let be the occurrence of defined by means of in Lemma 4.17, and let be the occurrence of defined by means of in Lemma 4.17. Let , and let be the smallest element of , if it exists. Moreover, let , and let be the smallest element of , if it exists. Notice that .
In the rest of the proof, we say that an integer is nice if it satisfies the following properties:
-
•
and are all defined, and for every .
-
•
for every .
Note the following properties:
-
•
is always nice because .
-
•
If is nice, than every is nice. This implies that either every is nice, or there exists a unique such that every is nice and every is not nice.
-
•
If is nice, and , then is nice. Indeed, implies . Moreover, since is an occurrence of starting at , then by the minimality of and Lemma 4.13 we have for every , and , which implies that is defined. Analogously, one obtains for every and , so is defined and , which implies that is nice.
-
•
If is nice, then for every . Indeed, assume for sake of contradiction that for some (the case is analogous). Since is an occurrence of starting at , then by Lemma 4.13 we have for every , and , which by Lemma 4.16 implies for every . This implies that is not defined, which is a contradiction because .
Let us prove that, if is nice, then:
-
•
.
-
•
.
We will prove these two properties separately.
-
•
Let us prove that . Since is nice, then for every we have that and are defined, and . As a consequence, it will suffice to prove that for every . This is equivalent to proving that for every . This follows from for every .
-
•
Let us prove that . Fix . We must prove that , or equivalently, . This follows from and .
We can now prove the two claims of the theorem. We will use that following observation: if every is nice, then and , because and for every .
-
1.
Assume that ; we must prove that . If every is nice, we are done, so we can assume that is the largest nice integer. By Lemma 4.9, we have for every , so . In particular, for every we have that is defined if and only if is defined and, if so . Since is nice, then and are defined, and . We know that and , so implies that in order to prove that we only have to prove that .
By Lemma 4.9, we have . Since for every , is an occurrence of starting at and is an occurrence of starting at , we obtain . As a consequence, . In particular, it must be , so from we conclude . Moreover, it must be , otherwise we would conclude that is nice, which contradicts the maximality of . As a consequence, by the definition of we conclude that .
-
2.
Assume that ; we must prove that . Notice that it cannot happen that every is nice, otherwise we would obtain , a contradiction. Let be the biggest nice integer. In particular, and are defined, , and . We know that and , so from we conclude . Since , and , in order to prove that we only have to prove that .
Notice that it cannot be and , because:
-
(a)
If , then by Lemma 4.15 we would conclude , a contradiction.
-
(b)
If , then would be a nice integer, which contradicts the maximality of .
From this observation, Lemma 4.14 and we conclude that one of the following must be true:
-
(a)
is not a prefix of , is not a prefix of and .
-
(b)
, and .
-
(c)
is a strict prefix of .
In all three cases we conclude , or equivalently, , which implies .
-
(a)
Statement of Lemma 4.20. Let be a trimmed graph. Then, we can build in time.
Proof B.16.
The definition of implies that in order to build it is sufficient to compute , and for every such that ; moreover, we also need to compute the total order , so that we can remap to an integer alphabet consistently with the total order on .
Fix such that . For every , let be the set of all edges entering a node in , and let . Note that since the ’s are pairwise distinct, then the ’s are pairwise disjoint. Let us prove that . To this end, it will suffice to prove that for every there exist at most two edges in whose start nodes are equal to . Fix , and let be the smallest integer such that is the first state of an edge for some , if such a exists. Notice that is uniquely determined because the value does not depend on the choice of , and is deterministic. This means that is uniquely determined. In order to prove our claim, it will suffice to show that if for some and , then , because we will conclude that and are uniquely determined, since is deterministic. Since and , by the definition of , we have . Since , then by Lemma 4.12 we obtain , which by Lemma 4.13 is equivalent to , and so we conclude , or equivalently, . Since and , then , so from we conclude that it must be because is trimmed, and so .
Let us show that in we can compute , and for every such that . It will suffice to show that, for a a fixed such that , we can compute , and in time.
Let us show how we recursively compute each set , for . During the algorithm, we will mark some nodes. Initially, no node is marked, and after step , a nodes of is marked if and only if it belongs to . If , we just let , and we define . Now, assume that , and assume that we have computed . We first scan all edges in and we collect the set the set of all start nodes of these edges. Then, by scanning , we first determine , then we determine . Next, we determine and . By checking which nodes in are marked, we determine . Finally, we mark all nodes in . We can perform all these steps in time, where the hidden constant in the asymptotic notation does not depend on . Now, we check whether . If , then and we are done. Otherwise, we have and we proceed with step .
We conclude that we can determine in time , where we have used that the hidden constant in does not depend on . In addition, we have and .
We are only left with showing how to compute . Essentially, we only have to radix sort the strings ’s by taking into account that is defined by considering a slight variation of the lexicographic order. More precisely, we proceed as follows. We know that for each such that , so we first pad the end of each with a special character larger than all characters in the alphabet until the length of each string is exactly . Next, we consider two extra special characters and such that , and we append exactly one of this character to each : we append if , and we append if . Now, we radix sort the (modified) ’s in time, so obtaining .
B.3 Proofs from Section 4.3
Statement of Lemma 4.21. Let be a graph. If we know the min-partition of , then we can build the min-partition of in time.
Proof B.17.
We have seen that the algorithm correctly builds and for every . Note that when the algorithm builds the ’s, it never modifies the ’s and the ’s (because we only add elements to the ’s).
Let us prove that, when we consider in (and before computing the ’s), then:
-
1.
is an element of and we have already built the prefix of whose largest element is .
-
2.
every contains a prefix of .
At the beginning of the algorithm we consider in , with , so our claim is true because we have already built and , and all the ’s are empty.
Now, assume that our claim is true when we consider . We want to prove that it is true when we process the next element (if it exists). When we consider , we compute the nonempty ’s, and we add each such to . We now want to prove that is correctly identified as the next element in .
-
•
First, let us prove that, if , then . Since , then there exist such that . Since by the inductive hypothesis is an element of and we have already built the prefix of whose largest element is and since and were not marked before, then it must be , and , so we conclude .
-
•
We are only left with showing that if and is not in some element of after is added to , then . Let such that ; we have shown that . The definition of implies that for some that we have not processed yet. Since by the inductive hypothesis is an element of and we have already built the prefix of whose largest element is , then it must be , and we conclude .
Now, let us prove that, if after considering in , the next element to be considered is not in , then the construction of is complete. If then the conclusion is immediate because and had already been fully built. Now assume that . Suppose for sake of contradiction that there exists which has not been added to some element in . Without loss of generality, we choose such that is as small as possible. Since , then there exists such that and (and so ). If , then we immediately obtain that has already been processed (because we have already built the prefix of whose largest element is ) and so should have already been processed when was processed, a contradiction. If , then by Corollary 3.3 we have , so and ; the minimality of implies that was previously added to some element of and so should have been processed, again a contradiction.
As a consequence, when we process the next element after in , then our claim will be true (independently of whether the next element is in or not). In particular, if there is no next element, we conclude that all ’s have been built,
Lastly, we can build each in time, because we only need to scan each node and each edge once.
Appendix C Details on the Complementary Case
Let us show how to reduce the problem of computing the min-partition of to the problem of determining the min-partition of , where is a graph such that . We use the same notation that we used in the previous sections (with a complementary meaning) so that it will easy to follow our argument.
C.1 Recursive Step
For every such that , let be the smallest integer such that , where is an occurrence of starting at . For every , we can define each as we did before; now, it will be for every . In fact, when we explore the graph starting from in a backward fashion and we (implicitly) build a prefix of , at each step it is still true that we must consider the nodes with minimum value and, among those, the nodes with minimum value . This time, there is no chance that we include in a node already included before: at every step, can only increase (because ), so if there were a cycle, then all edges of the cycles would have the same label and by Lemma 4.6 we would conclude for some , a contradiction. In particular, this means that now we do not even need to somehow trim the graph to ensure that our algorithm runs in time, because we cannot visit the same node twice. The ’s and the ’s are defined as before, and Lemma 4.14 still implies that and must be defined as before. When we define , we define and we define as before. It is easy to check that we can build in time as before, and if we have the min-partition of (with respect to ), then we also have the min-partition of (with respect to ).
C.2 Merging
Let such that . By Lemma 3.2, there exist , and such that and . Then, we define .
In the following, we will often use the following observation. Let such that . Let such that and . Then, (i) and (ii) if , then and .
Note that, if are such that , and , then if and only if .
It is easy to compute all ’s in time. Start from each such that , and explore the graph in a backward fashion by only following edges labeled , until either we encounter a node for which we have already determined , or we cannot explore the graph any longer. For all the nodes that we encounter it must be (otherwise it would be ), and if we cannot explore any longer from some that we encounter, it must be . We cannot encounter the same node twice because is deterministic, and we cannot have cycles otherwise it would be by Lemma 4.6. As a consequence, we have built a tree (where the root has no outgoing edges), and computing the ’s is equivalent to computing the height of each node.
Let us show how we can use the ’s to determine the min-partition of , assuming that we already have the min-partition of . As before we have the min-partition of . For every and for every , we define , and as before, and our problem reduces to compute each .
By using and , we can compute the ’s and the ’s as before, so the challenging part is to compute the ’s. Let , with . During the algorithm, we will assign a number to every element being in some at some point.
Notice that must be empty because otherwise we would conclude that is not largest character. This suggest that this time the order in which we process the ’s must be , , , , , and so on. Moreover, we will build each incrementally from its largest element to its smallest element (so we will consider suffixes of min-partitions, not prefixes). This time we will not mark nodes, but we will mark entries of the ’s to indicate that a node has been removed from an element in . Intuitively, this time we need to remove nodes because it will be true that when we process in then we have already built the suffix (not the prefix) of whose largest element is , but we are now building from its largest element to its smallest element, so a node reached by an edge leaving a node in may also be reached by an edge leaving a node that we have not processed, and for which it holds .
Assume that we process in . Note that if is such that , , and for some , then it must be otherwise ; if is an element in the ’s or in the ’s, it must always be because, if it were , then we would conclude . For every , define if , and if . We compute the set of all nodes such that , , and for some and, if , then (i) we mark the entries of containing an element in and (ii) we add to , letting . We assume that we maintain an additional array that for every node in already occurring in some stores its (unique) current position, so that operation (i) can be performed in constant time without affecting the running time of the algorithm (which is still ).
Notice that at any time the following will be true in each : (a) if is in , then ; (b) if is nonempty, then the first that we have added is such that ; (c) If has been added to immediately after , then ; (d) If for some in , then .
Let us prove that, when we consider in (and before computing the ’s), then is an element of and we have already built the suffix of whose largest element is .
At the beginning of the algorithm we consider in , with , so our claim is true because we have already built and .
Now, assume that our claim is true when we consider in . Let us prove that we can consider the next element in (if it exists), then is an element of and we have already built the suffix of whose largest element is . Notice that either or .
-
1.
Assume that and . If we are done because we have already built the ’s and the ’s. Now assume that . This implies that .
-
(a)
Suppose that . First, let us prove that, if , then . Let be such that , and . Since we obtain that either (i) , or (ii) , and . In both cases, we conclude that is an element of the suffix of whose largest element is , which by the inductive hypothesis has been correctly built, and so is added to an element of ; is never removed from this element otherwise in there would be a string smaller that . As a consequence, it must also be , so by the inductive hypothesis and we conclude .
We are only left with proving that, if is neither in , nor in the suffix of whose largest element is , and if , then . The conclusion is immediate if , so we can assume . It cannot be otherwise would be in the suffix of whose largest element is . Since , it cannot be and , otherwise again would be in the suffix of whose largest element is . As a consequence, it must and, if , then . If , or and , we immediately conclude . Hence, we can assume , and . Let be such that , and . As before, we must have already considered and , but since is not in , by construction it must be and so we conclude .
-
(b)
Suppose that . Arguing as we did above we infer that all such that are already in some element of the suffix of whose largest element is . By using this information, the same proof of the previous case shows that is correct.
-
(a)
-
2.
Assume that and . Since we have already built the ’s and the ’s, we only have to prove that, if , then all are already in some element of the suffix of whose largest element is . Assume for sake of contradiction that this is not true for some , and choose such that is as small as possible. By arguing as before, we conclude (by the minimality of ) that must be in some ; but since , we conclude that must be in some element of the suffix of whose largest element is , a contradiction.
-
3.
Assume that and . In particular, . As in the previous case, we obtain that if , then all are already in some element of the suffix of whose largest element is . Moreover, and must necessarily empty because we have already built them, and is not contained in any of them. First, let us prove that, if , then . Let be such that , and . Since we obtain that either (i) , or (ii) , and . However, case (ii) cannot occur because , so it must be , and we conclude that is an element of the suffix of whose largest element is . As before, we conclude that also is in , and .
We are only left with proving that, if is neither in , nor in the suffix of whose largest element is , and if , then . The conclusion is immediate if , so we can assume . It cannot be otherwise would be in the suffix of whose largest element is . It cannot be and , because . As a consequence, it must hold and, if , then . If , or and , we immediately conclude . Hence, we can assume , and . Let be such that , and . As before, we must have already considered and , but since is not in , by construction it must be and so we conclude .
-
4.
Assume that and . As before, we obtain that if , then all are already in some element of the suffix of whose largest element is . Moreover, and must necessarily be empty because we have already built them, and is not contained in any of them. Since we have already built and , we only have to prove that is empty. If it were not empty, in particular there would exists with . If is such that and , then we conclude that it must be . Then, must be in some element of the the suffix of whose largest element is , so would be in some element of ; but this is a contradiction, because .
Lastly, if after considering in there is no element to consider, we can check that every is in some element of the suffix of whose largest element is (and so have built ). Indeed, assume for sake of contradiction that this is not true. Consider the set of all nodes that do not satisfy these properties (it must be ); let be the set of all nodes in for which is maximal, and let be the set of all nodes in for which is minimal. Pick any , and let such that and . As before, we obtain that must be in some element of the suffix of whose largest element is , so should be in some and there would be an element to consider, a contradiction.