This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pointer-Machine Algorithms for Fully-Online Construction of
Suffix Trees and DAWGs on Multiple Strings

Shunsuke Inenaga Department of Informatics, Kyushu University, Japan
PRESTO, Japan Science and Technology Agency, Japan
[email protected]
Abstract

We deal with the problem of maintaining the suffix tree indexing structure for a fully-online collection of multiple strings, where a new character can be prepended to any string in the collection at any time. The only previously known algorithm for the problem, recently proposed by Takagi et al. [Algorithmica 82(5): 1346-1377 (2020)], runs in O(Nlogσ)O(N\log\sigma) time and O(N)O(N) space on the word RAM model, where NN denotes the total length of the strings and σ\sigma denotes the alphabet size. Their algorithm makes heavy use of the nearest marked ancestor (NMA) data structure on semi-dynamic trees, that can answer queries and supports insertion of nodes in O(1)O(1) amortized time on the word RAM model. In this paper, we present a simpler fully-online right-to-left algorithm that builds the suffix tree for a given string collection in O(N(logσ+logd))O(N(\log\sigma+\log d)) time and O(N)O(N) space, where dd is the maximum number of in-coming Weiner links to a node of the suffix tree. We note that dd is bounded by the height of the suffix tree, which is further bounded by the length of the longest string in the collection. The advantage of this new algorithm is that it works on the pointer machine model, namely, it does not use the complicated NMA data structures that involve table look-ups. As a byproduct, we also obtain a pointer-machine algorithm for building the directed acyclic word graph (DAWG) for a fully-online left-to-right collection of multiple strings, which runs in O(N(logσ+logd))O(N(\log\sigma+\log d)) time and O(N)O(N) space again without the aid of the NMA data structures.

1 Introduction

1.1 Suffix trees and DAWGs

Suffix trees are a fundamental string data structure with a myriad of applications [9]. The first efficient construction algorithm for suffix trees, proposed by Weiner [19], builds the suffix tree for a string in a right-to-left online manner, by updating the suffix tree each time a new character is prepended to the string. It runs in O(nlogσ)O(n\log\sigma) time and O(n)O(n) space, where nn is the length of the string and σ\sigma is the alphabet size.

One of the most interesting features of Weiner’s algorithm is a very close relationship to Blumer et al.’s algorithm [2] that builds the directed acyclic word graph (DAWG) in a left-to-right online manner, by updating the DAWG each time a new character is prepended to the string. It is well known that the DAG of the Weiner links of the suffix tree of TT is equivalent to the DAWG of the reversal revTrev{T} of TT, or symmetrically, the suffix link tree of the DAWG of T¯\overline{T} is equivalent to the suffix tree of TT. Thus, right-to-left online construction of suffix trees is essentially equivalent to left-to-right construction of DAWGs. This means that Blumer et al.’s DAWG construction algorithm also runs in O(nlogσ)O(n\log\sigma) time and O(n)O(n) space [2].

DAWGs also support efficient pattern matching queries, and have been applied to other important string problems such as local alignment [5], pattern matching with variable-length don’t cares [13], dynamic dictionary matching [10], compact online Lempel-Ziv factorization [21], finding minimal absent words [7], and finding gapped repeats [17], on the input string.

1.2 Fully online construction of suffix trees and DAWGs

Takagi et al. [16] initiated the generalized problem of maintaining the suffix tree for a collection of strings in a fully-online manner, where a new character can be prepended to any string in the collection at any time. This fully-online scenario arises in real-time database systems e.g. for sensor networks or trajectories. Takagi et al. showed that a direct application of Weiner’s algorithm [19] to this fully-online setting requires to visit Θ(Nmin(K,N))\Theta(N\min(K,\sqrt{N})) nodes, where NN is the total length of the strings and KK is the number of strings in the collection. Note that this leads to a worst-case Θ(N1.5logσ)\Theta(N^{1.5}\log\sigma)-time construction when K=Ω(N)K=\Omega(\sqrt{N}).

In their analysis, it was shown that Weiner’s original algorithm applied to a fully-online string collection visits a total of Θ(Nmin(K,N))\Theta(N\min(K,\sqrt{N})) nodes. This means that the amortization argument of Weiner’s algorithm for the number of nodes visited in the climbing process for inserting a new leaf, does not work for multiple strings in the fully-online setting. To overcome difficulty, Takagi et al. proved the three following arguments: (1) By using σ\sigma nearest marked ancestor (NMA) structures [20], one can skip the last part of the climbing process; (2) All the σ\sigma NMA data structures can be stored in O(n)O(n) space; (3) The number of nodes explicitly visited in the remaining part of each climbing process can be amortized O(1)O(1) per new added character. This led to their O(Nlogσ)O(N\log\sigma)-time and O(N)O(N)-space fully-online right-to-left construction of the suffix tree for multiple strings.

Takagi et al. [16] also showed that Blumer et al.’s algorithm applied to a fully-online left-to-right DAWG construction requires at least Θ(Nmin(K,N))\Theta(N\min(K,\sqrt{N})) work as well. They also showed how to maintain an implicit representation of the DAWG of O(N)O(N) space which supports fully-online updates and simulates a DAWG edge traversal in O(logσ)O(\log\sigma) time each. The key here was again the non-trivial use of the aforementioned σ\sigma NMA data structures over the suffix tree of the reversed strings.

As was stated above, Takagi et al.’s construction heavily relies on the use of the NMA data structures [20]. Albeit NMA data structures are useful and powerful, all known NMA data structures for (static and dynamic) trees that support O(1)O(1) (amortized) time queries and updates [8, 11, 20] are quite involved, and they are valid only on the word RAM model as they use look-up tables that explicitly store the answers for small sub-problems. Hence, in general, it would be preferable if one can achieve similar efficiency without NMA data structures.

1.3 Our contribution

In this paper, we show how to maintain the suffix tree for a right-to-left fully-online string collection in O(N(logσ+logd))O(N(\log\sigma+\log d)) time and O(N)O(N) space, where dd is the maximum number of in-coming Weiner links to a node of the suffix tree. Our construction does not use NMA data structures and works in the pointer-machine model [18], which is a simple computational model without address arithmetics. We note that dd is bounded by the height of the suffix tree. Clearly, the height of the suffix tree is at most the maximum length of the strings. Hence, the dd term can be dominated by the σ\sigma term when the strings are over integer alphabets of polynomial size in NN, or when a large number of strings of similar lengths are treated. To achieve the aforementioned bounds on the pointer-machine model, we reduce the problem of maintaining in-coming Weiner links of nodes to the ordered split-insert-find problem, which maintains dynamic sets of sorted elements allowing for split and insert operations, and find queries, which can be solved in a total of O(Nlogd)O(N\log d) time and O(N)O(N) space.

As a byproduct of the above result, we also obtain the first non-trivial algorithm that maintains an explicit representation of the DAWG for fully-online left-to-right multiple strings, which runs in O(N(logσ+logd))O(N(\log\sigma+\log d)) time and O(N)O(N) space. By an explicit representation, we mean that every edge of the DAWG is implemented as a pointer. This DAWG construction does not require complicated table look-ups and thus also works on the pointer machine model.

2 Preliminaries

2.1 String notations

Let Σ\Sigma be a general ordered alphabet. Any element of Σ\Sigma^{*} is called a string. For any string TT, let |T||T| denote its length. Let ε\varepsilon be the empty string, namely, |ε|=0|\varepsilon|=0. Let Σ+=Σ{ε}\Sigma^{+}=\Sigma\setminus\{\varepsilon\}. If T=XYZT=XYZ, then XX, YY, and ZZ are called a prefix, a substring, and a suffix of TT, respectively. For any 1ij|T|1\leq i\leq j\leq|T|, let T[i..j]T[i..j] denote the substring of TT that begins at position ii and ends at position jj in TT. For any 1i|T|1\leq i\leq|T|, let T[i]T[i] denote the iith character of TT. For any string TT, let 𝖲𝗎𝖿𝖿𝗂𝗑(T)\mathsf{Suffix}(T) denote the set of suffixes of TT, and for any set 𝒯\mathcal{T} of strings, let 𝖲𝗎𝖿𝖿𝗂𝗑(𝒯)\mathsf{Suffix}(\mathcal{T}) denote the set of suffixes of all strings in 𝒯\mathcal{T}. Namely, 𝖲𝗎𝖿𝖿𝗂𝗑(𝒯)=T𝒯𝖲𝗎𝖿𝖿𝗂𝗑(T)\mathsf{Suffix}(\mathcal{T})=\bigcup_{T\in\mathcal{T}}\mathsf{Suffix}(T). For any string TT, let T¯\overline{T} denote the reversed string of TT, i.e., T¯=T[|T|]T[1]\overline{T}=T[|T|]\cdots T[1]. For any set 𝒯\mathcal{T} of strings, let 𝒯¯={T¯T𝒯}\overline{\mathcal{T}}=\{\overline{T}\mid T\in\mathcal{T}\}.

2.2 Suffix trees and DAWGs for multiple strings

For ease of description, we assume that each string TiT_{i} in the collection 𝒯\mathcal{T} terminates with a unique character $i\$_{i} that does not appear elsewhere in 𝒯\mathcal{T}. However, our algorithms work without $i\$_{i} symbols at the right end of strings as well.

A compacted trie is a rooted tree such that (1) each edge is labeled by a non-empty string, (2) each internal node is branching, and (3) the string labels of the out-going edges of each node begin with mutually distinct characters. The suffix tree [19] for a text collection 𝒯\mathcal{T}, denoted 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}), is a compacted trie which represents 𝖲𝗎𝖿𝖿𝗂𝗑(𝒯)\mathsf{Suffix}(\mathcal{T}). The string depth of a node vv of 𝖲𝗎𝖿𝖿𝗂𝗑(𝒯)\mathsf{Suffix}(\mathcal{T}) is the length of the substring that is represented by vv. We sometimes identify node vv with the substring it represents. The suffix tree for a single string TT is denoted 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T).

Refer to caption
Refer to caption
Figure 1: Left: 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) for 𝒯={𝚌𝚊𝚋𝚊𝚊$𝟷,𝚊𝚋𝚊𝚊𝚋$𝟸}\mathcal{T}=\{\mathtt{cabaa\$_{1}},\mathtt{abaab\$_{2}}\}. The bold broken arrows represent hard Weiner links, while the narrow broken arrows represent soft Weiner links. Not all Weiner links are shown for simplicity. Right: 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) for 𝒮=𝒯¯={$𝟷𝚊𝚊𝚋𝚊𝚌,$𝟸𝚋𝚊𝚊𝚋𝚊}\mathcal{S}=\overline{\mathcal{T}}=\{\mathtt{\$_{1}aabac},\mathtt{\$_{2}baaba}\}. The broken arrow represents a suffix link. Not all suffix links are shown for simplicity.

𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) has at most 2N12N-1 nodes and thus 2N22N-2 nodes, since every internal node of 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) is branching and there are NN leaves in 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}). By representing each edge label xx with a triple k,i,j\langle k,i,j\rangle of integers such that x=Tk[i..j]x=T_{k}[i..j], 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) can be stored with O(N)O(N) space.

We define the suffix link of each non-root node avav of 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) with aΣa\in\Sigma and vΣv\in\Sigma^{*}, by 𝗌𝗅𝗂𝗇𝗄(av)=v\mathsf{slink}(av)=v. For each explicit node vv and aΣa\in\Sigma, we also define the reversed suffix link (a.k.a. Weiner link) by 𝖶_𝗅𝗂𝗇𝗄a(v)=avx{\mathsf{W\_link}}_{a}({v})=avx, where xΣx\in\Sigma^{*} is the shortest string such that avxavx is a node of 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}). 𝖶_𝗅𝗂𝗇𝗄a(v){\mathsf{W\_link}}_{a}({v}) is undefined if avav is not a substring of strings in 𝒯\mathcal{T}. A Weiner link 𝖶_𝗅𝗂𝗇𝗄a(v)=avx{\mathsf{W\_link}}_{a}({v})=avx is said to be hard if x=εx=\varepsilon, and soft if xΣ+x\in\Sigma^{+}.

See the left diagram of Figure 1 for an example of 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) and Weiner links.

The directed acyclic word graph (DAWG in short) [2, 3] of a text collection 𝒮\mathcal{S}, denoted 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}), is a (partial) DFA which represents 𝖲𝗎𝖿𝖿𝗂𝗑(𝒮)\mathsf{Suffix}(\mathcal{S}). It is proven in [3] that 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) has at most 2N12N-1 nodes and 3N43N-4 edges for N3N\geq 3. Since each DAWG edge is labeled by a single character, 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) can be stored with O(N)O(N) space. The DAWG for a single string SS is denoted 𝖣𝖠𝖶𝖦(S)\mathsf{DAWG}(S).

A node of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) corresponds to the substrings in 𝒮\mathcal{S} which share the same set of ending positions in 𝒮\mathcal{S}. Thus, for each node, there is a unique longest string represented by that node. For any node vv of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}), let 𝑙𝑜𝑛𝑔(v)\mathit{long}(v) denote the longest string represented by vv. An edge (u,a,v)(u,a,v) in the DAWG is called primary if |𝑙𝑜𝑛𝑔(u)|+1=|𝑙𝑜𝑛𝑔(v)||\mathit{long}(u)|+1=|\mathit{long}(v)|, and is called secondary otherwise. For each node vv of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) with |𝑙𝑜𝑛𝑔(v)|1|\mathit{long}(v)|\geq 1, let 𝗌𝗅𝗂𝗇𝗄(v)=y\mathsf{slink}(v)=y, where yy is the longest suffix of 𝑙𝑜𝑛𝑔(v)\mathit{long}(v) which is not represented by vv.

Suppose S=𝒯¯S=\overline{\mathcal{T}}. It is known (c.f. [2, 3, 4]) that there is a node vv in 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) iff there is a node xx in 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) such that 𝑙𝑜𝑛𝑔(x)=v¯\mathit{long}(x)=\overline{v}. Also, the hard Weiner links and the soft Weiner links of 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) coincide with the primary edges and the secondary edges of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}), respectively. In a symmetric view, the reversed suffix links of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) coincide with the suffix tree 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) for 𝒯\mathcal{T}.

See Figure 1 for some concrete examples of the aforementioned symmetry. For instance, the nodes 𝚊𝚋𝚊𝚊\mathtt{abaa} and 𝚋𝚊𝚊\mathtt{baa} of 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) correspond to the nodes of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) whose longest strings are 𝚊𝚋𝚊𝚊¯=𝚊𝚊𝚋𝚊\overline{\mathtt{abaa}}=\mathtt{aaba} and 𝚋𝚊𝚊¯=𝚊𝚊𝚋\overline{\mathtt{baa}}=\mathtt{aab}, respectively. Observe that both 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) and 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) have 19 nodes each. The Weiner links of 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) labeled by character 𝚌\mathtt{c} correspond to the out-going edges of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) labeled by 𝚌\mathtt{c}. To see another example, the three Weiner links from node 𝚊\mathtt{a} in 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) labeled 𝚊\mathtt{a}, 𝚋\mathtt{b}, and 𝚌\mathtt{c} correspond to the three out-going edges of node {𝚊}\{\mathtt{a}\} of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) labeled 𝚊\mathtt{a}, 𝚋\mathtt{b}, and 𝚌\mathtt{c}, respectively. For the symmetric view, focus on the suffix link of the node {$𝟸𝚋𝚊𝚊𝚋,𝚋𝚊𝚊𝚋}\{\mathtt{\$_{2}baab},\mathtt{baab}\} of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) to the node {𝚊𝚊𝚋,𝚊𝚋}\{\mathtt{aab},\mathtt{ab}\}. This suffix link reversed corresponds to the edge labeled 𝚋$𝟸\mathtt{b\$_{2}} from the node 𝚋𝚊𝚊\mathtt{baa} to the node 𝚋𝚊𝚊𝚋$𝟸\mathtt{baab\$_{2}} in 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}).

We now see that the two following tasks are essentially equivalent:

  • (A)

    Building 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}) for a fully-online right-to-left text collection 𝒯\mathcal{T}, using hard and soft Weiner links.

  • (B)

    Building 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) for a fully-online left-to-right text collection 𝒮\mathcal{S}, using suffix links.

2.3 Pointer machines

A pointer machine [18] is an abstract model of computation such that the state of computation is stored as a directed graph, where each node can contain a constant number of data (e.g. integers, symbols) and a constant number of pointers (i.e. out-going edges to other nodes). The instructions supported by the pointer machine model are basically creating new nodes and pointers, manipulating data, and performing comparisons. The crucial restriction in the pointer machine model, which distinguishes it from the word RAM model, is that pointer machines cannot perform address arithmetics, namely, memory access must be performed only by an explicit reference to a pointer. While the pointer machine model is apparently weaker than the word RAM model that supports address arithmetics and unit-cost bit-wise operations, the pointer machine model serves as a good basis for modeling linked structures such as trees and graphs, which are exactly our targets in this paper. In addition, pointer-machines are powerful enough to simulate list-processing based languages such as LISP and Prolog (and their variants), which have recurrently gained attention.

3 Brief reviews on previous algorithms

To understand why and how our new algorithms to be presented in Section 4 work efficiently, let us briefly recall the previous related algorithms.

3.1 Weiner’s algorithm and Blumer et al.’s algorithm for a single string

First, we briefly review how Weiner’s algorithm for a single string TT adds a new leaf to the suffix tree when a new character aa is prepended to TT. Our description of Weiner’s algorithm slightly differs from the original one, in that we use both hard and soft Weiner links while Weiner’s original algorithm uses hard Weiner links only and it instead maintains Boolean vectors indicating the existence of soft Weiner links.

Suppose we have already constructed 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) with hard and soft Weiner links. Let \ell be the leaf that represents TT. Given a new character aa, Weiner’s algorithm climbs up the path from the leaf \ell until encountering the deepest ancestor vv of \ell that has a Weiner link 𝖶_𝗅𝗂𝗇𝗄a(v){\mathsf{W\_link}}_{a}({v}) defined. If there is no such ancestor of \ell above, then a new leaf representing aTaT is inserted from the root rr of the suffix tree. Otherwise, the algorithm follows the Weiner link 𝖶_𝗅𝗂𝗇𝗄a(v){\mathsf{W\_link}}_{a}({v}) and arrives at its target node u=𝖶_𝗅𝗂𝗇𝗄a(v)u={\mathsf{W\_link}}_{a}({v}). There are two sub-cases:

  • (1)

    If 𝖶_𝗅𝗂𝗇𝗄a(v){\mathsf{W\_link}}_{a}({v}) is a hard Weiner link, then a new leaf ^\hat{\ell} representing aTaT is inserted from uu.

  • (2)

    If 𝖶_𝗅𝗂𝗇𝗄a(v){\mathsf{W\_link}}_{a}({v}) is a soft Weiner link, then the algorithm splits the incoming edge of uu into two edges by inserting a new node yy as a new parent of uu such that |y|=|v|+1|y|=|v|+1 (See also Figure 2). A new leaf representing aTaT is inserted from this new internal node yy. We also copy each out-going Weiner link 𝖶_𝗅𝗂𝗇𝗄c(u){\mathsf{W\_link}}_{c}({u}) from uu with a character cc as an out-going Weiner link 𝖶_𝗅𝗂𝗇𝗄c(y){\mathsf{W\_link}}_{c}({y}) from yy so that their target nodes are the same (i.e. 𝖶_𝗅𝗂𝗇𝗄c(u){\mathsf{W\_link}}_{c}({u}) = 𝖶_𝗅𝗂𝗇𝗄c(y){\mathsf{W\_link}}_{c}({y})). See also Figure 3. Then, a new hard Weiner link is created from vv to yy with label aa, in other words, an old soft Weiner link 𝖶_𝗅𝗂𝗇𝗄a(v)=u{\mathsf{W\_link}}_{a}({v})=u is redirected to a new hard Weiner link 𝖶_𝗅𝗂𝗇𝗄a(v)=y{\mathsf{W\_link}}_{a}({v})=y. In addition, all the old soft Weiner links of ancestors zz of vv such that 𝖶_𝗅𝗂𝗇𝗄a(z)=u{\mathsf{W\_link}}_{a}({z})=u in 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) have to be redirected to new soft Weiner links 𝖶_𝗅𝗂𝗇𝗄a(z)=y{\mathsf{W\_link}}_{a}({z})=y in 𝖲𝖳𝗋𝖾𝖾(aT)\mathsf{STree}(aT). These redirections can be done by keeping climbing up the path from vv until finding the deepest node xx that has a hard Weiner link with character aa pointing to the parent of uu in 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T).

Refer to caption
Refer to caption
Figure 2: Left: Illustration for 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) of Case (2) before inserting the new leaf representing aTaT. Right: Illustration for 𝖲𝖳𝗋𝖾𝖾(aT)\mathsf{STree}(aT) of Case (2) after inserting the new leaf representing aTaT. In both diagrams, thick broken arrows represent hard Winer links, and narrow broken arrows represent soft Weiner links. All these Winer links are labeled by aa. Also, new Weiner links labeled aa are created from the nodes between the leaf for TT and vv to the new leaf for aTaT (not shown in this diagram).
Refer to caption
Refer to caption
Figure 3: Illustration of the copy process of the out-going Weiner links of uu to its new parent yy in Case (2). Left: Out-going Weiner links of node uu before the update. Right: Each out-going Winer link of node uu is copied to its new parent yy, represented by a red broken arrow.

In both Cases (1) and (2) above, new soft Weiner links 𝖶_𝗅𝗂𝗇𝗄a(x)=^{\mathsf{W\_link}}_{a}({x})=\hat{\ell} are created from every node xx in the path from \ell to the child of vv.

The running time analysis of the above algorithm has three phases.

  • (a)

    In both Cases (1) and (2), the number of nodes from leaf \ell for TT to vv is bounded by the number of newly created soft Weiner links. This is amortized O(1)O(1) per new character since the resulting suffix tree has a total of O(n)O(n) soft Weiner links [2], where n=|T|n=|T|.

  • (b)

    In Case (2), the number of out-going Weiner links copied from uu to yy is bounded by the number of newly created Weiner link, which is also amortized O(1)O(1) per new character by the same argument as (a).

  • (c)

    In Case (2), the number redirected soft Weiner links is bounded by the number of nodes from vv to xx. The analysis by Weiner [19] shows that this number of nodes from vv to xx can be amortized O(1)O(1).

Wrapping up (a), (b), and (c), the total numbers of visited nodes, created Weiner links, and redirected Weiner links through constructing 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) by prepending nn characters are O(n)O(n). Thus Weiner’s algorithm constructs 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) in right-to-left online manner in O(nlogσ)O(n\log\sigma) time with O(n)O(n) space, where the logσ\log\sigma term comes from the cost for maintaining Weiner links of each node in the lexicographically sorted order by e.g. a standard balanced binary search tree.

Since this algorithm correctly maintains all (hard and soft) Weiner links, it builds 𝖣𝖠𝖶𝖦(S)\mathsf{DAWG}(S) for the reversed string S=T¯S=\overline{T} in a left-to-right manner, in O(nlogσ)O(n\log\sigma) time with O(n)O(n) space. In other words, this version of Weiner’s algorithm is equivalent to Blumer et al.’s DAWG online construction algorithm.

We remark that the aforementioned version of Weiner’s algorithm, and equivalently Blumer et al.’s algorithm, work on the pointer machine model as they do not use address arithmetics nor table look-ups.

3.2 Takagi et al.’s algorithm for multiple strings on the word RAM

When Weiner’s algorithm is applied to fully-online right-to-left construction of 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}), the amortization in Analysis (c) does not work. Namely, it was shown by Takagi et al. [16] that the number of redirected soft Weiner links is Θ(Nmin(K,N))\Theta(N\min(K,\sqrt{N})) in the fully-online setting for multiple KK strings. A simpler upper bound O(NK)O(NK) immediately follows from an observation that the insertion of a new leaf for a string TiT_{i} in 𝒯\mathcal{T} may also increase the depths of the leaves for all the other K1K-1 strings T1,,Ti1,Ti+1,,TKT_{1},\ldots,T_{i-1},T_{i+1},\ldots,T_{K} in 𝒯\mathcal{T}. Takagi et al. then obtained the aforementioned improved O(Nmin(K,N))O(N\min(K,\sqrt{N})) upper bound, and presented a lower bound instance that indeed requires Ω(Nmin(K,N))\Omega(N\min(K,\sqrt{N})) work. It should also be noted that the original version of Weiner’s algorithm that only maintains Boolean indicators for the existence of soft Weiner links, must also visit Θ(Nmin(K,N))\Theta(N\min(K,\sqrt{N})) nodes [16].

Takagi et al. gave a neat way to overcome this difficulty by using the nearest marked ancestor (NMA) data structure [20] for a rooted tree. This NMA data structure allows for making unmarked nodes, splitting edges, inserting new leaves, and answering NMA queries in O(1)O(1) amortized time each, in the word RAM model of machine word size Ω(logN)\Omega(\log N). Takagi et al. showed how to skip the nodes between vv to xx in O(1)O(1) amortized time using a single NMA query on the NMA data structure associated to a given character aa that is prepended to TT. They also showed how to store σ\sigma NMA data structures for all σ\sigma distinct characters in O(N)O(N) total space. Since the amortization argument (c) is no more needed by the use of the NMA data structures, and since the analyses (a) and (b) still hold for fully-online multiple strings, the total number of visited nodes was reduced to O(N)O(N) in their algorithm. This led their construction in O(Nlogσ)O(N\log\sigma) time and O(N)O(N) space, in the word RAM model.

Takagi et al.’s Θ(Nmin(K,N))\Theta(N\min(K,\sqrt{N})) bound also applies to the number of visited nodes and that of redirected secondary edges of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) for multiple strings in the fully-online setting. Instead, they showed how to simulate secondary edge traversals of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) in O(logσ)O(\log\sigma) amortized time each, using the aforementioned NMA structures. We remark that their data structure is only an implicit representation of 𝖣𝖠𝖶𝖦(𝒮)\mathsf{DAWG}(\mathcal{S}) in the sense that the secondary edges are not explicitly stored.

4 Simple fully-online constructions of suffix trees and DAWGs on the pointer-machine model

In this section, we present our new algorithms for fully-online construction of suffix trees and DAWGs for multiple strings, which work on the pointer-machine model.

4.1 Right-to-left suffix tree construction

In this section, we present our new algorithm that constructs the suffix tree for a fully-online right-to-left string collection.

Consider a collection 𝒯={T1,,TK}\mathcal{T}^{\prime}=\{T_{1},\ldots,T_{K}\} of KK strings. Suppose that we have built 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}^{\prime}) and that for each string Ti𝒯T_{i}\in\mathcal{T}^{\prime} we know the leaf i\ell_{i} that represents TiT_{i}.

In our fully-online setting, any new character from Σ\Sigma can be prepended to any string in the current string collection 𝒯\mathcal{T}. Suppose that a new character aΣa\in\Sigma is prepended to a string TT in the collection 𝒯\mathcal{T}^{\prime}, and let 𝒯=(𝒯{T}){aT}\mathcal{T}=(\mathcal{T}^{\prime}\setminus\{T\})\cup\{aT\} be the collection after the update. Our task is to update 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}^{\prime}) to 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}).

Our approach is to reduce the sub-problem of redirecting Weiner links to the ordered split-insert-find problem that operates on ordered sets over dynamic universe of elements, and supports the following operations and queries efficiently:

  • \bullet

    Make-set, which creates a new list that consists only of a single element;

  • \bullet

    Split, which splits a given set into two disjoint sets, so that one set contains only smaller elements than the other set;

  • \bullet

    Insert, which inserts a new single element to a given set;

  • \bullet

    Find, which answers the name of the set that a given element belongs to.

Recall our description of Weiner’s algorithm in Section 3.1 and see Figure 2. Consider the set of in-coming Weiner links of node uu before updates (the left diagram of Figure 2), and assume that these Weiner links are sorted by the length of the origin nodes. After arriving the node vv in the climbing up process from the leaf for TT, we take the Weiner link with character aa and arrive at node uu. Then we access the set of in-coming Weiner-links of uu by a find query. When we create a new internal node yy as the parent of the new leaf for aTaT, we split this set into two sets, one as the set of in-coming Weiner links of yy, and the other as the set of in-coming Weiner links of uu (see the right diagram of Figure 2). This can be maintained by a single call of a split operation.

Now we pay our attention to the copying process of Weiner links described in Figure 3. Observe that each newly copied Weiner links can be inserted by a single find operation and a single insert operation to the set of in-coming Weiner links of 𝖶_𝗅𝗂𝗇𝗄u(c){\mathsf{W\_link}}_{u}({c}) for each character cc where 𝖶_𝗅𝗂𝗇𝗄u(c){\mathsf{W\_link}}_{u}({c}) is defined.

Now we prove the next lemma:

Lemma 1.

Let ff denote the operation and query time of a linear-space algorithm for the ordered split-insert-find problem. Then, we can build the suffix tree for a fully-online right-to-left string collection of total length NN in a total of O(N(f+logσ))O(N(f+\log\sigma)) time and O(N)O(N) space.

Proof.

The number of split operations is clearly bounded by the number of leaves, which is NN. Since the number of Weiner links is at most 3N43N-4, the number of insert operations is also bounded by 3N43N-4. The number of find queries is thus bounded by N+3N4=4N4N+3N-4=4N-4. By using a linear-space split-insert-find data structure, we can maintain the set of in-coming Weiner links for all nodes in a total of O(Nf)O(Nf) time with O(N)O(N) space.

Given a new character aa to prepend to a string TT, we climb up the path from the leaf for TT and find the deepest ancestor vv of the leaf for which 𝖶_𝗅𝗂𝗇𝗄a(v){\mathsf{W\_link}}_{a}({v}) is defined. This can be checked in O(logσ)O(\log\sigma) time at each visited node, by using a balanced search tree. Since we do not climb up the nodes zz (see Figure 2) for which the soft Weiner links with aa are redirected, we can use the same analysis (a) as in the case of a single string. This results in that the number of visited nodes in our algorithm is O(N)O(N). Hence we use O(Nlogσ)O(N\log\sigma) total time for finding the deepest node which has a Weiner link for the prepended character aa.

Overall, our algorithm uses O(N(f+logσ))O(N(f+\log\sigma)) time and O(N)O(N) space. ∎

Our ordered split-insert-find problem is a special case of the union-split-find problem on ordered sets, since each insert operation can be trivially simulated by make-set and union operations. Link-cut trees of Sleator and Tarjan [14] for a dynamic forest support make-tree, link, cut operations and find-root queries in O(logd)O(\log d) time each. Since link-cut trees can be used to path-trees, make-set, insert, split, and find in the ordered split-insert-find problem can be supported in O(logd)O(\log d) time each. Since link-cut trees work on the pointer machine model, this leads to a pointer-machine algorithm for our fully-online right-to-left construction of the suffix tree for multiple strings with f=O(logd)f=O(\log d). Here, in our context, dd denotes the maximum number of in-coming Weiner links to a node of the suffix tree.

A potential drawback of using link-cut trees is that in order to achieve O(logd)O(\log d)-time operations and queries, link-cut trees use some auxiliary data structures such as splay trees [15] as its building block. Yet, in what follows, we describe that our ordered split-insert-find problem can be solved by a simpler balanced tree, AVL-trees [1], retaining O(N(logσ+logd))O(N(\log\sigma+\log d))-time and O(N)O(N)-space complexities.

Theorem 1.

There is an AVL-tree based pointer-machine algorithm that builds the suffix tree for a fully-online right-to-left multiple strings of total length NN in O(N(logσ+logd))O(N(\log\sigma+\log d)) time with O(N)O(N) space, where dd is the maximum number of in-coming Weiner links to a suffix tree node and σ\sigma is the alphabet size.

Proof.

For each node uu of the suffix tree 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}^{\prime}) before update, let S(u)={|x|𝖶_𝗅𝗂𝗇𝗄a(x)=u}S(u)=\{|x|\mid{\mathsf{W\_link}}_{a}({x})=u\} where a=u[1]a=u[1], namely, S(u)S(u) is the set of the string depths of the origin nodes of the in-coming Weiner links of uu. We maintain an AVL tree for S(u)S(u) with the node uu, so that each in-coming Weiner link for uu points to the corresponding node in the AVL tree for S(u)S(u). The root of the AVL tree is always linked to the suffix tree node uu, and each time another node in the AVL tree becomes the new root as a result of node rotations, we adjust the link so that it points to uu from the new root of the AVL tree.

This way, a find query for a given Weiner link is reduced to accessing the root of the AVL tree that contains the given Weiner link, which can be done in O(logS(u))O(logd)O(\log S(u))\subseteq O(\log d) time.

Inserting a new element to S(u)S(u) can also be done in O(logS(u))O(logd)O(\log S(u))\subseteq O(\log d) time.

Given an integer kk, let S1S_{1} and S2S_{2} denote the subset of S(u)S(u) such that any element in S1S_{1} is not larger than kk, any element in S2S_{2} is larger than kk, and S1S2=S(u)S_{1}\cup S_{2}=S(u). It is well known that we can split the AVL tree for S(u)S(u) into two AVL trees for S1S_{1} and for S2S_{2} in O(logS(u))O(logd)O(\log S(u))\subseteq O(\log d) time (c.f. [12]). In our context, kk is the string depth of the deepest node vv that is a Weiner link with character aa in the upward path from the leaf for TT. This allows us to maintain S1=S(y)S_{1}=S(y) and S2=S(u)S_{2}=S(u) in O(logd)O(\log d) time in the updated suffix tree 𝖲𝖳𝗋𝖾𝖾(𝒯)\mathsf{STree}(\mathcal{T}).

When we create the in-coming Weiner links labeled aa to the new leaf ^\hat{\ell} for aTaT, we first perform a make-set operation which builds an AVL tree consisting only of the root. If we naïvely insert each in-coming Weiner link to the AVL tree one by one, then it takes a total of O(Nlogd)O(N\log d) time. However, we can actually perform this process in O(N)O(N) total time even on the pointer machine model: Since we climb up the path from the leaf \ell for TT, the in-coming Weiner links are already sorted in decreasing order of the string depths of the origin nodes. We create a family of maximal complete binary trees of size 2h12^{h}-1 each, arranged in decreasing order of hh. This can be done as follows: Initially set r|S(^)|r\leftarrow|S(\hat{\ell})|. We then greedily take the largest hh such that 2h1r2^{h}-1\leq r, and then update rr(2h1)r\leftarrow r-(2^{h}-1) and search for the next largest hh and so on. These trees can be easily created in O(|S(^)|)O(|S(\hat{\ell})|) total time by a simple linear scan over the sorted list of the in-coming Weiner links. Since the heights hh of these complete binary search trees are monotonically decreasing, and since all of these binary search trees are AVL trees, one can merge all of them into a single AVL tree in time linear in the height of the final AVL tree (c.f. [12]), which is bounded by O(h)=O(logS(^))O(h)=O(\log S(\hat{\ell})). Thus, we can construct the initial AVL tree for the in-coming Weiner links of each new leaf ^\hat{\ell} in O(|S(^)|)O(|S(\hat{\ell})|) time. Since the total number of Weiner links is O(N)O(N), we can construct the initial AVL trees for the in-coming Weiner links of all new leaves in O(N)O(N) total time.

Overall, our algorithm works in O(N(logσ+logd))O(N(\log\sigma+\log d)) time with O(N)O(N) space. ∎

4.2 Left-to-right DAWG construction

The next theorem immediately follows from Theorem 1.

Theorem 2.

There is an AVL-tree based pointer-machine algorithm that builds an explicit representation of the DAWG for a fully-online left-to-right multiple strings of total length NN in O(N(logσ+logd))O(N(\log\sigma+\log d)) time with O(N)O(N) space, where dd is the maximum number of in-coming edges of a DAWG node and σ\sigma is the alphabet size. This representation of the DAWG allows each edge traversal in O(logσ+logd)O(\log\sigma+\log d) time.

Proof.

The correctness and the complexity of construction are immediate from Theorem 1.

Given a character aa and a node vv in the DAWG, we first find the out-going edge of vv labeled aa in O(logσ)O(\log\sigma) time. If it does not exist, we terminate. Otherwise, we take this aa-edge and arrive at the corresponding node in the AVL tree for the destination node uu for this aa-edge. We then perform a find query on the AVL tree and obtain uu in O(logd)O(\log d) time. ∎

We emphasize that Theorem 2 gives the first non-trivial algorithm that builds an explicit representation of the DAWG for fully-online multiple strings. Recall that a direct application of Blumer et al.’s algorithm to the case of fully-online KK multiple strings requires to visit Θ(Nmin(K,N))\Theta(N\min(K,\sqrt{N})) nodes in the DAWG, which leads to O(Nmin(K,N)logσ)=O(N1.5logσ)O(N\min(K,\sqrt{N})\log\sigma)=O(N^{1.5}\log\sigma)-time construction for K=Θ(N)K=\Theta(\sqrt{N}).

It should be noted that after all the NN characters have been processed, it is easy to modify, in O(N)O(N) time in an offline manner, this representation of the DAWG so that each edge traversal takes O(logσ)O(\log\sigma) time.

4.3 On optimality of our algorithms

It is known that sorting a length-NN sequence of σ\sigma distinct characters is an obvious lower bound for building the suffix tree [6] or alternatively the DAWG. This is because, when we build the suffix tree or the DAWG where the out-going edges of each node are sorted in the lexicographical order, then we can obtain a sorted list of characters at their root. Thus, Ω(Nlogσ)\Omega(N\log\sigma) is a comparison-based model lower bound for building the suffix tree or the DAWG. Since Takagi et al.’s O(Nlogσ)O(N\log\sigma)-time algorithm [16] works only on the word RAM model, in which faster integer sorting algorithms exist, it would be interesting to explore some cases where our O(N(logσ+logd))O(N(\log\sigma+\log d))-time algorithms for a weaker model of computation can perform in optimal O(Nlogσ)O(N\log\sigma) time.

It is clear that the maximum number dd of in-coming Weiner links to a node is bounded by the total length NN of the strings. Hence, in case of integer alphabets of size σ=NO(1)\sigma=N^{O(1)}, our algorithms run in optimal O(Nlogσ)=O(NlogN)O(N\log\sigma)=O(N\log N) time.

For the case of smaller alphabet size σ=polylog(N)\sigma=\mathrm{polylog}(N), the next lemma can be useful:

Lemma 2.

The maximum number dd of in-coming Weiner links is less than the height of the suffix tree.

Proof.

For any node uu in the suffix tree, all in-coming Weiner links to uu is labeled by the same character aa, which is the first character of the substring represented by uu. Therefore, all in-coming Weiner links to uu are from the nodes in the path between the root and the node u[2..|u|]u[2..|u|]. ∎

We note that the height of the suffix tree for multiple strings is bounded by the length of the longest string in the collection. In many applications such as time series from sensor data, it would be natural to assume that all the KK strings in the collection have similar lengths. Hence, when the collection consists of K=N/polylog(N)K=N/\mathrm{polylog}(N) strings of length polylog(N)\mathrm{polylog}(N) each, we have d=polylog(N)d=\mathrm{polylog}(N). In such cases, our algorithms run in optimal O(Nlogσ)=O(NloglogN)O(N\log\sigma)=O(N\log\log N) time.

Refer to caption                   Refer to caption

Figure 4: Left: The K1=N/2K-1=\lceil\sqrt{N}/2\rceil strings where character 𝚋\mathtt{b} has been prepended only to the first string T1T_{1}. Right: The corresponding part of the suffix tree. Dashed arrows represent Weiner links with character 𝚋\mathtt{b}.

Refer to caption                   Refer to caption

Figure 5: Left: The K1=N/2K-1=\lceil\sqrt{N}/2\rceil strings where character 𝚋\mathtt{b} has been prepended to all of them. Right: The corresponding part of the suffix tree after the updates. Each time a new leaf is created, Θ(N)\Theta(\sqrt{N}) in-coming Weiner links were involved in a split operation on the AVL tree and it takes O(logN)O(\log N) time.

The next lemma shows some instance over a binary alphabet of size σ=2\sigma=2, which requires a certain amount of work for the splitting process.

Lemma 3.

There exist a set of fully-online multiple strings over a binary alphabet such that the node split procedure of our algorithms takes O(NlogN)O(\sqrt{N}\log N) time.

Proof.

Let K=1+N/2K=1+\lceil\sqrt{N}/2\rceil.

For the time being, we assume that each string TiT_{i} is terminated with a unique symbol $i\$_{i}. Consider a subset {T1,,TK1}\{T_{1},\ldots,T_{K-1}\} of K1=N/2K-1=\lceil\sqrt{N}/2\rceil strings such that for each 1iK11\leq i\leq K-1, Ti=𝚊Ni+1$iT_{i}=\mathtt{a}^{\sqrt{N}-i+1}\$_{i}. We then prepend the other character 𝚋\mathtt{b} from the binary alphabet {𝚊,𝚋}\{\mathtt{a},\mathtt{b}\} to each TiT_{i} in increasing order of i=1,,K1i=1,\ldots,K-1. For i=1i=1, N\sqrt{N} Weiner links to the new leaf for 𝚋T1=𝚋𝚊N$1\mathtt{b}T_{1}=\mathtt{b}\mathtt{a}^{\sqrt{N}}\$_{1}, each labeled 𝚋\mathtt{b}, are created. See Figure 4 for illustration of this step.

Then, for each i=2,,K1i=2,\ldots,K-1, inserting a new leaf for 𝚋Ti\mathtt{b}T_{i} requires an insertion of a new internal node as the parent of the new leaf. This splits the set of in-coming Weiner links into two sets: one is a singleton consisting of the Winer link from node 𝚊Ni+1\mathtt{a}^{\sqrt{N}-i+1}, and the other consists of the Weiner links from the shallower nodes. Each of these K2K-2 split operations can be done by a simple deletion operation on the corresponding AVL tree, using O(logN)=O(logN)O(\log\sqrt{N})=O(\log N) time each. See Figure 5 for illustration.

Observe also that the same analysis holds even if we remove the terminal symbol $i\$_{i} from each string TiT_{i} (in this case, there is a non-branching internal node for each TiT_{i} and we start the climbing up process from this internal node).

The total length of these K1K-1 strings is approximately 3N/83N/8. We can arbitrarily choose the last string TKT_{K} of length approximately 5N/85N/8 so that it does not affect the above split operations (e.g., a unary string a5N/8a^{5N/8} or b5N/8b^{5N/8} would suffice).

Thus, there exists an instance over a binary alphabet for which the node split operations require O(NlogN)O(\sqrt{N}\log N) total time. ∎

Since NlogN=o(N)\sqrt{N}\log N=o(N), the NlogN\sqrt{N}\log N term is always dominated by the NlogσN\log\sigma term. It is left open whether there exists a set of strings with Θ(N)\Theta(N) character additions, each of which requires splitting a set that involves NO(1)N^{O(1)} in-coming Weiner links. If such an instance exists, then our algorithm must take Θ(NlogN)\Theta(N\log N) time in the worst case.

5 Conclusions and future work

In this paper we considered the problem of maintaining the suffix tree and the DAWG indexing structures for a collection of multiple strings that are updated in a fully-online manner, where a new character can be added to the left end or the right end of any string in the collection, respectively. Our contributions are simple pointer-machine algorithms that work in O(N(logσ+logd))O(N(\log\sigma+\log d)) time and O(N)O(N) space, where NN is the total length of the strings, σ\sigma is the alphabet size, and dd is the maximum number of in-coming Weiner links of a node in the suffix tree. The key idea was to reduce the sub-problem of re-directing in-coming Weiner links to the ordered split-insert-find problem, which we solved in O(logd)O(\log d) time by AVL trees. We also discussed the cases where our O(N(logσ+logd))O(N(\log\sigma+\log d))-time solution is optimal.

A major open question regarding the proposed algorithms is whether there exists an instance over a small alphabet which contains Θ(N)\Theta(N) positions each of which requires Θ(logN)\Theta(\log N) time for the split operation, or requires Θ(N)\Theta(N) insertions each taking Θ(logN)\Theta(\log N) time. If such instances exist, then the running time of our algorithms may be worse than the optimal O(Nlogσ)O(N\log\sigma) for small σ\sigma. So far, we have only found an instance with σ=2\sigma=2 that takes sub-linear O(NlogN)O(\sqrt{N}\log N) total time for split operations.

Acknowledgements

This work is supported by JST PRESTO Grant Number JPMJPR1922.

References

  • [1] G. Adelson-Velsky and E. Landis. An algorithm for the organization of information. Proceedings of the USSR Academy of Sciences (in Russian), 146:263–266, 1962. English translation by Myron J. Ricci in Soviet Mathematics - Doklady, 3:1259–1263, 1962.
  • [2] A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen, and J. Seiferas. The smallest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31–55, 1985.
  • [3] A. Blumer, J. Blumer, D. Haussler, R. Mcconnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. J. ACM, 34(3):578–595, 1987.
  • [4] M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
  • [5] H. H. Do and W. Sung. Compressed directed acyclic word graph with application in local alignment. Algorithmica, 67(2):125–141, 2013.
  • [6] M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. J. ACM, 47(6):987–1011, 2000.
  • [7] Y. Fujishige, Y. Tsujimaru, S. Inenaga, H. Bannai, and M. Takeda. Computing DAWGs and minimal absent words in linear time for integer alphabets. In MFCS 2016, pages 38:1–38:14, 2016.
  • [8] H. N. Gabow and R. E. Tarjan. A linear-time algorithm for a special case of disjoint set union. Journal of Computer and System Sciences, 30:209–221, 1985.
  • [9] D. Gusfield and J. Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string. J. Comput. Syst. Sci., 69(4):525–546, 2004.
  • [10] D. Hendrian, S. Inenaga, R. Yoshinaka, and A. Shinohara. Efficient dynamic dictionary matching with DAWGs and AC-automata. Theor. Comput. Sci., 792:161–172, 2019.
  • [11] H. Imai and T. Asano. Dynamic orthogonal segment intersection search. J. Algorithms, 8(1):1–18, 1987.
  • [12] D. Knuth. The Art Of Computer Programming, vol. 3: Sorting And Searching, Second Edition. Addison-Wesley, 1998.
  • [13] G. Kucherov and M. Rusinowitch. Matching a set of strings with variable length don’t cares. Theor. Comput. Sci., 178(1-2):129–154, 1997.
  • [14] D. D. Sleator and R. E. Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362–391, 1983.
  • [15] D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652–686, 1985.
  • [16] T. Takagi, S. Inenaga, H. Arimura, D. Breslauer, and D. Hendrian. Fully-online suffix tree and directed acyclic word graph construction for multiple texts. Algorithmica, 82(5):1346–1377, 2020.
  • [17] Y. Tanimura, Y. Fujishige, T. I, S. Inenaga, H. Bannai, and M. Takeda. A faster algorithm for computing maximal α\alpha-gapped repeats in a string. In SPIRE 2015, pages 124–136, 2015.
  • [18] R. E. Tarjan. A class of algorithms which require nonlinear time to maintain disjoint sets. J. Comput. Syst. Sci., 18(2):110–127, 1979.
  • [19] P. Weiner. Linear pattern-matching algorithms. In Proc. of 14th IEEE Ann. Symp. on Switching and Automata Theory, pages 1–11, 1973.
  • [20] J. Westbrook. Fast incremental planarity testing. In ICALP 1992, pages 342–353, 1992.
  • [21] J. Yamamoto, T. I, H. Bannai, S. Inenaga, and M. Takeda. Faster compact on-line Lempel-Ziv factorization. In STACS 2014, pages 675–686, 2014.