This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching

Kento Iseri
Kyushu Institute of Technology, Fukuoka, Japan
[email protected]

Tomohiro  I
Kyushu Institute of Technology, Fukuoka, Japan
[email protected]

Diptarama Hendrian
Tohoku University, Sendai, Japan
[email protected]

Dominik Köppl
Department of Computer Science, Universität Münster, Germany
[email protected]

Ryo Yoshinaka
Tohoku University, Sendai, Japan
[email protected]

Ayumi Shinohara
Tohoku University, Sendai, Japan
[email protected]
Abstract

A parameterized string (p-string) is a string over an alphabet (ΣsΣp)(\Sigma_{s}\cup\Sigma_{p}), where Σs\Sigma_{s} and Σp\Sigma_{p} are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings xx and yy are said to parameterized match (p-match) if and only if xx can be transformed into yy by applying a bijection on Σp\Sigma_{p} to every occurrence of p-symbols in xx. The indexing problem for p-matching is to preprocess a p-string TT of length nn so that we can efficiently find the occurrences of substrings of TT that p-match with a given pattern. Extending the Burrows-Wheeler Transform (BWT) based index for exact string pattern matching, Ganguly et al. [SODA 2017] proposed the first compact index (named pBWT) for p-matching, and posed an open problem on how to construct it in compact space, i.e., in O(nlg|ΣsΣp|)O(n\lg|\Sigma_{s}\cup\Sigma_{p}|) bits of space. Hashimoto et al. [SPIRE 2022] partially solved this problem by showing how to construct some components of pBWTs for TT in O(n|Σp|lgnlglgn)O(n\frac{|\Sigma_{p}|\lg n}{\lg\lg n}) time in an online manner while reading the symbols of TT from right to left. In this paper, we improve the time complexity to O(nlg|Σp|lgnlglgn)O(n\frac{\lg|\Sigma_{p}|\lg n}{\lg\lg n}). We remark that removing the multiplicative factor of |Σp||\Sigma_{p}| from the complexity is of great interest because it has not been achieved for over a decade in the construction of related data structures like parameterized suffix arrays even in the offline setting. We also show that our data structure can support backward search, a core procedure of BWT-based indexes, at any stage of the online construction, making it the first compact index for p-matching that can be constructed in compact space and even in an online manner.

1 Introduction

A parameterized string (p-string) is a string over an alphabet (ΣsΣp)(\Sigma_{s}\cup\Sigma_{p}), where Σs\Sigma_{s} and Σp\Sigma_{p} are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings xx and yy are said to parameterized match (p-match) if and only if xx can be transformed into yy by applying a bijection on Σp\Sigma_{p} to every occurrence of p-symbols in xx. P-matching was introduced by Baker aiming at software maintenance and plagiarism detection  [1, 2, 3], and has been extensively studied in the last decades (see a recent survey [21] and references therein).

The indexing problem for p-matching is to preprocess a p-string TT of length nn so that we can efficiently find the occurrences of substrings of TT that p-match with a given pattern. Extending indexes for exact string pattern matching, there have been proposed several data structures that can be used as indexes for p-matching, e.g., parameterized suffix trees [1, 19, 2, 3], parameterized suffix arrays [7, 17, 4, 11], parameterized suffix trays [13], parameterized DAWGs [23], parameterized position heaps [8, 10, 12] and parameterized Burrows-Wheeler Transform (pBWT) [14, 18, 15].

Among these indexes, pBWTs are the most space economic, consuming nlgσ+O(n)n\lg\sigma+O(n) bits [14] or 2nlgσ+2n+o(n)2n\lg\sigma+2n+o(n) bits with a simplified version proposed in [18], where σ\sigma is the alphabet size. Let σs\sigma_{s} and respectively σp\sigma_{p} be the numbers of distinct s-symbols and p-symbols that appear in TT. The pBWT of TT can be constructed via the parameterized suffix tree of TT for which O(n(lgσs+lgσp))O(n(\lg\sigma_{s}+\lg\sigma_{p}))-time or randomized O(n)O(n)-time construction algorithms are known [19, 6, 20], but the intermediate memory footprint of O(nlgn)O(n\lg n) bits could be intolerable when nn is significantly larger than σ\sigma. Hashimoto et al. [16] showed how to construct some components of the pBWT of [18] for TT in O(nσplgnlglgn)O(n\frac{\sigma_{p}\lg n}{\lg\lg n}) time in an online manner while reading the symbols of TT from right to left.

In this paper, we improve the time complexity of [16] to O(nlgσplgnlglgn)O(n\frac{\lg\sigma_{p}\lg n}{\lg\lg n}). Removing the multiplicative factor of σp\sigma_{p} from the time complexity of [16] is of great interest because it has not been achieved for over a decade in the construction of related data structures like parameterized suffix arrays even in an offline setting [11]. We also show that our data structure can support backward search, a core procedure of BWT-based indexes, at any stage of the online construction, making it the first compact index for p-matching that can be constructed in compact space and even in an online manner. This is not likely to be achieved with the previous work [16] due to the lack of support for 2D range counting queries in the data structure it uses.

Our computational assumptions are as follows:

  • We assume a standard Word-RAM model with word size Ω(lgn)\Omega(\lg n).

  • Each symbol in (ΣsΣp)(\Sigma_{s}\cup\Sigma_{p}) is represented by O(lgn)O(\lg n) bits.

  • We can distinguish symbols in Σs\Sigma_{s} and Σp\Sigma_{p} in O(1)O(1) time, e.g., by having some flag bits or thresholds separating them each other.

  • The order on s-symbols are determined in O(1)O(1) time based on their bit representations.

An index of a p-string TT for p-matching is to support, given a pattern ww, the counting queries that compute the number of occurrences of substrings that p-match with ww and the locating queries that compute the occurrences of ww. Our main result is:

Theorem 1.

For a p-string TT of length nn over an alphabet (ΣsΣp)[0..σ](\Sigma_{s}\cup\Sigma_{p})\subseteq[0..\sigma], an index of TT for p-matching can be constructed online in O(nlgσplgnlglgn)O(n\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time and O(nlgσ)O(n\lg\sigma) bits of space, where σp\sigma_{p} is the number of distinct p-symbols used in the p-string. At any stage of the online construction, it can support the counting queries in O(mlgσplgnlglgn)O(m\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time and the locating queries in O(mlgσplgnlglgn+𝗈𝖼𝖼lg2nlgσlglgn)O(m\frac{\lg\sigma_{p}\lg n}{\lg\lg n}+\mathsf{occ}\frac{\lg^{2}n}{\lg\sigma\lg\lg n}) time, where mm is the pattern length and 𝗈𝖼𝖼\mathsf{occ} is the number of occurrences to be reported.

2 Preliminaries

2.1 Basic notations and tools

An integer interval {i,i+1,,j}\{i,i+1,\dots,j\} is denoted by [i..j][i..j], where [i..j][i..j] represents the empty interval if i>ji>j.

Let Σ\Sigma be an ordered finite alphabet. An element of Σ\Sigma^{*} is called a string over Σ\Sigma. The length of a string ww is denoted by |w||w|. The empty string ε\varepsilon is the string of length 0, that is, |ε|=0|\varepsilon|=0. Let Σ+=Σ{ε}\Sigma^{+}=\Sigma^{*}-\{\varepsilon\} and Σk={xΣ|x|=k}\Sigma^{k}=\{x\in\Sigma^{*}\mid|x|=k\} for any non-negative integer kk. The concatenation of two strings xx and yy is denoted by xyx\cdot y or simply xyxy. When a string ww is represented by the concatenation of strings xx, yy and zz (i.e. w=xyzw=xyz), then xx, yy and zz are called a prefix, substring, and suffix of ww, respectively. A substring xx of ww is called proper if xwx\neq w.

The ii-th symbol of a string ww is denoted by w[i]w[i] for 1i|w|1\leq i\leq|w|, and the substring of a string ww that begins at position ii and ends at position jj is denoted by w[i..j]w[i..j] for 1ij|w|1\leq i\leq j\leq|w|, i.e., w[i..j]=w[i]w[i+1]w[j]w[i..j]=w[i]w[i+1]\cdots w[j]. For convenience, let w[i..j]=εw[i..j]=\varepsilon if j<ij<i; further let w[..i]=w[1..i]w[..i]=w[1..i] and w[i..]=w[i..|w|]w[i..]=w[i..|w|] denote abbreviations for the prefix of length ii and the suffix starting at position ii, respectively. For two strings xx and yy, let 𝗅𝖼𝗉(x,y)\mathsf{lcp}(x,y) denote the length of the longest common prefix between xx and yy. We consider the lexicographic order over Σ\Sigma^{*} by extending the strict total order << defined on Σ\Sigma: xx is lexicographically smaller than yy (denoted as x<yx<y) if and only if either xx is a proper prefix of yy or x[𝗅𝖼𝗉(x,y)+1]<y[𝗅𝖼𝗉(x,y)+1]x[\mathsf{lcp}(x,y)+1]<y[\mathsf{lcp}(x,y)+1] holds.

For any string ww, character cc, and position i(1i|w|)i~{}(1\leq i\leq|w|), 𝗋𝖺𝗇𝗄c(w,i)\mathsf{rank}_{c}(w,i) returns the number of occurrences of cc in w[..i]w[..i] and 𝗌𝖾𝗅𝖾𝖼𝗍c(w,i)\mathsf{select}_{c}(w,i) returns the ii-th occurrence of cc in ww. For 1ij|w|1\leq i\leq j\leq|w|, a range minimum query 𝖱𝗆𝖰w(i,j)\mathsf{RmQ}_{w}(i,j) asks for argminikj{w[k]}\arg\min_{i\leq k\leq j}\{w[k]\}. We also consider find previous/next queries 𝖥𝖯𝖰p(w,i)\mathsf{FPQ}_{p}(w,i) and 𝖥𝖭𝖰p(w,i)\mathsf{FNQ}_{p}(w,i), where pp is a predicate either in the form of “cc” (equal to cc), “<c<c” (less than cc) or “c\geq c” (larger than or equal to cc): 𝖥𝖯𝖰p(w,i)\mathsf{FPQ}_{p}(w,i) returns the largest position jij\leq i at which w[j]w[j] satisfies the predicate pp. Symmetrically, 𝖥𝖭𝖰p(w,i)\mathsf{FNQ}_{p}(w,i) returns the smallest position jij\geq i at which w[j]w[j] satisfies the predicate pp. For example with the integer string w=[2,5,10,6,8,3,14,5]w=[2,5,10,6,8,3,14,5], 𝖥𝖭𝖰5(w,4)=8\mathsf{FNQ}_{5}(w,4)=8, 𝖥𝖭𝖰6(w,4)=4\mathsf{FNQ}_{6}(w,4)=4, 𝖥𝖯𝖰5(w,4)=2\mathsf{FPQ}_{5}(w,4)=2, 𝖥𝖭𝖰<5(w,4)=6\mathsf{FNQ}_{<5}(w,4)=6, 𝖥𝖯𝖰<5(w,4)=1\mathsf{FPQ}_{<5}(w,4)=1, 𝖥𝖭𝖰9(w,4)=7\mathsf{FNQ}_{\geq 9}(w,4)=7 and 𝖥𝖯𝖰9(w,4)=3\mathsf{FPQ}_{\geq 9}(w,4)=3.

If the answer of 𝗌𝖾𝗅𝖾𝖼𝗍c(w,i)\mathsf{select}_{c}(w,i), 𝖥𝖯𝖰p(w,i)\mathsf{FPQ}_{p}(w,i) or 𝖥𝖭𝖰p(w,i)\mathsf{FNQ}_{p}(w,i) does not exist, it is just ignored. To handle this case of non-existence, we would use them in an expression with min\min or max\max: For example, max{1,𝖥𝖯𝖰p(w,i)}\max\{1,\mathsf{FPQ}_{p}(w,i)\} returns 11 if 𝖥𝖯𝖰p(w,i)\mathsf{FPQ}_{p}(w,i) does not exist.

Dynamic strings should support insertion/deletion of a symbol to/from any position as well as fast random access. We use the following result:

Lemma 2 ([22]).

A dynamic string of length nn over an alphabet [0..σ][0..\sigma] can be implemented while supporting random access, insertion, deletion, 𝗋𝖺𝗇𝗄\mathsf{rank} and 𝗌𝖾𝗅𝖾𝖼𝗍\mathsf{select} queries in (n+o(n))lgσ(n+o(n))\lg\sigma bits of space and O(lgnlglgn)O(\frac{\lg n}{\lg\lg n}) query and update times.

Dynamic binary strings equipped with 𝗋𝖺𝗇𝗄\mathsf{rank} and 𝗌𝖾𝗅𝖾𝖼𝗍\mathsf{select} queries can be used as a building block for the dynamic wavelet matrix [5] of a string over an alphabet [0..σ][0..\sigma] to support queries beyond 𝗋𝖺𝗇𝗄\mathsf{rank} and 𝗌𝖾𝗅𝖾𝖼𝗍\mathsf{select}. The idea is that each of the other queries can be simulated by performing one of the building block queries on every level of the wavelet matrix, which has lgσ\lceil\lg\sigma\rceil levels, cf. [24, Section 6.2.].

Lemma 3.

A dynamic string of length nn over an alphabet [0..σ][0..\sigma] with σ=O(n)\sigma=O(n) can be implemented while supporting random access, insertion, deletion, 𝗋𝖺𝗇𝗄\mathsf{rank}, 𝗌𝖾𝗅𝖾𝖼𝗍\mathsf{select}, 𝖱𝗆𝖰\mathsf{RmQ}, 𝖥𝖯𝖰\mathsf{FPQ} and 𝖥𝖭𝖰\mathsf{FNQ} queries in (n+o(n))lgσ+O(lgσlgn)(n+o(n))\lceil\lg\sigma\rceil+O(\lg\sigma\lg n) bits of space and O(lgσlgnlglgn)O(\frac{\lg\sigma\lg n}{\lg\lg n}) query and update times.

2.2 Parameterized strings

Let Σs\Sigma_{s} and Σp\Sigma_{p} denote two disjoint sets of symbols. We call a symbol in Σs\Sigma_{s} a static symbol (s-symbol) and a symbol in Σp\Sigma_{p} a parameter symbol (p-symbol). A parameterized string (p-string) is a string over (ΣsΣp)(\Sigma_{s}\cup\Sigma_{p}). Let \infty represent a symbol that is larger than any integer, and let 𝐍=𝐍+{}\mathbf{N}_{\infty}=\mathbf{N}_{+}\cup\{\infty\} be the set of natural positive numbers 𝐍+\mathbf{N}_{+} including infinity. We assume that 𝐍Σs=\mathbf{N}_{\infty}\cap\Sigma_{s}=\emptyset and (𝐍Σs)(\mathbf{N}_{\infty}\cup\Sigma_{s}) is an ordered alphabet such that all s-symbols are smaller than any element in 𝐍\mathbf{N}_{\infty}. Let $\$ be the smallest s-symbol, which will be used as an end-marker of p-strings. For any p-string ww the p-encoded string w\langle w\rangle of ww, also proposed as 𝚙𝚛𝚎𝚟(w)\mathtt{prev}_{\infty}(w) in [18], is the string in (𝐍Σs)|w|(\mathbf{N}_{\infty}\cup\Sigma_{s})^{|w|} such that

w[i]={w[i]if w[i]Σs,if w[i]Σp and w[i] does not appear in w[..i1],ijotherwise,\langle w\rangle[i]=\begin{cases}w[i]&\mbox{if $w[i]\in\Sigma_{s}$},\\ \infty&\mbox{if $w[i]\in\Sigma_{p}$ and $w[i]$ does not appear in $w[..i-1]$},\\ i-j&\mbox{otherwise,}\end{cases}

where jj is the largest position in [1..i1][1..i-1] with w[i]=w[j]w[i]=w[j]. In other words, p-encoding transforms an occurrence of a p-symbol into the distance to the previous occurrence of the same p-symbol, or \infty if it is the leftmost occurrence. By definition, p-encoding is prefix-consistent, i.e., w=wc[..|w|]\langle w\rangle=\langle wc\rangle[..|w|] for any symbol c(ΣsΣp)c\in(\Sigma_{s}\cup\Sigma_{p}). On the other hand, w\langle w\rangle and cw[2..]\langle cw\rangle[2..] differ if and only if cΣpc\in\Sigma_{p} occurs in ww. If it is the case, the leftmost occurrence hh of cc in ww is the unique position such that w\langle w\rangle and cw[2..]\langle cw\rangle[2..] differ with w[h]=\langle w\rangle[h]=\infty and (cw[2..])[h]=cw[h+1]=h(\langle cw\rangle[2..])[h]=\langle cw\rangle[h+1]=h, i.e., h=𝗌𝖾𝗅𝖾𝖼𝗍c(w,1)h=\mathsf{select}_{c}(w,1) and h+1=𝗌𝖾𝗅𝖾𝖼𝗍c(cw,2)h+1=\mathsf{select}_{c}(cw,2).

For any p-string ww, let |w|p|w|_{p} denote the number of distinct p-symbols in ww, i.e., |w|p=𝗋𝖺𝗇𝗄(w,|w|)|w|_{p}=\mathsf{rank}_{\infty}(\langle w\rangle,|w|). We define a function π\pi that maps a p-string ww over (ΣsΣp)(\Sigma_{s}\cup\Sigma_{p}) to an element in (Σs[1..|w|p])(\Sigma_{s}\cup[1..|w|_{p}]) such that

π(w)={w[1]if w[1]Σs,|w[..h+1]|potherwise,\pi(w)=\begin{cases}w[1]&\mbox{if $w[1]\in\Sigma_{s}$},\\ |w[..h+1]|_{p}&\mbox{otherwise,}\end{cases}

where h+1=min{|w|,𝗌𝖾𝗅𝖾𝖼𝗍w[1](w,2)}h+1=\min\{|w|,\mathsf{select}_{w[1]}(w,2)\}. In the second case, π(w)\pi(w) represents the rank of p-symbol w[1]w[1] when ordering all p-symbols by their leftmost occurrences in w[2..]w[2..], considering the rank of p-symbols not in w[2..]w[2..] to be |w|p|w|_{p}. If 𝗌𝖾𝗅𝖾𝖼𝗍w[1](w,2)\mathsf{select}_{w[1]}(w,2) exists, it holds that h=𝗌𝖾𝗅𝖾𝖼𝗍(w[2..],π(w))h=\mathsf{select}_{\infty}(\langle w[2..]\rangle,\pi(w)). For convenience, let π(ε)=$\pi(\varepsilon)=\$.

For two p-strings uu and vv, 𝗅𝖼𝗉(u,v)\mathsf{lcp}^{\infty}(\langle u\rangle,\langle v\rangle) denotes the number of \infty’s in the longest common prefix of u\langle u\rangle and v\langle v\rangle.

The following lemma states basic but important properties of the p-string encoding and π\pi, which immediately follow from the definition (see Fig. 1 for illustrations).

Lemma 4.

For any p-strings xx and yy over (ΣsΣp)(\Sigma_{s}\cup\Sigma_{p}) with λ=𝗅𝖼𝗉(x[2..],y[2..])\lambda=\mathsf{lcp}(\langle x[2..]\rangle,\langle y[2..]\rangle), e=𝗅𝖼𝗉(x[2..],y[2..])e=\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle) and x[2..]<y[2..]\langle x[2..]\rangle<\langle y[2..]\rangle, Table 1 shows a complete list of cases for 𝗅𝖼𝗉(x,y)\mathsf{lcp}(\langle x\rangle,\langle y\rangle), 𝗅𝖼𝗉(x,y)\mathsf{lcp}^{\infty}(\langle x\rangle,\langle y\rangle) and the lexicographic order between x\langle x\rangle and y\langle y\rangle, where a case starting with A and B is under the condition that at least one of π(x)\pi(x) and π(y)\pi(y) is in Σs\Sigma_{s}, and respectively none of π(x)\pi(x) and π(y)\pi(y) is in Σs\Sigma_{s}.

Table 1: All cases considered in Lemma 4. On the one hand, a case starting with letter A assumes that at least one of π(x)\pi(x) and π(y)\pi(y) is in Σs\Sigma_{s}, while on the other hand, a case starting with letter B assumes that none of π(x)\pi(x) and π(y)\pi(y) is in Σs\Sigma_{s}. Here, h=𝗌𝖾𝗅𝖾𝖼𝗍x[1](x[2..],1)h=\mathsf{select}_{x[1]}(x[2..],1) and h=𝗌𝖾𝗅𝖾𝖼𝗍y[1](y[2..],1)h^{\prime}=\mathsf{select}_{y[1]}(y[2..],1).
cases additional conditions 𝗅𝖼𝗉(x,y)\mathsf{lcp}(\langle x\rangle,\langle y\rangle) 𝗅𝖼𝗉(x,y)\mathsf{lcp}^{\infty}(\langle x\rangle,\langle y\rangle) lexicographic order
(A1) π(x)π(y)\pi(x)\neq\pi(y) 0 0 x<y\langle x\rangle<\langle y\rangle iff π(x)<π(y)\pi(x)<\pi(y)
(A2) π(x)=π(y)\pi(x)=\pi(y) λ+1\lambda+1 ee x<y\langle x\rangle<\langle y\rangle
(B1) π(x)=π(y)e\pi(x)=\pi(y)\leq e λ+1\lambda+1 ee x<y\langle x\rangle<\langle y\rangle
(B2) π(x)e\pi(x)\leq e and π(x)<π(y)\pi(x)<\pi(y) hh π(x)\pi(x) x<y\langle x\rangle<\langle y\rangle
(B3) π(y)e\pi(y)\leq e and π(y)<π(x)\pi(y)<\pi(x) hh^{\prime} π(y)\pi(y) y<x\langle y\rangle<\langle x\rangle
(B4) e<min{π(x),π(y)}e<\min\{\pi(x),\pi(y)\} λ+1\lambda+1 e+1e+1 x<y\langle x\rangle<\langle y\rangle
Refer to caption
Figure 1: Illustrations for the cases of Lemma 4. Each right arrow represents the longest common prefix of two p-encoded strings, and the lexicographic order between them is determined by the following p-encoded symbols. For Case (B1), h=𝗌𝖾𝗅𝖾𝖼𝗍x[1](x[2..],1)=𝗌𝖾𝗅𝖾𝖼𝗍y[1](y[2..],1)h=\mathsf{select}_{x[1]}(x[2..],1)=\mathsf{select}_{y[1]}(y[2..],1). For Case (B2)-(B4) and (B4)’, h=𝗌𝖾𝗅𝖾𝖼𝗍x[1](x[2..],1)h=\mathsf{select}_{x[1]}(x[2..],1) and h=𝗌𝖾𝗅𝖾𝖼𝗍y[1](y[2..],1)h^{\prime}=\mathsf{select}_{y[1]}(y[2..],1). Case (B4)’ illustrates the case with b=b=\infty, which is included in Case (B4).

By Lemma 4, we have the following corollaries:

Corollary 5.

For any p-strings xx and yy, 𝗅𝖼𝗉(x,y)𝗅𝖼𝗉(x[2..],y[2..])+1\mathsf{lcp}^{\infty}(\langle x\rangle,\langle y\rangle)\leq\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle)+1.

Corollary 6.

For any p-strings xx and yy with π(x)=π(y)\pi(x)=\pi(y), x<yx<y if and only if x<y\langle x\rangle<\langle y\rangle.

Corollary 7.

For any p-strings xx and yy with π(x)𝗅𝖼𝗉(x[2..],y[2..])<π(y)\pi(x)\leq\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle)<\pi(y), it holds that x[..h+1]<x<x[..h+1]<y\langle x\rangle[..h+1]<\langle x\rangle<\langle x\rangle[..h^{\prime}+1]<\langle y\rangle, where h=𝗌𝖾𝗅𝖾𝖼𝗍x[1](x[2..],1)h=\mathsf{select}_{x[1]}(x[2..],1) and h=𝗌𝖾𝗅𝖾𝖼𝗍y[1](y[2..],1)h^{\prime}=\mathsf{select}_{y[1]}(y[2..],1).

Table 2: An example of 𝖱T1(i)\mathsf{R}^{-1}_{T}(i), 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T}, 𝖫T\mathsf{L}_{T} and 𝖥T\mathsf{F}_{T} for a p-string T=𝚡𝚢𝚊𝚣𝚢𝚡𝚊𝚣𝚡𝚣𝚊$T=\mathtt{xyazyxazxza\$} with Σs={𝚊}\Sigma_{s}=\{\mathtt{a}\} and Σp={𝚡,𝚢,𝚣}\Sigma_{p}=\{\mathtt{x},\mathtt{y},\mathtt{z}\}.
ii T[i..]T[i..] T[i..]\langle T[i..]\rangle 𝖱T1(i)\mathsf{R}^{-1}_{T}(i) 𝖫𝖢𝖯T[i]\mathsf{LCP}^{\infty}_{T}[i] 𝖫T[i]\mathsf{L}_{T}[i] 𝖥T[i]\mathsf{F}_{T}[i] T[𝖱T1(i)..]\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle
1 𝚡𝚢𝚊𝚣𝚢𝚡𝚊𝚣𝚡𝚣𝚊$\mathtt{xyazyxazxza\$} 𝚊𝟹𝟻𝚊𝟺𝟹𝟸𝚊$\mathtt{\infty\infty a\infty 35a432a\$} 12 0 𝚊\mathtt{a} $\mathtt{\$} $\$
2 𝚢𝚊𝚣𝚢𝚡𝚊𝚣𝚡𝚣𝚊$\mathtt{yazyxazxza\$} 𝚊𝟹𝚊𝟺𝟹𝟸𝚊$\mathtt{\infty a\infty 3\infty a432a\$} 11 0 1 𝚊\mathtt{a} 𝚊$\mathtt{a\$}
3 𝚊𝚣𝚢𝚡𝚊𝚣𝚡𝚣𝚊$\mathtt{azyxazxza\$} 𝚊𝚊𝟺𝟹𝟸𝚊$\mathtt{a\infty\infty\infty a432a\$} 7 0 2 𝚊\mathtt{a} 𝚊𝟸𝚊$\mathtt{a\infty\infty 2a\$}
4 𝚣𝚢𝚡𝚊𝚣𝚡𝚣𝚊$\mathtt{zyxazxza\$} 𝚊𝟺𝟹𝟸𝚊$\mathtt{\infty\infty\infty a432a\$} 3 2 2 𝚊\mathtt{a} 𝚊𝚊𝟺𝟹𝟸𝚊$\mathtt{a\infty\infty\infty a432a\$}
5 𝚢𝚡𝚊𝚣𝚡𝚣𝚊$\mathtt{yxazxza\$} 𝚊𝟹𝟸𝚊$\mathtt{\infty\infty a\infty 32a\$} 10 0 2 1 𝚊$\mathtt{\infty a\$}
6 𝚡𝚊𝚣𝚡𝚣𝚊$\mathtt{xazxza\$} 𝚊𝟹𝟸𝚊$\mathtt{\infty a\infty 32a\$} 6 1 3 2 𝚊𝟹𝟸𝚊$\mathtt{\infty a\infty 32a\$}
7 𝚊𝚣𝚡𝚣𝚊$\mathtt{azxza\$} 𝚊𝟸𝚊$\mathtt{a\infty\infty 2a\$} 2 2 3 2 𝚊𝟹𝚊𝟺𝟹𝟸𝚊$\mathtt{\infty a\infty 3\infty a432a\$}
8 𝚣𝚡𝚣𝚊$\mathtt{zxza\$} 𝟸𝚊$\mathtt{\infty\infty 2a\$} 9 1 2 2 𝚊$\mathtt{\infty\infty a\$}
9 𝚡𝚣𝚊$\mathtt{xza\$} 𝚊$\mathtt{\infty\infty a\$} 5 2 3 3 𝚊𝟹𝟸𝚊$\mathtt{\infty\infty a\infty 32a\$}
10 𝚣𝚊$\mathtt{za\$} 𝚊$\mathtt{\infty a\$} 1 3 $\mathtt{\$} 3 𝚊𝟹𝟻𝚊𝟺𝟹𝟸𝚊$\mathtt{\infty\infty a\infty 35a432a\$}
11 𝚊$\mathtt{a\$} 𝚊$\mathtt{a\$} 8 2 𝚊\mathtt{a} 2 𝟸𝚊$\mathtt{\infty\infty 2a\$}
12 $\mathtt{\$} $\mathtt{\$} 4 2 𝚊\mathtt{a} 3 𝚊𝟺𝟹𝟸𝚊$\mathtt{\infty\infty\infty a432a\$}

Let TT be a p-string that has the smallest s-symbol $\$ as its end-marker, i.e., T[|T|]=$T[|T|]=\$ and $\$ does not appear anywhere else in TT. The suffix rank function 𝖱T:[1..|T|][1..|T|]\mathsf{R}_{T}:[1..|T|]\rightarrow[1..|T|] for TT maps a position i(1i|T|)i~{}(1\leq i\leq|T|) to the lexicographic rank of T[i..]\langle T[i..]\rangle in {T[j..]1j|T|}\{\langle T[j..]\rangle\mid 1\leq j\leq|T|\}. Its inverse function 𝖱T1(i)\mathsf{R}^{-1}_{T}(i) returns the starting position of the lexicographically ii-th p-encoded suffix of TT.

The main components of parameterized Burrows-Wheeler Transform (pBWT) of TT are 𝖥T\mathsf{F}_{T} and 𝖫T\mathsf{L}_{T}. 𝖥T\mathsf{F}_{T} (resp. 𝖫T\mathsf{L}_{T}) is defined to be the string of length |T||T| such that 𝖥T[i]=π(T[𝖱T1(i)..])\mathsf{F}_{T}[i]=\pi(T[\mathsf{R}^{-1}_{T}(i)..]) (resp. 𝖫T[i]=π(T[𝖱T1(i)1..])\mathsf{L}_{T}[i]=\pi(T[\mathsf{R}^{-1}_{T}(i)-1..])), where we assume that T[0..]=$T[0..]=\$. 111Previous studies [14, 18, 16] define pBWTs based on sorted cyclic rotations, but our suffix-based definition is more suitable for online construction to prevent unnecessary update on 𝖥T\mathsf{F}_{T} and 𝖫T\mathsf{L}_{T}. Since {T[𝖱T1(i)..]1i|T|}={T[𝖱T1(i)1..]1i|T|}\{T[\mathsf{R}^{-1}_{T}(i)..]\mid 1\leq i\leq|T|\}=\{T[\mathsf{R}^{-1}_{T}(i)-1..]\mid 1\leq i\leq|T|\} is equivalent to the set of all non-empty suffixes of TT, 𝖥T\mathsf{F}_{T} is a permutation of 𝖫T\mathsf{L}_{T}.

The so-called LF-mapping 𝖫𝖥T\mathsf{LF}_{T} maps a position ii to 𝖱T(𝖱T1(i)1)\mathsf{R}_{T}(\mathsf{R}^{-1}_{T}(i)-1) if 𝖱T1(i)>1\mathsf{R}^{-1}_{T}(i)>1, and otherwise 𝖱T(|T|)=1\mathsf{R}_{T}(|T|)=1. By definition and Corollary 6, we have:

Corollary 8.

For any p-string TT and any integers i,ji,j with 1i<j|T|1\leq i<j\leq|T|, 𝖫𝖥T(i)<𝖫𝖥T(j)\mathsf{LF}_{T}(i)<\mathsf{LF}_{T}(j) if 𝖫T[i]=𝖫T[j]\mathsf{L}_{T}[i]=\mathsf{L}_{T}[j].

Thanks to Corollary 8, it holds that 𝖫𝖥T(i)=𝗌𝖾𝗅𝖾𝖼𝗍c(𝖥T,𝗋𝖺𝗇𝗄c(𝖫T,i))\mathsf{LF}_{T}(i)=\mathsf{select}_{c}(\mathsf{F}_{T},\mathsf{rank}_{c}(\mathsf{L}_{T},i)), where c=𝖫T[i]c=\mathsf{L}_{T}[i]. The inverse function 𝖥𝖫T\mathsf{FL}_{T} of 𝖫𝖥T\mathsf{LF}_{T} can be computed by 𝖥𝖫T(i)=𝗌𝖾𝗅𝖾𝖼𝗍c(𝖫T,𝗋𝖺𝗇𝗄c(𝖥T,i))\mathsf{FL}_{T}(i)=\mathsf{select}_{c}(\mathsf{L}_{T},\mathsf{rank}_{c}(\mathsf{F}_{T},i)), where c=𝖥T[i]c=\mathsf{F}_{T}[i].

Let 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} be the string of length |T||T| such that 𝖫𝖢𝖯T[0]=0\mathsf{LCP}^{\infty}_{T}[0]=0 and 𝖫𝖢𝖯T[i]=𝗅𝖼𝗉(T[𝖱T1(i1)..],T[𝖱T1(i)..])\mathsf{LCP}^{\infty}_{T}[i]=\mathsf{lcp}^{\infty}(T[\mathsf{R}^{-1}_{T}(i-1)..],T[\mathsf{R}^{-1}_{T}(i)..]) for any 1<i|T|1<i\leq|T|. An example of all explained arrays is given in Table 2.

3 Online construction algorithm

For online construction of our index for p-matching, we consider maintaining dynamic data structures for 𝖥T\mathsf{F}_{T}, 𝖫T\mathsf{L}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} while prepending a symbol to the current p-string TT. The details of the data structures will be presented in Subsection 3.3. In what follows, we focus on a single step of updating TT to T^=cT\hat{T}=cT for some symbol cc in ΣsΣp\Sigma_{s}\cup\Sigma_{p}. Note that 𝖥T\mathsf{F}_{T}, 𝖫T\mathsf{L}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} are strongly related to the sorted p-encoded suffixes of a p-string and T^=cT\hat{T}=cT is the only suffix that was not in the suffixes of TT. Let k=𝖱T(1)k=\mathsf{R}_{T}(1) and k^=𝖱T^(1)\hat{k}=\mathsf{R}_{\hat{T}}(1). In order to deal with the new emerging suffix T^\hat{T}, we compute the lexicographic rank k^\hat{k} of T^\langle\hat{T}\rangle in the non-empty p-encoded suffixes of T^\hat{T}. Then 𝖥T^\mathsf{F}_{\hat{T}} and 𝖫T^\mathsf{L}_{\hat{T}} can be obtained by replacing $\$ in 𝖫T\mathsf{L}_{T} at kk by π(T^)\pi(\hat{T}) and inserting $\$ and π(T^)\pi(\hat{T}) into the k^\hat{k}-th position of 𝖫T\mathsf{L}_{T} and 𝖥T\mathsf{F}_{T}, respectively. In Subsection 3.1, we propose our algorithm to compute k^\hat{k}. For updating 𝖫𝖢𝖯\mathsf{LCP}^{\infty}, we have to compute the 𝗅𝖼𝗉\mathsf{lcp}^{\infty}-values for T^\langle\hat{T}\rangle with its lexicographically adjacent p-encoded suffixes, which will be treated in Subsection 3.2.

3.1 How to compute k^\hat{k}

Unlike the existing work [16] that computes k^\hat{k} by counting the number of p-encoded suffixes that are lexicographically smaller than T^\langle\hat{T}\rangle, we get k^\hat{k} indirectly by computing the rank of a lexicographically closest (smaller or larger) p-encoded suffix to T^\langle\hat{T}\rangle. The lexicographically smaller (resp. larger) closest element in {T[i..]1i|T|}\{\langle T[i..]\rangle\mid 1\leq i\leq|T|\} to T^\langle\hat{T}\rangle is called the p-pred (resp. p-succ) of T^\langle\hat{T}\rangle. and its rank is denoted by kk_{-} (resp. k+k_{+}). Then it holds that k^=k+=k+1\hat{k}=k_{+}=k_{-}+1.

We start with the easy case that the prepended symbol cc is an s-symbol.

Lemma 9.

Let T^=cT\hat{T}=cT be a p-string with cΣsc\in\Sigma_{s}. If p:=𝖥𝖯𝖰c(𝖫T,k)p:=\mathsf{FPQ}_{c}(\mathsf{L}_{T},k) exists, the rank kk_{-} of the p-pred of T^\hat{T} is 𝖫𝖥T(p)\mathsf{LF}_{T}(p). Otherwise, k=𝗌𝖾𝗅𝖾𝖼𝗍b(𝖥T,𝗋𝖺𝗇𝗄b(𝖥T,|T|))k_{-}=\mathsf{select}_{b}(\mathsf{F}_{T},\mathsf{rank}_{b}(\mathsf{F}_{T},|T|)), where bb is the largest s-symbol that appears in TT and smaller than cc.

Proof.

By Case (A2) of Lemma 4 (cf. Table 1), the lexicographic order of p-encoded suffixes starting with cc does not change by removing their first characters, which are all cc. If pp exists, T[𝖱T1(p)..]\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle is the lexicographically smaller closest p-encoded suffix to T\langle T\rangle that is preceded by cc. Hence, T[𝖱T1(𝖫𝖥T(p))..]=cT[𝖱T1(p)..]\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(p))..]\rangle=\langle cT[\mathsf{R}^{-1}_{T}(p)..]\rangle is the p-pred of cT=T^\langle cT\rangle=\langle\hat{T}\rangle, which means that k=𝖫𝖥T(p)k_{-}=\mathsf{LF}_{T}(p).

If pp does not exist, it implies that T^\langle\hat{T}\rangle is the lexicographically smallest p-encoded suffix that starts with cc. Since T^\langle\hat{T}\rangle lexicographically comes right after the p-encoded suffixes starting with an s-symbol smaller than cc, kk_{-} is the last occurrence of bb in 𝖥T\mathsf{F}_{T}, that is, k=𝗌𝖾𝗅𝖾𝖼𝗍b(𝖥T,𝗋𝖺𝗇𝗄b(𝖥T,|T|))k_{-}=\mathsf{select}_{b}(\mathsf{F}_{T},\mathsf{rank}_{b}(\mathsf{F}_{T},|T|)). ∎

In the rest of this subsection, we consider the case that cc is a p-symbol. If TT contains no p-symbol, it is clear that k=|T|k_{-}=|T|. Hence, in what follows, we assume that there is a p-symbol in TT.

Since T^\langle\hat{T}\rangle has the longest 𝗅𝖼𝗉\mathsf{lcp}-value λ^\hat{\lambda} with its p-pred or p-succ, we search for such p-encoded suffixes of TT using the following lemmas to leverage the information stored in 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T}.

Lemma 10.

Given two positions ii and jj with 1i<j|T|1\leq i<j\leq|T|, 𝗅𝖼𝗉(T[𝖱T1(i)..],T[𝖱T1(j)..])=𝖱𝗆𝖰𝖫𝖢𝖯T(i+1,j)\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle,\langle T[\mathsf{R}^{-1}_{T}(j)..]\rangle)=\mathsf{RmQ}_{\mathsf{LCP}^{\infty}_{T}}(i+1,j).

Proof.

It is not difficult to see that 𝗅𝖼𝗉(x,z)=min{𝗅𝖼𝗉(x,y),𝗅𝖼𝗉(y,z)}\mathsf{lcp}(x,z)=\min\{\mathsf{lcp}(x,y),\mathsf{lcp}(y,z)\} for any strings x<y<zx<y<z, and thus, 𝗅𝖼𝗉(x,z)=min{𝗅𝖼𝗉(x,y),𝗅𝖼𝗉(y,z)}\mathsf{lcp}^{\infty}(x,z)=\min\{\mathsf{lcp}^{\infty}(x,y),\mathsf{lcp}^{\infty}(y,z)\}. Since 𝖫𝖢𝖯\mathsf{LCP}^{\infty} holds the 𝗅𝖼𝗉\mathsf{lcp}^{\infty}-values of lexicographically adjacent p-encoded suffixes, we get 𝗅𝖼𝗉(T[𝖱T1(i)..],T[𝖱T1(j)..])=min{𝖫𝖢𝖯T[g]}g=i+1j=𝖱𝗆𝖰𝖫𝖢𝖯T(i+1,j)\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle,\langle T[\mathsf{R}^{-1}_{T}(j)..]\rangle)=\min\{\mathsf{LCP}^{\infty}_{T}[g]\}_{g=i+1}^{j}=\mathsf{RmQ}_{\mathsf{LCP}^{\infty}_{T}}(i+1,j) by applying the previous argument successively. ∎

Lemma 11.

Algorithm 1 correctly computes the maximal interval [l..r][l..r] such that 𝗅𝖼𝗉(T[𝖱T1(i)..],T[𝖱T1(j)..])e\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle,\langle T[\mathsf{R}^{-1}_{T}(j)..]\rangle)\geq e for any j[l..r]j\in[l..r].

1 Function GetMI(ii, ee):
2       lmax{1,𝖥𝖯𝖰<e(𝖫𝖢𝖯T,i)}l\leftarrow\max\{1,\mathsf{FPQ}_{<e}(\mathsf{LCP}^{\infty}_{T},i)\};
3       rmin{|T|,𝖥𝖭𝖰<e(𝖫𝖢𝖯T,i+1)1}r\leftarrow\min\{|T|,\mathsf{FNQ}_{<e}(\mathsf{LCP}^{\infty}_{T},i+1)-1\};
4       return [l..r][l..r];
5      
Algorithm 1 Algorithm to compute the maximal interval [l..r][l..r] such that 𝗅𝖼𝗉(T[𝖱T1(i)..],T[𝖱T1(j)..])e\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle,\langle T[\mathsf{R}^{-1}_{T}(j)..]\rangle)\geq e for any j[l..r]j\in[l..r].
Lemma 12.

Algorithm 2 correctly returns k^\hat{k}.

Proof.

Let hi=𝗌𝖾𝗅𝖾𝖼𝗍(T,i)h_{i}=\mathsf{select}_{\infty}(\langle T\rangle,i) for any 1imin{|T|p,π(T^)}1\leq i\leq\min\{|T|_{p},\pi(\hat{T})\}, and hi=|T|+1h_{i}=|T|+1 for any i>min{|T|p,π(T^)}i>\min\{|T|_{p},\pi(\hat{T})\}. Also let λ=max{𝗅𝖼𝗉(T^,T[i..])1i|T|}\lambda=\max\{\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)\mid 1\leq i\leq|T|\}. Although Algorithm 2 does not intend to compute the exact value of λ\lambda, it checks if λ\lambda falls in [he..he+1][h_{e}..h_{e+1}] in decreasing order of ee starting from min{π(T^),max{𝖫𝖢𝖯T[k],𝖫𝖢𝖯T[k+1]}}\min\{\pi(\hat{T}),\max\{\mathsf{LCP}^{\infty}_{T}[k],\mathsf{LCP}^{\infty}_{T}[k+1]\}\}. One of the necessary conditions to have 𝗅𝖼𝗉(T^,T[i..])>he\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)>h_{e} is that 𝗅𝖼𝗉(T,T[i+1..])he\mathsf{lcp}(\langle T\rangle,\langle T[i+1..]\rangle)\geq h_{e}, or equivalently 𝗅𝖼𝗉(T,T[i+1..])e\mathsf{lcp}^{\infty}(\langle T\rangle,\langle T[i+1..]\rangle)\geq e. Line 2 computes the maximal interval [l..r][l..r] that represents the ranks of the p-encoded suffixes having an 𝗅𝖼𝗉\mathsf{lcp}^{\infty}-value larger than or equal to ee. The basic idea is to find a p-encoded suffix in {T[𝖱T1(p)..]}p=lr\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=l}^{r} that comes closest to T^\langle\hat{T}\rangle when extended by adding its preceding symbol. When ee comes down to the point with λ[he+1..he+1]\lambda\in[h_{e}+1..h_{e+1}], k^\hat{k} is returned in one of the if-then-blocks at Lines 222 and 2.

If 𝗅𝖼𝗉(T^,T[i..])=he^\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)=h_{\hat{e}} for an integer e^\hat{e}, there are two possible scenarios:

  1. 1.

    𝗅𝖼𝗉(T,T[i+1..])e^\mathsf{lcp}^{\infty}(\langle T\rangle,\langle T[i+1..]\rangle)\geq\hat{e} and either π(T^)\pi(\hat{T}) or π(T[i..])\pi(T[i..]) is e^\hat{e}, and

  2. 2.

    𝗅𝖼𝗉(T,T[i+1..])=he^1\mathsf{lcp}(\langle T\rangle,\langle T[i+1..]\rangle)=h_{\hat{e}}-1 and both π(T^)\pi(\hat{T}) and π(T[i..])\pi(T[i..]) are at least e^\hat{e}.

The former p-encoded suffix is processed in one of the if-then-blocks at Lines 2 and 2 when e=e^e=\hat{e}, while the latter at Lines 222 and 2 when e=e^1e=\hat{e}-1. Note that the former is never farther from T^\langle\hat{T}\rangle than the latter because the lexicographic order between T^\langle\hat{T}\rangle and T[i..]\langle T[i..]\rangle is determined by \infty and he^h_{\hat{e}} at he^+1h_{\hat{e}}+1 in the former case, while it is by \infty and something smaller than he^h_{\hat{e}} in the latter case. Since Algorithm 2 processes the former case first, it guarantees that the algorithm finds the closer one first.

The case with e=π(T^)e=\pi(\hat{T}) is treated differently than other cases in the if-then-block at Line 2 since hπ(T^)h_{\pi(\hat{T})} is the unique position where T[hπ(T^)]=\langle T\rangle[h_{\pi(\hat{T})}]=\infty turns into T^[hπ(T^)+1]=hπ(T^)\langle\hat{T}\rangle[h_{\pi(\hat{T})}+1]=h_{\pi(\hat{T})}. For a p-encoded suffix T[𝖱T1(q)..]{T[𝖱T1(p)..]}p=lr\langle T[\mathsf{R}^{-1}_{T}(q^{\prime})..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=l}^{r}, having 𝖫T[q]=π(T^)\mathsf{L}_{T}[q^{\prime}]=\pi(\hat{T}) is necessary and sufficient for its extended suffix T[𝖱T1(q)1..]\langle T[\mathsf{R}^{-1}_{T}(q^{\prime})-1..]\rangle to have an 𝗅𝖼𝗉\mathsf{lcp}-value larger than hπ(T^)h_{\pi(\hat{T})} with T^\hat{T}. By Corollary 6, p-encoded suffixes satisfying this condition must preserve their lexicographic order after extension, and hence, it is enough to search for the closest one (q𝖥𝖯𝖰e(𝖫T,k)q\leftarrow\mathsf{FPQ}_{e}(\mathsf{L}_{T},k) or q𝖥𝖭𝖰e(𝖫T,k)q\leftarrow\mathsf{FNQ}_{e}(\mathsf{L}_{T},k)) to T\langle T\rangle and compute the rank of its extended suffix by 𝖫𝖥T(q)\mathsf{LF}_{T}(q). If Lines 2 and 2 fail to return a value, it means that λhπ(T^)\lambda\leq h_{\pi(\hat{T})}. The if-block at Line 2 checks if there exists a p-encoded suffix T[i+1..]\langle T[i+1..]\rangle that satisfies the former condition to be 𝗅𝖼𝗉(T^,T[i..])=hπ(T^)\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)=h_{\pi(\hat{T})}. It is enough to find one T[𝖱T1(q)..]\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle with 𝖫[q]π(T^)\mathsf{L}[q]\geq\pi(\hat{T}) for q[l..r]q\in[l..r] because it is necessary and sufficient to have 𝗅𝖼𝗉(T^,T[𝖱T1(q)1..])=hπ(T^)\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)=h_{\pi(\hat{T})} and T^[hπ(T^)+1]=hπ(T^)=T[𝖱T1(q)1..][hπ(T^)+1]\langle\hat{T}\rangle[h_{\pi(\hat{T})}+1]=\infty\neq h_{\pi(\hat{T})}=\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle[h_{\pi(\hat{T})}+1]. Note that there could be two or more p-encoded suffixes that satisfy the condition and their lexicographic order may change by extension. In the then-block at Line 2, the algorithm computes the rank of the lexicographically smallest p-encoded suffix that has an 𝗅𝖼𝗉\mathsf{lcp}^{\infty}-value larger than π(T^)\pi(\hat{T}) with T[𝖱T1(q)1..]=T[𝖱T1(𝖫𝖥T(q))..]\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle=\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle, which is the p-succ of T^\hat{T} in this case.

The case with eπ(T^)e\neq\pi(\hat{T}) is processed in the else-block at Line 2. Here, it is good to keep in mind that when we enter this else-block, 𝗅𝖼𝗉(T,T[i..])e\mathsf{lcp}^{\infty}(\langle T\rangle,\langle T[i..]\rangle)\leq e or π(T[i1..])e\pi(T[i-1..])\leq e holds for any proper suffix T[i..]T[i..] of TT, since otherwise k^\hat{k} must be reporeted in a previous round of the foreach loop.

When the if-condition at Line 2 holds, T[𝖱T1(q)..]\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle is the lexicographically smaller closest p-encoded suffix to T\langle T\rangle such that 𝗅𝖼𝗉(T^,T[𝖱T1(q)1..])e+1\mathsf{lcp}^{\infty}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)\geq e+1, or equivalently 𝗅𝖼𝗉(T^,T[𝖱T1(q)1..])>he\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)>h_{e}. For any p-encoded suffix in {T[𝖱T1(p)..]}p=q+1k1\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=q+1}^{k-1} its extended suffix is lexicographically smaller than T[𝖱T1(q)1..]\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle due to Corollary 7, and never closer to T^\langle\hat{T}\rangle than T[𝖱T1(q)1..]\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle. At Line 2, the algorithm computes the maximal interval [l..r][l^{\prime}..r^{\prime}] such that every p-encoded suffix in {T[𝖱T1(p)..]}p=lr\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=l^{\prime}}^{r^{\prime}} shares the common prefix of length h:=min{|T|𝖱T1(q)+1,𝗌𝖾𝗅𝖾𝖼𝗍(T[𝖱T1(q)..],e+1)}h^{\prime}:=\min\{|T|-\mathsf{R}^{-1}_{T}(q)+1,\mathsf{select}_{\infty}(\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle,e+1)\} with T[𝖱T1(q)..]\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle. Since any T[i..]{T[𝖱T1(p)..]}p=1l1\langle T[i..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=1}^{l^{\prime}-1} has an 𝗅𝖼𝗉\mathsf{lcp}-value smaller than hh^{\prime} with T[𝖱T1(q)..]\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle, it follows from Lemma 4 that T[i1..]<T[𝖱T1(q)1..]\langle T[i-1..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle. Also, for any T[i..]{T[𝖱T1(p)..]}p=k+1|T|\langle T[i..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=k+1}^{|T|}, the aforementioned precondition to enter the else-block at Line 2 leads to T^<T[i1..]\langle\hat{T}\rangle<\langle T[i-1..]\rangle or T[i1..]<T[𝖱T1(q)..]\langle T[i-1..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle by Lemma 4. So far we have confirmed that the 𝗅𝖼𝗉\mathsf{lcp}-value between T^\langle\hat{T}\rangle and the p-pred of T^\langle\hat{T}\rangle is less than hh^{\prime}, which implies that the p-pred is the largest p-encoded suffix that is prefixed by x:=T[𝖱T1(q)1..][..h]x:=\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle[..h^{\prime}]. If q𝖥𝖯𝖰e+2(𝖫T,r)q^{\prime}\leftarrow\mathsf{FPQ}_{\geq e+2}(\mathsf{L}_{T},r^{\prime}) computed at Line 2 is in [l..r][l^{\prime}..r^{\prime}], T[𝖱T1(q)1..]=T[𝖫𝖥T(q)..]\langle T[\mathsf{R}^{-1}_{T}(q^{\prime})-1..]\rangle=\langle T[\mathsf{LF}_{T}(q^{\prime})..]\rangle is prefixed by xx\cdot\infty and the p-pred is the largest p-encoded suffix that is prefixed by xx\cdot\infty, which can be computed by maxGetMI(𝖫𝖥T(q),e+2)\max\textnormal{{GetMI}}(\mathsf{LF}_{T}(q^{\prime}),e+2) because T[𝖱T1(q)1..][..h+1]=T[𝖱T1(𝖫𝖥T(q)..][..h+1]=x\langle T[\mathsf{R}^{-1}_{T}(q^{\prime})-1..]\rangle[..h^{\prime}+1]=\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q^{\prime})..]\rangle[..h^{\prime}+1]=x\cdot\infty contains exactly e+2e+2 \infty’s. If q[l..r]q^{\prime}\notin[l^{\prime}..r^{\prime}], the p-pred is the largest p-encoded suffix that is prefixed by xhx\cdot h^{\prime}, which is T[𝖱T1(𝖫𝖥T(q))..]\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle.

When the if-condition at Line 2 holds, T[𝖱T1(q)..]\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle is the lexicographically larger closest p-encoded suffix to T\langle T\rangle such that 𝗅𝖼𝗉(T^,T[𝖱T1(q)1..])e+1\mathsf{lcp}^{\infty}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)\geq e+1, or equivalently 𝗅𝖼𝗉(T^,T[𝖱T1(q)1..])>he\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)>h_{e}. For any p-encoded suffix in {T[𝖱T1(p)..]}p=k+1q1\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=k+1}^{q-1} its extended suffix is lexicographically smaller than T[𝖱T1(q)1..]\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle due to Corollary 7, and never closer to T^\langle\hat{T}\rangle than T[𝖱T1(q)1..]\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle. At Line 2, the algorithm computes the maximal interval [l..r][l^{\prime}..r^{\prime}] such that every p-encoded suffix in {T[𝖱T1(p)..]}p=lr\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=l^{\prime}}^{r^{\prime}} shares the common prefix of length h:=min{|T|𝖱T1(q)+1,𝗌𝖾𝗅𝖾𝖼𝗍(T[𝖱T1(q)..],e+1)}h^{\prime}:=\min\{|T|-\mathsf{R}^{-1}_{T}(q)+1,\mathsf{select}_{\infty}(\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle,e+1)\} with T[𝖱T1(q)..]\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle. Since any T[i..]{T[𝖱T1(p)..]}p=r+1|T|\langle T[i..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=r^{\prime}+1}^{|T|} has an 𝗅𝖼𝗉\mathsf{lcp}-value smaller than hh^{\prime} with T[𝖱T1(q)..]\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle, it follows from Lemma 4 that T[i1..]<T[𝖱T1(q)1..]\langle T[i-1..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle. Also, for any T[i..]{T[𝖱T1(p)..]}p=1k1\langle T[i..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=1}^{k-1}, the aforementioned precondition to enter the else-block at Line 2 leads to T[i1..]<T^\langle T[i-1..]\rangle<\langle\hat{T}\rangle by Lemma 4. So far we have confirmed that the 𝗅𝖼𝗉\mathsf{lcp}-value between T^\langle\hat{T}\rangle and the p-succ of T^\langle\hat{T}\rangle is less than hh^{\prime}, which implies that the p-succ is the smallest p-encoded suffix that is prefixed by x:=T[𝖱T1(q)1..][..h]x:=\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle[..h^{\prime}]. If q𝖥𝖭𝖰e+1(𝖫T,l)q^{\prime}\leftarrow\mathsf{FNQ}_{e+1}(\mathsf{L}_{T},l^{\prime}) computed at Line 2 is in [l..r][l^{\prime}..r^{\prime}], the p-succ is the smallest p-encoded suffix that is prefixed by xhx\cdot h^{\prime}, which is T[𝖱T1(𝖫𝖥T(q))..]\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q^{\prime}))..]\rangle. If q[l..r]q^{\prime}\notin[l^{\prime}..r^{\prime}], the p-succ is the smallest p-encoded suffix that is prefixed by xx\cdot\infty, which can be computed by minGetMI(𝖫𝖥T(q),e+2)\min\textnormal{{GetMI}}(\mathsf{LF}_{T}(q),e+2) because T[𝖱T1(q)1..][..h+1]=T[𝖱T1(𝖫𝖥T(q))..][..h+1]=x\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle[..h^{\prime}+1]=\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle[..h^{\prime}+1]=x\cdot\infty contains exactly e+2e+2 \infty’s.

When we enter the if-then-block at Line 2, it is guaranteed that λhe\lambda\leq h_{e}. In order to check if there exists a p-encoded suffix T[i+1..]\langle T[i+1..]\rangle that satisfies the former condition to be 𝗅𝖼𝗉(T^,T[i..])=he\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)=h_{e}, the algorithm computes q𝖥𝖯𝖰e(𝖫T,r))q\leftarrow\mathsf{FPQ}_{e}(\mathsf{L}_{T},r)). If q[l..r]q\in[l..r], T[𝖱T1(q)..]\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle is the lexicographically largest p-encoded suffix that satisfies the condition, and by Corollary 6, its extended suffix T[𝖱T1(q)1..]\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle must be the largest one to have 𝗅𝖼𝗉(T^,T[𝖱T1(q)1..])=he\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)=h_{e}. Therefore, T[𝖱T1(q)1..]\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle is the p-pred of T^\langle\hat{T}\rangle, and k^=1+𝖫𝖥T(q)\hat{k}=1+\mathsf{LF}_{T}(q). ∎

1 foreach emin{π(T^),max{𝖫𝖢𝖯T[k],𝖫𝖢𝖯T[k+1]}}e\leftarrow\min\{\pi(\hat{T}),\max\{\mathsf{LCP}^{\infty}_{T}[k],\mathsf{LCP}^{\infty}_{T}[k+1]\}\} down to 11 do
2       [l..r][l..r]\leftarrow GetMI(kk, ee);
3       if e=π(T^)e=\pi(\hat{T}) then
4             if (q𝖥𝖯𝖰e(𝖫T,k))[l..r](q\leftarrow\mathsf{FPQ}_{e}(\mathsf{L}_{T},k))\in[l..r] then  return 1+𝖫𝖥T(q)1+\mathsf{LF}_{T}(q) ;
5             if (q𝖥𝖭𝖰e(𝖫T,k))[l..r](q\leftarrow\mathsf{FNQ}_{e}(\mathsf{L}_{T},k))\in[l..r] then  return 𝖫𝖥T(q)\mathsf{LF}_{T}(q) ;
6             if (q𝖥𝖭𝖰e+1(𝖫T,l))[l..r](q\leftarrow\mathsf{FNQ}_{\geq e+1}(\mathsf{L}_{T},l))\in[l..r] then  return min\min GetMI(𝖫𝖥T(q)\mathsf{LF}_{T}(q), e+1e+1) ;
7            
8      else
9             if (q𝖥𝖯𝖰e+1(𝖫T,k))[l..r](q\leftarrow\mathsf{FPQ}_{\geq e+1}(\mathsf{L}_{T},k))\in[l..r] then
10                   [l..r][l^{\prime}..r^{\prime}]\leftarrow GetMI(qq, e+1e+1);
11                   q𝖥𝖯𝖰e+2(𝖫T,r)q^{\prime}\leftarrow\mathsf{FPQ}_{\geq e+2}(\mathsf{L}_{T},r^{\prime});
12                   if q[l..r]q^{\prime}\in[l^{\prime}..r^{\prime}] then  return 1+max1+\max GetMI(𝖫𝖥T(q)\mathsf{LF}_{T}(q^{\prime}), e+2e+2) ;
13                   else  return 1+𝖫𝖥T(q)1+\mathsf{LF}_{T}(q) ;
14                  
15            if (q𝖥𝖭𝖰e+1(𝖫T,k))[l..r](q\leftarrow\mathsf{FNQ}_{\geq e+1}(\mathsf{L}_{T},k))\in[l..r] then
16                   [l..r][l^{\prime}..r^{\prime}]\leftarrow GetMI(qq, e+1e+1);
17                   q𝖥𝖭𝖰e+1(𝖫T,l)q^{\prime}\leftarrow\mathsf{FNQ}_{e+1}(\mathsf{L}_{T},l^{\prime});
18                   if q[l..r]q^{\prime}\in[l^{\prime}..r^{\prime}] then  return 𝖫𝖥T(q)\mathsf{LF}_{T}(q^{\prime}) ;
19                   else  return min\min GetMI(𝖫𝖥T(q)\mathsf{LF}_{T}(q), e+2e+2) ;
20                  
21            if (q𝖥𝖯𝖰e(𝖫T,r))[l..r](q\leftarrow\mathsf{FPQ}_{e}(\mathsf{L}_{T},r))\in[l..r] then  return 1+𝖫𝖥T(q)1+\mathsf{LF}_{T}(q) ;
22            
23      
Algorithm 2 Algorithm to compute k^\hat{k}.

3.2 How to maintain 𝖫𝖢𝖯\mathsf{LCP}^{\infty}

Suppose that we have k=𝖱T(1)k=\mathsf{R}_{T}(1), k^=𝖱T^(1)\hat{k}=\mathsf{R}_{\hat{T}}(1), 𝖫T\mathsf{L}_{T}, 𝖥T\mathsf{F}_{T}, we show how to compute 𝗅𝖼𝗉\mathsf{lcp}^{\infty}-values of T^\langle\hat{T}\rangle with its p-pred T[𝖱T1(k^)1..]\langle T[\mathsf{R}^{-1}_{T}(\hat{k})-1..]\rangle and p-succ T[𝖱T1(k^)..]\langle T[\mathsf{R}^{-1}_{T}(\hat{k})..]\rangle to maintain 𝖫𝖢𝖯\mathsf{LCP}^{\infty}.

We focus on 𝗅𝖼𝗉(T^,T[𝖱T1(k^)..])\mathsf{lcp}^{\infty}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(\hat{k})..]\rangle) because the other one can be treated similarly. We apply Lemma 4 by setting x=T^x=\hat{T} and y=T[𝖱T1(k^)..]y=T[\mathsf{R}^{-1}_{T}(\hat{k})..] if k<𝖥𝖫T(k^)k<\mathsf{FL}_{T}(\hat{k}) (otherwise we swap their roles for xx and yy). In order to get 𝗅𝖼𝗉(x,y)\mathsf{lcp}^{\infty}(\langle x\rangle,\langle y\rangle), all we need are π(x)=π(T^)\pi(x)=\pi(\hat{T}), π(y)=𝖥[k^]\pi(y)=\mathsf{F}[\hat{k}] and e=𝗅𝖼𝗉(x[2..],y[2..])e=\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle). For the computation of ee we use Lemma 10, i.e, e=𝗅𝖼𝗉(x[2..],y[2..])=𝖱𝗆𝖰𝖫𝖢𝖯T(k+1,𝖥𝖫T(k^))e=\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle)=\mathsf{RmQ}_{\mathsf{LCP}^{\infty}_{T}}(k+1,\mathsf{FL}_{T}(\hat{k})).

3.3 Dynamic data structures and analysis

We consider constructing 𝖥T\mathsf{F}_{T}, 𝖫T\mathsf{L}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} of a p-string TT over an alphabet (ΣsΣp)[0..σ](\Sigma_{s}\cup\Sigma_{p})\subseteq[0..\sigma] of length nn online. Let σs\sigma_{s} and respectively σp\sigma_{p} be the numbers of distinct s-symbols and p-symbols that appear in TT.

First we show data structures needed to implement our algorithm presented in the previous subsections. For 𝖥T\mathsf{F}_{T} we maintain a dynamic string of Lemma 2 supporting random access, insertion, 𝗋𝖺𝗇𝗄\mathsf{rank} and 𝗌𝖾𝗅𝖾𝖼𝗍\mathsf{select} queries in O(lgnlglgn)O(\frac{\lg n}{\lg\lg n}) time and O(nlgσ)O(n\lg\sigma) bits of space. For 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} we maintain a dynamic string of Lemma 3 to support random access, insertion, 𝖱𝗆𝖰\mathsf{RmQ}, 𝖥𝖯𝖰\mathsf{FPQ} and 𝖥𝖭𝖰\mathsf{FNQ} queries in O(lgσplgnlglgn)O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time and O(nlgσp)O(n\lg\sigma_{p}) bits of space.

If we build a dynamic string of Lemma 3 for 𝖫T\mathsf{L}_{T}, the query time would be O(lgσlgnlglgn)O(\frac{\lg\sigma\lg n}{\lg\lg n}). Since our algorithm does not use 𝖱𝗆𝖰\mathsf{RmQ}, 𝖥𝖯𝖰\mathsf{FPQ} and 𝖥𝖭𝖰\mathsf{FNQ} queries for s-symbols, we can reduce the query time to O(lgσplgnlglgn)O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) as follows. We represent 𝖫T\mathsf{L}_{T} with one level of a wavelet tree, where a bit vector BTB_{T} partitions the alphabet into Σs\Sigma_{s} and [1..|T|p][1..|T|_{p}] and thus has pointers to XTX_{T} and YTY_{T} storing respectively the sequence over Σs\Sigma_{s} and that over [1..|T|p][1..|T|_{p}] of 𝖫T\mathsf{L}_{T}. We represent the former and the latter by the data structures described in Lemmas 2 and 3, respectively, since we only need the aforementioned queries such as 𝖱𝗆𝖰\mathsf{RmQ} on YTY_{T}. Then, queries on 𝖫T\mathsf{L}_{T} can be answered in O(lgσplgnlglgn)O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time using O(nlgσ)O(n\lg\sigma) bits of space.

In addition to these dynamic strings for 𝖥T\mathsf{F}_{T}, 𝖫T\mathsf{L}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T}, we consider another dynamic string ZTZ_{T}. Let ZTZ_{T} be a string that is obtained by extracting the leftmost occurrence of every p-symbol in TT. Thus, |ZT|=σpn|Z_{T}|=\sigma_{p}\leq n. A dynamic string ZTZ_{T} of Lemma 2 enables us to compute π(cT)\pi(cT) for a p-symbol cc by π(cT)=min{,𝗌𝖾𝗅𝖾𝖼𝗍c(ZT,1)}\pi(cT)=\min\{\infty,\mathsf{select}_{c}(Z_{T},1)\} in O(lgσplglgσp)O(\frac{\lg\sigma_{p}}{\lg\lg\sigma_{p}}) time and O(σplgσp)O(\sigma_{p}\lg\sigma_{p}) bits of space.

We also maintain a fusion tree of [25] of O(σslgn)=O(nlgσ)O(\sigma_{s}\lg n)=O(n\lg\sigma) bits to maintain the set of s-symbols used in the current TT, which enables us to compute bb in Lemma 9 in O(lgσslglgσs)O(\frac{\lg\sigma_{s}}{\lg\lg\sigma_{s}}) time.

We are now ready to prove the following lemma.

Lemma 13.

𝖥T\mathsf{F}_{T}, 𝖫T\mathsf{L}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} for a p-string of length nn over an alphabet (ΣsΣp)[0..σ](\Sigma_{s}\cup\Sigma_{p})\subseteq[0..\sigma] can be constructed online in O(nlgσplgnlglgn)O(n\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time and O(nlgσ)O(n\lg\sigma) bits of space, where σp\sigma_{p} is the number of distinct p-symbols used in the p-string.

Proof.

We maintain the dynamic data structures of O(nlgσ)O(n\lg\sigma) bits described in this subsection while prepending a symbol to the current p-string. For a single step of updating TT to T^=cT\hat{T}=cT with c(ΣsΣp)c\in(\Sigma_{s}\cup\Sigma_{p}), we compute k^=𝖱T^(1)\hat{k}=\mathsf{R}_{\hat{T}}(1) as described in Subsection 3.1 and obtain 𝖥T^\mathsf{F}_{\hat{T}} and 𝖫T^\mathsf{L}_{\hat{T}} by replacing $\$ in 𝖫T\mathsf{L}_{T} at k=𝖱T(1)k=\mathsf{R}_{T}(1) by π(T^)\pi(\hat{T}) and inserting $\$ and π(T^)\pi(\hat{T}) into the k^\hat{k}-th position of 𝖫T\mathsf{L}_{T} and 𝖥T\mathsf{F}_{T}, respectively. 𝖫𝖢𝖯\mathsf{LCP}^{\infty} is updated as described in Subsection 3.2.

If cΣsc\in\Sigma_{s}, the computation of k^\hat{k} based on Lemma 9 requires a constant number of queries. If cΣpc\in\Sigma_{p}, Algorithm 2 computes k^\hat{k} invoking O(2+ee^)O(2+e-\hat{e}) queries, where e=max{𝖫𝖢𝖯T[k],𝖫𝖢𝖯T[k+1]}e=\max\{\mathsf{LCP}^{\infty}_{T}[k],\mathsf{LCP}^{\infty}_{T}[k+1]\} and e^=max{𝖫𝖢𝖯T^[k^],𝖫𝖢𝖯T^[k^+1]}\hat{e}=\max\{\mathsf{LCP}^{\infty}_{\hat{T}}[\hat{k}],\mathsf{LCP}^{\infty}_{\hat{T}}[\hat{k}+1]\}. The value ee can be seen as a potential held by the current string TT, which upper bounds the number of queries. The number of queries in a single step can be O(σp)O(\sigma_{p}) in the worst case when ee and e^\hat{e} are close to σp\sigma_{p} and respectively 0, but this will reduce the potential for later steps, which allows us to give an amortized analysis. Since a single step increases the potential at most 1 by Corollary 5, the total number of queries can be bounded by O(n)O(n).

Since we invoke O(n)O(n) queries that take O(lgσplgnlglgn)O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time each, the overall time complexity is O(nlgσplgnlglgn)O(n\frac{\lg\sigma_{p}\lg n}{\lg\lg n}). ∎

4 Extendable compact index for p-matching

In this section, we show that 𝖫T\mathsf{L}_{T}, 𝖥T\mathsf{F}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} can serve as an index for p-matching.

First we show that we can support backward search, a core procedure of BWT-based indexes, with the data structures for 𝖫T\mathsf{L}_{T}, 𝖥T\mathsf{F}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} described in Subsection 3.3. For any p-string ww, let ww-interval be the maximal interval [l..r][l..r] such that T[𝖱T1(p)..]\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle is prefixed by w\langle w\rangle for any p[l..r]p\in[l..r]. We show the next lemma for a single step of backward search, which computes cwcw-interval from ww-interval.

Lemma 14.

Suppose that we have data structures for 𝖫T\mathsf{L}_{T}, 𝖥T\mathsf{F}_{T} and T^\langle\hat{T}\rangle described in Subsection 3.3. Given ww-interval [l..r][l..r] and c(ΣsΣp)c\in(\Sigma_{s}\cup\Sigma_{p}), we can compute cwcw-interval [l..r][l^{\prime}..r^{\prime}] in O(lgσplgnlglgn)O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time.

Proof.

We show that we can compute cwcw-interval from ww-interval using a constant number of queries supported on 𝖫T\mathsf{L}_{T}, 𝖥T\mathsf{F}_{T} and T^\langle\hat{T}\rangle, which takes O(lgσplgnlglgn)O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time each.

  • When cc is in Σs\Sigma_{s}: T[𝖱T1(𝖫𝖥T(p))..]\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(p))..]\rangle is prefixed by cw\langle cw\rangle if and only if T[𝖱T1(p)..]\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle is prefixed by w\langle w\rangle and 𝖫T[p]=c\mathsf{L}_{T}[p]=c. In other words, 𝖫𝖥T(p)[l..r]\mathsf{LF}_{T}(p)\in[l^{\prime}..r^{\prime}] if and only if p[l..r]p\in[l..r] and 𝖫T[p]=c\mathsf{L}_{T}[p]=c. Then it holds that l=𝖫𝖥T(𝖥𝖭𝖰c(𝖫T,l))l^{\prime}=\mathsf{LF}_{T}(\mathsf{FNQ}_{c}(\mathsf{L}_{T},l)) and r=𝖫𝖥T(𝖥𝖯𝖰c(𝖫T,r))r^{\prime}=\mathsf{LF}_{T}(\mathsf{FPQ}_{c}(\mathsf{L}_{T},r)) due to Corollary 6.

  • When cc is a p-symbol that appears in ww: Similar to the previous case, 𝖫𝖥T(p)[l..r]\mathsf{LF}_{T}(p)\in[l^{\prime}..r^{\prime}] if and only if p[l..r]p\in[l..r] and 𝖫T[p]=π(cw)\mathsf{L}_{T}[p]=\pi(cw). Then it holds that l=𝖫𝖥T(𝖥𝖭𝖰π(cw)(𝖫T,l))l^{\prime}=\mathsf{LF}_{T}(\mathsf{FNQ}_{\pi(cw)}(\mathsf{L}_{T},l)) and r=𝖫𝖥T(𝖥𝖯𝖰π(cw)(𝖫T,r))r^{\prime}=\mathsf{LF}_{T}(\mathsf{FPQ}_{\pi(cw)}(\mathsf{L}_{T},r)) due to Corollary 6.

  • When cc is a p-symbol that does not appear in ww: Let e=|w|pe=|w|_{p}. Since p[l..r]p\in[l..r] and 𝖫T[p]>e\mathsf{L}_{T}[p]>e are necessary and sufficient conditions for 𝖫𝖥T(p)\mathsf{LF}_{T}(p) to be in [l..r][l^{\prime}..r^{\prime}], we can compute rl+1r^{\prime}-l^{\prime}+1, the width of [l..r][l^{\prime}..r^{\prime}], by counting the number of positions pp such that 𝖫T[p]>e\mathsf{L}_{T}[p]>e with p[l..r]p\in[l..r]. This can be done with 2D range counting queries, which can also be supported with the wavelet tree of Lemma 3. If s=𝖫𝖥T(𝖥𝖭𝖰>e(𝖫T,l)s=\mathsf{LF}_{T}(\mathsf{FNQ}_{>e}(\mathsf{L}_{T},l) is in [l..r][l..r], it holds that rl+10r^{\prime}-l^{\prime}+1\neq 0 and 𝖫𝖥T(s)[l..r]\mathsf{LF}_{T}(s)\in[l^{\prime}..r^{\prime}]. Note that 𝖫𝖥T(s)\mathsf{LF}_{T}(s) is not necessarily ll^{\prime} because p-encoded suffixes T[𝖱T1(p)..]\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle with 𝖫T[p]>e\mathsf{L}_{T}[p]>e in [l..r][l..r] do not necessarily preserve the lexicographic order when they are extended by one symbol to the left, making it non-straightforward to identify the position ll^{\prime}.

    To tackle this problem, we consider [le..re]=GetMI(s,e)[l_{e}..r_{e}]=\textnormal{{GetMI}}(s,e) and [le+1..re+1]=GetMI(𝖫𝖥T(s),e+1)[l^{\prime}_{e+1}..r^{\prime}_{e+1}]=\textnormal{{GetMI}}(\mathsf{LF}_{T}(s),e+1), and show that l=le+1+xl^{\prime}=l^{\prime}_{e+1}+x, where xx is the number of positions pp such that 𝖫T[p]>e\mathsf{L}_{T}[p]>e with p[le..l1]p\in[l_{e}..l-1]. Observe that [l..r][le..re][l..r]\subseteq[l_{e}..r_{e}] and [l..r][le+1..re+1][l^{\prime}..r^{\prime}]\subseteq[l^{\prime}_{e+1}..r^{\prime}_{e+1}] by definition, and that 𝖫𝖥T(p)[le+1..re+1]\mathsf{LF}_{T}(p)\in[l^{\prime}_{e+1}..r^{\prime}_{e+1}] if and only if p[le..re]p\in[l_{e}..r_{e}] and 𝖫T[p]>e\mathsf{L}_{T}[p]>e (see Fig. 2 for an illustration). Also, it holds that T[𝖱T1(𝖫𝖥T(p))..]<T[𝖱T1(𝖫𝖥T(q))..]\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(p))..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle for any p[le..l1]p\in[l_{e}..l-1] and q[l..r]q\in[l..r] with 𝖫T[p]>e\mathsf{L}_{T}[p]>e and 𝖫T[q]>e\mathsf{L}_{T}[q]>e because 𝗅𝖼𝗉(T[𝖱T1(p)..]<T[𝖱T1(q)..])=e\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle)=e, and they fall into Case (B4) of Lemma 4. Similarly for any p[l..r]p\in[l..r] and q[r+1..re]q\in[r+1..r_{e}] with 𝖫T[p]>e\mathsf{L}_{T}[p]>e and 𝖫T[q]>e\mathsf{L}_{T}[q]>e, we have T[𝖱T1(𝖫𝖥T(p))..]<T[𝖱T1(𝖫𝖥T(q))..]\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(p))..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle. Hence, l=le+1+xl^{\prime}=l^{\prime}_{e+1}+x holds.

This concludes the proof. ∎

Refer to caption
Figure 2: Illustration for the computation of cwcw-interval [l..r][l^{\prime}..r^{\prime}] from ww-interval [l..r][l..r] for the case when cc is a p-symbol that does not appear in ww with e=|w|p=3e=|w|_{p}=3. The left (resp. right) part shows sorted p-encoded suffixes around [l..r][l..r] (resp. [l..r][l^{\prime}..r^{\prime}]) with grayed areas representing the longest common prefix between ww (resp. cwcw) and each p-encoded suffix. It depicts that the positions pp in [le..l1][l_{e}..l-1] with 𝖫T[p]>e\mathsf{L}_{T}[p]>e are mapped to [le+1..l1][l^{\prime}_{e+1}..l^{\prime}-1] by LF-mapping.

We are now ready to prove the main theorem:

Proof of Theorem 1:.

If we only need counting queries, Lemmas 13 and 14 are enough: While we build 𝖫T\mathsf{L}_{T}, 𝖥T\mathsf{F}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T} online, we can compute ww-interval [l..r][l..r] for a given pattern ww of length mm using Lemma 14 successively mm times, spending O(mlgσplgnlglgn)O(m\frac{\lg\sigma_{p}\lg n}{\lg\lg n}) time in total.

Since {𝖱T1(i)i[l..r]}\{\mathsf{R}^{-1}_{T}(i)\mid i\in[l..r]\} is the set of occurrences of ww in TT, we consider how to access 𝖱T1(i)\mathsf{R}^{-1}_{T}(i) in O(lg2nlgσlglgn)O(\frac{\lg^{2}n}{\lg\sigma\lg\lg n}) time to support locating queries. As is common in BWT-based indexes, we employ a sampling technique (e.g., see [9]): For every Θ(logσn)\Theta(\log_{\sigma}n) text positions we store the values so that if we apply LF/FL-mapping to ii successively at most Θ(logσn)\Theta(\log_{\sigma}n) times we hit one of the sampled text positions. A minor remark is that since our online construction proceeds from right to left, it is convenient to start sampling from the right-end of TT and store the distance to the right-end instead of the text position counted from the left-end of TT.

During the online construction of the data structures for 𝖫T\mathsf{L}_{T}, 𝖥T\mathsf{F}_{T} and 𝖫𝖢𝖯T\mathsf{LCP}^{\infty}_{T}, we additionally maintain a dynamic bit vector of length nn and dynamic integer string VTV_{T} of length O(n/logσn)O(n/\log_{\sigma}n), which marks the sampled positions and stores sampled values, respectively. We implement VTV_{T} with the dynamic string of Lemma 2 in O(nlgnlogσn)=O(nlgσ)O(n\frac{\lg n}{\log_{\sigma}n})=O(n\lg\sigma) bits with O(lgn)O(\lg n) query times. In order to support LF/FL-mapping in O(lgnlglgn)O(\frac{\lg n}{\lg\lg n}) time, we also maintain a dynamic string of Lemma 2 for 𝖫T\mathsf{L}_{T}. With the additional space usage of O(nlgσ)O(n\lg\sigma) bits, we can access 𝖱T1(i)\mathsf{R}^{-1}_{T}(i) in O(lg2nlgσlglgn)O(\frac{\lg^{2}n}{\lg\sigma\lg\lg n}) time as we use LF/FL-mapping at most Θ(logσn)\Theta(\log_{\sigma}n) times. This leads to the claimed time bound for locating queries. ∎

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 19K20213 and 22K11907 (TI).

References

  • [1] Brenda S. Baker. A theory of parameterized pattern matching: algorithms and applications. In S. Rao Kosaraju, David S. Johnson, and Alok Aggarwal, editors, Proc. 25th Annual ACM Symposium on Theory of Computing (STOC), pages 71–80. ACM, 1993.
  • [2] Brenda S. Baker. Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences, 52(1):28–42, 1996.
  • [3] Brenda S. Baker. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput., 26(5):1343–1362, 1997.
  • [4] Richard Beal and Donald A. Adjeroh. p-suffix sorting as arithmetic coding. J. Discrete Algorithms, 16:151–169, 2012.
  • [5] Francisco Claude, Gonzalo Navarro, and Alberto Ordóñez Pereira. The wavelet matrix: An efficient wavelet tree for large alphabets. Inf. Syst., 47:15–32, 2015.
  • [6] Richard Cole and Ramesh Hariharan. Faster suffix tree construction with missing suffix links. SIAM J. Comput., 33(1):26–42, 2003.
  • [7] Satoshi Deguchi, Fumihito Higashijima, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Parameterized suffix arrays for binary strings. In Proc. Prague Stringology Conference (PSC) 2008, pages 84–94, 2008.
  • [8] Diptarama, Takashi Katsura, Yuhei Otomo, Kazuyuki Narisawa, and Ayumi Shinohara. Position heaps for parameterized strings. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, Proc. 28th Annual Symposium on Combinatorial Pattern Matching (CPM) 2017, volume 78 of LIPIcs, pages 8:1–8:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
  • [9] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In FOCS, pages 390–398, 2000.
  • [10] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Right-to-left online construction of parameterized position heaps. In Jan Holub and Jan Zdárek, editors, Proc. Prague Stringology Conference (PSC) 2018, pages 91–102. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2018.
  • [11] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Direct linear time construction of parameterized suffix and LCP arrays for constant alphabets. In Nieves R. Brisaboa and Simon J. Puglisi, editors, Proc. 26th International Symposium on String Processing and Information Retrieval (SPIRE) 2019, volume 11811 of Lecture Notes in Computer Science, pages 382–391. Springer, 2019.
  • [12] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. The parameterized position heap of a trie. In Pinar Heggernes, editor, Proc. 11th International Conference on Algorithms and Complexity (CIAC) 2019, volume 11485 of Lecture Notes in Computer Science, pages 237–248. Springer, 2019.
  • [13] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. The parameterized suffix tray. In Tiziana Calamoneri and Federico Corò, editors, Proc. 12th International Conference on Algorithms and Complexity (CIAC) 2021, volume 12701 of Lecture Notes in Computer Science, pages 258–270. Springer, 2021.
  • [14] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. pBWT: Achieving succinct data structures for parameterized pattern matching and related problems. In Proc. 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017, pages 397–407, 2017. doi:10.1137/1.9781611974782.25.
  • [15] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Fully functional parameterized suffix trees in compact space. In Mikolaj Bojanczyk, Emanuela Merelli, and David P. Woodruff, editors, Proc. 49th International Colloquium on Automata, Languages, and Programming, (ICALP) 2022, volume 229 of LIPIcs, pages 65:1–65:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.
  • [16] Daiki Hashimoto, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara. Computing the parameterized burrows-wheeler transform online. In Diego Arroyuelo and Barbara Poblete, editors, Proc. 29th International Symposium on String Processing and Information Retrieval (SPIRE) 2022, volume 13617 of Lecture Notes in Computer Science, pages 70–85. Springer, 2022. doi:10.1007/978-3-031-20643-6\_6.
  • [17] Tomohiro I, Satoshi Deguchi, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Lightweight parameterized suffix array construction. In Proc. 20th International Workshop on Combinatorial Algorithms (IWOCA) 2009, pages 312–323, 2009.
  • [18] Sung-Hwan Kim and Hwan-Gue Cho. Simpler FM-index for parameterized string matching. Inf. Process. Lett., 165:106026, 2021. doi:10.1016/j.ipl.2020.106026.
  • [19] S. Rao Kosaraju. Faster algorithms for the construction of parameterized suffix trees (preliminary version). In Proc. 36th Annual Symposium on Foundations of Computer Science (FOCS), pages 631–637. IEEE Computer Society, 1995.
  • [20] Taehyung Lee, Joong Chae Na, and Kunsoo Park. On-line construction of parameterized suffix trees for large alphabets. Inf. Process. Lett., 111(5):201–207, 2011.
  • [21] Juan Mendivelso, Sharma V. Thankachan, and Yoan J. Pinzón. A brief history of parameterized matching problems. Discret. Appl. Math., 274:103–115, 2020. doi:10.1016/j.dam.2018.07.017.
  • [22] J. Ian Munro and Yakov Nekrich. Compressed data structures for dynamic sequences. In Proc. 23rd Annual European Symposium on Algorithms (ESA) 2015, pages 891–902, 2015.
  • [23] Katsuhito Nakashima, Noriki Fujisato, Diptarama Hendrian, Yuto Nakashima, Ryo Yoshinaka, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, and Masayuki Takeda. Parameterized dawgs: Efficient constructions and bidirectional pattern searches. Theor. Comput. Sci., 933:21–42, 2022. doi:10.1016/j.tcs.2022.09.008.
  • [24] Gonzalo Navarro. Wavelet trees for all. J. Discrete Algorithms, 25:2–20, 2014. doi:10.1016/j.jda.2013.07.004.
  • [25] Mihai Patrascu and Mikkel Thorup. Dynamic integer sets with optimal rank, select, and predecessor search. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 166–175, 2014. doi:10.1109/FOCS.2014.26.