Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching

Kento Iseri
Kyushu Institute of Technology, Fukuoka, Japan
[email protected]

Tomohiro I
Kyushu Institute of Technology, Fukuoka, Japan
[email protected]

Diptarama Hendrian
Tohoku University, Sendai, Japan
[email protected]

Dominik Köppl
Department of Computer Science, Universität Münster, Germany
[email protected]

Ryo Yoshinaka
Tohoku University, Sendai, Japan
[email protected]

Ayumi Shinohara
Tohoku University, Sendai, Japan
[email protected]

Abstract

A parameterized string (p-string) is a string over an alphabet $(\Sigma_{s}\cup\Sigma_{p})$ , where $\Sigma_{s}$ and $\Sigma_{p}$ are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings $x$ and $y$ are said to parameterized match (p-match) if and only if $x$ can be transformed into $y$ by applying a bijection on $\Sigma_{p}$ to every occurrence of p-symbols in $x$ . The indexing problem for p-matching is to preprocess a p-string $T$ of length $n$ so that we can efficiently find the occurrences of substrings of $T$ that p-match with a given pattern. Extending the Burrows-Wheeler Transform (BWT) based index for exact string pattern matching, Ganguly et al. [SODA 2017] proposed the first compact index (named pBWT) for p-matching, and posed an open problem on how to construct it in compact space, i.e., in $O(n\lg|\Sigma_{s}\cup\Sigma_{p}|)$ bits of space. Hashimoto et al. [SPIRE 2022] partially solved this problem by showing how to construct some components of pBWTs for $T$ in $O(n\frac{|\Sigma_{p}|\lg n}{\lg\lg n})$ time in an online manner while reading the symbols of $T$ from right to left. In this paper, we improve the time complexity to $O(n\frac{\lg|\Sigma_{p}|\lg n}{\lg\lg n})$ . We remark that removing the multiplicative factor of $|\Sigma_{p}|$ from the complexity is of great interest because it has not been achieved for over a decade in the construction of related data structures like parameterized suffix arrays even in the offline setting. We also show that our data structure can support backward search, a core procedure of BWT-based indexes, at any stage of the online construction, making it the first compact index for p-matching that can be constructed in compact space and even in an online manner.

1 Introduction

A parameterized string (p-string) is a string over an alphabet $(\Sigma_{s}\cup\Sigma_{p})$ , where $\Sigma_{s}$ and $\Sigma_{p}$ are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings $x$ and $y$ are said to parameterized match (p-match) if and only if $x$ can be transformed into $y$ by applying a bijection on $\Sigma_{p}$ to every occurrence of p-symbols in $x$ . P-matching was introduced by Baker aiming at software maintenance and plagiarism detection [1, 2, 3], and has been extensively studied in the last decades (see a recent survey [21] and references therein).

The indexing problem for p-matching is to preprocess a p-string $T$ of length $n$ so that we can efficiently find the occurrences of substrings of $T$ that p-match with a given pattern. Extending indexes for exact string pattern matching, there have been proposed several data structures that can be used as indexes for p-matching, e.g., parameterized suffix trees [1, 19, 2, 3], parameterized suffix arrays [7, 17, 4, 11], parameterized suffix trays [13], parameterized DAWGs [23], parameterized position heaps [8, 10, 12] and parameterized Burrows-Wheeler Transform (pBWT) [14, 18, 15].

Among these indexes, pBWTs are the most space economic, consuming $n\lg\sigma+O(n)$ bits [14] or $2n\lg\sigma+2n+o(n)$ bits with a simplified version proposed in [18], where $\sigma$ is the alphabet size. Let $\sigma_{s}$ and respectively $\sigma_{p}$ be the numbers of distinct s-symbols and p-symbols that appear in $T$ . The pBWT of $T$ can be constructed via the parameterized suffix tree of $T$ for which $O(n(\lg\sigma_{s}+\lg\sigma_{p}))$ -time or randomized $O(n)$ -time construction algorithms are known [19, 6, 20], but the intermediate memory footprint of $O(n\lg n)$ bits could be intolerable when $n$ is significantly larger than $\sigma$ . Hashimoto et al. [16] showed how to construct some components of the pBWT of [18] for $T$ in $O(n\frac{\sigma_{p}\lg n}{\lg\lg n})$ time in an online manner while reading the symbols of $T$ from right to left.

In this paper, we improve the time complexity of [16] to $O(n\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ . Removing the multiplicative factor of $\sigma_{p}$ from the time complexity of [16] is of great interest because it has not been achieved for over a decade in the construction of related data structures like parameterized suffix arrays even in an offline setting [11]. We also show that our data structure can support backward search, a core procedure of BWT-based indexes, at any stage of the online construction, making it the first compact index for p-matching that can be constructed in compact space and even in an online manner. This is not likely to be achieved with the previous work [16] due to the lack of support for 2D range counting queries in the data structure it uses.

Our computational assumptions are as follows:

•

We assume a standard Word-RAM model with word size $\Omega(\lg n)$ .
•

Each symbol in $(\Sigma_{s}\cup\Sigma_{p})$ is represented by $O(\lg n)$ bits.
•

We can distinguish symbols in $\Sigma_{s}$ and $\Sigma_{p}$ in $O(1)$ time, e.g., by having some flag bits or thresholds separating them each other.
•

The order on s-symbols are determined in $O(1)$ time based on their bit representations.

An index of a p-string $T$ for p-matching is to support, given a pattern $w$ , the counting queries that compute the number of occurrences of substrings that p-match with $w$ and the locating queries that compute the occurrences of $w$ . Our main result is:

Theorem 1.

For a p-string $T$ of length $n$ over an alphabet $(\Sigma_{s}\cup\Sigma_{p})\subseteq[0..\sigma]$ , an index of $T$ for p-matching can be constructed online in $O(n\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time and $O(n\lg\sigma)$ bits of space, where $\sigma_{p}$ is the number of distinct p-symbols used in the p-string. At any stage of the online construction, it can support the counting queries in $O(m\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time and the locating queries in $O(m\frac{\lg\sigma_{p}\lg n}{\lg\lg n}+\mathsf{occ}\frac{\lg^{2}n}{\lg\sigma\lg\lg n})$ time, where $m$ is the pattern length and $\mathsf{occ}$ is the number of occurrences to be reported.

2 Preliminaries

2.1 Basic notations and tools

An integer interval $\{i,i+1,\dots,j\}$ is denoted by $[i..j]$ , where $[i..j]$ represents the empty interval if $i>j$ .

Let $\Sigma$ be an ordered finite alphabet. An element of $\Sigma^{*}$ is called a string over $\Sigma$ . The length of a string $w$ is denoted by $|w|$ . The empty string $\varepsilon$ is the string of length 0, that is, $|\varepsilon|=0$ . Let $\Sigma^{+}=\Sigma^{*}-\{\varepsilon\}$ and $\Sigma^{k}=\{x\in\Sigma^{*}\mid|x|=k\}$ for any non-negative integer $k$ . The concatenation of two strings $x$ and $y$ is denoted by $x\cdot y$ or simply $xy$ . When a string $w$ is represented by the concatenation of strings $x$ , $y$ and $z$ (i.e. $w=xyz$ ), then $x$ , $y$ and $z$ are called a prefix, substring, and suffix of $w$ , respectively. A substring $x$ of $w$ is called proper if $x\neq w$ .

The $i$ -th symbol of a string $w$ is denoted by $w[i]$ for $1\leq i\leq|w|$ , and the substring of a string $w$ that begins at position $i$ and ends at position $j$ is denoted by $w[i..j]$ for $1\leq i\leq j\leq|w|$ , i.e., $w[i..j]=w[i]w[i+1]\cdots w[j]$ . For convenience, let $w[i..j]=\varepsilon$ if $j<i$ ; further let $w[..i]=w[1..i]$ and $w[i..]=w[i..|w|]$ denote abbreviations for the prefix of length $i$ and the suffix starting at position $i$ , respectively. For two strings $x$ and $y$ , let $\mathsf{lcp}(x,y)$ denote the length of the longest common prefix between $x$ and $y$ . We consider the lexicographic order over $\Sigma^{*}$ by extending the strict total order $<$ defined on $\Sigma$ : $x$ is lexicographically smaller than $y$ (denoted as $x<y$ ) if and only if either $x$ is a proper prefix of $y$ or $x[\mathsf{lcp}(x,y)+1]<y[\mathsf{lcp}(x,y)+1]$ holds.

For any string $w$ , character $c$ , and position $i~{}(1\leq i\leq|w|)$ , $\mathsf{rank}_{c}(w,i)$ returns the number of occurrences of $c$ in $w[..i]$ and $\mathsf{select}_{c}(w,i)$ returns the $i$ -th occurrence of $c$ in $w$ . For $1\leq i\leq j\leq|w|$ , a range minimum query $\mathsf{RmQ}_{w}(i,j)$ asks for $\arg\min_{i\leq k\leq j}\{w[k]\}$ . We also consider find previous/next queries $\mathsf{FPQ}_{p}(w,i)$ and $\mathsf{FNQ}_{p}(w,i)$ , where $p$ is a predicate either in the form of “ $c$ ” (equal to $c$ ), “ $<c$ ” (less than $c$ ) or “ $\geq c$ ” (larger than or equal to $c$ ): $\mathsf{FPQ}_{p}(w,i)$ returns the largest position $j\leq i$ at which $w[j]$ satisfies the predicate $p$ . Symmetrically, $\mathsf{FNQ}_{p}(w,i)$ returns the smallest position $j\geq i$ at which $w[j]$ satisfies the predicate $p$ . For example with the integer string $w=[2,5,10,6,8,3,14,5]$ , $\mathsf{FNQ}_{5}(w,4)=8$ , $\mathsf{FNQ}_{6}(w,4)=4$ , $\mathsf{FPQ}_{5}(w,4)=2$ , $\mathsf{FNQ}_{<5}(w,4)=6$ , $\mathsf{FPQ}_{<5}(w,4)=1$ , $\mathsf{FNQ}_{\geq 9}(w,4)=7$ and $\mathsf{FPQ}_{\geq 9}(w,4)=3$ .

If the answer of $\mathsf{select}_{c}(w,i)$ , $\mathsf{FPQ}_{p}(w,i)$ or $\mathsf{FNQ}_{p}(w,i)$ does not exist, it is just ignored. To handle this case of non-existence, we would use them in an expression with $\min$ or $\max$ : For example, $\max\{1,\mathsf{FPQ}_{p}(w,i)\}$ returns $1$ if $\mathsf{FPQ}_{p}(w,i)$ does not exist.

Dynamic strings should support insertion/deletion of a symbol to/from any position as well as fast random access. We use the following result:

Lemma 2 ([22]).

A dynamic string of length $n$ over an alphabet $[0..\sigma]$ can be implemented while supporting random access, insertion, deletion, $\mathsf{rank}$ and $\mathsf{select}$ queries in $(n+o(n))\lg\sigma$ bits of space and $O(\frac{\lg n}{\lg\lg n})$ query and update times.

Dynamic binary strings equipped with $\mathsf{rank}$ and $\mathsf{select}$ queries can be used as a building block for the dynamic wavelet matrix [5] of a string over an alphabet $[0..\sigma]$ to support queries beyond $\mathsf{rank}$ and $\mathsf{select}$ . The idea is that each of the other queries can be simulated by performing one of the building block queries on every level of the wavelet matrix, which has $\lceil\lg\sigma\rceil$ levels, cf. [24, Section 6.2.].

Lemma 3.

A dynamic string of length $n$ over an alphabet $[0..\sigma]$ with $\sigma=O(n)$ can be implemented while supporting random access, insertion, deletion, $\mathsf{rank}$ , $\mathsf{select}$ , $\mathsf{RmQ}$ , $\mathsf{FPQ}$ and $\mathsf{FNQ}$ queries in $(n+o(n))\lceil\lg\sigma\rceil+O(\lg\sigma\lg n)$ bits of space and $O(\frac{\lg\sigma\lg n}{\lg\lg n})$ query and update times.

2.2 Parameterized strings

Let $\Sigma_{s}$ and $\Sigma_{p}$ denote two disjoint sets of symbols. We call a symbol in $\Sigma_{s}$ a static symbol (s-symbol) and a symbol in $\Sigma_{p}$ a parameter symbol (p-symbol). A parameterized string (p-string) is a string over $(\Sigma_{s}\cup\Sigma_{p})$ . Let $\infty$ represent a symbol that is larger than any integer, and let $\mathbf{N}_{\infty}=\mathbf{N}_{+}\cup\{\infty\}$ be the set of natural positive numbers $\mathbf{N}_{+}$ including infinity. We assume that $\mathbf{N}_{\infty}\cap\Sigma_{s}=\emptyset$ and $(\mathbf{N}_{\infty}\cup\Sigma_{s})$ is an ordered alphabet such that all s-symbols are smaller than any element in $\mathbf{N}_{\infty}$ . Let $\$$ be the smallest s-symbol, which will be used as an end-marker of p-strings. For any p-string $w$ the p-encoded string $\langle w\rangle$ of $w$ , also proposed as $\mathtt{prev}_{\infty}(w)$ in [18], is the string in $(\mathbf{N}_{\infty}\cup\Sigma_{s})^{|w|}$ such that

\langle w\rangle[i]=\begin{cases}w[i]&\mbox{if $w[i]\in\Sigma_{s}$},\\ \infty&\mbox{if $w[i]\in\Sigma_{p}$ and $w[i]$ does not appear in $w[..i-1]$},\\ i-j&\mbox{otherwise,}\end{cases}

where $j$ is the largest position in $[1..i-1]$ with $w[i]=w[j]$ . In other words, p-encoding transforms an occurrence of a p-symbol into the distance to the previous occurrence of the same p-symbol, or $\infty$ if it is the leftmost occurrence. By definition, p-encoding is prefix-consistent, i.e., $\langle w\rangle=\langle wc\rangle[..|w|]$ for any symbol $c\in(\Sigma_{s}\cup\Sigma_{p})$ . On the other hand, $\langle w\rangle$ and $\langle cw\rangle[2..]$ differ if and only if $c\in\Sigma_{p}$ occurs in $w$ . If it is the case, the leftmost occurrence $h$ of $c$ in $w$ is the unique position such that $\langle w\rangle$ and $\langle cw\rangle[2..]$ differ with $\langle w\rangle[h]=\infty$ and $(\langle cw\rangle[2..])[h]=\langle cw\rangle[h+1]=h$ , i.e., $h=\mathsf{select}_{c}(w,1)$ and $h+1=\mathsf{select}_{c}(cw,2)$ .

For any p-string $w$ , let $|w|_{p}$ denote the number of distinct p-symbols in $w$ , i.e., $|w|_{p}=\mathsf{rank}_{\infty}(\langle w\rangle,|w|)$ . We define a function $\pi$ that maps a p-string $w$ over $(\Sigma_{s}\cup\Sigma_{p})$ to an element in $(\Sigma_{s}\cup[1..|w|_{p}])$ such that

\pi(w)=\begin{cases}w[1]&\mbox{if $w[1]\in\Sigma_{s}$},\\ |w[..h+1]|_{p}&\mbox{otherwise,}\end{cases}

where $h+1=\min\{|w|,\mathsf{select}_{w[1]}(w,2)\}$ . In the second case, $\pi(w)$ represents the rank of p-symbol $w[1]$ when ordering all p-symbols by their leftmost occurrences in $w[2..]$ , considering the rank of p-symbols not in $w[2..]$ to be $|w|_{p}$ . If $\mathsf{select}_{w[1]}(w,2)$ exists, it holds that $h=\mathsf{select}_{\infty}(\langle w[2..]\rangle,\pi(w))$ . For convenience, let $\pi(\varepsilon)=\$$ .

For two p-strings $u$ and $v$ , $\mathsf{lcp}^{\infty}(\langle u\rangle,\langle v\rangle)$ denotes the number of $\infty$ ’s in the longest common prefix of $\langle u\rangle$ and $\langle v\rangle$ .

The following lemma states basic but important properties of the p-string encoding and $\pi$ , which immediately follow from the definition (see Fig. 1 for illustrations).

Lemma 4.

For any p-strings $x$ and $y$ over $(\Sigma_{s}\cup\Sigma_{p})$ with $\lambda=\mathsf{lcp}(\langle x[2..]\rangle,\langle y[2..]\rangle)$ , $e=\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle)$ and $\langle x[2..]\rangle<\langle y[2..]\rangle$ , Table 1 shows a complete list of cases for $\mathsf{lcp}(\langle x\rangle,\langle y\rangle)$ , $\mathsf{lcp}^{\infty}(\langle x\rangle,\langle y\rangle)$ and the lexicographic order between $\langle x\rangle$ and $\langle y\rangle$ , where a case starting with A and B is under the condition that at least one of $\pi(x)$ and $\pi(y)$ is in $\Sigma_{s}$ , and respectively none of $\pi(x)$ and $\pi(y)$ is in $\Sigma_{s}$ .

Table 1: All cases considered in Lemma 4. On the one hand, a case starting with letter A assumes that at least one of

\pi(x)

and

\pi(y)

is in

\Sigma_{s}

, while on the other hand, a case starting with letter B assumes that none of

\pi(x)

and

\pi(y)

is in

\Sigma_{s}

. Here,

h=\mathsf{select}_{x[1]}(x[2..],1)

and

h^{\prime}=\mathsf{select}_{y[1]}(y[2..],1)

cases	additional conditions	$\mathsf{lcp}(\langle x\rangle,\langle y\rangle)$	$\mathsf{lcp}^{\infty}(\langle x\rangle,\langle y\rangle)$	lexicographic order
(A1)	$\pi(x)\neq\pi(y)$	$0$	$0$	$\langle x\rangle<\langle y\rangle$ iff $\pi(x)<\pi(y)$
(A2)	$\pi(x)=\pi(y)$	$\lambda+1$	$e$	$\langle x\rangle<\langle y\rangle$
(B1)	$\pi(x)=\pi(y)\leq e$	$\lambda+1$	$e$	$\langle x\rangle<\langle y\rangle$
(B2)	$\pi(x)\leq e$ and $\pi(x)<\pi(y)$	$h$	$\pi(x)$	$\langle x\rangle<\langle y\rangle$
(B3)	$\pi(y)\leq e$ and $\pi(y)<\pi(x)$	$h^{\prime}$	$\pi(y)$	$\langle y\rangle<\langle x\rangle$
(B4)	$e<\min\{\pi(x),\pi(y)\}$	$\lambda+1$	$e+1$	$\langle x\rangle<\langle y\rangle$

Refer to caption — Figure 1: Illustrations for the cases of Lemma 4. Each right arrow represents the longest common prefix of two p-encoded strings, and the lexicographic order between them is determined by the following p-encoded symbols. For Case (B1), $h=\mathsf{select}_{x[1]}(x[2..],1)=\mathsf{select}_{y[1]}(y[2..],1)$ . For Case (B2)-(B4) and (B4)’, $h=\mathsf{select}_{x[1]}(x[2..],1)$ and $h^{\prime}=\mathsf{select}_{y[1]}(y[2..],1)$ . Case (B4)’ illustrates the case with $b=\infty$ , which is included in Case (B4).

By Lemma 4, we have the following corollaries:

Corollary 5.

For any p-strings $x$ and $y$ , $\mathsf{lcp}^{\infty}(\langle x\rangle,\langle y\rangle)\leq\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle)+1$ .

Corollary 6.

For any p-strings $x$ and $y$ with $\pi(x)=\pi(y)$ , $x<y$ if and only if $\langle x\rangle<\langle y\rangle$ .

Corollary 7.

For any p-strings $x$ and $y$ with $\pi(x)\leq\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle)<\pi(y)$ , it holds that $\langle x\rangle[..h+1]<\langle x\rangle<\langle x\rangle[..h^{\prime}+1]<\langle y\rangle$ , where $h=\mathsf{select}_{x[1]}(x[2..],1)$ and $h^{\prime}=\mathsf{select}_{y[1]}(y[2..],1)$ .

Table 2: An example of

\mathsf{R}^{-1}_{T}(i)

\mathsf{LCP}^{\infty}_{T}

\mathsf{L}_{T}

and

\mathsf{F}_{T}

for a p-string

T=\mathtt{xyazyxazxza\$}

with

\Sigma_{s}=\{\mathtt{a}\}

and

\Sigma_{p}=\{\mathtt{x},\mathtt{y},\mathtt{z}\}

$i$	$T[i..]$	$\langle T[i..]\rangle$	$\mathsf{R}^{-1}_{T}(i)$	$\mathsf{LCP}^{\infty}_{T}[i]$	$\mathsf{L}_{T}[i]$	$\mathsf{F}_{T}[i]$	$\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle$
1	$\mathtt{xyazyxazxza\$}$	$\mathtt{\infty\infty a\infty 35a432a\$}$	12	0	$\mathtt{a}$	$\mathtt{\$}$	$\$$
2	$\mathtt{yazyxazxza\$}$	$\mathtt{\infty a\infty 3\infty a432a\$}$	11	0	1	$\mathtt{a}$	$\mathtt{a\$}$
3	$\mathtt{azyxazxza\$}$	$\mathtt{a\infty\infty\infty a432a\$}$	7	0	2	$\mathtt{a}$	$\mathtt{a\infty\infty 2a\$}$
4	$\mathtt{zyxazxza\$}$	$\mathtt{\infty\infty\infty a432a\$}$	3	2	2	$\mathtt{a}$	$\mathtt{a\infty\infty\infty a432a\$}$
5	$\mathtt{yxazxza\$}$	$\mathtt{\infty\infty a\infty 32a\$}$	10	0	2	1	$\mathtt{\infty a\$}$
6	$\mathtt{xazxza\$}$	$\mathtt{\infty a\infty 32a\$}$	6	1	3	2	$\mathtt{\infty a\infty 32a\$}$
7	$\mathtt{azxza\$}$	$\mathtt{a\infty\infty 2a\$}$	2	2	3	2	$\mathtt{\infty a\infty 3\infty a432a\$}$
8	$\mathtt{zxza\$}$	$\mathtt{\infty\infty 2a\$}$	9	1	2	2	$\mathtt{\infty\infty a\$}$
9	$\mathtt{xza\$}$	$\mathtt{\infty\infty a\$}$	5	2	3	3	$\mathtt{\infty\infty a\infty 32a\$}$
10	$\mathtt{za\$}$	$\mathtt{\infty a\$}$	1	3	$\mathtt{\$}$	3	$\mathtt{\infty\infty a\infty 35a432a\$}$
11	$\mathtt{a\$}$	$\mathtt{a\$}$	8	2	$\mathtt{a}$	2	$\mathtt{\infty\infty 2a\$}$
12	$\mathtt{\$}$	$\mathtt{\$}$	4	2	$\mathtt{a}$	3	$\mathtt{\infty\infty\infty a432a\$}$

Let $T$ be a p-string that has the smallest s-symbol $\$$ as its end-marker, i.e., $T[|T|]=\$$ and $\$$ does not appear anywhere else in $T$ . The suffix rank function $\mathsf{R}_{T}:[1..|T|]\rightarrow[1..|T|]$ for $T$ maps a position $i~{}(1\leq i\leq|T|)$ to the lexicographic rank of $\langle T[i..]\rangle$ in $\{\langle T[j..]\rangle\mid 1\leq j\leq|T|\}$ . Its inverse function $\mathsf{R}^{-1}_{T}(i)$ returns the starting position of the lexicographically $i$ -th p-encoded suffix of $T$ .

The main components of parameterized Burrows-Wheeler Transform (pBWT) of $T$ are $\mathsf{F}_{T}$ and $\mathsf{L}_{T}$ . $\mathsf{F}_{T}$ (resp. $\mathsf{L}_{T}$ ) is defined to be the string of length $|T|$ such that $\mathsf{F}_{T}[i]=\pi(T[\mathsf{R}^{-1}_{T}(i)..])$ (resp. $\mathsf{L}_{T}[i]=\pi(T[\mathsf{R}^{-1}_{T}(i)-1..])$ ), where we assume that $T[0..]=\$$ . ¹¹1Previous studies [14, 18, 16] define pBWTs based on sorted cyclic rotations, but our suffix-based definition is more suitable for online construction to prevent unnecessary update on $\mathsf{F}_{T}$ and $\mathsf{L}_{T}$ . Since $\{T[\mathsf{R}^{-1}_{T}(i)..]\mid 1\leq i\leq|T|\}=\{T[\mathsf{R}^{-1}_{T}(i)-1..]\mid 1\leq i\leq|T|\}$ is equivalent to the set of all non-empty suffixes of $T$ , $\mathsf{F}_{T}$ is a permutation of $\mathsf{L}_{T}$ .

The so-called LF-mapping $\mathsf{LF}_{T}$ maps a position $i$ to $\mathsf{R}_{T}(\mathsf{R}^{-1}_{T}(i)-1)$ if $\mathsf{R}^{-1}_{T}(i)>1$ , and otherwise $\mathsf{R}_{T}(|T|)=1$ . By definition and Corollary 6, we have:

Corollary 8.

For any p-string $T$ and any integers $i,j$ with $1\leq i<j\leq|T|$ , $\mathsf{LF}_{T}(i)<\mathsf{LF}_{T}(j)$ if $\mathsf{L}_{T}[i]=\mathsf{L}_{T}[j]$ .

Thanks to Corollary 8, it holds that $\mathsf{LF}_{T}(i)=\mathsf{select}_{c}(\mathsf{F}_{T},\mathsf{rank}_{c}(\mathsf{L}_{T},i))$ , where $c=\mathsf{L}_{T}[i]$ . The inverse function $\mathsf{FL}_{T}$ of $\mathsf{LF}_{T}$ can be computed by $\mathsf{FL}_{T}(i)=\mathsf{select}_{c}(\mathsf{L}_{T},\mathsf{rank}_{c}(\mathsf{F}_{T},i))$ , where $c=\mathsf{F}_{T}[i]$ .

Let $\mathsf{LCP}^{\infty}_{T}$ be the string of length $|T|$ such that $\mathsf{LCP}^{\infty}_{T}[0]=0$ and $\mathsf{LCP}^{\infty}_{T}[i]=\mathsf{lcp}^{\infty}(T[\mathsf{R}^{-1}_{T}(i-1)..],T[\mathsf{R}^{-1}_{T}(i)..])$ for any $1<i\leq|T|$ . An example of all explained arrays is given in Table 2.

3 Online construction algorithm

For online construction of our index for p-matching, we consider maintaining dynamic data structures for $\mathsf{F}_{T}$ , $\mathsf{L}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ while prepending a symbol to the current p-string $T$ . The details of the data structures will be presented in Subsection 3.3. In what follows, we focus on a single step of updating $T$ to $\hat{T}=cT$ for some symbol $c$ in $\Sigma_{s}\cup\Sigma_{p}$ . Note that $\mathsf{F}_{T}$ , $\mathsf{L}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ are strongly related to the sorted p-encoded suffixes of a p-string and $\hat{T}=cT$ is the only suffix that was not in the suffixes of $T$ . Let $k=\mathsf{R}_{T}(1)$ and $\hat{k}=\mathsf{R}_{\hat{T}}(1)$ . In order to deal with the new emerging suffix $\hat{T}$ , we compute the lexicographic rank $\hat{k}$ of $\langle\hat{T}\rangle$ in the non-empty p-encoded suffixes of $\hat{T}$ . Then $\mathsf{F}_{\hat{T}}$ and $\mathsf{L}_{\hat{T}}$ can be obtained by replacing $\$$ in $\mathsf{L}_{T}$ at $k$ by $\pi(\hat{T})$ and inserting $\$$ and $\pi(\hat{T})$ into the $\hat{k}$ -th position of $\mathsf{L}_{T}$ and $\mathsf{F}_{T}$ , respectively. In Subsection 3.1, we propose our algorithm to compute $\hat{k}$ . For updating $\mathsf{LCP}^{\infty}$ , we have to compute the $\mathsf{lcp}^{\infty}$ -values for $\langle\hat{T}\rangle$ with its lexicographically adjacent p-encoded suffixes, which will be treated in Subsection 3.2.

3.1 How to compute $\hat{k}$

Unlike the existing work [16] that computes $\hat{k}$ by counting the number of p-encoded suffixes that are lexicographically smaller than $\langle\hat{T}\rangle$ , we get $\hat{k}$ indirectly by computing the rank of a lexicographically closest (smaller or larger) p-encoded suffix to $\langle\hat{T}\rangle$ . The lexicographically smaller (resp. larger) closest element in $\{\langle T[i..]\rangle\mid 1\leq i\leq|T|\}$ to $\langle\hat{T}\rangle$ is called the p-pred (resp. p-succ) of $\langle\hat{T}\rangle$ . and its rank is denoted by $k_{-}$ (resp. $k_{+}$ ). Then it holds that $\hat{k}=k_{+}=k_{-}+1$ .

We start with the easy case that the prepended symbol $c$ is an s-symbol.

Lemma 9.

Let $\hat{T}=cT$ be a p-string with $c\in\Sigma_{s}$ . If $p:=\mathsf{FPQ}_{c}(\mathsf{L}_{T},k)$ exists, the rank $k_{-}$ of the p-pred of $\hat{T}$ is $\mathsf{LF}_{T}(p)$ . Otherwise, $k_{-}=\mathsf{select}_{b}(\mathsf{F}_{T},\mathsf{rank}_{b}(\mathsf{F}_{T},|T|))$ , where $b$ is the largest s-symbol that appears in $T$ and smaller than $c$ .

Proof.

By Case (A2) of Lemma 4 (cf. Table 1), the lexicographic order of p-encoded suffixes starting with $c$ does not change by removing their first characters, which are all $c$ . If $p$ exists, $\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle$ is the lexicographically smaller closest p-encoded suffix to $\langle T\rangle$ that is preceded by $c$ . Hence, $\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(p))..]\rangle=\langle cT[\mathsf{R}^{-1}_{T}(p)..]\rangle$ is the p-pred of $\langle cT\rangle=\langle\hat{T}\rangle$ , which means that $k_{-}=\mathsf{LF}_{T}(p)$ .

If $p$ does not exist, it implies that $\langle\hat{T}\rangle$ is the lexicographically smallest p-encoded suffix that starts with $c$ . Since $\langle\hat{T}\rangle$ lexicographically comes right after the p-encoded suffixes starting with an s-symbol smaller than $c$ , $k_{-}$ is the last occurrence of $b$ in $\mathsf{F}_{T}$ , that is, $k_{-}=\mathsf{select}_{b}(\mathsf{F}_{T},\mathsf{rank}_{b}(\mathsf{F}_{T},|T|))$ . ∎

In the rest of this subsection, we consider the case that $c$ is a p-symbol. If $T$ contains no p-symbol, it is clear that $k_{-}=|T|$ . Hence, in what follows, we assume that there is a p-symbol in $T$ .

Since $\langle\hat{T}\rangle$ has the longest $\mathsf{lcp}$ -value $\hat{\lambda}$ with its p-pred or p-succ, we search for such p-encoded suffixes of $T$ using the following lemmas to leverage the information stored in $\mathsf{LCP}^{\infty}_{T}$ .

Lemma 10.

Given two positions $i$ and $j$ with $1\leq i<j\leq|T|$ , $\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle,\langle T[\mathsf{R}^{-1}_{T}(j)..]\rangle)=\mathsf{RmQ}_{\mathsf{LCP}^{\infty}_{T}}(i+1,j)$ .

Proof.

It is not difficult to see that $\mathsf{lcp}(x,z)=\min\{\mathsf{lcp}(x,y),\mathsf{lcp}(y,z)\}$ for any strings $x<y<z$ , and thus, $\mathsf{lcp}^{\infty}(x,z)=\min\{\mathsf{lcp}^{\infty}(x,y),\mathsf{lcp}^{\infty}(y,z)\}$ . Since $\mathsf{LCP}^{\infty}$ holds the $\mathsf{lcp}^{\infty}$ -values of lexicographically adjacent p-encoded suffixes, we get $\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle,\langle T[\mathsf{R}^{-1}_{T}(j)..]\rangle)=\min\{\mathsf{LCP}^{\infty}_{T}[g]\}_{g=i+1}^{j}=\mathsf{RmQ}_{\mathsf{LCP}^{\infty}_{T}}(i+1,j)$ by applying the previous argument successively. ∎

Lemma 11.

Algorithm 1 correctly computes the maximal interval $[l..r]$ such that $\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle,\langle T[\mathsf{R}^{-1}_{T}(j)..]\rangle)\geq e$ for any $j\in[l..r]$ .

1 Function GetMI( $i$ , $e$ ):

l\leftarrow\max\{1,\mathsf{FPQ}_{<e}(\mathsf{LCP}^{\infty}_{T},i)\}

;

r\leftarrow\min\{|T|,\mathsf{FNQ}_{<e}(\mathsf{LCP}^{\infty}_{T},i+1)-1\}

;

4 return

[l..r]

;

Algorithm 1 Algorithm to compute the maximal interval

[l..r]

such that

\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(i)..]\rangle,\langle T[\mathsf{R}^{-1}_{T}(j)..]\rangle)\geq e

for any

j\in[l..r]

Lemma 12.

Algorithm 2 correctly returns $\hat{k}$ .

Proof.

Let $h_{i}=\mathsf{select}_{\infty}(\langle T\rangle,i)$ for any $1\leq i\leq\min\{|T|_{p},\pi(\hat{T})\}$ , and $h_{i}=|T|+1$ for any $i>\min\{|T|_{p},\pi(\hat{T})\}$ . Also let $\lambda=\max\{\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)\mid 1\leq i\leq|T|\}$ . Although Algorithm 2 does not intend to compute the exact value of $\lambda$ , it checks if $\lambda$ falls in $[h_{e}..h_{e+1}]$ in decreasing order of $e$ starting from $\min\{\pi(\hat{T}),\max\{\mathsf{LCP}^{\infty}_{T}[k],\mathsf{LCP}^{\infty}_{T}[k+1]\}\}$ . One of the necessary conditions to have $\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)>h_{e}$ is that $\mathsf{lcp}(\langle T\rangle,\langle T[i+1..]\rangle)\geq h_{e}$ , or equivalently $\mathsf{lcp}^{\infty}(\langle T\rangle,\langle T[i+1..]\rangle)\geq e$ . Line 2 computes the maximal interval $[l..r]$ that represents the ranks of the p-encoded suffixes having an $\mathsf{lcp}^{\infty}$ -value larger than or equal to $e$ . The basic idea is to find a p-encoded suffix in $\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=l}^{r}$ that comes closest to $\langle\hat{T}\rangle$ when extended by adding its preceding symbol. When $e$ comes down to the point with $\lambda\in[h_{e}+1..h_{e+1}]$ , $\hat{k}$ is returned in one of the if-then-blocks at Lines 2, 2, 2 and 2.

If $\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)=h_{\hat{e}}$ for an integer $\hat{e}$ , there are two possible scenarios:

1.

$\mathsf{lcp}^{\infty}(\langle T\rangle,\langle T[i+1..]\rangle)\geq\hat{e}$ and either $\pi(\hat{T})$ or $\pi(T[i..])$ is $\hat{e}$ , and
2.

$\mathsf{lcp}(\langle T\rangle,\langle T[i+1..]\rangle)=h_{\hat{e}}-1$ and both $\pi(\hat{T})$ and $\pi(T[i..])$ are at least $\hat{e}$ .

The former p-encoded suffix is processed in one of the if-then-blocks at Lines 2 and 2 when $e=\hat{e}$ , while the latter at Lines 2, 2, 2 and 2 when $e=\hat{e}-1$ . Note that the former is never farther from $\langle\hat{T}\rangle$ than the latter because the lexicographic order between $\langle\hat{T}\rangle$ and $\langle T[i..]\rangle$ is determined by $\infty$ and $h_{\hat{e}}$ at $h_{\hat{e}}+1$ in the former case, while it is by $\infty$ and something smaller than $h_{\hat{e}}$ in the latter case. Since Algorithm 2 processes the former case first, it guarantees that the algorithm finds the closer one first.

The case with $e=\pi(\hat{T})$ is treated differently than other cases in the if-then-block at Line 2 since $h_{\pi(\hat{T})}$ is the unique position where $\langle T\rangle[h_{\pi(\hat{T})}]=\infty$ turns into $\langle\hat{T}\rangle[h_{\pi(\hat{T})}+1]=h_{\pi(\hat{T})}$ . For a p-encoded suffix $\langle T[\mathsf{R}^{-1}_{T}(q^{\prime})..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=l}^{r}$ , having $\mathsf{L}_{T}[q^{\prime}]=\pi(\hat{T})$ is necessary and sufficient for its extended suffix $\langle T[\mathsf{R}^{-1}_{T}(q^{\prime})-1..]\rangle$ to have an $\mathsf{lcp}$ -value larger than $h_{\pi(\hat{T})}$ with $\hat{T}$ . By Corollary 6, p-encoded suffixes satisfying this condition must preserve their lexicographic order after extension, and hence, it is enough to search for the closest one ( $q\leftarrow\mathsf{FPQ}_{e}(\mathsf{L}_{T},k)$ or $q\leftarrow\mathsf{FNQ}_{e}(\mathsf{L}_{T},k)$ ) to $\langle T\rangle$ and compute the rank of its extended suffix by $\mathsf{LF}_{T}(q)$ . If Lines 2 and 2 fail to return a value, it means that $\lambda\leq h_{\pi(\hat{T})}$ . The if-block at Line 2 checks if there exists a p-encoded suffix $\langle T[i+1..]\rangle$ that satisfies the former condition to be $\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)=h_{\pi(\hat{T})}$ . It is enough to find one $\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ with $\mathsf{L}[q]\geq\pi(\hat{T})$ for $q\in[l..r]$ because it is necessary and sufficient to have $\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)=h_{\pi(\hat{T})}$ and $\langle\hat{T}\rangle[h_{\pi(\hat{T})}+1]=\infty\neq h_{\pi(\hat{T})}=\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle[h_{\pi(\hat{T})}+1]$ . Note that there could be two or more p-encoded suffixes that satisfy the condition and their lexicographic order may change by extension. In the then-block at Line 2, the algorithm computes the rank of the lexicographically smallest p-encoded suffix that has an $\mathsf{lcp}^{\infty}$ -value larger than $\pi(\hat{T})$ with $\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle=\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle$ , which is the p-succ of $\hat{T}$ in this case.

The case with $e\neq\pi(\hat{T})$ is processed in the else-block at Line 2. Here, it is good to keep in mind that when we enter this else-block, $\mathsf{lcp}^{\infty}(\langle T\rangle,\langle T[i..]\rangle)\leq e$ or $\pi(T[i-1..])\leq e$ holds for any proper suffix $T[i..]$ of $T$ , since otherwise $\hat{k}$ must be reporeted in a previous round of the foreach loop.

When the if-condition at Line 2 holds, $\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ is the lexicographically smaller closest p-encoded suffix to $\langle T\rangle$ such that $\mathsf{lcp}^{\infty}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)\geq e+1$ , or equivalently $\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)>h_{e}$ . For any p-encoded suffix in $\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=q+1}^{k-1}$ its extended suffix is lexicographically smaller than $\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle$ due to Corollary 7, and never closer to $\langle\hat{T}\rangle$ than $\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle$ . At Line 2, the algorithm computes the maximal interval $[l^{\prime}..r^{\prime}]$ such that every p-encoded suffix in $\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=l^{\prime}}^{r^{\prime}}$ shares the common prefix of length $h^{\prime}:=\min\{|T|-\mathsf{R}^{-1}_{T}(q)+1,\mathsf{select}_{\infty}(\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle,e+1)\}$ with $\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ . Since any $\langle T[i..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=1}^{l^{\prime}-1}$ has an $\mathsf{lcp}$ -value smaller than $h^{\prime}$ with $\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ , it follows from Lemma 4 that $\langle T[i-1..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle$ . Also, for any $\langle T[i..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=k+1}^{|T|}$ , the aforementioned precondition to enter the else-block at Line 2 leads to $\langle\hat{T}\rangle<\langle T[i-1..]\rangle$ or $\langle T[i-1..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ by Lemma 4. So far we have confirmed that the $\mathsf{lcp}$ -value between $\langle\hat{T}\rangle$ and the p-pred of $\langle\hat{T}\rangle$ is less than $h^{\prime}$ , which implies that the p-pred is the largest p-encoded suffix that is prefixed by $x:=\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle[..h^{\prime}]$ . If $q^{\prime}\leftarrow\mathsf{FPQ}_{\geq e+2}(\mathsf{L}_{T},r^{\prime})$ computed at Line 2 is in $[l^{\prime}..r^{\prime}]$ , $\langle T[\mathsf{R}^{-1}_{T}(q^{\prime})-1..]\rangle=\langle T[\mathsf{LF}_{T}(q^{\prime})..]\rangle$ is prefixed by $x\cdot\infty$ and the p-pred is the largest p-encoded suffix that is prefixed by $x\cdot\infty$ , which can be computed by $\max\textnormal{{GetMI}}(\mathsf{LF}_{T}(q^{\prime}),e+2)$ because $\langle T[\mathsf{R}^{-1}_{T}(q^{\prime})-1..]\rangle[..h^{\prime}+1]=\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q^{\prime})..]\rangle[..h^{\prime}+1]=x\cdot\infty$ contains exactly $e+2$ $\infty$ ’s. If $q^{\prime}\notin[l^{\prime}..r^{\prime}]$ , the p-pred is the largest p-encoded suffix that is prefixed by $x\cdot h^{\prime}$ , which is $\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle$ .

When the if-condition at Line 2 holds, $\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ is the lexicographically larger closest p-encoded suffix to $\langle T\rangle$ such that $\mathsf{lcp}^{\infty}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)\geq e+1$ , or equivalently $\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)>h_{e}$ . For any p-encoded suffix in $\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=k+1}^{q-1}$ its extended suffix is lexicographically smaller than $\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle$ due to Corollary 7, and never closer to $\langle\hat{T}\rangle$ than $\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle$ . At Line 2, the algorithm computes the maximal interval $[l^{\prime}..r^{\prime}]$ such that every p-encoded suffix in $\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=l^{\prime}}^{r^{\prime}}$ shares the common prefix of length $h^{\prime}:=\min\{|T|-\mathsf{R}^{-1}_{T}(q)+1,\mathsf{select}_{\infty}(\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle,e+1)\}$ with $\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ . Since any $\langle T[i..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=r^{\prime}+1}^{|T|}$ has an $\mathsf{lcp}$ -value smaller than $h^{\prime}$ with $\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ , it follows from Lemma 4 that $\langle T[i-1..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle$ . Also, for any $\langle T[i..]\rangle\in\{\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle\}_{p=1}^{k-1}$ , the aforementioned precondition to enter the else-block at Line 2 leads to $\langle T[i-1..]\rangle<\langle\hat{T}\rangle$ by Lemma 4. So far we have confirmed that the $\mathsf{lcp}$ -value between $\langle\hat{T}\rangle$ and the p-succ of $\langle\hat{T}\rangle$ is less than $h^{\prime}$ , which implies that the p-succ is the smallest p-encoded suffix that is prefixed by $x:=\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle[..h^{\prime}]$ . If $q^{\prime}\leftarrow\mathsf{FNQ}_{e+1}(\mathsf{L}_{T},l^{\prime})$ computed at Line 2 is in $[l^{\prime}..r^{\prime}]$ , the p-succ is the smallest p-encoded suffix that is prefixed by $x\cdot h^{\prime}$ , which is $\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q^{\prime}))..]\rangle$ . If $q^{\prime}\notin[l^{\prime}..r^{\prime}]$ , the p-succ is the smallest p-encoded suffix that is prefixed by $x\cdot\infty$ , which can be computed by $\min\textnormal{{GetMI}}(\mathsf{LF}_{T}(q),e+2)$ because $\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle[..h^{\prime}+1]=\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle[..h^{\prime}+1]=x\cdot\infty$ contains exactly $e+2$ $\infty$ ’s.

When we enter the if-then-block at Line 2, it is guaranteed that $\lambda\leq h_{e}$ . In order to check if there exists a p-encoded suffix $\langle T[i+1..]\rangle$ that satisfies the former condition to be $\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[i..]\rangle)=h_{e}$ , the algorithm computes $q\leftarrow\mathsf{FPQ}_{e}(\mathsf{L}_{T},r))$ . If $q\in[l..r]$ , $\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle$ is the lexicographically largest p-encoded suffix that satisfies the condition, and by Corollary 6, its extended suffix $\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle$ must be the largest one to have $\mathsf{lcp}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle)=h_{e}$ . Therefore, $\langle T[\mathsf{R}^{-1}_{T}(q)-1..]\rangle$ is the p-pred of $\langle\hat{T}\rangle$ , and $\hat{k}=1+\mathsf{LF}_{T}(q)$ . ∎

1 foreach $e\leftarrow\min\{\pi(\hat{T}),\max\{\mathsf{LCP}^{\infty}_{T}[k],\mathsf{LCP}^{\infty}_{T}[k+1]\}\}$ down to $1$ do

[l..r]\leftarrow

GetMI( $k$ , $e$ );

3 if $e=\pi(\hat{T})$ then

4 if $(q\leftarrow\mathsf{FPQ}_{e}(\mathsf{L}_{T},k))\in[l..r]$ then return

1+\mathsf{LF}_{T}(q)

;

5 if $(q\leftarrow\mathsf{FNQ}_{e}(\mathsf{L}_{T},k))\in[l..r]$ then return

\mathsf{LF}_{T}(q)

;

6 if $(q\leftarrow\mathsf{FNQ}_{\geq e+1}(\mathsf{L}_{T},l))\in[l..r]$ then return

\min

GetMI( $\mathsf{LF}_{T}(q)$ , $e+1$ ) ;

8 else

9 if $(q\leftarrow\mathsf{FPQ}_{\geq e+1}(\mathsf{L}_{T},k))\in[l..r]$ then

[l^{\prime}..r^{\prime}]\leftarrow

GetMI( $q$ , $e+1$ );

q^{\prime}\leftarrow\mathsf{FPQ}_{\geq e+2}(\mathsf{L}_{T},r^{\prime})

;

12 if $q^{\prime}\in[l^{\prime}..r^{\prime}]$ then return

1+\max

GetMI( $\mathsf{LF}_{T}(q^{\prime})$ , $e+2$ ) ;

13 else return

1+\mathsf{LF}_{T}(q)

;

15 if $(q\leftarrow\mathsf{FNQ}_{\geq e+1}(\mathsf{L}_{T},k))\in[l..r]$ then

[l^{\prime}..r^{\prime}]\leftarrow

GetMI( $q$ , $e+1$ );

q^{\prime}\leftarrow\mathsf{FNQ}_{e+1}(\mathsf{L}_{T},l^{\prime})

;

18 if $q^{\prime}\in[l^{\prime}..r^{\prime}]$ then return

\mathsf{LF}_{T}(q^{\prime})

;

19 else return

\min

GetMI( $\mathsf{LF}_{T}(q)$ , $e+2$ ) ;

21 if $(q\leftarrow\mathsf{FPQ}_{e}(\mathsf{L}_{T},r))\in[l..r]$ then return

1+\mathsf{LF}_{T}(q)

;

Algorithm 2 Algorithm to compute

\hat{k}

3.2 How to maintain $\mathsf{LCP}^{\infty}$

Suppose that we have $k=\mathsf{R}_{T}(1)$ , $\hat{k}=\mathsf{R}_{\hat{T}}(1)$ , $\mathsf{L}_{T}$ , $\mathsf{F}_{T}$ , we show how to compute $\mathsf{lcp}^{\infty}$ -values of $\langle\hat{T}\rangle$ with its p-pred $\langle T[\mathsf{R}^{-1}_{T}(\hat{k})-1..]\rangle$ and p-succ $\langle T[\mathsf{R}^{-1}_{T}(\hat{k})..]\rangle$ to maintain $\mathsf{LCP}^{\infty}$ .

We focus on $\mathsf{lcp}^{\infty}(\langle\hat{T}\rangle,\langle T[\mathsf{R}^{-1}_{T}(\hat{k})..]\rangle)$ because the other one can be treated similarly. We apply Lemma 4 by setting $x=\hat{T}$ and $y=T[\mathsf{R}^{-1}_{T}(\hat{k})..]$ if $k<\mathsf{FL}_{T}(\hat{k})$ (otherwise we swap their roles for $x$ and $y$ ). In order to get $\mathsf{lcp}^{\infty}(\langle x\rangle,\langle y\rangle)$ , all we need are $\pi(x)=\pi(\hat{T})$ , $\pi(y)=\mathsf{F}[\hat{k}]$ and $e=\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle)$ . For the computation of $e$ we use Lemma 10, i.e, $e=\mathsf{lcp}^{\infty}(\langle x[2..]\rangle,\langle y[2..]\rangle)=\mathsf{RmQ}_{\mathsf{LCP}^{\infty}_{T}}(k+1,\mathsf{FL}_{T}(\hat{k}))$ .

3.3 Dynamic data structures and analysis

We consider constructing $\mathsf{F}_{T}$ , $\mathsf{L}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ of a p-string $T$ over an alphabet $(\Sigma_{s}\cup\Sigma_{p})\subseteq[0..\sigma]$ of length $n$ online. Let $\sigma_{s}$ and respectively $\sigma_{p}$ be the numbers of distinct s-symbols and p-symbols that appear in $T$ .

First we show data structures needed to implement our algorithm presented in the previous subsections. For $\mathsf{F}_{T}$ we maintain a dynamic string of Lemma 2 supporting random access, insertion, $\mathsf{rank}$ and $\mathsf{select}$ queries in $O(\frac{\lg n}{\lg\lg n})$ time and $O(n\lg\sigma)$ bits of space. For $\mathsf{LCP}^{\infty}_{T}$ we maintain a dynamic string of Lemma 3 to support random access, insertion, $\mathsf{RmQ}$ , $\mathsf{FPQ}$ and $\mathsf{FNQ}$ queries in $O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time and $O(n\lg\sigma_{p})$ bits of space.

If we build a dynamic string of Lemma 3 for $\mathsf{L}_{T}$ , the query time would be $O(\frac{\lg\sigma\lg n}{\lg\lg n})$ . Since our algorithm does not use $\mathsf{RmQ}$ , $\mathsf{FPQ}$ and $\mathsf{FNQ}$ queries for s-symbols, we can reduce the query time to $O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ as follows. We represent $\mathsf{L}_{T}$ with one level of a wavelet tree, where a bit vector $B_{T}$ partitions the alphabet into $\Sigma_{s}$ and $[1..|T|_{p}]$ and thus has pointers to $X_{T}$ and $Y_{T}$ storing respectively the sequence over $\Sigma_{s}$ and that over $[1..|T|_{p}]$ of $\mathsf{L}_{T}$ . We represent the former and the latter by the data structures described in Lemmas 2 and 3, respectively, since we only need the aforementioned queries such as $\mathsf{RmQ}$ on $Y_{T}$ . Then, queries on $\mathsf{L}_{T}$ can be answered in $O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time using $O(n\lg\sigma)$ bits of space.

In addition to these dynamic strings for $\mathsf{F}_{T}$ , $\mathsf{L}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ , we consider another dynamic string $Z_{T}$ . Let $Z_{T}$ be a string that is obtained by extracting the leftmost occurrence of every p-symbol in $T$ . Thus, $|Z_{T}|=\sigma_{p}\leq n$ . A dynamic string $Z_{T}$ of Lemma 2 enables us to compute $\pi(cT)$ for a p-symbol $c$ by $\pi(cT)=\min\{\infty,\mathsf{select}_{c}(Z_{T},1)\}$ in $O(\frac{\lg\sigma_{p}}{\lg\lg\sigma_{p}})$ time and $O(\sigma_{p}\lg\sigma_{p})$ bits of space.

We also maintain a fusion tree of [25] of $O(\sigma_{s}\lg n)=O(n\lg\sigma)$ bits to maintain the set of s-symbols used in the current $T$ , which enables us to compute $b$ in Lemma 9 in $O(\frac{\lg\sigma_{s}}{\lg\lg\sigma_{s}})$ time.

We are now ready to prove the following lemma.

Lemma 13.

$\mathsf{F}_{T}$ , $\mathsf{L}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ for a p-string of length $n$ over an alphabet $(\Sigma_{s}\cup\Sigma_{p})\subseteq[0..\sigma]$ can be constructed online in $O(n\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time and $O(n\lg\sigma)$ bits of space, where $\sigma_{p}$ is the number of distinct p-symbols used in the p-string.

Proof.

We maintain the dynamic data structures of $O(n\lg\sigma)$ bits described in this subsection while prepending a symbol to the current p-string. For a single step of updating $T$ to $\hat{T}=cT$ with $c\in(\Sigma_{s}\cup\Sigma_{p})$ , we compute $\hat{k}=\mathsf{R}_{\hat{T}}(1)$ as described in Subsection 3.1 and obtain $\mathsf{F}_{\hat{T}}$ and $\mathsf{L}_{\hat{T}}$ by replacing $\$$ in $\mathsf{L}_{T}$ at $k=\mathsf{R}_{T}(1)$ by $\pi(\hat{T})$ and inserting $\$$ and $\pi(\hat{T})$ into the $\hat{k}$ -th position of $\mathsf{L}_{T}$ and $\mathsf{F}_{T}$ , respectively. $\mathsf{LCP}^{\infty}$ is updated as described in Subsection 3.2.

If $c\in\Sigma_{s}$ , the computation of $\hat{k}$ based on Lemma 9 requires a constant number of queries. If $c\in\Sigma_{p}$ , Algorithm 2 computes $\hat{k}$ invoking $O(2+e-\hat{e})$ queries, where $e=\max\{\mathsf{LCP}^{\infty}_{T}[k],\mathsf{LCP}^{\infty}_{T}[k+1]\}$ and $\hat{e}=\max\{\mathsf{LCP}^{\infty}_{\hat{T}}[\hat{k}],\mathsf{LCP}^{\infty}_{\hat{T}}[\hat{k}+1]\}$ . The value $e$ can be seen as a potential held by the current string $T$ , which upper bounds the number of queries. The number of queries in a single step can be $O(\sigma_{p})$ in the worst case when $e$ and $\hat{e}$ are close to $\sigma_{p}$ and respectively $0$ , but this will reduce the potential for later steps, which allows us to give an amortized analysis. Since a single step increases the potential at most 1 by Corollary 5, the total number of queries can be bounded by $O(n)$ .

Since we invoke $O(n)$ queries that take $O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time each, the overall time complexity is $O(n\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ . ∎

4 Extendable compact index for p-matching

In this section, we show that $\mathsf{L}_{T}$ , $\mathsf{F}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ can serve as an index for p-matching.

First we show that we can support backward search, a core procedure of BWT-based indexes, with the data structures for $\mathsf{L}_{T}$ , $\mathsf{F}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ described in Subsection 3.3. For any p-string $w$ , let $w$ -interval be the maximal interval $[l..r]$ such that $\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle$ is prefixed by $\langle w\rangle$ for any $p\in[l..r]$ . We show the next lemma for a single step of backward search, which computes $cw$ -interval from $w$ -interval.

Lemma 14.

Suppose that we have data structures for $\mathsf{L}_{T}$ , $\mathsf{F}_{T}$ and $\langle\hat{T}\rangle$ described in Subsection 3.3. Given $w$ -interval $[l..r]$ and $c\in(\Sigma_{s}\cup\Sigma_{p})$ , we can compute $cw$ -interval $[l^{\prime}..r^{\prime}]$ in $O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time.

Proof.

We show that we can compute $cw$ -interval from $w$ -interval using a constant number of queries supported on $\mathsf{L}_{T}$ , $\mathsf{F}_{T}$ and $\langle\hat{T}\rangle$ , which takes $O(\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time each.

•

When $c$ is in $\Sigma_{s}$ : $\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(p))..]\rangle$ is prefixed by $\langle cw\rangle$ if and only if $\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle$ is prefixed by $\langle w\rangle$ and $\mathsf{L}_{T}[p]=c$ . In other words, $\mathsf{LF}_{T}(p)\in[l^{\prime}..r^{\prime}]$ if and only if $p\in[l..r]$ and $\mathsf{L}_{T}[p]=c$ . Then it holds that $l^{\prime}=\mathsf{LF}_{T}(\mathsf{FNQ}_{c}(\mathsf{L}_{T},l))$ and $r^{\prime}=\mathsf{LF}_{T}(\mathsf{FPQ}_{c}(\mathsf{L}_{T},r))$ due to Corollary 6.
•

When $c$ is a p-symbol that appears in $w$ : Similar to the previous case, $\mathsf{LF}_{T}(p)\in[l^{\prime}..r^{\prime}]$ if and only if $p\in[l..r]$ and $\mathsf{L}_{T}[p]=\pi(cw)$ . Then it holds that $l^{\prime}=\mathsf{LF}_{T}(\mathsf{FNQ}_{\pi(cw)}(\mathsf{L}_{T},l))$ and $r^{\prime}=\mathsf{LF}_{T}(\mathsf{FPQ}_{\pi(cw)}(\mathsf{L}_{T},r))$ due to Corollary 6.
•

When $c$ is a p-symbol that does not appear in $w$ : Let $e=|w|_{p}$ . Since $p\in[l..r]$ and $\mathsf{L}_{T}[p]>e$ are necessary and sufficient conditions for $\mathsf{LF}_{T}(p)$ to be in $[l^{\prime}..r^{\prime}]$ , we can compute $r^{\prime}-l^{\prime}+1$ , the width of $[l^{\prime}..r^{\prime}]$ , by counting the number of positions $p$ such that $\mathsf{L}_{T}[p]>e$ with $p\in[l..r]$ . This can be done with 2D range counting queries, which can also be supported with the wavelet tree of Lemma 3. If $s=\mathsf{LF}_{T}(\mathsf{FNQ}_{>e}(\mathsf{L}_{T},l)$ is in $[l..r]$ , it holds that $r^{\prime}-l^{\prime}+1\neq 0$ and $\mathsf{LF}_{T}(s)\in[l^{\prime}..r^{\prime}]$ . Note that $\mathsf{LF}_{T}(s)$ is not necessarily $l^{\prime}$ because p-encoded suffixes $\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle$ with $\mathsf{L}_{T}[p]>e$ in $[l..r]$ do not necessarily preserve the lexicographic order when they are extended by one symbol to the left, making it non-straightforward to identify the position $l^{\prime}$ .

To tackle this problem, we consider $[l_{e}..r_{e}]=\textnormal{{GetMI}}(s,e)$ and $[l^{\prime}_{e+1}..r^{\prime}_{e+1}]=\textnormal{{GetMI}}(\mathsf{LF}_{T}(s),e+1)$ , and show that $l^{\prime}=l^{\prime}_{e+1}+x$ , where $x$ is the number of positions $p$ such that $\mathsf{L}_{T}[p]>e$ with $p\in[l_{e}..l-1]$ . Observe that $[l..r]\subseteq[l_{e}..r_{e}]$ and $[l^{\prime}..r^{\prime}]\subseteq[l^{\prime}_{e+1}..r^{\prime}_{e+1}]$ by definition, and that $\mathsf{LF}_{T}(p)\in[l^{\prime}_{e+1}..r^{\prime}_{e+1}]$ if and only if $p\in[l_{e}..r_{e}]$ and $\mathsf{L}_{T}[p]>e$ (see Fig. 2 for an illustration). Also, it holds that $\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(p))..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle$ for any $p\in[l_{e}..l-1]$ and $q\in[l..r]$ with $\mathsf{L}_{T}[p]>e$ and $\mathsf{L}_{T}[q]>e$ because $\mathsf{lcp}^{\infty}(\langle T[\mathsf{R}^{-1}_{T}(p)..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(q)..]\rangle)=e$ , and they fall into Case (B4) of Lemma 4. Similarly for any $p\in[l..r]$ and $q\in[r+1..r_{e}]$ with $\mathsf{L}_{T}[p]>e$ and $\mathsf{L}_{T}[q]>e$ , we have $\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(p))..]\rangle<\langle T[\mathsf{R}^{-1}_{T}(\mathsf{LF}_{T}(q))..]\rangle$ . Hence, $l^{\prime}=l^{\prime}_{e+1}+x$ holds.

This concludes the proof. ∎

We are now ready to prove the main theorem:

Proof of Theorem 1:.

If we only need counting queries, Lemmas 13 and 14 are enough: While we build $\mathsf{L}_{T}$ , $\mathsf{F}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ online, we can compute $w$ -interval $[l..r]$ for a given pattern $w$ of length $m$ using Lemma 14 successively $m$ times, spending $O(m\frac{\lg\sigma_{p}\lg n}{\lg\lg n})$ time in total.

Since $\{\mathsf{R}^{-1}_{T}(i)\mid i\in[l..r]\}$ is the set of occurrences of $w$ in $T$ , we consider how to access $\mathsf{R}^{-1}_{T}(i)$ in $O(\frac{\lg^{2}n}{\lg\sigma\lg\lg n})$ time to support locating queries. As is common in BWT-based indexes, we employ a sampling technique (e.g., see [9]): For every $\Theta(\log_{\sigma}n)$ text positions we store the values so that if we apply LF/FL-mapping to $i$ successively at most $\Theta(\log_{\sigma}n)$ times we hit one of the sampled text positions. A minor remark is that since our online construction proceeds from right to left, it is convenient to start sampling from the right-end of $T$ and store the distance to the right-end instead of the text position counted from the left-end of $T$ .

During the online construction of the data structures for $\mathsf{L}_{T}$ , $\mathsf{F}_{T}$ and $\mathsf{LCP}^{\infty}_{T}$ , we additionally maintain a dynamic bit vector of length $n$ and dynamic integer string $V_{T}$ of length $O(n/\log_{\sigma}n)$ , which marks the sampled positions and stores sampled values, respectively. We implement $V_{T}$ with the dynamic string of Lemma 2 in $O(n\frac{\lg n}{\log_{\sigma}n})=O(n\lg\sigma)$ bits with $O(\lg n)$ query times. In order to support LF/FL-mapping in $O(\frac{\lg n}{\lg\lg n})$ time, we also maintain a dynamic string of Lemma 2 for $\mathsf{L}_{T}$ . With the additional space usage of $O(n\lg\sigma)$ bits, we can access $\mathsf{R}^{-1}_{T}(i)$ in $O(\frac{\lg^{2}n}{\lg\sigma\lg\lg n})$ time as we use LF/FL-mapping at most $\Theta(\log_{\sigma}n)$ times. This leads to the claimed time bound for locating queries. ∎

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 19K20213 and 22K11907 (TI).

References

[1] Brenda S. Baker. A theory of parameterized pattern matching: algorithms and applications. In S. Rao Kosaraju, David S. Johnson, and Alok Aggarwal, editors, Proc. 25th Annual ACM Symposium on Theory of Computing (STOC), pages 71–80. ACM, 1993.
[2] Brenda S. Baker. Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences, 52(1):28–42, 1996.
[3] Brenda S. Baker. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput., 26(5):1343–1362, 1997.
[4] Richard Beal and Donald A. Adjeroh. p-suffix sorting as arithmetic coding. J. Discrete Algorithms, 16:151–169, 2012.
[5] Francisco Claude, Gonzalo Navarro, and Alberto Ordóñez Pereira. The wavelet matrix: An efficient wavelet tree for large alphabets. Inf. Syst., 47:15–32, 2015.
[6] Richard Cole and Ramesh Hariharan. Faster suffix tree construction with missing suffix links. SIAM J. Comput., 33(1):26–42, 2003.
[7] Satoshi Deguchi, Fumihito Higashijima, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Parameterized suffix arrays for binary strings. In Proc. Prague Stringology Conference (PSC) 2008, pages 84–94, 2008.
[8] Diptarama, Takashi Katsura, Yuhei Otomo, Kazuyuki Narisawa, and Ayumi Shinohara. Position heaps for parameterized strings. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, Proc. 28th Annual Symposium on Combinatorial Pattern Matching (CPM) 2017, volume 78 of LIPIcs, pages 8:1–8:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
[9] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In FOCS, pages 390–398, 2000.
[10] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Right-to-left online construction of parameterized position heaps. In Jan Holub and Jan Zdárek, editors, Proc. Prague Stringology Conference (PSC) 2018, pages 91–102. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2018.
[11] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Direct linear time construction of parameterized suffix and LCP arrays for constant alphabets. In Nieves R. Brisaboa and Simon J. Puglisi, editors, Proc. 26th International Symposium on String Processing and Information Retrieval (SPIRE) 2019, volume 11811 of Lecture Notes in Computer Science, pages 382–391. Springer, 2019.
[12] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. The parameterized position heap of a trie. In Pinar Heggernes, editor, Proc. 11th International Conference on Algorithms and Complexity (CIAC) 2019, volume 11485 of Lecture Notes in Computer Science, pages 237–248. Springer, 2019.
[13] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. The parameterized suffix tray. In Tiziana Calamoneri and Federico Corò, editors, Proc. 12th International Conference on Algorithms and Complexity (CIAC) 2021, volume 12701 of Lecture Notes in Computer Science, pages 258–270. Springer, 2021.
[14] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. pBWT: Achieving succinct data structures for parameterized pattern matching and related problems. In Proc. 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017, pages 397–407, 2017. doi:10.1137/1.9781611974782.25.
[15] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Fully functional parameterized suffix trees in compact space. In Mikolaj Bojanczyk, Emanuela Merelli, and David P. Woodruff, editors, Proc. 49th International Colloquium on Automata, Languages, and Programming, (ICALP) 2022, volume 229 of LIPIcs, pages 65:1–65:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.
[16] Daiki Hashimoto, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara. Computing the parameterized burrows-wheeler transform online. In Diego Arroyuelo and Barbara Poblete, editors, Proc. 29th International Symposium on String Processing and Information Retrieval (SPIRE) 2022, volume 13617 of Lecture Notes in Computer Science, pages 70–85. Springer, 2022. doi:10.1007/978-3-031-20643-6\_6.
[17] Tomohiro I, Satoshi Deguchi, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Lightweight parameterized suffix array construction. In Proc. 20th International Workshop on Combinatorial Algorithms (IWOCA) 2009, pages 312–323, 2009.
[18] Sung-Hwan Kim and Hwan-Gue Cho. Simpler FM-index for parameterized string matching. Inf. Process. Lett., 165:106026, 2021. doi:10.1016/j.ipl.2020.106026.
[19] S. Rao Kosaraju. Faster algorithms for the construction of parameterized suffix trees (preliminary version). In Proc. 36th Annual Symposium on Foundations of Computer Science (FOCS), pages 631–637. IEEE Computer Society, 1995.
[20] Taehyung Lee, Joong Chae Na, and Kunsoo Park. On-line construction of parameterized suffix trees for large alphabets. Inf. Process. Lett., 111(5):201–207, 2011.
[21] Juan Mendivelso, Sharma V. Thankachan, and Yoan J. Pinzón. A brief history of parameterized matching problems. Discret. Appl. Math., 274:103–115, 2020. doi:10.1016/j.dam.2018.07.017.
[22] J. Ian Munro and Yakov Nekrich. Compressed data structures for dynamic sequences. In Proc. 23rd Annual European Symposium on Algorithms (ESA) 2015, pages 891–902, 2015.
[23] Katsuhito Nakashima, Noriki Fujisato, Diptarama Hendrian, Yuto Nakashima, Ryo Yoshinaka, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, and Masayuki Takeda. Parameterized dawgs: Efficient constructions and bidirectional pattern searches. Theor. Comput. Sci., 933:21–42, 2022. doi:10.1016/j.tcs.2022.09.008.
[24] Gonzalo Navarro. Wavelet trees for all. J. Discrete Algorithms, 25:2–20, 2014. doi:10.1016/j.jda.2013.07.004.
[25] Mihai Patrascu and Mikkel Thorup. Dynamic integer sets with optimal rank, select, and predecessor search. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 166–175, 2014. doi:10.1109/FOCS.2014.26.

Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching

Abstract

1 Introduction

Theorem 1.

2 Preliminaries

2.1 Basic notations and tools

Lemma 2 ([22]).

Lemma 3.

2.2 Parameterized strings

Lemma 4.

Corollary 5.

Corollary 6.

Corollary 7.

Corollary 8.

3 Online construction algorithm

3.1 How to compute k^\hat{k}

Lemma 9.

Proof.

Lemma 10.

Proof.

Lemma 11.

Lemma 12.

Proof.

3.2 How to maintain 𝖫𝖢𝖯∞\mathsf{LCP}^{\infty}

3.3 Dynamic data structures and analysis

Lemma 13.

Proof.

4 Extendable compact index for p-matching

Lemma 14.

Proof.

Proof of Theorem 1:.

Acknowledgements

References

3.1 How to compute $\hat{k}$

3.2 How to maintain $\mathsf{LCP}^{\infty}$