Position Heaps for Cartesian-tree Matching
on Strings and Tries

Akio Nishimoto¹ ¹Department of Informatics, Kyushu University, Japan
{nishimoto.akio, noriki.fujisato, yuto.nakashima, inenaga}@inf.kyushu-u.ac.jp
²PRESTO, Japan Science and Technology Agency, Japan Noriki Fujisato¹ ¹Department of Informatics, Kyushu University, Japan
{nishimoto.akio, noriki.fujisato, yuto.nakashima, inenaga}@inf.kyushu-u.ac.jp
²PRESTO, Japan Science and Technology Agency, Japan Yuto Nakashima¹ ¹Department of Informatics, Kyushu University, Japan
{nishimoto.akio, noriki.fujisato, yuto.nakashima, inenaga}@inf.kyushu-u.ac.jp
²PRESTO, Japan Science and Technology Agency, Japan Shunsuke Inenaga^1,2 ¹Department of Informatics, Kyushu University, Japan
{nishimoto.akio, noriki.fujisato, yuto.nakashima, inenaga}@inf.kyushu-u.ac.jp
²PRESTO, Japan Science and Technology Agency, Japan

Abstract

The Cartesian-tree pattern matching is a recently introduced scheme of pattern matching that detects fragments in a sequential data stream which have a similar structure as a query pattern. Formally, Cartesian-tree pattern matching seeks all substrings $S^{\prime}$ of the text string $S$ such that the Cartesian tree of $S^{\prime}$ and that of a query pattern $P$ coincide. In this paper, we present a new indexing structure for this problem, called the Cartesian-tree Position Heap (CPH). Let $n$ be the length of the input text string $S$ , $m$ the length of a query pattern $P$ , and $\sigma$ the alphabet size. We show that the CPH of $S$ , denoted $\mathsf{CPH}(S)$ , supports pattern matching queries in $O(m(\sigma+\log(\min\{h,m\}))+\mathit{occ})$ time with $O(n)$ space, where $h$ is the height of the CPH and $\mathit{occ}$ is the number of pattern occurrences. We show how to build $\mathsf{CPH}(S)$ in $O(n\log\sigma)$ time with $O(n)$ working space. Further, we extend the problem to the case where the text is a labeled tree (i.e. a trie). Given a trie $\boldsymbol{T}$ with $N$ nodes, we show that the CPH of $\boldsymbol{T}$ , denoted $\mathsf{CPH}(\boldsymbol{T})$ , supports pattern matching queries on the trie in $O(m(\sigma^{2}+\log(\min\{h,m\}))+\mathit{occ})$ time with $O(N\sigma)$ space. We also show a construction algorithm for $\mathsf{CPH}(\boldsymbol{T})$ running in $O(N\sigma)$ time and $O(N\sigma)$ working space.

1 Introduction

If the Cartesian trees $\mathsf{CT}(X)$ and $\mathsf{CT}(Y)$ of two strings $X$ and $Y$ are equal, then we say that $X$ and $Y$ Cartesian-tree match (ct-match). The Cartesian-tree pattern matching problem (ct-matching problem) [18] is, given a text string $S$ and a pattern $P$ , to find all substrings $S^{\prime}$ of $S$ that ct-match with $P$ .

String equivalence with ct-matching belongs to the class of substring-consistent equivalence relation (SCER) [17], namely, the following holds: If two strings $X$ and $Y$ ct-match, then $X[i..j]$ and $Y[i..j]$ also ct-match for any $1\leq i\leq j\leq|X|$ . Among other types of SCERs ([3, 4, 5, 14, 15]), ct-matching is the most related to order-peserving matching (op-matching) [16, 7, 9]. Two strings $X$ and $Y$ are said to op-match if the relative order of the characters in $X$ and the relative order of the characters in $Y$ are the same. It is known that with ct-matching one can detect some interesting occurrences of a pattern that cannot be captured with op-matching. More precisely, if two strings $X$ and $Y$ op-match, then $X$ and $Y$ also ct-match. However, the reverse is not true. With this property in hand, ct-matching is motivated for analysis of time series such as stock charts [18, 11].

This paper deals with the indexing version of the ct-matching problem. Park et al. [18] proposed the Cartesian suffix tree (CST) for a text string $S$ that can be built in $O(n\log n)$ worst-case time or $O(n)$ expected time, where $n$ is the length of the text string $S$ . The $\log n$ factor in the worst-case complexity is due to the fact that the parent-encoding, a key concept for ct-matching introduced in [18], is a sequence of integers in range $[0..n-1]$ . While it is not explicitly stated in Park et al.’s paper [18], our simple analysis (c.f. Lemma 9 in Section 5) reveals that the CST supports pattern matching queries in $O(m\log m+\mathit{occ})$ time, where $m$ is the pattern length and $\mathit{occ}$ is the number of pattern occurrences.

In this paper, we present a new indexing structure for this problem, called the Cartesian-tree Position Heap (CPH). We show that the CPH of $S$ , which occupies $O(n)$ space, can be built in $O(n\log\sigma)$ time with $O(n)$ working space and supports pattern matching queries in $O(m(\sigma+\log(\min\{h,m\}))+\mathit{occ})$ time, where $h$ is the height of the CPH. Compared to the afore-mentioned CST, our CPH is the first index for ct-matching that can be built in worst-case linear time for constant-size alphabets, while pattern matching queries with our CPH can be slower than with the CST when $\sigma$ is large.

We then consider the case where the text is a labeled tree (i.e. a trie). Given a trie $\boldsymbol{T}$ with $N$ nodes, we show that the CPH of $\boldsymbol{T}$ , which occupies $O(N\sigma)$ space, can be built in $O(N\sigma)$ time and $O(N\sigma)$ working space. We also show how to support pattern matching queries in $O(m(\sigma^{2}+\log(\min\{h,m\}))+\mathit{occ})$ time in the trie case. To our knowledge, our CPH is the first indexing structure for ct-matching on tries that uses linear space for constant-size alphabets.

Conceptually, our CPH is most related to the parameterized position heap (PPH) for a string [12] and for a trie [13], in that our CPHs and the PPHs are both constructed in an incremental manner where the suffixes of an input string and the suffixes of an input trie are processed in increasing order of their lengths. However, some new techniques are required in the construction of our CPH due to different nature of the parent encoding [18] of strings for ct-matching, from the previous encoding [3] of strings for parameterized matching.

2 Preliminaries

2.1 Strings and (Reversed) Tries

Let $\Sigma$ be an ordered alphabet of size $\sigma$ . An element of $\Sigma$ is called a character. An element of $\Sigma^{*}$ is called a string. For a string $S\in\Sigma^{*}$ , let $\sigma_{S}$ denote the number of distinct characters in $S$ .

The empty string $\varepsilon$ is a string of length 0, namely, $|\varepsilon|=0$ . For a string $S=XYZ$ , $X$ , $Y$ and $Z$ are called a prefix, substring, and suffix of $S$ , respectively. The set of prefixes of a string $S$ is denoted by $\mathsf{Prefix}(S)$ . The $i$ -th character of a string $S$ is denoted by $S[i]$ for $1\leq i\leq|S|$ , and the substring of a string $S$ that begins at position $i$ and ends at position $j$ is denoted by $S[i..j]$ for $1\leq i\leq j\leq|S|$ . For convenience, let $S[i..j]=\varepsilon$ if $j<i$ . Also, let $S[i..]=S[i..|S|]$ for any $1\leq i\leq|S|+1$ .

A trie is a rooted tree that represents a set of strings, where each edge is labeled with a character from $\Sigma$ and the labels of the out-going edges of each node is mutually distinct. Tries are natural generalizations to strings in that tries can have branches while strings are sequences without branches.

Let $\mathbf{x}$ be any node of a given trie $\boldsymbol{T}$ , and let $\mathbf{r}$ denote the root of $\boldsymbol{T}$ . Let $0pt(\mathbf{x})$ denote the depth of $\mathbf{x}$ . When $\mathbf{x}\neq\mathbf{r}$ , let $\mathsf{parent}(\mathbf{x})$ denote the parent of $\mathbf{x}$ . For any $0\leq j\leq 0pt(\mathbf{x})$ , let $\mathsf{anc}(\mathbf{x},j)$ denote the $j$ -th ancestor of $\mathbf{x}$ , namely, $\mathsf{anc}(\mathbf{x},0)=\mathbf{x}$ and $\mathsf{anc}(\mathbf{x},j)=\mathsf{parent}(\mathsf{anc}(\mathbf{x},j-1))$ for $1\leq j\leq 0pt(\mathbf{x})$ . It is known that after a linear-time processing on $\boldsymbol{T}$ , $\mathsf{anc}(\mathbf{x},j)$ for any query node $\mathbf{x}$ and integer $j$ can be answered in $O(1)$ time [6].

For the sake of convenience, in the case where our input is a trie $\boldsymbol{T}$ , then we consider its reversed trie where the path labels are read in the leaf-to-root direction. On the other hand, the trie-based data structures (namely position heaps) we build for input strings and reversed tries are usual tries where the path labels are read in the root-to-leaf direction.

For each (reversed) path $(\mathbf{x},\mathbf{y})$ in $\boldsymbol{T}$ such that $\mathbf{y}=\mathsf{anc}(\mathbf{x},j)$ with $j=|0pt(\mathbf{x})|-|0pt(\mathbf{y})|$ , let $\mathsf{str}(\mathbf{x},\mathbf{y})$ denote the string obtained by concatenating the labels of the edges from $\mathbf{x}$ to $\mathbf{y}$ . For any node $\mathbf{x}$ of $\boldsymbol{T}$ , let $\mathsf{str}(\mathbf{x})=\mathsf{str}(\mathbf{x},\mathbf{r})$ .

Let $N$ be the number of nodes in $\boldsymbol{T}$ . We associate a unique id to each node of $\boldsymbol{T}$ . Here we use a bottom-up level-order traversal rank as the id of each node in $\boldsymbol{T}$ , and we sometimes identify each node with its id. For each node id $i$ ( $1\leq i\leq N$ ) let $\boldsymbol{T}[i..]=\mathsf{str}(i)$ , i.e., $\boldsymbol{T}[i..]$ is the path string from node $i$ to the root $\mathbf{r}$ .

2.2 Cartesian-tree Pattern Matching

The Cartesian tree of a string $S$ , denoted $\mathsf{CT}(S)$ , is the rooted tree with $|S|$ nodes which is recursively defined as follows:

•

If $|S|=0$ , then $\mathsf{CT}(S)$ is the empty tree.
•

If $|S|\geq 1$ , then $\mathsf{CT}(S)$ is the tree whose root $r$ stores the left-most minimum value $S[i]$ in $S$ , namely, $r=S[i]$ iff $S[i]\leq S[j]$ for any $i\neq j$ and $S[h]>S[i]$ for any $h<i$ . The left-child of $r$ is $\mathsf{CT}(S[1..i-1])$ and the right-child of $r$ is $\mathsf{CT}(S[i+1..|S|])$ .

The parent distance encoding of a string $S$ of length $n$ , denoted $\mathsf{PD}(S)$ , is a sequence of $n$ integers over $[0..n-1]$ such that

\mathsf{PD}(S)[i]=\begin{cases}i-\max_{1\leq j<i}\{j\mid S[j]\leq S[i]\}&\mbox{if such $j$ exists},\\ 0&\mbox{otherwise.}\end{cases}

Namely, $\mathsf{PD}(S)[i]$ represents the distance to from position $i$ to its nearest left-neighbor position $j$ that stores a value that is less than or equal to $S[i]$ .

A tight connection between $\mathsf{CT}$ and $\mathsf{PD}$ is known:

Lemma 1 ([19]).

For any two strings $S_{1}$ and $S_{2}$ of equal length, $\mathsf{CT}(S_{1})=\mathsf{CT}(S_{2})$ iff $\mathsf{PD}(S_{1})=\mathsf{PD}(S_{2})$ .

For two strings $S_{1}$ and $S_{2}$ , we write $S_{1}\approx S_{2}$ iff $\mathsf{CT}(S_{1})=\mathsf{CT}(S_{2})$ (or equivalently $\mathsf{PD}(S_{1})=\mathsf{PD}(S_{2})$ ). We also say that $S_{1}$ and $S_{2}$ ct-match when $S_{1}\approx S_{2}$ . See Fig. 1 for a concrete example.

Refer to caption — Figure 1: Two strings $S_{1}=\mathtt{316486759}$ and $S_{2}=\mathtt{713286945}$ ct-match since $\mathsf{CT}(S_{1})=\mathsf{CT}(S_{2})$ and $\mathsf{PD}(S_{1})=\mathsf{PD}(S_{2})$ .

We consider the indexing problems for Cartesian-tree pattern matching on a text string and a text trie, which are respectively defined as follows:

Problem 1 (Cartesian-Tree Pattern Matching on Text String).

Preprocess:: A text string $S$ of length $n$ .
Query:: A pattern string $P$ of length $m$ .
Report:: All text positions $i$ such that $S[i..i+m-1]\approx P$ .

Problem 2 (Cartesian-Tree Pattern Matching on Text Trie).

Preprocess:: A text trie $\boldsymbol{T}$ with $N$ nodes.
Query:: A pattern string $P$ of length $m$ .
Report:: All trie nodes $i$ such that $(\boldsymbol{T}[i..])[1..m]\approx P$ .

2.3 Sequence Hash Trees

Let $\mathcal{W}=\langle w_{1},\ldots,w_{k}\rangle$ be a sequence of non-empty strings such that for any $1<i\leq k$ , $w_{i}\notin\mathsf{Prefix}(w_{j})$ for any $1\leq j<i$ . The sequence hash tree [8] of a sequence $\mathcal{W}=\langle w_{1},\ldots,w_{k}\rangle$ of $k$ strings, denoted $\mathsf{SHT}(\mathcal{W})=\mathsf{SHT}(\mathcal{W})^{k}$ , is a trie structure that is incrementally built as follows:

1.

$\mathsf{SHT}(\mathcal{W})^{0}=\mathsf{SHT}(\langle\ \rangle)$ for the empty sequence $\langle\ \rangle$ is the tree only with the root.
2.

For $i=1,\ldots,k$ , $\mathsf{SHT}(\mathcal{W})^{i}$ is obtained by inserting the shortest prefix $u_{i}$ of $w_{i}$ that does not exist in $\mathsf{SHT}(\mathcal{W})^{i-1}$ . This is done by finding the longest prefix $p_{i}$ of $w_{i}$ that exists in $\mathsf{SHT}(\mathcal{W})^{i-1}$ , and adding the new edge $(p_{i},c,u_{i})$ , where $c=w_{i}[|p_{i}|+1]$ is the first character of $w_{i}$ that could not be traversed in $\mathsf{SHT}(\mathcal{W})^{i-1}$ .

Since we have assumed that each $w_{i}$ in $\mathcal{W}$ is not a prefix of $w_{j}$ for any $1\leq j<i$ , the new edge $(p_{i},c,u_{i})$ is always created for each $1\leq i\leq k$ . This means that $\mathsf{SHT}(\mathcal{W})$ contains exactly $k+1$ nodes (including the root).

To perform pattern matching queries efficiently, each node of $\mathsf{SHT}(\mathcal{W})$ is augmented with the maximal reach pointer. For each $1\leq i\leq k$ , let $u_{i}$ be the newest node in $\mathsf{SHT}(\mathcal{W})^{i}$ , namely, $u_{i}$ is the shortest prefix of $w_{i}$ which did not exist in $\mathsf{SHT}(\mathcal{W})^{i-1}$ . Then, in the complete sequence hash tree $\mathsf{SHT}(\mathcal{W})=\mathsf{SHT}(\mathcal{W})^{k}$ , we set $\mathsf{mrp}(u_{i})=u_{j}$ iff $u_{j}$ is the deepest node in $\mathsf{SHT}(\mathcal{W})$ such that $u_{j}$ is a prefix of $w_{i}$ . Intuitively, $\mathsf{mrp}(u_{i})$ represents the last visited node $u_{j}$ when we traverse $w_{i}$ from the root of the complete $\mathsf{SHT}(\mathcal{W})$ . Note that $j\geq i$ always holds. When $j=i$ (i.e. when the maximal reach pointer is a self-loop), then we can omit it because it is not used in the pattern matching algorithm.

3 Cartesian-tree Position Heaps for Strings

In this section, we introduce our new indexing structure for Problem 1. For a given text string $S$ of length $n$ , let $\mathcal{W}_{S}$ denote the sequence of the parent distance encodings of the non-empty suffixes of $S$ which are sorted in increasing order of their lengths. Namely, $\mathcal{W}_{S}=\langle w_{1}$ , …, $w_{n}\rangle=\langle\mathsf{PD}(S[n..])$ , …, $\mathsf{PD}(S[1..])\rangle$ , where $w_{n-i+1}=\mathsf{PD}(S[i..])$ . The Cartesian-tree Position Heap (CPH) of string $S$ , denoted $\mathsf{CPH}(S)$ , is the sequence hash tree of $\mathcal{W}_{S}$ , that is, $\mathsf{CPH}(S)=\mathsf{SHT}(\mathcal{W}_{S})$ . Note that for each $1\leq i\leq n+1$ , $\mathsf{CPH}(S[i..])=\mathsf{SHT}(\mathcal{W}_{S})^{n-i+1}$ holds.

Our algorithm builds $\mathsf{CPH}(S[i..])$ for decreasing $i=n,\ldots,1$ , which means that we process the given text string $S$ in a right-to-left online manner, by prepending the new character $S[i]$ to the current suffix $S[i+1..]$ .

For a sequence $v$ of integers, let $\mathcal{Z}_{v}$ denote the sorted list of positions $z$ in $v$ such that $v[z]=0$ iff $z\in\mathcal{Z}_{v}$ . Clearly $|\mathcal{Z}_{v}|$ is equal to the number of $0$ ’s in $v$ .

Lemma 2.

For any string $S$ , $|\mathcal{Z}_{\mathsf{PD}(S)}|\leq\sigma_{S}$ .

Proof.

Let $\mathcal{Z}_{\mathsf{PD}(S)}=z_{1},\ldots,z_{\ell}$ . We have that $S[z_{1}]>\cdots>S[z_{\ell}]$ since otherwise $\mathsf{PD}(S)[z_{x}]\neq 0$ for some $z_{x}$ , a contradiction. Thus $|\mathcal{Z}_{\mathsf{PD}(S)}|\leq\sigma_{S}$ holds. ∎

Lemma 3.

For each $i=n,\ldots,1$ , $\mathsf{PD}(S[i..])$ can be computed from $\mathsf{PD}(S[i+1..])$ in an online manner, using a total of $O(n)$ time with $O(\sigma_{S})$ working space.

Proof.

Given a new character $S[i]$ , we check each position $z$ in the list $\mathcal{Z}_{\mathsf{PD}(S[i+1..])}$ in increasing order. Let $\hat{z}=z+i$ , i.e., $\hat{z}$ is the global position in $S$ corresponding to $z$ in $S[i+1..]$ . If $S[i]\leq S[\hat{z}]$ , then we set $\mathsf{PD}(S[i..])[z-i+1]=z-i~{}(>0)$ and remove $z$ from the list. Remark that these removed positions correspond to the front pointers in the next suffix $S[i..]$ . We stop when we encounter the first $z$ in the list such that $S[i]>S[\hat{z}]$ . Finally we add the position $i$ to the head of the remaining positions in the list. This gives us $\mathcal{Z}_{\mathsf{PD}(S[i..])}$ for the next suffix $S[i..]$ .

It is clear that once a position in the PD encoding is assigned a non-zero value, then the value never changes whatever characters we prepend to the string. Therefore, we can compute $\mathsf{PD}(S[i..])$ from $\mathsf{PD}(S[i+1..])$ in a total of $O(n)$ time for every $1\leq i\leq n$ . The working space is $O(\sigma_{S})$ due to Lemma 2. ∎

A position $i$ in a sequence $u$ of non-negative integers is said to be a front pointer in $u$ if $i-u[i]=1$ and $i\geq 2$ . Let $\mathcal{F}_{u}$ denote the sorted list of front pointers in $u$ . For example, if $u=01214501$ , then $\mathcal{F}_{u}=\{2,3,5,6\}$ . The positions of the suffix $S[i+1..]$ which are removed from $\mathcal{Z}_{\mathsf{PD}(S[i+1..])}$ correspond to the front pointers in $\mathcal{F}_{\mathsf{PD}(S[i..])}$ for the next suffix $S[i..]$ .

Our construction algorithm updates $\mathsf{CPH}(S[i+1..])$ to $\mathsf{CPH}(S[i..])$ by inserting a new node for the next suffix $S[i..]$ , processing the given string $S$ in a right-to-left online manner. Here the task is to efficiently locate the parent of the new node in the current CPH at each iteration.

As in the previous work on right-to-left online construction of indexing structures for other types of pattern matching [20, 10, 12, 13], we use the reversed suffix links in our construction algorithm for $\mathsf{CPH}(S)$ . For ease of explanation, we first introduce the notion of the suffix links. Let $u$ be any non-root node of $\mathsf{CPH}(S)$ . We identify $u$ with the path label from the root of $\mathsf{CPH}(S)$ to $u$ , so that $u$ is a PD encoding of some substring of $S$ . We define the suffix link of $u$ , denoted $\mathsf{sl}(u)$ , such that $\mathsf{sl}(u)=v$ iff $v$ is obtained by (1) removing the first $0$ ( $=u[1]$ ), and (2) substituting $0$ for the character $u[f]$ at every front pointer $f\in\mathcal{F}_{u}\subseteq[2..|u|]$ of $u$ . The reversed suffix link of $v$ with non-negative integer label $a$ , denoted $\mathsf{rsl}(v,a)$ , is defined such that $\mathsf{rsl}(v,a)=u$ iff $\mathsf{sl}(u)=v$ and $a=|\mathcal{F}_{u}|$ . See also Figure 2.

Lemma 4.

Let $u,v$ be any nodes of $\mathsf{CPH}(S)$ such that $\mathsf{rsl}(v,a)=u$ with label $a$ . Then $a\leq\sigma_{S}$ .

Proof.

Since $|\mathcal{F}_{u}|\leq|\mathcal{Z}_{v}|$ , using Lemma 2, we obtain $a=|\mathcal{F}_{u}|\leq|\mathcal{Z}_{v}|\leq\sigma_{S^{\prime}}\leq\sigma_{S}$ , where $S^{\prime}$ is a substring of $S$ such that $\mathsf{PD}(S^{\prime})=v$ . ∎∎

The next lemma shows that the number of out-going reversed suffix links of each node $v$ is bounded by the alphabet size.

Our CPH construction algorithm makes use of the following monotonicity of the labels of reversed suffix links:

Lemma 5.

Suppose that there exist two reversed suffix links $\mathsf{rsl}(v,a)=u$ and $\mathsf{rsl}(v^{\prime},a^{\prime})=u^{\prime}$ such that $v^{\prime}=\mathsf{parent}(v)$ and $u^{\prime}=\mathsf{parent}(u)$ . Then, $0\leq a-a^{\prime}\leq 1$ .

Proof.

Immediately follows from $a=|\mathcal{F}_{u}|$ , $a^{\prime}=|\mathcal{F}_{u^{\prime}}|$ , and $u^{\prime}=u[1..|u|-1]$ . ∎

We are ready to design our right-to-left online construction algorithm for the CPH of a given string $S$ . Since $\mathsf{PD}(S[i..])$ is the $(n-i+1)$ -th string $w_{n-i+1}$ of the input sequence $\mathcal{W}_{S}$ , for ease of explanation, we will use the convention that $u(i)=u_{n-i+1}$ and $p(i)=p_{n-i-1}$ , where the new node $u(i)$ for $w_{n-i+1}=\mathsf{PD}(S[i..])$ is inserted as a child of $p(i)$ . See Figure 3.

{breakbox}

Algorithm 1: Right-to-Left Online Construction of $\mathsf{CPH}(S)$

$i=n$ (base case):: We begin with $\mathsf{CPH}(S[n..])$ which consists of the root $r=u(n+1)$ and the node $u(n)$ for the first (i.e. shortest) suffix $S[n..]$ of $S$ . Since $w_{1}=\mathsf{PD}(S[n..])=\mathsf{PD}(S[n])=0$ , the edge from $r$ to $u(n)$ is labeled $0$ . Also, we set the reversed suffix link $\mathsf{rsl}(r,0)=u(n)$ .
$i=n-1,\ldots,1$ (iteration):: Given $\mathsf{CPH}(S[i+1..])$ which consists of the nodes $u(i+1),\ldots,u(n)$ , which respectively represent some prefixes of the already processed strings $w_{n-i},\ldots,w_{1}=\mathsf{PD}(S[i+1..]),\ldots,\mathsf{PD}(S[n..])$ , together with their reversed suffix links. We find the parent $p(i)$ of the new node $u(i)$ for $\mathsf{PD}(S[i..])$ , as follows: We start from the last-created node $u(i+1)$ for the previous $\mathsf{PD}(S[i+1..])$ , and climb up the path towards the root $r$ . Let $d_{i}\in[1..|u(i+1)|]$ be the smallest integer such that the $d_{i}$ -th ancestor $v(i)=\mathsf{anc}(u(i+1),d_{i})$ of $u(i+1)$ has the reversed suffix link $\mathsf{rsl}(v(i),a)$ with the label $a=|\mathcal{F}_{\mathsf{PD}(S[i..i+|v(i)|])}|$ . We traverse the reversed suffix link from $v(i)$ and let $p(i)=\mathsf{rsl}(v(i),a)$ . We then insert the new node $u(i)$ as the new child of $p(i)$ , with the edge labeled $\mathsf{PD}(S[i..])[i+|u(i)|-1]$ . Finally, we create a new reversed suffix link $\mathsf{rsl}(\hat{v}(i),b)=u(i)$ , where $\hat{v}(i)=\mathsf{anc}(u(i+1),d_{i}-1)$ and $\mathsf{parent}(\hat{v})=v$ . We set $b\leftarrow a+1$ if the position $i+|p(i)|$ is a front pointer of $\mathsf{PD}(S[i..])$ , and $b\leftarrow a$ otherwise.

For computing the label $a=|\mathcal{F}_{\mathsf{PD}(S[i..i+|v(i)|])}|$ efficiently, we introduce a new encoding $\mathsf{FP}$ that is defined as follows: For any string $S$ of length $n$ , let $\mathsf{FP}(S)[i]=|\mathcal{F}_{\mathsf{PD}(S[i..n])}|$ . The $\mathsf{FP}$ encoding preserves the ct-matching equivalence:

Lemma 6.

For any two strings $S_{1}$ and $S_{2}$ , $S_{1}\approx S_{2}$ iff $\mathsf{FP}(S_{1})=\mathsf{FP}(S_{2})$ .

Proof.

For a string $S$ , consider the DAG $\mathsf{G}(S)=(V,E)$ such that $V=\{1,\ldots,|S|\}$ , $E=\{(j,i)\mid j=i-\mathsf{PD}(S)[i]\}$ (see also Figure 7 in Appendix). By Lemma 1, for any strings $S_{1}$ and $S_{2}$ , $\mathsf{G}(S_{1})=\mathsf{G}(S_{2})$ iff $S_{1}\approx S_{2}$ . Now, we will show there is a one-to-one correspondence between the DAG $\mathsf{G}$ and the $\mathsf{FP}$ encoding.

( $\Rightarrow$ ) We are given $\mathsf{G}(S)$ for some (unknown) string $S$ . Since $\mathsf{FP}(S)[i]$ is the in-degree of the node $i$ of $\mathsf{G}(S)$ , $\mathsf{FP}(S)$ is unique for the given DAG $\mathsf{G}(S)$ .

( $\Leftarrow$ ) Given $\mathsf{FP}(S)$ for some (unknown) string $S$ , we show an algorithm that builds DAG $\mathsf{G}(S)$ . We first create nodes $V=\{1,\ldots,|S|\}$ without edges, where all nodes in $V$ are initially unmarked. For each $i=n,\ldots,1$ in decreasing order, if $\mathsf{FP}(S)[i]>0$ , then select the leftmost $\mathsf{FP}(S)[i]$ unmarked nodes in the range $[i-1..n]$ , and create an edge $(i,i^{\prime})$ from each selected node $i^{\prime}$ to $i$ . We mark all these $\mathsf{FP}(S)[i]$ nodes at the end of this step, and proceed to the next node $i-1$ . The resulting DAG $\mathsf{G}(S)$ is clearly unique for a given $\mathsf{PD}(S)$ . ∎

For computing the label $a=|\mathcal{F}_{\mathsf{PD}(S[i..i+|v(i)|])}|=\mathsf{FP}(S[i..i+|v(i)|])[1]$ of the reversed suffix link in Algorithm 1, it is sufficient to maintain the induced graph $\mathsf{G}_{[i..j]}$ of DAG $\mathsf{G}$ for a variable-length sliding window $S[i..j]$ with the nodes $\{i,\ldots,j\}$ . This can easily be maintained in $O(n)$ total time.

Theorem 1.

Algorithm 1 builds $\mathsf{CPH}(S[i..])$ for decreasing $i=n,\ldots,1$ in a total of $O(n\log\sigma)$ time and $O(n)$ space, where $\sigma$ is the alphabet size.

Proof.

Correctness: Consider the $(n-i+1)$ -th step in which we process $\mathsf{PD}(S[i..])$ . By Lemma 5, the $d_{i}$ -th ancestor $v(i)=\mathsf{anc}(u(i+1),d_{i})$ of $u(i+1)$ can be found by simply walking up the path from the start node $u(i+1)$ . Note that there always exists such ancestor $v(i)$ of $u(i+1)$ since the root $r$ has the defined reversed suffix link $\mathsf{rsl}(r,0)=0$ . By the definition of $v(i)$ and its reversed suffix link, $\mathsf{rsl}(v(i),a)=p(i)$ is the longest prefix of $\mathsf{PD}(S[i..])$ that is represented by $\mathsf{CPH}(S[i+1..])$ (see also Figure 3). Thus, $p(i)$ is the parent of the new node $u(i)$ for $\mathsf{PD}(S[i..])$ . The correctness of the new reversed suffix link $\mathsf{rsl}(\hat{v}(i),b)=u(i)$ follows from the definition.

Complexity: The time complexity is proportional to the total number $\sum_{i=1}^{n}d_{i}$ of nodes that we visit for all $i=n,\ldots,1$ . Clearly $|u(i)|-|u(i+1)|=d_{i}-2$ . Thus, $\sum_{i=1}^{n}d_{i}=\sum_{i=1}^{n}(|u(i)|-|u(i+1)|+2)=|u(1)|-|u(n)|+2n\leq 3n=O(n)$ . Using Lemma LABEL:lem:rsl_bounds and sliding-window $\mathsf{FP}$ , we can find the reversed suffix links in $O(\log\sigma_{S})$ time at each of the $\sum_{i=1}^{n}d_{i}$ visited nodes. Thus the total time complexity is $O(n\log\sigma_{S})$ . Since the number of nodes in $\mathsf{CPH}(S)$ is $n+1$ and the number of reversed suffix links is $n$ , the total space complexity is $O(n)$ . ∎

Lemma 7.

There exists a string $S$ of length $n$ over a binary alphabet $\Sigma=\{\mathtt{1,2}\}$ such a node in $\mathsf{CPH}(S)$ has $\Omega(\sqrt{n})$ out-going edges.

Proof.

Consider string $S=\mathtt{1}\mathtt{121}\mathtt{1221}\cdots\mathtt{1}\mathtt{2}^{k}\mathtt{1}$ . Then, for any $1\leq\ell\leq k$ , there exist nodes representing $01^{k-2}\ell$ (see also Figure 6 in Appendix). Since $k=\Theta(\sqrt{n})$ , the parent node $01^{k-2}$ has $\Omega(\sqrt{n})$ out-going edges. ∎

Due to Lemma 7, if we maintain a sorted list of out-going edges for each node during our online construction of $\mathsf{CPH}(S[i..])$ , it would require $O(n\log n)$ time even for a constant-size alphabet. Still, after $\mathsf{CPH}(S)$ has been constructed, we can sort all the edges offline, as follows:

Theorem 2.

For any string $S$ over an integer alphabet $\Sigma=[1..\sigma]$ of size $\sigma=n^{O(1)}$ , the edge-sorted $\mathsf{CPH}(S)$ together with the maximal reach pointers can be computed in $O(n\log\sigma_{S})$ time and $O(n)$ space.

Proof.

We sort the edges of $\mathsf{CPH}(S)$ as follows: Let $i$ be the id of each node $u(i)$ . Then sort the pairs $(i,x)$ of the ids and the edge labels. Since $i\in[0..n-1]$ and $x\in[1..n^{O(1)}]$ , we can sort these pairs in $O(n)$ time by a radix sort. The maximal reach pointers can be computed in $O(n\log\sigma_{S})$ time using the reversed suffix links, in a similar way to the position heaps for exact matching [10]. ∎

See Figure 5 in Appendix for an example of $\mathsf{CPH}(S)$ with maximal reach pointers.

4 Cartesian-tree Position Heaps for Tries

Let $\boldsymbol{T}$ be the input text trie with $N$ nodes. A naïve extension of our CPH to a trie would be to build the CPH for the sequence $\langle\mathsf{PD}(\boldsymbol{T}[N..]),\ldots,\mathsf{PD}(\boldsymbol{T}[1..])\rangle$ of the parent encodings of all the path strings of $\boldsymbol{T}$ towards the root $\mathbf{r}$ . However, this does not seem to work because the parent encodings are not consistent for suffixes. For instance, consider two strings $\mathtt{1432}$ and $\mathtt{4432}$ . Their longest common suffix $\mathtt{432}$ is represented by a single path in a trie $\boldsymbol{T}$ . However, the longest common suffix of $\mathsf{PD}(\mathtt{1432})=0123$ and $\mathsf{PD}(\mathtt{4432})=0100$ is $\varepsilon$ . Thus, in the worst case, we would have to consider all the path strings $\boldsymbol{T}[N..]$ , …, $\boldsymbol{T}[1..]$ in $\boldsymbol{T}$ separately, but the total length of these path strings in $\boldsymbol{T}$ is $\Omega(N^{2})$ .

To overcome this difficulty, we reuse the $\mathsf{FP}$ encoding from Section 3. Since $\mathsf{FP}(S)[i]$ is determined merely by the suffix $S[i..]$ , the $\mathsf{FP}$ encoding is suffix-consistent. For an input trie $\boldsymbol{T}$ , let the FP-trie $\boldsymbol{T}_{\mathsf{FP}}$ be the reversed trie storing $\mathsf{FP}(\boldsymbol{T}[i..])$ for all the original path strings $\boldsymbol{T}[i..]$ towards the root. Let $N^{\prime}$ be the number of nodes in $\boldsymbol{T}_{\mathsf{FP}}$ . Since $\mathsf{FP}$ is suffix-consistent, $N^{\prime}\leq N$ always holds. Namely, $\mathsf{FP}$ is a linear-size representation of the equivalence relation of the nodes of $\boldsymbol{T}$ w.r.t. $\approx$ . Each node $v$ of $\boldsymbol{T}_{\mathsf{FP}}$ stores the equivalence class $\mathcal{C}_{v}=\{i\mid\boldsymbol{T}_{\mathsf{FP}}[v..]=\mathsf{FP}(\boldsymbol{T}[i..])\}$ of the nodes $i$ in $\boldsymbol{T}$ that correspond to $v$ . We set $\min\{\mathcal{C}_{v}\}$ to be the representative of $\mathcal{C}_{v}$ , as well as the id of node $v$ . See Figure 8 in Appendix.

Let $\Sigma_{\boldsymbol{T}}$ be the set of distinct characters (i.e. edge labels) in $\boldsymbol{T}$ and let $\sigma_{\boldsymbol{T}}=|\Sigma_{\boldsymbol{T}}|$ . The FP-trie $\boldsymbol{T}_{\mathsf{FP}}$ can be computed in $O(N\sigma_{\boldsymbol{T}})$ time and working space by a standard traversal on $\boldsymbol{T}$ , where we store at most $\sigma_{\boldsymbol{T}}$ front pointers in each node of the current path in $\boldsymbol{T}$ due to Lemma 3.

Let $i_{N^{\prime}},\ldots,i_{1}$ be the node id’s of $\boldsymbol{T}_{\mathsf{FP}}$ which are sorted in decreasing order. The Cartesian-tree position heap for the input trie $\boldsymbol{T}$ is $\mathsf{CPH}(\boldsymbol{T})=\mathsf{SHT}(\mathcal{W}_{\boldsymbol{T}})$ , where $\mathcal{W}_{\boldsymbol{T}}=\langle\mathsf{PD}(\boldsymbol{T}[i_{N^{\prime}}..],\ldots,\mathsf{PD}(\boldsymbol{T}[i_{1}..])\rangle$ .

As in the case of string inputs in Section 3, we insert the shortest prefix of $\mathsf{PD}(\boldsymbol{T}[i_{k}..])$ that does not exist in $\mathsf{CPH}(\boldsymbol{T}[i_{k+1}..])$ . To perform this insert operation, we use the following data structure for a random-access query on the PD encoding of any path string in $\boldsymbol{T}$ :

Lemma 8.

There is a data structure of size $O(N\sigma_{\boldsymbol{T}})$ that can answer the following queries in $O(\sigma_{\boldsymbol{T}})$ time each.
Query input: The id $i$ of a node in $\boldsymbol{T}$ and integer $\ell>0$ .
Query output: The $\ell$ th (last) symbol $\mathsf{PD}((\boldsymbol{T}[i..])[1..\ell])[\ell]$ in $\mathsf{PD}(\boldsymbol{T}[i..])[1..\ell]$ .

Proof.

Let $\mathbf{x}$ be the node with id $i$ , and $\mathbf{z}=\mathsf{anc}(\mathbf{x},\ell)$ . Namely, $\mathsf{str}(\mathbf{x},\mathbf{z})=(\boldsymbol{T}[j..])[1..\ell]$ . For each character $a\in\Sigma_{\boldsymbol{T}}$ , let $\mathsf{na}(\mathbf{x},a)$ denote the nearest ancestor $\mathbf{y}_{a}$ of $\mathbf{x}$ such that the edge $(\mathsf{parent}(\mathbf{y}_{a}),\mathbf{y}_{a})$ is labeled $a$ . If such an ancestor does not exist, then we set $\mathsf{na}(\mathbf{x},a)$ to the root $\mathbf{r}$ .

Let $\mathbf{z^{\prime}}=\mathsf{anc}(\mathbf{x},\ell-1)$ , and $b$ be the label of the edge $(\mathbf{z},\mathbf{z^{\prime}})$ . Let $D$ be an empty set. For each character $a\in\Sigma_{\boldsymbol{T}}$ , we query $\mathsf{na}(\mathbf{x},a)=\mathbf{y}_{a}$ . If $d_{a}=|\mathbf{y}_{a}|-|\mathbf{z}^{\prime}|>0$ and $a\leq b$ , then $d_{a}$ is a candidate for $(\mathsf{PD}(\boldsymbol{T}[j..])[1..\ell])[\ell]$ and add $d_{a}$ to set $D$ . After testing all $a\in\Sigma_{\boldsymbol{T}}$ , we have that $(\mathsf{PD}(\boldsymbol{T}[j..])[1..\ell])[\ell]=\min D$ . See Figure 4.

For all characters $a\in\Sigma_{\boldsymbol{T}}$ and all nodes $x$ in $\boldsymbol{T}$ , $\mathsf{na}(\mathbf{x},a)$ can be pre-computed in a total of $O(N\sigma_{\boldsymbol{T}})$ preprocessing time and space, by standard traversals on $\boldsymbol{T}$ . Clearly each query is answered in $O(\sigma_{\boldsymbol{T}})$ time. ∎

Theorem 3.

Let $\boldsymbol{T}$ be a given trie with $N$ nodes whose edge labels are from an integer alphabet of size $n^{O(1)}$ . The edge-sorted $\mathsf{CPH}(\boldsymbol{T})$ with the maximal reach pointers, which occupies $O(N\sigma_{\boldsymbol{T}})$ space, can be built in $O(N\sigma_{\boldsymbol{T}})$ time.

Proof.

The rest of the construction algorithm of $\mathsf{CPH}(\boldsymbol{T})$ is almost the same as the case of the CPH for a string, except that the amortization argument in the proof for Theorem 1 cannot be applied to the case where the input is a trie. Instead, we use the nearest marked ancestor (NMA) data structure [21, 2] that supports queries and marking nodes in amortized $O(1)$ time each, using space linear in the input tree. For each $a\in[0..\sigma_{\boldsymbol{T}}]$ , we create a copy $\mathsf{CPH}_{a}(\boldsymbol{T})$ of $\mathsf{CPH}(\boldsymbol{T})$ and maintain the NMA data structure on $\mathsf{CPH}_{a}(\boldsymbol{T})$ so that every node $v$ that has defined reversed suffix link $\mathsf{rsl}(v,a)$ is marked, and any other nodes are unmarked. The NMA query for a given node $v$ with character $a$ is denoted by $\mathsf{nma}_{a}(v)$ . If $v$ itself is marked with $a$ , then let $\mathsf{nma}_{a}(v)=v$ . For any node $\mathbf{x}$ of $\boldsymbol{T}$ , let $\mathcal{I}_{\mathbf{x}}$ be the array of size at most $\sigma_{\boldsymbol{T}}$ s.t. $\mathcal{I}_{\mathbf{x}}[j]=h$ iff $h$ is the $j$ th smallest element of $\mathcal{F}_{\mathsf{PD}(\mathsf{str}(\mathbf{x}))}$ .

We are ready to design our construction algorithm: Suppose that we have already built $\mathsf{CPH}(\boldsymbol{T}[i_{k+1}..])$ and we are to update it to $\mathsf{CPH}(\boldsymbol{T}[i_{k}..])$ . Let $\mathbf{w}$ be the node in $\boldsymbol{T}$ with id $i_{k}$ , and let $\mathbf{u}=\mathsf{parent}(\mathbf{w})$ in $\boldsymbol{T}_{\mathsf{FP}}$ . Let $u$ be the node of $\mathsf{CPH}(\boldsymbol{T}[i_{k+1}..])$ that corresponds to $\mathbf{u}$ . We initially set $v\leftarrow u$ and $a\leftarrow|\mathcal{F}_{\mathsf{PD}(\boldsymbol{T}[i_{k}..i_{k}+|u|])}|$ . Let $d(a)=\max\{|u|-\mathcal{I}_{\mathbf{w}}[a]+1,0\}$ . We perform the following:

(1)

Check whether $v^{\prime}=\mathsf{anc}(u,d(a))$ is marked in $\mathsf{CPH}_{a}(\boldsymbol{T})$ . If so, go to (2). Otherwise, update $v\leftarrow v^{\prime}$ , $a\leftarrow a-1$ , and repeat (1).
(2)

Return $\mathsf{nma}(v,a)$ .

By the definitions of $\mathcal{I}_{\mathbf{w}}[a]$ and $d(a)$ , the node $v(i_{k})$ from which we should take the reversed suffix link is in the path between $v^{\prime}$ and $v$ , and it is the lowest ancestor of $v$ that has the reversed suffix link with $a$ . Thus, the above algorithm correctly computes the desired node. By Lemma 4, the number of queries in (1) for each of the $N^{\prime}$ nodes is $O(\sigma_{\boldsymbol{T}})$ , and we use the dynamic level ancestor data structure on our CPH that allows for leaf insertions and level ancestor queries in $O(1)$ time each [1]. This gives us $O(N\sigma_{\boldsymbol{T}})$ -time and space construction.

We will reuse the random access data structure of Lemma 8 for pattern matching (see Section 5.2). Thus $\mathsf{CPH}(\boldsymbol{T})$ requires $O(N\sigma_{\boldsymbol{T}})$ space. ∎

5 Cartesian-tree Pattern Matching with Position Heaps

5.1 Pattern Matching on Text String $S$ with $\mathsf{CPH}(S)$

Given a pattern $P$ of length $m$ , we first compute the greedy factorization $\mathsf{f}(P)=P_{0},P_{1},\ldots,P_{k}$ of $P$ such that $P_{0}=\varepsilon$ , and for $1\leq l\leq k$ , $P_{l}=P[\mathsf{lsum}(l-1)+1..\mathsf{lsum}(l)]$ is the longest prefix of $P_{l}\cdots P_{k}$ that is represented by $\mathsf{CPH}(S)$ , where $\mathsf{lsum}(l)=\sum_{j=0}^{l}|P_{j}|$ . We consider such a factorization of $P$ since the height $h$ of $\mathsf{CPH}(S)$ can be smaller than the pattern length $m$ .

Lemma 9.

Any node $v$ in $\mathsf{CPH}(S)$ has at most $|v|$ out-going edges.

Proof.

Let $(v,c,u)$ be any out-going edge of $v$ . When $|u|-1$ is a front pointer of $u$ , then $c=u[|u|]$ and this is when $c$ takes the maximum value. Since $u[|u|]\leq|u|-1$ , we have $c\leq|u|-1$ . Since the edge label of $\mathsf{CPH}(S)$ is non-negative, $v$ can have at most $|u|-1=|v|$ out-going edges. ∎

The next corollary immediately follows from Lemma 9.

Corollary 1.

Given a pattern $P$ of length $m$ , its factorization $\mathsf{f}(P)$ can be computed in $O(m\log(\min\{m,h\}))$ time, where $h$ is the height of $\mathsf{CPH}(S)$ .

The next lemma is analogous to the position heap for exact matching [10].

Lemma 10.

Consider two nodes $u$ and $v$ in $\mathsf{CPH}(S)$ such that $u=\mathsf{PD}(P)$ the id of $v$ is $i$ . Then, $\mathsf{PD}(S[i..])[1..|u|]=u$ iff one of the following conditions holds: (a) $v$ is a descendant of $u$ ; (b) $\mathsf{mrp}(v)$ is a descendant of $u$ .

See also Figure 10 in Appendix. We perform a standard traversal on $\mathsf{CPH}(S)$ so that one we check whether a node is a descendant of another node in $O(1)$ time.

When $k=1$ (i.e. $\mathsf{f}(P)=P$ ), $\mathsf{PD}(P)$ is represented by some node $u$ of $\mathsf{CPH}(S)$ . Now a direct application of Lemma 10 gives us all the $\mathit{occ}$ pattern occurrences in $O(m\log m+\mathit{occ})$ time, where $\min\{m,h\}=m$ in this case. All we need here is to report the id of every descendant of $u$ (Condition (a)) and the id of each node $v$ that satisfies Condition (b). The number of such nodes $v$ is less than $m$ .

When $k\geq 2$ (i.e. $\mathsf{f}(P)\neq P$ ), there is no node that represents $\mathsf{PD}(P)$ for the whole pattern $P$ . This happens only when $\mathit{occ}<m$ , since otherwise there has to be a node representing $\mathsf{PD}(P)$ by the incremental construction of $\mathsf{CPH}(S)$ , a contradiction. This implies that Condition (a) of Lemma 10 does apply when $k\geq 2$ . Thus, the candidates for the pattern occurrences only come from Condition (b), which are restricted to the nodes $v$ such that $\mathsf{mrp}(v)=u_{1}$ , where $u_{1}=\mathsf{PD}(P_{1})$ . We apply Condition (b) iteratively for the following $P_{2},\ldots,P_{k}$ , while keeping track of the position $i$ that was associated to each node $v$ such that $\mathsf{mrp}(v)=u_{1}$ . This can be done by padding $i$ with the off-set $\mathsf{lsum}(l-1)$ when we process $P_{l}$ . We keep such a position $i$ if Condition (b) is satisfied for all the following pattern blocks $P_{2},\ldots,P_{k}$ , namely, if the maximal reach pointer of the node with id $i+\mathsf{lsum}(l-1)$ points to node $u_{l}=\mathsf{PD}(P_{l})$ for increasing $l=2,\ldots,k$ . As soon as Condition (b) is not satisfied with some $l$ , we discard position $i$ .

Suppose that we have processed the all pattern blocks $P_{1},\ldots,P_{k}$ in $\mathsf{f}(P)$ . Now we have that $\mathsf{PD}(S[i..])[1..m]=\mathsf{PD}(P)$ (or equivalently $S[i..i+m-1]\approx P$ ) only if the position $i$ has survived. Namely, position $i$ is only a candidate of a pattern occurrence at this point, since the above algorithm only guarantees that $\mathsf{PD}(P_{1})\cdots\mathsf{PD}(P_{k})=\mathsf{PD}(S[i..])[1..m]$ . Note also that, by Condition (b), the number of such survived positions $i$ is bounded by $\min\{|P_{1}|,\ldots,|P_{k}|\}\leq m/k$ .

For each survived position $i$ , we verify whether $\mathsf{PD}(P)=\mathsf{PD}(S[i..])[1..m]$ . This can be done by checking, for each increasing $l=1,\ldots,k$ , whether or not $\mathsf{PD}(S[i..])[\mathsf{lsum}(l-1)+y]=\mathsf{PD}(P_{1}\cdots P_{l})[\mathsf{lsum}(l-1)+y]$ for every position $y$ ( $1\leq y\leq|P_{l}|$ ) such that $\mathsf{PD}(P_{l})[y]=0$ . By the definition of $\mathsf{PD}$ , the number of such positions $y$ is at most $\sigma_{P_{l}}\leq\sigma_{P}$ . Thus, for each survived position $i$ we have at most $k\sigma_{P}$ positions to verify. Since we have at most $m/k$ survived positions, the verification takes a total of $O(\frac{m}{k}\cdot k\sigma_{P})=O(m\sigma_{P})$ time.

Theorem 4.

Let $S$ be the text string of length $n$ . Using $\mathsf{CPH}(S)$ of size $O(n)$ augmented with the maximal reach pointers, we can find all $\mathit{occ}$ occurrences for a given pattern $P$ in $S$ in $O(m(\sigma_{P}+\log(\min\{m,h\}))+\mathit{occ})$ time, where $m=|P|$ and $h$ is the height of $\mathsf{CPH}(S)$ .

5.2 Pattern Matching on Text Trie $\boldsymbol{T}$ with $\mathsf{CPH}(\boldsymbol{T})$

In the text trie case, we can basically use the same matching algorithm as in the text string case of Section 5.1. However, recall that we cannot afford to store the PD encodings of the path strings in $\boldsymbol{T}$ as it requires $\Omega(n^{2})$ space. Instead, we reuse the random-access data structure of Lemma 8 for the verification step. Since it takes $O(\sigma_{\boldsymbol{T}})$ time for each random-access query, and since the data structure occupies $O(N\sigma_{\boldsymbol{T}})$ space, we have the following complexity:

Theorem 5.

Let $\boldsymbol{T}$ be the text trie with $N$ nodes. Using $\mathsf{CPH}(\boldsymbol{T})$ of size $O(N\sigma_{\boldsymbol{T}})$ augmented with the maximal reach pointers, we can find all $\mathit{occ}$ occurrences for a given pattern $P$ in $\boldsymbol{T}$ in $O(m(\sigma_{P}\sigma_{\boldsymbol{T}}+\log(\min\{m,h\}))+\mathit{occ})$ time, where $m=|P|$ and $h$ is the height of $\mathsf{CPH}(\boldsymbol{T})$ .

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers JP18K18002 (YN) and JP21K17705 (YN), and by JST PRESTO Grant Number JPMJPR1922 (SI).

References

[1] S. Alstrup and J. Holm. Improved algorithms for finding level ancestors in dynamic trees. In U. Montanari, J. D. P. Rolim, and E. Welzl, editors, ICALP 2000, volume 1853 of Lecture Notes in Computer Science, pages 73–84. Springer, 2000.
[2] A. Amir, M. Farach, R. M. Idury, J. A. L. Poutré, and A. A. Schäffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258–282, 1995.
[3] B. S. Baker. A theory of parameterized pattern matching: algorithms and applications. In STOC 1993, pages 71–80, 1993.
[4] B. S. Baker. Parameterized pattern matching by Boyer-Moore type algorithms. In Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 541–550, 1995.
[5] B. S. Baker. Parameterized pattern matching: Algorithms and applications. J. Comput. Syst. Sci., 52(1):28–42, 1996.
[6] M. A. Bender and M. Farach-Colton. The level ancestor problem simplified. Theor. Comput. Sci., 321(1):5–12, 2004.
[7] S. Cho, J. C. Na, K. Park, and J. S. Sim. A fast algorithm for order-preserving pattern matching. Inf. Process. Lett., 115(2):397–402, 2015.
[8] E. Coffman and J. Eve. File structures using hashing functions. Communications of the ACM, 13:427–432, 1970.
[9] M. Crochemore, C. S. Iliopoulos, T. Kociumaka, M. Kubica, A. Langiu, S. P. Pissis, J. Radoszewski, W. Rytter, and T. Walen. Order-preserving indexing. Theor. Comput. Sci., 638:122–135, 2016.
[10] A. Ehrenfeucht, R. M. McConnell, N. Osheim, and S.-W. Woo. Position heaps: A simple and dynamic text indexing data structure. Journal of Discrete Algorithms, 9(1):100–121, 2011.
[11] T. Fu, K. F. Chung, R. W. P. Luk, and C. Ng. Stock time series pattern matching: Template-based vs. rule-based approaches. Eng. Appl. Artif. Intell., 20(3):347–364, 2007.
[12] N. Fujisato, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. Right-to-left online construction of parameterized position heaps. In PSC 2018, pages 91–102, 2018.
[13] N. Fujisato, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. The parameterized position heap of a trie. In CIAC 2019, pages 237–248, 2019.
[14] T. I, S. Inenaga, and M. Takeda. Palindrome pattern matching. In R. Giancarlo and G. Manzini, editors, CPM 2011, volume 6661 of Lecture Notes in Computer Science, pages 232–245. Springer, 2011.
[15] H. Kim and Y. Han. OMPPM: online multiple palindrome pattern matching. Bioinform., 32(8):1151–1157, 2016.
[16] J. Kim, P. Eades, R. Fleischer, S. Hong, C. S. Iliopoulos, K. Park, S. J. Puglisi, and T. Tokuyama. Order-preserving matching. Theor. Comput. Sci., 525:68–79, 2014.
[17] Y. Matsuoka, T. Aoki, S. Inenaga, H. Bannai, and M. Takeda. Generalized pattern matching and periodicity under substring consistent equivalence relations. Theor. Comput. Sci., 656:225–233, 2016.
[18] S. G. Park, M. Bataa, A. Amir, G. M. Landau, and K. Park. Finding patterns and periods in cartesian tree matching. Theor. Comput. Sci., 845:181–197, 2020.
[19] S. Song, G. Gu, C. Ryu, S. Faro, T. Lecroq, and K. Park. Fast algorithms for single and multiple pattern Cartesian tree matching. Theor. Comput. Sci., 849:47–63, 2021.
[20] P. Weiner. Linear pattern-matching algorithms. In Proc. of 14th IEEE Ann. Symp. on Switching and Automata Theory, pages 1–11, 1973.
[21] J. Westbrook. Fast incremental planarity testing. In Proc. ICALP 1992, number 623 in LNCS, pages 342–353, 1992.