This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Position Heaps for Cartesian-tree Matching
on Strings and Tries

Akio Nishimoto1 1Department of Informatics, Kyushu University, Japan
{nishimoto.akio, noriki.fujisato, yuto.nakashima, inenaga}@inf.kyushu-u.ac.jp
2
PRESTO, Japan Science and Technology Agency, Japan
Noriki Fujisato1 1Department of Informatics, Kyushu University, Japan
{nishimoto.akio, noriki.fujisato, yuto.nakashima, inenaga}@inf.kyushu-u.ac.jp
2
PRESTO, Japan Science and Technology Agency, Japan
Yuto Nakashima1 1Department of Informatics, Kyushu University, Japan
{nishimoto.akio, noriki.fujisato, yuto.nakashima, inenaga}@inf.kyushu-u.ac.jp
2
PRESTO, Japan Science and Technology Agency, Japan
Shunsuke Inenaga1,2 1Department of Informatics, Kyushu University, Japan
{nishimoto.akio, noriki.fujisato, yuto.nakashima, inenaga}@inf.kyushu-u.ac.jp
2
PRESTO, Japan Science and Technology Agency, Japan
Abstract

The Cartesian-tree pattern matching is a recently introduced scheme of pattern matching that detects fragments in a sequential data stream which have a similar structure as a query pattern. Formally, Cartesian-tree pattern matching seeks all substrings SS^{\prime} of the text string SS such that the Cartesian tree of SS^{\prime} and that of a query pattern PP coincide. In this paper, we present a new indexing structure for this problem, called the Cartesian-tree Position Heap (CPH). Let nn be the length of the input text string SS, mm the length of a query pattern PP, and σ\sigma the alphabet size. We show that the CPH of SS, denoted 𝖢𝖯𝖧(S)\mathsf{CPH}(S), supports pattern matching queries in O(m(σ+log(min{h,m}))+𝑜𝑐𝑐)O(m(\sigma+\log(\min\{h,m\}))+\mathit{occ}) time with O(n)O(n) space, where hh is the height of the CPH and 𝑜𝑐𝑐\mathit{occ} is the number of pattern occurrences. We show how to build 𝖢𝖯𝖧(S)\mathsf{CPH}(S) in O(nlogσ)O(n\log\sigma) time with O(n)O(n) working space. Further, we extend the problem to the case where the text is a labeled tree (i.e. a trie). Given a trie 𝑻\boldsymbol{T} with NN nodes, we show that the CPH of 𝑻\boldsymbol{T}, denoted 𝖢𝖯𝖧(𝑻)\mathsf{CPH}(\boldsymbol{T}), supports pattern matching queries on the trie in O(m(σ2+log(min{h,m}))+𝑜𝑐𝑐)O(m(\sigma^{2}+\log(\min\{h,m\}))+\mathit{occ}) time with O(Nσ)O(N\sigma) space. We also show a construction algorithm for 𝖢𝖯𝖧(𝑻)\mathsf{CPH}(\boldsymbol{T}) running in O(Nσ)O(N\sigma) time and O(Nσ)O(N\sigma) working space.

1 Introduction

If the Cartesian trees 𝖢𝖳(X)\mathsf{CT}(X) and 𝖢𝖳(Y)\mathsf{CT}(Y) of two strings XX and YY are equal, then we say that XX and YY Cartesian-tree match (ct-match). The Cartesian-tree pattern matching problem (ct-matching problem)  [18] is, given a text string SS and a pattern PP, to find all substrings SS^{\prime} of SS that ct-match with PP.

String equivalence with ct-matching belongs to the class of substring-consistent equivalence relation (SCER[17], namely, the following holds: If two strings XX and YY ct-match, then X[i..j]X[i..j] and Y[i..j]Y[i..j] also ct-match for any 1ij|X|1\leq i\leq j\leq|X|. Among other types of SCERs ([3, 4, 5, 14, 15]), ct-matching is the most related to order-peserving matching (op-matching) [16, 7, 9]. Two strings XX and YY are said to op-match if the relative order of the characters in XX and the relative order of the characters in YY are the same. It is known that with ct-matching one can detect some interesting occurrences of a pattern that cannot be captured with op-matching. More precisely, if two strings XX and YY op-match, then XX and YY also ct-match. However, the reverse is not true. With this property in hand, ct-matching is motivated for analysis of time series such as stock charts [18, 11].

This paper deals with the indexing version of the ct-matching problem. Park et al. [18] proposed the Cartesian suffix tree (CST) for a text string SS that can be built in O(nlogn)O(n\log n) worst-case time or O(n)O(n) expected time, where nn is the length of the text string SS. The logn\log n factor in the worst-case complexity is due to the fact that the parent-encoding, a key concept for ct-matching introduced in [18], is a sequence of integers in range [0..n1][0..n-1]. While it is not explicitly stated in Park et al.’s paper [18], our simple analysis (c.f. Lemma 9 in Section 5) reveals that the CST supports pattern matching queries in O(mlogm+𝑜𝑐𝑐)O(m\log m+\mathit{occ}) time, where mm is the pattern length and 𝑜𝑐𝑐\mathit{occ} is the number of pattern occurrences.

In this paper, we present a new indexing structure for this problem, called the Cartesian-tree Position Heap (CPH). We show that the CPH of SS, which occupies O(n)O(n) space, can be built in O(nlogσ)O(n\log\sigma) time with O(n)O(n) working space and supports pattern matching queries in O(m(σ+log(min{h,m}))+𝑜𝑐𝑐)O(m(\sigma+\log(\min\{h,m\}))+\mathit{occ}) time, where hh is the height of the CPH. Compared to the afore-mentioned CST, our CPH is the first index for ct-matching that can be built in worst-case linear time for constant-size alphabets, while pattern matching queries with our CPH can be slower than with the CST when σ\sigma is large.

We then consider the case where the text is a labeled tree (i.e. a trie). Given a trie 𝑻\boldsymbol{T} with NN nodes, we show that the CPH of 𝑻\boldsymbol{T}, which occupies O(Nσ)O(N\sigma) space, can be built in O(Nσ)O(N\sigma) time and O(Nσ)O(N\sigma) working space. We also show how to support pattern matching queries in O(m(σ2+log(min{h,m}))+𝑜𝑐𝑐)O(m(\sigma^{2}+\log(\min\{h,m\}))+\mathit{occ}) time in the trie case. To our knowledge, our CPH is the first indexing structure for ct-matching on tries that uses linear space for constant-size alphabets.

Conceptually, our CPH is most related to the parameterized position heap (PPH) for a string [12] and for a trie [13], in that our CPHs and the PPHs are both constructed in an incremental manner where the suffixes of an input string and the suffixes of an input trie are processed in increasing order of their lengths. However, some new techniques are required in the construction of our CPH due to different nature of the parent encoding [18] of strings for ct-matching, from the previous encoding [3] of strings for parameterized matching.

2 Preliminaries

2.1 Strings and (Reversed) Tries

Let Σ\Sigma be an ordered alphabet of size σ\sigma. An element of Σ\Sigma is called a character. An element of Σ\Sigma^{*} is called a string. For a string SΣS\in\Sigma^{*}, let σS\sigma_{S} denote the number of distinct characters in SS.

The empty string ε\varepsilon is a string of length 0, namely, |ε|=0|\varepsilon|=0. For a string S=XYZS=XYZ, XX, YY and ZZ are called a prefix, substring, and suffix of SS, respectively. The set of prefixes of a string SS is denoted by 𝖯𝗋𝖾𝖿𝗂𝗑(S)\mathsf{Prefix}(S). The ii-th character of a string SS is denoted by S[i]S[i] for 1i|S|1\leq i\leq|S|, and the substring of a string SS that begins at position ii and ends at position jj is denoted by S[i..j]S[i..j] for 1ij|S|1\leq i\leq j\leq|S|. For convenience, let S[i..j]=εS[i..j]=\varepsilon if j<ij<i. Also, let S[i..]=S[i..|S|]S[i..]=S[i..|S|] for any 1i|S|+11\leq i\leq|S|+1.

A trie is a rooted tree that represents a set of strings, where each edge is labeled with a character from Σ\Sigma and the labels of the out-going edges of each node is mutually distinct. Tries are natural generalizations to strings in that tries can have branches while strings are sequences without branches.

Let 𝐱\mathbf{x} be any node of a given trie 𝑻\boldsymbol{T}, and let 𝐫\mathbf{r} denote the root of 𝑻\boldsymbol{T}. Let 0pt(𝐱)0pt(\mathbf{x}) denote the depth of 𝐱\mathbf{x}. When 𝐱𝐫\mathbf{x}\neq\mathbf{r}, let 𝗉𝖺𝗋𝖾𝗇𝗍(𝐱)\mathsf{parent}(\mathbf{x}) denote the parent of 𝐱\mathbf{x}. For any 0j0pt(𝐱)0\leq j\leq 0pt(\mathbf{x}), let 𝖺𝗇𝖼(𝐱,j)\mathsf{anc}(\mathbf{x},j) denote the jj-th ancestor of 𝐱\mathbf{x}, namely, 𝖺𝗇𝖼(𝐱,0)=𝐱\mathsf{anc}(\mathbf{x},0)=\mathbf{x} and 𝖺𝗇𝖼(𝐱,j)=𝗉𝖺𝗋𝖾𝗇𝗍(𝖺𝗇𝖼(𝐱,j1))\mathsf{anc}(\mathbf{x},j)=\mathsf{parent}(\mathsf{anc}(\mathbf{x},j-1)) for 1j0pt(𝐱)1\leq j\leq 0pt(\mathbf{x}). It is known that after a linear-time processing on 𝑻\boldsymbol{T}, 𝖺𝗇𝖼(𝐱,j)\mathsf{anc}(\mathbf{x},j) for any query node 𝐱\mathbf{x} and integer jj can be answered in O(1)O(1) time [6].

For the sake of convenience, in the case where our input is a trie 𝑻\boldsymbol{T}, then we consider its reversed trie where the path labels are read in the leaf-to-root direction. On the other hand, the trie-based data structures (namely position heaps) we build for input strings and reversed tries are usual tries where the path labels are read in the root-to-leaf direction.

For each (reversed) path (𝐱,𝐲)(\mathbf{x},\mathbf{y}) in 𝑻\boldsymbol{T} such that 𝐲=𝖺𝗇𝖼(𝐱,j)\mathbf{y}=\mathsf{anc}(\mathbf{x},j) with j=|0pt(𝐱)||0pt(𝐲)|j=|0pt(\mathbf{x})|-|0pt(\mathbf{y})|, let 𝗌𝗍𝗋(𝐱,𝐲)\mathsf{str}(\mathbf{x},\mathbf{y}) denote the string obtained by concatenating the labels of the edges from 𝐱\mathbf{x} to 𝐲\mathbf{y}. For any node 𝐱\mathbf{x} of 𝑻\boldsymbol{T}, let 𝗌𝗍𝗋(𝐱)=𝗌𝗍𝗋(𝐱,𝐫)\mathsf{str}(\mathbf{x})=\mathsf{str}(\mathbf{x},\mathbf{r}).

Let NN be the number of nodes in 𝑻\boldsymbol{T}. We associate a unique id to each node of 𝑻\boldsymbol{T}. Here we use a bottom-up level-order traversal rank as the id of each node in 𝑻\boldsymbol{T}, and we sometimes identify each node with its id. For each node id ii (1iN1\leq i\leq N) let 𝑻[i..]=𝗌𝗍𝗋(i)\boldsymbol{T}[i..]=\mathsf{str}(i), i.e., 𝑻[i..]\boldsymbol{T}[i..] is the path string from node ii to the root 𝐫\mathbf{r}.

2.2 Cartesian-tree Pattern Matching

The Cartesian tree of a string SS, denoted 𝖢𝖳(S)\mathsf{CT}(S), is the rooted tree with |S||S| nodes which is recursively defined as follows:

  • If |S|=0|S|=0, then 𝖢𝖳(S)\mathsf{CT}(S) is the empty tree.

  • If |S|1|S|\geq 1, then 𝖢𝖳(S)\mathsf{CT}(S) is the tree whose root rr stores the left-most minimum value S[i]S[i] in SS, namely, r=S[i]r=S[i] iff S[i]S[j]S[i]\leq S[j] for any iji\neq j and S[h]>S[i]S[h]>S[i] for any h<ih<i. The left-child of rr is 𝖢𝖳(S[1..i1])\mathsf{CT}(S[1..i-1]) and the right-child of rr is 𝖢𝖳(S[i+1..|S|])\mathsf{CT}(S[i+1..|S|]).

The parent distance encoding of a string SS of length nn, denoted 𝖯𝖣(S)\mathsf{PD}(S), is a sequence of nn integers over [0..n1][0..n-1] such that

𝖯𝖣(S)[i]={imax1j<i{jS[j]S[i]}if such j exists,0otherwise.\mathsf{PD}(S)[i]=\begin{cases}i-\max_{1\leq j<i}\{j\mid S[j]\leq S[i]\}&\mbox{if such $j$ exists},\\ 0&\mbox{otherwise.}\end{cases}

Namely, 𝖯𝖣(S)[i]\mathsf{PD}(S)[i] represents the distance to from position ii to its nearest left-neighbor position jj that stores a value that is less than or equal to S[i]S[i].

A tight connection between 𝖢𝖳\mathsf{CT} and 𝖯𝖣\mathsf{PD} is known:

Lemma 1 ([19]).

For any two strings S1S_{1} and S2S_{2} of equal length, 𝖢𝖳(S1)=𝖢𝖳(S2)\mathsf{CT}(S_{1})=\mathsf{CT}(S_{2}) iff 𝖯𝖣(S1)=𝖯𝖣(S2)\mathsf{PD}(S_{1})=\mathsf{PD}(S_{2}).

For two strings S1S_{1} and S2S_{2}, we write S1S2S_{1}\approx S_{2} iff 𝖢𝖳(S1)=𝖢𝖳(S2)\mathsf{CT}(S_{1})=\mathsf{CT}(S_{2}) (or equivalently 𝖯𝖣(S1)=𝖯𝖣(S2)\mathsf{PD}(S_{1})=\mathsf{PD}(S_{2})). We also say that S1S_{1} and S2S_{2} ct-match when S1S2S_{1}\approx S_{2}. See Fig. 1 for a concrete example.

Refer to caption Refer to caption

Figure 1: Two strings S1=𝟹𝟷𝟼𝟺𝟾𝟼𝟽𝟻𝟿S_{1}=\mathtt{316486759} and S2=𝟽𝟷𝟹𝟸𝟾𝟼𝟿𝟺𝟻S_{2}=\mathtt{713286945} ct-match since 𝖢𝖳(S1)=𝖢𝖳(S2)\mathsf{CT}(S_{1})=\mathsf{CT}(S_{2}) and 𝖯𝖣(S1)=𝖯𝖣(S2)\mathsf{PD}(S_{1})=\mathsf{PD}(S_{2}).

We consider the indexing problems for Cartesian-tree pattern matching on a text string and a text trie, which are respectively defined as follows:

Problem 1 (Cartesian-Tree Pattern Matching on Text String).
Preprocess:

A text string SS of length nn.

Query:

A pattern string PP of length mm.

Report:

All text positions ii such that S[i..i+m1]PS[i..i+m-1]\approx P.

Problem 2 (Cartesian-Tree Pattern Matching on Text Trie).
Preprocess:

A text trie 𝑻\boldsymbol{T} with NN nodes.

Query:

A pattern string PP of length mm.

Report:

All trie nodes ii such that (𝑻[i..])[1..m]P(\boldsymbol{T}[i..])[1..m]\approx P.

2.3 Sequence Hash Trees

Let 𝒲=w1,,wk\mathcal{W}=\langle w_{1},\ldots,w_{k}\rangle be a sequence of non-empty strings such that for any 1<ik1<i\leq k, wi𝖯𝗋𝖾𝖿𝗂𝗑(wj)w_{i}\notin\mathsf{Prefix}(w_{j}) for any 1j<i1\leq j<i. The sequence hash tree [8] of a sequence 𝒲=w1,,wk\mathcal{W}=\langle w_{1},\ldots,w_{k}\rangle of kk strings, denoted 𝖲𝖧𝖳(𝒲)=𝖲𝖧𝖳(𝒲)k\mathsf{SHT}(\mathcal{W})=\mathsf{SHT}(\mathcal{W})^{k}, is a trie structure that is incrementally built as follows:

  1. 1.

    𝖲𝖧𝖳(𝒲)0=𝖲𝖧𝖳()\mathsf{SHT}(\mathcal{W})^{0}=\mathsf{SHT}(\langle\ \rangle) for the empty sequence \langle\ \rangle is the tree only with the root.

  2. 2.

    For i=1,,ki=1,\ldots,k, 𝖲𝖧𝖳(𝒲)i\mathsf{SHT}(\mathcal{W})^{i} is obtained by inserting the shortest prefix uiu_{i} of wiw_{i} that does not exist in 𝖲𝖧𝖳(𝒲)i1\mathsf{SHT}(\mathcal{W})^{i-1}. This is done by finding the longest prefix pip_{i} of wiw_{i} that exists in 𝖲𝖧𝖳(𝒲)i1\mathsf{SHT}(\mathcal{W})^{i-1}, and adding the new edge (pi,c,ui)(p_{i},c,u_{i}), where c=wi[|pi|+1]c=w_{i}[|p_{i}|+1] is the first character of wiw_{i} that could not be traversed in 𝖲𝖧𝖳(𝒲)i1\mathsf{SHT}(\mathcal{W})^{i-1}.

Since we have assumed that each wiw_{i} in 𝒲\mathcal{W} is not a prefix of wjw_{j} for any 1j<i1\leq j<i, the new edge (pi,c,ui)(p_{i},c,u_{i}) is always created for each 1ik1\leq i\leq k. This means that 𝖲𝖧𝖳(𝒲)\mathsf{SHT}(\mathcal{W}) contains exactly k+1k+1 nodes (including the root).

To perform pattern matching queries efficiently, each node of 𝖲𝖧𝖳(𝒲)\mathsf{SHT}(\mathcal{W}) is augmented with the maximal reach pointer. For each 1ik1\leq i\leq k, let uiu_{i} be the newest node in 𝖲𝖧𝖳(𝒲)i\mathsf{SHT}(\mathcal{W})^{i}, namely, uiu_{i} is the shortest prefix of wiw_{i} which did not exist in 𝖲𝖧𝖳(𝒲)i1\mathsf{SHT}(\mathcal{W})^{i-1}. Then, in the complete sequence hash tree 𝖲𝖧𝖳(𝒲)=𝖲𝖧𝖳(𝒲)k\mathsf{SHT}(\mathcal{W})=\mathsf{SHT}(\mathcal{W})^{k}, we set 𝗆𝗋𝗉(ui)=uj\mathsf{mrp}(u_{i})=u_{j} iff uju_{j} is the deepest node in 𝖲𝖧𝖳(𝒲)\mathsf{SHT}(\mathcal{W}) such that uju_{j} is a prefix of wiw_{i}. Intuitively, 𝗆𝗋𝗉(ui)\mathsf{mrp}(u_{i}) represents the last visited node uju_{j} when we traverse wiw_{i} from the root of the complete 𝖲𝖧𝖳(𝒲)\mathsf{SHT}(\mathcal{W}). Note that jij\geq i always holds. When j=ij=i (i.e. when the maximal reach pointer is a self-loop), then we can omit it because it is not used in the pattern matching algorithm.

3 Cartesian-tree Position Heaps for Strings

Refer to caption Refer to caption

Figure 2: 𝖢𝖯𝖧(S)\mathsf{CPH}(S) for string S=𝟸𝟼𝟺𝟸𝟽𝟻𝟾𝟺𝟹𝟼𝟻𝟽𝟺𝟷S=\mathtt{26427584365741}. For each wi=𝖯𝖣(S[ni+1..])w_{i}=\mathsf{PD}(S[n-i+1..]), the underlined prefix is the string that is represented by the node uiu_{i} in 𝖢𝖯𝖧(S)\mathsf{CPH}(S). The dotted arcs are reversed suffix links (not all reversed suffix links are shown).

In this section, we introduce our new indexing structure for Problem 1. For a given text string SS of length nn, let 𝒲S\mathcal{W}_{S} denote the sequence of the parent distance encodings of the non-empty suffixes of SS which are sorted in increasing order of their lengths. Namely, 𝒲S=w1\mathcal{W}_{S}=\langle w_{1}, …, wn=𝖯𝖣(S[n..])w_{n}\rangle=\langle\mathsf{PD}(S[n..]), …, 𝖯𝖣(S[1..])\mathsf{PD}(S[1..])\rangle, where wni+1=𝖯𝖣(S[i..])w_{n-i+1}=\mathsf{PD}(S[i..]). The Cartesian-tree Position Heap (CPH) of string SS, denoted 𝖢𝖯𝖧(S)\mathsf{CPH}(S), is the sequence hash tree of 𝒲S\mathcal{W}_{S}, that is, 𝖢𝖯𝖧(S)=𝖲𝖧𝖳(𝒲S)\mathsf{CPH}(S)=\mathsf{SHT}(\mathcal{W}_{S}). Note that for each 1in+11\leq i\leq n+1, 𝖢𝖯𝖧(S[i..])=𝖲𝖧𝖳(𝒲S)ni+1\mathsf{CPH}(S[i..])=\mathsf{SHT}(\mathcal{W}_{S})^{n-i+1} holds.

Our algorithm builds 𝖢𝖯𝖧(S[i..])\mathsf{CPH}(S[i..]) for decreasing i=n,,1i=n,\ldots,1, which means that we process the given text string SS in a right-to-left online manner, by prepending the new character S[i]S[i] to the current suffix S[i+1..]S[i+1..].

For a sequence vv of integers, let 𝒵v\mathcal{Z}_{v} denote the sorted list of positions zz in vv such that v[z]=0v[z]=0 iff z𝒵vz\in\mathcal{Z}_{v}. Clearly |𝒵v||\mathcal{Z}_{v}| is equal to the number of 0’s in vv.

Lemma 2.

For any string SS, |𝒵𝖯𝖣(S)|σS|\mathcal{Z}_{\mathsf{PD}(S)}|\leq\sigma_{S}.

Proof.

Let 𝒵𝖯𝖣(S)=z1,,z\mathcal{Z}_{\mathsf{PD}(S)}=z_{1},\ldots,z_{\ell}. We have that S[z1]>>S[z]S[z_{1}]>\cdots>S[z_{\ell}] since otherwise 𝖯𝖣(S)[zx]0\mathsf{PD}(S)[z_{x}]\neq 0 for some zxz_{x}, a contradiction. Thus |𝒵𝖯𝖣(S)|σS|\mathcal{Z}_{\mathsf{PD}(S)}|\leq\sigma_{S} holds. ∎

Lemma 3.

For each i=n,,1i=n,\ldots,1, 𝖯𝖣(S[i..])\mathsf{PD}(S[i..]) can be computed from 𝖯𝖣(S[i+1..])\mathsf{PD}(S[i+1..]) in an online manner, using a total of O(n)O(n) time with O(σS)O(\sigma_{S}) working space.

Proof.

Given a new character S[i]S[i], we check each position zz in the list 𝒵𝖯𝖣(S[i+1..])\mathcal{Z}_{\mathsf{PD}(S[i+1..])} in increasing order. Let z^=z+i\hat{z}=z+i, i.e., z^\hat{z} is the global position in SS corresponding to zz in S[i+1..]S[i+1..]. If S[i]S[z^]S[i]\leq S[\hat{z}], then we set 𝖯𝖣(S[i..])[zi+1]=zi(>0)\mathsf{PD}(S[i..])[z-i+1]=z-i~{}(>0) and remove zz from the list. Remark that these removed positions correspond to the front pointers in the next suffix S[i..]S[i..]. We stop when we encounter the first zz in the list such that S[i]>S[z^]S[i]>S[\hat{z}]. Finally we add the position ii to the head of the remaining positions in the list. This gives us 𝒵𝖯𝖣(S[i..])\mathcal{Z}_{\mathsf{PD}(S[i..])} for the next suffix S[i..]S[i..].

It is clear that once a position in the PD encoding is assigned a non-zero value, then the value never changes whatever characters we prepend to the string. Therefore, we can compute 𝖯𝖣(S[i..])\mathsf{PD}(S[i..]) from 𝖯𝖣(S[i+1..])\mathsf{PD}(S[i+1..]) in a total of O(n)O(n) time for every 1in1\leq i\leq n. The working space is O(σS)O(\sigma_{S}) due to Lemma 2. ∎

A position ii in a sequence uu of non-negative integers is said to be a front pointer in uu if iu[i]=1i-u[i]=1 and i2i\geq 2. Let u\mathcal{F}_{u} denote the sorted list of front pointers in uu. For example, if u=01214501u=01214501, then u={2,3,5,6}\mathcal{F}_{u}=\{2,3,5,6\}. The positions of the suffix S[i+1..]S[i+1..] which are removed from 𝒵𝖯𝖣(S[i+1..])\mathcal{Z}_{\mathsf{PD}(S[i+1..])} correspond to the front pointers in 𝖯𝖣(S[i..])\mathcal{F}_{\mathsf{PD}(S[i..])} for the next suffix S[i..]S[i..].

Our construction algorithm updates 𝖢𝖯𝖧(S[i+1..])\mathsf{CPH}(S[i+1..]) to 𝖢𝖯𝖧(S[i..])\mathsf{CPH}(S[i..]) by inserting a new node for the next suffix S[i..]S[i..], processing the given string SS in a right-to-left online manner. Here the task is to efficiently locate the parent of the new node in the current CPH at each iteration.

As in the previous work on right-to-left online construction of indexing structures for other types of pattern matching [20, 10, 12, 13], we use the reversed suffix links in our construction algorithm for 𝖢𝖯𝖧(S)\mathsf{CPH}(S). For ease of explanation, we first introduce the notion of the suffix links. Let uu be any non-root node of 𝖢𝖯𝖧(S)\mathsf{CPH}(S). We identify uu with the path label from the root of 𝖢𝖯𝖧(S)\mathsf{CPH}(S) to uu, so that uu is a PD encoding of some substring of SS. We define the suffix link of uu, denoted 𝗌𝗅(u)\mathsf{sl}(u), such that 𝗌𝗅(u)=v\mathsf{sl}(u)=v iff vv is obtained by (1) removing the first 0 (=u[1]=u[1]), and (2) substituting 0 for the character u[f]u[f] at every front pointer fu[2..|u|]f\in\mathcal{F}_{u}\subseteq[2..|u|] of uu. The reversed suffix link of vv with non-negative integer label aa, denoted 𝗋𝗌𝗅(v,a)\mathsf{rsl}(v,a), is defined such that 𝗋𝗌𝗅(v,a)=u\mathsf{rsl}(v,a)=u iff 𝗌𝗅(u)=v\mathsf{sl}(u)=v and a=|u|a=|\mathcal{F}_{u}|. See also Figure 2.

Lemma 4.

Let u,vu,v be any nodes of 𝖢𝖯𝖧(S)\mathsf{CPH}(S) such that 𝗋𝗌𝗅(v,a)=u\mathsf{rsl}(v,a)=u with label aa. Then aσSa\leq\sigma_{S}.

Proof.

Since |u||𝒵v||\mathcal{F}_{u}|\leq|\mathcal{Z}_{v}|, using Lemma 2, we obtain a=|u||𝒵v|σSσSa=|\mathcal{F}_{u}|\leq|\mathcal{Z}_{v}|\leq\sigma_{S^{\prime}}\leq\sigma_{S}, where SS^{\prime} is a substring of SS such that 𝖯𝖣(S)=v\mathsf{PD}(S^{\prime})=v. ∎∎

The next lemma shows that the number of out-going reversed suffix links of each node vv is bounded by the alphabet size.

Refer to caption
Figure 3: We climb up the path from u(i+1)u(i+1) and find the parent p(i)p(i) of the new node u(i)u(i) (in black). The label aa of the reversed suffix link we traverse from v(i)v(i) is equal to the number of front pointers in p(i)p(i).

Our CPH construction algorithm makes use of the following monotonicity of the labels of reversed suffix links:

Lemma 5.

Suppose that there exist two reversed suffix links 𝗋𝗌𝗅(v,a)=u\mathsf{rsl}(v,a)=u and 𝗋𝗌𝗅(v,a)=u\mathsf{rsl}(v^{\prime},a^{\prime})=u^{\prime} such that v=𝗉𝖺𝗋𝖾𝗇𝗍(v)v^{\prime}=\mathsf{parent}(v) and u=𝗉𝖺𝗋𝖾𝗇𝗍(u)u^{\prime}=\mathsf{parent}(u). Then, 0aa10\leq a-a^{\prime}\leq 1.

Proof.

Immediately follows from a=|u|a=|\mathcal{F}_{u}|, a=|u|a^{\prime}=|\mathcal{F}_{u^{\prime}}|, and u=u[1..|u|1]u^{\prime}=u[1..|u|-1]. ∎

We are ready to design our right-to-left online construction algorithm for the CPH of a given string SS. Since 𝖯𝖣(S[i..])\mathsf{PD}(S[i..]) is the (ni+1)(n-i+1)-th string wni+1w_{n-i+1} of the input sequence 𝒲S\mathcal{W}_{S}, for ease of explanation, we will use the convention that u(i)=uni+1u(i)=u_{n-i+1} and p(i)=pni1p(i)=p_{n-i-1}, where the new node u(i)u(i) for wni+1=𝖯𝖣(S[i..])w_{n-i+1}=\mathsf{PD}(S[i..]) is inserted as a child of p(i)p(i). See Figure 3.

{breakbox}

Algorithm 1: Right-to-Left Online Construction of 𝖢𝖯𝖧(S)\mathsf{CPH}(S)

i=ni=n (base case):

We begin with 𝖢𝖯𝖧(S[n..])\mathsf{CPH}(S[n..]) which consists of the root r=u(n+1)r=u(n+1) and the node u(n)u(n) for the first (i.e. shortest) suffix S[n..]S[n..] of SS. Since w1=𝖯𝖣(S[n..])=𝖯𝖣(S[n])=0w_{1}=\mathsf{PD}(S[n..])=\mathsf{PD}(S[n])=0, the edge from rr to u(n)u(n) is labeled 0. Also, we set the reversed suffix link 𝗋𝗌𝗅(r,0)=u(n)\mathsf{rsl}(r,0)=u(n).

i=n1,,1i=n-1,\ldots,1 (iteration):

Given 𝖢𝖯𝖧(S[i+1..])\mathsf{CPH}(S[i+1..]) which consists of the nodes u(i+1),,u(n)u(i+1),\ldots,u(n), which respectively represent some prefixes of the already processed strings wni,,w1=𝖯𝖣(S[i+1..]),,𝖯𝖣(S[n..])w_{n-i},\ldots,w_{1}=\mathsf{PD}(S[i+1..]),\ldots,\mathsf{PD}(S[n..]), together with their reversed suffix links. We find the parent p(i)p(i) of the new node u(i)u(i) for 𝖯𝖣(S[i..])\mathsf{PD}(S[i..]), as follows: We start from the last-created node u(i+1)u(i+1) for the previous 𝖯𝖣(S[i+1..])\mathsf{PD}(S[i+1..]), and climb up the path towards the root rr. Let di[1..|u(i+1)|]d_{i}\in[1..|u(i+1)|] be the smallest integer such that the did_{i}-th ancestor v(i)=𝖺𝗇𝖼(u(i+1),di)v(i)=\mathsf{anc}(u(i+1),d_{i}) of u(i+1)u(i+1) has the reversed suffix link 𝗋𝗌𝗅(v(i),a)\mathsf{rsl}(v(i),a) with the label a=|𝖯𝖣(S[i..i+|v(i)|])|a=|\mathcal{F}_{\mathsf{PD}(S[i..i+|v(i)|])}|. We traverse the reversed suffix link from v(i)v(i) and let p(i)=𝗋𝗌𝗅(v(i),a)p(i)=\mathsf{rsl}(v(i),a). We then insert the new node u(i)u(i) as the new child of p(i)p(i), with the edge labeled 𝖯𝖣(S[i..])[i+|u(i)|1]\mathsf{PD}(S[i..])[i+|u(i)|-1]. Finally, we create a new reversed suffix link 𝗋𝗌𝗅(v^(i),b)=u(i)\mathsf{rsl}(\hat{v}(i),b)=u(i), where v^(i)=𝖺𝗇𝖼(u(i+1),di1)\hat{v}(i)=\mathsf{anc}(u(i+1),d_{i}-1) and 𝗉𝖺𝗋𝖾𝗇𝗍(v^)=v\mathsf{parent}(\hat{v})=v. We set ba+1b\leftarrow a+1 if the position i+|p(i)|i+|p(i)| is a front pointer of 𝖯𝖣(S[i..])\mathsf{PD}(S[i..]), and bab\leftarrow a otherwise.

For computing the label a=|𝖯𝖣(S[i..i+|v(i)|])|a=|\mathcal{F}_{\mathsf{PD}(S[i..i+|v(i)|])}| efficiently, we introduce a new encoding 𝖥𝖯\mathsf{FP} that is defined as follows: For any string SS of length nn, let 𝖥𝖯(S)[i]=|𝖯𝖣(S[i..n])|\mathsf{FP}(S)[i]=|\mathcal{F}_{\mathsf{PD}(S[i..n])}|. The 𝖥𝖯\mathsf{FP} encoding preserves the ct-matching equivalence:

Lemma 6.

For any two strings S1S_{1} and S2S_{2}, S1S2S_{1}\approx S_{2} iff 𝖥𝖯(S1)=𝖥𝖯(S2)\mathsf{FP}(S_{1})=\mathsf{FP}(S_{2}).

Proof.

For a string SS, consider the DAG 𝖦(S)=(V,E)\mathsf{G}(S)=(V,E) such that V={1,,|S|}V=\{1,\ldots,|S|\}, E={(j,i)j=i𝖯𝖣(S)[i]}E=\{(j,i)\mid j=i-\mathsf{PD}(S)[i]\} (see also Figure 7 in Appendix). By Lemma 1, for any strings S1S_{1} and S2S_{2}, 𝖦(S1)=𝖦(S2)\mathsf{G}(S_{1})=\mathsf{G}(S_{2}) iff S1S2S_{1}\approx S_{2}. Now, we will show there is a one-to-one correspondence between the DAG 𝖦\mathsf{G} and the 𝖥𝖯\mathsf{FP} encoding.

(\Rightarrow) We are given 𝖦(S)\mathsf{G}(S) for some (unknown) string SS. Since 𝖥𝖯(S)[i]\mathsf{FP}(S)[i] is the in-degree of the node ii of 𝖦(S)\mathsf{G}(S), 𝖥𝖯(S)\mathsf{FP}(S) is unique for the given DAG 𝖦(S)\mathsf{G}(S).

(\Leftarrow) Given 𝖥𝖯(S)\mathsf{FP}(S) for some (unknown) string SS, we show an algorithm that builds DAG 𝖦(S)\mathsf{G}(S). We first create nodes V={1,,|S|}V=\{1,\ldots,|S|\} without edges, where all nodes in VV are initially unmarked. For each i=n,,1i=n,\ldots,1 in decreasing order, if 𝖥𝖯(S)[i]>0\mathsf{FP}(S)[i]>0, then select the leftmost 𝖥𝖯(S)[i]\mathsf{FP}(S)[i] unmarked nodes in the range [i1..n][i-1..n], and create an edge (i,i)(i,i^{\prime}) from each selected node ii^{\prime} to ii. We mark all these 𝖥𝖯(S)[i]\mathsf{FP}(S)[i] nodes at the end of this step, and proceed to the next node i1i-1. The resulting DAG 𝖦(S)\mathsf{G}(S) is clearly unique for a given 𝖯𝖣(S)\mathsf{PD}(S). ∎

For computing the label a=|𝖯𝖣(S[i..i+|v(i)|])|=𝖥𝖯(S[i..i+|v(i)|])[1]a=|\mathcal{F}_{\mathsf{PD}(S[i..i+|v(i)|])}|=\mathsf{FP}(S[i..i+|v(i)|])[1] of the reversed suffix link in Algorithm 1, it is sufficient to maintain the induced graph 𝖦[i..j]\mathsf{G}_{[i..j]} of DAG 𝖦\mathsf{G} for a variable-length sliding window S[i..j]S[i..j] with the nodes {i,,j}\{i,\ldots,j\}. This can easily be maintained in O(n)O(n) total time.

Theorem 1.

Algorithm 1 builds 𝖢𝖯𝖧(S[i..])\mathsf{CPH}(S[i..]) for decreasing i=n,,1i=n,\ldots,1 in a total of O(nlogσ)O(n\log\sigma) time and O(n)O(n) space, where σ\sigma is the alphabet size.

Proof.

Correctness: Consider the (ni+1)(n-i+1)-th step in which we process 𝖯𝖣(S[i..])\mathsf{PD}(S[i..]). By Lemma 5, the did_{i}-th ancestor v(i)=𝖺𝗇𝖼(u(i+1),di)v(i)=\mathsf{anc}(u(i+1),d_{i}) of u(i+1)u(i+1) can be found by simply walking up the path from the start node u(i+1)u(i+1). Note that there always exists such ancestor v(i)v(i) of u(i+1)u(i+1) since the root rr has the defined reversed suffix link 𝗋𝗌𝗅(r,0)=0\mathsf{rsl}(r,0)=0. By the definition of v(i)v(i) and its reversed suffix link, 𝗋𝗌𝗅(v(i),a)=p(i)\mathsf{rsl}(v(i),a)=p(i) is the longest prefix of 𝖯𝖣(S[i..])\mathsf{PD}(S[i..]) that is represented by 𝖢𝖯𝖧(S[i+1..])\mathsf{CPH}(S[i+1..]) (see also Figure 3). Thus, p(i)p(i) is the parent of the new node u(i)u(i) for 𝖯𝖣(S[i..])\mathsf{PD}(S[i..]). The correctness of the new reversed suffix link 𝗋𝗌𝗅(v^(i),b)=u(i)\mathsf{rsl}(\hat{v}(i),b)=u(i) follows from the definition.

Complexity: The time complexity is proportional to the total number i=1ndi\sum_{i=1}^{n}d_{i} of nodes that we visit for all i=n,,1i=n,\ldots,1. Clearly |u(i)||u(i+1)|=di2|u(i)|-|u(i+1)|=d_{i}-2. Thus, i=1ndi=i=1n(|u(i)||u(i+1)|+2)=|u(1)||u(n)|+2n3n=O(n)\sum_{i=1}^{n}d_{i}=\sum_{i=1}^{n}(|u(i)|-|u(i+1)|+2)=|u(1)|-|u(n)|+2n\leq 3n=O(n). Using Lemma LABEL:lem:rsl_bounds and sliding-window 𝖥𝖯\mathsf{FP}, we can find the reversed suffix links in O(logσS)O(\log\sigma_{S}) time at each of the i=1ndi\sum_{i=1}^{n}d_{i} visited nodes. Thus the total time complexity is O(nlogσS)O(n\log\sigma_{S}). Since the number of nodes in 𝖢𝖯𝖧(S)\mathsf{CPH}(S) is n+1n+1 and the number of reversed suffix links is nn, the total space complexity is O(n)O(n). ∎

Lemma 7.

There exists a string SS of length nn over a binary alphabet Σ={𝟷,𝟸}\Sigma=\{\mathtt{1,2}\} such a node in 𝖢𝖯𝖧(S)\mathsf{CPH}(S) has Ω(n)\Omega(\sqrt{n}) out-going edges.

Proof.

Consider string S=𝟷𝟷𝟸𝟷𝟷𝟸𝟸𝟷𝟷𝟸k𝟷S=\mathtt{1}\mathtt{121}\mathtt{1221}\cdots\mathtt{1}\mathtt{2}^{k}\mathtt{1}. Then, for any 1k1\leq\ell\leq k, there exist nodes representing 01k201^{k-2}\ell (see also Figure 6 in Appendix). Since k=Θ(n)k=\Theta(\sqrt{n}), the parent node 01k201^{k-2} has Ω(n)\Omega(\sqrt{n}) out-going edges. ∎

Due to Lemma 7, if we maintain a sorted list of out-going edges for each node during our online construction of 𝖢𝖯𝖧(S[i..])\mathsf{CPH}(S[i..]), it would require O(nlogn)O(n\log n) time even for a constant-size alphabet. Still, after 𝖢𝖯𝖧(S)\mathsf{CPH}(S) has been constructed, we can sort all the edges offline, as follows:

Theorem 2.

For any string SS over an integer alphabet Σ=[1..σ]\Sigma=[1..\sigma] of size σ=nO(1)\sigma=n^{O(1)}, the edge-sorted 𝖢𝖯𝖧(S)\mathsf{CPH}(S) together with the maximal reach pointers can be computed in O(nlogσS)O(n\log\sigma_{S}) time and O(n)O(n) space.

Proof.

We sort the edges of 𝖢𝖯𝖧(S)\mathsf{CPH}(S) as follows: Let ii be the id of each node u(i)u(i). Then sort the pairs (i,x)(i,x) of the ids and the edge labels. Since i[0..n1]i\in[0..n-1] and x[1..nO(1)]x\in[1..n^{O(1)}], we can sort these pairs in O(n)O(n) time by a radix sort. The maximal reach pointers can be computed in O(nlogσS)O(n\log\sigma_{S}) time using the reversed suffix links, in a similar way to the position heaps for exact matching [10]. ∎

See Figure 5 in Appendix for an example of 𝖢𝖯𝖧(S)\mathsf{CPH}(S) with maximal reach pointers.

4 Cartesian-tree Position Heaps for Tries

Let 𝑻\boldsymbol{T} be the input text trie with NN nodes. A naïve extension of our CPH to a trie would be to build the CPH for the sequence 𝖯𝖣(𝑻[N..]),,𝖯𝖣(𝑻[1..])\langle\mathsf{PD}(\boldsymbol{T}[N..]),\ldots,\mathsf{PD}(\boldsymbol{T}[1..])\rangle of the parent encodings of all the path strings of 𝑻\boldsymbol{T} towards the root 𝐫\mathbf{r}. However, this does not seem to work because the parent encodings are not consistent for suffixes. For instance, consider two strings 𝟷𝟺𝟹𝟸\mathtt{1432} and 𝟺𝟺𝟹𝟸\mathtt{4432}. Their longest common suffix 𝟺𝟹𝟸\mathtt{432} is represented by a single path in a trie 𝑻\boldsymbol{T}. However, the longest common suffix of 𝖯𝖣(𝟷𝟺𝟹𝟸)=0123\mathsf{PD}(\mathtt{1432})=0123 and 𝖯𝖣(𝟺𝟺𝟹𝟸)=0100\mathsf{PD}(\mathtt{4432})=0100 is ε\varepsilon. Thus, in the worst case, we would have to consider all the path strings 𝑻[N..]\boldsymbol{T}[N..], …, 𝑻[1..]\boldsymbol{T}[1..] in 𝑻\boldsymbol{T} separately, but the total length of these path strings in 𝑻\boldsymbol{T} is Ω(N2)\Omega(N^{2}).

To overcome this difficulty, we reuse the 𝖥𝖯\mathsf{FP} encoding from Section 3. Since 𝖥𝖯(S)[i]\mathsf{FP}(S)[i] is determined merely by the suffix S[i..]S[i..], the 𝖥𝖯\mathsf{FP} encoding is suffix-consistent. For an input trie 𝑻\boldsymbol{T}, let the FP-trie 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}} be the reversed trie storing 𝖥𝖯(𝑻[i..])\mathsf{FP}(\boldsymbol{T}[i..]) for all the original path strings 𝑻[i..]\boldsymbol{T}[i..] towards the root. Let NN^{\prime} be the number of nodes in 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}}. Since 𝖥𝖯\mathsf{FP} is suffix-consistent, NNN^{\prime}\leq N always holds. Namely, 𝖥𝖯\mathsf{FP} is a linear-size representation of the equivalence relation of the nodes of 𝑻\boldsymbol{T} w.r.t. \approx. Each node vv of 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}} stores the equivalence class 𝒞v={i𝑻𝖥𝖯[v..]=𝖥𝖯(𝑻[i..])}\mathcal{C}_{v}=\{i\mid\boldsymbol{T}_{\mathsf{FP}}[v..]=\mathsf{FP}(\boldsymbol{T}[i..])\} of the nodes ii in 𝑻\boldsymbol{T} that correspond to vv. We set min{𝒞v}\min\{\mathcal{C}_{v}\} to be the representative of 𝒞v\mathcal{C}_{v}, as well as the id of node vv. See Figure 8 in Appendix.

Let Σ𝑻\Sigma_{\boldsymbol{T}} be the set of distinct characters (i.e. edge labels) in 𝑻\boldsymbol{T} and let σ𝑻=|Σ𝑻|\sigma_{\boldsymbol{T}}=|\Sigma_{\boldsymbol{T}}|. The FP-trie 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}} can be computed in O(Nσ𝑻)O(N\sigma_{\boldsymbol{T}}) time and working space by a standard traversal on 𝑻\boldsymbol{T}, where we store at most σ𝑻\sigma_{\boldsymbol{T}} front pointers in each node of the current path in 𝑻\boldsymbol{T} due to Lemma 3.

Let iN,,i1i_{N^{\prime}},\ldots,i_{1} be the node id’s of 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}} which are sorted in decreasing order. The Cartesian-tree position heap for the input trie 𝑻\boldsymbol{T} is 𝖢𝖯𝖧(𝑻)=𝖲𝖧𝖳(𝒲𝑻)\mathsf{CPH}(\boldsymbol{T})=\mathsf{SHT}(\mathcal{W}_{\boldsymbol{T}}), where 𝒲𝑻=𝖯𝖣(𝑻[iN..],,𝖯𝖣(𝑻[i1..])\mathcal{W}_{\boldsymbol{T}}=\langle\mathsf{PD}(\boldsymbol{T}[i_{N^{\prime}}..],\ldots,\mathsf{PD}(\boldsymbol{T}[i_{1}..])\rangle.

As in the case of string inputs in Section 3, we insert the shortest prefix of 𝖯𝖣(𝑻[ik..])\mathsf{PD}(\boldsymbol{T}[i_{k}..]) that does not exist in 𝖢𝖯𝖧(𝑻[ik+1..])\mathsf{CPH}(\boldsymbol{T}[i_{k+1}..]). To perform this insert operation, we use the following data structure for a random-access query on the PD encoding of any path string in 𝑻\boldsymbol{T}:

Lemma 8.

There is a data structure of size O(Nσ𝐓)O(N\sigma_{\boldsymbol{T}}) that can answer the following queries in O(σ𝐓)O(\sigma_{\boldsymbol{T}}) time each.
Query input: The id ii of a node in 𝐓\boldsymbol{T} and integer >0\ell>0.
Query output: The \ellth (last) symbol 𝖯𝖣((𝐓[i..])[1..])[]\mathsf{PD}((\boldsymbol{T}[i..])[1..\ell])[\ell] in 𝖯𝖣(𝐓[i..])[1..]\mathsf{PD}(\boldsymbol{T}[i..])[1..\ell].

Proof.

Let 𝐱\mathbf{x} be the node with id ii, and 𝐳=𝖺𝗇𝖼(𝐱,)\mathbf{z}=\mathsf{anc}(\mathbf{x},\ell). Namely, 𝗌𝗍𝗋(𝐱,𝐳)=(𝑻[j..])[1..]\mathsf{str}(\mathbf{x},\mathbf{z})=(\boldsymbol{T}[j..])[1..\ell]. For each character aΣ𝑻a\in\Sigma_{\boldsymbol{T}}, let 𝗇𝖺(𝐱,a)\mathsf{na}(\mathbf{x},a) denote the nearest ancestor 𝐲a\mathbf{y}_{a} of 𝐱\mathbf{x} such that the edge (𝗉𝖺𝗋𝖾𝗇𝗍(𝐲a),𝐲a)(\mathsf{parent}(\mathbf{y}_{a}),\mathbf{y}_{a}) is labeled aa. If such an ancestor does not exist, then we set 𝗇𝖺(𝐱,a)\mathsf{na}(\mathbf{x},a) to the root 𝐫\mathbf{r}.

Let 𝐳=𝖺𝗇𝖼(𝐱,1)\mathbf{z^{\prime}}=\mathsf{anc}(\mathbf{x},\ell-1), and bb be the label of the edge (𝐳,𝐳)(\mathbf{z},\mathbf{z^{\prime}}). Let DD be an empty set. For each character aΣ𝑻a\in\Sigma_{\boldsymbol{T}}, we query 𝗇𝖺(𝐱,a)=𝐲a\mathsf{na}(\mathbf{x},a)=\mathbf{y}_{a}. If da=|𝐲a||𝐳|>0d_{a}=|\mathbf{y}_{a}|-|\mathbf{z}^{\prime}|>0 and aba\leq b, then dad_{a} is a candidate for (𝖯𝖣(𝑻[j..])[1..])[](\mathsf{PD}(\boldsymbol{T}[j..])[1..\ell])[\ell] and add dad_{a} to set DD. After testing all aΣ𝑻a\in\Sigma_{\boldsymbol{T}}, we have that (𝖯𝖣(𝑻[j..])[1..])[]=minD(\mathsf{PD}(\boldsymbol{T}[j..])[1..\ell])[\ell]=\min D. See Figure 4.

Refer to caption
Figure 4: Illustration for the data structure of Lemma 8, where (𝑻[i..])[1..]=𝗌𝗍𝗋(𝐱,𝐳)(\boldsymbol{T}[i..])[1..\ell]=\mathsf{str}(\mathbf{x},\mathbf{z}).

For all characters aΣ𝑻a\in\Sigma_{\boldsymbol{T}} and all nodes xx in 𝑻\boldsymbol{T}, 𝗇𝖺(𝐱,a)\mathsf{na}(\mathbf{x},a) can be pre-computed in a total of O(Nσ𝑻)O(N\sigma_{\boldsymbol{T}}) preprocessing time and space, by standard traversals on 𝑻\boldsymbol{T}. Clearly each query is answered in O(σ𝑻)O(\sigma_{\boldsymbol{T}}) time. ∎

Theorem 3.

Let 𝐓\boldsymbol{T} be a given trie with NN nodes whose edge labels are from an integer alphabet of size nO(1)n^{O(1)}. The edge-sorted 𝖢𝖯𝖧(𝐓)\mathsf{CPH}(\boldsymbol{T}) with the maximal reach pointers, which occupies O(Nσ𝐓)O(N\sigma_{\boldsymbol{T}}) space, can be built in O(Nσ𝐓)O(N\sigma_{\boldsymbol{T}}) time.

Proof.

The rest of the construction algorithm of 𝖢𝖯𝖧(𝑻)\mathsf{CPH}(\boldsymbol{T}) is almost the same as the case of the CPH for a string, except that the amortization argument in the proof for Theorem 1 cannot be applied to the case where the input is a trie. Instead, we use the nearest marked ancestor (NMA) data structure [21, 2] that supports queries and marking nodes in amortized O(1)O(1) time each, using space linear in the input tree. For each a[0..σ𝑻]a\in[0..\sigma_{\boldsymbol{T}}], we create a copy 𝖢𝖯𝖧a(𝑻)\mathsf{CPH}_{a}(\boldsymbol{T}) of 𝖢𝖯𝖧(𝑻)\mathsf{CPH}(\boldsymbol{T}) and maintain the NMA data structure on 𝖢𝖯𝖧a(𝑻)\mathsf{CPH}_{a}(\boldsymbol{T}) so that every node vv that has defined reversed suffix link 𝗋𝗌𝗅(v,a)\mathsf{rsl}(v,a) is marked, and any other nodes are unmarked. The NMA query for a given node vv with character aa is denoted by 𝗇𝗆𝖺a(v)\mathsf{nma}_{a}(v). If vv itself is marked with aa, then let 𝗇𝗆𝖺a(v)=v\mathsf{nma}_{a}(v)=v. For any node 𝐱\mathbf{x} of 𝑻\boldsymbol{T}, let 𝐱\mathcal{I}_{\mathbf{x}} be the array of size at most σ𝑻\sigma_{\boldsymbol{T}} s.t. 𝐱[j]=h\mathcal{I}_{\mathbf{x}}[j]=h iff hh is the jjth smallest element of 𝖯𝖣(𝗌𝗍𝗋(𝐱))\mathcal{F}_{\mathsf{PD}(\mathsf{str}(\mathbf{x}))}.

We are ready to design our construction algorithm: Suppose that we have already built 𝖢𝖯𝖧(𝑻[ik+1..])\mathsf{CPH}(\boldsymbol{T}[i_{k+1}..]) and we are to update it to 𝖢𝖯𝖧(𝑻[ik..])\mathsf{CPH}(\boldsymbol{T}[i_{k}..]). Let 𝐰\mathbf{w} be the node in 𝑻\boldsymbol{T} with id iki_{k}, and let 𝐮=𝗉𝖺𝗋𝖾𝗇𝗍(𝐰)\mathbf{u}=\mathsf{parent}(\mathbf{w}) in 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}}. Let uu be the node of 𝖢𝖯𝖧(𝑻[ik+1..])\mathsf{CPH}(\boldsymbol{T}[i_{k+1}..]) that corresponds to 𝐮\mathbf{u}. We initially set vuv\leftarrow u and a|𝖯𝖣(𝑻[ik..ik+|u|])|a\leftarrow|\mathcal{F}_{\mathsf{PD}(\boldsymbol{T}[i_{k}..i_{k}+|u|])}|. Let d(a)=max{|u|𝐰[a]+1,0}d(a)=\max\{|u|-\mathcal{I}_{\mathbf{w}}[a]+1,0\}. We perform the following:

  1. (1)

    Check whether v=𝖺𝗇𝖼(u,d(a))v^{\prime}=\mathsf{anc}(u,d(a)) is marked in 𝖢𝖯𝖧a(𝑻)\mathsf{CPH}_{a}(\boldsymbol{T}). If so, go to (2). Otherwise, update vvv\leftarrow v^{\prime}, aa1a\leftarrow a-1, and repeat (1).

  2. (2)

    Return 𝗇𝗆𝖺(v,a)\mathsf{nma}(v,a).

By the definitions of 𝐰[a]\mathcal{I}_{\mathbf{w}}[a] and d(a)d(a), the node v(ik)v(i_{k}) from which we should take the reversed suffix link is in the path between vv^{\prime} and vv, and it is the lowest ancestor of vv that has the reversed suffix link with aa. Thus, the above algorithm correctly computes the desired node. By Lemma 4, the number of queries in (1) for each of the NN^{\prime} nodes is O(σ𝑻)O(\sigma_{\boldsymbol{T}}), and we use the dynamic level ancestor data structure on our CPH that allows for leaf insertions and level ancestor queries in O(1)O(1) time each [1]. This gives us O(Nσ𝑻)O(N\sigma_{\boldsymbol{T}})-time and space construction.

We will reuse the random access data structure of Lemma 8 for pattern matching (see Section 5.2). Thus 𝖢𝖯𝖧(𝑻)\mathsf{CPH}(\boldsymbol{T}) requires O(Nσ𝑻)O(N\sigma_{\boldsymbol{T}}) space. ∎

5 Cartesian-tree Pattern Matching with Position Heaps

5.1 Pattern Matching on Text String SS with 𝖢𝖯𝖧(S)\mathsf{CPH}(S)

Given a pattern PP of length mm, we first compute the greedy factorization 𝖿(P)=P0,P1,,Pk\mathsf{f}(P)=P_{0},P_{1},\ldots,P_{k} of PP such that P0=εP_{0}=\varepsilon, and for 1lk1\leq l\leq k, Pl=P[𝗅𝗌𝗎𝗆(l1)+1..𝗅𝗌𝗎𝗆(l)]P_{l}=P[\mathsf{lsum}(l-1)+1..\mathsf{lsum}(l)] is the longest prefix of PlPkP_{l}\cdots P_{k} that is represented by 𝖢𝖯𝖧(S)\mathsf{CPH}(S), where 𝗅𝗌𝗎𝗆(l)=j=0l|Pj|\mathsf{lsum}(l)=\sum_{j=0}^{l}|P_{j}|. We consider such a factorization of PP since the height hh of 𝖢𝖯𝖧(S)\mathsf{CPH}(S) can be smaller than the pattern length mm.

Lemma 9.

Any node vv in 𝖢𝖯𝖧(S)\mathsf{CPH}(S) has at most |v||v| out-going edges.

Proof.

Let (v,c,u)(v,c,u) be any out-going edge of vv. When |u|1|u|-1 is a front pointer of uu, then c=u[|u|]c=u[|u|] and this is when cc takes the maximum value. Since u[|u|]|u|1u[|u|]\leq|u|-1, we have c|u|1c\leq|u|-1. Since the edge label of 𝖢𝖯𝖧(S)\mathsf{CPH}(S) is non-negative, vv can have at most |u|1=|v||u|-1=|v| out-going edges. ∎

The next corollary immediately follows from Lemma 9.

Corollary 1.

Given a pattern PP of length mm, its factorization 𝖿(P)\mathsf{f}(P) can be computed in O(mlog(min{m,h}))O(m\log(\min\{m,h\})) time, where hh is the height of 𝖢𝖯𝖧(S)\mathsf{CPH}(S).

The next lemma is analogous to the position heap for exact matching [10].

Lemma 10.

Consider two nodes uu and vv in 𝖢𝖯𝖧(S)\mathsf{CPH}(S) such that u=𝖯𝖣(P)u=\mathsf{PD}(P) the id of vv is ii. Then, 𝖯𝖣(S[i..])[1..|u|]=u\mathsf{PD}(S[i..])[1..|u|]=u iff one of the following conditions holds: (a) vv is a descendant of uu; (b) 𝗆𝗋𝗉(v)\mathsf{mrp}(v) is a descendant of uu.

See also Figure 10 in Appendix. We perform a standard traversal on 𝖢𝖯𝖧(S)\mathsf{CPH}(S) so that one we check whether a node is a descendant of another node in O(1)O(1) time.

When k=1k=1 (i.e. 𝖿(P)=P\mathsf{f}(P)=P), 𝖯𝖣(P)\mathsf{PD}(P) is represented by some node uu of 𝖢𝖯𝖧(S)\mathsf{CPH}(S). Now a direct application of Lemma 10 gives us all the 𝑜𝑐𝑐\mathit{occ} pattern occurrences in O(mlogm+𝑜𝑐𝑐)O(m\log m+\mathit{occ}) time, where min{m,h}=m\min\{m,h\}=m in this case. All we need here is to report the id of every descendant of uu (Condition (a)) and the id of each node vv that satisfies Condition (b). The number of such nodes vv is less than mm.

When k2k\geq 2 (i.e. 𝖿(P)P\mathsf{f}(P)\neq P), there is no node that represents 𝖯𝖣(P)\mathsf{PD}(P) for the whole pattern PP. This happens only when 𝑜𝑐𝑐<m\mathit{occ}<m, since otherwise there has to be a node representing 𝖯𝖣(P)\mathsf{PD}(P) by the incremental construction of 𝖢𝖯𝖧(S)\mathsf{CPH}(S), a contradiction. This implies that Condition (a) of Lemma 10 does apply when k2k\geq 2. Thus, the candidates for the pattern occurrences only come from Condition (b), which are restricted to the nodes vv such that 𝗆𝗋𝗉(v)=u1\mathsf{mrp}(v)=u_{1}, where u1=𝖯𝖣(P1)u_{1}=\mathsf{PD}(P_{1}). We apply Condition (b) iteratively for the following P2,,PkP_{2},\ldots,P_{k}, while keeping track of the position ii that was associated to each node vv such that 𝗆𝗋𝗉(v)=u1\mathsf{mrp}(v)=u_{1}. This can be done by padding ii with the off-set 𝗅𝗌𝗎𝗆(l1)\mathsf{lsum}(l-1) when we process PlP_{l}. We keep such a position ii if Condition (b) is satisfied for all the following pattern blocks P2,,PkP_{2},\ldots,P_{k}, namely, if the maximal reach pointer of the node with id i+𝗅𝗌𝗎𝗆(l1)i+\mathsf{lsum}(l-1) points to node ul=𝖯𝖣(Pl)u_{l}=\mathsf{PD}(P_{l}) for increasing l=2,,kl=2,\ldots,k. As soon as Condition (b) is not satisfied with some ll, we discard position ii.

Suppose that we have processed the all pattern blocks P1,,PkP_{1},\ldots,P_{k} in 𝖿(P)\mathsf{f}(P). Now we have that 𝖯𝖣(S[i..])[1..m]=𝖯𝖣(P)\mathsf{PD}(S[i..])[1..m]=\mathsf{PD}(P) (or equivalently S[i..i+m1]PS[i..i+m-1]\approx P) only if the position ii has survived. Namely, position ii is only a candidate of a pattern occurrence at this point, since the above algorithm only guarantees that 𝖯𝖣(P1)𝖯𝖣(Pk)=𝖯𝖣(S[i..])[1..m]\mathsf{PD}(P_{1})\cdots\mathsf{PD}(P_{k})=\mathsf{PD}(S[i..])[1..m]. Note also that, by Condition (b), the number of such survived positions ii is bounded by min{|P1|,,|Pk|}m/k\min\{|P_{1}|,\ldots,|P_{k}|\}\leq m/k.

For each survived position ii, we verify whether 𝖯𝖣(P)=𝖯𝖣(S[i..])[1..m]\mathsf{PD}(P)=\mathsf{PD}(S[i..])[1..m]. This can be done by checking, for each increasing l=1,,kl=1,\ldots,k, whether or not 𝖯𝖣(S[i..])[𝗅𝗌𝗎𝗆(l1)+y]=𝖯𝖣(P1Pl)[𝗅𝗌𝗎𝗆(l1)+y]\mathsf{PD}(S[i..])[\mathsf{lsum}(l-1)+y]=\mathsf{PD}(P_{1}\cdots P_{l})[\mathsf{lsum}(l-1)+y] for every position yy (1y|Pl|1\leq y\leq|P_{l}|) such that 𝖯𝖣(Pl)[y]=0\mathsf{PD}(P_{l})[y]=0. By the definition of 𝖯𝖣\mathsf{PD}, the number of such positions yy is at most σPlσP\sigma_{P_{l}}\leq\sigma_{P}. Thus, for each survived position ii we have at most kσPk\sigma_{P} positions to verify. Since we have at most m/km/k survived positions, the verification takes a total of O(mkkσP)=O(mσP)O(\frac{m}{k}\cdot k\sigma_{P})=O(m\sigma_{P}) time.

Theorem 4.

Let SS be the text string of length nn. Using 𝖢𝖯𝖧(S)\mathsf{CPH}(S) of size O(n)O(n) augmented with the maximal reach pointers, we can find all 𝑜𝑐𝑐\mathit{occ} occurrences for a given pattern PP in SS in O(m(σP+log(min{m,h}))+𝑜𝑐𝑐)O(m(\sigma_{P}+\log(\min\{m,h\}))+\mathit{occ}) time, where m=|P|m=|P| and hh is the height of 𝖢𝖯𝖧(S)\mathsf{CPH}(S).

5.2 Pattern Matching on Text Trie 𝑻\boldsymbol{T} with 𝖢𝖯𝖧(𝑻)\mathsf{CPH}(\boldsymbol{T})

In the text trie case, we can basically use the same matching algorithm as in the text string case of Section 5.1. However, recall that we cannot afford to store the PD encodings of the path strings in 𝑻\boldsymbol{T} as it requires Ω(n2)\Omega(n^{2}) space. Instead, we reuse the random-access data structure of Lemma 8 for the verification step. Since it takes O(σ𝑻)O(\sigma_{\boldsymbol{T}}) time for each random-access query, and since the data structure occupies O(Nσ𝑻)O(N\sigma_{\boldsymbol{T}}) space, we have the following complexity:

Theorem 5.

Let 𝐓\boldsymbol{T} be the text trie with NN nodes. Using 𝖢𝖯𝖧(𝐓)\mathsf{CPH}(\boldsymbol{T}) of size O(Nσ𝐓)O(N\sigma_{\boldsymbol{T}}) augmented with the maximal reach pointers, we can find all 𝑜𝑐𝑐\mathit{occ} occurrences for a given pattern PP in 𝐓\boldsymbol{T} in O(m(σPσ𝐓+log(min{m,h}))+𝑜𝑐𝑐)O(m(\sigma_{P}\sigma_{\boldsymbol{T}}+\log(\min\{m,h\}))+\mathit{occ}) time, where m=|P|m=|P| and hh is the height of 𝖢𝖯𝖧(𝐓)\mathsf{CPH}(\boldsymbol{T}).

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers JP18K18002 (YN) and JP21K17705 (YN), and by JST PRESTO Grant Number JPMJPR1922 (SI).

References

  • [1] S. Alstrup and J. Holm. Improved algorithms for finding level ancestors in dynamic trees. In U. Montanari, J. D. P. Rolim, and E. Welzl, editors, ICALP 2000, volume 1853 of Lecture Notes in Computer Science, pages 73–84. Springer, 2000.
  • [2] A. Amir, M. Farach, R. M. Idury, J. A. L. Poutré, and A. A. Schäffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258–282, 1995.
  • [3] B. S. Baker. A theory of parameterized pattern matching: algorithms and applications. In STOC 1993, pages 71–80, 1993.
  • [4] B. S. Baker. Parameterized pattern matching by Boyer-Moore type algorithms. In Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 541–550, 1995.
  • [5] B. S. Baker. Parameterized pattern matching: Algorithms and applications. J. Comput. Syst. Sci., 52(1):28–42, 1996.
  • [6] M. A. Bender and M. Farach-Colton. The level ancestor problem simplified. Theor. Comput. Sci., 321(1):5–12, 2004.
  • [7] S. Cho, J. C. Na, K. Park, and J. S. Sim. A fast algorithm for order-preserving pattern matching. Inf. Process. Lett., 115(2):397–402, 2015.
  • [8] E. Coffman and J. Eve. File structures using hashing functions. Communications of the ACM, 13:427–432, 1970.
  • [9] M. Crochemore, C. S. Iliopoulos, T. Kociumaka, M. Kubica, A. Langiu, S. P. Pissis, J. Radoszewski, W. Rytter, and T. Walen. Order-preserving indexing. Theor. Comput. Sci., 638:122–135, 2016.
  • [10] A. Ehrenfeucht, R. M. McConnell, N. Osheim, and S.-W. Woo. Position heaps: A simple and dynamic text indexing data structure. Journal of Discrete Algorithms, 9(1):100–121, 2011.
  • [11] T. Fu, K. F. Chung, R. W. P. Luk, and C. Ng. Stock time series pattern matching: Template-based vs. rule-based approaches. Eng. Appl. Artif. Intell., 20(3):347–364, 2007.
  • [12] N. Fujisato, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. Right-to-left online construction of parameterized position heaps. In PSC 2018, pages 91–102, 2018.
  • [13] N. Fujisato, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. The parameterized position heap of a trie. In CIAC 2019, pages 237–248, 2019.
  • [14] T. I, S. Inenaga, and M. Takeda. Palindrome pattern matching. In R. Giancarlo and G. Manzini, editors, CPM 2011, volume 6661 of Lecture Notes in Computer Science, pages 232–245. Springer, 2011.
  • [15] H. Kim and Y. Han. OMPPM: online multiple palindrome pattern matching. Bioinform., 32(8):1151–1157, 2016.
  • [16] J. Kim, P. Eades, R. Fleischer, S. Hong, C. S. Iliopoulos, K. Park, S. J. Puglisi, and T. Tokuyama. Order-preserving matching. Theor. Comput. Sci., 525:68–79, 2014.
  • [17] Y. Matsuoka, T. Aoki, S. Inenaga, H. Bannai, and M. Takeda. Generalized pattern matching and periodicity under substring consistent equivalence relations. Theor. Comput. Sci., 656:225–233, 2016.
  • [18] S. G. Park, M. Bataa, A. Amir, G. M. Landau, and K. Park. Finding patterns and periods in cartesian tree matching. Theor. Comput. Sci., 845:181–197, 2020.
  • [19] S. Song, G. Gu, C. Ryu, S. Faro, T. Lecroq, and K. Park. Fast algorithms for single and multiple pattern Cartesian tree matching. Theor. Comput. Sci., 849:47–63, 2021.
  • [20] P. Weiner. Linear pattern-matching algorithms. In Proc. of 14th IEEE Ann. Symp. on Switching and Automata Theory, pages 1–11, 1973.
  • [21] J. Westbrook. Fast incremental planarity testing. In Proc. ICALP 1992, number 623 in LNCS, pages 342–353, 1992.

Appendix A Figures

Refer to caption Refer to caption

Figure 5: The Cartesian position heap of string S=𝟸𝟼𝟺𝟸𝟽𝟻𝟾𝟺𝟹𝟼𝟺𝟽𝟻𝟽𝟼S=\mathtt{264275843647576} together with the maximal reach pointers (doubly-lined arcs). For each wi=𝖯𝖣(S[ni+1..])w_{i}=\mathsf{PD}(S[n-i+1..]), the singly-underlined prefix is the string that is represented by the node uiu_{i} in 𝖢𝖯𝖧(S)\mathsf{CPH}(S), and the doubly-underlined substring is the string skipped by the maximal reach pointer.

Refer to caption Refer to caption

Figure 6: The string SS of Lemma 7 with k=4k=4, i.e., S=𝟷𝟷𝟸𝟷𝟸𝟸𝟷𝟸𝟸𝟸𝟷𝟸𝟸𝟸𝟸𝟷S=\mathtt{1121221222122221} and its Cartesian-tree position heap 𝖢𝖯𝖧(S)\mathsf{CPH}(S). For each wi=𝖯𝖣(S[ni+1..])w_{i}=\mathsf{PD}(S[n-i+1..]), the underlined prefix is represented by the node of 𝖢𝖯𝖧(S)\mathsf{CPH}(S) that corresponds to wiw_{i}. Node 011011 has k=4k=4 out-going edges. This example also shows that the upper bound of Lemma 9 is tight, since node 011011 has |011|+1=4|011|+1=4 out-going edges.

Refer to caption

Figure 7: DAG 𝖦(S)\mathsf{G}(S), 𝖯𝖣(S)\mathsf{PD}(S), and 𝖥𝖯(S)\mathsf{FP}(S) for string S=𝟸𝟼𝟺𝟸𝟽𝟻𝟾𝟺𝟹𝟼𝟻𝟽𝟺𝟷S=\mathtt{26427584365741}.

Refer to caption Refer to caption

Refer to caption
Figure 8: Left upper: An example of input trie 𝑻\boldsymbol{T}. Left lower: The FP-trie 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}} that is obtained from the original trie 𝑻\boldsymbol{T}. For instance, the FP encodings of the two path strings 𝑻[3..]=𝟻𝟹𝟺𝟹\boldsymbol{T}[3..]=\mathtt{5343} and 𝑻[4..]=𝟺𝟸𝟻𝟹\boldsymbol{T}[4..]=\mathtt{4253} have the same FP encoding 02000200 and thus the node id’s 33 and 44 are stored in a single node in 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}}. The representative (the id) of the node {3,4}\{3,4\} in 𝑻𝖥𝖯\boldsymbol{T}_{\mathsf{FP}} is min{3,4}=3\min\{3,4\}=3.

Refer to caption

Figure 9: 𝖢𝖯𝖧(𝑻)\mathsf{CPH}(\boldsymbol{T}) for the trie 𝑻\boldsymbol{T} of Fig. 8, where every node vv store the representatives 1,2,3,5,6,8,9,11,12,13,15,161,2,3,5,6,8,9,11,12,13,15,16 of the corresponding equivalence class 𝒞v\mathcal{C}_{v}.

Refer to caption Refer to caption

Figure 10: Left: Illustration for Condition (a) of Lemma 10. Right: Illustration for Condition (b) of Lemma 10, where the doubly-lined arc represents the maximal reach pointer.