This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast and linear-time string matching algorithms based on the distances of qq-gram occurrences

Satoshi Kobayashi Graduate School of Information Sciences, Tohoku University, Japan Diptarama Hendrian Graduate School of Information Sciences, Tohoku University, Japan Ryo Yoshinaka Graduate School of Information Sciences, Tohoku University, Japan Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan
Abstract

Given a text TT of length nn and a pattern PP of length mm, the string matching problem is a task to find all occurrences of PP in TT. In this study, we propose an algorithm that solves this problem in O((n+m)q)O((n+m)q) time considering the distance between two adjacent occurrences of the same qq-gram contained in PP. We also propose a theoretical improvement of it which runs in O(n+m)O(n+m) time, though it is not necessarily faster in practice. We compare the execution times of our and existing algorithms on various kinds of real and artificial datasets such as an English text, a genome sequence and a Fibonacci string. The experimental results show that our algorithm is as fast as the state-of-the-art algorithms in many cases, particularly when a pattern frequently appears in a text.

1 Introduction

The exact string matching problem is a task to find all occurrences of PP in TT when given a text TT of length nn and a pattern PP of length mm. A brute-force solution of this problem is to compare PP with all the substrings of TT of length mm. It takes O(nm)O(nm) time. The Knuth-Morris-Pratt (KMP) algorithm [17] is well known as an algorithm that can solve the problem in O(n+m)O(n+m) time. However, it is not efficient in practice, because it scans every position of the text at least once. The Boyer-Moore algorithm [3] is famous as an algorithm that can perform string matching fast in practice by skipping many positions of the text, though it has O(nm)O(nm) worst-case time complexity. Like this, many efficient algorithms whose worst-case time complexity is the same or even worse than the naive method have been proposed so far [16, 21, 19]. For example, the HASHqq algorithm [19] focuses on the substrings of length qq in a pattern and obtains a larger shift amount. However, considering that such an algorithm is embedded in software and actually used, if the worst-case input strings are given, the operation of the software may be slowed down. Therefore, an algorithm that operates theoretically and practically fast is important. Franek et al. [14] proposed the Franek-Jennings-Smyth (FJS) algorithm, which is a hybrid of the KMP algorithm and the Sunday algorithm [21]. The worst-case time complexity of the FJS algorithm is O(n+m+σ)O(n+m+\sigma) and it works fast in practice, where σ\sigma is the alphabet size. Kobayashi et al. [18] proposed an algorithm that improves the speed of the FJS algorithm by combining a method that extends the idea of the Quite-Naive algorithm [4]. This algorithm has the same worst-case time complexity as the FJS algorithm, and it runs faster than the FJS algorithm in many cases. The LWFRqq algorithm [8] is a practically fast algorithm that works in linear time. This algorithm uses a method of quickly recognizing substrings of a pattern using a hash function. See [12, 15] for recent surveys on exact string matching algorithms.

This paper proposes two new exact string matching algorithms based on the HASHqq and KMP algorithms incorporating a new idea based on the distances of occurrences of the same qq-grams. The time complexity of the preprocessing phase of the first algorithm is O(mq)O(mq) and the search phase runs in O(nq)O(nq) time. The second algorithm improves the theoretical complexity of the first algorithm, where the preprocessing and searching times are O(m)O(m) and O(n)O(n), respectively. Our algorithms are as fast as the state-of-the-art algorithms in many cases. Particularly, our algorithms work faster when a pattern frequently appears in a text.

This paper is organized as follows. Section 2 briefly reviews the KMP and HASHqq algorithms, which are the basis of the proposed algorithms. Section 3 proposes our algorithms. Section 4 shows experimental results comparing the proposed algorithms with several other algorithms using artificial and practical data. Section 5 draws our conclusions.

2 Preliminaries

2.1 Notation

Let Σ\Sigma be a set of characters called an alphabet and σ=|Σ|\sigma=|\Sigma| be its size. Σ\Sigma^{*} denotes the set of all strings over Σ\Sigma. The length of a string wΣw\in\Sigma^{*} is denoted by |w||w|. The empty string, denoted by ε\varepsilon, is the string of length zero. The ii-th character of ww is denoted by w[i]w[i] for each 1i|w|1\leq i\leq|w|. The substring of ww starting at ii and ending at jj is denoted by w[i:j]w[i:j] for 1ij|w|1\leq i\leq j\leq|w|. For convenience, let w[i:j]=εw[i:j]=\varepsilon if i>ji>j. A string w[1:i]w[1:i] is called a prefix of ww and a string w[i:|w|]w[i:|w|] is called a suffix of ww. A string vv is a border of ww if vv is both a prefix and a suffix of ww. Note that the empty string is a border of any string. Moreover, a prefix, a suffix or a border vv of ww is called proper when vwv\neq w. The length of the longest proper border of w[1:i]w[1:i] for 1i|w|1\leq i\leq|w| is given by

𝐵𝑜𝑟𝑑w[i]=max{jw[1:j]=w[ij+1:i] and 0j<i}.\displaystyle\mathit{Bord}_{w}[i]=\max\{\,j\mid w[1:j]=w[i-j+1:i]\text{ and }0\leq j<i\,\}\,.

Throughout this paper, we assume Σ\Sigma is an integer alphabet.

2.2 The exact string matching problem

The exact string matching problem is defined as follows:

Input: A text TΣT\in\Sigma^{*} of length nn and a pattern PΣP\in\Sigma^{*} of length mm,
Output: All positions ii such that T[i:i+m1]=PT[i:i+m-1]=P for 1inm+11\leq i\leq n-m+1.

We will use a text TΣT\in\Sigma^{*} of length nn and a pattern PΣP\in\Sigma^{*} of length mm throughout the paper.

Let us consider comparing T[i:i+m1]T[i:i+m-1] and P[1:m]P[1:m]. The naive method compares characters of the two strings from left to right. When a character mismatch occurs, the pattern is shifted to the right by one character. That is, we compare T[i+1:i+m]T[i+1:i+m] and P[1:m]P[1:m]. This naive method takes O(nm)O(nm) time for matching. There are a number of ideas to shift the pattern more so that searching TT for PP can be performed more quickly, using shifting functions obtained by preprocessing the pattern.

2.3 Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt (KMP) algorithm [17] is well known as a string matching algorithm that has linear worst-case time complexity. When the KMP algorithm has confirmed that T[i:i+j2]=P[1:j1]T[i:i+j-2]=P[1:j-1] and T[i+j1]P[j]T[i+j-1]\neq P[j] for some jmj\leq m, it shifts the pattern so that a suffix of T[i:i+j2]T[i:i+j-2] matches a prefix of PP and we do not have to re-scan any part of T[i:i+j2]T[i:i+j-2] again. That is, the pattern can be shifted by jk1j-k-1 for k=𝐵𝑜𝑟𝑑P[j1]k=\mathit{Bord}_{P}[j-1]. In addition, if P[k+1]=P[j]P[k+1]=P[j], the same mismatch will occur again after the shift. In order to avoid this kind of mismatch, we use 𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑[1:m+1]\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}[1:m+1] given by

𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑P(j)={𝐵𝑜𝑟𝑑P(m)if j=m+1,max({kP[1:k]=P[jk:j1],P[k+1]P[j],max({k0k<j}{1})otherwise.\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}_{P}(j)=\left\{\begin{aligned} &\mathit{Bord}_{P}(m)&&\text{if $j=m+1$,}\\ &\max\!\big{(}\{\,k\mid P[1:k]=P[j-k:j-1],\;P[k+1]\neq P[j],\;\\ &\text{\phantom{$\max\big{(}\{\,k\mid{}$}}0\leq k<j\,\}\cup\{-1\}\big{)}&&\text{otherwise.}\\ \end{aligned}\right.

The amount 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j] of the shift is given by

𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]=j𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑P(j)1.\displaystyle\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]=j-\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}_{P}(j)-1.

This function has a domain of {1,,m+1}\{1,\dots,m+1\} and is implemented as an array in the algorithm. Hereafter, we identify some functions and the arrays that implement them.

Fact 1.

If P[1:j1]=T[i:i+j2]P[1:j-1]=T[i:i+j-2] and P[j]T[i+j1]P[j]\neq T[i+j-1], then P[1:jkj1]=T[i+kj:i+j2]P[1:j-k_{j}-1]=T[i+k_{j}:i+j-2] holds for kj=𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]k_{j}=\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]. Moreover, there is no positive integer k<𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]k<\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j] such that P=T[i+k:i+k+m1]P=T[i+k:i+k+m-1].

Note that if the algorithm has confirmed T[i:i+m1]=PT[i:i+m-1]=P, the shift is given by 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[m+1]\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[m+1] after reporting the occurrence of the pattern. Algorithm 1 shows a pseudocode to compute the array 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}. It runs in O(m)O(m) time. By using 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}, the KMP algorithm finds all occurrences of PP in TT in O(n)O(n) time.

1 Function 𝙿𝚛𝚎𝙺𝙼𝙿𝚂𝚑𝚒𝚏𝚝(P)\mathtt{PreKMPShift}(P)
2       m|P|m\leftarrow|P|; i1i\leftarrow 1; j0j\leftarrow 0;
3       𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑P[1]1\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}_{P}[1]\leftarrow-1;
4       while imi\leq m do
5             while j>0j>0 and P[i]P[j]P[i]\neq P[j] do j𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑P[j]j\leftarrow\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}_{P}[j];
6             ii+1i\leftarrow i+1; jj+1j\leftarrow j+1;
7             if imi\leq m and P[i]=P[j]P[i]=P[j] then 𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑P[i]𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑P[j]\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}_{P}[i]\leftarrow\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}_{P}[j];
8            else 𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑P[i]j\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}_{P}[i]\leftarrow j;
9            
10      for j1j\leftarrow 1 to mm do 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]j𝑆𝑡𝑟𝑜𝑛𝑔_𝐵𝑜𝑟𝑑P[j]1\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]\leftarrow j-\mathit{Strong\text{\scalebox{0.6}[1.0]{\textunderscore}}Bord}_{P}[j]-1;
11       return 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}
Algorithm 1 Computing 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}

2.4 HASHqq algorithm

The HASHqq algorithm [19] is an adaptation of the Wu-Manber multiple string matching algorithm [22] to the single string matching problem. Before comparing PP and T[i:i+m1]T[i:i+m-1], the HASHqq algorithm shifts the pattern so that the suffix qq-gram T[i+mq:i+m1]T[i+m-q:i+m-1] of the text substring shall match the rightmost occurrence of the same qq-gram in the pattern. For practical efficiency, we use a hash function, though it may result in aligning mismatching qq-grams occasionally. The shift amount is given by 𝑠ℎ𝑖𝑓𝑡(h(T[i+mq:i+m1]))\mathit{shift}(h(T[i+m-q:i+m-1])) where

𝑠ℎ𝑖𝑓𝑡(c)\displaystyle\mathit{shift}(c) =mmax({j|h(P[jq+1:j])=c,qjm}{q1}),\displaystyle=m-\max(\{j\;|\;h(P[j-q+1:j])=c,\;q\leq j\leq m\}\cup\{q-1\})\,,
h(x)\displaystyle h(x) =(2q1x[1]+2q2x[2]++2x[q1]+x[q])mod 28.\displaystyle=(2^{q-1}\cdot x[1]+2^{q-2}\cdot x[2]+\cdots+2\cdot x[q-1]+x[q]){\rm\;mod\;2^{8}}.

We repeatedly shift the pattern till the suffix qq-grams of the pattern and the considered text substring have a matching hash value, in which case the shift amount will be 0. We then compare the characters of the pattern and the text substring from left to right. If a character mismatch occurs during the comparison, the pattern is shifted by

min({kh(P[mk:mk])=h(P[m:m]), 1kmq}{m})\min(\{\,k\mid h(P[m^{\prime}-k:m-k])=h(P[m^{\prime}:m]),\;1\leq k\leq m-q\,\}\cup\{m^{\prime}\}) (1)

where m=mq+1m^{\prime}=m-q+1, since the qq-gram suffixes of the pattern and the text substring have the same hash values. The time complexity of the preprocessing phase for computing the shift function is O(mq)O(mq). The searching phase has O(n(m+q))O(n(m+q)) time complexity. The worst-case time complexity is worse than that of the naive method, but it works fast in practice.

Fact 2.

If 𝑠ℎ𝑖𝑓𝑡(h(T[i+mq:i+m1]))=jmq+1\mathit{shift}(h(T[i+m-q:i+m-1]))=j\neq m-q+1, then h(P[mjq+1:mj])=h(T[i+mq:i+m1])h(P[m-j-q+1:m-j])=h(T[i+m-q:i+m-1]). There is no positive integer k<jk<j such that P=T[i+k:i+k+m1]P=T[i+k:i+k+m-1].

3 Proposed algorithms

3.1 DISTqq algorithm

Our proposed algorithm uses three kinds of shifting functions. The first one 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} is essentially the same as 𝑠ℎ𝑖𝑓𝑡\mathit{shift}, the one used in the HASHqq algorithm, except for the hashing function. The second one 𝑑𝑖𝑠𝑡\mathit{dist} is based on the distance of the closest occurrences of the qq-grams of the same hash value in the pattern. We involve 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} as the third one to guarantee the linear-time behavior.

Formally, the first shifting function is given as

𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[c]\displaystyle\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[c] =mmax({jh(P[jq+1:j])=c,qjm}{q1}),\displaystyle=m-\max(\{j\mid h(P[j-q+1:j])=c,\;q\leq j\leq m\}\cup\{q-1\}),
where h(x)=(4q1x[1]+4q2x[2]++4x[q1]+x[q])mod 216.\displaystyle h(x)=(4^{q-1}\cdot x[1]+4^{q-2}\cdot x[2]+\cdots+4\cdot x[q-1]+x[q]){\rm\;mod\;2^{16}}.

Fact 2 holds for 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}.

The second shifting function is defined for j=q,,mj=q,\dots,m by

𝑑𝑖𝑠𝑡[j]\displaystyle\mathit{dist}[j] =min({kh(P[jk:jk])=h(P[j:j]), 1kjq}{j})\displaystyle=\min(\{\,k\mid h(P[j^{\prime}-k:j-k])=h(P[j^{\prime}:j]),\;1\leq k\leq j-q\,\}\cup\{j^{\prime}\})

where j=jq+1j^{\prime}=j-q+1. This function 𝑑𝑖𝑠𝑡\mathit{dist} is a generalization of the shift (Eq. 1) used in the HASHqq algorithm. We have 𝑑𝑖𝑠𝑡[j]=k<j\mathit{dist}[j]=k<j^{\prime} if the qq-gram ending at jj and the one ending at jkj-k have the same hash value, while no qq-grams occurring between those have the same value. If no qq-gram ending before jj has the same hash value, then 𝑑𝑖𝑠𝑡[j]=j\mathit{dist}[j]=j^{\prime}. By using this, in the situation where h(P[jq+1:j])=h(T[i+jq:i+j1])h(P[j-q+1:j])=h(T[i+j-q:i+j-1]), when a mismatch occurs anywhere between T[i:i+m1]T[i:i+m-1] and PP, the pattern can be shifted by 𝑑𝑖𝑠𝑡[j]\mathit{dist}[j].

Fact 3.

Suppose that h(P[jq+1:j])=h(T[i+jq:i+j1])h(P[j-q+1:j])=h(T[i+j-q:i+j-1]). Then h(P[jq+1𝑑𝑖𝑠𝑡[j]:j𝑑𝑖𝑠𝑡[j]])=h(T[i+jq:i+j1])h(P[j-q+1-\mathit{dist}[j]:j-\mathit{dist}[j]])=h(T[i+j-q:i+j-1]), unless 𝑑𝑖𝑠𝑡[j]=jq+1\mathit{dist}[j]=j-q+1. Moreover, there is no positive integer k<𝑑𝑖𝑠𝑡[j]k<\mathit{dist}[j] such that P=T[i+k:i+k+m1]P=T[i+k:i+k+m-1].

Those functions 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}, 𝑑𝑖𝑠𝑡\mathit{dist} and 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} are computed in the preprocessing phase. Algorithms 2 and 3 compute the arrays 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} and 𝑑𝑖𝑠𝑡\mathit{dist}, respectively.

1 Function 𝙿𝚛𝚎𝙷𝚚𝚂𝚑𝚒𝚏𝚝(P,q)\mathtt{PreHqShift}(P,q)
2       m|P|m\leftarrow|P|;
3       for i0i\leftarrow 0 to 21612^{16}-1 do
4             𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[i]mq+1\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[i]\leftarrow m-q+1;
5            
6      for iqi\leftarrow q to mm do
7             hashh(P[iq+1:i])hash\leftarrow h(P[i-q+1:i]);
8             𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[hash]mi\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[hash]\leftarrow m-i;
9            
10      return 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift};
11      
Algorithm 2 Computing 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}
1 Function 𝙿𝚛𝚎𝙳𝚒𝚜𝚝𝙰𝚛𝚛𝚊𝚢(P,q)\mathtt{PreDistArray}(P,q)
2       for j1j\leftarrow 1 to q1q-1 do
3             𝑑𝑖𝑠𝑡[j]1\mathit{dist}[j]\leftarrow 1;
4            
5      for j0j\leftarrow 0 to 21612^{16}-1 do
6             prevpos[j]0prevpos[j]\leftarrow 0;
7            
8      for jqj\leftarrow q to |P||P| do
9             hashh(P[jq+1:j])hash\leftarrow h(P[j-q+1:j]);
10             if prevpos[hash]=0prevpos[hash]=0 then djq+1d\leftarrow j-q+1;
11             else djprevpos[hash]d\leftarrow j-prevpos[hash];
12             𝑑𝑖𝑠𝑡[j]d\mathit{dist}[j]\leftarrow d; prevpos[hash]jprevpos[hash]\leftarrow j;
13            
14      return 𝑑𝑖𝑠𝑡\mathit{dist};
15      
Algorithm 3 Computing 𝑑𝑖𝑠𝑡\mathit{dist}

Figure 1 shows examples of shifting the pattern using 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} and 𝑑𝑖𝑠𝑡\mathit{dist}.

Refer to caption
Refer to caption
Figure 1: Shifting a pattern using 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} and 𝑑𝑖𝑠𝑡\mathit{dist}

Both functions 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} and 𝑑𝑖𝑠𝑡\mathit{dist} shift the pattern using qq-gram hash values based on Facts 2 and 3, respectively. The latter can be used only when we know that the pattern and the text substring have aligned qq-grams ending at jj with the same hash value and it may shift the pattern at most jq+1j-q+1, while the former can be used anytime and the maximum possible shift is mq+1m-q+1. The advantage of the function 𝑑𝑖𝑠𝑡\mathit{dist} is in the computational cost. If we know that the premise of Fact 3 is satisfied, we can immediately perform the shift based on 𝑑𝑖𝑠𝑡\mathit{dist}, while computing 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡(h(w))\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}(h(w)) for the concerned qq-gram ww in the text is not as cheap as 𝑑𝑖𝑠𝑡[j]\mathit{dist}[j]. Our algorithm exploits this advantage of the new shifting function 𝑑𝑖𝑠𝑡\mathit{dist}.

1 Function 𝙳𝙸𝚂𝚃𝚚(P,T,q)\mathtt{DISTq}(P,T,q)
2       𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡𝙿𝚛𝚎𝙺𝙼𝙿𝚂𝚑𝚒𝚏𝚝(P)\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}\leftarrow\mathtt{PreKMPShift}(P);
3       𝐻𝑄_𝑆ℎ𝑖𝑓𝑡𝙿𝚛𝚎𝙷𝚚𝚂𝚑𝚒𝚏𝚝(P,q)\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}\leftarrow\mathtt{PreHqShift}(P,q);
4       𝑑𝑖𝑠𝑡𝙿𝚛𝚎𝙳𝚒𝚜𝚝𝙰𝚛𝚛𝚊𝚢(P,q)\mathit{dist}\leftarrow\mathtt{PreDistArray}(P,q);
5       n|T|n\leftarrow|T|; m|P|m\leftarrow|P|; i1i\leftarrow 1; j1j\leftarrow 1; kmk\leftarrow m;
6       while knk\leq n do
7             if j1j\leq 1 then
8                   while True do // Alignment-phase
9                         sh𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(T[km+1:k])]sh\leftarrow\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h(T[k-m+1:k])]; kk+shk\leftarrow k+sh;
10                         if shmq+1sh\neq m-q+1 then
11                               posm1shpos\leftarrow m-1-sh;
12                               if P[1]=T[km+1]P[1]=T[k-m+1] then break;
13                               kk+𝑑𝑖𝑠𝑡[pos]k\leftarrow k+\mathit{dist}[pos];
14                              
15                        if k>nk>n then halt;
16                        
17                  j2j\leftarrow 2; ikm+2i\leftarrow k-m+2; // Comparison-phase
18                   while jmj\leq m and P[j]=T[i]P[j]=T[i] do
19                         ii+1i\leftarrow i+1; jj+1j\leftarrow j+1;
20                        
21                  if j=m+1j=m+1 then
22                         output imi-m;
23                        
24                  
25                  if 𝑑𝑖𝑠𝑡[pos]j1and𝑑𝑖𝑠𝑡[pos]𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]\mathit{dist}[pos]\geq j-1{\rm\;and\;}\mathit{dist}[pos]\geq\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j] then
26                         jj𝑑𝑖𝑠𝑡[pos]j\leftarrow j-\mathit{dist}[pos];
27                        
28                  else
29                         jj𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]j\leftarrow j-\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j];
30                        
31                  
32            else
33                   while jmj\leq m and P[j]=T[i]P[j]=T[i] do // KMP-phase
34                         ii+1i\leftarrow i+1; jj+1j\leftarrow j+1;
35                        
36                  
37                  if j=m+1j=m+1 then
38                         output imi-m;
39                        
40                  jj𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]j\leftarrow j-\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j];
41                  
42            ki+mjk\leftarrow i+m-j;
43            
44      
Algorithm 4 DISTqq algorihm

Next, we explain our searching algorithm shown in Algorithm 4. The searching phase is divided into three: Alignment-phase, Comparison-phase, and KMP-phase. The goal of the Alignment-phase is to shift the pattern as far as possible without comparing each single character of the pattern and the text. The Alignment-phase ends when we align the pattern and a text substring that have (a) aligned qq-grams of the same hash value and (b) the same first character. Suppose PP and T[km+1:k]T[k-m+1:k] are aligned at the beginning of the Alignment-phase. If s=𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(T[kq+1:k])]mqs=\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h(T[k-q+1:k])]\leq m-q, by shifting the pattern by ss, we find the aligned qq-grams of the same hash value. Namely, h(P[mqs+1:ms])=h(T[kq+1:k])h(P[m-q-s+1:m-s])=h(T[k-q+1:k]). Otherwise, we shift the pattern by mq+1m-q+1 repeatedly until we find aligned qq-grams of the same hash value. When finding a position 𝑝𝑜𝑠\mathit{pos} satisfying (a) by aligning PP and T[km+1:k]T[k^{\prime}-m+1:k^{\prime}] for some kk^{\prime}, i.e., h(P[𝑝𝑜𝑠q+1:𝑝𝑜𝑠])=h(T[km+𝑝𝑜𝑠q+1:km+𝑝𝑜𝑠])h(P[\mathit{pos}-q+1:\mathit{pos}])=h(T[k^{\prime}-m+\mathit{pos}-q+1:k^{\prime}-m+\mathit{pos}]), we simply check the condition (b). If P[1]P[1] and the corresponding text character match, we move to the Comparison-phase. Otherwise, we safely shift the pattern using 𝑑𝑖𝑠𝑡[𝑝𝑜𝑠]\mathit{dist}[\mathit{pos}]. Note that although it is possible to use the function 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡(h(T[kq+1:k]))\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}(h(T[k^{\prime}-q+1:k^{\prime}])) rather than 𝑑𝑖𝑠𝑡[𝑝𝑜𝑠]\mathit{dist}[\mathit{pos}], the computation would be more expensive. Shifting the pattern by 𝑑𝑖𝑠𝑡[𝑝𝑜𝑠]\mathit{dist}[\mathit{pos}], unless 𝑑𝑖𝑠𝑡[𝑝𝑜𝑠]=𝑝𝑜𝑠q+1\mathit{dist}[\mathit{pos}]=\mathit{pos}-q+1, still the pattern and the aligned text substring satisfy (a). However, we do not repeat 𝑑𝑖𝑠𝑡\mathit{dist}-shift any more, since the smaller 𝑝𝑜𝑠\mathit{pos} becomes, the smaller the expected shift amount will be. We simply restart the Alignment-phase. Once the conditions (a) and (b) are satisfied, we move on to the Comparison-phase.

In the Comparison-phase, we check the characters from P[2]P[2] to P[m]P[m]. If a character mismatch occurs during the comparison, either of the shift by 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} or by 𝑑𝑖𝑠𝑡\mathit{dist} is possible. Therefore, we select the one where the resumption position of the character comparison goes further to the right after shifting the pattern. If the resumption position of the comparison is the same, we select the one with the larger shift amount. Recall that when the KMP algorithm finds that P[1:j1]=T[i:i+j2]P[1:j-1]=T[i:i+j-2] and P[j]T[i+j1]P[j]\neq T[i+j-1], it resumes comparison from checking the match between T[i+j1]T[i+j-1] and P[j𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]]P[j-\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]] if 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]<j\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]<j, and T[i+j]T[i+j] and P[1]P[1] if 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]=j\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]=j. On the other hand, if we shift the pattern by 𝑑𝑖𝑠𝑡[pos]\mathit{dist}[pos], we simply resume matching T[i+𝑑𝑖𝑠𝑡[pos]]T[i+\mathit{dist}[pos]] and P[1]P[1]. Therefore, we should use 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j] rather than 𝑑𝑖𝑠𝑡[pos]\mathit{dist}[pos] when either 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]<j\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]<j and 𝑑𝑖𝑠𝑡[pos]<j1\mathit{dist}[pos]<j-1 or 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]=j>𝑑𝑖𝑠𝑡[pos]\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]=j>\mathit{dist}[pos]. Summarizing the discussion, we shift the pattern by 𝑑𝑖𝑠𝑡[pos]\mathit{dist}[pos] if 𝑑𝑖𝑠𝑡[pos]j1\mathit{dist}[pos]\geq j-1 and 𝑑𝑖𝑠𝑡[pos]𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]\mathit{dist}[pos]\geq\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j] hold. Otherwise, we shift the pattern by 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]. At this moment, we may have a “partial match” between the pattern and the aligned text substring. If we have performed the KMP-shift with 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]<j1\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]<j-1, then we have a match between the nonempty prefixes of the pattern and the aligned text substring of length j𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[j]1j-\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[j]-1. In this case, we go to the KMP-phase, where we simply perform the KMP algorithm. The KMP-phase prevents the character comparison position from returning to the left and guarantees the linear time behavior of our algorithm. If we have no partial match, we return to the Alignment-phase.

Theorem 1.

The worst-case time complexity of the DISTqq algorithm is O((n+m)q)O((n+m)q).

Proof.

Since the proposed algorithm uses Fact 1 on the KMP algorithm to prevent the character comparison position from going back to the left, the number of character comparisons is at most 2nm2n-m times like the KMP algorithm. In addition, the hash value of qq-gram is calculated to perform the shift using 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}. Since the hash value calculation requires O(q)O(q) time and it is calculated at the maximum of nq+1n-q+1 places in the text, the hash value calculation takes O(nq)O(nq) time in total. Therefore, the worst-case time complexity of the searching phase is O(nq)O(nq). In the preprocessing, O(mq)O(mq) time is required to calculate the hash value of qq-gram at mq+1m-q+1 locations. \Box

Example 1.

Let P=𝚊𝚋𝚊𝚊𝚋𝚋𝚊𝚊𝚊P={\tt abaabbaaa}. The shifting functions 𝑑𝑖𝑠𝑡\mathit{dist}, 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}, and 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} are shown below. The hash values are calculated by treating each character as its ASCII value, e.g. 𝚊\tt{a} is calculated as 97.

jj 1 2 3 4 5 6 7 8 9 10
PP a b a a b b a a a
𝑑𝑖𝑠𝑡\mathit{dist} - - 1 2 3 4 5 4 7
𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift} 1 1 3 2 4 3 7 6 7 8
xx aba baa aab abb bba aaa others
h(x)h(x) 2041 2053 2038 2042 2057 2037
𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(x)]\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h(x)] 6 1 4 3 2 0 7

Figure 2 illustrates an example run of the DISTqq algorithm (q=3q=3) for finding P=𝚊𝚋𝚊𝚊𝚋𝚋𝚊𝚊𝚊P={\tt abaabbaaa} in T=𝚊𝚋𝚋𝚊𝚊𝚋𝚋𝚊𝚊𝚋𝚊𝚋𝚋𝚊𝚋𝚋𝚊𝚊𝚊𝚋𝚊𝚊𝚋𝚊𝚊𝚋𝚋𝚊𝚊𝚊T={\tt abbaabbaababbabbaaabaabaabbaaa}.

Attempt 1

We shift the pattern by one character for 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(T[7:9])]=𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(𝚋𝚊𝚊)]=𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[2053]=1\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h(T[7:9])]=\linebreak[4]\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h({\tt baa})]=\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[2053]=1. Since the position of the qq-gram aligned by this shift is 8, pospos is updated to 8.

Attempt 2

We check whether the first character of the pattern matches the corresponding character of the text. Finding P[1]T[2]P[1]\neq T[2], the pattern is shifted by 𝑑𝑖𝑠𝑡[pos]=𝑑𝑖𝑠𝑡[8]=4\mathit{dist}[pos]=\mathit{dist}[8]=4.

Attempt 3

We shift the pattern by 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(T[12:14])]=𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(𝚋𝚋𝚊)]=𝟸\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h(T[12:14])]=\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h(\tt{bba})]=2 and update pospos to 7.

Attempt 4

We check whether the first character of the pattern matches the corresponding character of the text. From P[1]=T[8]P[1]=T[8], we compare the characters of P[2:9]P[2:9] and T[9:16]T[9:16] from left to right. Since P[2]T[9]P[2]\neq T[9], the pattern is shifted by 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[2]\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[2] or 𝑑𝑖𝑠𝑡[pos]=𝑑𝑖𝑠𝑡[7]\mathit{dist}[pos]=\mathit{dist}[7]. From 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[2]=1\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[2]=1, 𝑑𝑖𝑠𝑡[7]=5\mathit{dist}[7]=5, 𝑑𝑖𝑠𝑡[pos]21\mathit{dist}[pos]\geq 2-1 and 𝑑𝑖𝑠𝑡[pos]𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[2]\mathit{dist}[pos]\geq\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[2] are satisfied. Therefore, we shift the pattern by 𝑑𝑖𝑠𝑡[pos]=𝑑𝑖𝑠𝑡[7]=5\mathit{dist}[pos]=\mathit{dist}[7]=5.

Attempt 5

We shift the pattern by 𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(T[19:21])]=𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[h(𝚊𝚋𝚊)]=𝐻𝑄_𝑆ℎ𝑖𝑓𝑡[2041]=6\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h(T[19:21])]=\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[h({\tt aba})]=\linebreak[4]\mathit{HQ\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[2041]=6 and update pospos to 3.

Attempt 6

We check whether the first character of the pattern matches the corresponding character of the text. By P[1]=T[19]P[1]=T[19], the characters of P[2:9]P[2:9] and T[20:27]T[20:27] are compared from left to right. Since P[6]T[24]P[6]\neq T[24], the pattern is shifted by 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[6]\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[6] or 𝑑𝑖𝑠𝑡[pos]=𝑑𝑖𝑠𝑡[3]\mathit{dist}[pos]=\mathit{dist}[3]. By 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[6]=3>𝑑𝑖𝑠𝑡[3]=1\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[6]=3>\mathit{dist}[3]=1, the pattern is shifted by 𝐾𝑀𝑃_𝑆ℎ𝑖𝑓𝑡[6]=3\mathit{KMP\text{\scalebox{0.6}[1.0]{\textunderscore}}Shift}[6]=3.

Attempt 7

Attempt 6 shows that P[1:2]=T[22:23]P[1:2]=T[22:23], that is, there is a partial match, so we continue the comparison of T[24:30]T[24:30] and P[3:9]P[3:9]. Since T[24:30]=P[3:9]T[24:30]=P[3:9], the pattern occurrence position 22 is reported.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
T:T: a b b a a b b a a b a b b a b b a a a b a a b a a b b a a a
Attempt 1 P:P: a b a a b b a a a
×1{\color[rgb]{1,0,0}\times}_{1}
Attempt 2 P:P: a b a a b b a a a
Attempt 3 P:P: a b a a b b a a a
1{\color[rgb]{0,0,1}\circ}_{1} ×2{\color[rgb]{1,0,0}\times}_{2}
Attempt 4 P:P: a b a a b b a a a
Attempt 5 P:P: a b a a b b a a a
1{\color[rgb]{0,0,1}\circ}_{1} 2{\color[rgb]{0,0,1}\circ}_{2} 3{\color[rgb]{0,0,1}\circ}_{3} 4{\color[rgb]{0,0,1}\circ}_{4} 5{\color[rgb]{0,0,1}\circ}_{5} ×6{\color[rgb]{1,0,0}\times}_{6}
Attempt 6 P:P: a b a a b b a a a
{\color[rgb]{0,0,1}\bullet} {\color[rgb]{0,0,1}\bullet} 1{\color[rgb]{0,0,1}\circ}_{1} 2{\color[rgb]{0,0,1}\circ}_{2} 3{\color[rgb]{0,0,1}\circ}_{3} 4{\color[rgb]{0,0,1}\circ}_{4} 5{\color[rgb]{0,0,1}\circ}_{5} 6{\color[rgb]{0,0,1}\circ}_{6} 7{\color[rgb]{0,0,1}\circ}_{7}
Attempt 7 P:P: a b a a b b a a a
Figure 2: An example run of the DISTqq algorithm for a pattern P=𝚊𝚋𝚊𝚊𝚋𝚋𝚊𝚊𝚊P={\tt abaabbaaa} and a text T=𝚊𝚋𝚋𝚊𝚊𝚋𝚋𝚊𝚊𝚋𝚊𝚋𝚋𝚊𝚋𝚋𝚊𝚊𝚊𝚋𝚊𝚊𝚋𝚊𝚊𝚋𝚋𝚊𝚊𝚊T={\tt abbaabbaababbabbaaabaabaabbaaa}. For each alignment of the pattern, {\color[rgb]{0,0,1}\circ} and ×{\color[rgb]{1,0,0}\times} indicate a match and a mismatch between the text and the pattern, respectively. The character with {\color[rgb]{0,0,1}\bullet} is known to match the character at the corresponding position in the text without comparison. Subscript numbers show the order of character comparisons in each attempt.

3.2 LDISTqq algorithm

The LDISTqq algorithm modifies the DISTqq algorithm so that the worst-case time complexity is independent of qq. In the DISTqq algorithm, if strings such as T=𝚊nT={\tt a}^{n} and P=𝚋𝚊m1P={\tt ba}^{m-1} are given, O(nq)O(nq) time is required for searching phase because the hash values of each qq-gram are calculated in the text. Since the hash function hh defined in Section 3.1 is a rolling hash, if the hash value of w[i:i+q1]w[i:i+q-1] has already been obtained for a string ww, the hash value of w[i+1:i+q]w[i+1:i+q] can be computed in constant time by h(w[i+1:i+q])=(4(h(w[i:i+q1])4q1w[i])+w[i+q])mod 216h(w[i+1:i+q])=(4\cdot(h(w[i:i+q-1])-4^{q-1}\cdot w[i])+w[i+q]){\rm\;mod\;}2^{16}. The LDISTqq algorithm modifies Line 4 of Algorithm 4 so that we calculate the hash value of the qq-gram using the previously calculated value of the other qq-gram in the incremental way, if they overlap. Similarly, the time complexity of the preprocessing phase can be reduced.

Theorem 2.

The worst-case time complexity of the LDISTqq algorithm is O(n+m)O(n+m).

Proof.

Like the DISTqq algorithm, we compare characters at most 2nm2n-m times. To calculate the hash value of a qq-gram, if it is overlapped with the qq-gram for which the hash value has been calculated one step before, the incremental update is performed using the rolling hash. Therefore, the calculation of the hash value of qq-grams takes O(n)O(n) time in total. Thus, the worst-case time complexity of matching is O(n)O(n). Calculating the hash values of qq-grams in the preprocess is performed in the same way, so it is done in O(m)O(m) time. \Box

4 Experiments

In this section, we compare the execution times of the proposed algorithms with the existing algorithms listed below, where algorithms that run in linear time in the input string size are marked with \star.

  • BNDMqq [20]: Backward Nondeterministic DAWG Matching algorithm using qq-grams with q=2,4and 6q=2,4{\rm\;and\;}6,

  • SBNDMqq [1]: Simplified version of the Backward Nondeterministic DAWG Matching algorithm using qq-grams with q=2,4,6and 8q=2,4,6{\rm\;and\;}8,

  • KBNDM [7]: Factorized variant of the BNDM algorithm,

  • BSDMqq [10]: Backward SNR DAWG Matching algorithm using condensed alphabets with groups of qq characters with 1q81\leq q\leq 8,

  • \star FJS [14]: Franek-Jennings-Smyth algorithm,

  • \star FJS++ [18]: Modification of the FJS algorithm,

  • HASHqq [19]: Hashing algorithm using qq-grams with 2q82\leq q\leq 8 (see Section 2.4),

  • FS-ww [11]: Multiple Windows version of the Fast Search algorithm [5] implemented using ww sliding windows with w=1,2,4,6and 8w=1,2,4,6{\rm\;and\;}8,

  • IOM [6]: Improved Occurrence Matcher,

  • WOM [6]: Worst Occurrence Matcher,

  • SKIPqq [9]: Skip-Search algorithm using qq-grams with 2q82\leq q\leq 8,

  • WFRqq [8]: Weak-Factors-Recognition algorithm implemented with a qq-chained loop with 2q82\leq q\leq 8,

  • \star LWFRqq [8]: Linear-Weak-Factors-Recognition algorithm implemented with a qq-chained loop with 2q82\leq q\leq 8,

  • \star DISTqq: Our algorithm proposed in Section 3.1 (Algorithm 4) with 1q81\leq q\leq 8,

  • \star LDISTqq: Our algorithm proposed in Section 3.2 with 1q81\leq q\leq 8.

Table 1: Genome sequence (σ=4\sigma=4, n=4641652n=4641652)
mm 2 4 8 16 32 64 128 256 512 1024
BNDMqq 218.34(2)218.34^{(2)} 174.55(2)174.55^{(2)} 84.14(4)84.14^{(4)} 64.89(6)64.89^{(6)} 60.39(6)60.39^{(6)} 61.07(6)61.07^{(6)} 60.85(6)60.85^{(6)} 62.17(6)62.17^{(6)} 61.20(6)61.20^{(6)} 60.16(6)60.16^{(6)}
SBNDMqq 𝟏𝟕𝟗¯.90(𝟐)¯\underline{\bf{179}}\underline{\bf.90^{(2)}} 154.20(2)154.20^{(2)} 80.94(4)80.94^{(4)} 73.29(6)73.29^{(6)} 57.10(8)57.10^{(8)} 61.01(6)61.01^{(6)} 61.74(6)61.74^{(6)} 61.20(6)61.20^{(6)} 61.67(6)61.67^{(6)} 60.80(6)60.80^{(6)}
KBNDM 311.78311.78 201.99201.99 150.15150.15 113.84113.84 83.2383.23 67.8367.83 75.6575.65 75.3975.39 76.5876.58 74.7274.72
BSDMqq 195.38(2)195.38^{(2)} 𝟏𝟏𝟖¯.86(𝟑)¯\underline{\bf{118}}\underline{\bf.86^{(3)}} 84.23(5)84.23^{(5)} 63.03(6)63.03^{(6)} 61.26(6)61.26^{(6)} 58.75(6)58.75^{(6)} 57.50(7)57.50^{(7)} 56.99(6)56.99^{(6)} 57.01(6)57.01^{(6)} 56.58(6)56.58^{(6)}
\star FJS 407.02407.02 353.60353.60 311.96311.96 279.13279.13 308.42308.42 297.00297.00 266.12266.12 317.79317.79 317.34317.34 296.19296.19
\star FJS++ 388.44388.44 296.70296.70 203.03203.03 171.17171.17 149.52149.52 136.59136.59 128.55128.55 130.39130.39 122.51122.51 112.99112.99
HASHqq 571.44(2)571.44^{(2)} 272.86(3)272.86^{(3)} 126.00(3)126.00^{(3)} 88.12(3)88.12^{(3)} 68.22(3)68.22^{(3)} 58.84(6)58.84^{(6)} 55.09(6)55.09^{(6)} 59.48(7)59.48^{(7)} 57.34(7)57.34^{(7)} 57.96(7)57.96^{(7)}
FS-ww 332.32(4)332.32^{(4)} 245.99(4)245.99^{(4)} 184.72(4)184.72^{(4)} 158.72(4)158.72^{(4)} 143.79(6)143.79^{(6)} 125.05(4)125.05^{(4)} 123.52(6)123.52^{(6)} 117.90(6)117.90^{(6)} 108.03(6)108.03^{(6)} 100.72(6)100.72^{(6)}
IOM 377.25377.25 275.36275.36 215.72215.72 220.97220.97 219.86219.86 218.12218.12 210.61210.61 221.31221.31 230.15230.15 211.69211.69
WOM 381.54381.54 301.46301.46 220.34220.34 182.30182.30 166.27166.27 143.24143.24 136.20136.20 133.75133.75 127.40127.40 114.74114.74
SKIPqq 250.89(2)250.89^{(2)} 136.18(3)136.18^{(3)} 91.51(4)91.51^{(4)} 63.96(6)63.96^{(6)} 56.79(7)56.79^{(7)} 53.09(7)53.09^{(7)} 52.10(7)52.10^{(7)} 57.12(7)57.12^{(7)} 56.97(8)56.97^{(8)} 58.00(6)58.00^{(6)}
WFRqq 219.50(2)219.50^{(2)} 168.39(2)168.39^{(2)} 88.86(4)88.86^{(4)} 65.82(4)65.82^{(4)} 57.19(8)57.19^{(8)} 55.12(2)55.12^{(2)} 51.77(3)51.77^{(3)} 50.04(3)50.04^{(3)} 55.02(6)55.02^{(6)} 𝟓𝟒¯.64(𝟖)¯\underline{\bf{54}}\underline{\bf.64^{(8)}}
\star LWFRqq 216.10(2)216.10^{(2)} 173.48(3)173.48^{(3)} 88.71(4)88.71^{(4)} 60.75(5)60.75^{(5)} 𝟓𝟑¯.84(𝟓)¯\underline{\bf{53}}\underline{\bf.84^{(5)}} 𝟓𝟎¯.48(𝟔)¯\underline{\bf{50}}\underline{\bf.48^{(6)}} 𝟒𝟗¯.65(𝟖)¯\underline{\bf{49}}\underline{\bf.65^{(8)}} 48.71(6)48.71^{(6)} 54.90(6)54.90^{(6)} 54.97(7)54.97^{(7)}
\star DISTqq 186.10(2)186.10^{(2)} 125.44(3)125.44^{(3)} 𝟕𝟖¯.56(𝟒)¯\underline{\bf{78}}\underline{\bf.56^{(4)}} 𝟔𝟎¯.48(𝟓)¯\underline{\bf{60}}\underline{\bf.48^{(5)}} 55.21(6)55.21^{(6)} 52.05(7)52.05^{(7)} 51.26(8)51.26^{(8)} 50.44(8)50.44^{(8)} 54.81(7)54.81^{(7)} 55.58(8)55.58^{(8)}
\star LDISTqq 295.55(2)295.55^{(2)} 181.99(3)181.99^{(3)} 86.58(4)86.58^{(4)} 65.29(6)65.29^{(6)} 56.74(6)56.74^{(6)} 52.31(6)52.31^{(6)} 50.39(6)50.39^{(6)} 𝟒𝟖¯.70(𝟕)¯\underline{\bf{48}}\underline{\bf.70^{(7)}} 𝟓𝟒¯.33(𝟒)¯\underline{\bf{54}}\underline{\bf.33^{(4)}} 55.22(7)55.22^{(7)}
Table 2: English text (σ=62\sigma=62, n=4017009n=4017009)
mm 2 4 8 16 32 64 128 256 512 1024
BNDMqq 140.48(2)140.48^{(2)} 93.39(2)93.39^{(2)} 70.84(4)70.84^{(4)} 54.10(4)54.10^{(4)} 𝟒𝟓¯.88(𝟒)¯\underline{\bf{45}}\underline{\bf.88^{(4)}} 47.61(4)47.61^{(4)} 47.64(4)47.64^{(4)} 46.88(4)46.88^{(4)} 47.40(6)47.40^{(6)} 45.58(4)45.58^{(4)}
SBNDMqq 𝟏𝟎𝟖¯.32(𝟐)¯\underline{\bf{108}}\underline{\bf.32^{(2)}} 𝟕𝟑¯.50(𝟐)¯\underline{\bf{73}}\underline{\bf.50^{(2)}} 𝟔𝟒¯.87(𝟐)¯\underline{\bf{64}}\underline{\bf.87^{(2)}} 𝟓𝟎¯.73(𝟒)¯\underline{\bf{50}}\underline{\bf.73^{(4)}} 47.85(4)47.85^{(4)} 48.45(4)48.45^{(4)} 48.55(4)48.55^{(4)} 47.56(4)47.56^{(4)} 47.08(4)47.08^{(4)} 45.98(4)45.98^{(4)}
KBNDM 192.76192.76 126.56126.56 92.0392.03 71.0471.04 64.1764.17 56.7956.79 49.5649.56 49.0949.09 50.2450.24 47.6847.68
BSDMqq 117.22(2)117.22^{(2)} 79.96(2)79.96^{(2)} 70.36(3)70.36^{(3)} 57.72(4)57.72^{(4)} 49.91(8)49.91^{(8)} 48.00(8)48.00^{(8)} 44.99(8)44.99^{(8)} 43.62(8)43.62^{(8)} 43.06(8)43.06^{(8)} 42.70(8)42.70^{(8)}
\star FJS 192.52192.52 158.61158.61 106.40106.40 89.0489.04 78.4378.43 68.1468.14 64.8064.80 61.1561.15 60.4860.48 55.3855.38
\star FJS++ 196.88196.88 155.86155.86 101.77101.77 77.7477.74 67.4067.40 60.3160.31 54.7754.77 52.4452.44 51.6251.62 47.7447.74
HASHqq 312.91(2)312.91^{(2)} 218.95(2)218.95^{(2)} 107.38(2)107.38^{(2)} 70.85(2)70.85^{(2)} 56.90(3)56.90^{(3)} 49.33(6)49.33^{(6)} 46.63(5)46.63^{(5)} 45.27(8)45.27^{(8)} 44.77(3)44.77^{(3)} 42.98(7)42.98^{(7)}
FS-ww 129.50(6)129.50^{(6)} 104.45(6)104.45^{(6)} 71.57(6)71.57^{(6)} 61.08(4)61.08^{(4)} 54.19(6)54.19^{(6)} 49.53(4)49.53^{(4)} 48.95(6)48.95^{(6)} 47.02(4)47.02^{(4)} 46.96(6)46.96^{(6)} 43.00(4)43.00^{(4)}
IOM 193.37193.37 148.51148.51 104.36104.36 86.2286.22 75.0975.09 69.0869.08 63.6363.63 60.6260.62 58.8358.83 51.7651.76
WOM 199.23199.23 153.81153.81 112.45112.45 86.6186.61 75.2675.26 62.8062.80 60.5460.54 62.7462.74 58.9158.91 55.6555.65
SKIPqq 161.38(2)161.38^{(2)} 100.52(2)100.52^{(2)} 72.33(3)72.33^{(3)} 55.71(4)55.71^{(4)} 49.06(4)49.06^{(4)} 48.73(4)48.73^{(4)} 49.87(4)49.87^{(4)} 48.81(8)48.81^{(8)} 45.75(2)45.75^{(2)} 45.32(2)45.32^{(2)}
WFRqq 137.40(2)137.40^{(2)} 85.22(2)85.22^{(2)} 68.88(2)68.88^{(2)} 52.85(5)52.85^{(5)} 46.67(5)46.67^{(5)} 𝟒𝟒¯.39(𝟖)¯\underline{\bf{44}}\underline{\bf.39^{(8)}} 𝟒𝟐¯.09(𝟖)¯\underline{\bf{42}}\underline{\bf.09^{(8)}} 41.66(5)41.66^{(5)} 41.65(6)41.65^{(6)} 40.53(2)40.53^{(2)}
\star LWFRqq 121.08(2)121.08^{(2)} 85.47(2)85.47^{(2)} 70.38(2)70.38^{(2)} 53.83(3)53.83^{(3)} 47.89(5)47.89^{(5)} 44.49(5)44.49^{(5)} 42.38(6)42.38^{(6)} 𝟒𝟏¯.09(𝟖)¯\underline{\bf{41}}\underline{\bf.09^{(8)}} 𝟒𝟏¯.23(𝟖)¯\underline{\bf{41}}\underline{\bf.23^{(8)}} 𝟒𝟎¯.30(𝟖)¯\underline{\bf{40}}\underline{\bf.30^{(8)}}
\star DISTqq 115.61(2)115.61^{(2)} 80.14(2)80.14^{(2)} 65.84(3)65.84^{(3)} 52.25(4)52.25^{(4)} 48.13(4)48.13^{(4)} 46.26(4)46.26^{(4)} 43.32(4)43.32^{(4)} 42.84(8)42.84^{(8)} 42.98(4)42.98^{(4)} 41.47(7)41.47^{(7)}
\star LDISTqq 229.62(2)229.62^{(2)} 102.15(2)102.15^{(2)} 69.70(3)69.70^{(3)} 55.34(4)55.34^{(4)} 49.46(5)49.46^{(5)} 45.93(5)45.93^{(5)} 44.37(3)44.37^{(3)} 43.15(4)43.15^{(4)} 42.21(5)42.21^{(5)} 41.11(7)41.11^{(7)}
Table 3: Fibonacci string (σ=2\sigma=2, n=2178309n=2178309)
mm 2 4 8 16 32 64 128 256 512 1024
BNDMqq 343.25(2)343.25^{(2)} 308.44(2)308.44^{(2)} 283.26(4)283.26^{(4)} 257.64(6)257.64^{(6)} 233.94(6)233.94^{(6)} 285.63(4)285.63^{(4)} 284.37(4)284.37^{(4)} 293.00(4)293.00^{(4)} 307.82(4)307.82^{(4)} 315.47(4)315.47^{(4)}
SBNDMqq 𝟐𝟖𝟔¯.02(𝟐)¯\underline{\bf{286}}\underline{\bf.02^{(2)}} 𝟐𝟗𝟐¯.15(𝟐)¯\!\underline{\bf{292}}\underline{\bf.15^{(2)}} 272.98(4)272.98^{(4)} 276.35(6)276.35^{(6)} 306.42(6)306.42^{(6)} 372.03(6)372.03^{(6)} 432.53(6)432.53^{(6)} 493.20(6)493.20^{(6)} 546.94(6)546.94^{(6)} 602.09(6)602.09^{(6)}
KBNDM 541.70541.70 405.78405.78 411.08411.08 422.85422.85 382.25382.25 402.45402.45 425.60425.60 437.67437.67 461.07461.07 451.12451.12
BSDMqq 482.47(2)482.47^{(2)} 500.43(3)500.43^{(3)} 397.29(5)397.29^{(5)} 362.52(8)362.52^{(8)} 330.76(6)330.76^{(6)} 736.89(1)736.89^{(1)} 766.57(1)766.57^{(1)} 782.98(1)782.98^{(1)} 790.26(1)790.26^{(1)} 508.80(3)508.80^{(3)}
\star FJS 402.44402.44 362.23362.23 276.97276.97 237.87237.87 218.38218.38 𝟐𝟎𝟔¯.07¯\underline{\bf{206}}\underline{\bf.07} 203.86203.86 202.94202.94 196.49196.49 194.01194.01
\star FJS++ 456.20456.20 396.04396.04 335.70335.70 319.93319.93 295.64295.64 300.36300.36 296.48296.48 295.37295.37 288.56288.56 289.00289.00
HASHqq 645.48(2)645.48^{(2)} 406.70(2)406.70^{(2)} 𝟐𝟓𝟕¯.69(𝟒)¯\!\underline{\bf{257}}\underline{\bf.69^{(4)}} 251.91(7)251.91^{(7)} 279.81(7)279.81^{(7)} 344.99(7)344.99^{(7)} 415.16(7)415.16^{(7)} 470.71(7)470.71^{(7)} 514.05(7)514.05^{(7)} 579.57(7)579.57^{(7)}
FS-ww 383.23(1)383.23^{(1)} 396.23(1)396.23^{(1)} 347.72(1)347.72^{(1)} 289.15(1)289.15^{(1)} 253.36(1)253.36^{(1)} 246.82(1)246.82^{(1)} 248.73(1)248.73^{(1)} 235.66(1)235.66^{(1)} 235.35(1)235.35^{(1)} 230.61(1)230.61^{(1)}
IOM 381.92381.92 414.42414.42 453.54453.54 497.84497.84 543.93543.93 641.13641.13 751.42751.42 839.92839.92 899.19899.19 1019.591019.59
WOM 552.38552.38 555.43555.43 564.67564.67 617.93617.93 664.47664.47 732.05732.05 852.06852.06 926.35926.35 1036.311036.31 1126.171126.17
SKIPqq 470.93(2)470.93^{(2)} 394.09(2)394.09^{(2)} 332.66(5)332.66^{(5)} 336.16(8)336.16^{(8)} 374.91(8)374.91^{(8)} 464.23(3)464.23^{(3)} 460.54(3)460.54^{(3)} 450.18(3)450.18^{(3)} 451.05(3)451.05^{(3)} 464.13(3)464.13^{(3)}
WFRqq 442.32(2)442.32^{(2)} 497.95(3)497.95^{(3)} 528.45(3)528.45^{(3)} 652.48(6)652.48^{(6)} 2132.38(7)\scalebox{1.0}[1.0]{2132}.38^{(7)} 3762.19(8)\scalebox{1.0}[1.0]{3762}.19^{(8)} 6762.67(8)\scalebox{1.0}[1.0]{6762}.67^{(8)} 12624.63(8)\scalebox{1.0}[1.0]{12624}.63^{(8)} 24416.65(8)\scalebox{1.0}[1.0]{24416}.65^{(8)} 48596.02(8)\scalebox{1.0}[1.0]{48596}.02^{(8)}
\star LWFRqq 552.64(2)552.64^{(2)} 504.77(3)504.77^{(3)} 428.34(5)428.34^{(5)} 342.80(7)342.80^{(7)} 297.07(7)297.07^{(7)} 304.57(2)304.57^{(2)} 274.47(6)274.47^{(6)} 265.28(6)265.28^{(6)} 258.06(6)258.06^{(6)} 254.92(6)254.92^{(6)}
\star DISTqq 438.96(1)438.96^{(1)} 359.56(2)359.56^{(2)} 293.80(2)293.80^{(2)} 𝟐𝟑𝟓¯.61(𝟐)¯\!\underline{\bf{235}}\underline{\bf.61^{(2)}} 𝟐𝟏𝟑¯.07(𝟕)¯\underline{\bf{213}}\underline{\bf.07^{(7)}} 206.92(2)206.92^{(2)} 201.18(8)201.18^{(8)} 196.56(5)196.56^{(5)} 194.86(4)194.86^{(4)} 193.62(5)193.62^{(5)}
\star LDISTqq 565.13(2)565.13^{(2)} 412.15(3)412.15^{(3)} 300.98(4)300.98^{(4)} 245.38(7)245.38^{(7)} 215.89(8)215.89^{(8)} 208.40(7)208.40^{(7)} 𝟐𝟎𝟎¯.45(𝟒)¯\underline{\bf{200}}\underline{\bf.45^{(4)}} 𝟏𝟗𝟑¯.74(𝟒)¯\underline{\bf{193}}\underline{\bf.74^{(4)}} 𝟏𝟗𝟎¯.85(𝟑)¯\underline{\bf{190}}\underline{\bf.85^{(3)}} 𝟏𝟗𝟑¯.48(𝟖)¯\underline{\bf{193}}\underline{\bf.48^{(8)}}
Table 4: Texts with frequent pattern occurrences (σ=4\sigma=4, n=4000000n=4000000, m=8m=8)
occocc 0 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072
BNDMqq 73.53(4)73.53^{(4)} 72.28(4)72.28^{(4)} 72.87(4)72.87^{(4)} 73.89(4)73.89^{(4)} 74.39(4)74.39^{(4)} 75.75(4)75.75^{(4)} 77.60(4)77.60^{(4)} 81.63(4)81.63^{(4)} 82.67(4)82.67^{(4)} 88.98(4)88.98^{(4)} 108.56(4)108.56^{(4)} 146.13(6)146.13^{(6)}
SBNDMqq 𝟔𝟖¯.58(𝟒)¯\underline{\bf{68}}\underline{\bf.58^{(4)}} 𝟔𝟖¯.92(𝟒)¯\underline{\bf{68}}\underline{\bf.92^{(4)}} 71.17(4)71.17^{(4)} 77.26(4)77.26^{(4)} 81.04(4)81.04^{(4)} 80.76(4)80.76^{(4)} 78.33(4)78.33^{(4)} 82.82(4)82.82^{(4)} 91.01(4)91.01^{(4)} 96.83(4)96.83^{(4)} 111.75(4)111.75^{(4)} 140.21(4)140.21^{(4)}
KBNDM 127.73127.73 128.41128.41 128.41128.41 126.92126.92 128.74128.74 131.17131.17 129.74129.74 135.70135.70 140.99140.99 152.46152.46 164.59164.59 186.32186.32
BSDMqq 69.85(4)69.85^{(4)} 69.04(4)69.04^{(4)} 𝟔𝟖¯.85(𝟒)¯\underline{\bf{68}}\underline{\bf.85^{(4)}} 𝟔𝟗¯.19(𝟒)¯\underline{\bf{69}}\underline{\bf.19^{(4)}} 𝟕𝟎¯.63(𝟒)¯\underline{\bf{70}}\underline{\bf.63^{(4)}} 𝟕𝟐¯.00(𝟒)¯\underline{\bf{72}}\underline{\bf.00^{(4)}} 𝟕𝟐¯.48(𝟒)¯\underline{\bf{72}}\underline{\bf.48^{(4)}} 𝟕𝟔¯.25(𝟒)¯\underline{\bf{76}}\underline{\bf.25^{(4)}} 𝟕𝟗¯.95(𝟒)¯\underline{\bf{79}}\underline{\bf.95^{(4)}} 89.91(4)89.91^{(4)} 110.22(4)110.22^{(4)} 143.91(4)143.91^{(4)}
\star FJS 246.26246.26 253.35253.35 263.94263.94 247.73247.73 255.94255.94 276.44276.44 262.83262.83 256.36256.36 251.33251.33 269.39269.39 259.93259.93 251.06251.06
\star FJS++ 174.38174.38 173.86173.86 180.46180.46 173.05173.05 178.65178.65 182.98182.98 177.56177.56 181.17181.17 182.48182.48 190.71190.71 197.69197.69 199.60199.60
HASHqq 121.09(3)121.09^{(3)} 121.36(3)121.36^{(3)} 122.64(3)122.64^{(3)} 122.50(3)122.50^{(3)} 119.49(3)119.49^{(3)} 123.98(3)123.98^{(3)} 124.31(3)124.31^{(3)} 124.93(3)124.93^{(3)} 126.81(3)126.81^{(3)} 128.94(3)128.94^{(3)} 143.75(3)143.75^{(3)} 153.08(4)153.08^{(4)}
FS-ww 154.89(4)154.89^{(4)} 155.56(4)155.56^{(4)} 165.18(4)165.18^{(4)} 156.67(4)156.67^{(4)} 159.03(4)159.03^{(4)} 166.44(4)166.44^{(4)} 165.62(4)165.62^{(4)} 160.46(4)160.46^{(4)} 167.46(4)167.46^{(4)} 178.53(2)178.53^{(2)} 185.86(2)185.86^{(2)} 191.38(2)191.38^{(2)}
IOM 185.31185.31 181.07181.07 195.53195.53 187.98187.98 194.18194.18 200.99200.99 195.98195.98 195.89195.89 197.67197.67 202.61202.61 214.90214.90 209.52209.52
WOM 192.59192.59 192.28192.28 207.57207.57 189.21189.21 196.02196.02 203.86203.86 196.98196.98 199.79199.79 203.59203.59 215.79215.79 219.94219.94 230.59230.59
SKIPqq 79.32(4)79.32^{(4)} 77.14(4)77.14^{(4)} 79.19(4)79.19^{(4)} 82.08(4)82.08^{(4)} 81.82(4)81.82^{(4)} 82.55(4)82.55^{(4)} 83.83(4)83.83^{(4)} 87.08(4)87.08^{(4)} 89.82(4)89.82^{(4)} 93.09(4)93.09^{(4)} 106.41(4)106.41^{(4)} 126.22(3)126.22^{(3)}
WFRqq 76.68(4)76.68^{(4)} 76.38(4)76.38^{(4)} 77.63(4)77.63^{(4)} 83.21(4)83.21^{(4)} 80.93(4)80.93^{(4)} 81.42(4)81.42^{(4)} 83.86(4)83.86^{(4)} 88.55(4)88.55^{(4)} 93.16(4)93.16^{(4)} 106.07(4)106.07^{(4)} 123.90(4)123.90^{(4)} 164.77(4)164.77^{(4)}
\star LWFRqq 76.99(4)76.99^{(4)} 76.33(4)76.33^{(4)} 78.83(4)78.83^{(4)} 77.57(4)77.57^{(4)} 78.74(4)78.74^{(4)} 76.46(4)76.46^{(4)} 84.65(4)84.65^{(4)} 89.87(4)89.87^{(4)} 96.17(4)96.17^{(4)} 108.67(4)108.67^{(4)} 129.35(3)129.35^{(3)} 168.70(3)168.70^{(3)}
\star DISTqq 69.67(4)69.67^{(4)} 73.38(4)73.38^{(4)} 74.35(4)74.35^{(4)} 74.31(4)74.31^{(4)} 75.12(4)75.12^{(4)} 74.96(4)74.96^{(4)} 77.21(4)77.21^{(4)} 77.56(4)77.56^{(4)} 80.09(4)80.09^{(4)} 𝟖𝟔¯.25(𝟒)¯\underline{\bf{86}}\underline{\bf.25^{(4)}} 𝟏𝟎𝟏¯.12(𝟒)¯\underline{\bf{101}}\underline{\bf.12^{(4)}} 𝟏𝟐𝟎¯.98(𝟑)¯\underline{\bf{120}}\underline{\bf.98^{(3)}}
\star LDISTqq 75.82(4)75.82^{(4)} 74.45(4)74.45^{(4)} 74.89(4)74.89^{(4)} 76.77(4)76.77^{(4)} 74.62(4)74.62^{(4)} 75.47(4)75.47^{(4)} 77.10(4)77.10^{(4)} 80.18(4)80.18^{(4)} 83.40(4)83.40^{(4)} 88.86(4)88.86^{(4)} 103.73(4)103.73^{(4)} 122.27(4)122.27^{(4)}
Table 5: Texts with frequent pattern occurrences (σ=95\sigma=95, n=4000000n=4000000, m=8m=8)
occocc 0 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072
BNDMqq 56.58(2)56.58^{(2)} 55.41(2)55.41^{(2)} 56.26(2)56.26^{(2)} 56.89(2)56.89^{(2)} 56.68(2)56.68^{(2)} 57.42(2)57.42^{(2)} 58.80(2)58.80^{(2)} 60.72(2)60.72^{(2)} 67.37(2)67.37^{(2)} 74.57(2)74.57^{(2)} 98.52(2)98.52^{(2)} 125.65(2)125.65^{(2)}
SBNDMqq 𝟓𝟐¯.76(𝟐)¯\underline{\bf{52}}\underline{\bf.76^{(2)}} 𝟓𝟐¯.37(𝟐)¯\underline{\bf{52}}\underline{\bf.37^{(2)}} 𝟓𝟑¯.09(𝟐)¯\underline{\bf{53}}\underline{\bf.09^{(2)}} 𝟓𝟑¯.23(𝟐)¯\underline{\bf{53}}\underline{\bf.23^{(2)}} 55.56(2)55.56^{(2)} 𝟓𝟐¯.82(𝟐)¯\underline{\bf{52}}\underline{\bf.82^{(2)}} 56.41(2)56.41^{(2)} 𝟓𝟔¯.50(𝟐)¯\underline{\bf{56}}\underline{\bf.50^{(2)}} 61.58(2)61.58^{(2)} 69.96(2)69.96^{(2)} 91.72(2)91.72^{(2)} 112.41(2)112.41^{(2)}
KBNDM 81.9881.98 80.2280.22 77.6177.61 79.4279.42 79.5579.55 81.8881.88 82.6182.61 82.2582.25 89.1989.19 98.6598.65 111.59111.59 142.65142.65
BSDMqq 54.89(2)54.89^{(2)} 56.07(2)56.07^{(2)} 54.55(2)54.55^{(2)} 55.99(2)55.99^{(2)} 55.78(2)55.78^{(2)} 56.78(2)56.78^{(2)} 58.38(2)58.38^{(2)} 61.48(2)61.48^{(2)} 66.76(2)66.76^{(2)} 76.66(3)76.66^{(3)} 96.79(3)96.79^{(3)} 131.76(3)131.76^{(3)}
\star FJS 81.9381.93 81.4181.41 81.1481.14 81.8381.83 82.0482.04 82.1682.16 82.1082.10 84.3084.30 86.0286.02 90.7290.72 100.41100.41 111.55111.55
\star FJS++ 80.2680.26 84.2784.27 81.0581.05 80.5880.58 80.2180.21 80.8480.84 83.5283.52 85.6685.66 87.4087.40 94.0394.03 102.16102.16 112.23112.23
HASHqq 101.34(2)101.34^{(2)} 103.34(2)103.34^{(2)} 102.15(2)102.15^{(2)} 105.02(2)105.02^{(2)} 101.55(2)101.55^{(2)} 104.07(2)104.07^{(2)} 105.84(2)105.84^{(2)} 107.46(2)107.46^{(2)} 107.79(2)107.79^{(2)} 112.15(2)112.15^{(2)} 118.52(2)118.52^{(2)} 126.53(2)126.53^{(2)}
FS-ww 52.77(8)52.77^{(8)} 52.42(6)52.42^{(6)} 53.51(6)53.51^{(6)} 53.74(6)53.74^{(6)} 𝟓𝟑¯.07(𝟔)¯\underline{\bf{53}}\underline{\bf.07^{(6)}} 53.95(6)53.95^{(6)} 𝟓𝟓¯.33(𝟔)¯\underline{\bf{55}}\underline{\bf.33^{(6)}} 57.61(6)57.61^{(6)} 𝟔𝟏¯.36(𝟔)¯\underline{\bf{61}}\underline{\bf.36^{(6)}} 68.86(6)68.86^{(6)} 81.66(4)81.66^{(4)} 102.11(2)102.11^{(2)}
IOM 82.0782.07 83.8683.86 84.8984.89 85.5885.58 82.6482.64 82.4582.45 83.9083.90 86.9886.98 87.2887.28 91.6391.63 99.3099.30 106.37106.37
WOM 84.3784.37 84.4284.42 86.3986.39 85.0585.05 85.3685.36 85.3585.35 86.2086.20 86.7486.74 90.2290.22 93.3193.31 105.36105.36 114.68114.68
SKIPqq 60.93(2)60.93^{(2)} 59.34(2)59.34^{(2)} 63.66(3)63.66^{(3)} 62.97(3)62.97^{(3)} 62.11(2)62.11^{(2)} 62.42(2)62.42^{(2)} 64.08(2)64.08^{(2)} 65.12(2)65.12^{(2)} 69.11(2)69.11^{(2)} 76.18(3)76.18^{(3)} 83.05(3)83.05^{(3)} 101.43(3)101.43^{(3)}
WFRqq 59.39(2)59.39^{(2)} 59.57(2)59.57^{(2)} 59.24(2)59.24^{(2)} 60.12(2)60.12^{(2)} 61.16(2)61.16^{(2)} 55.51(2)55.51^{(2)} 58.24(2)58.24^{(2)} 62.38(2)62.38^{(2)} 68.35(2)68.35^{(2)} 81.99(2)81.99^{(2)} 104.92(2)104.92^{(2)} 139.15(2)139.15^{(2)}
\star LWFRqq 55.99(2)55.99^{(2)} 58.63(2)58.63^{(2)} 54.16(2)54.16^{(2)} 53.75(2)53.75^{(2)} 56.99(2)56.99^{(2)} 60.87(2)60.87^{(2)} 58.45(2)58.45^{(2)} 63.39(2)63.39^{(2)} 72.48(2)72.48^{(2)} 85.61(2)85.61^{(2)} 112.47(3)112.47^{(3)} 152.54(3)152.54^{(3)}
\star DISTqq 58.60(2)58.60^{(2)} 59.30(2)59.30^{(2)} 58.94(2)58.94^{(2)} 58.46(2)58.46^{(2)} 58.47(3)58.47^{(3)} 59.91(2)59.91^{(2)} 61.55(2)61.55^{(2)} 62.29(2)62.29^{(2)} 65.21(2)65.21^{(2)} 𝟔𝟓¯.99(𝟐)¯\underline{\bf{65}}\underline{\bf.99^{(2)}} 𝟕𝟕¯.36(𝟐)¯\underline{\bf{77}}\underline{\bf.36^{(2)}} 𝟗𝟓¯.05(𝟐)¯\underline{\bf{95}}\underline{\bf.05^{(2)}}
\star LDISTqq 62.56(3)62.56^{(3)} 61.52(2)61.52^{(2)} 66.62(3)66.62^{(3)} 66.90(2)66.90^{(2)} 67.28(3)67.28^{(3)} 63.66(2)63.66^{(2)} 65.22(2)65.22^{(2)} 67.15(2)67.15^{(2)} 67.59(2)67.59^{(2)} 73.93(2)73.93^{(2)} 85.10(2)85.10^{(2)} 100.33(2)100.33^{(2)}

All algorithms are implemented in C language, compiled by GCC 9.2.0 with the optimization option 𝙾𝟹\mathtt{-O3}. We used the implementations in SMART [13] for all algorithms except for the FJS, FJS++ and our algorithms. The implementations of our algorithms are available at https://github.com/ushitora/distq. We experimented with the following strings:

  1. 1.

    Genome sequence (Table 4): the genome sequence of E. coli of length n=4641652n=4641652 with σ=4\sigma=4, from NCBI111https://www.ncbi.nlm.nih.gov/genome/167?genome _ assembly _ id=161521. The patterns are randomly extracted from TT of length m=2,4,8,16,32,64,128,256,512 and 1024m=2,4,8,16,32,64,128,256,512\text{\;and\;}1024. We measured the total running time of 25 executions.

  2. 2.

    English text (Table 4): the King James version of the Bible of length n=4017009n=4017009 with σ=62\sigma=62, from the Large Canterbury Corpus222http://corpus.canterbury.ac.nz/ [2]. We removed the line breaks from the text. The patterns are randomly extracted from TT of length m=2,4,8,16,32,64,128,256,512m=2,4,8,16,32,64,128,256,512 and 10241024. We measured the total running time of 25 executions.

  3. 3.

    Fibonacci string (Table 4): generated by the following recurrence

    𝐹𝑖𝑏1=b,𝐹𝑖𝑏2=a and 𝐹𝑖𝑏k=𝐹𝑖𝑏k1𝐹𝑖𝑏k2 for k>2.\mathit{Fib}_{1}=\texttt{b},\;\mathit{Fib}_{2}=\texttt{a}\;\text{ and }\;\mathit{Fib}_{k}=\mathit{Fib}_{k-1}\cdot\mathit{Fib}_{k-2}\text{ for }k>2.

    The text is fixed to T=𝐹𝑖𝑏32T=\mathit{Fib}_{32} of length n=2178309n=2178309. The patterns are randomly extracted from TT of length m=2,4,8,16,32,64,128,256,512 and 1024m=2,4,8,16,32,64,128,256,512\text{\;and\;}1024. We measured the total running time of 100 executions.

  4. 4.

    Texts with frequent pattern occurrences (Tables 4, 4): generated by intentionally embedding a lot of patterns. We embedded 𝑜𝑐𝑐=0,128,256,512,1024,2048,4096,8192,16384,32768,65536\mathit{occ}=0,128,256,512,1024,2048,4096,8192,16384,\linebreak[1]32768,65536, and 131072131072 occurrences of a pattern of length m=8m=8 into a text of length n=4000000n=4000000 over an alphabet of size σ=4 and 95\sigma=4\mbox{ and }95. More specifically, we first randomly generate a pattern and a provisional text, which may contain the pattern. Then we randomly change characters of the text until the pattern does not occur in the text. Finally we embed the pattern 𝑜𝑐𝑐\mathit{occ} times at random positions without overlapping. We measured the total running time of 25 executions.

The best performance among three trials is recorded for each experiment. For the algorithms using parameter qq or ww, we report only the best results. The value of qq or ww giving the best performance is shown in round brackets.

Experimental results show that when the pattern is short, the SBNDMqq and BSDMqq algorithms have good performance in general. For the genome sequence text, WFRqq, LWFRqq and our algorithms are the fastest algorithms except when the pattern is very short. On the English text, SBNDMqq and LWFRqq run fastest for short and long patterns, respectively. On the other hand, DISTqq runs almost as fast as the best algorithm on both short and long patterns. In fact, it runs faster than SBDDMqq and LWFRqq for long and short patterns, respectively. In the experiments on the Fibonacci string, the FJS algorithm and our algorithms have shown good results as the pattern length increases. Differently from the previous two sorts of texts, our algorithms clearly outperformed the LWFRqq algorithm. Since the Fibonacci strings have many repeating structures and patterns are randomly extracted from the text, the number of occurrences of the pattern is very large in this experiment. Therefore, we hypothesize that the efficiency of DISTqq algorithms does not decrease when the number of pattern occurrences is large. We fixed the pattern length and alphabet size and prepared data with the number of pattern occurrences intentionally changed. From the experimental result, it is found that our algorithms become more advantageous as the number of pattern occurrences increases. The results show that the LDISTqq algorithm is generally slower than the DISTqq algorithm. This should be due to the overhead of the process of determining whether to update the hash value difference by the rolling hash in the LDISTqq algorithm.

5 Conclusion

We proposed two new algorithms for the exact string matching problem: the DISTqq algorithm and the LDISTqq algorithm. We confirmed that our algorithms are as efficient as the state-of-the-art algorithms in many cases. Particularly when a pattern frequently appears in a text, our algorithms outperformed existing algorithms. The DISTqq algorithm runs in O(q(n+m))O(q(n+m)) time and the LDISTqq algorithm runs in O(n+m)O(n+m) time. Their performances were not significantly different in our experiments and rather the former ran faster than the latter in most cases, where the optimal value of qq was relatively small.

References

  • [1] Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Factor oracle: A new structure for pattern matching. In Jan Pavelka, Gerard Tel, and Miroslav Bartošek, editors, SOFSEM’99: Theory and Practice of Informatics, pages 295–310. Springer Berlin Heidelberg, 1999.
  • [2] Ross Arnold and Tim Bell. A corpus for the evaluation of lossless compression algorithms. In Proceedings of DCC ’97. Data Compression Conference, pages 201–210, 1997.
  • [3] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20(10):762–772, 1977.
  • [4] Domenico Cantone and Simone Faro. Searching for a substring with constant extra-space complexity. In Proceedings of Third International Conference on Fun with algorithms, pages 118–131, 2004.
  • [5] Domenico Cantone and Simone Faro. Fast-search algorithms: New efficient variants of the Boyer-Moore pattern-matching algorithm. Journal of Automata, Languages and Combinatorics, 10:589–608, 2005.
  • [6] Domenico Cantone and Simone Faro. Improved and self-tuned occurrence heuristics. Journal of Discrete Algorithms, 28:73–84, 2014.
  • [7] Domenico Cantone, Simone Faro, and Emanuele Giaquinta. A compact representation of nondeterministic (suffix) automata for the bit-parallel approach. Information and Computation, 213:3–12, 2012. Special Issue: Combinatorial Pattern Matching (CPM 2010).
  • [8] Domenico Cantone, Simone Faro, and Arianna Pavone. Linear and Efficient String Matching Algorithms Based on Weak Factor Recognition. Journal of Experimental Algorithmics, 24(1):1–20, 2019.
  • [9] Simone Faro. A very fast string matching algorithm based on condensed alphabets. In Riccardo Dondi, Guillaume Fertin, and Giancarlo Mauri, editors, Algorithmic Aspects in Information and Management - 11th International Conference, AAIM 2016, Bergamo, Italy, July 18-20, 2016, Proceedings, volume 9778 of Lecture Notes in Computer Science, pages 65–76. Springer, 2016.
  • [10] Simone Faro and Thierry Lecroq. A fast suffix automata based algorithm for exact online string matching. In Nelma Moreira and Rogério Reis, editors, Implementation and Application of Automata, pages 149–158. Springer Berlin Heidelberg, 2012.
  • [11] Simone Faro and Thierry Lecroq. A multiple sliding windows approach to speed up string matching algorithms. In Ralf Klasing, editor, Experimental Algorithms, pages 172–183. Springer Berlin Heidelberg, 2012.
  • [12] Simone Faro and Thierry Lecroq. The exact online string matching problem. ACM Computing Surveys, 45(2):1–42, 2013.
  • [13] Simone Faro, Thierry Lecroq, Stefano Borzì, Simone Di Mauro, and Alessandro Maggio. The string matching algorithms research tool. In Jan Holub and Jan Žďárek, editors, Proceedings of the Prague Stringology Conference 2016, pages 99–113, Czech Technical University in Prague, Czech Republic, 2016.
  • [14] Frantisek Franek, Christopher G. Jennings, and W.F. Smyth. A simple fast hybrid pattern-matching algorithm. Journal of Discrete Algorithms, 5(4):682–695, 2007.
  • [15] Saqib I. Hakak, Amirrudin Kamsin, Palaiahnakote Shivakumara, Gulshan A. Gilkar, Wazir Z. Khan, and Muhammad Imran. Exact string matching algorithms: Survey, issues, and future research directions. IEEE Access, 7:69614–69637, 2019.
  • [16] R. Nigel Horspool. Practical fast searching in strings. Software: Practice and Experience, 10(6):501–506, 1980.
  • [17] Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323–350, 1977.
  • [18] Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, and Ayumi Shinohara. An improvement of the Franek-Jennings-Smyth pattern matching algorithm. In Proceedings of the Prague Stringology Conference 2019, pages 56–68, 2019.
  • [19] Thierry Lecroq. Fast exact string matching algorithms. Information Processing Letters, 102(6):229–235, 2007.
  • [20] Gonzalo Navarro and Mathieu Raffinot. A bit-parallel approach to suffix automata: Fast extended string matching. In Martin Farach-Colton, editor, Combinatorial Pattern Matching, pages 14–33. Springer Berlin Heidelberg, 1998.
  • [21] Daniel M. Sunday and Daniel M. A very fast substring search algorithm. Communications of the ACM, 33(8):132–142, 1990.
  • [22] Sun Wu and Udi Manber. A fast algorithm for multi-pattern searching. Technical Report TR–94–17, Department of Computer Science, Chung-Cheng University, 1994.