This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Efficient Interactive Computation of
Dynamic Time Warping Distance

Akihiro Nishi1 1 Department of Informatics, Kyushu University, Fukuoka, Japan
2 PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
3 M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Yuto Nakashima1 1 Department of Informatics, Kyushu University, Fukuoka, Japan
2 PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
3 M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Shunsuke Inenaga1,2 1 Department of Informatics, Kyushu University, Fukuoka, Japan
2 PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
3 M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Hideo Bannai3 1 Department of Informatics, Kyushu University, Fukuoka, Japan
2 PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
3 M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Masayuki Takeda1 1 Department of Informatics, Kyushu University, Fukuoka, Japan
2 PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
3 M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Abstract

The dynamic time warping (DTW) is a widely-used method that allows us to efficiently compare two time series that can vary in speed. Given two strings AA and BB of respective lengths mm and nn, there is a fundamental dynamic programming algorithm that computes the DTW distance 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B) for AA and BB together with an optimal alignment in Θ(mn)\Theta(mn) time and space. In this paper, we tackle the problem of interactive computation of the DTW distance for dynamic strings, denoted 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW}, where character-wise edit operation (insertion, deletion, substitution) can be performed at an arbitrary position of the strings. Let MM and NN be the sizes of the run-length encoding (RLE) of AA and BB, respectively. We present an algorithm for 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW} that occupies Θ(mN+nM)\Theta(mN+nM) space and uses O(m+n+#chg)O(mN+nM)O(m+n+\#_{\mathrm{chg}})\subseteq O(mN+nM) time to update a compact differential representation 𝐷𝑆\mathit{DS} of the DP table per edit operation, where #chg\#_{\mathrm{chg}} denotes the number of cells in 𝐷𝑆\mathit{DS} whose values change after the edit operation. Our method is at least as efficient as the algorithm recently proposed by Froese et al. running in Θ(mN+nM)\Theta(mN+nM) time, and is faster when #chg\#_{\mathrm{chg}} is smaller than O(mN+nM)O(mN+nM) which, as our preliminary experiments suggest, is likely to be the case in the majority of instances.

1 Introduction

The dynamic time warping (DTW) is a classical and widely-used method that allows us to efficiently compare two temporal sequences or time series that can vary in speed. A fundamental dynamic programming algorithm computes the DTW distance 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B) for two strings AA and BB together with an optimal alignment in Θ(mn)\Theta(mn) time and space [12], where |A|=m|A|=m and |B|=n|B|=n. This algorithm allows one to update the DP table D\mathit{D} for 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B) in O(m)O(m) time (resp. O(n)O(n) time) when a new character is appended to BB (resp. to AA).

In this paper, we introduce the “dynamic” DTW problem, denoted 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW}, where character-wise edit operation (insertion, deletion, substitution) can be performed at an arbitrary position of the strings. More formally, we wish to maintain a (space-efficient) representation of D\mathit{D} that can dynamically be modified according to a given operation. This representation should be able to quickly answer the value of D[m,n]=𝖽𝗍𝗐(A,B)\mathit{D}[m,n]=\mathsf{dtw}(A,B) upon query, together with an optimal alignment achieving 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B). This kind of interactive computation for (a representation of) D\mathit{D} can be of practical merits, e.g. when simulating stock charts, or editing musical sequences. Another example of applications of 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW} is a sliding window version of DTW which computes 𝖽𝗍𝗐(A,B[j..j+d1])\mathsf{dtw}(A,B[j..j+d-1]) between AA and every substring B[j..j+d1]B[j..j+d-1] of BB of arbitrarily fixed length dd.

Incremental/decremental computation of a DP table is a restricted version of the aforementioned interactive computation, which allows for prepending a new character to BB, and/or deleting the leftmost character from BB. A number of incremental/decremental computation algorithms have been proposed for the unit-cost edit distance and weighted edit distance: Kim and Park [9] showed an incremental/decremental algorithm for the unit-cost edit distance that occupies Θ(mn)\Theta(mn) space and runs in O(m+n)O(m+n) time per operation. Hyyrö et al. [7] proposed an algorithm for the edit distance with integer weights which uses Θ(mn)\Theta(mn) space and runs in O(min{c(m+n),mn})O(\min\{c(m+n),mn\}) time per operation, where cc is the maximum weight in the cost function. This translates into O(m+n)O(m+n) time under constant weights. Schmidt [13] gave an algorithm that uses Θ(mn)\Theta(mn) space and runs in O(nlogm)O(n\log m) time per operation for a general weighted edit distance. Hyyrö and Inenaga [5] presented a space efficient alternative to incremental/decremental unit-cost edit distance computation which runs in O(m+n)O(m+n) time per operation but uses only Θ(mN+nM)\Theta(mN+nM) space, where MM and NN are the sizes of run-length encoding (RLE) of AA and BB, respectively. Since MmM\leq m and NnN\leq n always hold, the mN+nMmN+nM terms can be much smaller than the mnmn term for strings that contain many long character runs. Later, Hyyrö and Inenaga [6] presented a space-efficient alternative for edit distance with integer weights, which runs in O(min{c(m+n),mn})O(\min\{c(m+n),mn\}) time per operation and requires Θ(mN+nM)\Theta(mN+nM) space.

Fully-dynamic interactive computation for the (weighted) edit distance was also considered: Let jj^{*} be the position in BB where the modification has been performed. For the unit cost edit distance, Hyyrö et al. [8] presented a representation of the DP table which uses Θ(mn)\Theta(mn) space and can be updated in O(min{rc(m+n),mn})O(\min\{rc(m+n),mn\}) time per operation, where r=min{j,nj+1}r=\min\{j^{*},n-j^{*}+1\} and cc is the maximum weight. They also showed that there exist instances that require Ω(min{rc(m+n),mn})\Omega(\min\{rc(m+n),mn\}) time to update their data structure per operation. Very recently, Charalampopoulos et al. [3] showed how to maintain an optimal (weighted) alignment of two fully-dynamic strings in O~(nmin{n,c})\tilde{O}(n\min\{\sqrt{n},c\}) time per operation, where m=nm=n.

While computing longest common subsequence (LCS) and weighted edit distance of strings of length nn can both be reduced to computing DTW of strings of length O(n)O(n) [1, 10], a reduction to the other direction is not known. It thus seems difficult to directly apply any of the aforementioned algorithms to our 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW} problem. Also, a conditional lower bound suggests that strongly sub-quadratic DTW algorithms are unlikely to exist [1, 2]. Thus, any method that recomputes the nav̈e DP table D\mathit{D} from scratch should take almost quadratic time per update.

Our contribution. This paper takes the first step towards an efficient solution to 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW}. Namely, we present an algorithm for 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW} that occupies Θ(mN+nM)\Theta(mN+nM) space and uses O(m+n+#chg)O(m+n+\#_{\mathrm{chg}}) time to update a compact differential representation 𝐷𝑆\mathit{DS} for the DP table D\mathit{D} per edit operation, where #chg\#_{\mathrm{chg}} denotes the number of cells in 𝐷𝑆\mathit{DS} whose values change after the edit operation. Since #chg=O(mN+nM)\#_{\mathrm{chg}}=O(mN+nM) always holds, our method is always at least as efficient as the naïve method that recomputes the full DP table D\mathit{D} in Θ(mn)\Theta(mn) time, or the algorithm of Froese et al. [4] that recomputes another sparse representation of D\mathit{D} in Θ(mN+nM)\Theta(mN+nM) time. While there exist worst-case instances that give #chg=Ω(mN+nM)\#_{\mathrm{chg}}=\Omega(mN+nM), our preliminary experiments suggest that, in many cases, #chg\#_{\mathrm{chg}} can be much smaller than the size of 𝐷𝑆\mathit{DS} which is Θ(mN+nM)\Theta(mN+nM).

Technically our algorithm is most related to Hyyrö et al.’s method [7, 8] and Froese et al.’s method [4], but our algorithm is not straightforward from these.

2 Preliminaries

We consider sequences (strings) of characters from an alphabet Σ\Sigma of real numbers. Let A=a1,,amA=a_{1},\ldots,a_{m} be a string consisting of mm characters from Σ\Sigma. The run-length encoding 𝗋𝗅𝖾(A)\mathsf{rle}(A) of string AA is a compact representation of AA such that each maximal run of the same characters in AA is represented by a pair of the character and the length of the run. More formally, let \mathbb{N} denote the set of positive integers. For any non-empty string AA, 𝗋𝗅𝖾(A)=a1e1aMeM\mathsf{rle}(A)=a_{1}^{e_{1}}\cdots a_{M}^{e_{M}}, where aIΣa_{I}\in\Sigma and eIe_{I}\in\mathbb{N} for any 1IM1\leq I\leq M, and aIaI+1a_{I}\neq a_{I+1} for any 1I<M1\leq I<M. Each aIeIa_{I}^{e_{I}} in 𝗋𝗅𝖾(A)\mathsf{rle}(A) is called a (character) run, and eIe_{I} is called the exponent of this run. The size of 𝗋𝗅𝖾(A)\mathsf{rle}(A) is the number MM of runs in 𝗋𝗅𝖾(A)\mathsf{rle}(A). E.g., for string A=𝚊𝚊𝚌𝚌𝚌𝚌𝚌𝚌𝚌𝚋𝚋𝚊𝚋𝚋𝚋𝚋A=\mathtt{aacccccccbbabbbb} of length 16, 𝗋𝗅𝖾(A)=𝚊2𝚌7𝚋2𝚊1𝚋4\mathsf{rle}(A)=\mathtt{a}^{2}\mathtt{c}^{7}\mathtt{b}^{2}\mathtt{a}^{1}\mathtt{b}^{4} and its size is 5.

Dynamic time warping (DTW) is a commonly used method to compare two temporal sequences that may vary in speed. Consider two strings A=a1,,amA=a_{1},\ldots,a_{m} and B=b1,,bnB=b_{1},\ldots,b_{n}. To formally define the DTW for AA and BB, we consider an m×nm\times n grid graph 𝒢m,n\mathcal{G}_{m,n} such that each vertex (i,j)(i,j) has (at most) three directed edges; one to the lower neighbor (i+1,j)(i+1,j) (if it exists), one to the right neighbor (i,j+1)(i,j+1) (if it exists), and one to the lower-right neighbor (i+1,j+1)(i+1,j+1) (if it exists). A path in 𝒢m,n\mathcal{G}_{m,n} that starts from vertex (1,1)(1,1) and ends at vertex (m,n)(m,n) is called a warping path, and is denoted by a sequence (1,1),,(i,j),,(m,n)(1,1),\ldots,(i,j),\ldots,(m,n) of adjacent vertices. Let 𝒫m,n\mathcal{P}_{m,n} be the set of all warping paths in 𝒢m,n\mathcal{G}_{m,n}. Note that each warping path in 𝒫m,n\mathcal{P}_{m,n} corresponds to an alignment of AA and BB. The DTW for strings AA and BB, denoted 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B), is defined by 𝖽𝗍𝗐(A,B)=minp𝒫m,n(i,j)p(aibj)2\mathsf{dtw}(A,B)=\min_{p\in\mathcal{P}_{m,n}}\sqrt{\sum_{(i,j)\in p}(a_{i}-b_{j})^{2}}.

The fundamental Θ(mn)\Theta(mn)-time and space solution for computing 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B), given in [12], fills an m×nm\times n dynamic programming table D\mathit{D} such that D[i,j]=𝖽𝗍𝗐(A[1..i],B[1..j])2\mathit{D}[i,j]=\mathsf{dtw}(A[1..i],B[1..j])^{2} for 1im1\leq i\leq m and 1jn1\leq j\leq n. Therefore, after all the cells of D\mathit{D} are filled, the desired result 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B) can be obtained by D[m,n]\sqrt{\mathit{D}[m,n]}. The value for each cell D[i,j]\mathit{D}[i,j] is computed by the following well-known recurrence:

D[1,1]=(a1b1)2,D[i,1]=D[i1,1]+(aib1)2for 1<im,D[1,j]=D[1,j1]+(a1bj)2for 1<jn,D[i,j]=min{D[i,j1],D[i1,j],D[i1,j1]}+(aibj)2for 1<im and 1<jn.\begin{array}[]{lll}\mathit{D}[1,1]&=&(a_{1}-b_{1})^{2},\\ \mathit{D}[i,1]&=&\mathit{D}[i-1,1]+(a_{i}-b_{1})^{2}\quad\text{for }1<i\leq m,\\ \mathit{D}[1,j]&=&\mathit{D}[1,j-1]+(a_{1}-b_{j})^{2}\quad\text{for }1<j\leq n,\\ \mathit{D}[i,j]&=&\begin{array}[t]{l}\min\{\mathit{D}[i,j-1],\mathit{D}[i-1,j],\mathit{D}[i-1,j-1]\}+(a_{i}-b_{j})^{2}\\ \ \text{for $1<i\leq m$ and $1<j\leq n$.}\end{array}\end{array} (1)

In the rest of this paper, we will consider the problem of maintaining a representation for D\mathit{D}, each time one of the strings, BB, is dynamically modified by an edit operation (i.e. single character insertion, deletion, or substitution) on an arbitrary position in BB. We call this kind of interactive computation of 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B) as the dynamic DTW computation, denoted by 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW}.

Let B\mathit{B^{\prime}} denote the string after an edit operation is performed on BB, and D\mathit{D^{\prime}} denote the dynamic programming table D\mathit{D} after it has been updated to correspond to 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,\mathit{B^{\prime}}). In a special case where the edit operation is performed at the right end of BB, where we have B=Bc\mathit{B^{\prime}}=Bc (insertion), B=B[1..n1]\mathit{B^{\prime}}=B[1..n-1] (deletion) or B=B[1..n1]c\mathit{B^{\prime}}=B[1..n-1]c (substitution) with a character cΣc\in\Sigma, then D\mathit{D} can easily be updated to D\mathit{D^{\prime}} in O(m)O(m) time by simply computing a single column at index j=nj=n or j=n+1j=n+1 using recurrence (1).

Refer to caption
Figure 1: In this example where A=dcbbccdaA=\textrm{dcbbccda} and B=acbeeaadB=\textrm{acbeeaad}, the values of Θ(mn)\Theta(mn) cells of the DP table for 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B) change after the edit operation on BB (here, the first character B[1]=aB[1]=\mathrm{a} of BB was deleted).

As in Figure 1, in the worst case, the values of Θ(mn)\Theta(mn) cells of the DP table for 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B) can change after an edit on BB. The following lemma gives a stronger statement that updating D\mathit{D} to D\mathit{D^{\prime}} in our 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW} scenario cannot be amortized:

Lemma 1

There are strings AA, BB and a sequence of kk edits on BB such that Θ(kmn)\Theta(kmn) cells in D\mathit{D^{\prime}} have different values in the corresponding cells in D\mathit{D}.

3 Our 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW} Algorithm based on RLE

We first explain the data structures which are used in our algorithm.

Differential representation 𝐷𝑅\mathit{DR} of D\mathit{D}. The first idea of our algorithm is to use a differential representation 𝐷𝑅\mathit{DR} of D\mathit{D}: Each cell of 𝐷𝑅\mathit{DR} contains two fields that respectively store the horizontal difference and the vertical difference, namely, 𝐷𝑅[i,j].U=D[i,j]D[i1,j]\mathit{DR}[i,j].U=\mathit{D}[i,j]-\mathit{D}[i-1,j] and 𝐷𝑅[i,j].L=D[i,j]D[i,j1]\mathit{DR}[i,j].L=\mathit{D}[i,j]-\mathit{D}[i,j-1]. We let 𝐷𝑅[i,1].L=0\mathit{DR}[i,1].L=0 for any 1im1\leq i\leq m and 𝐷𝑅[1,j].U=0\mathit{DR}[1,j].U=0 for any 1jn1\leq j\leq n. The diagonal difference D[i,j]D[i1,j1]\mathit{D}[i,j]-\mathit{D}[i-1,j-1] can easily be computed from 𝐷𝑅[i,j].U\mathit{DR}[i,j].U and 𝐷𝑅[i,j].L\mathit{DR}[i,j].L and thus is not explicitly stored in 𝐷𝑅[i,j]\mathit{DR}[i,j].

In our algorithm we make heavy use of the following lemma:

Lemma 2

For any 1<im1<i\leq m,

𝐷𝑅[i,j].U={(aib1)2if j=1,z𝐷𝑅[i1,j].Lif 2jn,\mathit{DR}[i,j].U=\begin{cases}(a_{i}-b_{1})^{2}&\mbox{if }j=1,\\ z-\mathit{DR}[i-1,j].L&\mbox{if }2\leq j\leq n,\end{cases}

and for any 1<jn1<j\leq n,

𝐷𝑅[i,j].L={(a1bj)2if i=1,z𝐷𝑅[i,j1].Uif 2im,\mathit{DR}[i,j].L=\begin{cases}(a_{1}-b_{j})^{2}&\mbox{if }i=1,\\ z-\mathit{DR}[i,j-1].U&\mbox{if }2\leq i\leq m,\end{cases}

where z=min{𝐷𝑅[i1,j].L,𝐷𝑅[i,j1].U, 0}+(aibj)2z=\min\{\mathit{DR}[i-1,j].L,\ \mathit{DR}[i,j-1].U,\ 0\}+(a_{i}-b_{j})^{2}.

  • Proof. 𝐷𝑅[i,1].U=(aib1)2\mathit{DR}[i,1].U=(a_{i}-b_{1})^{2} and 𝐷𝑅[1,j].L=(a1bj)2\mathit{DR}[1,j].L=(a_{1}-b_{j})^{2} are clear from recurrence (1). Now we consider 1<im1<i\leq m and 1<jn1<j\leq n, and let d=D[i1,j1]d=\mathit{D}[i-1,j-1], x=𝐷𝑅[i1,j].Lx=\mathit{DR}[i-1,j].L, y=𝐷𝑅[i,j1].Uy=\mathit{DR}[i,j-1].U, and d+z=D[i,j]d+z=D[i,j]. Then we have D[i1,j]=d+x\mathit{D}[i-1,j]=d+x and D[i,j1]=d+y\mathit{D}[i,j-1]=d+y (see Figure 2). It follows from the definition of 𝐷𝑅\mathit{DR} that 𝐷𝑅[i,j].U=D[i,j]D[i1,j]=zx\mathit{DR}[i,j].U=\mathit{D}[i,j]-\mathit{D}[i-1,j]=z-x and 𝐷𝑅[i,j].L=D[i,j]D[i,j1]=zy\mathit{DR}[i,j].L=\mathit{D}[i,j]-\mathit{D}[i,j-1]=z-y. Since D[i,j]=min{D[i1,j1],D[i1,j],D[i,j1]}+(aibj)2\mathit{D}[i,j]=\min\{\mathit{D}[i-1,j-1],\mathit{D}[i-1,j],\mathit{D}[i,j-1]\}+(a_{i}-b_{j})^{2} by recurrence (1), we obtain d+z=min{d,d+x,d+y}+(aibj)2d+z=\min\{d,d+x,d+y\}+(a_{i}-b_{j})^{2} which leads to z=min{x,y,0}+(aibj)2z=\min\{x,y,0\}+(a_{i}-b_{j})^{2}. \Box

RLE-based sparse differential representation 𝐷𝑆\mathit{DS}. The second key idea of our algorithm is to divide the dynamic programming table D\mathit{D} into “boxes” that are defined by intersections of maximal runs of AA and BB. Note that D\mathit{D} contains M×NM\times N such boxes. Let 𝗋𝗅𝖾(A)=A1k1AMkM\mathsf{rle}(A)=A_{1}^{k_{1}}\dots A_{M}^{k_{M}} and 𝗋𝗅𝖾(B)=B1l1BNlN\mathsf{rle}(B)=B_{1}^{l_{1}}\dots B_{N}^{l_{N}} be the RLEs of AA and BB. Let iTI=iI1ki+1i_{\mathrm{T}}^{I}=\sum_{i}^{I-1}{k_{i}}+1, iBI=iIkii_{\mathrm{B}}^{I}=\sum_{i}^{I}{k_{i}}, jLJ=jJ1lj+1j_{\mathrm{L}}^{J}=\sum_{j}^{J-1}{l_{j}}+1, and jRJ=jJljj_{\mathrm{R}}^{J}=\sum_{j}^{J}{l_{j}}. We define a sparse table 𝐷𝑆\mathit{DS} for 𝐷𝑅\mathit{DR} that consists only of the rows and columns on the borders of the maximal runs in AA and BB. Namely, 𝐷𝑆\mathit{DS} is a sparse table that only stores the rows iTI,iBI(1IM)i_{\mathrm{T}}^{I},i_{\mathrm{B}}^{I}~{}(1\leq I\leq M) and the columns jLJ,jRJ(1JN)j_{\mathrm{L}}^{J},j_{\mathrm{R}}^{J}~{}(1\leq J\leq N), of 𝐷𝑅\mathit{DR} (see Figure 3).

[Uncaptioned image]
Figure 2: Illustration for Lemma 2 which depicts the corresponding cells of the dynamic programming table D\mathit{D}, where D[i1,j1]=d\mathit{D}[i-1,j-1]=d, D[i1,j]=d+x\mathit{D}[i-1,j]=d+x, D[i,j1]=d+y\mathit{D}[i,j-1]=d+y, and D[i,j]=d+zD[i,j]=d+z.
[Uncaptioned image]
Figure 3: Illustration for 𝐷𝑆\mathit{DS} that consists only of the cells of 𝐷𝑅\mathit{DR} corresponding to the maximal run boundaries of AA and BB (white rows and columns). The gray regions that are surrounded by the box boundaries are not stored in 𝐷𝑆\mathit{DS}.

Each row and column of 𝐷𝑆\mathit{DS} is implemented by a linked list as follows: each cell 𝐷𝑆[i,j]\mathit{DS}[i,j] has four links to the upper, lower, left, and right neighbors in 𝐷𝑆\mathit{DS} (if these neighbors exist), plus a diagonal link to the right-lower direction. This diagonal link from 𝐷𝑆[i,j]\mathit{DS}[i,j] points to the first cell 𝐷𝑆[i+h,j+h]\mathit{DS}[i+h,j+h] that is reached by following the right-lower diagonal path from 𝐷𝑆[i,j]\mathit{DS}[i,j], namely, h0h\geq 0 is the smallest integer such that i+h=iBIi+h=i_{\mathrm{B}}^{I} or j+h=jLJj+h=j_{\mathrm{L}}^{J}. Clearly 𝐷𝑆\mathit{DS} occupies Θ(mN+nM)\Theta(mN+nM) space. 𝐷𝑆\mathit{DS} can answer 𝖽𝗍𝗐(A,B)=D[m,n]\mathsf{dtw}(A,B)=\mathit{D}[m,n] in O(m+n)O(m+n) time by tracing O(m+n)O(m+n) cells of 𝐷𝑆\mathit{DS} from (1,1)(1,1) to (m,n)(m,n).

For each 1I<M1\leq I<M and 1J<N1\leq J<N, we consider the region of 𝐷𝑅\mathit{DR} that is surrounded by the borders of the IIth and (I+1)(I+1)th runs of AA, and the JJth and (J+1)(J+1)th runs of BB. This region is called a box for I,JI,J, and is denoted by I,J\mathcal{B}^{I,J}. For ease of description, we will sometimes refer to a box I,J\mathcal{B}^{I,J} also in D\mathit{D} and 𝐷𝑆\mathit{DS}.

3.1 Updating 𝐷𝑆\mathit{DS} after an edit operation

Suppose that an edit operation has been performed at position jj^{*} of string BB and let BB^{\prime} denote the edited string. Let D\mathit{D^{\prime}} denote the dynamic programming table for 𝖽𝗍𝗐(A,B)\mathsf{dtw}(A,B^{\prime}), and 𝐷𝑅\mathit{DR^{\prime}} the difference representation for D\mathit{D^{\prime}}. As Figure 4 shows, the number of changed cells in 𝐷𝑅\mathit{DR^{\prime}} can be much smaller than that of changed cells in D\mathit{D^{\prime}} (see also Figure 1).

Refer to caption
Figure 4: For the running example from Figure 1, only the gray cells have different values in the difference representations 𝐷𝑅\mathit{DR} (left) and 𝐷𝑅\mathit{DR^{\prime}} (right).

Let 𝐷𝑆\mathit{DS^{\prime}} denote the sparse table for 𝐷𝑅\mathit{DR^{\prime}}. Since 𝐷𝑆\mathit{DS} consists only of the boundary cells, the number of changed cells in 𝐷𝑆\mathit{DS^{\prime}} can even be much smaller. In what follows, we show how to efficiently update 𝐷𝑆\mathit{DS} to 𝐷𝑆\mathit{DS^{\prime}}.

Because the prefix B[1..j1]B[1..j^{*}-1] remains unchanged after the edit operation, for any j<jj<j^{*} we have 𝐷𝑅[i,j]=𝐷𝑅[i,j]\mathit{DR}[i,j]=\mathit{DR^{\prime}}[i,j] by Lemma 2 and recurrence (1). Hence, we can restrict ourselves to the indices jjj\geq j^{*}. We define \ell as a correcting offset of string indices before and after the update: =1\ell=-1 if a character has been inserted at position jj^{*} of BB, =1\ell=1 if a character has been deleted from position jj^{*} of BB, and =0\ell=0 otherwise. Now, for any jjj\geq j^{*}, B[j]=B[j+]B^{\prime}[j]=B[j+\ell] and column jj in 𝐷𝑅\mathit{DR^{\prime}} corresponds to column j+j+\ell in 𝐷𝑅\mathit{DR}.

Let I,J\mathcal{B}^{I,J} be any box on 𝐷𝑆\mathit{DS^{\prime}}. For the the top row iTIi_{\mathrm{T}}^{I} of I,J\mathcal{B}^{I,J}, we use a linked list ΔTI,J\Delta_{\mathrm{T}}^{I,J} that stores the column indices jj (jLJjjRJj_{\mathrm{L}}^{J}\leq j\leq j_{\mathrm{R}}^{J}) such that 𝐷𝑆[iTI,j+]𝐷𝑆[iTI,j]\mathit{DS}[i_{\mathrm{T}}^{I},j+\ell]\neq\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},j], in increasing order. We also compute, in each element of the list, the value for D[iTI,j]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},j] of the corresponding column index jj. We use similar lists ΔBI,J\Delta_{\mathrm{B}}^{I,J}, ΔLI,J\Delta_{\mathrm{L}}^{I,J}, and ΔRI,J\Delta_{\mathrm{R}}^{I,J} for the bottom row, left column, and right column of I,J\mathcal{B}^{I,J}, respectively. We compute these lists when an edit operation is performed to string BB, and use them to update 𝐷𝑆\mathit{DS} to 𝐷𝑆\mathit{DS^{\prime}} efficiently.

Let #chg\#_{\mathrm{chg}} denote the number of cells in our sparse representation such that 𝐷𝑆[i+,j]𝐷𝑆[i,j]\mathit{DS}[i+\ell,j]\neq\mathit{DS^{\prime}}[i,j]. In the sequel, we prove:

Theorem 1

Our 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW} algorithm updates 𝐷𝑆\mathit{DS} to 𝐷𝑆\mathit{DS^{\prime}} in O(m+n+#chg)O(m\!+\!n\!+\!\#_{\mathrm{chg}}) time.

Initial step. Suppose that jj^{*} is in the JJth run of string BB. Let I,J\mathcal{B}^{I,J} be any of the MM boxes of 𝐷𝑅\mathit{DR} that contain column jj^{*}, where jLJjjRJj_{\mathrm{L}}^{J}\leq j^{*}\leq j_{\mathrm{R}}^{J}. Due to Lemma 2, (1,j)(1,j^{*}) is the only cell in the first row where we may have 𝐷𝑆[1,j]𝐷𝑆[1,j+]\mathit{DS^{\prime}}[1,j^{*}]\neq\mathit{DS}[1,j^{*}+\ell]. 𝐷𝑆[1,j]\mathit{DS^{\prime}}[1,j^{*}] can be easily computed in O(1)O(1) time by Lemma 2. Then, D[1,j]\mathit{D^{\prime}}[1,j^{*}] can be computed in O(j)O(n)O(j^{*})\subseteq O(n) time by tracing the first row and using 𝐷𝑆[1,j].L\mathit{DS^{\prime}}[1,j].L for increasing j=1,,jj=1,\ldots,j^{*}. The list ΔTI,J\Delta_{\mathrm{T}}^{I,J} only contains jj^{*} (coupled with D[1,j]\mathit{D^{\prime}}[1,j^{*}]) if 𝐷𝑆[1,j]𝐷𝑆[1,j+]\mathit{DS^{\prime}}[1,j^{*}]\neq\mathit{DS}[1,j^{*}+\ell], and it is empty otherwise.

Editing string BB at position jj^{*} incurs some structural changes to 𝐷𝑆\mathit{DS}: (a) I,J\mathcal{B}^{I,J} gets wider by one (insertion of the same character to a run), (b) I,J\mathcal{B}^{I,J} gets narrower by one (deletion of a character), (c) I,J\mathcal{B}^{I,J} is divided into 2M2M or 3M3M boxes (insertion of a different character to a run, or character substitution).

In cases (a) and (b), the diagonal links of I,J\mathcal{B}^{I,J} need to be updated. A crucial observation is that the total number of such diagonal links to update is bounded by mm for all the MM boxes 1,J\mathcal{B}^{1,J}, …, M,J\mathcal{B}^{M,J}, since the destinations of such diagonal links are within the same column of 𝐷𝑆\mathit{DS^{\prime}} (jRJ+1j_{\mathrm{R}}^{J}+1 in case (a), and jRJ1j_{\mathrm{R}}^{J}-1 in case (b)). For each box I,J\mathcal{B}^{I,J}, if jRJjLJiTIiBIj_{\mathrm{R}}^{J}-j_{\mathrm{L}}^{J}\geq i_{\mathrm{T}}^{I}-i_{\mathrm{B}}^{I} (i.e. I,J\mathcal{B}^{I,J} is a square or a horizontal rectangle), then we scan the top row iTIi_{\mathrm{T}}^{I} from right to left and fix the diagonal links until encountering the first cell in iTIi_{\mathrm{T}}^{I} whose diagonal link needs no updates (see Figure 6). The case with jRJjLJ<iTIiBIj_{\mathrm{R}}^{J}-j_{\mathrm{L}}^{J}<i_{\mathrm{T}}^{I}-i_{\mathrm{B}}^{I} (i.e. I,J\mathcal{B}^{I,J} is a vertical rectangle) can be treated similarly. By the above observation, these costs for all boxes I,J\mathcal{B}^{I,J} that contain the edit position jj^{*} sum up to O(m)O(m).

[Uncaptioned image] Figure 5: Case (a) of the initial step. The dashed arcs are the old diagonal links in 𝐷𝑆\mathit{DS}, and the sold arcs are the modified diagonal links in 𝐷𝑆\mathit{DS^{\prime}}. The gray cells depict cells (iTI,jRJ)(i_{\mathrm{T}}^{I},j_{\mathrm{R}}^{J}) and (iBI,jRJ)(i_{\mathrm{B}}^{I},j_{\mathrm{R}}^{J}). [Uncaptioned image] Figure 6: Case (c) of the initial step, where character substitution has been performed at position jj^{*}. The dashed arcs are the old diagonal links in 𝐷𝑆\mathit{DS} from row iTIi_{\mathrm{T}}^{I} up to jj^{*}, and the sold arcs are the modified diagonal links from new column jj^{*} in 𝐷𝑆\mathit{DS^{\prime}}.

In case (a), we shift the right column jRJj_{\mathrm{R}}^{J} of 𝐷𝑆\mathit{DS} to the right by one position, and reuse it as the right column jRJ+1j_{\mathrm{R}}^{J}+1 of 𝐷𝑆\mathit{DS^{\prime}}. This incurs two new cells (iTI,jRJ)(i_{\mathrm{T}}^{I},j_{\mathrm{R}}^{J}) and (iBI,jRJ)(i_{\mathrm{B}}^{I},j_{\mathrm{R}}^{J}) in 𝐷𝑆\mathit{DS^{\prime}} (the gray cells in Figure 6). We can compute 𝐷𝑆[iTI,jRJ]\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},j_{\mathrm{R}}^{J}] in O(1)O(1) time using Lemma 2. Now consider to compute 𝐷𝑆[i,jRJ+1]\mathit{DS^{\prime}}[i,j_{\mathrm{R}}^{J}+1] for the new right column. Since this right column initially stores 𝐷𝑆[i,jRJ]\mathit{DS}[i,j_{\mathrm{R}}^{J}] for the old 𝐷𝑆\mathit{DS}, using Lemma 2, we can compute 𝐷𝑆[i,jRJ+1]\mathit{DS^{\prime}}[i,j_{\mathrm{R}}^{J}+1] in increasing order of i=1,,mi=1,\ldots,m, from top to bottom, in O(1)O(1) time each. We can compute D[1,jRJ+1]\mathit{D^{\prime}}[1,j_{\mathrm{R}}^{J}+1] in O(jRJ)O(j_{\mathrm{R}}^{J}) time by simply scanning the first row. Then, we can compute D[i,jRJ+1]\mathit{D^{\prime}}[i,j_{\mathrm{R}}^{J}+1] for increasing i=2,,mi=2,\ldots,m, using 𝐷𝑆[i,jRJ+1]\mathit{DS^{\prime}}[i,j_{\mathrm{R}}^{J}+1], and construct ΔRI,J\Delta_{\mathrm{R}}^{I,J}. This takes a total of O(jRJ+m)O(m+n)O(j_{\mathrm{R}}^{J}+m)\subseteq O(m+n) time. Finally, 𝐷𝑆[iBI,jRJ]\mathit{DS^{\prime}}[i_{\mathrm{B}}^{I},j_{\mathrm{R}}^{J}] is computed from D[iBI,jRJ+1]\mathit{D^{\prime}}[i_{\mathrm{B}}^{I},j_{\mathrm{R}}^{J}+1] and 𝐷𝑆[iBI,jRJ+1].L\mathit{DS^{\prime}}[i_{\mathrm{B}}^{I},j_{\mathrm{R}}^{J}+1].L in O(1)O(1) time. Case (b) can be treated similarly.

For case (c), we consider a sub-case where a character substitution was performed completely inside a run of BB, at position jj^{*}. This divides an existing box I,J\mathcal{B}^{I,J} into three boxes I,J\mathcal{B}^{I,J}, I,J+1\mathcal{B}^{I,J+1}, and I,J+2\mathcal{B}^{I,J+2}. Thus, there appear three new columns j1j^{*}-1, jj^{*}, and j+1j^{*}+1 in 𝐷𝑆\mathit{DS^{\prime}}. Then, the diagonal links for these new columns can be computed in O(1)O(1) time each, by scanning row iTIi_{\mathrm{T}}^{I} from j+1j^{*}+1, from right to left (see Figure 6). The 𝐷𝑆\mathit{DS^{\prime}} values for the cells in these new columns, as well as the D\mathit{D^{\prime}} values for column j+1j^{*}+1, can also be computed in similar ways to cases (a) and (b). The other sub-cases of (c) can be treated similarly.

Updating cells on row iTIi_{\mathrm{T}}^{I} and column jLJj_{\mathrm{L}}^{J}. In what follows, suppose that we are given a box I,J\mathcal{B}^{I,J} to the right of the edit position jj^{*}, in which some boundary cell values may have to be updated. For ease of exposition, we will discuss the simplest case with substitution where the column indices do not change between 𝐷𝑆\mathit{DS} and 𝐷𝑆\mathit{DS^{\prime}}. The cases with insertion/deletion can be treated similarly by considering the offset value \ell appropriately.

Now our task is to quickly detect the boundary cells (i,j)(i,j) of I,J\mathcal{B}^{I,J} such that 𝐷𝑆[i,j]𝐷𝑆[i,j]\mathit{DS}[i,j]\neq\mathit{DS^{\prime}}[i,j], and to update them. We assume that the boundary cell values of the preceding boxes I1,J\mathcal{B}^{I-1,J} and I,J1\mathcal{B}^{I,J-1} have already been computed.

We consider how to detect the cells on the top boundary row iTIi_{\mathrm{T}}^{I} and the cells on the left boundary column jLJj_{\mathrm{L}}^{J} of box I,J\mathcal{B}^{I,J} that need to be updated, and how to update them. For this sake, we use the following lemma on the values of 𝐷𝑅\mathit{DR}, which is immediate from Lemma 2:

Lemma 3

Let 1im1\leq i\leq m and 1jn1\leq j\leq n. Suppose that for any cell (i,j)(i^{\prime},j^{\prime}) with i<ii^{\prime}<i or j<jj^{\prime}<j, the value of 𝐷𝑅[i,j]\mathit{DR^{\prime}}[i^{\prime},j^{\prime}] has already been computed. If 𝐷𝑅[i,j]𝐷𝑅[i,j]\mathit{DR}[i,j]\neq\mathit{DR^{\prime}}[i,j], then 𝐷𝑅[i,j1].U𝐷𝑅[i,j1].U\mathit{DR}[i,j-1].U\neq\mathit{DR}^{\prime}[i,j-1].U or 𝐷𝑅[i1,j].L𝐷𝑅[i1,j].L\mathit{DR}[i-1,j].L\neq\mathit{DR^{\prime}}[i-1,j].L.

Intuitively, Lemma 3 states that the cell (i,j)(i,j) such that 𝐷𝑅[i,j]𝐷𝑅[i,j]\mathit{DR}[i,j]\neq\mathit{DR^{\prime}}[i,j] must be propagated from its left neighbor or its top neighbor. We use this lemma for updating the boundaries of each box I,J\mathcal{B}^{I,J} stored in 𝐷𝑆\mathit{DS}. Recall that the values on the preceding row iTI1=iBI1i_{\mathrm{T}}^{I}-1=i_{\mathrm{B}}^{I-1} and on the preceding column jLJ1=jRJ1j_{\mathrm{L}}^{J}-1=j_{\mathrm{R}}^{J-1} have already been updated. Then, the cells on iTIi_{\mathrm{T}}^{I} and jLJj_{\mathrm{L}}^{J} of box I,J\mathcal{B}^{I,J} with 𝐷𝑆[i,j]𝐷𝑆[i,j]\mathit{DS}[i,j]\neq\mathit{DS^{\prime}}[i^{\prime},j^{\prime}] can be found in constant time each, from the lists ΔBI1,J\Delta_{\mathrm{B}}^{I-1,J} and ΔRI,J1\Delta_{\mathrm{R}}^{I,J-1} maintained for the preceding row iTI1=iBI1i_{\mathrm{T}}^{I}-1=i_{\mathrm{B}}^{I-1} and preceding column jLJ1=jRJ1j_{\mathrm{L}}^{J}-1=j_{\mathrm{R}}^{J-1}, respectively.

We process column indices ΔBI1,J\Delta_{\mathrm{B}}^{I-1,J} in increasing order, and suppose that we are currently processing column index j^ΔBI1,J\hat{j}\in\Delta_{\mathrm{B}}^{I-1,J} in the bottom row iBI1i_{\mathrm{B}}^{I-1} of the preceding box I1,J\mathcal{B}^{I-1,J}. According to the above arguments, this indicates that the cells (iTI,j)(i_{\mathrm{T}}^{I},j) in the top row iTIi_{\mathrm{T}}^{I} of I,J\mathcal{B}^{I,J} that need to be updated (i.e., 𝐷𝑆[iTI,j]𝐷𝑆[iTI,j]\mathit{DS}[i_{\mathrm{T}}^{I},j]\neq\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},j]). We assume that, for any jj^{\prime} with jLJj<j^j_{\mathrm{L}}^{J}\leq j^{\prime}<\hat{j}, the value of 𝐷𝑆[iTI,j]\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},j^{\prime}] has already been computed. Also, we have maintained a partial list for ΔTI,J\Delta_{\mathrm{T}}^{I,J} where the last element of this partial list stores the largest j′′j^{\prime\prime} such that jLJj′′<j^j_{\mathrm{L}}^{J}\leq j^{\prime\prime}<\hat{j} and 𝐷𝑆[iTI,j′′]𝐷𝑆[iTI,j′′]\mathit{DS}[i_{\mathrm{T}}^{I},j^{\prime\prime}]\neq\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},j^{\prime\prime}], together with the value of D[iTI,j′′]\mathit{D}^{\prime}[i_{\mathrm{T}}^{I},j^{\prime\prime}]. Now it follows from Lemma 2 that both 𝐷𝑆[iTI,j^].U\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].U and 𝐷𝑆[iTI,j^].L\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L can be respectively computed in constant time from 𝐷𝑆[iTI1,j^].L\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I}-1,\hat{j}].L and 𝐷𝑆[iTI,j^1].U\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1].U, and thus we can check whether 𝐷𝑆[iTI,j^]𝐷𝑆[iTI,j^]\mathit{DS}[i_{\mathrm{T}}^{I},\hat{j}]\neq\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}] in constant time as well. In case 𝐷𝑆[iTI,j^]𝐷𝑆[iTI,j^]\mathit{DS}[i_{\mathrm{T}}^{I},\hat{j}]\neq\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}], we append j^\hat{j} to the partial list for ΔTI,J\Delta_{\mathrm{T}}^{I,J}. By the definition of 𝐷𝑆\mathit{DS}, we have D[iTI,j^]=D[iTI1,j^]𝐷𝑆[iTI,j^].U\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}]=\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}-1,\hat{j}]-\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].U. Since D[iTI1,j^]=D[iBI1,j^]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}-1,\hat{j}]=\mathit{D^{\prime}}[i_{\mathrm{B}}^{I-1},\hat{j}] is stored with the current column index j^\hat{j} in the list ΔBI1,J\Delta_{\mathrm{B}}^{I-1,J}, D[iTI,j^]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}] can also be computed in constant time.

Suppose we have processed cell (iTI,j^)(i_{\mathrm{T}}^{I},\hat{j}). We perform the same procedure as above for the right-neighbor cells (iTI,j^+p)(i_{\mathrm{T}}^{I},\hat{j}+p) with p=1p=1 and increasing pp, until encountering the first cell (iTI,j^+p)(i_{\mathrm{T}}^{I},\hat{j}+p) such that (1) 𝐷𝑆[iTI,j^+p]=𝐷𝑆[iTI,j^+p]\mathit{DS}[i_{\mathrm{T}}^{I},\hat{j}+p]=\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}+p], (2) j^+pΔBI1,J\hat{j}+p\in\Delta_{\mathrm{B}}^{I-1,J}, or (3) j^+p=jRJ+1\hat{j}+p=j_{\mathrm{R}}^{J}+1. In cases (1) and (2), we move on to the next element of in ΔBI1,J\Delta_{\mathrm{B}}^{I-1,J}, and perform the same procedure as above. We are done when we encounter case (3) or ΔBI1,J\Delta_{\mathrm{B}}^{I-1,J} becomes empty. The total number of cells (iTI,j^+p)(i_{\mathrm{T}}^{I},\hat{j}+p) for all boxes in 𝐷𝑆\mathit{DS^{\prime}} is bounded by #chg\#_{\mathrm{chg}}.

In a similar way, we process row indices ΔRI,J1\Delta_{\mathrm{R}}^{I,J-1} in increasing order, update the cells on the left column jLJj_{\mathrm{L}}^{J}, and maintain another partial list for ΔLI,J\Delta_{\mathrm{L}}^{I,J}.

Updating cells on row iBIi_{\mathrm{B}}^{I} and column jRJj_{\mathrm{R}}^{J}. Let us consider how to detect the cells on the bottom row iBIi_{\mathrm{B}}^{I} and the cells on the right column jRJj_{\mathrm{R}}^{J} of box I,J\mathcal{B}^{I,J} that need to be updated, and how to update them.

The next lemma shows monotonicity on the values of D\mathit{D} inside each I,J\mathcal{B}^{I,J}.

Lemma 4 ([4])

For any (i,j)(i,j) with 1im1\leq i\leq m and jLJ<jjRJj_{\mathrm{L}}^{J}<j\leq j_{\mathrm{R}}^{J}, D[i,j]D[i,j1]\mathit{D}[i,j]\geq\mathit{D}[i,j-1]. For any (i,j)(i,j) with iTI<iiBIi_{\mathrm{T}}^{I}<i\leq i_{\mathrm{B}}^{I} and 1jn1\leq j\leq n, D[i,j]D[i1,j]\mathit{D}[i,j]\geq\mathit{D}[i-1,j].

The next corollary is immediate from Lemma 4.

Corollary 1

For any cell (i,j)(i,j) with 1im1\leq i\leq m and jLJ<jjRJj_{\mathrm{L}}^{J}<j\leq j_{\mathrm{R}}^{J}, 𝐷𝑅[i,j].L0\mathit{DR}[i,j].L\geq 0. Also, for any cell (i,j)(i,j) with iTI<iiBIi_{\mathrm{T}}^{I}<i\leq i_{\mathrm{B}}^{I} and 1jn1\leq j\leq n, 𝐷𝑅[i,j].U0\mathit{DR}[i,j].U\geq 0.

Now we obtain the next lemma, which is a key to our algorithm.

Lemma 5

For any cell (i,j)(i,j) with iTI+1<iiBIi_{\mathrm{T}}^{I}+1<i\leq i_{\mathrm{B}}^{I} and jLJ+1<jjRJj_{\mathrm{L}}^{J}+1<j\leq j_{\mathrm{R}}^{J}, 𝐷𝑅[i,j]=𝐷𝑅[i1,j1]\mathit{DR}[i,j]=\mathit{DR}[i-1,j-1].

  • Proof. By Corollary 1, 𝐷𝑅[i1,j].L0\mathit{DR}[i-1,j].L\geq 0 and 𝐷𝑅[i,j1].U0\mathit{DR}[i,j-1].U\geq 0 for iTI+1<iiBIi_{\mathrm{T}}^{I}+1<i\leq i_{\mathrm{B}}^{I} and jLJ+1<jjRJj_{\mathrm{L}}^{J}+1<j\leq j_{\mathrm{R}}^{J}. Thus clearly min{𝐷𝑅[i1,j].L,𝐷𝑅[i,j1].U,0}=0\min\{\mathit{DR}[i-1,j].L,\mathit{DR}[i,j-1].U,0\}=0. Therefore, for the value of zz in Lemma 2, we have z=(aibj)2z=(a_{i}-b_{j})^{2}, which leads to

    𝐷𝑅[i,j].U\displaystyle\mathit{DR}[i,j].U =\displaystyle= (aibj)2𝐷𝑅[i1,j].L\displaystyle(a_{i}-b_{j})^{2}-\mathit{DR}[i-1,j].L (2)
    𝐷𝑅[i,j].L\displaystyle\mathit{DR}[i,j].L =\displaystyle= (aibj)2𝐷𝑅[i,j1].U\displaystyle(a_{i}-b_{j})^{2}-\mathit{DR}[i,j-1].U (3)

    By applying equation (3) to the 𝐷𝑅[i1,j].L\mathit{DR}[i-1,j].L term of equation (2), we get

    𝐷𝑅[i,j].U=(aibj)2((ai1bj)2𝐷𝑅[i1,j1].U).\mathit{DR}[i,j].U=(a_{i}-b_{j})^{2}-((a_{i-1}-b_{j})^{2}-\mathit{DR}[i-1,j-1].U).

    Recall that ai=ai1a_{i}=a_{i-1}, since we are considering cells in the same box I,J\mathcal{B}^{I,J}. Thus 𝐷𝑅[i,j].U=𝐷𝑅[i1,j1].U\mathit{DR}[i,j].U=\mathit{DR}[i-1,j-1].U. By applying equation (2) to the 𝐷𝑅[i,j1].U\mathit{DR}[i,j-1].U term of equation (3), we similarly obtain 𝐷𝑅[i,j].L=𝐷𝑅[i1,j1].L\mathit{DR}[i,j].L=\mathit{DR}[i-1,j-1].L. \Box

For any iTI+1<iiBIi_{\mathrm{T}}^{I}+1<i\leq i_{\mathrm{B}}^{I} and jLJ+1<jjRJj_{\mathrm{L}}^{J}+1<j\leq j_{\mathrm{R}}^{J}, let \ell be the smallest positive integer that satisfies i=iTI+1i-\ell=i_{\mathrm{T}}^{I}+1 or j=jLJ+1j-\ell=j_{\mathrm{L}}^{J}+1. By Lemma 5, for any cell (i,j)(i,j) on the bottom row iBIi_{\mathrm{B}}^{I} or on the right column jRJj_{\mathrm{R}}^{J}, we have 𝐷𝑆[i,j]=𝐷𝑅[i,j]\mathit{DS}[i,j]=\mathit{DR}[i-\ell,j-\ell] and 𝐷𝑆[i,j]=𝐷𝑅[i,j]\mathit{DS^{\prime}}[i,j]=\mathit{DR^{\prime}}[i-\ell,j-\ell]. This means that 𝐷𝑆[i,j]𝐷𝑆[i,j]\mathit{DS}[i,j]\neq\mathit{DS^{\prime}}[i,j] iff 𝐷𝑅[i,j]𝐷𝑅[i,j]\mathit{DR}[i-\ell,j-\ell]\neq\mathit{DR^{\prime}}[i-\ell,j-\ell]. Thus, finding cells (i,j)(i,j) with 𝐷𝑆[i,j]𝐷𝑆[i,j]\mathit{DS}[i,j]\neq\mathit{DS^{\prime}}[i,j] on the bottom row iBIi_{\mathrm{B}}^{I} or on the right column jRJj_{\mathrm{R}}^{J} reduces to finding cells (i,j)(i^{\prime},j^{\prime}) with 𝐷𝑅[i,j]𝐷𝑅[i,j]\mathit{DR}[i^{\prime},j^{\prime}]\neq\mathit{DR^{\prime}}[i^{\prime},j^{\prime}] on the row iTI+1i_{\mathrm{T}}^{I}+1 or on the column jLJ+1j_{\mathrm{L}}^{J}+1. See Figure 8.

[Uncaptioned image] Figure 7: Diagonal propagation of 𝐷𝑅[i,j]𝐷𝑅[i,j]\mathit{DR}[i,j]\neq\mathit{DR^{\prime}}[i,j] inside box I,J\mathcal{B}^{I,J}. [Uncaptioned image] Figure 8: Illustration for the case where s>ts>t in Lemma 6.

We have shown how to compute ΔTI,J\Delta_{\mathrm{T}}^{I,J} for the top row iTIi_{\mathrm{T}}^{I} and ΔLI,J\Delta_{\mathrm{L}}^{I,J} for the left column jLJj_{\mathrm{L}}^{J}. We here explain how to use ΔTI,J\Delta_{\mathrm{T}}^{I,J} (we can use ΔLI,J\Delta_{\mathrm{L}}^{I,J} in a symmetric manner). We process column indices in ΔTI,J\Delta_{\mathrm{T}}^{I,J} in increasing order, and suppose that we are currently processing column index j^ΔTI,J\hat{j}\in\Delta_{\mathrm{T}}^{I,J} in the top row iTIi_{\mathrm{T}}^{I} of the current box I,J\mathcal{B}^{I,J}. We check whether 𝐷𝑅[iTI+1,j^]𝐷𝑅[iTI+1,j^]\mathit{DR}[i_{\mathrm{T}}^{I}+1,\hat{j}]\neq\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}]. For this sake, we need to know the values of 𝐷𝑅[iTI+1,j^]\mathit{DR}[i_{\mathrm{T}}^{I}+1,\hat{j}] and 𝐷𝑅[iTI+1,j^]\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}]. Recall that, by Lemma 5, 𝐷𝑅[iTI+1,j^]\mathit{DR}[i_{\mathrm{T}}^{I}+1,\hat{j}] is equal to 𝐷𝑅[iTI+1+h,j^+h]\mathit{DR}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h] (=𝐷𝑆[iTI+1+h,j^+h]=\mathit{DS}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h]) on the bottom row iBIi_{\mathrm{B}}^{I} (if iTI+1+h=iBIi_{\mathrm{T}}^{I}+1+h=i_{\mathrm{B}}^{I}) or on the right column jRJj_{\mathrm{R}}^{J} (if j^+h=jRJ\hat{j}+h=j_{\mathrm{R}}^{J}), where h>0h>0. Since the cell (iTI+1+h,j^+h)(i_{\mathrm{T}}^{I}+1+h,\hat{j}+h) can be retrieved in constant time by the diagonal link from the cell (iTI,j^1)(i_{\mathrm{T}}^{I},\hat{j}-1) on the top row iTIi_{\mathrm{T}}^{I}, we can compute 𝐷𝑅[iTI+1,j^]\mathit{DR}[i_{\mathrm{T}}^{I}+1,\hat{j}] in constant time, applying Lemma 5 to the upper-left direction.

Computing 𝐷𝑅[iTI+1,j^]\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}] is more involved. By Lemma 2, we can compute 𝐷𝑅[iTI+1,j^]\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}] from 𝐷𝑅[iTI,j^].L\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L and 𝐷𝑅[iTI+1,j^1].U\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}-1].U. Since (iTI,j^)(i_{\mathrm{T}}^{I},\hat{j}) is on the top row iTIi_{\mathrm{T}}^{I}, 𝐷𝑅[iTI,j^].L=𝐷𝑆[iTI,j^].L\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L=\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L has already been computed. Consider to compute 𝐷𝑅[iTI+1,j^1].U\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}-1].U. Since 𝐷𝑅[iTI+1,j^1].U=D[iTI+1,j^1]D[iTI,j^1]\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}-1].U=\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}-1]-\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1], it suffices to compute D[iTI,j^1]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1] and D[iTI+1,j^1]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}-1]. By definition, D[iTI,j^1]=D[iTI,j^]𝐷𝑅[iTI,j^].L\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1]=\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}]-\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L. Since j^ΔTI,J\hat{j}\in\Delta_{\mathrm{T}}^{I,J}, we can retrieve the value of D[iTI,j^]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}] from the current element of the list ΔTI,J\Delta_{\mathrm{T}}^{I,J}, in O(1)O(1) time. Since 𝐷𝑅[iTI,j^].L=𝐷𝑆[iTI,j^].L\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L=\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L, we can compute D[iTI,j^1]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1] in O(1)O(1) time.

What remains is how to compute D[iTI+1,j^1]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}-1]. We use the next lemma.

Lemma 6

For any cell (i,j)(i,j) with iTI+1<iiBIi_{\mathrm{T}}^{I}+1<i\leq i_{\mathrm{B}}^{I} and jLJ+1<jjRJj_{\mathrm{L}}^{J}+1<j\leq j_{\mathrm{R}}^{J}, let s=jjLJs=j-j_{\mathrm{L}}^{J} and t=iiTIt=i-i_{\mathrm{T}}^{I}. Then,

D[i,j]=D[iTI+max{ts,0},jLJ+max{st,0}]+min{s,t}(aibj)2.\mathit{D}[i,j]=\mathit{D}[i_{\mathrm{T}}^{I}+\max\{t-s,0\},j_{\mathrm{L}}^{J}+\max\{s-t,0\}]+\min\{s,t\}\cdot(a_{i}-b_{j})^{2}.
  • Proof. Consider the case where s>ts>t. By applying Lemma 4 to recurrence (1), we obtain D[i,j]=D[i1,j1]+(aibj)2\mathit{D}[i,j]=\mathit{D}[i-1,j-1]+(a_{i}-b_{j})^{2}. Since ai=aia_{i}=a_{i^{\prime}} and bj=bjb_{j}=b_{j^{\prime}} for iTI<i<ii_{\mathrm{T}}^{I}<i^{\prime}<i and jLJ<j<jj_{\mathrm{L}}^{J}<j^{\prime}<j, by repeatedly applying Lemma 4 to the above equation, we get D[i,j]=D[iTI,jLJ+(st)]+t(aibj)2\mathit{D}[i,j]=\mathit{D}[i_{\mathrm{T}}^{I},j_{\mathrm{L}}^{J}+(s-t)]+t\cdot(a_{i}-b_{j})^{2}. See also Figure 8. The case sts\leq t is similar and we obtain D[i,j]=D[iTI+(ts),jLJ]+s(aibj)2\mathit{D}[i,j]=\mathit{D}[i_{\mathrm{T}}^{I}+(t-s),j_{\mathrm{L}}^{J}]+s\cdot(a_{i}-b_{j})^{2}. By merging the two equations for s>ts>t and sts\leq t, we obtain the desired equation. \Box

Let k=j^jLJk=\hat{j}-j_{\mathrm{L}}^{J}. Since jLJ+1<j^j_{\mathrm{L}}^{J}+1<\hat{j}, k2k\geq 2. Since s=j^1jLJ=k1s=\hat{j}-1-j_{\mathrm{L}}^{J}=k-1, t=iTI+1iTI=1t=i_{\mathrm{T}}^{I}+1-i_{\mathrm{T}}^{I}=1, and k2k\geq 2, we get sts\geq t. Thus it follows from Lemma 6 that

D[iTI+1,j^1]=D[iTI,jLJ+(k2)]+(A[iTI]B[j^])2=D[iTI,j^2]+(A[iTI]B[j^])2.\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}-1]\!=\!\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},j_{\mathrm{L}}^{J}+(k-2)]+(A[i_{\mathrm{T}}^{I}]-B[\hat{j}])^{2}\!=\!\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-2]+(A[i_{\mathrm{T}}^{I}]-B[\hat{j}])^{2}.

Since the value D[iTI,j^]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}] is already computed and stored in the corresponding element of ΔTI,J\Delta_{T}^{I,J}, we can compute, in O(1)O(1) time, D[iTI,j^2]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-2] by

D[iTI,j^2]\displaystyle\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-2] =\displaystyle= D[iTI,j^]𝐷𝑅[iTI,j^].L𝐷𝑅[iTI,j^1].L\displaystyle\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}]-\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L-\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1].L
=\displaystyle= D[iTI,j^]𝐷𝑆[iTI,j^].L𝐷𝑆[iTI,j^1].L.\displaystyle\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}]-\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L-\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1].L.

Thus, we can determine in O(1)O(1) time whether 𝐷𝑅[iTI+1,j^]𝐷𝑅[iTI+1,j^]\mathit{DR}[i_{\mathrm{T}}^{I}+1,\hat{j}]\neq\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}], and hence whether 𝐷𝑆[iTI+1+h,j^+h]𝐷𝑆[iTI+1+h,j^+h]\mathit{DS}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h]\neq\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h].

Suppose 𝐷𝑆[iTI+1+h,j^+h]𝐷𝑆[iTI+1+h,j^+h]\mathit{DS}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h]\neq\mathit{DS^{\prime}}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h]. Then we need to compute D[iTI+1+h,j^+h]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h]. This can be computed in constant time using Lemma 6, by D[iTI+1+h,j^+h]=D[iTI,j^1]+(h+1)(A[iTI]B[j^])2\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h]=\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1]+(h+1)\cdot(A[i_{\mathrm{T}}^{I}]-B[\hat{j}])^{2}, where D[iTI,j^1]=D[iTI,j^]𝐷𝑅[iTI,j^].L\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}-1]=\mathit{D^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}]-\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I},\hat{j}].L. We add the column index j^+h\hat{j}+h to list ΔBI,J\Delta_{\mathrm{B}}^{I,J} if iTI+1+h=iBIi_{\mathrm{T}}^{I}+1+h=i_{\mathrm{B}}^{I}, and/or add the row index iTI+1+hi_{\mathrm{T}}^{I}+1+h to list ΔRI,J\Delta_{\mathrm{R}}^{I,J} if j^+h=jRJ\hat{j}+h=j_{\mathrm{R}}^{J}, together with the value of D[iTI+1+h,j^+h]\mathit{D^{\prime}}[i_{\mathrm{T}}^{I}+1+h,\hat{j}+h].

Refer to caption
Figure 9: Illustration for the process of computing 𝐷𝑅[iTI+1,j^]\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}]. The gray cells show those for which both values of D\mathit{D^{\prime}} and 𝐷𝑅\mathit{DR^{\prime}} are unknown, the vertically striped cells show those for which only the value of D\mathit{D^{\prime}} is known, the horizontally striped cells show those for which only the value of 𝐷𝑅\mathit{DR^{\prime}} is known, and the white cells show those for which both values of D\mathit{D^{\prime}} and 𝐷𝑅\mathit{DR^{\prime}} are known. At the final step (lower-right), the desired value 𝐷𝑅[iTI+1,j^]\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}] has been computed.

The above process of computing 𝐷𝑅[iTI+1,j^]\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}] is illustrated in Figure 9. Suppose we have processed cell (iTI+1,j^)(i_{\mathrm{T}}^{I}+1,\hat{j}). We perform the same procedure as above for the right-neighbor cells (iTI+1,j^+q)(i_{\mathrm{T}}^{I}+1,\hat{j}+q) with q=1q=1 and increasing qq, until encountering the first cell (iTI+1,j^+q)(i_{\mathrm{T}}^{I}+1,\hat{j}+q) such that (1) 𝐷𝑅[iTI+1,j^+q]=𝐷𝑅[iTI+1,j^+q]\mathit{DR}[i_{\mathrm{T}}^{I}+1,\hat{j}+q]=\mathit{DR^{\prime}}[i_{\mathrm{T}}^{I}+1,\hat{j}+q], (2) j^+qΔTI,J\hat{j}+q\in\Delta_{\mathrm{T}}^{I,J}, or (3) j^+q=jRJ+1\hat{j}+q=j_{\mathrm{R}}^{J}+1. In cases (1) and (2), we remove j^\hat{j} from ΔTI,J\Delta_{\mathrm{T}}^{I,J} and move to the next element of in ΔTI,J\Delta_{\mathrm{T}}^{I,J}. We are done when we encounter case (3) or ΔTI,J\Delta_{\mathrm{T}}^{I,J} becomes empty. By Lemma 5, the total number of cells (iTI+1,j^+q)(i_{\mathrm{T}}^{I}+1,\hat{j}+q) for all boxes in 𝐷𝑆\mathit{DS^{\prime}} is O(#chg)O(\#_{\mathrm{chg}}).

Batched updates. Our algorithm can efficiently support batched updates for insertion, deletion, substitution of a run of characters.

Theorem 2

Let B\mathit{B^{\prime}} be the string after a run-wise edit operation on BB, and let n=|B|n^{\prime}=|\mathit{B^{\prime}}|. 𝐷𝑆\mathit{DS} can be updated to 𝐷𝑆\mathit{DS^{\prime}} in O(m+max{n,n}+#chg)O(m+\max\{n,n^{\prime}\}+\#_{\mathrm{chg}}^{\prime}) time where #chg\#_{\mathrm{chg}}^{\prime} denotes the number of cells where the values differ between 𝐷𝑆\mathit{DS} and 𝐷𝑆\mathit{DS^{\prime}}.

Since nn^{\prime} is the length of the string |B||\mathit{B^{\prime}}| after modification, #chg\#_{\mathrm{chg}}^{\prime} in Theorem 2 is bounded by O(mN+max{n,n}M)O(mN+\max\{n^{\prime},n\}M). Thus, we can perform a batched run-wise update on our sparse table 𝐷𝑆\mathit{DS} in worst-case O(m+max{n,n}+#chg)O(mN+max{n,n}M)O(m+\max\{n,n^{\prime}\}+\#_{\mathrm{chg}}^{\prime})\subseteq O(mN+\max\{n,n^{\prime}\}M) time. Let kk be the total number of characters that are involved in a run-wise batched edit operation from BB to B\mathit{B^{\prime}} (namely, a run of kk characters is inserted, a run of kk characters is deleted, or a run of k1k_{1} characters is substituted for a run of k2k_{2} characters with k=k1+k2k=k_{1}+k_{2}). Then a naïve kk-time applications of Theorem 1 to the run-wise batched edit operation requires O(k(m+n+#chg))O(k(mN+nM))O(k(m+n+\#_{\mathrm{chg}}))\subseteq O(k(mN+nM)) time. Since nn+kn^{\prime}\leq n+k, the batched update of Theorem 2 is faster than the naïve method by a factor of kk whenever kO(n)k\in O(n). We also remark that our batched update algorithm is at least as efficient as building the sparse DP table of Froese et al.’s algorithm [4] from scratch using Θ(mN+max{n,n}M)\Theta(mN+\max\{n,n^{\prime}\}M) time and space.

3.2 Evaluation of #chg\#_{\mathrm{chg}}

As was proven previously, our 𝐃𝟐𝐓𝐖\mathbf{D^{2}TW} algorithm works in O(m+n+#chg)O(m+n+\#_{\mathrm{chg}}) time per edit operation on one of the strings. In this subsection, we analyze how large the #chg\#_{\mathrm{chg}} would be in theory and practice. Although #chg=Θ(mN+nM)\#_{\mathrm{chg}}=\Theta(mN+nM) in the worst case for some strings (Theorem 3), our preliminary experiments shown below suggest that #chg\#_{\mathrm{chg}} can be much smaller than mN+nMmN+nM in many cases.

Theorem 3

Consider strings A=A1kAMkA=A_{1}^{k}\cdots A_{M}^{k} and B=B1lBNlB=B_{1}^{l}\cdots B_{N}^{l} of RLE sizes MM and NN, respectively, where |A|=m=kM|A|=m=kM and |B|=n=lN|B|=n=lN. We assume lexicographical orders of characters as AI1>AIA_{I-1}>A_{I} for 1<IM1<I\leq M, BJ1<BJB_{J-1}<B_{J} for 1<JN1<J\leq N, and AM>BNA_{M}>B_{N}. If we delete B[1]B[1] from BB, then #chg=Ω(mN+nM)\#_{\mathrm{chg}}=\Omega(mN+nM).

We have also conducted preliminary experiments to estimate practical values of #chg\#_{\mathrm{chg}}, using randomly generated strings. For simplicity, we set m=nm=n and M=NM=N for all experiments. We fixed the alphabet size |Σ|=26|\Sigma|=26 throughout our experiments.

Refer to caption
Refer to caption
Figure 10: Comparisons of the values of #chg\#_{\mathrm{chg}} and the sizes of the sparse table 𝐷𝑆\mathit{DS} on two randomly generated strings AA and BB. Upper: With fixed RLE size N=M=50N=M=50 and varying lengths n=mn=m from 5050 to 500500 (horizontal axis). Lower: With fixed length n=m=500n=m=500 and varying RLE sizes N=MN=M from 1010 to 500500 (horizontal axis).

In the first experiment, we fixed the RLE size M=N=50M=N=50, randomly generated two strings AA and BB of varying lengths m=nm=n from 5050 to 500500, and compared the values of #chg\#_{\mathrm{chg}} and the sizes of 𝐷𝑆\mathit{DS}. For each mm, we randomly generated 50 pairs of strings AA and BB of length mm each, and took the average values for #chg\#_{\mathrm{chg}} and the sizes of 𝐷𝑆\mathit{DS} when B[1]B[1] was deleted from BB. In the second experiment, we fixed the string length m=n=500m=n=500 and randomly generated two strings AA and BB of varying RLE sizes M=NM=N from 1010 to 500500. For each MM, we randomly generated 50 pairs of strings AA and BB of RLE size MM, and took the average values for #chg\#_{\mathrm{chg}} and the sizes of 𝐷𝑆\mathit{DS} when B[1]B[1] was deleted from BB. The results are shown in Figure 10. In both experiments, #chg\#_{\mathrm{chg}} is much smaller than the size of 𝐷𝑆\mathit{DS}. It is noteworthy that even when the values of MM (=N=N) and mm (=n=n) are close, the value of #chg\#_{\mathrm{chg}} stayed very small. This suggests that our algorithm can be fast also on strings that are not RLE-compressible.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers JP18K18002 (YN), JP17H01697 (SI), JP20H04141 (HB), JP18H04098 (MT), and JST PRESTO Grant Number JPMJPR1922 (SI).

References

  • [1] A. Abboud, A. Backurs, and V. V. Williams. Tight hardness results for LCS and other sequence similarity measures. In FOCS 2015, pages 59–78, 2015.
  • [2] K. Bringmann and M. Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In FOCS 2015, pages 79–97, 2015.
  • [3] P. Charalampopoulos, T. Kociumaka, and S. Mozes. Dynamic string alignment. In CPM 2020, pages 9:1–9:13, 2020.
  • [4] V. Froese, B. J. Jain, M. Rymar, and M. Weller. Fast exact dynamic time warping on run-length encoded time series. CoRR, abs/1903.03003, 2020.
  • [5] H. Hyyrö and S. Inenaga. Compacting a dynamic edit distance table by RLE compression. In SOFSEM 2016, pages 302–313, 2016.
  • [6] H. Hyyrö and S. Inenaga. Dynamic RLE-compressed edit distance tables under general weighted cost functions. Int. J. Found. Comput. Sci., 29(4):623–645, 2018.
  • [7] H. Hyyrö, K. Narisawa, and S. Inenaga. Dynamic edit distance table under a general weighted cost function. In SOFSEM 2010, pages 515–527, 2010.
  • [8] H. Hyyrö, K. Narisawa, and S. Inenaga. Dynamic edit distance table under a general weighted cost function. J. Disc. Algo., 34:2–17, 2015.
  • [9] S.-R. Kim and K. Park. A dynamic edit distance table. J. Disc. Algo., 2:302–312, 2004.
  • [10] W. Kuszmaul. Dynamic time warping in strongly subquadratic time: Algorithms for the low-distance regime and approximate evaluation. In ICALP 2019, pages 80:1–80:15, 2019.
  • [11] A. Nishi, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. Towards efficient interactive computation of dynamic time warping distance. CoRR, abs/2005.08190, 2020.
  • [12] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978.
  • [13] J. P. Schmidt. All highest scoring paths in weighted grid graphs and their application in finding all approximate repeats in strings. SIAM J. Comp., 27(4):972–992, 1998.

Appendix A Appendix: Omitted proofs

A.1 Proof for Theorem 3

To prove Theorem 3, we establish the following lemma.

Lemma 7

Let D\mathit{D} be the dynamic programming table for the above strings AA and BB. Then,

min{D[i1,j],D[i,j1],D[i1,j1]}={D[i,j1]if i<j,D[i1,j]if i>j,D[i1,j1]if i=j.\min\{\mathit{D}[i-1,j],\mathit{D}[i,j-1],\mathit{D}[i-1,j-1]\}=\begin{cases}\mathit{D}[i,j-1]&\mbox{if }i<j,\\ \mathit{D}[i-1,j]&\mbox{if }i>j,\\ \mathit{D}[i-1,j-1]&\mbox{if }i=j.\\ \end{cases}
  • Proof. By recurrence (1), the argument holds for any cells (1,j)(1,j) and (i,1)(i,1), where 1im1\leq i\leq m and 1jn1\leq j\leq n. For any cell (i,j)(i,j) with i>1i>1 or j>1j>1, suppose that the argument holds for any (i,j)(i^{\prime},j^{\prime}) with i<ii^{\prime}<i and j<jj^{\prime}<j. We consider the five following cases:

    1. 1.

      Case i<j1i<j-1: For the cell (i1,j)(i-1,j), it follows from the inductive hypothesis and recurrence (1) that D[i1,j]=D[i1,j1]+(ai1bj)2\mathit{D}[i-1,j]=\mathit{D}[i-1,j-1]+(a_{i-1}-b_{j})^{2}. Since (ai1bj)20(a_{i-1}-b_{j})^{2}\geq 0, D[i1,j]D[i1,j1]\mathit{D}[i-1,j]\geq\mathit{D}[i-1,j-1]. Similarly, for the cells (i1,j1)(i-1,j-1) and (i,j1)(i,j-1), D[i1,j1]=D[i1,j2]+(ai1bj1)2\mathit{D}[i-1,j-1]=\mathit{D}[i-1,j-2]+(a_{i-1}-b_{j-1})^{2} and D[i,j1]=D[i,j2]+(aibj1)2\mathit{D}[i,j-1]=\mathit{D}[i,j-2]+(a_{i}-b_{j-1})^{2}. Since ai1aia_{i-1}\geq a_{i}, ai1bj1aibj1a_{i-1}-b_{j-1}\geq a_{i}-b_{j-1}. Because aibj1>0a_{i}-b_{j-1}>0, it holds that (ai1bj1)2(aibj1)2(a_{i-1}-b_{j-1})^{2}\geq(a_{i}-b_{j-1})^{2}. By the inductive hypothesis, D[i1,j2]D[i,j2]\mathit{D}[i-1,j-2]\geq\mathit{D}[i,j-2]. Thus we have D[i1,j2]+(ai1bj1)2D[i,j2]+(aibj1)2\mathit{D}[i-1,j-2]+(a_{i-1}-b_{j-1})^{2}\geq\mathit{D}[i,j-2]+(a_{i}-b_{j-1})^{2}, which implies D[i1,j1]D[i,j1]\mathit{D}[i-1,j-1]\geq\mathit{D}[i,j-1].

    2. 2.

      Case i=j1i=j-1: Analogously to Case 1, we get D[i1,j]D[i1,j1]D[i-1,j]\geq D[i-1,j-1]. For the cells (i1,j1)(i-1,j-1) and (i,j1)(i,j-1), it follows from the inductive hypothesis and recurrence (1) that D[i1,j1]=D[i1,j2]+(ai1bj1)2\mathit{D}[i-1,j-1]=\mathit{D}[i-1,j-2]+(a_{i-1}-b_{j-1})^{2} and D[i,j1]=D[i1,j2]+(aibj1)2\mathit{D}[i,j-1]=\mathit{D}[i-1,j-2]+(a_{i}-b_{j-1})^{2}. Since (ai1bj1)2(aibj1)2(a_{i-1}-b_{j-1})^{2}\geq(a_{i}-b_{j-1})^{2}, D[i1,j2]+(ai1bj1)2D[i1,j2]+(aibj1)2\mathit{D}[i-1,j-2]+(a_{i-1}-b_{j-1})^{2}\geq\mathit{D}[i-1,j-2]+(a_{i}-b_{j-1})^{2}, which implies D[i1,j1]D[i,j1]\mathit{D}[i-1,j-1]\geq\mathit{D}[i,j-1].

    3. 3.

      Case i1>ji-1>j: By symmetric arguments to Case 1.

    4. 4.

      Case i1=ji-1=j: By symmetric arguments to Case 2.

    5. 5.

      Case i=ji=j: For the cells (i1,j)(i-1,j) and (i,j1)(i,j-1), by the inductive hypothesis min{D[i2,j],D[i1,j1],D[i1,j]}=D[i1,j1]\min\{\mathit{D}[i-2,j],\mathit{D}[i-1,j-1],\mathit{D}[i-1,j]\}=D[i-1,j-1] and min{D[i1,j1],D[i,j2],D[i,j1]}=D[i1,j1]\min\{\mathit{D}[i-1,j-1],\mathit{D}[i,j-2],\mathit{D}[i,j-1]\}=D[i-1,j-1]. Thus D[i1,j]D[i1,j1]\mathit{D}[i-1,j]\geq\mathit{D}[i-1,j-1] and D[i,j1]D[i1,j1]\mathit{D}[i,j-1]\geq D[i-1,j-1].

    \Box

We are ready to prove Theorem 3.

  • Proof. For simplicity, we assume that the column indices of D\mathit{D} begin with 0. In the grid graph 𝒢m,n\mathcal{G}_{m,n} over D\mathit{D}, we assign a weight (aibj)2(a_{i}-b_{j})^{2} to each in-coming edge of cell (i,j)(i,j). We also consider the grid graph 𝒢m1,n\mathcal{G}_{m-1,n} over D\mathit{D^{\prime}} obtained by removing (1,0),,(m,0)(1,0),\ldots,(m,0) from 𝒢m,n\mathcal{G}_{m,n}.

    Consider a cell (i,j)(i,j) with 1<i<j1<i<j that is on the top row of some box I,J\mathcal{B}^{I,J}, namely i=iTI=Ik+1i=i_{\mathrm{T}}^{I}=Ik+1 for some 1IM11\leq I\leq M-1. By Lemma 7, p1=(1,0),,(i,i1),,(i,j)p_{1}=(1,0),\dots,(i,i-1),\dots,(i,j) is the minimum weight path from (1,0)(1,0) to (i,j)(i,j). Similarly, p1=(1,1),,(i,i),,(i,j)p^{\prime}_{1}=(1,1),\dots,(i,i),\dots,(i,j) is the minimum weight path from (1,1)(1,1) to (i,j)(i,j).

    For the cell (i1,j)(i-1,j) that is the upper neighbor of (i,j)(i,j) and is on the bottom row of box I1,J\mathcal{B}^{I-1,J}, by analogous arguments to the above, p2=(1,0),,(i1,i2),,(i1,j)p_{2}=(1,0),\dots,(i-1,i-2),\dots,(i-1,j) is the minimum weight path from (1,0)(1,0) to (i1,j)(i-1,j), and p2=(1,1),,(i1,i1),,(i1,j)p^{\prime}_{2}=(1,1),\dots,(i-1,i-1),\dots,(i-1,j) is the minimum weight path from (1,1)(1,1) to (i1,j)(i-1,j).

    Let p3p_{3} be the sub-path of p1p_{1} and p2p_{2} ending at (i1,i2)(i-1,i-2), p4p_{4} the sub-path of p1p^{\prime}_{1} and p2p^{\prime}_{2} ending at (i1,i1)(i-1,i-1), p5p_{5} the sub-path of p1p_{1} and p1p^{\prime}_{1} from (i,i)(i,i) to (i,j)(i,j), and p6p_{6} the sub-path of p2p_{2} and p2p^{\prime}_{2} from (i1,i1)(i-1,i-1) to (i1,j)(i-1,j). Let e1e_{1} be the edge from (i1,i2)(i-1,i-2) to (i,i1)(i,i-1), e2e_{2} be the edge from (i,i1)(i,i-1) to (i,i)(i,i), e3e_{3} be the edge from (i1,i2)(i-1,i-2) to (i1,i1)(i-1,i-1), and e4e_{4} be the edge from (i1,i1)(i-1,i-1) to (i,i)(i,i). See Figure 11 that depicts these paths and edges.

    Refer to caption
    Figure 11: Minimum weight paths to (i,j)(i,j) and (i1,j)(i-1,j) over D\mathit{D} and D\mathit{D^{\prime}}.

For any path pp in 𝒢m,n\mathcal{G}_{m,n}, let 𝗐(p)\mathsf{w}(p) denote the total weights of edges in pp. Now we have 𝗐(p1)=𝗐(p3)+𝗐(e1)+𝗐(e2)+𝗐(p5)\mathsf{w}(p_{1})=\mathsf{w}(p_{3})+\mathsf{w}(e_{1})+\mathsf{w}(e_{2})+\mathsf{w}(p_{5}) and 𝗐(p1)=𝗐(p4)+𝗐(e4)+𝗐(p5)\mathsf{w}(p^{\prime}_{1})=\mathsf{w}(p_{4})+\mathsf{w}(e_{4})+\mathsf{w}(p_{5}). By the definition of DTW, D[i,j]\mathit{D}[i,j] stores the cost of the minimum weight path from (1,0)(1,0) to (i,j)(i,j), and D[i,j]\mathit{D^{\prime}}[i,j] stores the cost of the minimum weight path from (1,1)(1,1) to (i,j)(i,j). Thus D[i,j]=𝗐(p3)+𝗐(e1)+𝗐(e2)+𝗐(p5)\mathit{D}[i,j]=\mathsf{w}(p_{3})+\mathsf{w}(e_{1})+\mathsf{w}(e_{2})+\mathsf{w}(p_{5}) and D[i,j]=𝗐(p4)+𝗐(e4)+𝗐(p5)\mathit{D^{\prime}}[i,j]=\mathsf{w}(p_{4})+\mathsf{w}(e_{4})+\mathsf{w}(p_{5}). Similarly, D[i1,j]=𝗐(p3)+𝗐(e3)+𝗐(p6)\mathit{D}[i-1,j]=\mathsf{w}(p_{3})+\mathsf{w}(e_{3})+\mathsf{w}(p_{6}) and D[i1,j]=𝗐(p4)+𝗐(p6)\mathit{D^{\prime}}[i-1,j]=\mathsf{w}(p_{4})+\mathsf{w}(p_{6}).

Now 𝐷𝑆[i,j].U=D[i1,j]D[i,j]=𝗐(e3)+𝗐(p6)(𝗐(e1)+𝗐(e2)+𝗐(p5))\mathit{DS}[i,j].U=\mathit{D}[i-1,j]-\mathit{D}[i,j]=\mathsf{w}(e_{3})+\mathsf{w}(p_{6})-(\mathsf{w}(e_{1})+\mathsf{w}(e_{2})+\mathsf{w}(p_{5})), and 𝐷𝑆[i,j].U=D[i1,j]D[i,j]=𝗐(p6)(𝗐(e4)+𝗐(p5))\mathit{DS^{\prime}}[i,j].U=\mathit{D^{\prime}}[i-1,j]-\mathit{D^{\prime}}[i,j]=\mathsf{w}(p_{6})-(\mathsf{w}(e_{4})+\mathsf{w}(p_{5})). This leads to 𝐷𝑆[i,j].U𝐷𝑆[i,j].U=𝗐(e3)+𝗐(e4)(𝗐(e1)+𝗐(e2))\mathit{DS}[i,j].U-\mathit{DS^{\prime}}[i,j].U=\mathsf{w}(e_{3})+\mathsf{w}(e_{4})-(\mathsf{w}(e_{1})+\mathsf{w}(e_{2})). Recall that 𝗐(e2)=𝗐(e4)=(A[i]B[j])2\mathsf{w}(e_{2})=\mathsf{w}(e_{4})=(A[i]-B[j])^{2}. Thus 𝐷𝑆[i,j].U𝐷𝑆[i,j].U=𝗐(e3)𝗐(e1)\mathit{DS}[i,j].U-\mathit{DS^{\prime}}[i,j].U=\mathsf{w}(e_{3})-\mathsf{w}(e_{1}). Also, because A[i]A[i1]A[i]\neq A[i-1], (A[i]B[i1])2=𝗐(e1)𝗐(e3)=(A[i1]B[i1])2(A[i]-B[i-1])^{2}=\mathsf{w}(e_{1})\neq\mathsf{w}(e_{3})=(A[i-1]-B[i-1])^{2}, Consequently, 𝐷𝑆[i,j].U𝐷𝑆[i,j].U0\mathit{DS}[i,j].U-\mathit{DS^{\prime}}[i,j].U\neq 0.

Therefore, for any cell (i,j)(i,j) with 1<i<j1<i<j that lies on the top row of any character run in AA, 𝐷𝑆[i,j]𝐷𝑆[i,j]\mathit{DS}[i,j]\neq\mathit{DS^{\prime}}[i,j]. Since each top row iTI=Ik+1i_{\mathrm{T}}^{I}=Ik+1 (1IM1)(1\leq I\leq M-1) contains n(Ik+1)n-(Ik+1) such cells, and since n=kMn=kM, there are I=1M1(n(Ik+1))=n(M1)k((M1)M)/2(M1)=(n2)(M1)/2\sum_{I=1}^{M-1}{(n-(Ik+1))}=n(M-1)-k((M-1)M)/2-(M-1)=(n-2)(M-1)/2 such cells for top rows I=1,,M1I=1,\ldots,M-1. Symmetric arguments show that there are J=1N1(m(Jl+1))=(m2)(N1)/2\sum_{J=1}^{N-1}{(m-(Jl+1))}=(m-2)(N-1)/2 cells with 𝐷𝑆[i,j]𝐷𝑆[i,j]\mathit{DS}[i,j]\neq\mathit{DS^{\prime}}[i,j] for left rows J=1,,N1J=1,\ldots,N-1. Thus, #chg((n2)(M1)+(m2)(N1))/2Ω(mN+nM)\#_{\mathrm{chg}}\geq((n-2)(M-1)+(m-2)(N-1))/2\in\Omega(mN+nM). \Box

A.2 Proof for Lemma 1

  • Proof. The lemma can be shown in a similar manner to the proof for Theorem 3 above. In so doing, we set M=nM=n and N=nN=n in the strings AA and BB of Section 3.2, and consider strings AA and BB such that A[i1]>A[i]A[i-1]>A[i] for 1<im1<i\leq m and B[j1]<B[j]B[j-1]<B[j] for 1<jn1<j\leq n, and A[m]>B[n]A[m]>B[n]. Since deleting the leftmost character B[1]B[1] of BB is symmetric to appending a new character b1b_{1} to BB such that b1<B[1]b_{1}<B[1], we get Ω(mn)\Omega(mn) lower bound for the number of cells where D[i,j]D[i,j]D[i,j]\neq\mathit{D^{\prime}}[i,j] per appended character. If we repeat this by recursively appending kk new characters bib_{i} such that bi<bi1<<B[1]b_{i}<b_{i-1}<\cdots<B[1] for i=2,,ki=2,\ldots,k, we get Ω(mn)\Omega(mn) lower bound for the number of cells where D[i,j]D[i,j]D[i,j]\neq\mathit{D^{\prime}}[i,j] for each bib_{i}. Hence there are a total of Θ(kmn)\Theta(kmn) cells in D\mathit{D^{\prime}} that differ from the corresponding cells in D\mathit{D}, for kk edit operations on BB. \Box