Amercing: An Intuitive, Elegant and Effective Constraint for Dynamic Time Warping

Matthieu Herrmann [email protected] Geoffrey I. Webb [email protected] Department of Data Science and Artificial Intelligence and Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia

Abstract

Dynamic Time Warping ( $\operatorname{DTW}$ ), and its constrained ( $\operatorname{CDTW}$ ) and weighted ( $\operatorname{WDTW}$ ) variants, are time series distances with a wide range of applications. They minimize the cost of non-linear alignments between series. $\operatorname{CDTW}$ and $\operatorname{WDTW}$ have been introduced because $\operatorname{DTW}$ is too permissive in its alignments. However, $\operatorname{CDTW}$ uses a crude step function, allowing unconstrained flexibility within the window, and none beyond it. $\operatorname{WDTW}$ ’s multiplicative weight is relative to the distances between aligned points along a warped path, rather than being a direct function of the amount of warping that is introduced. In this paper, we introduce Amerced Dynamic Time Warping ( $\operatorname{ADTW}$ ), a new, intuitive, $\operatorname{DTW}$ variant that penalizes the act of warping by a fixed additive cost. Like $\operatorname{CDTW}$ and $\operatorname{WDTW}$ , $\operatorname{ADTW}$ constrains the amount of warping. However, it avoids both abrupt discontinuities in the amount of warping allowed and the limitations of a multiplicative penalty. We formally introduce $\operatorname{ADTW}$ , prove some of its properties, and discuss its parameterization. We show on a simple example how it can be parameterized to achieve an intuitive outcome, and demonstrate its usefulness on a standard time series classification benchmark. We provide a demonstration application in C++ [1].

keywords:

Time Series, Dynamic Time Warping, Elastic Distance

^†^†journal: Pattern Recognition

1 Introduction

Dynamic Time Warping ( $\operatorname{DTW}$ ) is a distance measure for time series. First developed for speech recognition [2, 3], it has been adopted across a broad spectrum of applications including gesture recognition [4], signature verification [5], shape matching [6], road surface monitoring [7], neuroscience [8] and medical diagnosis [9].

DTW sums pairwise-distances between aligned points in two series. To allow for similar events that unfold at different rates, $\operatorname{DTW}$ provides elasticity in which points are aligned. However, it is well understood that $\operatorname{DTW}$ can be too flexible in the alignments it allows. We illustrate this with respect to three series shown in Figure 1.

Refer to caption — (a) $S=\{1,1,-1,1,1,1\}$

A natural expectation for any time series distance function $\operatorname{dist}$ is that $\operatorname{dist}(S,S)=0<\operatorname{dist}(S,T)<\operatorname{dist}(S,U)$ . However, $\operatorname{DTW}(S,S)=\operatorname{DTW}(S,T)=\operatorname{DTW}(S,U)=0$ . While this may be appropriate for some tasks, where all that matters is that these sequences all contain five 1s and one -1, it is inappropriate for others.

Two variants have been developed to constrain $\operatorname{DTW}$ ’s flexibility, Constrained $\operatorname{DTW}$ ( $\operatorname{CDTW}$ [2, 3], see Section 2.2) and Weighted $\operatorname{DTW}$ ( $\operatorname{WDTW}$ [10], see Section 2.3). $\operatorname{CDTW}_{w}$ constrains the alignments from warping by more than a window, $w$ . Within the window, any warping is allowed, and none beyond it. Depending on $w$ , there are three possibilities for how $\operatorname{CDTW}$ might assess our example:

1.

For $w\geq 2$ , $\operatorname{CDTW}_{w}(S,S)=\operatorname{CDTW}_{w}(S,T)=\operatorname{CDTW}_{w}(S,U)=0$ .
2.

$\operatorname{CDTW}_{1}(S,S)=\operatorname{CDTW}_{1}(S,T)=0<\operatorname{CDTW}_{1}(S,U)=8$ (see Figure 3)
3.

$\operatorname{CDTW}_{0}(S,S)=0<\operatorname{CDTW}_{0}(S,T)=\operatorname{CDTW}_{0}(S,U)$ .

$\operatorname{WDTW}$ applies a multiplicative penalty to alignments that increases the more the alignment is warped. However, because the penalty is multiplicative, the perfectly matched elements in our example still lead to $\operatorname{WDTW}(S,S)=\operatorname{WDTW}(S,T)=\operatorname{WDTW}(S,U)=0$ (see Figure 4).

None of $\operatorname{DTW}$ , $\operatorname{WDTW}$ or $\operatorname{CDTW}$ can be parameterized to obtain the natural expectation that $\operatorname{dist}(S,S)=0<\operatorname{dist}(S,T)<\operatorname{dist}(S,U)$ .

In this paper, we present Amerced DTW ( $\operatorname{ADTW}$ ), an intuitive, elegant and effective variant of $\operatorname{DTW}$ that applies a tunable additive penalty $\omega$ for warping an alignment. To align the $-1$ in $S$ with the $-1$ in $T$ , it is necessary to warp the alignments by 1 step, incurring a cost of $\omega$ . A further compensatory warping step is required to bring the two end points back into alignment, incurring another cost of $\omega$ . Thus, the total cost of aligning the two series is $\operatorname{ADTW}_{\omega}(S,T)=2\omega$ . Aligning the $-1$ in $S$ with the $-1$ in $U$ requires warping by two steps, incurring a penalty of $2\omega$ , with a further $2\omega$ penalty required to realign the end points, resulting in $\operatorname{ADTW}_{\omega}(S,U)=4\omega$ . Thus, $\operatorname{ADTW}$ can be parameterized to be in line with natural expectations for our example, or to allow the flexibility to treat the three series as equivalent if that is appropriate for an application:

1.

For $0<\omega<4$ (see Figure 5), $\operatorname{ADTW}_{\omega}(S,S)=0<\operatorname{ADTW}_{\omega}(S,T)=2\omega<\operatorname{ADTW}_{\omega}(S,U)=\min(4\omega,8)$ . We have $\operatorname{ADTW}_{\omega}(S,U)=\min(4\omega,8)$ because the path will not be warped if the resulting cost of $8$ is cheaper than 4 warping penalties.
2.

For $\omega\geq 4$ , ${\operatorname{ADTW}_{\omega}(S,S)=0<\operatorname{ADTW}_{\omega}(S,T)=\operatorname{ADTW}_{\omega}(S,U)=8}$ , because the penalty for warping is greater than the cost of not warping.
3.

For $\omega=0$ , ${\operatorname{ADTW}_{0}(S,S)=\operatorname{ADTW}_{0}(S,T)=\operatorname{ADTW}_{0}(S,U)=0}$ , because there is no penalty for warping.

We show that this approach is not only intuitive, it often provides superior outcomes in practice.

The remainder of this paper is organised as follows. In Section 2, we review the literature related to DTW and its variants. ADTW is presented Section 3, and Section 4 discusses how to parameterize it. We then present the results of our experiment in Section 5, and conclude in Section 6.

2 Background and related Work

$\operatorname{DTW}$ is a foundational technique for a wide range of time series data analysis tasks, including similarity search [11], regression [12], clustering [13], anomaly and outlier detection [14], motif discovery [15], forecasting [16], and subspace projection [17].

In this paper, for ease of exposition, we only consider univariate time series, although $\operatorname{ADTW}$ extends directly to the multivariate case. We denote series by the capital letters $S$ , $T$ and $U$ . The letter $\ell$ denotes the length of the series. Subscripting (e.g. $\ell{}_{S}$ ) is used to disambiguate between the length of different series. The elements $S_{1},S_{2},\dots S_{i}\dots S_{\ell{}_{S}}$ with $1\leq{}i\leq{}\ell{}_{S}$ are the elements of the series $S=(S_{1},S_{2},\dots S_{i}\dots S_{\ell{}_{S}})$ , drawn from a domain $\operatorname{\mathbb{D}}$ (e.g. real numbers).

We assess $\operatorname{ADTW}$ on time series classification tasks. Given a training set $\operatorname{\mathcal{D}}=\langle\operatorname{\mathcal{D}}_{1},\operatorname{\mathcal{D}}_{2},\ldots,\operatorname{\mathcal{D}}_{N}\rangle$ of labeled time series $\operatorname{\mathcal{D}}_{i}$ , time series classification learns a classifier which is then assessed with respect to classifying an evaluation set $\operatorname{\mathcal{T}}=\langle\operatorname{\mathcal{T}}_{1},\operatorname{\mathcal{T}}_{2},\ldots,\operatorname{\mathcal{T}}_{M}\rangle$ of labeled time series.

2.1 Dynamic Time Warping and Variants

Dynamic Time Warping ( $\operatorname{DTW}$ ) [2, 3] handles alignment of series with distortion and disparate lengths. An alignment $\operatorname{\mathcal{A}}(S,T)=((i_{1},j_{1}),\ldots,(i_{\operatorname{\lambda}},j_{\operatorname{\lambda}}))$ between two series $S$ and $T$ is made of tuples $\operatorname{\mathcal{A}}(S,T)_{1\leq{}k\leq{}\lambda}=(i_{k},j_{k})$ representing the point-to-point alignments of $S_{i_{k}}$ with $T_{j_{k}}$ . A warping path is an alignment $\operatorname{\mathcal{A}}(S,T)$ with the following properties:

1.

Series extremities are aligned with each other, i.e.

$\operatorname{\mathcal{A}}(S,T)_{1}=(1,1)\quad\text{and}\quad\operatorname{\mathcal{A}}(S,T)_{\operatorname{\lambda}}=(\ell_{S},\ell_{T})$
2.

It is continuous, i.e.

$\forall{k\in\{2,\operatorname{\lambda}\}}\ (i_{k-1}\leq i_{k}\leq i_{k-1}+1)\ \wedge\ (j_{k-1}\leq j_{k}\leq j_{k-1}+1)$
3.

It is monotonic, i.e.

$\forall{k\in\{2,\operatorname{\lambda}\}}\ \operatorname{\mathcal{A}}_{k}\neq\operatorname{\mathcal{A}}_{k-1}$

Given a cost function $\operatorname{\gamma}:\operatorname{\mathbb{D}}\times\operatorname{\mathbb{D}}\rightarrow\operatorname{\mathbb{R}}$ , $\operatorname{DTW}$ finds a warping path $\operatorname{\mathcal{A}}(S,T)$ minimizing the cumulative sum of $\operatorname{\gamma}(S_{i_{k}},T_{j_{k}})$ :

\operatorname{DTW}(S,T)=min_{\operatorname{\mathcal{A}}(S,T)}\sum_{k=1}^{\operatorname{\lambda}}\operatorname{\gamma}(S_{i_{k}},T_{j_{k}})

(1)

In this paper, following common practice in time series classification, we use the squared L2 norm as the cost function for all distances.

$\operatorname{DTW}(S,T)$ can be computed on a 0-indexed ${(\ell_{S}+1)\times(\ell_{T}+1)}$ cost matrix $M_{\operatorname{DTW}(S,T)}$ . A cell $M_{\operatorname{DTW}(S,T)}(i,j)$ represents the minimal cumulative cost of aligning the first $i$ points of $S$ with the first $j$ points of $T$ . It follows that

\operatorname{DTW}(S,T)=M_{\operatorname{DTW}(S,T)}(\ell_{S},\ell_{T})

The cost matrix $M_{\operatorname{DTW}(S,T)}$ is defined by a set of recursive equations (2). The first two equations (2a) and (2b) defines the border conditions. The third one (2c) computes the cost of a cell $(i,j)$ by adding the cost of aligning $S_{i}$ with $T_{j}$ , given by $\operatorname{\gamma}(S_{i},T_{j})$ , to the cost of its smallest predecessors. Figure 2 shows examples of $\operatorname{DTW}$ cost matrices.


$\displaystyle M_{\operatorname{DTW}(S,T)}(0,0)$	$\displaystyle=0$	(2a)
$\displaystyle M_{\operatorname{DTW}(S,T)}(i,0)$	$\displaystyle=M_{\operatorname{DTW}(S,T)}(0,j)=+\infty$	(2b)
$\displaystyle M_{\operatorname{DTW}(S,T)}(i,j)$	$\displaystyle=\operatorname{\gamma}(S_{i},T_{j})+\min\left\{\begin{aligned} &M_{\operatorname{DTW}(S,T)}(i-1,j-1)\\ &M_{\operatorname{DTW}(S,T)}(i-1,j)\\ &M_{\operatorname{DTW}(S,T)}(i,j-1)\end{aligned}\right.$	(2c)

An efficient implementation technique [18] allows $\operatorname{DTW}$ and its variants, including $\operatorname{ADTW}$ , to be computed with an $O(\ell)$ space complexity, and a pruned time complexity; the worst case scenario $O(\ell^{2})$ can usually be avoided.

2.2 Constrained DTW

In Constrained DTW ( $\operatorname{CDTW}$ ), the warping path is strictly restricted to a subarea of the cost matrix. Different constraints, with different subarea “shapes”, exist [3, 19]. We focus on the popular Sakoe-Chiba band [3], also known as a warping window (or window for short). The window is a parameter $w$ controlling how far the warping path can deviate from the diagonal. Given a cost matrix $M_{\operatorname{CDTW}_{w}(S,T)}$ , a line index $1\leq{}i\leq{}\ell{}$ and a column index $1\leq{}j\leq{}\ell{}$ , we constrain $(i,j)$ by $\absolutevalue{i-j}\leq{}w$ . A window of $0$ forces the warping path to be on the diagonal, while a window of $\ell{}-1$ is equivalent to $\operatorname{DTW}$ (no constraint). For example with $w=1$ , the warping path can only step one cell away from each side of the diagonal (Figure 3).

An effective window increases the usefulness of $\operatorname{CDTW}$ over $\operatorname{DTW}$ by preventing spurious alignments. For example, nearest neighbor classification using $\operatorname{CDTW}$ as the distance measure (NN- $\operatorname{CDTW}$ ) is significantly more accurate than nearest neighbor classification using $\operatorname{DTW}$ (NN- $\operatorname{DTW}$ , see Section 5). $\operatorname{CDTW}$ is also faster to compute than $\operatorname{DTW}$ , as cells of the cost matrix beyond the window are ignored. However, the window is a parameter that must be learned. The Sakoe-Chiba band is part of the original definition of $\operatorname{DTW}$ and the defacto default constraint for $\operatorname{CDTW}$ (e.g. in [20, 21, 22]).

One of the issues that must be addressed in any application of $\operatorname{CDTW}$ is how to select an appropriate value for $w$ . A method deployed when the UCR benchmark archive was first established has become the defacto default method for time series classification [23]. This method applies leave-one-out cross validation to nearest neighbor classification over the training data $\operatorname{\mathcal{D}}$ for all $w\in\{0,0.01\ell,0.02\ell,\ldots,\ell\}$ . The best value of $w$ is selected, with respect to the performance measure of interest (typically the lowest error).

2.3 Weighted DTW

Weighted Dynamic Time Warping ( $\operatorname{WDTW}$ ) [10] penalizes phase difference between two points. An alignment cost of a cell $(i,j)$ is weighted according to its distance to the diagonal, $\delta=\absolutevalue{i-j}$ . In other words, $\operatorname{WDTW}$ relies on a $\operatorname{weight}$ (3a) function to define a new cost function $\operatorname{\gamma}^{\prime}$ (3b). A large weight decreases the chances of a cell to be on an optimal path. The weight factor parameter $g$ controls the penalization, and usually lies within $0.01$ – $0.6$ [10]. Figure 4 shows several $\operatorname{WDTW}$ cost matrices.


$\displaystyle\operatorname{weight}(\delta)$	$\displaystyle=\frac{1}{1+\exp^{-g\times{}(\delta-\ell{}/2)}}$	(3a)
$\displaystyle\operatorname{\gamma}^{\prime}(S_{i},T_{j})$	$\displaystyle=\operatorname{\gamma}(S_{i},T_{j})*\operatorname{weight}(\absolutevalue{i-j})$	(3b)

$\operatorname{WDTW}$ applies a penalty to every pair of aligned points that are off the diagonal. Hence, the longer the off diagonal path is, the greater the penalty. Thus, it favors many small deviations from the diagonal over fewer longer deviations.

$\operatorname{WDTW}$ is also confronted with the problem of selecting an appropriate value for its parameter, $g$ . We use the method employed in the Elastic Ensemble [20], and modeled on the defacto method for parameterizing $\operatorname{CDTW}$ . This method applies leave-one-out cross validation to nearest neighbor classification over the training data $\operatorname{\mathcal{D}}$ for all $g\in\{0.01,0.02,\ldots 1.0\}$ .

2.4 Squared Euclidean Distance

One simple distance measure is to simply sum the cost of aligning successive points in the two series. This is equivalent to $\operatorname{CDTW}$ with $w=0$ , and is only defined for same length series. When the cost function for aligning two points $S_{i}$ and $T_{j}$ is $\operatorname{\gamma}(S_{i},T_{j})=(S_{i}-T_{j})^{2}$ , this distance measure is known as the Squared Euclidean Distance:

\operatorname{SQED}(S,T)=\sum_{i=1}^{\ell}(S_{i}-T_{i})^{2}

(4)

3 Amerced Dynamic Time Warping

Amerced Dynamic Time Warping ( $\operatorname{ADTW}$ ) provides a new, intuitive, elegant, and effective mechanism for constraining the alignments within the $\operatorname{DTW}$ framework. It achieves this by the introduction of a novel constraint, amercing — the application of a simple additive penalty $\omega$ every time an alignment is warped ( $i-j$ changes). This addresses the following problems:

1.

$\operatorname{DTW}$ can be too flexible in its alignments;
2.

$\operatorname{CDTW}$ uses a crude step function — any flexibility is allowed within the window and none beyond it;
3.

$\operatorname{WDTW}$ applies a multiplicative weight, and hence promotes large degrees of warping if they lead to low cost alignments — the penalty incurred for warping is dependent on the costs of the warped alignments. Further, as the penalty is paid for every off-diagonal alignment (where $i\neq j$ ), $\operatorname{WDTW}$ penalizes off diagonal paths for their length, rather than for the number of times they adjust the alignment ¹¹1By adjust the alignment we mean having successive alignments $\operatorname{\mathcal{A}}_{k}$ and $\operatorname{\mathcal{A}}_{k+1}$ such that $i_{k+1}-i_{k}\neq j_{k+1}-j_{k}$ ..

3.1 Formal definition of ADTW

Given two times series $S$ and $T$ , a cost function $\operatorname{\gamma}:\operatorname{\mathbb{D}}\times\operatorname{\mathbb{D}}\rightarrow\operatorname{\mathbb{R}}$ , and an amercing penalty $\omega\in\operatorname{\mathbb{R}}$ (see Section 4), $\operatorname{ADTW}$ finds a warping path $\operatorname{\mathcal{A}}(S,T)$ minimizing the cumulative sum of amerced costs:

\operatorname{ADTW}_{\omega}(S,T)=\\ \min_{\operatorname{\mathcal{A}}(S,T)}\left[\operatorname{\gamma}(S_{i_{1}},T_{j_{1}})+\sum_{k=2}^{\operatorname{\lambda}}\operatorname{\gamma}(S_{i_{k}},T_{j_{k}})+1(i_{k}-i_{k-1}\neq j_{k}-j_{k-1})\omega\right]

(5)

where $1(i_{k}-i_{k-1}\neq j_{k}-j_{k-1})\omega$ indicates that the amercing penalty $\omega$ is only applied if a step from $(i_{k-1},j_{k-1})$ to $(i_{k},j_{k})$ is not an increment on both series, i.e. it is not the case that $i_{k}=i_{k-1}+1$ and $j_{k}=j_{k-1}+1$ . $\operatorname{ADTW}_{\omega}(S.T)$ can be computed on a cost matrix $M_{\operatorname{ADTW}_{\omega}(S,T)}$ (6) with

\operatorname{ADTW}_{\omega}(S,T)=M_{\operatorname{ADTW}_{\omega}(S,T)})(\ell_{S},\ell_{T}).

$M_{\operatorname{ADTW}}$ has the same border conditions as $M_{\operatorname{DTW}}$ (equations (6a) and (6b)). The other matrix cells are computed recursively, amercing the off diagonal alignments by the penalty $\omega$ (6c). Methods for choosing $\omega$ are discussed Section 4.


$\displaystyle M_{\operatorname{ADTW}_{\omega}(S,T)}(0,0)$	$\displaystyle=0$	(6a)
$\displaystyle M_{\operatorname{ADTW}_{\omega}(S,T)}(i,0)$	$\displaystyle=M_{\operatorname{ADTW}_{\omega}(S,T)}(0,j)=+\infty$	(6b)
$\displaystyle M_{\operatorname{ADTW}_{\omega}(S,T)}(i,j)$	$\displaystyle=\min\left\{\begin{aligned} &M_{\operatorname{ADTW}_{\omega}(S,T)}(i-1,j-1)+\operatorname{\gamma}(S_{i},T_{j})\\ &M_{\operatorname{ADTW}_{\omega}(S,T)}(i-1,j)+\operatorname{\gamma}(S_{i},T_{j})+\omega\\ &M_{\operatorname{ADTW}_{\omega}(S,T)}(i,j-1)+\operatorname{\gamma}(S_{i},T_{j})+\omega\end{aligned}\right.$	(6c)

3.2 Properties of ADTW

$\operatorname{ADTW}_{\omega}(S,T)$ is monotonic with respect to $\omega$ .

\alpha<\beta\equiv\operatorname{ADTW}_{\alpha}(S,T)\leq\operatorname{ADTW}_{\beta}(S,T)

(7)

Proof.

For any warping path $\operatorname{\mathcal{A}}(S,T)$ that minimizes (5) under $\omega=\beta$ , the same path will result in an equivalent or lower score for $\omega=\alpha$ . Hence the minimum score for a warping path under $\alpha$ cannot exceed the minimum score under $\beta$ . ∎

\operatorname{ADTW}_{0}(S,T)=\operatorname{DTW}(S,T)

(8)

Proof.

With the amercing term $\omega$ set to zero, (5) is equivalent to (1). ∎

If $\ell_{S}=\ell_{T}$ ,

\operatorname{ADTW}_{\infty}(S,T)=\operatorname{SQED}(S,T)

(9)

Proof.

With $\omega=\infty$ , any off diagonal alignment will receive an infinite penalty. Hence the minimal cost warping path must follow the diagonal. ∎

When $0\leq\omega\leq\infty$ ,

\operatorname{DTW}(S,T)\leq\operatorname{ADTW}_{\omega}(S,T)\leq\operatorname{SQED}(S,T)

(10)

Proof.

This follows from (7), (8) and (9). ∎

This observation provides useful and intuitive bounds for the measure.

$\operatorname{ADTW}$ is symmetric with respect to the order of the arguments.

\operatorname{ADTW}_{\omega}(S,T)=\operatorname{ADTW}_{\omega}(T,S)

(11)

Proof.

If we consider the cost matrix, such as illustrated in Figure 5, swapping $S$ for $T$ has the effect of flipping the matrix on the diagonal. This does not affect the cost of the minimal path. ∎

That a distance measure should satisfy this symmetry is intuitive, as it does not seem appropriate that the order of the arguments to a distance function should affect its value. This symmetry also holds for the other variants of $\operatorname{DTW}$ .

$\operatorname{ADTW}$ is symmetric with respect to the order in which the series are processed. With $\operatorname{reverse}(S)=(S_{\ell_{S}},S_{\ell_{S}-1},\ldots,S_{1})$ , we have:

\operatorname{ADTW}_{\omega}(S,T)=\operatorname{ADTW}_{\omega}(\operatorname{reverse}(S),\operatorname{reverse}(T))

(12)

Proof.

For any warping path $\operatorname{\mathcal{A}}(S,T)=((i_{1},j_{1}),\ldots,(i_{\operatorname{\lambda}},j_{\operatorname{\lambda}}))$ , there is a matching $\operatorname{\mathcal{A}}(\operatorname{reverse}(S),\operatorname{reverse}(T))=((i_{\lambda},j_{\lambda}),\ldots,(i_{1},j_{1}))$ . As these warping paths both contain the same alignments, the cost terms $\operatorname{\gamma}$ will be the same. As $1(i_{k}{-}i_{k-1}\neq j_{k}{-}j_{k-1})\omega=1(i_{k-1}{-}i_{k}\neq j_{k-1}{-}j_{k})\omega$ , the amercing penalty terms will also be identical. ∎

This symmetry is also intuitive, as it does not seem appropriate that the direction from which one calculates a distance measure should affect its value. Neither $\operatorname{CDTW}$ nor $\operatorname{WDTW}$ has this symmetry when $\ell_{S}\neq\ell_{T}$ . This is because both measures apply constraints relative to the diagonal, which is different depending on whether one traces it from the top left or the bottom right of the cost matrix. Unlike $\operatorname{CDTW}(S,T)$ , $\operatorname{ADTW}(S,T)$ is well defined when $\ell_{S}\neq\ell_{T}$ . If $w<|\ell_{S}-\ell_{T}|$ then $\operatorname{CDTW}_{w}(S,T)$ is undefined.

4 Parameterization

As shown Section 3.2, $\operatorname{ADTW}_{\omega}(S,T)$ can be parameterized so as to range from being as flexible as $\operatorname{DTW}$ , to being as constrained as $\operatorname{SQED}$ . If the situation requires a large amount of warping, a small penalty should be used. Reciprocally, a large penalty helps to avoid warping. Hence, $\omega$ must be tuned for the task at hand, ideally using expert knowledge. Without the latter, one has to fallback on some automated approach. In this paper, we evaluate $\operatorname{ADTW}$ on a classification benchmark (Section 5) due to its clearly defined objective evaluation measures. Our automated parameter selection method has been designed to work in this context. It remains an important topic for future investigation on how to best parameterize ADTW in other scenarios.

The penalty range $0\leq\omega\leq\infty$ is continuous, and hence cannot be exhaustively assessed unless the objective function is convex with respect to $\omega$ , or there are other properties that can be exploited for efficient complete search. Instead, we use a similar method to the defacto standard for parameterizing $\operatorname{CDTW}$ and $\operatorname{WDTW}$ . To this end we define a range from small to large penalties. We select from this range the one that achieves the lowest error when used as the distance measure for nearest neighbor classification, as assessed through leave-one-out cross validation (LOOCV) on the training set.

This leads to a major issue with an additive penalty such as $\omega$ : its scale. A small penalty in a given context may be huge in another one, as the impact of the penalty depends on the magnitude of the costs at each step along a warping path. Hence, we must find a reasonable way to determine the scale of penalties. We achieved this by defining a maximal penalty $\omega^{\prime}$ , and multiplying it by a ratio $0\leq{}r\leq{}1$ (13). Our penalty range is now $0\leq\omega\leq\omega^{\prime}$ . This leads to two questions: how to define $\omega^{\prime}$ , and how to sample $r$ .

\omega=\omega^{\prime}*r

(13)

Let us first consider two series $S$ and $T$ . As $\operatorname{ADTW}_{\omega}(S,T)\leq\operatorname{SQED}(S,T)$ , taking $\omega=\operatorname{SQED}(S,T)$ does ensure that $\operatorname{ADTW}_{\omega}(S,T)=\operatorname{SQED}(S,T)$ . Indeed, taking one single warping step costs as much as the full diagonal. In turn, this ensures that we do not need to test any other value beyond $\omega^{\prime}=\operatorname{SQED}(S,T)$ . However, we need to have a single penalty for all series from $\operatorname{\mathcal{D}}$ , so we sample random pairs of series, and set $\omega^{\prime}$ to their average $\operatorname{SQED}$ , $\omega^{\prime}=\operatorname{mean}_{S,T\sim\operatorname{\mathcal{D}}}\operatorname{SQED}(S,T)$ .

We now have to find the best $r$ — we use a detour to better explain it. Intuitively, $\omega^{\prime}$ represents the cost of $\ell$ steps along the diagonal, and we can define an average step cost $\alpha=\frac{\omega^{\prime}}{\ell}$ . In turn, we have

\omega=\alpha\times(\ell\times r^{\prime})

(14)

where $0\leq{}r^{\prime}\leq{}1$ . As $\alpha$ remains a large penalty, we need to start sampling $r_{1}^{\prime}$ at value below $\frac{1}{\ell}$ . Without expert knowledge, our best guess is to start at several order of magnitudes $m$ lower. Simplifying equation (14) back to equation (13), $r_{1}$ must be small enough to account for both $m$ , and the magnitude of $\ell$ . Then, successive $r_{i}$ must gradually increase toward $1$ . If several “best” $r_{i}$ are found, we take the median value.

As leave-one-out cross validation is time consuming, and to ensure a fair comparison with the methods for parameterizing $\operatorname{CDTW}$ and $\operatorname{WDTW}$ , we limit the search to a hundred values [20, 24]. The formula $r_{i}=(\frac{i}{100})^{5}$ for $1\leq{}i\leq{100}$ fulfills our requirements, covering a range from $1E-10$ to 1 and favoring smaller penalties, which tend to be more useful than large ones, as any sufficiently large penalty confines the warping path to the diagonal.

5 Experiments

Due to its clearly defined objective evaluation measures, we evaluate $\operatorname{ADTW}$ against other distances on a classification benchmark, using the UCR archive [23] in its 128 dataset versions. We retained 112 datasets after filtering out those with series of variable length, containing missing data, or having only one training exemplar per class (leading to spurious results under LOOCV). The distances are used in nearest neighbor classifiers. The archive provides default train and test splits. For each dataset, the parameters $w$ , $g$ and $\omega$ are learned on the training data $\operatorname{\mathcal{D}}$ following the methods described Sections 2 and 4. We report the accuracy obtained on the test split. Our results are available online [1].

Following Demsar [25], we compare classifiers using the Wilcoxon signed-rank test. The result can be graphically presented in mean-rank diagrams (Figures 6 and 9), where classifiers not significantly different (at a $0.05$ significance level, adjusted with the Holm correction for multiple testing) are lined by a horizontal bar. A rank closer to 1 indicates a more accurate classifier.

We compare $\operatorname{ADTW}$ against $\operatorname{SQED}$ , $\operatorname{DTW}$ , $\operatorname{CDTW}$ and $\operatorname{WDTW}$ . Figure 6 summarizes the outcome of this comparison. It shows that $\operatorname{ADTW}$ attains a significantly better rank on accuracy than any other measure.

Mean rank diagrams are useful to summarize relative performance, but do not tell the full story — a classifier consistently outperforming another would be deemed as significantly better ranked, even if the actual gain is minimal. Accuracy scatter plots (see figures 7 and 8) allow to visualize how a classifier performs relatively to another one. The space is divided by a diagonal representing equal accuracy, and each dot represents a dataset. Points close to the diagonal indicates comparable accuracy between classifiers; conversely, points far away indicates different accuracies.

Figure 7 shows the accuracy scatter plot of $\operatorname{ADTW}$ against other distances. Points above the diagonal indicate datasets where $\operatorname{ADTW}$ is more accurate, and conversely below it. We also indicate the numbers of ties and wins per classifiers, and the resulting Wilcoxon score. $\operatorname{ADTW}$ is almost always more accurate than $\operatorname{SQED}$ and $\operatorname{DTW}$ — usually substantially so, and the majority of points remain well above the diagonal for $\operatorname{CDTW}$ and $\operatorname{WDTW}$ , albeit by a smaller margin. We note a point well below the diagonal in Figures 7(a) and 7(c). This it the “Rock” dataset, for which our parameterization process fails to select a high enough penalty. Although $\operatorname{ADTW}$ has the capacity to behave like $\operatorname{SQED}$ if appropriately parameterized, our parameterization process fails to achieve this for “Rock.” Effective parameterization remains an open problem.

We also compare against a hypothetical $\operatorname{BEST}$ classifier, a classifier that uses the most accurate among all $\operatorname{ADTW}$ competitors. Such a classifier is not feasible to obtain in practice as it is not possible to be certain which classifier will obtain the lowest error on previously unseen test data. The accuracy scatter plots against the hypothetical $\operatorname{BEST}$ classifier are shown in Figure 8. A Wilcoxon signed-rank test assesses $\operatorname{ADTW}$ as significantly more accurate at a $0.05$ significance level than $\operatorname{BEST}$ ( $p=4e^{-4}<0.05$ ). These results suggest that $\operatorname{ADTW}$ is a good default choice of $\operatorname{DTW}$ variant.

Finally, we present a sensitivity analysis for the exponent used to tune $\omega$ . Figure 9 shows the effect of using different values for exponent $e$ when sampling the ratio $r$ under LOOCV at train time. All exponents above $1$ lead to comparable results. While $e=4$ leads to the best results on the benchmark, we have no theoretical grounds on which to prefer it and in other applications we have examined, slightly higher values lead to slightly greater accuracy.

6 Conclusions

$\operatorname{DTW}$ is a popular time series distance measure that allows flexibility in the series alignments [2, 3]. However, it can be too flexible for some applications, and two main forms of constraint have been developed to limit the alignments. Windows provide a strict limit on the distance over which points may be aligned [3, 19]. Multiplicative weights penalize all off diagonal points proportionally to their distance from the diagonal and the cost of their alignment [10].

However, windows introduce an abrupt discontinuity. Within the window there is unconstrained flexibility and beyond it there is none. Multiplicative weights allow large warping at little cost if the aligned points have low cost. Further, they penalize the length of an off-diagonal path rather than the number of times the path deviates from the diagonal. Whats more, neither windows nor multiplicative weights are symmetric with respect to reversing the series (they do not guarantee $\operatorname{dist}(S,T)=\operatorname{dist}(\operatorname{reverse}(S),\operatorname{reverse}(T))$ when $\ell_{S}\neq\ell_{T}$ ).

Amerced DTW introduces a tunable additive weight $\omega$ that produces a smooth range of distance measures such that $\operatorname{ADTW}_{\omega}(S,T)$ is monotonic with respect to $\omega$ , $\operatorname{ADTW}_{0}(S,T)=\operatorname{DTW}(S,T)$ , and $\operatorname{ADTW}_{\infty}(S,T)=\operatorname{SQED}(S,T)$ . It is symmetric with respect to the order of the parameters and reversing the series, irrespective of whether the series share the same length. It has the intuitive property of penalizing the number of times a path deviates from simply aligning successive points in each series, rather than penalizing the length of the paths following such a deviation.

Our experiments indicate that when $\operatorname{ADTW}$ is employed for nearest neighbor classification on the widely used UCR benchmark, it results in more accurate classification significantly more often than any other $\operatorname{DTW}$ variant.

The application of $\operatorname{ADTW}$ in the many other types of task to which $\operatorname{DTW}$ is often applied remains a productive direction for future investigation. These include similarity search [11], regression [12], clustering [13], anomaly and outlier detection [14], motif discovery [15], forecasting [16], and subspace projection [17]. One issue that will need to be addressed in each of these domains is how best to tune the amercing penalty $\omega$ , especially if a task does not have objective criteria by which utility may be judged. We hope that $\operatorname{ADTW}$ will prove as effective in these other applications as it has proved to be in classification. A C++ implementation of $\operatorname{ADTW}$ is available at [1].

Acknowledgments

This work was supported by the Australian Research Council award DP210100072. We would like to thank the maintainers and contributors for providing the UCR Archive, and Hassan Fawaz for the critical diagram drawing code [26]. We are also grateful to Eamonn Keogh, Chang Wei Tan, and Mahsa Salehi for insightful comments on an early draft.

References

[1] M. Herrmann, Nn1 adtw demonstration application and results, https://github.com/HerrmannM/paper-2021-ADTW (2021).
[2] H. Sakoe, S. Chiba, Recognition of continuously spoken words based on time-normalization by dynamic programming, Journal of the Acoustical Society of Japan 27 (9) (1971) 483–490.
[3] H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1) (1978) 43–49. doi:10.1109/TASSP.1978.1163055.
[4] H. Cheng, Z. Dai, Z. Liu, Y. Zhao, An image-to-class dynamic time warping approach for both 3d static and trajectory hand gesture recognition, Pattern Recognition 55 (2016) 137–147.
[5] M. Okawa, Online signature verification using single-template matching with time-series averaging and gradient boosting, Pattern Recognition 112 (2021) 107699.
[6] Z. Yasseen, A. Verroust-Blondet, A. Nasri, Shape matching by part alignment using extended chordal axis transform, Pattern Recognition 57 (2016) 115–135.
[7] G. Singh, D. Bansal, S. Sofat, N. Aggarwal, Smart patrolling: An efficient road surface monitoring using smartphone sensors and crowdsourcing, Pervasive and Mobile Computing 40 (2017) 71–88.
[8] Y. Cao, N. Rakhilin, P. H. Gordon, X. Shen, E. C. Kan, A real-time spike classification method based on dynamic time warping for extracellular enteric neural recording with large waveform variability, Journal of Neuroscience Methods 261 (2016) 97–109.
[9] R. Varatharajan, G. Manogaran, M. K. Priyan, R. Sundarasekar, Wearable sensor devices for early detection of alzheimer disease using dynamic time warping algorithm, Cluster Computing 21 (1) (2018) 681–690.
[10] Y.-S. Jeong, M. K. Jeong, O. A. Omitaomu, Weighted dynamic time warping for time series classification, Pattern Recognition 44 (9) (2011) 2231–2240. doi:10.1016/j.patcog.2010.09.022.
[11] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, E. Keogh, Searching and mining trillions of time series subsequences under dynamic time warping, in: Proc. 18th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2012, pp. 262–270.
[12] C. W. Tan, C. Bergmeir, F. Petitjean, G. I. Webb, Time series extrinsic regression, Data Mining and Knowledge Discovery 35 (3) (2021) 1032–1060. doi:10.1007/s10618-021-00745-9.
[13] F. Petitjean, A. Ketterlin, P. Gançarski, A global averaging method for dynamic time warping, with applications to clustering, Pattern Recognition 44 (3) (2011) 678–693.
[14] D. M. Diab, B. AsSadhan, H. Binsalleeh, S. Lambotharan, K. G. Kyriakopoulos, I. Ghafir, Anomaly detection using dynamic time warping, in: 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), IEEE, 2019, pp. 193–198.
[15] S. Alaee, R. Mercer, K. Kamgar, E. Keogh, Time series motifs discovery under DTW allows more robust discovery of conserved structure, Data Mining and Knowledge Discovery 35 (3) (2021) 863–910.
[16] K. Bandara, H. Hewamalage, Y.-H. Liu, Y. Kang, C. Bergmeir, Improving the accuracy of global forecasting models using time series data augmentation, Pattern Recognition 120 (2021) 108148.
[17] H. Deng, W. Chen, Q. Shen, A. J. Ma, P. C. Yuen, G. Feng, Invariant subspace learning for time series data based on dynamic time warping distance, Pattern Recognition 102 (2020) 107210. doi:https://doi.org/10.1016/j.patcog.2020.107210.
[18] M. Herrmann, G. I. Webb, Early abandoning and pruning for elastic distances including dynamic time warping, Data Mining and Knowledge Discovery 35 (6) (2021) 2577–2601. doi:10.1007/s10618-021-00782-4.
URL https://doi.org/10.1007/s10618-021-00782-4
[19] F. Itakura, Minimum prediction residual principle applied to speech recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing 23 (1) (1975) 67–72. doi:10.1109/TASSP.1975.1162641.
[20] J. Lines, A. Bagnall, Time series classification with ensembles of elastic distance measures, Data Mining and Knowledge Discovery 29 (3) (2015) 565–592. doi:10.1007/s10618-014-0361-2.
[21] B. Lucas, A. Shifaz, C. Pelletier, L. O’Neill, N. Zaidi, B. Goethals, F. Petitjean, G. I. Webb, Proximity Forest: An effective and scalable distance-based classifier for time series, Data Mining and Knowledge Discovery 33 (3) (2019) 607–635. doi:10.1007/s10618-019-00617-3.
[22] A. Shifaz, C. Pelletier, F. Petitjean, G. I. Webb, TS-CHIEF: A scalable and accurate forest algorithm for time series classification, Data Mining and Knowledge Discovery 34 (3) (2020) 742–775. doi:10.1007/s10618-020-00679-8.
[23] H. A. Dau, A. Bagnall, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, E. Keogh, The UCR Time Series Archive, arXiv:1810.07758 [cs, stat] (Sep. 2019). arXiv:1810.07758.
[24] C. W. Tan, F. Petitjean, G. I. Webb, FastEE: Fast Ensembles of Elastic Distances for time series classification, Data Mining and Knowledge Discovery 34 (1) (2020) 231–272. doi:10.1007/s10618-019-00663-x.
[25] J. Demšar, Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research 7 (2006) 1–30.
[26] H. Ismail Fawaz, Critical difference diagram with Wilcoxon-Holm post-hoc analysis, https://github.com/hfawaz/cd-diagram (2019).