This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Epidemic changepoint detection
in the presence of nuisance changes

Julius Juodakis, Stephen Marsland
School of Mathematics and Statistics, Victoria University of Wellington, New Zealand
Corresponding author: [email protected]
Abstract

Many time series problems feature epidemic changes – segments where a parameter deviates from a background baseline. The number and location of such changes can be estimated in a principled way by existing detection methods, providing that the background level is stable and known. However, practical data often contains nuisance changes in background level, which interfere with standard estimation techniques. Furthermore, such changes often differ from the target segments only in duration, and appear as false alarms in the detection results.

To solve these issues, we propose a two-level detector that models and separates nuisance and signal changes. As part of this method, we developed a new, efficient approach to simultaneously estimate unknown, but fixed, background level and detect epidemic changes. The analytic and computational properties of the proposed methods are established, including consistency and convergence. We demonstrate via simulations that our two-level detector provides accurate estimation of changepoints under a nuisance process, while other state-of-the-art detectors fail. Using real-world genomic and demographic datasets, we demonstrate that our method can identify and localise target events while separating out seasonal variations and experimental artefacts.

Keywords: Change point detection, Epidemic change points, Piecewise stationary time series, Segmentation, Stochastic gradient methods

1 Introduction

The problem of identifying when the probability distribution of a time series changes – changepoint detection – has been studied since the middle of the 20th century. Early developments, such as Shewhart’s control charts (Shewhart, 1930) and Page’s CUSUM test (Page, 1954), stemmed from operations research. However, as automatic sensing systems and continuous data collection became more common, many new use cases for changepoint detection have arisen, such as seismic events (Li et al., 2016), epidemic outbreaks (Texier et al., 2016), physiological signals (Vaisman et al., 2010), market movements (Bracker and Smith, 1999), and network traffic spikes (Hochenbaum et al., 2017). Stimulated by such practical interest, the growth of corresponding statistical theory has been rapid, as reviewed in e.g. Aminikhanghahi and Cook (2016) and Truong et al. (2020).

Different applications pose different statistical challenges. If a single drastic change may be expected, such as when detecting machine failure, the goal is to find a method with minimal time to detection and a controlled false alarm rate (Lau and Tay, 2019). In many other applications the number of changepoints is unknown a priori; the challenge is then to identify the change locations regardless, preferably in a computationally efficient way. Some problems, such as peak detection in sound (Aminikhanghahi and Cook, 2016) or genetic data (Hocking et al., 2017), feature epidemic segments – changepoints followed by a return to the background level – and incorporating this constraint can improve detection or simplify post-processing of ouputs.

However, current detection methods that do incorporate a background level assume it to be stable throughout the dataset (e.g. Fisch et al., 2018; Zhao and Yau, 2019). This is not realistic in several common applications. In genetic analyses such as measurements of protein binding along DNA there may be large regions where the observed background level is shifted due to structural variation in the genome or technical artefacts (Zhang et al., 2008). Similarly, a standard task in sound processing is to detect speech in the presence of dynamic background chatter, so-called babble noise (Pearce et al., 2000). In various datasets from epidemiology or climatology, such as wave height measurements (Killick et al., 2012), seasonal effects are observed as recurring background changes and will interfere with detection of shorter events. Methods that assume a constant background will be inaccurate in these cases, while ignoring the epidemic structure entirely would cost detection power and complicate the interpretation of outputs.

Our goal is to develop a general method for detecting epidemic changepoints in the presence of nuisance changes in the background. Furthermore, we assume that the only available feature for distinguishing the two types of segments is their duration: this would allow analysing the examples above, which share the property that the nuisance process is slower. The most similar research to ours is that of Lau and Tay (2019) for detecting failure of a machine that can be switched on, and thus undergo an irrelevant change in the background level. However, the setting there concerned a single change with specified background and nuisance distributions; in contrast, we are motivated by the case when multiple changes may be present, with only duration distinguishing their types. Detection then requires two novel developments: 1) rapid estimation of local background level, 2) modelling and distinguishing the two types of potentially overlapping segments.

These developments are presented in this paper as follows: after a background section we provide a new algorithm that simultaneously detects epidemic changepoints and estimates the unknown background level (Section 3). The convergence and consistency of this algorithm are proved in the Supplementary Material. While this algorithm is of its own interest, we use it to build a detector that allows local variations in the background, i.e., nuisance changes, in Section 4, again with proven properties. In Section 5 we test various properties of the algorithms with simulations before showing how the proposed nuisance-robust detector can be applied in two problems: detecting histone modifications in human genome, while ignoring structural variations, and detecting the effects of the COVID-19 pandemic in Spanish mortality data, robustly to seasonal effects. Compared to state-of-the-art methods, the proposed detector produced lower false-alarm rates (or more parsimonious models), while retaining accurate detection of true signal peaks.

2 Background

In the general changepoint detection setup, the data consists of a sequence of observations x0:n={x0,x1,,xn}x_{0:n}=\{x_{0},x_{1},\dots,x_{n}\}, split by kk changepoints 0<τ1<τ2<τk<n0<\tau_{1}<\tau_{2}<\dots\tau_{k}<n into k+1k+1 segments. Within each segment the observations are drawn from a distribution specific to that segment; these are often assumed to belong to some fixed family fθ(x)f_{\theta}(x), with different values of parameter θ\theta for each segment. The most common and historically earliest example is the change in mean of a Gaussian (see Truong et al. (2020)), i.e., for each t[τi,τi+1)t\in[\tau_{i},\tau_{i+1}), xt𝒩(μi,σ2)x_{t}\sim\mathcal{N}(\mu_{i},\sigma^{2}), for known, fixed σ2\sigma^{2}. (We assume θ1\theta\in\mathbb{R}^{1} to keep notation clearer, but multidimensional problems are also common, such as when both the mean and variance of a Gaussian are subject to change.)

While early research focused on optimal hypothesis testing of a single change against no change in a parameter, modern models aim to estimate the position of all changepoints {τi}\{\tau_{i}\} in a given time series, and optionally the vector of segment parameters 𝜽\bm{\theta}. A common approach is to use a penalised cost: choose a segment cost function C(xa:b;θ)C(x_{a:b};\theta), usually based on negative log-likelihood of the observations xa:bx_{a:b} given θ\theta, and a penalty p(k)p(k) for the number of changepoints kk. The full cost of xτ0:τkx_{\tau_{0}:\tau_{k}} (where τ0=0\tau_{0}=0 and τk=n\tau_{k}=n) is then:

F(n;𝝉,𝜽,k)=i=0k1C(xτi+1:τi+1;θi)+p(k).F(n;\bm{\tau},\bm{\theta},k)=\sum_{i=0}^{k-1}C(x_{\tau_{i}+1:\tau_{i+1}};\theta_{i})+p(k). (1)

Changepoint number and positions are estimated by finding F(n)=minF(n;𝝉,𝜽,k)F(n)=\min F(n;\bm{\tau},\bm{\theta},k). Such estimation has been shown to be consistent for a range of different data generation models (Fisch et al., 2018; Zheng et al., 2019).

For a discrete problem such as this, computation of the true minimum is not trivial – a naïve brute force approach would require 𝒪(2n)\mathcal{O}(2^{n}) tests. Approaches to reducing this cost fall in two broad classes: 1) simplifying the search algorithm by memoisation and pruning of paths (Jackson et al., 2005; Killick et al., 2012; Rigaill, 2015); 2) using greedy methods to find approximate solutions faster (Fryzlewicz, 2014; Baranowski et al., 2019). In both classes, there are methods that can consistently estimate multiple changepoints in linear time under certain conditions.

The first category is of more relevance here. It is based on the Optimal Partitioning (OP) algorithm (Jackson et al., 2005). Let the data be partitioned into discrete blocks Bi:iBi={x}nB_{i}:\bigcup_{i}B_{i}=\{x\}_{n}, so BiBj=,ijB_{i}\cap B_{j}=\emptyset,\forall i\neq j. A function VV that maps each set of blocks Pj={Bi}P_{j}=\{B_{i}\} to a cost is block-additive if:

P1,P2,V(P1P2)=V(P1)+V(P2).\forall P_{1},P_{2},V(P_{1}\cup P_{2})=V(P_{1})+V(P_{2}). (2)

If each segment incurs a fixed penalty β=p(k)/k\beta=p(k)/k, then the function FF defined in (1) is block-additive over segments, and can be defined recursively as:

F(s)=mint(F(t)+C(xt+1:s)+β).F(s)=\min_{t}\left(F(t)+C(x_{t+1:s})+\beta\right).

In OP, this cost is calculated for each sns\leq n, and thus its minimisation requires 𝒪(n2)\mathcal{O}(n^{2}) evaluations of CC. Furthermore, when the cost function CC is such that for all ab<ca\leq b<c:

C(xa:c)C(xa:b)+C(xb+1:c)C(x_{a:c})\geq C(x_{a:b})+C(x_{b+1:c}) (3)

then at each ss, it can be quickly determined that some candidate segmentations cannot be “rescued” by further segments, and so they can be pruned from the search space. This approach reduces the complexity to 𝒪(n)\mathcal{O}(n) in the best case. It gave rise to the family of algorithms called PELT, Pruned Exact Linear Time (Killick et al., 2012; Zhao and Yau, 2019). Note that OP and PELT do not rely on any probabilistic assumptions and find the exact same minimum as the global exponential-complexity search. Since the introduction of PELT, a number of alternative pruning schemes have been developed (Maidstone et al., 2016).

A variation on the basic changepoint detection problem is the identification of epidemic changepoints, when a change in regime appears for a finite time and then the process returns to the background level. The concept of a background distribution fBf_{B} is introduced, and the segments are defined by pairs of changepoints si,eis_{i},e_{i}, outside of which the data is drawn from the background model. The data model then becomes:

f(xt)={fS(xt;θi) if i:siteifB(xt) otherwise.f(x_{t})=\begin{cases}f_{S}(x_{t};\theta_{i})&\text{ if }\exists i:s_{i}\leq t\leq e_{i}\\ f_{B}(x_{t})&\text{ otherwise}.\end{cases} (4)

An equivalent of the CUSUM test for the epidemic situation was first proposed by Levin and Kline (1985), and a number of methods for multiple changepoint detection in this setting have been proposed (Olshen et al., 2004; Fisch et al., 2018; Zhao and Yau, 2019). These use a cost function that includes a separate term C0C^{0} for the background points:

F(n;{(si,ei)}k,𝜽,k)=i=1kC(xsi:ei;θi)+C0({xt:ti[si;ei]};θ0)+p(k)F(n;\{(s_{i},e_{i})\}_{k},\bm{\theta},k)=\sum_{i=1}^{k}C(x_{s_{i}:e_{i}};\theta_{i})+C^{0}(\{x_{t}:t\notin\bigcup_{i}[s_{i};e_{i}]\};\theta_{0})+p(k) (5)

A common choice for the background distribution is some particular “null” case of the segment distribution family, so that fB(x)=fS(x;θ0)f_{B}(x)=f_{S}(x;\theta_{0}) and C0()=C(;θ0)C^{0}(\cdot)=C(\cdot;\theta_{0}). However, while the value of θ0\theta_{0} is known in some settings (such as when presented with two copies of DNA), in other cases it may need to be estimated. Since this parameter is shared across all background points, the cost function is no longer block-additive as in (2), and algorithms such as OP and PELT cannot be directly applied.

One solution is to substitute the unknown parameter with some robust estimate of it, based on the unsegmented series x0:nx_{0:n}. The success of the resulting changepoint estimation then relies on this estimate being sufficiently close to the true value, and so the non-background data fraction must be small (Fisch et al., 2018). This is unlikely to be true in our motivating case, when nuisance changes in the background level are possible.

Another option is to define

F(n;θ0)=mink,{(si,ei)}k,{θi}kF(n;{(si,ei)}k,{θi}k,θ0,k)F(n;\theta_{0})=\min_{k,\{(s_{i},e_{i})\}_{k},\{\theta_{i}\}_{k}}F(n;\{(s_{i},e_{i})\}_{k},\{\theta_{i}\}_{k},\theta_{0},k)

which can be minimised using OP or PELT, and then use gradient descent for the outer minimisation over θ0\theta_{0} (Zhao and Yau, 2019). The main drawback of this approach is an increase in computation time proportional to the number of steps needed for the optimisation.

3 Detection of changepoints with unknown background parameter

To estimate the background parameter and changepoints efficiently, while allowing a large proportion of non-background data, we introduce Algorithm 1. The algorithm makes a pass over the data, during which epidemic segments are detected in a standard way, i.e., by minimising penalized cost using OP (steps 3-12). The estimate of parameter θ0\theta_{0} is iteratively updated at each time point. We will show that this process converges almost surely to the true θ0\theta_{0} for certain models of background and segment levels, or reaches its neighbourhood (θ0ϵ;θ0+ϵ)(\theta_{0}-\epsilon;\theta_{0}+\epsilon) under more general conditions.

Algorithm 1 Detection of changepoints with unknown background parameter
1:  Input: C0,C,β,x0:nC^{0},C,\beta,x_{0:n}
2:  Initialize F(0)=0,B(0)={x0},θ0=x0F(0)=0,B(0)=\{x_{0}\},\theta_{0}=x_{0}
3:  for t1,,nt\in 1,\dots,n do
4:     Calculate FB=F(t1)+C0(xt;θ0)F_{B}=F(t-1)+C^{0}(x_{t};\theta_{0}), and FS=min1klF(tk)+C(xtk+1:t)+βF_{S}=\min_{1\leq k\leq l}F(t-k)+C(x_{t-k+1:t})+\beta
5:     if FB<FSF_{B}<F_{S} then
6:        Assign B(t)=B(t1)xtB(t)=B(t-1)\cup x_{t}, and recalculate θ0\theta_{0} from BB
7:        F(t)=FBF(t)=F_{B}
8:     else
9:        Assign B(t)=B(tk)B(t)=B(t-k) (here kk: argminFS\operatorname*{argmin}F_{S} in step 4)
10:        F(t)=FSF(t)=F_{S}
11:     end if
12:  end for
13:  Repeat steps 3-12, without updating θ0\theta_{0}
14:  Output: θ0\theta_{0}, changepoint positions

The algorithm then makes a second pass over the data (step 13), repeating the segmentation using the final estimate of θ0\theta_{0}. The step is identical to a single application of a known-background changepoint detector, such as the one described in Fisch et al. (2018). Its purpose is to update the changepoint positions that are close to the start of the data and so had been determined using less precise estimates of θ0\theta_{0}. This simplifies the theoretical performance analysis, but an attractive option is to use this algorithm in an online manner, without this step. We evaluate the practical consequences of this omission later in Section 5.

By segmenting and estimating the background simultaneously, this algorithm is computationally more efficient than dedicated optimisation over possible θ0\theta_{0} values as in Zhao and Yau (2019), while allowing a larger fraction of non-background points than using a robust estimator on unsegmented data, as in Fisch et al. (2018).

We now demonstrate that Algorithm 1 will converge almost surely under certain constraints, and is consistent.

3.1 Convergence

Theorem 1.

Consider the problem of an epidemic change in mean, with data x0:nx_{0:n} generated as in (4). Assume the distribution fB(x)f_{B}(x) and marginal distribution of segment points fS(x)=fS(x;θ)fΘ(θ)𝑑θf_{S}(x)=\int f_{S}(x;\theta)f_{\Theta}(\theta)d\theta are symmetric and strongly unimodal, with unknown background mean θ0\theta_{0}, and that data points within each segment are iid. Here, fΘf_{\Theta} is the probability density of parameter θ\theta corresponding to each data point. Denote by wtw_{t} the estimate of θ0\theta_{0} obtained by analysing x0:tx_{0:t} by Algorithm 1. The sequence {wt}\{w_{t}\} converges:

  1. 1.

    to θ0\theta_{0} almost surely, if 𝔼fS=θ0\mathbb{E}f_{S}=\theta_{0}.

  2. 2.

    to a neighbourhood (θ0ϵ,θ0+ϵ)(\theta_{0}-\epsilon,\theta_{0}+\epsilon) almost surely, where ϵ0\epsilon\rightarrow 0 as the number of background points nn between successive segments nn\rightarrow\infty.

We refer the reader to Appendix A for the proof. It is based on a result by Bottou (1998), who established conditions under which an online minimisation algorithm almost surely converges to the optimum (including stochastic gradient descent as a special case). We show that these conditions are either satisfied directly by the updating process in our algorithm, or longer update cycles can be defined that will satisfy the conditions, in which case the convergence efficiency is determined by nn.

3.2 Consistency

The changepoint model can be understood as a function over an interval that is sampled to produce nn observed points. As the sampling density increases, more accurate estimation of the number and locations of changepoints is expected; this property is formalised as consistency of the detector. Fisch et al. (2018) showed that detectors based on minimising penalised cost are consistent in the Gaussian data case, and their result can be adapted to prove consistency of Algorithm 1. We will use the strengthened SIC-type penalty αlog(n)1+δ,δ>0\alpha\log(n)^{1+\delta},\delta>0 as presented in Fisch et al. (2018), but a similar result can be obtained with δ=0\delta=0.

Additionally, the minimum signal-to-noise ratio at which segments can be detected is formalised as this assumption:

i,(eisi)Δi>log(n)1+δ,\forall i,(e_{i}-s_{i})\Delta_{i}>\log(n)^{1+\delta}, (6)

where Δi\Delta_{i} is a function of the relative strength of the parameter changes |μiμ0||\mu_{i}-\mu_{0}| and σk/σ0+σ0/σk\sigma_{k}/\sigma_{0}+\sigma_{0}/\sigma_{k}.

Theorem 2.

Let the data x0:nx_{0:n} be generated from an epidemic changepoint model as in (4), with fBf_{B} and fSf_{S} Gaussian, and the changing parameter θ\theta is either its mean or variance (assume the other parameter is known). Further, assume (6) holds for kk changepoints. Analyse the data by Algorithm 1 with penalty β=αlog(n)1+δ\beta=\alpha\log(n)^{1+\delta}, α,δ>0\alpha,\delta>0. The estimated number and position of changepoints will be consistent, i.e. ϵ>0,n>B\forall\epsilon>0,n>B:

P(k^=k,|si^si|<AΔilog(n)1+δ,|ei^ei|<AΔilog(n)1+δ,1ik)1Cnϵ,P\left(\hat{k}=k,|\hat{s_{i}}-s_{i}|<\frac{A}{\Delta_{i}}\log(n)^{1+\delta},|\hat{e_{i}}-e_{i}|<\frac{A}{\Delta_{i}}\log(n)^{1+\delta},\forall 1\leq i\leq k\right)\geq 1-Cn^{-\epsilon}, (7)

for some A,B,CA,B,C that do not increase with nn.

The proof is given in Appendix B. We use the connection between our Algorithm 1 and stochastic gradient descent to establish error bounds on the backgorund parameter estimates. These bounds then allow us to apply a previous consistency result (Fisch et al., 2018) to our case.

3.3 Pruning

Much of the improvement in performance of change estimation algorithms comes from pruning of the search space. Standard pruning (Killick et al., 2012) can be applied to Algorithm 1. It continues to find optimal solutions in this case, as is shown in Proposition 1.

To implement pruning, the search set for FSF_{S} in step 4 would be changed to:

FS=mintK(F(t)+C(xt+1:t)+β),F_{S}=\min_{t^{\prime}\in K}(F(t^{\prime})+C(x_{t^{\prime}+1:t})+\beta),

and step 11a would be introduced to update KK as:

11a. K=K{s:F(s)+C(xs+1:t)<F(t),t+1ls<t}{t}.\text{11a. }K=K\cap\{s:F(s)+C(x_{s+1:t})<F(t),~{}t+1-l\leq s<t\}\cup\{t\}.\\
Proposition 1.

Assume a cost function CC such that (3) applies. If, for some s>tls>t-l:

F(s)+C(xs+1:t)F(t),F(s)+C(x_{s+1:t})\geq F(t),

then tkt-k will not be a segment start point in the optimal solution, and can be excluded from consideration for all subsequent t>tt^{\prime}>t.

Proof.

For all ts+lt^{\prime}\leq s+l, the proof applies as for other pruned algorithms (Killick et al., 2012):

F(t)+C(xt+1:t)+βF(s)+C(xs+1:t)+C(xt+1:t)+βF(s)+C(xs+1:t)+β.F(t)+C(x_{t+1:t^{\prime}})+\beta\leq F(s)+C(x_{s+1:t})+C(x_{t+1:t^{\prime}})+\beta\leq F(s)+C(x_{s+1:t^{\prime}})+\beta.

For t>s+lt^{\prime}>s+l, segment (s,t)(s,t^{\prime}) will exceed the length constraint and thus cannot be part of the final segmentation. ∎

4 Detecting changepoints with a nuisance process

4.1 Problem setup

In this section, we consider the changepoint detection problem when there is an interfering nuisance process. We assume that this process, like the signal, consists of segments, and we index the start and end times as sjN,ejNs^{N}_{j},e^{N}_{j}. Data within these segments is generated from a nuisance-only distribution fNf_{N}, or from some distribution fNSf_{NS} if a signal occurs at the same time. In total, four states are possible (background, nuisance, signal, or nuisance and signal), so the overall distribution of data points is:

f(xt)={fNS(xt;θi,θjN) if i,j:t[si;ei][sjN,ejN]fS(xt;θi) if i:t[si,ei],tj[sjN,ejN]fN(xt;θjN) if j:t[sjN,ejN],ti[si,ei]fB(xt) otherwisef(x_{t})=\begin{cases}f_{NS}(x_{t};\theta_{i},\theta^{N}_{j})&\text{ if }\exists i,j:t\in[s_{i};e_{i}]\cap[s^{N}_{j},e^{N}_{j}]\\ f_{S}(x_{t};\theta_{i})&\text{ if }\exists i:t\in[s_{i},e_{i}],t\notin\cup_{j}[s^{N}_{j},e^{N}_{j}]\\ f_{N}(x_{t};\theta^{N}_{j})&\text{ if }\exists j:t\in[s^{N}_{j},e^{N}_{j}],t\notin\cup_{i}[s_{i},e_{i}]\\ f_{B}(x_{t})&\text{ otherwise}\end{cases} (8)

We add two more conditions to ensure identifiability: 1) the nuisance process evolves more slowly than the signal process, so min(ejNsjN)>max(eisi)\min(e^{N}_{j}-s^{N}_{j})>\max(e_{i}-s_{i}); 2) signal segments are either entirely contained within a nuisance segment, or entirely out of it:

i,j, either [si,ei](sjN,ejN), or [si,ei][sjN,ejN]=\forall i,j,\text{ either }[s_{i},e_{i}]\subset(s^{N}_{j},e^{N}_{j}),\text{ or }[s_{i},e_{i}]\cap[s^{N}_{j},e^{N}_{j}]=\emptyset (9)

We adapt the changepoint detection method as follows. Define XS=ixsi:ei,XN=jxsjN:ejNX_{S}=\bigcup_{i}x_{s_{i}:e_{i}},X_{N}=\bigcup_{j}x_{s^{N}_{j}:e^{N}_{j}}, a penalty p(m)=βmp^{\prime}(m)=\beta^{\prime}m for the number of nuisance segments, and cost functions CNS,CS,CN,C0C^{NS},~{}C^{S},~{}C^{N},~{}C^{0} corresponding to each of the distributions in (8). Then the full cost of a model with kk true segments and mm nuisance segments is:

F(n;{(si,ei)}k,k,{(sjN,ejN)}m,m,𝜽)=C0(x0:n(XSXN))+i=0kCS(xsi:eiXN;θi)++j=0m(CN(xsjN:ejNXS;θj)+i=0kCNS(xsjN:ejNxsi:ei;θi,θj))+βk+βm.F(n;\{(s_{i},e_{i})\}_{k},k,\{(s^{N}_{j},e^{N}_{j})\}_{m},m,\bm{\theta})=C^{0}(x_{0:n}\setminus(X_{S}\cup X_{N}))+\sum_{i=0}^{k}C^{S}(x_{s_{i}:e_{i}}\setminus X_{N};\theta_{i})+\\ +\sum_{j=0}^{m}\left(C^{N}(x_{s^{N}_{j}:e^{N}_{j}}\setminus X_{S};\theta_{j})+\sum_{i=0}^{k}C^{NS}(x_{s^{N}_{j}:e^{N}_{j}}\cap x_{s_{i}:e_{i}};\theta_{i},\theta_{j})\right)+\beta k+\beta^{\prime}m. (10)

Note that this cost can also be expressed, using kjk_{j} as the number of signal segments that overlap a nuisance segment jj and k0=kkjk_{0}=k-\sum k_{j}, as:

F(n;{(si,ei)}k,k,{(sjN,ejN)}m,m,𝜽)=C0(x0:n(XSXN))+i=0k0CS(xsi:eiXN;θi)++βk0+j=0m(C(xsjN:ejN)+βkj+β).F(n;\{(s_{i},e_{i})\}_{k},k,\{(s^{N}_{j},e^{N}_{j})\}_{m},m,\bm{\theta})=C^{0}(x_{0:n}\setminus(X_{S}\cup X_{N}))+\sum_{i=0}^{k_{0}}C^{S}(x_{s_{i}:e_{i}}\setminus X_{N};\theta_{i})+\\ +\beta k_{0}+\sum_{j=0}^{m}(C^{\prime}(x_{s^{N}_{j}:e^{N}_{j}})+\beta k_{j}+\beta^{\prime}).

Condition (9) ensures that CC^{\prime} and CSC^{S} are independent (no points or θi\theta_{i} are shared between them), so FF is block-additive over (signal or nuisance) segments and can be minimised by dynamic programming.

4.2 Proposed method

To minimise this cost, we propose the method outlined in Algorithm 2. Its outer loop proceeds over the data to identify segments by the usual OP approach. However, to evaluate the cost C(xa:b)C^{\prime}(x_{a:b}), which allows segment and nuisance overlaps, a changepoint detection problem with unknown background level must be solved over xa:bx_{a:b}. This is achieved by an inner loop using Algorithm 1. If C=CSC^{\prime}=C^{S}, this method would reduce to a standard detector of epidemic changepoints, with the exception that segments are separated into two types depending on their length; this is very similar to the collective and point anomaly (CAPA) detector in Fisch et al. (2018).

The cost CC^{\prime} is minimised using Algorithm 1, so by Theorem 2, the number of positions of true segments will be estimated consistently given accurate assignment of the mm nuisance positions sjN,ejNs^{N}_{j},e^{N}_{j}. However, the latter event is subject to a complex set of assumptions on relative signal strength, position, and duration of the segments. Therefore, we do not attempt to describe these in full here, but instead investigate the performance of the method by extensive simulations in Section 5.1.3.

Algorithm 2 Adaptive segmentation by changepoint detection
1:  Input: C0,CS,CN,CNS,β,β,x0:nC^{0},C^{S},C^{N},C^{NS},\beta,\beta^{\prime},x_{0:n}
2:  Initialize F(0)=0,θ0=x0F(0)=0,\theta_{0}=x_{0}, lists of tuples chpS,chpN,chpNSchp_{S},chp_{N},chp_{NS}
3:  If θ0\theta_{0} not known: estimate using an appropriate robust estimator
4:  for t1,,nt\in 1,\dots,n do
5:     for t0,,tl1t^{\prime}\in 0,\dots,t-l-1 do
6:        Apply Algorithm 1 to data xt+1:tx_{t^{\prime}+1:t}, store returned cost as CC^{\prime} and changepoints as chpNS(t)chp_{NS}(t^{\prime})
7:        FN(t)=F(t)+C+βF_{N}(t^{\prime})=F(t^{\prime})+C^{\prime}+\beta^{\prime}
8:     end for
9:     FB=F(t1)+C0(xt)F_{B}=F(t-1)+C^{0}(x_{t})
10:     FS=min1klF(tk)+CS(xtk+1:t)+βF_{S}=\min_{1\leq k\leq l}F(t-k)+C^{S}(x_{t-k+1:t})+\beta
11:     FN=minFN(t)F_{N}=\min F_{N}(t^{\prime})
12:     Assign F(t)=min{FB,FS,FN}F(t)=\min\{F_{B},F_{S},F_{N}\}
13:     if F(t)=FBF(t)=F_{B} then
14:        Assign chpS(t)=chpS(t1),chpN(t)=chpN(t1)chp_{S}(t)=chp_{S}(t-1),chp_{N}(t)=chp_{N}(t-1)
15:     else if F(t)=FSF(t)=F_{S} then
16:        Assign chpS(t)=chpS(tk)(tk,t)chp_{S}(t)=chp_{S}(t-k)\cup(t-k,t) with kk determined by argminFS\operatorname*{argmin}F_{S}, chpN(t)=chpN(t1)chp_{N}(t)=chp_{N}(t-1)
17:     else
18:        Assign chpS(t)=chpS(t)chpNS(t)chp_{S}(t)=chp_{S}(t^{\prime})\cup chp_{NS}(t^{\prime}), chpN(t)=chpN(t)(t,t)chp_{N}(t)=chp_{N}(t^{\prime})\cup(t^{\prime},t) with tt^{\prime} determined by argminFN\operatorname*{argmin}F_{N}
19:     end if
20:  end for
21:  Output: changepoint positions chpS(n)={(si,ei)}chp_{S}(n)=\{(s_{i},e_{i})\} and chpN(n)={(sjN,ejN)}chp_{N}(n)=\{(s^{N}_{j},e^{N}_{j})\}

Algorithm 2 is stated assuming that a known or estimated value of the parameter θ0\theta_{0}, corresponding to the background level without the nuisance variations, is available. In practice, it may be known when there is a technical noise floor or a meaningful baseline that can be expected after removing the nuisance changes. Alternatively, θ0\theta_{0} may be estimated by a robust estimator if a sufficient fraction of background points can be assumed; in either case, the (known or estimated) value is substituted to reduce the computational cost. Otherwise, the method can be modified to estimate θ0\theta_{0} simultaneously with segmentation, using a principle similar to Algorithm 1.

4.3 Pruning

In the proposed method, the estimation of parameter μj\mu_{j} (the mean of segment jj) is sensitive to the segment length, therefore the cost CC^{\prime} does not necessarily satisfy the additivity condition (3), and so it cannot be guaranteed that PELT-like pruning will be exact. However, we can establish a local pruning scheme that retains the exact optimum with probability 1\rightarrow 1 as nn\rightarrow\infty.

Proposition 2.

Assume data x0:nx_{0:n} is generated from a Gaussian epidemic changepoint model, and that the distance between changepoints is bounded by some function A(n)A(n):

i,j:sj+1NejN>A(n),|sisjN|>A(n).\forall i,j:s^{N}_{j+1}-e^{N}_{j}>A(n),|s_{i}-s^{N}_{j}|>A(n).

At time tt, the solution space is pruned by removing:

𝒌pr,t={k:F(k1)+C(xk:t)minmF(m1)+C(xm:t)+αlog(n)1+δ}.\bm{k}_{pr,t}=\{k:F(k-1)+C^{\prime}(x_{k:t})\geq\min_{m}F(m-1)+C^{\prime}(x_{m:t})+\alpha\log(n)^{1+\delta}\}. (11)

Here m(tA(n);t],k(tA(n);t],kmm\in(t-A(n);t],k\in(t-A(n);t],k\neq m. Then ϵ>0\forall\epsilon>0, there exist constants B,n0B,n_{0}, such that when n>n0n>n_{0}, the true nuisance segment positions are retained with high probability:

P(j:sjNt𝒌pr,t,ejNt𝒌pr,t)1Bnϵ.P(\forall j:s^{N}_{j}\notin\bigcup_{t}\bm{k}_{pr,t},e^{N}_{j}\notin\bigcup_{t}\bm{k}_{pr,t})\geq 1-Bn^{-\epsilon}.

The proof is given in Appendix C. The assumed distance bound serves to simplify the detection problem: within each window (tA(n),t](t-A(n),t], at most 1 true changepoint may be present, and the initial part of Algorithm 2 is identical to a standard epidemic changepoint detector. It can be shown that other candidate segmentations in the pruning window are unlikely to have significantly lower cost than the one associated with sjN,ejNs^{N}_{j},e^{N}_{j}, and therefore sjN,ejNs^{N}_{j},e^{N}_{j} are likely to be retained in pruning.

As this scheme only prunes within windows of size A(n)A(n), and setting large A(n)A(n) may violate the true data generation model, it is less efficient than PELT-type pruning. However, assuming that the overall estimation of nuisance and signal changepoints is consistent, Proposition 2 extends to standard pruning over the full dataset. We show that this holds empirically in Section 5.1.3.

5 Experiments

In this section, we present the results of simulations used to evaluate the performance of the methods, and demonstrations on real-world datasets.

The methods proposed in this paper are implemented in R 3.5.3, and available on GitHub at https://github.com/jjuod/changepoint-detection. The repository also includes the code for recreating the simulations and data analyses presented here.

5.1 Simulations

5.1.1 Algorithm 1 estimates the background parameter consistently

In the first experiment we tested the performance of the algorithm with a fixed background parameter. Three different sets of data were generated and used:

Scenario 1. Gaussian data with one signal segment: nn data points were drawn as xt𝒩(θt,1)x_{t}\sim\mathcal{N}(\theta_{t},1), with a segment of θt=3\theta_{t}=3 at times t(0.3n;0.5n]t\in(0.3n;0.5n], and 0 otherwise (background level).

Scenario 2. Gaussian data with multiple signal segments: nn data points were drawn as xt𝒩(θt,1)x_{t}\sim\mathcal{N}(\theta_{t},1), with:

θt={1 when t(0.2n;0.3n](0.7n;0.8n]1 when t(0.5n;0.6n]0 otherwise .\theta_{t}=\begin{cases}-1\text{ when }t\in(0.2n;0.3n]\cup(0.7n;0.8n]\\ 1\text{ when }t\in(0.5n;0.6n]\\ 0\text{ otherwise }.\end{cases}

Scenario 3. Heavy tailed data with one signal segment: nn data points were drawn from the generalized tt distribution as xtT(3)+θtx_{t}\sim T(3)+\theta_{t}, with θt=2\theta_{t}=2 at times t(0.2n;0.6n]t\in(0.2n;0.6n] and 0 otherwise (background level).

To evaluate the background level (θ0\theta_{0}) estimation by Algorithm 1, we generated time series for each of the three scenarios with values of nn between 30 and 750, with 500 replications for each nn. Note that the segment positions remain the same for all sample sizes, so the increase in nn could be interpreted as denser sampling of the time series.

Each time series was analysed using Algorithm 1 to estimate θ0\theta_{0}. The maximum segment length was set to l=0.5nl=0.5n and the penalty to β=3log(n)1.1\beta=3\log(n)^{1.1}. The cost was computed using the Gaussian log-likelihood function with known variance. This means that the cost function was mis-specified in Scenario 3, and so provides a test of the robustness of the algorithm (although the variance was set to 33, as expected for T(3)T(3)).

For comparison, in each replication we also retrieved the median of the entire time series x1:nx_{1:n}, which is used as the estimate of θ0\theta_{0} by Fisch et al. (2018). For scenarios (1) and (2), we also computed the quantiles of 𝒩(0,σ2/nB)\mathcal{N}(0,\sigma^{2}/\sqrt{n_{B}}) as the oracle efficiency limit based on the CLT, with nBn_{B} the total number of background points: nB=0.8nn_{B}=0.8n for (1) and 0.7n0.7n for (2).

Figure 1 shows the results for all three scenarios. It can be seen that Algorithm 1 produced consistent estimates given sufficiently long time series. When the signal-to-noise ratio was large (scenario 1), most of the estimated background values were accurate even at small nn, and at larger nn our estimator reached efficiency and accuracy similar to the oracle estimator. Predictably, the median of the non-segmented data produced biased estimates, which nonetheless may be preferrable in some cases, as our estimate showed large variability at low nn in scenarios 2 and 3. At n>400n>400, our estimator provides the lowest total error in all tested scenarios, even with the mis-specified cost function in scenario 3.

Refer to caption
Figure 1: Consistency of the background parameter values estimations. Time series simulated in three different scenarios were analysed by Algorithm 1 (shown in red). Lines are the inter-quartile range (solid) and 2.5-97.5% range (faint) of the background parameter estimates observed in 500 replications at each nn. For comparison, we also show the ranges of estimates obtained by the median (blue) and mean of true background points (green; an oracle estimator).

5.1.2 Segment positions estimated by Algorithm 1 are accurate and consistent

The same simulation setup was used to evaluate the consistency of estimated segment number and positions. Data were generated and analysed with Algorithm 1 as in Section 5.1.1, but we also retrieved the changepoint positions that were estimated in step 12. This corresponds to the online usage of the algorithm, in which segmentation close to the start of the data is based on early θ0\theta_{0} estimates and may therefore itself be less precise. We extracted the mean number of segments reported by the algorithm and also calculated the true positive rate (TPR) as the fraction of simulations in which the algorithm reported at least 1 changepoint within 0.05n0.05n points of each true changepoint.

The results show that the TPR approaches 1 for all three scenarios (Table 1). As expected, detecting a strong segment was easier than multiple weak ones: segmentation in scenario 1 was accurate even at n=30n=30, while n>500n>500 was needed to reach 90%\geq 90\% TPR in scenario 2. In scenario 3, the algorithm correctly detected changes at the true segment start and end (TPR 100%\approx 100\%), but the algorithm tended to fit the segment as multiple ones, likely due to the heavy tails of the tt distribution. Notably, skipping step 13 had very little impact on the performance of the algorithm, suggesting that the faster online version can be safely used.

Table 1: Consistency of the number and position of estimated changepoints. Time series simulated in three different scenarios were analysed using Algorithm 1. For each nn, 500 replications were performed. The mean number of reported segments and the TPR (fraction of iterations when a changepoint was detected within 0.05n0.05n of each true changepoint) are shown. The number of true segments was 1, 3, and 1 for scenarios 1, 2, 3 respectively. The same time series were also analysed in an online manner, i.e., without Step 13 of the algorithm; the results are shown on the right of the table.
(Full algorithm) (no step 13)
Scenario nn Mean # segm. TPR Mean # segm. TPR
1 30 1.112 0.932 1.138 0.924
90 1.040 0.998 1.056 0.998
180 1.054 1.000 1.060 1.000
440 1.032 1.000 1.040 1.000
750 1.028 1.000 1.034 1.000
2 30 0.552 0.000 0.584 0.000
90 1.096 0.008 1.050 0.004
180 1.762 0.086 1.720 0.080
440 2.900 0.824 2.854 0.782
750 3.010 0.992 3.008 0.988
3 30 0.644 0.142 0.662 0.132
90 1.426 0.592 1.416 0.570
180 1.930 0.878 1.932 0.866
440 2.798 0.998 2.798 0.996
750 3.692 1.000 3.664 1.000

5.1.3 Algorithm 2 recovers true signal segments under interference

To evaluate Algorithm 2, we generated time series with both signal and nuisance segments under two different scenarios:

Scenario 1. Gaussian data with a signal segment overlapping a nuisance segment: nn data points were drawn as xt𝒩(θtS+θtN,1)x_{t}\sim\mathcal{N}(\theta^{S}_{t}+\theta^{N}_{t},1), with θtS=2\theta^{S}_{t}=2 when t(0.3n;0.5n]t\in(0.3n;0.5n], 0 otherwise, and θtN=2\theta^{N}_{t}=2 when t(0.2n;0.7n]t\in(0.2n;0.7n], 0 otherwise.

Scenario 2. Gaussian data with a nuisance segment and two non-overlapping signal segments: nn data points were drawn as xt𝒩(θtS+θtN,1)x_{t}\sim\mathcal{N}(\theta^{S}_{t}+\theta^{N}_{t},1), with:

θtN=1 when t(0.2n;0.4n],0 otherwise\displaystyle\theta^{N}_{t}=1\text{ when }t\in(0.2n;0.4n],~{}0\text{ otherwise}
θtS={3 when t(0.5n;0.6n]3 when t(0.7n;0.8n]0 otherwise \displaystyle\theta^{S}_{t}=\begin{cases}3\text{ when }t\in(0.5n;0.6n]\\ -3\text{ when }t\in(0.7n;0.8n]\\ 0\text{ otherwise }\end{cases}

Time series were generated for nn between 30 and 240, in 500 replications at each nn. Each series was analysed by three methods. Algorithm 2 (proposed) was run with penalties β=β=3log(n)1.1\beta=\beta^{\prime}=3\log(n)^{1.1} as before. For classical detection of epidemic changes in mean, we used the R package anomaly (Fisch et al., 2018); the implementation allows separating segments of length 1, but we treated them as standard segments and set all penalties to 3log(n)1.13\log(n)^{1.1}. In both methods, background parameters were set to μ0=0,σ0=1\mu_{0}=0,\sigma_{0}=1, and maximum segment length was l=0.33nl=0.33n in scenario 1 and l=0.15nl=0.15n in scenario 2. As an example of a different approach, we included the narrowest-over-threshold detector implemented in R package not, with default parameters. This is a non-epidemic changepoint detector that was shown to outperform most comparable methods (Baranowski et al., 2019). Since it does not include the background-signal distinction, we define signal segments as regions between two successive changepoints where the mean exceeds μ0±σ0\mu_{0}\pm\sigma_{0}.

As before, evaluation metrics were the mean number of segments and the TPR. For the proposed method, we required accurate segment type detection as well, i.e., a true positive is counted when the detector reported a nuisance start/end within 0.05n0.05n of each nuisance changepoint and a signal start/end within 0.05n0.05n of each signal changepoint. In scenario 1, we also extracted the estimate of θ\theta corresponding to the detected signal segment (or segment closest to (0.3n;0.5n](0.3n;0.5n] if multiple were detected). The average of these over the 500 replications is reported as θ^S\hat{\theta}^{S}.

To evaluate the effects of pruning, Algorithm 2 was applied without pruning, or with global pruning as in (11), with m(0;tl)m\in(0;t-l) at each tt. Only 5 out of 5000 runs (500 iterations ×\times 10 settings) showed any differences between the globally-pruned and non-pruned methods (Supplementary Table S1 in Appendix D), so we only present results obtained with pruning from here on.

We observed that the proposed method (with pruning) successfully detected true signal segments in both scenarios (Figure 2 and Table 2). The number of nuisance detections was accurate in scenario 1, and slightly underestimated in favour of more signal segments in scenario 2, most likely because the simulated nuisance length was close to the cutoff of 0.15n0.15n.

Refer to caption
Figure 2: Relative bias in the number of changepoints estimated by the proposed Algorithm 2 (pruned), and two alternative detectors: anomaly and not. Data simulated for two scenarios, in 500 replications for each nn. For the proposed algorithm, bias is calculated separately in signal and nuisance segments.
Table 2: The true positive rate of changepoint estimation by the proposed Algorithm 2 (pruned), and two alternative detectors: anomaly and not. Data simulated for two scenarios, in 500 replications for each nn. The true positive rate is the fraction of iterations when a changepoint of correct type was detected within 0.05n0.05n of each true changepoint (types are signal, S, and nuisance, N).
Scenario nn Proposed (S) Proposed (N) anomaly not
1 30 0.444 0.548 0.382 0.428
60 0.742 0.752 0.532 0.724
100 0.938 0.930 0.870 0.912
160 0.984 0.972 0.982 0.994
240 1.000 0.986 1.000 0.998
2 30 0.700 0.110 0.934 0.152
60 0.924 0.238 0.962 0.872
100 0.986 0.412 0.998 0.988
160 0.998 0.640 1.000 1.000
240 0.998 0.764 1.000 1.000

As expected, when nuisance segments are not included in the model (anomaly and not methods), they were identified as multiple changepoints; as a result, the number of segments was over-estimated up to 3-fold. These models are also unable to capture the signal-specific change in mean θS\theta^{S}: anomaly estimated θ^S=3.99\hat{\theta}^{S}=3.99, and not estimated θ^S=3.95\hat{\theta}^{S}=3.95 in scenario 1 at n=240n=240. These values correspond to the sum of the signal and nuisance effects. While the estimation is accurate and could be used to recover the value of interest by post-hoc analysis, our proposed method estimated the signal-specific change directly, as θ^S=2.00\hat{\theta}^{S}=2.00 in scenario 1 at n=240n=240.

5.2 Real-world Data

5.2.1 ChIP-seq

As an example application of the algorithms proposed in this paper, we demonstrate peak detection in chromatin immunoprecipitation sequencing (ChIP-seq) data. The goal of ChIP-seq is to identify DNA locations where a particular protein of interest binds, by precipitating and sequencing bound DNA. This produces a density of binding events along the genome that then needs to be processed to identify discrete peaks. Typically, some knowledge of the expected peak length is available to the researcher, although with considerable uncertainty (Rashid et al., 2011). Furthermore, the background level may contain local shifts of various sizes, caused by sequencing bias or true structural variation in the genome (Zhang et al., 2008). The method proposed in this paper is designed for such cases, and can potentially provide more accurate and more robust detection.

We used datasets from two ChIP-seq experiments, investigating histone modifications in human immune cells. Broad Institute H3K27ac data was obtained from UCSC Genome Browser, GEO accession GSM733771, as mean read coverage in non-overlapping windows of 25 bp. While ground truth is not available for this experiment, we also retrieved an input control track for the same cell line from UCSC (GEO accession GSM733742). We analysed a window near the centromere of chromosome 1, between 120,100,000 to 120,700,000 bp. This window contains nuisance variation in the background level, as seen in the input control (Figure 3, top). To improve runtime, data was downsampled to approximately 1000 points, each corresponding to mean coverage in a window of 500 bp. All read coordinates in this paper correspond to hg19 genome build.

The second dataset was the UCI/McGill dataset, obtained from https://archive.ics.uci.edu/ml/datasets/chipseq. This dataset was previously used for evaluating peak detection algorithms (Hocking et al., 2017, 2018). Mean read coverage at 1 bp resolution is provided, as well as peak annotations based on visual inspection. The annotations are weak labels in the sense that they indicate peak presence or absence in a region, not their exact positions, to acknowledge the uncertainty when labelling visually. From this data, we used the H3K36me3 modification in monocyte sample ID McGill0104, AM annotation. Around the labels L={(si,ei)}L=\{(s_{i},e_{i})\} in each chromosome, we extracted read coverage for the window between s(es)s-(e-s) and e+(es)e+(e-s) bp, s=minsis=\min s_{i}, e=maxeie=\max e_{i}, and downsampled to about 1000 points as before. Based on visual inspection of the coverage, we chose a group of labels in chromosome 12, which provides a variety of annotations and a structure that appears to contain both nuisance shifts and signal peaks.

The ChIP-seq datasets were analysed by the method proposed here (Algorithm 2), an epidemic detector from package anomaly, and the non-epidemic detector not. The length of the signal segments was limited to 50 kbp in anomaly and the proposed method. As an estimate of global mean μ0\mu_{0}, the median of the unsegmented data was used, and σ0\sigma_{0} estimated by the standard deviation of non-segmented data. As before, penalties were set to 3log(n)1.13\log(n)^{1.1}. Only segments with estimated θ>μ0\theta>\mu_{0} are shown, as we are a priori interested only in regions of increased binding.

We also used GFPOP, implemented in R package PeakSegDisk (Hocking et al., 2018): this detector has been developed specifically for ChIP-seq data processing, and models a change in Poisson rate parameter. This method does not include a single background level, but enforces alternating up-down constraints. It is intended as a supervised method, with the penalty value λ\lambda chosen based on the best segmentation provided by the training data. Therefore, for the Broad Institute dataset, we repeated the segmentation with λ{101,102,103,104,105,106}\lambda\in\{10^{1},10^{2},10^{3},10^{4},10^{5},10^{6}\}, and show the λ\lambda that produces between 2 and 10 segments.

To evaluate the results quantitatively, we calculated the SIC based on each method’s segmentation. We used Gaussian likelihood with parameters matching the detection status (i.e., estimated mean for the points in each segment μ^0\hat{\mu}_{0} for points outside segments, and σ^0\hat{\sigma}_{0} as the standard deviation for all points). The number of parameters was set to 3k+23k+2: a mean and two endpoints for each of the kk segments reported, and two for the background parameters.

In the H3K27ac data, all methods detected the three most prominent peaks, but produced different results for smaller peaks and more diffuse change areas (Figure 3, bottom). Both PeakSegDisk and anomaly marked a broad segment in the area around 120,600,000 bp. Based on comparison with the control data, this change is spurious, and it exceeds the 50 kbp bound for target segments. While this bound was provided to the anomaly detector, it does not include an alternative way to model these changes, and therefore still reports one or more shorter segments. In contrast, our method accurately modelled the area as a nuisance segment with two overlapping sharp peaks.

Using not, the data was partitioned into 10 segments. By defining segments with low mean (θ<μ^0+σ^0\theta<\hat{\mu}_{0}+\hat{\sigma}_{0}) as background, we could reduce this to 4 signal segments; while this removed the spurious background change, it also discarded the shorter change around 120,200,000 bp, which fits the definition of a signal peak (<50<50 kbp) and was retained by the proposed method. This data illustrates that choosing the post-processing required for most approaches is not trivial, and can have a large impact on the results. In contrast, the parameters required for our method have a natural interpretation and may be known a priori or easily estimated, and the outputs are provided in a directly useful form. This data illustrates that choosing the post-processing options is not trivial, and can have a large impact on the results. In contrast, the parameters required for our method have a natural interpretation and may be known a priori or easily estimated, and the outputs are provided in a directly useful form.

Refer to caption
Figure 3: ChIP-seq read counts and analysis results. Counts provided as mean coverage in 500 bp windows for a non-specific control sample (top) and H3K27ac histone modification (bottom), chromosome 1. Segments detected in the H3K27ac data by the method proposed here (Algorithm 2) and three other detectors are shown under the counts. Note that the proposed method can also produce longer nuisance changes (yellow) overlapped by signal segments (blue). not does not specifically identify background segments; we show the ones with relatively low mean in light blue.

In the UCI data, segment detections also generally matched the visually determined labels. However, our method produced the most parsimonious models to explain the changes. Figure 4 shows such an example from chromosome 12, where our method reported two nuisance segments and a single sharp peak around 62,750,000 bp. The nuisance segments correspond to broad regions of mean shift, which were also detected by anomaly and not, but using 6 and 16 segments, respectively. Notably, PeakSeg differed considerably: as this method does not incorporate a single background level, but requires segments to alternate between background and signal, the area around 62,750,000 bp was defined as background, despite having a mean of 4.5μ^04.5~{}\hat{\mu}_{0}. In total, 12 segments were reported by this method. This shows that the ability to separate nuisance and signal segments helps produce more parsimonious models, and in this way minimises the downstream efforts such as experimental replication of the peaks.

The visual annotations provided for this region are shown in the first row in Figure 4. Note that they do not distinguish between narrow and broad peaks (single annotations in this sample range up to 690 kb in size). Furthermore, comparison with such labels does not account for finer segmentation, coverage in the peak area, or the number of false alarms outside it. For these reasons we are unable to use the labels in a quantitative way.

Quantitative comparison of the segmentations by SIC also favours our proposed method in both datasets. In the Broad dataset, SICs corresponding to segmentations reported by PeakSeg, not and anomaly were 4447.4, 4532.2, and 4311.6, respectively, while the segmentation produced by our model had an SIC of 4285.8. The smallest criterion values in UCI data were also produced by our method (4664.5), closely followed by anomaly (4664.8), while not segmentation resulted in an SIC of 4705.7 and PeakSeg 6886.7. This suggests that in addition to the practical benefits of separating unwanted segments, the nuisance-signal structure provides a better fit to these datasets than models that allow only one type of segments.

Refer to caption
Figure 4: A window of H3K36me3 ChIP-seq data on chromosome 10. Read coverage in 1100 bp windows (black points), manual annotations of peaks included in the dataset (boxes below), and detection results using Algorithm 2 proposed in this paper, as well as three state-of-the-art methods (lines at the bottom).

5.2.2 European mortality data

The recent pandemic of coronavirus disease COVID-19 prompted a renewed interest in early outbreak detection and quantification. In particular, analysis of mortality data provided an important resource for guiding the public health responses to it (e.g. Baud et al., 2020; Yuan et al., 2020; Dowd et al., 2020).

We analysed Eurostat data of weekly deaths in Spain over a three year period between 2017 and 2020. Data was retrieved from https://ec.europa.eu Data Explorer. Besides the impact of the pandemic, mortality data contains normal seasonal variations, particularly in older age groups. We use the 60-64 years age group in which these trends are visually clear and thus provide a ground truth.

Data were analysed with the four methods introduced earlier. For the proposed and anomaly methods, we used the median and standard deviation of the first 52 weeks of the dataset as estimates of μ0\mu_{0} and σ0\sigma_{0}, respectively. Penalties in both were set to 3log(n)1.13\log(n)^{1.1}, as previously. The maximum length of signal segments was set to 10 weeks, to separate seasonal effects. In addition, not was used with default parameters, defining signal as regions where the mean exceeds μ0±σ0\mu_{0}\pm\sigma_{0}, and PeakSegDisk with a basic grid search to select the penalty as before.

The results of the four analyses are shown in Figure 5. Three of the methods, anomaly, PeakSeg, and Algorithm 2, detected a sharp peak around the pandemic period. However, anomaly and PeakSeg also marked one winter period as a signal segment, while ignoring the other two. Four segments were created by not, including a broad peak continuing well past the end of the pandemic spike. In contrast, the proposed method marked the pandemic spike sharply, while also labelling all three winter periods as nuisance segments. The resulting detection using our method is again parsimonious and flexible: if only short peaks are of interest, our method reports those with lower false alarm rate than the other methods, but broader segments are also marked accurately and can be retrieved if relevant.

As in the ChIP-seq data, comparing the results by SIC identifies our method as optimal for this dataset. The values corresponding to PeakSeg, not and anomaly models were 1629.3, 1648.2, and 1626.3 respectively, while Algorithm 2 produced a segmentation with SIC 1568.2. Note that the SIC penalizes both signal and nuisance segments, so in this case our model still appears optimal despite having more parameters.

Refer to caption
Figure 5: Weekly deaths in Spain, in the 60-64 years age group, over 2017–2020 (black points). Detection results using the method proposed in this paper and three alternative methods shown as lines below.

6 Discussion

In this paper, we have presented a pair of algorithms for improving detection of epidemic changepoints. Similarly to stochastic gradient descent, the iterative updating of the background estimate in Algorithm 1 leads to fast convergence while allowing large fraction of non-background points. This is utilised in Algorithm 2 to analyse nuisance-signal overlaps.

The computational complexity of both algorithms presented here is 𝒪(n)\mathcal{O}(n) in the best case, which is similar to state-of-the-art pruned algorithms (Killick et al., 2012; Hocking et al., 2018). However, note that this is stated in the number of required evaluations of CC. It is usually implicitly assumed that this function can be evaluated and minimised over θ\theta recursively, so that the total number of operations may also be linear. This would not be achievable with methods that estimate the background level strictly offline, such as aPELT-profile (Zhao and Yau, 2019). Therefore, development of Algorithm 1 was essential to create the overlap detector.

One of major practical benefits of the proposed model is the ability to separate non-target segments. We anticipate that this will greatly improve downstream processing, effectively reducing the false alarm rate or the manual load if the detections are reviewed. Despite that, it is difficult to evaluate this benefit at present: while there are recent datasets with annotations specifically for testing changepoint detection (Hocking et al., 2017; van den Burg and Williams, 2020), they are based on labelling all visually apparent changes. In future work, we expect to provide further application-specific comparisons that would measure the impact of separating and neutralising the nuisance process.

7 Acknowledgements

This work was supported by the Royal Society of New Zealand Marsden Fund and Te Pūnaha Matatini, a New Zealand Centre of Research Excellence.

References

  • Aminikhanghahi and Cook (2016) Aminikhanghahi, S. and Cook, D. J. (2016) A survey of methods for time series change point detection. Knowledge and Information Systems, 51, 339–367.
  • Baranowski et al. (2019) Baranowski, R., Chen, Y. and Fryzlewicz, P. (2019) Narrowest-over-threshold detection of multiple change points and change-point-like features. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81, 649–672.
  • Baud et al. (2020) Baud, D., Qi, X., Nielsen-Saines, K., Musso, D., Pomar, L. and Favre, G. (2020) Real estimates of mortality following COVID-19 infection. The Lancet Infectious Diseases, 20, 773.
  • Bottou (1998) Bottou, L. (1998) Online algorithms and stochastic approximations. In Online Learning and Neural Networks (ed. D. Saad). Cambridge, UK: Cambridge University Press. Revised, oct 2012.
  • Bracker and Smith (1999) Bracker, K. and Smith, K. L. (1999) Detecting and modeling changing volatility in the copper futures market. Journal of Futures Markets, 19, 79–100.
  • van den Burg and Williams (2020) van den Burg, G. J. J. and Williams, C. K. I. (2020) An evaluation of change point detection algorithms. Preprint arXiv:2003.06222.
  • Dowd et al. (2020) Dowd, J. B., Andriano, L., Brazel, D. M., Rotondi, V., Block, P., Ding, X., Liu, Y. and Mills, M. C. (2020) Demographic science aids in understanding the spread and fatality rates of COVID-19. Proceedings of the National Academy of Sciences, 117, 9696–9698.
  • Fisch et al. (2018) Fisch, A. T. M., Eckley, I. A. and Fearnhead, P. (2018) A linear time method for the detection of point and collective anomalies. Preprint arXiv:1806.01947.
  • Fryzlewicz (2014) Fryzlewicz, P. (2014) Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42, 2243–2281.
  • Harvey et al. (2019) Harvey, N. J. A., Liaw, C., Plan, Y. and Randhawa, S. (2019) Tight analyses for non-smooth stochastic gradient descent. In Proceedings of the Thirty-Second Conference on Learning Theory (eds. A. Beygelzimer and D. Hsu), vol. 99 of Proceedings of Machine Learning Research, 1579–1613. Phoenix, USA: PMLR.
  • Hochenbaum et al. (2017) Hochenbaum, J., Vallis, O. S. and Kejariwal, A. (2017) Automatic anomaly detection in the cloud via statistical learning. Preprint arXiv:1704.07706.
  • Hocking et al. (2017) Hocking, T. D., Goerner-Potvin, P., Morin, A., Shao, X., Pastinen, T. and Bourque, G. (2017) Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning. Bioinformatics, 33, 491–499.
  • Hocking et al. (2018) Hocking, T. D., Rigaill, G., Fearnhead, P. and Bourque, G. (2018) Generalized functional pruning optimal partitioning (gfpop) for constrained changepoint detection in genomic data. Preprint arXiv:1810.00117.
  • Jackson et al. (2005) Jackson, B., Scargle, J., Barnes, D., Arabhi, S., Alt, A., Gioumousis, P., Gwin, E., San, P., Tan, L. and Tsai, T. T. (2005) An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters, 12, 105–108.
  • Killick et al. (2012) Killick, R., Fearnhead, P. and Eckley, I. A. (2012) Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107, 1590–1598.
  • Lau and Tay (2019) Lau, T. S. and Tay, W. P. (2019) Quickest change detection in the presence of a nuisance change. IEEE Transactions on Signal Processing, 67, 5281–5296.
  • Levin and Kline (1985) Levin, B. and Kline, J. (1985) The cusum test of homogeneity with an application in spontaneous abortion epidemiology. Statistics in Medicine, 4, 469–488.
  • Li et al. (2016) Li, S., Cao, Y., Leamon, C., Xie, Y., Shi, L. and Song, W. (2016) Online seismic event picking via sequential change-point detection. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE.
  • Maidstone et al. (2016) Maidstone, R., Hocking, T., Rigaill, G. and Fearnhead, P. (2016) On optimal multiple changepoint algorithms for large data. Statistics and Computing, 27, 519–533.
  • Olshen et al. (2004) Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5, 557–572.
  • Page (1954) Page, E. S. (1954) Continuous inspection schemes. Biometrika, 41, 100.
  • Pearce et al. (2000) Pearce, D., Hirsch, H.-g. and Ericsson Eurolab Deutschland Gmbh (2000) The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ISCA ITRW ASR2000, 29–32.
  • Purkayastha (1998) Purkayastha, S. (1998) Simple proofs of two results on convolutions of unimodal distributions. Statistics & Probability Letters, 39, 97–100.
  • Rashid et al. (2011) Rashid, N. U., Giresi, P. G., Ibrahim, J. G., Sun, W. and Lieb, J. D. (2011) ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions. Genome Biology, 12, R67.
  • Rigaill (2015) Rigaill, G. (2015) A pruned dynamic programming algorithm to recover the best segmentations with 1 to Kmax{K}_{max} change-points. Journal de la Société Française de Statistique, 156, 180–205.
  • Shewhart (1930) Shewhart, W. A. (1930) Economic quality control of manufactured product. Bell System Technical Journal, 9, 364–389.
  • Texier et al. (2016) Texier, G., Farouh, M., Pellegrin, L., Jackson, M. L., Meynard, J.-B., Deparis, X. and Chaudet, H. (2016) Outbreak definition by change point analysis: a tool for public health decision? BMC Medical Informatics and Decision Making, 16.
  • Truong et al. (2020) Truong, C., Oudre, L. and Vayatis, N. (2020) Selective review of offline change point detection methods. Signal Processing, 167, 107299.
  • Vaisman et al. (2010) Vaisman, L., Zariffa, J. and Popovic, M. R. (2010) Application of singular spectrum-based change-point analysis to EMG-onset detection. Journal of Electromyography and Kinesiology, 20, 750–760.
  • Yuan et al. (2020) Yuan, J., Li, M., Lv, G. and Lu, Z. K. (2020) Monitoring transmissibility and mortality of COVID-19 in europe. International Journal of Infectious Diseases, 95, 311–315.
  • Zhang et al. (2008) Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., Nussbaum, C., Myers, R. M., Brown, M., Li, W. and Liu, X. S. (2008) Model-based analysis of ChIP-seq (MACS). Genome Biology, 9, R137.
  • Zhao and Yau (2019) Zhao, Z. and Yau, C. Y. (2019) Alternating pruned dynamic programming for multiple epidemic change-point estimation. Preprint arXiv:1907.06810.
  • Zheng et al. (2019) Zheng, C., Eckley, I. A. and Fearnhead, P. (2019) Consistency of a range of penalised cost approaches for detecting multiple changepoints. Preprint arXiv:1911.01716.
  • Zidek and van Eeden (2003) Zidek, J. V. and van Eeden, C. (2003) Uncertainty, entropy, variance and the effect of partial information, vol. 42 of Lecture Notes–Monograph Series, 155–167. Institute of Mathematical Statistics.

Appendix A Proof of Theorem 1 (Convergence of Algorithm 1)

Bottou (1998) analyses the case of an online algorithm iteratively minimising some function f(x,w)f(x,w) (where xx represents the complete data and ww the parameters). Data points {xt}\{x_{t}\} arrive sequentially, and at each iteration an estimate wtw_{t} of the location of the minimum ww^{*} is obtained using some update function H(x,w)H(x,w) and learning rate γt\gamma_{t} as:

wt+1=wtγtH(xt+1,wt).w_{t+1}=w_{t}-\gamma_{t}H(x_{t+1},w_{t}). (12)

This updating mechanism gives rise to stochastic gradient descent if 𝔼H(xt+1,wt)=wf(x,w)\mathbb{E}H(x_{t+1},w_{t})=\nabla_{w}f(x,w), but for the following argument this is not required.

To make the link with Algorithm 1 explicit, the update equation applied by this algorithm can be written as:

wt+1=wt+γt(xt+1wt).w_{t+1}=w_{t}+\gamma_{t}(x_{t+1}-w_{t}).

Then w=θ0w^{*}=\theta_{0} (i.e., the background mean that is to be estimated), and we ask whether the sequence of updates converges wtww_{t}\rightarrow w^{*}. It was shown by Bottou (1998) that this occurs almost surely if the following three conditions are met:

  1. 1.

    “convexity” – a single optimum ww^{*} exists and the expected value of the updates always points towards it:

    ϵ>0,inf(ww)2>ϵ(ww)𝔼H(x,w)>0;\forall\epsilon>0,\inf_{(w-w^{*})^{2}>\epsilon}(w-w^{*})\mathbb{E}H(x,w)>0; (13)
  2. 2.

    learning rate convergence:

    i=1γt= , i=1γt2<;\sum_{i=1}^{\infty}\gamma_{t}=\infty\text{ , }\sum_{i=1}^{\infty}\gamma^{2}_{t}<\infty; (14)
  3. 3.

    bounded variance of the updates:

    𝔼H(x,w)2A+B(ww)2 , A,B0.\mathbb{E}H(x,w)^{2}\leq A+B(w-w^{*})^{2}\text{ , }A,B\geq 0. (15)

Thus, proof of convergence of our algorithm reduces to showing that these requirements are satisfied. We start with the assumption that the global mean of segment points is also θ0\theta_{0}, and then relax this requirement.

The following lemma will be needed:

Lemma 1.

Let ff be a unimodal distribution, symmetric around a point μ\mu (so that f(x1)<f(x2)f(x_{1})<f(x_{2}) when x1<x2μx_{1}<x_{2}\leq\mu and f(x1)>f(x2)f(x_{1})>f(x_{2}) when μx1<x2\mu\leq x_{1}<x_{2}), such as a Gaussian. Consider a truncated random variable XX with pdf:

g(x)={0if x<maf(x)P(maxm+a)if maxm+a0if x>m+ag(x)=\begin{cases}0&\text{if }x<m-a\\ \frac{f(x)}{P(m-a\leq x\leq m+a)}&\text{if }m-a\leq x\leq m+a\\ 0&\text{if }x>m+a\end{cases}

for some a>0,ma>0,m. Then infmμ(𝔼Xm)(μm)>0\inf_{m\neq\mu}(\mathbb{E}X-m)(\mu-m)>0.

Proof.
𝔼Xm\displaystyle\mathbb{E}X-m =mam+a(xm)f(x)𝑑x\displaystyle=\int_{m-a}^{m+a}(x-m)f(x)dx
=a0yf(y+m)𝑑y+0ayf(y+m)𝑑y\displaystyle=\int_{-a}^{0}yf(y+m)dy+\int_{0}^{a}yf(y+m)dy
=0ay(f(m+y)f(my))𝑑y.\displaystyle=\int_{0}^{a}y(f(m+y)-f(m-y))dy. (16)

When m+a<μm+a<\mu, ff is increasing throughout the integration range, and 𝔼Xm>0\mathbb{E}X-m>0; the opposite is true for ma>μm-a>\mu. If ma<m<μ<m+am-a<m<\mu<m+a, split the integral in (A) as:

𝔼Xm=0μmy(f(m+y)f(my))𝑑y+μmay(f(m+y)f(my))𝑑y.\mathbb{E}X-m=\int_{0}^{\mu-m}y(f(m+y)-f(m-y))dy+\int_{\mu-m}^{a}y(f(m+y)-f(m-y))dy.

The first integral covers the range where ff is increasing, and thus is positive. Since μm>0\mu-m>0, |m+yμ|<|myμ||m+y-\mu|<|m-y-\mu| for y>0y>0, and f(m+y)>f(my)f(m+y)>f(m-y) by symmetry of ff around μ\mu and monotonicity, so the second interval is positive as well. Similarly, 𝔼Xm<0\mathbb{E}X-m<0 for ma<μ<m<m+am-a<\mu<m<m+a. ∎

A.1 When the global mean of segments matches the background mean

Consider the case that the background points are independent draws from 𝒩(w,σ2)\mathcal{N}(w^{*},\sigma^{2}), so that the points within each segment are 𝒩(θi,σ2)\mathcal{N}(\theta_{i},\sigma^{2}), with σ2\sigma^{2} known, and θ𝒩(w,τ2)\theta\sim\mathcal{N}(w^{*},\tau^{2}). Let wtw_{t} be the value of the background mean estimated by Algorithm 1 after processing tt data points. In this case wta.s.ww_{t}\xrightarrow{a.s.}w^{*}.

Proof.

Denote the true class of the next data point xt+1x_{t+1} by δt+1\delta_{t+1} (1 for background points, 0 for signal). Algorithm 1 estimates this as:

δ^t+1={1if F(t)+C0(xt+1;wt)<min1klF(t+1k)+C(xt+2k:t+1)+β0otherwise.\hat{\delta}_{t+1}=\begin{cases}1&\text{if }F(t)+C^{0}(x_{t+1};w_{t})<\min_{1\leq k\leq l}F(t+1-k)+C(x_{t+2-k:t+1})+\beta\\ 0&\text{otherwise}.\end{cases}

Initially, assume for simplicity that the true maximum segment length is 1 (and so only k=1k=1 is tested). When δ^t+1=1\hat{\delta}_{t+1}=1, the background estimate is updated as:

wt+1=wt+1i=1tδ^i+1(xt+1wt)w_{t+1}=w_{t}+\frac{1}{\sum_{i=1}^{t}\hat{\delta}_{i}+1}(x_{t+1}-w_{t})

(otherwise wt+1=wtw_{t+1}=w_{t}). So γt=1/(i=1tδ^i+1)\gamma_{t}=1/(\sum_{i=1}^{t}\hat{\delta}_{i}+1), and hence the learning rate convergence conditions (14) are satisfied.

Substituting in the costs based on a one-dimensional Gaussian pdf ϕ\phi, δ^t+1=1\hat{\delta}_{t+1}=1 if:

logϕ(xt+1;wt,σ2)<logϕ(xt+1;xt+1,σ2)+β\displaystyle-\log\phi(x_{t+1};w_{t},\sigma^{2})<-\log\phi(x_{t+1};x_{t+1},\sigma^{2})+\beta
\displaystyle\Rightarrow 1σ2πexp(xt+1wt)22σ2>1σ2πeβ\displaystyle\frac{1}{\sigma\sqrt{2\pi}}\exp{\frac{-(x_{t+1}-w_{t})^{2}}{2\sigma^{2}}}>\frac{1}{\sigma\sqrt{2\pi}}e^{-\beta}
\displaystyle\Rightarrow |xt+1wt|<2βσ2\displaystyle|x_{t+1}-w_{t}|<\sqrt{2\beta\sigma^{2}}
\displaystyle\Rightarrow xt+1(wt2βσ2;wt+2βσ2).\displaystyle x_{t+1}\in(w_{t}-\sqrt{2\beta\sigma^{2}};w_{t}+\sqrt{2\beta\sigma^{2}}). (17)

The distribution of true segment data points f(xt+1|δt+1=0)f(x_{t+1}|\delta_{t+1}=0) is symmetric with mean ww^{*} by assumption (as is easily verified for the Gaussian case). The background point distribution is also automatically symmetric. Thus, the overall distribution of the points used to update the wtw_{t} estimate is a truncation of a symmetric unimodal distribution. In the present case, it is a truncated normal with limits (wtσ2β,wt+σ2β)(w_{t}-\sigma\sqrt{2\beta},w_{t}+\sigma\sqrt{2\beta}), based on (17); more generally it is a truncated variant of the parent distribution with symmetric limits of the form wt±aw_{t}\pm a, and parent mean ww^{*} (that the acceptance set is an interval follows from the unimodality of ff).

This means that f(xt+1|δ^t+1=1)f(x_{t+1}|\hat{\delta}_{t+1}=1) satisfies the requirements for Lemma 1 with μ=w\mu=w^{*}, which implies the “convexity” condition (13):

inf(ww)2>ϵ(ww)𝔼H(xt+1,w)=inf(ww)2>ϵ(ww)(w𝔼(xt+1|δ^t+1=1))>0\inf_{(w-w^{*})^{2}>\epsilon}(w-w^{*})\mathbb{E}H(x_{t+1},w)=\inf_{(w-w^{*})^{2}>\epsilon}(w-w^{*})(w-\mathbb{E}(x_{t+1}|\hat{\delta}_{t+1}=1))>0

Following a similar approach – conditioning on δt+1\delta_{t+1} – and using the law of total variance it can be shown that the variance of 𝔼(wxt+1)\mathbb{E}(w-x_{t+1}) is finite, as required for the condition (15), and so wta.s.ww_{t}\xrightarrow{a.s.}w^{*}.

Remark. So far, we assumed that segment length k=1k=1. If segments occur and are tested in non-overlapping windows of any fixed size k2k\geq 2, the result is similar: j[tk+1;t],δ^j+1=1\forall j\in[t-k+1;t],\hat{\delta}_{j+1}=1 if:

i=tk+1i=tlogϕ(xi+1;wtk,σ2)<i=tk+1i=tlogϕ(xi+1;x¯k,σ2)+β,-\sum_{i=t-k+1}^{i=t}\log\phi(x_{i+1};w_{t-k},\sigma^{2})<-\sum_{i=t-k+1}^{i=t}\log\phi(x_{i+1};\bar{x}_{k},\sigma^{2})+\beta,

where x¯k=i=tk+1txi+1/k\bar{x}_{k}=\sum_{i=t-k+1}^{t}x_{i+1}/k. This can be expressed as truncation limits for accepted x¯k\bar{x}_{k}, analogously to (17):

β\displaystyle\beta >i=tk+1tlogϕ(xi+1;wtk,σ2)+i=tk+1i=tlogϕ(xi+1;x¯k,σ2)\displaystyle>-\sum_{i=t-k+1}^{t}\log\phi(x_{i+1};w_{t-k},\sigma^{2})+\sum_{i=t-k+1}^{i=t}\log\phi(x_{i+1};\bar{x}_{k},\sigma^{2})
=i=tk+1t(xi+1wtk)22σ2(xi+1x¯k)22σ2\displaystyle=\sum_{i=t-k+1}^{t}\frac{(x_{i+1}-w_{t-k})^{2}}{2\sigma^{2}}-\frac{(x_{i+1}-\bar{x}_{k})^{2}}{2\sigma^{2}}
=12σ2i=tk+1t(wtkx¯k)2\displaystyle=\frac{1}{2\sigma^{2}}\sum_{i=t-k+1}^{t}(w_{t-k}-\bar{x}_{k})^{2}
2βσ2/k>|wtkx¯k|.\displaystyle\Rightarrow\sqrt{2\beta\sigma^{2}/k}>|w_{t-k}-\bar{x}_{k}|. (18)

The pdf of x¯k\bar{x}_{k} is a kk-fold convolution of f(x)f(x). Since ff, as shown earlier, is symmetric unimodal, so is their convolution, and hence the distribution of x¯k\bar{x}_{k}, with a mean μ=w\mu=w^{*} (Purkayastha, 1998). In the special case when ff is the normal pdf, this can also be shown directly from Gaussian properties. Then Lemma 1 implies condition (13), and the rest of the proof follows as before.

When all segment positions (overlapping or not) are tested, the background point acceptance rule is:

δ^t=1 if xt1klSk, with Sk={xt:F(t1)+C0(xt)<F(tk)+C(xtk+1:t)+β}.\hat{\delta}_{t}=1\text{ if }x_{t}\in\bigcap_{1\leq k\leq l}S_{k},\text{ with }S_{k}=\{x_{t}:F(t-1)+C^{0}(x_{t})<F(t-k)+C(x_{t-k+1:t})+\beta\}.

As demonstrated earlier, S1=(wt2βσ2,wt+2βσ2)S_{1}=(w_{t}-\sqrt{2\beta\sigma^{2}},w_{t}+\sqrt{2\beta\sigma^{2}}). Define three sets of xtx_{t} based on which rules they pass: X1=xtS1X_{1}=x_{t}\in S_{1}, Xa=xtS1S2X_{a}=x_{t}\in S_{1}\cap S_{\geq 2}, Xr=xtS1S2X_{r}=x_{t}\in S_{1}\setminus S_{\geq 2}, and let PaP_{a} and PrP_{r} be the probabilities of the corresponding xtx_{t} sets. Clearly, X1=XaXrX_{1}=X_{a}\cup X_{r}. We are interested in the mean of the points accepted as background, i.e. 𝔼Xa\mathbb{E}X_{a}. Assume w.l.o.g. μ=0,σ=1,wt>μ\mu=0,\sigma=1,w_{t}>\mu, as the other case is symmetric. We will now show that for sufficiently large nn, 𝔼Xa<wt\mathbb{E}X_{a}<w_{t}, satisfying (13).

Using the conditional mean formula:

𝔼X1\displaystyle\mathbb{E}X_{1} =Pr𝔼Xr+Pa𝔼Xa\displaystyle=P_{r}\mathbb{E}X_{r}+P_{a}\mathbb{E}X_{a}
𝔼Xa\displaystyle\mathbb{E}X_{a} =𝔼X1/(1Pr)Pr𝔼Xr/(1Pr).\displaystyle=\mathbb{E}X_{1}/(1-P_{r})-P_{r}\mathbb{E}X_{r}/(1-P_{r}). (19)

Assume for now 𝔼Xr=μ=0\mathbb{E}X_{r}=\mu=0. Then, to obtain 𝔼Xa<wt\mathbb{E}X_{a}<w_{t}, we need:

𝔼X1/wt<1Pr.\mathbb{E}X_{1}/w_{t}<1-P_{r}. (20)

Denote p=2βp=\sqrt{2\beta}, which is an increasing function of nn, and consider the growth of both sides of (20) as nn increases. For the Gaussian or other distributions in the exponential family with mean μ\mu, truncated to a symmetric region (a,a)(-a,a), it is known that Var(X|S;μ)=ddμ𝔼(X|S;μ)Var(X|S;\mu)=\frac{d}{d\mu}\mathbb{E}(X|S;\mu) (Zidek and van Eeden, 2003). Then (denoting S1=(p;+p)S^{\prime}_{1}=(-p;+p)):

𝔼X1\displaystyle\mathbb{E}X_{1} =wt+𝔼(X|S1;wt)\displaystyle=w_{t}+\mathbb{E}(X|S^{\prime}_{1};-w_{t})
=wt+𝔼(X|S1;0)+0wtd𝔼(X|S1;a)da𝑑a\displaystyle=w_{t}+\mathbb{E}(X|S^{\prime}_{1};0)+\int_{0}^{w_{t}}\frac{d\mathbb{E}(X|S^{\prime}_{1};-a)}{da}da
=wt+00wtd𝔼(X|S1;a)da𝑑a\displaystyle=w_{t}+0-\int_{0}^{w_{t}}\frac{d\mathbb{E}(X|S^{\prime}_{1};a)}{da}da
<wtwtmin0awtVar(X|S1;a).\displaystyle<w_{t}-w_{t}\min_{0\leq a\leq w_{t}}Var(X|S^{\prime}_{1};a).

Hence:

𝔼X1/wt<1Var(X|S1;wt)=1wtpwt+px2f(x)𝑑x,\mathbb{E}X_{1}/w_{t}<1-Var(X|S^{\prime}_{1};w_{t})=1-\int_{w_{t}-p}^{w_{t}+p}x^{2}f(x)dx,

where f(x)f(x) is the pdf of xtx_{t} given xS1x\in S_{1}. Hence, this side grows with pp as x2f(x)-x^{2}f(x).

To analyse PrP_{r}, we first simplify the background acceptance condition. For any k2k\geq 2, by definition of FF and additivity of CC, we have:

F(tk)+C(xtk+1:t)+β\displaystyle F(t-k)+C(x_{t-k+1:t})+\beta =F(tk)+C(xtk+1:t1;x¯k)+C(xt;x¯k)+β\displaystyle=F(t-k)+C(x_{t-k+1:t-1};\bar{x}_{k})+C(x_{t};\bar{x}_{k})+\beta
=F(tk)+C(xtk+1:t1;x¯t)+(k1)d(x¯k,x¯t)+d(x¯k,xt)+β\displaystyle=F(t-k)+C(x_{t-k+1:t-1};\bar{x}_{-t})+(k-1)d(\bar{x}_{k},\bar{x}_{-t})+d(\bar{x}_{k},x_{t})+\beta
=F(t1)+A(t1)+(k1)d(x¯k,x¯t)+d(x¯k,xt),\displaystyle=F(t-1)+A(t-1)+(k-1)d(\bar{x}_{k},\bar{x}_{-t})+d(\bar{x}_{k},x_{t}),

where dd is some distance function, x¯t\bar{x}_{-t} is the mean of points xtk+1:t1x_{t-k+1:t-1}, and A(t)0A(t)\geq 0 is a constant depending only on x1:tx_{1:t}.

It is also helpful to note that F(t1)F(tk)+C0(xtk+1:t1)F(t-1)\leq F(t-k)+C^{0}(x_{t-k+1:t-1}), hence:

A(t1)\displaystyle A(t-1) =F(tk)+C(xtk+1:t1;x¯t)+βF(t1)\displaystyle=F(t-k)+C(x_{t-k+1:t-1};\bar{x}_{-t})+\beta-F(t-1)
C(xtk+1:t1;x¯t)+βC0(xtk+1:t1)\displaystyle\geq C(x_{t-k+1:t-1};\bar{x}_{-t})+\beta-C^{0}(x_{t-k+1:t-1})
β(k1)d(x¯t,wt1).\displaystyle\geq\beta-(k-1)d(\bar{x}_{-t},w_{t-1}). (21)

This corresponds to the case when all xtk+1:t1x_{t-k+1:t-1} were identified as background.

Using the Gaussian cost, i.e. d(a,b)=(ab)2/2d(a,b)=(a-b)^{2}/2, and recursive formula for the mean, the acceptance condition for kk becomes:

F(t1)+C0(xt)\displaystyle F(t-1)+C^{0}(x_{t}) <F(t1)+A(t1)+k12k2(xtx¯t)2+(k1)22k2(xtx¯t)2\displaystyle<F(t-1)+A(t-1)+\frac{k-1}{2k^{2}}(x_{t}-\bar{x}_{-t})^{2}+\frac{(k-1)^{2}}{2k^{2}}(x_{t}-\bar{x}_{-t})^{2}
(xtwt1)2\displaystyle\Rightarrow(x_{t}-w_{t-1})^{2} <2A(t1)+k1k(xtx¯t)2.\displaystyle<2A(t-1)+\frac{k-1}{k}(x_{t}-\bar{x}_{-t})^{2}. (22)

By substituting in the value of A(t1)A(t-1) from (A.1), we obtain the following lower bound for P(xSk|x)P(x\in S_{k}|x):

P(xSk|x)\displaystyle P(x\in S_{k}|x) P((xtwt1)2<2β(k1)(x¯twt1)2+k1k(xtx¯t)2)\displaystyle\geq P\left((x_{t}-w_{t-1})^{2}<2\beta-(k-1)(\bar{x}_{-t}-w_{t-1})^{2}+\frac{k-1}{k}(x_{t}-\bar{x}_{-t})^{2}\right)
=P((xtwt1)2k1k(xtx¯t)2+(k1)(x¯twt1)2<2β)\displaystyle=P\left((x_{t}-w_{t-1})^{2}-\frac{k-1}{k}(x_{t}-\bar{x}_{-t})^{2}+(k-1)(\bar{x}_{-t}-w_{t-1})^{2}<2\beta\right)
P((xtwt1)2+(k1)(x¯twt1)2<2β)\displaystyle\geq P\left((x_{t}-w_{t-1})^{2}+(k-1)(\bar{x}_{-t}-w_{t-1})^{2}<2\beta\right)
1𝔼((xtwt1)2)+(k1)(x¯twt1)2)/(2β)\displaystyle\geq 1-\mathbb{E}\left((x_{t}-w_{t-1})^{2})+(k-1)(\bar{x}_{-t}-w_{t-1})^{2}\right)/\left(2\beta\right)
P(xSk|x)\displaystyle P(x\notin S_{k}|x) 𝔼((xtwt1)2)+(k1)(x¯twt1)2)/(2β).\displaystyle\leq\mathbb{E}\left((x_{t}-w_{t-1})^{2})+(k-1)(\bar{x}_{-t}-w_{t-1})^{2}\right)/\left(2\beta\right).

Thus, we have the following bound for PrP_{r} at any kk:

Pr\displaystyle P_{r} =wtpwt+pP(xSk|x)f(x)𝑑x\displaystyle=\int_{w_{t}-p}^{w_{t}+p}P(x\notin S_{k}|x)f(x)dx
wtpwt+pO(x2/p2)f(x)𝑑x.\displaystyle\leq\int_{w_{t}-p}^{w_{t}+p}O(x^{2}/p^{2})f(x)dx. (23)

Therefore, as NN increases, 1Pr1-P_{r} grows faster than 1𝔼X1/wt=1wtpwt+px2f(x)𝑑x1-\mathbb{E}X_{1}/w_{t}=1-\int_{w_{t}-p}^{w_{t}+p}x^{2}f(x)dx. This means that p0\exists p_{0}, and thus n0\exists n_{0}, such that for n>n0n>n_{0}, and thus p>p0p>p_{0}, (20) holds.

So far we assumed 𝔼Xr=μ\mathbb{E}X_{r}=\mu. Clearly, for larger values of 𝔼Xr\mathbb{E}X_{r}, 𝔼Xa\mathbb{E}X_{a} is even smaller and (13) is satisfied.

For the case when 𝔼Xr<μ\mathbb{E}X_{r}<\mu, consider the worst case scenario 𝔼Xr=wtp\mathbb{E}X_{r}=w_{t}-p (this is the bound to X1X_{1}, and thus to XrX_{r}, imposed by S1S_{1}). Similarly to (19), we need:

𝔼Xa<wt\displaystyle\mathbb{E}X_{a}<w_{t}
\displaystyle\Rightarrow 𝔼X1/(1Pr)Pr(wtp)/(1Pr)<wt\displaystyle\mathbb{E}X_{1}/(1-P_{r})-P_{r}(w_{t}-p)/(1-P_{r})<w_{t}
\displaystyle\Rightarrow 𝔼X1/wtPr+Prp/wt<1Pr\displaystyle\mathbb{E}X_{1}/w_{t}-P_{r}+P_{r}p/w_{t}<1-P_{r}
\displaystyle\Rightarrow Prp/wt<1𝔼X1/wt\displaystyle P_{r}p/w_{t}<1-\mathbb{E}X_{1}/w_{t}
\displaystyle\Rightarrow Prp/wt<wtpwt+px2f(x)𝑑x.\displaystyle P_{r}p/w_{t}<\int_{w_{t}-p}^{w_{t}+p}x^{2}f(x)dx.

However, based on (23), Prp/wtwtpwt+pO(x2/p)f(x)𝑑xP_{r}p/w_{t}\leq\int_{w_{t}-p}^{w_{t}+p}O(x^{2}/p)f(x)dx, so again n>n0\exists n>n_{0} such that 𝔼Xa<wt\mathbb{E}X_{a}<w_{t}, and condition (13) holds.

And so overall 𝔼(xt|δ^t=1)\mathbb{E}(x_{t}|\hat{\delta}_{t}=1) satisfies condition (13). Since the distribution of accepted xtx_{t} still has bounded support imposed by S1S_{1}, condition (15) still holds, and the learning rate condition (14) holds as before, implying convergence.

A.2 When the global mean of segments does not match the background mean

Consider now θ𝒩(μ,τ2)\theta\sim\mathcal{N}(\mu,\tau^{2}) for some μw\mu\neq w^{*}, in particular μ>w\mu>w^{*}, so that the overall mean of segment points is 𝔼(θ)>w\mathbb{E}(\theta)>w^{*}. Then if w<wt<𝔼(θ)w^{*}<w_{t}<\mathbb{E}(\theta), any segment points that were misclassified as background will (on average) push the estimates away from the background mean, in violation of the “convexity” condition (13).

We assume that each segment point is followed by no less than nn background points. Then, as nn\rightarrow\infty, wta.sww_{t}\xrightarrow{a.s}w^{*}. For every finite nn, ϵ>0\exists\epsilon>0 such that P(|wtw|>ϵ)=0P(|w_{t}-w^{*}|>\epsilon)=0.

Proof.

Suppose that a misclassification at time TT is followed by nn correctly classified background points: δT=0\delta_{T}=0, δt=1\delta_{t}=1 for t[T+1;T+n]t\in[T+1;T+n], δ^t=1\hat{\delta}_{t}=1 for t[T;T+n]t\in[T;T+n]. For the points t[T+1;T+n]t\in[T+1;T+n], almost sure convergence of wtw_{t} was established above, i.e. for all ϵ>0\epsilon>0, there exists a t0t_{0} such that t:ntt0,P(|wT+tw|<ϵ)=1\forall t:n\geq t\geq t_{0},P(|w_{T+t}-w^{*}|<\epsilon)=1. Therefore, given nt0n\geq t_{0}:

P(|wT+nw|<|wT1w|)=1\displaystyle P(|w_{T+n}-w^{*}|<|w_{T-1}-w^{*}|)=1
\displaystyle\Rightarrow {P(wT+nwT1<0)=1, if wT1w>0P(wT+nwT1>0)=1, if wT1w<0\displaystyle\begin{cases}P(w_{T+n}-w_{T-1}<0)=1,\text{ if }w_{T-1}-w^{*}>0\\ P(w_{T+n}-w_{T-1}>0)=1,\text{ if }w_{T-1}-w^{*}<0\end{cases}
\displaystyle\Rightarrow infwT1w(wT1w)𝔼(wT+nwT1)<0.\displaystyle\inf_{w_{T-1}\neq w^{*}}(w_{T-1}-w^{*})\mathbb{E}(w_{T+n}-w_{T-1})<0. (24)

Indexing the segment-background cycles by ii, denote the first estimate of that segment by wiw^{\prime}_{i}, so the set of these estimates are:

{wi}={w1,,wT2n,wT1,wT+n,wT+1+2n,}.\{w^{\prime}_{i}\}=\{w_{1},\dots,w_{T-2-n},w_{T-1},w_{T+n},w_{T+1+2n},\dots\}.

The elements of this sequence can be expressed recursively as:

wi+1=wiγiH({xi},wi),w^{\prime}_{i+1}=w^{\prime}_{i}-\gamma^{\prime}_{i}H^{\prime}(\{x^{\prime}_{i}\},w^{\prime}_{i}),

with {xi}={xt:i(n+1)ti(n+1)+n}\{x^{\prime}_{i}\}=\{x_{t}:i(n+1)\leq t\leq i(n+1)+n\}.

From (A.2), 𝔼(wiwi+1)=γi𝔼H({xi},wi)\mathbb{E}(w^{\prime}_{i}-w^{\prime}_{i+1})=\gamma^{\prime}_{i}\mathbb{E}H^{\prime}(\{x^{\prime}_{i}\},w^{\prime}_{i}) is “convex” as defined in (13), and because γi>0\gamma^{\prime}_{i}>0 so is 𝔼H({xi},wi)\mathbb{E}H^{\prime}(\{x^{\prime}_{i}\},w^{\prime}_{i}).

Let γi=1i(n+1)+1\gamma^{\prime}_{i}=\frac{1}{i(n+1)+1}. Then:

H({xi},wi)\displaystyle H^{\prime}(\{x^{\prime}_{i}\},w^{\prime}_{i}) =t=i(n+1)i(n+1)+nγt(wt1xt)/γi\displaystyle=\sum_{t=i(n+1)}^{i(n+1)+n}\gamma_{t}(w_{t-1}-x_{t})/\gamma^{\prime}_{i}
=t=i(n+1)i(n+1)+ni(n+1)+1t+1(wt1xt)\displaystyle=\sum_{t=i(n+1)}^{i(n+1)+n}\frac{i(n+1)+1}{t+1}(w_{t-1}-x_{t})
<(n+1)(wt1xt).\displaystyle<(n+1)(w_{t-1}-x_{t}).

So for n<n<\infty, conditions (14)–(15) are satisfied as well, and wia.s.ww^{\prime}_{i}\xrightarrow{a.s.}w^{*}. (When nn\to\infty, the convergence conditions are satisfied directly without using the sequence {wi}\{w^{\prime}_{i}\}.) ∎

A.3 Martingale Approach

We can also describe the update process over the background points using martingales. The algorithm estimates are random variables wtw_{t}; let {𝒲t}\{\mathcal{W}_{t}\} be the sequence of σ\sigma-algebras such that for each tt, wtw_{t} is measurable with respect to 𝒲t\mathcal{W}_{t}. Using Lemma 1, and assuming w<wtw^{*}<w_{t} again, within each cycle the estimates comprise a supermartingale 𝔼(wt+1|𝒲t)<wt\mathbb{E}(w_{t+1}|\mathcal{W}_{t})<w_{t} over the points Tt<min(T+n,FHTw(w))T\leq t<\min(T+n,FHT_{w}(w^{*})), here FHTx(a)=inf{t:xta}FHT_{x}(a)=\inf\{t:x_{t}\leq a\} is the first hitting time of the process realisation {xt}\{x_{t}\} to value aa.

Consider again the problematic case when the global mean does not match the background mean and misclassification pushes the estimate away from the background mean, i.e. w<wT1<wT<𝔼(θ)w^{*}<w_{T-1}<w_{T}<\mathbb{E}(\theta). In order for wiw^{\prime}_{i} to converge, we need the perturbed estimates to return to a value below wT1w_{T-1} in each cycle. At the extremes, we have:

FHTw(wT)=T\displaystyle FHT_{w}(w_{T})=T starting position
FHTw(wT1)<\displaystyle FHT_{w}(w_{T-1})<\infty for sufficiently large n, because wta.s.w.\displaystyle\text{for sufficiently large $n$, because }w_{t}\xrightarrow{a.s.}w^{*}.

Clearly, FHTw(wT1)FHTw(w)FHT_{w}(w_{T-1})\leq FHT_{w}(w^{*}). However, the number of background points nn required to satisfy FHTw(wT1)<T+nFHT_{w}(w_{T-1})<T+n will depend on five factors: the distribution fBf_{B} and penalty pp (since they determine the distribution of update values HH), the size of estimated background set at time TT (as it determines the relevant γt\gamma_{t}), and wT1w_{T-1} and wTw_{T}.

In practice, nn is bounded by the available data, so there is a non-zero probability that, over the segment-background cycles indexed by ii:

maxiFHTw(wTi1)>Ti+n.\max_{i}FHT_{w}(w_{T_{i}-1})>T_{i}+n.

In that case, define b=min{a:i,FHTw(a)Ti+n}b=\min\{a:\forall i,FHT_{w}(a)\leq T_{i}+n\}; the final estimate of wiw^{\prime}_{i} will be bound by [w,b][w^{*},b]. As nn increases, P(|wb|>ϵ)0P(|w^{*}-b|>\epsilon)\rightarrow 0 for any ϵ>0\epsilon>0.

Similar reasoning applies when 𝔼(θ)<wT<wT1<w\mathbb{E}(\theta)<w_{T}<w_{T-1}<w^{*}.

Appendix B Proof of Theorem 2 (Consistency)

Proof.

Some general consistency results for changepoint detection by penalised cost methods are given in Fisch et al. (2018). In particular, an equivalent of our Theorem 2 is established for any algorithm that detects changepoints by exactly minimising a cost F(n;{si,ei},θ,μ^,σ^)F(n;\{s_{i},e_{i}\},\theta,\hat{\mu},\hat{\sigma}), where ^\hat{\cdot} marks robust estimates of background parameters. While the original statement uses the median and interquartile range of x0:nx_{0:n} for μ^\hat{\mu} and σ^\hat{\sigma}, the proof only requires that the estimates satisfy certain upper bounds on deviations from the true values. Therefore, we will first show that the online estimates produced by Algorithm 1 are within these bounds, and then follow the rest of the proof from Fisch et al. (2018).

Noting again that Algorithm 1 is effectively a stochastic gradient descent procedure, with each data point seen precisely once, we can use the error bound on estimates produced by such algorithms as provided in Theorem 7.5 of Harvey et al. (2019):

Theorem 3 ((Harvey et al., 2019)).

Let function f(w)f(w) be 1-strongly convex and 1-Lipschitz. A stochastic gradient algorithm for minimising this function runs for TT cycles, and at each cycle updates the estimate as in (12) with γt=1/t\gamma_{t}=1/t, 𝔼H=f(w)\mathbb{E}H=\nabla f(w). Then:

P(wtw2O(log(1/δ)t))1δ.P\left(\lVert w_{t}-w^{*}\rVert^{2}\leq O\left(\frac{\log(1/\delta)}{t}\right)\right)\geq 1-\delta.

Using δ=nϵ\delta=n^{-\epsilon}, and assuming without loss of generality that σ0=1\sigma_{0}=1, we can establish an upper bound on the error of background parameters estimated by Algorithm 1 after nn cycles:

P((μ^μ0)2O(log(nϵ)n))=P(|μ^μ0|O(ϵlognn))1nϵP\left((\hat{\mu}-\mu_{0})^{2}\leq O\left(\frac{\log(n^{\epsilon})}{n}\right)\right)=P\left(|\hat{\mu}-\mu_{0}|\leq O\left(\sqrt{\epsilon}\sqrt{\frac{\log n}{n}}\right)\right)\geq 1-n^{-\epsilon}
P(|σ^2σ02|O(ϵlognn))1nϵ.P\left(|\hat{\sigma}^{2}-\sigma^{2}_{0}|\leq O\left(\sqrt{\epsilon}\sqrt{\frac{\log n}{n}}\right)\right)\geq 1-n^{-\epsilon}.

Application of Boole’s inequality leads to:

P(|μ^μ0|D1σ0log(n)n,|σ^2σ021|D2log(n)n)1C1nϵ,P\left(|\hat{\mu}-\mu_{0}|\leq D_{1}\sigma_{0}\sqrt{\frac{\log(n)}{n}},\left|\frac{\hat{\sigma}^{2}}{\sigma^{2}_{0}}-1\right|\leq D_{2}\sqrt{\frac{\log(n)}{n}}\right)\geq 1-C_{1}n^{-\epsilon}, (25)

for some constants C1,D1,D2C_{1},D_{1},D_{2} and sufficiently large nn. Since the objective function ff in our algorithm is the Gaussian log-likelihood (i.e., the updates 𝔼H\mathbb{E}H approximate its gradient), for any given segmentation it is 1-strongly convex. For other functions, overall consistency can still be achieved similarly, but the convergence rate may be slower than nϵn^{-\epsilon}.

Having established the bound on estimate errors, we can use Lemma 9 from Fisch et al. (2018) and the proof method reported there.

First, introduce an event EE based on a combination of bounds limiting the behaviour of Gaussian data x1:nx_{1:n}, which for any ϵ>0\epsilon>0, occurs with probability P(E)>1C2nϵP(E)>1-C_{2}n^{-\epsilon}, with some constant C2C_{2} and sufficiently large nn (Lemmas 1 and 2 in Fisch et al. (2018)). Conditional on this event, the following lemma holds for the epidemic cost FF defined as in (5):

Lemma 2 ((Fisch et al., 2018)).

Let {τ}\{\tau\} be the set of true segment positions {(si,ei)}\{(s_{i},e_{i})\}, and θ\theta the vector of true segment means. Assume EE holds, and some μ^,σ^\hat{\mu},\hat{\sigma} are available for which the event in (25) holds. Then, there exist constants C3C_{3} and n1n_{1} such that when n>n1n>n_{1},

F(n;{τ},θ,μ^,σ^)F(n;{τ},θ,μ0,σ0)<C4logn.F(n;\{\tau\},\theta,\hat{\mu},\hat{\sigma})-F(n;\{\tau\},\theta,\mu_{0},\sigma_{0})<C_{4}\log n.

This lemma, together with results established for classical changepoint detection, can be used to show that the cost of any inconsistent solution will exceed the cost based on true segment positions and parameters (Proposition 8 in Fisch et al. (2018)):

Proposition 3 ((Fisch et al., 2018)).

Define {τ}\{\tau^{\prime}\} to be any set of segments {(si,ei)}\{(s_{i},e_{i})\} that does not satisfy the consistency event in (7). Let θ~=argminθF(n;θ)\tilde{\theta}=\operatorname*{argmin}_{\theta}F(n;\theta) be the parameters estimated by minimising the cost for a given segmentation (i.e. the vector of means and/or variances of xsi:eix_{s_{i}:e_{i}} for each ii). Assume EE holds. Then there exist constants C4C_{4} and n2n_{2} such that, when n>n2n>n_{2}:

F(n;{τ},θ~,μ^,σ^)F(n;{τ},θ,μ^,σ^)+C3log(n)1+δ/2.F(n;\{\tau^{\prime}\},\tilde{\theta},\hat{\mu},\hat{\sigma})\geq F(n;\{\tau\},\theta,\hat{\mu},\hat{\sigma})+C_{3}\log(n)^{1+\delta/2}.

See the original publication for a detailed proof of these results.

Finally, for a given set of changepoints, using fitted maximum-likelihood parameters by definition results in minimal cost:

F(n;{τ},θ,μ^,σ^)F(n;{τ},θ~,μ^,σ^).F(n;\{\tau\},\theta,\hat{\mu},\hat{\sigma})\geq F(n;\{\tau\},\tilde{\theta},\hat{\mu},\hat{\sigma}).

Thus, when Proposition 3 holds, we have:

F(n;{τ},θ~,μ^,σ^)>F(n;{τ},θ~,μ^,σ^),F(n;\{\tau^{\prime}\},\tilde{\theta},\hat{\mu},\hat{\sigma})>F(n;\{\tau\},\tilde{\theta},\hat{\mu},\hat{\sigma}),

and an exact minimisation algorithm will always find a solution in the consistent set. The overall probability of the events required for Proposition 3 is a combination of P(E)P(E), established before, and (25), which by Boole’s inequality is:

P>1C5nϵ,P>1-C_{5}n^{-\epsilon},

for any ϵ>0\epsilon>0, n>n3n>n_{3} and some constants n3,C5n_{3},C_{5}. ∎

Appendix C Proof of Proposition 2 (Pruning of Algorithm 2)

Proof.

Denote the true start and end of a nuisance segment as sj,ejs_{j},e_{j}. Consider the case when sj(tA(n);t]s_{j}\in(t-A(n);t]. Pruning at time tt will not remove this point (i.e. sj𝒌pr,ts_{j}\notin\bm{k}_{pr,t}) iff:

C0(xtA(n):sj1)+CN(xsj:t)<C0(xtA(n):m1)+CN(xm:t)+αlog(n)1+δC^{0}(x_{t-A(n):s_{j}-1})+C^{N}(x_{s_{j}:t})<C^{0}(x_{t-A(n):m-1})+C^{N}(x_{m:t})+\alpha\log(n)^{1+\delta}

with mm such that the right hand side is minimised and msjm\neq s_{j}.

Denote by C(xa:b;μ^,σ^)C(x_{a:b};\hat{\mu},\hat{\sigma}) the Gaussian cost calculated with MLE estimates of the parameters (i.e. mean and variance of xa:bx_{a:b}). Note that since C0(xa:b)=C(xa:b;μ0,σ0)C^{0}(x_{a:b})=C(x_{a:b};\mu_{0},\sigma_{0}) and CN(xa:b)=C(xa:b;μ^,σN)C^{N}(x_{a:b})=C(x_{a:b};\hat{\mu},\sigma_{N}), the required event to preserve sjs_{j} can be stated as

C(xtA(n):sj1;μ0,σ0)+C(xsj:t;μ^,σN)C(xtA(n):m1;μ0,σ0)C(xm:t;μ^,σN)<αlog(n)1+δC(x_{t-A(n):s_{j}-1};\mu_{0},\sigma_{0})+C(x_{s_{j}:t};\hat{\mu},\sigma_{N})-\\ C(x_{t-A(n):m-1};\mu_{0},\sigma_{0})-C(x_{m:t};\hat{\mu},\sigma_{N})<\alpha\log(n)^{1+\delta}

We can establish the probability of this using the following bound (Proposition 4 in Fisch et al. (2018)):

Lemma 3.

Let x1:nx_{1:n} be piecewise-Gaussian data. Choose any subset xi:j,1i<jnx_{i:j},1\leq i<j\leq n, with a true changepoint at ss, i.e., we have xt𝒩(μ1,σ1)x_{t}\sim\mathcal{N}(\mu_{1},\sigma_{1}) for t[i;s1]t\in[i;s-1], and xt𝒩(μ2,σ2)x_{t}\sim\mathcal{N}(\mu_{2},\sigma_{2}) for t[s;j]t\in[s;j]. Then, for any candidate changepoint τ\tau and any ϵ>0\epsilon>0, there exist constants B,n0,K1B,n_{0},K_{1} such that:

C(xi:s1;μ1,σ1)+C(xs:j;μ2,σ2)C(xi:τ1;μ^,σ^)C(xτ:j;μ^,σ^)K1log(n)C(x_{i:s-1};\mu_{1},\sigma_{1})+C(x_{s:j};\mu_{2},\sigma_{2})-C(x_{i:\tau-1};\hat{\mu},\hat{\sigma})-C(x_{\tau:j};\hat{\mu},\hat{\sigma})\leq K_{1}\log(n)

is true for all i,ji,j with P1BnϵP\geq 1-Bn^{-\epsilon} when n>n0n>n_{0}.

Now take i=sjA(n)+1,j=sj+A(n)1i=s_{j}-A(n)+1,j=s_{j}+A(n)-1. Note that there is one and only one changepoint within xi:jx_{i:j} because of the required distance between changepoints. Applying Lemma 3 to such xi:jx_{i:j} states that, conditional on an event with probability P1BnϵP\geq 1-Bn^{-\epsilon}, the following is true for all t[sj,sj+A(n))t\in[s_{j},s_{j}+A(n)):

C(xtA(n):sj1;μ0,σ0)+C(xsj:t;μ^,σN)C(xtA(n):m1;μ0,σ0)C(xm:t;μ^,σN)\displaystyle C(x_{t-A(n):s_{j}-1};\mu_{0},\sigma_{0})+C(x_{s_{j}:t};\hat{\mu},\sigma_{N})-C(x_{t-A(n):m-1};\mu_{0},\sigma_{0})-C(x_{m:t};\hat{\mu},\sigma_{N})
C(xtA(n):sj1;μ0,σ0)+C(xsj:t;μN,σN)C(xtA(n):m1;μ^,σ^)C(xm:t;μ^,σ^)\displaystyle\leq C(x_{t-A(n):s_{j}-1};\mu_{0},\sigma_{0})+C(x_{s_{j}:t};\mu_{N},\sigma_{N})-C(x_{t-A(n):m-1};\hat{\mu},\hat{\sigma})-C(x_{m:t};\hat{\mu},\hat{\sigma})
K1log(n)<αlog(n)1+δ,\displaystyle\leq K_{1}\log(n)<\alpha\log(n)^{1+\delta},

where we also used the fact that μ^,σ^=argminμ,σC(x;μ,σ)\hat{\mu},\hat{\sigma}=\operatorname*{argmin}_{\mu_{,}\sigma}C(x;\mu,\sigma).

Therefore, with the same probability, sjt=sjsj+A(n)1𝒌pr,ts_{j}\notin\bigcup_{t=s_{j}}^{s_{j}+A(n)-1}\bm{k}_{pr,t}. Also, sjt=sj+A(n)n𝒌pr,ts_{j}\notin\bigcup_{t=s_{j}+A(n)}^{n}\bm{k}_{pr,t} because then sjtA(n)s_{j}\leq t-A(n) and is not considered in the pruning scheme, and clearly sjt=1sj1𝒌pr,ts_{j}\notin\bigcup_{t=1}^{s_{j}-1}\bm{k}_{pr,t}. The case for eje_{j} follows by symmetry, and obviously no true changepoint can be pruned out if sj,ej=s_{j},e_{j}=\emptyset, so the overall probability of retaining a true changepoint remains at P1BnϵP\geq 1-Bn^{-\epsilon}. ∎

Appendix D Supplementary Table

Table S1: Comparison of Algorithm 2 detections without pruning (Full) or with global pruning (Pruned). Of the 5000 simulation runs, all runs where the two options produced any difference in segments were identified. All detections from these runs were extracted and are shown here. Segment types are N – nuisance, S – signal.
(Full) (Pruned)
Scenario nn Run number Segm. type Start End Segm. type Start End
1 30 88 N 7 21 S 7 11
30 88 S 13 15 N 12 21
30 88 S 16 20 S 16 20
100 84 N 21 69 S 21 50
100 84 S 51 68 S 51 69
2 30 378 N 10 18 S 10 12
30 378 S 13 15 S 16 18
30 393 N 2 12 N 3 12
30 393 S 4 7 S 4 7
30 393 S 16 18 S 16 18
30 393 S 22 24 S 22 24
240 445 N 55 95 N 55 95
240 445 S 121 144 S 121 144
240 445 N 170 218 S 170 192
240 445 S 193 214 S 215 218