This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Correlation Change Detection for Large-Dimensional Data with An Application to Forecasting of El Niño Events

Jie Gao1 1School of Data Science, The Chinese University of Hong Kong, Shenzhen
Guangdong, China, 518172
2Department of Industrial and Systems Engineering, University of Minnesota
Liyan Xie2 1School of Data Science, The Chinese University of Hong Kong, Shenzhen
Guangdong, China, 518172
2Department of Industrial and Systems Engineering, University of Minnesota
and Zhaoyuan Li1111Corresponding author, Email: [email protected]. 1School of Data Science, The Chinese University of Hong Kong, Shenzhen
Guangdong, China, 518172
2Department of Industrial and Systems Engineering, University of Minnesota
Abstract

We consider detecting change points in the correlation structure of streaming large-dimensional data with minimum assumptions posed on the underlying data distribution. Depending on the 1\ell_{1} and \ell_{\infty} norms of the squared difference of vectorized pre-change and post-change correlation matrices, detection statistics are constructed for dense and sparse settings, respectively. The proposed detection procedures possess the bless-dimension property, as a novel algorithm for threshold selection is designed based on sign-flip permutation. Theoretical evaluations of the proposed methods are conducted in terms of average run length and expected detection delay. Numerical studies are conducted to examine the finite sample performances of the proposed methods. Our methods are effective because the average detection delays have slopes similar to that of the optimal exact CUSUM test. Moreover, a combined 1\ell_{1} and \ell_{\infty} norm approach is proposed and has expected performance for transitions from sparse to dense settings. Our method is applied to forecast El Niño events and achieves state-of-the-art hit rates greater than 0.86, while false alarm rates are 0. This application illustrates the efficiency and effectiveness of our proposed methodology in detecting fundamental changes with minimal delay.

Keywords: Change point analysis; Correlation structure; El Niño prediction; Knockoff; Sign-flip permutation.

1 Introduction

The detection of abrupt changes in the statistical behavior of streaming data is a classic and fundamental problem in signal processing and statistics (Siegmund, 1985; Poor and Hadjiliadis, 2008; Veeravalli and Banerjee, 2014; Tartakovsky et al., 2014). A change point refers to the time when the underlying data distribution changes, and it may correspond to a triggering event that could have catastrophic consequences if not detected promptly. Therefore, the goal is to detect the change in distribution as quickly as possible, subject to false alarm constraints.

In the multivariate setting where multiple variables are observed, in addition to changes in univariate characteristics, crucial events are often marked by abrupt correlational changes (Cabrieto et al., 2018). Such multivariate settings are common in network data, including social networks (Raginsky et al., 2012; Peel and Clauset, 2015), sensor networks (Raghavan and Veeravalli, 2010), and cyber-physical systems (Nurjahan et al., 2016; Chen et al., 2015; Lakhina et al., 2004). The correlation changes are significant in various real-world applications. For example, in economics, increasing associations among diverse financial assets are often associated with financial crises (Galeano and Wied, 2017; Wied, 2017). In climate science, the El Niño events, the most important phenomenon of contemporary natural climate variability, exemplifies a phenomenon in which interactions between different locations in the Pacific strengthen before and weaken during events (Ludescher et al., 2013). In seismic detection, the correlations between the sensors strengthen when a seismic event occurs (Xie et al., 2019).

Change point detection in correlation structure has been extensively studied in conventional low-dimensional settings, i.e. the dimension remains small, and the sample size can become very large. For example, Wied (2017) constructed a bootstrap variance matrix estimator to detect changes in the correlation matrix, and Cabrieto et al. (2018) proposed a Gaussian kernel-based change point detection method that can locate multiple change points simultaneously. Madrid Padilla et al. (2023) localized multiple change points in multivariate time series with a potentially short-range dependence. Change point detection in time series has also been studied in Killick et al. (2013) and Dette et al. (2019), as well as the references therein. However, in large-dimensional settings where the dimension can be larger than the (available) sample size, methods designed for the low-dimensional settings often perform poorly or are poorly defined. For example, kernel-based methods are sensitive to kernel function and parameter choices, particularly in moderate to large dimensions, and the method proposed by Wied (2017) is unstable when the dimension is large relative to the sample size due to the (near) singularity of the variance estimator.

For large-dimensional data, Choi and Shin (2021) proposed a break test based on the self-normalization method, which avoids variance estimation. Li and Gao (2024) introduced a sign-flip parallel analysis-based detection method and a two-step approach to locate change points. In addition, since the correlation matrix is a particular case of the covariance matrix, some methods designed to detect changes in covariance can be applied to detect changes in correlation. For example, Dette et al. (2022) presented a two-stage bootstrap approach to detect and locate change points in the covariance structure of large-dimensional data. These existing works focus mainly on offline settings, assuming a fixed data sequence of finite length collected before analysis. These methods are not directly applicable in online settings where data arrive in an online manner.

There is a line of work on detecting covariance changes in online settings. Xie et al. (2020) discussed online monitoring of multivariate streaming data for changes in covariance matrices, focusing on a specific covariance structure that transitions from the identity matrix to an unknown spike covariance model. They proposed the largest-eigenvalue Shewhart chart and the subspace cumulative sum detection procedure. Buzun and Avanesov (2018) used the sup norm of a matrix as their statistics and considered abrupt changes in the inverse covariance matrix and the covariance matrix of large-dimensional data, respectively. Their methods used entirely data-driven algorithms with bootstrap calibration to select thresholds, and the temporal independence of random variables was assumed. Li and Li (2023) considered the Frobenius norm and introduced a novel stopping rule that accommodates spatial and temporal dependencies, applicable to non-Gaussian data. However, it is worth mentioning that relying solely on the covariance matrix for analysis may be susceptible to noise interference, making the correlation matrix more reliable for certain applications. This is evidenced by the real data analysis in Li and Gao (2024), which accurately detected changes in the coronary artery disease index in rearranged RNA data using correlation analysis, while covariance-based methods did not.

This paper focuses on online change detection of correlation matrices. We do not restrict a particular parametric form for the distributions, but only assume that the pre- and post-change distributions employ different and unknown correlation matrices. We assume the availability of a reference dataset generated from the pre-change distribution, which is used to estimate the pre-change correlation matrix. The post-change correlation matrix is estimated from the data sequences and compared with the reference to compute detection statistics. Moreover, we incorporate both the 1\ell_{1} and \ell_{\infty} norms to effectively accommodate the sparse and non-sparse changes in the correlation structure.

The main contributions of the paper are threefold. First, our efficient algorithms are the first to detect correlation changes in large-dimensional streaming data. The proposed methods utilize SMOTE and knockoff techniques to enhance sample efficiency. Second, we characterize the theoretical performance of the proposed methods, specifically in terms of the average run time and the expected detection delay. Finally, a major novelty is the application to El Niño forecasting, and early warnings can be used to predict long-term climate states. Using high-quality observational data available since 1950, our method yields hit rates greater than 0.86, whereas false alarm rates are zero. With this perspective, our method enables applications in many other real problems, such as detecting microearthquakes and tremor-like signals in seismic datasets as soon as possible, which are weak signals caused by minor subsurface changes in the Earth.

The paper is organized as follows. Section 2 presents our problem setup and preliminaries. Section 3 introduces the proposed online detection procedure and two improved methods incorporating the synthetic minority oversampling technique (SMOTE) and knockoff techniques. More importantly, a sign-flip permutation method is proposed to select thresholds used in detection procedures. Section 4 presents the theoretical results of the proposed methods in terms of average run length and worst-case average detection delay. Simulations and El Niño forecasting are discussed in Sections 5 and 6, respectively. Section 7 concludes with a brief discussion. Supplementary materials for this article are available online including all the proofs and an application to seismic event detection.

Notation.

We use \mathbb{P}_{\infty} to denote the probability measure on the sequence of observations when the change never occurs, and 𝔼{\mathbb{E}}_{\infty} is the corresponding expectation. Similarly, 𝔼ν{\mathbb{E}}_{\nu} denotes the expectation when the actual change point is equal to ν\nu; thus 𝔼1{\mathbb{E}}_{1} refers to the expectation under the post-change regime. The functions f0f_{0} and f1f_{1} represent the pre- and post-change probability density functions of the observations, respectively. For a vector 𝒙=(x1,,xp)\bm{x}=(x_{1},\ldots,x_{p}), 𝒙1:=i=1p|xi|\left\|\bm{x}\right\|_{1}:=\sum_{i=1}^{p}|x_{i}| denotes its 1\ell_{1} norm, and :=maxi=1,,p|xi|\left\|\cdot\right\|_{\infty}:=\max_{i=1,\ldots,p}|x_{i}| denotes its \ell_{\infty} norms. For a matrix 𝑨=(aij)1im,1jn\bm{A}=(a_{ij})_{1\leq i\leq m,1\leq j\leq n}, 𝑨F:=i=1mj=1naij2\left\|\bm{A}\right\|_{F}:=\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}a_{ij}^{2}} is its Frobenius norm. We denote 𝑨k:k\bm{A}_{k:k^{\prime}} as the kk-th to kk^{\prime}-th columns of the matrix 𝑨\bm{A}. We use O()O(\cdot) and o()o(\cdot) for the standard big-O and little-o notation. Finally, the symbol ``"``\circ" is the Hadamard (element-wise) product.

2 Problem Setup and Preliminaries

2.1 Problem Setup

Consider a sequence of observations {𝒙t,t=1,2,}\{\bm{x}_{t},t=1,2,\ldots\}, where 𝒙tp\bm{x}_{t}\in{\mathbb{R}}^{p}. In general, we do not impose any restriction on pp, except in the knockoff enhancement discussed in Section 3.4. We are interested in detecting the change point in the underlying correlation structure. We assume the following data-generating model:

orr(𝒙t)=𝑹0,t=1,2,,ν1\displaystyle\ \mathbb{C}\mathrm{orr}(\bm{x}_{t})=\bm{R}_{0},\ t=1,2,\ldots,\nu-1
orr(𝒙t)=𝑹1,t=ν,ν+1,.\displaystyle\ \mathbb{C}\mathrm{orr}(\bm{x}_{t})=\bm{R}_{1},\ t=\nu,\nu+1,\ldots.

Here, ν\nu is the change-point at which the correlation structure of 𝒙t\bm{x}_{t} changes. The matrices 𝑹0=[ρ0(i,j)]i,j=1,,p\bm{R}_{0}=[\rho_{0}(i,j)]_{i,j=1,\ldots,p} and 𝑹1=[ρ1(i,j)]i,j=1,,p\bm{R}_{1}=[\rho_{1}(i,j)]_{i,j=1,\ldots,p} are the pre- and post-change correlation matrices, respectively. We assume that both 𝑹0\bm{R}_{0} and 𝑹1\bm{R}_{1} are unknown, and the change point ν\nu is deterministic and unknown. The difference between 𝑹0\bm{R}_{0} and 𝑹1\bm{R}_{1} indicates the magnitude and pattern of the change. Specifically, the correlation change is considered sparse if the number of differing entries between 𝑹0\bm{R}_{0} and 𝑹1\bm{R}_{1} is relatively small compared to the total number of entries, otherwise the change is considered dense.

Our goal is to detect the change as quickly as possible, subject to false alarm constraints. This is usually achieved by designing a stopping time (Poor and Hadjiliadis, 2008), which is a random variable TT relating to the data sequence {𝒙t,t=1,2,}\{\bm{x}_{t},t=1,2,\ldots\} such that for each tt, the event {T=t}σ(𝒙1,,𝒙t)\{T=t\}\in\sigma(\bm{x}_{1},\ldots,\bm{x}_{t}), where σ(𝒙1,,𝒙t)\sigma(\bm{x}_{1},\ldots,\bm{x}_{t}) denotes the sigma-algebra generated by {𝒙1,,𝒙t}\{\bm{x}_{1},\ldots,\bm{x}_{t}\}. Equivalently, the event {T=t}\{T=t\} is a function of only 𝒙1,,𝒙t\bm{x}_{1},\ldots,\bm{x}_{t}.

2.2 Performance Measures

A fundamental objective in sequential change point detection is to optimize the trade-off between false alarm rate and average detection delay. Controlling a false alarm rate is commonly achieved by setting an appropriate threshold on a test statistic. On the other hand, the threshold also affects the average detection delay. A larger threshold incurs fewer false alarms but leads to a larger detection delay, and vice versa.

We introduce next the two commonly used performance metrics in sequential detection, the average run length (ARL) and the worst-case average detection delay (WADD). ARL is used to characterize false alarms and it is defined, for a given stopping time TT, as:

𝖠𝖱𝖫(T)=𝔼[T],\displaystyle{\mathsf{ARL}}(T)=\mathbb{E}_{\infty}[T], (1)

its reciprocal is the commonly used false-alarm rate 𝖥𝖠𝖱(T)=1/𝖠𝖱𝖫(T)=1/𝔼[T]{\mathsf{FAR}}(T)=1/{{\mathsf{ARL}}(T)}=1/{\mathbb{E}_{\infty}[T]}. ARL can be interpreted as the expected time duration between two consecutive false alarms. We typically focus on the test procedures that satisfy a constraint γ\gamma on the ARL, i.e., consider the set of tests:

𝒟γ={T:𝖠𝖱𝖫(T)γ},\mathcal{D}_{\gamma}=\{T:{\mathsf{ARL}}(T)\geq\gamma\},

notice that ARL (or γ\gamma) is pre-specified and can go to infinity.

Finding a uniformly powerful test that minimizes the delay over all possible values of the change point ν\nu, subject to an ARL constraint, is generally intractable. Usually, it is more tractable to pose the problem in the so-called minimax setting. We adopt the worst-case detection delay (WADD), defined as the supremum of the average detection delay conditioned on the worst possible realizations (Lorden, 1971). More specifically,

𝖶𝖠𝖣𝖣(T)=supν1esssup𝔼ν[(Tν+1)+|𝒙1,,𝒙ν1],{\mathsf{WADD}}(T)=\underset{\nu\geq 1}{\operatorname{\sup}}\ \operatorname*{ess\,sup}\ {\mathbb{E}}_{\nu}\left[(T-\nu+1)^{+}|\bm{x}_{1},\dots,\bm{x}_{\nu-1}\right], (2)

where esssup\operatorname*{ess\,sup} is essential supremum, and (x)+=max{x,0}(x)^{+}=\max\{x,0\}. Under such a definition of WADD, the formulation of interest is:

minimize 𝖶𝖠𝖣𝖣(T), subject to 𝖠𝖱𝖫(T)γ.\mbox{minimize }{\mathsf{WADD}}(T),\ \mbox{ subject to }{\mathsf{ARL}}(T)\geq\gamma. (3)
Remark 2.1 (Information-theoretic lower bound).

The information-theoretic lower bound to 𝖶𝖠𝖣𝖣(T){\mathsf{WADD}}(T), i.e., the minimum value of (3), is known to be obtained by the CUSUM Procedure (Page, 1954), which is a commonly used sequential change detection procedure that enjoys efficient implementation and exact optimality properties (Lorden, 1971; Moustakides, 1986; Ritov, 1990). By accumulating the log-likelihood ratios, the CUSUM statistic is constructed as Wt=max1itk=it(𝒙k)W_{t}=\max_{1\leq i\leq t}\sum_{k=i}^{t}\ell(\bm{x}_{k}), where (𝒙k)=log(f1(𝒙k)/f0(𝒙k))\ell(\bm{x}_{k})=\log({f_{1}(\bm{x}_{k})}/{f_{0}(\bm{x}_{k})}) is the log-likelihood ratio statistic, and the stopping time is thus Tc=inf{t1:Wtb}T_{\text{c}}=\inf\left\{t\geq 1:W_{t}\geq b\right\}, for a pre-specified threshold bb such that 𝖠𝖱𝖫(Tc)=γ{\mathsf{ARL}}(T_{\text{c}})=\gamma. The WADD of the CUSUM test is known to be (logγ)/D(f1f0)(1+o(1)){(\log\gamma)}/{D(f_{1}\|f_{0})}\cdot(1+o(1)), where D(f1f0)=f1(x)log(f1(x)/f0(x))𝑑xD(f_{1}\|f_{0})=\int f_{1}(x)\log({f_{1}(x)}/{f_{0}(x)})dx is the KL divergence between the post- and pre-change distributions (Tartakovsky et al., 2014). In Section 5, we compare the performance of the proposed detection methods with that of the CUSUM procedure to validate their effectiveness. It is important to note that the CUSUM test requires full knowledge of the pre- and post-change density functions, which is usually unavailable in practice.

Remark 2.2 (How does pp influence the detection procedure?).

The dimensionality of the data, pp, significantly affects the correlation structure before and after a change point, which, in turn, affects the detection delay for a fixed ARL. Higher dimensions can lead to more complex correlations, potentially masking the detection of changes. Generally, as pp increases, the detection delay may increase due to a diminished signal-to-noise ratio and increased sensitivity to noise. In Section 5, however, our results indicate a shorter detection delay as pp increases, demonstrating the bless dimension property of our detection method.

3 Online Detection Procedure

In this section, we first introduce the construction of sliding sample correlation matrices and define some preliminary statistics in Section 3.1. We then propose sum-type and max-type detection statistics, together with their window-limited and Shewhart-type variants, in Sections 3.2 and 3.3, respectively. We discuss SMOTE and knockoff enhancements in Section 3.4, providing solutions for situations with limited sample sizes. In addition, we present a practical method for threshold selection in Section 3.5, which ensures computational and sample efficiency while maintaining the desired ARL condition approximately.

3.1 Sliding Sample Correlation Matrices and Preliminary Statistics

Assume a data sequence 𝒙1,,𝒙t,\bm{x}_{1},\ldots,\bm{x}_{t},\ldots that is i.i.d., where 𝒙t:=(xt1,,xtp)\bm{x}_{t}:=(x_{t1},\ldots,x_{tp})^{\top} with bounded fourth-moment entries and correlation matrix 𝑹\bm{R}. The matrix 𝑹\bm{R} can be estimated using the sample correlation matrix defined as follows. There are typically two scenarios, depending on the prior knowledge of the distributional parameters. For illustration purposes, we first consider the simplest case where the mean and variance of each entry are known. Without loss of generality, we assume 𝔼(x1i)=0\mathbb{E}(x_{1i})=0 and σ2=𝕍ar(x1i)=1\sigma^{2}=\mathbb{V}\mathrm{ar}(x_{1i})=1, i=1,,p\forall i=1,\ldots,p. The sample correlation matrix for samples within a time window [s,t][s,t] is then given by 𝑹~s:t=1ts+1k=st𝒙k𝒙k\bm{\widetilde{R}}_{s:t}=\frac{1}{t-s+1}\sum_{k=s}^{t}\bm{x}_{k}\bm{x}_{k}^{\top}.

Second, we consider the general case where the population mean and variance are unknown. Let 𝒙¯s:t=1ts+1k=st𝒙k\bm{\bar{x}}_{s:t}=\frac{1}{t-s+1}\sum_{k=s}^{t}\bm{x}_{k} be the sample mean from time ss to tt, and the corresponding sample covariance matrix within [s,t][s,t] is

𝚺^s:t=1tsk=st(𝒙k𝒙¯s:t)(𝒙k𝒙¯s:t).\bm{\widehat{\Sigma}}_{s:t}=\frac{1}{t-s}\sum_{k=s}^{t}(\bm{x}_{k}-\bm{\bar{x}}_{s:t})(\bm{x}_{k}-\bm{\bar{x}}_{s:t})^{\top}. (4)

Let 𝑫s:t=diag(𝚺^s:t)\bm{D}_{s:t}=diag(\bm{\widehat{\Sigma}}_{s:t}) be the diagonal matrix of 𝚺^s:t\bm{\widehat{\Sigma}}_{s:t} and 𝒚k=𝑫𝒔:𝒕1/2(𝒙k𝒙¯s:t)\bm{y}_{k}=\bm{D_{s:t}}^{-1/2}(\bm{x}_{k}-\bm{\bar{x}}_{s:t}) be the standardized vector of 𝒙k\bm{x}_{k}, for k=s,,tk=s,\ldots,t, then the corresponding sample correlation matrix for samples within [s,t][s,t] is

𝑹^s:t=1tsk=st𝒚k𝒚k.\bm{\widehat{R}}_{s:t}=\frac{1}{t-s}\sum_{k=s}^{t}\bm{y}_{k}\bm{y}_{k}^{\top}. (5)

Assume there is also a set of reference data (historical data) {𝒙H,,𝒙0}\{\bm{x}_{-H},\ldots,\bm{x}_{0}\} of size H+1H+1, which are known to be generated from the pre-change distribution with an unknown correlation matrix 𝑹0\bm{R}_{0}. According to equation (5), the sample correlation matrix of all reference data can be similarly calculated as:

𝑹^0=𝑹^H:0=1Hk=H0𝒚k𝒚k,\bm{\widehat{R}}_{0}=\bm{\widehat{R}}_{-H:0}=\frac{1}{H}\sum_{k=-H}^{0}\bm{y}_{k}\bm{y}_{k}^{\top}, (6)

where 𝒚k=𝑫H:01/2(𝒙k𝒙¯H:0)\bm{y}_{k}=\bm{D}_{-H:0}^{-1/2}(\bm{x}_{k}-\bm{\bar{x}}_{-H:0}), for k=H,,0k=-H,\ldots,0, and we further denote 𝑹^0=[ρ^0(i,j)]i,j=1,,p\bm{\widehat{R}}_{0}=[\hat{\rho}_{0}(i,j)]_{i,j=1,\ldots,p}. The size of historical data, H+1H+1, is assumed to be sufficiently large. When appropriate, we may use only a subset of this historical data for calculating the sample correlation matrix.

Given the online sequence {𝒙t,t=1,2,}\{\bm{x}_{t},t=1,2,\ldots\}, we calculate the detection statistics as follows. At each time tt, we investigate all values of t<tt^{\prime}<t as potential change-points. For each tt^{\prime}, we calculate the sample correlation matrix 𝑹^t:t=[ρ^t:t(i,j)]i,j=1,,p\bm{\widehat{R}}_{t^{\prime}:t}=[\hat{\rho}_{t^{\prime}:t}(i,j)]_{i,j=1,\ldots,p} according to (5) using potential post-change samples {𝒙t,,𝒙t}\{\bm{x}_{t^{\prime}},\ldots,\bm{x}_{t}\}. Then, we calculate the squared difference between 𝑹^t:t\bm{\widehat{R}}_{t^{\prime}:t} and 𝑹^0\bm{\widehat{R}}_{0}, i.e.,

𝒗t,t=vecho{[(ρ^0(i,j)ρ^t:t(i,j))2]i,j=1,,p}p(p1)2,\bm{v}_{t^{\prime},t}=vecho\left\{\left[\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t^{\prime}:t}(i,j)\right)^{2}\right]_{i,j=1,\ldots,p}\right\}\in\mathbb{R}^{\frac{p(p-1)}{2}}, (7)

where vecho(){vecho}(\cdot) indicates the half-vectorized vector by vectorizing only the lower triangular part (excluding the diagonal) of the symmetric matrix. If a change point exists at tt^{\prime}, then 𝒗t,t\bm{v}_{t^{\prime},t} is obviously an estimator of vecho{[(ρ0(i,j)ρ1(i,j))2]i,j=1,,p}vecho\big{\{}[\left(\rho_{0}(i,j)-\rho_{1}(i,j)\right)^{2}]_{i,j=1,\ldots,p}\big{\}}, which measures the entry-wise differences between the pre- and post-change population correlation matrices.

More specifically, for any (i,j)(i,j)-th entry of the correlation matrix, we present the expectation of 𝒗1,t(i,j)\bm{v}_{1,t}(i,j) when the data sequence {𝒙t,t=1,2,}\{\bm{x}_{t},t=1,2,\ldots\} is sampled entirely from the post-change regime, to illustrate its effectiveness for detection. For notational convenience, we write ρ0=ρ0(i,j)\rho_{0}=\rho_{0}(i,j) and ρ1=ρ1(i,j)\rho_{1}=\rho_{1}(i,j) here.

Lemma 3.1.

Assume 𝔼[xki]=0{\mathbb{E}}[x_{ki}]=0 and 𝔼[xki]2=1{\mathbb{E}}[x_{ki}]^{2}=1 for i=1,,pi=1,\ldots,p, k\forall k, for 𝐯1,t(i,j)=(ρ^0(i,j)ρ^1:t(i,j))2\bm{v}_{1,t}(i,j)=(\hat{\rho}_{0}(i,j)-\hat{\rho}_{1:t}(i,j))^{2}, we have

𝔼1[𝒗1,t(i,j)]=(ρ0ρ1)2+1H+1(β20ρ02)+1t(β21ρ12),\begin{multlined}{\mathbb{E}}_{1}[\bm{v}_{1,t}(i,j)]=(\rho_{0}-\rho_{1})^{2}+\frac{1}{H+1}(\beta_{20}-\rho_{0}^{2})+\frac{1}{t}(\beta_{21}-\rho_{1}^{2}),\end{multlined}{\mathbb{E}}_{1}[\bm{v}_{1,t}(i,j)]=(\rho_{0}-\rho_{1})^{2}+\frac{1}{H+1}(\beta_{20}-\rho_{0}^{2})+\frac{1}{t}(\beta_{21}-\rho_{1}^{2}), (8)

where β20:=𝔼[(xkixkj)2]\beta_{20}:=\mathbb{E}_{\infty}[(x_{ki}x_{kj})^{2}] is the expectation under the pre-change regime, and β21:=𝔼1[(xkixkj)2]\beta_{21}:=\mathbb{E}_{1}[(x_{ki}x_{kj})^{2}] is the expectation under the post-change regime.

We note that in (8), the first term (ρ0ρ1)2(\rho_{0}-\rho_{1})^{2} dominates the expectation when HH and tt are sufficiently large. Therefore, for any fixed tt, larger values of 𝒗t,t(i,j)\bm{v}_{t^{\prime},t}(i,j) indicate a significant difference between ρ0(i,j)\rho_{0}(i,j) and ρ1(i,j)\rho_{1}(i,j). In contrast, under the pre-change regime, all elements in the 𝒗t,t\bm{v}_{t^{\prime},t} are considerably small.

Remark 3.2 (The case with unknown mean).

In the more general case with unknown mean value and known unit variance 𝕍ar(xki)=1\mathbb{V}\mathrm{ar}(x_{ki})=1 for i=1,,pi=1,\ldots,p, k=1,2,k=1,2,\ldots, the expectation of 𝒗1,t(i,j)=(ρ^0(i,j)ρ^1:t(i,j))2\bm{v}_{1,t}(i,j)=(\hat{\rho}_{0}(i,j)-\hat{\rho}_{1:t}(i,j))^{2} is

𝔼1[𝒗1,t(i,j)]=(ρ0ρ1)2+1Hβ20+1t1β21+5tt(t1)ρ12+3H+4H2H(H+1)2ρ02.\begin{multlined}{\mathbb{E}}_{1}[\bm{v}_{1,t}(i,j)]=\left(\rho_{0}-\rho_{1}\right)^{2}+\frac{1}{H}\beta_{20}+\frac{1}{t-1}\beta_{21}+\frac{5-t}{t(t-1)}\rho_{1}^{2}+\frac{3H+4-H^{2}}{H(H+1)^{2}}\rho_{0}^{2}.\end{multlined}{\mathbb{E}}_{1}[\bm{v}_{1,t}(i,j)]=\left(\rho_{0}-\rho_{1}\right)^{2}+\frac{1}{H}\beta_{20}+\frac{1}{t-1}\beta_{21}+\frac{5-t}{t(t-1)}\rho_{1}^{2}+\frac{3H+4-H^{2}}{H(H+1)^{2}}\rho_{0}^{2}.

The first term (ρ0ρ1)2(\rho_{0}-\rho_{1})^{2} is still the dominating term. When the variance 𝕍ar(xki)\mathbb{V}\mathrm{ar}(x_{ki}) is also unknown, the sample variances serve as consistent estimators for 𝕍ar(xki)\mathbb{V}\mathrm{ar}(x_{ki}) and the expectation of 𝒗t,t(i,j)\bm{v}_{t^{\prime},t}(i,j) can be approximated similarly as above.

Remark 3.3.

Although we assume the data follows an i.i.d. distribution, our proposed method is not restricted to i.i.d. cases. The strong assumptions outlined in Lemma 3.1 and Remark 3.2 are primarily for theoretical analysis. In Section 5 and Section 6, our method demonstrates robust performance in more complex scenarios, including non-Gaussian distributions and serial dependence.

3.2 Sum-Type Detection Statistics

To detect dense changes in correlation structure, we use the 1\ell_{1} norm of 𝒗t,t\bm{v}_{t^{\prime},t} to construct detection statistics. Since the true change-point location ν\nu is unknown, we take the maximum for all potential change-point t<tt^{\prime}<t, and the resulting detection statistic is defined as

St(sum)=max1tt1(tt)HH+tt𝒗t,t1,S_{t}^{(\text{sum})}=\max_{1\leq t^{\prime}\leq t-1}\frac{(t-t^{\prime})H}{H+t-t^{\prime}}\left\|\bm{v}_{t^{\prime},t}\right\|_{1}, (9)

the scaling (tt)HH+tt\frac{(t-t^{\prime})H}{H+t-t^{\prime}} is incorporated to balance the variance of 𝒗t,t\bm{v}_{t^{\prime},t}, as the variance of 𝒗t,t\bm{v}_{t^{\prime},t} is O(1tt+1H)O(\frac{1}{t-t^{\prime}}+\frac{1}{H}). When HH is relatively large, we may simply use St(sum)=max1tt1(tt)𝒗t,t1S_{t}^{(\text{sum})}=\max_{1\leq t^{\prime}\leq t-1}(t-t^{\prime})\|\bm{v}_{t^{\prime},t}\|_{1}.

Window-limited variant

In practice, searching for the maximum over 1tt11\leq t^{\prime}\leq t-1 is computationally expensive, especially when tt becomes larger. Alternatively, we can use the window-limited sum-type statistic (WL-Sum) defined as

St(WL-sum)=maxtwtt1(tt)HH+tt𝒗t,t1,S_{t}^{(\text{WL-sum})}=\max_{t-w\leq t^{\prime}\leq t-1}\frac{(t-t^{\prime})H}{H+t-t^{\prime}}\left\|\bm{v}_{t^{\prime},t}\right\|_{1}, (10)

where ww is a pre-specified window size.

Shewhart-type variant

Alternately, we can further improve computational efficiency by eliminating the maximization step on twtt1t-w\leq t^{\prime}\leq t-1 at time tt. This leads to the following Shewhart sum-type (ST-Sum) statistics:

St(ST-sum)=𝒗tw,t1.S_{t}^{(\text{ST-sum})}=\left\|\bm{v}_{t-w,t}\right\|_{1}. (11)

Change detection is performed by stopping time defined as

T(sum):=inf{t:St(sum)b(sum)},T^{(\text{sum})}:=\inf\{t:S_{t}^{(\text{sum})}\geq b^{(\text{sum})}\},

where bb is the pre-specified threshold selected to meet the false alarm rate constraint, as detailed later. Here St(sum)S_{t}^{(\text{sum})} can be any sum-type test statistic mentioned above, and it is worth noting that the detection threshold varies for different test statistics.

3.3 Max-Type Detection Statistics

The sum-type detection statistics above may not be efficient in detecting sparse changes, where only a relatively small number of elements change in the correlation structure. Under the sparse setting, the conventional test statistic is the maximum difference between the pre- and post-change correlation matrices. We thus define the max-type test statistic as

St(max)=max1tt1(tt)HH+tt𝒗t,t.S_{t}^{(\max)}=\underset{1\leq t^{\prime}\leq t-1}{\max}\frac{(t-t^{\prime})H}{H+t-t^{\prime}}{\left\|\bm{v}_{t^{\prime},t}\right\|_{\infty}}. (12)

Window-limited variant

A window-limited variant of the max-type statistic (WL-Max) is employed to mitigate the computational burden,

St(WL-max)=maxtwtt1(tt)HH+tt𝒗t,t.S_{t}^{(\text{WL-max})}=\underset{t-w\leq t^{\prime}\leq t-1}{\max}\frac{(t-t^{\prime})H}{H+t-t^{\prime}}\left\|\bm{v}_{t^{\prime},t}\right\|_{\infty}. (13)

Shewhart-type variant

The Shewhart max-type statistic (ST-Max) is similarly constructed as

St(ST-max)=𝒗tw,t.S_{t}^{(\text{ST-max})}=\left\|\bm{v}_{t-w,t}\right\|_{\infty}. (14)

The stopping time is denoted as,

T(max):=inf{t:St(max)b(max)},T^{(\text{max})}:=\inf\{t:S_{t}^{(\text{max})}\geq b^{(\text{max})}\},

where St(max)S_{t}^{(\text{max})} can be either St(WL-max)S_{t}^{(\text{WL-max})} or St(ST-max)S_{t}^{(\text{ST-max})}.

Remark 3.4 (The choice of window size.).

The choice of window size involves a tradeoff between computational efficiency and detection delay. Generally, we assume a small sequence of data is available to pre-determine the optimal window size. A smaller window size may be computationally faster but could result in longer detection delays, especially for large ARL constraints. Conversely, a larger window size improves detection delay but is computationally less efficient. The key idea is to have sufficient data to ensure that the signal-to-noise ratio is detectable. In Proposition 4.3, we provide a lower bound for selecting the window size. A small-scale simulation is conducted to evaluate the impact of the window size ww, with the results presented in Figure 10 of the Supplementary Material. The key observation is that, for a fixed small ARL value, a wide range of window sizes generally achieves low detection delays, allowing us to select smaller window sizes to ensure computational efficiency. However, as ARL increases, the detection delay for smaller window sizes can grow quadratically, necessitating the use of larger window sizes to maintain detection performance. Additionally, larger window sizes are needed for performing more challenging detection tasks effectively.

Remark 3.5.

An alternative method to improve the computational efficiency of window-limited statistics in (10) and (13) is to use lagged sliding windows with stopping time defined as

T:=inf{t:t=w+s×t,t+,Stb},T^{\prime}:=\inf\{t:t=w+s\times t^{\prime},t^{\prime}\in\mathbb{N}_{+},S_{t}\geq b\},

where 1sw1\leq s\leq w is the lag between neighboring sliding windows, and when s=1s=1, it reduces to the previous definition, and s=ws=w means that we have independent (nonoverlapping) sliding time windows.

3.4 SMOTE and Knockoff Enhancements

For window-limited-variant test statistics, the underlying change point tt^{\prime} can vary from twt-w to t1t-1, the calculations of 𝑹^t:t\bm{\widehat{R}}_{t^{\prime}:t} and statistics can be unstable due to lack of samples, especially when ww is small compared to the dimension of data pp and tt^{\prime} is close to t1t-1. To solve this issue, we come up with two modified approaches combining with the SMOTE and Knockoff techniques, respectively. See Figure 1 for an illustration.

SMOTE-based method

The SMOTE technique is an oversampling method designed to deal with imbalanced data in classification problems (Chawla et al., 2002). Specifically, for each sample 𝒙t\bm{x}_{t} in the minority class, the nearest five neighbors with the smallest Euclidean distance are identified. One is randomly chosen as 𝒙t\bm{x}_{t}^{*}, based on which a new SMOTE sample is produced as 𝒙SMOTE=𝒙t+u(𝒙t𝒙t)\bm{x}_{\text{SMOTE}}=\bm{x}_{t}+u\cdot(\bm{x}_{t}^{*}–\bm{x}_{t}), where uu is randomly chosen from a standard uniform distribution on [0,1][0,1]. Blagus and Lusa (2013) investigated SMOTE’s theoretical properties and performance on large-dimensional data.

The proposed methods based on window-limited-variant test statistics can be enhanced by combining SMOTE to effectively detect the change points, and the Algorithm is detailed as follows.

Input : Data dimension pp, online data {𝒙t,t=1,2,}\{\bm{x}_{t},t=1,2,\ldots\}, reference data {𝒙H,,𝒙0}\{\bm{x}_{-H},\ldots,\bm{x}_{0}\}, the pre-specified ARL γ\gamma, window size ww, threshold bb.
Output : Stopping time TT.
1 Use the reference data to calculate 𝑹^0\bm{\widehat{R}}_{0}.
2 Initialize S1=0S^{\prime}_{1}=0;
3 while St<bS^{\prime}_{t}<b do
4       if t<w+1t<w+1 then
5             𝑿=[𝒙1,,𝒙t]p×t\bm{X}=[\bm{x}_{1},\ldots,\bm{x}_{t}]\in{\mathbb{R}}^{p\times t};
6            
7      else
8            𝑿=[𝒙tw,,𝒙t]p×(w+1)\bm{X}=[\bm{x}_{t-w},\ldots,\bm{x}_{t}]\in{\mathbb{R}}^{p\times(w+1)};
9            
10       end if
11      m0=#columns(𝑿);m_{0}=\#columns(\bm{X});
12       𝑿SMOTE𝟎=𝑿\bm{X_{\text{SMOTE}_{0}}}=\bm{X};
13       Let 𝑿SMOTE0\bm{X}_{\text{SMOTE}_{0}} be the minority class, from which m0m_{0} number of SMOTE variables are generated as 𝑿SMOTE1\bm{X}_{\text{SMOTE}_{1}}, and 𝑿~=[𝑿SMOTE0,𝑿SMOTE1]\bm{\tilde{X}}=[\bm{X}_{\text{SMOTE}_{0}},\bm{X}_{\text{SMOTE}_{1}}];
14       m~0=#columns(𝑿~)=2m0\tilde{m}_{0}=\#columns(\bm{\tilde{X})}=2m_{0};
15       for m1m\leftarrow 1 to m01m_{0}-1 do
16             𝑿s=𝑿~m:m~0.\bm{X}_{s}=\bm{\tilde{X}}_{m:\tilde{m}_{0}}.
17             Use 𝑿s\bm{X}_{s} to calculate the corresponding sample correlation matrix 𝑹^m:m~0\bm{\hat{R}}_{m:\tilde{m}_{0}} and thus the test statistic St(m)=g(𝒗m,m~0)S_{t}^{(m)}=g(\bm{v}_{m,\tilde{m}_{0}}), where g()g(\cdot) is a function depending on which type of test statistics is used, and a weight H(m0m)H+(m0m)\frac{H(m_{0}-m)}{H+(m_{0}-m)} is added for window-limited-variant statistics;
18            
19       end for
20      St=maxm=1,,m01{St(m)}S^{\prime}_{t}=\max_{m=1,\ldots,m_{0}-1}\{S_{t}^{(m)}\};
21 end while
22Stopping timeT=t\text{Stopping time}\ T=t.
Algorithm 1 SMOTE-enhanced Detection Procedure

Knockoff-based method

Proposed by Barber and Candès (2015), the knockoff method is designed to select variables that are indeed associated with the response with a controlled false discovery rate (FDR). The knockoff variables imitate the original variables’ correlation structure but have nothing to do with the response. Barber and Candès (2015) achieved exact finite sample FDR control in the homoscedastic Gaussian linear model when n2pn\geq 2p (along with a nearly exact extension to the case when n<2pn<2p). We enhance our proposed detection procedures by incorporating the knockoff method when p>2w+2p>2w+2.

Specifically, for a matrix 𝑿p×(w+1)\bm{X}\in\mathbb{R}^{p\times(w+1)}, 𝑿knockoff=𝑿(𝑰𝚺1diag{𝒛})+𝑼𝑪\bm{X_{\text{knockoff}}}=\bm{X}(\bm{I}-\bm{\Sigma}^{-1}diag\{\bm{z}\})+\bm{UC}, where 𝚺=𝑿𝑿\bm{\Sigma}=\bm{X}^{\top}\bm{X} and satisfies 𝚺j,j=1\bm{\Sigma}_{j,j}=1 for all jj, 𝒛\bm{z} is a non-negative vector of dimensions pp, 𝑼\bm{U} is an orthonormal matrix p×(w+1)p\times(w+1) which is orthogonal to the span of the feature of 𝑿\bm{X}, and 𝑪𝑪=2diag{𝒛}diag{𝒛}Σ1diag{𝒛}\bm{C}^{\top}\bm{C}=2diag\{\bm{z}\}-diag\{\bm{z}\}\Sigma^{-1}diag\{\bm{z}\} is a Cholesky decomposition, see Barber and Candès (2015) for more details to choose 𝒛\bm{z}. Then 𝑿knockoff\bm{X_{\text{knockoff}}} has the same correlation structure as 𝑿\bm{X}.

The Algorithm is the same as the Algorithm 1, except in the 10th line, 𝑿knockoff𝟎=𝑿\bm{X_{\text{knockoff}_{0}}}=\bm{X}, and the 11th line becomes: “For 𝑿knockoff𝟎\bm{X_{\text{knockoff}_{0}}}, generate m0m_{0} knockoff variables as 𝑿knockoff𝟏\bm{X_{\text{knockoff}_{1}}}, and 𝑿~=[𝑿knockoff𝟎,𝑿knockoff𝟏]\bm{\tilde{X}}=[\bm{X_{\text{knockoff}_{0}}},\bm{X_{\text{knockoff}_{1}}}].” Using the knockoff enhancement technique, additional independent random variables are added that maintain the same correlation structure as the original variables. This approach improves the estimation of the correlation matrix, particularly when the sample size is small. Regarding detection delay, for a fixed ARL, this technique would allow us to use a smaller window size for the algorithm to identify change points effectively, as shown in Section 5.

Refer to caption
Figure 1: Illustration of SMOTE and Knockoff enhancement techniques
Remark 3.6.

A small simulation study is conducted to show the effects of these two enhancement methods on correlation estimation. Two settings are considered under different ground truth correlation matrices,

  • Setting 1: the true correlation matrix is an identity matrix 𝑹=𝐈p\bm{R}=\mathbf{I}_{p};

  • Setting 2: the true correlation matrix is 𝑹=[ρ(i,j)]i,j=1,,p\bm{R}=[\rho(i,j)]_{i,j=1,\ldots,p} with ρ(i,j)=1\rho(i,j)=1 for i=ji=j, ρ(i,j)=0.5\rho(i,j)=0.5 for iji\neq j, 1i,jp1\leq i,j\leq p.

Table 1 compares the means and standard deviations of 𝑹𝑹^F||\bm{R}-\bm{\hat{R}}||_{F}, where 𝑹^\bm{\hat{R}} is the sample correlation matrix calculated from the original samples (of size ww), the SMOTE enhancement samples and the knockoff enhancement samples, respectively. Different combinations of data dimension pp and sample size ww are considered. Compared to the sample correlation matrix obtained from the original samples, knockoff method can enhance the accuracy of correlation estimation, while the SMOTE method cannot (but the bias is small). This is due to the construction principle of knockoff variables; they keep the same correlation structure with original ones, while SMOTE method only makes sure new SMOTE variables have the same distribution with the original ones.

Table 1: Mean and standard deviation (Std) of 𝑹0𝑹^0F||\bm{R}_{0}-\bm{\hat{R}}_{0}||_{F}.
Setting 1 Setting 2
Original SMOTE Knockoff Original SMOTE Knockoff
p=50,w=20p=50,w=20 Mean 11.3541 11.9440 11.2093 8.6080 8.9213 8.5133
Std 0.2215 0.2816 0.2224 1.6088 1.4629 1.6260
p=100,w=20p=100,w=20 Mean 22.8313 23.9804 22.6955 17.1047 17.8367 17.0168
Std 0.2192 0.3524 0.2207 2.8597 2.6303 2.8764
p=300,w=20p=300,w=20 Mean 68.7075 72.1942 68.5769 51.7309 53.7253 51.6465
Std 0.1981 0.7886 0.1987 9.6532 8.7277 9.6662
p=300,w=50p=300,w=50 Mean 42.7811 45.7033 42.7045 32.8976 34.6930 32.8487
Std 0.1407 0.4921 0.1409 6.2258 5.2383 6.2366
p=300,w=100p=300,w=100 Mean 30.0997 32.5922 30.0469 22.5139 25.3277 22.4771
Std 0.1067 0.3467 0.1067 3.3156 4.0502 3.3196

3.5 Signflip-Based Threshold Selection

Selecting a proper threshold bb for stopping time is crucial in the sequential change point detection procedure. In general, there are two types of methods for determining the appropriate threshold. One way is to infer an exact threshold based on the distribution of test statistics under the pre-change regime. The other way is to determine the threshold based on the empirical distribution of numerous simulated test statistics, which is more common in practice as it requires little knowledge about the distribution of test statistics.

However, in real-data applications where the number of pre-change samples is limited, or when generating samples is time-consuming, the available data size is small. We propose a new heuristic method for selecting the threshold using a sign-flip permutation method, which works well even for small sample sizes. The method is detailed in Algorithm 2. It is worthwhile mentioning that we assume that the prechange correlation matrix 𝑹0\bm{R}_{0} is known or preestimated using available historical data in Algorithm 2 for simplicity. In real applications, 𝑹0\bm{{R}}_{0} may be known in advance based on domain knowledge or estimateable from reference data, as shown in our real data analyses in Section 6.

Input : Data dimension pp, number of signflip trials qq, pre-change correlation estimator 𝑹^0\bm{\hat{R}}_{0}, a sequence of pre-change data {𝒙1,,𝒙M}\{\bm{x}_{1},\ldots,\bm{x}_{M}\}, the pre-specified ARL γ\gamma, window size ww.
Output : Threshold bb.
1 for l1l\leftarrow 1 to qq do
2       Randomly signflip entries: form 𝒙k=𝓡𝒙k,k=1,,M\bm{x}_{k}^{*}=\bm{\mathcal{R}}\circ\bm{x}_{k},\forall k=1,\ldots,M where
3       𝓡(i)i=1,,pi.i.d.{+1,with probability 1/2,1,with probability 1/2,\bm{\mathcal{R}}(i)_{i=1,\ldots,p}\overset{i.i.d.}{\sim}\left\{\begin{array}[]{cc}+1,&\text{with probability 1/2},\\ -1,&\text{with probability 1/2},\end{array}\right.
4       that is, 𝓡p×1\bm{\mathcal{R}}\in\mathbb{R}^{p\times 1} has independent identically distributed Rademacher entries;
5       for t2t\leftarrow 2 to MM do
6             if t<w+1t<w+1 then
7                   𝑿=[𝒙1,,𝒙t]p×t\bm{X}=[\bm{x}^{*}_{1},\ldots,\bm{x}^{*}_{t}]\in\mathbb{R}^{p\times t};
8            else
9                   𝑿=[𝒙tw,,𝒙t]p×(w+1)\bm{X}=[\bm{x}^{*}_{t-w},\ldots,\bm{x}^{*}_{t}]\in\mathbb{R}^{p\times(w+1)};
10             end if
11            m0=#columns(𝑿)m_{0}=\#columns(\bm{X});
12             for m1m\leftarrow 1 to m01m_{0}-1 do
13                   Use 𝑿m:m0\bm{X}_{m:m_{0}} to calculate the sample correlation matrix 𝑹^m:m0\bm{\hat{R}}_{m:m_{0}} and thus the statistics St,m=g(𝒗m,m0)S_{t,m}=g(\bm{v}_{m,m_{0}}), where g()g(\cdot) is a function depending on which type of statistics is used, and a weight H(m0m)H+(m0m)\frac{H(m_{0}-m)}{H+(m_{0}-m)} is added for window-limited variant statistics;
14             end for
15            Calculate test statistic St(l)=max1mm01{St,m}S_{t}^{(l)}=\max_{1\leq m\leq m_{0}-1}\{S_{t,m}\};
16       end for
17      
18 end for
Set the threshold bb using qq sequences of detection statistics {St(l),t=1,,M}l=1q\{S_{t}^{(l)},t=1,\ldots,M\}_{l=1}^{q} such that their average run length equals to γ\gamma.
Algorithm 2 Signflip-based Threshold Selection

To illustrate why the threshold selected via Algorithm 2 is valid, we present the following lemma.

Lemma 3.7.

Given a random vector 𝐱t=(xt1,,xtp)p×1\bm{x}_{t}=(x_{t1},\ldots,x_{tp})^{\top}\in\mathbb{R}^{p\times 1} and two independent Rademacher vectors 𝓡1:=(r1(1),,rp(1))\bm{\mathcal{R}}_{1}:=(r_{1}^{(1)},\ldots,r_{p}^{(1)})^{\top} and 𝓡2:=(r1(2),,rp(2))\bm{\mathcal{R}}_{2}:=(r_{1}^{(2)},\ldots,r_{p}^{(2)})^{\top} with i.i.d. Rademacher entries, we have 𝐱t𝓡1\bm{x}_{t}\circ\bm{\mathcal{R}}_{1} and 𝐱t𝓡2\bm{x}_{t}\circ\bm{\mathcal{R}}_{2} are uncorrelated as Cov(xtiri(1),xtjrj(2))=0Cov(x_{ti}r_{i}^{(1)},x_{tj}r_{j}^{(2)})=0 for 1i,jp.1\leq i,j\leq p.

Lemma 3.7 demonstrates that for a given vector 𝒙t\bm{x}_{t}, the data shuffled through various sign-flip steps using different 𝓡\bm{\mathcal{R}} exhibit weak dependence. Moreover, if the shuffled data follows a Gaussian distribution, it becomes independent. This property ensures that the test statistics computed from qq sign-flip trials are nearly independent, allowing for appropriate threshold selection. A numerical result supporting this assertion is presented in Remark 3.8.

Remark 3.8.

In addition, we illustrate that the sign-flip permutation in Algorithm 2 does not change the distribution of detection statistic StS_{t} under the pre-change measure through a simulation example. We draw pre-change samples from the standard Gaussian distribution with dimension p=50p=50. We set w=20,H=100,M=1000w=20,H=100,M=1000. We plot the detection statistic StS_{t} calculated from the original data and that of S~t\tilde{S}_{t} from the sign-flipped data in increasing order in Figure 2, where the WL-Sum in (10) is presented, and it is obvious that the two distributions are almost identical.

Refer to caption
Figure 2: The empirical distributions of increasing-order elements of StS_{t} and S~t\tilde{S}_{t}.
Remark 3.9.

Before we conclude this section, we would like to comment on the insights behind the effectiveness of SMOTE- and knockoff-enhanced detection procedures in Section 3.4. In the simulation results in Section 5, both SMOTE and knockoff enhanced methods have smaller detection delays compared to the original WL-Sum procedure. We use a simple simulation study to illustrate the key reasons behind the effectiveness of SMOTE and knockoff. We let the pre-change correlation matrix be 𝑹0=𝐈p\bm{R}_{0}=\mathbf{I}_{p}, and the post-change correlation matrix has entries ρ1(i,j)=1\rho_{1}(i,j)=1 for i=ji=j and ρ1(i,j)=0.5\rho_{1}(i,j)=0.5 for 1ijp1\leq i\neq j\leq p. We simulate the data sequences of length M=1000M=1000 under both pre- and post-change regimes separately, and use window size w=20w=20 for the window-limited approaches.

For tt ranging from 22 to 10001000 (999 time steps in total), the number of times when tmax:=argmaxtwtt1St(WL-sum)t^{\prime}_{\text{max}}:=\operatorname*{arg\,max}_{t-w\leq t^{\prime}\leq t-1}S_{t}^{(\text{WL-sum})} equals to tw,tw+1,tw+2,t-w,t-w+1,t-w+2, and tmax<tw/2t^{\prime}_{\text{max}}<t-w/2, are shown, respectively, in Table 2. Under the pre-change regime, among 999 time indices, both SMOTE and knockoff-enhanced methods show more than 91.9% of the times when tmax=twt^{\prime}_{\text{max}}=t-w, making their estimated results more credible and thus the thresholds obtained more reliable. In contrast, the original method has a relatively low number of instances when tmax=twt^{\prime}_{\text{max}}=t-w. In addition, the knockoff-enhanced method consistently finds the maximum of test statistics at tmax=twt^{\prime}_{\text{max}}=t-w. A similar pattern is also observed under the post-change regime. These results align with the detection delay findings presented in Section 5, where the knockoff-enhancement method exhibits the smallest detection delay, followed by the SMOTE-enhancement method, compared to the detection delay of the original WL-Sum statistics St(WL-sum)S_{t}^{(\text{WL-sum})}.

Table 2: Number of times of locations of tmaxt^{\prime}_{\text{max}} when St(WL-sum)S_{t}^{(\text{WL-sum})} reaches the maximum among 999 times.
twt-w tw+1t-w+1 tw+2t-w+2 <tw/2<t-w/2
Pre-change p=50p=50 SMOTE 919 73 7 999
Knockoff 999 0 0 999
Original 362 165 112 964
p=300p=300 SMOTE 982 16 1 999
Knockoff 999 0 0 999
Original 914 72 13 999
Post-change p=50p=50 SMOTE 852 110 30 999
Knockoff 998 1 0 999
Original 247 138 94 930
p=300p=300 SMOTE 894 78 16 999
Knockoff 999 0 0 999
Original 311 135 111 953

4 Theoretical Analysis

In this section, we provide the theoretical guarantees for the proposed tests with respect to two key performance measures: the average run length and the expected detection delay. The detailed proofs are deferred to Section C of the Supplementary Material.

4.1 ARL Approximation

For simplicity, we let the size of the reference data HH equal the chosen window size ww. Note that we can regard the weight (tt)HH+tt\frac{(t-t^{\prime})H}{H+t-t^{\prime}} before the test statistic StS_{t} as a constant for a given tt^{\prime}, and we may unify both sum-type statistics for non-sparse settings and max-type statistics for sparse settings since they are, in essence, the combination of element-wise entries. We first characterize the temporal correlation of the statistics as follows, which will be utilized later for the ARL approximation.

Lemma 4.1 (Temporal correlation of sequential detection statistics).

Suppose all samples are i.i.d. from pre-change distribution with 𝔼[xti]=0{\mathbb{E}}[x_{ti}]=0 and 𝕍ar[xti]=1\mathbb{V}\mathrm{ar}[x_{ti}]=1 for i=1,,pi=1,\ldots,p, t=1,2,t=1,2,\ldots, then the correlation between the detection statistics St+sS_{t+s} and StS_{t} is

orr(St+s,St)=1sw+o(sw).\mathbb{C}\mathrm{orr}(S_{t+s},S_{t})=1-\frac{s}{w}+o\left(\frac{s}{w}\right).

Here s=o(w)s=o(w) is a small time shift from tt and StS_{t} can be both sum-type and max-type statistics defined in Section 3.

Based on this temporal correlation, we have the following Theorem characterizing the approximate ARL of the proposed detection procedures. The main idea is to use a linear approximation for the correlation between detection statistics StS_{t} and St+sS_{t+s}. Then, the behavior of the detection procedure can be related to a random field. By leveraging the localization theorem (Siegmund et al., 2010), we can obtain an asymptotic approximation for ARL when the threshold bb is large enough.

Theorem 4.2 (ARL Approximation).

Assume under the pre-change measure we have 𝔼[xti4]<{\mathbb{E}}_{\infty}[x_{ti}^{4}]<\infty for 1ip1\leq i\leq p, t=1,2,t=1,2,\ldots, and under Assumption C.1, C.2, C.3 and C.4 detailed in Appendix C, as threshold bb\rightarrow\infty, the ARL of the stopping time TT can be approximated as

𝔼[T]=12[2π/κ]1/2eκ/2/ξ1ξ2yζ2(y)𝑑y(1+o(1)),\mathbb{E}_{\infty}[T]=\frac{1}{2}[2\pi/\kappa]^{1/2}e^{\kappa/2}\bigg{/}\int_{\xi_{1}}^{\xi_{2}}y\zeta^{2}(y)dy(1+o(1)), (15)

where ξ1=2bwσd2\xi_{1}=\frac{2b}{\sqrt{w\sigma_{d}^{2}}}, ξ2=2bσd\xi_{2}=\frac{2b}{\sigma_{d}}, κ=(bμ)2σd2\kappa=\frac{(b-\mu)^{2}}{\sigma_{d}^{2}}, μ=𝔼[𝐯tw,t(i,j)]\mu={\mathbb{E}}_{\infty}[\bm{v}_{t-w,t}(i,j)], σd2=𝕍ar[𝐯tw,t(i,j)]\sigma_{d}^{2}=\mathbb{V}\mathrm{ar}_{\infty}[\bm{v}_{t-w,t}(i,j)], and TT can be the stopping time of both sum-type and max-type statistics. ζ()\zeta(\cdot) is a special function closely related to the Laplace transform of the overshoot over the boundary of a random walk (Siegmund and Yakir, 2007):

ζ(y)2y[Φ(y2)0.5]y2Φ(y2)+ϕ(y2),\zeta(y)\approx\frac{\frac{2}{y}[\Phi(\frac{y}{2})-0.5]}{\frac{y}{2}\Phi(\frac{y}{2})+\phi(\frac{y}{2})},

where ϕ(y)\phi(y) and Φ(y)\Phi(y) are the probability density function and the cumulative density function of the standard Gaussian distribution.

The bounded fourth-moment assumption is needed to ensure that the estimated correlation coefficients are reliable estimators, i.e., ρ^(i,j)\hat{\rho}(i,j) converge to ρ(i,j)\rho(i,j) in probability. Additionally, the Assumption C.1, C.2, C.3 and C.4 are required to ensure the applicability of the localization theorem. In practice, all assumptions can be easily satisfied by data from some commonly used distributions, such as Gaussian distributions and exponential families.

The main contribution of Theorem 4.2 is to provide a theoretical method to set the threshold that can avoid the Monte Carlo simulation, which could be time-consuming, especially when the desired ARL is large. From Theorem 4.2, we can numerically compute the threshold value bb by setting the right-hand side of Equation (15) to the desired ARL value. Table 3 shows the high accuracy of this approximation result by comparing the threshold obtained from Equation (15) with that obtained from a simulation study. For a sequence of i.i.d standard normal samples, we conduct 1000 sign-flip trials to find the threshold for different ARL values. The results of WL-Sum, WL-Max, ST-Sum, and ST-Max procedures are presented in Table 3, indicating the approximation is reasonably accurate as the relative error is smaller than 1% for WL-Sum, around 7% to 11% for ST-Sum, around 1% for WL-Max and smaller than 8% for ST-Max. We also note that the magnitude of thresholds varies for different test statistics. For example, the WL-Sum statistic involves the summation of p(p1)/2p(p-1)/2 terms and results in consistently higher thresholds than max-type statistics.

Table 3: Comparison of the threshold bb obtained from simulations and the theoretical approximation (15). pp=50, ww=20, HH=100, sign-flip trials qq=1000. For WL-Sum, the unit is in 10310^{3}.
ARL 5,000 10,000 20,000 30,000 40,000 50,000
WL-Sum Simulated 1.3271 1.3379 1.3478 1.3530 1.3589 1.3598
Theoretical 1.3388 1.3499 1.3603 1.3659 1.3698 1.3729
ST-Sum Simulated 79.1592 79.7146 80.3381 80.7368 81.0911 81.3835
Theoretical 78.2469 78.8507 79.4074 79.7111 79.9234 80.0864
WL-Max Simulated 17.3070 17.9350 18.5248 18.8585 19.0879 19.1525
Theoretical 16.7333 17.0234 17.2975 17.4572 17.5667 17.6469
ST-Max Simulated 1.0308 1.0617 1.0999 1.1168 1.1351 1.1412
Theoretical 0.9561 0.9732 0.9896 0.9988 1.0052 1.0101

4.2 Detection Delay of Max-type and Sum-type Statistics

When a change point ν\nu exists, the performance of the detection procedure is measured by the detection delay as defined in (2), which can be interpreted as the expected number of post-change samples needed to detect the change. We note that the supremum over all possible change-points ν\nu in the definition (2) is typically intractable for a general detection procedure. Therefore, we use a simplified definition of the expected detection delay (EDD) as 𝔼1[T]{\mathbb{E}}_{1}[T], i.e., the expected stopping time when the change point equals 11, this is also what is typically simulated in practice. We present the following proposition for the detection delay of the proposed stopping times.

Proposition 4.3 (EDD Approximation).

Assume under the post-change measure we have 𝔼1[xti4]<{\mathbb{E}}_{1}[x_{ti}^{4}]<\infty for 1ip1\leq i\leq p, t=1,2,t=1,2,\ldots, then the EDD of WL-Sum and WL-Max test statistics can be approximated as follows. As threshold bb\to\infty and for window size wbmin(Δ1,Δ2)w\geq\frac{b}{\min(\Delta_{1},\Delta_{2})},

𝔼1[T(sum)]=b(1+o(1))Δ1,and𝔼1[T(max)]=b(1+o(1))Δ2,\mathbb{E}_{1}[T^{(\text{sum})}]=\frac{b(1+o(1))}{\Delta_{1}},\ \text{and}\quad\mathbb{E}_{1}[T^{(\text{max})}]=\frac{b(1+o(1))}{\Delta_{2}},

where Δ1:=1i<jp(ρ0(i,j)ρ1(i,j))2\Delta_{1}:=\sum_{1\leq i<j\leq p}(\rho_{0}(i,j)-\rho_{1}(i,j))^{2} and Δ2:=max1i<jp(ρ0(i,j)ρ1(i,j))2\Delta_{2}:=\max_{1\leq i<j\leq p}(\rho_{0}(i,j)-\rho_{1}(i,j))^{2} are two notions of signal strength, and note that large Δ2/Δ1\Delta_{2}/\Delta_{1} indicates the change in the correlation matrix is sparse.

5 Simulation Results

In this section, we conduct extensive simulations to illustrate the performance of the proposed methods. The simulations are conducted across varying dimensions p=50,60,100,300p=50,60,100,300 and window sizes w=20,40,50,60w=20,40,50,60. We set H=100H=100, M=1000M=1000, q=1000q=1000, and all detection delays are averaged at 1000 replications. We consider two data-generating distributions for the stream {𝒙t}\{\bm{x}_{t}\}: (i) Multivariate normal distribution N(𝟎,𝑹)N(\bm{0},\bm{R}); and (ii) Multivariate Student’s tt distribution, t5t_{5}, with degree of freedom 5 and covariance matrix 𝑹\bm{R}. For each type of distribution, we examine three scenarios: Cases 1 and 2 represent non-sparse settings, while Case 3 represents a sparse setting, and Case 4 is a more general setting. Below rr denotes the off-diagonal element of the post-change correlation matrices, and \lfloor\cdot\rfloor is the floor function.

  • Case 1: 𝑹0=𝐈p\bm{R}_{0}=\mathbf{I}_{p}; ρ1(i,j)=r\rho_{1}(i,j)=r for 1ijp1\leq i\neq j\leq p.

  • Case 2: 𝑹0=𝐈p\bm{R}_{0}=\mathbf{I}_{p}; ρ1(i,j)=r\rho_{1}(i,j)=r for 1ijp/21\leq i\neq j\leq\lfloor p/2\rfloor and 0 otherwise.

  • Case 3: ρ0(i,j)=0.3\rho_{0}(i,j)=-0.3, ρ1(i,j)=0.9\rho_{1}(i,j)=0.9 for 1ijp0.31\leq i\neq j\leq\lfloor p^{0.3}\rfloor and 0 otherwise.

  • Case 4: ρ0(i,j)=0.3\rho_{0}(i,j)=0.3 for 1ijp/21\leq i\neq j\leq\lfloor p/2\rfloor and 0 otherwise, ρ1(i,j)=0.5\rho_{1}(i,j)=0.5 for p/2+1ijp\lfloor p/2\rfloor+1\leq i\neq j\leq p and 0 otherwise.

We simulate the expected detection delay 𝔼1[T]{\mathbb{E}}_{1}[T] by setting the change point ν=1\nu=1, which means that all streaming data is drawn from the post-change distribution. This simplifies the simulation process, as computing the exact WADD, defined in (2), which requires considering all possible change points, is often impractical. Importantly, for some detection procedures, such as the CUSUM test, the worst-case detection delay is often achieved when the change occurs at ν=1\nu=1 (Xie et al., 2021). As a result, using 𝔼1[T]{\mathbb{E}}_{1}[T] closely approximates the worst-case scenario, proving a reliable alternative to evaluate detection performance without significant deviation from WADD.

For clarity purposes, here is the notation for different methods. In addition to WL-Sum and WL-Max defined in Section 3.2 and 3.3, WL-Sum+SMOTE and WL-Sum+Knockoff denote methods that incorporate SMOTE and Knockoff enhancements, respectively. CUSUM represents the method utilizing exact CUSUM, which is the exact optimal detection procedure and thus provides us the information-theoretical lower bound to the delay (Moustakides, 1986). For knockoff-enhanced methods, we need p>2w+2p>2w+2, so detection results will be absent for large window sizes when this condition is violated.

Following the definition of sparse in Feng et al. (2022), we define a term “dense” to represent the extent of the change in the correlation structure. Dense refers to the proportion of changed elements within the entire lower triangular of the correlation matrix, excluding the diagonal elements. Therefore, a smaller dense value suggests a more sparse alteration in the correlation structure and vice versa. In Case 1, the dense level is 1. In Case 2, the dense level is p2(p21)p(p1)14.\frac{\lfloor\frac{p}{2}\rfloor(\lfloor\frac{p}{2}\rfloor-1)}{p(p-1)}\approx\frac{1}{4}. In Case 3, the dense level is p0.3(p0.31)p(p1)p1.4\frac{\lfloor p^{0.3}\rfloor(\lfloor p^{0.3}\rfloor-1)}{p(p-1)}\approx p^{-1.4}. From Case 1 to Case 3, the extent of the change in the correlation structure becomes more and more sparse.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Normal distribution (Case 1: the dense level of change is 1). Comparison of ADDs of WL-Sum, WL-Sum+SMOTE, WL-Sum+Knockoff, and CUSUM.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Normal distribution (Case 2: the dense level of change is 0.25). Comparison of ADDs of WL-Sum, WL-Sum+SMOTE, WL-Sum+Knockoff, and CUSUM, under r=0.5r=0.5.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Normal distribution (Case 3: the dense level of change is p1.4p^{-1.4}). Comparison of ADDs of WL-Sum, WL-Sum+SMOTE, WL-Sum+Knockoff, WL-Max, WL-Max+SMOTE, WL-Max+Knockoff and CUSUM.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Student’s tt distribution t5t_{5}. Comparison of ADDs of WL-Sum, WL-Sum+SMOTE, WL-Sum+Knockoff and CUSUM. (a) Case1, p=50p=50; (b) Case1, p=100p=100; (c) Case2, p=50p=50; (d) Case2, p=100p=100.
Refer to caption
Refer to caption
Figure 7: Results under Case 4. Comparison of ADDs of WL-Sum, WL-Sum+SMOTE, WL-Sum+Knockoff and CUSUM. (a) Normal distribution; (b) Student’s tt distribution t5t_{5}.

Figure 3 shows the average detection delay (ADD) of different methods under Case 1, with ARL ranging from 10210^{2} to 10510^{5}. Subfigures (a), (b) and (c) are the results when rr changes from 0 to 0.3, 0.5 and 0.8, respectively. CUSUM is the optimum detection method that provides the theoretical lower bound on detection delay for all test procedures. For a fixed dense level, when the magnitude of the change in the correlation structure increases, ADD decreases accordingly. Moreover, the SMOTE and knockoff enhanced methods can effectively reduce detection delay, and the knockoff-enhanced method performs better. Comparing (b), (d) and (f), as pp increases from 50 to 300, the maximum ADD of WL-Sum (WL-Sum+SMOTE, WL-Sum+Knockoff) decreases, indicating that our proposed methods perform better as the dimension grows, that is, blessing of dimension. Comparing (b) and (e), as window size ww increases from 20 to 50, ADD does not change much for WL-Sum and WL-Sum+SMOTE. Thus, for this setting, the best detection approach is to use a small window size along with knock-off enhancements.

Figure 4 compares the ADDs of different methods in Case 2. Comparing (a) with (b) of Figure 3, as the dense level drops from 1 to 0.2449, the ADD of WL-Sum increases from 47.2 to 75.6, indicating a more challenging detection situation when the change becomes scarce. Moreover, (a) and (b) differ in window size, with little impact on ADD, showing that window size has little influence in this case. But SMOTE can reduce detection delay when ww is larger. Figures (a), (c), and (d) show that as pp grows, the ADD decreases, further demonstrating the effectiveness of our proposed method in dealing with large-dimensional data.

Figure 5 compares the ADDs of WL-Sum, WL-Max and their corresponding enhancement methods under Case 3. In all settings, the ADDs of the WL-Sum statistic and its enhancement methods are much larger than the WL-Max type statistics. This indicates that max-type statistics perform better under sparse settings. In addition, the enhancement methods only yield modest improvements to the ADD for WL-Max statistics. Compared with the ADD of the exact CUSUM test, we note that the ADDs of our proposed methods, especially the knockoff-enhanced methods, are comparable to those of exact CUSUM, demonstrating the effectiveness of our methods.

To assess the performance of our method with data generated from the Student’s tt distribution, the results are presented in Figure 6, which compares the ADDs of WL-Sum and its corresponding enhanced methods under Case 1 and Case 2. Our method demonstrates strong and robust performance under the tt distribution as well.

Figure 7 compares the ADDs of WL-Sum and its corresponding enhanced methods in Case 4, where 𝑹0\bm{R}_{0} is not an identity matrix. Subfigure (a) presents data generated from normal distribution, while subfigure (b) illustrates data from Student’s tt distribution. For the more general case of a correlation change, our method still performs well; however, the student’s tt distribution shows a reduced performance due to the increased difficulty in detection.

As shown above, sum-type statistics perform well for dense correlation changes, while max-type statistics excel for sparse changes. Since the dense level is typically unknown in practice, we can utilize both statistics and calibrate them as follows:

Stc=max(1b1St(WL-sum),1b2St(WL-max)),S_{t}^{c}=\max\left(\frac{1}{b_{1}}S_{t}^{(\text{WL-sum})},\frac{1}{b_{2}}S_{t}^{(\text{WL-max})}\right),

and the stopping time is

Tc=inf{t:Stc1},T^{c}=\inf\left\{t:S_{t}^{c}\geq 1\right\},

where b1b_{1} and b2b_{2} are the estimated thresholds for WL-Sum and WL-Max statistics, respectively, given a fixed γ\gamma. These thresholds are used to calibrate the WL-Sum and WL-Max statistics to the same magnitude. We denote this method as WL-Sum-Max-Combined, as it operates by running both sum-type and max-type detection procedures in parallel and stopping as soon as either one raises an alarm.

The combined StcS_{t}^{c} is compared with WL-Sum and WL-Max approaches in terms of ADD in Figure 8. For simplicity, we do not present the performance of enhancement methods here. Their performance should be consistent with those shown in Figure 5. Specifically, the enhancement method for WL-Sum will reduce ADD at lower dense levels, whereas for WL-Max, the enhancement method does not result in significant improvements. We fix ARL to 10510^{5} in this setting and let p=60p=60, window sizes w=30,40,60w=30,40,60. For a given positive integer npn\leq p, let ρ0(i,j)=ρ1(i,j)=1\rho_{0}(i,j)=\rho_{1}(i,j)=1 for i=ji=j, ρ0(i,j)=r0\rho_{0}(i,j)=r_{0} for 1ijn1\leq i\neq j\leq n and 0 otherwise, and ρ1(i,j)=r1\rho_{1}(i,j)=r_{1} for 1ijn1\leq i\neq j\leq n and 0 otherwise. When n=2n=2, only the correlation coefficient of variable 1 and variable 2 changes, which is extremely sparse. As nn increases from 2 to 20, the dense level increases from 0.0006 to 0.1073. The correlation coefficients r0r_{0} and r1r_{1} are selected to ensure the largest eigenvalue of 𝑹1𝑹01\bm{R}_{1}\bm{R}_{0}^{-1} is between 55 and 88.

The WL-Max statistic performs well when detecting very sparse changes, and as the dense level increases, the WL-Sum statistic performs better. This finding aligns with the literature on two-sample testing, indicating that the max-type statistic demonstrates higher power in sparse settings. In contrast, the sum-type statistic exhibits higher power in non-sparse settings. The yellow line represents the performance of the combined method, and it works consistently well under both sparse and non-sparse settings. It has a lower ADD than WL-Max in sparse cases and nearly the same ADD as WL-Sum in nonsparse cases.

Refer to caption
Refer to caption
Refer to caption
Figure 8: Normal distribution. Comparison of ADDs of WL-Sum, WL-Max and WL-Sum-Max-Combined, with the dense level varying from 0.0006 to 0.1073. γ=105\gamma=10^{5}, p=60p=60.

6 El Niño Event Prediction

In this section, we test our proposed method on a real El Niño dataset. To our knowledge, this is the first time a change point detection procedure has been applied to forecasting El Niño events. Our method achieves state-of-the-art hit rates greater than 0.86 with zero false alarm rates.

El Niño episodes are part of the El Niño-Southern Oscillation (ENSO), which is the strongest driver of interannual climate variability and can trigger extreme weather events and disasters in various parts of the world. Early warning signals of El Niño events would be instrumental for avoiding some of the worst damages.

To forecast El Niño events, many state-of-the-art coupled climate models, as well as a variety of statistical approaches, have been suggested; see the review paper Bunde et al. (2024). Although these forecasts are quite successful at shorter lead times (say, 6 months), they have limited anticipation power at longer lead times. In particular, they generally fail to overcome the so-called ````spring barrier”, that is, in spring, most methods tend to make wrong predictions for El Niño event. Some methods have been constructed to predict El Niño events beyond 9 months in advance (Ludescher et al., 2013). However, it is almost impossible to forecast these events accurately with more than 12 months in advance. Lenssen et al. (2024) found that ENSO is predictable at least two years in advance only when forecasts are made during strong EI Niño events, while forecasts initialized during other states do not have predictive skill over one year. Our method can predict El Niño events more than one year most of the time.

An El Niño episode is featured by rather irregular warm excursions from the long-term mean state. And an El Niño event is said to occur when the ONI index, the 3-month running mean of the anomaly in sea surface temperature averaged in the NINO3.4 region (5oS-5o5^{o}N, 170oW-120o0^{o}W), is above 0.5C0.5^{\circ}C for at least five consecutive months. Ludescher et al. (2013) showed that the strengths of the cross-correlations between the El Niño Basin and the surrounding sites tend to strengthen before El Niño episodes and then weaken significantly. Therefore, we concentrate on the changes in the correlations and show that well before an El Niño episode, the correlations tend to increase first and then decrease sharply. We use this robust observation to forecast El Niño development in advance. The sum-type test statistic is applied here.

We use the 1000hPa temperature weekly data obtained from the ERA5 database at time 00:00 on the 7th, 14th, 21st, and 28th days of each month with a grid size of 7.5o×7.5o7.5^{o}\times 7.5^{o}, spanning from 1974 to 2023. We select the same 207 nodes as Ludescher et al. (2013)(30oS-30oN, 120oE-75oW) containing the El Niño Basin, where the nodes are arranged as a 9×239\times 23 matrix. The data dimension is thus p=207p=207, and T=2400T=2400 is the length of the time series. Reference data is chosen from Jan 7th to Dec 28th, 1971, and Jan 7th to Dec 28th, 1974, as no El Niño event happened during these periods. The window size is set as w=48w=48 (a year). As the correlation change only occurs in a subset of 207 nodes, we select 8×8=648\times 8=64 nodes that are geographically close to each other as a subset, thus there are 32 subsets V1,V2,,V32V_{1},V_{2},\ldots,V_{32} as in panel (a) of Figure 9. The ST-Sum statistic (11) is calculated for each specific subset, and the maximum value over the 32 subsets is used as the final detection statistic. The prediction result is shown in panel (b) of Figure 9.

The blue line represents the test statistic, and the yellow and green vertical lines mark the beginning and end of 15 El Niño events from 1974 to 2023. The horizontal green line is the threshold. An alarm indicating the arrival of an El Niño event is triggered every time the test statistics cross the threshold from the top. The alarm results in a correct prediction if, in the following one or two years, El Niño episode sets in; otherwise, it is considered a false alarm. The correct predictions are marked with red arrows. During our detection period from 1974 to 2023, there were 15 years with an El Niño episode (i.e., there were 15 events) and 35 years without one (35 free-events years). Our method yields 16 signals, which correctly predict 13 El Niño events with 0 false alarms, resulting in a hit rate of 13/15=0.86713/15=0.867 and a false alarm rate of 0/35=00/35=0, as shown in the first column of Table 4. It is worth mentioning that since the statistics may contain some noise, any rise above the threshold by a very small margin (less than 10 in this case) is considered noise and not treated as a real alarm. Such instances are indicated with black dotted lines.

Refer to caption
Refer to caption
Figure 9: (a) Illustration of subset selection; (b) The prediction of El Niño events during 1974 to 2023.

In general, the El Niño events and La Niña events appear in return; that is, an El Niño event is often followed by a La Niña event and and vice versa. However, this is not always the case. Sometimes, several El Niño events occur consecutively without an intervening of La Niña event, and similarly, La Niña events can sometimes appear in succession. Of our 13 predictions, 10 prediction signals that occur during or at the end of a La Niña event, indicating that the El Niño would occur in the next one to two years. The other three prediction signals (in 1977, 1981, and 2002) appear during or at the end of El Niño event, indicating that La Niña will not happen, but another El Niño event will continue to occur. Furthermore, for the strongest El Niño event starting from winter 2014 and ending before summer 2016 with the highest strength (peak) ever recorded, our method gives the first signal in winter 2012, then another signal in summer 2013 and a third signal before the starting point of this event. These three signals not only forecast the existence of El Niño phenomenon in the following two years separately, but also indicate an extreme event together.

Ludescher et al. (2013) forecasted El Niño events with a hit rate of 0.7 and a false alarm rate of 0.1 in the training data (1951-1980), and a hit rate of 0.667 and a false alarm rate of 0.048 in the testing data (1981-2011) when the threshold is set at 2.82, as shown in Figure 2 of their paper. For the same period of test data, the comparison results are presented in the last two columns of Table 4, and the prediction results of our method are detailed in Figure 11 of the Supplementary Material. Our method exhibits superior performance compared to Ludescher et al. (2013)’s method, with two more correct predictions, which increases the hit rate by 0.222. Ludescher et al. (2013)’s method failed to detect the three most recent El Niño events, dating back to 2011. In addition, their method contains two steps - an optimum prediction algorithm is first learned from the training data and then applied to the testing data. Meanwhile, their method can produce different results based on different thresholds. In contrast, our method is easily implemented without optimizing in a training dataset, the threshold is determined by the proposed algorithm, and theoretical guarantees are provided.

Table 4: Comparison of our method and Ludescher et al. (2013)’s method in terms of hit rate and false alarm.
Our method Our method Ludescher’s
Period 1974-2023 1981-2011 1981-2011
Hit Rate 0.867(13/15) 0.889(8/9) 0.667(6/9)
False Alarm Rate 0(0/35) 0(0/21) 0.048(1/21)

7 Discussions and Conclusion

In this paper, we propose an online change point detection procedure for correlation structures. Both sum- and max-type statistics are proposed for non-sparse and sparse settings, respectively. These two types of statistics can also be combined in practice, providing more flexibility and efficiency for detection tasks. Simulation studies illustrate the performance of the combined method when the change in correlation structure varies from sparse to dense regimes. In addition, we propose to combine SMOTE and Knockoff techniques to increase sample efficiency, and the enhanced detection procedures have shown smaller detection delays in most simulation scenarios. Furthermore, an efficient sign-flip algorithm is proposed to select the threshold for our detection statistics. Theoretical approximations are also provided for ARL and EDD of detection procedures. More importantly, the proposed detection procedure is applied to forecast El Niño events. Our method can predict these events with one or two years in advance and achieves state-of-the-art hit rates above 0.86 and zero false alarm rates. There is immense societal benefit from these high-accuracy multi-year forecasts, as many human systems make decisions on this timescale.

Given the superior performance in forecasting El Niño events, we believe that our methods can be applied to predict other important climate phenomena, especially those involving varying relationships between different regions of the Earth. The application of our proposed algorithm to forecasting La Niña events will be reported in our future work.

There are several possible extensions of our methods. A direct extension is to detect change points in the covariance structure of streaming large-dimensional data, including general covariance structure changes and special covariance structure changes. In addition, while we constructed the detection statistics based on 1\ell_{1} and \ell_{\infty} norms of the difference between two vectorized correlation matrices, other norms can be used to adapt to different circumstances.

References

  • Barber and Candès (2015) R. F. Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
  • Blagus and Lusa (2013) R. Blagus and L. Lusa. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14(1):1–16, 2013.
  • Bunde et al. (2024) Armin Bunde, Josef Ludescher, and Hans Joachim Schellnhuber. Evaluation of the real-time El Niño forecasts by the climate network approach between 2011 and present. Theoretical and Applied Climatology, pages 1–10, 2024.
  • Buzun and Avanesov (2018) N. Buzun and V. Avanesov. Change-point detection in high-dimensional covariance structure. Electronic Journal of Statistics, 12(2):3254–3294, 2018.
  • Cabrieto et al. (2018) J. Cabrieto, F. Tuerlinckx, P. Kuppens, F.H. Wilhelm, M. Liedlgruber, and E. Ceulemans. Capturing correlation changes by applying kernel change point detection on the running correlations. Information Sciences, 447:117–139, 2018.
  • Cao et al. (2018) Y. Cao, Y. Xie, and N. Gebraeel. Multi-sensor slope change detection. Annals of Operations Research, 263:163–189, 2018.
  • Chawla et al. (2002) N. V. Chawla, K. W. Bowyer, L. O. Hall, and W.P. Kegelmeyer. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002.
  • Chen et al. (2015) Y. Chen, T. Banerjee, A.D. Dominguez-Garcia, and V.V Veeravalli. Quickest line outage detection and identification. IEEE Transactions on Power Systems, 31(1):749–758, 2015.
  • Choi and Shin (2021) J. E. Choi and D. W. Shin. A self-normalization break test for correlation matrix. Statistical Papers, 62(5):2333–2353, 2021.
  • Dette et al. (2019) H. Dette, W. Wu, and Z. Zhou. Change point analysis of correlation in non-stationary time series. Statistica Sinica, 29(2):611–643, 2019.
  • Dette et al. (2022) H. Dette, G. Pan, and Q. Yang. Estimating a change point in a sequence of very high-dimensional covariance matrices. Journal of the American Statistical Association, 117(537):444–454, 2022.
  • Feng et al. (2022) L. Feng, T. Jiang, B. Liu, and W. Xiong. Max-sum tests for cross-sectional independence of high-dimensional panel data. The Annals of Statistics, 50(2):1124–1143, 2022.
  • Galeano and Wied (2017) P. Galeano and D. Wied. Dating multiple change points in the correlation matrix. Test, 26(2):331–352, 2017.
  • Killick et al. (2013) R. Killick, I. A. Eckley, and P. Jonathan. A wavelet-based approach for detecting changes in second order structure within nonstationary time series. Electronic Journal of Statistics, 7:1167–1183, 2013.
  • Lakhina et al. (2004) A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. ACM SIGCOMM Computer Communication Review, 34(4):219–230, 2004.
  • Lenssen et al. (2024) N. Lenssen, P. DiNezio, L. Goddard, C. Deser, Y. Kushnir, S. Mason, M. Newman, and Y. Okumura. Strong El Niño events lead to robust multi-year ENSO predictability. Geophysical Research Letters, 51(12):e2023GL106988, 2024.
  • Li and Li (2023) L. Li and J. Li. Online change-point detection in high-dimensional covariance structure with application to dynamic networks. Journal of Machine Learning Research, 24(51):1–44, 2023.
  • Li et al. (2015) S. Li, Y. Xie, H. Dai, and L. Song. M-statistic for kernel change-point detection. Advances in Neural Information Processing Systems, 28, 2015.
  • Li and Gao (2024) Z. Li and J. Gao. Efficient change point detection and estimation in high-dimensional correlation matrices. Electronic Journal of Statistics, 18(1):942–979, 2024.
  • Lorden (1971) G. Lorden. Procedures for reacting to a change in distribution. Annals of Mathematical Statistics, 42(6):1897–1908, 1971.
  • Ludescher et al. (2013) J. Ludescher, A. Gozolchiani, M.I. Bogachev, A. Bunde, S. Havlin, and H.J. Schellnhuber. Improved El Niño forecasting by cooperativity detection. Proceedings of the National Academy of Sciences, 110(29):11742–11745, 2013.
  • Madrid Padilla et al. (2023) C. M. Madrid Padilla, H. Xu, D. Wang, O. H. Madrid Padilla, and Y. Yu. Change point detection and inference in multivariate non-parametric models under mixing conditions. Advances in Neural Information Processing Systems, 36:21081–21134, 2023.
  • Moustakides (1986) G.V. Moustakides. Optimal stopping times for detecting changes in distributions. The Annals of Statistics, 14(4):1379–1387, 1986.
  • Nurjahan et al. (2016) N. Nurjahan, F. Nizam, S. Chaki, S. A. Mamun, and M. S. Kaiser. Attack detection and prevention in the cyber physical system. In 2016 International Conference on Computer Communication and Informatics (ICCCI), pages 1–6. IEEE, 2016.
  • Page (1954) E. S. Page. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
  • Peel and Clauset (2015) L. Peel and A. Clauset. Detecting change points in the large-scale structure of evolving networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, pages 2914–2920, 2015.
  • Pollak and Yakir (1998) M. Pollak and B. Yakir. A new representation for a renewal-theoretic constant appearing in asymptotic approximations of large deviations. The Annals of Applied Probability, 8(3):749–774, 1998.
  • Poor and Hadjiliadis (2008) H. V. Poor and O. Hadjiliadis. Quickest Detection. Cambridge University Press, 2008.
  • Raghavan and Veeravalli (2010) V. Raghavan and V.V. Veeravalli. Quickest change detection of a Markov process across a sensor array. IEEE Transactions on Information Theory, 56(4):1961–1981, 2010.
  • Raginsky et al. (2012) M. Raginsky, R. Willett, C. Horn, J. Silva, and R. Marcia. Sequential anomaly detection in the presence of noise and limited feedback. IEEE Transactions on Information Theory, 58(8):5544–5562, 2012.
  • Ritov (1990) Y. Ritov. Decision theoretic optimality of the CUSUM procedure. The Annals of Statistics, 18(3):1464–1469, 1990.
  • Siegmund (1985) D. Siegmund. Sequential Analysis: Tests and Confidence Intervals. Springer, 1985.
  • Siegmund and Venkatraman (1995) D. Siegmund and E. S. Venkatraman. Using the generalized likelihood ratio statistic for sequential detection of a change-point. The Annals of Statistics, 23(1):255–271, 1995.
  • Siegmund and Yakir (2000) D. Siegmund and B. Yakir. Tail probabilities for the null distribution of scanning statistics. Bernoulli, 6(2):191–213, 2000.
  • Siegmund and Yakir (2007) D. Siegmund and B. Yakir. The Statistics of Gene Mapping, volume 1. Springer, 2007.
  • Siegmund and Yakir (2008) D. Siegmund and B. Yakir. Detecting the emergence of a signal in a noisy image. Statistics and Its Interface, 1(1):3–12, 2008.
  • Siegmund et al. (2010) D. Siegmund, B. Yakir, and N. Zhang. Tail approximations for maxima of random fields by likelihood ratio transformations. Sequential Analysis, 29(3):245–262, 2010.
  • Tartakovsky et al. (2014) A. Tartakovsky, I. Nikiforov, and M. Basseville. Sequential Analysis: Hypothesis Testing and Changepoint Detection. CRC press, 2014.
  • Veeravalli and Banerjee (2014) V. V. Veeravalli and T. Banerjee. Quickest change detection. In Academic Press Library in Signal Processing, volume 3, pages 209–255. Elsevier, 2014.
  • Wied (2017) D. Wied. A nonparametric test for a constant correlation matrix. Econometric Reviews, 36(10):1157–1172, 2017.
  • Xie et al. (2019) L. Xie, Y. Xie, and G. V. Moustakides. Asynchronous multi-sensor change-point detection for seismic tremors. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2199–2203, 2019.
  • Xie et al. (2020) L. Xie, Y. Xie, and G. V. Moustakides. Sequential subspace change point detection. Sequential Analysis, 39(3):307–335, 2020.
  • Xie et al. (2021) L. Xie, S. Zou, Y. Xie, and V. V. Veeravalli. Sequential (quickest) change detection: Classical results and new directions. IEEE Journal on Selected Areas in Information Theory, 2(2):494–514, 2021.
  • Xie and Siegmund (2013) Y. Xie and D. Siegmund. Sequential multi-sensor change-point detection. The Annals of Statistics, 41(2):670–692, 2013.

Appendix A Additional Numerical Results

A small simulation is conducted to illustrate the impact of different window sizes ww and the results are shown in Figure 10. We let ww varies from 5 to 50, and CUSUM again serves as the theoretical lower bound for detection delay. The first two rows represent data generated from the Case 1 setting defined in Section 5, while the last two rows correspond to Case 2. For a fixed small ARL, all window sizes exhibit low detection delays, with smaller delays preferred for computational efficiency. However, as ARL increases, the detection delay for smaller window sizes can grow quadratically, necessitating the use of larger window sizes. Comparing the first two rows with the last two, it is evident that a larger window size is required for the more challenging detection scenario in Case 2.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Different ww. First and second rows, Case 1. Third and Fourth rows, Case 2.
Refer to caption
Figure 11: The prediction of El Niño events during 1981 to 2011.

Appendix B Seismic Event Detection

We also implement our proposed approaches in a real seismic dataset collected at Parkfield, California, from 2 a.m. to 4 a.m. on Dec 23rd, 2004. The raw data contains records at 13 seismic sensors that simultaneously record a continuous data stream with a frequency of 250HZ, see Figure 12(a). The goal is to detect micro-earthquakes and tremor-like signals as soon as possible, which are weak signals caused by minor subsurface changes in the Earth. The tremor signal is propagated to each sensor from the source, and the affected sensors observe a similar waveform corrupted by noise. These tremor signals are helpful for geophysical study and predicting potential earthquakes.

To improve computational efficiency, we first perform a down-sampling procedure to reduce the data length to T=7198T=7198. The first 590 observations do not contain tremor signals and are thus selected as reference data. We use a sliding window of length 200s, i.e., at each time point, we utilize the previous 200 observations to estimate the corresponding correlation matrices as 𝑹^t\bm{\hat{R}}_{t}. We use the ST-Sum test statistic in (11) and its trajectory is shown in Figure 12(b). We use 50 signflip trials to get the detection threshold. In practice, we inflate the threshold chosen by the algorithm slightly to ensure a zero-false-alarm rate in the reference data.

In Figure 13, the vertical solid line denotes true event time and the dashed line represents the detected change point. Our proposed method finds three main events at 604, 2093, and 6370 seconds as shown in Figure 13(a), (b), (d), respectively. They match well with the true event catalog, 594, 2090, and 6369 seconds, obtained from the Northern California Earthquake Data Center. We also compare our results with those obtained in Xie et al. (2020) using the same dataset. In Xie et al. (2020), three main events are detected at 615, 2127, and 6371 seconds, corresponding to three actual events at 594, 2124, and 6369 seconds. Both our method and the method in Xie et al. (2020) detect the same first and third events, and for the second event, the one that happened at 2090s has a magnitude of 1.66 while the one that happened at 2124s has a magnitude of 1.1, and they are very close to each other. Compared to Xie et al. (2020), our method has a smaller detection delay in the first event and a comparable delay for the latter two events. Our method also detects the tremor-like signals at 4504s, which corresponds to an actual event at 4475 seconds with a magnitude of 1.58, with a time delay of 29 seconds, see Figure 13(c), while the subspace method in Xie et al. (2020) fails to detect this event. It is worth noting that the change in Figure 13(c) is challenging to detect since the data variation between 2000 and 4000 seconds is very large.

Refer to caption
Refer to caption
Figure 12: (a)Raw data; (b) Values of ST-Sum test statistic.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Four change points detected in seismic data. Time starts since 2 a.m. on December 23, 2004, with each plot covering a time range of 150 seconds. The vertical solid black line is a real event and the vertical dotted black line refers to the detected change point.

Appendix C Proofs

Proof of Lemma 3.1.

For 𝒗1,T(i,j)\bm{v}_{1,T}(i,j), we have

(ρ^0(i,j)ρ^1:t(i,j))2=(1H+1k=H0ykiykj1tl=1tyliylj)2\displaystyle(\hat{\rho}_{0}(i,j)-\hat{\rho}_{1:t}(i,j))^{2}=\left(\frac{1}{H+1}\sum_{k=-H}^{0}y_{ki}y_{kj}-\frac{1}{t}\sum_{l=1}^{t}y_{li}y_{lj}\right)^{2}
=\displaystyle= (1H+1k=H0ykiykj)2+(1tl=1tyliylj)221(H+1)tk=H0l=1tykiykjyliylj\displaystyle\left(\frac{1}{H+1}\sum_{k=-H}^{0}y_{ki}y_{kj}\right)^{2}+\left(\frac{1}{t}\sum_{l=1}^{t}y_{li}y_{lj}\right)^{2}-2\cdot\frac{1}{(H+1)\cdot t}\sum_{k=-H}^{0}\sum_{l=1}^{t}y_{ki}y_{kj}y_{li}y_{lj}
=\displaystyle= 1(H+1)2k=H0yki2ykj2+1(H+1)2k1=H0k2=H,k2k10yk1iyk1jyk2iyk2j+1t2l=1tyli2ylj2\displaystyle\frac{1}{(H+1)^{2}}\sum_{k=-H}^{0}y_{k_{i}}^{2}y_{k_{j}}^{2}+\frac{1}{(H+1)^{2}}\sum_{k_{1}=-H}^{0}\sum_{k_{2}=-H,k_{2}\neq k_{1}}^{0}y_{k_{1}i}y_{k_{1}j}y_{k_{2}i}y_{k_{2}j}+\frac{1}{t^{2}}\sum_{l=1}^{t}y_{li}^{2}y_{lj}^{2}
+1t2l1=1tl2=1,l2l1tyl1iyl1jyl2iyl2j2(H+1)tk=H0l=1tykiykjyliylj.\displaystyle+\frac{1}{t^{2}}\sum_{l_{1}=1}^{t}\sum_{l_{2}=1,l_{2}\neq l_{1}}^{t}y_{l_{1}i}y_{l_{1}j}y_{l_{2}i}y_{l_{2}j}-\frac{2}{(H+1)t}\sum_{k=-H}^{0}\sum_{l=1}^{t}y_{ki}y_{kj}y_{li}y_{lj}.

Under the assumption that 𝔼(xki)=0{\mathbb{E}}(x_{ki})=0 and 𝔼(xki2)=1{\mathbb{E}}(x_{ki}^{2})=1 for i=1,,p,k=1,2,i=1,\ldots,p,k=1,2,\ldots, we have 𝔼(xkixkj)=ρ^0(i,j)=ρ0(i,j){\mathbb{E}}_{\infty}(x_{ki}x_{kj})=\hat{\rho}_{0}(i,j)=\rho_{0}(i,j), 𝔼1(xkixkj)=𝔼1(ykiykj)=ρ1(i,j){\mathbb{E}}_{1}(x_{ki}x_{kj})={\mathbb{E}}_{1}(y_{ki}y_{kj})=\rho_{1}(i,j), then we take expectation for every term,

𝔼(ρ^0(i,j))2=1H+1β20+HH+1ρ02(i,j),\displaystyle{\mathbb{E}}\left(\hat{\rho}_{0}(i,j)\right)^{2}=\frac{1}{H+1}\beta_{20}+\frac{H}{H+1}\cdot\rho_{0}^{2}(i,j),
𝔼(ρ^1(i,j))2=1tβ21+t1tρ12(i,j),\displaystyle{\mathbb{E}}\left(\hat{\rho}_{1}(i,j)\right)^{2}=\frac{1}{t}\beta_{21}+\frac{t-1}{t}\rho_{1}^{2}(i,j),
𝔼(2(H+1)tk=H0l=1tykiyk,yliylj)=2ρ0(i,j)ρ1(i,j),\displaystyle{\mathbb{E}}\left(\frac{2}{(H+1)t}\sum_{k=-H}^{0}\sum_{l=1}^{t}y_{k_{i}}y_{k,}y_{li}y_{lj}\right)=2\rho_{0}(i,j)\cdot\rho_{1}(i,j),

finally, we have,

𝔼(ρ^0(i,j)ρ^1(i,j))2=(ρ0(i,j)ρ1(i,j))2+1H+1(β20ρ02(i,j))+1t(β21ρ12(i,j)).{\mathbb{E}}\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{1}(i,j)\right)^{2}=(\rho_{0}(i,j)-\rho_{1}(i,j))^{2}+\frac{1}{H+1}(\beta_{20}-\rho_{0}^{2}(i,j))+\frac{1}{t}(\beta_{21}-\rho_{1}^{2}(i,j)).\\

Proof of Remark 3.2.

Notice that, under the assumption that var(xki)=1var(x_{ki})=1, we have 𝒚k=𝒙k𝒙¯H:0\bm{y}_{k}=\bm{x}_{k}-\bm{\bar{x}}_{-H:0} for k=H,,0k=-H,\ldots,0 and 𝒚k=𝒙k𝒙¯1:t\bm{y}_{k}=\bm{x}_{k}-\bm{\bar{x}}_{1:t} for k=1,,tk=1,\ldots,t. Due to the unbiasedness of the sample covariance, we have 𝔼[ρ^0(i,j)]=ρ0(i,j){\mathbb{E}}_{\infty}[\hat{\rho}_{0}(i,j)]=\rho_{0}(i,j) and 𝔼1[[ρ^1:t(i,j)]=ρ1(i,j){\mathbb{E}}_{1}[\bm{[}{\hat{\rho}}_{1:t}(i,j)]=\rho_{1}(i,j).

We first write

(ρ^0(i,j)ρ^1:t(i,j))2=(ρ^0(i,j))2+(ρ^1:t(i,j))22ρ^0(i,j)ρ^1:t(i,j).\displaystyle(\hat{\rho}_{0}(i,j)-\hat{\rho}_{1:t}(i,j))^{2}=(\hat{\rho}_{0}(i,j))^{2}+(\hat{\rho}_{1:t}(i,j))^{2}-2\hat{\rho}_{0}(i,j)\hat{\rho}_{1:t}(i,j).

We have 𝔼[ρ^0(i,j)ρ^1:t(i,j)]=ρ0(i,j)ρ1(i,j){\mathbb{E}}[\hat{\rho}_{0}(i,j)\hat{\rho}_{1:t}(i,j)]=\rho_{0}(i,j)\rho_{1}(i,j). For the term 𝔼[(ρ^0(i,j))2]{\mathbb{E}}[(\hat{\rho}_{0}(i,j))^{2}] (the term 𝔼1[(ρ^1:t(i,j))2]{\mathbb{E}}_{1}[(\hat{\rho}_{1:t}(i,j))^{2}] can be calculated similarly), we can assume 𝔼(xki)=0{\mathbb{E}}(x_{ki})=0 for simplicity under both pre- and post-change regimes.

Note that

(ρ^0(i,j))2=1H2(k=H0ykiykj)2=1H2k=H0yki2ykj2+1H2k1=H0k2=H,k2k10yk1iyk1jyk2iyk2j.\displaystyle(\hat{\rho}_{0}(i,j))^{2}=\frac{1}{H^{2}}\left(\sum_{k=-H}^{0}y_{ki}y_{kj}\right)^{2}=\frac{1}{H^{2}}\sum_{k=-H}^{0}y_{ki}^{2}y_{kj}^{2}+\frac{1}{H^{2}}\sum_{k_{1}=-H}^{0}\sum_{k_{2}=-H,k_{2}\neq k_{1}}^{0}y_{k_{1}i}y_{k_{1}j}y_{k_{2}i}y_{k_{2}j}.

For the first term, we have

yki2ykj2\displaystyle y_{k_{i}}^{2}y_{kj}^{2} =(xki1H+1l=H0xli)2(xkj1H+1l=H0xlj)2\displaystyle=\left(x_{ki}-\frac{1}{H+1}\sum_{l=-H}^{0}x_{li}\right)^{2}\left(x_{kj}-\frac{1}{H+1}\sum_{l=-H}^{0}x_{lj}\right)^{2}
=[xki22H+1xkil=H0xli+1(H+1)2(l=H0xli)2]\displaystyle=\left[x_{ki}^{2}-\frac{2}{H+1}x_{ki}\sum_{l=-H}^{0}x_{li}+\frac{1}{(H+1)^{2}}\left(\sum_{l=-H}^{0}x_{li}\right)^{2}\right]
[xkj22H+1xkjl=H0xlj+1(H+1)2(l=H0xlj)2]\displaystyle\quad\cdot{\left[x_{kj}^{2}-\frac{2}{H+1}x_{kj}\sum_{l=-H}^{0}x_{lj}+\frac{1}{(H+1)^{2}}\left(\sum_{l=-H}^{0}x_{lj}\right)^{2}\right]}
=xki2xkj22H+1xki2xkji=H0xlj+1(H+1)2xki2(l=H0xlj)2\displaystyle=x_{ki}^{2}x_{kj}^{2}-\frac{2}{H+1}x_{ki}^{2}x_{kj}\sum_{i=-H}^{0}x_{lj}+\frac{1}{(H+1)^{2}}x_{ki}^{2}\left(\sum_{l=-H}^{0}x_{lj}\right)^{2}
2H+1xkixkj2l=H0xli+4(H+1)2xkixkj(l=H0xli)(l=H0xlj)\displaystyle\quad-\frac{2}{H+1}x_{ki}x_{kj}^{2}\sum_{l=-H}^{0}x_{li}+\frac{4}{(H+1)^{2}}x_{ki}x_{kj}\left(\sum_{l=-H}^{0}x_{li}\right)\left(\sum_{l=-H}^{0}x_{lj}\right)
2(H+1)3xki(l=H0xli)(l=H0xlj)2+1(H+1)2xkj2(l=H0Xli)2\displaystyle\quad-\frac{2}{(H+1)^{3}}x_{ki}\left(\sum_{l=-H}^{0}x_{li}\right)\left(\sum_{l=-H}^{0}x_{lj}\right)^{2}+\frac{1}{(H+1)^{2}}x_{kj}^{2}\left(\sum_{l=-H}^{0}X_{li}\right)^{2}
2(H+1)3xkj(l=H0xlj)(l=H0xli)2+1(H+1)4(l=H0xli)2(l=H0Xlj)2.\displaystyle\quad-\frac{2}{(H+1)^{3}}x_{kj}\left(\sum_{l=-H}^{0}x_{lj}\right)\left(\sum_{l=-H}^{0}x_{li}\right)^{2}+\frac{1}{(H+1)^{4}}\left(\sum_{l=-H}^{0}x_{li}\right)^{2}\left(\sum_{l=-H}^{0}X_{lj}\right)^{2}.

Note that 𝔼[xkixkj]=ρ0(i,j){\mathbb{E}}_{\infty}[x_{ki}x_{kj}]=\rho_{0}(i,j) and 𝔼[xki2]=1{\mathbb{E}}_{\infty}[x^{2}_{ki}]=1 (after assuming xkix_{ki} is zero-mean and unit-variance). We define the following notations:

A1=xki2xkj2,A2=2H+1xki2xkji=H0xlj,A3=1(H+1)2xki2(l=H0xlj)2,\displaystyle A_{1}=x_{ki}^{2}x_{kj}^{2},\quad A_{2}=-\frac{2}{H+1}x_{ki}^{2}x_{kj}\sum_{i=-H}^{0}x_{lj},\quad A_{3}=\frac{1}{(H+1)^{2}}x_{ki}^{2}\left(\sum_{l=-H}^{0}x_{lj}\right)^{2},
A4=2H+1xkixkj2l=H0xli,A5=4(H+1)2xkixkj(l=H0xli)(l=H0xlj),\displaystyle A_{4}=-\frac{2}{H+1}x_{ki}x_{kj}^{2}\sum_{l=-H}^{0}x_{li},\quad A_{5}=\frac{4}{(H+1)^{2}}x_{ki}x_{kj}\left(\sum_{l=-H}^{0}x_{li}\right)\left(\sum_{l=-H}^{0}x_{lj}\right),
A6=2(H+1)3xki(l=H0xli)(l=H0xlj)2,A7=1(H+1)2xkj2(l=H0Xli)2,\displaystyle A_{6}=-\frac{2}{(H+1)^{3}}x_{ki}\left(\sum_{l=-H}^{0}x_{li}\right)\left(\sum_{l=-H}^{0}x_{lj}\right)^{2},\quad A_{7}=\frac{1}{(H+1)^{2}}x_{kj}^{2}\left(\sum_{l=-H}^{0}X_{li}\right)^{2},
A8=2(H+1)3xkj(l=H0xlj)(l=H0xli)2,A9=1(H+1)4(l=H0xli)2(l=H0Xlj)2.\displaystyle A_{8}=-\frac{2}{(H+1)^{3}}x_{kj}\left(\sum_{l=-H}^{0}x_{lj}\right)\left(\sum_{l=-H}^{0}x_{li}\right)^{2},\quad A_{9}=\frac{1}{(H+1)^{4}}\left(\sum_{l=-H}^{0}x_{li}\right)^{2}\left(\sum_{l=-H}^{0}X_{lj}\right)^{2}.

It is obvious that yki2ykj2y_{ki}^{2}y_{kj}^{2} is decomposed into a summation from A1A_{1} to A9A_{9}, that is, yki2ykj2=A1+A2++A8+A9y_{ki}^{2}y_{kj}^{2}=A_{1}+A_{2}+\cdots+A_{8}+A_{9}, then if we take expectation of each term, the summation will be 𝔼(yki2ykj2){\mathbb{E}}(y_{ki}^{2}y_{kj}^{2}). The detailed expectations for each term are as follows.

𝔼(A1)=β20,𝔼(A2)=2H+1β20,\displaystyle{\mathbb{E}}(A_{1})=\beta_{20},{\mathbb{E}}(A_{2})=-\frac{2}{H+1}\beta_{20},
𝔼(A3)=E(1(H+1)2xki2(l=H0xlj2+l1=H0l2=H,l1l20xl1jxl2j))\displaystyle{\mathbb{E}}(A_{3})=E\left(\frac{1}{(H+1)^{2}}x_{ki}^{2}\left(\sum_{l=-H}^{0}x_{lj}^{2}+\sum_{l_{1}=-H}^{0}\sum_{l_{2}=-H,l_{1}\neq l_{2}}^{0}x_{l_{1}j}x_{l_{2}j}\right)\right)
=𝔼(1(H+1)2[xki2xkj2+lk=H0xki2xlj2])=1(H+1)2β20+H(H+1)21,\displaystyle\quad={\mathbb{E}}\left(\frac{1}{(H+1)^{2}}\left[x_{ki}^{2}x_{kj}^{2}+\sum_{l\neq k=-H}^{0}x_{ki}^{2}x_{lj}^{2}\right]\right)=\frac{1}{(H+1)^{2}}\beta_{20}+\frac{H}{(H+1)^{2}}\cdot 1,
𝔼(A4)=E(2H+1xkixkj2l=H0xli)=E(2H+1xki2xkj2)=2H+1β20,\displaystyle{\mathbb{E}}(A_{4})=E\left(-\frac{2}{H+1}x_{ki}x_{kj}^{2}\sum_{l=-H}^{0}x_{li}\right)=E\left(-\frac{2}{H+1}x_{ki}^{2}x_{kj}^{2}\right)=-\frac{2}{H+1}\beta_{20},
𝔼(A5)=E(4(H+1)2xkixkj(l=H0xli)(l=H0Xlj))\displaystyle{\mathbb{E}}(A_{5})=E\left(\frac{4}{(H+1)^{2}}x_{ki}x_{kj}\left(\sum_{l=-H}^{0}x_{li}\right)\left(\sum_{l=-H}^{0}X_{lj}\right)\right)
=E(4(H+1)2xki2xkj2+4(H+1)2lk=H0xkixkjxlixlj)\displaystyle\quad=E\left(\frac{4}{(H+1)^{2}}x_{ki}^{2}x_{kj}^{2}+\frac{4}{(H+1)^{2}}\sum_{l\neq k=-H}^{0}x_{ki}x_{kj}x_{li}x_{lj}\right)
=4(H+1)2β20+4H(H+1)2ρ02(i,j),\displaystyle\quad=\frac{4}{(H+1)^{2}}\beta_{20}+\frac{4H}{(H+1)^{2}}\rho_{0}^{2}(i,j),
𝔼(A6)=E(2(H+1)3xki(l=H0xli)(l=H0xlj)2)\displaystyle{\mathbb{E}}(A_{6})=E\left(-\frac{2}{(H+1)^{3}}x_{ki}\left(\sum_{l=-H}^{0}x_{li}\right)\left(\sum_{l=-H}^{0}x_{lj}\right)^{2}\right)
=E(2(H+1)3xki(l=H0xli)(l=H0xlj2+l1=H0l2=H,l1l20xl1jxl2j)]\displaystyle\quad=E\left(-\frac{2}{(H+1)^{3}}x_{ki}\left(\sum_{l=-H}^{0}x_{li}\right)\left(\sum_{l=-H}^{0}x_{lj}^{2}+\sum_{l_{1}=-H}^{0}\sum_{l_{2}=-H,l_{1}\neq l_{2}}^{0}x_{l_{1}j}x_{l_{2}j}\right)\right]
=E(2(H+1)3[xki(l=H0xli)(l=H0xlj2)+xki(l=H0xli)l1=H0l2=H,l1l20xl1jxl2j])\displaystyle\quad=E\left(-\frac{2}{(H+1)^{3}}\left[x_{ki}\left(\sum_{l=-H}^{0}x_{li}\right)\left(\sum_{l=-H}^{0}x_{lj}^{2}\right)+x_{ki}\left(\sum_{l=-H}^{0}x_{li}\right)\sum_{l_{1}=-H}^{0}\sum_{l_{2}=-H,l_{1}\neq l_{2}}^{0}x_{l_{1}j}x_{l_{2}j}\right]\right)
=E(2(H+1)3[xki2xkj2+xki2lk=H0xlj2+lk=H0xkixkjxlixlj])\displaystyle\quad=E\left(-\frac{2}{(H+1)^{3}}\left[x_{ki}^{2}x_{kj}^{2}+x_{ki}^{2}\cdot\sum_{l\neq k=-H}^{0}x_{lj}^{2}+\sum_{l\neq k=-H}^{0}x_{ki}x_{kj}x_{li}x_{lj}\right]\right)
=2(H+1)3β202H(H+1)312H(H+1)3ρ02(i,j),\displaystyle\quad=-\frac{2}{(H+1)^{3}}\beta_{20}-\frac{2H}{(H+1)^{3}}\cdot 1-\frac{2H}{(H+1)^{3}}\rho_{0}^{2}(i,j),
𝔼(A7)=E(1(H+1)2xkj2(l=H0xli)2)\displaystyle{\mathbb{E}}(A_{7})=E\left(\frac{1}{(H+1)^{2}}x_{kj}^{2}\left(\sum_{l=-H}^{0}x_{li}\right)^{2}\right)
=E[1(H+1)2xkj2(l=H0xli2+l1=H0l2=H,l1l20xl1ixl2i)]\displaystyle\quad=E\left[\frac{1}{(H+1)^{2}}x_{kj}^{2}\left(\sum_{l=-H}^{0}x_{li}^{2}+\sum_{l_{1}=-H}^{0}\sum_{l_{2}=-H,l_{1}\neq l_{2}}^{0}x_{l_{1}i}x_{l_{2}i}\right)\right]
=E[1(H+1)2(xkj2xki2+lk=H0xkj2Xli2)]\displaystyle\quad=E\left[\frac{1}{(H+1)^{2}}\left(x_{kj}^{2}x_{ki}^{2}+\sum_{l\neq k=-H}^{0}x_{kj}^{2}X_{li}^{2}\right)\right]
=1(H+1)2β20+H(H+1)21,\displaystyle\quad=\frac{1}{(H+1)^{2}}\beta_{20}+\frac{H}{(H+1)^{2}}\cdot 1,
𝔼(A8)=2(H+1)3β202H(H+1)32H(H+1)3ρ02(i,j),\displaystyle{\mathbb{E}}(A_{8})=-\frac{2}{(H+1)^{3}}\beta_{20}-\frac{2H}{(H+1)^{3}}-\frac{2H}{(H+1)^{3}}\rho_{0}^{2}(i,j),
𝔼(A9)=1(H+1)3β20+H(H+1)3+H(H+1)3ρ02(i,j).\displaystyle{\mathbb{E}}(A_{9})=\frac{1}{(H+1)^{3}}\beta_{20}+\frac{H}{(H+1)^{3}}+\frac{H}{(H+1)^{3}}\rho_{0}^{2}(i,j).

Then we calculate the expectation of yki2ykj2y_{ki}^{2}y_{kj}^{2} as

𝔼(yki2ykj2)=H(1H+H2)(H+1)3β20(i,j)+2H2H(H+1)3+4H2+H(H+1)3ρ02(i,j).{\mathbb{E}}\left(y_{ki}^{2}y_{kj}^{2}\right)=\frac{H\left(1-H+H^{2}\right)}{(H+1)^{3}}\beta_{20}(i,j)+\frac{2H^{2}-H}{(H+1)^{3}}+\frac{4H^{2}+H}{(H+1)^{3}}\rho_{0}^{2}(i,j).

Similarly, for the second term yk1iyk1jyk2iyk2jy_{k_{1}i}y_{k_{1}j}y_{k_{2}i}y_{k_{2}j}, notice that for any k1k2k_{1}\neq k_{2}, we can decompose it into the summation of several separate terms, the details are as follows.

yk1iyk1jyk2iyk2j=(xk1i1H+1k=H0xki)(xk1j1H+1k=H0xkj)\displaystyle y_{k_{1}i}y_{k_{1}j}y_{k_{2}i}y_{k_{2}j}=\left(x_{k_{1}i}-\frac{1}{H+1}\sum_{k=-H}^{0}x_{ki}\right)\left(x_{k_{1}j}-\frac{1}{H+1}\sum_{k=-H}^{0}x_{kj}\right)
(xk2i1H+1k=H0xki)(xk2j1H+1k=H0xkj)\displaystyle\hskip 90.0pt\cdot\left(x_{k_{2}i}-\frac{1}{H+1}\sum_{k=-H}^{0}x_{ki}\right)\left(x_{k_{2}j}-\frac{1}{H+1}\sum_{k=-H}^{0}x_{kj}\right)
=[xk1ixkk1j1H+1xk1ik=H0xkj1H+1xk1jk=H0xkj+1(H+1)2(k=H0xki)(k=H0xkj)]\displaystyle=\left[x_{k_{1}i}x_{k_{k_{1}j}}-\frac{1}{H+1}x_{k_{1}i}\sum_{k=-H}^{0}x_{kj}-\frac{1}{H+1}x_{k_{1}j}\sum_{k=-H}^{0}x_{kj}+\frac{1}{(H+1)^{2}}\left(\sum_{k=-H}^{0}x_{ki}\right)\left(\sum_{k=-H}^{0}x_{kj}\right)\right]
[xk2ixk2j1H+1xk2ik=H0xkj1H+1xk2jk=H0xki+1(H+1)2(k=H0xki)(k=H0xkj)]\displaystyle\quad{\left[x_{k_{2}i}x_{k_{2}j}-\frac{1}{H+1}x_{k_{2}i}\sum_{k=-H}^{0}x_{kj}-\frac{1}{H+1}x_{k_{2}j}\sum_{k=-H}^{0}x_{ki}+\frac{1}{(H+1)^{2}}\left(\sum_{k=-H}^{0}x_{ki}\right)\left(\sum_{k=-H}^{0}x_{kj}\right)\right]}
=xk1ixk1jxk2ixk2j1H+1xk1ixk1jxk2i(k=H0xkj)1H+1xk1ixk1jxk2j(k=H0xki)\displaystyle=x_{k_{1}i}x_{k_{1}j}x_{k_{2}i}x_{k_{2}j}-\frac{1}{H+1}x_{k_{1}i}x_{k_{1}j}x_{k_{2}i}\left(\sum_{k=-H}^{0}x_{kj}\right)-\frac{1}{H+1}x_{k_{1}i}x_{k_{1}j}x_{k-2j}\left(\sum_{k=-H}^{0}x_{ki}\right)
+1(H+1)2xk1ixk1j(k=H0xki)(k=H0xkj)\displaystyle\quad+\frac{1}{(H+1)^{2}}x_{k_{1}i}x_{k_{1}j}\left(\sum_{k=-H}^{0}x_{ki}\right)\left(\sum_{k=-H}^{0}x_{kj}\right)
1(H+1)xk1i(k=H0xkj)xk2ixk2j+1(H+1)2xk1i(k=H0xkj)xk2i(k=H0xkj)\displaystyle\quad-\frac{1}{(H+1)}x_{k_{1}i}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}x_{k_{2}i}x_{k_{2}j}+\frac{1}{(H+1)^{2}}x_{k_{1}i}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}x_{k_{2}i}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}
+1(H+1)2xk1i(k=H0xkj)xk2j(k=H0xki)1(H+1)3xk1i(k=H0xkj)(k=H0xki)(k=H0xkj)\displaystyle\quad+\frac{1}{(H+1)^{2}}x_{k_{1}i}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}x_{k_{2}j}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}-\frac{1}{(H+1)^{3}}x_{k_{1}i}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}
1(H+1)xk1j(k=H0xki)xk2ixk2j+1(H+1)2xk1j(k=H0xki)xk2i(k=H0xkj)\displaystyle\quad-\frac{1}{(H+1)}x_{k_{1}j}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}x_{k_{2}i}x_{k_{2}j}+\frac{1}{(H+1)^{2}}x_{k_{1}j}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}x_{k_{2}i}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}
+1(H+1)2xk1j(k=H0xki)xk2j(k=H0xki)1(H+1)3xk1j(k=H0xki)(k=H0xki)(k=H0xkj)\displaystyle\quad+\frac{1}{(H+1)^{2}}x_{k_{1}j}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}x_{k_{2}j}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}-\frac{1}{(H+1)^{3}}x_{k_{1}j}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}
+1(H+1)2(k=H0xki)(k=H0xkj)xk2ixk2j1(H+1)3(k=H0xki)(k=H0xkj)xk2i(k=H0xkj)\displaystyle\quad+\frac{1}{(H+1)^{2}}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}x_{k_{2}i}x_{k_{2}j}-\frac{1}{(H+1)^{3}}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}x_{k_{2}i}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}
1(H+1)3(k=H0xki)(k=H0xkj)xk2j(k=H0xki)+1(H+1)4(k=H0xki)2(k=H0xkj)2.\displaystyle\quad-\frac{1}{(H+1)^{3}}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}x_{k_{2}j}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}+\frac{1}{(H+1)^{4}}\big{(}\sum_{k=-H}^{0}x_{ki}\big{)}^{2}\big{(}\sum_{k=-H}^{0}x_{kj}\big{)}^{2}.

The expectation of yk1iyk1jyk2iyk2jy_{k_{1}i}y_{k_{1}j}y_{k_{2}i}y_{k_{2}j} can be calculated in similar way,

𝔼(yk1iyk1jyk2iyk2j)=12H(H+1)3+3H(H+1)3β20+3+H2+H3(H+1)3ρ02(i,j).{\mathbb{E}}\left(y_{k_{1}i}y_{k_{1}j}y_{k_{2}i}y_{k_{2}j}\right)=\frac{1-2H}{(H+1)^{3}}+\frac{3H}{(H+1)^{3}}\beta_{20}+\frac{3+H^{2}+H^{3}}{(H+1)^{3}}\rho_{0}^{2}(i,j).\\

Based on the expectations of yki2ykj2y_{ki}^{2}y_{kj}^{2} and yk1iyk1jyk2iyk2jy_{k_{1}i}y_{k_{1}j}y_{k_{2}i}y_{k_{2}j}, we have,

𝔼((ρ^0(i,j))2)\displaystyle{\mathbb{E}}\left(\left(\hat{\rho}_{0}(i,j)\right)^{2}\right) =H+1H2E(yki2ykj2)+(H+1)HH2E(yk1iyk1jyk2iyk2j)\displaystyle=\frac{H+1}{H^{2}}\cdot E\left(y_{ki}^{2}y_{kj}^{2}\right)+\frac{(H+1)\cdot H}{H^{2}}\cdot E\left(y_{k_{1}i}y_{k_{1}j}y_{k_{2}i}y_{k_{2}j}\right)
=H+1H2[H(1H+H2)(H+1)3β20+2H2H(H+1)3+4H2+H(H+1)3ρ02(i.j)]\displaystyle=\frac{H+1}{H^{2}}\cdot\left[\frac{H\left(1-H+H^{2}\right)}{(H+1)^{3}}\beta_{20}+\frac{2H^{2}-H}{(H+1)^{3}}+\frac{4H^{2}+H}{(H+1)^{3}}\rho_{0}^{2}(i.j)\right]
+(H+1)H[12H(H+1)3+3H(H+1)3β20+3+H2+H3(H+1)3ρ02(i,j)]\displaystyle\quad+\frac{(H+1)}{H}\left[\frac{1-2H}{(H+1)^{3}}+\frac{3H}{(H+1)^{3}}\beta_{20}+\frac{3+H^{2}+H^{3}}{(H+1)^{3}}\rho_{0}^{2}(i,j)\right]
=H2+2H+1H(H+1)2β20+0+H3+H2+4H+4H(H+1)2ρ02(i,j)\displaystyle=\frac{H^{2}+2H+1}{H(H+1)^{2}}\beta_{20}+0+\frac{H^{3}+H^{2}+4H+4}{H(H+1)^{2}}\rho_{0}^{2}(i,j)
=1Hβ20+H3+H2+4H+4H(H+1)2ρ02(i,j).\displaystyle=\frac{1}{H}\beta_{20}+\frac{H^{3}+H^{2}+4H+4}{H(H+1)^{2}}\rho_{0}^{2}(i,j).

Similarly, we can obtain,

𝔼((ρ^1:t(i,j))2)=1t1β21+(t22t+5)(t1)tρ12(i,j).{\mathbb{E}}\left(\left(\hat{\rho}_{1:t}(i,j)\right)^{2}\right)=\frac{1}{t-1}\beta_{21}+\frac{(t^{2}-2t+5)}{(t-1)t}\rho_{1}^{2}(i,j).

Finally, we have,

𝔼((ρ^0(i,j)ρ^1:t(i,j))2)\displaystyle{\mathbb{E}}\left((\hat{\rho}_{0}(i,j)-\hat{\rho}_{1:t}(i,j))^{2}\right)
=1Hβ20+1t1β21+(ρ0(i,j)ρ1(i,j))2+5tt(t1)ρ12(i,j)+3H+4H2H(H+1)2ρ02(i,j).\displaystyle=\frac{1}{H}\beta_{20}+\frac{1}{t-1}\beta_{21}+(\rho_{0}(i,j)-\rho_{1}(i,j))^{2}+\frac{5-t}{t(t-1)}\rho_{1}^{2}(i,j)+\frac{3H+4-H^{2}}{H(H+1)^{2}}\rho_{0}^{2}(i,j).

Proof of lemma 3.7.

Given a vector 𝒙t=(xt1,,xtp)p×1\bm{x}_{t}=(x_{t1},\ldots,x_{tp})^{\top}\in\mathbb{R}^{p\times 1} and two independent Rademacher vectors 𝓡1:=(r1(1),,rp(1))\bm{\mathcal{R}}_{1}:=(r_{1}^{(1)},\ldots,r_{p}^{(1)})^{\top} and 𝓡2:=(r1(2),,rp(2))\bm{\mathcal{R}}_{2}:=(r_{1}^{(2)},\ldots,r_{p}^{(2)})^{\top} with i.i.d. Rademacher entries, then 𝒙t𝓡1\bm{x}_{t}\circ\bm{\mathcal{R}}_{1} and 𝒙t𝓡2\bm{x}_{t}\circ\bm{\mathcal{R}}_{2} are sub-gaussian random vectors. 𝔼(ri(1))=𝔼(rj(2))=0\mathbb{E}(r_{i}^{(1)})=\mathbb{E}(r_{j}^{(2)})=0 due to the property of Rademacher entries. Let 𝒙~t=((𝒙t𝓡1),(𝒙t𝓡2))=(xt1r1(1),,xtprp(1),xt1r1(2),,xtprp(2))2p×1\bm{\tilde{x}}_{t}=((\bm{x}_{t}\circ\bm{\mathcal{R}}_{1})^{\top},(\bm{x}_{t}\circ\bm{\mathcal{R}}_{2})^{\top})^{\top}=(x_{t1}r_{1}^{(1)},\ldots,x_{tp}r_{p}^{(1)},x_{t1}r_{1}^{(2)},\ldots,x_{tp}r_{p}^{(2)})^{\top}\in\mathbb{R}^{2p\times 1}, then for 1i,jp1\leq i,j\leq p, the element in the covariance matrix of 𝒙~t\bm{\tilde{x}}_{t} is

Cov(xtiri(1),xtjrj(2))=𝔼(xtiri(1)xtjrj(2))𝔼(xtiri(1))𝔼(xtjrj(2))=𝔼(ri(1))𝔼(rj(2))Cov(xti,xtj)=0.Cov(x_{ti}r_{i}^{(1)},x_{tj}r_{j}^{(2)})=\mathbb{E}(x_{ti}r_{i}^{(1)}x_{tj}r_{j}^{(2)})-\mathbb{E}(x_{ti}r_{i}^{(1)})\mathbb{E}(x_{tj}r_{j}^{(2)})=\mathbb{E}(r_{i}^{(1)})\mathbb{E}(r_{j}^{(2)})Cov(x_{ti},x_{tj})=0.

Then 𝒙t𝓡1\bm{x}_{t}\circ\bm{\mathcal{R}}_{1} and 𝒙t𝓡2\bm{x}_{t}\circ\bm{\mathcal{R}}_{2} are asymptotically independent. ∎

Proof of Lemma 4.1.

In this lemma, we calculate the temporal correlation of St+sS_{t+s} and StS_{t}, where StS_{t} can represent both sum-type and max-type test statistics. We may unify both sum-type statistics for non-sparse settings and max-type statistics for sparse settings, since they are, in essence, the combination of element-wise entries, with each entry being C×(ρ^0(i,j)ρ^t:t(i,j))2C\times(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t^{\prime}:t}(i,j))^{2} for a constant CC. CC equals 1 for max-type statistics and smaller than p(p1)/2p(p-1)/2 for sum-type statistics, depending on the change numbers. Besides, for window-limited variant test statistics, the weight (tt)HH+tt\frac{(t-t^{\prime})H}{H+t-t^{\prime}} can also be treated as a constant, which does not influence the correlation value. Then the correlation between St+sS_{t+s} and StS_{t} is in essence the correlation between 𝒗tw,t(i,j)\bm{v}_{t-w,t}(i,j) and 𝒗t+sw,t+s(i,j)\bm{v}_{t+s-w,t+s}(i,j). We give the concrete form of the correlation coefficient as

Corr(St+s,St)=Corr((ρ^0(i,j)ρ^tw:t(i,j))2,(ρ^0(i,j)ρ^t+sw:t+s(i,j))2).\operatorname{Corr}\left(S_{t+s},S_{t}\right)=\operatorname{Corr}\left(\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t-w:t}(i,j)\right)^{2},\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t+s-w:t+s}(i,j)\right)^{2}\right).

Then we calculate the covariance and variance terms separately, for the covariance term, we have

Cov((ρ^0(i,j)ρ^tw:t(i,j))2,(ρ^0(i,j)ρ^t+sw:t+s(i,j))2)\displaystyle\operatorname{Cov}\left(\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t-w:t}(i,j)\right)^{2},\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t+s-w:t+s}(i,j)\right)^{2}\right)
=\displaystyle= Cov((1wk=w0ykiykj1wk=t+swtykiykj1wk=twt+sw1ykiykj)2,\displaystyle\operatorname{Cov}\left(\left(\frac{1}{w}\sum_{k=-w}^{0}y_{ki}y_{kj}-\frac{1}{w}\sum_{k=t+s-w}^{t}y_{ki}y_{kj}-\frac{1}{w}\sum_{k=t-w}^{t+s-w-1}y_{ki}y_{kj}\right)^{2},\right.
(1wk=w0ykiykj1wk=t+swtykiykj1wk=t+1t+sykiykj)2).\displaystyle\hskip 30.0pt\left.\left(\frac{1}{w}\sum_{k=-w}^{0}y_{ki}y_{kj}-\frac{1}{w}\sum_{k=t+s-w}^{t}y_{ki}y_{kj}-\frac{1}{w}\sum_{k=t+1}^{t+s}y_{ki}y_{kj}\right)^{2}\right).

For simplicity, we denote A=1wk=w0ykiykj1wk=t+swtykiykjA=\frac{1}{w}\sum_{k=-w}^{0}y_{ki}y_{kj}-\frac{1}{w}\sum_{k=t+s-w}^{t}y_{ki}y_{kj}, B=1wk=twt+sw1ykiykjB=\frac{1}{w}\sum_{k=t-w}^{t+s-w-1}y_{ki}y_{kj}, C=1wk=t+1t+sykiykjC=\frac{1}{w}\sum_{k=t+1}^{t+s}y_{ki}y_{kj}, then

Cov\displaystyle\operatorname{Cov} ((ρ^0(i,j)ρ^tw:t(i,j))2,(ρ^0(i,j)ρ^t+sw:t+s(i,j))2)=Cov((AB)2,(AC)2)\displaystyle\left(\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t-w:t}(i,j)\right)^{2},\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t+s-w:t+s}(i,j)\right)^{2}\right)=\operatorname{Cov}\left((A-B)^{2},(A-C)^{2}\right)
=\displaystyle= Var(A2)2𝔼C(𝔼A3𝔼A2𝔼A)2𝔼B(𝔼A3𝔼A2𝔼A)+4𝔼B𝔼CVar(A).\displaystyle\operatorname{Var}\left(A^{2}\right)-2{\mathbb{E}}C\left({\mathbb{E}}A^{3}-{\mathbb{E}}A^{2}{\mathbb{E}}A\right)-2{\mathbb{E}}B\left({\mathbb{E}}A^{3}-{\mathbb{E}}A^{2}{\mathbb{E}}A\right)+4{\mathbb{E}}B{\mathbb{E}}C\operatorname{Var}(A).

For the variance term, since Var((ρ^0(i,j)ρ^tw:t(i,j))2)=Var((ρ^0(i,j)ρ^t+sw:t+s(i,j))2)\operatorname{Var}\left(\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t-w:t}(i,j)\right)^{2}\right)=\operatorname{Var}\left(\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t+s-w:t+s}(i,j)\right)^{2}\right) due to the independent assumption, we only calculate the first one.

Var\displaystyle\operatorname{Var} ((ρ^0(i,j)ρ^tw:t(i,j))2)=Var((AB)2)\displaystyle\left(\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t-w:t}(i,j)\right)^{2}\right)=\operatorname{Var}\left(\left(A-B\right)^{2}\right)
=\displaystyle= Var(A2)+4𝔼A2𝔼B2+Var(B2)4𝔼B(𝔼A3𝔼A2𝔼A)\displaystyle\operatorname{Var}\left(A^{2}\right)+4{\mathbb{E}}A^{2}{\mathbb{E}}B^{2}+\operatorname{Var}\left(B^{2}\right)-4{\mathbb{E}}B\left({\mathbb{E}}A^{3}-{\mathbb{E}}A^{2}{\mathbb{E}}A\right)
4𝔼A(𝔼B3EB2𝔼B)4(𝔼A)2(𝔼B)2\displaystyle\quad-4{\mathbb{E}}A\left({\mathbb{E}}B^{3}-EB^{2}{\mathbb{E}}B\right)-4({\mathbb{E}}A)^{2}({\mathbb{E}}B)^{2}
=\displaystyle= Var(A2)+Var(B2)4𝔼B(𝔼A3𝔼A2𝔼A)4𝔼A(𝔼B3𝔼B2𝔼B)+4Var(AB).\displaystyle\operatorname{Var}\left(A^{2}\right)+\operatorname{Var}\left(B^{2}\right)-4{\mathbb{E}}B\left({\mathbb{E}}A^{3}-{\mathbb{E}}A^{2}{\mathbb{E}}A\right)-4{\mathbb{E}}A\left({\mathbb{E}}B^{3}-{\mathbb{E}}B^{2}{\mathbb{E}}B\right)+4\operatorname{Var}(AB).

Let 𝔼(ykiykj)=ρ0(i,j){\mathbb{E}}_{\infty}\left(y_{k_{i}}y_{kj}\right)=\rho_{0}(i,j), 𝔼(ykiykj)2=β20{\mathbb{E}}_{\infty}\left(y_{ki}y_{kj}\right)^{2}=\beta_{20}, 𝔼(ykiykj)3=β30{\mathbb{E}}_{\infty}\left(y_{ki}y_{kj}\right)^{3}=\beta_{30}, 𝔼(ykiykj)4=β40{\mathbb{E}}_{\infty}\left(y_{ki}y_{kj}\right)^{4}=\beta_{40}, then we take expectation for each term as follows.

𝔼(A)=ρ0(i,j)wswρ0(i,j)=ρ0(i,j)sw,\displaystyle{\mathbb{E}}(A)=\rho_{0}(i,j)-\frac{w-s}{w}\rho_{0}(i,j)=\rho_{0}(i,j)\frac{s}{w},
𝔼(B)=ρ0(i,j)sw,𝔼(C)=ρ0(i,j)sw,\displaystyle{\mathbb{E}}(B)=\rho_{0}\left(i,j\right)\frac{s}{w},\quad{\mathbb{E}}(C)=\rho_{0}(i,j)\frac{s}{w},
𝔼(B2)=(β20ρ02(i,j))sw2+ρ02(i,j)s2w2,\displaystyle{\mathbb{E}}\left(B^{2}\right)=\left(\beta_{20}-\rho_{0}^{2}(i,j)\right)\frac{s}{w^{2}}+\rho_{0}^{2}(i,j)\frac{s^{2}}{w^{2}},
𝔼(B3)=(β303β20ρ0(i,j)+2ρ03(i,j))sw3+3(β20ρ0(i,j)ρ03(i,j))s2w3+ρ03(i,j)s3w3,\displaystyle{\mathbb{E}}\left(B^{3}\right)=\left(\beta_{30}-3\beta_{20}\rho_{0}(i,j)+2\rho_{0}^{3}(i,j)\right)\frac{s}{w^{3}}+3\left(\beta_{20}\rho_{0}(i,j)-\rho_{0}^{3}(i,j)\right)\frac{s^{2}}{w^{3}}+\rho_{0}^{3}(i,j)\frac{s^{3}}{w^{3}},
𝔼(B4)=(β404β30ρ0(i,j)3β2026ρ04(i,j)+12β20ρ02(i,j))sw4\displaystyle{\mathbb{E}}\left(B^{4}\right)=\left(\beta_{40}-4\beta_{30}\rho_{0}\left(i,j\right)-3\beta_{20}^{2}-6\rho_{0}^{4}\left(i,j\right)+12\beta_{20}\rho_{0}^{2}\left(i,j\right)\right)\frac{s}{w^{4}}
+(4β30ρ0(i,j)+3β20218β20ρ02(i,j)+11ρ04(i,j))s2w4\displaystyle\quad+\left(4\beta_{30}\rho_{0}(i,j)+3\beta_{20}^{2}-18\beta_{20}\rho_{0}^{2}(i,j)+11\rho_{0}^{4}(i,j)\right)\frac{s^{2}}{w^{4}}
+(6β20ρ02(i,j)6ρ04(i,j))s3w4+ρ04(i,j)s4w4,\displaystyle\quad+\left(6\beta_{20}\rho_{0}^{2}(i,j)-6\rho_{0}^{4}(i,j)\right)\frac{s^{3}}{w^{4}}+\rho_{0}^{4}(i,j)\frac{s^{4}}{w^{4}},
𝔼(A2)=(2β202ρ02(i,j))1w+(ρ02(i,j)β20)sw2+ρ02(i,j)s2w2+(2β202ρ02(i,j))1w2,\displaystyle{\mathbb{E}}\left(A^{2}\right)=\left(2\beta_{20}-2\rho_{0}^{2}(i,j)\right)\cdot\frac{1}{w}+\left(\rho_{0}^{2}(i,j)-\beta_{20}\right)\frac{s}{w^{2}}+\rho_{0}^{2}\left(i,j\right)\frac{s^{2}}{w^{2}}+\left(2\beta_{20}-2\rho_{0}^{2}(i,j)\right)\frac{1}{w^{2}},
𝔼(A3)=(6β20ρ0(i,j)6ρ03(i,j))sw2+(β30+3β20ρ0(i,j)4ρ03(i,j))sw3\displaystyle{\mathbb{E}}\left(A^{3}\right)=\left(6\beta_{20}\rho_{0}(i,j)-6\rho_{0}^{3}(i,j)\right)\frac{s}{w^{2}}+\left(\beta_{30}+3\beta_{20}\rho_{0}\left(i,j\right)-4\rho_{0}^{3}\left(i,j\right)\right)\frac{s}{w^{3}}
+(3ρ03(i,j)3β20ρ0(i,j))s2w3+ρ03(i,j)s3w3,\displaystyle\quad+\left(3\rho_{0}^{3}\left(i,j\right)-3\beta_{20}\rho_{0}\left(i,j\right)\right)\frac{s^{2}}{w^{3}}+\rho_{0}^{3}\left(i,j\right)\frac{s^{3}}{w^{3}},
𝔼A4=(24β20ρ02(i,j)+12ρ04(i,j)+12β202)1w2+(24β20ρ02(i,j)12ρ04(i,j)12β202)sw3\displaystyle{\mathbb{E}}A^{4}=\left(-24\beta_{20}\rho_{0}^{2}(i,j)+12\rho_{0}^{4}(i,j)+12\beta_{20}^{2}\right)\frac{1}{w^{2}}+\left(24\beta_{20}\rho_{0}^{2}(i,j)-12\rho_{0}^{4}(i,j)-12\beta_{20}^{2}\right)\frac{s}{w^{3}}
+(12β20ρ02(i,j)12ρ04(i,j))s2w3+(2β408β30ρ0(i,j)+18β20224β20ρ02(i,j)+12ρ04(i,j))1w3\displaystyle\quad+\left(12\beta_{20}\rho_{0}^{2}(i,j)-12\rho_{0}^{4}(i,j)\right)\frac{s^{2}}{w^{3}}+\left(2\beta_{40}-8\beta_{30}\rho_{0}(i,j)+18\beta_{20}^{2}-24\beta_{20}\rho_{0}^{2}(i,j)+12\rho_{0}^{4}(i,j)\right)\frac{1}{w^{3}}
+(β40+4β30ρ0(i,j)9β202+12β20ρ02(i,j)6ρ04(i,j))sw4\displaystyle\quad+\left(-\beta_{40}+4\beta_{30}\rho_{0}(i,j)-9\beta_{20}^{2}+12\beta_{20}\rho_{0}^{2}(i,j)-6\rho_{0}^{4}(i,j)\right)\frac{s}{w^{4}}
+(4β30ρ0(i,j)+3β2026β20ρ02(i,j)ρ04(i,j))s2w4\displaystyle\quad+\left(4\beta_{30}\rho_{0}(i,j)+3\beta_{20}^{2}-6\beta_{20}\rho_{0}^{2}(i,j)-\rho_{0}^{4}(i,j)\right)\frac{s^{2}}{w^{4}}
+(6β20ρ02(i,j)+6ρ04(i,j))s3w4+ρ04(i,j)s4w4+(2β408β30ρ0(i,j)+6β202)1w4,\displaystyle\quad+\left(-6\beta_{20}\rho_{0}^{2}(i,j)+6\rho_{0}^{4}(i,j)\right)\frac{s^{3}}{w^{4}}+\rho_{0}^{4}(i,j)\frac{s^{4}}{w^{4}}+\left(2\beta_{40}-8\beta_{30}\rho_{0}(i,j)+6\beta_{20}^{2}\right)\frac{1}{w^{4}},
Var(A)=(2β202ρ02(i,j))1w+(ρ02(i,j)β20)sw2+(2β202ρ02(i,j))1w2,\displaystyle\operatorname{Var}(A)=\left(2\beta_{20}-2\rho_{0}^{2}(i,j)\right)\frac{1}{w}+\left(\rho_{0}^{2}\left(i,j\right)-\beta_{20}\right)\frac{s}{w^{2}}+\left(2\beta_{20}-2\rho_{0}^{2}(i,j)\right)\frac{1}{w^{2}},
Var(A2)=8(β20ρ02(i,j))21w28(β20ρ02(i,j))2sw3\displaystyle\operatorname{Var}\left(A^{2}\right)=8\left(\beta_{20}-\rho_{0}^{2}(i,j)\right)^{2}\frac{1}{w^{2}}-8\left(\beta_{20}-\rho_{0}^{2}(i,j)\right)^{2}\frac{s}{w^{3}}
+(8β20ρ02(i,j)8ρ04(i,j))s2w3+(2β408β30ρ0(i,j)+10β2028β20ρ02(i,j)+4ρ04(i,j))1w3\displaystyle\quad+\left(8\beta_{20}\rho_{0}^{2}(i,j)-8\rho_{0}^{4}(i,j)\right)\frac{s^{2}}{w^{3}}+\left(2\beta_{40}-8\beta_{30}\rho_{0}(i,j)+10\beta_{20}^{2}-8\beta_{20}\rho_{0}^{2}(i,j)+4\rho_{0}^{4}(i,j)\right)\frac{1}{w^{3}}
+(β40+4β30ρ0(i,j)5β202+4β20ρ02(i,j)2ρ04(i,j))sw4\displaystyle\quad+\left(-\beta_{40}+4\beta_{30}\rho_{0}(i,j)-5\beta_{20}^{2}+4\beta_{20}\rho_{0}^{2}(i,j)-2\rho_{0}^{4}(i,j)\right)\frac{s}{w^{4}}
+(4β30ρ0(i,j)+2β2028β20ρ02(i,j)+2ρ04(i,j))s2w4\displaystyle\quad+\left(4\beta_{30}\rho_{0}(i,j)+2\beta_{20}^{2}-8\beta_{20}\rho_{0}^{2}(i,j)+2\rho_{0}^{4}(i,j)\right)\frac{s^{2}}{w^{4}}
+(4β20ρ02(i,j)+4ρ04(i,j))s3w4+(2β40+2β2028β30ρ0(i,j)+8β20ρ02(i,j)4ρ04(i,j))1w4,\displaystyle\quad+\left(-4\beta_{20}\rho_{0}^{2}(i,j)+4\rho_{0}^{4}(i,j)\right)\frac{s^{3}}{w^{4}}+\left(2\beta_{40}+2\beta_{20}^{2}-8\beta_{30}\rho_{0}(i,j)+8\beta_{20}\rho_{0}^{2}(i,j)-4\rho_{0}^{4}(i,j)\right)\frac{1}{w^{4}},
Var(B2)=(β404β30ρ0(i,j)3β2026ρ04(i,j)+12β20ρ02(i,j))sw4\displaystyle\operatorname{Var}\left(B^{2}\right)=\left(\beta_{40}-4\beta_{30}\rho_{0}\left(i,j\right)-3\beta_{20}^{2}-6\rho_{0}^{4}\left(i,j\right)+12\beta_{20}\rho_{0}^{2}\left(i,j\right)\right)\frac{s}{w^{4}}
+(4β30ρ0(i,j)16β20ρ02(i,j)+10ρ04(i,j)+2β202)s2w4+(4β20ρ02(i,j)4ρ04(i,j))s3w4,\displaystyle\quad+\left(4\beta_{30}\rho_{0}(i,j)-16\beta_{20}\rho_{0}^{2}(i,j)+10\rho_{0}^{4}(i,j)+2\beta_{20}^{2}\right)\frac{s^{2}}{w^{4}}+\left(4\beta_{20}\rho_{0}^{2}(i,j)-4\rho_{0}^{4}(i,j)\right)\frac{s^{3}}{w^{4}},
Var(AB)=2(β20ρ02(i,j))2sw3+2(β20ρ02(i,j)ρ04(i,j))s2w3\displaystyle\operatorname{Var}(AB)=2\left(\beta_{20}-\rho_{0}^{2}(i,j)\right)^{2}\frac{s}{w^{3}}+2\left(\beta_{20}\rho_{0}^{2}(i,j)-\rho_{0}^{4}(i,j)\right)\frac{s^{2}}{w^{3}}
+(β2023ρ04(i,j)+4β20ρ02(i,j))s2w4+2(β20ρ02(i,j))2sw4.\displaystyle\quad+\left(-\beta_{20}^{2}-3\rho_{0}^{4}(i,j)+4\beta_{20}\rho_{0}^{2}(i,j)\right)\frac{s^{2}}{w^{4}}+2(\beta_{20}-\rho_{0}^{2}(i,j))^{2}\frac{s}{w^{4}}.

Since the correlation is represented as follows,

Corr(St+s,St)\displaystyle\operatorname{Corr}\left(S_{t+s},S_{t}\right)
=1+4𝔼B𝔼CVar(A)Var(B2)+4𝔼A(𝔼B3𝔼B𝔼B2)4Var(AB)Var(A2)+Var(B2)4𝔼B(𝔼A3𝔼A2EA)4𝔼A(𝔼B3𝔼B2𝔼B)+4Var(AB),\displaystyle=1+\frac{4{\mathbb{E}}B{\mathbb{E}}C\operatorname{Var}(A)-\operatorname{Var}\left(B^{2}\right)+4{\mathbb{E}}A\left({\mathbb{E}}B^{3}-{\mathbb{E}}B{\mathbb{E}}B^{2}\right)-4\operatorname{Var}(AB)}{\operatorname{Var}\left(A^{2}\right)+\operatorname{Var}\left(B^{2}\right)-4{\mathbb{E}}B\left({\mathbb{E}}A^{3}-{\mathbb{E}}A^{2}EA\right)-4{\mathbb{E}}A\left({\mathbb{E}}B^{3}-{\mathbb{E}}B^{2}{\mathbb{E}}B\right)+4\operatorname{Var}(AB)},

the numerator equals

4𝔼B𝔼CVar(A)Var(B2)+4𝔼A(𝔼B3𝔼BEB2)4Var(AB)\displaystyle 4{\mathbb{E}}B{\mathbb{E}}C\operatorname{Var}(A)-\operatorname{Var}\left(B^{2}\right)+4{\mathbb{E}}A\left({\mathbb{E}}B^{3}-{\mathbb{E}}BEB^{2}\right)-4\operatorname{Var}(AB)
=(β40+4β30ρ0(i,j)5β202+4β20ρ02(i,j)2ρ04(i,j))sw48(β20ρ02(i,j))2sw3\displaystyle\quad=\left(-\beta_{40}+4\beta_{30}\rho_{0}(i,j)-5\beta_{20}^{2}+4\beta_{20}\rho_{0}^{2}(i,j)-2\rho_{0}^{4}(i,j)\right)\frac{s}{w^{4}}-8\left(\beta_{20}-\rho_{0}^{2}(i,j)\right)^{2}\frac{s}{w^{3}}
+(2β2024ρ04(i,j)+5β20ρ02(i,j))s2w4,\displaystyle\quad+\left(2\beta_{20}^{2}-4\rho_{0}^{4}(i,j)+5\beta_{20}\rho_{0}^{2}(i,j)\right)\frac{s^{2}}{w^{4}},

and the denominator takes value

Var(A2)+Var(B2)4𝔼B(𝔼A3𝔼A2𝔼A)4𝔼A(𝔼B3𝔼B2𝔼B)+4Var(AB)\displaystyle\operatorname{Var}\left(A^{2}\right)+\operatorname{Var}\left(B^{2}\right)-4{\mathbb{E}}B\left({\mathbb{E}}A^{3}-{\mathbb{E}}A^{2}{\mathbb{E}}A\right)-4{\mathbb{E}}A\left({\mathbb{E}}B^{3}-{\mathbb{E}}B^{2}{\mathbb{E}}B\right)+4\operatorname{Var}(AB)
=8(β20ρ02(i,j))21w2+(2β408β30ρ0(i,j)+10β2028β20ρ02(i,j)+4ρ04(i,j))1w3\displaystyle\quad=8\left(\beta_{20}-\rho_{0}^{2}(i,j)\right)^{2}\frac{1}{w^{2}}+\left(2\beta_{40}-8\beta_{30}\rho_{0}(i,j)+10\beta_{20}^{2}-8\beta_{20}\rho_{0}^{2}(i,j)+4\rho_{0}^{4}(i,j)\right)\frac{1}{w^{3}}
+(12β20ρ02(i,j)+12ρ04(i,j))s2w4\displaystyle\quad+(-12\beta_{20}\rho_{0}^{2}(i,j)+12\rho_{0}^{4}(i,j))\frac{s^{2}}{w^{4}}
+(2β40+2β2028β30ρ0(i,j)+8β20ρ02(i,j)4rho04(i,j))1w4\displaystyle\quad+\left(2\beta_{40}+2\beta_{20}^{2}-8\beta_{30}\rho_{0}(i,j)+8\beta_{20}\rho_{0}^{2}(i,j)-4rho_{0}^{4}(i,j)\right)\frac{1}{w^{4}}
+(8β2028ρ04(i,j)+16β20ρ02(i,j))sw4.\displaystyle\quad+\left(-8\beta_{20}^{2}-8\rho_{0}^{4}(i,j)+16\beta_{20}\rho_{0}^{2}(i,j)\right)\frac{s}{w^{4}}.

Combining these results yields the correlation result

Corr=1sw+o(sw),Corr=1-\frac{s}{w}+o(\frac{s}{w}),

and complete the proof. ∎

Proof of Theorem 4.2.

The proof is based on a general method for computing first passage probabilities first introduced in Pollak and Yakir (1998) and further developed in Siegmund and Yakir (2000) and Siegmund et al. (2010), and commonly used in similar problems (Xie and Siegmund, 2013), Li et al. (2015), Cao et al. (2018). First of all, it is worth mentioning that the probability measure in the following proof always stands for the nominal case where all samples are from the same distribution dd. For the test statistic defined in this paper

St(WL-sum)=maxtwtt1(tt)HH+tt𝒗t,t1,S_{t}^{(\text{WL-sum})}=\max_{t-w\leqslant t^{\prime}\leqslant t-1}\frac{\left(t-t^{\prime}\right)H}{H+t-t^{\prime}}\left\|\bm{v}_{t^{\prime},t}\right\|_{1},

if we suppose H=wH=w, and (tt)HH+tt𝒗t,t1\frac{\left(t-t^{\prime}\right)H}{H+t-t^{\prime}}\left\|\bm{v}_{t^{\prime},t}\right\|_{1} reaches its maximum value when t=twt^{\prime}=t-w (as shown in Table 2, this is the most common case), then

St(WL-sum) =w2p(p1)2(ρ^0(i,j)ρ^tw:t(i,j))2\displaystyle S_{t}^{\text{(WL-sum) }}=\frac{w}{2}\cdot\frac{p(p-1)}{2}\cdot\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t-w:t}(i,j)\right)^{2}
St(WL-max) =w2(ρ^0(i,j)ρ^tw:t(i,j))2,\displaystyle S_{t}^{\text{(WL-max) }}=\frac{w}{2}\cdot\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t-w:t}(i,j)\right)^{2},

let Vtw=𝒗tw,t(i,j)=(ρ^0(i,j)ρ^tw:t(i,j))2=(1wk=w0ykiykj1wk=twtykiykj)2.V_{t}^{w}=\bm{v}_{t-w,t}(i,j)=\left(\hat{\rho}_{0}(i,j)-\hat{\rho}_{t-w:t}(i,j)\right)^{2}=\left(\frac{1}{w}\sum_{k=-w}^{0}y_{ki}y_{kj}-\frac{1}{w}\sum_{k=t-w}^{t}y_{ki}y_{kj}\right)^{2}. From the existing calculation, we have

μ:=E(Vtw)=2w(β20ρ02(i,j)),\displaystyle\mu:=E\left(V_{t}^{w}\right)=\frac{2}{w}\left(\beta_{20}-\rho_{0}^{2}(i,j)\right),
σd2:=Var(Vtw)=8(β20ρ02(i,j))21w2\displaystyle\sigma_{d}^{2}:=\operatorname{Var}\left(V_{t}^{w}\right)=8\left(\beta_{20}-\rho_{0}^{2}(i,j)\right)^{2}\frac{1}{w^{2}}
+(2β408β30ρ0(i,j)+10β2028β20ρ02(i,j)+4ρ04(i,j))1w3\displaystyle\quad+\left(2\beta_{40}-8\beta_{30}\rho_{0}(i,j)+10\beta_{20}^{2}-8\beta_{20}\rho_{0}^{2}(i,j)+4\rho_{0}^{4}(i,j)\right)\frac{1}{w^{3}}
+(12β20ρ02(i,j)+12ρ04(i,j))s2w4\displaystyle\quad+(-12\beta_{20}\rho_{0}^{2}(i,j)+12\rho_{0}^{4}(i,j))\frac{s^{2}}{w^{4}}
+(2β40+2β2028β30ρ0(i,j)+8β20ρ02(i,j)4rho04(i,j))1w4\displaystyle\quad+\left(2\beta_{40}+2\beta_{20}^{2}-8\beta_{30}\rho_{0}(i,j)+8\beta_{20}\rho_{0}^{2}(i,j)-4rho_{0}^{4}(i,j)\right)\frac{1}{w^{4}}
+(8β2028ρ04(i,j)+16β20ρ02(i,j))sw4.\displaystyle\quad+\left(-8\beta_{20}^{2}-8\rho_{0}^{4}(i,j)+16\beta_{20}\rho_{0}^{2}(i,j)\right)\frac{s}{w^{4}}.

Here we denote the moment generating function as

ψw(θ)=log𝔼(exp{θVtw}),\psi_{w}(\theta)=\log\mathbb{E}\left(\exp\left\{\theta V_{t}^{w}\right\}\right),

and select θ=θw\theta=\theta_{w} by solving the equation ψ˙w(θ)=b\dot{\psi}_{w}(\theta)=b. Since VtwV_{t}^{w} is defined by a function of 2w+22w+2 independent random samples, ψw\psi_{w} converges to a limit as ww\rightarrow\infty and θw\theta_{w} converges to a limiting value, denoted by θ\theta. The transformed distribution for all sequences before position tt and window size ww is denoted by

dtw=exp{θVtwψw(θw)}d.d\mathbb{P}_{t}^{w}=\exp\left\{\theta V_{t}^{w}-\psi_{w}\left(\theta_{w}\right)\right\}d\mathbb{P}.

Let

(t,w):=log(dtw/d)=θVtwψw(θw).\ell(t,w):=\log\left(d\mathbb{P}_{t}^{w}/d\mathbb{P}\right)=\theta V_{t}^{w}-\psi_{w}\left(\theta_{w}\right).

Denote D={(t,w):0tm,1wm}D=\{(t,w):0\leq t\leq m,1\leq w\leq m\} be the set of all possible windows in the scan. Let A={max(t,w)DVtwb}A=\left\{\underset{(t,w)\in D}{max}V_{t}^{w}\geq b\right\} be the event of interests (the event {TmT\leq m}), i.e., the detection procedure stop before time m.

By measure transformation, we have

(A)\displaystyle\mathbb{P}(A) =(t,w)D𝔼[exp[(t,w)]((t,w)Dexp[(t,w)])1;A]\displaystyle=\sum_{(t,w)\in D}\mathbb{E}\left[\exp[\ell(t,w)]\left(\sum_{\left(t^{\prime},w^{\prime}\right)\in D}\exp\left[\ell\left(t^{\prime},w^{\prime}\right)\right]\right)^{-1};A\right] (16)
=(t,w)D𝔼tw[((t,w)Dexp[(t,w)])1;A]\displaystyle=\sum_{(t,w)\in D}\mathbb{E}_{t}^{w}\left[\left(\sum_{\left(t^{\prime},w^{\prime}\right)\in D}\exp\left[\ell\left(t^{\prime},w^{\prime}\right)\right]\right)^{-1};A\right]
=(t,w)De~(t,w)(t,w)×𝔼tw[maxt,we(t,w)(t,w)t,we(t,w)(t,w)e~(t,w)[maxt,w(t,w)(t,w)];A]\displaystyle=\sum_{(t,w)\in D}e^{\tilde{\ell}(t,w)-\ell(t,w)}\times\mathbb{E}_{t}^{w}\left[\frac{\max_{t^{\prime},w^{\prime}}e^{\ell\left(t^{\prime},w^{\prime}\right)-\ell(t,w)}}{\sum_{t^{\prime},w^{\prime}}e^{\ell\left(t^{\prime},w^{\prime}\right)-\ell(t,w)}}e^{-\tilde{\ell}(t,w)-\left[\max_{t^{\prime},w^{\prime}}\ell\left(t^{\prime},w^{\prime}\right)-\ell(t,w)\right]};A\right]
=eθwψ˙w(θw)+ψw(θw)×(t,w)D𝔼tw[M(t,w)S(t,w)e~(t,w)logM(t,w);A],\displaystyle=e^{-\theta_{w}\dot{\psi}_{w}\left(\theta_{w}\right)+\psi_{w}\left(\theta_{w}\right)}\times\sum_{(t,w)\in D}\mathbb{E}_{t}^{w}\left[\frac{M(t,w)}{S(t,w)}e^{-\tilde{\ell}(t,w)-\log M(t,w)};A\right],

where

~(t,w)\displaystyle\tilde{\ell}(t,w) =θw[Vtwψ˙w(θw)],\displaystyle=\theta_{w}\left[V_{t}^{w}-\dot{\psi}_{w}\left(\theta_{w}\right)\right],
M(t,w)\displaystyle M(t,w) =maxt,wexp{θw(VtwVtw)},\displaystyle=\max_{t^{\prime},w^{\prime}}\exp\left\{\theta_{w}\left(V_{t^{\prime}}^{w^{\prime}}-V_{t}^{w}\right)\right\},
S(t,w)\displaystyle S(t,w) =t,wexp{θw(VtwVtw)}.\displaystyle=\sum_{t^{\prime},w^{\prime}}\exp\left\{\theta_{w}\left(V_{t^{\prime}}^{w^{\prime}}-V_{t}^{w}\right)\right\}.

Consider a sequence of σ\sigma-fields ^\hat{\mathcal{F}}, let M^(t,w)\hat{M}(t,w) and S^(t,w)\hat{S}(t,w) be approximations of M(t,w)M(t,w) and S(t,w)S(t,w), respectively, which are measurable with respect to ^\hat{\mathcal{F}}. Given ϵ>0\epsilon>0, we assume that for all large bb,

Assumption C.1.

M(t,w)M(t,w) and S(t,w)S(t,w) satisfy 0<M(t,w)S(t,w)<0<M(t,w)\leq S(t,w)<\infty with probability 1.

Assumption C.2.

There exist M^(t,w)\hat{M}(t,w) and S^(t,w)\hat{S}(t,w) measurable with respect to ^\hat{\mathcal{F}} such that M^(t,w)/M(t,w)1ϵ\|\hat{M}(t,w)/M(t,w)-1\|\leq\epsilon and |S^(t,w)/S(t,w)1|ϵ|\hat{S}(t,w)/S(t,w)-1|\leq\epsilon, with probability at least 1ϵb1/21-\epsilon b^{-1/2}.

Assumption C.3.

𝔼[M^(t,w)/S^(t,w)]\mathbb{E}[\hat{M}(t,w)/\hat{S}(t,w)] converges to a finite and positive limit denoted by 𝔼[/𝒮]\mathbb{E}[\mathcal{M}/\mathscr{S}].

Assumption C.4.

There exist μ\mu\in\mathbb{R} and σd+\sigma_{d}\in\mathbb{R}^{+} such that for every 0<ϵ,δ0<\epsilon,\delta and for all large enough bb the probability of the event

B={sup|x|log(b)|b1/2(l~(k,w)xlogM^(k,w)+(0,δ]|^)δσdϕ(μσd)|ϵ}B=\{\underset{|x|\leq log(b)}{sup}\left|b^{1/2}\mathbb{P}\left(\tilde{l}(k,w)\in x-log\hat{M}(k,w)+(0,\delta]|\hat{\mathcal{F}}\right)-\frac{\delta}{\sigma_{d}}\phi(\frac{\mu}{\sigma_{d}})\right|\leq\epsilon\}

is bounded from below by 1ϵb1/21-\epsilon b^{-1/2}.

Since t,wt,w are fixed in much of the following analysis, we suppress the dependence of the notation on t,wt,w and simply write ~,S,M\tilde{\ell},S,M. Under Assumptions C.1, C.2, C.3 and C.4, a localization lemma allows us to simplify the expectation

𝔼tw[MSe~logM;~+logM0]\mathbb{E}_{t}^{w}\left[\frac{M}{S}e^{-\tilde{\ell}-\log M};\tilde{\ell}+\log M\geq 0\right]

into a simpler form

12πσw2𝔼[MS],\frac{1}{\sqrt{2\pi\sigma_{w}^{2}}}\mathbb{E}\left[\frac{M}{S}\right],

where σw2\sigma_{w}^{2} stands for the variance of ~\tilde{\ell} under measure tw\mathbb{P}_{t}^{w}. The reduction relies on the fact that for large mm, the local processes MM and SS are approximately independent of the global process ~\tilde{\ell}. Such independence allows the above decomposition into the expectation of M/SM/S times the expectation involving ~+logM\tilde{\ell}+\log M, treating logM\log M essentially as a constant.

We first consider the process MM and SS and derive the expectation 𝔼[M/S]\mathbb{E}[M/S] following Siegmund and Yakir (2000).

The covariance between the two terms is given by

Cov(θwVtw,θwVtw)=θw2𝔼[Vtw,Vtw]=θw2σd2(1|tt|w+o(|tt|w)).\operatorname{Cov}\left(\theta_{w}V_{t^{\prime}}^{w^{\prime}},\theta_{w}V_{t}^{w}\right)=\theta_{w}^{2}\mathbb{E}\left[V_{t^{\prime}}^{w^{\prime}},V_{t}^{w}\right]=\theta_{w}^{2}\sigma_{d}^{2}\left(1-\frac{|t^{\prime}-t|}{w}+o\left(\frac{|t^{\prime}-t|}{w}\right)\right).

When ww is large, we have that the correlation depends on the difference |tt|\left|t^{\prime}-t\right| in a linear form, which shows that we have the random walk in the change time tt, and the variance of the increment equals to θw2σd2/w\theta_{w}^{2}\sigma_{d}^{2}/w. Following Siegmund and Yakir (2000), we have

𝔼[M/S]=[θw2σd2/wν([θwσd2/w]1/2)]2.\mathbb{E}[M/S]=\left[\theta_{w}^{2}\sigma_{d}^{2}/w\nu\left(\left[\theta_{w}\sigma_{d}^{2}/w\right]^{1/2}\right)\right]^{2}.

Moreover, the process ~\tilde{\ell} is zero-mean and has variance σw2=Vartw(~)=θw2ψ¨(θw)\sigma_{w}^{2}=\operatorname{Var}_{t}^{w}(\tilde{\ell})=\theta_{w}^{2}\ddot{\psi}\left(\theta_{w}\right) under the measure tw\mathbb{P}_{t}^{w}. Substituting the result for the expectations in (16) yields

(Tm)=2w=0m1(m2w)eθwψ˙w(θw)+ψw(θw)[2πθw2ψ¨w(θw)]1/2[θw2σd2/wν([θwσd2/w]1/2)]2.\mathbb{P}\left(T\leq m\right)=2\sum_{w=0}^{m-1}(m-2w)e^{-\theta_{w}\dot{\psi}_{w}\left(\theta_{w}\right)+\psi_{w}\left(\theta_{w}\right)}\left[2\pi\theta_{w}^{2}\ddot{\psi}_{w}\left(\theta_{w}\right)\right]^{-1/2}\left[\theta_{w}^{2}\sigma_{d}^{2}/w\nu\left(\left[\theta_{w}\sigma_{d}^{2}/w\right]^{1/2}\right)\right]^{2}.

In the limiting case, VtwV_{t}^{w} can be well approximated using Gaussian distribution 𝒩(μ,σd2)\mathcal{N}\left(\mu,\sigma_{d}^{2}\right). The moment generating function then becomes ψ(θ)=μθ+θ2σd2/2\psi(\theta)=\mu\theta+\theta^{2}\sigma_{d}^{2}/2, and the limiting θ=(bμ)/σd2\theta=(b-\mu)/\sigma_{d}^{2}, as the solution to ψ˙(θ)=b\dot{\psi}(\theta)=b. Furthermore, the summation term can be approximated by an integral, to obtain

(Tm)\displaystyle\mathbb{P}\left(T\leq m\right) =2w=0m1(m2w)e(bμ)2/(2σd2)[2π(bμ)2/σd2]1/2[(bμ)2/(wσd2)ν([(bμ)/w]1/2)]2\displaystyle=2\sum_{w=0}^{m-1}(m-2w)e^{-(b-\mu)^{2}/\left(2\sigma_{d}^{2}\right)}\left[2\pi(b-\mu)^{2}/\sigma_{d}^{2}\right]^{-1/2}\left[(b-\mu)^{2}/\left(w\sigma_{d}^{2}\right)\nu\left(\left[(b-\mu)/w\right]^{1/2}\right)\right]^{2} (17)
4e(bμ)2/(2σd2)[2π(bμ)2/σd2]1/2[(bμ)2/σd2]21mν2([4b2/(mtσd2)]1/2)(1t)𝑑t/t2.\displaystyle\approx 4e^{-(b-\mu)^{2}/\left(2\sigma_{d}^{2}\right)}\left[2\pi(b-\mu)^{2}/\sigma_{d}^{2}\right]^{-1/2}\left[(b-\mu)^{2}/\sigma_{d}^{2}\right]^{2}\int_{1}^{m}\nu^{2}\left(\left[4b^{2}/\left(mt\sigma_{d}^{2}\right)\right]^{1/2}\right)(1-t)dt/t^{2}.

Here it is assumed that mm is large, but small enough that the right-hand side of (17) converges to 0 when bb\rightarrow\infty. Changing variables in the integrand, we can rewrite this approximation as

{Tm}m×2e(bμ)2/(2σd2)[2π(bμ)2/σd2]1/2[(bμ)2/σd2][4b2/(wσd2)]1/2[4b2/(σd2)]1/2yν2(y)𝑑y.\mathbb{P}\left\{T\leq m\right\}\approx m\times 2e^{-(b-\mu)^{2}/\left(2\sigma_{d}^{2}\right)}\left[2\pi(b-\mu)^{2}/\sigma_{d}^{2}\right]^{-1/2}\left[(b-\mu)^{2}/\sigma_{d}^{2}\right]\int_{\left[4b^{2}/\left(w\sigma_{d}^{2}\right)\right]^{1/2}}^{\left[4b^{2}/\left(\sigma_{d}^{2}\right)\right]^{1/2}}y\nu^{2}(y)dy. (18)

From the arguments in Siegmund and Venkatraman (1995), Siegmund and Yakir (2008), we know that TT is asymptotically exponentially distributed and is uniformly integrable. Hence if λ\lambda denotes the factor multiplying mm on the right-hand side of (18), then for large mm, in the range where mλm\lambda is bounded away from 0 and ++\infty, {Tm}[1exp(λm)]0\mathbb{P}\left\{T\leq m\right\}-[1-\exp(-\lambda m)]\rightarrow 0. Consequently, 𝔼[T]1/λ\mathbb{E}\left[T\right]\approx 1/\lambda, thereby we complete the proof. Here we omit some technical details needed to make the derivation rigorous. Those details have been described and proved in Siegmund et al. (2010). ∎

Proof of Proposition 4.3.

For ν=1\nu=1 and a given detection threshold b>0b>0, when the window size is sufficiently large, at the time of detection (the stopping time TT) we have

𝔼[ST]𝔼T[Tww+T𝔼[𝒗1,T1]]𝔼T[T𝔼[𝒗1,T1]].\mathbb{E}[S_{T}]\approx\mathbb{E}_{T}[\frac{Tw}{w+T}{\mathbb{E}}[\left\|\bm{v}_{1,T}\right\|_{1}]]\approx\mathbb{E}_{T}[T{\mathbb{E}}[\left\|\bm{v}_{1,T}\right\|_{1}]].

On the other hand, we have 𝔼[ST]=b+𝔼[STb]\mathbb{E}[S_{T}]=b+\mathbb{E}[S_{T}-b], where 𝔼[STb]\mathbb{E}[S_{T}-b] is also called the overshoot of the detection procedure and it is of order o(b)o(b) as bb\to\infty (Siegmund, 1985). This yields the first-order approximation for the expected stopping time:

𝔼[T]𝔼[𝒗1,T1]=b(1+o(1))𝔼[T]=b(1+o(1))Δ,\mathbb{E}[T]{\mathbb{E}}[\left\|\bm{v}_{1,T}\right\|_{1}]=b(1+o(1))\Rightarrow\mathbb{E}[T]=\frac{b(1+o(1))}{\Delta},

where Δ=Δ1\Delta=\Delta_{1} when StS_{t} is the sum-type detection statistics, and Δ=Δ2\Delta=\Delta_{2} for the max-type detection statistics.