\externaldocument

Supp

Spectral Change Point Estimation for High Dimensional Time Series by Sparse Tensor Decomposition

Xinyu Zhang and Kung-Sik Chan
Department of Statistics and Actuarial Science, University of Iowa, Iowa City, USA. Email: [email protected]. Department of Statistics and Actuarial Science, University of Iowa, Iowa City, USA. Email: [email protected].

Abstract

Multivariate time series may be subject to partial structural changes over certain frequency band, for instance, in neuroscience. We study the change point detection problem with high dimensional time series, within the framework of frequency domain. The overarching goal is to locate all change points and delineate which series are activated by the change, over which frequencies. In practice, the number of activated series per change and frequency could span from a few to full participation. We solve the problem by first computing a CUSUM tensor based on spectra estimated from blocks of the time series. A frequency-specific projection approach is applied for dimension reduction. The projection direction is estimated by a proposed tensor decomposition algorithm that adjusts to the sparsity level of changes. Finally, the projected CUSUM vectors across frequencies are aggregated for change point detection. We provide theoretical guarantees on the number of estimated change points and the convergence rate of their locations. We derive error bounds for the estimated projection direction for identifying the frequency-specific series activated in a change. We provide data-driven rules for the choice of parameters. The efficacy of the proposed method is illustrated by simulation and a stock returns application.

Keywords: CANDECOMP/PARAFAC decomposition, frequency-specific projection, piecewise stationary, sparsity, spectral density matrix.

1 Introduction

In the context of high-dimensional time series, the stationary assumption is often suspect. Instead, data may be piecewise stationary, with the non-stationarity caused by structural breaks in the mean function (Csörgö and Horváth, 1997; Fryzlewicz, 2014; Jirak, 2015; Wang and Samworth, 2018; Enikeeva and Harchaoui, 2019), in the variance function (Aue et al., 2009; Kirch et al., 2015; Wang et al., 2021), or in the auto- and/or cross-covariance function (equivalently the spectral density function or simply referred to as the spectrum below). Here, we consider the problem of detecting and locating changes in the spectrum of high-dimensional time series.

The spectrum provides comprehensive information about the dynamics as it encapsulates both the contemporaneous and dynamic covariance structure of the time series components. On the one hand, by incorporating the autocovariances at all lags, it allows us to discern patterns and structural breaks that may not be evident when focusing solely on the contemporaneous covariances. On the other hand, changes in a parametric model (e.g., VAR model) have been widely studied (Davis et al., 2006; Chan et al., 2014; Kirch et al., 2015), and such changes will also result in structure breaks in the spectrum. Hence, frequency domain approaches are able to detect such changes, also offering the benefit of circumventing any model assumptions (Ombao et al., 2005; Cho and Fryzlewicz, 2012; Adak, 1998; Cho et al., 2023). What’s more, it is especially appealing in fields where frequency interpretations or periodic patterns are natural and crucial, e.g., electrical engineering, finance, physics, neuroscience, etc. Furthermore, from the frequency viewpoint, it is pertinent to identify over which frequencies (CP frequencies) and across which series (CP series) a structural change occurs. This problem of detecting frequency-series specific changes is underexplored, save two related works: Schröder and Ombao (2019) proposed the FreSpeD method, which detects change points based on one auto-spectrum (co-spectrum) at a time. Preuss et al. (2015) considered the problem of change-point detection based on local spectral density matrix, but their method is not designed for high dimensional time series, nor frequency specific.

To simultaneously estimate the change point locations, activated series and frequencies, we propose to explore the sparse spectral norm of the difference in the spectral matrix per frequency for change point detection. We focus on sparsity because in high dimension, changes tend to activate a few time series components, which poses a significant challenge, especially with weak signals. In such cases, aggregating information across all components via, say, the spectral norm may lose power due to accumulation of error, whereas the proposed sparse method helps to capture and highlight the signals. We emphasize that the proposed methodology is also adaptable to non-sparse settings, with the sparsity level determined in a data-driven way. Similarly, changes may be sparse or dense over the frequencies. For example, a covariance matrix shift generally affects all frequencies, while data from certain field, such as neuroscience, may exhibit changes only over specific frequency bands. Thus, the proposed method employs frequency-specific detection followed by adaptive aggregation to address the multi-faceted nature of structural changes in high-dimensional time series.

To achieve our goal, our methodology builds on frequency-specific projection, with the projector obtained by a sparse tensor decomposition algorithm. Specifically, we first sub-divide the data into blocks and compute their spectra (Adak, 1998). We then transform the sequence of spectra per frequency into a third-order CUSUM tensor. How to integrate the information in the third-order tensor for change point detection is a challenging problem. Several aggregation approaches for integrating information for change point detection have been proposed, including $l_{2}$ -aggregation (Horváth and Hušková, 2012), $l_{\infty}$ -aggregation (Jirak, 2015), among others. Considering our goals of specifying the CP series, we propose to use a sparse projection approach. The projection approach has been used by Wang and Samworth (2018) and Wang et al. (2021) for change point detection respectively in the mean and variance function, where the projector is obtained through some matrix decomposition, in combination with the wild binary segmentation (Fryzlewicz, 2014) for multiple change points. In comparison, our problem requires the development of novel algorithms, including a new variant of the tensor power method, and theory for extracting the signal from the third-order CUSUM tensor. By exploring the special structures of the CUSUM tensor, we develop a novel special sparse tensor decomposition algorithm, based on the truncated matrix power and tensor power methods, to obtain the projection direction from the CUSUM tensor, and the algorithm is theoretically guaranteed.

With the projector, change point detection may be implemented based on the projected CUSUM in a frequency-specific manner. However, an overall conclusion on the change point locations is often desired, the solution of which requires integrating the information across frequencies. We do this with thresholded $l_{1}$ aggregation (Cho and Fryzlewicz, 2015) which achieves adaptivity across frequencies. In sum, leveraging the tensor decomposition algorithm, we develop a novel change point detection approach with high dimensional time series, that can simultaneously identify which series and frequencies are activated per change.

We establish the convergence rate for the projector output by the proposed sparse tensor decomposition algorithm, via a perturbation analysis of the high dimensional spectral CUSUM tensor. The significance of sparsity is evident as it results in a faster convergence rate. However, we emphasize that the proposed method can detect sparse to dense spectral changes, via a data-driven sparsity parameter. Furthermore, we show that our method consistently estimates both the number and locations of the change points, with the localization rate nearly achieving the minimax optimal rate for the covariance change point problem.

We outline the rest of the paper. Section 2 elaborates the main problem. We detail the proposed methods in Section 3 with justification based on characterizing their theoretical properties via non-asymptotics in Section 4. In Section 5, we discuss the choice of the hyper-parameters for applying the proposed methods, and report the empirical performance of the proposed method via simulation. In Section 6, we demonstrate the efficacy of the proposed methods by doing a change point analysis with a panel of S&P100 returns. We conclude in Section 7. All proofs and some additional simulation results are collected in the Supplementary Materials.

1.1 Notation

We generally use bold-face upright letters (e.g., $\mathbf{X}$ ) for matrices, bold-face italic letters (e.g., $\bm{\gamma}$ ) for vectors, calligraphic-font letters (e.g., $\mathcal{F}$ ) for tensors, and superscript ^∗ to signify tensor-induced vectors (e.g., $\mathcal{F}^{*}$ ). For any real number $r$ , $\lceil r\rceil$ denotes the ceiling function, i.e., the smallest integer greater or equal to $r$ , and $\lfloor r\rfloor=-\lceil-r\rceil$ denotes the floor function. For any vector $\bm{c}\in\mathbb{R}^{p}$ , $\|\bm{c}\|_{0}$ denotes its $l_{0}$ norm, and $\|\bm{c}\|$ its Euclidean norm. For any $p\times p$ real matrix $\mathbf{A}$ , $\|\mathbf{A}\|$ denotes its spectral norm, and $\mathbf{A}_{i,}$ its $i$ -th row. We follow the tensor convention in Kolda and Bader (2009). Specifically, for a third-order tensor $\mathcal{G}\in\mathbb{C}^{p_{1}\times p_{2}\times p_{3}}$ and some $1\leq j\leq p_{2},1\leq l\leq p_{3}$ , the mode 1 fibre is given by $\mathcal{G}_{:jl}$ , and the slice along mode 3 is given as $\mathcal{G}_{::l}$ . When $\mathcal{G}$ is real, define the vector product with $\bm{c}^{(k)}\in\mathbb{R}^{p_{k}}$ with $k=1,2,3$ , e.g. $\mathcal{G}\times_{1}\bm{c}^{(1)}\equiv\sum_{i=1}^{p_{1}}\bm{c}_{i}^{(1)}\mathcal{G}_{i::}$ . For any set $\mathbb{U}$ , let $|\mathbb{U}|$ be its cardinality.

2 Main problem

Let $\mathbf{X}=(\mathbf{X}_{1},\ldots,\mathbf{X}_{N})\in\mathbb{R}^{p\times N}$ be a $p$ -dimensional centered multivariate time series of length $N$ . Let the scaled time be $t=n/N\in[0,1]$ . We assume that $\mathbf{X}$ is a piecewise stationary process, i.e.,

\mathbf{X}_{n}=\sum_{m=0}^{\infty}\tilde{\mathbf{A}}(v_{q},m)\mathbf{Y}_{n-m},\quad v_{q-1}<t=n/N\leq v_{q},

(2.1)

where $v_{0}=0<v_{1}<\dots<v_{Q}<v_{Q+1}=1$ are change points of unknown numbers at unknown locations, the $\mathbf{Y}$ ’s are $p$ -dimensional doubly infinite i.i.d. innovations with zero mean and identity covariance matrix, and $\tilde{\mathbf{A}}(v_{q},m)$ is a sequence of $p\times p$ coefficient matrices. By assuming structural changes in the filter $\tilde{\mathbf{A}}(v_{q},m)$ , this facilitates a general framework for structural breaks in any auto- and/or cross- second moments across all lags. Note that the proposed framework presumes no changes in the mean function, or the data has been pre-processed to remove any such changes. The interesting problem of joint detection of changes in the first and second moments awaits future investigation.

Changes in $\tilde{\mathbf{A}}(v_{q},m)$ are succinctly encapsulated by the time-varying spectral density matrix $\mathbf{f}(t,\omega)$ . Specifically, denote the transfer function at frequency $\omega$ as $\mathbf{A}(v_{q},\omega)=\sum_{m=0}^{\infty}\tilde{\mathbf{A}}(v_{q},m)\exp(-\mathrm{i}\omega m)$ , it follows that $\mathbf{f}(t,\omega)$ is piecewise constant such that

\mathbf{f}(t,\omega)=\mathbf{f}(v_{q},\omega)=\mathbf{A}(v_{q},\omega)\mathbf{A}^{H}(v_{q},\omega),\quad v_{q-1}<t\leq v_{q}\,\,\mbox{for}\,\,q=1,\ldots,Q+1,\,\,\omega\in(-\pi,\pi],

where the superscript ^H denotes the complex conjugate transpose. Hence, the differences $\mathbf{f}(v_{q+1},\omega)-\mathbf{f}(v_{q},\omega)$ quantify the structural breaks. Our objective is to leverage this observation to devise a change point detection algorithm. As previously discussed, in high dimensional time series, a structural break may activate only a subset of time-series components, the search of which may be effected by finding a sparse linear projection of the vector time series that maximizes the spectral changes; such a projection may be frequency-specific. The basic algebraic operation is the following sparse spectral norm. Let $1\leq s\leq p$ be a fixed scalar. For any Hermitian matrix $\mathbf{B}\in\mathbb{C}^{p\times p}$ , its sparse spectral norm $\rho(\mathbf{B},s)$ is defined as

\displaystyle\rho(\mathbf{B},s)\equiv\sup_{\begin{subarray}{c}\bm{c}\in\mathbb{R}^{p},\|\bm{c}\|_{0}\leq s,\|\bm{c}\|=1\end{subarray}}|\bm{c}^{\top}\mathbf{B}\bm{c}|.

(2.2)

As $\mathbf{B}$ is Hermitian, $\rho(\mathbf{B},s)=\rho(\mathrm{Re}(\mathbf{B}),s)$ . Hence, applying the preceding sparse spectral norm to detecting changes in $\mathbf{f}(t,\omega)$ is equivalent to detecting changes in its real part $\mathrm{Re}[\mathbf{f}(t,\omega)]$ , which is known as the co-spectrum. The co-spectrum admits the interesting interpretation that it is proportional to the covariance matrix of the component of $\mathbf{X}$ corresponding to the frequency $\omega$ , see Section 7.1 in Brillinger (2001). The co-spectral increment between two consecutive time segments is given by

\mathbf{g}_{q}(\omega)\equiv\mathrm{Re}[\mathbf{f}(v_{q+1},\omega)-\mathbf{f}(v_{q},\omega)]\quad\mbox{for}\quad q=1,\ldots,Q,\,\,\omega\in(-\pi,\pi].

(2.3)

These are real symmetric matrices, thereby admitting an eigen-decomposition. Specifically, let $\lambda_{qi}(\omega)$ be the real eigenvalues ordered in decreasing magnitude, and $\bm{\gamma}_{qi}(\omega)$ the corresponding eigenvectors. The validity of our proposed methods require the following assumptions.

Assumption 1.

$\mathbf{X}$ is generated by (2.1). There exist positive constants $c_{1}>0$ and $c_{2}>2$ such that for all $m\geq 0$ ,

\max_{1\leq q\leq Q+1}\max_{1\leq i\leq p}\|\tilde{\mathbf{A}}_{i,}(v_{q},m)\|\leq c_{1}(1+m)^{-c_{2}}.

(2.4)

Assumption 1 requires the process $\mathbf{X}$ to be weakly dependent up to an algebraic decay rate. This assumption ensures consistent spectrum estimation with high-dimensional time series, under the framework of function dependence (Zhang and Wu, 2021).

Assumption 2.

For each $q=1,\ldots,Q$ , there exists a non-empty set of frequencies, $\mathbb{F}_{q}$ , such that for all $\omega\in\mathbb{F}_{q}$ , the leading eigenvalue $|\lambda_{q1}(\omega)|$ of $\mathbf{g}_{q}(\omega)$ and the gap $\Delta\lambda_{q}(\omega)=|\lambda_{q1}(\omega)|-|\lambda_{q2}(\omega)|$ are both non-degenerate with $|\lambda_{q1}(\omega)|>\underline{\lambda}>0$ and $\Delta\lambda_{q}(\omega)>\underline{\Delta\lambda}>0$ . The leading eigenvector $\bm{\gamma}_{q1}(\omega)$ is sparse such that $\sup_{q=1,\ldots,Q,\omega\in\mathbb{F}_{q}}\|\bm{\gamma}_{q1}(\omega)\|_{0}\leq k_{0}<p$ . Assume that there exist positive constants $\underline{f}$ and $\bar{f}$ such that for all $q$ and $\omega$ ,

k_{0}\underline{f}\leq\rho(\mathrm{Re}(\mathbf{f}(v_{q},\omega)),k_{0})\leq k_{0}\bar{f}.

(2.5)

Assumption 2 imposes conditions on the sparse spectral norm of the spectral densities $\mathbf{f}(v_{q},\omega)$ and their increments $\mathbf{g}_{q}(\omega)$ . Note that the RHS of (2.5) ensues from (2.4). These quantities are related in the following sense:

\underline{\Delta\lambda}\leq\underline{\lambda}\leq 2\max_{q}\rho(\mathrm{Re}(\mathbf{f}(v_{q},\omega)),k_{0})\leq 2k_{0}\bar{f}.

(2.6)

Assumption 2 allows but does not require that changes in $\mathbf{f}(t,\omega)$ occur only over a subset of frequencies that are specific to the change point. That is, the spectrum may remain constant at some frequencies but change over time at others. This relaxation could be useful in fields where specific frequency change is common, such as neuroscience. The frequencies in $\mathbb{F}_{q}$ are referred to as the change-point frequencies at the $q$ -th break. For each $\omega\in\mathbb{F}_{q}$ , we assume that only a set of series experience spectral change at $\omega$ , and such series are referred to as the CP series, the collection of which is denoted as $\mathbb{S}_{q}^{\omega}$ . Furthermore, let $\mathbb{F}=\{\mathbb{F}_{q},1\leq q\leq Q\}$ and $\mathbb{S}=\{\mathbb{S}_{q}^{\omega},1\leq q\leq Q,\omega\in\mathbb{F}_{q}\}$ be the list of all CP frequencies and CP series, respectively. We remark that while our methodology performs optimally in the sparse series case, which is demonstrated in Theorem 4.1, it could be applied in non-sparse settings as well.

Our goal is to detect the number and location of the change points, as well as the corresponding CP frequencies and CP series ( $\mathbb{F}$ and $\mathbb{S}$ ).

3 Methodology

Our change point detection method proceeds as follows. First, in Section 3.1, the entire time series is partitioned into multiple blocks, over each of which a spectral density matrix estimate is computed. A CUSUM-type test statistic for change-point detection is then constructed based on the sequence of spectral density matrix estimates, resulting in a tensor. Then, in Section 3.2, a frequency-specific projection approach is proposed to compress the CUSUM tensor into a vector at each frequency. This projection approach is realized by a special tensor decomposition algorithm, which is introduced in Section 3.3. In Section 3.4, these frequency-specific pieces of information are aggregated via a sparsified wild binary segmentation algorithm for implementing multiple change point detection.

3.1 CUSUM of the block-based spectral matrices

The first step is to estimate the time-dependent spectral density matrix $\mathbf{f}(t,\omega)$ , which requires some trade-off between spectral estimation accuracy and temporal resolution of the change point location. Similar to Adak (1998) and with no loss of generality, we partition the index set $\{1,\ldots,N\}$ into $B$ time blocks of equal length $L$ , which are defined by the boundaries $\{n_{b}=Lb,b=0,...,B\}$ with $B=\lfloor N/L\rfloor$ . The change point search is then on the block time scale, with the change block index set defined as $\mathbb{U}=\{u_{1\leq q\leq Q}|u_{q}=\lceil v_{q}B\rceil\}$ . Note that the search can be fine-tuned using overlapping windows around any change point detected on the block scale, although we will not delve into this aspect below.

Denote $\Omega$ as a set of frequencies over which we estimate the spectrum. For instance, it can be the set of Fourier frequencies $2\pi l/L$ for $1\leq l\leq\lceil L/2\rceil$ , or a subset thereof, or certain frequency band of interest, as in neuroscience. Consider the $b$ -th block indexed by time $n_{b-1}+1,\ldots,n_{b}$ . For any frequency $\omega\in\Omega$ , define $\mathcal{F}(\omega)$ as a $p\times p\times B$ (population) spectrum tensor with $\mathcal{F}_{::b}(\omega)=\mathbf{f}_{b}(\omega)$ defined as a weighted average of $\mathbf{f}(v_{q},\omega)$ with weights proportional to number of observations within $b$ -th block falling into the $q$ -th segment; see the formula in (LABEL:eq:f). Then, $\hat{\mathcal{F}}(\omega)$ is defined as a $p\times p\times B$ smoothed periodogram tensor with

\hat{\mathcal{F}}_{::b}(\omega)=\hat{\mathbf{f}}_{b}(\omega)=\frac{1}{2\pi}\sum_{m=-R}^{R}K\left(\frac{m}{R}\right)\hat{\mathbf{\Sigma}}_{b}(m)\exp(-\mathrm{i}\omega m),

(3.1)

where $K(\cdot)$ is a kernel function (Bartlett kernel in our implementation) with bandwidth $1\leq R<L$ , and $\hat{\mathbf{\Sigma}}_{b}(m)$ is the lag- $m$ autocovariance matrix for the $b$ -th block, i.e.,

\hat{\mathbf{\Sigma}}_{b}(m)=\frac{1}{L}\sum_{n=n_{b-1}+1+m}^{n_{b}}\mathbf{X}_{n-m}\mathbf{X}_{n}^{\top}\text{ for }m\geq 0,\text{ and }\hat{\mathbf{\Sigma}}_{b}(m)=\hat{\mathbf{\Sigma}}_{b}^{\top}(-m)\text{ for }m<0.

For a generic $p\times p\times B$ spectrum or periodogram tensor $\mathcal{G}$ , define the co-spectrum CUSUM transformation $\mathcal{C}_{s,e}:\mathbb{C}^{p\times p\times B}\to\mathbb{R}^{p\times p\times(e-s)}$ on the interval $\{(s,e):1\leq s\leq e\leq B\}$ such that the $(b-s+1)$ -th slice of $\mathcal{C}_{s,e}(\mathcal{G})$ for $b=s,\ldots,e-1$ is

\displaystyle\mathcal{C}_{s,b,e}(\mathcal{G})=\sqrt{\frac{(b-s+1)(e-b)}{e-s+1}}\mathrm{Re}\left(\frac{1}{e-b}\sum_{i=b+1}^{e}\mathcal{G}_{::i}-\frac{1}{b-s+1}\sum_{i=s}^{b}\mathcal{G}_{::i}\right).

(3.2)

Specifically, when $s=1$ and $e=B$ , we have

\displaystyle\mathcal{C}_{1,b,B}(\mathcal{G})=\sqrt{\frac{b(B-b)}{B}}\mathrm{Re}\left(\frac{1}{B-b}\sum_{i=b+1}^{B}\mathcal{G}_{::i}-\frac{1}{b}\sum_{i=1}^{b}\mathcal{G}_{::i}\right).

Apply the $\mathcal{C}_{s,e}$ transformation to $\hat{\mathcal{F}}(\omega)$ and $\mathcal{F}(\omega)$ , and for simplicity denote $\hat{\mathcal{T}}_{s,e}(\omega)=\mathcal{C}_{s,e}(\hat{\mathcal{F}}(\omega))$ and $\mathcal{T}_{s,e}(\omega)=\mathcal{C}_{s,e}(\mathcal{F}(\omega))$ , and their $b$ -th slices as $\hat{\mathcal{T}}_{s,b,e}(\omega)$ and $\mathcal{T}_{s,b,e}(\omega)$ .

In the Supplementary Section LABEL:sec:spectheory , we provide non-asymptotic bounds on the deviation between $\hat{\mathcal{T}}_{s,e}(\omega)$ and $\mathcal{T}_{s,e}(\omega)$ uniformly over $s,e$ and $\omega$ , which facilitates deriving the theoretical properties of the change point detection method.

3.2 A frequency-specific projection approach

To extract and aggregate the information about the change points as contained in the third-order tensor $\hat{\mathcal{T}}_{s,e}(\omega)$ for fixed $\omega$ , we develop a projection approach to compress it into a projection vector along which the projected univariate series preserves the greatest change point signal. Consider the simple case that the spectral density function changes at some specific frequency $\omega$ , across a single change point (i.e. $Q=1$ ). Denote $v(\omega)=v_{1}(\omega)$ as the change point location, and $u(\omega)$ the corresponding change point block index. Let $\mathbf{g}(\omega)=\mathbf{g}_{1}(\omega)=\mathrm{Re}[\mathbf{f}(v_{2},\omega)-\mathbf{f}(v_{1},\omega)]$ be the co-spectral increment.

For notation simplicity, assume that any change point is the first time point of some block (when change point is in the interior of some time blocks, just replace $u(\omega)$ below by $Bv(\omega)$ ), the co-spectrum CUSUM is given by

\mathcal{T}_{1,b,B}(\omega)=\left\{\begin{array}[]{ll}\sqrt{\frac{b}{B(B-b)}}(B-u(\omega))\mathbf{g}(\omega),&\quad b<u(\omega),\\ \sqrt{\frac{(B-b)}{Bb}}u(\omega)\mathbf{g}(\omega),&\quad b\geq u(\omega).\\ \end{array}\right.

(3.3)

Let

\displaystyle\begin{split}{\bm{\alpha}}^{\prime}(\omega)\equiv\sqrt{\frac{1}{B}}\Big{(}&\sqrt{\frac{1}{B-1}}(B-u(\omega)),\sqrt{\frac{2}{B-2}}(B-u(\omega)),\ldots,\\ &\sqrt{u(\omega)(B-u(\omega))},\sqrt{\frac{B-u(\omega)-1}{u(\omega)+1}}u(\omega),\ldots,\sqrt{\frac{1}{B-1}}u(\omega)\Big{)}^{\top},\end{split}

(3.4)

and $\bm{\alpha}(\omega)=\mathrm{Norm}({\bm{\alpha}}^{\prime}(\omega))$ , where $\mathrm{Norm}(\bm{c})\equiv\bm{c}/\|\bm{c}\|$ is defined as the normalization operator on vector $\bm{c}$ . Let $\circ$ denotes the outer product. The eigen-decomposition of $\mathbf{g}(\omega)$ entails that

\mathcal{T}_{1,B}(\omega)=\mathbf{g}(\omega)\circ{\bm{\alpha}}^{\prime}(\omega)=\sum_{i=1}^{r}\lambda_{i}(\omega)\|{\bm{\alpha}}^{\prime}(\omega)\|\cdot\bm{\gamma}_{i}(\omega)\circ\bm{\gamma}_{i}(\omega)\circ\bm{\alpha}(\omega).

(3.5)

Thus, it is a special tensor in that it has a CANDECOMP/PARAFAC-decomposition whose modes 1 and 2 are sparse in the leading term, symmetric and orthogonal, and identical mode 3 across the components; see Figure 3.1 for a graphical illustration.

Refer to caption — Figure 3.1: A graphical illustration of (3.5).

For the third-order tensor $\hat{\mathcal{F}}(\omega)$ , we seek a projector projecting each of its frontal slice matrices into a scalar, thereby rendering the tensor into a vector for change point detection. The projected series should preserve the information about the change point as much as possible. Consider a real projector $\bm{\beta}(\omega)\in\mathbb{R}^{p}$ and $\|\bm{\beta}(\omega)\|=1$ such that the tensor $\hat{\mathcal{F}}(\omega)$ is projected into a vector $\hat{\mathcal{F}}(\omega)\times_{1}\bm{\beta}(\omega)\times_{2}\bm{\beta}(\omega)$ , on which we could apply the operator $\mathcal{C}_{s,e}$ since vectors are $1\times 1\times B$ tensors. Meanwhile ${\mathcal{F}}(\omega)$ is projected into a vector ${\mathcal{F}}(\omega)\times_{1}\bm{\beta}(\omega)\times_{2}\bm{\beta}(\omega)$ , whose CUSUM will achieve the maximum magnitude with the projection $\bm{\beta}(\omega)=\bm{\gamma}_{1}(\omega)$ , and the maximum is

\displaystyle\|\mathcal{T}_{1,B}(\omega)\times_{1}\bm{\gamma}_{1}(\omega)\times_{2}\bm{\gamma}_{1}(\omega)\|=\|\lambda_{1}(\omega){\bm{\alpha}}^{\prime}(\omega)\|,

hence retaining the most signal about the change point. Thus, the oracle projection direction is the leading mode-1 component in the decomposition of the tensor $\mathcal{T}_{1,B}(\omega)$ .

3.3 Special tensor decomposition

Below, we work on recovering the optimal projection $\bm{\gamma}_{1}(\omega)$ from the tensor $\hat{\mathcal{T}}_{1,B}(\omega)$ . A state-of-art tensor decomposition technique is the alternating least squares (ALS) procedure (Kolda and Bader, 2009). This involves solving the least squares problem one tensor mode at a time, while keeping the other modes fixed, and alternating between the tensor modes. Another approach is the tensor power method, which is basically a greedy or rank-1 ALS update. This is a natural generalization of the matrix power method for extracting the dominant eigenvector, with a truncated version given by Yuan and Zhang (2013) for the sparse setting. In sparse tensor decomposition, the tensor power method can be combined with a soft thresholding operator (Allen, 2012) or a truncation step (Sun et al., 2017), and the latter provided theoretical guarantees. Since the tensor decomposition for ${\mathcal{T}}_{1,B}(\omega)$ enjoys the special features including partial orthogonality, symmetry and identical third mode, we utilize these features and develop a new Algorithm 1. It combines the tensor power method with the truncated matrix power method. For any $\bm{c}\in\mathbb{R}^{p}$ , $\mathrm{Trun}(\bm{c},k)$ equals the vector resulted from retaining its largest $k$ elements in magnitude but replacing other elements of $\bm{c}$ by zero. In Algorithm 1, we first discard the boundary frontal slices of the tensor $\hat{\mathcal{T}}_{s,e}(\omega)\in\mathbb{R}^{p\times p\times(e-s)}$ , for technical reasons. Specifically, let $\hat{\mathcal{T}}_{s,e}^{\dagger}(\omega)\in\mathbb{R}^{p\times p\times(e-s-2\nu)}$ be its trimmed version by discarding $\nu\asymp{R\log(Np)}/{L}\leq\log(Np)$ slices at each boundary, based on which change point detection is proceeded. For the single-change-point setting, Algorithm 1 is implemented with $s=1$ and $e=B$ .

Algorithm 1 Tensor power with truncated matrix power algorithm.

\hat{\mathcal{T}}_{s,e}(\omega)\in\mathbb{R}^{p\times p\times(e-s)}

k_{0}\leq k\leq p

\hat{\bm{\gamma}}_{s,e}(\omega)=\bm{\gamma}^{(j)}(\omega)

3:Let

j=1

and initialize the unit vector

\bm{\gamma}^{(0)}(\omega)\in\mathbb{R}^{p}

4:repeat

5: Compute

\bm{\alpha}^{(j)}(\omega)=\mathrm{Norm}(\hat{\mathcal{T}}_{s,e}^{\dagger}(\omega)\times_{1}{\bm{\gamma}}^{(j-1)}(\omega)\times_{2}{\bm{\gamma}}^{(j-1)}(\omega))

6: Compute

\mathbf{D}^{(j)}(\omega)=\hat{\mathcal{T}}_{s,e}^{\dagger}(\omega)\times_{3}{\bm{\alpha}}^{(j)}(\omega)

7: Let

i=1

and

\bm{\gamma}^{(j,0)}(\omega)=\bm{\gamma}^{(j-1)}(\omega)

8: repeat

9: Compute

\bm{\gamma}^{(j,i)}(\omega)=\mathrm{Norm}(\mathrm{Trun}(\mathrm{Norm}(\mathbf{D}^{(j)}(\omega)\bm{\gamma}^{(j,i-1)}(\omega)),k))

, update

i\leftarrow i+1

10: until Convergence

11: Let

\bm{\gamma}^{(j)}(\omega)=\bm{\gamma}^{(j,i)}(\omega)

, update

j\leftarrow j+1

12:until Convergence

Algorithm 1 is a tensor power method, with $\bm{\alpha}^{(j)}(\omega)$ and $\bm{\gamma}^{(j)}(\omega)$ alternately updated according to lines 3 and 9. Specifically, $\bm{\gamma}^{(j)}(\omega)$ is updated by a truncated matrix power method (Yuan and Zhang, 2013), under the special case here that the matrix $\mathbf{D}^{(j)}(\omega)$ is dynamic and updated along the outer iteration. The iteration may be terminated by keeping a pre-specified maximal number of iterations, or if the changes in the $\bm{\alpha}^{(j)}(\omega)$ ’s and $\bm{\gamma}^{(j)}(\omega)$ ’s are below some threshold. For initialization $\bm{\gamma}^{(0)}(\omega)$ , refer to Section LABEL:sec:initial for two approaches we provide, namely, the DSPCA approach and the truncated PCA approach. The former has theoretical guarantee and the latter is computationally efficient. Although there are two levels of iterations, the outer iteration for $\bm{\alpha}^{(j)}(\omega)$ converges very quickly thanks to the identical third mode, thus a few outer iterations would work, which we explain in Remark LABEL:rem:alpha. The convergence of $\bm{\gamma}^{(j)}(\omega)$ output by this algorithm is theoretically guaranteed by Theorem 4.1, where the benefit of exploring the sparsity structure is also illustrated.

3.4 Change point estimation

Now, based on the CUSUM tensor $\hat{\mathcal{T}}_{s,e}(\omega)$ , on interval $[s,e]$ , we could obtain the estimated projector $\hat{\bm{\gamma}}_{s,e}(\omega)$ , and thus project $\hat{\mathcal{T}}_{s,e}(\omega)$ into a vector. Theoretically speaking, it is preferable to implement the projector estimation and the projection-enabled change point detection with two independent co-spectrum tensors, with which the consistency of the proposed change-point detection method (see Theorem 4.2) can be established. Below, we assume the existence of two independent and identically distributed co-spectrum tensors $\hat{\mathcal{F}}(\omega)$ and $\hat{\mathcal{F}}^{\prime}(\omega)$ ; see Algorithm 2. Note that the output $\hat{\sigma}_{s,e}(\omega)>0$ since $\hat{\mathcal{F}}(\omega)$ is positive semidefinite, and it accounts for the variation of the projected spectral density.

Algorithm 2 Independent projection based on CUSUM.

\hat{\mathcal{F}}(\omega)\in\mathbb{R}^{p\times p\times B}

k_{0}\leq k\leq p

s

e

\hat{\bm{\gamma}}_{s,e}(\omega)

\hat{\sigma}_{s,e}(\omega)

\hat{\mathcal{T}}^{*}_{s,e}(\omega)

3:Let

\hat{\mathcal{F}}^{\prime}(\omega)

be an independent copy of

\hat{\mathcal{F}}(\omega)

and compute the CUSUM tensors

\hat{\mathcal{T}}^{\prime}_{s,e}(\omega)

and

\hat{\mathcal{T}}_{s,e}(\omega)

by (3.2) respectively with

\hat{\mathcal{F}}^{\prime}(\omega)

and

\hat{\mathcal{F}}(\omega)

4:Implement Algorithm 1 with input

\hat{\mathcal{T}}^{\prime}_{s,e}(\omega)

and

k

to obtain

\hat{\bm{\gamma}}_{s,e}(\omega)

5:Obtain the projected CUSUM vector

\hat{\mathcal{T}}^{*}_{s,e}(\omega)

and a scaling factor

\hat{\sigma}_{s,e}(\omega)

	$\displaystyle\hat{\mathcal{T}}^{*}_{s,e}(\omega)$	$\displaystyle\equiv\hat{\mathcal{T}}_{s,e}(\omega)\times_{1}\hat{\bm{\gamma}}_{s,e}(\omega)\times_{2}\hat{\bm{\gamma}}_{s,e}(\omega),$
	$\displaystyle\hat{\sigma}_{s,e}(\omega)$	$\displaystyle\equiv\frac{1}{e-s+1}\sum_{b=s}^{e}\hat{\mathcal{F}}_{::b}(\omega)\times_{1}\hat{\bm{\gamma}}_{s,e}(\omega)\times_{2}\hat{\bm{\gamma}}_{s,e}(\omega).$

Remark 3.1.

We remark on possible approaches to obtain two independent and identically distributed co-spectrum tensors $\hat{\mathcal{F}}(\omega)$ and $\hat{\mathcal{F}}^{\prime}(\omega)$ . For independent data, two independent samples may be obtained by sample splitting (Wang and Samworth, 2018), which is not applicable for our time series data. However, thanks to the asymptotic independence of Fourier transform, we could do the splitting on frequency domain, and we provide two choices of obtaining (asymptotically) independent $\hat{\mathcal{F}}(\omega)$ and $\hat{\mathcal{F}}^{\prime}(\omega)$ . First, by choosing $\Omega$ to be $\{\omega_{l}=2\pi l/R,1\leq l\leq\lceil R/2\rceil\}$ , for each $\omega_{l}$ , $\hat{\mathcal{F}}(\omega_{l-1})$ (or $\hat{\mathcal{F}}(\omega_{l+1})$ ) could be used as $\hat{\mathcal{F}}^{\prime}(\omega_{l})$ , an asymptotic independent copy of $\hat{\mathcal{F}}(\omega_{l})$ , assuming a slowly varying population spectrum across the frequency range. For the second approach, let $\Omega$ be $\{2\pi l/L,1\leq l\leq\lceil L/2\rceil\}$ . It is well-known that the lag-window estimate (3.1) could be approximated by a corresponding spectral window estimate, i.e., weighted average of the periodograms over Fourier frequencies $\Omega$ , where two independent estimates can be obtained by calculating the weighted average of periodograms on disjoint subset of Fourier frequencies. In practice, to fully leverage the signal in the data and enjoy the flexibility of specifying frequencies, we recommend using $\hat{\mathcal{F}}(\omega)$ for both the projector estimation and change point detection.

Based on Algorithm 2, we could detect and estimate the single-change-point location at frequency $\omega$ by assessing the value of $|\hat{\mathcal{T}}^{*}_{s,e}(\omega)|$ . In practice, the data may sustain multiple change points. Also, as previously discussed, an overall conclusion of change points across frequencies is desired. We combine the threshold $l_{1}$ aggregation (Cho and Fryzlewicz, 2015) for multiple frequencies and wild binary segmentation (Fryzlewicz, 2014) for multiple change points, which will be referred to as the wild sparsified binary segmentation. Specifically, we apply binary segmentation to isolate individual change points within random windows over which we aggregate the information across frequencies. After obtaining $\hat{\mathcal{T}}^{*}_{s,e}(\omega)$ and $\hat{\sigma}_{s,e}(\omega)$ for each Fourier frequency $\omega\in\Omega$ by Algorithm 2, let $\hat{\mathcal{F}}\in\mathbb{R}^{p\times p\times B\times|\Omega|}$ be a 4-order tensor with $\hat{\mathcal{F}}_{:::l}=\hat{\mathcal{F}}(\omega_{l})$ for $1\leq l\leq|\Omega|$ . We define a thresholded sum of the CUSUM across all frequencies, as follows:

\mathfrak{C}_{s,e}(\hat{\mathcal{F}})=\sum_{\omega\in\Omega}\frac{|\hat{\mathcal{T}}^{*}_{s,e}(\omega)|}{\hat{\sigma}_{s,e}(\omega)}\mathbb{I}\left(\frac{|\hat{\mathcal{T}}^{*}_{s,e}(\omega)|}{\hat{\sigma}_{s,e}(\omega)}>\tau\right),

(3.6)

where $\mathbb{I}$ is the indicator function, and $\tau$ is the threshold. The threshold step reduces the influence of irrelevant noisy parts, and can consistently detect change points.

To implement the wild binary segmentation, we first randomly sample a large number of pairs, $(s_{1},e_{1}),\ldots,(s_{J},e_{J})$ , uniformly from the set $\{(s,e):1\leq s<e\leq B\}$ . Then, for each $1\leq j\leq J$ , $\mathfrak{C}_{s_{j},e_{j}}(\hat{\mathcal{F}})$ is computed via (3.6). Denote the maximizer of $\mathfrak{C}_{s_{j},e_{j}}(\hat{\mathcal{F}})$ over $[{s}_{j}+\nu,{e}_{j}-\nu]$ by $\hat{u}^{[j]}$ , and the corresponding maximum CUSUM value is $\mathfrak{C}_{s_{j},\hat{u}^{[j]},e_{j}}(\hat{\mathcal{F}})$ . Suppose $\mathfrak{C}_{s_{j},\hat{u}^{[j]},e_{j}}(\hat{\mathcal{F}}),1\leq j\leq J$ is maximized at $j=\hat{j}$ . If $\mathfrak{C}_{s_{j},\hat{u}^{[\hat{j}]},e_{j}}(\hat{\mathcal{F}})$ is not above 0, conclude that no change point is found, otherwise we declare $\hat{u}^{[\hat{j}]}$ a change point and repeat the above procedure recursively on each segment to the left and right of $\hat{u}^{[\hat{j}]}$ to find other change points. Thus, we carry out change-point detection recursively, with the search restricted to those initial time windows $(s_{j},e_{j}),1\leq j\leq J$ that lie in an interval under search. The algorithm is detailed in Algorithm 3, with the key steps in lines 6-9 depicted in Figure 3.2.

Algorithm 3 Multiple spectral change points detection algorithm.

\hat{\mathcal{F}}\in\mathbb{R}^{p\times p\times B\times|\Omega|}

k_{0}\leq k\leq p

J\in\mathbb{N}

\tau

\hat{\mathbb{U}}

3:Set

\hat{\mathbb{U}}=\emptyset

, randomly draw

(s_{1},e_{1}),\ldots,(s_{J},e_{J})

, uniformly from

\{(s,e):1\leq s<e\leq B\}

4:Run SpecCp(

\hat{\mathcal{F}},0,B

)

5:function SpecCp(

\hat{\mathcal{F}},s,e

)

6: Set

\mathcal{J}_{s,e}\equiv\{j:[s_{j},e_{j}]\subseteq[s,e],e_{j}-s_{j}+1>2\nu\}\cup\{0\}

where

s_{0}=s

and

e_{0}=e

7: for

j\in\mathcal{J}_{s,e}

8: for

\omega\in\Omega

9: Run Algorithm 2 with

\hat{\mathcal{F}}(\omega)

k

s_{j}

e_{j}

as input, and let

\hat{\bm{\gamma}}^{[j]}(\omega)

\hat{\sigma}_{s_{j},e_{j}}(\omega)

\hat{\mathcal{T}}^{*}_{s_{j},e_{j}}(\omega)

be the output

10: end for

11: Compute

\mathfrak{C}_{s_{j},e_{j}}(\hat{\mathcal{F}})

by (3.6)

12: end for

13: Let

(j_{0},\hat{u})=\arg\max_{j\in\mathcal{J}_{s,e},{s}_{j}+\nu\leq u\leq{e}_{j}-\nu}\mathfrak{C}_{s_{j},u,e_{j}}(\hat{\mathcal{F}})

14: if

\mathfrak{C}_{s_{j_{0}},\hat{u},e_{j_{0}}}(\hat{\mathcal{F}})>0

then

15: Add

\hat{u}

to the set

\hat{\mathbb{U}}

16: SpecCp(

\hat{\mathcal{F}},s,\hat{u}

) and SpecCp(

\hat{\mathcal{F}},\hat{u}+1,e

)

17: end if

18:end function

4 Theoretical properties

In this section, we establish the convergence of Algorithm 1 for sparse tensor decomposition and Algorithm 3 for change point detections. Let the window size in (3.1) be $R\asymp(\frac{N\delta}{\log(Np)})^{1/3}$ . This choice is to the balance the bias variance trade-off in the CUSUM of the spectrum estimation. We introduce some assumptions.

Assumption 3.

The $\mathbf{Y}$ ’s are i.i.d. $p$ -dimensional Gaussian random vectors of zero mean and identity covariance matrix.

Assumption 4.

There exists $\delta>0$ such that for all $q=1,\ldots,Q$ , $\min(v_{q}-v_{q-1})\geq\delta$ .

Assumption 5.

Let $\phi=\left(\frac{\log(Np)}{N}\right)^{1/3}$ . There exists a large enough absolute constant $c>1$ such that (i) $\delta\geq c\phi^{3}$ , (ii) $\delta^{5/6}\underline{\Delta\lambda}\geq ck_{0}\sqrt{\phi}$ , and (iii) $c\left(\frac{N\delta}{\log(Np)}\right)^{1/3}\leq L\leq N\delta$ .

The Gaussian Assumption 3 simplifies the non-asymptotic development, which could be relaxed to existence of finite polynomial moments; see Zhang and Wu (2021). In Assumption 4, if the number of change points is finite, then $\delta$ can be taken as some positive fraction, otherwise, it decreases to zero with increasing $N$ . Assumption 5 (i) allows the high dimension setting with $p$ increasing with $N$ at some exponential rate so long as $\log(p)\leq\delta N-\log(N)$ . Also, as a result, the trimming in Algorithm 1 satisfies $\nu\asymp{R\log(Np)}/{L}\lesssim\delta B$ , hence no change points near the boundary will be missed due to trimming. Assumption 5 (ii) imposes a lower bound on the signal strength by measuring the product $\delta^{5/6}\underline{\Delta\lambda}$ . Note that if the time-series dimension $p$ increases sufficiently slowly as $N\to\infty$ , for instance, if $\log(p)=o(N)$ , then $\phi\to 0$ and so does the lower bound. Also, (iii) means that the block length $L$ should be smaller than the shortest inter-change-point distance but large enough to enable accurate per-block spectral density estimates. In sum, these are mild regularity assumptions.

Next we show the asymptotic validity of Algorithm 1 for the single-change-point setting, i.e. $s=1$ and $e=B$ .

Theorem 4.1.

Suppose Assumptions 1-5 hold. Let $Q=1$ , $s=1$ , $e=B$ . Consider some frequency $\omega\in\mathbb{F}_{1}$ . Denote $\lambda(\omega)=|\lambda_{1}(\omega)|>0$ . Assume $k$ satisfies

\left(1+2(k_{0}/k)^{1/2}+2k_{0}/k\right)\eta^{2}(\omega)<1,\quad\eta(\omega)\equiv\frac{\lambda(\omega)-(1-c_{1})\Delta\lambda(\omega)}{\lambda(\omega)-c_{1}\Delta\lambda(\omega)}.

(4.1)

with some constant $0<c_{1}<1/2$ . Assume the initial unit vector $\bm{\gamma}^{(0)}(\omega)$ is close enough to $\bm{\gamma}_{1}(\omega)$ such that $|\bm{\gamma}_{1}(\omega)^{\top}\bm{\gamma}^{(0)}(\omega)|$ is larger than some quantity defined in (LABEL:eq:th1_gamma). Let $a=2k+k_{0}$ with $k\gtrsim k_{0}$ . Then, with probability at least $1-N^{3-c_{2}}p^{2-c_{2}}$ for some $c_{2}>3$ , we have

\sqrt{1-\left|\bm{\gamma}_{1}^{\top}(\omega)\hat{\bm{\gamma}}_{1,B}(\omega)\right|}\leq O\left(\frac{a\phi}{\delta^{5/6}\underline{\Delta\lambda}}\right),

(4.2)

Assumption 5 (ii) implies that the RHS (4.2) is further bounded by $\sqrt{\phi}$ . Under the condition that $a\asymp k_{0}$ is far less than $p$ , the fact that this order grows linearly in $k_{0}$ illustrates the potential gain by taking into account of the sparsity in $\bm{\gamma}_{1}(\omega)$ , which is corroborated by the simulation results reported below. The consistency of the projector obtains if $\phi\to 0$ , e.g., if $\log(p)=o(N)$ ; hence the projector casts light on the importance of each component series in the structure break.

The following results state a non-asymptotic probability bound regarding the consistency of the proposed change-point detection method. Let $\mathbb{\hat{U}}$ denote the set of all change points detected by Algorithm 3, with $\hat{Q}=|\mathbb{\hat{U}}|$ equal to the number of detected change points. Sort the change point estimates (in ascending order) as $\hat{u}_{q}$ for $q=1,\ldots,\hat{Q}$ . Recalling that the trimming is $\nu\asymp{R\log(Np)}/{L}$ .

Theorem 4.2.

Suppose Assumptions 1-5 and the conditions in Theorem 4.1 hold. Assume $\max_{j\in\mathcal{J}}|e_{j}-s_{j}+1|\leq c_{1}\delta B$ for an absolute constant $c_{1}>0$ , and the number of random intervals $J$ is large enough that $J\geq 9\log(B\delta^{-1})/\delta^{2}$ . Let ${\epsilon}={k^{2}\nu}/{\underline{\lambda}^{2}}\leq{k^{2}\log(Np)}/{\underline{\lambda}^{2}}$ . Suppose there exists a positive constant $c_{2}$ such that the CUSUM threshold satisfies $\sqrt{\epsilon}<c_{2}\tau<\sqrt{\delta B}\underline{\lambda}/k$ . Then, with probability at least $1-3N^{3-c_{3}}p^{2-c_{3}}-\delta^{-1}\exp(-\delta^{2}J/9)$ for some constant $c_{3}>3$ , we have $\hat{Q}={Q}$ and $\max_{q=1,\ldots,\hat{Q}}|\hat{u}_{q}-{u}_{q}|<c_{4}\epsilon$ with some constant $c_{4}>0$ .

Note that the error is linear in $k^{2}$ , which is of the same order of $k_{0}^{2}$ with $k\gtrsim k_{0}$ . Again, it emphasizes the advantage of accounting for the sparsity. Wang et al. (2021) established the minimax lower bounds of the localization error for the problem of covariance change point for independent high dimensional sub-Gaussian data. For our result, the localization error on the original scale is bounded by $\epsilon L={k^{2}\log(Np)R}/{\underline{\lambda}^{2}}$ and we comment that this matches the minimax rate up to $\log(Np)R$ , where $R$ is due to the existence of autocovariance matrices of order $-R$ to $R$ in (3.1).

5 Empirical performance

5.1 Parameter guidance

We discuss in this section some aspects about the implementation of Algorithm 3, including hyper-parameters choices and practical techniques, based on our limited experience gained from the simulation experiments reported below.

The choice of block length $L$ could be set based on the user’s preference. For example, certain neuroscience data may be segmented into uniform time intervals. By Assumption 4, $L$ should enable accurate per-block spectral density estimates while not containing more than one change point. In practice, we recommend setting $L$ to be between 50 and 100. Since $R\geq(L/\log(Np))^{1/3}$ by Assumption 5 (iii), we find $R=\lfloor L^{1/3}\rfloor\lor 1$ works well for our change point detection. For Algorithms 1 and 3, we specify a trimming distance $\nu\asymp{R\log(Np)}/{L}\geq(\log(Np)/{L})^{2/3}$ for technical reasons. In practice, we set $\nu=0$ in Algorithm 1 to exploit full information in finding the projector, and $\nu=\lfloor(B\log(Np))^{2/3}/15\rfloor\lor 1$ in Algorithm 3. Also, to avoid spurious change point detection, we require $\mathfrak{C}_{s_{j},e_{j}}(\hat{\mathcal{F}})$ to stay positive in a small section around the detected change point. Specifically, let $\hat{u}^{[j]}=\arg\max_{u}\mathfrak{C}_{s_{j},u,e_{j}}(\hat{\mathcal{F}})$ within the set $\{u:\min(u-s_{j},e_{j}-u)>\nu,\mathfrak{C}_{s_{j},b,e_{j}}(\hat{\mathcal{F}})>0,|b-u|<\nu/4\}$ .

As for the sparsity $k$ , we use a heuristic approach to determine an upper bound of the number of series components sustaining at least one change. We do this by implementing Algorithm 3 with each univariate component time series of $\mathbf{X}$ , where projection is not needed. Then, we set $k$ to be the number of such time series found to have some change points. The series-by-series approach tends to find spurious change points for stationary time series, making it non-competitive as a standalone method but providing a reasonable choice of $k$ for our proposed method since Theorem 4.2 requires $k$ to be greater than $k_{0}$ . We recommend the wild interval number $J=500$ . In the simulation studies, we found that even setting $J=30$ yielded satisfactory results.

In implementing Algorithm 3 for $\mathbf{X}$ , we determine the threshold $\tau$ to be frequency-specific as $\tau(\omega)$ for $\omega\in\Omega$ via bootstrap as follows: Generate bootstrap samples of $\hat{\mathcal{F}}(\omega)$ by re-sampling with replacement the block indexes. For each such bootstrap sample, run Algorithm 2 with $s=1$ and $e=B$ , and calculate the value $\max_{s\leq u\leq e}|\hat{\mathcal{T}}^{*}_{s,u,e}(\omega)|/\hat{\sigma}_{s,e}(\omega)$ . Take an upper quantile (97.5%) of these bootstrap values as $\tau(\omega)$ .

Here we discuss two techniques to further enhance robustness against outliers and heavy tailed data. A challenge arises from inflated block spectral density estimates in the presence of outliers, resulting in an elevated threshold $\tau(\omega)$ thence lower power to change point detection. One approach to solving this problem is to determine the threshold via bootstrap that samples blocks whose spectral norm of the spectrum (averaged across frequencies) falls below 90% of all blocks. Another technique is to apply a nonparametric transformation to make the data marginally normal, via applying an instantaneous normal quantile transformation co-ordinate by co-ordinate. Specifically, let $F_{N}$ be the empirical marginal cumulative distribution function (CDF) of a particular component, and $\Phi$ the CDF of the standard normal distribution. Then the normal quantile tranformation is effected by the function $\Phi^{-1}\circ\{F_{N}-1/(2N)\}$ . For normally distributed data, the transformation is akin to standardization. We implemented the former technique in all the numerical results, and we show the benefit of the latter technique for heavy-tailed data in Supplementary LABEL:sec:auxsimulation.

5.2 Simulation scenarios

The main objectives of this simulation section are: (i) to evaluate the performance of Algorithm 1, (ii) to evaluate the performance of Algorithm 3, and (iii) to show the benefit of exploiting the sparsity structure.

We consider three data generating processes (DGPs) for verifying these three objectives. Recalling that $\mathbb{S}$ is defined as the CP series index set of cardinality $k_{0}$ .

\displaystyle\textbf{DGP1 (factor):}\qquad\mathbf{X}_{n}=\left\{\begin{matrix}&\mathbf{Z}_{n},&0<n/N\leq v_{1},\\ &\mathbf{Z}_{n}+\bm{\gamma}W_{n},&v_{1}<n/N\leq 1,\\ \end{matrix}\right.

where $\bm{\gamma}$ is a unit $p$ vector with $\bm{\gamma}_{i}=1/\sqrt{k_{0}}$ for $i\in\mathbb{S}$ and 0 otherwise, and $W_{n}={\epsilon}_{n}+\phi{\epsilon}_{n-1}$ is a univariate MA(1) process with i.i.d. normal errors ${\epsilon}_{n}\sim\mathcal{N}(0,\sigma^{2})$ independent of $\mathbf{Z}_{n}$ . $\mathbf{Z}_{n}$ follows a VMA(1) model with the VMA coefficient matrix $\mathrm{diag}(0.6\times\mathbf{1}_{p})$ , and i.i.d. errors from $\mathcal{N}({\bm{0}_{p},\mathbf{\Sigma}_{p\times p}})$ with $\mathbf{\Sigma}_{ii}=2$ and $\mathbf{\Sigma}_{ij}=0.2$ for $i\neq j$ , where $\mathbf{1}_{p}$ is the $p$ -length vector of all 1s.

\displaystyle\textbf{DGP2 (VMA):}\qquad\mathbf{X}_{n}=\left\{\begin{matrix}&\bm{\epsilon}_{n}+\mathbf{\Phi}^{(1)}\bm{\epsilon}_{n-1},&v_{q-1}<n/N\leq v_{q}\text{ for odd }q,\\ &\bm{\epsilon}_{n}+\mathbf{\Phi}^{(2)}\bm{\epsilon}_{n-1},&v_{q-1}<n/N\leq v_{q}\text{ for even }q,\\ \end{matrix}\right.

where $\mathbf{\Phi}^{(1)}=\mathrm{diag}(0.6\times\mathbf{1}_{p})$ , and $\mathbf{\Phi}^{(2)}=\mathbf{\Phi}^{(1)}$ except that $\mathbf{\Phi}^{(2)}_{ii}=-0.6$ for $i\in\mathbb{S}$ .

\displaystyle\textbf{DGP3 (VAR):}\qquad\mathbf{X}_{n}=\left\{\begin{matrix}&\bm{\epsilon}_{n}+\mathbf{\Psi}_{1}\mathbf{X}_{n-1}+\mathbf{\Psi}_{2}^{(1)}\mathbf{X}_{n-2},&v_{q-1}<n/N\leq v_{q}\text{ for odd }q\\ &\bm{\epsilon}_{n}+\mathbf{\Psi}_{1}\mathbf{X}_{n-1}+\mathbf{\Psi}_{2}^{(2)}\mathbf{X}_{n-2},&v_{q-1}<n/N\leq v_{q}\text{ for even }q,\\ \end{matrix}\right.

where $\mathbf{\Psi}_{1}=\mathrm{diag}(0.1\times\mathbf{1}_{p})$ , $\mathbf{\Psi}_{2}^{(1)}=\mathrm{diag}(0.4\times\mathbf{1}_{p})$ , and $\mathbf{\Psi}_{2}^{(2)}=\mathbf{\Psi}_{2}^{(1)}$ except that $\mathbf{\Psi}_{2,ii}^{(2)}=-0.7$ for $i\in\mathbb{S}$ . For both DGP 2 and 3, on all segments, the innovations $\{\bm{\epsilon}_{n}\}$ are i.i.d. $\mathcal{N}({0_{p},\mathbf{\Sigma}_{p\times p}})$ with $\mathbf{\Sigma}_{ii}=1$ , and $\mathbf{\Sigma}_{ij}=0.2$ for $i\neq j$ .

We comment on the design of these DGPs. DGP 1 follows (2.1), and hence $\mathbf{g}(\omega)$ defined in (2.3) equals the spectrum of $\bm{\gamma}W_{n}$ , with the eigenvalue (spectral norm) $\lambda_{1}(\omega)$ being the spectral density of $W_{n}$ , see Figure 5.1. And it exactly satisfies the sparsity condition in Assumption 2 with the vector $\gamma$ of cardinality $k_{0}$ . This DGP facilitates straightforward manipulation of the spectral norm and the sparse projection, and hence will be used to validate objectives (i) and (iii). For simplicity, we consider the single-change-point setting here.

DGPs 2 and 3 consider commonly-used VMA and VAR models with coefficients changes in the multiple-change-points setting, respectively. Note that, given the contemporaneous correlation among elements of $\bm{\epsilon}_{n}$ , the structure break of the spectrum also exists in components outside $\mathbb{S}$ . Hence, these two DGPs belong to the “approximately” sparse case, which allow us to assess the generality of our methodology. What’s more, for DGP2, when $\mathbb{S}$ contains all the components, the two sub-models with $\mathbf{\Phi}^{(1)}$ and $\mathbf{\Phi}^{(2)}$ are not distinguishable in terms of the covariance matrix, indicating the necessity and importance of our method for exploiting the spectrum structure.

We contrast the empirical performances of the proposed method (SCP) with several state-of-art methods including: SBS-LSW (Cho and Fryzlewicz, 2015), which uses the sparsified binary segmentation with the locally stationary wavelet periodogram; the factor-adjusted vector autoregressive method (Cho et al., 2023), which detects structure break in both the factor driven component (FVAR-c) and the idiosyncratic VAR process (FVAR-i); the FreSpeD method (Schröder and Ombao, 2019), which detects change points in each auto-spectra, followed by a cluster postprocessing procedure. We also include the series-by-series method as a benchmark, which implements Algorithm 3 component-wise and aggregates the found change points via clustering. See Supplementary Section LABEL:sec:auxsimulation for further implementation details of these methods.

For all simulation results, each experiment was replicated 500 times. We set the frequency set as a subset of the Fourier frequencies, which is $\Omega=\{\pi l/\lfloor L/4\rfloor,1\leq l\leq\lfloor L/4\rfloor\}$ .

5.3 Simulation results

For DGP1, we consider $p=80$ , $N=6000$ , $L=75$ and $v_{1}=0.5$ . Two sparsity levels with $k_{0}\in\{3,\lfloor\sqrt{p}\rfloor\}$ , i.e., $\{3,8\}$ , and $\mathbb{S}$ is the set of the CP series that are evenly distributed across the $p$ dimensions. We first examine the performance of Algorithm 1 by evaluating $|\bm{\gamma}(\omega)^{\top}\hat{\bm{\gamma}}_{1,B}(\omega)|$ . For $W_{n}$ in DGP1, we set $\phi=-0.6$ and $\sigma^{2}=0.6$ . Figure 5.1 plots the mean absolute inner product against frequency. Besides $k=k_{0}$ , we also consider $k=p$ as a benchmark. Fig. 5.1 shows that the inner product curves bear resemblance to the signal magnitude $\lambda_{1}(\omega)$ . Over the high frequency region when the signal is strong, the mean absolute inner product is generally close to 1 for $k=k_{0}$ , indicating almost perfect alignment with its true counterpart $\bm{\gamma}$ in DGP1. The inferior performance with $k=p$ shows the benefit of exploring the sparsity structure.

We also check the performance of change point detection for DGP1. Specifically, for $W_{n}$ in DGP1, let $\phi=-0.3$ and $\sigma^{2}=0.6$ or $\sigma^{2}=5$ , respectively representing weak and strong signals. Again, we consider $k_{0}\in\{3,8\}$ . In Table 1, we report the simulation results of the estimated change point numbers and locations. We show the frequencies of the estimated numbers being less than, equal to or greater than the true value. As for the change point location, we use the adjusted Rand index (ARI) of the estimated partition of the time points into stationary segments against the true partition to measure the performance, since change point estimation may be viewed as a special case of classification.

Table 1: Simulation results for DGP1

	$\sigma^{2}=0.6,k_{0}=3$				$\sigma^{2}=5,k_{0}=8$
	$\hat{q}<q$	$\hat{q}=q$	$\hat{q}>q$	ARI	$\hat{q}<q$	$\hat{q}=q$	$\hat{q}>q$	ARI
SCP	0.00	0.91	0.09	0.95	0.00	0.99	0.01	0.99
SBS-LSW	0.13	0.80	0.07	0.73	0.00	0.90	0.10	0.98
Series-by-series	0.00	0.06	0.94	0.65	0.00	0.14	0.86	0.70
FVAR-c	1.00	0.00	0.00	0.00	0.04	0.96	0.00	0.94
FVAR-i	1.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00
FreSpeD	0.32	0.42	0.26	0.34	0.00	0.03	0.97	0.70
SCP ( $k=k_{0}$ )	0.00	0.92	0.08	0.95	0.00	0.97	0.03	0.99
SCP ( $k=p$ )	0.31	0.68	0.01	0.59	0.00	1.00	0.00	1.00

Table 1 shows that our method is very competitive in terms of accuracy regarding the number and location of the change points. To compare, the SBS-LSW and FVAR method behave well in the strong signal case ( $\sigma^{2}=5,k_{0}=8$ ), but have a tendency to miss the change point in the weak signal case ( $\sigma^{2}=0.6,k_{0}=3$ ). As for the series-by-series and the FreSpeD approaches, both employing a procedure where marginal detection is conducted first followed by clustering, they have a tendency to overestimate change points. Particularly, these methods may introduce spurious change points whenever any component detects one, rendering it less effective in high dimensions. In contrast, with information integration over the entire matrix, our method is more stable against spurious change points. Our SCP method sets the sparsity parameter $k$ via the data-driven approach elaborated in Section 5.1. To compare, we have also examined the performance when $k$ is set to the true $k_{0}$ or $p$ . It turns out that our method performs as well as the one with true $k_{0}$ . However, using $k=p$ may result in significant performance loss, particularly when the signal is weak, possibly due to the aggregation of errors. This once again underscores the advantages of exploring the sparsity structure inherent in the data.

We consider multiple change points for DGP2 and DGP3 with $q=4$ equally spaced change points. For $p=80$ , we consider different sparsity $k_{0}$ from $\{3,\lfloor\sqrt{p}\rfloor,{p}/2,p\}$ , i.e., $\{3,8,40,80\}$ . The elements of the CP series index set $\mathbb{S}$ are evenly distributed across the $p$ dimensions. We consider $N=6000,12000$ with $L=75$ , and we only show the simulation results for DGP2 with $N=6000$ and DGP3 with $N=12000$ in Tables 2 and 3; see the other two tables in the Supplementary Section LABEL:sec:auxsimulation.

Table 2: Simulation results for DGP2 with

N=6000

$N$	$k_{0}$	$\hat{q}<q$	$\hat{q}=q$	$\hat{q}>q$	ARI	$\hat{q}<q$	$\hat{q}=q$	$\hat{q}>q$	ARI	$\hat{q}<q$	$\hat{q}=q$	$\hat{q}>q$	ARI
		SCP				SBS-LSW				Series-by-series
	3	0.02	0.98	0.00	0.98	0	0.82	0.18	0.98	0	0.19	0.81	0.90
	8	0.01	0.99	0.00	0.99	0	0.76	0.24	0.98	0	0.21	0.79	0.90
	40	0.00	1.00	0.00	1.00	0	0.31	0.69	0.95	0	0.42	0.58	0.91
	80	0.00	1.00	0.00	1.00	0	0.11	0.89	0.92	0	0.92	0.08	0.95
		FVAR-c				FVAR-i				FreSpeD
	3	1.00	0.00	0.00	0.00	0	1.00	0.00	0.98	0	0.08	0.92	0.89
	8	0.93	0.07	0.00	0.33	0	1.00	0.00	0.98	0	0.01	0.99	0.80
	40	0.00	0.92	0.08	0.97	0	1.00	0.00	0.98	0	0.00	1.00	0.64
6000	80	0.00	0.72	0.28	0.95	0	1.00	0.00	0.98	0	0.00	1.00	0.66

Table 3: Simulation results for DGP3 with

N=12000

$N$	$k_{0}$	$\hat{q}<q$	$\hat{q}=q$	$\hat{q}>q$	ARI	$\hat{q}<q$	$\hat{q}=q$	$\hat{q}>q$	ARI	$\hat{q}<q$	$\hat{q}=q$	$\hat{q}>q$	ARI
		SCP				SBS-LSW				Series-by-series
	3	0.0	1.00	0.00	0.99	0	0.66	0.34	0.96	0	0.29	0.71	0.92
	8	0.0	1.00	0.00	0.99	0	0.45	0.55	0.96	0	0.32	0.68	0.92
	40	0.0	1.00	0.00	0.99	0	0.02	0.98	0.88	0	0.54	0.46	0.94
	80	0.0	0.99	0.01	0.99	0	0.00	1.00	0.84	0	0.93	0.07	0.97
		FVAR-c				FVAR-i				FreSpeD
	3	1.0	0.00	0.00	0.03	1	0.00	0.00	0.00	0	0.58	0.42	0.97
	8	0.1	0.90	0.00	0.95	1	0.00	0.00	0.00	0	0.28	0.72	0.94
	40	0.0	0.69	0.31	0.96	1	0.00	0.00	0.00	0	0.00	1.00	0.83
12000	80	0.0	0.71	0.29	0.96	1	0.00	0.00	0.01	0	0.00	1.00	0.78

Across experiments, the proposed SCP can accurately estimate the number and location of changes across different scenarios. And it adapts to different sparsity levels. The FVAR method is very competitive for DGP2, but not for DGP3, where the signal seems weaker. We note that other methods may overestimate the change point numbers. Next we consider the CP frequency and CP series. Fig. 5.2 displays the results when $k_{0}=8$ . Clearly, the shape of the empirical distribution of the CP frequencies bears resemblance to that of the difference in the spectral densities pre- and post-change. And the projection correctly identifies the CP series.

In Supplementary Section LABEL:sec:auxsimulation, we report additional simulation results with different series lengths and dimensions. We also present the empirical performance of the proposed method against non-normal distributions.

6 Applications

We illustrate the proposed method with the difference of log returns of the daily adjusted closing values for a subset of the Standard and Poor’s 100 (S&P100) stocks, from November 1, 1999 to October 7, 2021, totalling $N=5520$ observations. Among the S&P100 stocks, we focus our analysis on the $p=79$ stocks with complete observations during this period. The dataset is publicly available in the $\mathtt{alphavantager}$ package of R.

To apply Algorithm 3, we set the block lengths to be $L=60$ , which is about three months. We choose other hyper-parameters according to the guidance in Section 5.1 and apply the normal quantile transformation. Four change points are found, namely, $\mathbb{U}=\{17,32,45,80\}$ , which respectively corresponds to 2003-11-20, 2007-06-21, 2010-07-27 and 2018-11-27 for the end of the corresponding block. Fig. 6.1 shows which series at which frequency underwent a change in their spectrum per change point, i.e., it displays the CP frequency set $\mathbb{F}_{q}$ and the corresponding frequency-specific CP series set $\mathbb{S}_{q}^{\omega}$ for $1\leq q\leq 4$ .

The four change points display quite different patterns. At the 1st change point, the changes emerge in part of the frequencies (high frequencies), and most affected series (CP series) are stocks from the Information Technology sector. This corresponds to the technology bubble burst, which heavily impacted IT companies. The 2nd and 3rd change points show similarity in terms of the broad CP frequency and evenly-spread CP series, with the Financial and Real Estate stocks having slightly greater weights. These two change points correspond to the start and end of the subprime mortgage crisis and global financial crisis. For the 4th change point, the changes occurred over part of the frequencies, and the Energy sector stands out as CP series. In 2019, there were significant fluctuations in oil prices due to various factors such as changes in global oil supply and demand dynamics and economic uncertainties. In this application, the CP series cast light on the processes underlying the found structural breaks.

7 Discussion

In this paper, we study the problem of change point detection within the general framework of a high dimensional piecewise stationary process, via frequency domain analysis. We accommodate both cross-sectional and temporal dependencies in the data as well as in the structural change. By studying the general framework of changes in the spectral density matrix, we encompass many standard parametric models, such as VAR and dynamic factor models. The structure change is allowed to be series sparse and frequency specific, which facilitates practical analysis to discern its origin and impact. Specifically, the benefit of exploring the sparse structure is shown via both theory and empirical evidence. We have shown that the proposed method, via projection based on a novel decomposition of the third-order CUSUM tensor, leads to stability and accuracy of the detected change points, as compared with existing methods.

A number of interesting problems await future research. There’s potential for a more comprehensive exploration of frequency-specific information, particularly in fields like neuroscience, which constitutes an ongoing project for us. Another topic is to detect change points with data admitting other structures associated to change points, for instance, time series may be divided into groups whose member series may likely experience a change together.

Supplemental Material

The Supplementary Material contains all proofs as well as further numerical discussions and results.

References

(1)
Adak (1998) Adak, S. (1998). Time-dependent spectral analysis of nonstationary time series, Journal of the American Statistical Association 93(444): 1488–1501.
Allen (2012) Allen, G. (2012). Sparse higher-order principal components analysis, Artificial Intelligence and Statistics, PMLR, pp. 27–36.
Aue et al. (2009) Aue, A., Hörmann, S., Horváth, L. and Reimherr, M. (2009). Break detection in the covariance structure of multivariate time series models, The Annals of Statistics 37(6B): 4046–4087.
Brillinger (2001) Brillinger, D. R. (2001). Time Series: Data Analysis and Theory, SIAM.
Chan et al. (2014) Chan, N. H., Yau, C. Y. and Zhang, R.-M. (2014). Group LASSO for structural break time series, Journal of the American Statistical Association 109(506): 590–599.
Cho and Fryzlewicz (2012) Cho, H. and Fryzlewicz, P. (2012). Multiscale and multilevel technique for consistent segmentation of nonstationary time series, Statistica Sinica 22(1): 207–229.
Cho and Fryzlewicz (2015) Cho, H. and Fryzlewicz, P. (2015). Multiple-change-point detection for high dimensional time series via sparsified binary segmentation, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 77(2): 475–507.
Cho et al. (2023) Cho, H., Maeng, H., Eckley, I. A. and Fearnhead, P. (2023). High-dimensional time series segmentation via factor-adjusted vector autoregressive modelling, Journal of the American Statistical Association pp. 1–28.
Csörgö and Horváth (1997) Csörgö, M. and Horváth, L. (1997). Limit Theorems in Change-Point Analysis, Wiley Series in Probability and Statistics, Wiley, Chichester ; New York.
Davis et al. (2006) Davis, R. A., Lee, T. C. M. and Rodriguez-Yam, G. A. (2006). Structural break estimation for nonstationary time series models, Journal of the American Statistical Association 101(473): 223–239.
Enikeeva and Harchaoui (2019) Enikeeva, F. and Harchaoui, Z. (2019). High-dimensional change-point detection under sparse alternatives, The Annals of Statistics 47(4): 2051–2079.
Fryzlewicz (2014) Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection, The Annals of Statistics 42(6): 2243–2281.
Horváth and Hušková (2012) Horváth, L. and Hušková, M. (2012). Change-point detection in panel data, Journal of Time Series Analysis 33(4): 631–648.
Jirak (2015) Jirak, M. (2015). Uniform change point tests in high dimension, The Annals of Statistics 43(6): 2451–2483.
Kirch et al. (2015) Kirch, C., Muhsal, B. and Ombao, H. (2015). Detection of changes in multivariate time series with application to EEG data, Journal of the American Statistical Association 110(511): 1197–1216.
Kolda and Bader (2009) Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications, SIAM Review 51(3): 455–500.
Ombao et al. (2005) Ombao, H., von Sachs, R. and Guo, W. (2005). SLEX analysis of multivariate nonstationary time series, Journal of the American Statistical Association 100(470): 519–531.
Preuss et al. (2015) Preuss, P., Puchstein, R. and Dette, H. (2015). Detection of multiple structural breaks in multivariate time series, Journal of the American Statistical Association 110(510): 654–668.
Schröder and Ombao (2019) Schröder, A. L. and Ombao, H. (2019). FreSpeD: Frequency-specific change-point detection in epileptic seizure multi-channel EEG data, Journal of the American Statistical Association 114(525): 115–128.
Sun et al. (2017) Sun, W. W., Lu, J., Liu, H. and Cheng, G. (2017). Provable sparse tensor decomposition, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(3): 899–916.
Wang et al. (2021) Wang, D., Yu, Y. and Rinaldo, A. (2021). Optimal covariance change point localization in high dimensions, Bernoulli 27(1): 554–575.
Wang and Samworth (2018) Wang, T. and Samworth, R. J. (2018). High dimensional change point estimation via sparse projection, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(1): 57–83.
Yuan and Zhang (2013) Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems, Journal of Machine Learning Research 14(Apr): 899–925.
Zhang and Wu (2021) Zhang, D. and Wu, W. B. (2021). Convergence of covariance and spectral density estimates for high-dimensional locally stationary processes, The Annals of Statistics 49(1): 233–254.