Break recovery in graphical networks with D-trace loss

Ying Lin Benjamin Poignard Ting Kei Pong Akiko Takeda Ying Lin¹¹1Department of Applied Mathematics, The Hong Kong Polytechnic University, People’s Republic of China. E-mail address: [email protected], Benjamin Poignard²²2Graduate School of Economics, Osaka University, Japan; Jointly affiliated at RIKEN and OIST (MLDS unit). E-mail address: [email protected], Ting Kei Pong³³3Department of Applied Mathematics, The Hong Kong Polytechnic University, People’s Republic of China. E-mail address: [email protected], Akiko Takeda⁴⁴4Graduate School of Information Science and Technology, The University of Tokyo, Japan; Center for Advanced Intelligence Project, RIKEN, Japan. E-mail address: [email protected]

Abstract

We consider the problem of estimating a time-varying sparse precision matrix, which is assumed to evolve in a piece-wise constant manner. Building upon the Group Fused LASSO and LASSO penalty functions, we estimate both the network structure and the change-points. We propose an alternative estimator to the commonly employed Gaussian likelihood loss, namely the D-trace loss. We provide the conditions for the consistency of the estimated change-points and of the sparse estimators in each block. We show that the solutions to the corresponding estimation problem exist when some conditions relating to the tuning parameters of the penalty functions are satisfied. Unfortunately, these conditions are not verifiable in general, posing challenges for tuning the parameters in practice. To address this issue, we introduce a modified regularizer and develop a revised problem that always admits solutions: these solutions can be used for detecting possible unsolvability of the original problem or obtaining a solution of the original problem otherwise. An alternating direction method of multipliers (ADMM) is then proposed to solve the revised problem. The relevance of the method is illustrated through simulations and real data experiments.

Key words: Alternating direction method of multipliers, Change-points, Dynamic network, Precision matrix, Sparsity.

1 Introduction

Much attention has been devoted to the development of methods for the detection of multiple change-points in the underlying data generating process of some realized sample. The model parameters are typically assumed to be constant in each regime but can experience “jumps” or “breaks” over time. The extraction of these breaks is usually performed through the application of some filtering techniques. To do so, a popular method is the fused LASSO which screens the existence of parameter breaks over the sample of observations. In particular, [17] developed a framework to recover multiple change-points for one-dimensional piece-wise constant signals. Then [4] extended this procedure to grouped parameters with the Group Fused LASSO in the context of autoregressive time series models. In the same spirit, [30] applied the Group Fused LASSO to linear regression models. This filtering technique has also been popular among the literature on change-points in panel data models: e.g., [31] aimed to identify structural changes in linear panel data regressions based on the adaptive Group Fused LASSO; using a similar filtering technique, [21] considered the estimation of structural breaks in panel models with interactive fixed effects.

This paper considers the detection of structural breaks in time series network, whose structure is represented by the corresponding precision matrix. Some works have been dedicated to change-point detection for precision matrix. [46] assumed that the covariance matrix evolves smoothly over time. Contrary to the framework of [20], which focused on the detection of multiple breaks at a node level through the Group Fused LASSO, [33] restricted their framework to a single change point which impacts the global network structure. Our aim is to detect multiple change-points that affect the whole network. Thus, our viewpoint is similar to [16] and [13]: these works considered the Group Fused Graphical LASSO (GFGL) to detect breaks in the precision matrix. The latter filtering technique is a mixture of the Group Fused LASSO regularization of [2] and the LASSO regularization applied to the Gaussian likelihood function. We propose an alternative estimator to the commonly employed penalized Gaussian likelihood, which builds upon the D-trace loss of [44]. The formulation of the latter is much simpler than the Gaussian likelihood (in particular, the absence of log-determinant), thus allowing for a more direct theoretical analysis and implementation procedure. The D-trace loss and its extensions have been applied to diverse problems: while [40] considered network change analysis between two samples of vector-valued observations, [19] adapted the latter framework to differential network analysis for fMRI data; in the context of compositional data, [41] and [43] considered modified versions of the D-trace loss to estimate the high-dimensional compositional precision matrix under sparsity constraint; using the matching of the coefficients of SVAR processes and the entries of the precision matrix of the observed sample, [29] considered the sparse estimation of the latter based on the D-trace loss. In addition to the detection of structural breaks in the network, we allow for sparse dependence in the entries of the precision matrix in each regime. This motivates the use of the LASSO penalization function applied to each coefficient of the piece-wise constant precision matrix.

The corresponding estimation problem formulated in (2.1), Section 2, with the D-trace loss, includes two tuning parameters. In practice, the optimal parameters are unknown a priori and are expected to be estimated upon solving the estimation problems with a series of tuning parameters. However, we show that the problem with general tuning parameters may be unbounded from below (and hence, does not have solutions). Consequently, to identify the optimal parameters, one may encounter estimation problems that are unbounded from below, and the unboundedness can be numerically difficult to detect. To address this issue, we introduce a new regularizer, thereby constructing a revised problem that consistently has solutions regardless of the choice of the tuning parameters. In addition to the existence of a solution, this revised problem also enjoys several desirable properties. On the one hand, if the solutions to the revised problem possess some easy-to-detect patterns, then the original problem may not have solutions, and we can further update the tuning parameters towards obtaining reliable estimators. On the other hand, if the patterns are not detected, then the solutions to the revised problem also solve the original problem. We adapt the celebrated alternating direction method of multipliers (ADMM) to solve the revised problem, with its convergence guaranteed by [10]. ADMM is a widely used algorithm for solving convex optimization problems with separable objectives and linear coupling constraints. Its classical version was first introduced by [15] and [12], with the convergence established in [11]. Then [7] showed that ADMM is equivalent to the proximal point algorithm applied to a certain maximal monotone operator. This insight led to the development of the first proximal ADMM by [6]. Building upon Eckstein’s work, [18] extended the proximal ADMM to include a broader class of proximal terms. This approach was further advanced by [10], which generalized the method to allow the use of semi-proximal terms, enhancing the algorithm’s applicability. More recently, considerable efforts have been made to further accelerate variants of ADMM: see, e.g., the Schur complement-based semi-proximal ADMM in [23]; the inexact symmetric Gauss-Seidel-based ADMM in [5], [24], [25], [39]; the accelerated linearized ADMM in [28], [37], [22]; the preconditioned ADMM in [36], [34], [35]; the accelerated proximal ADMM based on Halpern iteration in [42], [38].

Our contributions can be summarized as follows: we propose a new estimator for break detection in the precision matrix, the Group Fused D-trace LASSO (GFDtL); we derive the conditions for which all the break points and the precision matrices can be consistently estimated when the estimated breaks coincide with the true number of breaks; we provide a modified regularizer to ensure the existence of solutions to the revised problem, and show that these solutions either exhibit some easily verifiable patterns indicating the possible unsolvability of the original problem so that we can further update the tuning parameters towards obtaining reliable estimators, or solve the original problem if the patterns are not detected; an ADMM is adapted for implementation with convergence guarantees. The relevance of the novel estimator for change-points in time-varying networks compared to the standard Gaussian-based GFGL estimator is illustrated through simulations and real data experiments.

The rest of the paper is organized as follows. Section 2 details the framework and estimation procedure. Section 3 contains the asymptotic properties. Section 4 is devoted to the optimization aspect of the estimation procedure. Section 5 provides the implementation details. Section 6 contains some simulation experiments, a sensitivity and computational complexity analysis. A real data experiment is provided in Section 7. All the proofs of the main text and auxiliary results are relegated to the Appendices.

Notation: Throughout this paper, we use $V_{k}$ and $A_{kl}$ to denote the $k$ -th element of a vector $V\in{\mathbb{R}}^{d}$ and the $(k,l)$ -th element of a matrix $A\in{\mathbb{R}}^{m\times n}$ , respectively. We write $A^{\top}$ (resp. $V^{\top}$ ) to denote the transpose of the matrix $A$ (resp. the vector $V$ ). For any square matrix $A\in{\mathbb{R}}^{n\times n}$ , we write $\operatorname*{\mathrm{Sym}}(A)\coloneqq(A+A^{\top})/2$ to denote the symmetrization of $A$ . The $p$ -th order identity matrix is denoted by $I_{p}$ . We denote by $\textbf{0}_{k\times l}\in{\mathbb{R}}^{k\times l}$ (resp. $\mathbf{1}_{k\times l}\in{\mathbb{R}}^{k\times l}$ ) the $k\times l$ zero matrix (resp. $k\times l$ matrix of ones). We write $\text{vec}(A)$ to denote the vectorization operator that stacks the columns of $A$ on top of one another into a vector. For two matrices $A$ and $B$ , $A\otimes B$ is the Kronecker product. The set of symmetric matrices is denoted by $\mathcal{S}^{n}$ . A symmetric matrix $S\in\mathcal{S}^{n}$ is said to be positive semi-definite (resp. positive definite) and written as $S\succeq 0$ (resp. $S\succ 0$ ) if all its eigenvalues are non-negative (resp. positive). The expression $A\succeq B$ (resp. $A\succ B$ ) for $A$ , $B\in\mathcal{S}^{n}$ means $A-B\succeq 0$ (resp. $A-B\succ 0$ ). We use $\mathcal{S}_{+}^{n}$ and $\mathcal{S}_{++}^{n}$ to denote the sets of positive semi-definite matrices and positive definite matrices, respectively. For a positive semi-definite matrix $S$ , $S^{\frac{1}{2}}$ is the unique positive semi-definite matrix such that $S=S^{\frac{1}{2}}S^{\frac{1}{2}}$ . The $\ell_{p}$ norm for $V\in{\mathbb{R}}^{d}$ is denoted by $\|V\|_{p}=\big{(}\sum^{d}_{k=1}|V_{k}|^{p}\big{)}^{1/p}$ , $p\geq 1$ . The Frobenius norm and the off-diagonal $\ell_{1}$ (semi)norm of a matrix $A\in{\mathbb{R}}^{m\times n}$ are denoted by $\|A\|_{F}=\sqrt{\sum_{k=1}^{m}\sum_{l=1}^{n}|A_{kl}|^{2}}$ and $\|A\|_{1,\mathrm{off}}=\sum_{k\neq l}|A_{kl}|$ , respectively. The spectral norm, i.e., the maximum singular value of a matrix $A$ is written as $\|A\|_{s}$ . For a symmetric matrix A, we use $\lambda_{\max}(A)$ (resp. $\lambda_{\min}(A)$ ) to denote its largest (resp. smallest) eigenvalue. The maximum absolute value of all entries of a matrix $A$ is denoted by $\|A\|_{\max}$ . We use $\mathcal{S}_{\mathrm{off}}^{n}$ to denote the set of $n\times n$ symmetric matrices whose diagonal elements are $0$ . By $\{Y_{t,\mathrm{off}}\}_{t=1}^{T}$ we denote a sequence of symmetric matrices whose diagonal elements are $0$ , i.e., $Y_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{n}$ . For a sequence of symmetric matrices $\{\Theta_{t}\}_{t=1}^{T}$ , $\Theta_{uv,t}$ refers to the $(u,v)$ -th element of $\Theta_{t}$ , and $\Theta_{t,\mathrm{off}}$ is the copy of $\Theta_{t}$ with diagonal elements set to $0$ ; in particular, $\|\Theta_{t}\|_{1,\mathrm{off}}=\|\Theta_{t,\mathrm{off}}\|_{1}$ . Given $\epsilon\geq 0$ , we use $\mathrm{Proj}_{\cdot\succeq\epsilon I_{p}}$ to denote the projection onto $\{S\,:\,S\succeq\epsilon I_{p}\}$ . The proximal operator of a function $f$ at $x$ is defined as $\mathrm{prox}_{f}(x)=\operatorname*{\arg\min}_{y}\{f(y)+\frac{1}{2}\|y-x\|^{2}\}$ ; for more details about the proximal operator, we refer the interested readers to Sections 12.4, 14.1, 14.2 in [1].

2 Framework

For a sequence of a $p$ -dimensional random vector $(X_{t})$ observed at $t=1,\ldots,T$ , we consider the estimation of the underlying network through the corresponding precision matrix. The latter is assumed to evolve over time and the task is to recover the break dates. More formally, we denote by $\{{\mathcal{B}}_{j}\}_{1\leq j\leq m}$ a disjoint partitioning of the set $\{1,\ldots,T\}$ such that ${\mathcal{B}}_{j}\cap{\mathcal{B}}_{j^{\prime}}=\emptyset,j\neq j^{\prime}$ , $\cup_{j}{\mathcal{B}}_{j}=\{1,\ldots,T\}$ and ${\mathcal{B}}_{j}=\{T_{j-1},T_{j-1}+1,\ldots,T_{j}-1\}$ . The partition of the break dates is denoted by ${\mathcal{T}}_{m}=\{T_{1}<T_{2}<\ldots<T_{m}\}$ with the convention $T_{0}=1,T_{m+1}=T+1$ . Then, we assume ${\mathbb{E}}[X_{t}]=0$ and Var $(X_{t})=\Sigma_{j}$ for $t\in{\mathcal{B}}_{j}$ , such that the observations indexed by elements in ${\mathcal{B}}_{j}$ are $p$ -dimensional realizations of a centered random variable with variance-covariance $\Sigma_{j}$ . We denote by $\Omega_{j}=\Sigma^{-1}_{j}$ the precision matrix with entries $\Omega_{uv,j},1\leq u,v\leq p$ . In practice, we consider the sequence of precision matrices $\{\Theta_{1},\ldots,\Theta_{T}\}$ such that the total number of distinct matrices in the set is $m+1$ and $\Theta_{t}=\Omega_{j},\;t\in{\mathcal{B}}_{j},\;j=1,\ldots,m+1$ . We are interested in estimating the unknown true number $m^{\ast}$ of unknown break dates, the true partition ${\mathcal{T}}^{\ast}_{m^{\ast}}=\{T^{\ast}_{1}<T^{\ast}_{2}<\ldots<T^{\ast}_{m^{\ast}}\}$ and the true unknown precision matrices $\Omega^{\ast}_{j}$ . As a consequence, the true data generating process is assumed to be

{\mathbb{E}}[X_{t}]=0,\;\text{Var}(X_{t})=\Sigma^{\ast}_{j},\;\Theta^{\ast}_{t}=\Omega^{\ast}_{j}=\Sigma^{\ast-1}_{j}\;\text{when}\;t=T^{\ast}_{j-1},T^{\ast}_{j-1}+1,\ldots,T^{\ast}_{j}-1,

and $1\leq j\leq m^{\ast}+1$ , $T^{\ast}_{0}=1,T^{\ast}_{m^{\ast}+1}=T+1$ with blocks ${\mathcal{B}}^{\ast}_{j}=\{T^{\ast}_{j-1},\ldots,T^{\ast}_{j}-1\}$ . While $m^{\ast}$ and the break dates are unknown, $m^{\ast}$ is typically much smaller than $T$ and, assuming the underlying network may exhibit some sparse structures, we consider the sparse estimation of $\Theta_{t}$ ’s and the estimation of ${\mathcal{T}}_{m}$ via a mixture of LASSO and Group Fused LASSO, which we will hereafter refer as the Group Fused D-trace LASSO (GFDtL), defined as

\{\widehat{\Theta}_{t}\}^{T}_{t=1}=\underset{\Theta_{t}\succeq 0,1\leq t\leq T}{\arg\;\min}\Big{\{}{\mathbb{L}}(\{\Theta_{t}\}^{T}_{t=1},\mathcal{X}_{T})+\lambda_{1}\overset{T}{\underset{t=1}{\sum}}\|\Theta_{t}\|_{1,\text{off}}+\lambda_{2}\overset{T}{\underset{t=2}{\sum}}\|\Theta_{t}-\Theta_{t-1}\|_{F}\Big{\}},

(2.1)

where $\mathcal{X}_{T}=(X_{1},\ldots,X_{T})$ is the sample, $\lambda_{1},\lambda_{2}$ are the tuning parameters, $\|\Theta_{t}\|_{1,\text{off}}=\sum_{k\neq l}|\Theta_{kl,t}|$ and the D-trace loss of [44] is defined as

{\mathbb{L}}(\{\Theta_{t}\}^{T}_{t=1},\mathcal{X}_{T})=\frac{1}{T}\overset{T}{\underset{t=1}{\sum}}\Big{[}\text{tr}(\frac{1}{2}\Theta^{2}_{t}X_{t}X^{\top}_{t})-\text{tr}(\Theta_{t})\Big{]}.

Sparsity within the estimated precision matrix for a given block is controlled by $\lambda_{1}$ , whereas $\lambda_{2}$ affects the smoothing and guarantees that the solution is piece-wise constant.

3 Asymptotic properties

Before stating the large sample results, we define some notations and present the assumptions used hereafter. Define $\mathcal{I}^{\ast}_{j}=T^{\ast}_{j}-T^{\ast}_{j-1}$ and

\mathcal{I}_{\min}=\underset{1\leq j\leq m^{\ast}+1}{\min}|\mathcal{I}^{\ast}_{j}|,\;\eta_{\min}=\underset{1\leq j\leq m^{\ast}}{\min}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F},\;s^{\ast}_{\max}=\underset{1\leq j\leq m^{\ast}}{\max}\|\Omega^{\ast}_{j}\|_{F}.

Assumption 1.

(i)

$(X_{t})$ is a centered strong mixing process, that is, $\exists 0<\rho<1$ such that for all $t\in\mathbb{Z}^{+}$ , $\alpha(t)\leq c_{\alpha}\rho^{t}$ , with $c_{\alpha}>0$ and $\alpha(\cdot)$ the mixing coefficient $\alpha(T)=\sup_{A\in{\mathcal{F}}^{0}_{-\infty},B\in{\mathcal{F}}^{\infty}_{T}}|{\mathbb{P}}(A){\mathbb{P}}(B)-{\mathbb{P}}(A\cup B)|$ , where ${\mathcal{F}}^{0}_{-\infty},{\mathcal{F}}^{\infty}_{T}$ are the filtrations generated by $\{(X_{t}):-\infty\leq t\leq 0\}$ and $\{(X_{t}):T\leq t\leq\infty\}$ .
(ii)

$\exists\gamma,b>0$ such that $\forall s>0$ , $\forall 1\leq k,l\leq p$ , $\sup_{t\geq 1}\,{\mathbb{P}}(|X_{k,t}X_{l,t}|>s)\leq\exp(1-(s/b)^{\gamma})$ .

Assumption 2.

$\exists\underline{\mu},\overline{\mu}$ : $0<\underline{\mu}\leq\underset{1\leq j\leq m^{\ast}+1}{\min}\lambda_{\min}\big{(}\Sigma^{\ast}_{j}\big{)}\leq\underset{1\leq j\leq m^{\ast}+1}{\max}\lambda_{\min}\big{(}\Sigma^{\ast}_{j}\big{)}\leq\overline{\mu}<\infty$ .

Assumption 3.

Let $(\delta_{T})$ be a non-increasing positive sequence converging to zero. The following conditions hold:

(i)

$T\delta_{T}\geq c_{v}\log(pT)^{(2+\gamma)/\gamma}$ for some $c_{v}>0$ .
(ii)

$m^{\ast}=O(\log(T))$ and $\mathcal{I}_{\min}/(T\delta_{T})\rightarrow\infty$ as $T\rightarrow\infty$ .
(iii)

$p\sqrt{\log(pT)/T\delta_{T}}\rightarrow 0$ and $(\sqrt{T\delta_{T}}\eta_{\min})^{-1}p\,s^{\ast}_{\max}\sqrt{\log(pT)}\rightarrow 0$ .
(iv)

$\lambda_{2}/(\eta_{\min}\delta_{T})\rightarrow 0$ and $\lambda_{1}Tp/\eta_{\min}\rightarrow 0$ as $T\rightarrow\infty$ .

Assumption 1-(i) relates to the properties of $(X_{t})$ . Assumption 1-(ii) is a tail condition and will allow us to apply exponential inequalities for dependent processes. Assumption 2 ensures the identification of the model: it is similar to Assumption A.1 of [20] or Assumption A.2 of [30]. Assumption 3 provides conditions on $\delta_{T}$ , $m^{\ast}$ , $\mathcal{I}_{\min}$ , $\eta_{\min}$ and the tuning parameters $\lambda_{1},\lambda_{2}$ . Condition (i) concerns the convergence rate of $\delta_{T}$ to $0$ . In condition (ii), the sample size in each regime may diverge with rate $T\delta_{T}$ , but at a slower rate than $T$ , and the number of breaks $m^{\ast}$ may diverge slowly: this is similar to Assumption A.3-(i) of [30] or Assumption H3 of [4]. It also sets the slowest rate at which $\delta_{T}$ may shrink to zero: $\delta_{T}=o(\mathcal{I}_{\min}/T)$ . Conditions (iii) and (iv) specify the fastest rate at which $\delta_{T}$ may shrink to zero, which is $\delta_{T}\gg\max(\lambda_{2}/\eta_{\min},p^{2}(s^{\ast}_{\max})^{2}\log(pT)/(T\eta_{\min}^{2}))$ . It is worth emphasizing that conditions (ii)-(iii) imply $p^{2}(s^{\ast}_{\max})^{2}\log(pT)=o(\mathcal{I}_{\min}\eta^{2}_{\min})$ and conditions (ii) and (iv) imply that $\lambda_{2}T=o(\eta_{\min}\mathcal{I}_{\min})$ . Finally, the effect of the LASSO shrinkage through $\lambda_{1}$ does not relate to $\delta_{T}$ : this is because the Group Fused LASSO penalty only allows to detect change-points. The consistency of $\widehat{T}_{j},\widehat{\Omega}_{j}$ , given $\widehat{m}=m^{\ast}$ , is provided in the next Theorem.

Theorem 3.1.

Suppose Assumptions 1-3 are satisfied. Under $\widehat{m}=m^{\ast}$ , then:

(i)

${\mathbb{P}}\Big{(}\underset{1\leq j\leq m^{\ast}}{\max}|\widehat{T}_{j}-T^{\ast}_{j}|\leq T\delta_{T}\Big{)}\rightarrow 1$ as $T\rightarrow\infty$ .
(ii)

$\|\widehat{\Omega}_{j}-\Omega^{\ast}_{j}\|_{F}=O_{p}(\frac{\lambda_{2}T}{\mathcal{I}^{\ast}_{j}}+\lambda_{1}Tp(1+\frac{T\delta_{T}}{\mathcal{I}^{\ast}_{j}})+\frac{T\delta_{T}}{\mathcal{I}^{\ast}_{j}}+s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{\mathcal{I}^{\ast}_{j}}})$ , for $j=1,\ldots,m^{\ast}+1$ .

Remark Result (i) implies $\max_{1\leq j\leq m^{\ast}}T^{-1}|\widehat{T}_{j}-T^{\ast}_{j}|=o_{p}(\delta_{T})$ . Since $\delta_{T}=o(1)$ , this means $T^{-1}|\widehat{T}_{j}-T^{\ast}_{j}|=o_{p}(1)$ . Here, $\delta_{T}$ is a key quantity to control for the rate at which $\widehat{T}_{j}/T$ converges to $T^{\ast}_{j}/T$ . Note that $\delta_{T}\gg\max(\frac{\lambda_{2}}{\eta_{\min}},p^{2}(s^{\ast}_{\max})^{2}\log(pT)/(T\eta_{\min}^{2}))$ , implies that the fastest convergence rate for the break ratio estimator depends on the regularization parameter $\lambda_{2}$ and $p^{2}(s^{\ast}_{\max})^{2}\log(pT)/(T\eta_{\min}^{2})$ . Result (ii) relates to the consistency of the precision matrix in each regime.

It is worth noting that the true number of breaks $m^{\ast}$ is unknown. Following the common practice in the change-point literature, we assume that $m^{\ast}$ is bounded by a known conservative upper bound $m_{\max}$ . Next, we define $h(A,B)\coloneqq\sup_{b\in B}\inf_{a\in A}|a-b|$ for any two sets $A$ and $B$ . The next result establishes that all true break points in ${\mathcal{T}}^{\ast}_{m^{\ast}}$ can be consistently estimated by some points in $\widehat{{\mathcal{T}}}_{\widehat{m}}$ .

Theorem 3.2.

Suppose Assumptions 1-3 are satisfied. If $m^{\ast}\leq\widehat{m}\leq m_{\max}$ , then

{\mathbb{P}}(h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})\leq T\delta_{T})\rightarrow 1\;\text{as}\;T\rightarrow\infty.

The proof of Theorem 3.2 is done by contradiction and follows similar arguments as in the proof of Theorem 3.1. It relies on the optimality conditions from Lemma A.3. Theorem 3.2 ensures that even if the number of blocks is overestimated, there will be an estimated change-point close to each unknown true change-point.

4 Optimization

In this section, we move on to the optimization aspects of criterion (2.1), including the dual problem, the existence of solutions, and the algorithm. Specifically, given $\mathcal{X}_{T}$ , $\epsilon>0$ , $\lambda_{1}>0$ and $\lambda_{2}>0$ , we consider the following scaled optimization problem

\min_{\begin{subarray}{c}\Theta_{t}\succeq\epsilon I_{p},\\[2.84544pt] 1\leq t\leq T\end{subarray}}\left\{\sum_{t=1}^{T}\!\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})\!-\!\text{tr}(\Theta_{t})\right]\!+\!\lambda_{1}T\sum_{t=1}^{T}\!\|\Theta_{t}\|_{1,\text{off}}\!+\!\lambda_{2}T\sum_{t=1}^{T-1}\!\|\Theta_{t+1}-\Theta_{t}\|_{F}\right\},

(4.1)

where we scaled Problem (2.1) by a factor of $T$ for numerical stability. One can also notice that in (4.1) we use $\Theta_{t}\succeq\epsilon I_{p}$ rather than $\Theta\succ 0$ as in (2.1). This choice is made for practical reasons, as setting $\epsilon>0$ ensures the non-singular solutions, and the set $\{S\,:\,S\succeq\epsilon I_{p}\}$ is closed and convex and hence the projection onto it is well defined.

4.1 Dual problem and existence of solutions

We first deduce the dual problem of (4.1), and discuss the existence of solutions to (4.1).

Proposition 4.1.

(i)

The dual problem of (4.1) is

$\displaystyle\max_{\mathbf{Y}}$	$\displaystyle\left\{\sum_{t=1}^{T}-\frac{1}{2}\mathrm{tr}(W_{t}^{\top}W_{t})+\epsilon\sum_{t=1}^{T}\mathrm{tr}\left(Z_{t}-Z_{t-1}-I_{p}+(X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}-Y_{t,\mathrm{off}}\right)\right\}$	(4.2)
s.t.	$\displaystyle Z_{0}=Z_{T}=\mathbf{0}_{p\times p};$
	$\displaystyle Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}}\succeq 0\,\,\,\,\forall t=1,\dots,T;$
	$\displaystyle\\|Z_{t}\\|_{F}\leq\lambda_{2}T\,\,\,\,\forall t=1,\dots,T-1;$
	$\displaystyle\|Y_{uv,t}\|\leq\lambda_{1}T\,\,\,\,\forall t=1,\dots,T,u,v=1,\dots,p\text{ with }u\neq v,$

where $\mathbf{Y}=\left\{\{W_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1}\right\}$ is the dual variable with $W_{t}\in{\mathbb{R}}^{p\times p}$ , $Y_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p}$ , $Z_{t}\in\mathcal{S}^{p}$ for all $t$ ; $\operatorname*{\mathrm{Sym}}$ is the symmetrization operator. Moreover, the optimal values of (4.1) and (4.2) are the same.

(ii)

If $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ , then there exists $\overline{\lambda}_{2}>0$ such that for any $\lambda_{1}>0$ and any $\lambda_{2}\geq\overline{\lambda}_{2}$ , the dual problem (4.2) has a Slater point⁵⁵5A Slater point of (4.2) is a feasible point that satisfies all the inequality and positive semi-definite constraints strictly, i.e., satisfies all the “ $\leq$ ” and “ $\succeq$ ” as “ $<$ ” and “ $\succ$ ”, respectively. and the primal problem (4.1) has solutions.

We note that the assumption $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ in Proposition 4.1-(ii) is reasonable because it can be viewed as a sample-based version of Assumption 2.

We next present a simple example to illustrate that for some specific dataset $\mathcal{X}_{T}$ , even with $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ , there may still exist some $\lambda_{1}$ and $\lambda_{2}$ such that (4.2) is infeasible, meaning that its optimal value is $-\infty$ . Since (4.1) and (4.2) have the same optimal value, this means (4.1) does not have solutions in this case.

Example 1.

Consider the case that $\mathcal{X}_{2}=(X_{1},X_{2})$ where $X_{1}=(1,0)^{\top}$ and $X_{2}=(0,1)^{\top}$ . Then

X_{1}X_{1}^{\top}=\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\quad X_{2}X_{2}^{\top}=\begin{bmatrix}0&0\\ 0&1\end{bmatrix},

and $X_{1}X_{1}^{\top}+X_{2}X_{2}^{\top}=I_{2}\succ 0$ . The positive semi-definite constraint in (4.2) can be written as

		$\displaystyle Z_{1}-I_{2}+\begin{bmatrix}W_{11,1}&W_{12,1}/2\\ W_{12,1}/2&0\end{bmatrix}-\begin{bmatrix}0&Y_{12,1}\\ Y_{12,1}&0\end{bmatrix}\succeq 0,$
		$\displaystyle-Z_{1}-I_{2}+\begin{bmatrix}0&W_{21,2}/2\\ W_{21,2}/2&W_{22,2}\end{bmatrix}-\begin{bmatrix}0&Y_{12,2}\\ Y_{12,2}&0\end{bmatrix}\succeq 0.$

After some simple calculations, we have

		$\displaystyle\begin{bmatrix}Z_{11,1}+W_{11,1}-1&Z_{12,1}+W_{12,1}/2-Y_{12,1}\\ Z_{21,1}+W_{12,1}/2-Y_{12,1}&Z_{22,1}-1\end{bmatrix}\succeq 0,$
		$\displaystyle\begin{bmatrix}Z_{11,1}+1&Z_{12,1}-W_{21,2}/2+Y_{12,2}\\ Z_{21,1}-W_{21,2}/2+Y_{12,2}&Z_{22,1}-W_{22,2}+1\end{bmatrix}\preceq 0.$

Recall that the diagonal elements of a positive semi-definite matrix must be non-negative, then we can observe that the above condition requires that $Z_{11,1}\leq-1$ and $Z_{22,1}\geq 1$ . These two conditions imply that for any $\lambda_{1}>0$ and any $\lambda_{2}<1/\sqrt{2}$ , the dual problem (4.2) is infeasible and hence the primal problem (4.1) does not have solutions.

Remark 1.

It is worth pointing out that the nonexistence of solution does not contradict our findings in Section 3 because those results only indicate that when there is a ground truth, under suitable assumptions, the ground truth can be (approximately) recovered from a solution of problem (2.1) with suitably chosen $T$ , $\lambda_{1}$ and $\lambda_{2}$ ; in particular, it did not imply solution existence of problem (2.1) for general $T$ , $\lambda_{1}$ and $\lambda_{2}$ .

To ensure the existence of a solution, although we can obtain some lower bound $\overline{\lambda}_{2}$ of $\lambda_{2}$ (as detailed in the proof of Proposition 4.1-(ii)) to ensure the solution existence for (4.1), this $\overline{\lambda}_{2}$ may not be tight. This can imply practical issues because the optimal $\lambda^{*}_{2}$ can be strictly smaller than $\overline{\lambda}_{2}$ . Then we will need to work with problem (4.1) with some $\lambda_{2}<\overline{\lambda}_{2}$ to locate such $\lambda^{*}_{2}$ . However, since $\lambda_{2}<\overline{\lambda}_{2}$ , (4.1) may be unbounded from below and certifying such a scenario is a challenging problem. This motivates us to modify problem (4.1) to obtain a new model which has solutions for all choices of $\lambda_{1}$ and $\lambda_{2}$ , and (under some mild condition) returns a solution of problem (4.1) when the latter problem is solvable.

To this end, we notice from the above example that the unsolvability of problem (4.1) when $\lambda_{2}$ is small may be related to the fact that the relations between different groups induced by the Group Fused LASSO regularizer is not strong enough to leverage the condition $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ to ensure dual strict feasibility (and hence the existence of solutions to (4.1)). Hence, this motivates the introduction of a (modified) new regularizer to replace the Group Fused LASSO regularizer in problem (4.1): intuitively, this regularizer should be similar to Group Fused LASSO regularizer when $\|\Theta_{t+1}-\Theta_{t}\|_{F}$ is small for inducing the (same) desired breaks, but penalize more on large $\|\Theta_{t+1}-\Theta_{t}\|_{F}$ to ensure the solution existence of the new model.

4.2 A revised problem with a modified regularizer

Let $\lambda_{3}\geq 0.5$ and let

\mathcal{R}(x;\lambda_{3})\coloneqq\begin{cases}|x|&\text{if }|x|\leq\lambda_{3},\\ x^{2}-\lambda_{3}^{2}+\lambda_{3}&\text{otherwise}.\end{cases}

(4.3)

Here, $\lambda_{3}\geq 0.5$ is necessary and sufficient to ensure the convexity of $\mathcal{R}$ . The function $\mathcal{R}$ employs the absolute value in a small region near 0 (determined by $\lambda_{3}$ ) and switches to a quadratic function outside this region. In this way, it reduces to the classical $\ell_{1}$ penalty when $x$ is near 0, while imposing a more substantial penalty as $x$ goes further away from 0.

Replacing $\|\Theta_{t+1}-\Theta_{t}\|_{F}$ by $\mathcal{R}(\|\Theta_{t+1}-\Theta_{t}\|_{F};\lambda_{3})$ in problem (4.1), we obtain the following revised optimization problem:

\min_{\begin{subarray}{c}\Theta_{t}\succeq\epsilon I_{p},\\[2.84544pt] 1\leq t\leq T\end{subarray}}\!\left\{\sum_{t=1}^{T}\!\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})\!-\!\text{tr}(\Theta_{t})\right]\!\!+\!\!\lambda_{1}T\!\sum_{t=1}^{T}\!\|\Theta_{t}\|_{1,\text{off}}\!+\!\lambda_{2}T\!\sum_{t=1}^{T-1}\!\mathcal{R}(\|\Theta_{t+1}\!-\!\Theta_{t}\|_{F};\!\lambda_{3})\right\}.

(4.4)

The next proposition shows the dual problem and the existence of solutions to (4.4).

Proposition 4.2.

(i)

Let

\mathcal{G}(x;\lambda_{3})=\min\left\{-\left(x-\lambda_{2}T\right)_{+}\lambda_{3},\lambda_{2}T\left(\left(\lambda_{3}-\frac{x}{2\lambda_{2}T}\right)_{+}^{2}-\frac{x^{2}}{4\lambda_{2}^{2}T^{2}}-\lambda_{3}^{2}+\lambda_{3}\right)\right\},

where $(\cdot)_{+}=\max\{\cdot,0\}$ . Then the dual problem of (4.4) is

$\displaystyle\max_{\mathbf{Y}}$	$\displaystyle\Bigg{\{}\sum_{t=1}^{T}-\frac{1}{2}\mathrm{tr}(W_{t}^{\top}W_{t})+\epsilon\sum_{t=1}^{T}\mathrm{tr}\left(Z_{t}-Z_{t-1}-I_{p}+(X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}-Y_{t,\mathrm{off}}\right)$	(4.5)
	$\displaystyle\quad+\sum_{t=1}^{T-1}\mathcal{G}(\\|Z_{t}\\|_{F};\lambda_{3})\Bigg{\}},$
s.t.	$\displaystyle Z_{0}=Z_{T}=\mathbf{0}_{p\times p};$
	$\displaystyle Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}}\succeq 0\,\,\,\,\forall t=1,\dots,T;$
	$\displaystyle\|Y_{uv,t}\|\leq\lambda_{1}T\,\,\,\,\forall t=1,\dots,T,u,v=1,\dots,p\text{ with }u\neq v,$

(ii)

If $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ , then the dual problem (4.5) has a Slater point and the primal problem (4.4) has solutions.

The relationship between (4.1) and (4.4) is summarized as follows.

Proposition 4.3.

Given $\mathcal{X}_{T}$ and $\epsilon>0$ , the following statements hold:

(i)

For any positive $\lambda_{1}$ and $\lambda_{2}$ such that (4.1) has solutions, there exists $\overline{\lambda}_{3}\geq 0.5$ such that for any $\lambda_{3}\geq\overline{\lambda}$ , any solution of (4.4) also solves (4.1).
(ii)

Fix any positive $\lambda_{1}$ and $\lambda_{2}$ such that (4.1) does not have solutions. Then for any $\lambda_{3}\geq 0.5$ , any solution $\{\Theta_{t}^{*}\}_{t=1}^{T}$ to (4.4) satisfies

$\max_{t=1,\dots,T-1}\|\Theta_{t+1}^{*}-\Theta_{t}^{*}\|_{F}\geq\lambda_{3}.$ (4.6)

Equipped with Proposition 4.3, we can derive the following practical way to search for a suitable $\lambda_{2}$ such that (4.1) is solvable by solving (a sequence of) (4.4). Specifically, for any positive $\lambda_{1}$ and $\lambda_{2}$ , we solve (4.4) with an appropriately large $\lambda_{3}$ , and then check if the solution satisfies (4.6): if it does, we increase $\lambda_{2}$ further to pursue a reliable estimator; otherwise we obtain a solution to (4.1), and all the theoretical properties described in Section 3 hold.

4.3 An alternating direction method of multipliers

In this subsection, we discuss how to adapt the alternating direction method of multipliers (ADMM) to solve (4.4). We rewrite (4.4) as the following constrained optimization problem:

\vspace{-.3cm}\begin{split}\min_{\mathbf{X}}\!&\left\{\sum_{t=1}^{T}\!\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})\!-\!\text{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(V_{t})\right]\!+\!\lambda_{1}T\!\sum_{t=1}^{T}\!\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}\!+\!\lambda_{2}T\!\sum_{t=1}^{T-1}\!\mathcal{R}(\|D_{t}\|_{F};\!\lambda_{3})\right\},\\[2.84544pt] \text{s.t.}&\quad V_{t}=\Theta_{t},\Upsilon_{t,\mathrm{off}}=\Theta_{t,\mathrm{off}}\,\,\forall t=1,\dots,T;\quad D_{t}=\Theta_{t+1}-\Theta_{t}\,\,\forall t=1,\dots,T-1,\end{split}\vspace{.2cm}

(4.7)

where we denote $\mathbf{X}=\left\{\{\Theta_{t}\}_{t=1}^{T},\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1}\right\}$ for the sake of notional simplicity; $\Theta_{t,\mathrm{off}}$ is the copy of $\Theta_{t}$ with the diagonal elements set to 0, and $\Upsilon_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p}$ .

Given $\beta>0$ , the augmented Lagrangian function is

		$\displaystyle L(\{\Theta_{t}\}_{t=1}^{T},\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1},\{A_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1})$
	$\displaystyle\coloneqq$	$\displaystyle\sum_{t=1}^{T}\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})-\text{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(V_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}$
		$\displaystyle+\lambda_{2}T\sum_{t=1}^{T-1}\mathcal{R}(\\|D_{t}\\|_{F};\lambda_{3})+\sum_{t=1}^{T}\left[-\left\langle A_{t},\Theta_{t}-V_{t}\right\rangle+\frac{\beta}{2}\\|\Theta_{t}-V_{t}\\|_{F}^{2}\right]$
		$\displaystyle+\sum_{t=1}^{T}\left[-\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\right\rangle+\frac{\beta}{2}\\|\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\\|_{F}^{2}\right]$
		$\displaystyle+\sum_{t=1}^{T-1}\left[-\left\langle Z_{t},\Theta_{t+1}-\Theta_{t}-D_{t}\right\rangle+\frac{\beta}{2}\\|\Theta_{t+1}-\Theta_{t}-D_{t}\\|_{F}^{2}\right].$

In iteration $k+1$ , our ADMM consists of three update steps:

$\{\Theta_{t}\}_{t=1}^{T}$ update:

$\displaystyle\{\Theta_{t}^{k+1}\}_{t=1}^{T}\!=\!\operatorname*{\arg\min}_{\{\Theta_{t}\}_{t=1}^{T}}L\Big{(}\{\Theta_{t}\}_{t=1}^{T},\!\{V_{t}^{k}\}_{t=1}^{T},\!\{\Upsilon_{t,\mathrm{off}}^{k}\}_{t=1}^{T},\!\{D_{t}^{k}\}_{t=1}^{T-1},\!\{A_{t}^{k}\}_{t=1}^{T},\!\{Y_{t,\mathrm{off}}^{k}\}_{t=1}^{T},\!\{Z_{t}^{k}\}_{t=1}^{T-1}\Big{)}.$

Note that this update is well defined as $L$ is strongly convex in $\{\Theta_{t}\}_{t=1}^{T}$ (with other variables fixed).

$\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1}$ update:

		$\displaystyle\quad\{V_{t}^{k+1}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}^{k+1}\}_{t=1}^{T},\{D_{t}^{k+1}\}_{t=1}^{T-1}$
	$\displaystyle=$	$\displaystyle\operatorname*{\arg\min}_{\begin{subarray}{c}\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\\[2.84544pt] \{D_{t}\}_{t=1}^{T-1}\end{subarray}}\,L\Big{(}\{\Theta_{t}^{k+1}\}_{t=1}^{T},\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1},$
		$\displaystyle\qquad\qquad\qquad\qquad\{A_{t}^{k}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}^{k}\}_{t=1}^{T},\{Z_{t}^{k}\}_{t=1}^{T-1}\Big{)}.$

Note that this update is well defined as $L$ is strongly convex in the variables $\{V_{t}\}_{t=1}^{T}$ , $\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T}$ , $\{D_{t}\}_{t=1}^{T-1}$ (with other variables fixed).

Dual update: for all $t$ ,⁶⁶6The dual stepsize 1.61 in (4.8) can be more generally chosen from the interval $(0,(\sqrt{5}+1)/2)$ . Here we pick 1.61 for simplicity.

		$\displaystyle A_{t}^{k+1}=A_{t}^{k}-1.61\beta(\Theta_{t}^{k+1}-V_{t}^{k+1}),$		(4.8)
		$\displaystyle Y_{t,\mathrm{off}}^{k+1}=Y_{t,\mathrm{off}}^{k}-1.61\beta(\Theta_{t,\mathrm{off}}^{k+1}-\Upsilon_{t,\mathrm{off}}^{k+1}),$
		$\displaystyle Z_{t}^{k+1}=Z_{t}^{k}-1.61\beta(\Theta_{t+1}^{k+1}-\Theta_{t}^{k+1}-D_{t}^{k+1}).$

We next discuss how to perform the first two update steps efficiently.

$\{\Theta_{t}\}_{t=1}^{T}$ update: Setting the derivative of the objective function of the subproblem with respect to $\Theta_{t}$ to 0, we have

\left\{\begin{aligned} \mathbf{0}_{p\times p}=&\frac{1}{2}(X_{1}X_{1}^{\top}\Theta_{1}+\Theta_{1}X_{1}X_{1}^{\top})\!-\!I_{p}\!-\!A_{1}^{k}\!-\!Y_{1,\mathrm{off}}^{k}\!+\!Z_{1}^{k}\!+\!\beta(\Theta_{1}\!-\!V_{1}^{k})\!+\!\beta(\Theta_{1,\mathrm{off}}\!-\!\Upsilon_{1,\mathrm{off}}^{k})\\ &\qquad\!-\!\beta(\Theta_{2}\!-\!\Theta_{1}\!-\!D_{1}^{k}),\\ \mathbf{0}_{p\times p}=&\frac{1}{2}(X_{t}X_{t}^{\top}\Theta_{t}\!+\!\Theta_{t}X_{t}X_{t}^{\top})\!-\!I_{p}\!-\!A_{t}^{k}\!-\!Y_{t,\mathrm{off}}^{k}\!+\!Z_{t}^{k}\!-\!Z_{t\!-\!1}^{k}\!+\!\beta(\Theta_{t}\!-\!V_{t}^{k})\!+\!\beta(\Theta_{t,\mathrm{off}}\!-\!\Upsilon_{t,\mathrm{off}}^{k})\\ &\qquad\!-\!\beta(\Theta_{t\!+\!1}\!-\!\Theta_{t}\!-\!D_{t}^{k})\!+\!\beta(\Theta_{t}\!-\!\Theta_{t\!-\!1}\!-\!D_{t\!-\!1}^{k})\quad\quad\forall\,t=2,\dots,T\!-\!1,\\ \mathbf{0}_{p\times p}=&\frac{1}{2}(X_{T}X_{T}^{\top}\Theta_{T}\!+\!\Theta_{T}X_{T}X_{T}^{\top})\!-\!I_{p}\!-\!A_{T}^{k}\!-\!Y_{T,\mathrm{off}}^{k}\!-\!Z_{T\!-\!1}^{k}\!+\!\beta(\Theta_{T}\!-\!V_{T}^{k})\!+\!\beta(\Theta_{T,\mathrm{off}}\!-\!\Upsilon_{T,\mathrm{off}}^{k})\\ &\qquad\!+\!\beta(\Theta_{T}\!-\!\Theta_{T\!-\!1}\!-\!D_{T\!-\!1}^{k}).\end{aligned}\right.

For the sake of simplicity, we let $Z_{0}^{k}=Z_{T}^{k}=D_{0}^{k}=D_{T}^{k}=\mathbf{0}_{p\times p}$ for all $k$ . By further denoting

\Psi_{t}^{k}=I_{p}+A_{t}^{k}+Y_{t,\mathrm{off}}^{k}-Z_{t}^{k}+Z_{t-1}^{k}+\beta V_{t}^{k}+\beta\Upsilon_{t,\mathrm{off}}^{k}-\beta D_{t}^{k}+\beta D_{t-1}^{k}

for all $t$ and $k$ , the $\{\Theta_{t}\}_{t=1}^{T}$ update is equivalent to solving the following linear system:

$\displaystyle\frac{1}{2}(X_{1}X_{1}^{\top}\Theta_{1}\!+\!\Theta_{1}X_{1}X_{1}^{\top})\!+\!2\beta\Theta_{1}\!+\!\beta\Theta_{1,\mathrm{off}}\!-\!\beta\Theta_{2}$	$\displaystyle\!=\!\Psi_{1}^{k},$	(4.9)
$\displaystyle\!-\!\beta\Theta_{1}\!+\!\frac{1}{2}(X_{2}X_{2}^{\top}\Theta_{2}\!+\!\Theta_{2}X_{2}X_{2}^{\top})\!+\!3\beta\Theta_{2}\!+\!\beta\Theta_{2,\mathrm{off}}\!-\!\beta\Theta_{3}$	$\displaystyle\!=\!\Psi_{2}^{k},$
$\displaystyle\!-\!\beta\Theta_{2}\!+\!\frac{1}{2}(X_{3}X_{3}^{\top}\Theta_{3}\!+\!\Theta_{3}X_{3}X_{3}^{\top})\!+\!3\beta\Theta_{3}\!+\!\beta\Theta_{3,\mathrm{off}}\!-\!\beta\Theta_{4}$	$\displaystyle\!=\!\Psi_{3}^{k},$
$\displaystyle\dots&$
$\displaystyle\!-\!\beta\Theta_{T\!-\!2}\!+\!\frac{1}{2}(X_{T\!-\!1}X_{T\!-\!1}^{\top}\Theta_{T\!-\!1}\!+\!\Theta_{T\!-\!1}X_{T\!-\!1}X_{T\!-\!1}^{\top})\!+\!3\beta\Theta_{T\!-\!1}\!+\!\beta\Theta_{T\!-\!1,\mathrm{off}}\!-\!\beta\Theta_{T}$	$\displaystyle\!=\!\Psi_{T\!-\!1}^{k},$
$\displaystyle\!-\!\beta\Theta_{T\!-\!1}\!+\!\frac{1}{2}(X_{T}X_{T}^{\top}\Theta_{T}\!+\!\Theta_{T}X_{T}X_{T}^{\top})\!+\!2\beta\Theta_{T}+\beta\Theta_{T,\mathrm{off}}$	$\displaystyle\!=\!\Psi_{T}^{k}.$

This linear system does not have a closed form solution in general. Here we use pcg in Matlab to solve it. Specifically, in each iteration, we use the solution from the previous iteration as the initial point and solve the system only up to some tolerance that decreases with iterations.

$\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1}$ update: It is notable that this subproblem is block separable, allowing us to solve it by addressing three further subproblems with respect to $\{V_{t}\}_{t=1}^{T}$ , $\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T}$ , and $\{D_{t}\}_{t=1}^{T-1}$ , respectively.

For $V_{t}$ , one has

	$\displaystyle V_{t}^{k+1}=\operatorname*{\arg\min}_{V_{t}\succeq\epsilon I_{p}}\left\{\left\langle A_{t}^{k},V_{t}\right\rangle+\frac{\beta}{2}\\|\Theta_{t}^{k+1}-V_{t}\\|_{F}^{2}\right\}$
			$\displaystyle\>\>\>=\operatorname*{\arg\min}_{V_{t}\succeq\epsilon I_{p}}\left\{\left\\|V_{t}-\left(\Theta_{t}^{k+1}-\frac{A_{t}^{k}}{\beta}\right)\right\\|_{F}^{2}\right\}=\mathrm{Proj}_{\cdot\succeq\epsilon I_{p}}\left(\Theta_{t}^{k+1}-\frac{A_{t}^{k}}{\beta}\right).$

For $\Upsilon_{t,\mathrm{off}}$ , it holds that

$\displaystyle\Upsilon_{t,\mathrm{off}}^{k+1}$	$\displaystyle=\operatorname*{\arg\min}_{\Upsilon_{t,\mathrm{off}}}\left\{\lambda_{1}T\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\mathrm{off}}+\left\langle Y_{t,\mathrm{off}}^{k},\Upsilon_{t,\mathrm{off}}\right\rangle+\frac{\beta}{2}\\|\Theta_{t,\mathrm{off}}^{k+1}-\Upsilon_{t,\mathrm{off}}\\|_{F}^{2}\right\}$	(4.11)
	$\displaystyle=\operatorname*{\arg\min}_{\Upsilon_{t,\mathrm{off}}}\left\{\frac{\lambda_{1}T}{\beta}\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\mathrm{off}}+\left\\|\Upsilon_{t,\mathrm{off}}-\left(\Theta_{t,\mathrm{off}}^{k+1}-\frac{Y_{t,\mathrm{off}}^{k}}{\beta}\right)\right\\|_{F}^{2}\right\}$
	$\displaystyle=\mathrm{prox}_{\frac{\lambda_{1}T}{\beta}\\|\cdot\\|_{1,\mathrm{off}}}\left(\Theta_{t,\mathrm{off}}^{k+1}-\frac{Y_{t,\mathrm{off}}^{k}}{\beta}\right)=\left(\mathrm{prox}_{\frac{\lambda_{1}T}{\beta}\|\cdot\|}\left(\Theta_{uv,t}^{k+1}-\frac{Y_{uv,t}^{k}}{\beta}\right)\right)_{\begin{subarray}{c}u,v=1,\dots,p\\[2.84544pt] \text{ with }u\neq v\end{subarray}}.$

The update scheme for $D_{t}$ is slightly more complicated. For simplicity, we denote

\Xi_{t}^{k}=\Theta_{t+1}^{k+1}-\Theta_{t}^{k+1}-\frac{Z_{t}^{k}}{\beta}\;\;\;\;\;\forall t=1,\dots,T-1,k=1,2,\dots.

For each $t$ , we need to solve the following optimization problem:

D_{t}^{k+1}=\operatorname*{\arg\min}_{D_{t}}\frac{1}{2}\|D_{t}-\Xi_{t}^{k}\|_{F}^{2}+\frac{\lambda_{2}T}{\beta}\mathcal{R}(\|D_{t}\|_{F};\lambda_{3}).

By the definition of $\mathcal{R}$ in (4.3), it is equivalent to solving the following two problems

	$\displaystyle D_{t}^{\diamond}$	$\displaystyle\coloneqq\operatorname*{\arg\min}_{\\|D_{t}\\|_{F}\leq\lambda_{3}}\frac{1}{2}\\|D_{t}-\Xi_{t}^{k}\\|_{F}^{2}+\frac{\lambda_{2}T}{\beta}\\|D_{t}\\|_{F},$	$\displaystyle\smash{\raisebox{-0.2ex}{\small{1}}⃝}$
	$\displaystyle D_{t}^{+}$	$\displaystyle\coloneqq\operatorname*{\arg\min}_{\\|D_{t}\\|_{F}\geq\lambda_{3}}\frac{1}{2}\\|D_{t}-\Xi_{t}^{k}\\|_{F}^{2}+\frac{\lambda_{2}T}{\beta}\left(\\|D_{t}\\|_{F}^{2}-\lambda_{3}^{2}+\lambda_{3}\right),$	$\displaystyle\smash{\raisebox{-0.2ex}{\small{2}}⃝}$

and let

D_{t}^{k+1}=\begin{cases}D_{t}^{\diamond}&\text{ if }\mathrm{val}^{\diamond}\leq\mathrm{val}^{+},\\ D_{t}^{+}&\text{ if }\mathrm{val}^{\diamond}>\mathrm{val}^{+},\\ \end{cases}

(4.12)

where $\mathrm{val}^{\diamond}$ and $\mathrm{val}^{+}$ refer to the optimal values of the two optimization problems \raisebox{-0.2ex}{\small{1}}⃝ and \raisebox{-0.2ex}{\small{2}}⃝, respectively. We first note that if $\Xi_{t}^{k}=\mathbf{0}_{p\times p}$ , then $\mathrm{val}^{+}\geq 0=\mathrm{val}^{\diamond}$ and we have $D^{k+1}_{t}=D_{t}^{\diamond}=\mathbf{0}_{p\times p}$ . We next assume that $\Xi_{t}^{k}\neq\mathbf{0}_{p\times p}$ . For \raisebox{-0.2ex}{\small{1}}⃝, from the proximal operator of Frobenius norm, we have

D_{t}^{\diamond}=\underbrace{\min\left\{\left(\|\Xi_{t}^{k}\|_{F}-\frac{\lambda_{2}T}{\beta}\right)_{+},\lambda_{3}\right\}}_{\imath_{1}}\frac{\Xi_{t}^{k}}{\|\Xi_{t}^{k}\|_{F}},\;\;\mathrm{val}^{\diamond}=\frac{1}{2}\left(\imath_{1}-\|\Xi_{t}^{k}\|_{F}\right)^{2}+\frac{\lambda_{2}T}{\beta}\imath_{1}.

(4.13)

For \raisebox{-0.2ex}{\small{2}}⃝, by considering the derivative of its objective function, we get

D_{t}^{+}=\underbrace{\max\left\{\frac{\beta\|\Xi_{t}^{k}\|_{F}}{\beta+2\lambda_{2}T},\lambda_{3}\right\}}_{\imath_{2}}\frac{\Xi_{t}^{k}}{\|\Xi_{t}^{k}\|_{F}},\;\;\mathrm{val}^{+}=\frac{1}{2}\left(\imath_{2}-\|\Xi_{t}^{k}\|_{F}\right)^{2}+\frac{\lambda_{2}T}{\beta}\left(\imath_{2}^{2}-\lambda_{3}^{2}+\lambda_{3}\right).

(4.14)

The update scheme for $D_{t}$ is obtained upon combining (4.13) and (4.14) with (4.12).

Overall, the ADMM is summarized in Algorithm 1, whose convergence directly follows from [10, Appendix B].

Algorithm 1 An ADMM for solving (4.4)

1:input:

\mathcal{X}_{T}=(X_{1},\dots,X_{T})

;

\lambda_{1}>0

\lambda_{2}>0

\epsilon>0

\lambda_{3}\geq 0.5

;

\beta>0

;

\epsilon_{\mathsf{pcg}}\in(0,1)

\tau\in(0,1)

\mathbf{X}^{0}=\{\{\Theta_{t}^{0}\}_{t=1}^{T},\{V_{t}^{0}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}^{0}\}_{t=1}^{T},\{D_{t}^{0}\}_{t=1}^{T-1}\}

;

\mathbf{Y}^{0}=\{\{A_{t}^{0}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}^{0}\}_{t=1}^{T},\{Z_{t}^{0}\}_{t=1}^{T-1}\}

3:output:

\mathbf{\widehat{X}}\!=\!\{\!\{\widehat{\Theta}_{t}\}_{t=1}^{T},\!\{\widehat{V}_{t}\}_{t=1}^{T},\!\{\widehat{\Upsilon}_{t,\mathrm{off}}\}_{t=1}^{T},\!\{\widehat{D}_{t}\}_{t=1}^{T-1}\!\}

;

\mathbf{\widehat{Y}}\!=\!\{\!\{\widehat{A}_{t}\}_{t=1}^{T},\!\{\widehat{Y}_{t,\mathrm{off}}\}_{t=1}^{T},\!\{\widehat{Z}_{t}\}_{t=1}^{T-1}\!\}

4:Set

k\leftarrow 0

5:while the termination criterion is not met do

\{\Theta_{t}\}_{t=1}^{T}

update: call pcg in Matlab to solve the linear system (4.9) using

\{\Theta_{t}^{k}\}_{t=1}^{T}

as initial point up to tolerance

\epsilon_{\mathsf{pcg}}

to obtain

\{\Theta_{t}^{k+1}\}_{t=1}^{T}

7: if

{\rm mod}(k+1,10)=0

then

8: Update

\epsilon_{\mathsf{pcg}}\leftarrow\max\{\tau\epsilon_{\mathsf{pcg}},10^{-12}\}

9: end if

10: Update

\{V_{t}\}_{t=1}^{T}

according to (4.3).

11: Update

\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T}

according to (4.11).

12: Update

\{D_{t}\}_{t=1}^{T-1}

according to (4.12).

13: Update dual variables according to (4.8).

14: Set

k\leftarrow k+1

15:end while

16:Return

\widehat{\Theta}_{t}=\Theta_{t}^{k}

\widehat{V}_{t}=V_{t}^{k}

\widehat{\Upsilon}_{t,\mathrm{off}}=\Upsilon_{t,\mathrm{off}}^{k}

\widehat{D}_{t}=D_{t}^{k}

\widehat{A}_{t}=A_{t}^{k}

\widehat{Y}_{t,\mathrm{off}}=Y_{t,\mathrm{off}}^{k}

\widehat{Z}_{t}=Z_{t}^{k}

for all

t

5 Implementation details

We implement Algorithm 1 in Matlab R2023a and perform several numerical experiments on both simulated datasets and real datasets, which will be discussed in the later sections. The Matlab codes for the implementation of Algorithm 1 and the experiments are available in https://github.com/linyopt/GFDtL. In this section, we provide some implementation details about the algorithm and the numerical experiments; interested readers can check the GitHub repository for more technical details that are not covered here.

We first briefly describe the process to obtain an estimator and to identify the corresponding breaks based on a given $\epsilon>0$ and a sample $\mathcal{X}_{T}$ with $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ . Specifically, for any pair of tuning parameters $(\lambda_{1},\lambda_{2})$ , Proposition 4.3 suggests that we can apply Algorithm 1 to solve (4.4) with a sufficiently large $\lambda_{3}$ to either assert that (4.1) may not have solutions or obtain a solution to (4.1). With the obtained estimator $\{\widehat{\Theta}_{t}\}_{t=1}^{T}$ in hand, we identify the breaks by selecting the $t$ ’s with $\|\widehat{\Theta}_{t+1}-\widehat{\Theta}_{t}\|_{F}\geq 10^{-6}$ . Since $(\lambda_{1},\lambda_{2})$ controls the model complexity and smoothing, they must be calibrated accordingly. We search for the optimal tuning parameters over a user-specified grid based on some criteria, which will be discussed in the next subsection.

It is important to further discuss how the breaks are identified from a given estimator. Intuitively, with the Group Fused LASSO regularizer, the estimator $\{\widehat{\Theta}_{t}\}_{t=1}^{T}$ should satisfy that for some $t$ ’s, $\|\widehat{\Theta}_{t+1}-\widehat{\Theta}_{t}\|_{F}\neq 0$ , which we identify as breaks. In contrast, the estimator produced by the competing method, the Gaussian loss-based GFGL approach of [13], does not inherently possess this property. Instead, their algorithm includes an additional post-processing step to extract the breaks from the estimator. Although, as discussed hereafter, this method can also produce reasonable breaks, their estimator does not naturally exhibit the desired property of containing identifiable breaks.

5.1 Optimal selection of the tuning parameters

The Bayesian information criterion (BIC) is a common criterion to choose the optimal tuning parameters: see, e.g., [27]. However, it is known that there are some scenarios where BIC may fail to select the optimal tuning parameters: see, e.g., Section 4.2 and the Appendix in [13]. This motivates us to propose the following three methods for the selection of the optimal pair of tuning parameters:

(a)

Method a: When the true underlying data generating process is known, the optimal pair can be selected as the pair that minimizes or maximizes suitable performance measures, such as the Hausdorff distance, model recovery, or estimation error. Although this strategy can be employed when the true underlying structure is known only, it is informative about the relative performances of different competing procedures when in a position to minimize/maximize the performance metrics.

(b)

Method b: In the case of simulated data, we divide the simulated samples of observations into a training set and $B$ test sets, all of them sampled from the same data generating process: the training set is denoted by $\mathcal{X}^{\text{train}}_{T}$ and the test sets are denoted by $\mathcal{X}^{\text{test}}_{(1),T},\ldots,\mathcal{X}^{\text{test}}_{(B),T}$ . Based on $\mathcal{X}^{\text{train}}_{T}$ , we apply Algorithm 1 to solve (4.4) with an appropriately large $\lambda_{3}$ for different $(\lambda_{1},\lambda_{2})$ candidates and obtain $\{\widehat{\Theta}(\mathcal{X}^{\text{train}}_{T})_{\lambda_{1},\lambda_{2}}\}^{T}_{t=1}$ . The optimal $(\lambda^{\ast}_{1},\lambda^{\ast}_{2})$ is selected as the pair which minimizes

\text{lossval}(\lambda_{1},\lambda_{2})\coloneqq\frac{1}{B}\sum^{B}_{j=1}{\mathbb{L}}_{T}(\{\widehat{\Theta}(\mathcal{X}^{\text{train}}_{T})_{\lambda_{1},\lambda_{2}}\}^{T}_{t=1},\mathcal{X}^{\text{test}}_{(j),T}).

(5.1)

$B$ is user-specified. Throughout this paper, we set $B=10$ .

(c)

Method c: When the true underlying data generating process is unknown, as in real data, following [13], we employ a BIC-type criterion given by

\text{BIC}(\lambda_{1},\lambda_{2})={\mathbb{L}}(\{\widehat{\Theta}_{t}\}^{T}_{t=1},\mathcal{X}_{T})+K\log(T),

(5.2)

where $K$ represents the complexity, or degrees of freedom, and is defined as

K=\text{card}\big{(}\mathbf{1}(\widehat{\Theta}_{uv,t}\neq\widehat{\Theta}_{uv,t-1}),\forall 2\leq t\leq T,\forall u\neq v\big{)}+\text{card}\big{(}\mathbf{1}(\widehat{\Theta}_{uv,1}\neq 0),\forall u\neq v\big{)}.

Then we select the optimal values $(\lambda^{\ast}_{1},\lambda^{\ast}_{2})$ according to the criterion

(\lambda^{\ast}_{1},\lambda^{\ast}_{2})=\underset{\lambda_{1},\lambda_{2}}{\arg\,\min}\;\text{BIC}(\lambda_{1},\lambda_{2}).

The definition of $K$ varies across the literature: see, e.g., the different definitions provided in [13] and [27]. In the former work, $K$ is the sum of the number of active edges at $t=1$ and of the corresponding changes for $t=1,\ldots,T$ . In the latter work, $K$ is the number of non-zero coefficient blocks in $\widehat{\Theta}_{uv,t},t=1,\ldots,T$ for $1\leq u\neq v\leq p$ . Based on our preliminary experiments, we found that these two definitions do not result in significant differences in the selection of the optimal tuning parameters. Therefore, since we will compare our algorithm with the GFGL method of [13], we use their definition for consistency.

For these three methods, the minimization problem is solved w.r.t. $(\lambda_{1},\lambda_{2})$ over a user-specified grid: in our experiments on synthetic experiments, the grid is specified as $0.1,0.2,0.3,$ $\ldots,1$ for $\lambda_{1}$ and $10,20,30,\ldots,200$ for $\lambda_{2}$ .

It is worth noting that from Proposition 4.1-(ii), within the user-specified grid there may be some pairs of $(\lambda_{1},\lambda_{2})$ such that (4.1) does not have solutions, which can be detected by checking (4.6). For these pairs of $(\lambda_{1},\lambda_{2})$ , we set $\text{lossval}(\lambda_{1},\lambda_{2})=+\infty$ and $\text{BIC}(\lambda_{1},\lambda_{2})=+\infty$ , so that they will never be selected. In Subsection 6.2, we perform a sensitivity analysis on a simulated dataset to evaluate and compare Method b and Method c for selecting the optimal pair of tuning parameters.

5.2 Initialization and termination criterion

Throughout this paper, we initialize the algorithm as follows: for all $t$ ,

\begin{array}[]{lll}\Theta_{t}^{0}=\left(\sum_{t=1}^{T}X_{t}X_{t}^{\top}\right)^{-1},&V_{t}^{0}=\Theta_{t}^{0},&\Upsilon_{t,\mathrm{off}}^{0}=\Theta_{t,\mathrm{off}}^{0},\\ D_{t}^{0}=\Theta_{t+1}^{0}-\Theta_{t}^{0}=\mathbf{0}_{p\times p},&A_{t}^{0}=\mathbf{1}_{p\times p},&Y_{t,\mathrm{off}}^{0}=Z_{t}^{0}=\mathbf{0}_{p\times p};\end{array}

here the inverse is well-defined since we have $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ .

We next describe the termination criterion, which essentially consists of checking the constraint violations for the dual problem (4.5), and also the gap between the primal and dual objective values. Recall that $\mathbf{X}\!=\!\left\{\!\{\Theta_{t}\}_{t=1}^{T},\!\{V_{t}\}_{t=1}^{T},\!\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\!\{D_{t}\}_{t=1}^{T-1}\!\right\}$ is the primal variable with $\Theta_{t}\!\in\!\mathcal{S}^{p}$ , $V_{t}\!\in\!\mathcal{S}^{p}$ , $\Upsilon_{t,\mathrm{off}}\!\in\!\mathcal{S}_{\mathrm{off}}^{p}$ and $D_{t}\!\in\!\mathcal{S}^{p}$ for all $t$ ; $\mathbf{Y}\!=\!\left\{\!\{W_{t}\}_{t=1}^{T},\!\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\!\{Z_{t}\}_{t=1}^{T-1}\!\right\}$ is the dual variable with $W_{t}\!\in\!{\mathbb{R}}^{p\times p}$ , $Y_{t,\mathrm{off}}\!\in\!\mathcal{S}_{\mathrm{off}}^{p}$ , $Z_{t}\!\in\!\mathcal{S}^{p}$ for all $t$ ; and let $\zeta_{t}\!=\!Z_{t}\!-\!Z_{t-1}\!-\!I_{p}\!+\!\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)\!-\!Y_{t,\mathrm{off}}$ for all $t$ , where $Z_{0}=Z_{T}=\mathbf{0}_{p\times p}$ . We define the relative infeasibility for the positive semi-definite constraint as

\mathsf{dfeas}_{1}(\mathbf{Y})\coloneqq\max_{t=1,\dots,T}\left\{\frac{|\min\{\lambda_{\min}(\zeta_{t}),0\}|}{\|\zeta_{t}\|_{F}+1}\right\}.

For the bound constraint of $\{Y_{t,\mathrm{off}}\}_{t=1}^{T}$ , we similarly define the relative infeasibility as

\mathsf{dfeas}_{2}(\mathbf{Y})\coloneqq\frac{\left(\max_{t,u\neq v}\{|Y_{uv,t,\mathrm{off}}|\}-\lambda_{1}\right)_{+}}{1+\max_{t,u\neq v}\{|Y_{uv,t,\mathrm{off}}|\}},

where $Y_{uv,t,\mathrm{off}}$ is the $(u,v)$ -th element of $Y_{t,\mathrm{off}}$ . Then, the relative dual infeasibility is defined as

\mathsf{dfeas}(\mathbf{Y})\coloneqq\max\{\mathsf{dfeas}_{1}(\mathbf{Y}),\mathsf{dfeas}_{2}(\mathbf{Y})\}.

The relative duality gap is defined by

\mathsf{gap}(\mathbf{X},\mathbf{Y})\coloneqq\frac{|v_{p}(\mathbf{X})-v_{d}(\mathbf{Y})|}{1+|v_{p}(\mathbf{X})|+|v_{d}(\mathbf{Y})|},

where $v_{p}(\mathbf{X})$ and $v_{d}(\mathbf{Y})$ are the objective values of primal problem (cf. (4.4)) at $\mathbf{X}$ and dual problem (cf. (4.5)) at $\mathbf{Y}$ , respectively.

Overall, throughout this paper, we terminate Algorithm 1⁷⁷7We note that Algorithm 1 does not involve $\{W_{t}\}_{t=1}^{T}$ while the dual problem (4.5) does. Here, we set $W^{k}_{t}=(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta^{k}_{t}$ for all $t$ and $k$ , which comes from the derivation of the dual problem (4.5); see (B.20). when both the relative primal-dual gap and the relative dual infeasiibility are sufficiently small:

\max\left\{\mathsf{gap}(\mathbf{X}^{k},\mathbf{Y}^{k}),\mathsf{dfeas}(\mathbf{Y}^{k})\right\}\leq\epsilon_{\mathrm{tol}},

or, it is detected that Problem (4.1) may not have solutions:

\max_{t=1,\dots,T-1}\|\Theta_{t+1}^{k}-\Theta_{t}^{k}\|_{F}\geq\lambda_{3},

(5.3)

or, the relative successive changes of both primal and dual variables are sufficiently small:

	$\displaystyle\max\Big{\{}$	$\displaystyle\frac{\\|\{\Theta_{t}^{k+1}-\Theta_{t}^{k}\}_{t=1}^{T}\\|}{1+\\|\{\Theta_{t}^{k+1}\}_{t=1}^{T}\\|+\\|\{\Theta_{t}^{k}\}_{t=1}^{T}\\|},\frac{\\|\{A_{t}^{k+1}-A_{t}^{k}\}_{t=1}^{T}\\|}{1+\\|\{A_{t}^{k+1}\}_{t=1}^{T}\\|+\\|\{A_{t}^{k}\}_{t=1}^{T}\\|},$
		$\displaystyle\frac{\\|\{Y_{t,\mathrm{off}}^{k+1}-Y_{t,\mathrm{off}}^{k}\}_{t=1}^{T}\\|}{1+\\|\{Y_{t,\mathrm{off}}^{k+1}\}_{t=1}^{T}\\|+\\|\{Y_{t,\mathrm{off}}^{k}\}_{t=1}^{T}\\|},\frac{\\|\{Z_{t}^{k+1}-Z_{t}^{k}\}_{t=1}^{T-1}\\|}{1+\\|\{Z_{t}^{k+1}\}_{t=1}^{T-1}\\|+\\|\{Z_{t}^{k}\}_{t=1}^{T-1}\\|}\Big{\}}\leq\frac{\epsilon_{\mathrm{tol}}}{10^{3}},$

where $\|\{\Theta_{t}\}_{t=1}^{T}\|\coloneqq(\sum_{t=1}^{T}\sum_{u=1}^{p}\sum_{v=1}^{p}(\Theta_{uv,t})^{2})^{\frac{1}{2}}$ . We set $\epsilon_{\mathrm{tol}}=10^{-3}$ for experiments on both synthetic and real datasets.

5.3 Choices of algorithm parameters

The parameter $\epsilon$ specifies a lower bound of the eigenvalues of matrices in the resulting solution, which ensures the non-singularity of each matrix in the obtained solution if $\epsilon>0$ . The choice of $\epsilon$ depends on how “non-singular” the solution is expected to be. In our experiments, we set $\epsilon=0.01$ . In view of Proposition 4.3 and its proof, $\lambda_{3}$ should be large enough, with its lower bound related to $(\lambda_{1},\lambda_{2})$ and the possible solution to the corresponding (4.1). After some trial experiments, we set $\lambda_{3}=10$ for experiments on synthetic datasets and $\lambda_{3}=50$ for experiments on real datasets. For the tolerance of pcg, we set $\epsilon_{\mathsf{pcg}}=10^{-2}$ and $\tau=0.9$ . The parameter $\beta$ in ADMM has no effect on the results but only affects the convergence speed of the algorithm, so we did not fine-tune it; interested readers can refer to the GitHub repository for further details regarding these settings.

6 Synthetic experiments

In this section, we conduct some simulation experiments to assess the performances of the GFDtL procedure, its sensitivity to the tuning parameters and computational complexity.

6.1 Simulations

To assess the finite-sample relevance of the GFDtL procedure, we consider the following settings, where we denote by $m^{\ast}$ the true number of breaks and by $\Omega^{\ast}_{j},j=1,\ldots,m^{\ast}+1$ the true precision matrices:

Setting (i): For each true precision matrix $\Omega^{\ast}_{j},j=1,\ldots,m^{\ast}+1$ , its structure is uniformly drawn from the set of graphs with vertex size $\text{card}(V_{j})=p$ , that is, the number of variables, and $\text{card}(E_{j})=M_{j}$ edges, giving the graph $G(V_{j},E_{j})\sim$ Erdös-Rényi $(\mathcal{P},M_{j})$ for the block ${\mathcal{B}}^{\ast}_{j}$ . The zero entries are generating by matching the pattern of the adjacency matrix $E_{j}$ and the precision matrix $\Omega^{\ast}_{j}$ , that is, $(u,v)\in E_{j}\Leftrightarrow\Omega^{\ast}_{uv,j}\neq 0$ , providing the sparsity pattern in the off-diagonal coefficients of $\Omega^{\ast}_{j}$ . The proportion of the zero entries is calibrated by $\mathcal{P}$ .
Then the off-diagonal non-zero entries of $\Omega^{\ast}_{j}$ are drawn in ${\mathcal{U}}([-0.8,-0.05]\cup[0.05,0.8])$ , where ${\mathcal{U}}([-a,-b]\cup[b,a])$ denotes the uniform distribution in $[-a,-b]\cup[b,a]$ . The diagonal elements are drawn in ${\mathcal{U}}([0.5,1])$ . Finally, to ensure that the resulting matrix is positive-definite, if the simulated $\Omega^{\ast}_{j}$ satisfies $\lambda_{\min}(\Omega^{\ast}_{j})<0.01$ , we apply $\Omega^{\ast}_{j}=\Omega^{\ast}_{j}+(\zeta+|\lambda_{\min}(\Omega^{\ast}_{j})|)I_{p}$ , where $\zeta$ is the first value in $\{0.005,0.01,0.015,\ldots\}$ such that $\lambda_{\min}(\Omega^{\ast}_{j})>0.01$ .

Setting (ii): For each true $\Omega^{\ast}_{j},j=1,\ldots,m^{\ast}+1$ , its off-diagonal non-zero entries are generated in ${\mathcal{U}}([-1,1])$ and diagonal elements are drawn in ${\mathcal{U}}([1.1,1.5])$ . To ensure that the resulting matrix is positive-definite, we apply the same final step as in Setting (i).

Setting (iii): The precision matrix is generated following the same spirit as in Section 5 of [3]. We construct $\Sigma^{\ast}_{j}=\Omega^{\ast-1}_{j}=D^{1/2}\,C\,D^{1/2}$ , $j=1,\ldots,m^{\ast}+1$ , where $D=\text{diag}(U_{1},\ldots,U_{p})$ with $U_{k}\in{\mathcal{U}}([0.5,2]),1\leq k\leq p$ , and where $D$ makes the diagonal entries in $\Sigma^{\ast}_{j}$ and $\Theta^{\ast}_{j}$ different. We set $C=(c_{uv})_{1\leq u,v\leq p}$ with $c_{uv}=a^{|v-u|}$ . The coefficient $a$ equals $0.4$ with probability $0.5$ , and equals $0.1$ otherwise. Then, we set $\Omega^{\ast}_{j,uv}=0$ when $|\Omega^{\ast}_{j,uv}|<0.05$ and each non-zero off-diagonal coefficient is multiplied by $1$ (resp. $-1$ ) with probability $0.5$ (resp. $0.5$ ). Finally, to ensure that the resulting matrix is positive-definite, we apply the same final step as in Setting (i). This creates a banded structure.

For each of these settings, for $t=1,\ldots,T$ , we draw $X_{t}\sim{\mathcal{N}}_{{\mathbb{R}}^{p}}(0,\Theta^{\ast-1}_{t})$ the $p$ -dimensional Gaussian distribution, with $\Theta^{\ast}_{t}=\Omega^{\ast}_{j}$ when $t\in{\mathcal{B}}^{\ast}_{j}$ . We set $p=10$ and three cases relating to the breaks are considered: (a) no break; (b) a single break; (c) several breaks. In case (b), $m^{\ast}=1$ and in case (c), $m^{\ast}=4$ ; the location of the breaks, i.e., $T^{\ast}_{j},j=1,\ldots,m^{\ast}$ , are randomly set, conditionally on $\mathcal{I}_{\min}$ being at least $\kappa T$ , where $\kappa=1/(m^{\ast}+c)$ . We set $c=8$ so that $\kappa=0.11$ (resp. $\kappa=0.0833$ ) when $m^{\ast}=1$ (resp. $m^{\ast}=4$ ): the regimes may have different time lengths but satisfy a minimum time length condition. The latter relates to the issue of trimming: see the Introduction and Section 4 in [30], who mentioned that $\kappa\in[0.05,0.25]$ is a standard choice. We set the sample size $T=100$ in case (a) and $T=150$ in cases (b), (c). The “sparsity degree”, that is the proportion of zero entries in the lower triangular part of $\Omega^{\ast}_{j}$ , is set as $80\%$ and $30\%$ in settings (i) and (ii), which represents approximately $36$ and $14$ zero entries in each regime out of the $45$ lower triangular elements, respectively. Note that in setting (i), we allow the true number of zero entries to slightly vary between each regime around the corresponding sparsity degree, although ${\mathcal{P}}$ remains constant. In setting (ii), we keep the true number of zero entries constant across the regimes, i.e., $36$ and $14$ zero entries exactly. In setting (iii), the sparsity degree can slightly vary depending on the value of $a$ .

For each setting, we draw one hundred batches of $T$ independent samples ${\mathcal{X}}_{T}$ . For each batch, we apply Algorithm 1 to solve (4.4) with some large $\lambda_{3}$ over a grid specified as $0.1,0.2,0.3,\ldots,1$ for $\lambda_{1}$ and $10,20,30,\ldots,200$ for $\lambda_{2}$ , and select the optimal pair $(\lambda^{\ast}_{1},\lambda^{\ast}_{2})$ using the selection methods described in Subsection 5.1, which will be denoted by Method a, Method b and Method c hereafter. As a competing method, we also solve the Gaussian loss-based GFGL of [13]⁸⁸8The Matlab code for GFGL is available on the corresponding journal website as Supplemental. by selecting the optimal tuning parameters using the same strategies. For these two methods, we report the following metrics as performance measures:

(i)

the number of breaks detected by the procedure denoted by $nb$ .
(ii)

the Hausdorff distance $d_{h}=100\times\max\big{(}h(\widehat{T}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}}),h({\mathcal{T}}^{\ast}_{m^{\ast}},\widehat{T}_{\widehat{m}})\big{)}/T$ , which serves as a measure of the estimation accuracy of the break dates.
(iii)

the $\text{F}_{1}$ score defined as $\text{F}_{1}=2\text{TP}/(2\text{TP}+\text{FN}+\text{FP})$ , where TP is the number of correctly estimated non-zero coefficients, FN is the number of incorrectly estimated zero entries and FP is the number of incorrectly estimated non-zero entries. The closer to $1$ the $\text{F}_{1}$ score is, the better.
(iv)

the accuracy defined as $\mathrm{acc}=(\text{TP}+\text{TN})/T$ , where TN is the number of correctly estimated zero entries.
(v)

the averaged mean squared error (MSE) $\sqrt{(p^{2}T)^{-1}\sum^{T}_{t=1}\|\widehat{\Theta}_{t}-\Theta^{\ast}_{t}\|^{2}_{F}}$ for precision accuracy.

These metrics are averaged over the one hundred independent batches and are reported in Table 1, Table 2 and Table 3. These simulated experiments have been run on an Intel(R) Xeon(R) Gold 6242R [email protected] 3.09 GHz, 128 GB.

Table 1: Break recovery, model selection and precision accuracy based on 100 replications, Setting (i), with respect to

(m^{\ast},s^{\ast},T)

. For

d_{h}

, MSE, smaller numbers are better; for

\text{F}_{1}

and

\mathrm{acc}

, larger numbers are better. Bold figures represent the best performing models.

		$nb$		$d_{h}$		$\text{F}_{1}$		$\mathrm{acc}$		MSE
$(m^{\ast},s^{\ast},T)$	$(\lambda_{1},\lambda_{2})$	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL
$(0,0.8,100)$	Method a	0	0	0	0	0.8472	0.7174	0.9133	0.8084	0.1707	0.2266
	Method b	0	3.6800	0	46.0100	0.8072	0.5377	0.8919	0.5518	0.1662	0.2089
	Method c	0.1400	2.7000	4.1100	42.2200	0.7525	0.6915	0.8425	0.8060	0.1335	0.2331
$(0,0.3,100)$	Method a	0	0	0	0	0.8246	0.7247	0.8036	0.6429	0.2098	0.4219
	Method b	0	2.5000	0	36.8400	0.7178	0.7078	0.7251	0.5811	0.3452	0.4127
	Method c	0.0700	5.9800	2.6100	45.0200	0.7903	0.6526	0.7541	0.6281	0.1755	0.4491
$(1,0.8,150)$	Method a	1.0100	1.2700	0.4900	1.3500	0.7261	0.6592	0.8294	0.7641	0.2496	0.2382
	Method b	2.1800	23.0600	20.9900	69.9600	0.6815	0.4712	0.7805	0.4209	0.2049	0.2092
	Method c	1.4000	3.7200	9.5100	21.9400	0.6480	0.6707	0.7264	0.8016	0.2029	0.2427
$(1,0.3,150)$	Method a	1.0000	1.3600	0.0700	0.4300	0.7441	0.7323	0.6882	0.6347	0.3558	0.4237
	Method b	2.7900	20.0700	27.7800	66.7200	0.7163	0.7140	0.6696	0.5784	0.3774	0.4068
	Method c	1.1300	5.4000	9.7300	23.4000	0.7223	0.6670	0.6583	0.6343	0.3335	0.4505
$(4,0.8,150)$	Method a	4.2100	5.7600	2.2700	2.7100	0.6389	0.6088	0.7409	0.7093	0.2906	0.2585
	Method b	5.4500	44.1100	11.5800	27.6700	0.6040	0.4555	0.6924	0.3810	0.2476	0.2159
	Method c	2.9200	7.2300	29.3100	18.0100	0.5792	0.6228	0.6579	0.7488	0.2853	0.2599
$(4,0.3,150)$	Method a	4.1100	5.1100	1.7400	1.1800	0.7021	0.7175	0.6180	0.5920	0.4475	0.4379
	Method b	6.3000	36.8100	14.9100	27.6500	0.6989	0.7163	0.6209	0.5759	0.4218	0.4114
	Method c	1.8800	6.0800	41.0400	11.5000	0.6616	0.6615	0.5700	0.6067	0.4555	0.4762

Table 2: Break recovery, model selection and precision accuracy based on 100 replications, Setting (ii), with respect to

(m^{\ast},s^{\ast},T)

. For

d_{h}

, MSE, smaller numbers are better; for

\text{F}_{1}

and

\mathrm{acc}

, larger numbers are better. Bold figures represent the best performing models.

		$nb$		$d_{h}$		$\text{F}_{1}$		$\mathrm{acc}$		MSE
$(m^{\ast},s^{\ast},T)$	$(\lambda_{1},\lambda_{2})$	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL
$(0,0.8,100)$	Method a	0	0	0	0	0.8364	0.7131	0.9051	0.8109	0.2087	0.3582
	Method b	0.0100	2.7600	0.3600	40.0400	0.7940	0.5841	0.8861	0.6200	0.2239	0.3417
	Method c	0.0200	2.2200	0.8000	44.0100	0.7576	0.6655	0.8532	0.8048	0.1560	0.3654
$(0,0.3,100)$	Method a	0	0	0	0	0.8437	0.8143	0.7784	0.7069	0.2451	0.6669
	Method b	0	1.2700	0	29.1500	0.6995	0.8155	0.6500	0.7027	0.5162	0.6652
	Method c	0.0100	3.2600	0.3600	38.2700	0.8318	0.6885	0.7639	0.6071	0.2423	0.6981
$(1,0.8,150)$	Method a	1.0200	1.2900	0.9100	1.3800	0.7021	0.6546	0.7968	0.7645	0.2833	0.3470
	Method b	1.8900	23.9000	15.7300	86.6000	0.6777	0.5139	0.7769	0.4931	0.2530	0.3206
	Method c	1.0100	3.1200	16.3800	21.0600	0.6264	0.6398	0.7374	0.7998	0.2552	0.3506
$(1,0.3,150)$	Method a	0.9900	1.4500	0.3000	0.8000	0.7846	0.8271	0.6909	0.7186	0.5085	0.6511
	Method b	3.5100	20.5600	39.8500	91.1200	0.7465	0.8209	0.6560	0.7054	0.5470	0.6431
	Method c	1.0300	4.0800	14.2600	25.0800	0.7717	0.6935	0.6765	0.6089	0.4779	0.6874
$(4,0.8,150)$	Method a	4.3100	6.1500	4.5500	4.5300	0.5985	0.5901	0.6861	0.6931	0.3334	0.3466
	Method b	4.3900	40.8100	16.0900	33.6400	0.5788	0.4821	0.6624	0.4297	0.3099	0.3153
	Method c	1.9300	5.4500	55.6100	32.4000	0.5226	0.5971	0.5968	0.7630	0.3480	0.3559
$(4,0.3,150)$	Method a	4.1000	5.1300	1.9400	1.1900	0.7772	0.8265	0.6702	0.7150	0.6536	0.6727
	Method b	6.8800	33.9200	17.0400	34.2100	0.7691	0.8233	0.6655	0.7077	0.6031	0.6516
	Method c	1.5000	6.3700	50.4000	12.0900	0.7507	0.7154	0.6546	0.6171	0.6527	0.7177

Table 3: Break recovery, model selection and precision accuracy based on 100 replications, Setting (iii), with respect to

(m^{\ast},T)

. For

d_{h}

, MSE, smaller numbers are better; for

\text{F}_{1}

and

\mathrm{acc}

, larger numbers are better. Bold figures represent the best performing models.

		$nb$		$d_{h}$		$\text{F}_{1}$		$\mathrm{acc}$		MSE
$(m^{\ast},T)$	$(\lambda_{1},\lambda_{2})$	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL	GFDtL	GFGL
$(0,100)$	Method a	0	0	0	0	0.8236	0.7848	0.8230	0.7704	0.1089	0.2241
	Method b	0.0600	5.6600	1.5900	48.6800	0.7611	0.7268	0.7941	0.6448	0.1504	0.2100
	Method c	0.0300	7.3800	0.8900	53.9000	0.5396	0.5322	0.6835	0.6533	0.1709	0.2511
$(1,150)$	Method a	1.0000	1.1200	2.3800	3.1900	0.7799	0.7357	0.7670	0.7425	0.1800	0.2344
	Method b	2.5500	19.2700	27.8200	85.5000	0.7564	0.7136	0.7600	0.6084	0.1825	0.2092
	Method c	0.4400	7.5100	59.1100	42.3700	0.3961	0.4830	0.6028	0.6325	0.2317	0.2554
$(4,150)$	Method a	4.9900	7.2700	8.2500	9.6700	0.6950	0.6816	0.6709	0.6791	0.2181	0.2274
	Method b	4.5200	34.1600	24.3900	32.7100	0.7104	0.6951	0.7029	0.5862	0.2251	0.2095
	Method c	0.3700	4.3300	107.9600	66.8400	0.3538	0.4279	0.5836	0.6102	0.2589	0.2478

Overall, our GFDtL procedure performs better than the GFGL with respect to all the metrics in any setting and for Methods a and b for the selection of the optimal pair of $(\lambda_{1},\lambda_{2})$ . Importantly, our procedure results in lower MSE, hence more accurate estimation.

It is worth mentioning the case of no break, that is $m^{\ast}=0$ : $d_{h}=0$ means that the procedure concludes that there is no break; in this setting, our procedure performs much better than the GFGL, particularly in terms of MSE and Hausdorff distance.

When applying Method b, the latter metric obtained by GFDtL is much better than the one resulting from the GFGL, which tends to overestimate the number of breaks. Overall, we may conclude that GFDtL results in a low probability of falsely detecting breaks in the no break case.

Finally, the BIC (i.e., Method c)-based results are in favor of the GFGL method, particularly in terms of $d_{h}$ . This is because the BIC tends to underestimate the number of breaks when applied to the GFDtL, i.e., it tends to select large $\lambda_{2}$ : indeed, when $m^{\ast}\leq 1$ , the results obtained by Method c are good; but in the case of multiple breaks, the number of breaks detected by BIC is much lower than the true number of breaks. This behavior is further detailed in Subsection 6.2.

6.2 Sensitivity analysis with respect to the tuning parameters

We propose a sensitivity analysis of Methods b and c provided in Subsection 5.1 with respect to the calibration of $(\lambda_{1},\lambda_{2})$ . More precisely, we illustrate the ability of the proposed strategies to identify the optimal pair $(\lambda_{1},\lambda_{2})$ for break and sparse estimation. The experiments are conducted on datasets simulated according to Setting (i) in Subsection 6.1 with $T=100$ , $p=10$ , the “sparsity degree” being $80\%$ and $m^{\ast}=0,1,3$ . To better approximate the metric surfaces, we use a denser grid specified as $0.1,0.11,0.12,\ldots,1$ for $\lambda_{1}$ and $10,11,12,\ldots,200$ for $\lambda_{2}$ .

The results are displayed in Figure 1(l), with the three rows corresponding to $m^{\ast}=0,1,3$ , and the four columns showcasing the results for BICs (cf. (5.2)), lossvals (cf. (5.1)), Hausdorff distances, and F1 scores, respectively. In each subfigure, lighter colors represent more favorable tuning parameters, indicating areas where the metric values are minimized or maximized as appropriate. To facilitate visualization given the wide range of values for BICs and lossvals, we pre-processed these metrics before plotting: specifically, we subtracted the minimum value to ensure non-negativity, then applied the log1p function in Matlab (i.e., $\log(1+\cdot)$ ) to compress the scale of the values, making the patterns more discernible and enhancing the interpretability of the results.

Refer to caption — Figure 1: Sensitivity analysis of tuning parameters.

The figures suggest consistent patterns across the surfaces of all four metrics. Specifically, there are distinct boundaries splitting the metric surfaces into two primary regions: an upper and a lower region. Each of these regions further subdivides into multiple subregions that exhibit similar characteristics across all four metrics. The lower regions of the BIC, lossval, and F1 score surfaces are characterized by numerous vertical bars, indicating areas of potentially optimal parameter combinations. In contrast, the Hausdorff distance surface displays a constant lower region.

An interesting observation is that the BIC-type criterion tends to favor smaller values of $\lambda_{1}$ , whereas the lossval criterion leans towards slightly larger $\lambda_{1}$ values. Furthermore, although both criteria struggle to identify the optimal Hausdorff distance (except when $m^{\ast}=0$ ), when comparing the tuning parameters selected by the BIC-type criterion to those by the lossval criterion, the lossval criterion exhibits a slight advantage; its optimal region (the white area in the subfigures) is larger, especially as $m^{\ast}$ increases. For example, when $m^{\ast}=3$ , the white region extends to the upper right corner, which is preferable to the lower region. This suggests that the lossval criterion may be more effective in identifying the optimal parameters. These advantages of the lossval criterion over the BIC-type criterion are further corroborated by the results presented in the previous subsection.

The figures also shed light on the reasons behind the BIC-type criterion’s poor performance in the context of GFDtL. Specifically, the BIC-type criterion shows a preference for larger $\lambda_{2}$ values, which corresponds to fewer breaks in the estimation. This preference can be directly attributed to the definition of BIC (cf. (5.2)). In particular, when there are breaks, $K\log(T)$ is at least $\log(T)p(p-1)$ , which typically dominates the loss value term in the BIC formula. Consequently, the BIC-type criterion tends to favor estimators with fewer breaks, leading to suboptimal performance in detecting the true number of breaks, especially in scenarios with a higher number of actual breaks.

Furthermore, the F1 score surfaces provide additional insights into the model’s performance across different parameter combinations. The gradual transition from darker to lighter colors as $\lambda_{1}$ increases (for fixed $\lambda_{2}$ ) suggests that the model’s ability to correctly identify true positives improves with larger $\lambda_{1}$ values, up to a certain point. This observation aligns with the lossval criterion’s preference for slightly larger $\lambda_{1}$ values compared to the BIC-type criterion.

6.3 Empirical computational complexity analysis

The computational complexity of Algorithm 1 is influenced by several factors, including the sample size $T$ , the dimension $p$ , the number of breaks $m^{\ast}$ , $\lambda_{3}$ , and the pair $(\lambda_{1},\lambda_{2})$ . To empirically analyze this complexity, we conduct a series of experiments based on Setting (i) in Subsection 6.1 varying each factor individually. Specifically, for each factor, we perform 5 experiments with other factors held constant, and plot the averaged computation time to visualize the impact of each factor on the algorithm’s complexity. Here, we use $\beta=21$ . These experiments are conducted on a desktop with Intel(R) Core(TM) [email protected] (10 cores and 20 threads) and 64GB of RAM.

As displayed in Figure 2(d), the computation time is approximately linear in $T$ , quadratic in $p$ , and not significantly affected by $m^{\ast}$ and $\lambda_{3}$ . The impact of $(\lambda_{1},\lambda_{2})$ is presented in Table 4, where one can observe that the computation times for $(0.1,10)$ and $(0.2,10)$ are notably shorter. This is because the algorithm terminates early as (5.3) is satisfied within the first few iterations. Additionally, the computation times for $(0.1,20)$ , $(0.2,20)$ , $(0.3,10)$ , $(0.4,10)$ , and $(0.5,10)$ are significantly higher, as these are marginal cases that are typically more challenging to solve. Apart from these marginal cases, for large $\lambda_{1}$ and $\lambda_{2}$ , Algorithm 1 requires nearly the same amount of time to converge.

Table 4: Computation time (s) v.s.

\lambda_{1}

and

\lambda_{2}

(

T=100,p=10,m^{\ast}=1,\lambda_{3}=10

)

$\lambda_{1}\backslash\lambda_{2}$	10	20	30	40	50	60	70	80	90	100	110	120	130	140	150	160	170	180	190	200
0.1	3.52	33.71	12.47	6.71	5.17	4.17	5.58	5.58	5.83	5.73	5.58	5.51	5.52	5.51	5.53	5.44	5.52	5.50	5.57	5.47
0.2	6.60	14.00	6.15	4.28	3.68	3.98	4.13	4.19	4.13	4.15	4.16	4.15	4.14	4.14	4.15	4.17	4.15	4.15	4.15	4.15
0.3	27.25	7.13	3.69	2.98	2.67	2.91	2.90	2.91	2.91	2.91	2.91	2.91	2.91	2.92	2.91	2.91	2.92	2.91	2.93	2.91
0.4	36.05	5.11	2.75	2.40	2.16	2.16	2.16	2.16	2.16	2.16	2.16	2.22	2.19	2.21	2.18	2.20	2.21	2.21	2.21	2.23
0.5	23.22	4.65	2.68	2.28	2.11	2.12	2.12	2.11	2.12	2.12	2.18	2.17	2.17	2.21	2.17	2.18	2.18	2.17	2.17	2.17
0.6	14.14	4.64	2.81	2.30	2.30	2.22	2.23	2.17	2.20	2.21	2.24	2.20	2.24	2.15	2.20	2.14	2.17	2.17	2.19	2.18
0.7	10.38	4.78	2.87	2.19	2.26	2.30	2.15	2.18	2.16	2.23	2.17	2.19	2.21	2.21	2.21	2.15	2.18	2.18	2.19	2.19
0.8	9.17	4.72	2.88	2.46	2.22	2.41	2.59	2.48	2.20	2.15	2.14	2.15	2.18	2.23	2.24	2.43	2.27	2.35	2.20	2.26
0.9	8.73	4.80	3.03	2.45	2.18	2.18	2.21	2.15	2.15	2.13	2.16	2.21	2.21	2.14	2.15	2.13	2.20	2.17	2.13	2.13
1.0	8.06	4.60	2.84	2.13	2.14	2.13	2.15	2.14	2.13	2.13	2.13	2.14	2.14	2.14	2.15	2.16	2.14	2.14	2.13	2.14

7 Real data experiment

In this section, the relevance of our method is compared with the GFGL through a portfolio allocation experiment based on real financial data. The same computer was employed as in Subsection 6.1. We consider hereafter the stochastic process $(X_{t})$ in ${\mathbb{R}}^{p}$ of log-stock returns, where $X_{t,j}=100\times\log(P_{t,j}/P_{t-1,j}),1\leq j\leq p$ with $P_{t,j}$ the stock price of the $j$ -th index at time $t$ . The portfolio allocation will be performed with $20$ stocks data from the S&P 500, which are representative of different economic sectors⁹⁹9The data can be found on https://finance.yahoo.com or https://macrobond.com. They are provided on the GitHub repository.: Alphabet, Amazon, American Airlines, Apple, Berkshire Hathaway, Boeing, Chevron, Equity Residential, ExxonMobil, Ford, General Electric, Goldman Sachs, Jacobs Engineering Group, JPMorgan, Lockheed Martin, Pfizer, Procter & Gamble, United Parcel Service, Verizon, Walmart. The sample period is November 11, 2019 – March 27, 2020, corresponding to $T=100$ observations.

We analyse the economic performances obtained by the GFDtL and GFGL through the GMVP investment problem. Following [8], the latter problem at time $t$ , in the absence of short-sales constraints, is defined as

\min_{w_{t}}\;w^{\top}_{t}\;H_{t}\;w_{t},\;\;\text{s.t.}\;\;\mathbf{1}^{\top}_{p\times 1}w_{t}=1,

where $w_{t}$ is the vector of portfolio weights for time $t$ chosen at time $t-1$ , $H_{t}$ is the $p\times p$ conditional (on the past) covariance matrix of $X_{t}$ . The explicit solution is given by $\omega_{t}=H^{-1}_{t}\mathbf{1}_{p\times 1}/\mathbf{1}^{\top}_{p\times 1}H^{-1}_{t}\mathbf{1}_{p\times 1}$ . Now it is natural to replace $H_{t}$ by an estimator $\widehat{H}_{t}$ , yielding $\widehat{\omega}_{t}=\widehat{H}^{-1}_{t}\mathbf{1}_{p\times 1}/\mathbf{1}^{\top}_{p\times 1}\widehat{H}^{-1}_{t}\mathbf{1}_{p\times 1}$ . As a function depending on $H_{t}$ only, the GMVP performance essentially depends on the precise measurement of the latter or, equivalently, the precision matrix. In our setting, we set $\widehat{H}^{-1}_{t}=\widehat{\Theta}_{t-1}$ , estimated by the GFDtL and GFGL procedures, where $(\lambda^{\ast}_{1},\lambda^{\ast}_{2})$ are selected by Method c described in Subsection 5.1. We also consider the equally weighted portfolio, which will be denoted by $1/p$ . The following performance metrics (annualized) will be reported: AVG as the average of portfolio returns computed as $\widehat{\omega}^{\top}_{t}X_{t}$ , multiplied by $252$ ; SD as the standard deviation of portfolio returns, multiplied by $\sqrt{252}$ ; IR as the information ratio computed as $\textbf{AVG}/\textbf{SD}$ .
The key performance measure is SD. The GMVP problem essentially aims to minimize the variance rather than to maximize the expected return. Hence, as emphasized in [9], Section 6.2, high AVG and IR are desirable but should be considered of secondary importance compared with the quality of the measure of a covariance matrix estimator. We also report the number of breaks $nb$ detected by the procedure.

Both GFGL and GFDtL estimate the following break dates: February 25, 2020; March 9, 2020; March 17, 2020. These breaks are in line with the aftershock of the covid outbreak: the S&P 500 index entered a downward trend period from February 20, 2020, and the S&P 500 index reached a minimum value on March 23, 2020, which precedes the rally. Despite the presence of the covid shock, our proposed GFDtL procedure provides the lowest SD and clearly outperforms the GFGL. The BIC-based selection results in relevant estimatons of the break dates.

Table 5: Annualized GMVP performance metrics.

	S&P 500 data
	AVG	SD	IR	$nb$
GFDtL	-50.29	35.76	-1.41	3
GFGL	-101.44	50.00	-2.03	3
$1/p$	-77.77	45.69	-1.70	-

Note: The lowest SD figure is in bold face.

8 Concluding remarks

We propose a filtering procedure for the estimation of the number of change-points in the precision matrix of a vector-valued random process, whose full structure can “break” over time. We show that the estimated break dates are sufficiently close to the true break dates together with the consistency of the estimated precision matrix in each regime. We propose an ADMM-based algorithm to solve the optimization problem with breaks and study its properties. The simulation results illustrate the relevance of our method compared to the Gaussian likelihood GFGL in terms of change-point detection and graph recovery. They also emphasize the difficulty to devise a strategy to select the optimal pair $(\lambda_{1},\lambda_{2})$ .

A potential extension consists in the specification of adaptive weights in both the Group Fused penalty and the LASSO penalty: indeed, unless assuming high-level conditions such as the irrepresentable condition of [45], the LASSO penalty requires a correction, such as the adaptive version, to ensure the recovery of the sparse structure provided that the estimated break dates are close to the true ones. This would require the computation of some first step consistent estimator that would enter the penalty functions.

Acknowledgements

Benjamin Poignard was supported by JSPS KAKENHI (22K13377) and RIKEN. Ting Kei Pong was supported partly by the Hong Kong Research Grants Council PolyU153002/21p. Akiko Takeda was supported by JSPS KAKENHI (23H03351).

Appendix A Intermediary results

Lemma A.1.

Under Assumption 1, Assumption 2 and Assumption 3-(i), if $p^{2}\log(pT)=o(T\delta_{T})$ with $\delta_{T}\rightarrow 0$ as $T\rightarrow\infty$ , then:

(i)

$\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}\lambda_{\max}\Big{(}\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\Big{)}\leq\overline{\mu}+o_{p}(1)$ .
(ii)

$\underline{\mu}+o_{p}(1)\leq\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\inf}}\lambda_{\min}\Big{(}\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\Big{)}$ .

Proof.

Let us prove Point (i). Recall that $\Sigma^{\ast}_{j}$ is the true variance-covariance of $X_{t}$ , with $t\in{\mathcal{B}}^{\ast}_{j}$ . Now take any $s\leq t\leq r-1$ , with $s,r\in\{1,\ldots,T\}$ such that $r-s\geq T\delta_{T}$ . Then, by the triangle inequality, under Assumption 2:

	$\displaystyle\\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\\|_{s}\leq$	$\displaystyle\>\>\\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\text{Var}(X_{t})\\|_{s}+\\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\\|_{s}$
	$\displaystyle\leq$	$\displaystyle\>\>\overline{\mu}+p\\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\\|_{\max}.$

We show $\max_{\underset{r-s\geq T\delta_{T}}{1\leq s<r\leq T+1}}\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{s}=o_{p}(1)$ . By Assumption 1, we can apply Theorem 1 of [26] on $\zeta_{kl,t}\coloneqq X_{k,t}X_{l,t}$ : the latter result states the mixing condition $\alpha(t)\leq\exp(-c_{1}t^{\gamma_{1}})$ for some $c_{1},\gamma_{1}>0$ ; then for $c_{1},\gamma_{1}=1$ , we may take $\rho=\exp(-2c_{1})$ , which allows us to apply their Theorem 1. Thus, there exist constants $C_{1},C_{2}$ such that, $\forall\epsilon>0$ , $\forall 1\leq k,l\leq p$ ,

{\mathbb{P}}\Big{(}\!\big{|}\frac{1}{r\!-\!s}\overset{r\!-\!1}{\underset{t=s}{\sum}}\big{(}X_{k,t}X^{\top}_{l,t}\!-\!\text{Var}(X_{t})_{kl}\big{)}\big{|}\!\geq\!\epsilon\!\Big{)}\leq(T\!+\!1)\exp\!\Big{(}\!-\!\frac{((r\!-\!s)\epsilon)^{\frac{\gamma}{1+\gamma}}}{C_{1}}\!\Big{)}\!+\!\exp\!\Big{(}\!\!-\!\frac{((r\!-\!s)\epsilon)^{2}}{(r\!-\!s)C_{2})}\!\Big{)}.

Take $C>0$ large enough. Applying the previous inequality, by Bonferroni’s inequality:

		$\displaystyle{\mathbb{P}}\Big{(}\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}\\|\frac{1}{\sqrt{r-s}}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\\|_{\max}\geq C\sqrt{\log(pT)}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\>\>T^{2}\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}{\mathbb{P}}\Big{(}\underset{1\leq k,l\leq p}{\max}\,\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}_{kl}\|\geq C\sqrt{\log(pT)/(r-s)}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\>\>T^{2}\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}p^{2}\Big{\{}(T+1)\,\exp\Big{(}-\frac{((r-s)C\sqrt{\log(pT)/(r-s)})^{\frac{\gamma}{1+\gamma}}}{C_{1}}\Big{)}$
		$\displaystyle\>\>\quad+\exp\Big{(}-\frac{((r-s)C\sqrt{\log(pT)/(r-s)})^{2}}{(r-s)C_{2}}\Big{)}\Big{\}}$
	$\displaystyle\leq$	$\displaystyle\>\>\exp\Big{(}-\frac{(CT\delta_{T}\log(pT))^{\frac{\gamma}{2(1+\gamma)}}}{C_{1}}+4\log(pT)\Big{)}+\exp\Big{(}-\frac{CT\delta_{T}\log(pT)}{C_{2}}+2\log(pT)\Big{)},$

which goes to $0$ as $T\rightarrow\infty$ by Assumption 3-(i) implying $(T\delta_{T}\log(pT))^{(\gamma/(2(1+\gamma)))}\propto\log(pT)$ . Since $\|A\|_{s}\leq p\|A\|_{\max}$ for $A\in{\mathbb{R}}^{p\times p}$ ,

p\max_{\underset{r-s\geq T\delta_{T}}{1\leq s<r\leq T+1}}\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{\max}=O_{p}(p\sqrt{\frac{\log(pT)}{T\delta_{T}}})=o_{p}(1),

under $(T\delta_{T})^{-1}p^{2}\log(pT)\rightarrow 0$ . To prove Point (ii), we rely on the inequality

\lambda_{\min}\Big{(}\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\Big{)}\geq\underline{\mu}-\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{s}.

The result follows from the bound derived on $\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{s}$ . ∎

The next Lemma will be useful to bound the first order derivative w.r.t. $\Theta$ of the non-penalized D-trace loss function.

Lemma A.2.

Suppose Assumption 1, Assumption 2 and Assumption 3-(i) are satisfied. For a sequence $\delta_{T}\rightarrow 0$ as $T\rightarrow\infty$ , then

\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}\|\frac{1}{\sqrt{r-s}}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{\max}=O_{p}(\sqrt{\log(pT)}).

Proof.

The result follows the same steps as in the proof of Lemma A.1. ∎

Lemma A.3.

Consider problem (2.1). Define $\Gamma_{t}=\Theta_{t}-\Theta_{t-1}$ , $t\geq 2$ and $\Gamma_{1}=\Theta_{1}$ . The GFDtL estimator $\{\widehat{\Theta}_{t}\}^{T}_{t=1}$ satisfies the conditions

\forall t\in\{1,\ldots,T\},\;\frac{1}{T}\overset{T}{\underset{r=t}{\sum}}\Big{(}\frac{1}{2}\widehat{\Theta}_{r}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\widehat{\Theta}_{r}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=t}{\sum}}\widehat{E}_{1r}+\lambda_{2}\widehat{E}_{2t}=\mathbf{0}_{p\times p},

where $\widehat{E}_{1t},\widehat{E}_{2t}$ are the sub-gradient matrices defined by

\forall u\neq v,\;\widehat{E}_{uv,1t}=\begin{cases}\text{sgn}(\overset{t}{\underset{s=1}{\sum}}\widehat{\Gamma}_{uv,s})&\text{if}\;\overset{t}{\underset{s=1}{\sum}}\widehat{\Gamma}_{uv,s}\neq 0,\\ \in[-1,1]&\text{otherwise},\end{cases}

and $\widehat{E}_{21}=\mathbf{0}_{p\times p}$ and for $t=2,\ldots,T$ , $\widehat{E}_{2t}$ satisfies

\widehat{E}_{2t}=\frac{\widehat{\Gamma}_{t}}{\|\widehat{\Gamma}_{t}\|_{F}}\;\;\text{if}\;\;\widehat{\Gamma}_{t}\neq\mathbf{0}_{p\times p},\;\;\text{and}\;\;\|\widehat{E}_{2t}\|_{F}\leq 1\;\;\text{if}\;\;\|\widehat{\Gamma}_{t}\|_{F}=0.

Proof.

Defining $\Gamma_{t}=\Theta_{t}-\Theta_{t-1}$ , $t\geq 2$ and $\Gamma_{1}=\Theta_{1}$ , the problem stated in (2.1) can be recast as a minimization of the function

{\mathbb{G}}_{\lambda_{1},\lambda_{2}}(\{\Theta_{t}\}_{t=1}^{T},\!\mathcal{X}_{T})\!=\!\frac{1}{T}\overset{T}{\underset{t=1}{\sum}}\text{tr}(\frac{1}{2}\Big{(}\overset{t}{\underset{s=1}{\sum}}\Gamma_{s}\Big{)}^{2}X_{t}X^{\top}_{t})-\text{tr}(\overset{t}{\underset{s=1}{\sum}}\Gamma_{s})+\lambda_{1}\overset{T}{\underset{t=1}{\sum}}\underset{u\neq v}{\sum}|\overset{t}{\underset{s=1}{\sum}}\Gamma_{uv,s}|+\lambda_{2}\overset{T}{\underset{t=2}{\sum}}\|\Gamma_{t}\|_{F}.

Invoking subdifferential calculus, a necessary and sufficient condition for $(\widehat{\Gamma}_{t})_{1\leq t\leq T}$ to minimize ${\mathbb{G}}_{\lambda_{1},\lambda_{2}}(\cdot,\mathcal{X}_{T})$ is that for all $t=1,\ldots,T$ , $\mathbf{0}_{p\times p}\in{\mathbb{R}}^{p\times p}$ belongs to the subdifferential of ${\mathbb{G}}_{\lambda_{1},\lambda_{2}}(\cdot,\mathcal{X}_{T})$ with respect to $(\Gamma_{t})_{1\leq t\leq T}$ at $(\widehat{\Gamma}_{t})_{1\leq t\leq T}$ , that is

\frac{1}{T}\overset{T}{\underset{r=t}{\sum}}\Big{(}\frac{1}{2}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=t}{\sum}}\widehat{E}_{1r}+\lambda_{2}\widehat{E}_{2t}=\mathbf{0}_{p\times p},

with the subgradient matrices defined as: $\widehat{E}_{21}=\mathbf{0}_{p\times p}$ and

\widehat{E}_{2t}=\frac{\widehat{\Gamma}_{t}}{\|\widehat{\Gamma}_{t}\|_{F}}\;\text{if}\;\widehat{\Gamma}_{t}\neq\mathbf{0}_{p\times p},\;\;\text{and}\;\;\|\widehat{E}_{2t}\|_{F}\leq 1\;\text{if}\;\|\widehat{\Gamma}_{t}\|_{F}=0,\;\;\text{and}\

\forall u\neq v,\;\widehat{E}_{uv,1t}=\begin{cases}\omega_{uv,t}\;\text{sgn}(\overset{t}{\underset{s=1}{\sum}}\widehat{\Gamma}_{uv,s})&\text{if}\;\overset{t}{\underset{s=1}{\sum}}\widehat{\Gamma}_{uv,s}\neq 0,\\ \in[-1,1]&\text{otherwise}.\end{cases}

Now if $t=\widehat{T}_{j}$ for $j\in\{1,\ldots,\widehat{m}\}$ is one of the estimated break dates, then $\widehat{\Gamma}_{t}\neq\mathbf{0}_{p\times p}$ , and

\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{(}\frac{1}{2}\widehat{\Theta}_{r}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\widehat{\Theta}_{r}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}+\lambda_{2}\frac{\widehat{\Gamma}_{\widehat{T}_{j}}}{\|\widehat{\Gamma}_{\widehat{T}_{j}}\|_{F}}=\mathbf{0}_{p\times p},

since the breaks cannot occur at $t=1$ and $\overset{r}{\underset{s=1}{\sum}}\Gamma_{s}=\widehat{\Theta}_{r}$ . When $t=1$ , then the first order condition with respect to $\Gamma_{t}$ yields

\frac{1}{T}\overset{T}{\underset{r=1}{\sum}}\Big{(}\frac{1}{2}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=1}{\sum}}\widehat{E}_{1r}=\mathbf{0}_{p\times p},

so that

\frac{1}{T}\|\overset{T}{\underset{r=1}{\sum}}\Big{(}\frac{1}{2}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=1}{\sum}}\widehat{E}_{1r}\|_{F}\leq\lambda_{2}.

∎

Lemma A.4.

Let $\mathcal{X}_{T}=(X_{1},\dots,X_{T})$ be a given set of $p$ -dimensional vectors with $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ . For an arbitrary fixed $\lambda_{1}>0$ , let $\mathcal{C}_{\lambda_{1}}$ be defined as

$\begin{aligned} \mathcal{C}_{\lambda_{1}}\coloneqq\Big{\{}\left\{\{W_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1}\right\}\,:\,&Z_{0}=Z_{T}=\mathbf{0}_{p\times p};Z_{t}\!-\!Z_{t-1}\!-\!I_{p}\!+\!\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)\!-\!Y_{t,\mathrm{off}}\succeq 0\,\,\,\,\forall t;\\ &|Y_{uv,t,\mathrm{off}}|\leq\lambda_{1}T\,\,\,\,\forall t,u,v;W_{t}\in{\mathbb{R}}^{p\times p},Y_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p},Z_{t}\in\mathcal{S}^{p}\>\>\forall t\Big{\}},\end{aligned}$

where $Y_{uv,t,\mathrm{off}}$ is the $(u,v)$ -th element of $Y_{t,\mathrm{off}}$ . Then $\mathcal{C}_{\lambda_{1}}$ has a Slater point.

Proof.

Since $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ , there exist a large $c>0$ and a small $\gamma>0$ such that

T\gamma I_{p}\prec c\sum_{t=1}^{T}X_{t}X_{t}^{\top}-TI_{p}=\sum_{t=1}^{T}(X_{t}X_{t}^{\top})^{\frac{1}{2}}\overline{W}_{t}-\sum_{t=1}^{T}\overline{Y}_{t,\mathrm{off}}-TI_{p},

where $\overline{W}_{t}\coloneqq c(X_{t}X_{t}^{\top})^{\frac{1}{2}}$ and $\overline{Y}_{t,\mathrm{off}}\coloneqq\mathbf{0}_{p\times p}$ for all $t$ . We can see from the above display that

	$\displaystyle cX_{T}X_{T}^{\top}-I_{p}\succ$	$\displaystyle(T-1)I_{p}-c\sum_{t=1}^{T-1}X_{t}X_{t}^{\top}+(T-1)\gamma I_{p}+\gamma I_{p}$		(A.1)
	$\displaystyle\succ$	$\displaystyle(T-1)I_{p}-c\sum_{t=1}^{T-1}X_{t}X_{t}^{\top}+(T-1)\gamma I_{p}.$		(A.1)

To find $\{\overline{Z}_{t}\}_{t=1}^{T-1}$ and $\overline{Z}_{0}=\overline{Z}_{T}=\mathbf{0}_{p\times p}$ such that $\overline{Z}_{t}-\overline{Z}_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}\overline{W}_{t}\right)-\overline{Y}_{t}\succ 0$ for all $t$ , we need

\left\{\begin{aligned} &\overline{Z}_{T-1}\prec cX_{T}X_{T}^{\top}-I_{p},\\ &\overline{Z}_{t}\prec\overline{Z}_{t+1}+cX_{t+1}X_{t+1}^{\top}-I_{p}\,\,\,\,\forall t=1,2,\dots,T-2,\\ &\overline{Z}_{1}\succ I_{p}-cX_{1}X_{1}^{\top}.\end{aligned}\right.

(A.2)

We claim that the following choice of $\{\overline{Z}_{t}\}_{t=1}^{T-1}$ defined recursively (starting from $\overline{Z}_{0}=\mathbf{0}_{p\times p}$ ) satisfies (A.2):

\overline{Z}_{t}=\overline{Z}_{t-1}+I_{p}-cX_{t}X_{t}^{\top}+\gamma I_{p}\,\,\,\,\forall t=1,\dots,T-1.

Indeed, it is routine to check that the second and third lines of (A.2) are satisfied. Then, using $\overline{Z}_{0}=\mathbf{0}_{p\times p}$ and the above display recursively, we have

\overline{Z}_{T-1}=(T-1)I_{p}-c\sum_{t=1}^{T-1}X_{t}X_{t}^{\top}+(T-1)\gamma I_{p}.

Then by (A.1), $\overline{Z}_{T-1}\prec cX_{T}X_{T}^{\top}-I_{p}$ . Hence $\left\{\{\overline{W}_{t}\}_{t=1}^{T},\{\overline{Y}_{t,\mathrm{off}}\}_{t=}^{T},\{\overline{Z}_{t}\}_{t=1}^{T-1}\right\}$ is a Slater point of $\mathcal{C}_{\lambda_{1}}$ . ∎

Appendix B Proofs

B.1 Proof of Theorem 3.1

Proof of point (i).
The proof builds upon the works of [17], Proposition 3, [30], Theorem 3.1 and [14], Theorem 1. We define:

A_{T,j}=\big{\{}|\widehat{T}_{j}-T^{\ast}_{j}|\geq T\delta_{T}\big{\}},\;\;C_{T}=\big{\{}\underset{1\leq j\leq m^{\ast}}{\max}|\widehat{T}_{j}-T^{\ast}_{j}|<\mathcal{I}_{\min}/2\big{\}}.

By union bound, ${\mathbb{P}}(\underset{1\leq j\leq m^{\ast}}{\max}|\widehat{T}_{j}-T^{\ast}_{j}|\geq T\delta_{T})\leq\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A_{T,j})$ , $m^{\ast}<\infty$ . So we aim to show:

(a)\;\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A_{T,j}\cap C_{T})\rightarrow 0,\;\;(b)\;\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A_{T,j}\cap C^{c}_{T})\rightarrow 0,

with $C^{c}_{T}$ the complement of $C_{T}$ .

Proof of (a). We show:

\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C_{T})\rightarrow 0\;\text{and}\;\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{-}_{T,j}\cap C_{T})\rightarrow 0,

where $A^{+}_{T,j}=\{T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}\},A^{-}_{T,j}=\{\widehat{T}_{j}-T^{\ast}_{j}\geq T\delta_{T}\}$ . We prove $\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C_{T})\rightarrow 0$ as the other case follows in the same spirit. In light of $C_{T}$ :

\forall j\in\{1,\ldots,m^{\ast}\},\;T^{\ast}_{j-1}<\widehat{T}_{j}<T^{\ast}_{j+1}.

(B.1)

By Lemma A.3, with $t=T^{\ast}_{j}$ and $t=\widehat{T}_{j}$ , let $\Lambda(\Sigma)=\frac{1}{2}(\Sigma\otimes I_{p}+I_{p}\otimes\Sigma)$ , in $\text{vec}(\cdot)$ form:

\displaystyle\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r}+\widehat{\Theta}_{r}-\Theta^{\ast}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}+\lambda_{2}\frac{\widehat{\Gamma}_{\widehat{T}_{j}}}{\|\widehat{\Gamma}_{\widehat{T}_{j}}\|_{F}}\big{)}=\mathbf{0}_{p^{2}\times 1},

and

\|\frac{1}{T}\!\overset{T}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r})\!-\!\text{vec}(I_{p})\!\Big{]}\!+\!\frac{1}{T}\!\overset{T}{\underset{r=T^{\ast}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r}\!-\!\Theta^{\ast}_{r})\!+\!\lambda_{1}\!\text{vec}\big{(}\!\overset{T}{\underset{r=T^{\ast}_{j}}{\sum}}\!\widehat{E}_{1r}\big{)}\|_{2}\!\leq\!\lambda_{2}.

Therefore, under $T^{\ast}_{j}>\widehat{T}_{j}$ , taking the differences, by the triangle inequality, we obtain:

\displaystyle 2\lambda_{2}\!\geq\!\|\!\frac{1}{T}\!\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\!\Big{[}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r})\!-\!\text{vec}(I_{p})\!\Big{]}\!+\!\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r}\!-\!\Theta^{\ast}_{r})\!+\!\text{vec}\big{(}\!\lambda_{1}\!\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}.

Each component of $\lambda_{1}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}\widehat{E}_{1r}$ is bounded by $\pm\lambda_{1}(T^{\ast}_{j}-\widehat{T}_{j})$ . We deduce by the triangle inequality:

$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})$			(B.2)
	$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r}-\Theta^{\ast}_{r})\\|_{2}$
	$\displaystyle=$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j})\\|_{2}$
	$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j})\\|_{2}-\\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1})\\|_{2}$
		$\displaystyle-\\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\coloneqq R_{Tj,1}+R_{Tj,2}+R_{Tj,3},$

where the first equality holds since $\widehat{\Theta}_{r}=\widehat{\Omega}_{j+1}$ and $\Theta^{\ast}_{r}=\Omega^{\ast}_{j}$ for $r\in[\widehat{T}_{j},T^{\ast}_{j}-1]$ by (B.1). Let the event:

\overline{R}_{Tj}=\{2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})\geq\frac{1}{3}R_{Tj,1}\}\cup\{R_{Tj,2}\geq\frac{1}{3}R_{Tj,1}\}\cup\{R_{Tj,3}\geq\frac{1}{3}R_{Tj,1}\}.

Since inequality (B.2) holds with probability one, then ${\mathbb{P}}(\overline{R}_{Tj})=1$ . Therefore, we have:

	$\displaystyle{\mathbb{P}}(A^{+}_{T,j}\cap C_{T})\leq$	$\displaystyle\>\>{\mathbb{P}}(A^{+}_{T,j}\cap C_{T}\cap\{2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})\geq\frac{1}{3}R_{Tj,1}\})$
		$\displaystyle\>\quad+{\mathbb{P}}(A^{+}_{T,j}\cap C_{T}\cap\{R_{Tj,2}\geq\frac{1}{3}R_{Tj,1}\})+{\mathbb{P}}(A^{+}_{T,j}\cap C_{T}\cap\{R_{Tj,3}\geq\frac{1}{3}R_{Tj,1}\})$
	$\displaystyle\eqqcolon$	$\displaystyle\>\>AC_{j,1}+AC_{j,2}+AC_{j,3}.$

Let us first bound $\sum^{m^{\ast}}_{j=1}AC_{j,1}$ . Since $\|AB\|_{F}\geq\lambda_{\min}(A^{\top}A)^{1/2}\|B\|_{F}$ , for $1\leq j\leq m^{\ast}$ :

		$\displaystyle AC_{j,1}\leq\>\>{\mathbb{P}}(A^{+}_{T,j}\cap\{2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})\geq\frac{1}{3}R_{Tj,1}\})$
	$\displaystyle\leq$	$\displaystyle\>\>\leavevmode\resizebox{432.53474pt}{}{$\displaystyle{\mathbb{P}}\Big{(}\\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j})\\|_{2}\leq\frac{3T}{T^{\ast}_{j}-\widehat{T}_{j}}\big{[}2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})\big{]},T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}\Big{)}$}$
	$\displaystyle\leq$	$\displaystyle\>\>{\mathbb{P}}\Big{(}\gamma^{\min}_{1,T,j}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}\leq\frac{6T\lambda_{2}}{T^{\ast}_{j}-\widehat{T}_{j}}+3T\lambda_{1}\sqrt{p(p-1)},T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\>\>{\mathbb{P}}\Big{(}\gamma^{\min}_{1,T,j}\leq\frac{6\lambda_{2}}{\eta_{\min}\delta_{T}}+\frac{T\lambda_{1}\sqrt{p(p-1)}}{\eta_{\min}},T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}\Big{)},$

with $\gamma^{\min}_{1,T,j}=\lambda_{\min}(\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2>0$ with probability tending to one by Lemma A.1, and $\eta_{\min}=\underset{1\leq j\leq m^{\ast}}{\min}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}$ . By $\lambda_{2}/(\eta_{\min}\delta_{T})\rightarrow 0$ , $T\lambda_{1}\sqrt{p(p-1)}/\eta_{\min}\rightarrow 0$ in Assumption 3-(iii), we deduce $\sum^{m^{\ast}}_{j=1}AC_{j,1}\rightarrow 0$ . We now bound $\sum^{m^{\ast}}_{j=1}AC_{j,2}$ . For any $j=1,\ldots,m^{\ast}$ :

	$\displaystyle AC_{j,2}=$	$\displaystyle\leavevmode\resizebox{414.32576pt}{}{$\displaystyle{\mathbb{P}}\Big{(}\!A^{+}_{T,j}\!\cap\!C_{T}\!\cap\!\Big{\{}\!\\|\!\frac{1}{T^{\ast}_{j}\!-\!\widehat{T}_{j}}\!\overset{T^{\ast}_{j}\!-\!1}{\underset{r!=!\widehat{T}_{j}}{\sum}}\!\Lambda(X_{r}\!X^{\top}_{r}\!)\text{vec}(\widehat{\Omega}_{j+1}\!-\!\Omega^{\ast}_{j+1})\!\\|_{2}\!\geq\!\frac{1}{3}\!\\|\!\frac{1}{T^{\ast}_{j}\!-\!\widehat{T}_{j}}\!\overset{T^{\ast}_{j}\!-\!1}{\underset{r\!=\!\widehat{T}_{j}}{\sum}}\!\Lambda(X_{r}\!X^{\top}_{r})\!\text{vec}(\Omega^{\ast}_{j+1}\!-\!\Omega^{\ast}_{j})\!\\|_{2}\!\Big{\}}\!\Big{)}$}$
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}A^{+}_{T,j}\cap C_{T}\cap\Big{\{}\gamma^{\max}_{T,j}\\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\\|_{F}\geq\frac{1}{3}\gamma^{\min}_{1,T,j}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}\}\Big{)},$

with $\gamma^{\max}_{T,j}=\lambda_{\max}(\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}X_{r}X^{\top}_{r})\leq 2\overline{\mu}$ with probability tending to one by Lemma A.1. We now need to evaluate the bound for $\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\|_{F}$ . To do so, we rely on the KKT conditions of Lemma A.3. Note that with probability tending to one, $\lambda_{\max}(\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}X_{r}X^{\top}_{r})\leq 2\overline{\mu}$ . We have $\widehat{\Theta}_{t}=\widehat{\Omega}_{j+1}$ when $t\in[T^{\ast}_{j},(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1]$ as $\widehat{T}_{j}<T^{\ast}_{j}$ given $A^{+}_{T,j}$ and $\widehat{T}_{j+1}>(T^{\ast}_{j}+T^{\ast}_{j+1})/2$ given $C_{T}$ . Therefore, by Lemma A.3 with $l=(T^{\ast}_{j}+T^{\ast}_{j+1})/2$ and $l=T^{\ast}_{j}$ , following the steps to obtain inequality (B.2), we get

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}[(T^{\ast}_{j}+T^{\ast}_{j+1})/2-T^{\ast}_{j}]$
		$\displaystyle\geq$	$\displaystyle\leavevmode\resizebox{414.32576pt}{}{$\displaystyle\\|\frac{1}{T}\overset{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1})\\|_{2}-\\|\frac{1}{T}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}$}.$

Therefore, denoting by $\gamma^{\min}_{2,T,j}=\lambda_{\min}\Big{(}\frac{1}{T^{\ast}_{j+1}-T^{\ast}_{j}}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}X_{r}X^{\top}_{r}\Big{)}$ , conditional on $C_{T}$ :

	$\displaystyle\\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\\|_{F}\leq$	$\displaystyle\>\>(\gamma^{\min}_{2,T,j})^{-1}\Big{(}\frac{2T\lambda_{2}+T\lambda_{1}\sqrt{p(p-1)}[(T^{\ast}_{j}+T^{\ast}_{j+1})/2-T^{\ast}_{j}]}{T^{\ast}_{j+1}-T^{\ast}_{j}}$
		$\displaystyle\>\>\quad+\\|\frac{2}{T^{\ast}_{j+1}-T^{\ast}_{j}}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\Big{)}.$

By Lemma A.1, $\gamma^{\min}_{2,T,j}\geq\underline{\mu}/2>0$ with probability tending to one. We deduce

$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\Big{\{}\\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\\|_{F}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/3\Big{\}}\cap C_{T}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\frac{2T\lambda_{2}+T\lambda_{1}\sqrt{p(p-1)}[(T^{\ast}_{j}+T^{\ast}_{j+1})/2-T^{\ast}_{j}]}{T^{\ast}_{j+1}-T^{\ast}_{j}}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/6\Big{)}$
		$\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\\|\frac{2}{T^{\ast}_{j+1}-T^{\ast}_{j}}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/6\Big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\frac{2T\lambda_{2}}{\mathcal{I}_{\min}\eta_{\min}}+\frac{T\lambda_{1}\sqrt{p(p-1)}}{\eta_{\min}}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}/6\Big{)}$
		$\displaystyle\leavevmode\resizebox{398.9296pt}{}{$\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\geq\overline{\mu}^{-1}\underline{\mu}^{2}\eta_{\min}/48\Big{)}$}.$

The first term tends to zero since $T\lambda_{2}/(\mathcal{I}_{\min}\eta_{\min})\rightarrow 0$ and $T\lambda_{1}p/\eta_{\min}\rightarrow 0$ by Assumption 3-(ii) and (iii). As for the second term, using $\|AB\|_{F}\leq\|A\|_{s}\|B\|_{F}$ , note that

$\displaystyle\\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}$
	$\displaystyle=$	$\displaystyle\\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}\Big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\Big{)}+\frac{1}{2}\Big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\Big{)}\Omega^{\ast}_{j}\Big{]}\\|_{F}$
	$\displaystyle\leq$	$\displaystyle s^{\ast}_{\max}\,p\,\\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\big{)}\\|_{\max}.$

Therefore, for $C>0$ finite, applying Lemma A.2, we deduce that for any $j$ :

{\mathbb{P}}\Big{(}\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\geq\overline{\mu}^{-1}\underline{\mu}^{2}\eta_{\min}/48\Big{)}\rightarrow 0,

since $(\eta_{\min}\mathcal{I}^{1/2}_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0$ . Hence, $\sum^{m^{\ast}}_{j=1}AC_{j,2}\rightarrow 0$ . Let us now consider $\sum^{m^{\ast}}_{j=1}AC_{j,3}$ . Applying the same reasoning to show the convergence of the second summation on the right-hand side of (B.1), we get

\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}=O_{p}(p\,s^{\ast}_{\max}\sqrt{\frac{\log(pT)}{T\delta_{T}}})=o_{p}(\eta_{\min}),

when $T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}$ , and

$\displaystyle\sum^{m^{\ast}}_{j=1}AC_{j,3}\leq{\mathbb{P}}(A^{+}_{T,j}\cap\{R_{Tj,3}\geq\frac{1}{3}R_{Tj,1}\})$
	$\displaystyle\leq$	$\displaystyle\sum^{m^{\ast}}_{j=1}\!{\mathbb{P}}\Big{(}A^{+}_{T,j}\!\cap\!\Big{\{}\\|\frac{1}{T^{\ast}_{j}\!-\!\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}\!\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}\!+\!\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}\!-\!I_{p}\!\Big{]}\\|_{F}\!\geq\!\frac{1}{3}\\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j+1}\!-\!\Omega^{\ast}_{j})\\|_{2}\Big{\}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}A^{+}_{T,j}\cap\Big{\{}\\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\geq\frac{1}{3}\eta_{\min}\gamma^{\min}_{1,T,j}\Big{)},$

since $\gamma^{\min}_{1,T,j}\geq\underline{\mu}/2>0$ with probability tending to one, and $T\delta_{T}\leq T^{\ast}_{j}-\widehat{T}_{j}$ , then under $(\sqrt{T\delta_{T}}\eta_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0$ , we deduce $\sum^{m^{\ast}}_{j=1}AC_{j,3}\rightarrow 0$ . Consequently, we proved $\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A_{T,j}\cap C_{T})\rightarrow 0$ .

Proof of (b). We prove (b) by showing $\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C^{c}_{T})\rightarrow 0$ and $\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{-}_{T,j}\cap C^{c}_{T})\rightarrow 0$ . As in the proof of (a), we simply show $\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C^{c}_{T})\rightarrow 0$ . To do so, we define:

\begin{array}[]{llll}D^{(l)}_{T}\coloneqq\big{\{}\exists j\in\{1,\ldots,m^{\ast}\},\widehat{T}_{j}\leq T^{\ast}_{j-1}\big{\}}\cap C^{c}_{T},&&\\ D^{(m)}_{T}\coloneqq\big{\{}\forall j\in\{1,\ldots,m^{\ast}\},T^{\ast}_{j-1}<\widehat{T}_{j}<T^{\ast}_{j+1}\big{\}}\cap C^{c}_{T},&&\\ D^{(r)}_{T}\coloneqq\big{\{}\exists j\in\{1,\ldots,m^{\ast}\},\widehat{T}_{j}\geq T^{\ast}_{j+1}\big{\}}\cap C^{c}_{T},&&\end{array}

where $C^{c}_{T}=\{\underset{1\leq j\leq m^{\ast}}{\max}|\widehat{T}_{j}-T^{\ast}_{j}|\geq\mathcal{I}_{\min}/2\}$ . Then, we have:

\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C^{c}_{T})=\sum^{m^{\ast}}_{j=1}\Big{[}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(l)}_{T})+{\mathbb{P}}(A^{+}_{T,j}\cap D^{(m)}_{T})+{\mathbb{P}}(A^{+}_{T,j}\cap D^{(r)}_{T})\Big{]}.

We first bound $\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(m)}_{T})$ . For any $j$ :

$\displaystyle{\mathbb{P}}(A^{+}_{T,j}\cap D^{(m)}_{T})$
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})+{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}<\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})$
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})+{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}T^{\ast}_{j+1}-\widehat{T}_{j+1}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T}),$

since $0\leq\widehat{T}_{j+1}-T^{\ast}_{j}\leq\mathcal{I}_{\min}/2$ implies $T^{\ast}_{j+1}-\widehat{T}_{j+1}=(T^{\ast}_{j+1}-T^{\ast}_{j})-(\widehat{T}_{j+1}-T^{\ast}_{j})\geq\mathcal{I}_{\min}-\mathcal{I}_{\min}/2=\mathcal{I}_{\min}/2$ . Moreover, since

$\displaystyle\Big{\{}A^{+}_{t,j}\cap\big{\{}T^{\ast}_{j+1}-\widehat{T}_{j+1}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T}\Big{\}}\subset\overset{m^{\ast}-1}{\underset{k=j+1}{\cup}}\Big{[}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap D^{(m)}_{T}\Big{]},$

we deduce:

	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(m)}_{T})\leq\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})$				(B.4)
			$\displaystyle+\sum^{m^{\ast}}_{j=1}\sum^{m^{\ast}-1}_{k=j+1}{\mathbb{P}}\Big{(}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap D^{(m)}_{T}\Big{)}.$		(B.4)

Let us treat the first term. By Lemma A.3 with $t=\widehat{T}_{j}$ and $t=T^{\ast}_{j}$ , we obtain:

\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r}+\widehat{\Theta}_{r}-\Theta^{\ast}_{r})-\text{vec}(I_{p})\Big{]}+\lambda_{1}\text{vec}\big{(}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}\big{)}=\lambda_{2}\text{vec}\big{(}\frac{\widehat{\Gamma}_{\widehat{T}_{j}}}{\|\widehat{\Gamma}_{\widehat{T}_{j}}\|_{F}}\big{)},

and

\|\frac{1}{T}\overset{T}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r}+\widehat{\Theta}_{r}-\Theta^{\ast}_{r})-\text{vec}(I_{p})\Big{]}+\lambda_{1}\text{vec}\big{(}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}\leq 2\lambda_{2}.

We deduce

$\displaystyle\gamma^{\min}_{1,T,j}\\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j}\\|_{F}-\\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\\|\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j})+\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j})-\text{vec}(I_{p})\Big{]}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{2\lambda_{2}T}{T^{\ast}_{j}-\widehat{T}_{j}}+\lambda_{1}T\sqrt{p(p-1)}.$

As a consequence:

$\displaystyle\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j}\|_{F}\leq(\gamma^{\min}_{1,T,j})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{j}-\widehat{T}_{j}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\Big{]}.$

(B.5)

In the same vein, applying Lemma A.3 with $t=\widehat{T}_{j+1}$ and $t=T^{\ast}_{j}$ , we obtain:

	$\displaystyle\\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\\|_{F}$
		$\displaystyle\leq$	$\displaystyle(\gamma^{\min}_{3,T,j})^{-1}\Big{[}\frac{2\lambda_{2}T}{\widehat{T}_{j+1}-T^{\ast}_{j}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\Big{]},$

where $\gamma^{\min}_{3,T,j}=\lambda_{\min}(\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ with probability one. Let the event:

	$\displaystyle E_{T,j}\coloneqq\Big{\{}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}\leq(\gamma^{\min}_{1,T,j})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{j}-\widehat{T}_{j}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}$
			$\displaystyle+(\gamma^{\min}_{3,T,j})^{-1}\Big{[}\frac{2\lambda_{2}T}{\widehat{T}_{j+1}-T^{\ast}_{j}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}+(\gamma^{\min}_{1,T,j})^{-1}\\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}$
			$\displaystyle+(\gamma^{\min}_{3,T,j})^{-1}\\|\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\Big{\}}.$

Therefore, by the triangle inequality, (B.5) and (B.1) imply that the event $E_{T,j}$ holds with probability one. Hence:

$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})$			(B.8)
	$\displaystyle=$	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(E_{T,j}\cap A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})$
	$\displaystyle\leq$	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(E_{T,j}\cap\big{\{}T^{\ast}_{j}-\widehat{T}_{j}>T\delta_{T}\big{\}}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}})$
	$\displaystyle\leq$	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}((\gamma^{\min}_{1,T,j})^{-1}\Big{[}\frac{2\lambda_{2}}{\delta_{T}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}+(\gamma^{\min}_{3,T,j})^{-1}\Big{[}\frac{4\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\geq\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/3)$
		$\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(\Big{\{}(\gamma^{\min}_{1,T,j})^{-1}\\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\geq\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/3\Big{\}}\cap\big{\{}T^{\ast}_{j}-\widehat{T}_{j}>T\delta_{T}\big{\}})$
		$\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(\Big{\{}(\gamma^{\min}_{3,T,j})^{-1}\\|\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\geq\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/3\Big{\}}$
		$\displaystyle\qquad\quad\>\>\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\mathcal{I}_{\min}/2\big{\}}).$

The first term in (B.8) tends to zero under $\lambda_{2}/(\eta_{\min}\delta_{T})\rightarrow 0$ , $\lambda_{2}T/(\mathcal{I}_{\min}\eta_{\min})\rightarrow 0$ , and $\lambda_{1}Tp/\eta_{\min}\rightarrow 0$ . Moreover, note that

\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}=O_{p}(s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{T\delta_{T}}})=o_{p}(\eta_{\min}),

and

\|\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}=O_{p}(s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{\mathcal{I}_{\min}}})=o_{p}(\eta_{\min}),

under Assumption 3-(ii)-(iii). In the same manner, we can show that the second term in (B.4) tends to zero.

We now consider $\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(l)}_{T})$ . The probability of the event $A^{+}_{T,j}\cap D^{(l)}_{T}$ is upper bounded by:

{\mathbb{P}}(D^{(l)}_{T})\leq\sum^{m^{\ast}}_{j=1}2^{j-1}{\mathbb{P}}(\max(l\in\{1,\ldots,m^{\ast}\}:\widehat{T}_{l}\leq T^{\ast}_{l-1})=j).

Now $\max(l\in\{1,\ldots,m^{\ast}\}:\widehat{T}_{l}\leq T^{\ast}_{l-1})=j$ implies $\widehat{T}_{j}\leq T^{\ast}_{j-1}$ and $\widehat{T}_{l+1}>T^{\ast}_{l}$ for any $j\leq l\leq m^{\ast}$ and:

\big{\{}\max(l\in\{1,\ldots,m^{\ast}\}:\widehat{T}_{l}\leq T^{\ast}_{l-1})=j\big{\}}\subset\overset{m^{\ast}-1}{\underset{k=j}{\cup}}\Big{(}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\Big{)}.

Therefore:

	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(l)}_{T})$
		$\displaystyle\leq$	$\displaystyle m^{\ast}\overset{m^{\ast}-1}{\underset{j=1}{\sum}}2^{j-1}\overset{m^{\ast}-1}{\underset{k=j}{\sum}}{\mathbb{P}}\Big{(}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\Big{)}+m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}(T^{\ast}_{m^{\ast}}-\widehat{T}_{m^{\ast}}\geq\mathcal{I}_{\min}/2).$

First, we consider the second term of the right-hand side of (B.1). Let $j=m^{\ast}$ in (B.1), then $E_{T,m^{\ast}}$ holds with probability one. Therefore:

$\displaystyle m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}(T^{\ast}_{m^{\ast}}-\widehat{T}_{m^{\ast}}\geq\mathcal{I}_{\min}/2)=m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}(E_{T,m^{\ast}}\cap\big{\{}T^{\ast}_{m^{\ast}}-\widehat{T}_{m^{\ast}}\geq\mathcal{I}_{\min}/2\big{\}})$
	$\displaystyle\leq$	$\displaystyle m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}((\gamma^{\min}_{1,T,m^{\ast}})^{-1}\Big{[}\frac{2\lambda_{2}}{\delta_{T}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}+(\gamma^{\min}_{3,T,m^{\ast}})^{-1}\Big{[}\frac{4\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\geq\\|\Omega^{\ast}_{m^{\ast}+1}-\Omega^{\ast}_{m^{\ast}}\\|_{F}/3)$
		$\displaystyle+m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}((\gamma^{\min}_{1,T,m^{\ast}})^{-1}\\|\frac{1}{T^{\ast}_{m^{\ast}}\!-\!\widehat{T}_{m^{\ast}}}\overset{T^{\ast}_{m^{\ast}}-1}{\underset{r=\widehat{T}_{m^{\ast}}}{\sum}}\!\Big{[}\frac{1}{2}\Omega^{\ast}_{m^{\ast}}X_{r}X^{\top}_{r}\!+\!\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{m^{\ast}}\!-\!I_{p}\Big{]}\\|_{F}\!\geq\!\\|\Omega^{\ast}_{m^{\ast}+1}\!-\!\Omega^{\ast}_{m^{\ast}}\\|_{F}/3,T^{\ast}_{m^{\ast}}\!-\!\widehat{T}_{m^{\ast}}\!\geq\!\mathcal{I}_{\min}/2)$
		$\displaystyle+m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}((\gamma^{\min}_{3,T,m^{\ast}})^{-1}\\|\frac{1}{T-T^{\ast}_{m^{\ast}}}\overset{T}{\underset{r=T^{\ast}_{m^{\ast}}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{m^{\ast}}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{m^{\ast}}-I_{p}\Big{]}\\|_{F}\geq\\|\Omega^{\ast}_{m^{\ast}+1}-\Omega^{\ast}_{m^{\ast}}\\|_{F}/3).$

Since $m^{\ast}2^{m^{\ast}-1}=O(T\log(T))$ , then $\log(m^{\ast}2^{m^{\ast}-1})=O(\log(T^{1+\epsilon/2}))$ . So under the conditions $(\sqrt{T\delta_{T}}\eta_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0,(\mathcal{I}^{1/2}_{\min}\eta_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0$ , the right-hand side of the previous inequality converges to zero. As for the first term of (B.1), applying $j=k$ in (B.1):

$\displaystyle m^{\ast}\overset{m^{\ast}-1}{\underset{j=1}{\sum}}2^{j-1}\overset{m^{\ast}-1}{\underset{k=j}{\sum}}{\mathbb{P}}\Big{(}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle m^{\ast}2^{m^{\ast}-1}\overset{m^{\ast}-1}{\underset{k=1}{\sum}}{\mathbb{P}}\Big{(}E_{T,k}\cap\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle m^{\ast}2^{m^{\ast}-1}\overset{m^{\ast}-1}{\underset{k=1}{\sum}}\Big{\{}{\mathbb{P}}((\gamma^{\min}_{1,T,k})^{-1}\Big{[}\frac{2\lambda_{2}}{\delta_{T}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}+(\gamma^{\min}_{3,T,k})^{-1}\Big{[}\frac{4\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\geq\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3)$
		$\displaystyle+{\mathbb{P}}((\gamma^{\min}_{1,T,k})^{-1}\\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{k}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{k}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{k}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{k}-I_{p}\Big{]}\\|_{F}\geq\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3,T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2)$
		$\displaystyle+{\mathbb{P}}((\gamma^{\min}_{3,T,k})^{-1}\\|\frac{1}{\widehat{T}_{k+1}-T^{\ast}_{k}}\overset{\widehat{T}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{k}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{k}-I_{p}\Big{]}\\|_{F}\geq\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3,\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2)\Big{\}}.$

The right-hand side of the last inequality converges to zero under the same conditions. Finally, we can prove that $\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(r)}_{T})\rightarrow 0$ .

Proof of point (ii).
By point (i) and under Assumption 3-(ii), for any $j=1,\ldots,m^{\ast}$ , $|\widehat{T}_{j}-T^{\ast}_{j}|=O_{p}(T\delta_{T})$ , which is $|\widehat{T}_{j}-T^{\ast}_{j}|=o_{p}(\mathcal{I}_{\min})$ under Assumption 3-(ii). Hence, $(T^{\ast}_{j-1}+T^{\ast}_{j})/2<\widehat{T}_{j}<T^{\ast}_{j}$ or $T^{\ast}_{j}\leq\widehat{T}_{j}<(T^{\ast}_{j}+T^{\ast}_{j+1})/2$ is satisfied for any $j$ . Set $l=1,\ldots,m^{\ast}$ and assume $(T^{\ast}_{l-1}+T^{\ast}_{l})/2<\widehat{T}_{l}<T^{\ast}_{l}$ and consider two cases: (ii-a) $(T^{\ast}_{l}+T^{\ast}_{l+1})/2<\widehat{T}_{l+1}<T^{\ast}_{l+1}$ and (ii-b) $T^{\ast}_{l+1}\leq\widehat{T}_{l+1}$ . In case (ii-a), by Lemma A.3 with change-points $t=\widehat{T}_{l}$ and $t=\widehat{T}_{l+1}$ :

$\displaystyle 2\lambda_{2}\geq\\|\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}$
		$\displaystyle-\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{l+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{l}}{\sum}}\widehat{E}_{1r}-\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{l+1}}{\sum}}\widehat{E}_{1r}\big{)}\\|_{2}$
	$\displaystyle=$	$\displaystyle\\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{\widehat{T}_{l+1}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\widehat{E}_{1r}\big{)}\\|_{2}.$

Therefore, we deduce

$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})$
	$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}\Big{]}\\|_{2}$
	$\displaystyle=$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}+\Omega^{\ast}_{l+1})-\text{vec}(I_{p})\Big{]}\Big{]}\\|_{2}$
	$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}+\Omega^{\ast}_{l+1})-\text{vec}(I_{p})\Big{]}\Big{]}\\|_{2}-\\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}\\|_{2}$
	$\displaystyle\geq$	$\displaystyle\frac{\widehat{T}_{l+1}-T^{\ast}_{l}}{T}\Big{\{}\\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{l}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1})\\|_{2}$
		$\displaystyle-\\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{l}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{l+1}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{l+1}-I_{p}\Big{]}\\|_{F}\Big{\}}$
		$\displaystyle-\frac{T^{\ast}_{l}-\widehat{T}_{l}}{T}\\|\frac{1}{T^{\ast}_{l}-\widehat{T}_{l}}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}\\|_{2}.$

Therefore, using part (i) of Theorem 3.1, we obtain

$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})$
	$\displaystyle\geq$	$\displaystyle\frac{\widehat{T}_{l+1}-T^{\ast}_{l}}{T}\Big{\{}\gamma^{\min}_{T,l}\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}\\|_{F}-\\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{l}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{l+1}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{l+1}-I_{p}\Big{]}\\|_{F}\Big{\}}$
		$\displaystyle-O_{p}(\frac{T^{\ast}_{l}-\widehat{T}_{l}}{T}),$

where $\gamma^{\min}_{T,l}=\lambda_{\min}(\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{l}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ with probability tending to one. We deduce

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})$
		$\displaystyle\geq$	$\displaystyle\frac{\widehat{T}_{l+1}-T^{\ast}_{l}}{T}\Big{\{}\gamma^{\min}_{T,l}\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}\\|_{F}-O_{p}(s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{I^{\ast}_{l+1}}})\Big{\}}-O_{p}(\frac{T^{\ast}_{l}-\widehat{T}_{l}}{T}).$

As a consequence, it can be deduced that

\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}\|_{F}=O_{p}(\frac{\lambda_{2}T}{I^{\ast}_{l+1}}+\lambda_{1}Tp(1+\frac{T\delta_{T}}{I^{\ast}_{l+1}})+\frac{T\delta_{T}}{I^{\ast}_{l+1}}+s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{I^{\ast}_{l+1}}}).

(B.10)

In case (ii-b), by Lemma A.3, with change-points $t=\widehat{T}_{l}$ and $t=\widehat{T}_{l+1}$ , we have

$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})\geq\\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}\\|_{2}$
	$\displaystyle=$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{T^{\ast}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}$
		$\displaystyle+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}\\|_{2}$
	$\displaystyle=$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{T^{\ast}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}+\Omega^{\ast}_{l+1})-\text{vec}(I_{p})\Big{]}$
		$\displaystyle+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+2}+\Omega^{\ast}_{l+2})-\text{vec}(I_{p})\Big{]}\\|_{2}$
	$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}+\Omega^{\ast}_{l+1})-\text{vec}(I_{p})\Big{]}\\|_{2}-\\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}\\|_{2}$
		$\displaystyle-\\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+2}+\Omega^{\ast}_{l+2})-\text{vec}(I_{p})\Big{]}\\|_{2}.$

With $\overline{\gamma}^{\min}_{T,l}=\lambda_{\min}(\frac{1}{I_{\min}}\overset{T^{\ast}_{l+1}-1}{\underset{r=T^{\ast}}{\sum}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ with probability tending to one, we deduce

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})$
		$\displaystyle\geq$	$\displaystyle\frac{I^{\ast}_{l+1}}{T}\Big{\{}\overline{\gamma}^{\min}_{T,l}\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}\\|_{F}-O_{p}(s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{I^{\ast}_{l+1}}})\Big{\}}-O_{p}(\frac{T^{\ast}_{l}-\widehat{T}_{l}}{T})-O_{p}(\frac{\widehat{T}_{l+1}-T^{\ast}_{l+1}}{T}).$

Hence, (B.10) holds. Using similar arguments, we can show that the latter is satisfied when $T^{\ast}_{l}\leq\widehat{T}_{l}<(T^{\ast}_{l}+T^{\ast}_{l+1})/2$ .

B.2 Proof of Theorem 3.2

Using the result of Theorem 3.1, we aim to show that:

{\mathbb{P}}\big{(}\{h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})>T\delta_{T}\}\cap\{m^{\ast}<\widehat{m}\leq m_{\max}\}\big{)}\rightarrow 0\;\;\text{as}\;\;T\rightarrow\infty.

(B.11)

To so, we define:

\begin{array}[]{llll}L_{m,k,1}=\big{\{}\forall 1\leq l\leq m,|\widehat{T}_{l}-T^{\ast}_{k}|>T\delta_{T}\;\text{and}\;\widehat{T}_{l}<T^{\ast}_{k}\big{\}},&&\\ L_{m,k,2}=\big{\{}\forall 1\leq l\leq m,|\widehat{T}_{l}-T^{\ast}_{k}|>T\delta_{T}\;\text{and}\;\widehat{T}_{l}>T^{\ast}_{k}\big{\}},&&\\ L_{m,k,3}=\big{\{}\exists 1\leq l\leq m-1,|\widehat{T}_{l}-T^{\ast}_{k}|>T\delta_{T},|\widehat{T}_{l+1}-T^{\ast}_{k}|>T\delta_{T}\;\text{and}\;\widehat{T}_{l}<T^{\ast}_{k}<\widehat{T}_{l+1}\big{\}}.&&\end{array}

The probability (B.11) can be bounded as:

	$\displaystyle{\mathbb{P}}\big{(}\{h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})>T\delta_{T}\}\cap\{m^{\ast}<\widehat{m}\leq m_{\max}\}\big{)}\leq\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}{\mathbb{P}}\big{(}h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})>T\delta_{T}\big{)}$
		$\displaystyle\leq$	$\displaystyle\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}\forall l\in\{1,\ldots,m\},\|\widehat{T}_{l}-T^{\ast}_{k}\|>T\delta_{T}\big{)}=\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}\Big{[}{\mathbb{P}}\big{(}L_{m,k,1}\big{)}+{\mathbb{P}}\big{(}L_{m,k,2}\big{)}+{\mathbb{P}}\big{(}L_{m,k,3}\big{)}\Big{]}.$

We first focus on $\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,1}\big{)}$ , which can be expressed as:

{\mathbb{P}}\big{(}L_{m,k,1}\big{)}={\mathbb{P}}\big{(}L_{m,k,1}\cap\{\widehat{T}_{m}>T^{\ast}_{k-1}\}\big{)}+{\mathbb{P}}\big{(}L_{m,k,1}\cap\{\widehat{T}_{m}\leq T^{\ast}_{k-1}\}\big{)}.

By Lemma A.3 with change-points $t=\widehat{T}_{m}$ and $t=T^{\ast}_{k}$ , given the case $T^{\ast}_{k}\geq\widehat{T}_{m}>T^{\ast}_{k-1}$ :

\displaystyle\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{m}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{m}}{\sum}}\widehat{E}_{1r}+\lambda_{2}\frac{\widehat{\Gamma}_{\widehat{T}_{m}}}{\|\widehat{\Gamma}_{\widehat{T}_{m}}\|_{F}}\big{)}=\mathbf{0}_{p^{2}\times 1},

and

\displaystyle\|\frac{1}{T}\overset{T}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=T^{\ast}_{k}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}\leq\lambda_{2}.

Therefore, taking the differences, we get:

$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k}-\widehat{T}_{m})\geq\\|\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}\\|_{2}$
	$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k})$
		$\displaystyle+\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}.$

Therefore, the event ${\mathcal{B}}_{T}$ defined as

	$\displaystyle{\mathcal{B}}_{T}\coloneqq\Big{\{}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}\leq(\gamma^{\min}_{4,T,m,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-\widehat{T}_{m}}+\lambda_{1}T\sqrt{p(p-1)}$
			$\displaystyle+\\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1})\\|_{2}+\\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]}\Big{\}},$

where $\gamma^{\min}_{4,T,m,k}=\lambda_{\min}(\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\sum^{T^{\ast}_{k}-1}_{r=\widehat{T}_{m}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ with probability tending to one, holds with probability one. Hence, we deduce

with

\displaystyle M_{1,1}\coloneqq\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{4,T,m,k})^{-1}\Big{[}2\lambda_{2}\delta^{-1}_{T}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\big{)},

$\displaystyle M_{1,2}\!\coloneqq\!\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}T^{\ast}_{k}\!-\!\widehat{T}_{m}\!>\!T\delta_{T},\|\Omega^{\ast}_{k+1}\!-\!\Omega^{\ast}_{k}\|_{F}/3\!\leq\!(\gamma^{\min}_{4,T,m,k})^{-1}\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1}\!-\!\Omega^{\ast}_{k+1})\|_{2}\big{)},$

$\displaystyle M_{1,3}\!\coloneqq\!\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}T^{\ast}_{k}\!-\!\widehat{T}_{m}\!>\!T\delta_{T},\|\Omega^{\ast}_{k+1}\!-\!\Omega^{\ast}_{k}\|_{F}/3\!\leq\!(\gamma^{\min}_{4,T,m,k})^{-1}\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\!\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})\!-\!\text{vec}(I_{p})\Big{]}\|_{2}\big{)}.$

In the same vein as in the analysis of (B.8), we can show that $M_{1,1},M_{1,3}\rightarrow 0$ as $T\rightarrow\infty$ . $M_{1,2}$ requires more arguments. By Lemma (A.3), with change-points $t=T^{\ast}_{k}$ and $t=T^{\ast}_{k+1}$ :

\displaystyle\|\frac{1}{T}\overset{T}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=T^{\ast}_{k}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}\leq\lambda_{2},

and

\displaystyle\|\frac{1}{T}\overset{T}{\underset{r=T^{\ast}_{k+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=T^{\ast}_{k+1}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}\leq\lambda_{2}.

Therefore

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k+1}-T^{\ast}_{k})$
		$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\\|_{2},$

which implies

$\displaystyle\|\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1}\|_{F}\leq(\gamma^{\min}_{5,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]},$

with $\gamma^{\min}_{5,T,k}=\lambda_{\min}(\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\sum^{T^{\ast}_{k+1}-1}_{r=T^{\ast}_{k}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ with probability tending to one. We deduce

$\displaystyle M_{1,2}\leq\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{4,T,m,k})^{-1}\gamma^{\max}_{2,T,m,k}\\|\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1}\\|_{F}\big{)}$
	$\displaystyle\leq$	$\displaystyle\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}\Big{[}{\mathbb{P}}\big{(}\gamma^{\min}_{4,T,m,k}(\gamma^{\max}_{2,T,m,k})^{-1}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/6\leq(\gamma^{\min}_{5,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\big{)}$
		$\displaystyle+{\mathbb{P}}\big{(}\gamma^{\min}_{4,T,m,k}(\gamma^{\max}_{2,T,m,k})^{-1}\\|\Omega^{\ast}_{k+1}\!-\!\Omega^{\ast}_{k}\\|_{F}/6\!\leq\!(\gamma^{\min}_{5,T,k})^{-1}\\|\frac{1}{T^{\ast}_{k+1}\!-\!T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\!\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})\!-\!\text{vec}(I_{p})\Big{]}\\|_{2}\big{)}\Big{]},$

where $\gamma^{\max}_{2,T,m,k}=\lambda_{\max}(\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\sum^{T^{\ast}_{k}-1}_{r=\widehat{T}_{m}}X_{r}X^{\top}_{r})\leq 2\overline{\mu}$ with probability tending to one. The first term in the second inequality of (B.2) tends to zero under the conditions $\lambda_{2}T/(\mathcal{I}_{\min}\eta_{\min})\rightarrow 0$ and $\lambda_{1}Tp/\eta_{\min}\rightarrow 0$ . And under $(\eta_{\min}\mathcal{I}^{1/2}_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0$ , the second term tends to zero. Therefore, we conclude $\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,1}\cap\{\widehat{T}_{m}>T^{\ast}_{k-1}\}\big{)}\rightarrow 0$ as $T\rightarrow\infty$ . Based on similar arguments, we can show $\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,1}\cap\{\widehat{T}_{m}\leq T^{\ast}_{k-1}\}\big{)}\rightarrow 0$ as $T\rightarrow\infty$ . Therefore, $\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,1}\big{)}\rightarrow 0$ as $T\rightarrow\infty$ . Similarly, it can be proved that $\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,2}\big{)}\rightarrow 0$ as $T\rightarrow\infty$ .
We now consider $\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,3}\big{)}$ . Define

\begin{array}[]{llll}\leavevmode\resizebox{433.62pt}{}{$\displaystyle L^{(1)}_{m,k,3}\coloneqq L_{m,k,3}\cap\{T^{\ast}_{k-1}<\widehat{T}_{l}<\widehat{T}_{l+1}<T^{\ast}_{k+1}\},L^{(2)}_{m,k,3}\coloneqq L_{m,k,3}\cap\{T^{\ast}_{k-1}<\widehat{T}_{l}<T^{\ast}_{k+1},\widehat{T}_{l+1}\geq T^{\ast}_{k+1}\},$}&&\\ \leavevmode\resizebox{433.62pt}{}{$\displaystyle L^{(3)}_{m,k,3}\coloneqq L_{m,k,3}\cap\{\widehat{T}_{l}\leq T^{\ast}_{k-1},T^{\ast}_{k-1}<\widehat{T}_{l+1}<T^{\ast}_{k+1}\},L^{(4)}_{m,k,3}\coloneqq L_{m,k,3}\cap\{\widehat{T}_{l}\leq T^{\ast}_{k-1},T^{\ast}_{k+1}<\widehat{T}_{l+1}\}.$}&&\end{array}

First, we consider $L^{(1)}_{m,k,3}$ . By Lemma (A.3), for the change-points $t=T^{\ast}_{k}$ and $t=\widehat{T}_{l}$ , we obtain

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k}-\widehat{T}_{l})$
		$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k})+\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2},$

and for the change-points $t=T^{\ast}_{k}$ and $t=\widehat{T}_{l+1}$ , we get

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-T^{\ast}_{k})$
		$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\\|_{2}.$

Moreover, by the triangle inequality, we have

$\displaystyle\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}\leq\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k}\\|_{F}+\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1}\\|_{F}$
	$\displaystyle\leq$	$\displaystyle(\gamma^{\min}_{6,T,k,l})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-\widehat{T}_{l}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]}$
		$\displaystyle+(\gamma^{\min}_{7,T,k,l})^{-1}\Big{[}\frac{2\lambda_{2}T}{\widehat{T}_{l+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]},$

with $\gamma^{\min}_{6,T,k,l}=\lambda_{\min}(\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\sum^{T^{\ast}_{k}-1}_{r=\widehat{T}_{l}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2,\gamma^{\min}_{7,T,k,l}=\lambda_{\min}(\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\sum^{\widehat{T}_{l+1}-1}_{r=T^{\ast}_{k}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ with probability tending to one. So $\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(1)}_{m,k,3}\big{)}$ is upper bounded as follows

$\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(1)}_{m,k,3}\big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq\big{(}(\gamma^{\min}_{6,T,k,l})^{-1}+(\gamma^{\min}_{7,T,k,l})^{-1}\big{)}\big{[}2\lambda_{2}\delta^{-1}_{T}+\lambda_{1}T\sqrt{p(p-1)}\big{]}\big{)}$
		$\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{6,T,k,l})^{-1}\\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2},T^{\ast}_{k}-\widehat{T}_{l}\geq T\delta_{T}\big{)}$
		$\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{7,T,k,l})^{-1}\\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2},\widehat{T}_{l+1}-T^{\ast}_{k}\geq T\delta_{T}\big{)},$

which tends to zero in the spirit as in (B.8). For $L^{(2)}_{m,k,3}$ , by Lemma A.3 with change-points $t=T^{\ast}_{k}$ and $t=\widehat{T}_{l}$ to obtain (B.2) and with change-points $t=T^{\ast}_{k}$ , $t=T^{\ast}_{k+1}$ , we get

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k+1}-T^{\ast}_{k})$
		$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\\|_{2}.$

By the triangle inequality, we have

$\displaystyle\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}\leq\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k}\\|_{F}+\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1}\\|_{F}$
	$\displaystyle\leq$	$\displaystyle(\gamma^{\min}_{6,T,k,l})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-\widehat{T}_{l}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]}$
		$\displaystyle+(\gamma^{\min}_{8,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]},$

with $\gamma^{\min}_{8,T,k}=\lambda_{\min}(\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\sum^{T^{\ast}_{k+1}-1}_{r=T^{\ast}_{k}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ with probability tending to one. Therefore, we obtain

$\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(2)}_{m,k,3}\big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{6,T,k,l})^{-1}\big{[}2\lambda_{2}\delta^{-1}_{T}+\lambda_{1}T\sqrt{p(p-1)}\big{]}+(\gamma^{\min}_{8,T,k,l})^{-1}\big{[}\frac{2\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\big{]}\big{)}$
		$\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{6,T,k,l})^{-1}\\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2},T^{\ast}_{k}-\widehat{T}_{l}\geq T\delta_{T}\big{)}$
		$\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{8,T,k,l})^{-1}\\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\big{)},$

which will tend to zero based on similar arguments as in the convergence of (B.8). For $L^{(3)}_{m,k,3}$ , by Lemma A.3 with change-points $t=T^{\ast}_{k-1}$ and $t=T^{\ast}_{k}$ , we have

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k}-T^{\ast}_{k-1})$
		$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k})+\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2},$

and with change-points $t=T^{\ast}_{k}$ , $t=\widehat{T}_{l+1}$ , we get

	$\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-T^{\ast}_{k})$
		$\displaystyle\geq$	$\displaystyle\\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\\|_{2}.$

By the triangle inequality, we deduce

$\displaystyle\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}\leq\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k}\\|_{F}+\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1}\\|_{F}$
	$\displaystyle\leq$	$\displaystyle(\gamma^{\min}_{9,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-T^{\ast}_{k-1}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]}$
		$\displaystyle+(\gamma^{\min}_{10,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{\widehat{T}_{l+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]},$

with $\gamma^{\min}_{9,T,k}=\lambda_{\min}(\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\sum^{T^{\ast}_{k}-1}_{r=T^{\ast}_{k-1}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ , $\gamma^{\min}_{10,T,k,l}=\lambda_{\min}(\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\sum^{\widehat{T}_{l+1}-1}_{r=T^{\ast}_{k}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2$ with probability tending to one. We deduce

$\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(3)}_{m,k,3}\big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{9,T,k})^{-1}\big{[}\frac{2\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\big{]}+(\gamma^{\min}_{10,T,k,l})^{-1}\big{[}2\lambda_{2}\delta^{-1}_{T}+\lambda_{1}T\sqrt{p(p-1)}\big{]}\big{)}$
		$\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{9,T,k})^{-1}\\|\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\big{)}$
		$\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{10,T,k,l})^{-1}\\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k+1}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\\|_{2},$
		$\displaystyle\qquad\qquad\qquad\widehat{T}_{l+1}-T^{\ast}_{k}\geq T\delta_{T}\big{)},$

which tends to zero based on the same arguments as in the convergence of (B.8). Finally, to analyze $L^{(4)}_{m,k,3}$ , applying Lemma A.3 with $t=T^{\ast}_{k-1},t=T^{\ast}_{k}$ to obtain (B.2) and with $t=T^{\ast}_{k},t=T^{\ast}_{k+1}$ to obtain (B.2). By the triangle inequality, we have

$\displaystyle\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}\leq\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k}\\|_{F}+\\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1}\\|_{F}$
	$\displaystyle\leq$	$\displaystyle(\gamma^{\min}_{9,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-T^{\ast}_{k-1}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]}$
		$\displaystyle+(\gamma^{\min}_{8,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\\|_{2}\Big{]},$

we deduce

$\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(4)}_{m,k,3}\big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq\big{(}(\gamma^{\min}_{9,T,k})^{-1}+(\gamma^{\min}_{8,T,k})^{-1}\big{)}\big{[}\frac{2\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\big{]}\big{)}$
		$\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{9,T,k})^{-1}\\|\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\big{)}$
		$\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\\|_{F}/3\leq(\gamma^{\min}_{8,T,k,l})^{-1}\\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\\|_{2}\big{)},$

which also tends to zero, as in the proof of the convergence to zero of (B.8). We conclude that ${\mathbb{P}}\big{(}\{h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})>T\delta_{T}\}\cap\{m^{\ast}<\widehat{m}\leq m_{\max}\}\big{)}\rightarrow 0$ as $T\rightarrow\infty$ .

B.3 Proof of Proposition 4.1

Proof of point (i).
We can rewrite (4.1) as the following constrained optimization problem:

$\displaystyle\min_{\mathbf{X}}$	$\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\mathrm{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}+\lambda_{2}T\sum_{t=1}^{T-1}\\|D_{t}\\|_{F}$	(B.18)
s.t.	$\displaystyle U_{t}=(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta_{t},\,\Upsilon_{t,\mathrm{off}}=\Theta_{t,\mathrm{off}}\,\,\,\,\forall t=1,\dots,T;$
	$\displaystyle D_{t}=\Theta_{t+1}-\Theta_{t}\,\,\,\,\forall t=1,\dots,T-1,$

where we write $\mathbf{X}=\left\{\{\Theta_{t}\}_{t=1}^{T},\{U_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1}\right\}$ for short; $\delta_{\cdot\succeq\epsilon I_{p}}(\cdot)$ is the indicator function of the set $\{S\,:\,S\succeq\epsilon I_{p}\}$ ; $\Upsilon_{t,\mathrm{off}}$ is a matrix whose diagonal elements are 0; $\Theta_{t,\mathrm{off}}$ is the copy of $\Theta_{t}$ with the diagonal elements set to 0.

Denote the dual variables by $\mathbf{Y}=\{\{W_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1}\}$ for simplicity, where $W_{t}\in{\mathbb{R}}^{p\times p}$ , $Y_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p}$ , and $Z_{t}\in\mathcal{S}^{p}$ for all $t$ . The Lagrangian function of (B.18) is

	$\displaystyle L(\mathbf{X};\mathbf{Y})=$	$\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\mathrm{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}$
		$\displaystyle+\lambda_{2}T\sum_{t=1}^{T-1}\\|D_{t}\\|_{F}-\sum_{t=1}^{T}\left\langle W_{t},U_{t}-(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta_{t}\right\rangle$
		$\displaystyle-\sum_{t=1}^{T}\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\right\rangle-\sum_{t=1}^{T-1}\left\langle Z_{t},\Theta_{t+1}-\Theta_{t}-D_{t}\right\rangle$
	$\displaystyle=$	$\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\left\langle W_{t},U_{t}\right\rangle\right]+\sum_{t=1}^{T}\left[\lambda_{1}T\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}+\left\langle Y_{t,\mathrm{off}},\Upsilon_{t,\mathrm{off}}\right\rangle\right]$
		$\displaystyle+\sum_{t=1}^{T-1}\left[\lambda_{2}T\\|D_{t}\\|_{F}+\left\langle Z_{t},D_{t}\right\rangle\right]$
		$\displaystyle+\sum_{t=1}^{T}\left[\left\langle Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}},\Theta_{t}\right\rangle+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right],$

where for convenience, we set $Z_{0}=Z_{T}=\mathbf{0}_{p\times p}$ ; we further note that $\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}\right\rangle=\left\langle Y_{t,\mathrm{off}},\Theta_{t}\right\rangle$ . Now, let $\zeta_{t}=Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}}$ , we have

	$\displaystyle\min_{U_{t}}\,\,\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\left\langle W_{t},U_{t}\right\rangle=$	$\displaystyle-\frac{1}{2}\mathrm{tr}(W_{t}^{\top}W_{t});$
	$\displaystyle\min_{\Upsilon_{t,\mathrm{off}}}\,\,\lambda_{1}T\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}+\left\langle Y_{t,\mathrm{off}},\Upsilon_{t,\mathrm{off}}\right\rangle=$	$\displaystyle\begin{cases}0&\text{if }\|Y_{t,uv}\|\leq\lambda_{1}T\,\,\,\,\forall u\neq v,\\ -\infty&\text{otherwise};\end{cases}$
	$\displaystyle\min_{D_{t}}\,\,\lambda_{2}T\\|D_{t}\\|_{F}+\left\langle Z_{t},D_{t}\right\rangle=$	$\displaystyle\begin{cases}0&\text{if }\\|Z_{t}\\|_{F}\leq\lambda_{2}T,\\ -\infty&\text{otherwise};\end{cases}$
	$\displaystyle\min_{\Theta_{t}}\,\,\left\langle\zeta_{t},\Theta_{t}\right\rangle+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})=$	$\displaystyle\begin{cases}\epsilon\mathrm{tr}(\zeta_{t})&\text{if }\zeta_{t}\succeq 0,\\ -\infty&\text{otherwise}.\end{cases}$

Therefore, we have the dual problem as in (4.2). Finally, the equality of the optimal values follow from [32, Theorem 31.1] upon noting that there exists $\mathbf{X}$ with $\Theta_{t}\succ\epsilon I_{p}$ for all $t$ satisfying the equality constraints in (B.18).

Proof of point (ii).
Since $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ , by Lemma A.4, the set $\mathcal{C}_{\lambda_{1}}$ has a Slater point $\left\{\{\overline{W}_{t}\}_{t=1}^{T},\{\overline{Y}_{t,\mathrm{off}}\}_{t=}^{T},\{\overline{Z}_{t}\}_{t=1}^{T-1}\right\}$ . Then the result follows directly with $\overline{\lambda}_{2}=1+\max\{\|\overline{Z}_{t}\|_{F}\}/T$ . The existence of solutions to the primal problem comes from the strong duality thanks to the strict feasibility of the dual problem; see, for example [32, Theorem 31.1].

B.4 Proof of Proposition 4.2

Proof of point (i).
We first rewrite (4.4) as the following constrained optimization problem:

\begin{split}\min_{\mathbf{X}}\,\,&\sum_{t=1}^{T}\!\left[\!\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})\!-\!\mathrm{tr}(\Theta_{t})\!+\!\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\!\right]\!+\!\lambda_{1}T\!\sum_{t=1}^{T}\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}\!+\!\lambda_{2}T\sum_{t=1}^{T-1}\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})\\ \text{s.t.}\,\,&U_{t}=(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta_{t},\,\Upsilon_{t,\mathrm{off}}=\Theta_{t,\mathrm{off}},\,\,\,\,\forall t=1,\dots,T;\\ &D_{t}=\Theta_{t+1}-\Theta_{t}\,\,\,\,\forall t=1,\dots,T-1,\end{split}

(B.19)

where we write $\mathbf{X}=\left\{\{\Theta_{t}\}_{t=1}^{T},\{U_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1}\right\}$ for short; $\Upsilon_{t,\mathrm{off}}$ is a matrix whose diagonal elements are 0; $\Theta_{t,\mathrm{off}}$ is the copy of $\Theta_{t}$ with the diagonal elements set to 0.

	$\displaystyle L(\mathbf{X};\mathbf{Y})=$	$\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\mathrm{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}$
		$\displaystyle+\lambda_{2}T\sum_{t=1}^{T-1}\mathcal{R}(\\|D_{t}\\|_{F};\lambda_{3})-\sum_{t=1}^{T}\left\langle W_{t},U_{t}-(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta_{t}\right\rangle$
		$\displaystyle-\sum_{t=1}^{T}\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\right\rangle-\sum_{t=1}^{T-1}\left\langle Z_{t},\Theta_{t+1}-\Theta_{t}-D_{t}\right\rangle$
	$\displaystyle=$	$\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\left\langle W_{t},U_{t}\right\rangle\right]+\sum_{t=1}^{T}\left[\lambda_{1}T\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}+\left\langle Y_{t,\mathrm{off}},\Upsilon_{t,\mathrm{off}}\right\rangle\right]$
		$\displaystyle+\sum_{t=1}^{T-1}\left[\lambda_{2}T\mathcal{R}(\\|D_{t}\\|_{F};\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle\right]$
		$\displaystyle+\sum_{t=1}^{T}\left[\left\langle Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}},\Theta_{t}\right\rangle+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right],$

where for convenience, we set $Z_{0}=Z_{T}=\mathbf{0}_{p\times p}$ ; we also note that $\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}\right\rangle=\left\langle Y_{t,\mathrm{off}},\Theta_{t}\right\rangle$ . Now, let $\zeta_{t}=Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}}$ , we have

$\displaystyle\min_{U_{t}}\,\,\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\left\langle W_{t},U_{t}\right\rangle=$	$\displaystyle-\frac{1}{2}\mathrm{tr}(W_{t}^{\top}W_{t});$	(B.20)
$\displaystyle\min_{\Upsilon_{t,\mathrm{off}}}\,\,\lambda_{1}T\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}+\left\langle Y_{t,\mathrm{off}},\Upsilon_{t,\mathrm{off}}\right\rangle=$	$\displaystyle\begin{cases}0&\text{if }\|Y_{t,uv}\|\leq\lambda_{1}T\,\,\,\,\forall u\neq v,\\ -\infty&\text{otherwise};\end{cases}$
$\displaystyle\min_{\Theta_{t}}\,\,\left\langle\zeta_{t},\Theta_{t}\right\rangle+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})=$	$\displaystyle\begin{cases}\epsilon\mathrm{tr}(\zeta_{t})&\text{if }\zeta_{t}\succeq 0,\\ -\infty&\text{otherwise}.\end{cases}$

For the problem $\min_{D_{t}}\lambda_{2}T\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle$ for each $t$ , it holds that

		$\displaystyle\min_{D_{t}}\,\,\lambda_{2}T\mathcal{R}(\\|D_{t}\\|_{F};\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle$
	$\displaystyle=$	$\displaystyle\min\Big{\{}\underbrace{\min_{\\|D_{t}\\|_{F}\leq\lambda_{3}}\lambda_{2}T\\|D_{t}\\|_{F}+\left\langle Z_{t},D_{t}\right\rangle}_{\smash{\raisebox{-0.2ex}{\small{1}}⃝}},\underbrace{\min_{\\|D_{t}\\|_{F}\geq\lambda_{3}}\lambda_{2}T(\\|D_{t}\\|_{F}^{2}-\lambda_{3}^{2}+\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle}_{\smash{\raisebox{-0.2ex}{\small{2}}⃝}}\Big{\}}.$

For \raisebox{-0.2ex}{\small{1}}⃝, we have

		$\displaystyle\min_{\\|D_{t}\\|_{F}\leq\lambda_{3}}\lambda_{2}T\\|D_{t}\\|_{F}+\left\langle Z_{t},D_{t}\right\rangle=\min_{\begin{subarray}{c}\\|D_{t}\\|_{F}\leq\lambda_{3};\\ D_{t}=\alpha Z_{t},\alpha\geq 0\end{subarray}}\lambda_{2}T\\|D_{t}\\|_{F}-\\|Z_{t}\\|_{F}\\|D_{t}\\|_{F}$		(B.21)
	$\displaystyle=$	$\displaystyle\min_{0\leq r\leq\lambda_{3}}(\lambda_{2}T-\\|Z_{t}\\|_{F})r=-\left(\\|Z_{t}\\|_{F}-\lambda_{2}T\right)_{+}\lambda_{3},$		(B.21)

where the first equality follows from the (equality case in) Cauchy-Schwarz inequality; $(\cdot)_{+}=\max\{\cdot,0\}$ .

For \raisebox{-0.2ex}{\small{2}}⃝, one can see that

	$\displaystyle\min_{\\|D_{t}\\|_{F}\geq\lambda_{3}}\lambda_{2}T(\\|D_{t}\\|_{F}^{2}-\lambda_{3}^{2}+\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle$
$\displaystyle\overset{(\text{a})}{=}$	$\displaystyle\min_{r\geq\lambda_{3}}\lambda_{2}T(r^{2}-\lambda_{3}^{2}+\lambda_{3})-\\|Z_{t}\\|_{F}r$
$\displaystyle=$	$\displaystyle\min_{r\geq\lambda_{3}}\lambda_{2}T\left(\left(r-\frac{\\|Z_{t}\\|_{F}}{2\lambda_{2}T}\right)^{2}-\frac{\\|Z_{t}\\|_{F}^{2}}{4\lambda_{2}^{2}T^{2}}-\lambda_{3}^{2}+\lambda_{3}\right)$
$\displaystyle\overset{(\text{b})}{=}$	$\displaystyle\lambda_{2}T\left(\left(\lambda_{3}-\frac{\\|Z_{t}\\|_{F}}{2\lambda_{2}T}\right)_{+}^{2}-\frac{\\|Z_{t}\\|_{F}^{2}}{4\lambda_{2}^{2}T^{2}}-\lambda_{3}^{2}+\lambda_{3}\right),$	(B.22)

where (a) comes from the (equality case in) Cauchy-Schwarz inequality, and (b) is true since the minimum is attained at $r=\max\left\{\frac{\|Z_{t}\|_{F}}{2\lambda_{2}T},\lambda_{3}\right\}$ . One can see this by first locating the vertex of the quadratic objective.

Using (B.21) and (B.4), we have

		$\displaystyle\min_{D_{t}}\,\,\lambda_{2}T\mathcal{R}(\\|D_{t}\\|_{F};\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle$
	$\displaystyle=$	$\displaystyle\min\left\{-\left(\\|Z_{t}\\|_{F}-\lambda_{2}T\right)_{+}\lambda_{3},\lambda_{2}T\left(\left(\lambda_{3}-\frac{\\|Z_{t}\\|_{F}}{2\lambda_{2}T}\right)_{+}^{2}-\frac{\\|Z_{t}\\|_{F}^{2}}{4\lambda_{2}^{2}T^{2}}-\lambda_{3}^{2}+\lambda_{3}\right)\right\}.$

The above display is exactly the definition of $\mathcal{G}(\|Z_{t}\|_{F};\lambda_{3})$ . Using this and (B.20), we can conclude that the dual problem of (4.4) is (4.5). Finally, the equality of optimal values follows from [32, Theorem 31.1] upon noting that there exists $\mathbf{X}$ with $\Theta_{t}\succ\epsilon I_{p}$ for all $t$ satisfying the equality constraints in (4.4).

Proof of point (ii).
Since $\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0$ , by Lemma A.4, the set $\mathcal{C}_{\lambda_{1}}$ and hence the dual problem has a Slater point $\left\{\{\overline{W}_{t}\}_{t=1}^{T},\{\overline{Y}_{t,\mathrm{off}}\}_{t=}^{T},\{\overline{Z}_{t}\}_{t=1}^{T-1}\right\}$ . The existence of solutions to the primal problem comes from strong duality thanks to the strict feasibility of the dual problem; see, for example [32, Theorem 31.1].

B.5 Proof of Proposition 4.3

For simplicity, for a given pair of fixed $\lambda_{1}$ and $\lambda_{2}$ , we denote the objective functions of (4.1) and (4.4) with $\lambda_{3}$ by $F$ and $G_{\lambda_{3}}$ , respectively. From the definition of $\mathcal{G}(\cdot;\lambda_{3})$ in (4.3), we know that

F\left(\{\Theta_{t}\}_{t=1}^{T}\right)\leq G_{\lambda_{3}}\left(\{\Theta_{t}\}_{t=1}^{T}\right)\text{ for any }\{\Theta_{t}\}_{t=1}^{T}.

(B.23)

Proof of point (i).
Suppose that $\lambda_{1},\lambda_{2}$ are such that (4.1) has solutions. Let $\{\Theta_{t}^{*}\}_{t=1}^{T}$ be an arbitrary solution to (4.1) and define

\overline{\lambda}_{3}=\max\left\{\max_{t=1,\dots,T-1}\{\|\Theta_{t+1}^{*}-\Theta_{t}^{*}\|_{F}\},0.5\right\}.

Then it holds that

F\left(\{\Theta_{t}^{*}\}_{t=1}^{T}\right)=G_{\lambda_{3}}\left(\{\Theta_{t}^{*}\}_{t=1}^{T}\right)\text{ for any }\lambda_{3}\geq\overline{\lambda}_{3}.

(B.24)

Fix any $\lambda_{3}\geq\overline{\lambda}_{3}$ . Suppose that $\{\widehat{\Theta}_{t}\}_{t=1}^{T}$ is a solution to (4.4) with this $\lambda_{3}$ , then we have

F\left(\{\widehat{\Theta}_{t}\}_{t=1}^{T}\right)\overset{(\text{a})}{\leq}G_{\lambda_{3}}\left(\{\widehat{\Theta}_{t}\}_{t=1}^{T}\right)\overset{(\text{b})}{\leq}G_{\lambda_{3}}\left(\{\Theta_{t}^{*}\}_{t=1}^{T}\right)\overset{(\text{c})}{=}F\left(\{\Theta_{t}^{*}\}_{t=1}^{T}\right),

where (a) comes from (B.23); (b) holds thanks to the assumption that $\{\widehat{\Theta}_{t}\}_{t=1}^{T}$ is a solution to (4.4) with $\lambda_{3}$ ; (c) is true because of (B.24). Therefore, $\{\widehat{\Theta}_{t}\}_{t=1}^{T}$ is also a solution to (4.1).

Proof of point (ii).
It suffices to show that, for arbitrary fixed $\lambda_{1},\lambda_{2}$ , if there exists $\lambda_{3}\geq 0.5$ such that there exists a solution $\{\Theta_{t}^{*}\}_{t=1}^{T}$ to (4.4) with $\lambda_{3}$ that satisfies

\max_{t=1,\dots,T-1}\|\Theta_{t+1}^{*}-\Theta_{t}^{*}\|_{F}<\lambda_{3},

(B.25)

then (4.1) has solutions. To this end, we notice from (B.25) and the definition of $\mathcal{R}$ in (4.3) that

\partial F(\{\Theta_{t}^{*}\}_{t=1}^{T})=\partial G_{\lambda_{3}}(\{\Theta_{t}^{*}\}_{t=1}^{T}),

(B.26)

where $\partial F(\{\Theta_{t}^{*}\}_{t=1}^{T})$ and $\partial G_{\lambda_{3}}(\{\Theta_{t}^{*}\}_{t=1}^{T})$ are the subdifferentials of $F$ and $G_{\lambda_{3}}$ at $\{\Theta_{t}^{*}\}_{t=1}^{T}$ , respectively.

Given that $\{\Theta_{t}^{*}\}_{t=1}^{T}$ is a solution to (4.4), the optimality condition implies

0\in\partial G_{\lambda_{3}}(\{\Theta_{t}^{*}\}_{t=1}^{T}).

This, along with (B.26), shows that $\{\Theta_{t}^{*}\}_{t=1}^{T}$ is also a solution to (4.1).

References

Bauschke and Combettes [2011] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer New York, 2011. ISBN 9781441994677. 10.1007/978-1-4419-9467-7.
Bleakley and Vert [2011] K. Bleakley and J. Vert. The group fused lasso for multiple change-point detection. HAL, Technical Report, Computational Biology Center, Paris, 2011.
Cai et al. [2016] T. Cai, W. Liu, and H. Zhou. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. The Annals of Statistics, 44(2):455–488, 2016. 10.1214/13-AOS1171.
Chan et al. [2014] N. Chan, C. Yau, and R.-M. Zhang. Group lasso for structural break time series. Journal of the American Statistical Association, 109(506):590–599, 2014. 10.1080/01621459.2013.866566.
Chen et al. [2017] L. Chen, D. Sun, and K.-C. Toh. An efficient inexact symmetric Gauss–Seidel based majorized ADMM for high-dimensional convex composite conic programming. Mathematical Programming, 161(1–2):237–270, Apr. 2017. ISSN 1436-4646. 10.1007/s10107-016-1007-5.
Eckstein [1994] J. Eckstein. Some saddle-function splitting methods for convex programming. Optimization Methods and Software, 4(1):75–83, Jan. 1994. ISSN 1029-4937. 10.1080/10556789408805578.
Eckstein and Bertsekas [1992] J. Eckstein and D. P. Bertsekas. On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1–3):293–318, Apr. 1992. ISSN 1436-4646. 10.1007/bf01581204.
Engle and Colacito [2006] R. Engle and R. Colacito. Testing and valuing dynamic correlations for asset allocation. Journal of Business & Economic Statistics, 24(2):238–253, 2006. 10.1198/073500106000000017.
Engle et al. [2019] R. Engle, O. Ledoit, and M. Wolf. Large dynamic covariance matrices. Journal of Business & Economic Statistics, 37(2):363–375, 2019. 10.1080/07350015.2017.1345683.
Fazel et al. [2013] M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946–977, Jan. 2013. ISSN 1095-7162. 10.1137/110853996.
Fortin and Glowinski [1983] M. Fortin and R. Glowinski. Augmented Lagrangian Methods: Applications to the Numerical Solution of Boundary-Value Problems. Studies in Mathematics and Its Applications. Elsevier, 1983.
Gabay and Mercier [1976] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. ISSN 0898-1221. 10.1016/0898-1221(76)90003-1.
Gibberd and Nelson [2017] A. Gibberd and J. Nelson. Regularized estimation of piecewise constant gaussian graphical models: the group-fused graphical lasso. Journal of Computational and Graphical Statistics, 26(3):623–634, 2017. 10.1080/10618600.2017.1302340.
Gibberd and Roy [2021] A. Gibberd and S. Roy. Consistent multiple changepoint estimation with fused gaussian graphical models. Annals of the Institute of Statistical Mathematics, 73:283–309, 2021. 10.1007/s10463-020-00749-0.
Glowinski and Marroco [1975] R. Glowinski and A. Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(R2):41–76, 1975. ISSN 0397-9342. 10.1051/m2an/197509r200411.
Hallac et al. [2017] D. Hallac, Y. Park, S. Boyd, and J. Leskovec. Network inference via the time-varying graphical lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017. 10.1145/3097983.3098037.
Harchaoui and Lévy-Leduc [2010] Z. Harchaoui and C. Lévy-Leduc. Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492):1480–1493, 2010. 10.1198/jasa.2010.tm09181.
He et al. [2002] B. He, L.-Z. Liao, D. Han, and H. Yang. A new inexact alternating directions method for monotone variational inequalities. Mathematical Programming, 92(1):103–118, Mar. 2002. ISSN 0025-5610. 10.1007/s101070100280.
Ji et al. [2021] J. Ji, Y. He, L. Liu, and L. Xie. Brain connectivity alteration detection via matrix-variate differential network model. Biometrics, 77(4):1409–1421, 2021. 10.1111/biom.13359.
Kolar and Xing [2012] M. Kolar and E. Xing. Estimating networks with jumps. Electronic Journal of Statistics, 6:2069–2106, 2012. 10.1214/12-EJS739.
Li et al. [2016a] D. Li, J. Qian, and L. Su. Panel data models with interactive fixed effects and multiple structural breaks. Journal of the American Statistical Association, 111(516):1804–1819, 2016a. 10.1080/01621459.2015.1119696.
Li and Lin [2019] H. Li and Z. Lin. Accelerated alternating direction method of multipliers: An optimal O(1 / k) nonergodic analysis. Journal of Scientific Computing, 79(2):671–699, Dec. 2019. ISSN 1573-7691. 10.1007/s10915-018-0893-5.
Li et al. [2016b] X. Li, D. Sun, and K.-C. Toh. A Schur complement based semi-proximal ADMM for convex quadratic conic programming and extensions. Mathematical Programming, 155(1–2):333–373, Dec. 2016b. ISSN 1436-4646. 10.1007/s10107-014-0850-5.
Li et al. [2018] X. Li, D. Sun, and K.-C. Toh. QSDPNAL: a two-phase augmented lagrangian method for convex quadratic semidefinite programming. Mathematical Programming Computation, 10(4):703–743, Apr. 2018. ISSN 1867-2957. 10.1007/s12532-018-0137-6.
Li et al. [2019] X. Li, D. Sun, and K.-C. Toh. A block symmetric Gauss–Seidel decomposition theorem for convex composite quadratic programming and its applications. Mathematical Programming, 175(1–2):395–418, Feb. 2019. ISSN 1436-4646. 10.1007/s10107-018-1247-7.
Merlevède et al. [2011] F. Merlevède, M. Peligrad, and E. Rio. A Bernstein type inequality and moderate deviations for weakly dependent sequences. Probability Theory and Related Fields, 151:435–474, 2011. 10.1007/s00440-010-0304-9.
Monti et al. [2014] R. Monti, P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana. Estimating time-varying brain connectivity networks from functional mri time series. NeuroImage, 103:427–443, 2014. 10.1016/j.neuroimage.2014.07.033.
Ouyang et al. [2015] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao. An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences, 8(1):644–681, Jan. 2015. ISSN 1936-4954. 10.1137/14095697x.
Poignard and Asai [2023] B. Poignard and M. Asai. Estimation of high-dimensional vector autoregression via sparse precision matrix. The Econometrics Journal, 26(2):307–326, 2023. 10.1093/ectj/utad003.
Qian and Su [2016a] J. Qian and L. Su. Shrinkage estimation of regression models with multiple structural changes. Econometric Theory, 32(6):1376–1433, 2016a. 10.1017/S0266466615000237.
Qian and Su [2016b] J. Qian and L. Su. Shrinkage estimation of common breaks in panel data models via adaptive group fused lasso. Journal of Econometrics, 191(1):86–109, 2016b. 10.1016/j.jeconom.2015.09.004.
Rockafellar [1970] R. T. Rockafellar. Convex Analysis. Princeton University Press, Dec. 1970. ISBN 9781400873173. 10.1515/9781400873173.
Roy et al. [2017] S. Roy, Y. Atchadé, and G. Michailidis. Change point estimation in high dimensional markov random-field models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(4):1187–1206, 2017. 10.1111/rssb.12205.
Sabach and Teboulle [2022] S. Sabach and M. Teboulle. Faster lagrangian-based methods in convex optimization. SIAM Journal on Optimization, 32(1):204–227, Feb. 2022. ISSN 1095-7189. 10.1137/20m1375358.
Sun et al. [2024] D. Sun, Y. Yuan, G. Zhang, and X. Zhao. Accelerating preconditioned ADMM via degenerate proximal point mappings, 2024.
Xiao et al. [2018] Y. Xiao, L. Chen, and D. Li. A generalized alternating direction method of multipliers with semi-proximal terms for convex composite conic programming. Mathematical Programming Computation, 10(4):533–555, Jan. 2018. ISSN 1867-2957. 10.1007/s12532-018-0134-9.
Xu [2017] Y. Xu. Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM Journal on Optimization, 27(3):1459–1484, Jan. 2017. ISSN 1095-7189. 10.1137/16m1082305.
Yang et al. [2023] B. Yang, X. Zhao, X. Li, and D. Sun. An Accelerated Proximal Alternating Direction Method of Multipliers for Optimal Decentralized Control of Uncertain Systems, 2023.
Yang et al. [2021] L. Yang, J. Li, D. Sun, and K. Toh. A fast globally linearly convergent algorithm for the computation of Wasserstein barycenters. Journal of Machine Learning Research, 22:1–37, 2021. URL http://jmlr.org/papers/v22/19-629.html.
Yuan et al. [2017] H. Yuan, R. Xi, C. Chen, and M. Deng. Differential network analysis via lasso penalized d-trace loss. Biometrika, 104(4):755–770, 2017. 10.1093/biomet/asx049.
Yuan et al. [2019] H. Yuan, S. He, and M. Deng. Compositional data network analysis via lasso penalized d-trace loss. Bioinformatics, 35(18):3404–3411, 2019. 10.1093/bioinformatics/btz098.
Zhang et al. [2022] G. Zhang, Y. Yuan, and D. Sun. An Efficient HPR Algorithm for the Wasserstein Barycenter Problem with $O({Dim(P)}/\varepsilon)$ Computational Complexity, 2022.
Zhang et al. [2024] S. Zhang, H. Wang, and W. Lin. Care: Large precision matrix estimation for compositional data. Journal of the American Statistical Association, 2024. 10.1080/01621459.2024.2335586.
Zhang and Zou [2014] T. Zhang and H. Zou. Sparse precision matrix estimation via lasso penalized d-trace loss. Biometrika, 101(1):103–120, 2014. 10.1093/biomet/ast059.
Zhao and Yu [2006] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7(90):2541–2563, 2006. URL https://www.jmlr.org/papers/v7/zhao06a.html.
Zhou et al. [2010] S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. Machine Learning, 80:295–319, 2010. 10.1007/s10994-010-5180-0.

		$\displaystyle L(\{\Theta_{t}\}_{t=1}^{T},\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1},\{A_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1})$
	$\displaystyle\coloneqq$	$\displaystyle\sum_{t=1}^{T}\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})-\text{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(V_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\text{off}}$
		$\displaystyle+\lambda_{2}T\sum_{t=1}^{T-1}\mathcal{R}(\\|D_{t}\\|_{F};\lambda_{3})+\sum_{t=1}^{T}\left[-\left\langle A_{t},\Theta_{t}-V_{t}\right\rangle+\frac{\beta}{2}\\|\Theta_{t}-V_{t}\\|_{F}^{2}\right]$
		$\displaystyle+\sum_{t=1}^{T}\left[-\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\right\rangle+\frac{\beta}{2}\\|\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\\|_{F}^{2}\right]$
		$\displaystyle+\sum_{t=1}^{T-1}\left[-\left\langle Z_{t},\Theta_{t+1}-\Theta_{t}-D_{t}\right\rangle+\frac{\beta}{2}\\|\Theta_{t+1}-\Theta_{t}-D_{t}\\|_{F}^{2}\right].$

$\displaystyle\Upsilon_{t,\mathrm{off}}^{k+1}$	$\displaystyle=\operatorname*{\arg\min}_{\Upsilon_{t,\mathrm{off}}}\left\{\lambda_{1}T\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\mathrm{off}}+\left\langle Y_{t,\mathrm{off}}^{k},\Upsilon_{t,\mathrm{off}}\right\rangle+\frac{\beta}{2}\\|\Theta_{t,\mathrm{off}}^{k+1}-\Upsilon_{t,\mathrm{off}}\\|_{F}^{2}\right\}$	(4.11)
	$\displaystyle=\operatorname*{\arg\min}_{\Upsilon_{t,\mathrm{off}}}\left\{\frac{\lambda_{1}T}{\beta}\\|\Upsilon_{t,\mathrm{off}}\\|_{1,\mathrm{off}}+\left\\|\Upsilon_{t,\mathrm{off}}-\left(\Theta_{t,\mathrm{off}}^{k+1}-\frac{Y_{t,\mathrm{off}}^{k}}{\beta}\right)\right\\|_{F}^{2}\right\}$
	$\displaystyle=\mathrm{prox}_{\frac{\lambda_{1}T}{\beta}\\|\cdot\\|_{1,\mathrm{off}}}\left(\Theta_{t,\mathrm{off}}^{k+1}-\frac{Y_{t,\mathrm{off}}^{k}}{\beta}\right)=\left(\mathrm{prox}_{\frac{\lambda_{1}T}{\beta}\|\cdot\|}\left(\Theta_{uv,t}^{k+1}-\frac{Y_{uv,t}^{k}}{\beta}\right)\right)_{\begin{subarray}{c}u,v=1,\dots,p\\[2.84544pt] \text{ with }u\neq v\end{subarray}}.$

	$\displaystyle\\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\\|_{s}\leq$	$\displaystyle\>\>\\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\text{Var}(X_{t})\\|_{s}+\\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\\|_{s}$
	$\displaystyle\leq$	$\displaystyle\>\>\overline{\mu}+p\\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\\|_{\max}.$

$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\Big{\{}\\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\\|_{F}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/3\Big{\}}\cap C_{T}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\frac{2T\lambda_{2}+T\lambda_{1}\sqrt{p(p-1)}[(T^{\ast}_{j}+T^{\ast}_{j+1})/2-T^{\ast}_{j}]}{T^{\ast}_{j+1}-T^{\ast}_{j}}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/6\Big{)}$
		$\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\\|\frac{2}{T^{\ast}_{j+1}-T^{\ast}_{j}}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}\\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\\|_{F}/6\Big{)}$
	$\displaystyle\leq$	$\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\frac{2T\lambda_{2}}{\mathcal{I}_{\min}\eta_{\min}}+\frac{T\lambda_{1}\sqrt{p(p-1)}}{\eta_{\min}}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}/6\Big{)}$
		$\displaystyle\leavevmode\resizebox{398.9296pt}{}{$\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}\geq\overline{\mu}^{-1}\underline{\mu}^{2}\eta_{\min}/48\Big{)}$}.$

$\displaystyle\\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\\|_{F}$
	$\displaystyle=$	$\displaystyle\\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}\Big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\Big{)}+\frac{1}{2}\Big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\Big{)}\Omega^{\ast}_{j}\Big{]}\\|_{F}$
	$\displaystyle\leq$	$\displaystyle s^{\ast}_{\max}\,p\,\\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\big{)}\\|_{\max}.$