This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Break recovery in graphical networks with D-trace loss

Ying Lin    Benjamin Poignard    Ting Kei Pong    Akiko Takeda    Ying Lin111Department of Applied Mathematics, The Hong Kong Polytechnic University, People’s Republic of China. E-mail address: [email protected], Benjamin Poignard222Graduate School of Economics, Osaka University, Japan; Jointly affiliated at RIKEN and OIST (MLDS unit). E-mail address: [email protected], Ting Kei Pong333Department of Applied Mathematics, The Hong Kong Polytechnic University, People’s Republic of China. E-mail address: [email protected], Akiko Takeda444Graduate School of Information Science and Technology, The University of Tokyo, Japan; Center for Advanced Intelligence Project, RIKEN, Japan. E-mail address: [email protected]
Abstract

We consider the problem of estimating a time-varying sparse precision matrix, which is assumed to evolve in a piece-wise constant manner. Building upon the Group Fused LASSO and LASSO penalty functions, we estimate both the network structure and the change-points. We propose an alternative estimator to the commonly employed Gaussian likelihood loss, namely the D-trace loss. We provide the conditions for the consistency of the estimated change-points and of the sparse estimators in each block. We show that the solutions to the corresponding estimation problem exist when some conditions relating to the tuning parameters of the penalty functions are satisfied. Unfortunately, these conditions are not verifiable in general, posing challenges for tuning the parameters in practice. To address this issue, we introduce a modified regularizer and develop a revised problem that always admits solutions: these solutions can be used for detecting possible unsolvability of the original problem or obtaining a solution of the original problem otherwise. An alternating direction method of multipliers (ADMM) is then proposed to solve the revised problem. The relevance of the method is illustrated through simulations and real data experiments.

Key words: Alternating direction method of multipliers, Change-points, Dynamic network, Precision matrix, Sparsity.

1 Introduction

Much attention has been devoted to the development of methods for the detection of multiple change-points in the underlying data generating process of some realized sample. The model parameters are typically assumed to be constant in each regime but can experience “jumps” or “breaks” over time. The extraction of these breaks is usually performed through the application of some filtering techniques. To do so, a popular method is the fused LASSO which screens the existence of parameter breaks over the sample of observations. In particular, [17] developed a framework to recover multiple change-points for one-dimensional piece-wise constant signals. Then [4] extended this procedure to grouped parameters with the Group Fused LASSO in the context of autoregressive time series models. In the same spirit, [30] applied the Group Fused LASSO to linear regression models. This filtering technique has also been popular among the literature on change-points in panel data models: e.g., [31] aimed to identify structural changes in linear panel data regressions based on the adaptive Group Fused LASSO; using a similar filtering technique, [21] considered the estimation of structural breaks in panel models with interactive fixed effects.

This paper considers the detection of structural breaks in time series network, whose structure is represented by the corresponding precision matrix. Some works have been dedicated to change-point detection for precision matrix. [46] assumed that the covariance matrix evolves smoothly over time. Contrary to the framework of [20], which focused on the detection of multiple breaks at a node level through the Group Fused LASSO, [33] restricted their framework to a single change point which impacts the global network structure. Our aim is to detect multiple change-points that affect the whole network. Thus, our viewpoint is similar to [16] and [13]: these works considered the Group Fused Graphical LASSO (GFGL) to detect breaks in the precision matrix. The latter filtering technique is a mixture of the Group Fused LASSO regularization of [2] and the LASSO regularization applied to the Gaussian likelihood function. We propose an alternative estimator to the commonly employed penalized Gaussian likelihood, which builds upon the D-trace loss of [44]. The formulation of the latter is much simpler than the Gaussian likelihood (in particular, the absence of log-determinant), thus allowing for a more direct theoretical analysis and implementation procedure. The D-trace loss and its extensions have been applied to diverse problems: while [40] considered network change analysis between two samples of vector-valued observations, [19] adapted the latter framework to differential network analysis for fMRI data; in the context of compositional data, [41] and [43] considered modified versions of the D-trace loss to estimate the high-dimensional compositional precision matrix under sparsity constraint; using the matching of the coefficients of SVAR processes and the entries of the precision matrix of the observed sample, [29] considered the sparse estimation of the latter based on the D-trace loss. In addition to the detection of structural breaks in the network, we allow for sparse dependence in the entries of the precision matrix in each regime. This motivates the use of the LASSO penalization function applied to each coefficient of the piece-wise constant precision matrix.

The corresponding estimation problem formulated in (2.1), Section 2, with the D-trace loss, includes two tuning parameters. In practice, the optimal parameters are unknown a priori and are expected to be estimated upon solving the estimation problems with a series of tuning parameters. However, we show that the problem with general tuning parameters may be unbounded from below (and hence, does not have solutions). Consequently, to identify the optimal parameters, one may encounter estimation problems that are unbounded from below, and the unboundedness can be numerically difficult to detect. To address this issue, we introduce a new regularizer, thereby constructing a revised problem that consistently has solutions regardless of the choice of the tuning parameters. In addition to the existence of a solution, this revised problem also enjoys several desirable properties. On the one hand, if the solutions to the revised problem possess some easy-to-detect patterns, then the original problem may not have solutions, and we can further update the tuning parameters towards obtaining reliable estimators. On the other hand, if the patterns are not detected, then the solutions to the revised problem also solve the original problem. We adapt the celebrated alternating direction method of multipliers (ADMM) to solve the revised problem, with its convergence guaranteed by [10]. ADMM is a widely used algorithm for solving convex optimization problems with separable objectives and linear coupling constraints. Its classical version was first introduced by [15] and [12], with the convergence established in [11]. Then [7] showed that ADMM is equivalent to the proximal point algorithm applied to a certain maximal monotone operator. This insight led to the development of the first proximal ADMM by [6]. Building upon Eckstein’s work, [18] extended the proximal ADMM to include a broader class of proximal terms. This approach was further advanced by [10], which generalized the method to allow the use of semi-proximal terms, enhancing the algorithm’s applicability. More recently, considerable efforts have been made to further accelerate variants of ADMM: see, e.g., the Schur complement-based semi-proximal ADMM in [23]; the inexact symmetric Gauss-Seidel-based ADMM in [5], [24], [25], [39]; the accelerated linearized ADMM in [28], [37], [22]; the preconditioned ADMM in [36], [34], [35]; the accelerated proximal ADMM based on Halpern iteration in [42], [38].

Our contributions can be summarized as follows: we propose a new estimator for break detection in the precision matrix, the Group Fused D-trace LASSO (GFDtL); we derive the conditions for which all the break points and the precision matrices can be consistently estimated when the estimated breaks coincide with the true number of breaks; we provide a modified regularizer to ensure the existence of solutions to the revised problem, and show that these solutions either exhibit some easily verifiable patterns indicating the possible unsolvability of the original problem so that we can further update the tuning parameters towards obtaining reliable estimators, or solve the original problem if the patterns are not detected; an ADMM is adapted for implementation with convergence guarantees. The relevance of the novel estimator for change-points in time-varying networks compared to the standard Gaussian-based GFGL estimator is illustrated through simulations and real data experiments.

The rest of the paper is organized as follows. Section 2 details the framework and estimation procedure. Section 3 contains the asymptotic properties. Section 4 is devoted to the optimization aspect of the estimation procedure. Section 5 provides the implementation details. Section 6 contains some simulation experiments, a sensitivity and computational complexity analysis. A real data experiment is provided in Section 7. All the proofs of the main text and auxiliary results are relegated to the Appendices.

Notation: Throughout this paper, we use VkV_{k} and AklA_{kl} to denote the kk-th element of a vector VdV\in{\mathbb{R}}^{d} and the (k,l)(k,l)-th element of a matrix Am×nA\in{\mathbb{R}}^{m\times n}, respectively. We write AA^{\top} (resp. VV^{\top}) to denote the transpose of the matrix AA (resp. the vector VV). For any square matrix An×nA\in{\mathbb{R}}^{n\times n}, we write Sym(A)(A+A)/2\operatorname*{\mathrm{Sym}}(A)\coloneqq(A+A^{\top})/2 to denote the symmetrization of AA. The pp-th order identity matrix is denoted by IpI_{p}. We denote by 0k×lk×l\textbf{0}_{k\times l}\in{\mathbb{R}}^{k\times l} (resp. 𝟏k×lk×l\mathbf{1}_{k\times l}\in{\mathbb{R}}^{k\times l}) the k×lk\times l zero matrix (resp. k×lk\times l matrix of ones). We write vec(A)\text{vec}(A) to denote the vectorization operator that stacks the columns of AA on top of one another into a vector. For two matrices AA and BB, ABA\otimes B is the Kronecker product. The set of symmetric matrices is denoted by 𝒮n\mathcal{S}^{n}. A symmetric matrix S𝒮nS\in\mathcal{S}^{n} is said to be positive semi-definite (resp. positive definite) and written as S0S\succeq 0 (resp. S0S\succ 0) if all its eigenvalues are non-negative (resp. positive). The expression ABA\succeq B (resp. ABA\succ B) for AA, B𝒮nB\in\mathcal{S}^{n} means AB0A-B\succeq 0 (resp. AB0A-B\succ 0). We use 𝒮+n\mathcal{S}_{+}^{n} and 𝒮++n\mathcal{S}_{++}^{n} to denote the sets of positive semi-definite matrices and positive definite matrices, respectively. For a positive semi-definite matrix SS, S12S^{\frac{1}{2}} is the unique positive semi-definite matrix such that S=S12S12S=S^{\frac{1}{2}}S^{\frac{1}{2}}. The p\ell_{p} norm for VdV\in{\mathbb{R}}^{d} is denoted by Vp=(k=1d|Vk|p)1/p\|V\|_{p}=\big{(}\sum^{d}_{k=1}|V_{k}|^{p}\big{)}^{1/p}, p1p\geq 1. The Frobenius norm and the off-diagonal 1\ell_{1} (semi)norm of a matrix Am×nA\in{\mathbb{R}}^{m\times n} are denoted by AF=k=1ml=1n|Akl|2\|A\|_{F}=\sqrt{\sum_{k=1}^{m}\sum_{l=1}^{n}|A_{kl}|^{2}} and A1,off=kl|Akl|\|A\|_{1,\mathrm{off}}=\sum_{k\neq l}|A_{kl}|, respectively. The spectral norm, i.e., the maximum singular value of a matrix AA is written as As\|A\|_{s}. For a symmetric matrix A, we use λmax(A)\lambda_{\max}(A) (resp. λmin(A)\lambda_{\min}(A)) to denote its largest (resp. smallest) eigenvalue. The maximum absolute value of all entries of a matrix AA is denoted by Amax\|A\|_{\max}. We use 𝒮offn\mathcal{S}_{\mathrm{off}}^{n} to denote the set of n×nn\times n symmetric matrices whose diagonal elements are 0. By {Yt,off}t=1T\{Y_{t,\mathrm{off}}\}_{t=1}^{T} we denote a sequence of symmetric matrices whose diagonal elements are 0, i.e., Yt,off𝒮offnY_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{n}. For a sequence of symmetric matrices {Θt}t=1T\{\Theta_{t}\}_{t=1}^{T}, Θuv,t\Theta_{uv,t} refers to the (u,v)(u,v)-th element of Θt\Theta_{t}, and Θt,off\Theta_{t,\mathrm{off}} is the copy of Θt\Theta_{t} with diagonal elements set to 0; in particular, Θt1,off=Θt,off1\|\Theta_{t}\|_{1,\mathrm{off}}=\|\Theta_{t,\mathrm{off}}\|_{1}. Given ϵ0\epsilon\geq 0, we use ProjϵIp\mathrm{Proj}_{\cdot\succeq\epsilon I_{p}} to denote the projection onto {S:SϵIp}\{S\,:\,S\succeq\epsilon I_{p}\}. The proximal operator of a function ff at xx is defined as proxf(x)=argminy{f(y)+12yx2}\mathrm{prox}_{f}(x)=\operatorname*{\arg\min}_{y}\{f(y)+\frac{1}{2}\|y-x\|^{2}\}; for more details about the proximal operator, we refer the interested readers to Sections 12.4, 14.1, 14.2 in [1].

2 Framework

For a sequence of a pp-dimensional random vector (Xt)(X_{t}) observed at t=1,,Tt=1,\ldots,T, we consider the estimation of the underlying network through the corresponding precision matrix. The latter is assumed to evolve over time and the task is to recover the break dates. More formally, we denote by {j}1jm\{{\mathcal{B}}_{j}\}_{1\leq j\leq m} a disjoint partitioning of the set {1,,T}\{1,\ldots,T\} such that jj=,jj{\mathcal{B}}_{j}\cap{\mathcal{B}}_{j^{\prime}}=\emptyset,j\neq j^{\prime}, jj={1,,T}\cup_{j}{\mathcal{B}}_{j}=\{1,\ldots,T\} and j={Tj1,Tj1+1,,Tj1}{\mathcal{B}}_{j}=\{T_{j-1},T_{j-1}+1,\ldots,T_{j}-1\}. The partition of the break dates is denoted by 𝒯m={T1<T2<<Tm}{\mathcal{T}}_{m}=\{T_{1}<T_{2}<\ldots<T_{m}\} with the convention T0=1,Tm+1=T+1T_{0}=1,T_{m+1}=T+1. Then, we assume 𝔼[Xt]=0{\mathbb{E}}[X_{t}]=0 and Var(Xt)=Σj(X_{t})=\Sigma_{j} for tjt\in{\mathcal{B}}_{j}, such that the observations indexed by elements in j{\mathcal{B}}_{j} are pp-dimensional realizations of a centered random variable with variance-covariance Σj\Sigma_{j}. We denote by Ωj=Σj1\Omega_{j}=\Sigma^{-1}_{j} the precision matrix with entries Ωuv,j,1u,vp\Omega_{uv,j},1\leq u,v\leq p. In practice, we consider the sequence of precision matrices {Θ1,,ΘT}\{\Theta_{1},\ldots,\Theta_{T}\} such that the total number of distinct matrices in the set is m+1m+1 and Θt=Ωj,tj,j=1,,m+1\Theta_{t}=\Omega_{j},\;t\in{\mathcal{B}}_{j},\;j=1,\ldots,m+1. We are interested in estimating the unknown true number mm^{\ast} of unknown break dates, the true partition 𝒯m={T1<T2<<Tm}{\mathcal{T}}^{\ast}_{m^{\ast}}=\{T^{\ast}_{1}<T^{\ast}_{2}<\ldots<T^{\ast}_{m^{\ast}}\} and the true unknown precision matrices Ωj\Omega^{\ast}_{j}. As a consequence, the true data generating process is assumed to be

𝔼[Xt]=0,Var(Xt)=Σj,Θt=Ωj=Σj1whent=Tj1,Tj1+1,,Tj1,{\mathbb{E}}[X_{t}]=0,\;\text{Var}(X_{t})=\Sigma^{\ast}_{j},\;\Theta^{\ast}_{t}=\Omega^{\ast}_{j}=\Sigma^{\ast-1}_{j}\;\text{when}\;t=T^{\ast}_{j-1},T^{\ast}_{j-1}+1,\ldots,T^{\ast}_{j}-1,

and 1jm+11\leq j\leq m^{\ast}+1, T0=1,Tm+1=T+1T^{\ast}_{0}=1,T^{\ast}_{m^{\ast}+1}=T+1 with blocks j={Tj1,,Tj1}{\mathcal{B}}^{\ast}_{j}=\{T^{\ast}_{j-1},\ldots,T^{\ast}_{j}-1\}. While mm^{\ast} and the break dates are unknown, mm^{\ast} is typically much smaller than TT and, assuming the underlying network may exhibit some sparse structures, we consider the sparse estimation of Θt\Theta_{t}’s and the estimation of 𝒯m{\mathcal{T}}_{m} via a mixture of LASSO and Group Fused LASSO, which we will hereafter refer as the Group Fused D-trace LASSO (GFDtL), defined as

{Θ^t}t=1T=argminΘt0,1tT{𝕃({Θt}t=1T,𝒳T)+λ1t=1𝑇Θt1,off+λ2t=2𝑇ΘtΘt1F},\{\widehat{\Theta}_{t}\}^{T}_{t=1}=\underset{\Theta_{t}\succeq 0,1\leq t\leq T}{\arg\;\min}\Big{\{}{\mathbb{L}}(\{\Theta_{t}\}^{T}_{t=1},\mathcal{X}_{T})+\lambda_{1}\overset{T}{\underset{t=1}{\sum}}\|\Theta_{t}\|_{1,\text{off}}+\lambda_{2}\overset{T}{\underset{t=2}{\sum}}\|\Theta_{t}-\Theta_{t-1}\|_{F}\Big{\}}, (2.1)

where 𝒳T=(X1,,XT)\mathcal{X}_{T}=(X_{1},\ldots,X_{T}) is the sample, λ1,λ2\lambda_{1},\lambda_{2} are the tuning parameters, Θt1,off=kl|Θkl,t|\|\Theta_{t}\|_{1,\text{off}}=\sum_{k\neq l}|\Theta_{kl,t}| and the D-trace loss of [44] is defined as

𝕃({Θt}t=1T,𝒳T)=1Tt=1𝑇[tr(12Θt2XtXt)tr(Θt)].{\mathbb{L}}(\{\Theta_{t}\}^{T}_{t=1},\mathcal{X}_{T})=\frac{1}{T}\overset{T}{\underset{t=1}{\sum}}\Big{[}\text{tr}(\frac{1}{2}\Theta^{2}_{t}X_{t}X^{\top}_{t})-\text{tr}(\Theta_{t})\Big{]}.

Sparsity within the estimated precision matrix for a given block is controlled by λ1\lambda_{1}, whereas λ2\lambda_{2} affects the smoothing and guarantees that the solution is piece-wise constant.

3 Asymptotic properties

Before stating the large sample results, we define some notations and present the assumptions used hereafter. Define j=TjTj1\mathcal{I}^{\ast}_{j}=T^{\ast}_{j}-T^{\ast}_{j-1} and

min=min1jm+1|j|,ηmin=min1jmΩj+1ΩjF,smax=max1jmΩjF.\mathcal{I}_{\min}=\underset{1\leq j\leq m^{\ast}+1}{\min}|\mathcal{I}^{\ast}_{j}|,\;\eta_{\min}=\underset{1\leq j\leq m^{\ast}}{\min}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F},\;s^{\ast}_{\max}=\underset{1\leq j\leq m^{\ast}}{\max}\|\Omega^{\ast}_{j}\|_{F}.
Assumption 1.
  • (i)

    (Xt)(X_{t}) is a centered strong mixing process, that is, 0<ρ<1\exists 0<\rho<1 such that for all t+t\in\mathbb{Z}^{+}, α(t)cαρt\alpha(t)\leq c_{\alpha}\rho^{t}, with cα>0c_{\alpha}>0 and α()\alpha(\cdot) the mixing coefficient α(T)=supA0,BT|(A)(B)(AB)|\alpha(T)=\sup_{A\in{\mathcal{F}}^{0}_{-\infty},B\in{\mathcal{F}}^{\infty}_{T}}|{\mathbb{P}}(A){\mathbb{P}}(B)-{\mathbb{P}}(A\cup B)|, where 0,T{\mathcal{F}}^{0}_{-\infty},{\mathcal{F}}^{\infty}_{T} are the filtrations generated by {(Xt):t0}\{(X_{t}):-\infty\leq t\leq 0\} and {(Xt):Tt}\{(X_{t}):T\leq t\leq\infty\}.

  • (ii)

    γ,b>0\exists\gamma,b>0 such that s>0\forall s>0, 1k,lp\forall 1\leq k,l\leq p, supt1(|Xk,tXl,t|>s)exp(1(s/b)γ)\sup_{t\geq 1}\,{\mathbb{P}}(|X_{k,t}X_{l,t}|>s)\leq\exp(1-(s/b)^{\gamma}).

Assumption 2.

μ¯,μ¯\exists\underline{\mu},\overline{\mu}: 0<μ¯min1jm+1λmin(Σj)max1jm+1λmin(Σj)μ¯<0<\underline{\mu}\leq\underset{1\leq j\leq m^{\ast}+1}{\min}\lambda_{\min}\big{(}\Sigma^{\ast}_{j}\big{)}\leq\underset{1\leq j\leq m^{\ast}+1}{\max}\lambda_{\min}\big{(}\Sigma^{\ast}_{j}\big{)}\leq\overline{\mu}<\infty.

Assumption 3.

Let (δT)(\delta_{T}) be a non-increasing positive sequence converging to zero. The following conditions hold:

  • (i)

    TδTcvlog(pT)(2+γ)/γT\delta_{T}\geq c_{v}\log(pT)^{(2+\gamma)/\gamma} for some cv>0c_{v}>0.

  • (ii)

    m=O(log(T))m^{\ast}=O(\log(T)) and min/(TδT)\mathcal{I}_{\min}/(T\delta_{T})\rightarrow\infty as TT\rightarrow\infty.

  • (iii)

    plog(pT)/TδT0p\sqrt{\log(pT)/T\delta_{T}}\rightarrow 0 and (TδTηmin)1psmaxlog(pT)0(\sqrt{T\delta_{T}}\eta_{\min})^{-1}p\,s^{\ast}_{\max}\sqrt{\log(pT)}\rightarrow 0.

  • (iv)

    λ2/(ηminδT)0\lambda_{2}/(\eta_{\min}\delta_{T})\rightarrow 0 and λ1Tp/ηmin0\lambda_{1}Tp/\eta_{\min}\rightarrow 0 as TT\rightarrow\infty.

Assumption 1-(i) relates to the properties of (Xt)(X_{t}). Assumption 1-(ii) is a tail condition and will allow us to apply exponential inequalities for dependent processes. Assumption 2 ensures the identification of the model: it is similar to Assumption A.1 of [20] or Assumption A.2 of [30]. Assumption 3 provides conditions on δT\delta_{T}, mm^{\ast}, min\mathcal{I}_{\min}, ηmin\eta_{\min} and the tuning parameters λ1,λ2\lambda_{1},\lambda_{2}. Condition (i) concerns the convergence rate of δT\delta_{T} to 0. In condition (ii), the sample size in each regime may diverge with rate TδTT\delta_{T}, but at a slower rate than TT, and the number of breaks mm^{\ast} may diverge slowly: this is similar to Assumption A.3-(i) of [30] or Assumption H3 of [4]. It also sets the slowest rate at which δT\delta_{T} may shrink to zero: δT=o(min/T)\delta_{T}=o(\mathcal{I}_{\min}/T). Conditions (iii) and (iv) specify the fastest rate at which δT\delta_{T} may shrink to zero, which is δTmax(λ2/ηmin,p2(smax)2log(pT)/(Tηmin2))\delta_{T}\gg\max(\lambda_{2}/\eta_{\min},p^{2}(s^{\ast}_{\max})^{2}\log(pT)/(T\eta_{\min}^{2})). It is worth emphasizing that conditions (ii)-(iii) imply p2(smax)2log(pT)=o(minηmin2)p^{2}(s^{\ast}_{\max})^{2}\log(pT)=o(\mathcal{I}_{\min}\eta^{2}_{\min}) and conditions (ii) and (iv) imply that λ2T=o(ηminmin)\lambda_{2}T=o(\eta_{\min}\mathcal{I}_{\min}). Finally, the effect of the LASSO shrinkage through λ1\lambda_{1} does not relate to δT\delta_{T}: this is because the Group Fused LASSO penalty only allows to detect change-points. The consistency of T^j,Ω^j\widehat{T}_{j},\widehat{\Omega}_{j}, given m^=m\widehat{m}=m^{\ast}, is provided in the next Theorem.

Theorem 3.1.

Suppose Assumptions 1-3 are satisfied. Under m^=m\widehat{m}=m^{\ast}, then:

  • (i)

    (max1jm|T^jTj|TδT)1{\mathbb{P}}\Big{(}\underset{1\leq j\leq m^{\ast}}{\max}|\widehat{T}_{j}-T^{\ast}_{j}|\leq T\delta_{T}\Big{)}\rightarrow 1 as TT\rightarrow\infty.

  • (ii)

    Ω^jΩjF=Op(λ2Tj+λ1Tp(1+TδTj)+TδTj+smaxplog(pT)j)\|\widehat{\Omega}_{j}-\Omega^{\ast}_{j}\|_{F}=O_{p}(\frac{\lambda_{2}T}{\mathcal{I}^{\ast}_{j}}+\lambda_{1}Tp(1+\frac{T\delta_{T}}{\mathcal{I}^{\ast}_{j}})+\frac{T\delta_{T}}{\mathcal{I}^{\ast}_{j}}+s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{\mathcal{I}^{\ast}_{j}}}), for j=1,,m+1j=1,\ldots,m^{\ast}+1.

Remark Result (i) implies max1jmT1|T^jTj|=op(δT)\max_{1\leq j\leq m^{\ast}}T^{-1}|\widehat{T}_{j}-T^{\ast}_{j}|=o_{p}(\delta_{T}). Since δT=o(1)\delta_{T}=o(1), this means T1|T^jTj|=op(1)T^{-1}|\widehat{T}_{j}-T^{\ast}_{j}|=o_{p}(1). Here, δT\delta_{T} is a key quantity to control for the rate at which T^j/T\widehat{T}_{j}/T converges to Tj/TT^{\ast}_{j}/T. Note that δTmax(λ2ηmin,p2(smax)2log(pT)/(Tηmin2))\delta_{T}\gg\max(\frac{\lambda_{2}}{\eta_{\min}},p^{2}(s^{\ast}_{\max})^{2}\log(pT)/(T\eta_{\min}^{2})), implies that the fastest convergence rate for the break ratio estimator depends on the regularization parameter λ2\lambda_{2} and p2(smax)2log(pT)/(Tηmin2)p^{2}(s^{\ast}_{\max})^{2}\log(pT)/(T\eta_{\min}^{2}). Result (ii) relates to the consistency of the precision matrix in each regime.

It is worth noting that the true number of breaks mm^{\ast} is unknown. Following the common practice in the change-point literature, we assume that mm^{\ast} is bounded by a known conservative upper bound mmaxm_{\max}. Next, we define h(A,B)supbBinfaA|ab|h(A,B)\coloneqq\sup_{b\in B}\inf_{a\in A}|a-b| for any two sets AA and BB. The next result establishes that all true break points in 𝒯m{\mathcal{T}}^{\ast}_{m^{\ast}} can be consistently estimated by some points in 𝒯^m^\widehat{{\mathcal{T}}}_{\widehat{m}}.

Theorem 3.2.

Suppose Assumptions 1-3 are satisfied. If mm^mmaxm^{\ast}\leq\widehat{m}\leq m_{\max}, then

(h(𝒯^m^,𝒯m)TδT)1asT.{\mathbb{P}}(h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})\leq T\delta_{T})\rightarrow 1\;\text{as}\;T\rightarrow\infty.

The proof of Theorem 3.2 is done by contradiction and follows similar arguments as in the proof of Theorem 3.1. It relies on the optimality conditions from Lemma A.3. Theorem 3.2 ensures that even if the number of blocks is overestimated, there will be an estimated change-point close to each unknown true change-point.

4 Optimization

In this section, we move on to the optimization aspects of criterion (2.1), including the dual problem, the existence of solutions, and the algorithm. Specifically, given 𝒳T\mathcal{X}_{T}, ϵ>0\epsilon>0, λ1>0\lambda_{1}>0 and λ2>0\lambda_{2}>0, we consider the following scaled optimization problem

minΘtϵIp,1tT{t=1T[tr(12Θt2XtXt)tr(Θt)]+λ1Tt=1TΘt1,off+λ2Tt=1T1Θt+1ΘtF},\min_{\begin{subarray}{c}\Theta_{t}\succeq\epsilon I_{p},\\[2.84544pt] 1\leq t\leq T\end{subarray}}\left\{\sum_{t=1}^{T}\!\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})\!-\!\text{tr}(\Theta_{t})\right]\!+\!\lambda_{1}T\sum_{t=1}^{T}\!\|\Theta_{t}\|_{1,\text{off}}\!+\!\lambda_{2}T\sum_{t=1}^{T-1}\!\|\Theta_{t+1}-\Theta_{t}\|_{F}\right\}, (4.1)

where we scaled Problem (2.1) by a factor of TT for numerical stability. One can also notice that in (4.1) we use ΘtϵIp\Theta_{t}\succeq\epsilon I_{p} rather than Θ0\Theta\succ 0 as in (2.1). This choice is made for practical reasons, as setting ϵ>0\epsilon>0 ensures the non-singular solutions, and the set {S:SϵIp}\{S\,:\,S\succeq\epsilon I_{p}\} is closed and convex and hence the projection onto it is well defined.

4.1 Dual problem and existence of solutions

We first deduce the dual problem of (4.1), and discuss the existence of solutions to (4.1).

Proposition 4.1.
  1. (i)

    The dual problem of (4.1) is

    max𝐘\displaystyle\max_{\mathbf{Y}} {t=1T12tr(WtWt)+ϵt=1Ttr(ZtZt1Ip+(XtXt)12WtYt,off)}\displaystyle\left\{\sum_{t=1}^{T}-\frac{1}{2}\mathrm{tr}(W_{t}^{\top}W_{t})+\epsilon\sum_{t=1}^{T}\mathrm{tr}\left(Z_{t}-Z_{t-1}-I_{p}+(X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}-Y_{t,\mathrm{off}}\right)\right\} (4.2)
    s.t. Z0=ZT=𝟎p×p;\displaystyle Z_{0}=Z_{T}=\mathbf{0}_{p\times p};
    ZtZt1Ip+Sym((XtXt)12Wt)Yt,off0t=1,,T;\displaystyle Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}}\succeq 0\,\,\,\,\forall t=1,\dots,T;
    ZtFλ2Tt=1,,T1;\displaystyle\|Z_{t}\|_{F}\leq\lambda_{2}T\,\,\,\,\forall t=1,\dots,T-1;
    |Yuv,t|λ1Tt=1,,T,u,v=1,,p with uv,\displaystyle|Y_{uv,t}|\leq\lambda_{1}T\,\,\,\,\forall t=1,\dots,T,u,v=1,\dots,p\text{ with }u\neq v,

    where 𝐘={{Wt}t=1T,{Yt,off}t=1T,{Zt}t=1T1}\mathbf{Y}=\left\{\{W_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1}\right\} is the dual variable with Wtp×pW_{t}\in{\mathbb{R}}^{p\times p}, Yt,off𝒮offpY_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p}, Zt𝒮pZ_{t}\in\mathcal{S}^{p} for all tt; Sym\operatorname*{\mathrm{Sym}} is the symmetrization operator. Moreover, the optimal values of (4.1) and (4.2) are the same.

  2. (ii)

    If t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0, then there exists λ¯2>0\overline{\lambda}_{2}>0 such that for any λ1>0\lambda_{1}>0 and any λ2λ¯2\lambda_{2}\geq\overline{\lambda}_{2}, the dual problem (4.2) has a Slater point555A Slater point of (4.2) is a feasible point that satisfies all the inequality and positive semi-definite constraints strictly, i.e., satisfies all the “\leq” and “\succeq” as “<<” and “\succ”, respectively. and the primal problem (4.1) has solutions.

We note that the assumption t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0 in Proposition 4.1-(ii) is reasonable because it can be viewed as a sample-based version of Assumption 2.

We next present a simple example to illustrate that for some specific dataset 𝒳T\mathcal{X}_{T}, even with t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0, there may still exist some λ1\lambda_{1} and λ2\lambda_{2} such that (4.2) is infeasible, meaning that its optimal value is -\infty. Since (4.1) and (4.2) have the same optimal value, this means (4.1) does not have solutions in this case.

Example 1.

Consider the case that 𝒳2=(X1,X2)\mathcal{X}_{2}=(X_{1},X_{2}) where X1=(1,0)X_{1}=(1,0)^{\top} and X2=(0,1)X_{2}=(0,1)^{\top}. Then

X1X1=[1000],X2X2=[0001],X_{1}X_{1}^{\top}=\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\quad X_{2}X_{2}^{\top}=\begin{bmatrix}0&0\\ 0&1\end{bmatrix},

and X1X1+X2X2=I20X_{1}X_{1}^{\top}+X_{2}X_{2}^{\top}=I_{2}\succ 0. The positive semi-definite constraint in (4.2) can be written as

Z1I2+[W11,1W12,1/2W12,1/20][0Y12,1Y12,10]0,\displaystyle Z_{1}-I_{2}+\begin{bmatrix}W_{11,1}&W_{12,1}/2\\ W_{12,1}/2&0\end{bmatrix}-\begin{bmatrix}0&Y_{12,1}\\ Y_{12,1}&0\end{bmatrix}\succeq 0,
Z1I2+[0W21,2/2W21,2/2W22,2][0Y12,2Y12,20]0.\displaystyle-Z_{1}-I_{2}+\begin{bmatrix}0&W_{21,2}/2\\ W_{21,2}/2&W_{22,2}\end{bmatrix}-\begin{bmatrix}0&Y_{12,2}\\ Y_{12,2}&0\end{bmatrix}\succeq 0.

After some simple calculations, we have

[Z11,1+W11,11Z12,1+W12,1/2Y12,1Z21,1+W12,1/2Y12,1Z22,11]0,\displaystyle\begin{bmatrix}Z_{11,1}+W_{11,1}-1&Z_{12,1}+W_{12,1}/2-Y_{12,1}\\ Z_{21,1}+W_{12,1}/2-Y_{12,1}&Z_{22,1}-1\end{bmatrix}\succeq 0,
[Z11,1+1Z12,1W21,2/2+Y12,2Z21,1W21,2/2+Y12,2Z22,1W22,2+1]0.\displaystyle\begin{bmatrix}Z_{11,1}+1&Z_{12,1}-W_{21,2}/2+Y_{12,2}\\ Z_{21,1}-W_{21,2}/2+Y_{12,2}&Z_{22,1}-W_{22,2}+1\end{bmatrix}\preceq 0.

Recall that the diagonal elements of a positive semi-definite matrix must be non-negative, then we can observe that the above condition requires that Z11,11Z_{11,1}\leq-1 and Z22,11Z_{22,1}\geq 1. These two conditions imply that for any λ1>0\lambda_{1}>0 and any λ2<1/2\lambda_{2}<1/\sqrt{2}, the dual problem (4.2) is infeasible and hence the primal problem (4.1) does not have solutions.

Remark 1.

It is worth pointing out that the nonexistence of solution does not contradict our findings in Section 3 because those results only indicate that when there is a ground truth, under suitable assumptions, the ground truth can be (approximately) recovered from a solution of problem (2.1) with suitably chosen TT, λ1\lambda_{1} and λ2\lambda_{2}; in particular, it did not imply solution existence of problem (2.1) for general TT, λ1\lambda_{1} and λ2\lambda_{2}.

To ensure the existence of a solution, although we can obtain some lower bound λ¯2\overline{\lambda}_{2} of λ2\lambda_{2} (as detailed in the proof of Proposition 4.1-(ii)) to ensure the solution existence for (4.1), this λ¯2\overline{\lambda}_{2} may not be tight. This can imply practical issues because the optimal λ2\lambda^{*}_{2} can be strictly smaller than λ¯2\overline{\lambda}_{2}. Then we will need to work with problem (4.1) with some λ2<λ¯2\lambda_{2}<\overline{\lambda}_{2} to locate such λ2\lambda^{*}_{2}. However, since λ2<λ¯2\lambda_{2}<\overline{\lambda}_{2}, (4.1) may be unbounded from below and certifying such a scenario is a challenging problem. This motivates us to modify problem (4.1) to obtain a new model which has solutions for all choices of λ1\lambda_{1} and λ2\lambda_{2}, and (under some mild condition) returns a solution of problem (4.1) when the latter problem is solvable.

To this end, we notice from the above example that the unsolvability of problem (4.1) when λ2\lambda_{2} is small may be related to the fact that the relations between different groups induced by the Group Fused LASSO regularizer is not strong enough to leverage the condition t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0 to ensure dual strict feasibility (and hence the existence of solutions to (4.1)). Hence, this motivates the introduction of a (modified) new regularizer to replace the Group Fused LASSO regularizer in problem (4.1): intuitively, this regularizer should be similar to Group Fused LASSO regularizer when Θt+1ΘtF\|\Theta_{t+1}-\Theta_{t}\|_{F} is small for inducing the (same) desired breaks, but penalize more on large Θt+1ΘtF\|\Theta_{t+1}-\Theta_{t}\|_{F} to ensure the solution existence of the new model.

4.2 A revised problem with a modified regularizer

Let λ30.5\lambda_{3}\geq 0.5 and let

(x;λ3){|x|if |x|λ3,x2λ32+λ3otherwise.\mathcal{R}(x;\lambda_{3})\coloneqq\begin{cases}|x|&\text{if }|x|\leq\lambda_{3},\\ x^{2}-\lambda_{3}^{2}+\lambda_{3}&\text{otherwise}.\end{cases} (4.3)

Here, λ30.5\lambda_{3}\geq 0.5 is necessary and sufficient to ensure the convexity of \mathcal{R}. The function \mathcal{R} employs the absolute value in a small region near 0 (determined by λ3\lambda_{3}) and switches to a quadratic function outside this region. In this way, it reduces to the classical 1\ell_{1} penalty when xx is near 0, while imposing a more substantial penalty as xx goes further away from 0.

Replacing Θt+1ΘtF\|\Theta_{t+1}-\Theta_{t}\|_{F} by (Θt+1ΘtF;λ3)\mathcal{R}(\|\Theta_{t+1}-\Theta_{t}\|_{F};\lambda_{3}) in problem (4.1), we obtain the following revised optimization problem:

minΘtϵIp,1tT{t=1T[tr(12Θt2XtXt)tr(Θt)]+λ1Tt=1TΘt1,off+λ2Tt=1T1(Θt+1ΘtF;λ3)}.\min_{\begin{subarray}{c}\Theta_{t}\succeq\epsilon I_{p},\\[2.84544pt] 1\leq t\leq T\end{subarray}}\!\left\{\sum_{t=1}^{T}\!\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})\!-\!\text{tr}(\Theta_{t})\right]\!\!+\!\!\lambda_{1}T\!\sum_{t=1}^{T}\!\|\Theta_{t}\|_{1,\text{off}}\!+\!\lambda_{2}T\!\sum_{t=1}^{T-1}\!\mathcal{R}(\|\Theta_{t+1}\!-\!\Theta_{t}\|_{F};\!\lambda_{3})\right\}. (4.4)

The next proposition shows the dual problem and the existence of solutions to (4.4).

Proposition 4.2.
  1. (i)

    Let

    𝒢(x;λ3)=min{(xλ2T)+λ3,λ2T((λ3x2λ2T)+2x24λ22T2λ32+λ3)},\mathcal{G}(x;\lambda_{3})=\min\left\{-\left(x-\lambda_{2}T\right)_{+}\lambda_{3},\lambda_{2}T\left(\left(\lambda_{3}-\frac{x}{2\lambda_{2}T}\right)_{+}^{2}-\frac{x^{2}}{4\lambda_{2}^{2}T^{2}}-\lambda_{3}^{2}+\lambda_{3}\right)\right\},

    where ()+=max{,0}(\cdot)_{+}=\max\{\cdot,0\}. Then the dual problem of (4.4) is

    max𝐘\displaystyle\max_{\mathbf{Y}} {t=1T12tr(WtWt)+ϵt=1Ttr(ZtZt1Ip+(XtXt)12WtYt,off)\displaystyle\Bigg{\{}\sum_{t=1}^{T}-\frac{1}{2}\mathrm{tr}(W_{t}^{\top}W_{t})+\epsilon\sum_{t=1}^{T}\mathrm{tr}\left(Z_{t}-Z_{t-1}-I_{p}+(X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}-Y_{t,\mathrm{off}}\right) (4.5)
    +t=1T1𝒢(ZtF;λ3)},\displaystyle\quad+\sum_{t=1}^{T-1}\mathcal{G}(\|Z_{t}\|_{F};\lambda_{3})\Bigg{\}},
    s.t. Z0=ZT=𝟎p×p;\displaystyle Z_{0}=Z_{T}=\mathbf{0}_{p\times p};
    ZtZt1Ip+Sym((XtXt)12Wt)Yt,off0t=1,,T;\displaystyle Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}}\succeq 0\,\,\,\,\forall t=1,\dots,T;
    |Yuv,t|λ1Tt=1,,T,u,v=1,,p with uv,\displaystyle|Y_{uv,t}|\leq\lambda_{1}T\,\,\,\,\forall t=1,\dots,T,u,v=1,\dots,p\text{ with }u\neq v,

    where 𝐘={{Wt}t=1T,{Yt,off}t=1T,{Zt}t=1T1}\mathbf{Y}=\left\{\{W_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1}\right\} is the dual variable with Wtp×pW_{t}\in{\mathbb{R}}^{p\times p}, Yt,off𝒮offpY_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p}, Zt𝒮pZ_{t}\in\mathcal{S}^{p} for all tt; Sym\operatorname*{\mathrm{Sym}} is the symmetrization operator. Moreover, (4.4) and (4.5) have the same optimal values.

  2. (ii)

    If t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0, then the dual problem (4.5) has a Slater point and the primal problem (4.4) has solutions.

The relationship between (4.1) and (4.4) is summarized as follows.

Proposition 4.3.

Given 𝒳T\mathcal{X}_{T} and ϵ>0\epsilon>0, the following statements hold:

  1. (i)

    For any positive λ1\lambda_{1} and λ2\lambda_{2} such that (4.1) has solutions, there exists λ¯30.5\overline{\lambda}_{3}\geq 0.5 such that for any λ3λ¯\lambda_{3}\geq\overline{\lambda}, any solution of (4.4) also solves (4.1).

  2. (ii)

    Fix any positive λ1\lambda_{1} and λ2\lambda_{2} such that (4.1) does not have solutions. Then for any λ30.5\lambda_{3}\geq 0.5, any solution {Θt}t=1T\{\Theta_{t}^{*}\}_{t=1}^{T} to (4.4) satisfies

    maxt=1,,T1Θt+1ΘtFλ3.\max_{t=1,\dots,T-1}\|\Theta_{t+1}^{*}-\Theta_{t}^{*}\|_{F}\geq\lambda_{3}. (4.6)

Equipped with Proposition 4.3, we can derive the following practical way to search for a suitable λ2\lambda_{2} such that (4.1) is solvable by solving (a sequence of) (4.4). Specifically, for any positive λ1\lambda_{1} and λ2\lambda_{2}, we solve (4.4) with an appropriately large λ3\lambda_{3}, and then check if the solution satisfies (4.6): if it does, we increase λ2\lambda_{2} further to pursue a reliable estimator; otherwise we obtain a solution to (4.1), and all the theoretical properties described in Section 3 hold.

4.3 An alternating direction method of multipliers

In this subsection, we discuss how to adapt the alternating direction method of multipliers (ADMM) to solve (4.4). We rewrite (4.4) as the following constrained optimization problem:

min𝐗{t=1T[tr(12Θt2XtXt)tr(Θt)+δϵIp(Vt)]+λ1Tt=1TΥt,off1,off+λ2Tt=1T1(DtF;λ3)},s.t.Vt=Θt,Υt,off=Θt,offt=1,,T;Dt=Θt+1Θtt=1,,T1,\vspace{-.3cm}\begin{split}\min_{\mathbf{X}}\!&\left\{\sum_{t=1}^{T}\!\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})\!-\!\text{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(V_{t})\right]\!+\!\lambda_{1}T\!\sum_{t=1}^{T}\!\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}\!+\!\lambda_{2}T\!\sum_{t=1}^{T-1}\!\mathcal{R}(\|D_{t}\|_{F};\!\lambda_{3})\right\},\\[2.84544pt] \text{s.t.}&\quad V_{t}=\Theta_{t},\Upsilon_{t,\mathrm{off}}=\Theta_{t,\mathrm{off}}\,\,\forall t=1,\dots,T;\quad D_{t}=\Theta_{t+1}-\Theta_{t}\,\,\forall t=1,\dots,T-1,\end{split}\vspace{.2cm} (4.7)

where we denote 𝐗={{Θt}t=1T,{Vt}t=1T,{Υt,off}t=1T,{Dt}t=1T1}\mathbf{X}=\left\{\{\Theta_{t}\}_{t=1}^{T},\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1}\right\} for the sake of notional simplicity; Θt,off\Theta_{t,\mathrm{off}} is the copy of Θt\Theta_{t} with the diagonal elements set to 0, and Υt,off𝒮offp\Upsilon_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p}.

Given β>0\beta>0, the augmented Lagrangian function is

L({Θt}t=1T,{Vt}t=1T,{Υt,off}t=1T,{Dt}t=1T1,{At}t=1T,{Yt,off}t=1T,{Zt}t=1T1)\displaystyle L(\{\Theta_{t}\}_{t=1}^{T},\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1},\{A_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1})
\displaystyle\coloneqq t=1T[tr(12Θt2XtXt)tr(Θt)+δϵIp(Vt)]+λ1Tt=1TΥt,off1,off\displaystyle\sum_{t=1}^{T}\left[\text{tr}(\frac{1}{2}\Theta_{t}^{2}X_{t}X_{t}^{\top})-\text{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(V_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}
+λ2Tt=1T1(DtF;λ3)+t=1T[At,ΘtVt+β2ΘtVtF2]\displaystyle+\lambda_{2}T\sum_{t=1}^{T-1}\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})+\sum_{t=1}^{T}\left[-\left\langle A_{t},\Theta_{t}-V_{t}\right\rangle+\frac{\beta}{2}\|\Theta_{t}-V_{t}\|_{F}^{2}\right]
+t=1T[Yt,off,Θt,offΥt,off+β2Θt,offΥt,offF2]\displaystyle+\sum_{t=1}^{T}\left[-\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\right\rangle+\frac{\beta}{2}\|\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\|_{F}^{2}\right]
+t=1T1[Zt,Θt+1ΘtDt+β2Θt+1ΘtDtF2].\displaystyle+\sum_{t=1}^{T-1}\left[-\left\langle Z_{t},\Theta_{t+1}-\Theta_{t}-D_{t}\right\rangle+\frac{\beta}{2}\|\Theta_{t+1}-\Theta_{t}-D_{t}\|_{F}^{2}\right].

In iteration k+1k+1, our ADMM consists of three update steps:

  1. 1.

    {Θt}t=1T\{\Theta_{t}\}_{t=1}^{T} update:

    {Θtk+1}t=1T=argmin{Θt}t=1TL({Θt}t=1T,{Vtk}t=1T,{Υt,offk}t=1T,{Dtk}t=1T1,{Atk}t=1T,{Yt,offk}t=1T,{Ztk}t=1T1).\displaystyle\{\Theta_{t}^{k+1}\}_{t=1}^{T}\!=\!\operatorname*{\arg\min}_{\{\Theta_{t}\}_{t=1}^{T}}L\Big{(}\{\Theta_{t}\}_{t=1}^{T},\!\{V_{t}^{k}\}_{t=1}^{T},\!\{\Upsilon_{t,\mathrm{off}}^{k}\}_{t=1}^{T},\!\{D_{t}^{k}\}_{t=1}^{T-1},\!\{A_{t}^{k}\}_{t=1}^{T},\!\{Y_{t,\mathrm{off}}^{k}\}_{t=1}^{T},\!\{Z_{t}^{k}\}_{t=1}^{T-1}\Big{)}.

    Note that this update is well defined as LL is strongly convex in {Θt}t=1T\{\Theta_{t}\}_{t=1}^{T} (with other variables fixed).

  2. 2.

    {Vt}t=1T,{Υt,off}t=1T,{Dt}t=1T1\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1} update:

    {Vtk+1}t=1T,{Υt,offk+1}t=1T,{Dtk+1}t=1T1\displaystyle\quad\{V_{t}^{k+1}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}^{k+1}\}_{t=1}^{T},\{D_{t}^{k+1}\}_{t=1}^{T-1}
    =\displaystyle= argmin{Vt}t=1T,{Υt,off}t=1T,{Dt}t=1T1L({Θtk+1}t=1T,{Vt}t=1T,{Υt,off}t=1T,{Dt}t=1T1,\displaystyle\operatorname*{\arg\min}_{\begin{subarray}{c}\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\\[2.84544pt] \{D_{t}\}_{t=1}^{T-1}\end{subarray}}\,L\Big{(}\{\Theta_{t}^{k+1}\}_{t=1}^{T},\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1},
    {Atk}t=1T,{Yt,offk}t=1T,{Ztk}t=1T1).\displaystyle\qquad\qquad\qquad\qquad\{A_{t}^{k}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}^{k}\}_{t=1}^{T},\{Z_{t}^{k}\}_{t=1}^{T-1}\Big{)}.

    Note that this update is well defined as LL is strongly convex in the variables {Vt}t=1T\{V_{t}\}_{t=1}^{T}, {Υt,off}t=1T\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T}, {Dt}t=1T1\{D_{t}\}_{t=1}^{T-1} (with other variables fixed).

  3. 3.

    Dual update: for all tt,666The dual stepsize 1.61 in (4.8) can be more generally chosen from the interval (0,(5+1)/2)(0,(\sqrt{5}+1)/2). Here we pick 1.61 for simplicity.

    Atk+1=Atk1.61β(Θtk+1Vtk+1),\displaystyle A_{t}^{k+1}=A_{t}^{k}-1.61\beta(\Theta_{t}^{k+1}-V_{t}^{k+1}), (4.8)
    Yt,offk+1=Yt,offk1.61β(Θt,offk+1Υt,offk+1),\displaystyle Y_{t,\mathrm{off}}^{k+1}=Y_{t,\mathrm{off}}^{k}-1.61\beta(\Theta_{t,\mathrm{off}}^{k+1}-\Upsilon_{t,\mathrm{off}}^{k+1}),
    Ztk+1=Ztk1.61β(Θt+1k+1Θtk+1Dtk+1).\displaystyle Z_{t}^{k+1}=Z_{t}^{k}-1.61\beta(\Theta_{t+1}^{k+1}-\Theta_{t}^{k+1}-D_{t}^{k+1}).

We next discuss how to perform the first two update steps efficiently.

{Θt}t=1T\{\Theta_{t}\}_{t=1}^{T} update: Setting the derivative of the objective function of the subproblem with respect to Θt\Theta_{t} to 0, we have

{𝟎p×p=12(X1X1Θ1+Θ1X1X1)IpA1kY1,offk+Z1k+β(Θ1V1k)+β(Θ1,offΥ1,offk)β(Θ2Θ1D1k),𝟎p×p=12(XtXtΘt+ΘtXtXt)IpAtkYt,offk+ZtkZt1k+β(ΘtVtk)+β(Θt,offΥt,offk)β(Θt+1ΘtDtk)+β(ΘtΘt1Dt1k)t=2,,T1,𝟎p×p=12(XTXTΘT+ΘTXTXT)IpATkYT,offkZT1k+β(ΘTVTk)+β(ΘT,offΥT,offk)+β(ΘTΘT1DT1k).\left\{\begin{aligned} \mathbf{0}_{p\times p}=&\frac{1}{2}(X_{1}X_{1}^{\top}\Theta_{1}+\Theta_{1}X_{1}X_{1}^{\top})\!-\!I_{p}\!-\!A_{1}^{k}\!-\!Y_{1,\mathrm{off}}^{k}\!+\!Z_{1}^{k}\!+\!\beta(\Theta_{1}\!-\!V_{1}^{k})\!+\!\beta(\Theta_{1,\mathrm{off}}\!-\!\Upsilon_{1,\mathrm{off}}^{k})\\ &\qquad\!-\!\beta(\Theta_{2}\!-\!\Theta_{1}\!-\!D_{1}^{k}),\\ \mathbf{0}_{p\times p}=&\frac{1}{2}(X_{t}X_{t}^{\top}\Theta_{t}\!+\!\Theta_{t}X_{t}X_{t}^{\top})\!-\!I_{p}\!-\!A_{t}^{k}\!-\!Y_{t,\mathrm{off}}^{k}\!+\!Z_{t}^{k}\!-\!Z_{t\!-\!1}^{k}\!+\!\beta(\Theta_{t}\!-\!V_{t}^{k})\!+\!\beta(\Theta_{t,\mathrm{off}}\!-\!\Upsilon_{t,\mathrm{off}}^{k})\\ &\qquad\!-\!\beta(\Theta_{t\!+\!1}\!-\!\Theta_{t}\!-\!D_{t}^{k})\!+\!\beta(\Theta_{t}\!-\!\Theta_{t\!-\!1}\!-\!D_{t\!-\!1}^{k})\quad\quad\forall\,t=2,\dots,T\!-\!1,\\ \mathbf{0}_{p\times p}=&\frac{1}{2}(X_{T}X_{T}^{\top}\Theta_{T}\!+\!\Theta_{T}X_{T}X_{T}^{\top})\!-\!I_{p}\!-\!A_{T}^{k}\!-\!Y_{T,\mathrm{off}}^{k}\!-\!Z_{T\!-\!1}^{k}\!+\!\beta(\Theta_{T}\!-\!V_{T}^{k})\!+\!\beta(\Theta_{T,\mathrm{off}}\!-\!\Upsilon_{T,\mathrm{off}}^{k})\\ &\qquad\!+\!\beta(\Theta_{T}\!-\!\Theta_{T\!-\!1}\!-\!D_{T\!-\!1}^{k}).\end{aligned}\right.

For the sake of simplicity, we let Z0k=ZTk=D0k=DTk=𝟎p×pZ_{0}^{k}=Z_{T}^{k}=D_{0}^{k}=D_{T}^{k}=\mathbf{0}_{p\times p} for all kk. By further denoting

Ψtk=Ip+Atk+Yt,offkZtk+Zt1k+βVtk+βΥt,offkβDtk+βDt1k\Psi_{t}^{k}=I_{p}+A_{t}^{k}+Y_{t,\mathrm{off}}^{k}-Z_{t}^{k}+Z_{t-1}^{k}+\beta V_{t}^{k}+\beta\Upsilon_{t,\mathrm{off}}^{k}-\beta D_{t}^{k}+\beta D_{t-1}^{k}

for all tt and kk, the {Θt}t=1T\{\Theta_{t}\}_{t=1}^{T} update is equivalent to solving the following linear system:

12(X1X1Θ1+Θ1X1X1)+2βΘ1+βΘ1,offβΘ2\displaystyle\frac{1}{2}(X_{1}X_{1}^{\top}\Theta_{1}\!+\!\Theta_{1}X_{1}X_{1}^{\top})\!+\!2\beta\Theta_{1}\!+\!\beta\Theta_{1,\mathrm{off}}\!-\!\beta\Theta_{2} =Ψ1k,\displaystyle\!=\!\Psi_{1}^{k}, (4.9)
βΘ1+12(X2X2Θ2+Θ2X2X2)+3βΘ2+βΘ2,offβΘ3\displaystyle\!-\!\beta\Theta_{1}\!+\!\frac{1}{2}(X_{2}X_{2}^{\top}\Theta_{2}\!+\!\Theta_{2}X_{2}X_{2}^{\top})\!+\!3\beta\Theta_{2}\!+\!\beta\Theta_{2,\mathrm{off}}\!-\!\beta\Theta_{3} =Ψ2k,\displaystyle\!=\!\Psi_{2}^{k},
βΘ2+12(X3X3Θ3+Θ3X3X3)+3βΘ3+βΘ3,offβΘ4\displaystyle\!-\!\beta\Theta_{2}\!+\!\frac{1}{2}(X_{3}X_{3}^{\top}\Theta_{3}\!+\!\Theta_{3}X_{3}X_{3}^{\top})\!+\!3\beta\Theta_{3}\!+\!\beta\Theta_{3,\mathrm{off}}\!-\!\beta\Theta_{4} =Ψ3k,\displaystyle\!=\!\Psi_{3}^{k},
&\displaystyle\dots&
βΘT2+12(XT1XT1ΘT1+ΘT1XT1XT1)+3βΘT1+βΘT1,offβΘT\displaystyle\!-\!\beta\Theta_{T\!-\!2}\!+\!\frac{1}{2}(X_{T\!-\!1}X_{T\!-\!1}^{\top}\Theta_{T\!-\!1}\!+\!\Theta_{T\!-\!1}X_{T\!-\!1}X_{T\!-\!1}^{\top})\!+\!3\beta\Theta_{T\!-\!1}\!+\!\beta\Theta_{T\!-\!1,\mathrm{off}}\!-\!\beta\Theta_{T} =ΨT1k,\displaystyle\!=\!\Psi_{T\!-\!1}^{k},
βΘT1+12(XTXTΘT+ΘTXTXT)+2βΘT+βΘT,off\displaystyle\!-\!\beta\Theta_{T\!-\!1}\!+\!\frac{1}{2}(X_{T}X_{T}^{\top}\Theta_{T}\!+\!\Theta_{T}X_{T}X_{T}^{\top})\!+\!2\beta\Theta_{T}+\beta\Theta_{T,\mathrm{off}} =ΨTk.\displaystyle\!=\!\Psi_{T}^{k}.

This linear system does not have a closed form solution in general. Here we use pcg in Matlab to solve it. Specifically, in each iteration, we use the solution from the previous iteration as the initial point and solve the system only up to some tolerance that decreases with iterations.

{Vt}t=1T,{Υt,off}t=1T,{Dt}t=1T1\{V_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1} update: It is notable that this subproblem is block separable, allowing us to solve it by addressing three further subproblems with respect to {Vt}t=1T\{V_{t}\}_{t=1}^{T}, {Υt,off}t=1T\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T}, and {Dt}t=1T1\{D_{t}\}_{t=1}^{T-1}, respectively.

For VtV_{t}, one has

Vtk+1=argminVtϵIp{Atk,Vt+β2Θtk+1VtF2}\displaystyle V_{t}^{k+1}=\operatorname*{\arg\min}_{V_{t}\succeq\epsilon I_{p}}\left\{\left\langle A_{t}^{k},V_{t}\right\rangle+\frac{\beta}{2}\|\Theta_{t}^{k+1}-V_{t}\|_{F}^{2}\right\}
=argminVtϵIp{Vt(Θtk+1Atkβ)F2}=ProjϵIp(Θtk+1Atkβ).\displaystyle\>\>\>=\operatorname*{\arg\min}_{V_{t}\succeq\epsilon I_{p}}\left\{\left\|V_{t}-\left(\Theta_{t}^{k+1}-\frac{A_{t}^{k}}{\beta}\right)\right\|_{F}^{2}\right\}=\mathrm{Proj}_{\cdot\succeq\epsilon I_{p}}\left(\Theta_{t}^{k+1}-\frac{A_{t}^{k}}{\beta}\right).

For Υt,off\Upsilon_{t,\mathrm{off}}, it holds that

Υt,offk+1\displaystyle\Upsilon_{t,\mathrm{off}}^{k+1} =argminΥt,off{λ1TΥt,off1,off+Yt,offk,Υt,off+β2Θt,offk+1Υt,offF2}\displaystyle=\operatorname*{\arg\min}_{\Upsilon_{t,\mathrm{off}}}\left\{\lambda_{1}T\|\Upsilon_{t,\mathrm{off}}\|_{1,\mathrm{off}}+\left\langle Y_{t,\mathrm{off}}^{k},\Upsilon_{t,\mathrm{off}}\right\rangle+\frac{\beta}{2}\|\Theta_{t,\mathrm{off}}^{k+1}-\Upsilon_{t,\mathrm{off}}\|_{F}^{2}\right\} (4.11)
=argminΥt,off{λ1TβΥt,off1,off+Υt,off(Θt,offk+1Yt,offkβ)F2}\displaystyle=\operatorname*{\arg\min}_{\Upsilon_{t,\mathrm{off}}}\left\{\frac{\lambda_{1}T}{\beta}\|\Upsilon_{t,\mathrm{off}}\|_{1,\mathrm{off}}+\left\|\Upsilon_{t,\mathrm{off}}-\left(\Theta_{t,\mathrm{off}}^{k+1}-\frac{Y_{t,\mathrm{off}}^{k}}{\beta}\right)\right\|_{F}^{2}\right\}
=proxλ1Tβ1,off(Θt,offk+1Yt,offkβ)=(proxλ1Tβ||(Θuv,tk+1Yuv,tkβ))u,v=1,,p with uv.\displaystyle=\mathrm{prox}_{\frac{\lambda_{1}T}{\beta}\|\cdot\|_{1,\mathrm{off}}}\left(\Theta_{t,\mathrm{off}}^{k+1}-\frac{Y_{t,\mathrm{off}}^{k}}{\beta}\right)=\left(\mathrm{prox}_{\frac{\lambda_{1}T}{\beta}|\cdot|}\left(\Theta_{uv,t}^{k+1}-\frac{Y_{uv,t}^{k}}{\beta}\right)\right)_{\begin{subarray}{c}u,v=1,\dots,p\\[2.84544pt] \text{ with }u\neq v\end{subarray}}.

The update scheme for DtD_{t} is slightly more complicated. For simplicity, we denote

Ξtk=Θt+1k+1Θtk+1Ztkβt=1,,T1,k=1,2,.\Xi_{t}^{k}=\Theta_{t+1}^{k+1}-\Theta_{t}^{k+1}-\frac{Z_{t}^{k}}{\beta}\;\;\;\;\;\forall t=1,\dots,T-1,k=1,2,\dots.

For each tt, we need to solve the following optimization problem:

Dtk+1=argminDt12DtΞtkF2+λ2Tβ(DtF;λ3).D_{t}^{k+1}=\operatorname*{\arg\min}_{D_{t}}\frac{1}{2}\|D_{t}-\Xi_{t}^{k}\|_{F}^{2}+\frac{\lambda_{2}T}{\beta}\mathcal{R}(\|D_{t}\|_{F};\lambda_{3}).

By the definition of \mathcal{R} in (4.3), it is equivalent to solving the following two problems

Dt\displaystyle D_{t}^{\diamond} argminDtFλ312DtΞtkF2+λ2TβDtF,\displaystyle\coloneqq\operatorname*{\arg\min}_{\|D_{t}\|_{F}\leq\lambda_{3}}\frac{1}{2}\|D_{t}-\Xi_{t}^{k}\|_{F}^{2}+\frac{\lambda_{2}T}{\beta}\|D_{t}\|_{F}, \raisebox{-0.2ex}{\small{1}}⃝\displaystyle\smash{\raisebox{-0.2ex}{\small{1}}⃝}
Dt+\displaystyle D_{t}^{+} argminDtFλ312DtΞtkF2+λ2Tβ(DtF2λ32+λ3),\displaystyle\coloneqq\operatorname*{\arg\min}_{\|D_{t}\|_{F}\geq\lambda_{3}}\frac{1}{2}\|D_{t}-\Xi_{t}^{k}\|_{F}^{2}+\frac{\lambda_{2}T}{\beta}\left(\|D_{t}\|_{F}^{2}-\lambda_{3}^{2}+\lambda_{3}\right), \raisebox{-0.2ex}{\small{2}}⃝\displaystyle\smash{\raisebox{-0.2ex}{\small{2}}⃝}

and let

Dtk+1={Dt if valval+,Dt+ if val>val+,D_{t}^{k+1}=\begin{cases}D_{t}^{\diamond}&\text{ if }\mathrm{val}^{\diamond}\leq\mathrm{val}^{+},\\ D_{t}^{+}&\text{ if }\mathrm{val}^{\diamond}>\mathrm{val}^{+},\\ \end{cases} (4.12)

where val\mathrm{val}^{\diamond} and val+\mathrm{val}^{+} refer to the optimal values of the two optimization problems \raisebox{-0.2ex}{\small{1}}⃝ and \raisebox{-0.2ex}{\small{2}}⃝, respectively. We first note that if Ξtk=𝟎p×p\Xi_{t}^{k}=\mathbf{0}_{p\times p}, then val+0=val\mathrm{val}^{+}\geq 0=\mathrm{val}^{\diamond} and we have Dtk+1=Dt=𝟎p×pD^{k+1}_{t}=D_{t}^{\diamond}=\mathbf{0}_{p\times p}. We next assume that Ξtk𝟎p×p\Xi_{t}^{k}\neq\mathbf{0}_{p\times p}. For \raisebox{-0.2ex}{\small{1}}⃝, from the proximal operator of Frobenius norm, we have

Dt=min{(ΞtkFλ2Tβ)+,λ3}ı1ΞtkΞtkF,val=12(ı1ΞtkF)2+λ2Tβı1.D_{t}^{\diamond}=\underbrace{\min\left\{\left(\|\Xi_{t}^{k}\|_{F}-\frac{\lambda_{2}T}{\beta}\right)_{+},\lambda_{3}\right\}}_{\imath_{1}}\frac{\Xi_{t}^{k}}{\|\Xi_{t}^{k}\|_{F}},\;\;\mathrm{val}^{\diamond}=\frac{1}{2}\left(\imath_{1}-\|\Xi_{t}^{k}\|_{F}\right)^{2}+\frac{\lambda_{2}T}{\beta}\imath_{1}. (4.13)

For \raisebox{-0.2ex}{\small{2}}⃝, by considering the derivative of its objective function, we get

Dt+=max{βΞtkFβ+2λ2T,λ3}ı2ΞtkΞtkF,val+=12(ı2ΞtkF)2+λ2Tβ(ı22λ32+λ3).D_{t}^{+}=\underbrace{\max\left\{\frac{\beta\|\Xi_{t}^{k}\|_{F}}{\beta+2\lambda_{2}T},\lambda_{3}\right\}}_{\imath_{2}}\frac{\Xi_{t}^{k}}{\|\Xi_{t}^{k}\|_{F}},\;\;\mathrm{val}^{+}=\frac{1}{2}\left(\imath_{2}-\|\Xi_{t}^{k}\|_{F}\right)^{2}+\frac{\lambda_{2}T}{\beta}\left(\imath_{2}^{2}-\lambda_{3}^{2}+\lambda_{3}\right). (4.14)

The update scheme for DtD_{t} is obtained upon combining (4.13) and (4.14) with (4.12).

Overall, the ADMM is summarized in Algorithm 1, whose convergence directly follows from [10, Appendix B].

Algorithm 1 An ADMM for solving (4.4)
1:input: 𝒳T=(X1,,XT)\mathcal{X}_{T}=(X_{1},\dots,X_{T}); λ1>0\lambda_{1}>0, λ2>0\lambda_{2}>0, ϵ>0\epsilon>0, λ30.5\lambda_{3}\geq 0.5; β>0\beta>0; ϵ𝗉𝖼𝗀(0,1)\epsilon_{\mathsf{pcg}}\in(0,1), τ(0,1)\tau\in(0,1).
2:𝐗0={{Θt0}t=1T,{Vt0}t=1T,{Υt,off0}t=1T,{Dt0}t=1T1}\mathbf{X}^{0}=\{\{\Theta_{t}^{0}\}_{t=1}^{T},\{V_{t}^{0}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}^{0}\}_{t=1}^{T},\{D_{t}^{0}\}_{t=1}^{T-1}\}; 𝐘0={{At0}t=1T,{Yt,off0}t=1T,{Zt0}t=1T1}\mathbf{Y}^{0}=\{\{A_{t}^{0}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}^{0}\}_{t=1}^{T},\{Z_{t}^{0}\}_{t=1}^{T-1}\}.
3:output: 𝐗^={{Θ^t}t=1T,{V^t}t=1T,{Υ^t,off}t=1T,{D^t}t=1T1}\mathbf{\widehat{X}}\!=\!\{\!\{\widehat{\Theta}_{t}\}_{t=1}^{T},\!\{\widehat{V}_{t}\}_{t=1}^{T},\!\{\widehat{\Upsilon}_{t,\mathrm{off}}\}_{t=1}^{T},\!\{\widehat{D}_{t}\}_{t=1}^{T-1}\!\}; 𝐘^={{A^t}t=1T,{Y^t,off}t=1T,{Z^t}t=1T1}\mathbf{\widehat{Y}}\!=\!\{\!\{\widehat{A}_{t}\}_{t=1}^{T},\!\{\widehat{Y}_{t,\mathrm{off}}\}_{t=1}^{T},\!\{\widehat{Z}_{t}\}_{t=1}^{T-1}\!\}.
4:Set k0k\leftarrow 0.
5:while the termination criterion is not met do
6:     {Θt}t=1T\{\Theta_{t}\}_{t=1}^{T} update: call pcg in Matlab to solve the linear system (4.9) using {Θtk}t=1T\{\Theta_{t}^{k}\}_{t=1}^{T} as initial point up to tolerance ϵ𝗉𝖼𝗀\epsilon_{\mathsf{pcg}} to obtain {Θtk+1}t=1T\{\Theta_{t}^{k+1}\}_{t=1}^{T}.
7:     if mod(k+1,10)=0{\rm mod}(k+1,10)=0 then
8:               Update ϵ𝗉𝖼𝗀max{τϵ𝗉𝖼𝗀,1012}\epsilon_{\mathsf{pcg}}\leftarrow\max\{\tau\epsilon_{\mathsf{pcg}},10^{-12}\}.
9:     end if
10:     Update {Vt}t=1T\{V_{t}\}_{t=1}^{T} according to (4.3).
11:     Update {Υt,off}t=1T\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T} according to (4.11).
12:     Update {Dt}t=1T1\{D_{t}\}_{t=1}^{T-1} according to (4.12).
13:     Update dual variables according to (4.8).
14:     Set kk+1k\leftarrow k+1.
15:end while
16:Return Θ^t=Θtk\widehat{\Theta}_{t}=\Theta_{t}^{k}, V^t=Vtk\widehat{V}_{t}=V_{t}^{k}, Υ^t,off=Υt,offk\widehat{\Upsilon}_{t,\mathrm{off}}=\Upsilon_{t,\mathrm{off}}^{k}, D^t=Dtk\widehat{D}_{t}=D_{t}^{k}, A^t=Atk\widehat{A}_{t}=A_{t}^{k}, Y^t,off=Yt,offk\widehat{Y}_{t,\mathrm{off}}=Y_{t,\mathrm{off}}^{k}, Z^t=Ztk\widehat{Z}_{t}=Z_{t}^{k} for all tt.

5 Implementation details

We implement Algorithm 1 in Matlab R2023a and perform several numerical experiments on both simulated datasets and real datasets, which will be discussed in the later sections. The Matlab codes for the implementation of Algorithm 1 and the experiments are available in https://github.com/linyopt/GFDtL. In this section, we provide some implementation details about the algorithm and the numerical experiments; interested readers can check the GitHub repository for more technical details that are not covered here.

We first briefly describe the process to obtain an estimator and to identify the corresponding breaks based on a given ϵ>0\epsilon>0 and a sample 𝒳T\mathcal{X}_{T} with t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0. Specifically, for any pair of tuning parameters (λ1,λ2)(\lambda_{1},\lambda_{2}), Proposition 4.3 suggests that we can apply Algorithm 1 to solve (4.4) with a sufficiently large λ3\lambda_{3} to either assert that (4.1) may not have solutions or obtain a solution to (4.1). With the obtained estimator {Θ^t}t=1T\{\widehat{\Theta}_{t}\}_{t=1}^{T} in hand, we identify the breaks by selecting the tt’s with Θ^t+1Θ^tF106\|\widehat{\Theta}_{t+1}-\widehat{\Theta}_{t}\|_{F}\geq 10^{-6}. Since (λ1,λ2)(\lambda_{1},\lambda_{2}) controls the model complexity and smoothing, they must be calibrated accordingly. We search for the optimal tuning parameters over a user-specified grid based on some criteria, which will be discussed in the next subsection.

It is important to further discuss how the breaks are identified from a given estimator. Intuitively, with the Group Fused LASSO regularizer, the estimator {Θ^t}t=1T\{\widehat{\Theta}_{t}\}_{t=1}^{T} should satisfy that for some tt’s, Θ^t+1Θ^tF0\|\widehat{\Theta}_{t+1}-\widehat{\Theta}_{t}\|_{F}\neq 0, which we identify as breaks. In contrast, the estimator produced by the competing method, the Gaussian loss-based GFGL approach of [13], does not inherently possess this property. Instead, their algorithm includes an additional post-processing step to extract the breaks from the estimator. Although, as discussed hereafter, this method can also produce reasonable breaks, their estimator does not naturally exhibit the desired property of containing identifiable breaks.

5.1 Optimal selection of the tuning parameters

The Bayesian information criterion (BIC) is a common criterion to choose the optimal tuning parameters: see, e.g., [27]. However, it is known that there are some scenarios where BIC may fail to select the optimal tuning parameters: see, e.g., Section 4.2 and the Appendix in [13]. This motivates us to propose the following three methods for the selection of the optimal pair of tuning parameters:

  • (a)

    Method a: When the true underlying data generating process is known, the optimal pair can be selected as the pair that minimizes or maximizes suitable performance measures, such as the Hausdorff distance, model recovery, or estimation error. Although this strategy can be employed when the true underlying structure is known only, it is informative about the relative performances of different competing procedures when in a position to minimize/maximize the performance metrics.

  • (b)

    Method b: In the case of simulated data, we divide the simulated samples of observations into a training set and BB test sets, all of them sampled from the same data generating process: the training set is denoted by 𝒳Ttrain\mathcal{X}^{\text{train}}_{T} and the test sets are denoted by 𝒳(1),Ttest,,𝒳(B),Ttest\mathcal{X}^{\text{test}}_{(1),T},\ldots,\mathcal{X}^{\text{test}}_{(B),T}. Based on 𝒳Ttrain\mathcal{X}^{\text{train}}_{T}, we apply Algorithm 1 to solve (4.4) with an appropriately large λ3\lambda_{3} for different (λ1,λ2)(\lambda_{1},\lambda_{2}) candidates and obtain {Θ^(𝒳Ttrain)λ1,λ2}t=1T\{\widehat{\Theta}(\mathcal{X}^{\text{train}}_{T})_{\lambda_{1},\lambda_{2}}\}^{T}_{t=1}. The optimal (λ1,λ2)(\lambda^{\ast}_{1},\lambda^{\ast}_{2}) is selected as the pair which minimizes

    lossval(λ1,λ2)1Bj=1B𝕃T({Θ^(𝒳Ttrain)λ1,λ2}t=1T,𝒳(j),Ttest).\text{lossval}(\lambda_{1},\lambda_{2})\coloneqq\frac{1}{B}\sum^{B}_{j=1}{\mathbb{L}}_{T}(\{\widehat{\Theta}(\mathcal{X}^{\text{train}}_{T})_{\lambda_{1},\lambda_{2}}\}^{T}_{t=1},\mathcal{X}^{\text{test}}_{(j),T}). (5.1)

    BB is user-specified. Throughout this paper, we set B=10B=10.

  • (c)

    Method c: When the true underlying data generating process is unknown, as in real data, following [13], we employ a BIC-type criterion given by

    BIC(λ1,λ2)=𝕃({Θ^t}t=1T,𝒳T)+Klog(T),\text{BIC}(\lambda_{1},\lambda_{2})={\mathbb{L}}(\{\widehat{\Theta}_{t}\}^{T}_{t=1},\mathcal{X}_{T})+K\log(T), (5.2)

    where KK represents the complexity, or degrees of freedom, and is defined as

    K=card(𝟏(Θ^uv,tΘ^uv,t1),2tT,uv)+card(𝟏(Θ^uv,10),uv).K=\text{card}\big{(}\mathbf{1}(\widehat{\Theta}_{uv,t}\neq\widehat{\Theta}_{uv,t-1}),\forall 2\leq t\leq T,\forall u\neq v\big{)}+\text{card}\big{(}\mathbf{1}(\widehat{\Theta}_{uv,1}\neq 0),\forall u\neq v\big{)}.

    Then we select the optimal values (λ1,λ2)(\lambda^{\ast}_{1},\lambda^{\ast}_{2}) according to the criterion

    (λ1,λ2)=argminλ1,λ2BIC(λ1,λ2).(\lambda^{\ast}_{1},\lambda^{\ast}_{2})=\underset{\lambda_{1},\lambda_{2}}{\arg\,\min}\;\text{BIC}(\lambda_{1},\lambda_{2}).

    The definition of KK varies across the literature: see, e.g., the different definitions provided in [13] and [27]. In the former work, KK is the sum of the number of active edges at t=1t=1 and of the corresponding changes for t=1,,Tt=1,\ldots,T. In the latter work, KK is the number of non-zero coefficient blocks in Θ^uv,t,t=1,,T\widehat{\Theta}_{uv,t},t=1,\ldots,T for 1uvp1\leq u\neq v\leq p. Based on our preliminary experiments, we found that these two definitions do not result in significant differences in the selection of the optimal tuning parameters. Therefore, since we will compare our algorithm with the GFGL method of [13], we use their definition for consistency.

For these three methods, the minimization problem is solved w.r.t. (λ1,λ2)(\lambda_{1},\lambda_{2}) over a user-specified grid: in our experiments on synthetic experiments, the grid is specified as 0.1,0.2,0.3,0.1,0.2,0.3, ,1\ldots,1 for λ1\lambda_{1} and 10,20,30,,20010,20,30,\ldots,200 for λ2\lambda_{2}.

It is worth noting that from Proposition 4.1-(ii), within the user-specified grid there may be some pairs of (λ1,λ2)(\lambda_{1},\lambda_{2}) such that (4.1) does not have solutions, which can be detected by checking (4.6). For these pairs of (λ1,λ2)(\lambda_{1},\lambda_{2}), we set lossval(λ1,λ2)=+\text{lossval}(\lambda_{1},\lambda_{2})=+\infty and BIC(λ1,λ2)=+\text{BIC}(\lambda_{1},\lambda_{2})=+\infty, so that they will never be selected. In Subsection 6.2, we perform a sensitivity analysis on a simulated dataset to evaluate and compare Method b and Method c for selecting the optimal pair of tuning parameters.

5.2 Initialization and termination criterion

Throughout this paper, we initialize the algorithm as follows: for all tt,

Θt0=(t=1TXtXt)1,Vt0=Θt0,Υt,off0=Θt,off0,Dt0=Θt+10Θt0=𝟎p×p,At0=𝟏p×p,Yt,off0=Zt0=𝟎p×p;\begin{array}[]{lll}\Theta_{t}^{0}=\left(\sum_{t=1}^{T}X_{t}X_{t}^{\top}\right)^{-1},&V_{t}^{0}=\Theta_{t}^{0},&\Upsilon_{t,\mathrm{off}}^{0}=\Theta_{t,\mathrm{off}}^{0},\\ D_{t}^{0}=\Theta_{t+1}^{0}-\Theta_{t}^{0}=\mathbf{0}_{p\times p},&A_{t}^{0}=\mathbf{1}_{p\times p},&Y_{t,\mathrm{off}}^{0}=Z_{t}^{0}=\mathbf{0}_{p\times p};\end{array}

here the inverse is well-defined since we have t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0.

We next describe the termination criterion, which essentially consists of checking the constraint violations for the dual problem (4.5), and also the gap between the primal and dual objective values. Recall that 𝐗={{Θt}t=1T,{Vt}t=1T,{Υt,off}t=1T,{Dt}t=1T1}\mathbf{X}\!=\!\left\{\!\{\Theta_{t}\}_{t=1}^{T},\!\{V_{t}\}_{t=1}^{T},\!\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\!\{D_{t}\}_{t=1}^{T-1}\!\right\} is the primal variable with Θt𝒮p\Theta_{t}\!\in\!\mathcal{S}^{p}, Vt𝒮pV_{t}\!\in\!\mathcal{S}^{p}, Υt,off𝒮offp\Upsilon_{t,\mathrm{off}}\!\in\!\mathcal{S}_{\mathrm{off}}^{p} and Dt𝒮pD_{t}\!\in\!\mathcal{S}^{p} for all tt; 𝐘={{Wt}t=1T,{Yt,off}t=1T,{Zt}t=1T1}\mathbf{Y}\!=\!\left\{\!\{W_{t}\}_{t=1}^{T},\!\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\!\{Z_{t}\}_{t=1}^{T-1}\!\right\} is the dual variable with Wtp×pW_{t}\!\in\!{\mathbb{R}}^{p\times p}, Yt,off𝒮offpY_{t,\mathrm{off}}\!\in\!\mathcal{S}_{\mathrm{off}}^{p}, Zt𝒮pZ_{t}\!\in\!\mathcal{S}^{p} for all tt; and let ζt=ZtZt1Ip+Sym((XtXt)12Wt)Yt,off\zeta_{t}\!=\!Z_{t}\!-\!Z_{t-1}\!-\!I_{p}\!+\!\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)\!-\!Y_{t,\mathrm{off}} for all tt, where Z0=ZT=𝟎p×pZ_{0}=Z_{T}=\mathbf{0}_{p\times p}. We define the relative infeasibility for the positive semi-definite constraint as

𝖽𝖿𝖾𝖺𝗌1(𝐘)maxt=1,,T{|min{λmin(ζt),0}|ζtF+1}.\mathsf{dfeas}_{1}(\mathbf{Y})\coloneqq\max_{t=1,\dots,T}\left\{\frac{|\min\{\lambda_{\min}(\zeta_{t}),0\}|}{\|\zeta_{t}\|_{F}+1}\right\}.

For the bound constraint of {Yt,off}t=1T\{Y_{t,\mathrm{off}}\}_{t=1}^{T}, we similarly define the relative infeasibility as

𝖽𝖿𝖾𝖺𝗌2(𝐘)(maxt,uv{|Yuv,t,off|}λ1)+1+maxt,uv{|Yuv,t,off|},\mathsf{dfeas}_{2}(\mathbf{Y})\coloneqq\frac{\left(\max_{t,u\neq v}\{|Y_{uv,t,\mathrm{off}}|\}-\lambda_{1}\right)_{+}}{1+\max_{t,u\neq v}\{|Y_{uv,t,\mathrm{off}}|\}},

where Yuv,t,offY_{uv,t,\mathrm{off}} is the (u,v)(u,v)-th element of Yt,offY_{t,\mathrm{off}}. Then, the relative dual infeasibility is defined as

𝖽𝖿𝖾𝖺𝗌(𝐘)max{𝖽𝖿𝖾𝖺𝗌1(𝐘),𝖽𝖿𝖾𝖺𝗌2(𝐘)}.\mathsf{dfeas}(\mathbf{Y})\coloneqq\max\{\mathsf{dfeas}_{1}(\mathbf{Y}),\mathsf{dfeas}_{2}(\mathbf{Y})\}.

The relative duality gap is defined by

𝗀𝖺𝗉(𝐗,𝐘)|vp(𝐗)vd(𝐘)|1+|vp(𝐗)|+|vd(𝐘)|,\mathsf{gap}(\mathbf{X},\mathbf{Y})\coloneqq\frac{|v_{p}(\mathbf{X})-v_{d}(\mathbf{Y})|}{1+|v_{p}(\mathbf{X})|+|v_{d}(\mathbf{Y})|},

where vp(𝐗)v_{p}(\mathbf{X}) and vd(𝐘)v_{d}(\mathbf{Y}) are the objective values of primal problem (cf. (4.4)) at 𝐗\mathbf{X} and dual problem (cf. (4.5)) at 𝐘\mathbf{Y}, respectively.

Overall, throughout this paper, we terminate Algorithm 1777We note that Algorithm 1 does not involve {Wt}t=1T\{W_{t}\}_{t=1}^{T} while the dual problem (4.5) does. Here, we set Wtk=(XtXt)12ΘtkW^{k}_{t}=(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta^{k}_{t} for all tt and kk, which comes from the derivation of the dual problem (4.5); see (B.20). when both the relative primal-dual gap and the relative dual infeasiibility are sufficiently small:

max{𝗀𝖺𝗉(𝐗k,𝐘k),𝖽𝖿𝖾𝖺𝗌(𝐘k)}ϵtol,\max\left\{\mathsf{gap}(\mathbf{X}^{k},\mathbf{Y}^{k}),\mathsf{dfeas}(\mathbf{Y}^{k})\right\}\leq\epsilon_{\mathrm{tol}},

or, it is detected that Problem (4.1) may not have solutions:

maxt=1,,T1Θt+1kΘtkFλ3,\max_{t=1,\dots,T-1}\|\Theta_{t+1}^{k}-\Theta_{t}^{k}\|_{F}\geq\lambda_{3}, (5.3)

or, the relative successive changes of both primal and dual variables are sufficiently small:

max{\displaystyle\max\Big{\{} {Θtk+1Θtk}t=1T1+{Θtk+1}t=1T+{Θtk}t=1T,{Atk+1Atk}t=1T1+{Atk+1}t=1T+{Atk}t=1T,\displaystyle\frac{\|\{\Theta_{t}^{k+1}-\Theta_{t}^{k}\}_{t=1}^{T}\|}{1+\|\{\Theta_{t}^{k+1}\}_{t=1}^{T}\|+\|\{\Theta_{t}^{k}\}_{t=1}^{T}\|},\frac{\|\{A_{t}^{k+1}-A_{t}^{k}\}_{t=1}^{T}\|}{1+\|\{A_{t}^{k+1}\}_{t=1}^{T}\|+\|\{A_{t}^{k}\}_{t=1}^{T}\|},
{Yt,offk+1Yt,offk}t=1T1+{Yt,offk+1}t=1T+{Yt,offk}t=1T,{Ztk+1Ztk}t=1T11+{Ztk+1}t=1T1+{Ztk}t=1T1}ϵtol103,\displaystyle\frac{\|\{Y_{t,\mathrm{off}}^{k+1}-Y_{t,\mathrm{off}}^{k}\}_{t=1}^{T}\|}{1+\|\{Y_{t,\mathrm{off}}^{k+1}\}_{t=1}^{T}\|+\|\{Y_{t,\mathrm{off}}^{k}\}_{t=1}^{T}\|},\frac{\|\{Z_{t}^{k+1}-Z_{t}^{k}\}_{t=1}^{T-1}\|}{1+\|\{Z_{t}^{k+1}\}_{t=1}^{T-1}\|+\|\{Z_{t}^{k}\}_{t=1}^{T-1}\|}\Big{\}}\leq\frac{\epsilon_{\mathrm{tol}}}{10^{3}},

where {Θt}t=1T(t=1Tu=1pv=1p(Θuv,t)2)12\|\{\Theta_{t}\}_{t=1}^{T}\|\coloneqq(\sum_{t=1}^{T}\sum_{u=1}^{p}\sum_{v=1}^{p}(\Theta_{uv,t})^{2})^{\frac{1}{2}}. We set ϵtol=103\epsilon_{\mathrm{tol}}=10^{-3} for experiments on both synthetic and real datasets.

5.3 Choices of algorithm parameters

The parameter ϵ\epsilon specifies a lower bound of the eigenvalues of matrices in the resulting solution, which ensures the non-singularity of each matrix in the obtained solution if ϵ>0\epsilon>0. The choice of ϵ\epsilon depends on how “non-singular” the solution is expected to be. In our experiments, we set ϵ=0.01\epsilon=0.01. In view of Proposition 4.3 and its proof, λ3\lambda_{3} should be large enough, with its lower bound related to (λ1,λ2)(\lambda_{1},\lambda_{2}) and the possible solution to the corresponding (4.1). After some trial experiments, we set λ3=10\lambda_{3}=10 for experiments on synthetic datasets and λ3=50\lambda_{3}=50 for experiments on real datasets. For the tolerance of pcg, we set ϵ𝗉𝖼𝗀=102\epsilon_{\mathsf{pcg}}=10^{-2} and τ=0.9\tau=0.9. The parameter β\beta in ADMM has no effect on the results but only affects the convergence speed of the algorithm, so we did not fine-tune it; interested readers can refer to the GitHub repository for further details regarding these settings.

6 Synthetic experiments

In this section, we conduct some simulation experiments to assess the performances of the GFDtL procedure, its sensitivity to the tuning parameters and computational complexity.

6.1 Simulations

To assess the finite-sample relevance of the GFDtL procedure, we consider the following settings, where we denote by mm^{\ast} the true number of breaks and by Ωj,j=1,,m+1\Omega^{\ast}_{j},j=1,\ldots,m^{\ast}+1 the true precision matrices:

Setting (i): For each true precision matrix Ωj,j=1,,m+1\Omega^{\ast}_{j},j=1,\ldots,m^{\ast}+1, its structure is uniformly drawn from the set of graphs with vertex size card(Vj)=p\text{card}(V_{j})=p, that is, the number of variables, and card(Ej)=Mj\text{card}(E_{j})=M_{j} edges, giving the graph G(Vj,Ej)G(V_{j},E_{j})\sim Erdös-Rényi(𝒫,Mj)(\mathcal{P},M_{j}) for the block j{\mathcal{B}}^{\ast}_{j}. The zero entries are generating by matching the pattern of the adjacency matrix EjE_{j} and the precision matrix Ωj\Omega^{\ast}_{j}, that is, (u,v)EjΩuv,j0(u,v)\in E_{j}\Leftrightarrow\Omega^{\ast}_{uv,j}\neq 0, providing the sparsity pattern in the off-diagonal coefficients of Ωj\Omega^{\ast}_{j}. The proportion of the zero entries is calibrated by 𝒫\mathcal{P}.
Then the off-diagonal non-zero entries of Ωj\Omega^{\ast}_{j} are drawn in 𝒰([0.8,0.05][0.05,0.8]){\mathcal{U}}([-0.8,-0.05]\cup[0.05,0.8]), where 𝒰([a,b][b,a]){\mathcal{U}}([-a,-b]\cup[b,a]) denotes the uniform distribution in [a,b][b,a][-a,-b]\cup[b,a]. The diagonal elements are drawn in 𝒰([0.5,1]){\mathcal{U}}([0.5,1]). Finally, to ensure that the resulting matrix is positive-definite, if the simulated Ωj\Omega^{\ast}_{j} satisfies λmin(Ωj)<0.01\lambda_{\min}(\Omega^{\ast}_{j})<0.01, we apply Ωj=Ωj+(ζ+|λmin(Ωj)|)Ip\Omega^{\ast}_{j}=\Omega^{\ast}_{j}+(\zeta+|\lambda_{\min}(\Omega^{\ast}_{j})|)I_{p}, where ζ\zeta is the first value in {0.005,0.01,0.015,}\{0.005,0.01,0.015,\ldots\} such that λmin(Ωj)>0.01\lambda_{\min}(\Omega^{\ast}_{j})>0.01.

Setting (ii): For each true Ωj,j=1,,m+1\Omega^{\ast}_{j},j=1,\ldots,m^{\ast}+1, its off-diagonal non-zero entries are generated in 𝒰([1,1]){\mathcal{U}}([-1,1]) and diagonal elements are drawn in 𝒰([1.1,1.5]){\mathcal{U}}([1.1,1.5]). To ensure that the resulting matrix is positive-definite, we apply the same final step as in Setting (i).

Setting (iii): The precision matrix is generated following the same spirit as in Section 5 of [3]. We construct Σj=Ωj1=D1/2CD1/2\Sigma^{\ast}_{j}=\Omega^{\ast-1}_{j}=D^{1/2}\,C\,D^{1/2}, j=1,,m+1j=1,\ldots,m^{\ast}+1, where D=diag(U1,,Up)D=\text{diag}(U_{1},\ldots,U_{p}) with Uk𝒰([0.5,2]),1kpU_{k}\in{\mathcal{U}}([0.5,2]),1\leq k\leq p, and where DD makes the diagonal entries in Σj\Sigma^{\ast}_{j} and Θj\Theta^{\ast}_{j} different. We set C=(cuv)1u,vpC=(c_{uv})_{1\leq u,v\leq p} with cuv=a|vu|c_{uv}=a^{|v-u|}. The coefficient aa equals 0.40.4 with probability 0.50.5, and equals 0.10.1 otherwise. Then, we set Ωj,uv=0\Omega^{\ast}_{j,uv}=0 when |Ωj,uv|<0.05|\Omega^{\ast}_{j,uv}|<0.05 and each non-zero off-diagonal coefficient is multiplied by 11 (resp. 1-1) with probability 0.50.5 (resp. 0.50.5). Finally, to ensure that the resulting matrix is positive-definite, we apply the same final step as in Setting (i). This creates a banded structure.

For each of these settings, for t=1,,Tt=1,\ldots,T, we draw Xt𝒩p(0,Θt1)X_{t}\sim{\mathcal{N}}_{{\mathbb{R}}^{p}}(0,\Theta^{\ast-1}_{t}) the pp-dimensional Gaussian distribution, with Θt=Ωj\Theta^{\ast}_{t}=\Omega^{\ast}_{j} when tjt\in{\mathcal{B}}^{\ast}_{j}. We set p=10p=10 and three cases relating to the breaks are considered: (a) no break; (b) a single break; (c) several breaks. In case (b), m=1m^{\ast}=1 and in case (c), m=4m^{\ast}=4; the location of the breaks, i.e., Tj,j=1,,mT^{\ast}_{j},j=1,\ldots,m^{\ast}, are randomly set, conditionally on min\mathcal{I}_{\min} being at least κT\kappa T, where κ=1/(m+c)\kappa=1/(m^{\ast}+c). We set c=8c=8 so that κ=0.11\kappa=0.11 (resp. κ=0.0833\kappa=0.0833) when m=1m^{\ast}=1 (resp. m=4m^{\ast}=4): the regimes may have different time lengths but satisfy a minimum time length condition. The latter relates to the issue of trimming: see the Introduction and Section 4 in [30], who mentioned that κ[0.05,0.25]\kappa\in[0.05,0.25] is a standard choice. We set the sample size T=100T=100 in case (a) and T=150T=150 in cases (b), (c). The “sparsity degree”, that is the proportion of zero entries in the lower triangular part of Ωj\Omega^{\ast}_{j}, is set as 80%80\% and 30%30\% in settings (i) and (ii), which represents approximately 3636 and 1414 zero entries in each regime out of the 4545 lower triangular elements, respectively. Note that in setting (i), we allow the true number of zero entries to slightly vary between each regime around the corresponding sparsity degree, although 𝒫{\mathcal{P}} remains constant. In setting (ii), we keep the true number of zero entries constant across the regimes, i.e., 3636 and 1414 zero entries exactly. In setting (iii), the sparsity degree can slightly vary depending on the value of aa.

For each setting, we draw one hundred batches of TT independent samples 𝒳T{\mathcal{X}}_{T}. For each batch, we apply Algorithm 1 to solve (4.4) with some large λ3\lambda_{3} over a grid specified as 0.1,0.2,0.3,,10.1,0.2,0.3,\ldots,1 for λ1\lambda_{1} and 10,20,30,,20010,20,30,\ldots,200 for λ2\lambda_{2}, and select the optimal pair (λ1,λ2)(\lambda^{\ast}_{1},\lambda^{\ast}_{2}) using the selection methods described in Subsection 5.1, which will be denoted by Method a, Method b and Method c hereafter. As a competing method, we also solve the Gaussian loss-based GFGL of [13]888The Matlab code for GFGL is available on the corresponding journal website as Supplemental. by selecting the optimal tuning parameters using the same strategies. For these two methods, we report the following metrics as performance measures:

  1. (i)

    the number of breaks detected by the procedure denoted by nbnb.

  2. (ii)

    the Hausdorff distance dh=100×max(h(T^m^,𝒯m),h(𝒯m,T^m^))/Td_{h}=100\times\max\big{(}h(\widehat{T}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}}),h({\mathcal{T}}^{\ast}_{m^{\ast}},\widehat{T}_{\widehat{m}})\big{)}/T, which serves as a measure of the estimation accuracy of the break dates.

  3. (iii)

    the F1\text{F}_{1} score defined as F1=2TP/(2TP+FN+FP)\text{F}_{1}=2\text{TP}/(2\text{TP}+\text{FN}+\text{FP}), where TP is the number of correctly estimated non-zero coefficients, FN is the number of incorrectly estimated zero entries and FP is the number of incorrectly estimated non-zero entries. The closer to 11 the F1\text{F}_{1} score is, the better.

  4. (iv)

    the accuracy defined as acc=(TP+TN)/T\mathrm{acc}=(\text{TP}+\text{TN})/T, where TN is the number of correctly estimated zero entries.

  5. (v)

    the averaged mean squared error (MSE) (p2T)1t=1TΘ^tΘtF2\sqrt{(p^{2}T)^{-1}\sum^{T}_{t=1}\|\widehat{\Theta}_{t}-\Theta^{\ast}_{t}\|^{2}_{F}} for precision accuracy.

These metrics are averaged over the one hundred independent batches and are reported in Table 1, Table 2 and Table 3. These simulated experiments have been run on an Intel(R) Xeon(R) Gold 6242R [email protected] 3.09 GHz, 128 GB.

Table 1: Break recovery, model selection and precision accuracy based on 100 replications, Setting (i), with respect to (m,s,T)(m^{\ast},s^{\ast},T). For dhd_{h}, MSE, smaller numbers are better; for F1\text{F}_{1} and acc\mathrm{acc}, larger numbers are better. Bold figures represent the best performing models.
nbnb dhd_{h} F1\text{F}_{1} acc\mathrm{acc} MSE
(m,s,T)(m^{\ast},s^{\ast},T) (λ1,λ2)(\lambda_{1},\lambda_{2}) GFDtL GFGL GFDtL GFGL GFDtL GFGL GFDtL GFGL GFDtL GFGL
(0,0.8,100)(0,0.8,100) Method a 0 0 0 0 0.8472 0.7174 0.9133 0.8084 0.1707 0.2266
Method b 0 3.6800 0 46.0100 0.8072 0.5377 0.8919 0.5518 0.1662 0.2089
Method c 0.1400 2.7000 4.1100 42.2200 0.7525 0.6915 0.8425 0.8060 0.1335 0.2331
(0,0.3,100)(0,0.3,100) Method a 0 0 0 0 0.8246 0.7247 0.8036 0.6429 0.2098 0.4219
Method b 0 2.5000 0 36.8400 0.7178 0.7078 0.7251 0.5811 0.3452 0.4127
Method c 0.0700 5.9800 2.6100 45.0200 0.7903 0.6526 0.7541 0.6281 0.1755 0.4491
(1,0.8,150)(1,0.8,150) Method a 1.0100 1.2700 0.4900 1.3500 0.7261 0.6592 0.8294 0.7641 0.2496 0.2382
Method b 2.1800 23.0600 20.9900 69.9600 0.6815 0.4712 0.7805 0.4209 0.2049 0.2092
Method c 1.4000 3.7200 9.5100 21.9400 0.6480 0.6707 0.7264 0.8016 0.2029 0.2427
(1,0.3,150)(1,0.3,150) Method a 1.0000 1.3600 0.0700 0.4300 0.7441 0.7323 0.6882 0.6347 0.3558 0.4237
Method b 2.7900 20.0700 27.7800 66.7200 0.7163 0.7140 0.6696 0.5784 0.3774 0.4068
Method c 1.1300 5.4000 9.7300 23.4000 0.7223 0.6670 0.6583 0.6343 0.3335 0.4505
(4,0.8,150)(4,0.8,150) Method a 4.2100 5.7600 2.2700 2.7100 0.6389 0.6088 0.7409 0.7093 0.2906 0.2585
Method b 5.4500 44.1100 11.5800 27.6700 0.6040 0.4555 0.6924 0.3810 0.2476 0.2159
Method c 2.9200 7.2300 29.3100 18.0100 0.5792 0.6228 0.6579 0.7488 0.2853 0.2599
(4,0.3,150)(4,0.3,150) Method a 4.1100 5.1100 1.7400 1.1800 0.7021 0.7175 0.6180 0.5920 0.4475 0.4379
Method b 6.3000 36.8100 14.9100 27.6500 0.6989 0.7163 0.6209 0.5759 0.4218 0.4114
Method c 1.8800 6.0800 41.0400 11.5000 0.6616 0.6615 0.5700 0.6067 0.4555 0.4762
Table 2: Break recovery, model selection and precision accuracy based on 100 replications, Setting (ii), with respect to (m,s,T)(m^{\ast},s^{\ast},T). For dhd_{h}, MSE, smaller numbers are better; for F1\text{F}_{1} and acc\mathrm{acc}, larger numbers are better. Bold figures represent the best performing models.
nbnb dhd_{h} F1\text{F}_{1} acc\mathrm{acc} MSE
(m,s,T)(m^{\ast},s^{\ast},T) (λ1,λ2)(\lambda_{1},\lambda_{2}) GFDtL GFGL GFDtL GFGL GFDtL GFGL GFDtL GFGL GFDtL GFGL
(0,0.8,100)(0,0.8,100) Method a 0 0 0 0 0.8364 0.7131 0.9051 0.8109 0.2087 0.3582
Method b 0.0100 2.7600 0.3600 40.0400 0.7940 0.5841 0.8861 0.6200 0.2239 0.3417
Method c 0.0200 2.2200 0.8000 44.0100 0.7576 0.6655 0.8532 0.8048 0.1560 0.3654
(0,0.3,100)(0,0.3,100) Method a 0 0 0 0 0.8437 0.8143 0.7784 0.7069 0.2451 0.6669
Method b 0 1.2700 0 29.1500 0.6995 0.8155 0.6500 0.7027 0.5162 0.6652
Method c 0.0100 3.2600 0.3600 38.2700 0.8318 0.6885 0.7639 0.6071 0.2423 0.6981
(1,0.8,150)(1,0.8,150) Method a 1.0200 1.2900 0.9100 1.3800 0.7021 0.6546 0.7968 0.7645 0.2833 0.3470
Method b 1.8900 23.9000 15.7300 86.6000 0.6777 0.5139 0.7769 0.4931 0.2530 0.3206
Method c 1.0100 3.1200 16.3800 21.0600 0.6264 0.6398 0.7374 0.7998 0.2552 0.3506
(1,0.3,150)(1,0.3,150) Method a 0.9900 1.4500 0.3000 0.8000 0.7846 0.8271 0.6909 0.7186 0.5085 0.6511
Method b 3.5100 20.5600 39.8500 91.1200 0.7465 0.8209 0.6560 0.7054 0.5470 0.6431
Method c 1.0300 4.0800 14.2600 25.0800 0.7717 0.6935 0.6765 0.6089 0.4779 0.6874
(4,0.8,150)(4,0.8,150) Method a 4.3100 6.1500 4.5500 4.5300 0.5985 0.5901 0.6861 0.6931 0.3334 0.3466
Method b 4.3900 40.8100 16.0900 33.6400 0.5788 0.4821 0.6624 0.4297 0.3099 0.3153
Method c 1.9300 5.4500 55.6100 32.4000 0.5226 0.5971 0.5968 0.7630 0.3480 0.3559
(4,0.3,150)(4,0.3,150) Method a 4.1000 5.1300 1.9400 1.1900 0.7772 0.8265 0.6702 0.7150 0.6536 0.6727
Method b 6.8800 33.9200 17.0400 34.2100 0.7691 0.8233 0.6655 0.7077 0.6031 0.6516
Method c 1.5000 6.3700 50.4000 12.0900 0.7507 0.7154 0.6546 0.6171 0.6527 0.7177
Table 3: Break recovery, model selection and precision accuracy based on 100 replications, Setting (iii), with respect to (m,T)(m^{\ast},T). For dhd_{h}, MSE, smaller numbers are better; for F1\text{F}_{1} and acc\mathrm{acc}, larger numbers are better. Bold figures represent the best performing models.
nbnb dhd_{h} F1\text{F}_{1} acc\mathrm{acc} MSE
(m,T)(m^{\ast},T) (λ1,λ2)(\lambda_{1},\lambda_{2}) GFDtL GFGL GFDtL GFGL GFDtL GFGL GFDtL GFGL GFDtL GFGL
(0,100)(0,100) Method a 0 0 0 0 0.8236 0.7848 0.8230 0.7704 0.1089 0.2241
Method b 0.0600 5.6600 1.5900 48.6800 0.7611 0.7268 0.7941 0.6448 0.1504 0.2100
Method c 0.0300 7.3800 0.8900 53.9000 0.5396 0.5322 0.6835 0.6533 0.1709 0.2511
(1,150)(1,150) Method a 1.0000 1.1200 2.3800 3.1900 0.7799 0.7357 0.7670 0.7425 0.1800 0.2344
Method b 2.5500 19.2700 27.8200 85.5000 0.7564 0.7136 0.7600 0.6084 0.1825 0.2092
Method c 0.4400 7.5100 59.1100 42.3700 0.3961 0.4830 0.6028 0.6325 0.2317 0.2554
(4,150)(4,150) Method a 4.9900 7.2700 8.2500 9.6700 0.6950 0.6816 0.6709 0.6791 0.2181 0.2274
Method b 4.5200 34.1600 24.3900 32.7100 0.7104 0.6951 0.7029 0.5862 0.2251 0.2095
Method c 0.3700 4.3300 107.9600 66.8400 0.3538 0.4279 0.5836 0.6102 0.2589 0.2478

Overall, our GFDtL procedure performs better than the GFGL with respect to all the metrics in any setting and for Methods a and b for the selection of the optimal pair of (λ1,λ2)(\lambda_{1},\lambda_{2}). Importantly, our procedure results in lower MSE, hence more accurate estimation.

It is worth mentioning the case of no break, that is m=0m^{\ast}=0: dh=0d_{h}=0 means that the procedure concludes that there is no break; in this setting, our procedure performs much better than the GFGL, particularly in terms of MSE and Hausdorff distance.

When applying Method b, the latter metric obtained by GFDtL is much better than the one resulting from the GFGL, which tends to overestimate the number of breaks. Overall, we may conclude that GFDtL results in a low probability of falsely detecting breaks in the no break case.

Finally, the BIC (i.e., Method c)-based results are in favor of the GFGL method, particularly in terms of dhd_{h}. This is because the BIC tends to underestimate the number of breaks when applied to the GFDtL, i.e., it tends to select large λ2\lambda_{2}: indeed, when m1m^{\ast}\leq 1, the results obtained by Method c are good; but in the case of multiple breaks, the number of breaks detected by BIC is much lower than the true number of breaks. This behavior is further detailed in Subsection 6.2.

6.2 Sensitivity analysis with respect to the tuning parameters

We propose a sensitivity analysis of Methods b and c provided in Subsection 5.1 with respect to the calibration of (λ1,λ2)(\lambda_{1},\lambda_{2}). More precisely, we illustrate the ability of the proposed strategies to identify the optimal pair (λ1,λ2)(\lambda_{1},\lambda_{2}) for break and sparse estimation. The experiments are conducted on datasets simulated according to Setting (i) in Subsection 6.1 with T=100T=100, p=10p=10, the “sparsity degree” being 80%80\% and m=0,1,3m^{\ast}=0,1,3. To better approximate the metric surfaces, we use a denser grid specified as 0.1,0.11,0.12,,10.1,0.11,0.12,\ldots,1 for λ1\lambda_{1} and 10,11,12,,20010,11,12,\ldots,200 for λ2\lambda_{2}.

The results are displayed in Figure 1(l), with the three rows corresponding to m=0,1,3m^{\ast}=0,1,3, and the four columns showcasing the results for BICs (cf. (5.2)), lossvals (cf. (5.1)), Hausdorff distances, and F1 scores, respectively. In each subfigure, lighter colors represent more favorable tuning parameters, indicating areas where the metric values are minimized or maximized as appropriate. To facilitate visualization given the wide range of values for BICs and lossvals, we pre-processed these metrics before plotting: specifically, we subtracted the minimum value to ensure non-negativity, then applied the log1p function in Matlab (i.e., log(1+)\log(1+\cdot)) to compress the scale of the values, making the patterns more discernible and enhancing the interpretability of the results.

Figure 1: Sensitivity analysis of tuning parameters.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)

The figures suggest consistent patterns across the surfaces of all four metrics. Specifically, there are distinct boundaries splitting the metric surfaces into two primary regions: an upper and a lower region. Each of these regions further subdivides into multiple subregions that exhibit similar characteristics across all four metrics. The lower regions of the BIC, lossval, and F1 score surfaces are characterized by numerous vertical bars, indicating areas of potentially optimal parameter combinations. In contrast, the Hausdorff distance surface displays a constant lower region.

An interesting observation is that the BIC-type criterion tends to favor smaller values of λ1\lambda_{1}, whereas the lossval criterion leans towards slightly larger λ1\lambda_{1} values. Furthermore, although both criteria struggle to identify the optimal Hausdorff distance (except when m=0m^{\ast}=0), when comparing the tuning parameters selected by the BIC-type criterion to those by the lossval criterion, the lossval criterion exhibits a slight advantage; its optimal region (the white area in the subfigures) is larger, especially as mm^{\ast} increases. For example, when m=3m^{\ast}=3, the white region extends to the upper right corner, which is preferable to the lower region. This suggests that the lossval criterion may be more effective in identifying the optimal parameters. These advantages of the lossval criterion over the BIC-type criterion are further corroborated by the results presented in the previous subsection.

The figures also shed light on the reasons behind the BIC-type criterion’s poor performance in the context of GFDtL. Specifically, the BIC-type criterion shows a preference for larger λ2\lambda_{2} values, which corresponds to fewer breaks in the estimation. This preference can be directly attributed to the definition of BIC (cf. (5.2)). In particular, when there are breaks, Klog(T)K\log(T) is at least log(T)p(p1)\log(T)p(p-1), which typically dominates the loss value term in the BIC formula. Consequently, the BIC-type criterion tends to favor estimators with fewer breaks, leading to suboptimal performance in detecting the true number of breaks, especially in scenarios with a higher number of actual breaks.

Furthermore, the F1 score surfaces provide additional insights into the model’s performance across different parameter combinations. The gradual transition from darker to lighter colors as λ1\lambda_{1} increases (for fixed λ2\lambda_{2}) suggests that the model’s ability to correctly identify true positives improves with larger λ1\lambda_{1} values, up to a certain point. This observation aligns with the lossval criterion’s preference for slightly larger λ1\lambda_{1} values compared to the BIC-type criterion.

6.3 Empirical computational complexity analysis

The computational complexity of Algorithm 1 is influenced by several factors, including the sample size TT, the dimension pp, the number of breaks mm^{\ast}, λ3\lambda_{3}, and the pair (λ1,λ2)(\lambda_{1},\lambda_{2}). To empirically analyze this complexity, we conduct a series of experiments based on Setting (i) in Subsection 6.1 varying each factor individually. Specifically, for each factor, we perform 5 experiments with other factors held constant, and plot the averaged computation time to visualize the impact of each factor on the algorithm’s complexity. Here, we use β=21\beta=21. These experiments are conducted on a desktop with Intel(R) Core(TM) [email protected] (10 cores and 20 threads) and 64GB of RAM.

As displayed in Figure 2(d), the computation time is approximately linear in TT, quadratic in pp, and not significantly affected by mm^{\ast} and λ3\lambda_{3}. The impact of (λ1,λ2)(\lambda_{1},\lambda_{2}) is presented in Table 4, where one can observe that the computation times for (0.1,10)(0.1,10) and (0.2,10)(0.2,10) are notably shorter. This is because the algorithm terminates early as (5.3) is satisfied within the first few iterations. Additionally, the computation times for (0.1,20)(0.1,20), (0.2,20)(0.2,20), (0.3,10)(0.3,10), (0.4,10)(0.4,10), and (0.5,10)(0.5,10) are significantly higher, as these are marginal cases that are typically more challenging to solve. Apart from these marginal cases, for large λ1\lambda_{1} and λ2\lambda_{2}, Algorithm 1 requires nearly the same amount of time to converge.

Figure 2: Empirical computational complexity analysis.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Table 4: Computation time (s) v.s. λ1\lambda_{1} and λ2\lambda_{2} (T=100,p=10,m=1,λ3=10T=100,p=10,m^{\ast}=1,\lambda_{3}=10)
λ1\λ2\lambda_{1}\backslash\lambda_{2} 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
0.1 3.52 33.71 12.47 6.71 5.17 4.17 5.58 5.58 5.83 5.73 5.58 5.51 5.52 5.51 5.53 5.44 5.52 5.50 5.57 5.47
0.2 6.60 14.00 6.15 4.28 3.68 3.98 4.13 4.19 4.13 4.15 4.16 4.15 4.14 4.14 4.15 4.17 4.15 4.15 4.15 4.15
0.3 27.25 7.13 3.69 2.98 2.67 2.91 2.90 2.91 2.91 2.91 2.91 2.91 2.91 2.92 2.91 2.91 2.92 2.91 2.93 2.91
0.4 36.05 5.11 2.75 2.40 2.16 2.16 2.16 2.16 2.16 2.16 2.16 2.22 2.19 2.21 2.18 2.20 2.21 2.21 2.21 2.23
0.5 23.22 4.65 2.68 2.28 2.11 2.12 2.12 2.11 2.12 2.12 2.18 2.17 2.17 2.21 2.17 2.18 2.18 2.17 2.17 2.17
0.6 14.14 4.64 2.81 2.30 2.30 2.22 2.23 2.17 2.20 2.21 2.24 2.20 2.24 2.15 2.20 2.14 2.17 2.17 2.19 2.18
0.7 10.38 4.78 2.87 2.19 2.26 2.30 2.15 2.18 2.16 2.23 2.17 2.19 2.21 2.21 2.21 2.15 2.18 2.18 2.19 2.19
0.8 9.17 4.72 2.88 2.46 2.22 2.41 2.59 2.48 2.20 2.15 2.14 2.15 2.18 2.23 2.24 2.43 2.27 2.35 2.20 2.26
0.9 8.73 4.80 3.03 2.45 2.18 2.18 2.21 2.15 2.15 2.13 2.16 2.21 2.21 2.14 2.15 2.13 2.20 2.17 2.13 2.13
1.0 8.06 4.60 2.84 2.13 2.14 2.13 2.15 2.14 2.13 2.13 2.13 2.14 2.14 2.14 2.15 2.16 2.14 2.14 2.13 2.14

7 Real data experiment

In this section, the relevance of our method is compared with the GFGL through a portfolio allocation experiment based on real financial data. The same computer was employed as in Subsection 6.1. We consider hereafter the stochastic process (Xt)(X_{t}) in p{\mathbb{R}}^{p} of log-stock returns, where Xt,j=100×log(Pt,j/Pt1,j),1jpX_{t,j}=100\times\log(P_{t,j}/P_{t-1,j}),1\leq j\leq p with Pt,jP_{t,j} the stock price of the jj-th index at time tt. The portfolio allocation will be performed with 2020 stocks data from the S&P 500, which are representative of different economic sectors999The data can be found on https://finance.yahoo.com or https://macrobond.com. They are provided on the GitHub repository.: Alphabet, Amazon, American Airlines, Apple, Berkshire Hathaway, Boeing, Chevron, Equity Residential, ExxonMobil, Ford, General Electric, Goldman Sachs, Jacobs Engineering Group, JPMorgan, Lockheed Martin, Pfizer, Procter & Gamble, United Parcel Service, Verizon, Walmart. The sample period is November 11, 2019 – March 27, 2020, corresponding to T=100T=100 observations.

We analyse the economic performances obtained by the GFDtL and GFGL through the GMVP investment problem. Following [8], the latter problem at time tt, in the absence of short-sales constraints, is defined as

minwtwtHtwt,s.t.  1p×1wt=1,\min_{w_{t}}\;w^{\top}_{t}\;H_{t}\;w_{t},\;\;\text{s.t.}\;\;\mathbf{1}^{\top}_{p\times 1}w_{t}=1,

where wtw_{t} is the vector of portfolio weights for time tt chosen at time t1t-1, HtH_{t} is the p×pp\times p conditional (on the past) covariance matrix of XtX_{t}. The explicit solution is given by ωt=Ht1𝟏p×1/𝟏p×1Ht1𝟏p×1\omega_{t}=H^{-1}_{t}\mathbf{1}_{p\times 1}/\mathbf{1}^{\top}_{p\times 1}H^{-1}_{t}\mathbf{1}_{p\times 1}. Now it is natural to replace HtH_{t} by an estimator H^t\widehat{H}_{t}, yielding ω^t=H^t1𝟏p×1/𝟏p×1H^t1𝟏p×1\widehat{\omega}_{t}=\widehat{H}^{-1}_{t}\mathbf{1}_{p\times 1}/\mathbf{1}^{\top}_{p\times 1}\widehat{H}^{-1}_{t}\mathbf{1}_{p\times 1}. As a function depending on HtH_{t} only, the GMVP performance essentially depends on the precise measurement of the latter or, equivalently, the precision matrix. In our setting, we set H^t1=Θ^t1\widehat{H}^{-1}_{t}=\widehat{\Theta}_{t-1}, estimated by the GFDtL and GFGL procedures, where (λ1,λ2)(\lambda^{\ast}_{1},\lambda^{\ast}_{2}) are selected by Method c described in Subsection 5.1. We also consider the equally weighted portfolio, which will be denoted by 1/p1/p. The following performance metrics (annualized) will be reported: AVG as the average of portfolio returns computed as ω^tXt\widehat{\omega}^{\top}_{t}X_{t}, multiplied by 252252; SD as the standard deviation of portfolio returns, multiplied by 252\sqrt{252}; IR as the information ratio computed as AVG/SD\textbf{AVG}/\textbf{SD}.
The key performance measure is SD. The GMVP problem essentially aims to minimize the variance rather than to maximize the expected return. Hence, as emphasized in [9], Section 6.2, high AVG and IR are desirable but should be considered of secondary importance compared with the quality of the measure of a covariance matrix estimator. We also report the number of breaks nbnb detected by the procedure.

Both GFGL and GFDtL estimate the following break dates: February 25, 2020; March 9, 2020; March 17, 2020. These breaks are in line with the aftershock of the covid outbreak: the S&P 500 index entered a downward trend period from February 20, 2020, and the S&P 500 index reached a minimum value on March 23, 2020, which precedes the rally. Despite the presence of the covid shock, our proposed GFDtL procedure provides the lowest SD and clearly outperforms the GFGL. The BIC-based selection results in relevant estimatons of the break dates.

Table 5: Annualized GMVP performance metrics.
S&P 500 data
AVG SD IR nbnb
GFDtL -50.29 35.76 -1.41 3
GFGL -101.44 50.00 -2.03 3
1/p1/p -77.77 45.69 -1.70 -

Note: The lowest SD figure is in bold face.

8 Concluding remarks

We propose a filtering procedure for the estimation of the number of change-points in the precision matrix of a vector-valued random process, whose full structure can “break” over time. We show that the estimated break dates are sufficiently close to the true break dates together with the consistency of the estimated precision matrix in each regime. We propose an ADMM-based algorithm to solve the optimization problem with breaks and study its properties. The simulation results illustrate the relevance of our method compared to the Gaussian likelihood GFGL in terms of change-point detection and graph recovery. They also emphasize the difficulty to devise a strategy to select the optimal pair (λ1,λ2)(\lambda_{1},\lambda_{2}).

A potential extension consists in the specification of adaptive weights in both the Group Fused penalty and the LASSO penalty: indeed, unless assuming high-level conditions such as the irrepresentable condition of [45], the LASSO penalty requires a correction, such as the adaptive version, to ensure the recovery of the sparse structure provided that the estimated break dates are close to the true ones. This would require the computation of some first step consistent estimator that would enter the penalty functions.

Acknowledgements

Benjamin Poignard was supported by JSPS KAKENHI (22K13377) and RIKEN. Ting Kei Pong was supported partly by the Hong Kong Research Grants Council PolyU153002/21p. Akiko Takeda was supported by JSPS KAKENHI (23H03351).

Appendix A Intermediary results

Lemma A.1.

Under Assumption 1, Assumption 2 and Assumption 3-(i), if p2log(pT)=o(TδT)p^{2}\log(pT)=o(T\delta_{T}) with δT0\delta_{T}\rightarrow 0 as TT\rightarrow\infty, then:

  • (i)

    sup1s<rT+1rsTδTλmax(1rst=sr1XtXt)μ¯+op(1)\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}\lambda_{\max}\Big{(}\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\Big{)}\leq\overline{\mu}+o_{p}(1).

  • (ii)

    μ¯+op(1)inf1s<rT+1rsTδTλmin(1rst=sr1XtXt)\underline{\mu}+o_{p}(1)\leq\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\inf}}\lambda_{\min}\Big{(}\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\Big{)}.

Proof.

Let us prove Point (i). Recall that Σj\Sigma^{\ast}_{j} is the true variance-covariance of XtX_{t}, with tjt\in{\mathcal{B}}^{\ast}_{j}. Now take any str1s\leq t\leq r-1, with s,r{1,,T}s,r\in\{1,\ldots,T\} such that rsTδTr-s\geq T\delta_{T}. Then, by the triangle inequality, under Assumption 2:

1rst=sr1XtXts\displaystyle\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\|_{s}\leq 1rst=sr1Var(Xt)s+1rst=sr1(XtXtVar(Xt))s\displaystyle\>\>\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\text{Var}(X_{t})\|_{s}+\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{s}
\displaystyle\leq μ¯+p1rst=sr1(XtXtVar(Xt))max.\displaystyle\>\>\overline{\mu}+p\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{\max}.

We show max1s<rT+1rsTδT1rst=sr1(XtXtVar(Xt))s=op(1)\max_{\underset{r-s\geq T\delta_{T}}{1\leq s<r\leq T+1}}\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{s}=o_{p}(1). By Assumption 1, we can apply Theorem 1 of [26] on ζkl,tXk,tXl,t\zeta_{kl,t}\coloneqq X_{k,t}X_{l,t}: the latter result states the mixing condition α(t)exp(c1tγ1)\alpha(t)\leq\exp(-c_{1}t^{\gamma_{1}}) for some c1,γ1>0c_{1},\gamma_{1}>0; then for c1,γ1=1c_{1},\gamma_{1}=1, we may take ρ=exp(2c1)\rho=\exp(-2c_{1}), which allows us to apply their Theorem 1. Thus, there exist constants C1,C2C_{1},C_{2} such that, ϵ>0\forall\epsilon>0, 1k,lp\forall 1\leq k,l\leq p,

(|1rst=sr1(Xk,tXl,tVar(Xt)kl)|ϵ)(T+1)exp(((rs)ϵ)γ1+γC1)+exp(((rs)ϵ)2(rs)C2)).{\mathbb{P}}\Big{(}\!\big{|}\frac{1}{r\!-\!s}\overset{r\!-\!1}{\underset{t=s}{\sum}}\big{(}X_{k,t}X^{\top}_{l,t}\!-\!\text{Var}(X_{t})_{kl}\big{)}\big{|}\!\geq\!\epsilon\!\Big{)}\leq(T\!+\!1)\exp\!\Big{(}\!-\!\frac{((r\!-\!s)\epsilon)^{\frac{\gamma}{1+\gamma}}}{C_{1}}\!\Big{)}\!+\!\exp\!\Big{(}\!\!-\!\frac{((r\!-\!s)\epsilon)^{2}}{(r\!-\!s)C_{2})}\!\Big{)}.

Take C>0C>0 large enough. Applying the previous inequality, by Bonferroni’s inequality:

(sup1s<rT+1rsTδT1rst=sr1(XtXtVar(Xt))maxClog(pT))\displaystyle{\mathbb{P}}\Big{(}\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}\|\frac{1}{\sqrt{r-s}}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{\max}\geq C\sqrt{\log(pT)}\Big{)}
\displaystyle\leq T2sup1s<rT+1rsTδT(max1k,lp|1rst=sr1(XtXtVar(Xt))kl|Clog(pT)/(rs))\displaystyle\>\>T^{2}\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}{\mathbb{P}}\Big{(}\underset{1\leq k,l\leq p}{\max}\,|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}_{kl}|\geq C\sqrt{\log(pT)/(r-s)}\Big{)}
\displaystyle\leq T2sup1s<rT+1rsTδTp2{(T+1)exp(((rs)Clog(pT)/(rs))γ1+γC1)\displaystyle\>\>T^{2}\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}p^{2}\Big{\{}(T+1)\,\exp\Big{(}-\frac{((r-s)C\sqrt{\log(pT)/(r-s)})^{\frac{\gamma}{1+\gamma}}}{C_{1}}\Big{)}
+exp(((rs)Clog(pT)/(rs))2(rs)C2)}\displaystyle\>\>\quad+\exp\Big{(}-\frac{((r-s)C\sqrt{\log(pT)/(r-s)})^{2}}{(r-s)C_{2}}\Big{)}\Big{\}}
\displaystyle\leq exp((CTδTlog(pT))γ2(1+γ)C1+4log(pT))+exp(CTδTlog(pT)C2+2log(pT)),\displaystyle\>\>\exp\Big{(}-\frac{(CT\delta_{T}\log(pT))^{\frac{\gamma}{2(1+\gamma)}}}{C_{1}}+4\log(pT)\Big{)}+\exp\Big{(}-\frac{CT\delta_{T}\log(pT)}{C_{2}}+2\log(pT)\Big{)},

which goes to 0 as TT\rightarrow\infty by Assumption 3-(i) implying (TδTlog(pT))(γ/(2(1+γ)))log(pT)(T\delta_{T}\log(pT))^{(\gamma/(2(1+\gamma)))}\propto\log(pT). Since AspAmax\|A\|_{s}\leq p\|A\|_{\max} for Ap×pA\in{\mathbb{R}}^{p\times p},

pmax1s<rT+1rsTδT1rst=sr1(XtXtVar(Xt))max=Op(plog(pT)TδT)=op(1),p\max_{\underset{r-s\geq T\delta_{T}}{1\leq s<r\leq T+1}}\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{\max}=O_{p}(p\sqrt{\frac{\log(pT)}{T\delta_{T}}})=o_{p}(1),

under (TδT)1p2log(pT)0(T\delta_{T})^{-1}p^{2}\log(pT)\rightarrow 0. To prove Point (ii), we rely on the inequality

λmin(1rst=sr1XtXt)μ¯1rst=sr1(XtXtVar(Xt))s.\lambda_{\min}\Big{(}\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}X_{t}X^{\top}_{t}\Big{)}\geq\underline{\mu}-\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{s}.

The result follows from the bound derived on 1rst=sr1(XtXtVar(Xt))s\|\frac{1}{r-s}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{s}. ∎

The next Lemma will be useful to bound the first order derivative w.r.t. Θ\Theta of the non-penalized D-trace loss function.

Lemma A.2.

Suppose Assumption 1, Assumption 2 and Assumption 3-(i) are satisfied. For a sequence δT0\delta_{T}\rightarrow 0 as TT\rightarrow\infty, then

sup1s<rT+1rsTδT1rst=sr1(XtXtVar(Xt))max=Op(log(pT)).\underset{r-s\geq T\delta_{T}}{\underset{1\leq s<r\leq T+1}{\sup}}\|\frac{1}{\sqrt{r-s}}\overset{r-1}{\underset{t=s}{\sum}}\big{(}X_{t}X^{\top}_{t}-\text{Var}(X_{t})\big{)}\|_{\max}=O_{p}(\sqrt{\log(pT)}).
Proof.

The result follows the same steps as in the proof of Lemma A.1. ∎

Lemma A.3.

Consider problem (2.1). Define Γt=ΘtΘt1\Gamma_{t}=\Theta_{t}-\Theta_{t-1}, t2t\geq 2 and Γ1=Θ1\Gamma_{1}=\Theta_{1}. The GFDtL estimator {Θ^t}t=1T\{\widehat{\Theta}_{t}\}^{T}_{t=1} satisfies the conditions

t{1,,T},1Tr=t𝑇(12Θ^rXrXr+12XrXrΘ^rIp)+λ1r=t𝑇E^1r+λ2E^2t=𝟎p×p,\forall t\in\{1,\ldots,T\},\;\frac{1}{T}\overset{T}{\underset{r=t}{\sum}}\Big{(}\frac{1}{2}\widehat{\Theta}_{r}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\widehat{\Theta}_{r}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=t}{\sum}}\widehat{E}_{1r}+\lambda_{2}\widehat{E}_{2t}=\mathbf{0}_{p\times p},

where E^1t,E^2t\widehat{E}_{1t},\widehat{E}_{2t} are the sub-gradient matrices defined by

uv,E^uv,1t={sgn(s=1tΓ^uv,s)ifs=1𝑡Γ^uv,s0,[1,1]otherwise,\forall u\neq v,\;\widehat{E}_{uv,1t}=\begin{cases}\text{sgn}(\overset{t}{\underset{s=1}{\sum}}\widehat{\Gamma}_{uv,s})&\text{if}\;\overset{t}{\underset{s=1}{\sum}}\widehat{\Gamma}_{uv,s}\neq 0,\\ \in[-1,1]&\text{otherwise},\end{cases}

and E^21=𝟎p×p\widehat{E}_{21}=\mathbf{0}_{p\times p} and for t=2,,Tt=2,\ldots,T, E^2t\widehat{E}_{2t} satisfies

E^2t=Γ^tΓ^tFifΓ^t𝟎p×p,andE^2tF1ifΓ^tF=0.\widehat{E}_{2t}=\frac{\widehat{\Gamma}_{t}}{\|\widehat{\Gamma}_{t}\|_{F}}\;\;\text{if}\;\;\widehat{\Gamma}_{t}\neq\mathbf{0}_{p\times p},\;\;\text{and}\;\;\|\widehat{E}_{2t}\|_{F}\leq 1\;\;\text{if}\;\;\|\widehat{\Gamma}_{t}\|_{F}=0.
Proof.

Defining Γt=ΘtΘt1\Gamma_{t}=\Theta_{t}-\Theta_{t-1}, t2t\geq 2 and Γ1=Θ1\Gamma_{1}=\Theta_{1}, the problem stated in (2.1) can be recast as a minimization of the function

𝔾λ1,λ2({Θt}t=1T,𝒳T)=1Tt=1𝑇tr(12(s=1𝑡Γs)2XtXt)tr(s=1𝑡Γs)+λ1t=1𝑇uv|s=1𝑡Γuv,s|+λ2t=2𝑇ΓtF.{\mathbb{G}}_{\lambda_{1},\lambda_{2}}(\{\Theta_{t}\}_{t=1}^{T},\!\mathcal{X}_{T})\!=\!\frac{1}{T}\overset{T}{\underset{t=1}{\sum}}\text{tr}(\frac{1}{2}\Big{(}\overset{t}{\underset{s=1}{\sum}}\Gamma_{s}\Big{)}^{2}X_{t}X^{\top}_{t})-\text{tr}(\overset{t}{\underset{s=1}{\sum}}\Gamma_{s})+\lambda_{1}\overset{T}{\underset{t=1}{\sum}}\underset{u\neq v}{\sum}|\overset{t}{\underset{s=1}{\sum}}\Gamma_{uv,s}|+\lambda_{2}\overset{T}{\underset{t=2}{\sum}}\|\Gamma_{t}\|_{F}.

Invoking subdifferential calculus, a necessary and sufficient condition for (Γ^t)1tT(\widehat{\Gamma}_{t})_{1\leq t\leq T} to minimize 𝔾λ1,λ2(,𝒳T){\mathbb{G}}_{\lambda_{1},\lambda_{2}}(\cdot,\mathcal{X}_{T}) is that for all t=1,,Tt=1,\ldots,T, 𝟎p×pp×p\mathbf{0}_{p\times p}\in{\mathbb{R}}^{p\times p} belongs to the subdifferential of 𝔾λ1,λ2(,𝒳T){\mathbb{G}}_{\lambda_{1},\lambda_{2}}(\cdot,\mathcal{X}_{T}) with respect to (Γt)1tT(\Gamma_{t})_{1\leq t\leq T} at (Γ^t)1tT(\widehat{\Gamma}_{t})_{1\leq t\leq T}, that is

1Tr=t𝑇(12(s=1𝑟Γ^s)XrXr+12XrXr(s=1𝑟Γ^s)Ip)+λ1r=t𝑇E^1r+λ2E^2t=𝟎p×p,\frac{1}{T}\overset{T}{\underset{r=t}{\sum}}\Big{(}\frac{1}{2}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=t}{\sum}}\widehat{E}_{1r}+\lambda_{2}\widehat{E}_{2t}=\mathbf{0}_{p\times p},

with the subgradient matrices defined as: E^21=𝟎p×p\widehat{E}_{21}=\mathbf{0}_{p\times p} and

E^2t=Γ^tΓ^tFifΓ^t𝟎p×p,andE^2tF1ifΓ^tF=0,and\widehat{E}_{2t}=\frac{\widehat{\Gamma}_{t}}{\|\widehat{\Gamma}_{t}\|_{F}}\;\text{if}\;\widehat{\Gamma}_{t}\neq\mathbf{0}_{p\times p},\;\;\text{and}\;\;\|\widehat{E}_{2t}\|_{F}\leq 1\;\text{if}\;\|\widehat{\Gamma}_{t}\|_{F}=0,\;\;\text{and}\
uv,E^uv,1t={ωuv,tsgn(s=1tΓ^uv,s)ifs=1𝑡Γ^uv,s0,[1,1]otherwise.\forall u\neq v,\;\widehat{E}_{uv,1t}=\begin{cases}\omega_{uv,t}\;\text{sgn}(\overset{t}{\underset{s=1}{\sum}}\widehat{\Gamma}_{uv,s})&\text{if}\;\overset{t}{\underset{s=1}{\sum}}\widehat{\Gamma}_{uv,s}\neq 0,\\ \in[-1,1]&\text{otherwise}.\end{cases}

Now if t=T^jt=\widehat{T}_{j} for j{1,,m^}j\in\{1,\ldots,\widehat{m}\} is one of the estimated break dates, then Γ^t𝟎p×p\widehat{\Gamma}_{t}\neq\mathbf{0}_{p\times p}, and

1Tr=T^j𝑇(12Θ^rXrXr+12XrXrΘ^rIp)+λ1r=T^j𝑇E^1r+λ2Γ^T^jΓ^T^jF=𝟎p×p,\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{(}\frac{1}{2}\widehat{\Theta}_{r}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\widehat{\Theta}_{r}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}+\lambda_{2}\frac{\widehat{\Gamma}_{\widehat{T}_{j}}}{\|\widehat{\Gamma}_{\widehat{T}_{j}}\|_{F}}=\mathbf{0}_{p\times p},

since the breaks cannot occur at t=1t=1 and s=1𝑟Γs=Θ^r\overset{r}{\underset{s=1}{\sum}}\Gamma_{s}=\widehat{\Theta}_{r}. When t=1t=1, then the first order condition with respect to Γt\Gamma_{t} yields

1Tr=1𝑇(12(s=1𝑟Γ^s)XrXr+12XrXr(s=1𝑟Γ^s)Ip)+λ1r=1𝑇E^1r=𝟎p×p,\frac{1}{T}\overset{T}{\underset{r=1}{\sum}}\Big{(}\frac{1}{2}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=1}{\sum}}\widehat{E}_{1r}=\mathbf{0}_{p\times p},

so that

1Tr=1𝑇(12(s=1𝑟Γ^s)XrXr+12XrXr(s=1𝑟Γ^s)Ip)+λ1r=1𝑇E^1rFλ2.\frac{1}{T}\|\overset{T}{\underset{r=1}{\sum}}\Big{(}\frac{1}{2}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Big{(}\overset{r}{\underset{s=1}{\sum}}\widehat{\Gamma}_{s}\Big{)}-I_{p}\Big{)}+\lambda_{1}\overset{T}{\underset{r=1}{\sum}}\widehat{E}_{1r}\|_{F}\leq\lambda_{2}.

Lemma A.4.

Let 𝒳T=(X1,,XT)\mathcal{X}_{T}=(X_{1},\dots,X_{T}) be a given set of pp-dimensional vectors with t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0. For an arbitrary fixed λ1>0\lambda_{1}>0, let 𝒞λ1\mathcal{C}_{\lambda_{1}} be defined as

𝒞λ1{{{Wt}t=1T,{Yt,off}t=1T,{Zt}t=1T1}:Z0=ZT=𝟎p×p;ZtZt1Ip+Sym((XtXt)12Wt)Yt,off0t;|Yuv,t,off|λ1Tt,u,v;Wtp×p,Yt,off𝒮offp,Zt𝒮pt},\begin{aligned} \mathcal{C}_{\lambda_{1}}\coloneqq\Big{\{}\left\{\{W_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1}\right\}\,:\,&Z_{0}=Z_{T}=\mathbf{0}_{p\times p};Z_{t}\!-\!Z_{t-1}\!-\!I_{p}\!+\!\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)\!-\!Y_{t,\mathrm{off}}\succeq 0\,\,\,\,\forall t;\\ &|Y_{uv,t,\mathrm{off}}|\leq\lambda_{1}T\,\,\,\,\forall t,u,v;W_{t}\in{\mathbb{R}}^{p\times p},Y_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p},Z_{t}\in\mathcal{S}^{p}\>\>\forall t\Big{\}},\end{aligned}

where Yuv,t,offY_{uv,t,\mathrm{off}} is the (u,v)(u,v)-th element of Yt,offY_{t,\mathrm{off}}. Then 𝒞λ1\mathcal{C}_{\lambda_{1}} has a Slater point.

Proof.

Since t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0, there exist a large c>0c>0 and a small γ>0\gamma>0 such that

TγIpct=1TXtXtTIp=t=1T(XtXt)12W¯tt=1TY¯t,offTIp,T\gamma I_{p}\prec c\sum_{t=1}^{T}X_{t}X_{t}^{\top}-TI_{p}=\sum_{t=1}^{T}(X_{t}X_{t}^{\top})^{\frac{1}{2}}\overline{W}_{t}-\sum_{t=1}^{T}\overline{Y}_{t,\mathrm{off}}-TI_{p},

where W¯tc(XtXt)12\overline{W}_{t}\coloneqq c(X_{t}X_{t}^{\top})^{\frac{1}{2}} and Y¯t,off𝟎p×p\overline{Y}_{t,\mathrm{off}}\coloneqq\mathbf{0}_{p\times p} for all tt. We can see from the above display that

cXTXTIp\displaystyle cX_{T}X_{T}^{\top}-I_{p}\succ (T1)Ipct=1T1XtXt+(T1)γIp+γIp\displaystyle(T-1)I_{p}-c\sum_{t=1}^{T-1}X_{t}X_{t}^{\top}+(T-1)\gamma I_{p}+\gamma I_{p} (A.1)
\displaystyle\succ (T1)Ipct=1T1XtXt+(T1)γIp.\displaystyle(T-1)I_{p}-c\sum_{t=1}^{T-1}X_{t}X_{t}^{\top}+(T-1)\gamma I_{p}.

To find {Z¯t}t=1T1\{\overline{Z}_{t}\}_{t=1}^{T-1} and Z¯0=Z¯T=𝟎p×p\overline{Z}_{0}=\overline{Z}_{T}=\mathbf{0}_{p\times p} such that Z¯tZ¯t1Ip+Sym((XtXt)12W¯t)Y¯t0\overline{Z}_{t}-\overline{Z}_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}\overline{W}_{t}\right)-\overline{Y}_{t}\succ 0 for all tt, we need

{Z¯T1cXTXTIp,Z¯tZ¯t+1+cXt+1Xt+1Ipt=1,2,,T2,Z¯1IpcX1X1.\left\{\begin{aligned} &\overline{Z}_{T-1}\prec cX_{T}X_{T}^{\top}-I_{p},\\ &\overline{Z}_{t}\prec\overline{Z}_{t+1}+cX_{t+1}X_{t+1}^{\top}-I_{p}\,\,\,\,\forall t=1,2,\dots,T-2,\\ &\overline{Z}_{1}\succ I_{p}-cX_{1}X_{1}^{\top}.\end{aligned}\right. (A.2)

We claim that the following choice of {Z¯t}t=1T1\{\overline{Z}_{t}\}_{t=1}^{T-1} defined recursively (starting from Z¯0=𝟎p×p\overline{Z}_{0}=\mathbf{0}_{p\times p}) satisfies (A.2):

Z¯t=Z¯t1+IpcXtXt+γIpt=1,,T1.\overline{Z}_{t}=\overline{Z}_{t-1}+I_{p}-cX_{t}X_{t}^{\top}+\gamma I_{p}\,\,\,\,\forall t=1,\dots,T-1.

Indeed, it is routine to check that the second and third lines of (A.2) are satisfied. Then, using Z¯0=𝟎p×p\overline{Z}_{0}=\mathbf{0}_{p\times p} and the above display recursively, we have

Z¯T1=(T1)Ipct=1T1XtXt+(T1)γIp.\overline{Z}_{T-1}=(T-1)I_{p}-c\sum_{t=1}^{T-1}X_{t}X_{t}^{\top}+(T-1)\gamma I_{p}.

Then by (A.1), Z¯T1cXTXTIp\overline{Z}_{T-1}\prec cX_{T}X_{T}^{\top}-I_{p}. Hence {{W¯t}t=1T,{Y¯t,off}t=T,{Z¯t}t=1T1}\left\{\{\overline{W}_{t}\}_{t=1}^{T},\{\overline{Y}_{t,\mathrm{off}}\}_{t=}^{T},\{\overline{Z}_{t}\}_{t=1}^{T-1}\right\} is a Slater point of 𝒞λ1\mathcal{C}_{\lambda_{1}}. ∎

Appendix B Proofs

B.1 Proof of Theorem 3.1

Proof of point (i).
The proof builds upon the works of [17], Proposition 3, [30], Theorem 3.1 and [14], Theorem 1. We define:

AT,j={|T^jTj|TδT},CT={max1jm|T^jTj|<min/2}.A_{T,j}=\big{\{}|\widehat{T}_{j}-T^{\ast}_{j}|\geq T\delta_{T}\big{\}},\;\;C_{T}=\big{\{}\underset{1\leq j\leq m^{\ast}}{\max}|\widehat{T}_{j}-T^{\ast}_{j}|<\mathcal{I}_{\min}/2\big{\}}.

By union bound, (max1jm|T^jTj|TδT)j=1m(AT,j){\mathbb{P}}(\underset{1\leq j\leq m^{\ast}}{\max}|\widehat{T}_{j}-T^{\ast}_{j}|\geq T\delta_{T})\leq\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A_{T,j}), m<m^{\ast}<\infty. So we aim to show:

(a)j=1m(AT,jCT)0,(b)j=1m(AT,jCTc)0,(a)\;\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A_{T,j}\cap C_{T})\rightarrow 0,\;\;(b)\;\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A_{T,j}\cap C^{c}_{T})\rightarrow 0,

with CTcC^{c}_{T} the complement of CTC_{T}.

Proof of (a). We show:

j=1m(AT,j+CT)0andj=1m(AT,jCT)0,\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C_{T})\rightarrow 0\;\text{and}\;\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{-}_{T,j}\cap C_{T})\rightarrow 0,

where AT,j+={TjT^jTδT},AT,j={T^jTjTδT}A^{+}_{T,j}=\{T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}\},A^{-}_{T,j}=\{\widehat{T}_{j}-T^{\ast}_{j}\geq T\delta_{T}\}. We prove j=1m(AT,j+CT)0\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C_{T})\rightarrow 0 as the other case follows in the same spirit. In light of CTC_{T}:

j{1,,m},Tj1<T^j<Tj+1.\forall j\in\{1,\ldots,m^{\ast}\},\;T^{\ast}_{j-1}<\widehat{T}_{j}<T^{\ast}_{j+1}. (B.1)

By Lemma A.3, with t=Tjt=T^{\ast}_{j} and t=T^jt=\widehat{T}_{j}, let Λ(Σ)=12(ΣIp+IpΣ)\Lambda(\Sigma)=\frac{1}{2}(\Sigma\otimes I_{p}+I_{p}\otimes\Sigma), in vec()\text{vec}(\cdot) form:

1Tr=T^j𝑇[Λ(XrXr)vec(Θr+Θ^rΘr)vec(Ip)]+vec(λ1r=T^j𝑇E^1r+λ2Γ^T^jΓ^T^jF)=𝟎p2×1,\displaystyle\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r}+\widehat{\Theta}_{r}-\Theta^{\ast}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}+\lambda_{2}\frac{\widehat{\Gamma}_{\widehat{T}_{j}}}{\|\widehat{\Gamma}_{\widehat{T}_{j}}\|_{F}}\big{)}=\mathbf{0}_{p^{2}\times 1},

and

1Tr=Tj𝑇[Λ(XrXr)vec(Θr)vec(Ip)]+1Tr=Tj𝑇Λ(XrXr)vec(Θ^rΘr)+λ1vec(r=Tj𝑇E^1r)2λ2.\|\frac{1}{T}\!\overset{T}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r})\!-\!\text{vec}(I_{p})\!\Big{]}\!+\!\frac{1}{T}\!\overset{T}{\underset{r=T^{\ast}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r}\!-\!\Theta^{\ast}_{r})\!+\!\lambda_{1}\!\text{vec}\big{(}\!\overset{T}{\underset{r=T^{\ast}_{j}}{\sum}}\!\widehat{E}_{1r}\big{)}\|_{2}\!\leq\!\lambda_{2}.

Therefore, under Tj>T^jT^{\ast}_{j}>\widehat{T}_{j}, taking the differences, by the triangle inequality, we obtain:

2λ21Tr=T^jTj1[Λ(XrXr)vec(Θr)vec(Ip)]+1Tr=T^jTj1Λ(XrXr)vec(Θ^rΘr)+vec(λ1r=T^jTj1E^1r)2.\displaystyle 2\lambda_{2}\!\geq\!\|\!\frac{1}{T}\!\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\!\Big{[}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r})\!-\!\text{vec}(I_{p})\!\Big{]}\!+\!\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r}\!-\!\Theta^{\ast}_{r})\!+\!\text{vec}\big{(}\!\lambda_{1}\!\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}.

Each component of λ1r=T^jTj1E^1r\lambda_{1}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}\widehat{E}_{1r} is bounded by ±λ1(TjT^j)\pm\lambda_{1}(T^{\ast}_{j}-\widehat{T}_{j}). We deduce by the triangle inequality:

2λ2+λ1p(p1)(TjT^j)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j}) (B.2)
\displaystyle\geq 1Tr=T^jTj1[Λ(XrXr)vec(Θr)vec(Ip)]+1Tr=T^jTj1Λ(XrXr)vec(Θ^rΘr)2\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r}-\Theta^{\ast}_{r})\|_{2}
=\displaystyle= 1Tr=T^jTj1[Λ(XrXr)vec(Ωj)vec(Ip)]+1Tr=T^jTj1Λ(XrXr)vec(Ω^j+1Ωj)2\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j})\|_{2}
\displaystyle\geq 1Tr=T^jTj1Λ(XrXr)vec(Ωj+1Ωj)21Tr=T^jTj1Λ(XrXr)vec(Ω^j+1Ωj+1)2\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j})\|_{2}-\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1})\|_{2}
1Tr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]FRTj,1+RTj,2+RTj,3,\displaystyle-\|\frac{1}{T}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\coloneqq R_{Tj,1}+R_{Tj,2}+R_{Tj,3},

where the first equality holds since Θ^r=Ω^j+1\widehat{\Theta}_{r}=\widehat{\Omega}_{j+1} and Θr=Ωj\Theta^{\ast}_{r}=\Omega^{\ast}_{j} for r[T^j,Tj1]r\in[\widehat{T}_{j},T^{\ast}_{j}-1] by (B.1). Let the event:

R¯Tj={2λ2+λ1p(p1)(TjT^j)13RTj,1}{RTj,213RTj,1}{RTj,313RTj,1}.\overline{R}_{Tj}=\{2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})\geq\frac{1}{3}R_{Tj,1}\}\cup\{R_{Tj,2}\geq\frac{1}{3}R_{Tj,1}\}\cup\{R_{Tj,3}\geq\frac{1}{3}R_{Tj,1}\}.

Since inequality (B.2) holds with probability one, then (R¯Tj)=1{\mathbb{P}}(\overline{R}_{Tj})=1. Therefore, we have:

(AT,j+CT)\displaystyle{\mathbb{P}}(A^{+}_{T,j}\cap C_{T})\leq (AT,j+CT{2λ2+λ1p(p1)(TjT^j)13RTj,1})\displaystyle\>\>{\mathbb{P}}(A^{+}_{T,j}\cap C_{T}\cap\{2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})\geq\frac{1}{3}R_{Tj,1}\})
+(AT,j+CT{RTj,213RTj,1})+(AT,j+CT{RTj,313RTj,1})\displaystyle\>\quad+{\mathbb{P}}(A^{+}_{T,j}\cap C_{T}\cap\{R_{Tj,2}\geq\frac{1}{3}R_{Tj,1}\})+{\mathbb{P}}(A^{+}_{T,j}\cap C_{T}\cap\{R_{Tj,3}\geq\frac{1}{3}R_{Tj,1}\})
\displaystyle\eqqcolon ACj,1+ACj,2+ACj,3.\displaystyle\>\>AC_{j,1}+AC_{j,2}+AC_{j,3}.

Let us first bound j=1mACj,1\sum^{m^{\ast}}_{j=1}AC_{j,1}. Since ABFλmin(AA)1/2BF\|AB\|_{F}\geq\lambda_{\min}(A^{\top}A)^{1/2}\|B\|_{F}, for 1jm1\leq j\leq m^{\ast}:

ACj,1(AT,j+{2λ2+λ1p(p1)(TjT^j)13RTj,1})\displaystyle AC_{j,1}\leq\>\>{\mathbb{P}}(A^{+}_{T,j}\cap\{2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})\geq\frac{1}{3}R_{Tj,1}\})
\displaystyle\leq (1TjT^jr=T^jTj1Λ(XrXr)vec(Ωj+1Ωj)23TTjT^j[2λ2+λ1p(p1)(TjT^j)],TjT^jTδT)\displaystyle\>\>\leavevmode\resizebox{432.53474pt}{}{$\displaystyle{\mathbb{P}}\Big{(}\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j})\|_{2}\leq\frac{3T}{T^{\ast}_{j}-\widehat{T}_{j}}\big{[}2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{j}-\widehat{T}_{j})\big{]},T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}\Big{)}$}
\displaystyle\leq (γ1,T,jminΩj+1ΩjF6Tλ2TjT^j+3Tλ1p(p1),TjT^jTδT)\displaystyle\>\>{\mathbb{P}}\Big{(}\gamma^{\min}_{1,T,j}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}\leq\frac{6T\lambda_{2}}{T^{\ast}_{j}-\widehat{T}_{j}}+3T\lambda_{1}\sqrt{p(p-1)},T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}\Big{)}
\displaystyle\leq (γ1,T,jmin6λ2ηminδT+Tλ1p(p1)ηmin,TjT^jTδT),\displaystyle\>\>{\mathbb{P}}\Big{(}\gamma^{\min}_{1,T,j}\leq\frac{6\lambda_{2}}{\eta_{\min}\delta_{T}}+\frac{T\lambda_{1}\sqrt{p(p-1)}}{\eta_{\min}},T^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}\Big{)},

with γ1,T,jmin=λmin(1TjT^jr=T^jTj1XrXr)μ¯/2>0\gamma^{\min}_{1,T,j}=\lambda_{\min}(\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2>0 with probability tending to one by Lemma A.1, and ηmin=min1jmΩj+1ΩjF\eta_{\min}=\underset{1\leq j\leq m^{\ast}}{\min}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}. By λ2/(ηminδT)0\lambda_{2}/(\eta_{\min}\delta_{T})\rightarrow 0, Tλ1p(p1)/ηmin0T\lambda_{1}\sqrt{p(p-1)}/\eta_{\min}\rightarrow 0 in Assumption 3-(iii), we deduce j=1mACj,10\sum^{m^{\ast}}_{j=1}AC_{j,1}\rightarrow 0. We now bound j=1mACj,2\sum^{m^{\ast}}_{j=1}AC_{j,2}. For any j=1,,mj=1,\ldots,m^{\ast}:

ACj,2=\displaystyle AC_{j,2}= (AT,j+CT{1TjT^jr!=!T^jTj1Λ(XrXr)vec(Ω^j+1Ωj+1)2131TjT^jr=T^jTj1Λ(XrXr)vec(Ωj+1Ωj)2})\displaystyle\leavevmode\resizebox{414.32576pt}{}{$\displaystyle{\mathbb{P}}\Big{(}\!A^{+}_{T,j}\!\cap\!C_{T}\!\cap\!\Big{\{}\!\|\!\frac{1}{T^{\ast}_{j}\!-\!\widehat{T}_{j}}\!\overset{T^{\ast}_{j}\!-\!1}{\underset{r!=!\widehat{T}_{j}}{\sum}}\!\Lambda(X_{r}\!X^{\top}_{r}\!)\text{vec}(\widehat{\Omega}_{j+1}\!-\!\Omega^{\ast}_{j+1})\!\|_{2}\!\geq\!\frac{1}{3}\!\|\!\frac{1}{T^{\ast}_{j}\!-\!\widehat{T}_{j}}\!\overset{T^{\ast}_{j}\!-\!1}{\underset{r\!=\!\widehat{T}_{j}}{\sum}}\!\Lambda(X_{r}\!X^{\top}_{r})\!\text{vec}(\Omega^{\ast}_{j+1}\!-\!\Omega^{\ast}_{j})\!\|_{2}\!\Big{\}}\!\Big{)}$}
\displaystyle\leq (AT,j+CT{γT,jmaxΩ^j+1Ωj+1F13γ1,T,jminΩj+1ΩjF}),\displaystyle{\mathbb{P}}\Big{(}A^{+}_{T,j}\cap C_{T}\cap\Big{\{}\gamma^{\max}_{T,j}\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\|_{F}\geq\frac{1}{3}\gamma^{\min}_{1,T,j}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}\}\Big{)},

with γT,jmax=λmax(1TjT^jr=T^jTj1XrXr)2μ¯\gamma^{\max}_{T,j}=\lambda_{\max}(\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}X_{r}X^{\top}_{r})\leq 2\overline{\mu} with probability tending to one by Lemma A.1. We now need to evaluate the bound for Ω^j+1Ωj+1F\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\|_{F}. To do so, we rely on the KKT conditions of Lemma A.3. Note that with probability tending to one, λmax(1TjT^jr=T^jTj1XrXr)2μ¯\lambda_{\max}(\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}X_{r}X^{\top}_{r})\leq 2\overline{\mu}. We have Θ^t=Ω^j+1\widehat{\Theta}_{t}=\widehat{\Omega}_{j+1} when t[Tj,(Tj+Tj+1)/21]t\in[T^{\ast}_{j},(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1] as T^j<Tj\widehat{T}_{j}<T^{\ast}_{j} given AT,j+A^{+}_{T,j} and T^j+1>(Tj+Tj+1)/2\widehat{T}_{j+1}>(T^{\ast}_{j}+T^{\ast}_{j+1})/2 given CTC_{T}. Therefore, by Lemma A.3 with l=(Tj+Tj+1)/2l=(T^{\ast}_{j}+T^{\ast}_{j+1})/2 and l=Tjl=T^{\ast}_{j}, following the steps to obtain inequality (B.2), we get

2λ2+λ1p(p1)[(Tj+Tj+1)/2Tj]\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}[(T^{\ast}_{j}+T^{\ast}_{j+1})/2-T^{\ast}_{j}]
\displaystyle\geq 1Tr=Tj(Tj+Tj+1)/21Λ(XrXr)vec(Ω^j+1Ωj+1)21Tr=Tj(Tj+Tj+1)/21[12ΩjXrXr+12XrXrΩjIp]F.\displaystyle\leavevmode\resizebox{414.32576pt}{}{$\displaystyle\|\frac{1}{T}\overset{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1})\|_{2}-\|\frac{1}{T}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}$}.

Therefore, denoting by γ2,T,jmin=λmin(1Tj+1Tjr=Tj(Tj+Tj+1)/21XrXr)\gamma^{\min}_{2,T,j}=\lambda_{\min}\Big{(}\frac{1}{T^{\ast}_{j+1}-T^{\ast}_{j}}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}X_{r}X^{\top}_{r}\Big{)}, conditional on CTC_{T}:

Ω^j+1Ωj+1F\displaystyle\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\|_{F}\leq (γ2,T,jmin)1(2Tλ2+Tλ1p(p1)[(Tj+Tj+1)/2Tj]Tj+1Tj\displaystyle\>\>(\gamma^{\min}_{2,T,j})^{-1}\Big{(}\frac{2T\lambda_{2}+T\lambda_{1}\sqrt{p(p-1)}[(T^{\ast}_{j}+T^{\ast}_{j+1})/2-T^{\ast}_{j}]}{T^{\ast}_{j+1}-T^{\ast}_{j}}
+2Tj+1Tjr=Tj(Tj+Tj+1)/21[12ΩjXrXr+12XrXrΩjIp]F).\displaystyle\>\>\quad+\|\frac{2}{T^{\ast}_{j+1}-T^{\ast}_{j}}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\Big{)}.

By Lemma A.1, γ2,T,jminμ¯/2>0\gamma^{\min}_{2,T,j}\geq\underline{\mu}/2>0 with probability tending to one. We deduce

j=1m({Ω^j+1Ωj+1F(γT,jmax)1γ1,T,jminΩj+1ΩjF/3}CT)\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\Big{\{}\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\|_{F}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}/3\Big{\}}\cap C_{T}\Big{)}
\displaystyle\leq

j=1m(2Tλ2+Tλ1p(p1)[(Tj+Tj+1)/2Tj]Tj+1Tj(γT,jmax)1γ1,T,jminγ2,T,jminΩj+1ΩjF/6)\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\frac{2T\lambda_{2}+T\lambda_{1}\sqrt{p(p-1)}[(T^{\ast}_{j}+T^{\ast}_{j+1})/2-T^{\ast}_{j}]}{T^{\ast}_{j+1}-T^{\ast}_{j}}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}/6\Big{)}

+j=1m(2Tj+1Tjr=Tj(Tj+Tj+1)/21[12ΩjXrXr+12XrXrΩjIp]F(γT,jmax)1γ1,T,jminγ2,T,jminΩj+1ΩjF/6)\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\|\frac{2}{T^{\ast}_{j+1}-T^{\ast}_{j}}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}/6\Big{)}

\displaystyle\leq j=1m(2Tλ2minηmin+Tλ1p(p1)ηmin(γT,jmax)1γ1,T,jminγ2,T,jmin/6)\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\frac{2T\lambda_{2}}{\mathcal{I}_{\min}\eta_{\min}}+\frac{T\lambda_{1}\sqrt{p(p-1)}}{\eta_{\min}}\geq(\gamma^{\max}_{T,j})^{-1}\gamma^{\min}_{1,T,j}\gamma^{\min}_{2,T,j}/6\Big{)}
+j=1m(1(Tj+1Tj)/2r=Tj(Tj+Tj+1)/21[12ΩjXrXr+12XrXrΩjIp]Fμ¯1μ¯2ηmin/48).\displaystyle\leavevmode\resizebox{398.9296pt}{}{$\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\geq\overline{\mu}^{-1}\underline{\mu}^{2}\eta_{\min}/48\Big{)}$}.

The first term tends to zero since Tλ2/(minηmin)0T\lambda_{2}/(\mathcal{I}_{\min}\eta_{\min})\rightarrow 0 and Tλ1p/ηmin0T\lambda_{1}p/\eta_{\min}\rightarrow 0 by Assumption 3-(ii) and (iii). As for the second term, using ABFAsBF\|AB\|_{F}\leq\|A\|_{s}\|B\|_{F}, note that

1(Tj+1Tj)/2r=Tj(Tj+Tj+1)/21[12ΩjXrXr+12XrXrΩjIp]F\displaystyle\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}
=\displaystyle= 1(Tj+1Tj)/2r=Tj(Tj+Tj+1)/21[12Ωj(XrXrΣj)+12(XrXrΣj)Ωj]F\displaystyle\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}\Big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\Big{)}+\frac{1}{2}\Big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\Big{)}\Omega^{\ast}_{j}\Big{]}\|_{F}
\displaystyle\leq smaxp1(Tj+1Tj)/2r=Tj(Tj+Tj+1)/21(XrXrΣj)max.\displaystyle s^{\ast}_{\max}\,p\,\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\big{(}X_{r}X^{\top}_{r}-\Sigma^{\ast}_{j}\big{)}\|_{\max}.

Therefore, for C>0C>0 finite, applying Lemma A.2, we deduce that for any jj:

(1(Tj+1Tj)/2r=Tj(Tj+Tj+1)/21[12ΩjXrXr+12XrXrΩjIp]Fμ¯1μ¯2ηmin/48)0,{\mathbb{P}}\Big{(}\|\frac{1}{(T^{\ast}_{j+1}-T^{\ast}_{j})/2}\sum^{(T^{\ast}_{j}+T^{\ast}_{j+1})/2-1}_{r=T^{\ast}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\geq\overline{\mu}^{-1}\underline{\mu}^{2}\eta_{\min}/48\Big{)}\rightarrow 0,

since (ηminmin1/2)1smaxplog(pT)0(\eta_{\min}\mathcal{I}^{1/2}_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0. Hence, j=1mACj,20\sum^{m^{\ast}}_{j=1}AC_{j,2}\rightarrow 0. Let us now consider j=1mACj,3\sum^{m^{\ast}}_{j=1}AC_{j,3}. Applying the same reasoning to show the convergence of the second summation on the right-hand side of (B.1), we get

1TjT^jr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]F=Op(psmaxlog(pT)TδT)=op(ηmin),\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}=O_{p}(p\,s^{\ast}_{\max}\sqrt{\frac{\log(pT)}{T\delta_{T}}})=o_{p}(\eta_{\min}),

when TjT^jTδTT^{\ast}_{j}-\widehat{T}_{j}\geq T\delta_{T}, and

j=1mACj,3(AT,j+{RTj,313RTj,1})\displaystyle\sum^{m^{\ast}}_{j=1}AC_{j,3}\leq{\mathbb{P}}(A^{+}_{T,j}\cap\{R_{Tj,3}\geq\frac{1}{3}R_{Tj,1}\})
\displaystyle\leq

j=1m(AT,j+{1TjT^jr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]F131TjT^jr=T^jTj1Λ(XrXr)vec(Ωj+1Ωj)2})\displaystyle\sum^{m^{\ast}}_{j=1}\!{\mathbb{P}}\Big{(}A^{+}_{T,j}\!\cap\!\Big{\{}\|\frac{1}{T^{\ast}_{j}\!-\!\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}\!\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}\!+\!\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}\!-\!I_{p}\!\Big{]}\|_{F}\!\geq\!\frac{1}{3}\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j+1}\!-\!\Omega^{\ast}_{j})\|_{2}\Big{\}}\Big{)}

\displaystyle\leq j=1m(AT,j+{1TjT^jr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]F13ηminγ1,T,jmin),\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}\Big{(}A^{+}_{T,j}\cap\Big{\{}\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\sum^{T^{\ast}_{j}-1}_{r=\widehat{T}_{j}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\geq\frac{1}{3}\eta_{\min}\gamma^{\min}_{1,T,j}\Big{)},

since γ1,T,jminμ¯/2>0\gamma^{\min}_{1,T,j}\geq\underline{\mu}/2>0 with probability tending to one, and TδTTjT^jT\delta_{T}\leq T^{\ast}_{j}-\widehat{T}_{j}, then under (TδTηmin)1smaxplog(pT)0(\sqrt{T\delta_{T}}\eta_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0, we deduce j=1mACj,30\sum^{m^{\ast}}_{j=1}AC_{j,3}\rightarrow 0. Consequently, we proved j=1m(AT,jCT)0\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A_{T,j}\cap C_{T})\rightarrow 0.

Proof of (b). We prove (b) by showing j=1m(AT,j+CTc)0\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C^{c}_{T})\rightarrow 0 and j=1m(AT,jCTc)0\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{-}_{T,j}\cap C^{c}_{T})\rightarrow 0. As in the proof of (a), we simply show j=1m(AT,j+CTc)0\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C^{c}_{T})\rightarrow 0. To do so, we define:

DT(l){j{1,,m},T^jTj1}CTc,DT(m){j{1,,m},Tj1<T^j<Tj+1}CTc,DT(r){j{1,,m},T^jTj+1}CTc,\begin{array}[]{llll}D^{(l)}_{T}\coloneqq\big{\{}\exists j\in\{1,\ldots,m^{\ast}\},\widehat{T}_{j}\leq T^{\ast}_{j-1}\big{\}}\cap C^{c}_{T},&&\\ D^{(m)}_{T}\coloneqq\big{\{}\forall j\in\{1,\ldots,m^{\ast}\},T^{\ast}_{j-1}<\widehat{T}_{j}<T^{\ast}_{j+1}\big{\}}\cap C^{c}_{T},&&\\ D^{(r)}_{T}\coloneqq\big{\{}\exists j\in\{1,\ldots,m^{\ast}\},\widehat{T}_{j}\geq T^{\ast}_{j+1}\big{\}}\cap C^{c}_{T},&&\end{array}

where CTc={max1jm|T^jTj|min/2}C^{c}_{T}=\{\underset{1\leq j\leq m^{\ast}}{\max}|\widehat{T}_{j}-T^{\ast}_{j}|\geq\mathcal{I}_{\min}/2\}. Then, we have:

j=1m(AT,j+CTc)=j=1m[(AT,j+DT(l))+(AT,j+DT(m))+(AT,j+DT(r))].\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap C^{c}_{T})=\sum^{m^{\ast}}_{j=1}\Big{[}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(l)}_{T})+{\mathbb{P}}(A^{+}_{T,j}\cap D^{(m)}_{T})+{\mathbb{P}}(A^{+}_{T,j}\cap D^{(r)}_{T})\Big{]}.

We first bound j=1m(AT,j+DT(m))\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(m)}_{T}). For any jj:

(AT,j+DT(m))\displaystyle{\mathbb{P}}(A^{+}_{T,j}\cap D^{(m)}_{T})
\displaystyle\leq (At,j+{T^j+1Tj12min}DT(m))+(At,j+{T^j+1Tj<12min}DT(m))\displaystyle{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})+{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}<\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})
\displaystyle\leq (At,j+{T^j+1Tj12min}DT(m))+(At,j+{Tj+1T^j+112min}DT(m)),\displaystyle{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})+{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}T^{\ast}_{j+1}-\widehat{T}_{j+1}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T}),

since 0T^j+1Tjmin/20\leq\widehat{T}_{j+1}-T^{\ast}_{j}\leq\mathcal{I}_{\min}/2 implies Tj+1T^j+1=(Tj+1Tj)(T^j+1Tj)minmin/2=min/2T^{\ast}_{j+1}-\widehat{T}_{j+1}=(T^{\ast}_{j+1}-T^{\ast}_{j})-(\widehat{T}_{j+1}-T^{\ast}_{j})\geq\mathcal{I}_{\min}-\mathcal{I}_{\min}/2=\mathcal{I}_{\min}/2. Moreover, since

{At,j+{Tj+1T^j+112min}DT(m)}k=j+1m1[{TkT^kmin/2}{T^k+1Tkmin/2}DT(m)],\displaystyle\Big{\{}A^{+}_{t,j}\cap\big{\{}T^{\ast}_{j+1}-\widehat{T}_{j+1}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T}\Big{\}}\subset\overset{m^{\ast}-1}{\underset{k=j+1}{\cup}}\Big{[}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap D^{(m)}_{T}\Big{]},

we deduce:

j=1m(AT,j+DT(m))j=1m(At,j+{T^j+1Tj12min}DT(m))\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(m)}_{T})\leq\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T}) (B.4)
+j=1mk=j+1m1({TkT^kmin/2}{T^k+1Tkmin/2}DT(m)).\displaystyle+\sum^{m^{\ast}}_{j=1}\sum^{m^{\ast}-1}_{k=j+1}{\mathbb{P}}\Big{(}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap D^{(m)}_{T}\Big{)}.

Let us treat the first term. By Lemma A.3 with t=T^jt=\widehat{T}_{j} and t=Tjt=T^{\ast}_{j}, we obtain:

1Tr=T^j𝑇[Λ(XrXr)vec(Θr+Θ^rΘr)vec(Ip)]+λ1vec(r=T^j𝑇E^1r)=λ2vec(Γ^T^jΓ^T^jF),\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r}+\widehat{\Theta}_{r}-\Theta^{\ast}_{r})-\text{vec}(I_{p})\Big{]}+\lambda_{1}\text{vec}\big{(}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}\big{)}=\lambda_{2}\text{vec}\big{(}\frac{\widehat{\Gamma}_{\widehat{T}_{j}}}{\|\widehat{\Gamma}_{\widehat{T}_{j}}\|_{F}}\big{)},

and

1Tr=Tj𝑇[Λ(XrXr)vec(Θr+Θ^rΘr)vec(Ip)]+λ1vec(r=T^j𝑇E^1r)22λ2.\|\frac{1}{T}\overset{T}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Theta^{\ast}_{r}+\widehat{\Theta}_{r}-\Theta^{\ast}_{r})-\text{vec}(I_{p})\Big{]}+\lambda_{1}\text{vec}\big{(}\overset{T}{\underset{r=\widehat{T}_{j}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}\leq 2\lambda_{2}.

We deduce

γ1,T,jminΩ^j+1ΩjF1TjT^jr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]F\displaystyle\gamma^{\min}_{1,T,j}\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j}\|_{F}-\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}
\displaystyle\leq 1TjT^jr=T^jTj1Λ(XrXr)vec(Ω^j+1Ωj)+r=T^jTj1[Λ(XrXr)vec(Ωj)vec(Ip)]2\displaystyle\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\|\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j})+\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{j})-\text{vec}(I_{p})\Big{]}\|_{2}
\displaystyle\leq 2λ2TTjT^j+λ1Tp(p1).\displaystyle\frac{2\lambda_{2}T}{T^{\ast}_{j}-\widehat{T}_{j}}+\lambda_{1}T\sqrt{p(p-1)}.

As a consequence:

Ω^j+1ΩjF(γ1,T,jmin)1[2λ2TTjT^j+λ1Tp(p1)+1TjT^jr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]F].\displaystyle\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j}\|_{F}\leq(\gamma^{\min}_{1,T,j})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{j}-\widehat{T}_{j}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\Big{]}.

(B.5)

In the same vein, applying Lemma A.3 with t=T^j+1t=\widehat{T}_{j+1} and t=Tjt=T^{\ast}_{j}, we obtain:

Ω^j+1Ωj+1F\displaystyle\|\widehat{\Omega}_{j+1}-\Omega^{\ast}_{j+1}\|_{F}
\displaystyle\leq

(γ3,T,jmin)1[2λ2TT^j+1Tj+λ1Tp(p1)+1T^j+1Tjr=TjT^j+11[12ΩjXrXr+12XrXrΩjIp]F],\displaystyle(\gamma^{\min}_{3,T,j})^{-1}\Big{[}\frac{2\lambda_{2}T}{\widehat{T}_{j+1}-T^{\ast}_{j}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\Big{]},

where γ3,T,jmin=λmin(1T^j+1Tjr=TjT^j+11XrXr)μ¯/2\gamma^{\min}_{3,T,j}=\lambda_{\min}(\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2 with probability one. Let the event:

ET,j{Ωj+1ΩjF(γ1,T,jmin)1[2λ2TTjT^j+λ1Tp(p1)]\displaystyle E_{T,j}\coloneqq\Big{\{}\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}\leq(\gamma^{\min}_{1,T,j})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{j}-\widehat{T}_{j}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}

+(γ3,T,jmin)1[2λ2TT^j+1Tj+λ1Tp(p1)]+(γ1,T,jmin)11TjT^jr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]F\displaystyle+(\gamma^{\min}_{3,T,j})^{-1}\Big{[}\frac{2\lambda_{2}T}{\widehat{T}_{j+1}-T^{\ast}_{j}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}+(\gamma^{\min}_{1,T,j})^{-1}\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}

+(γ3,T,jmin)11T^j+1Tjr=TjT^j+11[12ΩjXrXr+12XrXrΩjIp]F}.\displaystyle+(\gamma^{\min}_{3,T,j})^{-1}\|\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\Big{\}}.

Therefore, by the triangle inequality, (B.5) and (B.1) imply that the event ET,jE_{T,j} holds with probability one. Hence:

j=1m(At,j+{T^j+1Tj12min}DT(m))\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T}) (B.8)
=\displaystyle= j=1m(ET,jAt,j+{T^j+1Tj12min}DT(m))\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(E_{T,j}\cap A^{+}_{t,j}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}}\cap D^{(m)}_{T})
\displaystyle\leq j=1m(ET,j{TjT^j>TδT}{T^j+1Tj12min})\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(E_{T,j}\cap\big{\{}T^{\ast}_{j}-\widehat{T}_{j}>T\delta_{T}\big{\}}\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\frac{1}{2}\mathcal{I}_{\min}\big{\}})
\displaystyle\leq

j=1m((γ1,T,jmin)1[2λ2δT+λ1Tp(p1)]+(γ3,T,jmin)1[4λ2Tmin+λ1Tp(p1)]Ωj+1ΩjF/3)\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}((\gamma^{\min}_{1,T,j})^{-1}\Big{[}\frac{2\lambda_{2}}{\delta_{T}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}+(\gamma^{\min}_{3,T,j})^{-1}\Big{[}\frac{4\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\geq\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}/3)

+j=1m({(γ1,T,jmin)11TjT^jr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]FΩj+1ΩjF/3}{TjT^j>TδT})\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(\Big{\{}(\gamma^{\min}_{1,T,j})^{-1}\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\geq\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}/3\Big{\}}\cap\big{\{}T^{\ast}_{j}-\widehat{T}_{j}>T\delta_{T}\big{\}})

+j=1m({(γ3,T,jmin)11T^j+1Tjr=TjT^j+11[12ΩjXrXr+12XrXrΩjIp]FΩj+1ΩjF/3}\displaystyle+\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(\Big{\{}(\gamma^{\min}_{3,T,j})^{-1}\|\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}\geq\|\Omega^{\ast}_{j+1}-\Omega^{\ast}_{j}\|_{F}/3\Big{\}}

{T^j+1Tjmin/2}).\displaystyle\qquad\quad\>\>\cap\big{\{}\widehat{T}_{j+1}-T^{\ast}_{j}\geq\mathcal{I}_{\min}/2\big{\}}).

The first term in (B.8) tends to zero under λ2/(ηminδT)0\lambda_{2}/(\eta_{\min}\delta_{T})\rightarrow 0, λ2T/(minηmin)0\lambda_{2}T/(\mathcal{I}_{\min}\eta_{\min})\rightarrow 0, and λ1Tp/ηmin0\lambda_{1}Tp/\eta_{\min}\rightarrow 0. Moreover, note that

1TjT^jr=T^jTj1[12ΩjXrXr+12XrXrΩjIp]F=Op(smaxplog(pT)TδT)=op(ηmin),\|\frac{1}{T^{\ast}_{j}-\widehat{T}_{j}}\overset{T^{\ast}_{j}-1}{\underset{r=\widehat{T}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}=O_{p}(s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{T\delta_{T}}})=o_{p}(\eta_{\min}),

and

1T^j+1Tjr=TjT^j+11[12ΩjXrXr+12XrXrΩjIp]F=Op(smaxplog(pT)min)=op(ηmin),\|\frac{1}{\widehat{T}_{j+1}-T^{\ast}_{j}}\overset{\widehat{T}_{j+1}-1}{\underset{r=T^{\ast}_{j}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{j}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{j}-I_{p}\Big{]}\|_{F}=O_{p}(s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{\mathcal{I}_{\min}}})=o_{p}(\eta_{\min}),

under Assumption 3-(ii)-(iii). In the same manner, we can show that the second term in (B.4) tends to zero.

We now consider j=1m(AT,j+DT(l))\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(l)}_{T}). The probability of the event AT,j+DT(l)A^{+}_{T,j}\cap D^{(l)}_{T} is upper bounded by:

(DT(l))j=1m2j1(max(l{1,,m}:T^lTl1)=j).{\mathbb{P}}(D^{(l)}_{T})\leq\sum^{m^{\ast}}_{j=1}2^{j-1}{\mathbb{P}}(\max(l\in\{1,\ldots,m^{\ast}\}:\widehat{T}_{l}\leq T^{\ast}_{l-1})=j).

Now max(l{1,,m}:T^lTl1)=j\max(l\in\{1,\ldots,m^{\ast}\}:\widehat{T}_{l}\leq T^{\ast}_{l-1})=j implies T^jTj1\widehat{T}_{j}\leq T^{\ast}_{j-1} and T^l+1>Tl\widehat{T}_{l+1}>T^{\ast}_{l} for any jlmj\leq l\leq m^{\ast} and:

{max(l{1,,m}:T^lTl1)=j}k=jm1({TkT^kmin/2}{T^k+1Tkmin/2}).\big{\{}\max(l\in\{1,\ldots,m^{\ast}\}:\widehat{T}_{l}\leq T^{\ast}_{l-1})=j\big{\}}\subset\overset{m^{\ast}-1}{\underset{k=j}{\cup}}\Big{(}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\Big{)}.

Therefore:

j=1m(AT,j+DT(l))\displaystyle\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(l)}_{T})
\displaystyle\leq

mj=1m12j1k=jm1({TkT^kmin/2}{T^k+1Tkmin/2})+m2m1(TmT^mmin/2).\displaystyle m^{\ast}\overset{m^{\ast}-1}{\underset{j=1}{\sum}}2^{j-1}\overset{m^{\ast}-1}{\underset{k=j}{\sum}}{\mathbb{P}}\Big{(}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\Big{)}+m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}(T^{\ast}_{m^{\ast}}-\widehat{T}_{m^{\ast}}\geq\mathcal{I}_{\min}/2).

First, we consider the second term of the right-hand side of (B.1). Let j=mj=m^{\ast} in (B.1), then ET,mE_{T,m^{\ast}} holds with probability one. Therefore:

m2m1(TmT^mmin/2)=m2m1(ET,m{TmT^mmin/2})\displaystyle m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}(T^{\ast}_{m^{\ast}}-\widehat{T}_{m^{\ast}}\geq\mathcal{I}_{\min}/2)=m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}(E_{T,m^{\ast}}\cap\big{\{}T^{\ast}_{m^{\ast}}-\widehat{T}_{m^{\ast}}\geq\mathcal{I}_{\min}/2\big{\}})
\displaystyle\leq

m2m1((γ1,T,mmin)1[2λ2δT+λ1Tp(p1)]+(γ3,T,mmin)1[4λ2Tmin+λ1Tp(p1)]Ωm+1ΩmF/3)\displaystyle m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}((\gamma^{\min}_{1,T,m^{\ast}})^{-1}\Big{[}\frac{2\lambda_{2}}{\delta_{T}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}+(\gamma^{\min}_{3,T,m^{\ast}})^{-1}\Big{[}\frac{4\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\geq\|\Omega^{\ast}_{m^{\ast}+1}-\Omega^{\ast}_{m^{\ast}}\|_{F}/3)

+m2m1((γ1,T,mmin)11TmT^mr=T^mTm1[12ΩmXrXr+12XrXrΩmIp]FΩm+1ΩmF/3,TmT^mmin/2)\displaystyle+m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}((\gamma^{\min}_{1,T,m^{\ast}})^{-1}\|\frac{1}{T^{\ast}_{m^{\ast}}\!-\!\widehat{T}_{m^{\ast}}}\overset{T^{\ast}_{m^{\ast}}-1}{\underset{r=\widehat{T}_{m^{\ast}}}{\sum}}\!\Big{[}\frac{1}{2}\Omega^{\ast}_{m^{\ast}}X_{r}X^{\top}_{r}\!+\!\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{m^{\ast}}\!-\!I_{p}\Big{]}\|_{F}\!\geq\!\|\Omega^{\ast}_{m^{\ast}+1}\!-\!\Omega^{\ast}_{m^{\ast}}\|_{F}/3,T^{\ast}_{m^{\ast}}\!-\!\widehat{T}_{m^{\ast}}\!\geq\!\mathcal{I}_{\min}/2)

+m2m1((γ3,T,mmin)11TTmr=Tm𝑇[12ΩmXrXr+12XrXrΩmIp]FΩm+1ΩmF/3).\displaystyle+m^{\ast}2^{m^{\ast}-1}{\mathbb{P}}((\gamma^{\min}_{3,T,m^{\ast}})^{-1}\|\frac{1}{T-T^{\ast}_{m^{\ast}}}\overset{T}{\underset{r=T^{\ast}_{m^{\ast}}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{m^{\ast}}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{m^{\ast}}-I_{p}\Big{]}\|_{F}\geq\|\Omega^{\ast}_{m^{\ast}+1}-\Omega^{\ast}_{m^{\ast}}\|_{F}/3).

Since m2m1=O(Tlog(T))m^{\ast}2^{m^{\ast}-1}=O(T\log(T)), then log(m2m1)=O(log(T1+ϵ/2))\log(m^{\ast}2^{m^{\ast}-1})=O(\log(T^{1+\epsilon/2})). So under the conditions (TδTηmin)1smaxplog(pT)0,(min1/2ηmin)1smaxplog(pT)0(\sqrt{T\delta_{T}}\eta_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0,(\mathcal{I}^{1/2}_{\min}\eta_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0, the right-hand side of the previous inequality converges to zero. As for the first term of (B.1), applying j=kj=k in (B.1):

mj=1m12j1k=jm1({TkT^kmin/2}{T^k+1Tkmin/2})\displaystyle m^{\ast}\overset{m^{\ast}-1}{\underset{j=1}{\sum}}2^{j-1}\overset{m^{\ast}-1}{\underset{k=j}{\sum}}{\mathbb{P}}\Big{(}\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\Big{)}
\displaystyle\leq m2m1k=1m1(ET,k{TkT^kmin/2}{T^k+1Tkmin/2})\displaystyle m^{\ast}2^{m^{\ast}-1}\overset{m^{\ast}-1}{\underset{k=1}{\sum}}{\mathbb{P}}\Big{(}E_{T,k}\cap\big{\{}T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\cap\big{\{}\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2\big{\}}\Big{)}
\displaystyle\leq

m2m1k=1m1{((γ1,T,kmin)1[2λ2δT+λ1Tp(p1)]+(γ3,T,kmin)1[4λ2Tmin+λ1Tp(p1)]Ωk+1ΩkF/3)\displaystyle m^{\ast}2^{m^{\ast}-1}\overset{m^{\ast}-1}{\underset{k=1}{\sum}}\Big{\{}{\mathbb{P}}((\gamma^{\min}_{1,T,k})^{-1}\Big{[}\frac{2\lambda_{2}}{\delta_{T}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}+(\gamma^{\min}_{3,T,k})^{-1}\Big{[}\frac{4\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\geq\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3)

+((γ1,T,kmin)11TkT^kr=T^kTk1[12ΩkXrXr+12XrXrΩkIp]FΩk+1ΩkF/3,TkT^kmin/2)\displaystyle+{\mathbb{P}}((\gamma^{\min}_{1,T,k})^{-1}\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{k}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{k}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{k}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{k}-I_{p}\Big{]}\|_{F}\geq\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3,T^{\ast}_{k}-\widehat{T}_{k}\geq\mathcal{I}_{\min}/2)

+((γ3,T,kmin)11T^k+1Tkr=TkT^k+11[12ΩkXrXr+12XrXrΩkIp]FΩk+1ΩkF/3,T^k+1Tkmin/2)}.\displaystyle+{\mathbb{P}}((\gamma^{\min}_{3,T,k})^{-1}\|\frac{1}{\widehat{T}_{k+1}-T^{\ast}_{k}}\overset{\widehat{T}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{k}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{k}-I_{p}\Big{]}\|_{F}\geq\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3,\widehat{T}_{k+1}-T^{\ast}_{k}\geq\mathcal{I}_{\min}/2)\Big{\}}.

The right-hand side of the last inequality converges to zero under the same conditions. Finally, we can prove that j=1m(AT,j+DT(r))0\sum^{m^{\ast}}_{j=1}{\mathbb{P}}(A^{+}_{T,j}\cap D^{(r)}_{T})\rightarrow 0.

Proof of point (ii).
By point (i) and under Assumption 3-(ii), for any j=1,,mj=1,\ldots,m^{\ast}, |T^jTj|=Op(TδT)|\widehat{T}_{j}-T^{\ast}_{j}|=O_{p}(T\delta_{T}), which is |T^jTj|=op(min)|\widehat{T}_{j}-T^{\ast}_{j}|=o_{p}(\mathcal{I}_{\min}) under Assumption 3-(ii). Hence, (Tj1+Tj)/2<T^j<Tj(T^{\ast}_{j-1}+T^{\ast}_{j})/2<\widehat{T}_{j}<T^{\ast}_{j} or TjT^j<(Tj+Tj+1)/2T^{\ast}_{j}\leq\widehat{T}_{j}<(T^{\ast}_{j}+T^{\ast}_{j+1})/2 is satisfied for any jj. Set l=1,,ml=1,\ldots,m^{\ast} and assume (Tl1+Tl)/2<T^l<Tl(T^{\ast}_{l-1}+T^{\ast}_{l})/2<\widehat{T}_{l}<T^{\ast}_{l} and consider two cases: (ii-a) (Tl+Tl+1)/2<T^l+1<Tl+1(T^{\ast}_{l}+T^{\ast}_{l+1})/2<\widehat{T}_{l+1}<T^{\ast}_{l+1} and (ii-b) Tl+1T^l+1T^{\ast}_{l+1}\leq\widehat{T}_{l+1}. In case (ii-a), by Lemma A.3 with change-points t=T^lt=\widehat{T}_{l} and t=T^l+1t=\widehat{T}_{l+1}:

2λ21Tr=T^l𝑇[Λ(XrXr)vec(Θ^r)vec(Ip)]\displaystyle 2\lambda_{2}\geq\|\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}
1Tr=T^l+1𝑇[Λ(XrXr)vec(Θ^r)vec(Ip)]+vec(λ1r=T^l𝑇E^1rλ1r=T^l+1𝑇E^1r)2\displaystyle-\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{l+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{l}}{\sum}}\widehat{E}_{1r}-\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{l+1}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}
=\displaystyle= 1Tr=T^lT^l+11[Λ(XrXr)vec(Θ^r)vec(Ip)]+vec(λ1r=T^lT^l+11E^1r)2.\displaystyle\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{\widehat{T}_{l+1}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}.

Therefore, we deduce

2λ2+λ1p(p1)(T^l+1T^l)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})
\displaystyle\geq 1Tr=T^lTl1[Λ(XrXr)vec(Θ^r)vec(Ip)]+1Tr=TlT^l+11[Λ(XrXr)vec(Θ^r)vec(Ip)]]2\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}\Big{]}\|_{2}
=\displaystyle=

1Tr=T^lTl1[Λ(XrXr)vec(Ω^l+1Ωl+Ωl)vec(Ip)]+1Tr=TlT^l+11[Λ(XrXr)vec(Ω^l+1Ωl+1+Ωl+1)vec(Ip)]]2\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}+\Omega^{\ast}_{l+1})-\text{vec}(I_{p})\Big{]}\Big{]}\|_{2}

\displaystyle\geq

1Tr=TlT^l+11[Λ(XrXr)vec(Ω^l+1Ωl+1+Ωl+1)vec(Ip)]]21Tr=T^lTl1[Λ(XrXr)vec(Ω^l+1Ωl+Ωl)vec(Ip)]2\displaystyle\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}+\Omega^{\ast}_{l+1})-\text{vec}(I_{p})\Big{]}\Big{]}\|_{2}-\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}\|_{2}

\displaystyle\geq T^l+1TlT{1T^l+1Tlr=TlT^l+11Λ(XrXr)vec(Ω^l+1Ωl+1)2\displaystyle\frac{\widehat{T}_{l+1}-T^{\ast}_{l}}{T}\Big{\{}\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{l}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1})\|_{2}
1T^l+1Tlr=TlT^l+11[12Ωl+1XrXr+12XrXrΩl+1Ip]F}\displaystyle-\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{l}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{l+1}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{l+1}-I_{p}\Big{]}\|_{F}\Big{\}}
TlT^lT1TlT^lr=T^lTl1[Λ(XrXr)vec(Ω^l+1Ωl+Ωl)vec(Ip)]2.\displaystyle-\frac{T^{\ast}_{l}-\widehat{T}_{l}}{T}\|\frac{1}{T^{\ast}_{l}-\widehat{T}_{l}}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}\|_{2}.

Therefore, using part (i) of Theorem 3.1, we obtain

2λ2+λ1p(p1)(T^l+1T^l)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})
\displaystyle\geq

T^l+1TlT{γT,lminΩ^l+1Ωl+1F1T^l+1Tlr=TlT^l+11[12Ωl+1XrXr+12XrXrΩl+1Ip]F}\displaystyle\frac{\widehat{T}_{l+1}-T^{\ast}_{l}}{T}\Big{\{}\gamma^{\min}_{T,l}\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}\|_{F}-\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{l}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\frac{1}{2}\Omega^{\ast}_{l+1}X_{r}X^{\top}_{r}+\frac{1}{2}X_{r}X^{\top}_{r}\Omega^{\ast}_{l+1}-I_{p}\Big{]}\|_{F}\Big{\}}

Op(TlT^lT),\displaystyle-O_{p}(\frac{T^{\ast}_{l}-\widehat{T}_{l}}{T}),

where γT,lmin=λmin(1T^l+1Tlr=TlT^l+11XrXr)μ¯/2\gamma^{\min}_{T,l}=\lambda_{\min}(\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{l}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2 with probability tending to one. We deduce

2λ2+λ1p(p1)(T^l+1T^l)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})
\displaystyle\geq T^l+1TlT{γT,lminΩ^l+1Ωl+1FOp(smaxplog(pT)Il+1)}Op(TlT^lT).\displaystyle\frac{\widehat{T}_{l+1}-T^{\ast}_{l}}{T}\Big{\{}\gamma^{\min}_{T,l}\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}\|_{F}-O_{p}(s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{I^{\ast}_{l+1}}})\Big{\}}-O_{p}(\frac{T^{\ast}_{l}-\widehat{T}_{l}}{T}).

As a consequence, it can be deduced that

Ω^l+1Ωl+1F=Op(λ2TIl+1+λ1Tp(1+TδTIl+1)+TδTIl+1+smaxplog(pT)Il+1).\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}\|_{F}=O_{p}(\frac{\lambda_{2}T}{I^{\ast}_{l+1}}+\lambda_{1}Tp(1+\frac{T\delta_{T}}{I^{\ast}_{l+1}})+\frac{T\delta_{T}}{I^{\ast}_{l+1}}+s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{I^{\ast}_{l+1}}}). (B.10)

In case (ii-b), by Lemma A.3, with change-points t=T^lt=\widehat{T}_{l} and t=T^l+1t=\widehat{T}_{l+1}, we have

2λ2+λ1p(p1)(T^l+1T^l)1Tr=T^lT^l+11[Λ(XrXr)vec(Θ^r)vec(Ip)]2\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})\geq\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}\|_{2}
=\displaystyle= 1Tr=T^lTl1[Λ(XrXr)vec(Θ^r)vec(Ip)]+1Tr=TlTl+11[Λ(XrXr)vec(Θ^r)vec(Ip)]\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{T^{\ast}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}
+1Tr=Tl+1T^l+11[Λ(XrXr)vec(Θ^r)vec(Ip)]2\displaystyle+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}\|_{2}
=\displaystyle=

1Tr=T^lTl1[Λ(XrXr)vec(Ω^l+1Ωl+Ωl)vec(Ip)]+1Tr=TlTl+11[Λ(XrXr)vec(Ω^l+1Ωl+1+Ωl+1)vec(Ip)]\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}+\frac{1}{T}\overset{T^{\ast}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}+\Omega^{\ast}_{l+1})-\text{vec}(I_{p})\Big{]}

+1Tr=Tl+1T^l+11[Λ(XrXr)vec(Ω^l+1Ωl+2+Ωl+2)vec(Ip)]2\displaystyle+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+2}+\Omega^{\ast}_{l+2})-\text{vec}(I_{p})\Big{]}\|_{2}
\displaystyle\geq

1Tr=TlTl+11[Λ(XrXr)vec(Ω^l+1Ωl+1+Ωl+1)vec(Ip)]21Tr=T^lTl1[Λ(XrXr)vec(Ω^l+1Ωl+Ωl)vec(Ip)]2\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{l+1}-1}{\underset{r=T^{\ast}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}+\Omega^{\ast}_{l+1})-\text{vec}(I_{p})\Big{]}\|_{2}-\|\frac{1}{T}\overset{T^{\ast}_{l}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l}+\Omega^{\ast}_{l})-\text{vec}(I_{p})\Big{]}\|_{2}

1Tr=Tl+1T^l+11[Λ(XrXr)vec(Ω^l+1Ωl+2+Ωl+2)vec(Ip)]2.\displaystyle-\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{l+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+2}+\Omega^{\ast}_{l+2})-\text{vec}(I_{p})\Big{]}\|_{2}.

With γ¯T,lmin=λmin(1Iminr=TTl+11XrXr)μ¯/2\overline{\gamma}^{\min}_{T,l}=\lambda_{\min}(\frac{1}{I_{\min}}\overset{T^{\ast}_{l+1}-1}{\underset{r=T^{\ast}}{\sum}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2 with probability tending to one, we deduce

2λ2+λ1p(p1)(T^l+1T^l)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-\widehat{T}_{l})
\displaystyle\geq

Il+1T{γ¯T,lminΩ^l+1Ωl+1FOp(smaxplog(pT)Il+1)}Op(TlT^lT)Op(T^l+1Tl+1T).\displaystyle\frac{I^{\ast}_{l+1}}{T}\Big{\{}\overline{\gamma}^{\min}_{T,l}\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{l+1}\|_{F}-O_{p}(s^{\ast}_{\max}\,p\sqrt{\frac{\log(pT)}{I^{\ast}_{l+1}}})\Big{\}}-O_{p}(\frac{T^{\ast}_{l}-\widehat{T}_{l}}{T})-O_{p}(\frac{\widehat{T}_{l+1}-T^{\ast}_{l+1}}{T}).

Hence, (B.10) holds. Using similar arguments, we can show that the latter is satisfied when TlT^l<(Tl+Tl+1)/2T^{\ast}_{l}\leq\widehat{T}_{l}<(T^{\ast}_{l}+T^{\ast}_{l+1})/2.

B.2 Proof of Theorem 3.2

Using the result of Theorem 3.1, we aim to show that:

({h(𝒯^m^,𝒯m)>TδT}{m<m^mmax})0asT.{\mathbb{P}}\big{(}\{h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})>T\delta_{T}\}\cap\{m^{\ast}<\widehat{m}\leq m_{\max}\}\big{)}\rightarrow 0\;\;\text{as}\;\;T\rightarrow\infty. (B.11)

To so, we define:

Lm,k,1={1lm,|T^lTk|>TδTandT^l<Tk},Lm,k,2={1lm,|T^lTk|>TδTandT^l>Tk},Lm,k,3={1lm1,|T^lTk|>TδT,|T^l+1Tk|>TδTandT^l<Tk<T^l+1}.\begin{array}[]{llll}L_{m,k,1}=\big{\{}\forall 1\leq l\leq m,|\widehat{T}_{l}-T^{\ast}_{k}|>T\delta_{T}\;\text{and}\;\widehat{T}_{l}<T^{\ast}_{k}\big{\}},&&\\ L_{m,k,2}=\big{\{}\forall 1\leq l\leq m,|\widehat{T}_{l}-T^{\ast}_{k}|>T\delta_{T}\;\text{and}\;\widehat{T}_{l}>T^{\ast}_{k}\big{\}},&&\\ L_{m,k,3}=\big{\{}\exists 1\leq l\leq m-1,|\widehat{T}_{l}-T^{\ast}_{k}|>T\delta_{T},|\widehat{T}_{l+1}-T^{\ast}_{k}|>T\delta_{T}\;\text{and}\;\widehat{T}_{l}<T^{\ast}_{k}<\widehat{T}_{l+1}\big{\}}.&&\end{array}

The probability (B.11) can be bounded as:

({h(𝒯^m^,𝒯m)>TδT}{m<m^mmax})m=m+1mmax(h(𝒯^m^,𝒯m)>TδT)\displaystyle{\mathbb{P}}\big{(}\{h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})>T\delta_{T}\}\cap\{m^{\ast}<\widehat{m}\leq m_{\max}\}\big{)}\leq\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}{\mathbb{P}}\big{(}h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})>T\delta_{T}\big{)}
\displaystyle\leq

m=m+1mmaxk=1m(l{1,,m},|T^lTk|>TδT)=m=m+1mmaxk=1m[(Lm,k,1)+(Lm,k,2)+(Lm,k,3)].\displaystyle\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}\forall l\in\{1,\ldots,m\},|\widehat{T}_{l}-T^{\ast}_{k}|>T\delta_{T}\big{)}=\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}\Big{[}{\mathbb{P}}\big{(}L_{m,k,1}\big{)}+{\mathbb{P}}\big{(}L_{m,k,2}\big{)}+{\mathbb{P}}\big{(}L_{m,k,3}\big{)}\Big{]}.

We first focus on m=m+1mmaxk=1m(Lm,k,1)\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,1}\big{)}, which can be expressed as:

(Lm,k,1)=(Lm,k,1{T^m>Tk1})+(Lm,k,1{T^mTk1}).{\mathbb{P}}\big{(}L_{m,k,1}\big{)}={\mathbb{P}}\big{(}L_{m,k,1}\cap\{\widehat{T}_{m}>T^{\ast}_{k-1}\}\big{)}+{\mathbb{P}}\big{(}L_{m,k,1}\cap\{\widehat{T}_{m}\leq T^{\ast}_{k-1}\}\big{)}.

By Lemma A.3 with change-points t=T^mt=\widehat{T}_{m} and t=Tkt=T^{\ast}_{k}, given the case TkT^m>Tk1T^{\ast}_{k}\geq\widehat{T}_{m}>T^{\ast}_{k-1}:

1Tr=T^m𝑇[Λ(XrXr)vec(Θ^r)vec(Ip)]+vec(λ1r=T^m𝑇E^1r+λ2Γ^T^mΓ^T^mF)=𝟎p2×1,\displaystyle\frac{1}{T}\overset{T}{\underset{r=\widehat{T}_{m}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=\widehat{T}_{m}}{\sum}}\widehat{E}_{1r}+\lambda_{2}\frac{\widehat{\Gamma}_{\widehat{T}_{m}}}{\|\widehat{\Gamma}_{\widehat{T}_{m}}\|_{F}}\big{)}=\mathbf{0}_{p^{2}\times 1},

and

1Tr=Tk𝑇[Λ(XrXr)vec(Θ^r)vec(Ip)]+vec(λ1r=Tk𝑇E^1r)2λ2.\displaystyle\|\frac{1}{T}\overset{T}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=T^{\ast}_{k}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}\leq\lambda_{2}.

Therefore, taking the differences, we get:

2λ2+λ1p(p1)(TkT^m)1Tr=T^mTk1[Λ(XrXr)vec(Θ^r)vec(Ip)]2\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k}-\widehat{T}_{m})\geq\|\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Theta}_{r})-\text{vec}(I_{p})\Big{]}\|_{2}
\displaystyle\geq 1Tr=T^mTk1Λ(XrXr)vec(Ω^m+1Ωk+1)+1Tr=T^mTk1Λ(XrXr)vec(Ωk+1Ωk)\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k})
+1Tr=T^mTk1[Λ(XrXr)vec(Ωk)vec(Ip)]2.\displaystyle+\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}.

Therefore, the event T{\mathcal{B}}_{T} defined as

T{Ωk+1ΩkF(γ4,T,m,kmin)1[2λ2TTkT^m+λ1Tp(p1)\displaystyle{\mathcal{B}}_{T}\coloneqq\Big{\{}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}\leq(\gamma^{\min}_{4,T,m,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-\widehat{T}_{m}}+\lambda_{1}T\sqrt{p(p-1)}

+1TkT^mr=T^mTk1Λ(XrXr)vec(Ω^m+1Ωk+1)2+1TkT^mr=T^mTk1[Λ(XrXr)vec(Ωk)vec(Ip)]2]},\displaystyle+\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1})\|_{2}+\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]}\Big{\}},

where γ4,T,m,kmin=λmin(1TkT^mr=T^mTk1XrXr)μ¯/2\gamma^{\min}_{4,T,m,k}=\lambda_{\min}(\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\sum^{T^{\ast}_{k}-1}_{r=\widehat{T}_{m}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2 with probability tending to one, holds with probability one. Hence, we deduce

with

M1,1m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ4,T,m,kmin)1[2λ2δT1+λ1Tp(p1)]),\displaystyle M_{1,1}\coloneqq\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{4,T,m,k})^{-1}\Big{[}2\lambda_{2}\delta^{-1}_{T}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\big{)},

M1,2m=m+1mmaxk=1m(TkT^m>TδT,Ωk+1ΩkF/3(γ4,T,m,kmin)11TkT^mr=T^mTk1Λ(XrXr)vec(Ω^m+1Ωk+1)2),\displaystyle M_{1,2}\!\coloneqq\!\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}T^{\ast}_{k}\!-\!\widehat{T}_{m}\!>\!T\delta_{T},\|\Omega^{\ast}_{k+1}\!-\!\Omega^{\ast}_{k}\|_{F}/3\!\leq\!(\gamma^{\min}_{4,T,m,k})^{-1}\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\!\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1}\!-\!\Omega^{\ast}_{k+1})\|_{2}\big{)},

M1,3m=m+1mmaxk=1m(TkT^m>TδT,Ωk+1ΩkF/3(γ4,T,m,kmin)11TkT^mr=T^mTk1[Λ(XrXr)vec(Ωk)vec(Ip)]2).\displaystyle M_{1,3}\!\coloneqq\!\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}T^{\ast}_{k}\!-\!\widehat{T}_{m}\!>\!T\delta_{T},\|\Omega^{\ast}_{k+1}\!-\!\Omega^{\ast}_{k}\|_{F}/3\!\leq\!(\gamma^{\min}_{4,T,m,k})^{-1}\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{m}}{\sum}}\!\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})\!-\!\text{vec}(I_{p})\Big{]}\|_{2}\big{)}.

In the same vein as in the analysis of (B.8), we can show that M1,1,M1,30M_{1,1},M_{1,3}\rightarrow 0 as TT\rightarrow\infty. M1,2M_{1,2} requires more arguments. By Lemma (A.3), with change-points t=Tkt=T^{\ast}_{k} and t=Tk+1t=T^{\ast}_{k+1}:

1Tr=Tk𝑇[Λ(XrXr)vec(Ω^m+1)vec(Ip)]+vec(λ1r=Tk𝑇E^1r)2λ2,\displaystyle\|\frac{1}{T}\overset{T}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=T^{\ast}_{k}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}\leq\lambda_{2},

and

1Tr=Tk+1𝑇[Λ(XrXr)vec(Ω^m+1)vec(Ip)]+vec(λ1r=Tk+1𝑇E^1r)2λ2.\displaystyle\|\frac{1}{T}\overset{T}{\underset{r=T^{\ast}_{k+1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1})-\text{vec}(I_{p})\Big{]}+\text{vec}\big{(}\lambda_{1}\overset{T}{\underset{r=T^{\ast}_{k+1}}{\sum}}\widehat{E}_{1r}\big{)}\|_{2}\leq\lambda_{2}.

Therefore

2λ2+λ1p(p1)(Tk+1Tk)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k+1}-T^{\ast}_{k})
\displaystyle\geq 1Tr=TkTk+11Λ(XrXr)vec(Ω^m+1Ωk+1)+1Tr=TkTk+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2,\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2},

which implies

Ω^m+1Ωk+1F(γ5,T,kmin)1[2λ2TTk+1Tk+λ1Tp(p1)+1Tk+1Tkr=TkTk+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2],\displaystyle\|\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1}\|_{F}\leq(\gamma^{\min}_{5,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]},

with γ5,T,kmin=λmin(1Tk+1Tkr=TkTk+11XrXr)μ¯/2\gamma^{\min}_{5,T,k}=\lambda_{\min}(\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\sum^{T^{\ast}_{k+1}-1}_{r=T^{\ast}_{k}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2 with probability tending to one. We deduce

M1,2m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ4,T,m,kmin)1γ2,T,m,kmaxΩ^m+1Ωk+1F)\displaystyle M_{1,2}\leq\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{4,T,m,k})^{-1}\gamma^{\max}_{2,T,m,k}\|\widehat{\Omega}_{m+1}-\Omega^{\ast}_{k+1}\|_{F}\big{)}
\displaystyle\leq

m=m+1mmaxk=1m[(γ4,T,m,kmin(γ2,T,m,kmax)1Ωk+1ΩkF/6(γ5,T,kmin)1[2λ2Tmin+λ1Tp(p1)])\displaystyle\overset{m_{\max}}{\underset{m=m^{\ast}+1}{\sum}}\overset{m^{\ast}}{\underset{k=1}{\sum}}\Big{[}{\mathbb{P}}\big{(}\gamma^{\min}_{4,T,m,k}(\gamma^{\max}_{2,T,m,k})^{-1}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/6\leq(\gamma^{\min}_{5,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\Big{]}\big{)}

+(γ4,T,m,kmin(γ2,T,m,kmax)1Ωk+1ΩkF/6(γ5,T,kmin)11Tk+1Tkr=TkTk+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2)],\displaystyle+{\mathbb{P}}\big{(}\gamma^{\min}_{4,T,m,k}(\gamma^{\max}_{2,T,m,k})^{-1}\|\Omega^{\ast}_{k+1}\!-\!\Omega^{\ast}_{k}\|_{F}/6\!\leq\!(\gamma^{\min}_{5,T,k})^{-1}\|\frac{1}{T^{\ast}_{k+1}\!-\!T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\!\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})\!-\!\text{vec}(I_{p})\Big{]}\|_{2}\big{)}\Big{]},

where γ2,T,m,kmax=λmax(1TkT^mr=T^mTk1XrXr)2μ¯\gamma^{\max}_{2,T,m,k}=\lambda_{\max}(\frac{1}{T^{\ast}_{k}-\widehat{T}_{m}}\sum^{T^{\ast}_{k}-1}_{r=\widehat{T}_{m}}X_{r}X^{\top}_{r})\leq 2\overline{\mu} with probability tending to one. The first term in the second inequality of (B.2) tends to zero under the conditions λ2T/(minηmin)0\lambda_{2}T/(\mathcal{I}_{\min}\eta_{\min})\rightarrow 0 and λ1Tp/ηmin0\lambda_{1}Tp/\eta_{\min}\rightarrow 0. And under (ηminmin1/2)1smaxplog(pT)0(\eta_{\min}\mathcal{I}^{1/2}_{\min})^{-1}s^{\ast}_{\max}\,p\sqrt{\log(pT)}\rightarrow 0, the second term tends to zero. Therefore, we conclude m=m+1mmaxk=1m(Lm,k,1{T^m>Tk1})0\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,1}\cap\{\widehat{T}_{m}>T^{\ast}_{k-1}\}\big{)}\rightarrow 0 as TT\rightarrow\infty. Based on similar arguments, we can show m=m+1mmaxk=1m(Lm,k,1{T^mTk1})0\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,1}\cap\{\widehat{T}_{m}\leq T^{\ast}_{k-1}\}\big{)}\rightarrow 0 as TT\rightarrow\infty. Therefore, m=m+1mmaxk=1m(Lm,k,1)0\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,1}\big{)}\rightarrow 0 as TT\rightarrow\infty. Similarly, it can be proved that m=m+1mmaxk=1m(Lm,k,2)0\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,2}\big{)}\rightarrow 0 as TT\rightarrow\infty.
We now consider m=m+1mmaxk=1m(Lm,k,3)\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L_{m,k,3}\big{)}. Define

Lm,k,3(1)Lm,k,3{Tk1<T^l<T^l+1<Tk+1},Lm,k,3(2)Lm,k,3{Tk1<T^l<Tk+1,T^l+1Tk+1},Lm,k,3(3)Lm,k,3{T^lTk1,Tk1<T^l+1<Tk+1},Lm,k,3(4)Lm,k,3{T^lTk1,Tk+1<T^l+1}.\begin{array}[]{llll}\leavevmode\resizebox{433.62pt}{}{$\displaystyle L^{(1)}_{m,k,3}\coloneqq L_{m,k,3}\cap\{T^{\ast}_{k-1}<\widehat{T}_{l}<\widehat{T}_{l+1}<T^{\ast}_{k+1}\},L^{(2)}_{m,k,3}\coloneqq L_{m,k,3}\cap\{T^{\ast}_{k-1}<\widehat{T}_{l}<T^{\ast}_{k+1},\widehat{T}_{l+1}\geq T^{\ast}_{k+1}\},$}&&\\ \leavevmode\resizebox{433.62pt}{}{$\displaystyle L^{(3)}_{m,k,3}\coloneqq L_{m,k,3}\cap\{\widehat{T}_{l}\leq T^{\ast}_{k-1},T^{\ast}_{k-1}<\widehat{T}_{l+1}<T^{\ast}_{k+1}\},L^{(4)}_{m,k,3}\coloneqq L_{m,k,3}\cap\{\widehat{T}_{l}\leq T^{\ast}_{k-1},T^{\ast}_{k+1}<\widehat{T}_{l+1}\}.$}&&\end{array}

First, we consider Lm,k,3(1)L^{(1)}_{m,k,3}. By Lemma (A.3), for the change-points t=Tkt=T^{\ast}_{k} and t=T^lt=\widehat{T}_{l}, we obtain

2λ2+λ1p(p1)(TkT^l)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k}-\widehat{T}_{l})
\displaystyle\geq 1Tr=T^lTk1Λ(XrXr)vec(Ω^l+1Ωk)+1Tr=T^lTk1[Λ(XrXr)vec(Ωk)vec(Ip)]2,\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k})+\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2},

and for the change-points t=Tkt=T^{\ast}_{k} and t=T^l+1t=\widehat{T}_{l+1}, we get

2λ2+λ1p(p1)(T^l+1Tk)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-T^{\ast}_{k})
\displaystyle\geq 1Tr=TkT^l+11Λ(XrXr)vec(Ω^l+1Ωk+1)+1Tr=TkT^l+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2.\displaystyle\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2}.

Moreover, by the triangle inequality, we have

Ωk+1ΩkFΩ^l+1ΩkF+Ω^l+1Ωk+1F\displaystyle\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}\leq\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k}\|_{F}+\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1}\|_{F}
\displaystyle\leq

(γ6,T,k,lmin)1[2λ2TTkT^l+λ1Tp(p1)+1TkT^lr=T^lTk1[Λ(XrXr)vec(Ωk)vec(Ip)]2]\displaystyle(\gamma^{\min}_{6,T,k,l})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-\widehat{T}_{l}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]}

+(γ7,T,k,lmin)1[2λ2TT^l+1Tk+λ1Tp(p1)+1T^l+1Tkr=TkT^l+11[Λ(XrXr)vec(Ωk)vec(Ip)]2],\displaystyle+(\gamma^{\min}_{7,T,k,l})^{-1}\Big{[}\frac{2\lambda_{2}T}{\widehat{T}_{l+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]},

with γ6,T,k,lmin=λmin(1TkT^lr=T^lTk1XrXr)μ¯/2,γ7,T,k,lmin=λmin(1T^l+1Tkr=TkT^l+11XrXr)μ¯/2\gamma^{\min}_{6,T,k,l}=\lambda_{\min}(\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\sum^{T^{\ast}_{k}-1}_{r=\widehat{T}_{l}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2,\gamma^{\min}_{7,T,k,l}=\lambda_{\min}(\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\sum^{\widehat{T}_{l+1}-1}_{r=T^{\ast}_{k}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2 with probability tending to one. So m=m+1mmaxk=1m(Lm,k,3(1))\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(1)}_{m,k,3}\big{)} is upper bounded as follows

m=m+1mmaxk=1m(Lm,k,3(1))\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(1)}_{m,k,3}\big{)}
\displaystyle\leq

m=m+1mmaxk=1m(Ωk+1ΩkF/3((γ6,T,k,lmin)1+(γ7,T,k,lmin)1)[2λ2δT1+λ1Tp(p1)])\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq\big{(}(\gamma^{\min}_{6,T,k,l})^{-1}+(\gamma^{\min}_{7,T,k,l})^{-1}\big{)}\big{[}2\lambda_{2}\delta^{-1}_{T}+\lambda_{1}T\sqrt{p(p-1)}\big{]}\big{)}

+m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ6,T,k,lmin)11TkT^lr=T^lTk1[Λ(XrXr)vec(Ωk)vec(Ip)]2,TkT^lTδT)\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{6,T,k,l})^{-1}\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2},T^{\ast}_{k}-\widehat{T}_{l}\geq T\delta_{T}\big{)}

+m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ7,T,k,lmin)11T^l+1Tkr=TkT^l+11[Λ(XrXr)vec(Ωk)vec(Ip)]2,T^l+1TkTδT),\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{7,T,k,l})^{-1}\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2},\widehat{T}_{l+1}-T^{\ast}_{k}\geq T\delta_{T}\big{)},

which tends to zero in the spirit as in (B.8). For Lm,k,3(2)L^{(2)}_{m,k,3}, by Lemma A.3 with change-points t=Tkt=T^{\ast}_{k} and t=T^lt=\widehat{T}_{l} to obtain (B.2) and with change-points t=Tkt=T^{\ast}_{k}, t=Tk+1t=T^{\ast}_{k+1}, we get

2λ2+λ1p(p1)(Tk+1Tk)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k+1}-T^{\ast}_{k})
\displaystyle\geq 1Tr=TkTk+11Λ(XrXr)vec(Ω^l+1Ωk+1)+1Tr=TkTk+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2.\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2}.

By the triangle inequality, we have

Ωk+1ΩkFΩ^l+1ΩkF+Ω^l+1Ωk+1F\displaystyle\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}\leq\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k}\|_{F}+\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1}\|_{F}
\displaystyle\leq

(γ6,T,k,lmin)1[2λ2TTkT^l+λ1Tp(p1)+1TkT^lr=T^lTk1[Λ(XrXr)vec(Ωk)vec(Ip)]2]\displaystyle(\gamma^{\min}_{6,T,k,l})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-\widehat{T}_{l}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]}

+(γ8,T,kmin)1[2λ2TTk+1Tk+λ1Tp(p1)+1Tk+1Tkr=TkTk+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2],\displaystyle+(\gamma^{\min}_{8,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]},

with γ8,T,kmin=λmin(1Tk+1Tkr=TkTk+11XrXr)μ¯/2\gamma^{\min}_{8,T,k}=\lambda_{\min}(\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\sum^{T^{\ast}_{k+1}-1}_{r=T^{\ast}_{k}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2 with probability tending to one. Therefore, we obtain

m=m+1mmaxk=1m(Lm,k,3(2))\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(2)}_{m,k,3}\big{)}
\displaystyle\leq

m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ6,T,k,lmin)1[2λ2δT1+λ1Tp(p1)]+(γ8,T,k,lmin)1[2λ2Tmin+λ1Tp(p1)])\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{6,T,k,l})^{-1}\big{[}2\lambda_{2}\delta^{-1}_{T}+\lambda_{1}T\sqrt{p(p-1)}\big{]}+(\gamma^{\min}_{8,T,k,l})^{-1}\big{[}\frac{2\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\big{]}\big{)}

+m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ6,T,k,lmin)11TkT^lr=T^lTk1[Λ(XrXr)vec(Ωk)vec(Ip)]2,TkT^lTδT)\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{6,T,k,l})^{-1}\|\frac{1}{T^{\ast}_{k}-\widehat{T}_{l}}\overset{T^{\ast}_{k}-1}{\underset{r=\widehat{T}_{l}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2},T^{\ast}_{k}-\widehat{T}_{l}\geq T\delta_{T}\big{)}

+m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ8,T,k,lmin)11Tk+1Tkr=TkTk+11[Λ(XrXr)vec(Ωk)vec(Ip)]2),\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{8,T,k,l})^{-1}\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\big{)},

which will tend to zero based on similar arguments as in the convergence of (B.8). For Lm,k,3(3)L^{(3)}_{m,k,3}, by Lemma A.3 with change-points t=Tk1t=T^{\ast}_{k-1} and t=Tkt=T^{\ast}_{k}, we have

2λ2+λ1p(p1)(TkTk1)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(T^{\ast}_{k}-T^{\ast}_{k-1})
\displaystyle\geq 1Tr=Tk1Tk1Λ(XrXr)vec(Ω^l+1Ωk)+1Tr=Tk1Tk1[Λ(XrXr)vec(Ωk)vec(Ip)]2,\displaystyle\|\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k})+\frac{1}{T}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2},

and with change-points t=Tkt=T^{\ast}_{k}, t=T^l+1t=\widehat{T}_{l+1}, we get

2λ2+λ1p(p1)(T^l+1Tk)\displaystyle 2\lambda_{2}+\lambda_{1}\sqrt{p(p-1)}(\widehat{T}_{l+1}-T^{\ast}_{k})
\displaystyle\geq 1Tr=TkT^l+11Λ(XrXr)vec(Ω^l+1Ωk+1)+1Tr=TkT^l+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2.\displaystyle\|\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1})+\frac{1}{T}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2}.

By the triangle inequality, we deduce

Ωk+1ΩkFΩ^l+1ΩkF+Ω^l+1Ωk+1F\displaystyle\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}\leq\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k}\|_{F}+\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1}\|_{F}
\displaystyle\leq

(γ9,T,kmin)1[2λ2TTkTk1+λ1Tp(p1)+1TkTk1r=Tk1Tk1[Λ(XrXr)vec(Ωk)vec(Ip)]2]\displaystyle(\gamma^{\min}_{9,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-T^{\ast}_{k-1}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]}

+(γ10,T,kmin)1[2λ2TT^l+1Tk+λ1Tp(p1)+1T^l+1Tkr=TkT^l+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2],\displaystyle+(\gamma^{\min}_{10,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{\widehat{T}_{l+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]},

with γ9,T,kmin=λmin(1TkTk1r=Tk1Tk1XrXr)μ¯/2\gamma^{\min}_{9,T,k}=\lambda_{\min}(\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\sum^{T^{\ast}_{k}-1}_{r=T^{\ast}_{k-1}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2, γ10,T,k,lmin=λmin(1T^l+1Tkr=TkT^l+11XrXr)μ¯/2\gamma^{\min}_{10,T,k,l}=\lambda_{\min}(\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k}}\sum^{\widehat{T}_{l+1}-1}_{r=T^{\ast}_{k}}X_{r}X^{\top}_{r})\geq\underline{\mu}/2 with probability tending to one. We deduce

m=m+1mmaxk=1m(Lm,k,3(3))\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(3)}_{m,k,3}\big{)}
\displaystyle\leq

m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ9,T,kmin)1[2λ2Tmin+λ1Tp(p1)]+(γ10,T,k,lmin)1[2λ2δT1+λ1Tp(p1)])\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{9,T,k})^{-1}\big{[}\frac{2\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\big{]}+(\gamma^{\min}_{10,T,k,l})^{-1}\big{[}2\lambda_{2}\delta^{-1}_{T}+\lambda_{1}T\sqrt{p(p-1)}\big{]}\big{)}

+m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ9,T,kmin)11TkTk1r=Tk1Tk1[Λ(XrXr)vec(Ωk)vec(Ip)]2)\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{9,T,k})^{-1}\|\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\big{)}

+m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ10,T,k,lmin)11T^l+1Tk+1r=TkT^l+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2,\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{10,T,k,l})^{-1}\|\frac{1}{\widehat{T}_{l+1}-T^{\ast}_{k+1}}\overset{\widehat{T}_{l+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2},

T^l+1TkTδT),\displaystyle\qquad\qquad\qquad\widehat{T}_{l+1}-T^{\ast}_{k}\geq T\delta_{T}\big{)},

which tends to zero based on the same arguments as in the convergence of (B.8). Finally, to analyze Lm,k,3(4)L^{(4)}_{m,k,3}, applying Lemma A.3 with t=Tk1,t=Tkt=T^{\ast}_{k-1},t=T^{\ast}_{k} to obtain (B.2) and with t=Tk,t=Tk+1t=T^{\ast}_{k},t=T^{\ast}_{k+1} to obtain (B.2). By the triangle inequality, we have

Ωk+1ΩkFΩ^l+1ΩkF+Ω^l+1Ωk+1F\displaystyle\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}\leq\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k}\|_{F}+\|\widehat{\Omega}_{l+1}-\Omega^{\ast}_{k+1}\|_{F}
\displaystyle\leq

(γ9,T,kmin)1[2λ2TTkTk1+λ1Tp(p1)+1TkTk1r=Tk1Tk1[Λ(XrXr)vec(Ωk)vec(Ip)]2]\displaystyle(\gamma^{\min}_{9,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k}-T^{\ast}_{k-1}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]}

+(γ8,T,kmin)1[2λ2TTk+1Tk+λ1Tp(p1)+1Tk+1Tkr=TkTk+11[Λ(XrXr)vec(Ωk+1)vec(Ip)]2],\displaystyle+(\gamma^{\min}_{8,T,k})^{-1}\Big{[}\frac{2\lambda_{2}T}{T^{\ast}_{k+1}-T^{\ast}_{k}}+\lambda_{1}T\sqrt{p(p-1)}+\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k+1})-\text{vec}(I_{p})\Big{]}\|_{2}\Big{]},

we deduce

m=m+1mmaxk=1m(Lm,k,3(4))\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}L^{(4)}_{m,k,3}\big{)}
\displaystyle\leq m=m+1mmaxk=1m(Ωk+1ΩkF/3((γ9,T,kmin)1+(γ8,T,kmin)1)[2λ2Tmin+λ1Tp(p1)])\displaystyle\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq\big{(}(\gamma^{\min}_{9,T,k})^{-1}+(\gamma^{\min}_{8,T,k})^{-1}\big{)}\big{[}\frac{2\lambda_{2}T}{\mathcal{I}_{\min}}+\lambda_{1}T\sqrt{p(p-1)}\big{]}\big{)}

+m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ9,T,kmin)11TkTk1r=Tk1Tk1[Λ(XrXr)vec(Ωk)vec(Ip)]2)\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{9,T,k})^{-1}\|\frac{1}{T^{\ast}_{k}-T^{\ast}_{k-1}}\overset{T^{\ast}_{k}-1}{\underset{r=T^{\ast}_{k-1}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\big{)}

+m=m+1mmaxk=1m(Ωk+1ΩkF/3(γ8,T,k,lmin)11Tk+1Tkr=TkTk+11[Λ(XrXr)vec(Ωk)vec(Ip)]2),\displaystyle+\sum^{m_{\max}}_{m=m^{\ast}+1}\sum^{m^{\ast}}_{k=1}{\mathbb{P}}\big{(}\|\Omega^{\ast}_{k+1}-\Omega^{\ast}_{k}\|_{F}/3\leq(\gamma^{\min}_{8,T,k,l})^{-1}\|\frac{1}{T^{\ast}_{k+1}-T^{\ast}_{k}}\overset{T^{\ast}_{k+1}-1}{\underset{r=T^{\ast}_{k}}{\sum}}\Big{[}\Lambda(X_{r}X^{\top}_{r})\text{vec}(\Omega^{\ast}_{k})-\text{vec}(I_{p})\Big{]}\|_{2}\big{)},

which also tends to zero, as in the proof of the convergence to zero of (B.8). We conclude that ({h(𝒯^m^,𝒯m)>TδT}{m<m^mmax})0{\mathbb{P}}\big{(}\{h(\widehat{{\mathcal{T}}}_{\widehat{m}},{\mathcal{T}}^{\ast}_{m^{\ast}})>T\delta_{T}\}\cap\{m^{\ast}<\widehat{m}\leq m_{\max}\}\big{)}\rightarrow 0 as TT\rightarrow\infty.

B.3 Proof of Proposition 4.1

Proof of point (i).
We can rewrite (4.1) as the following constrained optimization problem:

min𝐗\displaystyle\min_{\mathbf{X}} t=1T[12tr(UtUt)tr(Θt)+δϵIp(Θt)]+λ1Tt=1TΥt,off1,off+λ2Tt=1T1DtF\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\mathrm{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}+\lambda_{2}T\sum_{t=1}^{T-1}\|D_{t}\|_{F} (B.18)
s.t. Ut=(XtXt)12Θt,Υt,off=Θt,offt=1,,T;\displaystyle U_{t}=(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta_{t},\,\Upsilon_{t,\mathrm{off}}=\Theta_{t,\mathrm{off}}\,\,\,\,\forall t=1,\dots,T;
Dt=Θt+1Θtt=1,,T1,\displaystyle D_{t}=\Theta_{t+1}-\Theta_{t}\,\,\,\,\forall t=1,\dots,T-1,

where we write 𝐗={{Θt}t=1T,{Ut}t=1T,{Υt,off}t=1T,{Dt}t=1T1}\mathbf{X}=\left\{\{\Theta_{t}\}_{t=1}^{T},\{U_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1}\right\} for short; δϵIp()\delta_{\cdot\succeq\epsilon I_{p}}(\cdot) is the indicator function of the set {S:SϵIp}\{S\,:\,S\succeq\epsilon I_{p}\}; Υt,off\Upsilon_{t,\mathrm{off}} is a matrix whose diagonal elements are 0; Θt,off\Theta_{t,\mathrm{off}} is the copy of Θt\Theta_{t} with the diagonal elements set to 0.

Denote the dual variables by 𝐘={{Wt}t=1T,{Yt,off}t=1T,{Zt}t=1T1}\mathbf{Y}=\{\{W_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1}\} for simplicity, where Wtp×pW_{t}\in{\mathbb{R}}^{p\times p}, Yt,off𝒮offpY_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p}, and Zt𝒮pZ_{t}\in\mathcal{S}^{p} for all tt. The Lagrangian function of (B.18) is

L(𝐗;𝐘)=\displaystyle L(\mathbf{X};\mathbf{Y})= t=1T[12tr(UtUt)tr(Θt)+δϵIp(Θt)]+λ1Tt=1TΥt,off1,off\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\mathrm{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}
+λ2Tt=1T1DtFt=1TWt,Ut(XtXt)12Θt\displaystyle+\lambda_{2}T\sum_{t=1}^{T-1}\|D_{t}\|_{F}-\sum_{t=1}^{T}\left\langle W_{t},U_{t}-(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta_{t}\right\rangle
t=1TYt,off,Θt,offΥt,offt=1T1Zt,Θt+1ΘtDt\displaystyle-\sum_{t=1}^{T}\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\right\rangle-\sum_{t=1}^{T-1}\left\langle Z_{t},\Theta_{t+1}-\Theta_{t}-D_{t}\right\rangle
=\displaystyle= t=1T[12tr(UtUt)Wt,Ut]+t=1T[λ1TΥt,off1,off+Yt,off,Υt,off]\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\left\langle W_{t},U_{t}\right\rangle\right]+\sum_{t=1}^{T}\left[\lambda_{1}T\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}+\left\langle Y_{t,\mathrm{off}},\Upsilon_{t,\mathrm{off}}\right\rangle\right]
+t=1T1[λ2TDtF+Zt,Dt]\displaystyle+\sum_{t=1}^{T-1}\left[\lambda_{2}T\|D_{t}\|_{F}+\left\langle Z_{t},D_{t}\right\rangle\right]
+t=1T[ZtZt1Ip+Sym((XtXt)12Wt)Yt,off,Θt+δϵIp(Θt)],\displaystyle+\sum_{t=1}^{T}\left[\left\langle Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}},\Theta_{t}\right\rangle+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right],

where for convenience, we set Z0=ZT=𝟎p×pZ_{0}=Z_{T}=\mathbf{0}_{p\times p}; we further note that Yt,off,Θt,off=Yt,off,Θt\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}\right\rangle=\left\langle Y_{t,\mathrm{off}},\Theta_{t}\right\rangle. Now, let ζt=ZtZt1Ip+Sym((XtXt)12Wt)Yt,off\zeta_{t}=Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}}, we have

minUt12tr(UtUt)Wt,Ut=\displaystyle\min_{U_{t}}\,\,\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\left\langle W_{t},U_{t}\right\rangle= 12tr(WtWt);\displaystyle-\frac{1}{2}\mathrm{tr}(W_{t}^{\top}W_{t});
minΥt,offλ1TΥt,off1,off+Yt,off,Υt,off=\displaystyle\min_{\Upsilon_{t,\mathrm{off}}}\,\,\lambda_{1}T\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}+\left\langle Y_{t,\mathrm{off}},\Upsilon_{t,\mathrm{off}}\right\rangle= {0if |Yt,uv|λ1Tuv,otherwise;\displaystyle\begin{cases}0&\text{if }|Y_{t,uv}|\leq\lambda_{1}T\,\,\,\,\forall u\neq v,\\ -\infty&\text{otherwise};\end{cases}
minDtλ2TDtF+Zt,Dt=\displaystyle\min_{D_{t}}\,\,\lambda_{2}T\|D_{t}\|_{F}+\left\langle Z_{t},D_{t}\right\rangle= {0if ZtFλ2T,otherwise;\displaystyle\begin{cases}0&\text{if }\|Z_{t}\|_{F}\leq\lambda_{2}T,\\ -\infty&\text{otherwise};\end{cases}
minΘtζt,Θt+δϵIp(Θt)=\displaystyle\min_{\Theta_{t}}\,\,\left\langle\zeta_{t},\Theta_{t}\right\rangle+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})= {ϵtr(ζt)if ζt0,otherwise.\displaystyle\begin{cases}\epsilon\mathrm{tr}(\zeta_{t})&\text{if }\zeta_{t}\succeq 0,\\ -\infty&\text{otherwise}.\end{cases}

Therefore, we have the dual problem as in (4.2). Finally, the equality of the optimal values follow from [32, Theorem 31.1] upon noting that there exists 𝐗\mathbf{X} with ΘtϵIp\Theta_{t}\succ\epsilon I_{p} for all tt satisfying the equality constraints in (B.18).

Proof of point (ii).
Since t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0, by Lemma A.4, the set 𝒞λ1\mathcal{C}_{\lambda_{1}} has a Slater point {{W¯t}t=1T,{Y¯t,off}t=T,{Z¯t}t=1T1}\left\{\{\overline{W}_{t}\}_{t=1}^{T},\{\overline{Y}_{t,\mathrm{off}}\}_{t=}^{T},\{\overline{Z}_{t}\}_{t=1}^{T-1}\right\}. Then the result follows directly with λ¯2=1+max{Z¯tF}/T\overline{\lambda}_{2}=1+\max\{\|\overline{Z}_{t}\|_{F}\}/T. The existence of solutions to the primal problem comes from the strong duality thanks to the strict feasibility of the dual problem; see, for example [32, Theorem 31.1].

B.4 Proof of Proposition 4.2

Proof of point (i).
We first rewrite (4.4) as the following constrained optimization problem:

min𝐗t=1T[12tr(UtUt)tr(Θt)+δϵIp(Θt)]+λ1Tt=1TΥt,off1,off+λ2Tt=1T1(DtF;λ3)s.t.Ut=(XtXt)12Θt,Υt,off=Θt,off,t=1,,T;Dt=Θt+1Θtt=1,,T1,\begin{split}\min_{\mathbf{X}}\,\,&\sum_{t=1}^{T}\!\left[\!\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})\!-\!\mathrm{tr}(\Theta_{t})\!+\!\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\!\right]\!+\!\lambda_{1}T\!\sum_{t=1}^{T}\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}\!+\!\lambda_{2}T\sum_{t=1}^{T-1}\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})\\ \text{s.t.}\,\,&U_{t}=(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta_{t},\,\Upsilon_{t,\mathrm{off}}=\Theta_{t,\mathrm{off}},\,\,\,\,\forall t=1,\dots,T;\\ &D_{t}=\Theta_{t+1}-\Theta_{t}\,\,\,\,\forall t=1,\dots,T-1,\end{split} (B.19)

where we write 𝐗={{Θt}t=1T,{Ut}t=1T,{Υt,off}t=1T,{Dt}t=1T1}\mathbf{X}=\left\{\{\Theta_{t}\}_{t=1}^{T},\{U_{t}\}_{t=1}^{T},\{\Upsilon_{t,\mathrm{off}}\}_{t=1}^{T},\{D_{t}\}_{t=1}^{T-1}\right\} for short; Υt,off\Upsilon_{t,\mathrm{off}} is a matrix whose diagonal elements are 0; Θt,off\Theta_{t,\mathrm{off}} is the copy of Θt\Theta_{t} with the diagonal elements set to 0.

Denote the dual variables by 𝐘={{Wt}t=1T,{Yt,off}t=1T,{Zt}t=1T1}\mathbf{Y}=\{\{W_{t}\}_{t=1}^{T},\{Y_{t,\mathrm{off}}\}_{t=1}^{T},\{Z_{t}\}_{t=1}^{T-1}\} for simplicity, where Wtp×pW_{t}\in{\mathbb{R}}^{p\times p}, Yt,off𝒮offpY_{t,\mathrm{off}}\in\mathcal{S}_{\mathrm{off}}^{p}, and Zt𝒮pZ_{t}\in\mathcal{S}^{p} for all tt. The Lagrangian function of (B.19) is

L(𝐗;𝐘)=\displaystyle L(\mathbf{X};\mathbf{Y})= t=1T[12tr(UtUt)tr(Θt)+δϵIp(Θt)]+λ1Tt=1TΥt,off1,off\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\mathrm{tr}(\Theta_{t})+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right]+\lambda_{1}T\sum_{t=1}^{T}\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}
+λ2Tt=1T1(DtF;λ3)t=1TWt,Ut(XtXt)12Θt\displaystyle+\lambda_{2}T\sum_{t=1}^{T-1}\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})-\sum_{t=1}^{T}\left\langle W_{t},U_{t}-(X_{t}X_{t}^{\top})^{\frac{1}{2}}\Theta_{t}\right\rangle
t=1TYt,off,Θt,offΥt,offt=1T1Zt,Θt+1ΘtDt\displaystyle-\sum_{t=1}^{T}\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}-\Upsilon_{t,\mathrm{off}}\right\rangle-\sum_{t=1}^{T-1}\left\langle Z_{t},\Theta_{t+1}-\Theta_{t}-D_{t}\right\rangle
=\displaystyle= t=1T[12tr(UtUt)Wt,Ut]+t=1T[λ1TΥt,off1,off+Yt,off,Υt,off]\displaystyle\sum_{t=1}^{T}\left[\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\left\langle W_{t},U_{t}\right\rangle\right]+\sum_{t=1}^{T}\left[\lambda_{1}T\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}+\left\langle Y_{t,\mathrm{off}},\Upsilon_{t,\mathrm{off}}\right\rangle\right]
+t=1T1[λ2T(DtF;λ3)+Zt,Dt]\displaystyle+\sum_{t=1}^{T-1}\left[\lambda_{2}T\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle\right]
+t=1T[ZtZt1Ip+Sym((XtXt)12Wt)Yt,off,Θt+δϵIp(Θt)],\displaystyle+\sum_{t=1}^{T}\left[\left\langle Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}},\Theta_{t}\right\rangle+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})\right],

where for convenience, we set Z0=ZT=𝟎p×pZ_{0}=Z_{T}=\mathbf{0}_{p\times p}; we also note that Yt,off,Θt,off=Yt,off,Θt\left\langle Y_{t,\mathrm{off}},\Theta_{t,\mathrm{off}}\right\rangle=\left\langle Y_{t,\mathrm{off}},\Theta_{t}\right\rangle. Now, let ζt=ZtZt1Ip+Sym((XtXt)12Wt)Yt,off\zeta_{t}=Z_{t}-Z_{t-1}-I_{p}+\operatorname*{\mathrm{Sym}}\left((X_{t}X_{t}^{\top})^{\frac{1}{2}}W_{t}\right)-Y_{t,\mathrm{off}}, we have

minUt12tr(UtUt)Wt,Ut=\displaystyle\min_{U_{t}}\,\,\frac{1}{2}\mathrm{tr}(U_{t}^{\top}U_{t})-\left\langle W_{t},U_{t}\right\rangle= 12tr(WtWt);\displaystyle-\frac{1}{2}\mathrm{tr}(W_{t}^{\top}W_{t}); (B.20)
minΥt,offλ1TΥt,off1,off+Yt,off,Υt,off=\displaystyle\min_{\Upsilon_{t,\mathrm{off}}}\,\,\lambda_{1}T\|\Upsilon_{t,\mathrm{off}}\|_{1,\text{off}}+\left\langle Y_{t,\mathrm{off}},\Upsilon_{t,\mathrm{off}}\right\rangle= {0if |Yt,uv|λ1Tuv,otherwise;\displaystyle\begin{cases}0&\text{if }|Y_{t,uv}|\leq\lambda_{1}T\,\,\,\,\forall u\neq v,\\ -\infty&\text{otherwise};\end{cases}
minΘtζt,Θt+δϵIp(Θt)=\displaystyle\min_{\Theta_{t}}\,\,\left\langle\zeta_{t},\Theta_{t}\right\rangle+\delta_{\cdot\succeq\epsilon I_{p}}(\Theta_{t})= {ϵtr(ζt)if ζt0,otherwise.\displaystyle\begin{cases}\epsilon\mathrm{tr}(\zeta_{t})&\text{if }\zeta_{t}\succeq 0,\\ -\infty&\text{otherwise}.\end{cases}

For the problem minDtλ2T(DtF;λ3)+Zt,Dt\min_{D_{t}}\lambda_{2}T\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle for each tt, it holds that

minDtλ2T(DtF;λ3)+Zt,Dt\displaystyle\min_{D_{t}}\,\,\lambda_{2}T\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle
=\displaystyle= min{minDtFλ3λ2TDtF+Zt,Dt\raisebox{-0.2ex}{\small{1}}⃝,minDtFλ3λ2T(DtF2λ32+λ3)+Zt,Dt\raisebox{-0.2ex}{\small{2}}⃝}.\displaystyle\min\Big{\{}\underbrace{\min_{\|D_{t}\|_{F}\leq\lambda_{3}}\lambda_{2}T\|D_{t}\|_{F}+\left\langle Z_{t},D_{t}\right\rangle}_{\smash{\raisebox{-0.2ex}{\small{1}}⃝}},\underbrace{\min_{\|D_{t}\|_{F}\geq\lambda_{3}}\lambda_{2}T(\|D_{t}\|_{F}^{2}-\lambda_{3}^{2}+\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle}_{\smash{\raisebox{-0.2ex}{\small{2}}⃝}}\Big{\}}.

For \raisebox{-0.2ex}{\small{1}}⃝, we have

minDtFλ3λ2TDtF+Zt,Dt=minDtFλ3;Dt=αZt,α0λ2TDtFZtFDtF\displaystyle\min_{\|D_{t}\|_{F}\leq\lambda_{3}}\lambda_{2}T\|D_{t}\|_{F}+\left\langle Z_{t},D_{t}\right\rangle=\min_{\begin{subarray}{c}\|D_{t}\|_{F}\leq\lambda_{3};\\ D_{t}=\alpha Z_{t},\alpha\geq 0\end{subarray}}\lambda_{2}T\|D_{t}\|_{F}-\|Z_{t}\|_{F}\|D_{t}\|_{F} (B.21)
=\displaystyle= min0rλ3(λ2TZtF)r=(ZtFλ2T)+λ3,\displaystyle\min_{0\leq r\leq\lambda_{3}}(\lambda_{2}T-\|Z_{t}\|_{F})r=-\left(\|Z_{t}\|_{F}-\lambda_{2}T\right)_{+}\lambda_{3},

where the first equality follows from the (equality case in) Cauchy-Schwarz inequality; ()+=max{,0}(\cdot)_{+}=\max\{\cdot,0\}.

For \raisebox{-0.2ex}{\small{2}}⃝, one can see that

minDtFλ3λ2T(DtF2λ32+λ3)+Zt,Dt\displaystyle\min_{\|D_{t}\|_{F}\geq\lambda_{3}}\lambda_{2}T(\|D_{t}\|_{F}^{2}-\lambda_{3}^{2}+\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle
=(a)\displaystyle\overset{(\text{a})}{=} minrλ3λ2T(r2λ32+λ3)ZtFr\displaystyle\min_{r\geq\lambda_{3}}\lambda_{2}T(r^{2}-\lambda_{3}^{2}+\lambda_{3})-\|Z_{t}\|_{F}r
=\displaystyle= minrλ3λ2T((rZtF2λ2T)2ZtF24λ22T2λ32+λ3)\displaystyle\min_{r\geq\lambda_{3}}\lambda_{2}T\left(\left(r-\frac{\|Z_{t}\|_{F}}{2\lambda_{2}T}\right)^{2}-\frac{\|Z_{t}\|_{F}^{2}}{4\lambda_{2}^{2}T^{2}}-\lambda_{3}^{2}+\lambda_{3}\right)
=(b)\displaystyle\overset{(\text{b})}{=} λ2T((λ3ZtF2λ2T)+2ZtF24λ22T2λ32+λ3),\displaystyle\lambda_{2}T\left(\left(\lambda_{3}-\frac{\|Z_{t}\|_{F}}{2\lambda_{2}T}\right)_{+}^{2}-\frac{\|Z_{t}\|_{F}^{2}}{4\lambda_{2}^{2}T^{2}}-\lambda_{3}^{2}+\lambda_{3}\right), (B.22)

where (a) comes from the (equality case in) Cauchy-Schwarz inequality, and (b) is true since the minimum is attained at r=max{ZtF2λ2T,λ3}r=\max\left\{\frac{\|Z_{t}\|_{F}}{2\lambda_{2}T},\lambda_{3}\right\}. One can see this by first locating the vertex of the quadratic objective.

Using (B.21) and (B.4), we have

minDtλ2T(DtF;λ3)+Zt,Dt\displaystyle\min_{D_{t}}\,\,\lambda_{2}T\mathcal{R}(\|D_{t}\|_{F};\lambda_{3})+\left\langle Z_{t},D_{t}\right\rangle
=\displaystyle= min{(ZtFλ2T)+λ3,λ2T((λ3ZtF2λ2T)+2ZtF24λ22T2λ32+λ3)}.\displaystyle\min\left\{-\left(\|Z_{t}\|_{F}-\lambda_{2}T\right)_{+}\lambda_{3},\lambda_{2}T\left(\left(\lambda_{3}-\frac{\|Z_{t}\|_{F}}{2\lambda_{2}T}\right)_{+}^{2}-\frac{\|Z_{t}\|_{F}^{2}}{4\lambda_{2}^{2}T^{2}}-\lambda_{3}^{2}+\lambda_{3}\right)\right\}.

The above display is exactly the definition of 𝒢(ZtF;λ3)\mathcal{G}(\|Z_{t}\|_{F};\lambda_{3}). Using this and (B.20), we can conclude that the dual problem of (4.4) is (4.5). Finally, the equality of optimal values follows from [32, Theorem 31.1] upon noting that there exists 𝐗\mathbf{X} with ΘtϵIp\Theta_{t}\succ\epsilon I_{p} for all tt satisfying the equality constraints in (4.4).

Proof of point (ii).
Since t=1TXtXt0\sum_{t=1}^{T}X_{t}X_{t}^{\top}\succ 0, by Lemma A.4, the set 𝒞λ1\mathcal{C}_{\lambda_{1}} and hence the dual problem has a Slater point {{W¯t}t=1T,{Y¯t,off}t=T,{Z¯t}t=1T1}\left\{\{\overline{W}_{t}\}_{t=1}^{T},\{\overline{Y}_{t,\mathrm{off}}\}_{t=}^{T},\{\overline{Z}_{t}\}_{t=1}^{T-1}\right\}. The existence of solutions to the primal problem comes from strong duality thanks to the strict feasibility of the dual problem; see, for example [32, Theorem 31.1].

B.5 Proof of Proposition 4.3

For simplicity, for a given pair of fixed λ1\lambda_{1} and λ2\lambda_{2}, we denote the objective functions of (4.1) and (4.4) with λ3\lambda_{3} by FF and Gλ3G_{\lambda_{3}}, respectively. From the definition of 𝒢(;λ3)\mathcal{G}(\cdot;\lambda_{3}) in (4.3), we know that

F({Θt}t=1T)Gλ3({Θt}t=1T) for any {Θt}t=1T.F\left(\{\Theta_{t}\}_{t=1}^{T}\right)\leq G_{\lambda_{3}}\left(\{\Theta_{t}\}_{t=1}^{T}\right)\text{ for any }\{\Theta_{t}\}_{t=1}^{T}. (B.23)

Proof of point (i).
Suppose that λ1,λ2\lambda_{1},\lambda_{2} are such that (4.1) has solutions. Let {Θt}t=1T\{\Theta_{t}^{*}\}_{t=1}^{T} be an arbitrary solution to (4.1) and define

λ¯3=max{maxt=1,,T1{Θt+1ΘtF},0.5}.\overline{\lambda}_{3}=\max\left\{\max_{t=1,\dots,T-1}\{\|\Theta_{t+1}^{*}-\Theta_{t}^{*}\|_{F}\},0.5\right\}.

Then it holds that

F({Θt}t=1T)=Gλ3({Θt}t=1T) for any λ3λ¯3.F\left(\{\Theta_{t}^{*}\}_{t=1}^{T}\right)=G_{\lambda_{3}}\left(\{\Theta_{t}^{*}\}_{t=1}^{T}\right)\text{ for any }\lambda_{3}\geq\overline{\lambda}_{3}. (B.24)

Fix any λ3λ¯3\lambda_{3}\geq\overline{\lambda}_{3}. Suppose that {Θ^t}t=1T\{\widehat{\Theta}_{t}\}_{t=1}^{T} is a solution to (4.4) with this λ3\lambda_{3}, then we have

F({Θ^t}t=1T)(a)Gλ3({Θ^t}t=1T)(b)Gλ3({Θt}t=1T)=(c)F({Θt}t=1T),F\left(\{\widehat{\Theta}_{t}\}_{t=1}^{T}\right)\overset{(\text{a})}{\leq}G_{\lambda_{3}}\left(\{\widehat{\Theta}_{t}\}_{t=1}^{T}\right)\overset{(\text{b})}{\leq}G_{\lambda_{3}}\left(\{\Theta_{t}^{*}\}_{t=1}^{T}\right)\overset{(\text{c})}{=}F\left(\{\Theta_{t}^{*}\}_{t=1}^{T}\right),

where (a) comes from (B.23); (b) holds thanks to the assumption that {Θ^t}t=1T\{\widehat{\Theta}_{t}\}_{t=1}^{T} is a solution to (4.4) with λ3\lambda_{3}; (c) is true because of (B.24). Therefore, {Θ^t}t=1T\{\widehat{\Theta}_{t}\}_{t=1}^{T} is also a solution to (4.1).

Proof of point (ii).
It suffices to show that, for arbitrary fixed λ1,λ2\lambda_{1},\lambda_{2}, if there exists λ30.5\lambda_{3}\geq 0.5 such that there exists a solution {Θt}t=1T\{\Theta_{t}^{*}\}_{t=1}^{T} to (4.4) with λ3\lambda_{3} that satisfies

maxt=1,,T1Θt+1ΘtF<λ3,\max_{t=1,\dots,T-1}\|\Theta_{t+1}^{*}-\Theta_{t}^{*}\|_{F}<\lambda_{3}, (B.25)

then (4.1) has solutions. To this end, we notice from (B.25) and the definition of \mathcal{R} in (4.3) that

F({Θt}t=1T)=Gλ3({Θt}t=1T),\partial F(\{\Theta_{t}^{*}\}_{t=1}^{T})=\partial G_{\lambda_{3}}(\{\Theta_{t}^{*}\}_{t=1}^{T}), (B.26)

where F({Θt}t=1T)\partial F(\{\Theta_{t}^{*}\}_{t=1}^{T}) and Gλ3({Θt}t=1T)\partial G_{\lambda_{3}}(\{\Theta_{t}^{*}\}_{t=1}^{T}) are the subdifferentials of FF and Gλ3G_{\lambda_{3}} at {Θt}t=1T\{\Theta_{t}^{*}\}_{t=1}^{T}, respectively.

Given that {Θt}t=1T\{\Theta_{t}^{*}\}_{t=1}^{T} is a solution to (4.4), the optimality condition implies

0Gλ3({Θt}t=1T).0\in\partial G_{\lambda_{3}}(\{\Theta_{t}^{*}\}_{t=1}^{T}).

This, along with (B.26), shows that {Θt}t=1T\{\Theta_{t}^{*}\}_{t=1}^{T} is also a solution to (4.1).


References

  • Bauschke and Combettes [2011] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer New York, 2011. ISBN 9781441994677. 10.1007/978-1-4419-9467-7.
  • Bleakley and Vert [2011] K. Bleakley and J. Vert. The group fused lasso for multiple change-point detection. HAL, Technical Report, Computational Biology Center, Paris, 2011.
  • Cai et al. [2016] T. Cai, W. Liu, and H. Zhou. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. The Annals of Statistics, 44(2):455–488, 2016. 10.1214/13-AOS1171.
  • Chan et al. [2014] N. Chan, C. Yau, and R.-M. Zhang. Group lasso for structural break time series. Journal of the American Statistical Association, 109(506):590–599, 2014. 10.1080/01621459.2013.866566.
  • Chen et al. [2017] L. Chen, D. Sun, and K.-C. Toh. An efficient inexact symmetric Gauss–Seidel based majorized ADMM for high-dimensional convex composite conic programming. Mathematical Programming, 161(1–2):237–270, Apr. 2017. ISSN 1436-4646. 10.1007/s10107-016-1007-5.
  • Eckstein [1994] J. Eckstein. Some saddle-function splitting methods for convex programming. Optimization Methods and Software, 4(1):75–83, Jan. 1994. ISSN 1029-4937. 10.1080/10556789408805578.
  • Eckstein and Bertsekas [1992] J. Eckstein and D. P. Bertsekas. On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1–3):293–318, Apr. 1992. ISSN 1436-4646. 10.1007/bf01581204.
  • Engle and Colacito [2006] R. Engle and R. Colacito. Testing and valuing dynamic correlations for asset allocation. Journal of Business & Economic Statistics, 24(2):238–253, 2006. 10.1198/073500106000000017.
  • Engle et al. [2019] R. Engle, O. Ledoit, and M. Wolf. Large dynamic covariance matrices. Journal of Business & Economic Statistics, 37(2):363–375, 2019. 10.1080/07350015.2017.1345683.
  • Fazel et al. [2013] M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946–977, Jan. 2013. ISSN 1095-7162. 10.1137/110853996.
  • Fortin and Glowinski [1983] M. Fortin and R. Glowinski. Augmented Lagrangian Methods: Applications to the Numerical Solution of Boundary-Value Problems. Studies in Mathematics and Its Applications. Elsevier, 1983.
  • Gabay and Mercier [1976] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. ISSN 0898-1221. 10.1016/0898-1221(76)90003-1.
  • Gibberd and Nelson [2017] A. Gibberd and J. Nelson. Regularized estimation of piecewise constant gaussian graphical models: the group-fused graphical lasso. Journal of Computational and Graphical Statistics, 26(3):623–634, 2017. 10.1080/10618600.2017.1302340.
  • Gibberd and Roy [2021] A. Gibberd and S. Roy. Consistent multiple changepoint estimation with fused gaussian graphical models. Annals of the Institute of Statistical Mathematics, 73:283–309, 2021. 10.1007/s10463-020-00749-0.
  • Glowinski and Marroco [1975] R. Glowinski and A. Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(R2):41–76, 1975. ISSN 0397-9342. 10.1051/m2an/197509r200411.
  • Hallac et al. [2017] D. Hallac, Y. Park, S. Boyd, and J. Leskovec. Network inference via the time-varying graphical lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017. 10.1145/3097983.3098037.
  • Harchaoui and Lévy-Leduc [2010] Z. Harchaoui and C. Lévy-Leduc. Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492):1480–1493, 2010. 10.1198/jasa.2010.tm09181.
  • He et al. [2002] B. He, L.-Z. Liao, D. Han, and H. Yang. A new inexact alternating directions method for monotone variational inequalities. Mathematical Programming, 92(1):103–118, Mar. 2002. ISSN 0025-5610. 10.1007/s101070100280.
  • Ji et al. [2021] J. Ji, Y. He, L. Liu, and L. Xie. Brain connectivity alteration detection via matrix-variate differential network model. Biometrics, 77(4):1409–1421, 2021. 10.1111/biom.13359.
  • Kolar and Xing [2012] M. Kolar and E. Xing. Estimating networks with jumps. Electronic Journal of Statistics, 6:2069–2106, 2012. 10.1214/12-EJS739.
  • Li et al. [2016a] D. Li, J. Qian, and L. Su. Panel data models with interactive fixed effects and multiple structural breaks. Journal of the American Statistical Association, 111(516):1804–1819, 2016a. 10.1080/01621459.2015.1119696.
  • Li and Lin [2019] H. Li and Z. Lin. Accelerated alternating direction method of multipliers: An optimal O(1 / k) nonergodic analysis. Journal of Scientific Computing, 79(2):671–699, Dec. 2019. ISSN 1573-7691. 10.1007/s10915-018-0893-5.
  • Li et al. [2016b] X. Li, D. Sun, and K.-C. Toh. A Schur complement based semi-proximal ADMM for convex quadratic conic programming and extensions. Mathematical Programming, 155(1–2):333–373, Dec. 2016b. ISSN 1436-4646. 10.1007/s10107-014-0850-5.
  • Li et al. [2018] X. Li, D. Sun, and K.-C. Toh. QSDPNAL: a two-phase augmented lagrangian method for convex quadratic semidefinite programming. Mathematical Programming Computation, 10(4):703–743, Apr. 2018. ISSN 1867-2957. 10.1007/s12532-018-0137-6.
  • Li et al. [2019] X. Li, D. Sun, and K.-C. Toh. A block symmetric Gauss–Seidel decomposition theorem for convex composite quadratic programming and its applications. Mathematical Programming, 175(1–2):395–418, Feb. 2019. ISSN 1436-4646. 10.1007/s10107-018-1247-7.
  • Merlevède et al. [2011] F. Merlevède, M. Peligrad, and E. Rio. A Bernstein type inequality and moderate deviations for weakly dependent sequences. Probability Theory and Related Fields, 151:435–474, 2011. 10.1007/s00440-010-0304-9.
  • Monti et al. [2014] R. Monti, P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana. Estimating time-varying brain connectivity networks from functional mri time series. NeuroImage, 103:427–443, 2014. 10.1016/j.neuroimage.2014.07.033.
  • Ouyang et al. [2015] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao. An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences, 8(1):644–681, Jan. 2015. ISSN 1936-4954. 10.1137/14095697x.
  • Poignard and Asai [2023] B. Poignard and M. Asai. Estimation of high-dimensional vector autoregression via sparse precision matrix. The Econometrics Journal, 26(2):307–326, 2023. 10.1093/ectj/utad003.
  • Qian and Su [2016a] J. Qian and L. Su. Shrinkage estimation of regression models with multiple structural changes. Econometric Theory, 32(6):1376–1433, 2016a. 10.1017/S0266466615000237.
  • Qian and Su [2016b] J. Qian and L. Su. Shrinkage estimation of common breaks in panel data models via adaptive group fused lasso. Journal of Econometrics, 191(1):86–109, 2016b. 10.1016/j.jeconom.2015.09.004.
  • Rockafellar [1970] R. T. Rockafellar. Convex Analysis. Princeton University Press, Dec. 1970. ISBN 9781400873173. 10.1515/9781400873173.
  • Roy et al. [2017] S. Roy, Y. Atchadé, and G. Michailidis. Change point estimation in high dimensional markov random-field models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(4):1187–1206, 2017. 10.1111/rssb.12205.
  • Sabach and Teboulle [2022] S. Sabach and M. Teboulle. Faster lagrangian-based methods in convex optimization. SIAM Journal on Optimization, 32(1):204–227, Feb. 2022. ISSN 1095-7189. 10.1137/20m1375358.
  • Sun et al. [2024] D. Sun, Y. Yuan, G. Zhang, and X. Zhao. Accelerating preconditioned ADMM via degenerate proximal point mappings, 2024.
  • Xiao et al. [2018] Y. Xiao, L. Chen, and D. Li. A generalized alternating direction method of multipliers with semi-proximal terms for convex composite conic programming. Mathematical Programming Computation, 10(4):533–555, Jan. 2018. ISSN 1867-2957. 10.1007/s12532-018-0134-9.
  • Xu [2017] Y. Xu. Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM Journal on Optimization, 27(3):1459–1484, Jan. 2017. ISSN 1095-7189. 10.1137/16m1082305.
  • Yang et al. [2023] B. Yang, X. Zhao, X. Li, and D. Sun. An Accelerated Proximal Alternating Direction Method of Multipliers for Optimal Decentralized Control of Uncertain Systems, 2023.
  • Yang et al. [2021] L. Yang, J. Li, D. Sun, and K. Toh. A fast globally linearly convergent algorithm for the computation of Wasserstein barycenters. Journal of Machine Learning Research, 22:1–37, 2021. URL http://jmlr.org/papers/v22/19-629.html.
  • Yuan et al. [2017] H. Yuan, R. Xi, C. Chen, and M. Deng. Differential network analysis via lasso penalized d-trace loss. Biometrika, 104(4):755–770, 2017. 10.1093/biomet/asx049.
  • Yuan et al. [2019] H. Yuan, S. He, and M. Deng. Compositional data network analysis via lasso penalized d-trace loss. Bioinformatics, 35(18):3404–3411, 2019. 10.1093/bioinformatics/btz098.
  • Zhang et al. [2022] G. Zhang, Y. Yuan, and D. Sun. An Efficient HPR Algorithm for the Wasserstein Barycenter Problem with O(Dim(P)/ε)O({Dim(P)}/\varepsilon) Computational Complexity, 2022.
  • Zhang et al. [2024] S. Zhang, H. Wang, and W. Lin. Care: Large precision matrix estimation for compositional data. Journal of the American Statistical Association, 2024. 10.1080/01621459.2024.2335586.
  • Zhang and Zou [2014] T. Zhang and H. Zou. Sparse precision matrix estimation via lasso penalized d-trace loss. Biometrika, 101(1):103–120, 2014. 10.1093/biomet/ast059.
  • Zhao and Yu [2006] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7(90):2541–2563, 2006. URL https://www.jmlr.org/papers/v7/zhao06a.html.
  • Zhou et al. [2010] S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. Machine Learning, 80:295–319, 2010. 10.1007/s10994-010-5180-0.