Break recovery in graphical networks with D-trace loss
Abstract
We consider the problem of estimating a time-varying sparse precision matrix, which is assumed to evolve in a piece-wise constant manner. Building upon the Group Fused LASSO and LASSO penalty functions, we estimate both the network structure and the change-points. We propose an alternative estimator to the commonly employed Gaussian likelihood loss, namely the D-trace loss. We provide the conditions for the consistency of the estimated change-points and of the sparse estimators in each block. We show that the solutions to the corresponding estimation problem exist when some conditions relating to the tuning parameters of the penalty functions are satisfied. Unfortunately, these conditions are not verifiable in general, posing challenges for tuning the parameters in practice. To address this issue, we introduce a modified regularizer and develop a revised problem that always admits solutions: these solutions can be used for detecting possible unsolvability of the original problem or obtaining a solution of the original problem otherwise. An alternating direction method of multipliers (ADMM) is then proposed to solve the revised problem. The relevance of the method is illustrated through simulations and real data experiments.
Key words: Alternating direction method of multipliers, Change-points, Dynamic network, Precision matrix, Sparsity.
1 Introduction
Much attention has been devoted to the development of methods for the detection of multiple change-points in the underlying data generating process of some realized sample. The model parameters are typically assumed to be constant in each regime but can experience “jumps” or “breaks” over time. The extraction of these breaks is usually performed through the application of some filtering techniques. To do so, a popular method is the fused LASSO which screens the existence of parameter breaks over the sample of observations. In particular, [17] developed a framework to recover multiple change-points for one-dimensional piece-wise constant signals. Then [4] extended this procedure to grouped parameters with the Group Fused LASSO in the context of autoregressive time series models. In the same spirit, [30] applied the Group Fused LASSO to linear regression models. This filtering technique has also been popular among the literature on change-points in panel data models: e.g., [31] aimed to identify structural changes in linear panel data regressions based on the adaptive Group Fused LASSO; using a similar filtering technique, [21] considered the estimation of structural breaks in panel models with interactive fixed effects.
This paper considers the detection of structural breaks in time series network, whose structure is represented by the corresponding precision matrix. Some works have been dedicated to change-point detection for precision matrix. [46] assumed that the covariance matrix evolves smoothly over time. Contrary to the framework of [20], which focused on the detection of multiple breaks at a node level through the Group Fused LASSO, [33] restricted their framework to a single change point which impacts the global network structure. Our aim is to detect multiple change-points that affect the whole network. Thus, our viewpoint is similar to [16] and [13]: these works considered the Group Fused Graphical LASSO (GFGL) to detect breaks in the precision matrix. The latter filtering technique is a mixture of the Group Fused LASSO regularization of [2] and the LASSO regularization applied to the Gaussian likelihood function. We propose an alternative estimator to the commonly employed penalized Gaussian likelihood, which builds upon the D-trace loss of [44]. The formulation of the latter is much simpler than the Gaussian likelihood (in particular, the absence of log-determinant), thus allowing for a more direct theoretical analysis and implementation procedure. The D-trace loss and its extensions have been applied to diverse problems: while [40] considered network change analysis between two samples of vector-valued observations, [19] adapted the latter framework to differential network analysis for fMRI data; in the context of compositional data, [41] and [43] considered modified versions of the D-trace loss to estimate the high-dimensional compositional precision matrix under sparsity constraint; using the matching of the coefficients of SVAR processes and the entries of the precision matrix of the observed sample, [29] considered the sparse estimation of the latter based on the D-trace loss. In addition to the detection of structural breaks in the network, we allow for sparse dependence in the entries of the precision matrix in each regime. This motivates the use of the LASSO penalization function applied to each coefficient of the piece-wise constant precision matrix.
The corresponding estimation problem formulated in (2.1), Section 2, with the D-trace loss, includes two tuning parameters. In practice, the optimal parameters are unknown a priori and are expected to be estimated upon solving the estimation problems with a series of tuning parameters. However, we show that the problem with general tuning parameters may be unbounded from below (and hence, does not have solutions). Consequently, to identify the optimal parameters, one may encounter estimation problems that are unbounded from below, and the unboundedness can be numerically difficult to detect. To address this issue, we introduce a new regularizer, thereby constructing a revised problem that consistently has solutions regardless of the choice of the tuning parameters. In addition to the existence of a solution, this revised problem also enjoys several desirable properties. On the one hand, if the solutions to the revised problem possess some easy-to-detect patterns, then the original problem may not have solutions, and we can further update the tuning parameters towards obtaining reliable estimators. On the other hand, if the patterns are not detected, then the solutions to the revised problem also solve the original problem. We adapt the celebrated alternating direction method of multipliers (ADMM) to solve the revised problem, with its convergence guaranteed by [10]. ADMM is a widely used algorithm for solving convex optimization problems with separable objectives and linear coupling constraints. Its classical version was first introduced by [15] and [12], with the convergence established in [11]. Then [7] showed that ADMM is equivalent to the proximal point algorithm applied to a certain maximal monotone operator. This insight led to the development of the first proximal ADMM by [6]. Building upon Eckstein’s work, [18] extended the proximal ADMM to include a broader class of proximal terms. This approach was further advanced by [10], which generalized the method to allow the use of semi-proximal terms, enhancing the algorithm’s applicability. More recently, considerable efforts have been made to further accelerate variants of ADMM: see, e.g., the Schur complement-based semi-proximal ADMM in [23]; the inexact symmetric Gauss-Seidel-based ADMM in [5], [24], [25], [39]; the accelerated linearized ADMM in [28], [37], [22]; the preconditioned ADMM in [36], [34], [35]; the accelerated proximal ADMM based on Halpern iteration in [42], [38].
Our contributions can be summarized as follows: we propose a new estimator for break detection in the precision matrix, the Group Fused D-trace LASSO (GFDtL); we derive the conditions for which all the break points and the precision matrices can be consistently estimated when the estimated breaks coincide with the true number of breaks; we provide a modified regularizer to ensure the existence of solutions to the revised problem, and show that these solutions either exhibit some easily verifiable patterns indicating the possible unsolvability of the original problem so that we can further update the tuning parameters towards obtaining reliable estimators, or solve the original problem if the patterns are not detected; an ADMM is adapted for implementation with convergence guarantees. The relevance of the novel estimator for change-points in time-varying networks compared to the standard Gaussian-based GFGL estimator is illustrated through simulations and real data experiments.
The rest of the paper is organized as follows. Section 2 details the framework and estimation procedure. Section 3 contains the asymptotic properties. Section 4 is devoted to the optimization aspect of the estimation procedure. Section 5 provides the implementation details. Section 6 contains some simulation experiments, a sensitivity and computational complexity analysis. A real data experiment is provided in Section 7. All the proofs of the main text and auxiliary results are relegated to the Appendices.
Notation: Throughout this paper, we use and to denote the -th element of a vector and the -th element of a matrix , respectively. We write (resp. ) to denote the transpose of the matrix (resp. the vector ). For any square matrix , we write to denote the symmetrization of . The -th order identity matrix is denoted by . We denote by (resp. ) the zero matrix (resp. matrix of ones). We write to denote the vectorization operator that stacks the columns of on top of one another into a vector. For two matrices and , is the Kronecker product. The set of symmetric matrices is denoted by . A symmetric matrix is said to be positive semi-definite (resp. positive definite) and written as (resp. ) if all its eigenvalues are non-negative (resp. positive). The expression (resp. ) for , means (resp. ). We use and to denote the sets of positive semi-definite matrices and positive definite matrices, respectively. For a positive semi-definite matrix , is the unique positive semi-definite matrix such that . The norm for is denoted by , . The Frobenius norm and the off-diagonal (semi)norm of a matrix are denoted by and , respectively. The spectral norm, i.e., the maximum singular value of a matrix is written as . For a symmetric matrix A, we use (resp. ) to denote its largest (resp. smallest) eigenvalue. The maximum absolute value of all entries of a matrix is denoted by . We use to denote the set of symmetric matrices whose diagonal elements are . By we denote a sequence of symmetric matrices whose diagonal elements are , i.e., . For a sequence of symmetric matrices , refers to the -th element of , and is the copy of with diagonal elements set to ; in particular, . Given , we use to denote the projection onto . The proximal operator of a function at is defined as ; for more details about the proximal operator, we refer the interested readers to Sections 12.4, 14.1, 14.2 in [1].
2 Framework
For a sequence of a -dimensional random vector observed at , we consider the estimation of the underlying network through the corresponding precision matrix. The latter is assumed to evolve over time and the task is to recover the break dates. More formally, we denote by a disjoint partitioning of the set such that , and . The partition of the break dates is denoted by with the convention . Then, we assume and Var for , such that the observations indexed by elements in are -dimensional realizations of a centered random variable with variance-covariance . We denote by the precision matrix with entries . In practice, we consider the sequence of precision matrices such that the total number of distinct matrices in the set is and . We are interested in estimating the unknown true number of unknown break dates, the true partition and the true unknown precision matrices . As a consequence, the true data generating process is assumed to be
and , with blocks . While and the break dates are unknown, is typically much smaller than and, assuming the underlying network may exhibit some sparse structures, we consider the sparse estimation of ’s and the estimation of via a mixture of LASSO and Group Fused LASSO, which we will hereafter refer as the Group Fused D-trace LASSO (GFDtL), defined as
(2.1) |
where is the sample, are the tuning parameters, and the D-trace loss of [44] is defined as
Sparsity within the estimated precision matrix for a given block is controlled by , whereas affects the smoothing and guarantees that the solution is piece-wise constant.
3 Asymptotic properties
Before stating the large sample results, we define some notations and present the assumptions used hereafter. Define and
Assumption 1.
-
(i)
is a centered strong mixing process, that is, such that for all , , with and the mixing coefficient , where are the filtrations generated by and .
-
(ii)
such that , , .
Assumption 2.
: .
Assumption 3.
Let be a non-increasing positive sequence converging to zero. The following conditions hold:
-
(i)
for some .
-
(ii)
and as .
-
(iii)
and .
-
(iv)
and as .
Assumption 1-(i) relates to the properties of . Assumption 1-(ii) is a tail condition and will allow us to apply exponential inequalities for dependent processes. Assumption 2 ensures the identification of the model: it is similar to Assumption A.1 of [20] or Assumption A.2 of [30]. Assumption 3 provides conditions on , , , and the tuning parameters . Condition (i) concerns the convergence rate of to . In condition (ii), the sample size in each regime may diverge with rate , but at a slower rate than , and the number of breaks may diverge slowly: this is similar to Assumption A.3-(i) of [30] or Assumption H3 of [4]. It also sets the slowest rate at which may shrink to zero: . Conditions (iii) and (iv) specify the fastest rate at which may shrink to zero, which is . It is worth emphasizing that conditions (ii)-(iii) imply and conditions (ii) and (iv) imply that . Finally, the effect of the LASSO shrinkage through does not relate to : this is because the Group Fused LASSO penalty only allows to detect change-points. The consistency of , given , is provided in the next Theorem.
Remark Result (i) implies . Since , this means . Here, is a key quantity to control for the rate at which converges to . Note that , implies that the fastest convergence rate for the break ratio estimator depends on the regularization parameter and . Result (ii) relates to the consistency of the precision matrix in each regime.
It is worth noting that the true number of breaks is unknown. Following the common practice in the change-point literature, we assume that is bounded by a known conservative upper bound . Next, we define for any two sets and . The next result establishes that all true break points in can be consistently estimated by some points in .
The proof of Theorem 3.2 is done by contradiction and follows similar arguments as in the proof of Theorem 3.1. It relies on the optimality conditions from Lemma A.3. Theorem 3.2 ensures that even if the number of blocks is overestimated, there will be an estimated change-point close to each unknown true change-point.
4 Optimization
In this section, we move on to the optimization aspects of criterion (2.1), including the dual problem, the existence of solutions, and the algorithm. Specifically, given , , and , we consider the following scaled optimization problem
(4.1) |
where we scaled Problem (2.1) by a factor of for numerical stability. One can also notice that in (4.1) we use rather than as in (2.1). This choice is made for practical reasons, as setting ensures the non-singular solutions, and the set is closed and convex and hence the projection onto it is well defined.
4.1 Dual problem and existence of solutions
Proposition 4.1.
- (i)
-
(ii)
If , then there exists such that for any and any , the dual problem (4.2) has a Slater point555A Slater point of (4.2) is a feasible point that satisfies all the inequality and positive semi-definite constraints strictly, i.e., satisfies all the “” and “” as “” and “”, respectively. and the primal problem (4.1) has solutions.
We note that the assumption in Proposition 4.1-(ii) is reasonable because it can be viewed as a sample-based version of Assumption 2.
We next present a simple example to illustrate that for some specific dataset , even with , there may still exist some and such that (4.2) is infeasible, meaning that its optimal value is . Since (4.1) and (4.2) have the same optimal value, this means (4.1) does not have solutions in this case.
Example 1.
Consider the case that where and . Then
and . The positive semi-definite constraint in (4.2) can be written as
After some simple calculations, we have
Recall that the diagonal elements of a positive semi-definite matrix must be non-negative, then we can observe that the above condition requires that and . These two conditions imply that for any and any , the dual problem (4.2) is infeasible and hence the primal problem (4.1) does not have solutions.
Remark 1.
It is worth pointing out that the nonexistence of solution does not contradict our findings in Section 3 because those results only indicate that when there is a ground truth, under suitable assumptions, the ground truth can be (approximately) recovered from a solution of problem (2.1) with suitably chosen , and ; in particular, it did not imply solution existence of problem (2.1) for general , and .
To ensure the existence of a solution, although we can obtain some lower bound of (as detailed in the proof of Proposition 4.1-(ii)) to ensure the solution existence for (4.1), this may not be tight. This can imply practical issues because the optimal can be strictly smaller than . Then we will need to work with problem (4.1) with some to locate such . However, since , (4.1) may be unbounded from below and certifying such a scenario is a challenging problem. This motivates us to modify problem (4.1) to obtain a new model which has solutions for all choices of and , and (under some mild condition) returns a solution of problem (4.1) when the latter problem is solvable.
To this end, we notice from the above example that the unsolvability of problem (4.1) when is small may be related to the fact that the relations between different groups induced by the Group Fused LASSO regularizer is not strong enough to leverage the condition to ensure dual strict feasibility (and hence the existence of solutions to (4.1)). Hence, this motivates the introduction of a (modified) new regularizer to replace the Group Fused LASSO regularizer in problem (4.1): intuitively, this regularizer should be similar to Group Fused LASSO regularizer when is small for inducing the (same) desired breaks, but penalize more on large to ensure the solution existence of the new model.
4.2 A revised problem with a modified regularizer
Let and let
(4.3) |
Here, is necessary and sufficient to ensure the convexity of . The function employs the absolute value in a small region near 0 (determined by ) and switches to a quadratic function outside this region. In this way, it reduces to the classical penalty when is near 0, while imposing a more substantial penalty as goes further away from 0.
Replacing by in problem (4.1), we obtain the following revised optimization problem:
(4.4) |
The next proposition shows the dual problem and the existence of solutions to (4.4).
Proposition 4.2.
Proposition 4.3.
Equipped with Proposition 4.3, we can derive the following practical way to search for a suitable such that (4.1) is solvable by solving (a sequence of) (4.4). Specifically, for any positive and , we solve (4.4) with an appropriately large , and then check if the solution satisfies (4.6): if it does, we increase further to pursue a reliable estimator; otherwise we obtain a solution to (4.1), and all the theoretical properties described in Section 3 hold.
4.3 An alternating direction method of multipliers
In this subsection, we discuss how to adapt the alternating direction method of multipliers (ADMM) to solve (4.4). We rewrite (4.4) as the following constrained optimization problem:
(4.7) |
where we denote for the sake of notional simplicity; is the copy of with the diagonal elements set to 0, and .
Given , the augmented Lagrangian function is
In iteration , our ADMM consists of three update steps:
-
1.
update:
Note that this update is well defined as is strongly convex in (with other variables fixed).
-
2.
update:
Note that this update is well defined as is strongly convex in the variables , , (with other variables fixed).
-
3.
Dual update: for all ,666The dual stepsize 1.61 in (4.8) can be more generally chosen from the interval . Here we pick 1.61 for simplicity.
(4.8)
We next discuss how to perform the first two update steps efficiently.
update: Setting the derivative of the objective function of the subproblem with respect to to 0, we have
For the sake of simplicity, we let for all . By further denoting
for all and , the update is equivalent to solving the following linear system:
(4.9) | ||||
This linear system does not have a closed form solution in general. Here we use pcg in Matlab to solve it. Specifically, in each iteration, we use the solution from the previous iteration as the initial point and solve the system only up to some tolerance that decreases with iterations.
update: It is notable that this subproblem is block separable, allowing us to solve it by addressing three further subproblems with respect to , , and , respectively.
For , one has
For , it holds that
(4.11) | ||||
The update scheme for is slightly more complicated. For simplicity, we denote
For each , we need to solve the following optimization problem:
By the definition of in (4.3), it is equivalent to solving the following two problems
and let
(4.12) |
where and refer to the optimal values of the two optimization problems \raisebox{-0.2ex}{\small{1}}⃝ and \raisebox{-0.2ex}{\small{2}}⃝, respectively. We first note that if , then and we have . We next assume that . For \raisebox{-0.2ex}{\small{1}}⃝, from the proximal operator of Frobenius norm, we have
(4.13) |
For \raisebox{-0.2ex}{\small{2}}⃝, by considering the derivative of its objective function, we get
(4.14) |
Overall, the ADMM is summarized in Algorithm 1, whose convergence directly follows from [10, Appendix B].
5 Implementation details
We implement Algorithm 1 in Matlab R2023a and perform several numerical experiments on both simulated datasets and real datasets, which will be discussed in the later sections. The Matlab codes for the implementation of Algorithm 1 and the experiments are available in https://github.com/linyopt/GFDtL. In this section, we provide some implementation details about the algorithm and the numerical experiments; interested readers can check the GitHub repository for more technical details that are not covered here.
We first briefly describe the process to obtain an estimator and to identify the corresponding breaks based on a given and a sample with . Specifically, for any pair of tuning parameters , Proposition 4.3 suggests that we can apply Algorithm 1 to solve (4.4) with a sufficiently large to either assert that (4.1) may not have solutions or obtain a solution to (4.1). With the obtained estimator in hand, we identify the breaks by selecting the ’s with . Since controls the model complexity and smoothing, they must be calibrated accordingly. We search for the optimal tuning parameters over a user-specified grid based on some criteria, which will be discussed in the next subsection.
It is important to further discuss how the breaks are identified from a given estimator. Intuitively, with the Group Fused LASSO regularizer, the estimator should satisfy that for some ’s, , which we identify as breaks. In contrast, the estimator produced by the competing method, the Gaussian loss-based GFGL approach of [13], does not inherently possess this property. Instead, their algorithm includes an additional post-processing step to extract the breaks from the estimator. Although, as discussed hereafter, this method can also produce reasonable breaks, their estimator does not naturally exhibit the desired property of containing identifiable breaks.
5.1 Optimal selection of the tuning parameters
The Bayesian information criterion (BIC) is a common criterion to choose the optimal tuning parameters: see, e.g., [27]. However, it is known that there are some scenarios where BIC may fail to select the optimal tuning parameters: see, e.g., Section 4.2 and the Appendix in [13]. This motivates us to propose the following three methods for the selection of the optimal pair of tuning parameters:
-
(a)
Method a: When the true underlying data generating process is known, the optimal pair can be selected as the pair that minimizes or maximizes suitable performance measures, such as the Hausdorff distance, model recovery, or estimation error. Although this strategy can be employed when the true underlying structure is known only, it is informative about the relative performances of different competing procedures when in a position to minimize/maximize the performance metrics.
-
(b)
Method b: In the case of simulated data, we divide the simulated samples of observations into a training set and test sets, all of them sampled from the same data generating process: the training set is denoted by and the test sets are denoted by . Based on , we apply Algorithm 1 to solve (4.4) with an appropriately large for different candidates and obtain . The optimal is selected as the pair which minimizes
(5.1) is user-specified. Throughout this paper, we set .
-
(c)
Method c: When the true underlying data generating process is unknown, as in real data, following [13], we employ a BIC-type criterion given by
(5.2) where represents the complexity, or degrees of freedom, and is defined as
Then we select the optimal values according to the criterion
The definition of varies across the literature: see, e.g., the different definitions provided in [13] and [27]. In the former work, is the sum of the number of active edges at and of the corresponding changes for . In the latter work, is the number of non-zero coefficient blocks in for . Based on our preliminary experiments, we found that these two definitions do not result in significant differences in the selection of the optimal tuning parameters. Therefore, since we will compare our algorithm with the GFGL method of [13], we use their definition for consistency.
For these three methods, the minimization problem is solved w.r.t. over a user-specified grid: in our experiments on synthetic experiments, the grid is specified as for and for .
It is worth noting that from Proposition 4.1-(ii), within the user-specified grid there may be some pairs of such that (4.1) does not have solutions, which can be detected by checking (4.6). For these pairs of , we set and , so that they will never be selected. In Subsection 6.2, we perform a sensitivity analysis on a simulated dataset to evaluate and compare Method b and Method c for selecting the optimal pair of tuning parameters.
5.2 Initialization and termination criterion
Throughout this paper, we initialize the algorithm as follows: for all ,
here the inverse is well-defined since we have .
We next describe the termination criterion, which essentially consists of checking the constraint violations for the dual problem (4.5), and also the gap between the primal and dual objective values. Recall that is the primal variable with , , and for all ; is the dual variable with , , for all ; and let for all , where . We define the relative infeasibility for the positive semi-definite constraint as
For the bound constraint of , we similarly define the relative infeasibility as
where is the -th element of . Then, the relative dual infeasibility is defined as
The relative duality gap is defined by
where and are the objective values of primal problem (cf. (4.4)) at and dual problem (cf. (4.5)) at , respectively.
Overall, throughout this paper, we terminate Algorithm 1777We note that Algorithm 1 does not involve while the dual problem (4.5) does. Here, we set for all and , which comes from the derivation of the dual problem (4.5); see (B.20). when both the relative primal-dual gap and the relative dual infeasiibility are sufficiently small:
or, it is detected that Problem (4.1) may not have solutions:
(5.3) |
or, the relative successive changes of both primal and dual variables are sufficiently small:
where . We set for experiments on both synthetic and real datasets.
5.3 Choices of algorithm parameters
The parameter specifies a lower bound of the eigenvalues of matrices in the resulting solution, which ensures the non-singularity of each matrix in the obtained solution if . The choice of depends on how “non-singular” the solution is expected to be. In our experiments, we set . In view of Proposition 4.3 and its proof, should be large enough, with its lower bound related to and the possible solution to the corresponding (4.1). After some trial experiments, we set for experiments on synthetic datasets and for experiments on real datasets. For the tolerance of pcg, we set and . The parameter in ADMM has no effect on the results but only affects the convergence speed of the algorithm, so we did not fine-tune it; interested readers can refer to the GitHub repository for further details regarding these settings.
6 Synthetic experiments
In this section, we conduct some simulation experiments to assess the performances of the GFDtL procedure, its sensitivity to the tuning parameters and computational complexity.
6.1 Simulations
To assess the finite-sample relevance of the GFDtL procedure, we consider the following settings, where we denote by the true number of breaks and by the true precision matrices:
Setting (i): For each true precision matrix , its structure is uniformly drawn from the set of graphs with vertex size , that is, the number of variables, and edges, giving the graph Erdös-Rényi for the block . The zero entries are generating by matching the pattern of the adjacency matrix and the precision matrix , that is, , providing the sparsity pattern in the off-diagonal coefficients of . The proportion of the zero entries is calibrated by .
Then the off-diagonal non-zero entries of are drawn in , where denotes the uniform distribution in . The diagonal elements are drawn in . Finally, to ensure that the resulting matrix is positive-definite, if the simulated satisfies , we apply , where is the first value in such that .
Setting (ii): For each true , its off-diagonal non-zero entries are generated in and diagonal elements are drawn in . To ensure that the resulting matrix is positive-definite, we apply the same final step as in Setting (i).
Setting (iii): The precision matrix is generated following the same spirit as in Section 5 of [3]. We construct , , where with , and where makes the diagonal entries in and different. We set with . The coefficient equals with probability , and equals otherwise. Then, we set when and each non-zero off-diagonal coefficient is multiplied by (resp. ) with probability (resp. ). Finally, to ensure that the resulting matrix is positive-definite, we apply the same final step as in Setting (i). This creates a banded structure.
For each of these settings, for , we draw the -dimensional Gaussian distribution, with when . We set and three cases relating to the breaks are considered: (a) no break; (b) a single break; (c) several breaks. In case (b), and in case (c), ; the location of the breaks, i.e., , are randomly set, conditionally on being at least , where . We set so that (resp. ) when (resp. ): the regimes may have different time lengths but satisfy a minimum time length condition. The latter relates to the issue of trimming: see the Introduction and Section 4 in [30], who mentioned that is a standard choice. We set the sample size in case (a) and in cases (b), (c). The “sparsity degree”, that is the proportion of zero entries in the lower triangular part of , is set as and in settings (i) and (ii), which represents approximately and zero entries in each regime out of the lower triangular elements, respectively. Note that in setting (i), we allow the true number of zero entries to slightly vary between each regime around the corresponding sparsity degree, although remains constant. In setting (ii), we keep the true number of zero entries constant across the regimes, i.e., and zero entries exactly. In setting (iii), the sparsity degree can slightly vary depending on the value of .
For each setting, we draw one hundred batches of independent samples . For each batch, we apply Algorithm 1 to solve (4.4) with some large over a grid specified as for and for , and select the optimal pair using the selection methods described in Subsection 5.1, which will be denoted by Method a, Method b and Method c hereafter. As a competing method, we also solve the Gaussian loss-based GFGL of [13]888The Matlab code for GFGL is available on the corresponding journal website as Supplemental. by selecting the optimal tuning parameters using the same strategies. For these two methods, we report the following metrics as performance measures:
-
(i)
the number of breaks detected by the procedure denoted by .
-
(ii)
the Hausdorff distance , which serves as a measure of the estimation accuracy of the break dates.
-
(iii)
the score defined as , where TP is the number of correctly estimated non-zero coefficients, FN is the number of incorrectly estimated zero entries and FP is the number of incorrectly estimated non-zero entries. The closer to the score is, the better.
-
(iv)
the accuracy defined as , where TN is the number of correctly estimated zero entries.
-
(v)
the averaged mean squared error (MSE) for precision accuracy.
These metrics are averaged over the one hundred independent batches and are reported in Table 1, Table 2 and Table 3. These simulated experiments have been run on an Intel(R) Xeon(R) Gold 6242R [email protected] 3.09 GHz, 128 GB.
MSE | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | ||
Method a | 0 | 0 | 0 | 0 | 0.8472 | 0.7174 | 0.9133 | 0.8084 | 0.1707 | 0.2266 | |
Method b | 0 | 3.6800 | 0 | 46.0100 | 0.8072 | 0.5377 | 0.8919 | 0.5518 | 0.1662 | 0.2089 | |
Method c | 0.1400 | 2.7000 | 4.1100 | 42.2200 | 0.7525 | 0.6915 | 0.8425 | 0.8060 | 0.1335 | 0.2331 | |
Method a | 0 | 0 | 0 | 0 | 0.8246 | 0.7247 | 0.8036 | 0.6429 | 0.2098 | 0.4219 | |
Method b | 0 | 2.5000 | 0 | 36.8400 | 0.7178 | 0.7078 | 0.7251 | 0.5811 | 0.3452 | 0.4127 | |
Method c | 0.0700 | 5.9800 | 2.6100 | 45.0200 | 0.7903 | 0.6526 | 0.7541 | 0.6281 | 0.1755 | 0.4491 | |
Method a | 1.0100 | 1.2700 | 0.4900 | 1.3500 | 0.7261 | 0.6592 | 0.8294 | 0.7641 | 0.2496 | 0.2382 | |
Method b | 2.1800 | 23.0600 | 20.9900 | 69.9600 | 0.6815 | 0.4712 | 0.7805 | 0.4209 | 0.2049 | 0.2092 | |
Method c | 1.4000 | 3.7200 | 9.5100 | 21.9400 | 0.6480 | 0.6707 | 0.7264 | 0.8016 | 0.2029 | 0.2427 | |
Method a | 1.0000 | 1.3600 | 0.0700 | 0.4300 | 0.7441 | 0.7323 | 0.6882 | 0.6347 | 0.3558 | 0.4237 | |
Method b | 2.7900 | 20.0700 | 27.7800 | 66.7200 | 0.7163 | 0.7140 | 0.6696 | 0.5784 | 0.3774 | 0.4068 | |
Method c | 1.1300 | 5.4000 | 9.7300 | 23.4000 | 0.7223 | 0.6670 | 0.6583 | 0.6343 | 0.3335 | 0.4505 | |
Method a | 4.2100 | 5.7600 | 2.2700 | 2.7100 | 0.6389 | 0.6088 | 0.7409 | 0.7093 | 0.2906 | 0.2585 | |
Method b | 5.4500 | 44.1100 | 11.5800 | 27.6700 | 0.6040 | 0.4555 | 0.6924 | 0.3810 | 0.2476 | 0.2159 | |
Method c | 2.9200 | 7.2300 | 29.3100 | 18.0100 | 0.5792 | 0.6228 | 0.6579 | 0.7488 | 0.2853 | 0.2599 | |
Method a | 4.1100 | 5.1100 | 1.7400 | 1.1800 | 0.7021 | 0.7175 | 0.6180 | 0.5920 | 0.4475 | 0.4379 | |
Method b | 6.3000 | 36.8100 | 14.9100 | 27.6500 | 0.6989 | 0.7163 | 0.6209 | 0.5759 | 0.4218 | 0.4114 | |
Method c | 1.8800 | 6.0800 | 41.0400 | 11.5000 | 0.6616 | 0.6615 | 0.5700 | 0.6067 | 0.4555 | 0.4762 |
MSE | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | ||
Method a | 0 | 0 | 0 | 0 | 0.8364 | 0.7131 | 0.9051 | 0.8109 | 0.2087 | 0.3582 | |
Method b | 0.0100 | 2.7600 | 0.3600 | 40.0400 | 0.7940 | 0.5841 | 0.8861 | 0.6200 | 0.2239 | 0.3417 | |
Method c | 0.0200 | 2.2200 | 0.8000 | 44.0100 | 0.7576 | 0.6655 | 0.8532 | 0.8048 | 0.1560 | 0.3654 | |
Method a | 0 | 0 | 0 | 0 | 0.8437 | 0.8143 | 0.7784 | 0.7069 | 0.2451 | 0.6669 | |
Method b | 0 | 1.2700 | 0 | 29.1500 | 0.6995 | 0.8155 | 0.6500 | 0.7027 | 0.5162 | 0.6652 | |
Method c | 0.0100 | 3.2600 | 0.3600 | 38.2700 | 0.8318 | 0.6885 | 0.7639 | 0.6071 | 0.2423 | 0.6981 | |
Method a | 1.0200 | 1.2900 | 0.9100 | 1.3800 | 0.7021 | 0.6546 | 0.7968 | 0.7645 | 0.2833 | 0.3470 | |
Method b | 1.8900 | 23.9000 | 15.7300 | 86.6000 | 0.6777 | 0.5139 | 0.7769 | 0.4931 | 0.2530 | 0.3206 | |
Method c | 1.0100 | 3.1200 | 16.3800 | 21.0600 | 0.6264 | 0.6398 | 0.7374 | 0.7998 | 0.2552 | 0.3506 | |
Method a | 0.9900 | 1.4500 | 0.3000 | 0.8000 | 0.7846 | 0.8271 | 0.6909 | 0.7186 | 0.5085 | 0.6511 | |
Method b | 3.5100 | 20.5600 | 39.8500 | 91.1200 | 0.7465 | 0.8209 | 0.6560 | 0.7054 | 0.5470 | 0.6431 | |
Method c | 1.0300 | 4.0800 | 14.2600 | 25.0800 | 0.7717 | 0.6935 | 0.6765 | 0.6089 | 0.4779 | 0.6874 | |
Method a | 4.3100 | 6.1500 | 4.5500 | 4.5300 | 0.5985 | 0.5901 | 0.6861 | 0.6931 | 0.3334 | 0.3466 | |
Method b | 4.3900 | 40.8100 | 16.0900 | 33.6400 | 0.5788 | 0.4821 | 0.6624 | 0.4297 | 0.3099 | 0.3153 | |
Method c | 1.9300 | 5.4500 | 55.6100 | 32.4000 | 0.5226 | 0.5971 | 0.5968 | 0.7630 | 0.3480 | 0.3559 | |
Method a | 4.1000 | 5.1300 | 1.9400 | 1.1900 | 0.7772 | 0.8265 | 0.6702 | 0.7150 | 0.6536 | 0.6727 | |
Method b | 6.8800 | 33.9200 | 17.0400 | 34.2100 | 0.7691 | 0.8233 | 0.6655 | 0.7077 | 0.6031 | 0.6516 | |
Method c | 1.5000 | 6.3700 | 50.4000 | 12.0900 | 0.7507 | 0.7154 | 0.6546 | 0.6171 | 0.6527 | 0.7177 |
MSE | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | GFDtL | GFGL | ||
Method a | 0 | 0 | 0 | 0 | 0.8236 | 0.7848 | 0.8230 | 0.7704 | 0.1089 | 0.2241 | |
Method b | 0.0600 | 5.6600 | 1.5900 | 48.6800 | 0.7611 | 0.7268 | 0.7941 | 0.6448 | 0.1504 | 0.2100 | |
Method c | 0.0300 | 7.3800 | 0.8900 | 53.9000 | 0.5396 | 0.5322 | 0.6835 | 0.6533 | 0.1709 | 0.2511 | |
Method a | 1.0000 | 1.1200 | 2.3800 | 3.1900 | 0.7799 | 0.7357 | 0.7670 | 0.7425 | 0.1800 | 0.2344 | |
Method b | 2.5500 | 19.2700 | 27.8200 | 85.5000 | 0.7564 | 0.7136 | 0.7600 | 0.6084 | 0.1825 | 0.2092 | |
Method c | 0.4400 | 7.5100 | 59.1100 | 42.3700 | 0.3961 | 0.4830 | 0.6028 | 0.6325 | 0.2317 | 0.2554 | |
Method a | 4.9900 | 7.2700 | 8.2500 | 9.6700 | 0.6950 | 0.6816 | 0.6709 | 0.6791 | 0.2181 | 0.2274 | |
Method b | 4.5200 | 34.1600 | 24.3900 | 32.7100 | 0.7104 | 0.6951 | 0.7029 | 0.5862 | 0.2251 | 0.2095 | |
Method c | 0.3700 | 4.3300 | 107.9600 | 66.8400 | 0.3538 | 0.4279 | 0.5836 | 0.6102 | 0.2589 | 0.2478 |
Overall, our GFDtL procedure performs better than the GFGL with respect to all the metrics in any setting and for Methods a and b for the selection of the optimal pair of . Importantly, our procedure results in lower MSE, hence more accurate estimation.
It is worth mentioning the case of no break, that is : means that the procedure concludes that there is no break; in this setting, our procedure performs much better than the GFGL, particularly in terms of MSE and Hausdorff distance.
When applying Method b, the latter metric obtained by GFDtL is much better than the one resulting from the GFGL, which tends to overestimate the number of breaks. Overall, we may conclude that GFDtL results in a low probability of falsely detecting breaks in the no break case.
Finally, the BIC (i.e., Method c)-based results are in favor of the GFGL method, particularly in terms of . This is because the BIC tends to underestimate the number of breaks when applied to the GFDtL, i.e., it tends to select large : indeed, when , the results obtained by Method c are good; but in the case of multiple breaks, the number of breaks detected by BIC is much lower than the true number of breaks. This behavior is further detailed in Subsection 6.2.
6.2 Sensitivity analysis with respect to the tuning parameters
We propose a sensitivity analysis of Methods b and c provided in Subsection 5.1 with respect to the calibration of . More precisely, we illustrate the ability of the proposed strategies to identify the optimal pair for break and sparse estimation. The experiments are conducted on datasets simulated according to Setting (i) in Subsection 6.1 with , , the “sparsity degree” being and . To better approximate the metric surfaces, we use a denser grid specified as for and for .
The results are displayed in Figure 1(l), with the three rows corresponding to , and the four columns showcasing the results for BICs (cf. (5.2)), lossvals (cf. (5.1)), Hausdorff distances, and F1 scores, respectively. In each subfigure, lighter colors represent more favorable tuning parameters, indicating areas where the metric values are minimized or maximized as appropriate. To facilitate visualization given the wide range of values for BICs and lossvals, we pre-processed these metrics before plotting: specifically, we subtracted the minimum value to ensure non-negativity, then applied the log1p function in Matlab (i.e., ) to compress the scale of the values, making the patterns more discernible and enhancing the interpretability of the results.












The figures suggest consistent patterns across the surfaces of all four metrics. Specifically, there are distinct boundaries splitting the metric surfaces into two primary regions: an upper and a lower region. Each of these regions further subdivides into multiple subregions that exhibit similar characteristics across all four metrics. The lower regions of the BIC, lossval, and F1 score surfaces are characterized by numerous vertical bars, indicating areas of potentially optimal parameter combinations. In contrast, the Hausdorff distance surface displays a constant lower region.
An interesting observation is that the BIC-type criterion tends to favor smaller values of , whereas the lossval criterion leans towards slightly larger values. Furthermore, although both criteria struggle to identify the optimal Hausdorff distance (except when ), when comparing the tuning parameters selected by the BIC-type criterion to those by the lossval criterion, the lossval criterion exhibits a slight advantage; its optimal region (the white area in the subfigures) is larger, especially as increases. For example, when , the white region extends to the upper right corner, which is preferable to the lower region. This suggests that the lossval criterion may be more effective in identifying the optimal parameters. These advantages of the lossval criterion over the BIC-type criterion are further corroborated by the results presented in the previous subsection.
The figures also shed light on the reasons behind the BIC-type criterion’s poor performance in the context of GFDtL. Specifically, the BIC-type criterion shows a preference for larger values, which corresponds to fewer breaks in the estimation. This preference can be directly attributed to the definition of BIC (cf. (5.2)). In particular, when there are breaks, is at least , which typically dominates the loss value term in the BIC formula. Consequently, the BIC-type criterion tends to favor estimators with fewer breaks, leading to suboptimal performance in detecting the true number of breaks, especially in scenarios with a higher number of actual breaks.
Furthermore, the F1 score surfaces provide additional insights into the model’s performance across different parameter combinations. The gradual transition from darker to lighter colors as increases (for fixed ) suggests that the model’s ability to correctly identify true positives improves with larger values, up to a certain point. This observation aligns with the lossval criterion’s preference for slightly larger values compared to the BIC-type criterion.
6.3 Empirical computational complexity analysis
The computational complexity of Algorithm 1 is influenced by several factors, including the sample size , the dimension , the number of breaks , , and the pair . To empirically analyze this complexity, we conduct a series of experiments based on Setting (i) in Subsection 6.1 varying each factor individually. Specifically, for each factor, we perform 5 experiments with other factors held constant, and plot the averaged computation time to visualize the impact of each factor on the algorithm’s complexity. Here, we use . These experiments are conducted on a desktop with Intel(R) Core(TM) [email protected] (10 cores and 20 threads) and 64GB of RAM.
As displayed in Figure 2(d), the computation time is approximately linear in , quadratic in , and not significantly affected by and . The impact of is presented in Table 4, where one can observe that the computation times for and are notably shorter. This is because the algorithm terminates early as (5.3) is satisfied within the first few iterations. Additionally, the computation times for , , , , and are significantly higher, as these are marginal cases that are typically more challenging to solve. Apart from these marginal cases, for large and , Algorithm 1 requires nearly the same amount of time to converge.




10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | 110 | 120 | 130 | 140 | 150 | 160 | 170 | 180 | 190 | 200 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.1 | 3.52 | 33.71 | 12.47 | 6.71 | 5.17 | 4.17 | 5.58 | 5.58 | 5.83 | 5.73 | 5.58 | 5.51 | 5.52 | 5.51 | 5.53 | 5.44 | 5.52 | 5.50 | 5.57 | 5.47 |
0.2 | 6.60 | 14.00 | 6.15 | 4.28 | 3.68 | 3.98 | 4.13 | 4.19 | 4.13 | 4.15 | 4.16 | 4.15 | 4.14 | 4.14 | 4.15 | 4.17 | 4.15 | 4.15 | 4.15 | 4.15 |
0.3 | 27.25 | 7.13 | 3.69 | 2.98 | 2.67 | 2.91 | 2.90 | 2.91 | 2.91 | 2.91 | 2.91 | 2.91 | 2.91 | 2.92 | 2.91 | 2.91 | 2.92 | 2.91 | 2.93 | 2.91 |
0.4 | 36.05 | 5.11 | 2.75 | 2.40 | 2.16 | 2.16 | 2.16 | 2.16 | 2.16 | 2.16 | 2.16 | 2.22 | 2.19 | 2.21 | 2.18 | 2.20 | 2.21 | 2.21 | 2.21 | 2.23 |
0.5 | 23.22 | 4.65 | 2.68 | 2.28 | 2.11 | 2.12 | 2.12 | 2.11 | 2.12 | 2.12 | 2.18 | 2.17 | 2.17 | 2.21 | 2.17 | 2.18 | 2.18 | 2.17 | 2.17 | 2.17 |
0.6 | 14.14 | 4.64 | 2.81 | 2.30 | 2.30 | 2.22 | 2.23 | 2.17 | 2.20 | 2.21 | 2.24 | 2.20 | 2.24 | 2.15 | 2.20 | 2.14 | 2.17 | 2.17 | 2.19 | 2.18 |
0.7 | 10.38 | 4.78 | 2.87 | 2.19 | 2.26 | 2.30 | 2.15 | 2.18 | 2.16 | 2.23 | 2.17 | 2.19 | 2.21 | 2.21 | 2.21 | 2.15 | 2.18 | 2.18 | 2.19 | 2.19 |
0.8 | 9.17 | 4.72 | 2.88 | 2.46 | 2.22 | 2.41 | 2.59 | 2.48 | 2.20 | 2.15 | 2.14 | 2.15 | 2.18 | 2.23 | 2.24 | 2.43 | 2.27 | 2.35 | 2.20 | 2.26 |
0.9 | 8.73 | 4.80 | 3.03 | 2.45 | 2.18 | 2.18 | 2.21 | 2.15 | 2.15 | 2.13 | 2.16 | 2.21 | 2.21 | 2.14 | 2.15 | 2.13 | 2.20 | 2.17 | 2.13 | 2.13 |
1.0 | 8.06 | 4.60 | 2.84 | 2.13 | 2.14 | 2.13 | 2.15 | 2.14 | 2.13 | 2.13 | 2.13 | 2.14 | 2.14 | 2.14 | 2.15 | 2.16 | 2.14 | 2.14 | 2.13 | 2.14 |
7 Real data experiment
In this section, the relevance of our method is compared with the GFGL through a portfolio allocation experiment based on real financial data. The same computer was employed as in Subsection 6.1. We consider hereafter the stochastic process in of log-stock returns, where with the stock price of the -th index at time . The portfolio allocation will be performed with stocks data from the S&P 500, which are representative of different economic sectors999The data can be found on https://finance.yahoo.com or https://macrobond.com. They are provided on the GitHub repository.: Alphabet, Amazon, American Airlines, Apple, Berkshire Hathaway, Boeing, Chevron, Equity Residential, ExxonMobil, Ford, General Electric, Goldman Sachs, Jacobs Engineering Group, JPMorgan, Lockheed Martin, Pfizer, Procter & Gamble, United Parcel Service, Verizon, Walmart. The sample period is November 11, 2019 – March 27, 2020, corresponding to observations.
We analyse the economic performances obtained by the GFDtL and GFGL through the GMVP investment problem. Following [8], the latter problem at time , in the absence of short-sales constraints, is defined as
where is the vector of portfolio weights for time chosen at time , is the conditional (on the past) covariance matrix of . The explicit solution is given by . Now it is natural to replace by an estimator , yielding . As a function depending on only, the GMVP performance essentially depends on the precise measurement of the latter or, equivalently, the precision matrix. In our setting, we set , estimated by the GFDtL and GFGL procedures, where are selected by Method c described in Subsection 5.1. We also consider the equally weighted portfolio, which will be denoted by . The following performance metrics (annualized) will be reported: AVG as the average of portfolio returns computed as , multiplied by ; SD as the standard deviation of portfolio returns, multiplied by ; IR as the information ratio computed as .
The key performance measure is SD. The GMVP problem essentially aims to minimize the variance rather than to maximize the expected return. Hence, as emphasized in [9], Section 6.2, high AVG and IR are desirable but should be considered of secondary importance compared with the quality of the measure of a covariance matrix estimator. We also report the number of breaks detected by the procedure.
Both GFGL and GFDtL estimate the following break dates: February 25, 2020; March 9, 2020; March 17, 2020. These breaks are in line with the aftershock of the covid outbreak: the S&P 500 index entered a downward trend period from February 20, 2020, and the S&P 500 index reached a minimum value on March 23, 2020, which precedes the rally. Despite the presence of the covid shock, our proposed GFDtL procedure provides the lowest SD and clearly outperforms the GFGL. The BIC-based selection results in relevant estimatons of the break dates.
S&P 500 data | ||||
AVG | SD | IR | ||
GFDtL | -50.29 | 35.76 | -1.41 | 3 |
GFGL | -101.44 | 50.00 | -2.03 | 3 |
-77.77 | 45.69 | -1.70 | - |
Note: The lowest SD figure is in bold face.
8 Concluding remarks
We propose a filtering procedure for the estimation of the number of change-points in the precision matrix of a vector-valued random process, whose full structure can “break” over time. We show that the estimated break dates are sufficiently close to the true break dates together with the consistency of the estimated precision matrix in each regime. We propose an ADMM-based algorithm to solve the optimization problem with breaks and study its properties. The simulation results illustrate the relevance of our method compared to the Gaussian likelihood GFGL in terms of change-point detection and graph recovery. They also emphasize the difficulty to devise a strategy to select the optimal pair .
A potential extension consists in the specification of adaptive weights in both the Group Fused penalty and the LASSO penalty: indeed, unless assuming high-level conditions such as the irrepresentable condition of [45], the LASSO penalty requires a correction, such as the adaptive version, to ensure the recovery of the sparse structure provided that the estimated break dates are close to the true ones. This would require the computation of some first step consistent estimator that would enter the penalty functions.
Acknowledgements
Benjamin Poignard was supported by JSPS KAKENHI (22K13377) and RIKEN. Ting Kei Pong was supported partly by the Hong Kong Research Grants Council PolyU153002/21p. Akiko Takeda was supported by JSPS KAKENHI (23H03351).
Appendix A Intermediary results
Proof.
Let us prove Point (i). Recall that is the true variance-covariance of , with . Now take any , with such that . Then, by the triangle inequality, under Assumption 2:
We show . By Assumption 1, we can apply Theorem 1 of [26] on : the latter result states the mixing condition for some ; then for , we may take , which allows us to apply their Theorem 1. Thus, there exist constants such that, , ,
Take large enough. Applying the previous inequality, by Bonferroni’s inequality:
which goes to as by Assumption 3-(i) implying . Since for ,
under . To prove Point (ii), we rely on the inequality
The result follows from the bound derived on . ∎
The next Lemma will be useful to bound the first order derivative w.r.t. of the non-penalized D-trace loss function.
Lemma A.2.
Proof.
The result follows the same steps as in the proof of Lemma A.1. ∎
Lemma A.3.
Consider problem (2.1). Define , and . The GFDtL estimator satisfies the conditions
where are the sub-gradient matrices defined by
and and for , satisfies
Proof.
Defining , and , the problem stated in (2.1) can be recast as a minimization of the function
Invoking subdifferential calculus, a necessary and sufficient condition for to minimize is that for all , belongs to the subdifferential of with respect to at , that is
with the subgradient matrices defined as: and
Now if for is one of the estimated break dates, then , and
since the breaks cannot occur at and . When , then the first order condition with respect to yields
so that
∎
Lemma A.4.
Let be a given set of -dimensional vectors with . For an arbitrary fixed , let be defined as
|
where is the -th element of . Then has a Slater point.
Proof.
Since , there exist a large and a small such that
where and for all . We can see from the above display that
(A.1) | ||||
To find and such that for all , we need
(A.2) |
We claim that the following choice of defined recursively (starting from ) satisfies (A.2):
Indeed, it is routine to check that the second and third lines of (A.2) are satisfied. Then, using and the above display recursively, we have
Then by (A.1), . Hence is a Slater point of . ∎
Appendix B Proofs
B.1 Proof of Theorem 3.1
Proof of point (i).
The proof builds upon the works of [17], Proposition 3, [30], Theorem 3.1 and [14], Theorem 1. We define:
By union bound, , . So we aim to show:
with the complement of .
Proof of (a). We show:
where . We prove as the other case follows in the same spirit. In light of :
(B.1) |
By Lemma A.3, with and , let , in form:
and
Therefore, under , taking the differences, by the triangle inequality, we obtain:
Each component of is bounded by . We deduce by the triangle inequality:
(B.2) | |||||
where the first equality holds since and for by (B.1). Let the event:
Since inequality (B.2) holds with probability one, then . Therefore, we have:
Let us first bound . Since , for :
with with probability tending to one by Lemma A.1, and . By , in Assumption 3-(iii), we deduce . We now bound . For any :
with with probability tending to one by Lemma A.1. We now need to evaluate the bound for . To do so, we rely on the KKT conditions of Lemma A.3. Note that with probability tending to one, . We have when as given and given . Therefore, by Lemma A.3 with and , following the steps to obtain inequality (B.2), we get
Therefore, denoting by , conditional on :
By Lemma A.1, with probability tending to one. We deduce
|
||||
|
||||
The first term tends to zero since and by Assumption 3-(ii) and (iii). As for the second term, using , note that
Therefore, for finite, applying Lemma A.2, we deduce that for any :
since . Hence, . Let us now consider . Applying the same reasoning to show the convergence of the second summation on the right-hand side of (B.1), we get
when , and
|
||||
since with probability tending to one, and , then under , we deduce . Consequently, we proved .
Proof of (b). We prove (b) by showing and . As in the proof of (a), we simply show . To do so, we define:
where . Then, we have:
We first bound . For any :
since implies . Moreover, since
|
we deduce:
(B.4) | |||||
Let us treat the first term. By Lemma A.3 with and , we obtain:
and
We deduce
As a consequence:
|
(B.5) |
In the same vein, applying Lemma A.3 with and , we obtain:
|
where with probability one. Let the event:
|
||||
Therefore, by the triangle inequality, (B.5) and (B.1) imply that the event holds with probability one. Hence:
(B.8) | |||||
|
|||||
|
|||||
|
|||||
The first term in (B.8) tends to zero under , , and . Moreover, note that
and
under Assumption 3-(ii)-(iii). In the same manner, we can show that the second term in (B.4) tends to zero.
We now consider . The probability of the event is upper bounded by:
Now implies and for any and:
Therefore:
|
First, we consider the second term of the right-hand side of (B.1). Let in (B.1), then holds with probability one. Therefore:
|
||||
|
||||
|
Since , then . So under the conditions , the right-hand side of the previous inequality converges to zero. As for the first term of (B.1), applying in (B.1):
|
||||
|
||||
|
The right-hand side of the last inequality converges to zero under the same conditions. Finally, we can prove that .
Proof of point (ii).
By point (i) and under Assumption 3-(ii), for any , , which is under Assumption 3-(ii). Hence, or is satisfied for any . Set and assume and consider two cases: (ii-a) and (ii-b) . In case (ii-a), by Lemma A.3 with change-points and :
Therefore, we deduce
|
||||
|
||||
Therefore, using part (i) of Theorem 3.1, we obtain
|
||||
where with probability tending to one. We deduce
As a consequence, it can be deduced that
(B.10) |
In case (ii-b), by Lemma A.3, with change-points and , we have
|
||||
|
||||
With with probability tending to one, we deduce
|
Hence, (B.10) holds. Using similar arguments, we can show that the latter is satisfied when .
B.2 Proof of Theorem 3.2
Using the result of Theorem 3.1, we aim to show that:
(B.11) |
To so, we define:
The probability (B.11) can be bounded as:
|
We first focus on , which can be expressed as:
By Lemma A.3 with change-points and , given the case :
and
Therefore, taking the differences, we get:
Therefore, the event defined as
|
where with probability tending to one, holds with probability one. Hence, we deduce
with
|
|
In the same vein as in the analysis of (B.8), we can show that as . requires more arguments. By Lemma (A.3), with change-points and :
and
Therefore
which implies
|
with with probability tending to one. We deduce
|
||||
|
where with probability tending to one. The first term in the second inequality of (B.2) tends to zero under the conditions and . And under , the second term tends to zero. Therefore, we conclude as . Based on similar arguments, we can show as . Therefore, as . Similarly, it can be proved that as .
We now consider . Define
First, we consider . By Lemma (A.3), for the change-points and , we obtain
and for the change-points and , we get
Moreover, by the triangle inequality, we have
|
||||
|
with with probability tending to one. So is upper bounded as follows
|
||||
|
||||
|
which tends to zero in the spirit as in (B.8). For , by Lemma A.3 with change-points and to obtain (B.2) and with change-points , , we get
By the triangle inequality, we have
|
||||
|
with with probability tending to one. Therefore, we obtain
|
||||
|
||||
|
which will tend to zero based on similar arguments as in the convergence of (B.8). For , by Lemma A.3 with change-points and , we have
and with change-points , , we get
By the triangle inequality, we deduce
|
||||
|
with , with probability tending to one. We deduce
|
||||
|
||||
|
||||
which tends to zero based on the same arguments as in the convergence of (B.8). Finally, to analyze , applying Lemma A.3 with to obtain (B.2) and with to obtain (B.2). By the triangle inequality, we have
|
||||
|
we deduce
|
||||
|
which also tends to zero, as in the proof of the convergence to zero of (B.8). We conclude that as .
B.3 Proof of Proposition 4.1
Proof of point (i).
We can rewrite (4.1) as the following constrained optimization problem:
(B.18) | ||||
s.t. | ||||
where we write for short; is the indicator function of the set ; is a matrix whose diagonal elements are 0; is the copy of with the diagonal elements set to 0.
Denote the dual variables by for simplicity, where , , and for all . The Lagrangian function of (B.18) is
where for convenience, we set ; we further note that . Now, let , we have
Therefore, we have the dual problem as in (4.2). Finally, the equality of the optimal values follow from [32, Theorem 31.1] upon noting that there exists with for all satisfying the equality constraints in (B.18).
B.4 Proof of Proposition 4.2
Proof of point (i).
We first rewrite (4.4) as the following constrained optimization problem:
(B.19) |
where we write for short; is a matrix whose diagonal elements are 0; is the copy of with the diagonal elements set to 0.
Denote the dual variables by for simplicity, where , , and for all . The Lagrangian function of (B.19) is
where for convenience, we set ; we also note that . Now, let , we have
(B.20) | ||||
For the problem for each , it holds that
For \raisebox{-0.2ex}{\small{1}}⃝, we have
(B.21) | ||||
where the first equality follows from the (equality case in) Cauchy-Schwarz inequality; .
For \raisebox{-0.2ex}{\small{2}}⃝, one can see that
(B.22) |
where (a) comes from the (equality case in) Cauchy-Schwarz inequality, and (b) is true since the minimum is attained at . One can see this by first locating the vertex of the quadratic objective.
Using (B.21) and (B.4), we have
The above display is exactly the definition of . Using this and (B.20), we can conclude that the dual problem of (4.4) is (4.5). Finally, the equality of optimal values follows from [32, Theorem 31.1] upon noting that there exists with for all satisfying the equality constraints in (4.4).
B.5 Proof of Proposition 4.3
For simplicity, for a given pair of fixed and , we denote the objective functions of (4.1) and (4.4) with by and , respectively. From the definition of in (4.3), we know that
(B.23) |
Proof of point (i).
Suppose that are such that (4.1) has solutions.
Let be an arbitrary solution to (4.1) and define
Then it holds that
(B.24) |
Fix any . Suppose that is a solution to (4.4) with this , then we have
where (a) comes from (B.23); (b) holds thanks to the assumption that is a solution to (4.4) with ; (c) is true because of (B.24). Therefore, is also a solution to (4.1).
Proof of point (ii).
It suffices to show that, for arbitrary fixed , if there exists such that there exists a solution to (4.4) with that satisfies
(B.25) |
then (4.1) has solutions. To this end, we notice from (B.25) and the definition of in (4.3) that
(B.26) |
where and are the subdifferentials of and at , respectively.
References
- Bauschke and Combettes [2011] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer New York, 2011. ISBN 9781441994677. 10.1007/978-1-4419-9467-7.
- Bleakley and Vert [2011] K. Bleakley and J. Vert. The group fused lasso for multiple change-point detection. HAL, Technical Report, Computational Biology Center, Paris, 2011.
- Cai et al. [2016] T. Cai, W. Liu, and H. Zhou. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. The Annals of Statistics, 44(2):455–488, 2016. 10.1214/13-AOS1171.
- Chan et al. [2014] N. Chan, C. Yau, and R.-M. Zhang. Group lasso for structural break time series. Journal of the American Statistical Association, 109(506):590–599, 2014. 10.1080/01621459.2013.866566.
- Chen et al. [2017] L. Chen, D. Sun, and K.-C. Toh. An efficient inexact symmetric Gauss–Seidel based majorized ADMM for high-dimensional convex composite conic programming. Mathematical Programming, 161(1–2):237–270, Apr. 2017. ISSN 1436-4646. 10.1007/s10107-016-1007-5.
- Eckstein [1994] J. Eckstein. Some saddle-function splitting methods for convex programming. Optimization Methods and Software, 4(1):75–83, Jan. 1994. ISSN 1029-4937. 10.1080/10556789408805578.
- Eckstein and Bertsekas [1992] J. Eckstein and D. P. Bertsekas. On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1–3):293–318, Apr. 1992. ISSN 1436-4646. 10.1007/bf01581204.
- Engle and Colacito [2006] R. Engle and R. Colacito. Testing and valuing dynamic correlations for asset allocation. Journal of Business & Economic Statistics, 24(2):238–253, 2006. 10.1198/073500106000000017.
- Engle et al. [2019] R. Engle, O. Ledoit, and M. Wolf. Large dynamic covariance matrices. Journal of Business & Economic Statistics, 37(2):363–375, 2019. 10.1080/07350015.2017.1345683.
- Fazel et al. [2013] M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946–977, Jan. 2013. ISSN 1095-7162. 10.1137/110853996.
- Fortin and Glowinski [1983] M. Fortin and R. Glowinski. Augmented Lagrangian Methods: Applications to the Numerical Solution of Boundary-Value Problems. Studies in Mathematics and Its Applications. Elsevier, 1983.
- Gabay and Mercier [1976] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. ISSN 0898-1221. 10.1016/0898-1221(76)90003-1.
- Gibberd and Nelson [2017] A. Gibberd and J. Nelson. Regularized estimation of piecewise constant gaussian graphical models: the group-fused graphical lasso. Journal of Computational and Graphical Statistics, 26(3):623–634, 2017. 10.1080/10618600.2017.1302340.
- Gibberd and Roy [2021] A. Gibberd and S. Roy. Consistent multiple changepoint estimation with fused gaussian graphical models. Annals of the Institute of Statistical Mathematics, 73:283–309, 2021. 10.1007/s10463-020-00749-0.
- Glowinski and Marroco [1975] R. Glowinski and A. Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(R2):41–76, 1975. ISSN 0397-9342. 10.1051/m2an/197509r200411.
- Hallac et al. [2017] D. Hallac, Y. Park, S. Boyd, and J. Leskovec. Network inference via the time-varying graphical lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017. 10.1145/3097983.3098037.
- Harchaoui and Lévy-Leduc [2010] Z. Harchaoui and C. Lévy-Leduc. Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492):1480–1493, 2010. 10.1198/jasa.2010.tm09181.
- He et al. [2002] B. He, L.-Z. Liao, D. Han, and H. Yang. A new inexact alternating directions method for monotone variational inequalities. Mathematical Programming, 92(1):103–118, Mar. 2002. ISSN 0025-5610. 10.1007/s101070100280.
- Ji et al. [2021] J. Ji, Y. He, L. Liu, and L. Xie. Brain connectivity alteration detection via matrix-variate differential network model. Biometrics, 77(4):1409–1421, 2021. 10.1111/biom.13359.
- Kolar and Xing [2012] M. Kolar and E. Xing. Estimating networks with jumps. Electronic Journal of Statistics, 6:2069–2106, 2012. 10.1214/12-EJS739.
- Li et al. [2016a] D. Li, J. Qian, and L. Su. Panel data models with interactive fixed effects and multiple structural breaks. Journal of the American Statistical Association, 111(516):1804–1819, 2016a. 10.1080/01621459.2015.1119696.
- Li and Lin [2019] H. Li and Z. Lin. Accelerated alternating direction method of multipliers: An optimal O(1 / k) nonergodic analysis. Journal of Scientific Computing, 79(2):671–699, Dec. 2019. ISSN 1573-7691. 10.1007/s10915-018-0893-5.
- Li et al. [2016b] X. Li, D. Sun, and K.-C. Toh. A Schur complement based semi-proximal ADMM for convex quadratic conic programming and extensions. Mathematical Programming, 155(1–2):333–373, Dec. 2016b. ISSN 1436-4646. 10.1007/s10107-014-0850-5.
- Li et al. [2018] X. Li, D. Sun, and K.-C. Toh. QSDPNAL: a two-phase augmented lagrangian method for convex quadratic semidefinite programming. Mathematical Programming Computation, 10(4):703–743, Apr. 2018. ISSN 1867-2957. 10.1007/s12532-018-0137-6.
- Li et al. [2019] X. Li, D. Sun, and K.-C. Toh. A block symmetric Gauss–Seidel decomposition theorem for convex composite quadratic programming and its applications. Mathematical Programming, 175(1–2):395–418, Feb. 2019. ISSN 1436-4646. 10.1007/s10107-018-1247-7.
- Merlevède et al. [2011] F. Merlevède, M. Peligrad, and E. Rio. A Bernstein type inequality and moderate deviations for weakly dependent sequences. Probability Theory and Related Fields, 151:435–474, 2011. 10.1007/s00440-010-0304-9.
- Monti et al. [2014] R. Monti, P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana. Estimating time-varying brain connectivity networks from functional mri time series. NeuroImage, 103:427–443, 2014. 10.1016/j.neuroimage.2014.07.033.
- Ouyang et al. [2015] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao. An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences, 8(1):644–681, Jan. 2015. ISSN 1936-4954. 10.1137/14095697x.
- Poignard and Asai [2023] B. Poignard and M. Asai. Estimation of high-dimensional vector autoregression via sparse precision matrix. The Econometrics Journal, 26(2):307–326, 2023. 10.1093/ectj/utad003.
- Qian and Su [2016a] J. Qian and L. Su. Shrinkage estimation of regression models with multiple structural changes. Econometric Theory, 32(6):1376–1433, 2016a. 10.1017/S0266466615000237.
- Qian and Su [2016b] J. Qian and L. Su. Shrinkage estimation of common breaks in panel data models via adaptive group fused lasso. Journal of Econometrics, 191(1):86–109, 2016b. 10.1016/j.jeconom.2015.09.004.
- Rockafellar [1970] R. T. Rockafellar. Convex Analysis. Princeton University Press, Dec. 1970. ISBN 9781400873173. 10.1515/9781400873173.
- Roy et al. [2017] S. Roy, Y. Atchadé, and G. Michailidis. Change point estimation in high dimensional markov random-field models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(4):1187–1206, 2017. 10.1111/rssb.12205.
- Sabach and Teboulle [2022] S. Sabach and M. Teboulle. Faster lagrangian-based methods in convex optimization. SIAM Journal on Optimization, 32(1):204–227, Feb. 2022. ISSN 1095-7189. 10.1137/20m1375358.
- Sun et al. [2024] D. Sun, Y. Yuan, G. Zhang, and X. Zhao. Accelerating preconditioned ADMM via degenerate proximal point mappings, 2024.
- Xiao et al. [2018] Y. Xiao, L. Chen, and D. Li. A generalized alternating direction method of multipliers with semi-proximal terms for convex composite conic programming. Mathematical Programming Computation, 10(4):533–555, Jan. 2018. ISSN 1867-2957. 10.1007/s12532-018-0134-9.
- Xu [2017] Y. Xu. Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM Journal on Optimization, 27(3):1459–1484, Jan. 2017. ISSN 1095-7189. 10.1137/16m1082305.
- Yang et al. [2023] B. Yang, X. Zhao, X. Li, and D. Sun. An Accelerated Proximal Alternating Direction Method of Multipliers for Optimal Decentralized Control of Uncertain Systems, 2023.
- Yang et al. [2021] L. Yang, J. Li, D. Sun, and K. Toh. A fast globally linearly convergent algorithm for the computation of Wasserstein barycenters. Journal of Machine Learning Research, 22:1–37, 2021. URL http://jmlr.org/papers/v22/19-629.html.
- Yuan et al. [2017] H. Yuan, R. Xi, C. Chen, and M. Deng. Differential network analysis via lasso penalized d-trace loss. Biometrika, 104(4):755–770, 2017. 10.1093/biomet/asx049.
- Yuan et al. [2019] H. Yuan, S. He, and M. Deng. Compositional data network analysis via lasso penalized d-trace loss. Bioinformatics, 35(18):3404–3411, 2019. 10.1093/bioinformatics/btz098.
- Zhang et al. [2022] G. Zhang, Y. Yuan, and D. Sun. An Efficient HPR Algorithm for the Wasserstein Barycenter Problem with Computational Complexity, 2022.
- Zhang et al. [2024] S. Zhang, H. Wang, and W. Lin. Care: Large precision matrix estimation for compositional data. Journal of the American Statistical Association, 2024. 10.1080/01621459.2024.2335586.
- Zhang and Zou [2014] T. Zhang and H. Zou. Sparse precision matrix estimation via lasso penalized d-trace loss. Biometrika, 101(1):103–120, 2014. 10.1093/biomet/ast059.
- Zhao and Yu [2006] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7(90):2541–2563, 2006. URL https://www.jmlr.org/papers/v7/zhao06a.html.
- Zhou et al. [2010] S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. Machine Learning, 80:295–319, 2010. 10.1007/s10994-010-5180-0.