Decentralized Inexact Proximal Gradient Method With Network-Independent Stepsizes for Convex Composite Optimization
Abstract
This paper proposes a novel CTA (Combine-Then-Adapt)-based decentralized algorithm for solving convex composite optimization problems over undirected and connected networks. The local loss function in these problems contains both smooth and nonsmooth terms. The proposed algorithm uses uncoordinated network-independent constant stepsizes and only needs to approximately solve a sequence of proximal mappings, which is advantageous for solving decentralized composite optimization problems where the proximal mappings of the nonsmooth loss functions may not have analytical solutions. For the general convex case, we prove an convergence rate of the proposed algorithm, which can be improved to if the proximal mappings are solved exactly. Furthermore, with metric subregularity, we establish a linear convergence rate for the proposed algorithm. Numerical experiments demonstrate the efficiency of the algorithm.
Index Terms:
Decentralized optimization, inexact proximal gradient method, linear convergence, network independent.I Introduction
This paper considers the following decentralized composite optimization over networks with computational agents:
(1) |
where the local loss function and are convex and can only be accessed by corresponding agents. We assume that is -smooth, and is proper, closed, but not necessarily smooth. This setting is general and is encountered in various application fields, including multi-agent control, wireless communication, and decentralized machine learning [1].
Arguably, the straightforward approach to address the decentralized settings is to employ the classical gradient descent method combined with the network structure[2, 3, 4, 5]. On one hand, in [2, 3, 4], the mixing matrix is used to realize the consensus of the local state through the weighted average of neighbor states. These algorithms can be regarded as the quadratic penalty methods for equality consensus constraints, which leads to an inexact convergence when the stepsizes are fixed [3]. To obtain the exact optimal solution, the diminishing stepsizes are adopted [4]. On the other hand, by using the sign of the relative state rather than averaging over the local iterates via the mixing matrix, [5] proposes an algorithm based on the exact penalty method. However, these primal methods can not achieve the exact convergence under the constant stepsize.
A recent line of work in the primal-dual domain provide a number of alternative methods that can exactly converge to the optimum with the fixed step sizes. When and is strongly convex, several linear convergence algorithms have been proposed including D-ADMM[6], DLM[7], EXTRA[8], DIGing[9], Harnessing[10], AugDGM [11, 12], and Exact Diffusion[13]. When and the nonsmooth terms are identical among all the agents, under the assumption that is strongly convex, [14, 15, 16, 17] present several linear convergence algorithms for problem (1). When and the nonsmooth terms may be distinct among these agents, IC-ADMM[18], PG-ADMM[19], PG-EXTRA[20], NIDS[21] and PAD[22] have been provided. In [18], it proves the convergence of IC-ADMM under the strong convexity of . In [19], PG-ADMM is shown to have the ergodic convergence rate. Based on EXTRA[8], PG-EXTRA and NIDS have been provided in [20] and [21], respectively. For PG-EXTRA, it is shown that the algorithm has the convergence rate and has the convergence rate when the smooth part vanishes. For NIDS, which has a network independent step size, it is proved to have the convergence rate. In [22], by the quadratic penalty method and linearized-ADMM, an inexact convergence algorithm called PAD is proposed. For PAD, it proves that for a fixed penalty parameter , the distance between the solution of problem (1) and the limit point of PAD is of order . Moreover, it proves the ergodic convergence rate under general convexity, and the linear convergence rate under the strongly convex of .
Algorithm | Stepsize condition | Rate | Exactness | ||
PUDA[14] | strongly convex | ; proximal mapping of is efficiently computable | linear | exact | |
ABC-Algorithm[15] | |||||
NEXT[16]/SONATA[17] | |||||
IC-ADMM[18] | strongly convex | proximal mapping of is efficiently computable | converges | exact | |
PG-ADMM[19] | convex | ergodic: | |||
PG-EXTRA[20] | non-ergodic: | ||||
NIDS[21] | , | non-ergodic: | |||
PAD[22] | convex | proximal mapping of is efficiently computable | ergodic: | inexact | |
strongly convex | linear | ||||
D-iPGM | convex | , | ergodic: non-ergodic: | exact | |
metrically subregular | linear | ||||
and are the stepsizes of primal update and dual update, respectively. is the primal stepsize of local agent. is the Lipschitz constant of the gradients and . is the strong convexity constant of the functions . is the strong convexity parameter of the successive convex approximation surrogate of . is a given constant. is the signless Laplacian matrix. The definition of , , and can be found in Assumption 1 and Notations. |
These algorithms can be divided into ATC (Adapt-Then-Combine)- and CTA (Combine-Then-Adapt)-based decentralized algorithms according to their characteristics [15]. For ATC-based decentralized algorithms, there are several algorithms with network-independent constant stepsizes such as AugDGM [11, 12], Exact Diffusion[13], ABC-Algorithm [15], NEXT[16]/SONATA[17], and NIDS [21] (see [15] for more details). However, to the best of our knowledge, the choice of the stepsize of the existing CTA-based decentralized algorithms are related to the network topology. Moreover, in the algorithms mentioned above for composite optimization, the proximal mapping of is required to have analytic solutions or can be efficiently computed. However, in many applications, there is no analytic solution to the proximal mapping of . Such problems include generalized LASSO regularizer [23], fused LASSO regularizer [24], the octagonal selection and clustering algorithm for regression (OSCAR) regularizer [25], and the group LASSO regularizer [26]. Therefore, it prompts us to introduce an inexact inner-solver to compute an approximation of the proximal mapping. There are several works focusing on inexact versions of ALM [27], primal-dual hybrid gradient method [28], ADMM [29] and decentralized ALM [30, 31]. To our knowledge, the corresponding results for problem (1) are relatively scarce. Furthermore, to the problem (1), when the nonsmooth parts are distinct, the linear convergence rate of corresponding algorithms under certain assumptions have not been established. Although [14] shows that when is a strongly convex smooth function and is nonsmooth convex with closed-form solution to proximal mappings, then the dimension independent exact global linear convergence cannot be attained for the composite optimization problem (1), it does not mean that the linear convergence rate related to dimension is impossible under some conditions. Therefore, it naturally leads to the following three problems.
Question 1. Can one provide a CTA-based decentralized algorithm with network-independent constant stepsizes?
Question 2. Can one provide a decentralized algorithm when the proximal mapping of has no analytic solution?
Question 3. Can one establish the linear convergence of the proposed decentralized algorithm for the problem (1)?
This paper aims at addressing Questions 1, 2 and 3 for the problem (1) over an undirected and connected network. We develop a novel CTA-based decentralized inexact proximal gradient method (D-iPGM) with network-independent constant stepsizes. In the implementation of D-iPGM, computable approximations of the proximal mapping is allowed, which implies that the proposed algorithm can reduce the computational cost when proximal mapping of has no closed-form solution. To convergence analysis, we prove the convergence rate under general convexity. When the proximal mapping of is solved exactly, we establish the non-ergodic convergence of D-iPGM. We further prove the linear convergence rate of D-iPGM under the metric subregularity which is weaker than strong convexity. Table I shows a comparison of D-iPGM with the existing decentralized algorithms to convex composite optimization problems over networks.
This paper is organized as follows. Section II casts the problem (1) into a constrained form, and based on the reformulation, the D-iPGM is proposed. Additionally, some relations between D-iPGM and PG-EXTRA [20] and NIDS [21] are discussed. Then, the convergence properties under general convexity and metric subregularity are investigated in Section III. Finally, the numerical experiments are implemented in Section IV and conclusions are given in Section V.
Notations: denotes the -dimensional vector space with inner-product . The -norm, -norm and -norm are denoted as , and , respectively. , , and denote the null matrix, the identity matrix, and the all-ones matrix of appropriate dimensions, respectively. For , let . For a matrix , and is the smallest and largest nonzero singular value of , respectively. If is symmetric, means that is positive definite. Let , and be a non-empty closed convex subset of . For any , we define , , and . For a given closed proper convex function , the proximal mapping of relative to is defined as . When , we just omit it from these notations. denotes the open Euclidean norm ball around with modulus .
II Decentralized Inexact Proximal Gradient Method
Consider an undirected and connected network , where denotes the vertex set, and the edge set specifies the connectivity in the network, i.e., a communication link between agents and exists iff . Let be the decision variable held by the agent . To cast the problem (1) into a consensus-constrained form and make sure that , we introduce the global mixing matrix , where , and make the following standard assumption in decentralized optimization.
Assumption 1.
The mixing matrix is symmetric and doubly stochastic, and the null space of is . Moreover, if and , ; otherwise, .
In addition, with this mixing matrix, similar as [20], we introduce another mixing matrix defined as . By Assumption 1, the constraint is equivalent to the identity , which is again equivalent to . Then, we have the following equivalent formulation of problem (1).
(2) |
where . To guarantee the wellposedness of (1), it is necessary to suppose that the optimal solution of (1) exists. Throughout the paper, we give the following assumption.
Assumption 2.
The local function is convex and -smooth.
According to [32, Theorem 5.8] and Assumption 2, for any , we have the following commonly used inequalities
(3) | |||
(4) | |||
(5) |
where .
II-A Algorithm Development
Recall -unified framework proposed in [15]
(6a) | ||||
(6b) | ||||
(6c) |
where are suitably chosen weight matrices, is the primal stepsize, and the nonsmooth function is common to all agents. Let . According to this unified framework, when , by eliminating the dual variable , it holds that
and these algorithms are the ATC-based decentralized algorithms. Similarly, when , it deduces that
and these algorithms are the CTA-based decentralized algorithms. To ATC-based algorithms, the communication of state information and local gradient information are involved. By this strategy, the stepsize conditions of ATC-based algorithms are often independent of the network topology, such as NIDS [21], which exchanges the gradient adapted estimations, i.e., , and the stepsize condition of NIDS is , where . To ATC-based algorithms, they only need to share state information; however, their stepsize conditions are often related to the network topology, such as PG-EXTRA [20], which exchanges only the state information, and the stepsize condition PG-EXTRA is . To blend the respective superiority of these two types of algorithms, we propose the following decentralized algorithm to problem (1).
(7a) | ||||
(7b) | ||||
(7c) |
where is the primal stepsize matrix, is the dual stepsize, is the “scaled” Lagrange multiplier, and is the Lagrange multiplier. To achieve a network independent stepsize, we use the simple primal correction step (7c). In Section III, we will prove that the algorithm (7) is convergent when , . By eliminating the dual variable , it holds that
Apparently, the implementation of the proposed algorithm only requires neighboring variables , and there is only one round of communication in each iteration. Compared with (6), we know that the proposed algorithm is indeed a CTA-based decentralized algorithm and is independent of this primal-dual algorithmic framework.
On the other hand, when the proximal mapping of has no analytical solution, solving subproblem (7a) very accurately is a waste. It is more efficient to develop a proper condition and stop the subproblem procedure, i.e., inner iterations, once the condition is satisfied. By introducing an absolutely summable error criteria for subproblem (7a), we give D-iPGM (see Algorithm 1).
(8) |
(9a) | |||
(9b) |
(10) |
In Algorithm 1, to save computational expenses, in the -th iteration of D-iPGM, inexact proximal gradient descent is allowed, i.e., we can solve subproblem (8) inexactly by finding an approximate solution. It indicates that we have to find an appropriate inner iterative method to solve it inexactly. This is possible for a majority of the applications. For instance, to general LASSO regularizer [23], fused LASSO regularizer [24], OSCAR regularizer [25], and the group LASSO regularizer [26], in which has the form as with being nonsmooth functions and being a given matrix, we can use the operator splitting methods [33] or accelerated linearized ADMM [34] to solve (8) with the inner stopping criterion (9).
Remark 1.
Since is convex, closed, and proper, by the definition of proximal mapping, each of the subproblems (8) is strongly convex so that each is uniquely determined by . From [29], it holds that for any , a certain can be found such that the absolutely summable error criteria (9) holds. Therefore, the Algorithm 1 is well-defined. Moreover, by [28, Lemma 3.2], the inner loops of Algorithm 1 always terminate after a finite number of iterations.
Remark 2.
It follows from (9a) and the definition of proximal mapping that if . Therefore, can be seen as an approximation of the exact solution . To simplify convergence analysis, we give the inner stopping criterion (9). However, in practice, we can alternatively use as the inner stopping criterion, where is a constant, and is the estimate of in -the inner iteration, because can be controlled by , i.e., there exists a constant such that . We will specify this in Section IV.
Remark 3.
Let . Different from [28], we require rather than . Thus, by this setting, all agents have their own independent subproblems and independent error criteria for these subproblems, which implies that these subproblems can be solved in a decentralized manner. In addition, to control the residual , we introduce a summable sequence , which implies that we can choose by with any and , or by with . Apparently, when the proximal mapping of has a closed-form representation, we can set . In this scenario, we use the exact proximal gradient descent to perform the primal update, thus it has the same complexity as the existing proximal gradient methods [14, 15, 16, 17, 18, 19, 20, 21, 22].
II-B Comparison With PG-EXTRA [20] and NIDS [21]
In this subsection, we compare D-iPGM (), PG-EXTRA [20], and NIDS [21] from the perspective of operator splitting. Recall problem (2), which is equivalent to
(11) |
where , and is an indicator function defined as if ; otherwise, , which encodes the constraint . For compactness of exposition, we define the following operators
According to [35, Section 11.3.3], PG-EXTRA is equivalent to the C-V splitting algorithm [36, 37] applied to problem (11). In addition, it also can be derived by forward-backward operator splitting [33] under the metric defined by
i.e., it can be rewritten as
where . Since , the primal-dual update of PG-EXTRA is
(12a) | ||||
(12b) |
By eliminating the dual variable , it holds that
(13a) | ||||
(13b) |
In the implementation of PG-EXTRA, only the communication of the state variable is involved. Since the eigenvalues of lie in , and the multiplicity of eigenvalue is one, we obtain that the metric is positive definite. To ensure the convergence of PG-EXTRA, the positive definiteness of the following metric matrix is required
where . Thus, we have that the stepsize condition of PG-EXTRA is .
To NIDS, in terms of [35, Section 11.3.3], it is equivalent to the primal-dual three-operator splitting algorithm (PD3O) [38] applied to problem (11) with the metric matrix being
The primal-dual update of NIDS is
(14a) | ||||
(14b) |
By eliminating the dual variable , it holds that
(15a) | ||||
(15b) |
To achieve a more relaxed convergence condition than PG-EXTRA, there are two additional terms and compared with the primal-dual version of PG-EXTRA (12) in the dual update. Moreover, in the implementation of NIDS, the communication of both the state variable and are involved. Since , the metric is positive definite. Similarly, to ensure the convergence of NIDS, the metric matrix
is required to be positive definite. Thus, the stepsize condition of NIDS is .
Inspired by asymmetric forward-backward-adjoint splitting [39], which is a very general three operator splitting scheme, and by prediction-correction framework [40], the proposed D-iPGM can be seen as the triangularly preconditioned forward-backward operator splitting algorithm with a primal corrector. For a clearer view of this, we define the triangularly preconditioner and the corrector
Let , and . By these definitions, D-iPGM is equivalent to
By eliminating the dual variable, it holds that
Different from forward-backward operator splitting, the preconditioner is not required to be positive definite. To achieve a network-independent convergence condition, the correction step is employed, which enables the metric matrices of the D-PGM,
to be diagonal. Therefore, similar as NIDS, the stepsize condition of D-iPGM is . We compare it with PG-EXTRA and NIDS in Table II. In comparison, we observe that the primal sequence produced by D-iPGM differs from PG-EXTRA and NIDS. Therefore, we conclude that D-iPGM is a unique algorithm that distinguishes itself from PG-EXTRA and NIDS. To accomplish a more lenient convergence condition (or network-independent stepsize condition) that differs from NIDS, D-iPGM adopts an extra primal correction step. This primal correction step can be executed independently by each agent without exchanging any private information. Additionally, the extra computational cost is almost negligible when compared to PG-EXTRA. It is worth noting that in the absence of the proximal term, the sequences generated by NIDS and D-iPGM are identical. Furthermore, as shown in Section IV-A, D-iPGM () and NIDS exhibit similar performance in solving composite optimization problems.
III Convergence Analysis
This section analyzes the convergence and the convergence rate of the D-iPGM. The convergence analysis will be conducted in the variational inequality context [40], and the forthcoming analysis of the proposed D-iPGM is based on the following fundamental lemma.
Lemma 1 ([40]).
Let and be proper closed convex functions. If is differentiable, and the solution set of the problem is nonempty, then it holds that if and only if
Recall the Lagrangian of problem (2): where and is the Lagrange multiplier. By strongly duality, to solve problem (2), we can consider the saddle-point problem
(16) |
Define as the set of saddle-points to the above saddle-point problem. It holds that , where is the optimal solution set to problem (2) and is the optimal solution set to its dual problem. Denote the KKT mapping as
We have if and only if . A point is called a saddle point of the Lagrangian function if which can be alternatively rewritten as the following variational inequalities (by Lemma 1)
(17) | |||
(18) |
where , ,
Note that the mapping is affine with a skew-symmetric matrix and thus we have
(19) |
From the above analysis, it holds that the set is also the solution set of (17) and (18). Therefore, the convergence of D-iPGM can be proved if the generated sequence can be expressed by a similar inequality to (17) or (18) with an extra term converging to zero. Based on such observation, we will show the convergence properties of the sequence generated by D-iPGM in Sections III-A, III-B, and III-C.
III-A Convergence Analysis Under General Convexity
To simplify the notation for the convergence analysis, we first define two matrices as the following
Then, with the matrices and , we define the following two matrices which are important in the convergence and convergence rate establishment of D-iPGM.
(20) |
Proposition 1.
If and , the matrices and are positive definite.
Proof:
It follows from the definitions of and that
In addition, it holds that
Since and , implies that . This completes the proof of this proposition. ∎
With the matrix , we define two more matrices as
The positive definiteness of is necessary for establishing the convergence and the non-ergodic convergence rate of D-iPGM, whereas the positive definiteness of is for establishing ergodic convergence rate for D-iPGM. If , we have for any
where and . Let . We can rewrite Algorithm 1 as
(21a) | ||||
(21b) | ||||
(21c) |
where . Note that if , the sequence generated by Algorithm 1 and recursion (21) are identical. In the analysis, we consider the sequence to illustrate the convergence of Algorithm 1.
To establish the global convergence and derive the convergence rate of D-iPGM, we first give the following lemma.
Lemma 2.
Proof:
See Appendix A. ∎
To ensure the positive definiteness of , we assume that the stepsize satisfies , . Then, we give the following theorem to show that the sequence generated by (21) converges to a primal-dual optimal solution of problem (16).
Theorem 1.
Proof:
See Appendix B. ∎
Although (1) may not guarantee the monotonicity of sequence , since is bounded and is summable, the sequence is a quasi-Fejér monotone sequence in -norm, i.e.,
which guarantees the convergence of the Di-PGM.
III-B Sublinear Convergence Rate Under General Convexity
With general convexity, this subsection established the sublinear convergence rate of D-iPGM.
We give the following theorem to show the non-ergodic and ergodic convergence rate of D-iPGM.
Theorem 2.
Proof:
See Appendix C. ∎
With the stepsize contion , we establish the non-ergodic iteration complexity of D-iPGM. With a stronger convergence condition , the ergodic convergence rate of the primal-dual gap, the primal optimality gap, and the consensus error are proved.
The next theorem gives the non-ergodic convergence rates of D-iPGM when .
Theorem 3.
Proof:
See Appendix D. ∎
It follows from (A), i.e.,
the successive difference can be used to measure the accuracy of the iterate to , when . More specifically, if , by (21c), we have . Hence, it deduces that
Compared to (17), . Therefore, can be viewed as an error measurement after iterations of the D-iPGM. The same error measurement is considered in PG-EXTRA [20] and NIDS [21].
III-C Linear Convergence Rate Under Metric Subregularity
By some variational analysis techniques, this subsection proves the linear convergence of D-iPGM under metric subregularity. Recall the notion of metric subregularity [41], which is important in the establishment of the local linear convergence rate [42, 43].
Definition 1.
A set-valued mapping is metrically subregular at if for some there exists such that
where , .
To ensure the linear rate convergence of D-iPGM, we give the following assumption.
Assumption 3.
There exists an integer and a sequence such that
(26) |
where .
Remark 4.
From Theorem 1, we have . It implies that . Therefore, to facilitate implementation, we can use instead of in practice.
Then, the linear convergence rate of D-iPGM is presented.
Theorem 4.
Proof:
See Appendix E. ∎
Note that if converges to as , condition (28) hold eventually for sufficiently large, since and are constant. In addition, if , it holds that , which implies that for all
Thus, the Q-linear convergence of D-iPGM holds for exact proximal iteration.
Remark 5.
In [14], a dimension independent lower bound of problem (1) is given, and it is reported that if each agent owns a different local nonsmooth term, then exact linear convergence cannot be attained in the worst case ( is a strongly convex smooth function and is nonsmooth convex with closed form proximal mappings). However, in Theorem 4, the dimension dependent linear rate of D-iPGM can be established, which does not contradict with the exiting results.
Next, using the analysis techniques in [42] and [43], we give the sufficient condition such that is metrically subregular at . Similar as [42] and [43], a convex function is said to satisfy the structured assumption if , where is a matrix, is a vector in , and is smooth and essentially locally strongly convex, i.e., for any compact and convex subset , is strongly convex on . Some commonly used loss functions in machine learning automatically satisfy the structured assumption. We give Table III to summarize cases satisfying the structured assumption, where , , , , and .
Local loss function | Convexity of | |
Linear regression | Convex | |
Logistic regression | Convex | |
Logistic regression | Convex | |
Likelihood estimation | Convex | |
Poisson regression | Convex |
By the structure of , the following proposition is provided, which presents an alternative characterization of .
Proposition 2.
Suppose that satisfies the structured assumption. The set can be characterized as
with some satisfying for all , and , where , and is a block diagonal matrix with its -th block being and other blocks being .
The proof of Proposition 2 is rather standard by [44, Lemma 2.1] and thus is omitted here. To facilitate the analysis, an auxiliary perturbed set-valued mapping with perturbation is introduced associated with : It is obvious that . Similar to [42, Proposition 40] and [43, Proposition 4.1], one has the following equivalence.
Proposition 3.
Assume that structured assumption of is satisfied. The metric subregularity conditions of and are equivalent. Precisely, given , the following two statements are equivalent:
(i) There exist such that
(ii) There exist such that
Suppose that satisfies the structured assumption and is a polyhedral multifunction, i.e., those whose graphs are unions of finite number of polyhedral convex sets. In this scenario, it holds that is a polyhedral multifunction and hence is also a polyhedral multifunction. By [45, Proposition 1], for any point , is metrically subregular at . Thus, by Proposition 3, is metrically subregular at for any given . It follows from [43, Proposition 7] that if is a polyhedral convex function, which includes the indicator function of a polyhedral set and the polyhedral convex regularizer, or is convex piecewise linear-quadratic function, then is a polyhedral multifunction. This case covers scenarios where is the -norm regularizer, the -norm regularizer, the elastic net regularizer [46], the generalized LASSO regularizer [23], and the fused LASSO regularizer [24].









IV Numerical Experiments
In this section, we compare the performance of D-iPGM with PG-EXTRA [20] (see (12) or (13)) and NIDS [21] (see (14) or (15)) for decentralized linear and logistic regression problems with different regularizers on real datasets. All the algorithms are implemented in Matlab R2020b in a computer with 3.30 GHz AMD Ryzen 9 5900HS with Radeon Graphics and 16 GB memory.
For all experiments, we first compute the solution to (1) by centralized methods, and then run over a randomly generated connected network with agents and undirected edges, where and is the connectivity ratio. The mixing matrix is generated with the Metropolis-Hastings rule (see [8, Section 2.4]). The performance is evaluated by the relative error . For decentralized linear regression, the loss function is
For decentralized logistic regression, the loss function is
Here, any agent holds its own training date , including sample vectors and corresponding classes . We use three real datasets including a9a, covtype, and ijcnn1111https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/, whose attributes are and , and , and and , respectively. Moreover, the training samples are randomly and evenly distributed over all the agents.
Algorithm | a9a: Linear Regression | covtype: Linear Regression | ijcnn1: Linear Regression | ||||||
Inner Iter. | Outer Iter. | Time | Inner Iter. | Outer Iter. | Time | Inner Iter. | Outer Iter. | Time | |
PG-EXTRA | 32687 | 146 | 14.1916 | 63843 | 186 | 6.8362 | 42901 | 139 | 1.7367 |
NIDS | 31678 | 141 | 13.7203 | 61940 | 180 | 6.7148 | 41027 | 133 | 1.6229 |
D-iPGM- | 31460 | 140 | 13.1901 | 61809 | 180 | 6.4375 | 40528 | 132 | 1.6263 |
D-iPGM- | 18254 | 140 | 11.4739 | 35067 | 185 | 5.1945 | 27595 | 132 | 1.2733 |
D-iPGM- | 16640 | 146 | 11.2607 | 33887 | 189 | 4.9323 | 24375 | 132 | 1.1896 |
D-iPGM- | 11836 | 142 | 10.6125 | 23520 | 184 | 4.4608 | 13388 | 132 | 0.8935 |
D-iPGM- | 8568 | 141 | 9.9773 | 17539 | 185 | 3.9734 | 11529 | 132 | 0.8552 |
Algorithm | a9a: Logistic Regression | covtype: Logistic Regression | ijcnn1: Logistic Regression | ||||||
Inner Iter. | Outer Iter. | Time | Inner Iter. | Outer Iter. | Time | Inner Iter. | Outer Iter. | Time | |
PG-EXTRA | 32912 | 169 | 15.3620 | 52761 | 199 | 6.8032 | 54041 | 135 | 1.9368 |
NIDS | 32910 | 169 | 15.3008 | 55084 | 208 | 7.0974 | 53794 | 130 | 1.8648 |
D-iPGM- | 33794 | 169 | 15.7742 | 55281 | 209 | 7.0255 | 54050 | 130 | 1.9370 |
D-iPGM- | 17701 | 165 | 13.6317 | 31920 | 210 | 5.8638 | 37284 | 130 | 1.4901 |
D-iPGM- | 16055 | 166 | 13.0957 | 30864 | 211 | 5.6870 | 32520 | 130 | 1.4303 |
D-iPGM- | 13458 | 166 | 12.9017 | 19383 | 160 | 4.1614 | 17887 | 123 | 1.0219 |
D-iPGM- | 8521 | 156 | 11.5524 | 18795 | 173 | 4.2117 | 13759 | 117 | 0.8875 |
IV-A Scenario 1: Has a Closed-form Solution
We first consider the decentralized linear and logistic regression problems with regularizer. To these problems the nonsmooth term is , and we have It is not difficult to verify that the smooth coefficient of these two problems are , where . In this case, since the proximal mapping of has a closed-form solution, we use the exact version of D-iPGM, i.e., . Thus, the per-iteration complexity of these three algorithms are the same. Note that there is only one round of communication in each iteration of these algorithms. The amount of information exchange over the network is proportional to their numbers of iterations. Consequently, only the number of iterations is recorded.
The comparison results of relative error for these algorithms after each iteration are shown as Fig. 1, where we fix for D-iPGM. For PG-EXTRA, we only show , because when and , PG-EXTRA is divergent. When , these three algorithms have similar convergence performance. When and , D-iPGM and NIDS converge at almost the same speed. In addition, D-iPGM and NIDS converge faster with a larger stepsize.
Then, we consider the decentralized linear regression problem with regularizer. Thus, the nonsmooth term is , it also has a closed-form solution. We use the uncoordinated steps and let , , , and , where . The relative error after each iteration are shown in Fig. 2. It shows that D-iPGM converge faster with a larger .
IV-B Scenario 2: Has no Closed-form Solution
In this subsection, we consider the decentralized linear and logistic regression problems with +generalized LASSO [23] regularizer, i.e., , where is drawn from the normal distribution , , and . To this scenario, the nonsmooth term , whose proximal mapping has no closed-form solution for general , and we will use the C-V algorithm [36, 37] to compute the numerical solution of .
Recall the primal update of D-iPGM,
Let . By the definition of proximal mapping of , it holds that
(29) |
Applying the C-V algorithm to problem (29), we get
where , , and and are the stepsizes satisfying that . Let . It follows from [33, Section 7] and [47, Lemma 3] that there exists a constant such that . Therefore, if , we stop the inner procedure, and set , i.e., we can choose the inner stopping criterion of D-iPGM as
To this scenario, we use the primal-dual forms of PG-EXTRA (12) and NIDS (14) for solving the considered problems. Note that the primal update of these three algorithms are the same, and PG-EXTRA and NIDS are exact iterative algorithms. We apply the C-V algorithm to compute (12a) and (14a) with the inner stopping criterion
In order to compare the effect of inexact iteration and to ensure that the performance of outer iteration is similar for all the considered algorithms, according to the experimental results of Scenario 1, we set , for the outer iteration. In addition, we use the stopping criterion for all considered methods.

Note that different algorithms have different complexities for each iteration. In Tables IV and V, we report the number of inner iterations (Inner Iter.) and outer iterations (Outer Iter.), as well as the computational time in seconds (Time), when each method achieves the outer iteration stopping criterion. From Tables IV and V, we can observe that, the computational time and the inner iteration number of the D-iPGM are significantly less than exact algorithms including its exact version, and the outer iteration number of D-iPGM is very close to exact algorithms. Therefore, it is obvious that D-iPGM performs better than the exact version of D-iPGM, PG-EXTRA, and NIDS in terms of the computational time. Moreover, by comparing the performance of D-iPGM with various inner stopping criterions, the result suggests that it is better to choose the as the inner stopping criterion. To further visualize the numerical results, we present Fig. 3 to show the required number of inner iterations for each iteration, which indicates that D-iPGM requires fewer inner iterations per iteration than exact algorithms. If we choose or as the inner stopping criterion, the number of inner iterations required for each iteration is roughly on the rise. If we choose or , the trend of the number of internal iterations required for each iteration is similar to that of the exact algorithms.
V Conclusion
In this paper, we proposed a CTA-based decentralized algorithm called D-iPGM for convex composite optimizations over networks, utilizing uncoordinated network-independent constant stepsizes. D-iPGM does not require the exact solution of the proximal mapping of in each iteration, making it suitable for decentralized composite optimizations where the proximal mapping may be difficult to compute. We established the convergence rate of D-iPGM to be with general convexity, which can be improved to if the proximal mapping is solved exactly. With metric subregularity, we further established its linear convergence rate. The numerical experiments confirmed the effectiveness of D-iPGM. Acceleration, asynchronous, and stochastic settings of D-iPGM were left as future work.
Appendix A Proof of Lemma 2
Proof:
It follows from (21a) that , which implies that for ,
(30) |
It follows from (21b) that for
(31) |
Combining (A) and (31), it gives that
(32) |
Using (19), (A) can be rewritten as
(33) |
Note that
By rearranging some terms of (A), it holds that for
By (5), the assertion (2) holds. Combining (4) and (A), the assertion (2) holds. ∎
Appendix B Proof of Theorem 1
Proof:
1) According to (21c) and , one has that
Applying the identity to the with specifications , , and , it holds that
(34) |
Note that and , one has
(35) |
Thus, substituting (B) and (B) into (2), it holds that
(36) |
Letting in (B) as arbitrary , we have
2) We introduce the following notations.
(37a) | ||||
(37b) | ||||
(37c) |
where . It follows from (1) that
which implies that . Hence, for one has
(38) |
By the nonexpansivity of proximal mapping, it holds that
Therefore, it follows that for
(39) |
Combining (B) and (B), it holds that
(40) |
Summing the inequality (40) over , one has for any ,
which implies
Since and is summable, it deduces that for any , is bounded.
3) Summing (1) over , one obtains
Since is summable and is bounded, one has
(41) |
which implies that . Let be a cluster point of and be a subsequence converging to . Then, it follows from (A) that
Since the matrix is nonsingular, and , , and are continuous, taking in the above inequality, one gets
Compared with (17), we can conclude that . Therefore, by (40), it holds that
Since , by [27, Lemma 3.2] the quasi-Fejér monotone sequence converges to a unique limit point. Then, with being an accumulation point of , one has . ∎
Appendix C Proof of Theorem 2
Proof:
which implies that
and . Therefore, it holds that
(42) |
where . Thus, by (41) and [20, Proposition 1], the running-average optimality residual and the running-best optimality residual hold.
Substituting (B) and (B) into (2), it holds that
(43) |
Note that
Summing (C) over , we obtain
It follows from the convexity of , and the definition of that
Therefore, the primal–dual gap
(44) |
holds. Since is summable and is bounded, the primal-dual gap is .
Note the inequality (C) holds for all , hence it is also true for any optimal solution and , where is a any given positive number. Letting and , it gives
(45) |
Note that . Combining (C) and (C), one has
Since is bounded and , there exists such that
(46) |
For any , applying [34, Lemma 2.3] in (46), one has
Letting , we have
Thus, the primal gap and the consensus error are . ∎
Appendix D Proof of Theorem 3
Proof:
Since for all , by (A), one has
(47) |
Letting in (D), one has
(48) |
Rewriting the inequality (D) for the -th iteration, it gives that
(49) |
Setting in (D), one has
(50) |
Note that . Adding (D) and (D), we obtain
(51) |
Then, adding the term
to both sides (D) and using the identity
one obtains that
(52) |
On one hand, we have
On the other hand, we have
Setting , one has that
Thus, by (D), it holds that
(53) |
By the identity with and and , we have
Thus, (25) holds.
By (1) and the equivalence of different norms, there exists a constant such that for any
(54) |
Summarizing (54) over , it gives that
It follows from (25) that the sequence is monotonically non-increasing. Thus, we have
Since , the rate of follows from [20, Proposition 1]. By (C), the rate of the first-order optimality residual hold. ∎
Appendix E Proof of Theorem 4
Proof:
Since and converge to and converges to as , it from (B) that and also converges to as . By (37), one has
which implies that
and . Therefore, it holds that
where . Since the KKT mapping is metrically subregular at , there exists and such that when
Since and converge to as , there exists such that
(55) |
Since Note that , one has
(56) |
where . Combining (55) and (E), it gives that
(57) |
where . It follows from (1) that for
(58) |
Together with (E), one has
which implies that
(59) |
where . By (58) and the fact , one has
It deduces that
(60) |
where . By the triangle inequality, one has
(61) |
Then, by the Lipschitz continuity of , it holds that
which together with (60) and (E) deduces that
(62) |
By (26), (B) and the triangle inequality, it holds that
where . Then, by the fact that
and (62), it can be deduced that when
which together with (59), implies that when
It follows that if
then one has . This completes the proof. ∎
References
- [1] A. Nedić, “Convergence rate of distributed averaging dynamics and optimization in networks,” Found. Trends Syst. Control, vol. 2, no. 1, pp. 1–100, 2015.
- [2] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multiagent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.
- [3] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM J. Optim., vol. 26, no. 3, pp. 1835–1854, 2016.
- [4] D. Jakovetić, J. Xavier, and J. M. F. Moura, “Fast distributed gradient method,” IEEE Trans. Autom. Control, vol. 59, no. 5, pp. 1131–1146, May 2014.
- [5] J. Zhang, K. You, and T. Başar, “Distributed discrete-time optimization in multiagent networks using only sign of relative state,” IEEE Trans. Autom. Control, vol. 64, no. 6, pp. 2352–2367, Jun. 2019.
- [6] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Trans. Signal Process., vol. 62, no. 7, pp. 1750–1761, Apr. 2014.
- [7] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, “DLM: Decentralized linearized alternating direction method of multipliers,” IEEE Trans. Signal Process., vol. 63, no. 15, pp. 4051–4064, Aug. 2015.
- [8] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM J. Optim., vol. 25, no. 2, pp. 944–966, 2015.
- [9] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM J. Optim., vol. 27, no. 4, pp. 2597–2633, 2017.
- [10] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245–1260, Sep. 2018.
- [11] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes,” in Proc. 54th IEEE Conf. Decis. Control, 2015, pp. 2055–2060.
- [12] A. Nedić, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometrically convergent distributed optimization with uncoordinated step-sizes,” in Proc. Amer. Control Conf., 2017, pp. 3950–3955.
- [13] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—Part I: Algorithm development,” IEEE Trans. Signal Process., vol. 67, no. 3, pp. 708–723, Feb. 2019.
- [14] S. A. Alghunaim, E. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Trans. Autom. Control, vol. 66, no. 6, pp. 2787–2794, Jun. 2020.
- [15] J. Xu, Y. Tian, Y. Sun and G. Scutari, “Distributed algorithms for composite optimization: Unified framework and convergence analysis,” IEEE Trans. Signal Process., vol. 69, pp. 3555–3570, Jun. 2021.
- [16] P. D. Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Trans. Signal Inf. Process. Netw., vol. 2, no. 2, pp. 120–136, Jun. 2016.
- [17] Y. Sun, G. Scutari, and A. Daneshmand, “Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation,” SIAM J. Optim., vol. 32, no. 2, pp. 345–385, 2022.
- [18] T.-H. Chang, M. Hong, and X. Wang, “Multi-agent distributed optimization via inexact consensus ADMM,” IEEE Trans. Signal Process., vol. 63, no. 2, pp. 482–497, Jan. 2015.
- [19] N. S. Aybat, Z.Wang, T. Lin, and S. Ma, “Distributed linearized alternating directionmethod of multipliers for composite convex consensus optimization,” IEEE Trans. Autom. Control, vol. 63, no. 1, pp. 5–20, Jan. 2017.
- [20] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Trans. Signal Process., vol. 63, no. 22, pp. 6013–6023, Nov. 2015.
- [21] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494-4506, Sep. 2019.
- [22] J. Zhang, H. Liu, A. M.-C. So and Q. Ling, “A penalty alternating direction method of multipliers for convex composite optimization over decentralized networks,” IEEE Trans. Signal Process., vol. 69, pp. 4282–4295, Jun. 2021.
- [23] R. J. Tibshirani and J. Taylor, “The solution path of the generalized lasso,” Ann. Statist. vol. 39, no. 3, pp. 1335–1371, Jun. 2011.
- [24] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, “Sparsity and smoothness via the fused lasso,” J. Roy. Stat. Soc. Ser. B, Stat. Methodol., vol. 67, no. 1, pp. 91–108, Feb. 2005.
- [25] H. D. Bondell and B. J. Reich, “Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR,” Biometrics, vol. 64, no. 1, pp. 115–123, Mar. 2008.
- [26] L. Jacob, G. Obozinski, J.-P. Vert, “Group lasso with overlap and graph lasso,” in Proc. 26th Int. Conf. Mach. Learn., 2009, pp. 433–440.
- [27] L. Chen, X. Li, D. Sun, and K.-C. Toh, “On the equivalence of inexact proximal ALM and ADMM for a class of convex composite programming,” Math. Program., vol. 185, pp. 111–161, 2021.
- [28] F. Jiang, X. Cai, Z. Wu, and D. Han, “Approximate first-order primal-dual algorithms for saddle point problems,” Math. Comp., vol. 90, pp. 1227–1262, 2021.
- [29] J. Eckstein and W. Yao, “Approximate ADMM algorithms derived from Lagrangian splitting,” Comput. Optim. Appl., vol. 68, no. 2, pp. 363–405, 2017.
- [30] Y. Arjevani, J. Bruna, B. Can, M. Gurbuzbalaban, S. Jegelka, and H. Lin, “IDEAL: Inexact decentralized accelerated augmented Lagrangian method,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 20648–20659.
- [31] M. Hong, “Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: Algorithms, convergence, and applications,” 2016, arXiv:1604.00543.
- [32] A. Beck, First-Order Methods in Optimization, vol. 25. SIAM, Philadelphia, 2017.
- [33] L. Condat, D. Kitahara, A. Contreras, and A. Hirabayashi, “Proximal splitting algorithms for convex optimization: A tour of recent advances, with new twists,” 2021, arXiv:1912.00137v7.
- [34] Y. Xu, “Accelerated first-order primal–dual proximal methods for linearly constrained composite convex programming, SIAM J. Optim., vol. 27, no. 3, pp. 1459–1484, 2017.
- [35] E. K. Ryu and W. Yin, Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators, Cambridge, 2022.
- [36] L. Condat, “A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms,” J. Optim. Theory Appl., vol. 158, no. 2, pp. 460–479, 2013.
- [37] B. C. Vũ, “A splitting algorithm for dual monotone inclusions involving cocoercive operators,” Adv. Comput. Math., vol. 38, no. 3, pp. 667–681, 2013.
- [38] M. Yan, “A new primal-dual algorithm for minimizing the sum of three functions with a linear operator,” J. Sci. Comput., vol. 76, pp. 1698–1717, 2018.
- [39] P. Latafat and P. Patrinos, “Asymmetric forward-backward-adjoint splitting for solving monotone inclusions involving three operators,” Comput. Optim. Appl., vol. 68, no. 1, pp. 57–93, 2017.
- [40] B. S. He and X. M. Yuan, “On construction of splitting contraction algorithms in a prediction-correction framework for separable convex optimization,” 2022, arXiv:2204.11522.
- [41] T. R. Rockafellar and J. B. Wets, Variational analysis, Springer-Verlag Berlin Heidelberg, 2004.
- [42] X. Yuan, S. Zeng, and J. Zhang, “Discerning the linear convergence of ADMM for structured convex optimization through the lens of variational analysis,” J. Mach. Learn. Res., vol. 21, pp. 1–75, 2020.
- [43] J. J. Ye, X. Yuan, S. Zeng, and J. Zhang, “Variational analysis perspective on linear convergence of some first order methods for nonsmooth convex optimization problems,” Set-Valued Var. Anal., vol. 29, pp. 803–837, 2021.
- [44] Z. Luo and P. Tseng, “On the linear convergence of descent methods for convex essentially smooth minimization,” SIAM J. Control. Optim., vol. 30, pp. 408–425, 1992.
- [45] S. M. Robinson, “Some continuity properties of polyhedral multifunctions,” Math. Program., vol. 14, pp. 206–214, 1981.
- [46] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. Roy. Stat. Soc. Ser. B, Stat. Methodol., vol. 67, no. 2, pp. 301–320, 2005.
- [47] F. Jiang, Z. Wu, X. Cai, and H. Zhao, “Unified linear convergence of first-order primal-dual algorithms for saddle point problems,” Optim. Lett. vol. 16, pp. 1675–1700, 2022.
![]() |
Luyao Guo received the B.S. degree in information and computing science from Shanxi University, Taiyuan, China, in 2020. He is currently pursuing the Ph.D. degree in applied mathematics with the Jiangsu Provincial Key Laboratory of Networked Collective Intelligence, School of Mathematics, Southeast University, Nanjing, China. His current research focuses on distributed optimization and learning. |
![]() |
Xinli Shi (Senior Member, IEEE) received the B.S. degree in software engineering, the M.S. degree in applied mathematics, and the Ph.D. degree in control science and engineering from Southeast University, Nanjing, China, in 2013, 2016, and 2019, respectively. He held a China Scholarship Council Studentship for one-year study with the University of Royal Melbourne Institute of Technology, Melbourne, VIC, Australia, in 2018. He is currently an Associate Professor with the School of Cyber Science and Engineering, Southeast University. His current research interests include distributed optimization, nonsmooth analysis, and network control systems. Dr. Shi was the recipient of the Outstanding Ph.D. Degree Thesis Award from Jiangsu Province, China. |
![]() |
Jinde Cao (Fellow, IEEE) received the B.S. degree in mathematics from Anhui Normal University, Wuhu, China, in 1986, the M.S. degree in mathematics/applied mathematics from Yunnan University, Kunming, China, in 1989, and the Ph.D. degree in mathematics/applied mathematics from Sichuan University, Chengdu, China, in 1998. He is an Endowed Chair Professor, the Dean of the School of Mathematics, and the Director of the Jiangsu Provincial Key Laboratory of Networked Collective Intelligence of China and the Research Center for Complex Systems and Network Sciences with Southeast University, Nanjing, China. Prof. Cao was a recipient of the National Innovation Award of China, the Gold medal of Russian Academy of Natural Sciences, the Obada Prize, and the Highly Cited Researcher Award in Engineering, Computer Science, and Mathematics by Thomson Reuters/Clarivate Analytics. He is elected as a foreign member of Russian Academy of Sciences, a member of the Academy of Europe, a foreign member of Russian Academy of Engineering, a member of the European Academy of Sciences and Arts, a foreign member of Russian Academy of Natural Sciences, a foreign fellow of Pakistan Academy of Sciences, a fellow of African Academy of Sciences, a foreign Member of the Lithuanian Academy of Sciences, and an IASCYS academician. |
![]() |
Zihao Wang received the B.E. degree in network engineering from Nanjing Xiaozhuang University, Nanjing, China, in 2020. He is currently pursuing the M.E. degree in cyberspace security with the School of Cyber Science and Engineering, Southeast University, Nanjing. His current research focuses on privacy-preserving distributed optimization of networked multi-agent systems. |