Accelerated Dual Averaging Methods for Decentralized Constrained Optimization
Abstract
In this work, we study decentralized convex constrained optimization problems in networks. We focus on the dual averaging-based algorithmic framework that is well-documented to be superior in handling constraints and complex communication environments simultaneously. Two new decentralized dual averaging (DDA) algorithms are proposed. In the first one, a second-order dynamic average consensus protocol is tailored for DDA-type algorithms, which equips each agent with a provably more accurate estimate of the global dual variable than conventional schemes. We rigorously prove that the proposed algorithm attains convergence for general convex and smooth problems, for which existing DDA methods were only known to converge at prior to our work. In the second one, we use the extrapolation technique to accelerate the convergence of DDA. Compared to existing accelerated algorithms, where typically two different variables are exchanged among agents at each time, the proposed algorithm only seeks consensus on local gradients. Then, the extrapolation is performed based on two sequences of primal variables which are determined by the accumulations of gradients at two consecutive time instants, respectively. The algorithm is proved to converge at , where denotes the second largest singular value of the mixing matrix. We remark that the condition for the algorithmic parameter to guarantee convergence does not rely on the spectrum of the mixing matrix, making itself easy to satisfy in practice. Finally, numerical results are presented to demonstrate the efficiency of the proposed methods.
Index Terms:
Decentralized optimization, constrained optimization, dual averaging, acceleration, multi-agent system.I Introduction
Consider a multi-agent system consisting of agents. Each agent holds a private objective function. They are connected via a communication network in order to collaboratively solve the following optimization problem:
(1) |
where represents the local smooth objective function of agent and denotes the constraint set shared by all the agents. This problem is referred to as decentralized optimization in the literature and finds broad applications in optimal control of cyber-physical systems, sensor networks, and machine learning, to name a few. For an overview of decentralized optimization and its applications, please refer to [1, 2].
Over the last decade, many decentralized optimization algorithms have been proposed for solving Problem (1). For unconstrained problems, i.e., , the authors in [3, 4] developed decentralized gradient descent (DGD) methods with constant step sizes, where the local search performed by individual agents is guided by local gradients and a consensus protocol. However, because each individual gradient evaluated at the global optimum is not necessarily zero, the search directions induced by consensus-seeking and local gradient may conflict with each other, making it difficult to ascertain the exact solution to the problem. Several efforts have been made to overcome this drawback. For example, the authors in [5] proposed the EXTRA algorithm that adds a cumulative correction term to DGD to achieve consensual optimization. Alternatively, the additional gradient-tracking process based on dynamic average consensus scheme in [6] can be used. It is shown in [7, 8] that, for unconstrained smooth optimization, the algorithms steered by the tracked gradient exactly converge at an rate. Based on this idea, a decentralized Nesterov gradient descent was proposed in [9], where the rate of convergence is accelerated to for any at the expense of exchanging an additional variable among agents at each time instant. In [10], the authors proposed an accelerated decentralized algorithm with multiple consensus rounds at each time instant, and proved that after local iterations and communication rounds the objective error is bounded by . By modeling Problem (1) as a linearly constrained optimization problem, centralized primal-dual paradigms such as the augmented Lagrangian method (ALM), the alternating direction method of multipliers (ADMM) and the dual ascent can also be used to design decentralized algorithms [11, 12, 13]. Based on the primal-dual reformulation, an accelerated primal-dual method was developed in [14]. The rate of convergence is improved to , where denotes the smoothness parameter of each objective function and is the eigengap of the graph Laplacian . Notably, the authors established a lower bound for a class of decentralized primal-dual methods, suggesting that the developed algorithm therein is optimal in terms of gradient computations. The authors in [15] considered the Lagrangian dual formulation of the decentralized optimization problem and developed two algorithms based on dual accelerated methods. The algorithms are proved to be linearly convergent for strongly convex and smooth problems.
For constrained problems, the design and convergence analysis of decentralized optimization algorithms are more challenging [16, 17, 18]. The seminal work in [19] is based on the gossip protocol and projected subgradient method, where the step size was made decaying for convergence. The randomized smoothing technique and multi-round consensus scheme are used to design a provably optimal decentralized algorithm for non-smooth convex problems in [20]. To improve the performance using a constant step size, a variant of EXTRA (PG-EXTRA) was developed in [21], where the constraint is modeled as a non-smooth indicator function and handled via the proximal operator. An rate of convergence is proved for the squared norm of the difference of consecutive iterates. Recently the authors in [22] proposed an accelerated decentralized penalty method (APM), where the constraint can be also treated as the non-smooth part of the objective. Notably, there are some decentralized algorithms [23, 24, 25, 26, 27, 28, 29] available in the literature where the local search mechanism for individual agents is inspired by dual methods [30], e.g., mirror descent and dual averaging [31, 32]. Particularly, dual averaging is provably more efficient in exploiting sparsity than proximal gradient methods for -regularized problems [32]. For example, the authors in [26] developed a decentralized dual averaging (DDA) algorithm where a linear model of the global objective function is gradually learned by each agent via gossip. Compared to other types of decentralized first-order methods, DDA has the advantage in simultaneously handling constraints and complex communication environments, e.g., directed networks [33], deterministic time-varying networks [29], and stochastic networks [26, 34]. From a technical perspective, this is because that DDA seeks consensus on linear models of the objective function rather than on the local projected iterates as in decentralized primal methods, e.g., DGD, therefore decoupling the consensus-seeking process from nonlinear projection and facilitating the rate analysis in complex communication environments. We present a more detailed comparison in Section III-A.
Although decentralized dual methods in the literature have demonstrated advantages over their primal counterparts in terms of constraint handling and analysis complexity, existing results focused on non-smooth problems and can have an rate of convergence. Considering this, a question naturally arises: If the objective functions exhibit some desired properties, e.g., smoothness, is it possible to accelerate the convergence rate of DDA beyond ? We provide affirmative answer to this question in this work. The main results and contributions are summarized in the following:
-
•
First, we develop a new DDA algorithm, where a second-order dynamic average consensus protocol is deliberately designed to assist each agent in estimating the global dual variable. Compared to the conventional estimation scheme [26], the proposed method equips each agent with provably more accurate estimates. In particular, the estimation error accumulated over time is proved to admit an upper bound constituted by the successive difference of an auxiliary variable whose update uses the mean of local dual variables. Then a rigorous investigation into the convergence of the auxiliary variable is carried out. Summarizing these two relations, we establish conditions for algorithm parameters such that the estimation error can be fully compensated, leading to an improved rate of convergence .
-
•
Second, we propose an accelerated DDA (ADDA) algorithm. Different from DDA, each agent employs a first-order dynamic average consensus protocol to estimate the mean of local gradients and accumulates the estimate over time to generate a local dual variable. By solving the convex conjugate of a -strongly convex function over this local dual variable, each agent produces a primal variable and uses it to construct another two sequences of primal variables in an iterative manner based on the extrapolation technique in [35] and the average consensus protocol. The rate of convergence is proved to be , where denotes the second largest singular value of the mixing matrix. Notably, the condition for the algorithmic parameter to ensure convergence does not rely on the mixing matrix. Establishing such a condition that is independent on the mixing matrix offers the appealing advantage of convenient verification in practical applications.
-
•
Finally, the proposed algorithms are tested and compared with a few methods in the literature on decentralized LASSO problems characterized by synthetic and real datasets. The comparison results demonstrate the efficiency of the proposed methods.
Notation: We use and to denote the set of reals and the -dimensional Euclidean space, respectively. Given a real number , we denote by the ceiling function that maps to the least integer greater than or equal to . Given a vector , denotes its -norm. Given a matrix , its spectral radius is denoted by . Its eigenvalues and singular values are denoted by and , respectively.
II Preliminaries
II-A Basic Setup
We consider the finite-sum optimization problem (1), in which is a convex and compact set, and satisfies the following assumptions for all :
1.
-
i)
is continuously differentiable on .
-
ii)
is convex on , i.e., for any ,
(2) -
iii)
is Lipschitz continuous on with a constant , i.e., for any ,
(3)
II-B Communication Network
We consider solving Problem (1) in a decentralized fashion, that is, a pair of agents can exchange information only if they are connected in the communication network. To describe the network topology, an undirected graph is used, where denotes the set of agents and represents the set of bidirectional channels, i.e., indicates that nodes and can send information to each other. Agent is said to be a neighbor of if there exists a link between them, and the set of ’s neighbors is denoted by . For every pair , a positive weight is assigned to and to weigh the information received from each other. Otherwise is considered. For the convergence of the algorithm, we make the following assumption for .
2.
-
i)
and , where denotes the all-one vector of dimension .
-
ii)
has a strictly positive diagonal. i.e., .
II-C Centralized Dual Averaging
Our algorithms are based on the dual averaging methods [31]. Before introducing them, we state the following definition.
1.
A differentiable function is strongly convex with modulus on , if
Let be a strongly convex and differentiable function with modulus on such that
(5) |
To meet the condition in (5) for any , one can choose
where is any strongly convex function with modulus , e.g., . The convex conjugate of is defined as
As a corollary of Danskin’s Theorem, we have the following result [35].
1.
For all , we have
(6) |
Dual averaging. The dual averaging method can be applied to solving Problem (1) in a centralized manner. Starting with , it generates a sequence of variables iteratively according to
(7) |
where
(8) |
and is a sequence of positive parameters that determines the rate of convergence. Let . It is proved in [31] that when , that is, with order exactly . When the objective function is convex and smooth, a constant can be used to achieve an ergodic rate of convergence in terms of objective error [37].
Accelerated dual averaging. To speed up the rate of convergence, an accelerated dual averaging algorithm is developed in [35]. In particular, the variables are updated according to
(9a) | ||||
(9b) |
where for some , and
(10) |
Note that is considered for the above iteration, and the variables are initialized with , . For convex and smooth objective functions, it is proved that [35].
III Algorithms and Convergence Results
In this section, we develop two new DDA algorithms that are provably more efficient than existing DDA-type algorithms.
III-A Decentralized Dual Averaging
To solve Problem (1) in a decentralized manner, we propose a novel DDA algorithm. In particular, we employ the following dynamic average consensus protocol to estimate in (8):
(11a) | ||||
(11b) |
where is a local estimate of generated by agent , is a proxy of which aims to reduce the consensus error among variables . Using it, each agent performs a local computation step to update its estimate of :
(12) |
The overall algorithm is summarized in Algorithm 1.
Before proceeding, we make the following remarks on Algorithm 1.
i) Subproblem solvability. Similar to centralized dual averaging methods, we assume the subproblem in (12) can be solved easily. For general problems, we can choose such that the subproblem (12) reduces to computing the projection of variables onto . If is simple enough, e.g., the simplex or -norm ball, a closed-form solution exists.
ii) Comparison with existing DDA algorithms. In existing DDA algorithms [26, 29, 28], each agent estimates in the following way
(13) |
For this scheme, it is proved that the consensus error among variables admits a constant upper bound [26], which necessitates the use of a monotonically decreasing sequence for convergence. However, decreasing slows down the convergence significantly; the rate of convergence in [26, 29] is reported to be . To speed up the convergence, we develop the consensus protocol in (11), which is inspired by the high-order consensus scheme in [6]. Thanks to it, we are able to prove that the deviation among variables asymptotically vanishes as time evolves. Therefore, the parameter in (12) can be set constant, i.e, , which is key to obtaining the improved rates.
iii) Comparison with DGD algorithms. In existing DGD, a so-called gradient-tracking process similar to (11) is usually observed:
(14) |
where represents the step size. The proposed scheme (11) differs from (14) in step (11a). With such a deliberate design and another local dual averaging step in (12), Algorithm 1 solves constrained problems with convergence rate guarantee. To compare DDA with existing algorithms applicable to solving Problem (1), we recall the PG-EXTRA algorithm [21]:
(15) |
where represents the step size and denotes the -th entry of . Notably, PG-EXTRA seeks consensus among variables at time that are obtained via a projection operator at time . In contrast, DDA manages to agree on , which essentially decouples the consensus-seeking procedure from projection. After using the smoothness assumption in (3) to bound in (11b), the iteration in (11) can be kept almost linear, which greatly facilitates the rate analysis; see the proof of Lemma 5 for more details.
Next, we present the convergence result of Algorithm 1. Inspired by [26], we first establish the convergence property of an auxiliary sequence , which is instrumental in proving the convergence of . In particular, the update of obeys
(16) |
where the initial vector and . To proceed, we introduce the following matrix:
(17) |
where , and let be the spectral radius of .
1.
Proof.
The proof is postponed to Appendix A. ∎
To obtain a more explicit version of (18), we identify the eigenvalues of as and , where
(23) |
Thus, we have and . By routine calculation, we can verify that and monotonically increase with . Due to
we have that as long as satisfies
(24) |
then also satisfies (18). Based on (24), we have
whose size is comparable to the DGD algorithms [38, 8, 39] in the literature.
Next, we consider an unconstrained version of Problem (1), i.e., , where the rate of convergence is stated for .
1.
Proof.
The proof is given in Appendix B. ∎
III-B Accelerated Decentralized Dual Averaging
To further speed up the convergence, we develop a decentralized variant of the accelerated dual averaging method in (9) and (10). Different from Algorithm 1, we consider building consensus among variables and propose the following iteration rule:
(27a) | ||||
(27b) |
where
(28) |
and
(29) |
The overall algorithm is summarized in Algorithm 2. It is worth to mention that agents in Algorithm 2 consume the same communication resources as Algorithm 1 to achieve acceleration.
3.
For the problem in (1), the constraint set is bounded with the following diameter:
2.
Proof.
The proof is postponed to Appendix C. ∎
i) Comparison with existing accelerated algorithms. Accelerated methods for decentralized constrained optimization are rarely reported in the literature. Recently, the authors in [22] developed the APM algorithm, where the iteration rule reads
(33a) | ||||
(33b) | ||||
(33c) |
where and is a decreasing parameter satisfying with . Letting , we can equivalently rewrite (33b) and (33c) as
from which we can see that new gradients are assigned with decreasing weights, whereas increasing weights are used for ADDA in (28). The reason for such different choices of parameters may be two-fold. First, parameter choices in (centralized) primal gradient descent and dual averaging methods are intrinsically different. Second, APM gradually increases the penalty parameter in order to enforce consensus, which essentially dilutes the weight for gradients, as shown above. We will show in simulation that decreasing weights over time slows down convergence. There are also a few other accelerated decentralized methods such as [9, 14], however they do not apply to constrained problems.
ii) Discussion about optimality. For ADDA, the rate of convergence is proved to be
In light of the lower bound in [14], it is not optimal in terms of the dependence on . In particular, the dominant term of the error in becomes larger as grows, i.e., the network becomes more sparsely connected. This is mainly because we consider a one-consensus-one-gradient update in the algorithm. However, extending the algorithm in [14] to handle constraints may require further investigation. In the simulation section, we demonstrate the superiority of ADDA over existing decentralized constrained optimization algorithms.
IV Simulation
In this section, we verify the proposed methods by applying them to solve the following constrained LASSO problems:
where , , and is a constant parameter that defines the constraint. In the simulation, each agent has access to a local data tuple and . Two different problem instances characterized by both real and synthetic datasets are considered.
IV-A Case I: Real Dataset
In this setting, we use sparco7 [40, 17] to define the LASSO problem, and consider a cycle graph and a complete graph of nodes. The corresponding weight matrix is determined by following the Metropolis-Hastings rule [36]. Each local measurement matrix , and the local corrupted measurement . The constraint parameter is set as , where with denotes the unknown variable to be recovered via solving LASSO. In this case, the simulation experiments were performed using MATLAB R2020b.
For comparison, the PG-EXTRA method in [21] and the APM method in [27] are simulated. For their algorithmic parameters, the step size for PG-EXTRA is set as , and the parameters for APM are set as and . For DDA and ADDA in this work, we use and , respectively, and as the prox-function. The projection onto an ball is carried out via the algorithm in [41]. All the methods are initialized with .
The performance of four algorithms are displayed in Figs. 1 and 2. In Fig. 1, the performance is evaluated in terms of the objective error , where is identified using CVX [42]. It demonstrates that the DDA method outperforms other methods when the graph is a cycle. As the graph becomes denser, i.e, complete graph, the convergence of all algorithms becomes faster. Among them, the ADDA method demonstrates the most significant improvement. This is in line with Theorem 2, where the network connectivity impacts the convergence error in as opposed to . In Fig. 2, we compare the trajectories of consensus error, i.e., , by all methods. When the graph is a cycle, the APM and PG-EXTRA have smaller consensus error than the developed methods, mainly because they build consensus directly among variables . When the graph is complete, the consensus error by the proposed DDA method vanishes because of the conservation property established in Lemma 3 and the complete graph structure.
IV-B Case II: Synthetic Dataset
For the synthetic dataset, the parameters are set as , , , and the data is generated in the following way. First, each local measurement matrix is randomly generated where each entry follows the normal distribution . Next, each entry of the sparse vector to be recovered via LASSO is randomly generated from the normal distribution with . Then the corrupted measurement is produced based on
where represents the Gaussian noise with zero mean and variance . The constraint parameter is set as . For this setting, we employed the message passing interface (MPI) in Python 3.7.3 to simulate a network of nodes, where each node is connected to a subset of nodes . For comparison, the proposed methods are compared to their centralized counterparts. The parameters for dual averaging and accelerated dual averaging are set as and , respectively. Similarly, the function is used as a prox-function, and the algorithms are initialized with .
The performance of the developed algorithms and their centralized counterparts is illustrated in Fig. 3. In particular, the performance is evaluated in terms of objective function value versus computing time. It demonstrates that the proposed methods outperform the corresponding centralized algorithms in the sense that the decentralized algorithms consume less computing time to reach the same degree of accuracy than their centralized counterparts.



V Conclusion
In this work, we have designed two DDA algorithms for solving decentralized constrained optimization problems with improved rates of convergence. In the first one, a novel second-order dynamic average consensus scheme is developed, with which each agent locally generates a more accurate estimate of the dual variable than existing methods under mild assumptions. This property enables each agent to use large constant weight in the local dual average step, and therefore improves the rate of convergence. In the second algorithm, each agent retains the conventional first-order dynamic average consensus method to estimate the average of local gradients. Alternatively, the extrapolation technique together with the average consensus protocol is used to achieve acceleration over a decentralized network.
This work opens several avenues for future research. In this work, we focus on the basic setting with time-invariant bidirectional communication networks. We believe that the consensus-based dual averaging framework can be extended to tackle decentralized constrained optimization in complex networks, e.g., directed networks [43] and time-varying networks [38]. Furthermore, we expect that our approach, as demonstrated by its centralized counterpart, i.e., follow-the-regularized-leader, may deliver superb performance in the online optimization setting [44].
Appendix A Roadmap for the proofs

Before proceeding to the proofs, we present Fig. 4 to illustrate how they relate to each other.
Appendix B Proof of Theorem 1
B-A Preliminaries
We introduce several notations to facilitate the presentation of the proof. Let
Using these notations, we express (11) in the following compact form
(34a) | ||||
(34b) |
where . Before proceeding to the proof of Theorem 1, we present several technical lemmas.
Recall a lemma from [39].
2.
Suppose that and are two sequences of positive scalars such that for all ,
where and is a constant. Then, the following holds for all :
3.
For Algorithm 1, we have that for any
(35) |
Proof of Lemma 3.
Proof of Lemma 4.
The following lemma establishes the relation between sequences and .
5.
Proof of Lemma 5.
From Lemma 3, we have
Define
(38) |
These in conjunction with (34a) lead to
(39) |
Because of and (36), we have
In addition, we have
(40) |
where the last equality is due to the double stochasticity of . Using the same arguments as above for and Assumption 2, we have
(41) |
Similarly, from Lemma 3, we obtain
(42) |
where the last inequality is due to Assumption 1. Using
and Lemma 1, we obtain
Upon substituting the above inequality into (42), we obtain
(43) |
By combining (41) and (43), we establish the following inequality:
where is defined in (17). By iterating the above inequality and using
we obtain
Recall the eigenvalues of matrix in (23). Then an analytical form can be presented for the th power of (see, e.g., [45])
Therefore, the -th entry of can be written as
Due to the assumption that and , we have
Therefore,
(44) |
Using Lemma 1, we further obtain
(45) |
Further using the above relation and Lemma 2, the inequality (37) follows as desired. ∎
Then, a lemma is stated for the prox-mapping.
6.
Given a sequence of variables and a positive sequence , for generated by
where in (5), it holds that
(46) |
Proof of Lemma 6.
Define
where . Since
and is strongly convex with modulus , we have
Upon taking in the above inequality and using
we obtain
which is equivalent to
Iterating the above equation from to yields
(47) |
We turn to consider
which together with (47) leads to the inequality in (46), thereby concluding the proof. ∎
B-B Proof of Theorem 1
Proof of Theorem 1.
Appendix C Proof of Corollary 1
Proof of Corollary 1.
We consider
(50) |
where the two inequalities follow from Assumption 1. When , the closed-form solutions for (12) and (16) can be identified as
implying that . Upon summing up (50) from to , we obtain
(51) |
Then, we iterate (51) from to and use the convexity of to get
(52) |
where . Using Lemma 5, we obtain
(53) |
Recall (49), we have
(54) |
By multiplying on both sides of (C) and adding the resultant inequality to (53), we get
(55) |
Upon using the condition in (25), we arrive at (26) as desired. ∎
Appendix D Proof of Theorem 2
D-A Preliminaries
For Algorithm 2, we define
Based on these notations, we present the steps in (27) and (29) in the following compact form
(56a) | ||||
(56b) | ||||
(56c) |
where . According to (27), we have
(57a) | ||||
(57b) |
Before proving Theorem 2, we present Lemma 7 that establishes upper bounds for consensus error vectors and .
Proof of Lemma 7.
Since both , and are within the constraint set, we readily have
by Assumption 3. Upon using
and the definition of , we have that (58) holds for
When we prove by an induction argument. Suppose that (58) holds for some . Next, we examine the upper bounds for and , respectively.
i) Upper bound for . Using
Evaluating the norm of both sides of the above equality yields
where the last inequality follows from the hypothesis that and Assumption 3. Since monotonically decreases with , we have
where the last inequality is due to . It then remains to prove
(59) |
to obtain the bound for as desired. To prove (59), we let
which implies
Based on the above relation, we further obtain
(60) |
This in conjunction with
and the definitions of and yields
Since monotonically increases with , we have (59) satisfied.
ii) Upper bound for . Using the same arguments as above, we have
By following the same line of reasoning as in the first part, we are able to obtain
Summarizing the above bounds, the proof is completed. ∎
Lemma 8 proves the upper bound for the consensus vector .
8.
Proof of Lemma 8.
For (62), we subtract from both sides of (56c) to get
(63) |
Using the same procedure in (B-A) leads to
(64) |
Since the objective is smooth, we obtain
To bound , we consider
where the first equality is due to (56a). From (56a) and (56b), we have In addition, we have Therefore, it holds that
(65) |
where Lemma 7 and Assumption 3 are used to get the second inequality. By substituting (65) into (64), we obtain
(66) |
By initialization, we have and therefore
implying that (62) is valid for . Next, we prove that (62) also holds for by mathematical induction. Suppose that (62) holds true for some . Using this hypothesis and (66), we obtain
Finally, using the same argument with (59) and (60) in the proof of Lemma 7, we arrive at (62) as desired. ∎
D-B Proof of Theorem 2
Proof of Theorem 2.
Using , we have
Upon using the convexity of , we obtain
By (57b), we obtain
(67) |
where the last inequality is due to the smoothness of . To bound , we consider
(68) |
where the first inequality follows from and the last inequality is due to Lemma 7 and (57). For , by letting and in Lemma 6, we have
(69) |
To bound , we use (61) to get
Upon using Lemma 7, we obtain
Recall Lemma 8 that . Therefore
(70) |
Finally, by collectively substituting (68), (69), and (70) into (67), we get
Based on the condition in (30) and the fact that , we obtain (31) as desired.
References
- [1] A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.
- [2] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, 2019.
- [3] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
- [4] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
- [5] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
- [6] M. Zhu and S. Martínez, “Discrete-time dynamic average consensus,” Automatica, vol. 46, no. 2, pp. 322–329, 2010.
- [7] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-raphson consensus for distributed convex optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 4, pp. 994–1009, 2015.
- [8] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2017.
- [9] ——, “Accelerated distributed nesterov gradient descent,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2566–2581, 2019.
- [10] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
- [11] D. Jakovetić, “A unification and generalization of exact distributed first-order methods,” IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 1, pp. 31–46, 2018.
- [12] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
- [13] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedić, “A dual approach for optimal algorithms in distributed optimization over networks,” Optimization Methods and Software, vol. 36, no. 1, pp. 171–210, 2021.
- [14] J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Accelerated primal-dual algorithms for distributed smooth convex optimization over networks,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2381–2391.
- [15] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in International Conference on Machine Learning. PMLR, 2017, pp. 3027–3036.
- [16] P. Latafat, L. Stella, and P. Patrinos, “New primal-dual proximal algorithm for distributed optimization,” in 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, 2016, pp. 1959–1964.
- [17] H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized frank–wolfe algorithm for convex and nonconvex problems,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017.
- [18] Z. Li and M. Yan, “New convergence analysis of a primal-dual algorithm with large stepsizes,” Advances in Computational Mathematics, vol. 47, no. 1, pp. 1–20, 2021.
- [19] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
- [20] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, “Optimal algorithms for non-smooth distributed optimization in networks,” in 32nd International Conference on Neural Information Processing Systems, 2018, pp. 2745–2754.
- [21] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
- [22] H. Li, C. Fang, W. Yin, and Z. Lin, “Decentralized accelerated gradient methods with increasing penalty parameters,” IEEE Transactions on Signal Processing, vol. 68, pp. 4855–4870, 2020.
- [23] M. Rabbat, “Multi-agent mirror descent for decentralized stochastic optimization,” in 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2015, pp. 517–520.
- [24] S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” IEEE Transactions on Automatic Control, vol. 63, no. 3, pp. 714–725, 2017.
- [25] D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018.
- [26] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2011.
- [27] S. Liu, P.-Y. Chen, and A. O. Hero, “Accelerated distributed dual averaging over evolving networks of growing connectivity,” IEEE Transactions on Signal Processing, vol. 66, no. 7, pp. 1845–1859, 2018.
- [28] C. Liu, H. Li, and Y. Shi, “A unitary distributed subgradient method for multi-agent optimization with different coupling sources,” Automatica, vol. 114, no. 108834, 2020.
- [29] S. Liang, G. Yin et al., “Dual averaging push for distributed convex optimization over time-varying directed graph,” IEEE Transactions on Automatic Control, vol. 65, no. 4, pp. 1785–1791, 2019.
- [30] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization. John Wiley and Sons, 1983.
- [31] Y. Nesterov, “Primal-dual subgradient methods for convex problems,” Mathematical Programming, vol. 120, no. 1, pp. 221–259, 2009.
- [32] L. Xiao, “Dual averaging methods for regularized stochastic learning and onliije-ootimization,” Journal of Machine Learning Rescarch, vol. 11, pp. 2543–2596, 2010.
- [33] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in 2012 IEEE 51st Conference on Decision and Control (CDC). IEEE, 2012, pp. 5453–5458.
- [34] I. Colin, A. Bellet, J. Salmon, and S. Clémençon, “Gossip dual averaging for decentralized optimization of pairwise functions,” in International Conference on Machine Learning. PMLR, 2016, pp. 1388–1396.
- [35] M. Cohen, J. Diakonikolas, and L. Orecchia, “On acceleration with noise-corrupted gradients,” in International Conference on Machine Learning, 2018, pp. 1019–1028.
- [36] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
- [37] H. Lu, R. M. Freund, and Y. Nesterov, “Relatively smooth convex optimization by first-order methods, and applications,” SIAM Journal on Optimization, vol. 28, no. 1, pp. 333–354, 2018.
- [38] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
- [39] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes,” in 2015 IEEE 54th Conference on Decision and Control (CDC). IEEE, 2015, pp. 2055–2060.
- [40] E. van den Berg, M. Friedlander, G. Hennenfent, F. Herrmann, R. Saab, and O. Yılmaz, “Sparco: A testing framework for sparse reconstruction,” Dept. Comput. Sci., Univ. British Columbia, Vancouver, Tech. Rep. TR-2007-20,[Online]. Available: http://www. cs. ubc. ca/labs/scl/sparco, 2007.
- [41] L. Condat, “Fast projection onto the simplex and the l1 ball,” Mathematical Programming, vol. 158, no. 1-2, pp. 575–585, 2016.
- [42] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” 2014.
- [43] S. Pu, W. Shi, J. Xu, and A. Nedić, “A push-pull gradient method for distributed optimization in networks,” in 2018 IEEE 57th Conference on Decision and Control (CDC). IEEE, 2018, pp. 3385–3390.
- [44] X. Yi, X. Li, L. Xie, and K. H. Johansson, “Distributed online convex optimization with time-varying coupled inequality constraints,” IEEE Transactions on Signal Processing, vol. 68, pp. 731–746, 2020.
- [45] K. S. Williams, “The th power of a 22 matrix,” Mathematics Magazine, vol. 65, no. 5, pp. 336–336, 1992.