A Unifying Approximate Method of Multipliers
for Distributed Composite Optimization
Abstract
This paper investigates solving convex composite optimization on an undirected network, where each node, privately endowed with a smooth component function and a nonsmooth one, is required to minimize the sum of all the component functions throughout the network. To address such a problem, a general Approximate Method of Multipliers (AMM) is developed, which attempts to approximate the Method of Multipliers by virtue of a surrogate function with numerous options. We then design the possibly nonseparable, time-varying surrogate function in various ways, leading to different distributed realizations of AMM. We demonstrate that AMM generalizes more than ten state-of-the-art distributed optimization algorithms, and certain specific designs of its surrogate function result in a variety of new algorithms to the literature. Furthermore, we show that AMM is able to achieve an rate of convergence to optimality, and the convergence rate becomes linear when the problem is locally restricted strongly convex and smooth. Such convergence rates provide new or stronger convergence results to many prior methods that can be viewed as specializations of AMM.
Index Terms:
Distributed optimization, convex composite optimization, the Method of Multipliers.I Introduction
This paper addresses the following convex composite optimization problem:
(1) |
where each is convex and has a Lipschitz continuous gradient, and each is convex and can be non-differentiable. Notice that is allowed to be a zero function. Also, if contains an indicator function with respect to a closed convex set , i.e., if and otherwise, then problem (1) turns into a nonsmooth, constrained convex program.
We consider solving problem (1) in a distributed way over a network modeled as a connected, undirected graph , where is the vertex set and is the edge set. We suppose each node possesses two private component functions and , aims at solving (1), and only exchanges information with its neighbors denoted by the set . Applications of such a distributed composite optimization problem include control of multi-agent systems, robust estimation by sensor networks, resource allocation in smart grids, decision making for swarm robotics, as well as distributed learning [1].
To date, a large body of distributed optimization algorithms have been proposed to solve problem (1) or its special cases. A typical practice is to reformulate (1) into a constrained problem with a separable objective function and a consensus constraint to be relaxed. For example, the inexact methods [2, 3] solve this reformulated problem by penalizing the consensus constraint in the objective function, and the primal-dual methods [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] relax the consensus constraint through dual decomposition techniques. Another common approach is to emulate the centralized subgradient/gradient methods in a decentralized manner, such as the distributed subgradient methods [23, 24, 25] and the distributed gradient-tracking-based methods [26, 27, 28, 29, 30, 31, 32, 33, 34] where [31, 32, 33, 34] utilize surrogate functions in order to realize the decentralized emulation.
Despite the growing literature, relatively few methods manage to tackle the general form of problem (1) under distributed settings, among which only the first-order methods [8, 15, 6, 10, 17, 9, 11, 12, 13, 16, 7] are guaranteed to converge to optimality with constant step-sizes. Here, [8, 15] rely on strong convexity of the problem, and [6, 10, 17] only prove asymptotic convergence for non-strongly convex problems. In contrast, the remaining works [9, 11, 12, 13, 16, 7] establish convergence rates in solving (1). Although these algorithms are developed using different rationales, we discover that the majority of them, i.e., [9, 11, 12, 13, 16], can indeed be thought to originate from the Method of Multipliers [35].
The Method of Multipliers is a seminal (centralized) optimization method, and one of its notable variants is the Alternating Direction Method of Multipliers (ADMM) [35]. In this paper, we develop a novel paradigm of solving (1) via approximating the behavior of the Method of Multipliers, called Approximate Method of Multipliers (AMM). The proposed AMM adopts a possibly time-varying surrogate function to take the place of the smooth objective function at every minimization step in the Method of Multipliers, facilitating abundant designs of distributed algorithms for solving (1) in a fully decentralized fashion. Unlike the typical separable surrogate functions for distributed optimization such as those in [31, 32, 33, 34], our surrogate function can be nonseparable in some distributed versions of AMM. It also admits certain function forms that cannot be included by [31, 32, 33, 34].
To enable distributed implementation of AMM, we first opt for a Bregman-divergence-type surrogate function, leading to a class of distributed realizations of AMM referred to as Distributed Approximate Method of Multipliers (DAMM). We also utilize convex conjugate and graph topologies to design the surrogate function, and thus construct two additional sets of distributed realizations of AMM, referred to as DAMM-SC and DAMM-SQ, for solving smooth convex optimization (i.e., (1) with ). We concretely exemplify such realizations of AMM, so that new distributed proximal/second-order/gradient-tracking methods can be obtained. Apart from that, AMM and its distributed realizations also unify a wide range of state-of-the-art distributed first-order and second-order algorithms [5, 13, 19, 16, 12, 11, 4, 9, 14, 26, 18] for solving (1) or its special cases, as all of them can be cast into the form of AMM/DAMM/DAMM-SQ. This offers a unifying perspective for understanding the nature of these existing methods with various design rationales.
We show that AMM allows the nodes to reach a consensus and converge to the optimal value at a rate of , either with a particular surrogate function type or under a local restricted strong convexity condition on . For all the algorithms that are able to solve the general form of problem (1) in a distributed way, the convergence rates of AMM and the algorithms in [9, 11, 12, 13, 16, 7] achieve the best order of , while the algorithms in [9, 11, 12, 13, 16] are indeed special cases of AMM with that particular surrogate function type and the convergence rate in [7] is in terms of an implicit measure of optimality error. Moreover, we show that when problem (1) is both smooth and locally restricted strongly convex, AMM enables the nodes to attain the optimum at a linear rate. Unlike most existing works, this linear convergence is established in no need of global strong convexity. Naturally, our analysis on AMM yields new convergence results and relaxes some problem assumptions with no degeneration of the convergence rate order for quite a few existing distributed optimization methods that can be generalized by AMM.
The outline of the paper is as follows: Section II describes the proposed AMM and the distributed designs. Section III demonstrates that AMM generalizes a number of prior distributed optimization methods. Section IV discusses the convergence results of AMM under various assumptions, and Section V presents the comparative simulation results. Finally, Section VI concludes the paper.
Notation and Definition: We use to denote an real matrix whose -entry, denoted by , is equal to . In addition, is the null space of , is the range of , and is the spectral norm of . Besides, represents the block diagonal matrix with being its diagonal blocks. Also, , , and represent a zero vector, an all-one vector, and a zero square matrix of proper dimensions, respectively, and represents the identity matrix. For any two matrices and , is the Kronecker product of and . If and are square matrices of the same size, means is positive semidefinite and means is positive definite. For any , denotes the largest eigenvalue of . For any and , and . For any countable set , we use to represent its cardinality. For any convex function , denotes its subdifferential (i.e., the set of subgradients) at . If is differentiable, only contains the gradient .
Given a convex set , a function is said to be strongly convex on with convexity parameter (or simply -strongly convex on ) if . We say is (globally) strongly convex if it is strongly convex on . Given , is said to be restricted strongly convex with respect to on with convexity parameter if . Given , is said to be -smooth if it is differentiable and its gradient is Lipschitz continuous with Lipschitz constant , i.e., .
II Approximate Method of Multipliers and Distributed Designs
This section develops distributed optimization algorithms for solving problem (1) over the undirected, connected graph , with the following formalized problem assumption.
Assumption 1.
For each , and are convex. In addition, is -smooth for some . Moreover, there exists at least one optimal solution to problem (1).
II-A Approximate Method of Multipliers
We first propose a family of optimization algorithms that effectively approximate the Method of Multipliers [35] and serve as a cornerstone of developing distributed algorithms for solving problem (1).
Let each node keep a copy of the global decision variable in problem (1), and define
where . Then, problem (1) can be equivalently transformed into
(2) |
Next, let be such that , , and . It has been shown in [18] that the consensus constraint in problem (2) is equivalent to . Therefore, (2) is identical to
(3) |
Problems (1), (2), and (3) have the same optimal value. Also, is an optimum of (1) if and only if is an optimum of (2) and (3).
Consider applying the Method of Multipliers [35] to solve problem (3), which gives
Here, is the primal variable at iteration , which is updated by minimizing an augmented Lagrangian function with the penalty , on the consensus constraint . In addition, is the dual variable, whose initial value can be arbitrarily set.
Although the Method of Multipliers may be applied to solve (3) with properly selected parameters, it is not implementable in a distributed fashion over , even when the problem reduces to linear programming. To address this issue, our strategy is to first derive a paradigm of approximating the Method of Multipliers and then design its distributed realizations.
Our approximation approach is as follows: Starting from any , let
(4) | ||||
(5) |
Compared to the Method of Multipliers, we adopt the same dual update (5) but construct a different primal update (4). In (4), we use a possibly time-varying surrogate function to replace in the primal update of the Method of Multipliers, whose conditions are imposed in Assumption 2 below. Additionally, to introduce more flexibility, we use a different weight matrix to define the penalty term , . We suppose has the same properties as , i.e., and .
Assumption 2.
The functions satisfy the following:
-
(a)
are convex and twice continuously differentiable.
-
(b)
are strongly convex, whose convexity parameters are uniformly bounded from below by some positive constant.
-
(c)
are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by some nonnegative constant.
- (d)
The strong convexity condition in Assumption 2(b) guarantees that in (4) is well-defined and uniquely exists. Assumption 2(d) is the key to making (4)–(5) solve problem (3). To explain this, note that (4) is equivalent to finding the unique satisfying
(6) |
Let be a primal-dual optimal solution pair of problem (3), which satisfies
(7) |
If , then has to be because of Assumption 2(d), , (7), and the uniqueness of in (6). It follows from (5) and that . Therefore, is a fixed point of (4)–(5). The remaining conditions in Assumption 2 will be used for convergence analysis later in Section IV.
The paradigm described by (4)–(5) and obeying Assumption 2 is called Approximate Method of Multipliers (AMM). As there are numerous options of the surrogate function , AMM unifies a wealth of optimization algorithms, including a variety of existing methods (cf. Section III) and many brand new algorithms. Moreover, since Assumption 2 allows to have a more favorable structure than , AMM with appropriate ’s may induce a prominent reduction in computational cost compared to the Method of Multipliers. In the sequel, we will provide various options of , which give rise to a series of distributed versions of AMM.
A number of existing works [36, 37, 38, 39, 40, 31, 32, 33, 34] also introduce surrogate functions to optimization methods, among which [31, 32, 33, 34] develop distributed algorithms and can address partially nonconvex problems on time-varying networks. The advantages of AMM as well as its differences from the algorithms in [31, 32, 33, 34] are highlighted as follows:
i) The algorithms carrying surrogate functions in [31, 32, 33, 34] are (primal) gradient-tracking-based methods. In contrast, AMM incorporates a surrogate function into a primal-dual framework, so that AMM is inherently different from the algorithms in [31, 32, 33, 34] and requires new analysis tools.
ii) The surrogate function conditions in Assumption 2 intersect with but still differ from those in [31, 32, 33, 34]. For instance, when is twice continuously differentiable, with meets Assumption 2 but cannot be included by [31, 32, 33, 34].
iii) To enable distributed implementation, the surrogate functions in [31, 32, 33, 34] need to be fully separable in the sense that they can be written as the sum of functions with independent variables. In contrast, AMM allows to be densely coupled under proper design yet can still be executed in a distributed fashion (cf. Sections II-C and III-C4). This may lead to more diverse algorithm design.
II-B Distributed Approximate Method of Multipliers
This subsection lays out the parameter designs of AMM for distributed implementations.
We first apply the following change of variable to AMM:
(8) |
Then, AMM (4)–(5) can be rewritten as
(9) | ||||
(10) |
Moreover, note that
where is the orthogonal complement of in (2). Hence, (8) requires , which, due to (10), can be ensured simply by the following initialization:
(11) |
Next, partition the primal variable and the dual variable in (9)–(11) as and . Suppose each node maintains and . Clearly, the nodes manage to collectively meet (11) by setting, for instance, . Below, we discuss the selections of , , and for the sake of distributed implementations of (9) and (10).
Assumption 3.
The matrices and satisfy the following:
-
(a)
, , .
-
(b)
, , .
-
(c)
.
-
(d)
, .
The nodes can jointly determine under Assumption 3 without any centralized coordination. For instance, we can let each node agree with every neighbor on and , and set and , which directly guarantee Assumption 3(a). Then, Assumption 3(c)(d) are satisfied effortlessly by means of Assumption 3(b) and the connectivity of . Typical examples of such and include the graph Laplacian matrix with and the Metropolis weight matrix with . Also, the conditions on and in Section II-A hold due to Assumption 3.
With Assumption 3, (10) can be accomplished by letting each node update as
(12) |
This is a local operation, as each node only needs to acquire the primal variables associated with its neighbors.
It remains to design the surrogate function so that (9) can be realized in a distributed way. To this end, let be a set of functions under Assumption 4.
Assumption 4.
The functions satisfy:
-
(a)
are twice continuously differentiable.
-
(b)
are strongly convex, whose convexity parameters are uniformly bounded from below by some positive constant.
-
(c)
are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by some positive constant.
-
(d)
are convex.
To meet Assumption 4(d), it suffices to let be any strongly convex functions whose convexity parameters are larger than or equal to , and one readily available upper bound on is .
Now define
(13) |
and let be the Bregman divergence associated with at and , i.e.,
(14) |
Then, we set as
(15) |
With Assumption 4, (15) is sufficient to ensure Assumption 2. To see this, note from Assumption 4(a)(d) that in (15) is twice continuously differentiable and convex, i.e., Assumption 2(a) holds. Also note from (14) and (15) that
(16) |
This, along with Assumption 4 (b)(c), guarantees Assumption 2(b)(c)(d).
To see how (15) results in a distributed implementation of (9), note from (16) that (9) is equivalent to
for some . Then, using (13) and the structure of given in Assumption 3, this can be written as
where . In other words, (9) can be achieved by letting each node solve the following strongly convex optimization problem:
(17) |
which can be locally carried out by node .
Algorithms described by (11), (17), and (12) under Assumptions 3 and 4 constitute a set of distributed realizations of AMM, referred to as Distributed Approximate Method of Multipliers (DAMM), which can be implemented by the nodes in via exchanging information with their neighbors only. Algorithm 1 describes how each node acts in DAMM.
Finally, we provide two examples of DAMM with two particular choices of .
Example 1.
For each , let , where can be any convex, smooth, and twice continuously differentiable functions and is such that . Then, DAMM reduces to a new distributed proximal algorithm, with the following local update of :
Example 2.
Suppose are twice continuously differentiable. Since , we can let each , where is such that . Then, the resulting DAMM is a new distributed second-order method, with the following local update of :
II-C Special Case: Smooth Problem
In this subsection, we focus on the smooth convex optimization problem , i.e., (1) with . For this special case of (1), we provide additional designs of the surrogate function in AMM, leading to a couple of variations of DAMM.
Here, we let still be in the form of (15), but no longer require be a separable function as in (13). Instead, we construct based upon another function under Assumption 5.
Assumption 5.
The functions satisfy the following:
-
(a)
are twice continuously differentiable.
-
(b)
are strongly convex, whose convexity parameters are uniformly bounded from below by .
-
(c)
are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by .
-
(d)
are convex, where is the convex conjugate function of .
-
(e)
For any and any , the th -dimensional block of , denoted by , is independent of .
From Assumption 5(b)(c), each is -strongly convex and -smooth [41], so that Assumption 5(d) holds as long as . Now we set
(18) |
Unlike DAMM, given by (15) and (18) under Assumption 5 does not necessarily guarantee that is separable. Below, we show that such also satisfy Assumption 2, leading to another subclass of AMM.
To do so, first notice that the strong convexity and smoothness of guarantee Assumption 2(b)(c). Also note from (16) that Assumption 2(d) is assured. In addition, due to Assumption 5(d), is convex and, thus, so is . To show the twice continuous differentiability of in Assumption 2(a), consider the fact from [42] that due to Assumption 5(b)(c), is invertible and its inverse function is . This, along with (18), implies that
(19) |
From the inverse function theorem [43], is continuously differentiable, so that and therefore given by (15) and (18) are twice continuously differentiable. We then conclude that Assumption 2 holds.
Equipped with (15), (18), and Assumption 5, we start to derive additional distributed realizations of AMM (9)–(11) for minimizing . As , (9) is equivalent to
(20) |
This, together with (19), gives
(21) |
Like DAMM, we let each node maintain the th -dimensional blocks of and , i.e., and , where the update of is the same as (12). According to (21), the update of is given by . From Assumption 5(e), this can be computed by node provided that it has access to , where denotes the th block of . Therefore, the remaining question is whether each node is able to locally evaluate . In fact, this can be enabled by the following two concrete ways of constructing .
Way #1: Let for some satisfying Assumption 5. Then, according to (18), are fixed to some . We introduce an auxiliary variable such that . From (20), satisfies the recursive relation
Due to the structure of in Assumption 3, each node is able to locally update as above. Nevertheless, the initialization cannot be achieved without centralized coordination, given that is arbitrarily chosen. We may overcome this by imposing a restriction on as follows: Let each node pick any , and force with , so that . Then, by letting each , we obtain . Due to Assumption 5(e), such initialization is decentralized, which only requires each node to share and with its neighbors.
Incorporating the above into (21) yields another distributed version of AMM, which is called Distributed Approximate Method of Multipliers for Smooth problems with Constant update functions (DAMM-SC) and is specified in Algorithm 2.
Example 3.
We can further reduce the communication cost of DAMM-SC by restricting to solely depend on node . For example, let be arbitrary twice continuously differentiable, smooth, convex functions and let . Then, we may choose with . It is straightforward to see that Assumption 5(a)(b)(c) hold and Assumption 5(d) can be satisfied with proper . For such a choice of , , which is up to node itself and hence meets Assumption 5(e). Thus, the primal update of DAMM-SC, i.e., Line 8 of Algorithm 2, is simplified to , so that each node only needs to share with its neighbors during each iteration and the local communications in Line 6 and Line 12 of Algorithm 2 are eliminated.
Way #2: For each , let , where . Suppose there exist such that , and also suppose . Moreover, we let each have a neighbor-sparse structure, i.e., the -block of , denoted by , is equal to for all . It can be shown that such quadratic ’s satisfy Assumption 5. Substituting into (21) gives
(22) |
which can be executed in a distributed manner due to the neighbor-sparse structure of . This leads to an additional distributed version of AMM, which is referred to as Distributed Approximate Method of Multipliers for Smooth problems with Quadratic update functions (DAMM-SQ) and is elaborated in Algorithm 3 where we introduce, for convenience, an auxiliary variable .
Example 4.
We present a new distributed gradient-tracking algorithm as an example of DAMM-SQ, where with under Assumption 3. Note that can be satisfied by letting be strictly diagonally dominant with positive diagonal entries. Similar to the discussions below Assumption 3, such and therefore can be locally determined by the nodes. Moreover, since , . Also, it can be verified that all the other conditions on required by DAMM-SQ hold. With this particular , (22) becomes
Since and , this indicates . It can thus be seen that this example of DAMM-SQ tracks the average of all the local gradients so as to imitate the behavior of the (centralized) gradient descent method.
III Existing Algorithms as Specializations
This section exemplifies that AMM and its distributed realizations generalize a variety of existing distributed optimization methods originally developed in different ways.
III-A Specializations of DAMM-SQ
DAMM-SQ described in Algorithm 3 generalizes several distributed first-order and second-order algorithms for solving problem (1) with , including EXTRA [5], ID-FBBS [13], and DQM [19].
III-A1 EXTRA
EXTRA [5] is a well-known first-order algorithm developed from a decentralized gradient descent method. From [5, Eq. (3.5)], EXTRA can be expressed as
(23) |
where is arbitrarily given, , and are two average matrices associated with 111We say is an average matrix associated with if , , , and . It can be shown that .. By letting , (23) becomes
(24) | ||||
(25) |
This is in the form of DAMM-SQ with , , , and . As [5] assumes and , Assumption 3 and (11) are guaranteed. Besides, [5] assumes , so that . It is then straightforward to see that this particular satisfies all the conditions in Section II-C.
III-A2 ID-FBBS
III-A3 DQM
DQM [19] is a distributed second-order method for solving problem (1) with strongly convex, smooth, and twice continuously differentiable ’s and zero ’s. DQM takes the following form: and , where are arbitrarily given, satisfy , and . Observe that DAMM-SQ reduces to DQM if we set , , , and . Clearly, and satisfy Assumption 3. Additionally, , and meets all the other requirements in Section II-C.
III-B Specializations of DAMM
A number of distributed algorithms for composite or nonsmooth convex optimization can be cast into the form of DAMM described in Algorithm 1, including PGC [16], PG-EXTRA [12], DPGA [11], a decentralized ADMM in [4], and D-FBBS [13].
III-B1 PGC and PG-EXTRA
PGC [16] and PG-EXTRA [12] are two recently proposed distributed methods for solving problem (1), where PGC is constructed upon ADMM [35] and PG-EXTRA is an extension of EXTRA [5] to address (1) with nonzero ’s. According to [16, Section V-D], PGC can be described by , , and , where is arbitrarily given and the parameters are chosen as follows: Let be a positive definite diagonal matrix and be two row-stochastic matrices such that and are symmetric, , and otherwise. To cast PGC in the form of DAMM, let , , and . Then, starting from any , the updates of PGC can be expressed as
(26) | ||||
(27) |
This means that PGC is exactly in the form of DAMM with . Note that and . In addition, [16] assumes , , and . Consequently, Assumption 3 and Assumption 4 hold. PG-EXTRA can also be described by (26)–(27) with and , i.e., is a special form of PGC. Therefore, DAMM generalizes both PGC and PG-EXTRA.
III-B2 DPGA and a Decentralized ADMM
DPGA [11] is a distributed proximal gradient method and has the following form: Given arbitrary and ,
where , , and . The above update equations of DPGA are equivalent to those of DAMM with , , and such that are equal to if or and are equal to otherwise. Apparently, and satisfy Assumption 3. Furthermore, due to the conditions on in [11], it is guaranteed that is convex and, thus, Assumption 4 holds. The decentralized ADMM in [4] for solving strongly convex and smooth problems can be viewed as a special case of DPGA with for some and , so that it is also a specialization of DAMM.
III-B3 D-FBBS
D-FBBS [13] addresses problem (1) with , which is designed based on a Bregman method and an operator splitting technique. D-FBBS is described by (26)–(27) with , for an average matrix associated with , being arbitrary, and . Clearly, satisfy Assumption 3. Now let , which satisfies Assumption 4 because [13] assumes . Therefore, we conclude that D-FBBS can be specialized from DAMM.
III-C Specializations of AMM
Since DAMM-SQ and DAMM are subsets of AMM, the algorithms in Section III-A and Section III-B are also specializations of AMM. Below, we present some other methods [9, 14, 26, 18] that can be specialized from AMM but do not belong to DAMM, DAMM-SC, or DAMM-SQ.
III-C1 A Distributed ADMM
In [9], a distributed ADMM is proposed to solve (1) with :
where is arbitrarily given and . In the above, , , and with satisfying and . Additionally, , where are such that . It is shown in [9] that . We may view this algorithm as a specialization from AMM (9)–(11), in which , , , and . Apparently, are symmetric and positive semidefinite, and , as required in Section II-A. Also, since each has full column rank, . This guarantees Assumption 2.
III-C2 A Distributed Primal-Dual Algorithm
In [14], a distributed primal-dual algorithm is developed to solve (1) with each equal to the indicator function with respect to a closed convex set , and takes the following form:
(28) | |||
(29) |
where are arbitrarily given, , and is the projection onto . Besides, is a symmetric positive semidefinite matrix satisfying and , and . To see how this algorithm relates to AMM, we use (29) to rewrite (28) as
(30) |
where . It can be seen that (29)–(30) are equivalent to AMM (4)–(5) with , , , , and . Since , we have and . Thus, satisfy the conditions required by AMM in Section II-A. Also, is strongly convex and satisfies Assumption 2.
III-C3 DIGing on Static Networks
DIGing [26] is a distributed gradient-tracking method for solving problem (1) with over time-varying networks. Here, we only consider DIGing on static undirected networks. Let and satisfy , , and . It is shown in [26] that DIGing with can be written as follows:
where can be arbitrary and . Adding the above equation from to yields
for all . By letting , the above update is the same as
Such an algorithmic form of DIGing is identical to AMM (9)–(11) with the above given , , , , and . It can be verified that and satisfy all the conditions in Section II-A.
III-C4 ESOM
ESOM [18] is a class of distributed second-order algorithms that address problem (1) with being strongly convex, smooth, and twice continuously differentiable and with . It is developed by incorporating a proximal technique and certain second-order approximations into the Method of Multipliers. To describe ESOM, let , , and be an average matrix associated with such that . In addition, define , , , and , . With the above notations, each ESOM- algorithm can be described by
(31) | ||||
(32) |
where is any vector in and which satisfies the initialization (11) of AMM. Note that the primal and dual updates of AMM, i.e., (9)–(10), reduce to (31)–(32) when , , and . Clearly, and satisfy the conditions in Section II-A. Also, Assumption 2 holds, since [18], where is the Lipschitz constant of all the ’s. Note that unlike most specializations of AMM discussed in this section, for ESOM- with is non-separable.
III-D Connections between AMM and Existing Unifying Methods
PUDA [22] and ABC [21] are two recently proposed distributed methods for convex composite optimization, which unify a number of existing methods including a subset of the aforementioned specializations of AMM. Nevertheless, unlike AMM that can be specialized to both first-order and second-order methods, PUDA and ABC are first-order algorithms. Moreover, AMM allows to be non-identical, i.e., each node only needs to know a local portion of the global nonsmooth objective function . In contrast, PUDA and ABC are restricted to the case of identical ’s, i.e., each node has to know the entire , and it is not straightforward to extend their analyses to the more general case of non-identical ’s.
Although none of AMM, PUDA, and ABC can include the others as special cases, they are implicitly connected through the following algorithm: Let . For any , let
(33) | ||||
(34) | ||||
(35) |
Different from AMM whose primal update (9) involves both the surrogate function and the nonsmooth objective, the above algorithm tackles these two parts separately and thus has two sequential primal minimization operations. Indeed, when , (33)–(35) are identical to (9)–(10).
It can be shown that with particular , , and , (33)–(35) and become equivalent to PUDA and ABC under some mild parameter conditions. Specifically, PUDA corresponds to , , , and , while ABC corresponds to , , , and . Here, are proper matrices whose detailed designs can be found in [22, 21]. Such equivalence requires additional conditions that and are invertible, , and is symmetric. In fact, these additional conditions are satisfied by many specializations of PUDA and ABC such as those in [22, Table I] and [21, Table II] when they choose positive definite average matrices. Also, the above options of , , and for PUDA and ABC satisfy our conditions for AMM in Section II-A.
To summarize, PUDA and ABC with some additional parameter conditions can be specialized from (33)–(35), while AMM and (33)–(35) are the same if problem (1) is smooth and are different otherwise. As a result, when solving smooth problems, AMM, PUDA, and ABC share a number of common special cases (e.g., EXTRA [5] and DIGing [26]).
IV Convergence Analysis
In this section, we analyze the convergence performance of AMM described by (4)–(5) (which are equivalent to (9)–(11)).
Assumption 6.
The matrices and are symmetric and , where is defined in (2). In addition, .
In DAMM, DAMM-SC, and DAMM-SQ, we adopt and , where and comply with Assumption 3. Thus, as long as we further impose , Assumption 6 holds. Moreover, all the existing specializations of AMM, DAMM, and DAMM-SQ in Section III, except DIGing [26], also need to satisfy Assumption 6. DIGing is not necessarily required to meet , yet it can easily satisfy by letting its average matrix be symmetric positive semidefinite.
In what follows, we let be generated by (4)–(5), and let be a primal-dual optimal solution pair of problem (3), which satisfies (7). Also, let , where is the Lipschitz constant of . In addition, define
Note that every exists and is symmetric due to Assumption 2. Moreover, since ,
(36) |
Then, we introduce the following auxiliary lemma for deriving the main results.
Proof.
See Appendix -A. ∎
IV-A Convergence under General Convexity
In this subsection, we provide the convergence rates of AMM in solving the general convex problem (1).
Let be the running average of from to . Below, we derive sublinear convergence rates for (i) the consensus error , which represents the infeasibility of in solving the equivalent problem (3), and (ii) the objective error , which is a direct measure of optimality. Throughout Section IV-A, we tentatively consider the following type of surrogate function that fulfills Assumption 2:
(38) |
where satisfies , , and . Such a choice of results in .
Note that AMM endowed with (38) still generalizes most existing algorithms discussed in Section III, including EXTRA [5], ID-FBBS [13], PGC [16], PG-EXTRA [12], DPGA [11], the decentralized ADMM in [4], D-FBBS [13], the distributed ADMM in [9], the distributed primal-dual algorithm in [14], and DIGing [26]. Although DQM [19] and ESOM [18] are specialized from AMM with different ’s other than (38), they all require problem (1) to be strongly convex and smooth—In such a case, we will provide convergence rates for the general form of AMM in Section IV-B.
Now we bound and . This plays a key role in acquiring the rates at which these errors vanish.
Proof.
See Appendix -B. ∎
Observe from Lemma 2 that as long as and are bounded, and are guaranteed to converge to . The following lemma assures this is true with the help of Lemma 1, in which , , and .
Lemma 3.
Proof.
See Appendix -C. ∎
The following theorem results from Lemma 2 and Lemma 3, which provides rates for both the consensus error and the objective error at the running average .
Theorem 1.
Suppose all the conditions in Lemma 3 hold. For any ,
Among the existing distributed algorithms that are able to solve nonsmooth convex optimization problems with non-strongly convex objective functions (e.g., [10, 6, 7, 9, 11, 12, 13, 16, 17, 23, 24, 33, 31, 34, 32, 21]), the best convergence rates are of , achieved by [9, 11, 16, 12, 13, 7, 33, 21]. Indeed, AMM with all the conditions in Theorem 1 generalizes the algorithms in [9, 11, 16, 12, 13] as well as their parameter conditions. Like Theorem 1, [9, 11, 16, 21] achieve rates for both the objective error and the consensus error, whereas [12, 13, 7, 33] reach rates in terms of optimality residuals that implicitly reflect deviations from an optimality condition, and they only guarantee rates for the consensus error. Also note that [33, 21] only address problem (1) with identical ’s.
Theorem 1 provides new convergence results for some existing algorithms generalized by AMM. In particular, the rate in terms of the objective error is new to PG-EXTRA [12] and D-FBBS [13] for nonsmooth problems as well as EXTRA [5] and ID-FBBS [13] for smooth problems, which only had rates with respect to optimality residuals before. Moreover, Theorem 1 improves their rates in reaching consensus to . Furthermore, Theorem 1 extends the rate of the distributed primal-dual algorithm in [14] for constrained smooth convex optimization to general nonsmooth convex problems, and allows PGC [16] to establish its rate without the originally required condition that each needs to have a compact domain.
Finally, we illustrate how the sublinear rates in Theorem 1 are influenced by the graph topology. For simplicity, set and . Also, let be such that , which exists because of the optimality condition , and let . It follows from (7) that is an optimal solution to the Lagrange dual of (3). Then, , where is the smallest nonzero eigenvalue of . Note that . Thus, by letting , we have . In addition, . Here, plays a similar role as in quantifying the consensus error. Therefore, denser network connectivity, which is often indicated by larger , can yield faster convergence of both the objective error and the consensus error (measured by ).
Compared to the existing specializations of AMM that also have rates in solving (1), the above rate has better dependence on than the rate of DPGA [11] and the rate of the distributed ADMM [9]. The remaining specializations discussed in Section III have not been provided with similar dependence of their convergence rates on the graph topology.
IV-B Convergence under Local Restricted Strong Convexity
In this subsection, we additionally impose a problem assumption and acquire the convergence rates of AMM. Henceforth, we no longer restrict to the form of (38).
Assumption 7.
The function is locally restricted strongly convex with respect to , where is an optimum of problem (1), i.e., for any convex and compact set containing , is restricted strongly convex with respect to on with convexity parameter .
Restricted strong convexity has been studied in the literature. For example, [44, 5] consider its global version and [20] considers the local version as Assumption 7. Under Assumption 7, problem (1) has a unique optimum , and problem (3) has a unique optimum [20]. Furthermore, Assumption 7 is less restricted than the standard global strong convexity condition. For instance, the objective function of logistic regression [45] is not globally strongly convex but meets Assumption 7.
Proposition 1.
Proof.
See Appendix -D. ∎
Proposition 1 provides a sufficient condition for Assumption 7. When each is twice continuously differentiable, this sufficient condition is much weaker than global strong convexity which requires .
Note that Assumption 7 is unable to assure any strong convexity of , yet it guarantees the following property on .
Proposition 2.
Subsequently, we construct a convex and compact set containing , so that the restricted strong convexity in Proposition 2 takes into effect on , leading to a parameter condition that guarantees . To introduce , note from Assumption 2 that there exist symmetric positive semidefinite matrices , such that for any and , . Let . Moreover, define and , which are also symmetric positive semidefinite. Then, let , where and are defined above Lemma 3. From Assumption 2(b) and Assumption 6, , so that is convex and compact. Thus, from Proposition 2, is restricted strongly convex with respect to on with convexity parameter . Lemma 4 is another consequence of Lemma 1, showing that stays identically in .
Lemma 4.
Proof.
See Appendix -E. ∎
When (which means is constant) or is globally restricted strongly convex (which means can be independent of ), it is straightforward to find so that (44) holds. Otherwise, depends on and thus on , so that both sides of (44) involve . With that being said, (44) can always be satisfied by proper ’s. To see this, arbitrarily pick and such that , where . If we choose such that the corresponding satisfies , then is a subset of . Let be any convexity parameter of on . Clearly, and is independent of . Then, the following suffices to satisfy (44):
(45) |
The lower and upper bounds on in (45) do not need to depend on . Also, (45) is well-posed, since holds for sufficiently small . Therefore, with such and with satisfying (45) meet (44).
Next, we present the convergence rates of AMM in both optimality and feasibility under the local restricted strong convexity condition. We first consider the smooth case of (1) with each identically equal to , and provide a linear convergence rate for , which quantifies the distance to primal optimality, dual optimality, and primal feasibility of (3). In Theorem 2, we force to satisfy not only (7) but as well. Such is a particular optimum to the Lagrange dual of (3), and can be chosen as , where is any Lagrange dual optimum of (3) and is the initial dual iterate of AMM.
Theorem 2.
Proof.
See Appendix -F. ∎
In comparison with many existing works such as [4, 18, 19, 9, 26, 27, 6, 7, 28, 29, 21, 22] that assume global strong convexity to derive linear convergence rates, the linear rate in Theorem 2 only requires the weaker condition of local restricted strong convexity in Assumption 7. Moreover, Theorem 2 provides new convergence results to a number of existing algorithms that can be specialized from AMM. Specifically, Theorem 2 establishes linear convergence for D-FBBS [13], ID-FBBS [13], and DPGA [11], which has never been discussed before. In addition, Theorem 2 allows EXTRA [5], DIGing [26], ESOM [18], DQM [19], the distributed ADMMs [9, 4], PUDA [22], and ABC [21] to achieve linear convergence under less restrictive problem assumptions, including relaxing the global strong convexity in [26, 18, 19, 9, 4, 21, 22] and the global restricted strong convexity in [5] to local restricted strong convexity, and eliminating the Lipschitz continuity of required in [18, 19].
Both the problem and the graph have impacts on the linear rate in Theorem 2. Their impacts become explicit when each is globally restricted strongly convex with respect to with convexity parameter . For simplicity, we set and . Also, choose and proper . In the expressions of , we pick any and then let be such that . We first suppose increases. Note that still satisfies (44) with the same , while we allow for a larger and thus a smaller such that (47) holds. Accordingly, become larger and, therefore, so does . Similarly, smaller yields larger . Now suppose the graph topology changes so that increases, which implies denser connectivity of . Since can be any upper bound on , we can always assume it is unchanged with the graph topology (e.g., when ). Then, it can be seen that increases.
The above discussions suggest that smaller condition number or denser network connectivity may lead to faster convergence of AMM in solving strongly convex and smooth problems. Indeed, for some particular forms of AMM, such dependence can be more explicit and is comparable to some prior results. For example, let with an average matrix associated with , , and be such that and . Thus, , , and , where is the second largest eigenvalue of . Accordingly, and . We may pick and so that (47) holds. Then, by choosing in Theorem 2, we can show that
(48) |
Most existing linearly convergent specializations of AMM have not revealed such explicit relations as (48), except the following. The decentralized ADMM [4] has the same result as (48), and DIGing [26] gives when , which is worse than (48). The distributed ADMM [9] can reach , which is better than (48). This is probably because (48) is directly from our analysis for more general problem assumptions and algorithmic forms.
Finally, we remove the condition of in Theorem 2 and derive the following result.
Theorem 3.
Proof.
See Appendix -G. ∎
The convergence rates in Theorem 3 are of the same order as those in Theorem 1. Theorem 1 considers a more general optimization problem, while Theorem 3 allows for more general ’s. In the literature, [8, 25, 15, 10] also deal with nonsmooth, strongly convex problems. Even under global strong convexity, [25, 15] provide convergence rates slower than , and [8] derives an rate of convergence only to dual optimality. Although [10] attains linear convergence, it considers a more restricted problem, which assumes to be globally restricted strongly convex and the subgradients of to be uniformly bounded.
Remark 1.
Compared to the existing algorithms in Section III that can be viewed as specializations of AMM, AMM is able to achieve convergence rates of the same or even better order under identical or weaker problem conditions. Moreover, although the theoretical results in Section IV require additional conditions on the parameters of AMM besides those in Section II, these parameter conditions are not restrictive. Indeed, when AMM reduces to the algorithms in Section III, its parameter conditions in Theorem 1 still generalize those in [5, 13, 16, 12, 11, 9, 14] and partially overlap those in [26] for non-strongly convex problems. The parameter conditions in Theorem 2 are more general than those in [5, 19], different from but intersecting with those in [26, 18], and admittedly, relatively limited compared to those in [9, 4] for strongly convex problems, yet on the premise that is specialized to the corresponding particular forms specified in Section III.
V Numerical Examples
This section demonstrates the competent practical performance of DAMM via numerical examples.
V-A Non-identical Local Nonsmooth Objectives
Consider the following constrained -regularized problem:
(52) |
where each , , and with . Note that contains and is nonempty. We set and . We choose , , and . Besides, we randomly generate each , , and . The graph is also randomly generated and contains links.
The simulation involves PG-EXTRA [12], D-FBBS [13], DPGA [11], and the distributed ADMM [9], which are guaranteed to solve (52) with distinct ’s. From Section III, they are specializations of AMM, and the first three algorithms can also be specialized from DAMM. In addition to these existing methods, we run a particular DAMM with , , which depends on the problem data. Note that this is a new algorithm for solving (52). All these algorithms have similar computational complexities per iteration, while each iteration of the distributed ADMM requires twice the communication cost of the others. The algorithmic forms of PG-EXTRA, D-FBBS, DPGA, and the distributed ADMM are given in Section III. For DPGA, we let for some and , where is the Metropolis weight matrix defined below Assumption 3. We also set the weight matrix in the distributed ADMM. We assign to the above new DAMM and to PG-EXTRA and D-FBBS when cast in the form of DAMM. The remaining parameters are all fine-tuned in their theoretical ranges to achieve the best possible performance.
Figure 1 plots the optimality error (which includes both the objective error and the consensus error) at the running average and the iterate , respectively. For each of the aforementioned algorithms, the running average produces a smoother curve, while the iterate converges faster. The new DAMM with outperforms the other four methods, suggesting that AMM not only unifies a diversity of existing methods, but also creates novel distributed algorithms that achieve superior performance in solving various convex composite optimization problems.
V-B Comparison with Methods Using Surrogate Functions
This subsection compares the convergence performance of the particular DAMM with in Section V-A with that of NEXT [31] and SONATA [33], which also utilize surrogate functions. However, NEXT and SONATA are only guaranteed to address problem (1) with identical ’s. Thus, we change in (52) to the unit ball . The remaining settings comply with Section V-A.
At each iteration, the computational complexities of all the three algorithms are roughly the same, yet the communication costs of NEXT and SONATA double that of the particular DAMM. For NEXT and SONATA, we follow the simulations in [31, 33] to approximate by for some at each iteration , and choose the average matrix as , where is the Metropolis weight matrix. The diminishing step-size of NEXT is set as with , and SONATA adopts a fixed step-size. Again, we fine-tune all the algorithm parameters within their theoretical ranges.
Figure 2 displays the optimality error generated by the aforementioned algorithms. Observe that the particular DAMM converges much faster than the other two methods. The main reason for the slow convergence of SONATA in solving this numerical example is that its theoretical step-size is very small.
VI Conclusion
We have introduced a unifying Approximate Method of Multipliers (AMM) that emulates the Method of Multipliers via a surrogate function. Proper designs of the surrogate function lead to a wealth of distributed algorithms for solving convex composite optimization over undirected graphs. Sublinear and linear convergence rates for AMM are established under various mild conditions. The proposed AMM and its distributed realizations generalize a number of well-noted methods in the literature. Moreover, the theoretical convergence rates of AMM are no worse than and sometimes even better than the convergence results of such existing specializations of AMM in the sense of rate order, problem assumption, etc. The generality of AMM as well as its convergence analysis provides insights into the design and analysis of distributed primal-dual optimization algorithms, and allows us to explore high-performance specializations of AMM when addressing specific convex optimization problems in practice.
-A Proof of Lemma 1
Define and . Due to (6) and (7), and . It follows from (36) that
(53) |
Also, since , we have . This, along with (5), gives . Thus, . By substituting (53) into the above equation and using , we obtain . In this equation, , because , , and is convex. Moreover, due to and (5), we have . It follows that . Hence, to prove (37), it suffices to show that for any ,
(54) |
To do so, note from the AM-GM inequality and the Lipschitz continuity of that for any and , . By adding the above inequality over all with , we have . Then, adding to both sides of this inequality leads to (54).
-B Proof of Lemma 2
Let . Due to (5), , so that (39) holds. Next, we prove (40). For simplicity, define . Since is convex, for all . Due to (4), . Thus, by letting in the above inequality, we obtain
(55) |
In addition, because of (5) and ,
(56) |
Also, due to the convexity and the -smoothness of each ,
(57) |
Through combining (55), (56), and (57) and utilizing , we derive
(58) |
Now adding (58) over yields . Moreover, the convexity of and implies . Combining the above results in (40).
-C Proof of Lemma 3
-D Proof of Proposition 1
For convenience, let . Since and is continuous around , there exist and such that . Therefore,
(61) |
Let be any convex and compact set containing . Suppose . Let . Since is convex, , and , we have . Moreover, using , (61), and , we obtain . Adding the above two inequalities leads to . This, along with (61), implies that is restricted strongly convex with respect to on , so that Assumption 7 holds.
-E Proof of Lemma 4
For simplicity, define . Since , . Thus, to show , it suffices to show that . We prove this by induction. Clearly, holds for . Now suppose and below we show . Using (37) and (59) with replaced by ,
(62) |
On the other hand, since , we have . Due to (44), there exist and such that (47) holds. Then, through the AM-GM inequality, , which further implies
Substituting this into (62) and applying (5), we obtain
(63) |
Due to the hypothesis , we have . It then follows from Proposition 2 and that . This and (63) give
(64) |
Note that , due to (47), and . Thus, the right-hand side of (64) is nonnegative, which implies .
-F Proof of Theorem 2
Recall from the paragraph above Theorem 2 that we force . Also, due to (5), . Hence, . Moreover, since , . It then follows from (53) that . Then, via the AM-GM inequality, for any ,
(65) |
From (65), , , and the -smoothness of each , we have that for all ,
Note that (64) holds here because (44) guarantees (47) with some and . Then, by combining the above inequality and (64), we obtain . Note from (47) that and . Also note that , , , , , , , and . Therefore, there exists such that , which guarantees (46).
-G Proof of Theorem 3
In the proof of Lemma 4, we have shown that is non-increasing. Thus, . It follows that . By substituting this into (39) and (41), we obtain (49) and (51). Note that this substitution is legitimate, since (39) and (41) do not rely on (38) to hold. To prove (50), we let . Since , is convex. Similar to the derivation of (55),
(66) |
Since , we have and . These two inequalities along with imply
(67) |
Note that (56) and (57) hold in no need of (38). Hence, by integrating (66), (67), (56), (57), and , . Similar to the derivation of (40) from (58), it follows that . From (64), and , where and satisfying (47) exist due to (44). Also, from (44), . Combining the above gives (50).
References
- [1] A. Nedić, “Convergence rate of distributed averaging dynamics and optimization in networks,” Foundations and Trends in Systems and Control, vol. 2, no. 1, pp. 1–100, 2015.
- [2] D. Bajović, D. Jekovetić, N. Krejić, and N. K. Jerinikić, “Newton-like method with diagonal correction for distributed optimization,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 1171–1203, 2017.
- [3] F. Mansoori and E. Wei, “Fast distributed asynchronous Newton-based optimization algorithm,” IEEE Transactions on Automatic Control, vol. 65, no. 7, pp. 2769–2784, 2020.
- [4] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
- [5] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
- [6] Y. Liu, W. Xu, G. Wu, Z. Tian, and Q. Ling, “Communication-censored ADMM for decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 67, no. 10, pp. 2565–2579, 2019.
- [7] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent stepsizes and separated convergence rates,” IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506, 2019.
- [8] I. Notarnicola and G. Notarstefano, “Asynchronous distributed optimization via randomized dual proximal gradient,” IEEE Transactions on Automatic Control, vol. 62, no. 5, pp. 2095–2106, 2017.
- [9] A. Makhdoumi and A. Ozdaglar, “Convergence rate of distributed ADMM over networks,” IEEE Transactions on Automatic Control, vol. 62, no. 10, pp. 5082–5095, 2017.
- [10] J. Zeng, T. He, and M. Wang, “A fast proximal gradient algorithm for decentralized composite optimization over directed networks,” System & Control Letters, vol. 107, no. 2017, pp. 36–43, 2017.
- [11] N. S. Aybat, Z. Wang, T. Lin, and S. Ma, “Distributed linearized alternating direction method of multipliers for composite convex consensus optimization,” IEEE Transactions on Automatic Control, vol. 63, no. 1, pp. 5–20, 2018.
- [12] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
- [13] J. Xu, S. Zhu, Y. Soh, and L. Xie, “A bregman splitting scheme for distributed optimization over networks,” IEEE Transactions on Automatic Control, vol. 63, no. 11, pp. 3809–3824, 2018.
- [14] J. Lei, H. Chen, and H. Fang, “Primal-dual algorithm for distributed constrained optimization,” System & Control Letters, vol. 96, pp. 110–117, 2016.
- [15] X. Wu and J. Lu, “Fenchel dual gradient methods for distributed convex optimization over time-varying networks,” IEEE Transactions on Automatic Control, vol. 64, no. 11, pp. 4629–4636, 2019.
- [16] M. Hong and T. Chang, “Stochastic proximal gradient consensus over random networks,” IEEE Transactions on Signal Processing, vol. 65, no. 11, pp. 2933–2948, 2017.
- [17] P. Bianchi, W. Hachem, and F. Iutzeler, “A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 10, pp. 2947–2957, 2016.
- [18] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralized second-order method with exact linear convergence rate for consensus optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 507–522, 2016.
- [19] ——, “DQM: Decentralized quadratically approximated alternating direction method of multipliers,” IEEE Transactions on Signal Processing, vol. 64, no. 19, pp. 5158–5173, 2016.
- [20] X. Wu, Z. Qu, and J. Lu, “A second-order proximal algorithm for consensus optimization,” IEEE Transactions on Automatic Control, vol. 66, no. 4, pp. 1864–1871, 2021.
- [21] J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Distributed algorithms for composite optimization: Unified and tight convergence analysis,” arXiv preprint arXiv:2002.11534, 2020.
- [22] S. A. Alghunaim, E. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” accepted to IEEE Transactions on Automatic Control, 2020.
- [23] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Transactions on Automatic Control, vol. 60, no. 3, pp. 601– 615, 2015.
- [24] C. Xi and U. A. khan, “Distributed subgradient projection algorithm over directed graphs,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3986–3992, 2017.
- [25] D. Yuan, Y. Hong, D. W.C.Ho, and C. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018.
- [26] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
- [27] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2017.
- [28] S. Pu, W. Shi, J. Xu, and A. Nedic, “Push-pull gradient methods for distributed optimization in networks,” accepted to IEEE Transactions on Automatic Control, 2020.
- [29] R. Xin and U. A. Khan, “Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2627–2633, 2020.
- [30] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-Raphson consensus for distributed convex optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 4, pp. 994–1009, 2016.
- [31] P. Di Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.
- [32] A. Daneshmand, Y. Sun, G. Scutari, F. Facchinei, and B. Sadler, “Decentralized dictionary learning over time-varying digraphs,” Journal of machine learning research, vol. 20, pp. 1–62, 2019.
- [33] G. Scutari and Y. Sun, “Distributed nonconvex constrained optimization over time-varying digraphs,” Mathematical Programming, vol. 176, pp. 497–544, 2016.
- [34] I. Notarnicola, Y. Sun, G. Scutari, and G. Notarstefano, “Distributed big-data optimization via block-wise gradient tracking,” accepted to IEEE Transactions on Automatic Control,2020.
- [35] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
- [36] F. Facchinei, G. Scutari, and S. Sagratella, “Parallel selective algorithms for nonconvex big data optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 7, pp. 1874–1889, 2015.
- [37] F. Facchinei, L. Lampariello, and G. Scutari, “Feasible methods for nonconvex nonsmooth problems with applications in green communications,” Mathematical Programming, vol. 164, pp. 55–90, 2017.
- [38] M. Hong, X. Wang, M. Razaviyayn, and Z.-Q. Luo, “Iteration complexity analysis of block coordinate descent methods,” Mathematical Programming, vol. 163, pp. 85–114, 2017.
- [39] G. Scutari, F. Facchinei, and L. Lampariello, “Parallel and distributed methods for constrained nonconvex optimization—part i: Theory,” IEEE Transactions on Signal Processing, vol. 65, no. 8, pp. 1929–1944, 2017.
- [40] L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous parallel algorithms for nonconvex optimization,” Mathematical Programming, vol. 184, pp. 1–34, 2019.
- [41] S. Kakade, S. Shalev-Shwartz, and A. Tewari, “On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization,” Manuscript, 2009.
- [42] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters, vol. 31, no. 3, pp. 167–175, 2003.
- [43] M. Spivak, Calculus on manifolds: A modern approach to classical theorems of advanced calculus. Addison-Wesley, 1965.
- [44] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
- [45] F. Bach, “Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression,” Journal of Machine Learning Research, vol. 15, pp. 595–627, 2014.
![]() |
Xuyang Wu (SM’17) received the B.S. degree in Information and Computing Science from Northwestern Polytechnical University, Xi’an, China, in 2015, and the Ph.D. degree from the School of Information Science and Technology at ShanghaiTech University, Shanghai, China, in 2020. He is currently a postdoctoral researcher in the Division of Decision and Control Systems at KTH Royal Institute of Technology, Stockholm, Sweden. His research interests include distributed optimization, large-scale optimization, and their applications in IoT and power systems. |
![]() |
Jie Lu (SM’08-M’13) received the B.S. degree in Information Engineering from Shanghai Jiao Tong University, China, in 2007, and the Ph.D. degree in Electrical and Computer Engineering from the University of Oklahoma, USA, in 2011. From 2012 to 2015 she was a postdoctoral researcher with KTH Royal Institute of Technology, Stockholm, Sweden, and with Chalmers University of Technology, Gothenburg, Sweden. Since 2015, she has been an assistant professor in the School of Information Science and Technology at ShanghaiTech University, Shanghai, China. Her research interests include distributed optimization, optimization theory and algorithms, and networked dynamical systems. |