This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Unifying Approximate Method of Multipliers
for Distributed Composite Optimization

Xuyang Wu and Jie Lu X. Wu is with the Division of Decision and Control Systems, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden. Email: [email protected].J. Lu is with the School of Information Science and Technology, ShanghaiTech University, 201210 Shanghai, China. Email: [email protected].This work has been supported by the National Natural Science Foundation of China under grant 61603254.
Abstract

This paper investigates solving convex composite optimization on an undirected network, where each node, privately endowed with a smooth component function and a nonsmooth one, is required to minimize the sum of all the component functions throughout the network. To address such a problem, a general Approximate Method of Multipliers (AMM) is developed, which attempts to approximate the Method of Multipliers by virtue of a surrogate function with numerous options. We then design the possibly nonseparable, time-varying surrogate function in various ways, leading to different distributed realizations of AMM. We demonstrate that AMM generalizes more than ten state-of-the-art distributed optimization algorithms, and certain specific designs of its surrogate function result in a variety of new algorithms to the literature. Furthermore, we show that AMM is able to achieve an O(1/k)O(1/k) rate of convergence to optimality, and the convergence rate becomes linear when the problem is locally restricted strongly convex and smooth. Such convergence rates provide new or stronger convergence results to many prior methods that can be viewed as specializations of AMM.

Index Terms:
Distributed optimization, convex composite optimization, the Method of Multipliers.

I Introduction

This paper addresses the following convex composite optimization problem:

minimizexdi=1N(fi(x)+hi(x)),\begin{split}\underset{x\in\mathbb{R}^{d}}{\operatorname{minimize}}~{}&~{}\sum_{i=1}^{N}\left(f_{i}(x)+h_{i}(x)\right),\end{split} (1)

where each fi:df_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} is convex and has a Lipschitz continuous gradient, and each hi:d{+}h_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{+\infty\} is convex and can be non-differentiable. Notice that fif_{i} is allowed to be a zero function. Also, if hih_{i} contains an indicator function Xi\mathcal{I}_{X_{i}} with respect to a closed convex set XidX_{i}\subset\mathbb{R}^{d}, i.e., Xi(x)=0\mathcal{I}_{X_{i}}(x)=0 if xXix\in X_{i} and Xi(x)=+\mathcal{I}_{X_{i}}(x)=+\infty otherwise, then problem (1) turns into a nonsmooth, constrained convex program.

We consider solving problem (1) in a distributed way over a network modeled as a connected, undirected graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), where 𝒱={1,,N}\mathcal{V}=\{1,\ldots,N\} is the vertex set and {{i,j}|i,j𝒱,ij}\mathcal{E}\subseteq\{\{i,j\}|~{}i,j\in\mathcal{V},~{}i\neq j\} is the edge set. We suppose each node i𝒱i\in\mathcal{V} possesses two private component functions fif_{i} and hih_{i}, aims at solving (1), and only exchanges information with its neighbors denoted by the set 𝒩i={j|{i,j}}\mathcal{N}_{i}=\{j|~{}\{i,j\}\in\mathcal{E}\}. Applications of such a distributed composite optimization problem include control of multi-agent systems, robust estimation by sensor networks, resource allocation in smart grids, decision making for swarm robotics, as well as distributed learning [1].

To date, a large body of distributed optimization algorithms have been proposed to solve problem (1) or its special cases. A typical practice is to reformulate (1) into a constrained problem with a separable objective function and a consensus constraint to be relaxed. For example, the inexact methods [2, 3] solve this reformulated problem by penalizing the consensus constraint in the objective function, and the primal-dual methods [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] relax the consensus constraint through dual decomposition techniques. Another common approach is to emulate the centralized subgradient/gradient methods in a decentralized manner, such as the distributed subgradient methods [23, 24, 25] and the distributed gradient-tracking-based methods [26, 27, 28, 29, 30, 31, 32, 33, 34] where [31, 32, 33, 34] utilize surrogate functions in order to realize the decentralized emulation.

Despite the growing literature, relatively few methods manage to tackle the general form of problem (1) under distributed settings, among which only the first-order methods [8, 15, 6, 10, 17, 9, 11, 12, 13, 16, 7] are guaranteed to converge to optimality with constant step-sizes. Here, [8, 15] rely on strong convexity of the problem, and [6, 10, 17] only prove asymptotic convergence for non-strongly convex problems. In contrast, the remaining works [9, 11, 12, 13, 16, 7] establish O(1/k)O(1/k) convergence rates in solving (1). Although these algorithms are developed using different rationales, we discover that the majority of them, i.e., [9, 11, 12, 13, 16], can indeed be thought to originate from the Method of Multipliers [35].

The Method of Multipliers is a seminal (centralized) optimization method, and one of its notable variants is the Alternating Direction Method of Multipliers (ADMM) [35]. In this paper, we develop a novel paradigm of solving (1) via approximating the behavior of the Method of Multipliers, called Approximate Method of Multipliers (AMM). The proposed AMM adopts a possibly time-varying surrogate function to take the place of the smooth objective function at every minimization step in the Method of Multipliers, facilitating abundant designs of distributed algorithms for solving (1) in a fully decentralized fashion. Unlike the typical separable surrogate functions for distributed optimization such as those in [31, 32, 33, 34], our surrogate function can be nonseparable in some distributed versions of AMM. It also admits certain function forms that cannot be included by [31, 32, 33, 34].

To enable distributed implementation of AMM, we first opt for a Bregman-divergence-type surrogate function, leading to a class of distributed realizations of AMM referred to as Distributed Approximate Method of Multipliers (DAMM). We also utilize convex conjugate and graph topologies to design the surrogate function, and thus construct two additional sets of distributed realizations of AMM, referred to as DAMM-SC and DAMM-SQ, for solving smooth convex optimization (i.e., (1) with hi0h_{i}\equiv 0 i𝒱\forall i\in\mathcal{V}). We concretely exemplify such realizations of AMM, so that new distributed proximal/second-order/gradient-tracking methods can be obtained. Apart from that, AMM and its distributed realizations also unify a wide range of state-of-the-art distributed first-order and second-order algorithms [5, 13, 19, 16, 12, 11, 4, 9, 14, 26, 18] for solving (1) or its special cases, as all of them can be cast into the form of AMM/DAMM/DAMM-SQ. This offers a unifying perspective for understanding the nature of these existing methods with various design rationales.

We show that AMM allows the nodes to reach a consensus and converge to the optimal value at a rate of O(1/k)O(1/k), either with a particular surrogate function type or under a local restricted strong convexity condition on i𝒱fi(x)\sum_{i\in\mathcal{V}}f_{i}(x). For all the algorithms that are able to solve the general form of problem (1) in a distributed way, the convergence rates of AMM and the algorithms in [9, 11, 12, 13, 16, 7] achieve the best order of O(1/k)O(1/k), while the algorithms in [9, 11, 12, 13, 16] are indeed special cases of AMM with that particular surrogate function type and the convergence rate in [7] is in terms of an implicit measure of optimality error. Moreover, we show that when problem (1) is both smooth and locally restricted strongly convex, AMM enables the nodes to attain the optimum at a linear rate. Unlike most existing works, this linear convergence is established in no need of global strong convexity. Naturally, our analysis on AMM yields new convergence results and relaxes some problem assumptions with no degeneration of the convergence rate order for quite a few existing distributed optimization methods that can be generalized by AMM.

The outline of the paper is as follows: Section II describes the proposed AMM and the distributed designs. Section III demonstrates that AMM generalizes a number of prior distributed optimization methods. Section IV discusses the convergence results of AMM under various assumptions, and Section V presents the comparative simulation results. Finally, Section VI concludes the paper.

Notation and Definition: We use A=(aij)n×nA=(a_{ij})_{n\times n} to denote an n×nn\times n real matrix whose (i,j)(i,j)-entry, denoted by [A]ij[A]_{ij}, is equal to aija_{ij}. In addition, Null(A)\operatorname{Null}(A) is the null space of AA, Range(A)\operatorname{Range}(A) is the range of AA, and A\|A\| is the spectral norm of AA. Besides, diag(D1,,Dn)nd×nd\operatorname{diag}(D_{1},\ldots,D_{n})\in\mathbb{R}^{nd\times nd} represents the block diagonal matrix with D1,,Dnd×dD_{1},\ldots,D_{n}\in\mathbb{R}^{d\times d} being its diagonal blocks. Also, 𝟎\mathbf{0}, 𝟏\mathbf{1}, and 𝐎\mathbf{O} represent a zero vector, an all-one vector, and a zero square matrix of proper dimensions, respectively, and InI_{n} represents the n×nn\times n identity matrix. For any two matrices AA and BB, ABA\otimes B is the Kronecker product of AA and BB. If AA and BB are square matrices of the same size, ABA\succeq B means ABA-B is positive semidefinite and ABA\succ B means ABA-B is positive definite. For any A=ATn×nA=A^{T}\in\mathbb{R}^{n\times n}, λmax(A)\lambda_{\max}(A) denotes the largest eigenvalue of AA. For any znz\in\mathbb{R}^{n} and A𝐎A\succeq\mathbf{O}, z=zTz\|z\|=\sqrt{z^{T}z} and zA=zTAz\|z\|_{A}=\sqrt{z^{T}Az}. For any countable set SS, we use |S||S| to represent its cardinality. For any convex function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}, f(x)d\partial f(x)\subset\mathbb{R}^{d} denotes its subdifferential (i.e., the set of subgradients) at xx. If ff is differentiable, f(x)\partial f(x) only contains the gradient f(x)\nabla f(x).

Given a convex set XdX\subseteq\mathbb{R}^{d}, a function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is said to be strongly convex on XX with convexity parameter σ>0\sigma>0 (or simply σ\sigma-strongly convex on XX) if gxgy,xyσxy2\langle g_{x}-g_{y},x-y\rangle\geq\sigma\|x-y\|^{2} x,yX\forall x,y\in X gxf(x)\forall g_{x}\in\partial f(x) gyf(y)\forall g_{y}\in\partial f(y). We say ff is (globally) strongly convex if it is strongly convex on d\mathbb{R}^{d}. Given x~X\tilde{x}\in X, ff is said to be restricted strongly convex with respect to x~\tilde{x} on XX with convexity parameter σ>0\sigma>0 if gxgx~,xx~σxx~2\langle g_{x}-g_{\tilde{x}},x-\tilde{x}\rangle\geq\sigma\|x-\tilde{x}\|^{2} xX\forall x\in X gxf(x)\forall g_{x}\in\partial f(x) gx~f(x~)\forall g_{\tilde{x}}\in\partial f(\tilde{x}). Given L0L\geq 0, ff is said to be LL-smooth if it is differentiable and its gradient is Lipschitz continuous with Lipschitz constant LL, i.e., f(x)f(y)Lxy\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\| x,yd\forall x,y\in\mathbb{R}^{d}.

II Approximate Method of Multipliers and Distributed Designs

This section develops distributed optimization algorithms for solving problem (1) over the undirected, connected graph 𝒢\mathcal{G}, with the following formalized problem assumption.

Assumption 1.

For each i𝒱i\in\mathcal{V}, fi:df_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} and hi:d{+}h_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{+\infty\} are convex. In addition, fif_{i} is MiM_{i}-smooth for some Mi>0M_{i}>0. Moreover, there exists at least one optimal solution xx^{\star} to problem (1).

II-A Approximate Method of Multipliers

We first propose a family of optimization algorithms that effectively approximate the Method of Multipliers [35] and serve as a cornerstone of developing distributed algorithms for solving problem (1).

Let each node i𝒱i\in\mathcal{V} keep a copy xidx_{i}\in\mathbb{R}^{d} of the global decision variable xdx\in\mathbb{R}^{d} in problem (1), and define

f(𝐱):=i𝒱fi(xi),h(𝐱):=i𝒱hi(xi),f(\mathbf{x}):=\textstyle{\sum_{i\in\mathcal{V}}}f_{i}(x_{i}),\quad h(\mathbf{x}):=\textstyle{\sum_{i\in\mathcal{V}}}h_{i}(x_{i}),

where 𝐱=(x1T,,xNT)TNd\mathbf{x}=(x_{1}^{T},\ldots,x_{N}^{T})^{T}\in\mathbb{R}^{Nd}. Then, problem (1) can be equivalently transformed into

minimize𝐱Ndf(𝐱)+h(𝐱)subjectto𝐱S:={𝐱Nd|x1==xN}.\begin{split}\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{minimize}}~{}~{}&~{}f(\mathbf{x})+h(\mathbf{x})\\ \operatorname{subject~{}to}~{}&~{}\mathbf{x}\in S:=\{\mathbf{x}\in\mathbb{R}^{Nd}|~{}x_{1}=\cdots=x_{N}\}.\end{split} (2)

Next, let H~Nd×Nd\tilde{H}\in\mathbb{R}^{Nd\times Nd} be such that H~=H~T\tilde{H}=\tilde{H}^{T}, H~𝐎\tilde{H}\succeq\mathbf{O}, and Null(H~)=S\operatorname{Null}(\tilde{H})=S. It has been shown in [18] that the consensus constraint 𝐱S\mathbf{x}\in S in problem (2) is equivalent to H~12𝐱=𝟎\tilde{H}^{\frac{1}{2}}\mathbf{x}=\mathbf{0}. Therefore, (2) is identical to

minimize𝐱Ndf(𝐱)+h(𝐱)subjecttoH~12𝐱=𝟎.\begin{split}\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{minimize}}~{}~{}&~{}f(\mathbf{x})+h(\mathbf{x})\\ \operatorname{subject~{}to}~{}&~{}\tilde{H}^{\frac{1}{2}}\mathbf{x}=\mathbf{0}.\end{split} (3)

Problems (1), (2), and (3) have the same optimal value. Also, xdx^{\star}\in\mathbb{R}^{d} is an optimum of (1) if and only if 𝐱=((x)T,,(x)T)TNd\mathbf{x}^{\star}=((x^{\star})^{T},\ldots,(x^{\star})^{T})^{T}\in\mathbb{R}^{Nd} is an optimum of (2) and (3).

Consider applying the Method of Multipliers [35] to solve problem (3), which gives

𝐱k+1\displaystyle\mathbf{x}^{k+1} argmin𝐱Ndf(𝐱)+h(𝐱)+ρ2𝐱H~2+(𝐯k)TH~12𝐱,\displaystyle\in\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}f(\mathbf{x})\!+h(\mathbf{x})\!+\frac{\rho}{2}\|\mathbf{x}\|_{\tilde{H}}^{2}\!+(\mathbf{v}^{k})^{T}\tilde{H}^{\frac{1}{2}}\mathbf{x},
𝐯k+1\displaystyle\mathbf{v}^{k+1} =𝐯k+ρH~12𝐱k+1.\displaystyle=\mathbf{v}^{k}+\rho\tilde{H}^{\frac{1}{2}}\mathbf{x}^{k+1}.

Here, 𝐱kNd\mathbf{x}^{k}\in\mathbb{R}^{Nd} is the primal variable at iteration k0k\geq 0, which is updated by minimizing an augmented Lagrangian function with the penalty ρ2𝐱H~2\frac{\rho}{2}\|\mathbf{x}\|_{\tilde{H}}^{2}, ρ>0\rho>0 on the consensus constraint H~12𝐱=𝟎\tilde{H}^{\frac{1}{2}}\mathbf{x}=\mathbf{0}. In addition, 𝐯kNd\mathbf{v}^{k}\in\mathbb{R}^{Nd} is the dual variable, whose initial value 𝐯0\mathbf{v}^{0} can be arbitrarily set.

Although the Method of Multipliers may be applied to solve (3) with properly selected parameters, it is not implementable in a distributed fashion over 𝒢\mathcal{G}, even when the problem reduces to linear programming. To address this issue, our strategy is to first derive a paradigm of approximating the Method of Multipliers and then design its distributed realizations.

Our approximation approach is as follows: Starting from any 𝐯0Nd\mathbf{v}^{0}\in\mathbb{R}^{Nd}, let

𝐱k+1\displaystyle\mathbf{x}^{k+1} =argmin𝐱Nduk(𝐱)+h(𝐱)+ρ2𝐱H2+(𝐯k)TH~12𝐱,\displaystyle=\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{\arg\;\min}}\;u^{k}(\mathbf{x})\!+\!h(\mathbf{x})\!+\!\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}\!+\!(\mathbf{v}^{k})^{T}\tilde{H}^{\frac{1}{2}}\mathbf{x}, (4)
𝐯k+1\displaystyle\mathbf{v}^{k+1} =𝐯k+ρH~12𝐱k+1,k0.\displaystyle=\mathbf{v}^{k}+\rho\tilde{H}^{\frac{1}{2}}\mathbf{x}^{k+1},\quad\forall k\geq 0. (5)

Compared to the Method of Multipliers, we adopt the same dual update (5) but construct a different primal update (4). In (4), we use a possibly time-varying surrogate function uk:Ndu^{k}:\mathbb{R}^{Nd}\rightarrow\mathbb{R} to replace f(𝐱)f(\mathbf{x}) in the primal update of the Method of Multipliers, whose conditions are imposed in Assumption 2 below. Additionally, to introduce more flexibility, we use a different weight matrix HNd×NdH\in\mathbb{R}^{Nd\times Nd} to define the penalty term ρ2𝐱H2\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}, ρ>0\rho>0. We suppose HH has the same properties as H~\tilde{H}, i.e., H=HT𝐎H=H^{T}\succeq\mathbf{O} and Null(H)=S\operatorname{Null}(H)=S.

Assumption 2.

The functions uku^{k} k0\forall k\geq 0 satisfy the following:

  1. (a)

    ukk0u^{k}\,\forall k\!\geq\!0 are convex and twice continuously differentiable.

  2. (b)

    uk+ρ2H2u^{k}+\frac{\rho}{2}\|\cdot\|_{H}^{2} k0\forall k\geq 0 are strongly convex, whose convexity parameters are uniformly bounded from below by some positive constant.

  3. (c)

    uk\nabla u^{k} k0\forall k\geq 0 are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by some nonnegative constant.

  4. (d)

    uk(𝐱k)=f(𝐱k)\nabla u^{k}(\mathbf{x}^{k})=\nabla f(\mathbf{x}^{k}) k0\forall k\geq 0, where 𝐱0\mathbf{x}^{0} can be arbitrarily given and 𝐱k\mathbf{x}^{k}, k1k\geq 1 is generated by (4)–(5).

The strong convexity condition in Assumption 2(b) guarantees that 𝐱k+1\mathbf{x}^{k+1} in (4) is well-defined and uniquely exists. Assumption 2(d) is the key to making (4)–(5) solve problem (3). To explain this, note that (4) is equivalent to finding the unique 𝐱k+1\mathbf{x}^{k+1} satisfying

uk(𝐱k+1)ρH𝐱k+1H~12𝐯kh(𝐱k+1).\displaystyle-\nabla u^{k}(\mathbf{x}^{k+1})-\rho H\mathbf{x}^{k+1}-\tilde{H}^{\frac{1}{2}}\mathbf{v}^{k}\in\partial h(\mathbf{x}^{k+1}). (6)

Let (𝐱,𝐯)(\mathbf{x}^{\star},\mathbf{v}^{\star}) be a primal-dual optimal solution pair of problem (3), which satisfies

H~12𝐯f(𝐱)h(𝐱).\displaystyle-\tilde{H}^{\frac{1}{2}}\mathbf{v}^{\star}-\nabla f(\mathbf{x}^{\star})\in\partial h(\mathbf{x}^{\star}). (7)

If (𝐱k,𝐯k)=(𝐱,𝐯)(\mathbf{x}^{k},\mathbf{v}^{k})=(\mathbf{x}^{\star},\mathbf{v}^{\star}), then 𝐱k+1\mathbf{x}^{k+1} has to be 𝐱\mathbf{x}^{\star} because of Assumption 2(d), H𝐱=𝟎H\mathbf{x}^{\star}=\mathbf{0}, (7), and the uniqueness of 𝐱k+1\mathbf{x}^{k+1} in (6). It follows from (5) and H~12𝐱=𝟎\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0} that 𝐯k+1=𝐯\mathbf{v}^{k+1}=\mathbf{v}^{\star}. Therefore, (𝐱,𝐯)(\mathbf{x}^{\star},\mathbf{v}^{\star}) is a fixed point of (4)–(5). The remaining conditions in Assumption 2 will be used for convergence analysis later in Section IV.

The paradigm described by (4)–(5) and obeying Assumption 2 is called Approximate Method of Multipliers (AMM). As there are numerous options of the surrogate function uku^{k}, AMM unifies a wealth of optimization algorithms, including a variety of existing methods (cf. Section III) and many brand new algorithms. Moreover, since Assumption 2 allows uku^{k} to have a more favorable structure than ff, AMM with appropriate uku^{k}’s may induce a prominent reduction in computational cost compared to the Method of Multipliers. In the sequel, we will provide various options of uku^{k}, which give rise to a series of distributed versions of AMM.

A number of existing works [36, 37, 38, 39, 40, 31, 32, 33, 34] also introduce surrogate functions to optimization methods, among which [31, 32, 33, 34] develop distributed algorithms and can address partially nonconvex problems on time-varying networks. The advantages of AMM as well as its differences from the algorithms in [31, 32, 33, 34] are highlighted as follows:

i) The algorithms carrying surrogate functions in [31, 32, 33, 34] are (primal) gradient-tracking-based methods. In contrast, AMM incorporates a surrogate function into a primal-dual framework, so that AMM is inherently different from the algorithms in [31, 32, 33, 34] and requires new analysis tools.

ii) The surrogate function conditions in Assumption 2 intersect with but still differ from those in [31, 32, 33, 34]. For instance, when ff is twice continuously differentiable, uk(𝐱)=f(𝐱k),𝐱+12(𝐱𝐱k)T(2f(𝐱k)+ϵI)(𝐱𝐱k)u^{k}(\mathbf{x})=\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle+\frac{1}{2}(\mathbf{x}-\mathbf{x}^{k})^{T}(\nabla^{2}f(\mathbf{x}^{k})+\epsilon I)(\mathbf{x}-\mathbf{x}^{k}) with ϵ>0\epsilon>0 meets Assumption 2 but cannot be included by [31, 32, 33, 34].

iii) To enable distributed implementation, the surrogate functions in [31, 32, 33, 34] need to be fully separable in the sense that they can be written as the sum of NN functions with independent variables. In contrast, AMM allows uku^{k} to be densely coupled under proper design yet can still be executed in a distributed fashion (cf. Sections II-C and III-C4). This may lead to more diverse algorithm design.

iv) In [31, 32, 33, 34], it is required that the global nonsmooth objective function be accessible to every node, while problem (1) only assigns a local nonsmooth component hih_{i} to each node ii, which is more applicable to scenarios concerning privacy and makes the analysis more challenging.

II-B Distributed Approximate Method of Multipliers

This subsection lays out the parameter designs of AMM for distributed implementations.

We first apply the following change of variable to AMM:

𝐪k=H~12𝐯k,k0.\displaystyle\mathbf{q}^{k}=\tilde{H}^{\frac{1}{2}}\mathbf{v}^{k},\quad k\geq 0. (8)

Then, AMM (4)–(5) can be rewritten as

𝐱k+1\displaystyle\mathbf{x}^{k+1} =argmin𝐱Nduk(𝐱)+h(𝐱)+ρ2𝐱H2+(𝐪k)T𝐱,\displaystyle=\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{\arg\;\min}}~{}u^{k}(\mathbf{x})+h(\mathbf{x})+\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}+(\mathbf{q}^{k})^{T}\mathbf{x}, (9)
𝐪k+1\displaystyle\mathbf{q}^{k+1} =𝐪k+ρH~𝐱k+1.\displaystyle=\mathbf{q}^{k}+\rho\tilde{H}\mathbf{x}^{k+1}. (10)

Moreover, note that

Range(H~12)=Range(H~)\displaystyle\operatorname{Range}(\tilde{H}^{\frac{1}{2}})=\operatorname{Range}(\tilde{H})
=S:={𝐱Nd|x1++xN=𝟎},\displaystyle=S^{\bot}:=\{\mathbf{x}\in\mathbb{R}^{Nd}|~{}x_{1}+\cdots+x_{N}=\mathbf{0}\},

where SS^{\bot} is the orthogonal complement of SS in (2). Hence, (8) requires 𝐪kS\mathbf{q}^{k}\in S^{\bot} k0\forall k\geq 0, which, due to (10), can be ensured simply by the following initialization:

𝐪0S.\displaystyle\mathbf{q}^{0}\in S^{\bot}. (11)

Therefore, (9)–(11) is an equivalent form of AMM.

Next, partition the primal variable 𝐱k\mathbf{x}^{k} and the dual variable 𝐪k\mathbf{q}^{k} in (9)–(11) as 𝐱k=((x1k)T,,(xNk)T)T\mathbf{x}^{k}=((x_{1}^{k})^{T},\ldots,(x_{N}^{k})^{T})^{T} and 𝐪k=((q1k)T,,(qNk)T)T\mathbf{q}^{k}=((q_{1}^{k})^{T},\ldots,(q_{N}^{k})^{T})^{T}. Suppose each node i𝒱i\in\mathcal{V} maintains xikdx_{i}^{k}\in\mathbb{R}^{d} and qikdq_{i}^{k}\in\mathbb{R}^{d}. Clearly, the nodes manage to collectively meet (11) by setting, for instance, qi0=𝟎q_{i}^{0}=\mathbf{0} i𝒱\forall i\in\mathcal{V}. Below, we discuss the selections of uku^{k}, HH, and H~\tilde{H} for the sake of distributed implementations of (9) and (10).

To this end, we let

H=PId,H~=P~Id,\displaystyle H=P\otimes I_{d},\quad\tilde{H}=\tilde{P}\otimes I_{d},

and impose Assumption 3 on P,P~N×NP,\tilde{P}\in\mathbb{R}^{N\times N}.

Assumption 3.

The matrices P=(pij)N×NP=(p_{ij})_{N\times N} and P~=(p~ij)N×N\tilde{P}=(\tilde{p}_{ij})_{N\times N} satisfy the following:

  1. (a)

    pij=pjip_{ij}=p_{ji}, p~ij=p~ji\tilde{p}_{ij}=\tilde{p}_{ji}, {i,j}\forall\{i,j\}\in\mathcal{E}.

  2. (b)

    pij=p~ij=0p_{ij}=\tilde{p}_{ij}=0, i𝒱\forall i\in\mathcal{V}, j𝒩i{i}\forall j\notin\mathcal{N}_{i}\cup\{i\}.

  3. (c)

    Null(P)=Null(P~)=span(𝟏)\operatorname{Null}(P)=\operatorname{Null}(\tilde{P})=\operatorname{span}(\mathbf{1}).

  4. (d)

    P𝐎P\succeq\mathbf{O}, P~𝐎\tilde{P}\succeq\mathbf{O}.

The nodes can jointly determine P,P~P,\tilde{P} under Assumption 3 without any centralized coordination. For instance, we can let each node i𝒱i\in\mathcal{V} agree with every neighbor j𝒩ij\in\mathcal{N}_{i} on pij=pji<0p_{ij}=p_{ji}<0 and p~ij=p~ji<0\tilde{p}_{ij}=\tilde{p}_{ji}<0, and set pii=j𝒩ipijp_{ii}=-\sum_{j\in\mathcal{N}_{i}}p_{ij} and p~ii=j𝒩ip~ij\tilde{p}_{ii}=-\sum_{j\in\mathcal{N}_{i}}\tilde{p}_{ij}, which directly guarantee Assumption 3(a). Then, Assumption 3(c)(d) are satisfied effortlessly by means of Assumption 3(b) and the connectivity of 𝒢\mathcal{G}. Typical examples of such PP and P~\tilde{P} include the graph Laplacian matrix L𝒢N×NL_{\mathcal{G}}\in\mathbb{R}^{N\times N} with [L𝒢]ij=[L𝒢]ji=1[L_{\mathcal{G}}]_{ij}=[L_{\mathcal{G}}]_{ji}=-1 {i,j}\forall\{i,j\}\in\mathcal{E} and the Metropolis weight matrix M𝒢N×NM_{\mathcal{G}}\in\mathbb{R}^{N\times N} with [M𝒢]ij=[M𝒢]ji=1max{|𝒩i|,|𝒩j|}+1[M_{\mathcal{G}}]_{ij}=[M_{\mathcal{G}}]_{ji}=-\frac{1}{\max\{|\mathcal{N}_{i}|,|\mathcal{N}_{j}|\}+1} {i,j}\forall\{i,j\}\in\mathcal{E}. Also, the conditions on HH and H~\tilde{H} in Section II-A hold due to Assumption 3.

With Assumption 3, (10) can be accomplished by letting each node update as

qik+1=qik+ρj𝒩i{i}p~ijxjk+1,i𝒱.\displaystyle q_{i}^{k+1}=q_{i}^{k}+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}\tilde{p}_{ij}x_{j}^{k+1},\quad\forall i\in\mathcal{V}. (12)

This is a local operation, as each node i𝒱i\in\mathcal{V} only needs to acquire the primal variables associated with its neighbors.

It remains to design the surrogate function uku^{k} so that (9) can be realized in a distributed way. To this end, let ψik:d\psi_{i}^{k}:\mathbb{R}^{d}\rightarrow\mathbb{R} i𝒱\forall i\in\mathcal{V} k0\forall k\geq 0 be a set of functions under Assumption 4.

Assumption 4.

The functions ψik\psi_{i}^{k} i𝒱\forall i\in\mathcal{V} k0\forall k\geq 0 satisfy:

  1. (a)

    ψik\psi_{i}^{k} i𝒱\forall i\in\mathcal{V} k0\forall k\geq 0 are twice continuously differentiable.

  2. (b)

    ψik\psi_{i}^{k} i𝒱\forall i\in\mathcal{V} k0\forall k\geq 0 are strongly convex, whose convexity parameters are uniformly bounded from below by some positive constant.

  3. (c)

    ψik\nabla\psi_{i}^{k} i𝒱\forall i\in\mathcal{V} k0\forall k\geq 0 are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by some positive constant.

  4. (d)

    (i𝒱ψik(xi))ρ2𝐱H2(\sum_{i\in\mathcal{V}}\psi_{i}^{k}(x_{i}))-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2} k0\forall k\geq 0 are convex.

To meet Assumption 4(d), it suffices to let ψik\psi_{i}^{k} i𝒱\forall i\in\mathcal{V} k0\forall k\geq 0 be any strongly convex functions whose convexity parameters are larger than or equal to ρλmax(H)>0\rho\lambda_{\max}(H)>0, and one readily available upper bound on λmax(H)\lambda_{\max}(H) is maxi𝒱,j𝒩i|pij|N\max_{i\in\mathcal{V},\;j\in\mathcal{N}_{i}}|p_{ij}|N.

Now define

ϕk(𝐱):=(i𝒱ψik(xi))ρ2𝐱H2,k0,\displaystyle\phi^{k}(\mathbf{x}):=\bigl{(}\textstyle{\sum_{i\in\mathcal{V}}}\psi_{i}^{k}(x_{i})\bigr{)}-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2},\quad\forall k\geq 0, (13)

and let Dϕk(𝐱,𝐱k)D_{\phi^{k}}(\mathbf{x},\mathbf{x}^{k}) be the Bregman divergence associated with ϕk\phi^{k} at 𝐱\mathbf{x} and 𝐱k\mathbf{x}^{k}, i.e.,

Dϕk(𝐱,𝐱k)=ϕk(𝐱)ϕk(𝐱k)ϕk(𝐱k),𝐱𝐱k.\displaystyle D_{\phi^{k}}(\mathbf{x},\mathbf{x}^{k})=\phi^{k}(\mathbf{x})-\phi^{k}(\mathbf{x}^{k})-\langle\nabla\phi^{k}(\mathbf{x}^{k}),\mathbf{x}-\mathbf{x}^{k}\rangle. (14)

Then, we set uku^{k} as

uk(𝐱)=Dϕk(𝐱,𝐱k)+f(𝐱k),𝐱.\displaystyle u^{k}(\mathbf{x})=D_{\phi^{k}}(\mathbf{x},\mathbf{x}^{k})+\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle. (15)

With Assumption 4, (15) is sufficient to ensure Assumption 2. To see this, note from Assumption 4(a)(d) that uk(𝐱)u^{k}(\mathbf{x}) in (15) is twice continuously differentiable and convex, i.e., Assumption 2(a) holds. Also note from (14) and (15) that

uk(𝐱)=ϕk(𝐱)ϕk(𝐱k)+f(𝐱k).\displaystyle\nabla u^{k}(\mathbf{x})=\nabla\phi^{k}(\mathbf{x})-\nabla\phi^{k}(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{k}). (16)

This, along with Assumption 4 (b)(c), guarantees Assumption 2(b)(c)(d).

To see how (15) results in a distributed implementation of (9), note from (16) that (9) is equivalent to

𝟎=ϕk(𝐱k+1)ϕk(𝐱k)+f(𝐱k)+gk+1+ρH𝐱k+1+𝐪k\mathbf{0}=\nabla\phi^{k}(\mathbf{x}^{k+1})-\nabla\phi^{k}(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{k})+g^{k+1}+\rho H\mathbf{x}^{k+1}+\mathbf{q}^{k}

for some gk+1h(𝐱k+1)g^{k+1}\in\partial h(\mathbf{x}^{k+1}). Then, using (13) and the structure of HH given in Assumption 3, this can be written as

𝟎=\displaystyle\mathbf{0}= ψik(xik+1)+gik+1+qikψik(xik)+fi(xik)\displaystyle\nabla\psi_{i}^{k}(x_{i}^{k+1})+g_{i}^{k+1}+q_{i}^{k}-\nabla\psi_{i}^{k}(x_{i}^{k})+\nabla f_{i}(x_{i}^{k})
+ρj𝒩i{i}pijxjk,i𝒱,\displaystyle+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}p_{ij}x_{j}^{k},\quad\forall i\in\mathcal{V},

where gik+1hi(xik+1)g_{i}^{k+1}\in\partial h_{i}(x_{i}^{k+1}). In other words, (9) can be achieved by letting each node i𝒱i\in\mathcal{V} solve the following strongly convex optimization problem:

xik+1=argminxdψik(x)+hi(x)\displaystyle x_{i}^{k+1}=\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}~{}\psi_{i}^{k}(x)+h_{i}(x)
+x,qikψik(xik)+fi(xik)+ρj𝒩i{i}pijxjk,\displaystyle\!\!\!\!+\langle x,q_{i}^{k}-\nabla\psi_{i}^{k}(x_{i}^{k})+\nabla f_{i}(x_{i}^{k})+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}p_{ij}x_{j}^{k}\rangle, (17)

which can be locally carried out by node ii.

Algorithms described by (11), (17), and (12) under Assumptions 3 and 4 constitute a set of distributed realizations of AMM, referred to as Distributed Approximate Method of Multipliers (DAMM), which can be implemented by the nodes in 𝒢\mathcal{G} via exchanging information with their neighbors only. Algorithm 1 describes how each node acts in DAMM.

Algorithm 1 DAMM
1:  Initialization:
2:  Each node i𝒱i\in\mathcal{V} selects qi0dq_{i}^{0}\in\mathbb{R}^{d} such that i𝒱qi0=𝟎\sum_{i\in\mathcal{V}}q_{i}^{0}=\mathbf{0} (or simply sets qi0=𝟎q_{i}^{0}=\mathbf{0}).
3:  Each node i𝒱i\in\mathcal{V} arbitrarily sets xi0dx_{i}^{0}\in\mathbb{R}^{d} and sends xi0x_{i}^{0} to every neighbor j𝒩ij\in\mathcal{N}_{i}.
4:  for k0k\geq 0 do
5:     Each node i𝒱i\in\mathcal{V} computes xik+1=argminxdψik(x)+hi(x)+x,qikψik(xik)+fi(xik)+ρj𝒩i{i}pijxjkx_{i}^{k+1}=\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}\psi_{i}^{k}(x)+h_{i}(x)+\langle x,q_{i}^{k}-\nabla\psi_{i}^{k}(x_{i}^{k})+\nabla f_{i}(x_{i}^{k})+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{k}\rangle.
6:     Each node i𝒱i\in\mathcal{V} sends xik+1x_{i}^{k+1} to every neighbor j𝒩ij\in\mathcal{N}_{i}.
7:     Each node i𝒱i\in\mathcal{V} computes qik+1=qik+ρj𝒩i{i}p~ijxjk+1q_{i}^{k+1}=q_{i}^{k}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}\tilde{p}_{ij}x_{j}^{k+1}.
8:  end for

Finally, we provide two examples of DAMM with two particular choices of ψik\psi_{i}^{k}.

Example 1.

For each i𝒱i\in\mathcal{V}, let ψik(x)=ri(xxik)+ϵi2x2\psi_{i}^{k}(x)=r_{i}(x-x_{i}^{k})+\frac{\epsilon_{i}}{2}\|x\|^{2} k0\forall k\geq 0, where ri:dr_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} i𝒱\forall i\in\mathcal{V} can be any convex, smooth, and twice continuously differentiable functions and ϵi>0\epsilon_{i}>0 is such that ϵiρλmax(H)\epsilon_{i}\geq\rho\lambda_{\max}(H). Then, DAMM reduces to a new distributed proximal algorithm, with the following local update of xikx_{i}^{k}:

xik+1=\displaystyle x_{i}^{k+1}= argminxdri(xxik)+ϵi2xxik2+hi(x)\displaystyle\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}~{}r_{i}(x-x_{i}^{k})+\frac{\epsilon_{i}}{2}\|x-x_{i}^{k}\|^{2}+h_{i}(x)
+x,qikri(𝟎)+fi(xik)+ρj𝒩i{i}pijxjk.\displaystyle+\langle x,q_{i}^{k}-\nabla r_{i}(\mathbf{0})+\nabla f_{i}(x_{i}^{k})+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}p_{ij}x_{j}^{k}\rangle.
Example 2.

Suppose fif_{i} i𝒱\forall i\in\mathcal{V} are twice continuously differentiable. Since 2fi(x)MiId\nabla^{2}f_{i}(x)\preceq M_{i}I_{d} xd\forall x\in\mathbb{R}^{d}, we can let each ψik(x)=12xT(2fi(xik)+ϵiId)x\psi_{i}^{k}(x)=\frac{1}{2}x^{T}(\nabla^{2}f_{i}(x_{i}^{k})+\epsilon_{i}I_{d})x, where ϵi>0\epsilon_{i}>0 is such that ϵiρλmax(H)\epsilon_{i}\geq\rho\lambda_{\max}(H). Then, the resulting DAMM is a new distributed second-order method, with the following local update of xikx_{i}^{k}:

xik+1=\displaystyle x_{i}^{k+1}= argminxd12xxik2fi(xik)+ϵiId2+hi(x)\displaystyle\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}~{}\frac{1}{2}\|x-x_{i}^{k}\|_{\nabla^{2}f_{i}(x_{i}^{k})+\epsilon_{i}I_{d}}^{2}+h_{i}(x)
+x,qik+fi(xik)+ρj𝒩i{i}pijxjk.\displaystyle+\langle x,q_{i}^{k}+\nabla f_{i}(x_{i}^{k})+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}p_{ij}x_{j}^{k}\rangle.

II-C Special Case: Smooth Problem

In this subsection, we focus on the smooth convex optimization problem minxdi𝒱fi(x)\min_{x\in\mathbb{R}^{d}}\sum_{i\in\mathcal{V}}f_{i}(x), i.e., (1) with hi(x)0h_{i}(x)\equiv 0 i𝒱\forall i\in\mathcal{V}. For this special case of (1), we provide additional designs of the surrogate function uku^{k} in AMM, leading to a couple of variations of DAMM.

Here, we let uku^{k} still be in the form of (15), but no longer require ϕk+ρ2H2\phi^{k}+\frac{\rho}{2}\|\cdot\|_{H}^{2} be a separable function as in (13). Instead, we construct ϕk\phi^{k} based upon another function γk:Nd\gamma^{k}:\mathbb{R}^{Nd}\rightarrow\mathbb{R} under Assumption 5.

Assumption 5.

The functions γk\gamma^{k} k0\forall k\geq 0 satisfy the following:

  1. (a)

    γk\gamma^{k} k0\forall k\geq 0 are twice continuously differentiable.

  2. (b)

    γk\gamma^{k} k0\forall k\geq 0 are strongly convex, whose convexity parameters are uniformly bounded from below by γ¯>0\underline{\gamma}>0.

  3. (c)

    γk\nabla\gamma^{k} k0\forall k\geq 0 are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by γ¯>0\bar{\gamma}>0.

  4. (d)

    (γk)(𝐱)ρ2𝐱H2(\gamma^{k})^{\star}(\mathbf{x})-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2} k0\forall k\geq 0 are convex, where (γk)(𝐱)=sup𝐲Nd𝐱,𝐲γk(𝐲)(\gamma^{k})^{\star}(\mathbf{x})=\sup_{\mathbf{y}\in\mathbb{R}^{Nd}}\langle\mathbf{x},\mathbf{y}\rangle-\gamma^{k}(\mathbf{y}) is the convex conjugate function of γk\gamma^{k}.

  5. (e)

    For any k0k\geq 0 and any 𝐱Nd\mathbf{x}\in\mathbb{R}^{Nd}, the iith dd-dimensional block of γk(𝐱)\nabla\gamma^{k}(\mathbf{x}), denoted by iγk(𝐱)\nabla_{i}\gamma^{k}(\mathbf{x}), is independent of xjx_{j} j𝒩i{i}\forall j\notin\mathcal{N}_{i}\cup\{i\}.

From Assumption 5(b)(c), each (γk)(\gamma^{k})^{\star} is (1/γ¯)(1/\bar{\gamma})-strongly convex and (1/γ¯)(1/\underline{\gamma})-smooth [41], so that Assumption 5(d) holds as long as INd/γ¯ρH𝐎I_{Nd}/\bar{\gamma}-\rho H\succeq\mathbf{O}. Now we set

ϕk(𝐱)=(γk)(𝐱)ρ2𝐱H2,k0.\displaystyle\phi^{k}(\mathbf{x})=(\gamma^{k})^{\star}(\mathbf{x})-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2},\quad\forall k\geq 0. (18)

Unlike DAMM, uku^{k} given by (15) and (18) under Assumption 5 does not necessarily guarantee that uk()+ρ2H2u^{k}(\cdot)+\frac{\rho}{2}\|\cdot\|_{H}^{2} is separable. Below, we show that such uku^{k} k0\forall k\geq 0 also satisfy Assumption 2, leading to another subclass of AMM.

To do so, first notice that the strong convexity and smoothness of (γk)(\gamma^{k})^{\star} guarantee Assumption 2(b)(c). Also note from (16) that Assumption 2(d) is assured. In addition, due to Assumption 5(d), ϕk\phi^{k} is convex and, thus, so is uku^{k}. To show the twice continuous differentiability of uku^{k} in Assumption 2(a), consider the fact from [42] that due to Assumption 5(b)(c), γk\nabla\gamma^{k} is invertible and its inverse function is (γk)1=(γk)(\nabla\gamma^{k})^{-1}=\nabla(\gamma^{k})^{\star}. This, along with (18), implies that

ϕk(𝐱)=(γk)1(𝐱)ρH𝐱.\displaystyle\nabla\phi^{k}(\mathbf{x})=(\nabla\gamma^{k})^{-1}(\mathbf{x})-\rho H\mathbf{x}. (19)

From the inverse function theorem [43], (γk)1(\nabla\gamma^{k})^{-1} is continuously differentiable, so that ϕk\phi^{k} and therefore uku^{k} given by (15) and (18) are twice continuously differentiable. We then conclude that Assumption 2 holds.

Equipped with (15), (18), and Assumption 5, we start to derive additional distributed realizations of AMM (9)–(11) for minimizing i𝒱fi(x)\sum_{i\in\mathcal{V}}f_{i}(x). As hi(x)0h_{i}(x)\equiv 0 i𝒱\forall i\in\mathcal{V}, (9) is equivalent to

ϕk(𝐱k+1)ϕk(𝐱k)+f(𝐱k)+ρH𝐱k+1+𝐪k=𝟎.\displaystyle\nabla\phi^{k}(\mathbf{x}^{k+1})\!-\!\nabla\phi^{k}(\mathbf{x}^{k})\!+\!\nabla f(\mathbf{x}^{k})\!+\!\rho H\mathbf{x}^{k+1}\!+\!\mathbf{q}^{k}\!=\!\mathbf{0}. (20)

This, together with (19), gives

𝐱k+1=γk(ϕk(𝐱k)f(𝐱k)𝐪k).\displaystyle\mathbf{x}^{k+1}=\nabla\gamma^{k}(\nabla\phi^{k}(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})-\mathbf{q}^{k}). (21)

Like DAMM, we let each node i𝒱i\in\mathcal{V} maintain the iith dd-dimensional blocks of 𝐱k\mathbf{x}^{k} and 𝐪k\mathbf{q}^{k}, i.e., xikx_{i}^{k} and qikq_{i}^{k}, where the update of qikq_{i}^{k} is the same as (12). According to (21), the update of xikx_{i}^{k} is given by xik+1=iγk(ϕk(𝐱k)f(𝐱k)𝐪k)x_{i}^{k+1}=\nabla_{i}\gamma^{k}(\nabla\phi^{k}(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})-\mathbf{q}^{k}). From Assumption 5(e), this can be computed by node ii provided that it has access to jϕk(𝐱k)fj(xjk)qjk\nabla_{j}\phi^{k}(\mathbf{x}^{k})-\nabla f_{j}(x_{j}^{k})-q_{j}^{k} j𝒩i{i}\forall j\in\mathcal{N}_{i}\cup\{i\}, where jϕk(𝐱k)d\nabla_{j}\phi^{k}(\mathbf{x}^{k})\in\mathbb{R}^{d} denotes the jjth block of ϕk(𝐱k)\nabla\phi^{k}(\mathbf{x}^{k}). Therefore, the remaining question is whether each node i𝒱i\in\mathcal{V} is able to locally evaluate iϕk(𝐱k)\nabla_{i}\phi^{k}(\mathbf{x}^{k}). In fact, this can be enabled by the following two concrete ways of constructing γk\gamma^{k}.

Way #1: Let γk=γ\gamma^{k}=\gamma k0\forall k\geq 0 for some γ:Nd\gamma:\mathbb{R}^{Nd}\rightarrow\mathbb{R} satisfying Assumption 5. Then, according to (18), ϕk\phi^{k} k0\forall k\geq 0 are fixed to some ϕ:Nd\phi:\mathbb{R}^{Nd}\rightarrow\mathbb{R}. We introduce an auxiliary variable 𝐲k=((y1k)T,,(yNk)T)T\mathbf{y}^{k}=((y_{1}^{k})^{T},\ldots,(y_{N}^{k})^{T})^{T} such that 𝐲k=ϕ(𝐱k)\mathbf{y}^{k}=\nabla\phi(\mathbf{x}^{k}) k0\forall k\geq 0. From (20), 𝐲k\mathbf{y}^{k} satisfies the recursive relation

𝐲k+1=𝐲kf(𝐱k)𝐪kρH𝐱k+1.\displaystyle\mathbf{y}^{k+1}=\mathbf{y}^{k}-\nabla f(\mathbf{x}^{k})-\mathbf{q}^{k}-\rho H\mathbf{x}^{k+1}.

Due to the structure of HH in Assumption 3, each node i𝒱i\in\mathcal{V} is able to locally update yik+1y_{i}^{k+1} as above. Nevertheless, the initialization 𝐲0=ϕ(𝐱0)=(γ)1(𝐱0)ρH𝐱0\mathbf{y}^{0}=\nabla\phi(\mathbf{x}^{0})=(\nabla\gamma)^{-1}(\mathbf{x}^{0})-\rho H\mathbf{x}^{0} cannot be achieved without centralized coordination, given that 𝐱0\mathbf{x}^{0} is arbitrarily chosen. We may overcome this by imposing a restriction on 𝐱0\mathbf{x}^{0} as follows: Let each node i𝒱i\in\mathcal{V} pick any z~id\tilde{z}_{i}\in\mathbb{R}^{d}, and force xi0=iγ(𝐳~)x_{i}^{0}=\nabla_{i}\gamma(\tilde{\mathbf{z}}) with 𝐳~=(z~1T,,z~NT)T\tilde{\mathbf{z}}=(\tilde{z}_{1}^{T},\ldots,\tilde{z}_{N}^{T})^{T}, so that 𝐳~=(γ)1(𝐱0)\tilde{\mathbf{z}}=(\nabla\gamma)^{-1}(\mathbf{x}^{0}). Then, by letting each yi0=z~iρj𝒩i{i}pijxj0y_{i}^{0}=\tilde{z}_{i}-\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{0}, we obtain 𝐲0=ϕ(𝐱0)\mathbf{y}^{0}=\nabla\phi(\mathbf{x}^{0}). Due to Assumption 5(e), such initialization is decentralized, which only requires each node ii to share z~i\tilde{z}_{i} and xi0x_{i}^{0} with its neighbors.

Incorporating the above into (21) yields another distributed version of AMM, which is called Distributed Approximate Method of Multipliers for Smooth problems with Constant update functions (DAMM-SC) and is specified in Algorithm 2.

Algorithm 2 DAMM-SC
1:  Initialization:
2:  Each node i𝒱i\in\mathcal{V} selects qi0dq_{i}^{0}\in\mathbb{R}^{d} such that i𝒱qi0=𝟎\sum_{i\in\mathcal{V}}q_{i}^{0}=\mathbf{0} (or simply sets qi0=𝟎q_{i}^{0}=\mathbf{0}).
3:  Each node i𝒱i\!\in\!\mathcal{V} arbitrarily chooses z~id\tilde{z}_{i}\!\in\!\mathbb{R}^{d} and sends z~i\tilde{z}_{i} to every neighbor j𝒩ij\!\in\!\mathcal{N}_{i}.
4:  Each node i𝒱i\in\mathcal{V} sets xi0=iγ(𝐳~)x_{i}^{0}\!=\!\nabla_{i}\gamma(\tilde{\mathbf{z}}) (which only depends on z~j\tilde{z}_{j} j𝒩i{i}\forall j\in\mathcal{N}_{i}\cup\{i\}) and sends it to every neighbor j𝒩ij\in\mathcal{N}_{i}.
5:  Each node i𝒱i\in\mathcal{V} sets yi0=z~iρj𝒩i{i}pijxj0y_{i}^{0}=\tilde{z}_{i}-\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{0}.
6:  Each node i𝒱i\in\mathcal{V} sends yi0fi(xi0)qi0y_{i}^{0}-\nabla f_{i}(x_{i}^{0})-q_{i}^{0} to every neighbor j𝒩ij\in\mathcal{N}_{i}.
7:  for k0k\geq 0 do
8:     Each node i𝒱i\in\mathcal{V} computes xik+1=iγ(𝐲kf(𝐱k)𝐪k)x_{i}^{k+1}=\nabla_{i}\gamma(\mathbf{y}^{k}-\nabla f(\mathbf{x}^{k})-\mathbf{q}^{k}) (which only depends on yjkfj(xjk)qjky_{j}^{k}-\nabla f_{j}(x_{j}^{k})-q_{j}^{k} j𝒩i{i}\forall j\in\mathcal{N}_{i}\cup\{i\}).
9:     Each node i𝒱i\in\mathcal{V} sends xik+1x_{i}^{k+1} to every neighbor j𝒩ij\in\mathcal{N}_{i}.
10:     Each node i𝒱i\in\mathcal{V} computes yik+1=yikfi(xik)qikρj𝒩i{i}pijxjk+1y_{i}^{k+1}=y_{i}^{k}-\nabla f_{i}(x_{i}^{k})-q_{i}^{k}-\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{k+1}.
11:     Each node i𝒱i\in\mathcal{V} computes qik+1=qik+ρj𝒩i{i}p~ijxjk+1q_{i}^{k+1}=q_{i}^{k}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}\tilde{p}_{ij}x_{j}^{k+1}.
12:     Each node i𝒱i\in\mathcal{V} sends yik+1fi(xik+1)qik+1y_{i}^{k+1}-\nabla f_{i}(x_{i}^{k+1})-q_{i}^{k+1} to every neighbor j𝒩ij\in\mathcal{N}_{i}.
13:  end for

Example 3.

We can further reduce the communication cost of DAMM-SC by restricting iγ\nabla_{i}\gamma to solely depend on node ii. For example, let gi:dg_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} i𝒱\forall i\in\mathcal{V} be arbitrary twice continuously differentiable, smooth, convex functions and let g(𝐱)=i𝒱gi(xi)g(\mathbf{x})=\sum_{i\in\mathcal{V}}g_{i}(x_{i}). Then, we may choose γ(𝐱)=g(𝐱)+ϵ2𝐱2\gamma(\mathbf{x})=g(\mathbf{x})+\frac{\epsilon}{2}\|\mathbf{x}\|^{2} with ϵ>0\epsilon>0. It is straightforward to see that Assumption 5(a)(b)(c) hold and Assumption 5(d) can be satisfied with proper ρ,ϵ>0\rho,\epsilon>0. For such a choice of γ\gamma, iγ(𝐱)=gi(xi)+ϵxi\nabla_{i}\gamma(\mathbf{x})=\nabla g_{i}(x_{i})+\epsilon x_{i}, which is up to node ii itself and hence meets Assumption 5(e). Thus, the primal update of DAMM-SC, i.e., Line 8 of Algorithm 2, is simplified to xik+1=gi(yikfi(xik)qik)+ϵ(yikfi(xik)qik)x_{i}^{k+1}=\nabla g_{i}(y_{i}^{k}-\nabla f_{i}(x_{i}^{k})-q_{i}^{k})+\epsilon(y_{i}^{k}-\nabla f_{i}(x_{i}^{k})-q_{i}^{k}), so that each node ii only needs to share xik+1x_{i}^{k+1} with its neighbors during each iteration and the local communications in Line 6 and Line 12 of Algorithm 2 are eliminated.

Way #2: For each k0k\geq 0, let γk(𝐱)=12𝐱TGk𝐱\gamma^{k}(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}G^{k}\mathbf{x}, where Gk=(Gk)TNd×NdG^{k}=(G^{k})^{T}\in\mathbb{R}^{Nd\times Nd}. Suppose there exist γ¯γ¯>0\bar{\gamma}\geq\underline{\gamma}>0 such that γ¯INdGkγ¯INd\underline{\gamma}I_{Nd}\preceq G^{k}\preceq\bar{\gamma}I_{Nd} k0\forall k\geq 0, and also suppose (Gk)1ρH(G^{k})^{-1}\succeq\rho H k0\forall k\geq 0. Moreover, we let each GkG^{k} have a neighbor-sparse structure, i.e., the d×dd\times d (i,j)(i,j)-block of GkG^{k}, denoted by GijkG_{ij}^{k}, is equal to 𝐎d×d\mathbf{O}\in\mathbb{R}^{d\times d} for all j𝒩i{i}j\notin\mathcal{N}_{i}\cup\{i\}. It can be shown that such quadratic γk\gamma^{k}’s satisfy Assumption 5. Substituting γk(𝐱)=12𝐱TGk𝐱\gamma^{k}(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}G^{k}\mathbf{x} into (21) gives

𝐱k+1=𝐱kGk(ρH𝐱k+f(𝐱k)+𝐪k),\mathbf{x}^{k+1}=\mathbf{x}^{k}-G^{k}(\rho H\mathbf{x}^{k}+\nabla f(\mathbf{x}^{k})+\mathbf{q}^{k}), (22)

which can be executed in a distributed manner due to the neighbor-sparse structure of GkG^{k}. This leads to an additional distributed version of AMM, which is referred to as Distributed Approximate Method of Multipliers for Smooth problems with Quadratic update functions (DAMM-SQ) and is elaborated in Algorithm 3 where we introduce, for convenience, an auxiliary variable 𝐳k=((z1k)T,,(zNk)T)T=ρH𝐱k+f(𝐱k)+𝐪k\mathbf{z}^{k}=((z_{1}^{k})^{T},\ldots,(z_{N}^{k})^{T})^{T}=\rho H\mathbf{x}^{k}+\nabla f(\mathbf{x}^{k})+\mathbf{q}^{k}.

Algorithm 3 DAMM-SQ
1:  Initialization:
2:  Same as Lines 2–3 in Algorithm 1.
3:  Each node i𝒱i\in\mathcal{V} computes zi0=fi(xi0)+qi0+ρj𝒩i{i}pijxj0z_{i}^{0}=\nabla f_{i}(x_{i}^{0})+q_{i}^{0}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{0} and sends it to every neighbor j𝒩ij\in\mathcal{N}_{i}.
4:  for k0k\geq 0 do
5:     Each node i𝒱i\in\mathcal{V} computes xik+1=xikj𝒩i{i}Gijkzjkx_{i}^{k+1}=x_{i}^{k}-\sum_{j\in\mathcal{N}_{i}\cup\{i\}}G_{ij}^{k}z_{j}^{k}.
6:     Each node i𝒱i\in\mathcal{V} sends xik+1x_{i}^{k+1} to every neighbor j𝒩ij\in\mathcal{N}_{i}.
7:     Each node i𝒱i\in\mathcal{V} computes qik+1=qik+ρj𝒩i{i}p~ijxjk+1q_{i}^{k+1}=q_{i}^{k}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}\tilde{p}_{ij}x_{j}^{k+1}.
8:     Each node i𝒱i\in\mathcal{V} computes zik+1=fi(xik+1)+qik+1+ρj𝒩i{i}pijxjk+1z_{i}^{k+1}=\nabla f_{i}(x_{i}^{k+1})+q_{i}^{k+1}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{k+1}.
9:     Each node i𝒱i\in\mathcal{V} sends zik+1z_{i}^{k+1} to every neighbor j𝒩ij\in\mathcal{N}_{i}.
10:  end for
Example 4.

We present a new distributed gradient-tracking algorithm as an example of DAMM-SQ, where Gk=1ρ(INP)IdG^{k}=\frac{1}{\rho}(I_{N}-P)\otimes I_{d} k0\forall k\geq 0 with P=P~INP=\tilde{P}\prec I_{N} under Assumption 3. Note that P=P~INP=\tilde{P}\prec I_{N} can be satisfied by letting INP=INP~I_{N}-P=I_{N}-\tilde{P} be strictly diagonally dominant with positive diagonal entries. Similar to the discussions below Assumption 3, such P,P~P,\tilde{P} and therefore GkG^{k} can be locally determined by the nodes. Moreover, since (INP)1INP(I_{N}-P)^{-1}\succeq I_{N}\succ P, (Gk)1=ρ(INP)1IdρH(G^{k})^{-1}=\rho(I_{N}-P)^{-1}\otimes I_{d}\succ\rho H. Also, it can be verified that all the other conditions on GkG^{k} required by DAMM-SQ hold. With this particular GkG^{k}, (22) becomes

𝐱k+1=\displaystyle\mathbf{x}^{k+1}= ((P+(INP)2)Id)𝐱k\displaystyle((P+(I_{N}-P)^{2})\otimes I_{d})\mathbf{x}^{k}
1ρ((INP)Id)(f(𝐱k)+𝐪k).\displaystyle-\frac{1}{\rho}((I_{N}-P)\otimes I_{d})(\nabla f(\mathbf{x}^{k})+\mathbf{q}^{k}).

Since P𝟏=𝟎P\mathbf{1}=\mathbf{0} and (𝟏TId)𝐪k=𝟎(\mathbf{1}^{T}\otimes I_{d})\mathbf{q}^{k}=\mathbf{0}, this indicates 1Ni𝒱xik+1=1Ni𝒱xik1ρ(1Ni𝒱fi(xik))\frac{1}{N}\sum_{i\in\mathcal{V}}x_{i}^{k+1}=\frac{1}{N}\sum_{i\in\mathcal{V}}x_{i}^{k}-\frac{1}{\rho}\Bigl{(}\frac{1}{N}\sum_{i\in\mathcal{V}}\nabla f_{i}(x_{i}^{k})\Bigr{)}. It can thus be seen that this example of DAMM-SQ tracks the average of all the local gradients so as to imitate the behavior of the (centralized) gradient descent method.

III Existing Algorithms as Specializations

This section exemplifies that AMM and its distributed realizations generalize a variety of existing distributed optimization methods originally developed in different ways.

III-A Specializations of DAMM-SQ

DAMM-SQ described in Algorithm 3 generalizes several distributed first-order and second-order algorithms for solving problem (1) with hi0h_{i}\equiv 0 i𝒱\forall i\in\mathcal{V}, including EXTRA [5], ID-FBBS [13], and DQM [19].

III-A1 EXTRA

EXTRA [5] is a well-known first-order algorithm developed from a decentralized gradient descent method. From [5, Eq. (3.5)], EXTRA can be expressed as

𝐱k+1=(W~Id)𝐱kαf(𝐱k)+t=0k((WW~)Id)𝐱t,\displaystyle\mathbf{x}^{k+1}\!=\!(\tilde{W}\!\otimes\!I_{d})\mathbf{x}^{k}\!-\!\alpha\nabla f(\mathbf{x}^{k})\!+\!\!\sum_{t=0}^{k}\!((W\!-\!\tilde{W})\!\otimes\!I_{d})\mathbf{x}^{t}, (23)

where 𝐱0\mathbf{x}^{0} is arbitrarily given, α>0\alpha>0, and W,W~N×NW,\tilde{W}\in\mathbb{R}^{N\times N} are two average matrices associated with 𝒢\mathcal{G}111We say WN×NW\in\mathbb{R}^{N\times N} is an average matrix associated with 𝒢\mathcal{G} if W=WTW=W^{T}, W𝟏=𝟏W\mathbf{1}=\mathbf{1}, W𝟏𝟏T/N<1\|W-\mathbf{1}\mathbf{1}^{T}/N\|<1, and [W]ij=0[W]_{ij}=0 i𝒱\forall i\in\mathcal{V} j𝒩i{i}\forall j\notin\mathcal{N}_{i}\cup\{i\}. It can be shown that INWIN-I_{N}\prec W\preceq I_{N}.. By letting 𝐪k=1αt=0k((W~W)Id)𝐱t\mathbf{q}^{k}=\frac{1}{\alpha}\sum_{t=0}^{k}((\tilde{W}-W)\otimes I_{d})\mathbf{x}^{t}, (23) becomes

𝐱k+1\displaystyle\mathbf{x}^{k+1} =(W~Id)𝐱kαf(𝐱k)α𝐪k,\displaystyle=(\tilde{W}\otimes I_{d})\mathbf{x}^{k}-\alpha\nabla f(\mathbf{x}^{k})-\alpha\mathbf{q}^{k}, (24)
𝐪k+1\displaystyle\mathbf{q}^{k+1} =𝐪k+1α((W~W)Id)𝐱k+1.\displaystyle=\mathbf{q}^{k}+\frac{1}{\alpha}((\tilde{W}-W)\otimes I_{d})\mathbf{x}^{k+1}. (25)

This is in the form of DAMM-SQ with ρ=1/α\rho=1/\alpha, P~=W~W\tilde{P}=\tilde{W}-W, P=INW~P=I_{N}-\tilde{W}, and Gk=αINdG^{k}=\alpha I_{Nd}. As [5] assumes W~W\tilde{W}\succeq W and Null(W~W)=span(𝟏)\operatorname{Null}(\tilde{W}-W)=\operatorname{span}(\mathbf{1}), Assumption 3 and (11) are guaranteed. Besides, [5] assumes W~𝐎\tilde{W}\succ\mathbf{O}, so that (Gk)1=ρINdρ(INW~)Id=ρH(G^{k})^{-1}=\rho I_{Nd}\succeq\rho(I_{N}-\tilde{W})\otimes I_{d}=\rho H. It is then straightforward to see that this particular GkG^{k} satisfies all the conditions in Section II-C.

III-A2 ID-FBBS

ID-FBBS [13] takes the form of (24)–(25), except that W=2W~INW=2\tilde{W}-I_{N} and 𝐪0\mathbf{q}^{0} can be any vector in SS^{\perp}. Since [13] also assumes W~𝐎\tilde{W}\succ\mathbf{O}, it follows from the analysis in Section III-A1 that ID-FBBS is a particular example of DAMM-SQ, where ρ=1/α\rho=1/\alpha, P~=P=INW~\tilde{P}=P=I_{N}-\tilde{W}, and Gk=αINdG^{k}=\alpha I_{Nd}, with Assumption 3 and all the conditions on GkG^{k} in Section II-C satisfied.

III-A3 DQM

DQM [19] is a distributed second-order method for solving problem (1) with strongly convex, smooth, and twice continuously differentiable fif_{i}’s and zero hih_{i}’s. DQM takes the following form: xik+1=xik(2c|𝒩i|Id+2fi(xik))1(cj𝒩i(xikxjk)+fi(xik)+qik)x_{i}^{k+1}=x_{i}^{k}-(2c|\mathcal{N}_{i}|I_{d}+\nabla^{2}f_{i}(x_{i}^{k}))^{-1}(c\!\sum_{j\in\mathcal{N}_{i}}\!(x_{i}^{k}\!-\!x_{j}^{k})+\nabla f_{i}(x_{i}^{k})\!+\!q_{i}^{k}) and qik+1=qik+cj𝒩i(xik+1xjk+1)q_{i}^{k+1}=q_{i}^{k}+c\sum_{j\in\mathcal{N}_{i}}(x_{i}^{k+1}-x_{j}^{k+1}), where xi0x_{i}^{0} i𝒱\forall i\in\mathcal{V} are arbitrarily given, qi0q_{i}^{0} i𝒱\forall i\in\mathcal{V} satisfy i𝒱qi0=𝟎\sum_{i\in\mathcal{V}}q_{i}^{0}=\mathbf{0}, and c>0c>0. Observe that DAMM-SQ reduces to DQM if we set Gk=(2cdiag(|𝒩1|,,|𝒩N|)Id+2f(𝐱k))1G^{k}=(2c\operatorname{diag}(|\mathcal{N}_{1}|,\ldots,|\mathcal{N}_{N}|)\otimes I_{d}+\nabla^{2}f(\mathbf{x}^{k}))^{-1}, ρ=c\rho=c, pij=p~ij=1p_{ij}=\tilde{p}_{ij}=-1 {i,j}\forall\{i,j\}\in\mathcal{E}, and pii=p~ii=|𝒩i|p_{ii}=\tilde{p}_{ii}=|\mathcal{N}_{i}| i𝒱\forall i\in\mathcal{V}. Clearly, PP and P~\tilde{P} satisfy Assumption 3. Additionally, (Gk)12cdiag(|𝒩1|,,|𝒩N|)IdρPId=ρH(G^{k})^{-1}\succeq 2c\operatorname{diag}(|\mathcal{N}_{1}|,\ldots,|\mathcal{N}_{N}|)\otimes I_{d}\succeq\rho P\otimes I_{d}=\rho H, and GkG^{k} meets all the other requirements in Section II-C.

III-B Specializations of DAMM

A number of distributed algorithms for composite or nonsmooth convex optimization can be cast into the form of DAMM described in Algorithm 1, including PGC [16], PG-EXTRA [12], DPGA [11], a decentralized ADMM in [4], and D-FBBS [13].

III-B1 PGC and PG-EXTRA

PGC [16] and PG-EXTRA [12] are two recently proposed distributed methods for solving problem (1), where PGC is constructed upon ADMM [35] and PG-EXTRA is an extension of EXTRA [5] to address (1) with nonzero hih_{i}’s. According to [16, Section V-D], PGC can be described by 𝐪0S\mathbf{q}^{0}\in S^{\perp}, 𝐪k=𝐪0+=1k((Λβ(W~W))Id)𝐱\mathbf{q}^{k}=\mathbf{q}^{0}+\sum_{\ell=1}^{k}((\Lambda_{\beta}(\tilde{W}-W))\otimes I_{d})\mathbf{x}^{\ell} k1\forall k\geq 1, and 𝐱k+1=argmin𝐱Ndh(𝐱)+12𝐱(W~Id)𝐱kΛβId2+f(𝐱k)+𝐪k,𝐱\mathbf{x}^{k+1}=\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}h(\mathbf{x})+\frac{1}{2}\|\mathbf{x}-(\tilde{W}\otimes I_{d})\mathbf{x}^{k}\|_{\Lambda_{\beta}\otimes I_{d}}^{2}+\langle\nabla f(\mathbf{x}^{k})+\mathbf{q}^{k},\mathbf{x}\rangle k0\forall k\geq 0, where 𝐱0\mathbf{x}^{0} is arbitrarily given and the parameters are chosen as follows: Let Λβ=diag(β1,,βN)\Lambda_{\beta}=\operatorname{diag}(\beta_{1},\ldots,\beta_{N}) be a positive definite diagonal matrix and W,W~N×NW,\tilde{W}\in\mathbb{R}^{N\times N} be two row-stochastic matrices such that ΛβW\Lambda_{\beta}W and ΛβW~\Lambda_{\beta}\tilde{W} are symmetric, [W]ij,[W~]ij>0[W]_{ij},[\tilde{W}]_{ij}>0 j𝒩i{i}\forall j\in\mathcal{N}_{i}\cup\{i\}, and [W]ij=[W~]ij=0[W]_{ij}=[\tilde{W}]_{ij}=0 otherwise. To cast PGC in the form of DAMM, let ρ=1\rho=1, P~=Λβ(W~W)\tilde{P}=\Lambda_{\beta}(\tilde{W}-W), and P=Λβ(INW~)P=\Lambda_{\beta}(I_{N}-\tilde{W}). Then, starting from any 𝐪0S\mathbf{q}^{0}\in S^{\perp}, the updates of PGC can be expressed as

𝐱k+1=\displaystyle\mathbf{x}^{k+1}= argmin𝐱Ndh(𝐱)+12𝐱𝐱kΛβId2\displaystyle\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}h(\mathbf{x})\!+\!\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{k}\|_{\Lambda_{\beta}\otimes I_{d}}^{2}
+f(𝐱k)+𝐪k+ρ(PId)𝐱k,𝐱,\displaystyle+\langle\nabla f(\mathbf{x}^{k})\!+\!\mathbf{q}^{k}\!+\!\rho(P\!\otimes\!I_{d})\mathbf{x}^{k},\mathbf{x}\rangle, (26)
𝐪k+1=\displaystyle\mathbf{q}^{k+1}= 𝐪k+ρ(P~Id)𝐱k+1.\displaystyle\mathbf{q}^{k}+\rho(\tilde{P}\otimes I_{d})\mathbf{x}^{k+1}. (27)

This means that PGC is exactly in the form of DAMM with ψik()=βi22\psi_{i}^{k}(\cdot)=\frac{\beta_{i}}{2}\|\cdot\|^{2}. Note that ΛβΛβW~\Lambda_{\beta}\succeq\Lambda_{\beta}\tilde{W} and Null(Λβ(INW~))=span(𝟏)\operatorname{Null}(\Lambda_{\beta}(I_{N}-\tilde{W}))=\operatorname{span}(\mathbf{1}). In addition, [16] assumes ΛβW~ΛβW\Lambda_{\beta}\tilde{W}\succeq\Lambda_{\beta}W, ΛβW~𝐎\Lambda_{\beta}\tilde{W}\succeq\mathbf{O}, and Null(Λβ(W~W))=span(𝟏)\operatorname{Null}(\Lambda_{\beta}(\tilde{W}-W))=\operatorname{span}(\mathbf{1}). Consequently, Assumption 3 and Assumption 4 hold. PG-EXTRA can also be described by (26)–(27) with βi=ρ>0\beta_{i}=\rho>0 i𝒱\forall i\in\mathcal{V} and 𝐪0=ρ(P~Id)𝐱0\mathbf{q}^{0}=\rho(\tilde{P}\otimes I_{d})\mathbf{x}^{0}, i.e., is a special form of PGC. Therefore, DAMM generalizes both PGC and PG-EXTRA.

III-B2 DPGA and a Decentralized ADMM

DPGA [11] is a distributed proximal gradient method and has the following form: Given arbitrary xi0x_{i}^{0} and qi0=𝟎q_{i}^{0}=\mathbf{0},

xik+1=\displaystyle x_{i}^{k+1}= argminxdhi(x)+12cixxik2\displaystyle\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}~{}h_{i}(x)+\frac{1}{2c_{i}}\|x-x_{i}^{k}\|^{2}
+fi(xik)+qik+j𝒩i{i}Γijxjk,x,\displaystyle+\langle\nabla f_{i}(x_{i}^{k})+q_{i}^{k}+\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}\Gamma_{ij}x_{j}^{k},x\rangle,
qik+1=\displaystyle q_{i}^{k+1}= qik+j𝒩i{i}Γijxjk+1,i𝒱,\displaystyle q_{i}^{k}+\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}\Gamma_{ij}x_{j}^{k+1},\quad\forall i\in\mathcal{V},

where ci>0c_{i}>0 i𝒱\forall i\in\mathcal{V}, Γij=Γji<0\Gamma_{ij}=\Gamma_{ji}<0 {i,j}\forall\{i,j\}\in\mathcal{E}, and Γii=j𝒩iΓij\Gamma_{ii}=-\sum_{j\in\mathcal{N}_{i}}\Gamma_{ij} i𝒱\forall i\in\mathcal{V}. The above update equations of DPGA are equivalent to those of DAMM with ψik()=12ci2\psi_{i}^{k}(\cdot)=\frac{1}{2c_{i}}\|\cdot\|^{2}, ρ=1\rho=1, and P,P~P,\tilde{P} such that p~ij,pij\tilde{p}_{ij},p_{ij} are equal to Γij\Gamma_{ij} if {i,j}\{i,j\}\in\mathcal{E} or i=ji=j and are equal to 0 otherwise. Apparently, PP and P~\tilde{P} satisfy Assumption 3. Furthermore, due to the conditions on cic_{i} in [11], it is guaranteed that i𝒱ψik(xi)ρ2𝐱H2\sum_{i\in\mathcal{V}}\psi_{i}^{k}(x_{i})-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2} is convex and, thus, Assumption 4 holds. The decentralized ADMM in [4] for solving strongly convex and smooth problems can be viewed as a special case of DPGA with ci=12c|𝒩i|c_{i}=\frac{1}{2c|\mathcal{N}_{i}|} i𝒱\forall i\in\mathcal{V} for some c>0c>0 and Γij=Γji=c\Gamma_{ij}=\Gamma_{ji}=-c {i,j}\forall\{i,j\}\in\mathcal{E}, so that it is also a specialization of DAMM.

III-B3 D-FBBS

D-FBBS [13] addresses problem (1) with fi0f_{i}\equiv 0 i𝒱\forall i\in\mathcal{V}, which is designed based on a Bregman method and an operator splitting technique. D-FBBS is described by (26)–(27) with βi=ρ>0\beta_{i}=\rho>0 i𝒱\forall i\in\mathcal{V}, P~=P=INW\tilde{P}=P=I_{N}-W for an average matrix WN×NW\in\mathbb{R}^{N\times N} associated with 𝒢\mathcal{G}, 𝐱0\mathbf{x}^{0} being arbitrary, and 𝐪0S\mathbf{q}^{0}\in S^{\perp}. Clearly, P,P~P,\tilde{P} satisfy Assumption 3. Now let ψik()=ρ22\psi_{i}^{k}(\cdot)=\frac{\rho}{2}\|\cdot\|^{2}, which satisfies Assumption 4 because [13] assumes W𝐎W\succ\mathbf{O}. Therefore, we conclude that D-FBBS can be specialized from DAMM.

III-C Specializations of AMM

Since DAMM-SQ and DAMM are subsets of AMM, the algorithms in Section III-A and Section III-B are also specializations of AMM. Below, we present some other methods [9, 14, 26, 18] that can be specialized from AMM but do not belong to DAMM, DAMM-SC, or DAMM-SQ.

III-C1 A Distributed ADMM

In [9], a distributed ADMM is proposed to solve (1) with fi0f_{i}\equiv 0 i𝒱\forall i\in\mathcal{V}:

𝐱k+1=\displaystyle\mathbf{x}^{k+1}= argmin𝐱Ndh(𝐱)+QT𝐰k,𝐱\displaystyle\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}h(\mathbf{x})+\langle Q^{T}\mathbf{w}^{k},\mathbf{x}\rangle
+cQT(Λ1Id)Q𝐱k,𝐱+c2𝐱𝐱kQ~2,\displaystyle+c\langle Q^{T}(\Lambda^{-1}\otimes I_{d})Q\mathbf{x}^{k},\mathbf{x}\rangle+\frac{c}{2}\|\mathbf{x}-\mathbf{x}^{k}\|_{\tilde{Q}}^{2},
𝐰k+1=\displaystyle\mathbf{w}^{k+1}= 𝐰k+c(Λ1Id)Q𝐱k+1,\displaystyle\mathbf{w}^{k}+c(\Lambda^{-1}\otimes I_{d})Q\mathbf{x}^{k+1},

where 𝐱0Nd\mathbf{x}^{0}\in\mathbb{R}^{Nd} is arbitrarily given and 𝐰0=𝟎\mathbf{w}^{0}=\mathbf{0}. In the above, c>0c>0, Λ=diag(|𝒩1|+1,,|𝒩N|+1)\Lambda=\operatorname{diag}(|\mathcal{N}_{1}|+1,\ldots,|\mathcal{N}_{N}|+1), and Q=ΓIdQ=\Gamma\otimes I_{d} with ΓN×N\Gamma\in\mathbb{R}^{N\times N} satisfying [Γ]ij=0[\Gamma]_{ij}=0 i𝒱\forall i\in\mathcal{V} j𝒩i{i}\forall j\notin\mathcal{N}_{i}\cup\{i\} and Null(ΓTΛ1Γ)\operatorname{Null}(\Gamma^{T}\Lambda^{-1}\Gamma) =span(𝟏)=\operatorname{span}(\mathbf{1}). Additionally, Q~=diag(Q1TQ1,,QNTQN)Nd×Nd\tilde{Q}=\operatorname{diag}(Q_{1}^{T}Q_{1},\ldots,Q_{N}^{T}Q_{N})\in\mathbb{R}^{Nd\times Nd}, where QiNd×dQ_{i}\in\mathbb{R}^{Nd\times d} i𝒱\forall i\in\mathcal{V} are such that Q=(Q1,,QN)Q=(Q_{1},\ldots,Q_{N}). It is shown in [9] that Q~QT(Λ1Id)Q\tilde{Q}\succeq Q^{T}(\Lambda^{-1}\!\otimes\!I_{d})Q. We may view this algorithm as a specialization from AMM (9)–(11), in which 𝐪k=QT𝐰k\mathbf{q}^{k}=Q^{T}\mathbf{w}^{k}, H=H~=QT(Λ1Id)QH=\tilde{H}=Q^{T}(\Lambda^{-1}\otimes I_{d})Q, ρ=c\rho=c, and uk(𝐱)=ρ2(𝐱𝐱k)T(Q~H)(𝐱𝐱k)u^{k}(\mathbf{x})=\frac{\rho}{2}(\mathbf{x}-\mathbf{x}^{k})^{T}(\tilde{Q}-H)(\mathbf{x}-\mathbf{x}^{k}). Apparently, H,H~H,\tilde{H} are symmetric and positive semidefinite, and Null(H)=Null(H~)=S\operatorname{Null}(H)=\operatorname{Null}(\tilde{H})=S, as required in Section II-A. Also, since each QiQ_{i} has full column rank, Q~𝐎\tilde{Q}\succ\mathbf{O}. This guarantees Assumption 2.

III-C2 A Distributed Primal-Dual Algorithm

In [14], a distributed primal-dual algorithm is developed to solve (1) with each hih_{i} equal to the indicator function Xi\mathcal{I}_{X_{i}} with respect to a closed convex set XiX_{i}, and takes the following form:

𝐱k+1=PX[𝐱kα(f(𝐱k)+(ΓId)𝐰k+(ΓId)𝐱k)],\displaystyle\!\!\mathbf{x}^{k+1}\!=\!P_{X}[\mathbf{x}^{k}\!\!-\!\alpha(\nabla f(\mathbf{x}^{k})\!+\!(\Gamma\otimes I_{d})\mathbf{w}^{k}\!\!+\!\!(\Gamma\otimes I_{d})\mathbf{x}^{k})], (28)
𝐰k+1=𝐰k+α(ΓId)𝐱k,\displaystyle\!\!\mathbf{w}^{k+1}\!=\!\mathbf{w}^{k}+\alpha(\Gamma\otimes I_{d})\mathbf{x}^{k}, (29)

where 𝐱0,𝐰0\mathbf{x}^{0},\mathbf{w}^{0} are arbitrarily given, X=X1××XNX=X_{1}\times\cdots\times X_{N}, and PX[]P_{X}[\cdot] is the projection onto XX. Besides, ΓN×N\Gamma\in\mathbb{R}^{N\times N} is a symmetric positive semidefinite matrix satisfying [Γ]ij=0[\Gamma]_{ij}=0 i𝒱\forall i\in\mathcal{V} j𝒩i{i}\forall j\notin\mathcal{N}_{i}\cup\{i\} and Null(Γ)=span(𝟏)\operatorname{Null}(\Gamma)=\operatorname{span}(\mathbf{1}), and 0<α12Γ0<\alpha\leq\frac{1}{2\|\Gamma\|}. To see how this algorithm relates to AMM, we use (29) to rewrite (28) as

𝐱k+1=\displaystyle\mathbf{x}^{k+1}= argmin𝐱Ndh(𝐱)\displaystyle\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}~{}h(\mathbf{x})
+12α𝐱(𝐱kα(f(𝐱k)+(ΓId)𝐰k+1\displaystyle+\frac{1}{2\alpha}\|\mathbf{x}-\Bigl{(}\mathbf{x}^{k}-\alpha\bigl{(}\nabla f(\mathbf{x}^{k})+(\Gamma\otimes I_{d})\mathbf{w}^{k+1}
+((ΓαΓ2)Id)𝐱k))2,\displaystyle+((\Gamma-\alpha\Gamma^{2})\otimes I_{d})\mathbf{x}^{k}\bigr{)}\Bigr{)}\|^{2}, (30)

where h(𝐱)=X(𝐱)h(\mathbf{x})=\mathcal{I}_{X}(\mathbf{x}). It can be seen that (29)–(30) are equivalent to AMM (4)–(5) with 𝐯k=𝐰k+1\mathbf{v}^{k}=\mathbf{w}^{k+1}, ρ=α\rho=\alpha, H~=Γ2Id\tilde{H}=\Gamma^{2}\otimes I_{d}, H=(Γ/αΓ2)IdH=(\Gamma/\alpha-\Gamma^{2})\otimes I_{d}, and uk(𝐱)=f(𝐱k),𝐱+12α𝐱𝐱kINdα(ΓαΓ2)Id2u^{k}(\mathbf{x})=\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle+\frac{1}{2\alpha}\|\mathbf{x}-\mathbf{x}^{k}\|_{I_{Nd}-\alpha(\Gamma-\alpha\Gamma^{2})\otimes I_{d}}^{2}. Since α12Γ\alpha\leq\frac{1}{2\|\Gamma\|}, we have HΓId/(2α)H\succeq\Gamma\otimes I_{d}/(2\alpha) and INdα(ΓαΓ2)IdINdαΓId𝐎I_{Nd}-\alpha(\Gamma-\alpha\Gamma^{2})\otimes I_{d}\succeq I_{Nd}-\alpha\Gamma\otimes I_{d}\succ\mathbf{O}. Thus, H,H~H,\tilde{H} satisfy the conditions required by AMM in Section II-A. Also, uku^{k} is strongly convex and satisfies Assumption 2.

III-C3 DIGing on Static Networks

DIGing [26] is a distributed gradient-tracking method for solving problem (1) with hi0h_{i}\equiv 0 i𝒱\forall i\in\mathcal{V} over time-varying networks. Here, we only consider DIGing on static undirected networks. Let α>0\alpha>0 and WN×NW\in\mathbb{R}^{N\times N} satisfy W𝟏=WT𝟏=𝟏W\mathbf{1}=W^{T}\mathbf{1}=\mathbf{1}, [W]ij=0[W]_{ij}=0 i𝒱\forall i\in\mathcal{V} j𝒩i{i}\forall j\notin\mathcal{N}_{i}\cup\{i\}, and W1N𝟏𝟏T<1\|W-\frac{1}{N}\mathbf{1}\mathbf{1}^{T}\|<1. It is shown in [26] that DIGing with W=WTW=W^{T} can be written as follows:

𝐱k+2=\displaystyle\mathbf{x}^{k+2}= (2WId)𝐱k+1(W2Id)𝐱k\displaystyle(2W\otimes I_{d})\mathbf{x}^{k+1}-(W^{2}\otimes I_{d})\mathbf{x}^{k}
α(f(𝐱k+1)f(𝐱k)),k0,\displaystyle-\alpha(\nabla f(\mathbf{x}^{k+1})-\nabla f(\mathbf{x}^{k})),\quad\forall k\geq 0,

where 𝐱0\mathbf{x}^{0} can be arbitrary and 𝐱1=(WId)𝐱0αf(𝐱0)\mathbf{x}^{1}=(W\otimes I_{d})\mathbf{x}^{0}-\alpha\nabla f(\mathbf{x}^{0}). Adding the above equation from k=0k=0 to k=K1k=K-1 yields

𝐱K+1=\displaystyle\mathbf{x}^{K+1}= (W2Id)𝐱K+((INW)Id)𝐱0\displaystyle(W^{2}\otimes I_{d})\mathbf{x}^{K}+((I_{N}-W)\otimes I_{d})\mathbf{x}^{0}
k=0K((INW)2Id)𝐱kαf(𝐱K)\displaystyle-\sum_{k=0}^{K}((I_{N}-W)^{2}\otimes I_{d})\mathbf{x}^{k}-\alpha\nabla f(\mathbf{x}^{K})

for all K0K\geq 0. By letting 𝐪0=1α((W2W)Id)𝐱0\mathbf{q}^{0}=\frac{1}{\alpha}((W^{2}-W)\otimes I_{d})\mathbf{x}^{0}, the above update is the same as

𝐱K+1\displaystyle\mathbf{x}^{K+1} =(W2Id)𝐱Kαf(𝐱K)α𝐪K,\displaystyle=(W^{2}\otimes I_{d})\mathbf{x}^{K}-\alpha\nabla f(\mathbf{x}^{K})-\alpha\mathbf{q}^{K},
𝐪K+1\displaystyle\mathbf{q}^{K+1} =𝐪K+1α((INW)2Id)𝐱K+1.\displaystyle=\mathbf{q}^{K}+\frac{1}{\alpha}((I_{N}-W)^{2}\otimes I_{d})\mathbf{x}^{K+1}.

Such an algorithmic form of DIGing is identical to AMM (9)–(11) with the above given 𝐪0S\mathbf{q}^{0}\in S^{\bot}, ρ=1/α\rho=1/\alpha, H=(INW2)IdH=(I_{N}-W^{2})\otimes I_{d}, H~=(INW)2Id\tilde{H}=(I_{N}-W)^{2}\otimes I_{d}, and uk(𝐱)=f(𝐱k),𝐱+ρ2(WId)(𝐱𝐱k)2u^{k}(\mathbf{x})=\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle+\frac{\rho}{2}\|(W\otimes I_{d})(\mathbf{x}-\mathbf{x}^{k})\|^{2}. It can be verified that uku^{k} k0\forall k\geq 0 and H,H~H,\tilde{H} satisfy all the conditions in Section II-A.

III-C4 ESOM

ESOM [18] is a class of distributed second-order algorithms that address problem (1) with fif_{i} i𝒱\forall i\in\mathcal{V} being strongly convex, smooth, and twice continuously differentiable and with hi0h_{i}\equiv 0 i𝒱\forall i\in\mathcal{V}. It is developed by incorporating a proximal technique and certain second-order approximations into the Method of Multipliers. To describe ESOM, let α>0\alpha>0, ϵ>0\epsilon>0, and WN×NW\in\mathbb{R}^{N\times N} be an average matrix associated with 𝒢\mathcal{G} such that [W]ij0[W]_{ij}\geq 0 i,j𝒱\forall i,j\in\mathcal{V}. In addition, define Wd:=diag([W]11,,[W]NN)W_{d}:=\operatorname{diag}([W]_{11},\ldots,[W]_{NN}), B:=α(INd+(W2Wd)Id)B:=\alpha(I_{Nd}+(W-2W_{d})\otimes I_{d}), Dk:=2f(𝐱k)+ϵINd+2α(INdWdId)D^{k}:=\nabla^{2}f(\mathbf{x}^{k})+\epsilon I_{Nd}+2\alpha(I_{Nd}-W_{d}\otimes I_{d}), and Qk(K):=(Dk)12t=0K((Dk)12B(Dk)12)t(Dk)12Q^{k}(K):=(D^{k})^{-\frac{1}{2}}\sum_{t=0}^{K}((D^{k})^{-\frac{1}{2}}B(D^{k})^{-\frac{1}{2}})^{t}(D^{k})^{-\frac{1}{2}}, K0K\geq 0. With the above notations, each ESOM-KK algorithm can be described by

𝐱k+1\displaystyle\mathbf{x}^{k+1}\! =𝐱kQk(K)(f(𝐱k)+𝐪k+α(INdWId)𝐱k),\displaystyle=\!\mathbf{x}^{k}\!\!-\!Q^{k}(\!K\!)(\nabla f(\mathbf{x}^{k})\!\!+\!\mathbf{q}^{k}\!+\!\alpha(I_{Nd}\!-\!W\!\!\otimes\!I_{d})\mathbf{x}^{k}), (31)
𝐪k+1\displaystyle\mathbf{q}^{k+1}\! =𝐪k+α(INdWId)𝐱k+1,\displaystyle=\mathbf{q}^{k}+\alpha(I_{Nd}-\!W\!\otimes I_{d})\mathbf{x}^{k+1}, (32)

where 𝐱0\mathbf{x}^{0} is any vector in Nd\mathbb{R}^{Nd} and 𝐪0=𝟎\mathbf{q}^{0}=\mathbf{0} which satisfies the initialization (11) of AMM. Note that the primal and dual updates of AMM, i.e., (9)–(10), reduce to (31)–(32) when H~=H=INdWId\tilde{H}=H=I_{Nd}-W\otimes I_{d}, ρ=α\rho=\alpha, and uk(𝐱)=12𝐱T((Qk(K))1ρH)𝐱((Qk(K))1ρH)𝐱k,𝐱+f(𝐱k),𝐱u^{k}(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}((Q^{k}(K))^{-1}-\rho H)\mathbf{x}-\langle((Q^{k}(K))^{-1}-\rho H)\mathbf{x}^{k},\mathbf{x}\rangle+\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle. Clearly, HH and H~\tilde{H} satisfy the conditions in Section II-A. Also, Assumption 2 holds, since (M+ϵ+2ρ)INd(Qk(K))12f(𝐱k)+ϵINd+ρH(M+\epsilon+2\rho)I_{Nd}\succeq(Q^{k}(K))^{-1}\succeq\nabla^{2}f(\mathbf{x}^{k})+\epsilon I_{Nd}+\rho H [18], where M>0M>0 is the Lipschitz constant of all the fi\nabla f_{i}’s. Note that unlike most specializations of AMM discussed in this section, uk()+ρ2H2u^{k}(\cdot)+\frac{\rho}{2}\|\cdot\|_{H}^{2} for ESOM-KK with K1K\geq 1 is non-separable.

III-D Connections between AMM and Existing Unifying Methods

PUDA [22] and ABC [21] are two recently proposed distributed methods for convex composite optimization, which unify a number of existing methods including a subset of the aforementioned specializations of AMM. Nevertheless, unlike AMM that can be specialized to both first-order and second-order methods, PUDA and ABC are first-order algorithms. Moreover, AMM allows hih_{i} i𝒱\forall i\in\mathcal{V} to be non-identical, i.e., each node only needs to know a local portion of the global nonsmooth objective function hh. In contrast, PUDA and ABC are restricted to the case of identical hih_{i}’s, i.e., each node has to know the entire hh, and it is not straightforward to extend their analyses to the more general case of non-identical hih_{i}’s.

Although none of AMM, PUDA, and ABC can include the others as special cases, they are implicitly connected through the following algorithm: Let 𝐪0=𝟎Nd\mathbf{q}^{0}=\mathbf{0}_{Nd}. For any k0k\geq 0, let

𝐳k+1\displaystyle\mathbf{z}^{k+1} =argmin𝐳Nduk(𝐳)+ρ2𝐳H2+(𝐪k)T𝐳,\displaystyle=\operatorname{\arg\;\min}_{\mathbf{z}\in\mathbb{R}^{Nd}}u^{k}(\mathbf{z})+\frac{\rho}{2}\|\mathbf{z}\|_{H}^{2}+(\mathbf{q}^{k})^{T}\mathbf{z}, (33)
𝐱k+1\displaystyle\mathbf{x}^{k+1} =argmin𝐱Ndh(𝐱)+ρ2𝐱𝐳k+12,\displaystyle=\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}~{}h(\mathbf{x})+\frac{\rho}{2}\|\mathbf{x}-\mathbf{z}^{k+1}\|^{2}, (34)
𝐪k+1\displaystyle\mathbf{q}^{k+1} =𝐪k+ρH~𝐳k+1.\displaystyle=\mathbf{q}^{k}+\rho\tilde{H}\mathbf{z}^{k+1}. (35)

Different from AMM whose primal update (9) involves both the surrogate function and the nonsmooth objective, the above algorithm tackles these two parts separately and thus has two sequential primal minimization operations. Indeed, when h0h\equiv 0, (33)–(35) are identical to (9)–(10).

It can be shown that with particular uku^{k}, HH, and H~\tilde{H}, (33)–(35) and 𝐪0=𝟎Nd\mathbf{q}^{0}=\mathbf{0}_{Nd} become equivalent to PUDA and ABC under some mild parameter conditions. Specifically, PUDA corresponds to 𝐮k(𝐳)=ρ2𝐳𝐱kI𝒞12+𝐟(𝐱k),𝐳\mathbf{u}^{k}(\mathbf{z})=\frac{\rho}{2}\|\mathbf{z}-\mathbf{x}^{k}\|_{I-\mathcal{C}_{1}}^{2}+\langle\nabla\mathbf{f}(\mathbf{x}^{k}),\mathbf{z}\rangle, ρ>0\rho>0, H=𝒜11INd+𝒞1H=\mathcal{A}_{1}^{-1}-I_{Nd}+\mathcal{C}_{1}, and H~=12𝒜11\tilde{H}=\mathcal{B}_{1}^{2}\mathcal{A}_{1}^{-1}, while ABC corresponds to 𝐮k(𝐳)=ρ2𝐳𝐱k21𝒜22+𝐟(𝐱k),𝐳\mathbf{u}^{k}(\mathbf{z})=\frac{\rho}{2}\|\mathbf{z}-\mathbf{x}^{k}\|_{\mathcal{B}_{2}^{-1}\mathcal{A}_{2}}^{2}+\langle\nabla\mathbf{f}(\mathbf{x}^{k}),\mathbf{z}\rangle, ρ>0\rho>0, H=21(INd𝒜2)H=\mathcal{B}_{2}^{-1}(I_{Nd}-\mathcal{A}_{2}), and H~=21𝒞2\tilde{H}=\mathcal{B}_{2}^{-1}\mathcal{C}_{2}. Here, 𝒜1,1,𝒞1,𝒜2,2,𝒞2Nd×Nd\mathcal{A}_{1},\mathcal{B}_{1},\mathcal{C}_{1},\mathcal{A}_{2},\mathcal{B}_{2},\mathcal{C}_{2}\in\mathbb{R}^{Nd\times Nd} are proper matrices whose detailed designs can be found in [22, 21]. Such equivalence requires additional conditions that 𝒜1\mathcal{A}_{1} and 2\mathcal{B}_{2} are invertible, 𝒞1INd\mathcal{C}_{1}\preceq I_{Nd}, and 12𝒜11\mathcal{B}_{1}^{2}\mathcal{A}_{1}^{-1} is symmetric. In fact, these additional conditions are satisfied by many specializations of PUDA and ABC such as those in [22, Table I] and [21, Table II] when they choose positive definite average matrices. Also, the above options of uku^{k}, HH, and H~\tilde{H} for PUDA and ABC satisfy our conditions for AMM in Section II-A.

To summarize, PUDA and ABC with some additional parameter conditions can be specialized from (33)–(35), while AMM and (33)–(35) are the same if problem (1) is smooth and are different otherwise. As a result, when solving smooth problems, AMM, PUDA, and ABC share a number of common special cases (e.g., EXTRA [5] and DIGing [26]).

IV Convergence Analysis

In this section, we analyze the convergence performance of AMM described by (4)–(5) (which are equivalent to (9)–(11)).

For the purpose of analysis, we add HH~H\succeq\tilde{H} to the conditions on H,H~H,\tilde{H} in Section II-A, leading to Assumption 6.

Assumption 6.

The matrices HH and H~\tilde{H} are symmetric and Null(H)=Null(H~)=S\operatorname{Null}(H)=\operatorname{Null}(\tilde{H})=S, where SS is defined in (2). In addition, HH~𝐎H\succeq\tilde{H}\succeq\mathbf{O}.

In DAMM, DAMM-SC, and DAMM-SQ, we adopt H=PIdH=P\otimes I_{d} and H~=P~Id\tilde{H}=\tilde{P}\otimes I_{d}, where PP and P~\tilde{P} comply with Assumption 3. Thus, as long as we further impose P~P\tilde{P}\preceq P, Assumption 6 holds. Moreover, all the existing specializations of AMM, DAMM, and DAMM-SQ in Section III, except DIGing [26], also need to satisfy Assumption 6. DIGing is not necessarily required to meet HH~H\succeq\tilde{H}, yet it can easily satisfy HH~H\succeq\tilde{H} by letting its average matrix be symmetric positive semidefinite.

In what follows, we let (𝐱k,𝐯k)(\mathbf{x}^{k},\mathbf{v}^{k}) be generated by (4)–(5), and let (𝐱,𝐯)(\mathbf{x}^{\star},\mathbf{v}^{\star}) be a primal-dual optimal solution pair of problem (3), which satisfies (7). Also, let ΛM:=diag(M1,,MN)Id\Lambda_{M}:=\operatorname{diag}(M_{1},\ldots,M_{N})\otimes I_{d}, where Mi>0M_{i}>0 is the Lipschitz constant of fi\nabla f_{i}. In addition, define

Ak:=012uk((1s)𝐱k+s𝐱k+1)𝑑s,k0.\displaystyle A^{k}:=\int_{0}^{1}\nabla^{2}u^{k}((1-s)\mathbf{x}^{k}+s\mathbf{x}^{k+1})ds,\quad\forall k\geq 0.

Note that every AkA^{k} exists and is symmetric due to Assumption 2. Moreover, since uk(𝐱k)=f(𝐱k)\nabla u^{k}(\mathbf{x}^{k})=\nabla f(\mathbf{x}^{k}),

uk(𝐱k+1)\displaystyle\nabla u^{k}(\mathbf{x}^{k+1})
=\displaystyle= uk(𝐱k)+012uk((1s)𝐱k+s𝐱k+1)(𝐱k+1𝐱k)𝑑s\displaystyle\nabla u^{k}(\mathbf{x}^{k})+\int_{0}^{1}\nabla^{2}u^{k}((1-s)\mathbf{x}^{k}+s\mathbf{x}^{k+1})(\mathbf{x}^{k+1}-\mathbf{x}^{k})ds
=\displaystyle= f(𝐱k)+Ak(𝐱k+1𝐱k).\displaystyle\nabla f(\mathbf{x}^{k})+A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k}). (36)

Then, we introduce the following auxiliary lemma for deriving the main results.

Lemma 1.

Suppose Assumption 1, Assumption 2, and Assumption 6 hold. For any η[0,1)\eta\in[0,1) and k0k\geq 0,

1ρ𝐯k𝐯k+1,𝐯k+1𝐯η𝐱k𝐱,f(𝐱k)f(𝐱)\displaystyle\frac{1}{\rho}\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\rangle\geq\eta\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle
+𝐱k+1𝐱,Ak(𝐱k+1𝐱k)𝐱k+1𝐱kΛM24(1η).\displaystyle+\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle-\frac{\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{\Lambda_{M}}^{2}}{4(1-\eta)}. (37)
Proof.

See Appendix -A. ∎

IV-A Convergence under General Convexity

In this subsection, we provide the convergence rates of AMM in solving the general convex problem (1).

Let 𝐱¯k=1kt=1k𝐱t\bar{\mathbf{x}}^{k}\!=\!\frac{1}{k}\sum_{t=1}^{k}\mathbf{x}^{t} k1\forall k\geq 1 be the running average of 𝐱t\mathbf{x}^{t} from t=1t=1 to t=kt=k. Below, we derive sublinear convergence rates for (i) the consensus error H~12𝐱¯k\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|, which represents the infeasibility of 𝐱¯k\bar{\mathbf{x}}^{k} in solving the equivalent problem (3), and (ii) the objective error |f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)||f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|, which is a direct measure of optimality. Throughout Section IV-A, we tentatively consider the following type of surrogate function that fulfills Assumption 2:

uk(𝐱)=12𝐱𝐱kA2+f(𝐱k),𝐱,k0,u^{k}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{k}\|_{A}^{2}+\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle,\quad\forall k\geq 0, (38)

where ANd×NdA\in\mathbb{R}^{Nd\times Nd} satisfies A=ATA=A^{T}, A𝐎A\succeq\mathbf{O}, and A+ρH𝐎A+\rho H\succ\mathbf{O}. Such a choice of uku^{k} results in Ak=AA^{k}=A k0\forall k\geq 0.

Note that AMM endowed with (38) still generalizes most existing algorithms discussed in Section III, including EXTRA [5], ID-FBBS [13], PGC [16], PG-EXTRA [12], DPGA [11], the decentralized ADMM in [4], D-FBBS [13], the distributed ADMM in [9], the distributed primal-dual algorithm in [14], and DIGing [26]. Although DQM [19] and ESOM [18] are specialized from AMM with different uku^{k}’s other than (38), they all require problem (1) to be strongly convex and smooth—In such a case, we will provide convergence rates for the general form of AMM in Section IV-B.

Now we bound H~12𝐱¯k\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\| and |f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)||f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|. This plays a key role in acquiring the rates at which these errors vanish.

Lemma 2.

Suppose Assumption 1, Assumption 6, and (38) hold. For any k1k\geq 1,

H~12𝐱¯k=1ρk𝐯k𝐯0,\displaystyle\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|=\frac{1}{\rho k}\|\mathbf{v}^{k}-\mathbf{v}^{0}\|, (39)
f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)\displaystyle f(\bar{\mathbf{x}}^{k})\!+\!h(\bar{\mathbf{x}}^{k})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})
1k(𝐯022ρ+𝐱0𝐱A22+t=0k1𝐱t+1𝐱tΛMA22),\displaystyle\leq\frac{1}{k}\!\Bigl{(}\frac{\|\mathbf{v}^{0}\|^{2}}{2\rho}\!+\!\frac{\|\mathbf{x}^{0}\!-\!\mathbf{x}^{\star}\|_{A}^{2}}{2}\!+\!\sum_{t=0}^{k-1}\!\frac{\|\mathbf{x}^{t+1}\!-\!\mathbf{x}^{t}\|_{\Lambda_{M}\!-\!A}^{2}}{2}\Bigr{)}, (40)
f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)𝐯𝐯k𝐯0ρk.\displaystyle f(\bar{\mathbf{x}}^{k})\!+\!h(\bar{\mathbf{x}}^{k})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})\!\geq-\!\frac{\|\mathbf{v}^{\star}\|\cdot\|\mathbf{v}^{k}-\mathbf{v}^{0}\|}{\rho k}. (41)
Proof.

See Appendix -B. ∎

Observe from Lemma 2 that as long as 𝐯k𝐯0\|\mathbf{v}^{k}-\mathbf{v}^{0}\| and t=0k1𝐱t+1𝐱tΛMA2\sum_{t=0}^{k-1}\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A}^{2} are bounded, H~12𝐱¯k\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\| and |f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)||f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})| are guaranteed to converge to 0. The following lemma assures this is true with the help of Lemma 1, in which 𝐳k:=((𝐱k)T,(𝐯k)T)T\mathbf{z}^{k}:=((\mathbf{x}^{k})^{T},(\mathbf{v}^{k})^{T})^{T} k0\forall k\geq 0, 𝐳:=((𝐱)T,(𝐯)T)T\mathbf{z}^{\star}:=((\mathbf{x}^{\star})^{T},(\mathbf{v}^{\star})^{T})^{T}, and G:=diag(A,INd/ρ)G:=\operatorname{diag}(A,I_{Nd}/\rho).

Lemma 3.

Suppose all the conditions in Lemma 2 hold. Also suppose that AΛM/2A\succ\Lambda_{M}/2. For any k1k\geq 1,

𝐯k𝐯0𝐯0𝐯+ρ𝐳0𝐳G,\displaystyle\|\mathbf{v}^{k}-\mathbf{v}^{0}\|\leq\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+\sqrt{\rho}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{G}, (42)
t=0k1𝐱t+1𝐱tΛMA2𝐳0𝐳G21σ¯,\displaystyle\sum_{t=0}^{k-1}\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A}^{2}\leq\frac{\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{G}^{2}}{1-\underline{\sigma}}, (43)

where σ¯:=min{σ|σAΛM/2}(0,1)\underline{\sigma}:=\min\{\sigma|~{}\sigma A\succeq\Lambda_{M}/2\}\in(0,1).

Proof.

See Appendix -C. ∎

The following theorem results from Lemma 2 and Lemma 3, which provides O(1/k)O(1/k) rates for both the consensus error and the objective error at the running average 𝐱¯k\bar{\mathbf{x}}^{k}.

Theorem 1.

Suppose all the conditions in Lemma 3 hold. For any k1k\geq 1,

H~12𝐱¯k𝐯0𝐯+ρ𝐳0𝐳Gρk,\displaystyle\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|\leq\frac{\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+\sqrt{\rho}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{G}}{\rho k},
f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)\displaystyle f(\bar{\mathbf{x}}^{k})\!+\!h(\bar{\mathbf{x}}^{k})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})
12k(𝐯02ρ+𝐱0𝐱A2+𝐳0𝐳G21σ¯),\displaystyle\leq\frac{1}{2k}\left(\frac{\|\mathbf{v}^{0}\|^{2}}{\rho}\!+\!\|\mathbf{x}^{0}-\mathbf{x}^{\star}\|_{A}^{2}\!+\!\frac{\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{G}^{2}}{1-\underline{\sigma}}\right),
f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)\displaystyle f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})
𝐯(𝐯0𝐯+ρ𝐳0𝐳G)ρk.\displaystyle\geq-\frac{\|\mathbf{v}^{\star}\|(\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+\sqrt{\rho}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{G})}{\rho k}.
Proof.

Substitute (42)–(43) into (39)–(41). ∎

Among the existing distributed algorithms that are able to solve nonsmooth convex optimization problems with non-strongly convex objective functions (e.g., [10, 6, 7, 9, 11, 12, 13, 16, 17, 23, 24, 33, 31, 34, 32, 21]), the best convergence rates are of O(1/k)O(1/k), achieved by [9, 11, 16, 12, 13, 7, 33, 21]. Indeed, AMM with all the conditions in Theorem 1 generalizes the algorithms in [9, 11, 16, 12, 13] as well as their parameter conditions. Like Theorem 1, [9, 11, 16, 21] achieve O(1/k)O(1/k) rates for both the objective error and the consensus error, whereas [12, 13, 7, 33] reach O(1/k)O(1/k) rates in terms of optimality residuals that implicitly reflect deviations from an optimality condition, and they only guarantee O(1/k)O(1/\sqrt{k}) rates for the consensus error. Also note that [33, 21] only address problem (1) with identical hih_{i}’s.

Theorem 1 provides new convergence results for some existing algorithms generalized by AMM. In particular, the O(1/k)O(1/k) rate in terms of the objective error is new to PG-EXTRA [12] and D-FBBS [13] for nonsmooth problems as well as EXTRA [5] and ID-FBBS [13] for smooth problems, which only had O(1/k)O(1/k) rates with respect to optimality residuals before. Moreover, Theorem 1 improves their O(1/k)O(1/\sqrt{k}) rates in reaching consensus to O(1/k)O(1/k). Furthermore, Theorem 1 extends the O(1/k)O(1/k) rate of the distributed primal-dual algorithm in [14] for constrained smooth convex optimization to general nonsmooth convex problems, and allows PGC [16] to establish its O(1/k)O(1/k) rate without the originally required condition that each hih_{i} needs to have a compact domain.

Finally, we illustrate how the sublinear rates in Theorem 1 are influenced by the graph topology. For simplicity, set 𝐯0=𝟎\mathbf{v}^{0}=\mathbf{{0}} and H~INd\tilde{H}\preceq I_{Nd}. Also, let gh(𝐱)g^{\star}\in\partial h(\mathbf{x}^{\star}) be such that f(𝐱)+gS\nabla f(\mathbf{x}^{\star})+g^{\star}\in S^{\perp}, which exists because of the optimality condition i𝒱fi(x)(i𝒱hi)(x)-\sum_{i\in\mathcal{V}}\nabla f_{i}(x^{\star})\in(\partial\sum_{i\in\mathcal{V}}h_{i})(x^{\star}), and let 𝐯=(H~12)(f(𝐱)+g)\mathbf{{v}}^{\star}=-(\tilde{H}^{\frac{1}{2}})^{\dagger}(\nabla f(\mathbf{x}^{\star})+g^{\star}). It follows from (7) that 𝐯\mathbf{{v}}^{\star} is an optimal solution to the Lagrange dual of (3). Then, 𝐯0𝐯=𝐯1λH~f(𝐱)+g\|\mathbf{{v}}^{0}-\mathbf{{v}}^{\star}\|=\|\mathbf{{v}}^{\star}\|\leq\frac{1}{\sqrt{\lambda_{\tilde{H}}}}\|\nabla f(\mathbf{x}^{\star})+g^{\star}\|, where λH~(0,1]\lambda_{\tilde{H}}\in(0,1] is the smallest nonzero eigenvalue of H~\tilde{H}. Note that 𝐳0𝐳G=(𝐱0𝐱A2+1ρ𝐯0𝐯2)12𝐱0𝐱A+1ρ𝐯0𝐯\|\mathbf{{z}}^{0}-\mathbf{{z}}^{\star}\|_{G}=(\|\mathbf{{x}}^{0}-\mathbf{{x}}^{\star}\|_{A}^{2}+\frac{1}{\rho}\|\mathbf{{v}}^{0}-\mathbf{{v}}^{\star}\|^{2})^{\frac{1}{2}}\leq\|\mathbf{{x}}^{0}-\mathbf{{x}}^{\star}\|_{A}+\frac{1}{\sqrt{\rho}}\|\mathbf{{v}}^{0}-\mathbf{{v}}^{\star}\|. Thus, by letting ρ=1λH~\rho=\frac{1}{\sqrt{\lambda_{\tilde{H}}}}, we have |f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)|O(1kλH~)|f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|\leq O(\frac{1}{k\sqrt{\lambda_{\tilde{H}}}}). In addition, PS[𝐱¯k]1λH~PS[𝐱¯k]H~=1λH~H~12𝐱¯kO(1kλH~)\|P_{S^{\perp}}[\bar{\mathbf{{x}}}^{k}]\|\leq\frac{1}{\sqrt{\lambda_{\tilde{H}}}}\|P_{S^{\perp}}[\bar{\mathbf{{x}}}^{k}]\|_{\tilde{H}}=\frac{1}{\sqrt{\lambda_{\tilde{H}}}}\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|\leq O(\frac{1}{k\sqrt{\lambda_{\tilde{H}}}}). Here, PS[𝐱¯k]\|P_{S^{\perp}}[\bar{\mathbf{{x}}}^{k}]\| plays a similar role as H~12𝐱¯k\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\| in quantifying the consensus error. Therefore, denser network connectivity, which is often indicated by larger λH~\lambda_{\tilde{H}}, can yield faster convergence of both the objective error and the consensus error (measured by PS[𝐱¯k]\|P_{S^{\perp}}[\bar{\mathbf{{x}}}^{k}]\|).

Compared to the existing specializations of AMM that also have O(1/k)O(1/k) rates in solving (1), the above O(1/(kλH~))O(1/(k\sqrt{\lambda_{\tilde{H}}})) rate has better dependence on λH~(0,1]\lambda_{\tilde{H}}\in(0,1] than the O(1/(kλH~))O(1/(k\lambda_{\tilde{H}})) rate of DPGA [11] and the O(1/(kλH~2))O(1/(k\lambda_{\tilde{H}}^{2})) rate of the distributed ADMM [9]. The remaining specializations discussed in Section III have not been provided with similar dependence of their convergence rates on the graph topology.

IV-B Convergence under Local Restricted Strong Convexity

In this subsection, we additionally impose a problem assumption and acquire the convergence rates of AMM. Henceforth, we no longer restrict uku^{k} to the form of (38).

Assumption 7.

The function i𝒱fi(x)\sum_{i\in\mathcal{V}}f_{i}(x) is locally restricted strongly convex with respect to xx^{\star}, where xx^{\star} is an optimum of problem (1), i.e., for any convex and compact set CdC\subset\mathbb{R}^{d} containing xx^{\star}, i𝒱fi(x)\sum_{i\in\mathcal{V}}f_{i}(x) is restricted strongly convex with respect to xx^{\star} on CC with convexity parameter mC>0m_{C}>0.

Restricted strong convexity has been studied in the literature. For example, [44, 5] consider its global version and [20] considers the local version as Assumption 7. Under Assumption 7, problem (1) has a unique optimum xx^{\star}, and problem (3) has a unique optimum 𝐱=((x)T,,(x)T)TS\mathbf{x}^{\star}=((x^{\star})^{T},\ldots,(x^{\star})^{T})^{T}\in S [20]. Furthermore, Assumption 7 is less restricted than the standard global strong convexity condition. For instance, the objective function of logistic regression [45] is not globally strongly convex but meets Assumption 7.

Proposition 1.

Suppose Assumption 1 holds. If there exists ϵ>0\epsilon>0 such that each fif_{i} is twice continuously differentiable on the ball B(x,ϵ):={x|xxϵ}B(x^{\star},\epsilon):=\{x|~{}\|x-x^{\star}\|\leq\epsilon\} and i𝒱2fi(x)\sum_{i\in\mathcal{V}}\nabla^{2}f_{i}(x^{\star}) is positive definite, then Assumption 7 holds.

Proof.

See Appendix -D. ∎

Proposition 1 provides a sufficient condition for Assumption 7. When each fif_{i} is twice continuously differentiable, this sufficient condition is much weaker than global strong convexity which requires i𝒱2fi(x)𝐎\sum_{i\in\mathcal{V}}\nabla^{2}f_{i}(x)\succ\mathbf{O} xd\forall x\in\mathbb{R}^{d}.

Note that Assumption 7 is unable to assure any strong convexity of f(𝐱)=i𝒱fi(xi)f(\mathbf{x})=\sum_{i\in\mathcal{V}}f_{i}(x_{i}), yet it guarantees the following property on f(𝐱)+ρ4𝐱H~2f(\mathbf{x})+\frac{\rho}{4}\|\mathbf{x}\|_{\tilde{H}}^{2}.

Proposition 2.

[20, Lemma 1] Under Assumption 6 and Assumption 7, f(𝐱)+ρ4𝐱H~2f(\mathbf{x})+\frac{\rho}{4}\|\mathbf{x}\|_{\tilde{H}}^{2} is locally restricted strongly convex with respect to 𝐱\mathbf{x}^{\star}.

Subsequently, we construct a convex and compact set 𝒞Nd\mathcal{C}\subset\mathbb{R}^{Nd} containing 𝐱\mathbf{x}^{\star}, so that the restricted strong convexity in Proposition 2 takes into effect on 𝒞\mathcal{C}, leading to a parameter condition that guarantees 𝐱k𝒞\mathbf{x}^{k}\in\mathcal{C} k0\forall k\geq 0. To introduce 𝒞\mathcal{C}, note from Assumption 2 that there exist symmetric positive semidefinite matrices AA_{\ell}, AuNd×NdA_{u}\in\mathbb{R}^{Nd\times Nd} such that for any 𝐱Nd\mathbf{x}\in\mathbb{R}^{Nd} and k0k\geq 0, A2uk(𝐱)AuA_{\ell}\preceq\nabla^{2}u^{k}(\mathbf{x})\preceq A_{u}. Let Δ:=AuA0\Delta:=\|A_{u}-A_{\ell}\|\geq 0. Moreover, define Aa=(A+Au)/2A_{a}=(A_{\ell}+A_{u})/2 and G~=diag(Aa,INd/ρ)\tilde{G}=\operatorname{diag}(A_{a},I_{Nd}/\rho), which are also symmetric positive semidefinite. Then, let 𝒞:={𝐱|𝐱𝐱Aa+ρH~2𝐳0𝐳G~2+ρ𝐱0H~2}\mathcal{C}:=\left\{\mathbf{x}|~{}\|\mathbf{x}-\mathbf{x}^{\star}\|_{A_{a}+\rho\tilde{H}}^{2}\leq\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho\|\mathbf{x}^{0}\|_{\tilde{H}}^{2}\right\}, where 𝐳0\mathbf{z}^{0} and 𝐳\mathbf{z}^{\star} are defined above Lemma 3. From Assumption 2(b) and Assumption 6, Aa+ρH~𝐎A_{a}+\rho\tilde{H}\succ\mathbf{O}, so that 𝒞\mathcal{C} is convex and compact. Thus, from Proposition 2, f(𝐱)+ρ4𝐱H~2f(\mathbf{x})+\frac{\rho}{4}\|\mathbf{x}\|_{\tilde{H}}^{2} is restricted strongly convex with respect to 𝐱\mathbf{x}^{\star} on 𝒞\mathcal{C} with convexity parameter mρ(0,)m_{\rho}\in(0,\infty). Lemma 4 is another consequence of Lemma 1, showing that 𝐱k\mathbf{x}^{k} stays identically in 𝒞\mathcal{C}.

Lemma 4.

Under Assumption 1, Assumption 2, Assumption 6, and Assumption 7, 𝐱k𝒞\mathbf{x}^{k}\in\mathcal{C} k0\forall k\geq 0 provided that

Aa(Δ28ηmρ+Δ)INd+ΛM2(1η) for some η(0,1),\displaystyle A_{a}\!\succ\!\!\left(\!\frac{\Delta^{2}}{8\eta m_{\rho}}\!+\!\Delta\right)I_{Nd}\!+\!\frac{\Lambda_{M}}{2(1\!-\!\eta)}\text{ for some $\eta\in(0,1)$}, (44)

where ΛM\Lambda_{M} is defined above Lemma 1.

Proof.

See Appendix -E. ∎

When Δ=0\Delta=0 (which means 2uk\nabla^{2}u^{k} is constant) or i𝒱fi(x)\sum_{i\in\mathcal{V}}f_{i}(x) is globally restricted strongly convex (which means mρm_{\rho} can be independent of 𝒞\mathcal{C}), it is straightforward to find uku^{k} so that (44) holds. Otherwise, mρm_{\rho} depends on 𝒞\mathcal{C} and thus on AaA_{a}, so that both sides of (44) involve AaA_{a}. With that being said, (44) can always be satisfied by proper uku^{k}’s. To see this, arbitrarily pick η(0,1)\eta\in(0,1) and λ¯a,λ¯a>0\bar{\lambda}_{a},\underline{\lambda}_{a}>0 such that λ¯a>λ¯a>M2(1η)\bar{\lambda}_{a}>\underline{\lambda}_{a}>\frac{M}{2(1-\eta)}, where M:=maxi𝒱MiM:=\max_{i\in\mathcal{V}}M_{i}. If we choose uku^{k} such that the corresponding AaA_{a} satisfies λ¯aINdAaλ¯aINd\underline{\lambda}_{a}I_{Nd}\preceq A_{a}\preceq\bar{\lambda}_{a}I_{Nd}, then 𝒞\mathcal{C} is a subset of 𝒞:={𝐱|λ¯a𝐱𝐱2λ¯a𝐱0𝐱2+1ρ𝐯0𝐯2+ρ𝐱0H~2}\mathcal{C}^{\prime}:=\{\mathbf{x}|~{}\underline{\lambda}_{a}\|\mathbf{x}-\mathbf{x}^{\star}\|^{2}\leq\bar{\lambda}_{a}\|\mathbf{x}^{0}-\mathbf{x}^{\star}\|^{2}+\frac{1}{\rho}\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|^{2}+\rho\|\mathbf{x}^{0}\|_{\tilde{H}}^{2}\}. Let mρm_{\rho}^{\prime} be any convexity parameter of f(𝐱)+ρ𝐱H~24f(\mathbf{x})+\frac{\rho\|\mathbf{x}\|_{\tilde{H}}^{2}}{4} on 𝒞\mathcal{C}^{\prime}. Clearly, mρ(0,mρ]m_{\rho}^{\prime}\in(0,m_{\rho}] and is independent of AaA_{a}. Then, the following suffices to satisfy (44):

(Δ28ηmρ+Δ+λ¯a)INdAaλ¯aINd.\left(\frac{\Delta^{2}}{8\eta m_{\rho}^{\prime}}+\Delta+\underline{\lambda}_{a}\right)I_{Nd}\preceq A_{a}\preceq\bar{\lambda}_{a}I_{Nd}. (45)

The lower and upper bounds on AaA_{a} in (45) do not need to depend on AaA_{a}. Also, (45) is well-posed, since Δ28ηmρ+Δ+λ¯aλ¯a\frac{\Delta^{2}}{8\eta m_{\rho}^{\prime}}+\Delta+\underline{\lambda}_{a}\leq\bar{\lambda}_{a} holds for sufficiently small Δ>0\Delta>0. Therefore, uku^{k} k0\forall k\geq 0 with such Δ\Delta and with AaA_{a} satisfying (45) meet (44).

Next, we present the convergence rates of AMM in both optimality and feasibility under the local restricted strong convexity condition. We first consider the smooth case of (1) with each hih_{i} identically equal to 0, and provide a linear convergence rate for 𝐳k𝐳G~2+ρ𝐱kH~2\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}, which quantifies the distance to primal optimality, dual optimality, and primal feasibility of (3). In Theorem 2, we force 𝐯\mathbf{v^{\star}} to satisfy not only (7) but 𝐯𝐯0S\mathbf{v}^{\star}-\mathbf{v}^{0}\in S^{\perp} as well. Such 𝐯\mathbf{v}^{\star} is a particular optimum to the Lagrange dual of (3), and can be chosen as 𝐯=𝐯~+((1Ni𝒱(vi0v~i))T,,(1Ni𝒱(vi0v~i))T)T\mathbf{v}^{\star}=\tilde{\mathbf{v}}+((\frac{1}{N}\sum_{i\in\mathcal{V}}(v_{i}^{0}-\tilde{v}_{i}))^{T},\ldots,(\frac{1}{N}\sum_{i\in\mathcal{V}}(v_{i}^{0}-\tilde{v}_{i}))^{T})^{T}, where 𝐯~\tilde{\mathbf{v}} is any Lagrange dual optimum of (3) and 𝐯0=((v10)T,,(vN0)T)T\mathbf{v}^{0}=((v_{1}^{0})^{T},\ldots,(v_{N}^{0})^{T})^{T} is the initial dual iterate of AMM.

Theorem 2.

Suppose all the conditions in Lemma 4 hold. Also suppose hi(x)0h_{i}(x)\equiv 0 i𝒱\forall i\in\mathcal{V}. Then, there exists δ~(0,1)\tilde{\delta}\in(0,1) such that for each k0k\geq 0,

𝐳k+1𝐳G~2+ρ𝐱k+1H~2\displaystyle\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho\|\mathbf{x}^{k+1}\|_{\tilde{H}}^{2}
(1δ~)(𝐳k𝐳G~2+ρ𝐱kH~2).\displaystyle\leq(1-\tilde{\delta})(\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}). (46)

Specifically, δ~=max{δ|Bi(δ)𝐎,i{1,2,3}}\tilde{\delta}=\max\{\delta|~{}B_{i}(\delta)\succeq\mathbf{O},\forall i\in\{1,2,3\}\} with

B1(δ)=\displaystyle B_{1}(\delta)\!= (2ηmρβΔ)INdδAaδ(1+θ1)(1+θ2)ΛM2/(ρλH~),\displaystyle(2\eta m_{\rho}\!\!-\!\beta\Delta)I_{Nd}\!-\!\delta A_{a}\!\!-\!\delta(1\!+\!\theta_{1})(1\!+\!\theta_{2})\Lambda_{M}^{2}/\!(\rho\lambda_{\tilde{H}}\!),
B2(δ)=\displaystyle B_{2}(\delta)\!= ρ(1δη)H~ρδ(1+1/θ1)HH~H,\displaystyle\rho(1-\delta-\eta)\tilde{H}-\rho\delta(1+1/\theta_{1})H\tilde{H}^{\dagger}H,
B3(δ)=\displaystyle B_{3}(\delta)\!= (1σ)Aa2δ(1+θ1)(1+1/θ2)(Au2ρλH~+ρλ¯H12H~H12),\displaystyle(1\!-\!\sigma)\!A_{a}\!\!-\!2\delta(1\!+\!\theta_{1}\!)(1\!+\!\!1/\theta_{2}\!)(\tfrac{A_{u}^{2}}{\rho\lambda_{\tilde{H}}}\!+\!\rho\bar{\lambda}H^{\frac{1}{2}}\tilde{H}^{\dagger}H^{\frac{1}{2}}),

where θ1,θ2\theta_{1},\theta_{2} are arbitrary positive scalars, λ¯\bar{\lambda} is any upper bound on H\|H\|, η(0,1)\eta\in(0,1) is given in (44), β>0\beta>0 and σ(0,1)\sigma\in(0,1) are such that

βΔ<2ηmρ,σAa(14β+1)ΔINd+ΛM2(1η),\beta\Delta<2\eta m_{\rho},\quad\sigma A_{a}\succeq(\frac{1}{4\beta}+1)\Delta I_{Nd}+\frac{\Lambda_{M}}{2(1-\eta)}, (47)

and λH~>0\lambda_{\tilde{H}}>0 is the smallest nonzero eigenvalue of H~\tilde{H}.

Proof.

See Appendix -F. ∎

In comparison with many existing works such as [4, 18, 19, 9, 26, 27, 6, 7, 28, 29, 21, 22] that assume global strong convexity to derive linear convergence rates, the linear rate in Theorem 2 only requires the weaker condition of local restricted strong convexity in Assumption 7. Moreover, Theorem 2 provides new convergence results to a number of existing algorithms that can be specialized from AMM. Specifically, Theorem 2 establishes linear convergence for D-FBBS [13], ID-FBBS [13], and DPGA [11], which has never been discussed before. In addition, Theorem 2 allows EXTRA [5], DIGing [26], ESOM [18], DQM [19], the distributed ADMMs [9, 4], PUDA [22], and ABC [21] to achieve linear convergence under less restrictive problem assumptions, including relaxing the global strong convexity in [26, 18, 19, 9, 4, 21, 22] and the global restricted strong convexity in [5] to local restricted strong convexity, and eliminating the Lipschitz continuity of 2fi\nabla^{2}f_{i} required in [18, 19].

Both the problem and the graph have impacts on the linear rate (1δ~)(1-\tilde{\delta}) in Theorem 2. Their impacts become explicit when each fif_{i} is globally restricted strongly convex with respect to xx^{\star} with convexity parameter m>0m>0. For simplicity, we set Mi=MmM_{i}=M\geq m i𝒱\forall i\in\mathcal{V} and mρ=mm_{\rho}=m. Also, choose H=H~H=\tilde{H} and proper uku^{k} k0\forall k\geq 0. In the expressions of Bi(δ)B_{i}(\delta) i=1,2,3\forall i=1,2,3, we pick any θ2>0\theta_{2}>0 and then let θ1\theta_{1} be such that max{δ:B2(δ)𝐎}>max{δ:B1(δ)𝐎}\max\{\delta:B_{2}(\delta)\succeq\mathbf{O}\}>\max\{\delta:B_{1}(\delta)\succeq\mathbf{O}\}. We first suppose mm increases. Note that uku^{k} still satisfies (44) with the same η\eta, while we allow for a larger β\beta and thus a smaller σ\sigma such that (47) holds. Accordingly, max{δ:Bi(δ)𝐎}\max\{\delta:B_{i}(\delta)\succeq\mathbf{O}\} i=1,3\forall i=1,3 become larger and, therefore, so does δ~\tilde{\delta}. Similarly, smaller MM yields larger δ~\tilde{\delta}. Now suppose the graph topology changes so that λH~\lambda_{\tilde{H}} increases, which implies denser connectivity of 𝒢\mathcal{G}. Since λ¯\bar{\lambda} can be any upper bound on H\|H\|, we can always assume it is unchanged with the graph topology (e.g., λ¯=N\bar{\lambda}=N when H=L𝒢IdH=L_{\mathcal{G}}\otimes I_{d}). Then, it can be seen that δ~\tilde{\delta} increases.

The above discussions suggest that smaller condition number κf:=M/m\kappa_{f}:=M/m or denser network connectivity may lead to faster convergence of AMM in solving strongly convex and smooth problems. Indeed, for some particular forms of AMM, such dependence can be more explicit and is comparable to some prior results. For example, let H=H~=(INW)Id/4H=\tilde{H}=(I_{N}-W)\otimes I_{d}/4 with an average matrix WW associated with 𝒢\mathcal{G}, ρ=M2/λH~\rho=M\sqrt{2/\lambda_{\tilde{H}}}, and uku^{k} be such that Δ=0\Delta=0 and Aa=ρ(3IN+W)Id4A_{a}=\frac{\rho(3I_{N}+W)\otimes I_{d}}{4}. Thus, ρINdAaρ2INd\rho I_{Nd}\succeq A_{a}\succeq\frac{\rho}{2}I_{Nd}, Hλ¯=12\|H\|\leq\bar{\lambda}=\frac{1}{2}, and λH~=1λ2(W)4(0,12)\lambda_{\tilde{H}}=\frac{1-\lambda_{2}(W)}{4}\in(0,\frac{1}{2}), where λ2(W)\lambda_{2}(W) is the second largest eigenvalue of WW. Accordingly, ρ2M\rho\geq 2M and AaMINdA_{a}\succeq MI_{Nd}. We may pick η=14\eta=\frac{1}{4} and σ=23\sigma=\frac{2}{3} so that (47) holds. Then, by choosing θ1=θ2=1\theta_{1}=\theta_{2}=1 in Theorem 2, we can show that

δ~=O(min{1λ2(W)κf,1λ2(W)}).\tilde{\delta}=O(\min\{\tfrac{\sqrt{1-\lambda_{2}(W)}}{\kappa_{f}},1-\lambda_{2}(W)\}). (48)

Most existing linearly convergent specializations of AMM have not revealed such explicit relations as (48), except the following. The decentralized ADMM [4] has the same result as (48), and DIGing [26] gives δ~=O((1λ2(W))2κf1.5)\tilde{\delta}=O(\frac{(1-\lambda_{2}(W))^{2}}{\kappa_{f}^{1.5}}) when W𝐎W\succeq\mathbf{O}, which is worse than (48). The distributed ADMM [9] can reach δ~=O(1λ2(W)κf)\tilde{\delta}=O(\frac{1-\lambda_{2}(W)}{\sqrt{\kappa_{f}}}), which is better than (48). This is probably because (48) is directly from our analysis for more general problem assumptions and algorithmic forms.

Finally, we remove the condition of hi0h_{i}\equiv 0 i𝒱\forall i\in\mathcal{V} in Theorem 2 and derive the following result.

Theorem 3.

Suppose all the conditions in Lemma 4 hold. For any k1k\geq 1,

H~12𝐱¯kd0+𝐯0𝐯ρk,\displaystyle\!\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|\leq\frac{d_{0}+\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|}{\rho k}, (49)
f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)1k\displaystyle\!f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})\!\leq\frac{1}{k}
(d02ρ(Δ2(2ηmρβΔ)+1η1σ)+𝐱0𝐱Au22+𝐯022ρ),\displaystyle\!\cdot\!\Bigl{(}\!\frac{d_{0}^{2}}{\rho}\bigl{(}\dfrac{\Delta}{2(2\eta m_{\rho}\!-\!\beta\Delta)}\!+\!\dfrac{1\!-\!\eta}{1\!-\!\sigma}\bigr{)}\!\!+\!\!\frac{\|\mathbf{x}^{0}\!-\!\mathbf{x}^{\star}\|_{A_{u}}^{2}}{2}\!+\!\!\frac{\|\mathbf{v}^{0}\|^{2}}{2\rho}\!\Bigr{)}, (50)
f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)𝐯(d0+𝐯0𝐯)ρk,\displaystyle\!f(\bar{\mathbf{x}}^{k})\!\!+\!h(\bar{\mathbf{x}}^{k})\!-\!\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})\!\!\geq\!\!-\frac{\|\mathbf{v}^{\star}\|(d_{0}\!+\!\!\|\mathbf{v}^{0}\!-\!\mathbf{v}^{\star}\|)}{\rho k}, (51)

where d0=ρ𝐳0𝐳G~2+ρ2𝐱0H~2d_{0}=\sqrt{\rho\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho^{2}\|\mathbf{x}^{0}\|_{\tilde{H}}^{2}}, η(0,1)\eta\in(0,1) is given in (44), and β>0\beta>0 and σ(0,1)\sigma\in(0,1) are such that (47) holds.

Proof.

See Appendix -G. ∎

The convergence rates in Theorem 3 are of the same order as those in Theorem 1. Theorem 1 considers a more general optimization problem, while Theorem 3 allows for more general uku^{k}’s. In the literature, [8, 25, 15, 10] also deal with nonsmooth, strongly convex problems. Even under global strong convexity, [25, 15] provide convergence rates slower than O(1/k)O(1/k), and [8] derives an O(1/k)O(1/k) rate of convergence only to dual optimality. Although [10] attains linear convergence, it considers a more restricted problem, which assumes fif_{i} to be globally restricted strongly convex and the subgradients of hih_{i} to be uniformly bounded.

Remark 1.

Compared to the existing algorithms in Section III that can be viewed as specializations of AMM, AMM is able to achieve convergence rates of the same or even better order under identical or weaker problem conditions. Moreover, although the theoretical results in Section IV require additional conditions on the parameters of AMM besides those in Section II, these parameter conditions are not restrictive. Indeed, when AMM reduces to the algorithms in Section III, its parameter conditions in Theorem 1 still generalize those in [5, 13, 16, 12, 11, 9, 14] and partially overlap those in [26] for non-strongly convex problems. The parameter conditions in Theorem 2 are more general than those in [5, 19], different from but intersecting with those in [26, 18], and admittedly, relatively limited compared to those in [9, 4] for strongly convex problems, yet on the premise that uku^{k} is specialized to the corresponding particular forms specified in Section III.

V Numerical Examples

This section demonstrates the competent practical performance of DAMM via numerical examples.

V-A Non-identical Local Nonsmooth Objectives

Consider the following constrained l1l_{1}-regularized problem:

minimizexi𝒱Xii𝒱(12Bixbi2+1Nx1),\operatorname{minimize}_{x\in\cap_{i\in\mathcal{V}}X_{i}}~{}\sum_{i\in\mathcal{V}}\left(\frac{1}{2}\|B_{i}x-b_{i}\|^{2}+\frac{1}{N}\|x\|_{1}\right), (52)

where each Bim×dB_{i}\in\mathbb{R}^{m\times d}, bimb_{i}\in\mathbb{R}^{m}, and Xi={x|xaiai+1}X_{i}=\{x|~{}\|x-a_{i}\|\leq\|a_{i}\|+1\} with aida_{i}\in\mathbb{R}^{d}. Note that i𝒱Xi\cap_{i\in\mathcal{V}}X_{i} contains 𝟎\mathbf{0} and is nonempty. We set fi(x)=12Bixbi2f_{i}(x)=\frac{1}{2}\|B_{i}x-b_{i}\|^{2} and hi(x)=1Nx1+Xi(x)h_{i}(x)=\frac{1}{N}\|x\|_{1}+\mathcal{I}_{X_{i}}(x) i𝒱\forall i\in\mathcal{V}. We choose N=20N=20, d=5d=5, and m=3m=3. Besides, we randomly generate each BiB_{i}, bib_{i}, and aia_{i}. The graph 𝒢\mathcal{G} is also randomly generated and contains 2626 links.

The simulation involves PG-EXTRA [12], D-FBBS [13], DPGA [11], and the distributed ADMM [9], which are guaranteed to solve (52) with distinct hih_{i}’s. From Section III, they are specializations of AMM, and the first three algorithms can also be specialized from DAMM. In addition to these existing methods, we run a particular DAMM with ψik()=ϖi():=12BiTBi+ϵId2\psi_{i}^{k}(\cdot)=\varpi_{i}(\cdot):=\frac{1}{2}\|\cdot\|_{B_{i}^{T}B_{i}+\epsilon I_{d}}^{2}, ϵ>0\epsilon>0, which depends on the problem data. Note that this is a new algorithm for solving (52). All these algorithms have similar computational complexities per iteration, while each iteration of the distributed ADMM requires twice the communication cost of the others. The algorithmic forms of PG-EXTRA, D-FBBS, DPGA, and the distributed ADMM are given in Section III. For DPGA, we let ci=cc_{i}=c i𝒱\forall i\in\mathcal{V} for some c>0c>0 and Γij=12c[M𝒢]ij\Gamma_{ij}=\frac{1}{2c}[M_{\mathcal{G}}]_{ij} i𝒱\forall i\in\mathcal{V} j𝒩i{i}\forall j\in\mathcal{N}_{i}\cup\{i\}, where M𝒢M_{\mathcal{G}} is the Metropolis weight matrix defined below Assumption 3. We also set the weight matrix Γ=12M𝒢\Gamma=\frac{1}{2}M_{\mathcal{G}} in the distributed ADMM. We assign P=P~=12M𝒢P=\tilde{P}=\frac{1}{2}M_{\mathcal{G}} to the above new DAMM and to PG-EXTRA and D-FBBS when cast in the form of DAMM. The remaining parameters are all fine-tuned in their theoretical ranges to achieve the best possible performance.

Refer to caption
(a) Optimality error at 𝐱¯k\bar{\mathbf{x}}^{k}
Refer to caption
(b) Optimality error at 𝐱k\mathbf{x}^{k}
Figure 1: Convergence performance of PG-EXTRA, D-FBBS, DPGA, distributed ADMM, and DAMM with ψik()=ϖi()\psi_{i}^{k}(\cdot)=\varpi_{i}(\cdot).

Figure 1 plots the optimality error |f()+h()f(𝐱)h(𝐱)|+H~12()|f(\cdot)+h(\cdot)-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|+\|\tilde{H}^{\frac{1}{2}}(\cdot)\| (which includes both the objective error and the consensus error) at the running average 𝐱¯k\bar{\mathbf{x}}^{k} and the iterate 𝐱k\mathbf{x}^{k}, respectively. For each of the aforementioned algorithms, the running average produces a smoother curve, while the iterate converges faster. The new DAMM with ψik()=ϖi()\psi_{i}^{k}(\cdot)=\varpi_{i}(\cdot) outperforms the other four methods, suggesting that AMM not only unifies a diversity of existing methods, but also creates novel distributed algorithms that achieve superior performance in solving various convex composite optimization problems.

V-B Comparison with Methods Using Surrogate Functions

This subsection compares the convergence performance of the particular DAMM with ψik()=ϖi()\psi_{i}^{k}(\cdot)=\varpi_{i}(\cdot) in Section V-A with that of NEXT [31] and SONATA [33], which also utilize surrogate functions. However, NEXT and SONATA are only guaranteed to address problem (1) with identical hih_{i}’s. Thus, we change XiX_{i} i𝒱\forall i\in\mathcal{V} in (52) to the unit ball {x|x1}\{x|~{}\|x\|\leq 1\}. The remaining settings comply with Section V-A.

At each iteration, the computational complexities of all the three algorithms are roughly the same, yet the communication costs of NEXT and SONATA double that of the particular DAMM. For NEXT and SONATA, we follow the simulations in [31, 33] to approximate fi(xi)f_{i}(x_{i}) by fi(xi)+τ2xixik2f_{i}(x_{i})+\frac{\tau}{2}\|x_{i}-x_{i}^{k}\|^{2} for some τ>0\tau>0 at each iteration kk, and choose the average matrix as INM𝒢I_{N}-M_{\mathcal{G}}, where M𝒢M_{\mathcal{G}} is the Metropolis weight matrix. The diminishing step-size of NEXT is set as c/kc/k with c>0c>0, and SONATA adopts a fixed step-size. Again, we fine-tune all the algorithm parameters within their theoretical ranges.

Figure 2 displays the optimality error |f(𝐱k)+h(𝐱k)f(𝐱)h(𝐱)|+H~12𝐱k|f(\mathbf{x}^{k})+h(\mathbf{x}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|+\|\tilde{H}^{\frac{1}{2}}\mathbf{x}^{k}\| generated by the aforementioned algorithms. Observe that the particular DAMM converges much faster than the other two methods. The main reason for the slow convergence of SONATA in solving this numerical example is that its theoretical step-size is very small.

Refer to caption
Figure 2: Convergence performance of NEXT, SONATA, and DAMM with ψik()=ϖi()\psi_{i}^{k}(\cdot)=\varpi_{i}(\cdot).

VI Conclusion

We have introduced a unifying Approximate Method of Multipliers (AMM) that emulates the Method of Multipliers via a surrogate function. Proper designs of the surrogate function lead to a wealth of distributed algorithms for solving convex composite optimization over undirected graphs. Sublinear and linear convergence rates for AMM are established under various mild conditions. The proposed AMM and its distributed realizations generalize a number of well-noted methods in the literature. Moreover, the theoretical convergence rates of AMM are no worse than and sometimes even better than the convergence results of such existing specializations of AMM in the sense of rate order, problem assumption, etc. The generality of AMM as well as its convergence analysis provides insights into the design and analysis of distributed primal-dual optimization algorithms, and allows us to explore high-performance specializations of AMM when addressing specific convex optimization problems in practice.

-A Proof of Lemma 1

Define gk+1=uk(𝐱k+1)ρH𝐱k+1H~12𝐯kg^{k+1}\!=\!-\nabla u^{k}(\mathbf{x}^{k+1})\!-\!\rho H\mathbf{x}^{k+1}\!-\!\tilde{H}^{\frac{1}{2}}\mathbf{v}^{k} and g=f(𝐱)H~12𝐯g^{\star}\!=\!-\nabla f(\mathbf{x}^{\star})\!-\!\tilde{H}^{\frac{1}{2}}\mathbf{v}^{\star}. Due to (6) and (7), gk+1h(𝐱k+1)g^{k+1}\in\partial h(\mathbf{x}^{k+1}) and gh(𝐱)g^{\star}\in\partial h(\mathbf{x}^{\star}). It follows from (36) that

H~12(𝐯𝐯k)=\displaystyle\tilde{H}^{\frac{1}{2}}(\mathbf{v}^{\star}-\mathbf{v}^{k})= f(𝐱k)f(𝐱)+Ak(𝐱k+1𝐱k)\displaystyle\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})+A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k})
+gk+1g+ρH𝐱k+1.\displaystyle+g^{k+1}-g^{\star}+\rho H\mathbf{x}^{k+1}. (53)

Also, since Null(H~12)=Null(H~)=S\operatorname{Null}(\tilde{H}^{\frac{1}{2}})=\operatorname{Null}(\tilde{H})=S, we have H~12𝐱=𝟎\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0}. This, along with (5), gives 𝐯k𝐯k+1=ρH~12(𝐱k+1𝐱)\mathbf{v}^{k}-\mathbf{v}^{k+1}=-\rho\tilde{H}^{\frac{1}{2}}(\mathbf{x}^{k+1}-\mathbf{x}^{\star}). Thus, 𝐯k𝐯k+1,𝐯k𝐯=ρH~12(𝐱k+1𝐱),𝐯k𝐯=ρ𝐱k+1𝐱,H~12(𝐯𝐯k)\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k}-\mathbf{v}^{\star}\rangle=-\rho\langle\tilde{H}^{\frac{1}{2}}(\mathbf{x}^{k+1}-\mathbf{x}^{\star}),\mathbf{v}^{k}-\mathbf{v}^{\star}\rangle=\rho\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},\tilde{H}^{\frac{1}{2}}(\mathbf{v}^{\star}-\mathbf{v}^{k})\rangle. By substituting (53) into the above equation and using H𝐱=𝟎H\mathbf{x}^{\star}=\mathbf{0}, we obtain 𝐯k𝐯k+1,𝐯k𝐯=ρ𝐱k+1𝐱,f(𝐱k)f(𝐱)+ρ𝐱k+1𝐱,gk+1g+ρ𝐱k+1𝐱,Ak(𝐱k+1𝐱k)+ρ2𝐱k+1H2\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k}-\mathbf{v}^{\star}\rangle=\rho\langle\mathbf{x}^{k+1}\!-\!\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})\!-\!\nabla f(\mathbf{x}^{\star})\rangle\!+\!\rho\langle\mathbf{x}^{k+1}\!-\!\mathbf{x}^{\star},g^{k+1}\!-\!g^{\star}\rangle+\rho\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle+\rho^{2}\|\mathbf{x}^{k+1}\|_{H}^{2}. In this equation, 𝐱k+1𝐱,gk+1g0\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g^{k+1}-g^{\star}\rangle\geq 0, because gk+1h(𝐱k+1)g^{k+1}\in\partial h(\mathbf{x}^{k+1}), gh(𝐱)g^{\star}\in\partial h(\mathbf{x}^{\star}), and hh is convex. Moreover, due to HH~H\succeq\tilde{H} and (5), we have ρ2𝐱k+1H2ρ2𝐱k+1H~2=𝐯k+1𝐯k2=𝐯k𝐯k+1,𝐯k𝐯𝐯k𝐯k+1,𝐯k+1𝐯\rho^{2}\|\mathbf{x}^{k+1}\|_{H}^{2}\geq\rho^{2}\|\mathbf{x}^{k+1}\|_{\tilde{H}}^{2}=\|\mathbf{v}^{k+1}-\mathbf{v}^{k}\|^{2}=\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k}-\mathbf{v}^{\star}\rangle-\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\rangle. It follows that 1ρ𝐯k𝐯k+1,𝐯k+1𝐯𝐱k+1𝐱,f(𝐱k)f(𝐱)+𝐱k+1𝐱,Ak(𝐱k+1𝐱k)\frac{1}{\rho}\langle\mathbf{v}^{k}\!-\!\mathbf{v}^{k+1},\mathbf{v}^{k+1}\!-\!\mathbf{v}^{\star}\rangle\geq\langle\mathbf{x}^{k+1}\!-\!\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})\!-\!\nabla f(\mathbf{x}^{\star})\rangle+\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle. Hence, to prove (37), it suffices to show that for any η(0,1)\eta\in(0,1),

𝐱k+1𝐱,f(𝐱k)f(𝐱)\displaystyle\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle
η𝐱k𝐱,f(𝐱k)f(𝐱)𝐱k+1𝐱kΛM24(1η).\displaystyle\geq\!\eta\langle\mathbf{x}^{k}\!-\!\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})\!-\!\nabla f(\mathbf{x}^{\star})\rangle\!-\!\frac{\|\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k}\|_{\Lambda_{M}}^{2}}{4(1-\eta)}. (54)

To do so, note from the AM-GM inequality and the Lipschitz continuity of fi\nabla f_{i} that for any i𝒱i\in\mathcal{V} and ci>0c_{i}>0, xik+1xik,fi(xik)fi(x)cifi(xik)fi(x)214cixik+1xik2ciMixikx,fi(xik)fi(x)14cixik+1xik2\langle x_{i}^{k+1}-x_{i}^{k},\nabla f_{i}(x_{i}^{k})-\nabla f_{i}(x^{\star})\rangle\geq-c_{i}\|\nabla f_{i}(x_{i}^{k})-\nabla f_{i}(x^{\star})\|^{2}-\frac{1}{4c_{i}}\|x_{i}^{k+1}-x_{i}^{k}\|^{2}\geq-c_{i}M_{i}\langle x_{i}^{k}-x^{\star},\nabla f_{i}(x_{i}^{k})-\nabla f_{i}(x^{\star})\rangle-\frac{1}{4c_{i}}\|x_{i}^{k+1}-x_{i}^{k}\|^{2}. By adding the above inequality over all i𝒱i\in\mathcal{V} with ci=1ηMic_{i}=\frac{1-\eta}{M_{i}}, we have 𝐱k+1𝐱k,f(𝐱k)f(𝐱)+(1η)𝐱k𝐱,f(𝐱k)f(𝐱)𝐱k+1𝐱kΛM24(1η)\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle+(1-\eta)\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\geq-\frac{\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{\Lambda_{M}}^{2}}{4(1-\eta)}. Then, adding η𝐱k𝐱,f(𝐱k)f(𝐱)\eta\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle to both sides of this inequality leads to (54).

-B Proof of Lemma 2

Let k1k\geq 1. Due to (5), H~12𝐱¯k=1kt=1kH~12𝐱t=1ρk(𝐯k𝐯0)\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}=\frac{1}{k}\sum_{t=1}^{k}\tilde{H}^{\frac{1}{2}}\mathbf{x}^{t}=\frac{1}{\rho k}(\mathbf{v}^{k}-\mathbf{v}^{0}), so that (39) holds. Next, we prove (40). For simplicity, define Jt(𝐱)=f(𝐱t),𝐱+𝐱𝐱tA2/2+h(𝐱)+ρ𝐱H2/2+(𝐯t)TH~12𝐱J^{t}(\mathbf{x})=\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}\rangle+\|\mathbf{x}-\mathbf{x}^{t}\|_{A}^{2}/2+h(\mathbf{x})+\rho\|\mathbf{x}\|_{H}^{2}/2+(\mathbf{v}^{t})^{T}\tilde{H}^{\frac{1}{2}}\mathbf{x} t0\forall t\geq 0. Since Jt(𝐱)𝐱A22J^{t}(\mathbf{x})-\frac{\|\mathbf{x}\|_{A}^{2}}{2} is convex, (Jt(𝐱)𝐱A22)(Jt(𝐱t+1)𝐱t+1A22)g~t+1A𝐱t+1,𝐱𝐱t+1\bigl{(}J^{t}(\mathbf{x}^{\star})-\frac{\|\mathbf{x}^{\star}\|_{A}^{2}}{2}\bigr{)}-\bigl{(}J^{t}(\mathbf{x}^{t+1})-\frac{\|\mathbf{x}^{t+1}\|_{A}^{2}}{2}\bigr{)}\geq\langle\tilde{g}^{t+1}\!-\!A\mathbf{x}^{t+1},\mathbf{x}^{\star}\!-\!\mathbf{x}^{t+1}\rangle for all g~t+1Jt(𝐱t+1)\tilde{g}^{t+1}\!\in\!\partial J^{t}(\mathbf{x}^{t+1}). Due to (4), 𝟎Jt(𝐱t+1)\mathbf{0}\in\partial J^{t}(\mathbf{x}^{t+1}). Thus, by letting g~t+1=𝟎\tilde{g}^{t+1}=\mathbf{0} in the above inequality, we obtain

Jt(𝐱t+1)Jt(𝐱)𝐱t+1𝐱A2/2.J^{t}(\mathbf{x}^{t+1})-J^{t}(\mathbf{x}^{\star})\leq-\|\mathbf{x}^{t+1}-\mathbf{x}^{\star}\|_{A}^{2}/2. (55)

In addition, because of (5) and HH~H\succeq\tilde{H},

H~12𝐱t+1,𝐯t=1ρ𝐯t+1𝐯t,𝐯t\displaystyle\langle\tilde{H}^{\frac{1}{2}}\mathbf{x}^{t+1},\mathbf{v}^{t}\rangle=\frac{1}{\rho}\langle\mathbf{v}^{t+1}-\mathbf{v}^{t},\mathbf{v}^{t}\rangle
=\displaystyle= (𝐯t+12𝐯t2𝐯t+1𝐯t2)/(2ρ)\displaystyle(\|\mathbf{v}^{t+1}\|^{2}-\|\mathbf{v}^{t}\|^{2}-\|\mathbf{v}^{t+1}-\mathbf{v}^{t}\|^{2})/(2\rho)
=\displaystyle= (𝐯t+12𝐯t2)/(2ρ)ρ𝐱t+1H~2/2\displaystyle(\|\mathbf{v}^{t+1}\|^{2}-\|\mathbf{v}^{t}\|^{2})/(2\rho)-\rho\|\mathbf{x}^{t+1}\|_{\tilde{H}}^{2}/2
\displaystyle\geq (𝐯t+12𝐯t2)/(2ρ)ρ𝐱t+1H2/2.\displaystyle(\|\mathbf{v}^{t+1}\|^{2}-\|\mathbf{v}^{t}\|^{2})/(2\rho)-\rho\|\mathbf{x}^{t+1}\|_{H}^{2}/2. (56)

Also, due to the convexity and the MiM_{i}-smoothness of each fif_{i},

f(𝐱t+1)f(𝐱)=(f(𝐱t+1)f(𝐱t))+(f(𝐱t)f(𝐱))\displaystyle f(\mathbf{x}^{t+1})-f(\mathbf{x}^{\star})=(f(\mathbf{x}^{t+1})-f(\mathbf{x}^{t}))+(f(\mathbf{x}^{t})-f(\mathbf{x}^{\star}))
\displaystyle\leq f(𝐱t),𝐱t+1𝐱t+𝐱t+1𝐱tΛM2/2+f(𝐱t),𝐱t𝐱\displaystyle\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}^{t+1}\!\!-\!\mathbf{x}^{t}\rangle\!+\!\|\mathbf{x}^{t+1}\!\!-\!\mathbf{x}^{t}\|_{\Lambda_{M}}^{2}/2\!+\!\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}^{t}\!-\!\mathbf{x}^{\star}\rangle
=\displaystyle= f(𝐱t),𝐱t+1𝐱+𝐱t+1𝐱tΛM2/2.\displaystyle\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}^{t+1}-\mathbf{x}^{\star}\rangle+\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}}^{2}/2. (57)

Through combining (55), (56), and (57) and utilizing H𝐱=H~12𝐱=𝟎H\mathbf{x}^{\star}=\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0}, we derive

f(𝐱t+1)+h(𝐱t+1)f(𝐱)h(𝐱)+𝐯t+12𝐯t22ρ\displaystyle f(\mathbf{x}^{t+1})+h(\mathbf{x}^{t+1})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})+\frac{\|\mathbf{v}^{t+1}\|^{2}-\|\mathbf{v}^{t}\|^{2}}{2\rho}
\displaystyle\leq 𝐱t𝐱A22𝐱t+1𝐱A22+𝐱t+1𝐱tΛMA22.\displaystyle\frac{\|\mathbf{x}^{t}\!-\!\mathbf{x}^{\star}\|_{A}^{2}}{2}\!-\!\frac{\|\mathbf{x}^{t+1}\!-\!\mathbf{x}^{\star}\|_{A}^{2}}{2}\!+\!\frac{\|\mathbf{x}^{t+1}\!-\!\mathbf{x}^{t}\|_{\Lambda_{M}-A}^{2}}{2}. (58)

Now adding (58) over t=0,,k1t=0,\ldots,k-1 yields t=1k(f(𝐱t)+h(𝐱t)f(𝐱)h(𝐱))𝐯022ρ+𝐱0𝐱A22+t=0k1𝐱t+1𝐱tΛMA22\sum_{t=1}^{k}\left(f(\mathbf{x}^{t})+h(\mathbf{x}^{t})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})\right)\leq\frac{\|\mathbf{v}^{0}\|^{2}}{2\rho}+\frac{\|\mathbf{x}^{0}-\mathbf{x}^{\star}\|_{A}^{2}}{2}+\sum_{t=0}^{k-1}\frac{\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A}^{2}}{2}. Moreover, the convexity of ff and hh implies f(𝐱¯k)+h(𝐱¯k)1kt=1k(f(𝐱t)+h(𝐱t))f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})\leq\frac{1}{k}\sum_{t=1}^{k}\left(f(\mathbf{x}^{t})+h(\mathbf{x}^{t})\right). Combining the above results in (40).

Finally, to prove (41), note that 𝐱\mathbf{x}^{\star} is an optimal solution of min𝐱f(𝐱)+h(𝐱)+𝐯,H~12𝐱\min_{\mathbf{x}}f(\mathbf{x})+h(\mathbf{x})+\langle\mathbf{v}^{\star},\tilde{H}^{\frac{1}{2}}\mathbf{x}\rangle. It then follows from H~12𝐱=𝟎\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0} that f(𝐱)+h(𝐱)f(𝐱¯t)+h(𝐱¯t)+𝐯,H~12𝐱¯tf(𝐱¯t)+h(𝐱¯t)+𝐯H~12𝐱¯tf(\mathbf{x}^{\star})+h(\mathbf{x}^{\star})\leq f(\bar{\mathbf{x}}^{t})+h(\bar{\mathbf{x}}^{t})+\langle\mathbf{v}^{\star},\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{t}\rangle\leq f(\bar{\mathbf{x}}^{t})+h(\bar{\mathbf{x}}^{t})+\|\mathbf{v}^{\star}\|\cdot\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{t}\|. This, together with (39), implies (41).

-C Proof of Lemma 3

From the definition of GG, for each t0t\geq 0,

𝐳t𝐳G2𝐳t+1𝐳G2𝐳t+1𝐳tG2\displaystyle\|\mathbf{z}^{t}-\mathbf{z}^{\star}\|_{G}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{\star}\|_{G}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{G}^{2}
=\displaystyle= 2G(𝐳t𝐳t+1),𝐳t+1𝐳\displaystyle 2\langle G(\mathbf{z}^{t}-\mathbf{z}^{t+1}),\mathbf{z}^{t+1}-\mathbf{z}^{\star}\rangle
=\displaystyle= 2A(𝐱t𝐱t+1),𝐱t+1𝐱+2ρ𝐯t𝐯t+1,𝐯t+1𝐯.\displaystyle 2\langle A(\mathbf{x}^{t}\!\!-\!\mathbf{x}^{t+1}),\mathbf{x}^{t+1}\!\!-\!\mathbf{x}^{\star}\rangle\!+\!\frac{2}{\rho}\langle\mathbf{v}^{t}\!\!-\!\mathbf{v}^{t+1},\mathbf{v}^{t+1}\!\!-\!\mathbf{v}^{\star}\rangle. (59)

By substituting (37) with At=AA^{t}=A and η=0\eta=0 into (59),

𝐳t𝐳G2𝐳t+1𝐳G21ρ𝐯t+1𝐯t2\displaystyle\|\mathbf{z}^{t}-\mathbf{z}^{\star}\|_{G}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{\star}\|_{G}^{2}\geq\frac{1}{\rho}\|\mathbf{v}^{t+1}-\mathbf{v}^{t}\|^{2}
+𝐱t+1𝐱tAΛM22(1σ¯)𝐱t+1𝐱tA2,\displaystyle+\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{A-\frac{\Lambda_{M}}{2}}^{2}\geq(1-\underline{\sigma})\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{A}^{2}, (60)

where the last step is due to σ¯AΛM/2\underline{\sigma}A\succeq\Lambda_{M}/2. Adding (60) over t=0,,k1t=0,\ldots,k-1 and using 𝐯k𝐯ρ𝐳k𝐳G\|\mathbf{v}^{k}\!-\!\mathbf{v}^{\star}\|\!\leq\!\sqrt{\rho}\|\mathbf{z}^{k}\!-\!\mathbf{z}^{\star}\|_{G}, we have 1ρ𝐯k𝐯2+(1σ¯)t=0k1𝐱t+1𝐱tA2𝐳0𝐳G2\frac{1}{\rho}\|\mathbf{v}^{k}-\mathbf{v}^{\star}\|^{2}+(1-\underline{\sigma})\sum_{t=0}^{k-1}\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{A}^{2}\leq\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{G}^{2}. Therefore, (42) can be proved by combining the above inequality with 𝐯k𝐯0𝐯k𝐯+𝐯0𝐯\|\mathbf{v}^{k}-\mathbf{v}^{0}\|\leq\|\mathbf{v}^{k}-\mathbf{v}^{\star}\|+\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|. Furthermore, the above inequality, along with AΛMAA\succ\Lambda_{M}-A, implies (43).

-D Proof of Proposition 1

For convenience, let f~(x)=i𝒱fi(x)\tilde{f}(x)=\sum_{i\in\mathcal{V}}f_{i}(x). Since 2f~(x)𝐎\nabla^{2}\tilde{f}(x^{\star})\succ\mathbf{O} and 2f~\nabla^{2}\tilde{f} is continuous around xx^{\star}, there exist ϵ1(0,ϵ]\epsilon_{1}\in(0,\epsilon] and c>0c>0 such that 2f~(x)cId\nabla^{2}\tilde{f}(x)\succeq cI_{d} xB(x,ϵ1)\forall x\in B(x^{\star},\epsilon_{1}). Therefore,

f~(x)f~(x),xxcxx2,xB(x,ϵ1).\langle\nabla\tilde{f}(x)\!-\!\nabla\tilde{f}(x^{\star}),x-x^{\star}\rangle\!\geq\!c\|x-x^{\star}\|^{2},\forall x\in B(x^{\star},\epsilon_{1}). (61)

Let CdC\subset\mathbb{R}^{d} be any convex and compact set containing xx^{\star}. Suppose xCB(x,ϵ1)x\in C-B(x^{\star},\epsilon_{1}). Let z=x+ϵ1xxxxB(x,ϵ1)z=x^{\star}+\epsilon_{1}\frac{x-x^{\star}}{\|x-x^{\star}\|}\in B(x^{\star},\epsilon_{1}). Since f~\tilde{f} is convex, xx=xz1ϵ1/xxx-x^{\star}=\frac{x-z}{1-\epsilon_{1}/\|x-x^{\star}\|}, and xx>ϵ1\|x-x^{\star}\|>\epsilon_{1}, we have f~(x)f~(z),xx0\langle\nabla\tilde{f}(x)-\nabla\tilde{f}(z),x-x^{\star}\rangle\geq 0. Moreover, using xx=xxϵ1(zx)x-x^{\star}=\frac{\|x-x^{\star}\|}{\epsilon_{1}}(z-x^{\star}), (61), and zx=ϵ1\|z-x^{\star}\|=\epsilon_{1}, we obtain f~(z)f~(x),xx=xxϵ1f~(z)f~(x),zxcxxϵ1zx2=cϵ1xxxx2\langle\nabla\tilde{f}(z)-\nabla\tilde{f}(x^{\star}),x-x^{\star}\rangle=\frac{\|x-x^{\star}\|}{\epsilon_{1}}\langle\nabla\tilde{f}(z)-\nabla\tilde{f}(x^{\star}),z-x^{\star}\rangle\geq\frac{c\|x-x^{\star}\|}{\epsilon_{1}}\|z-x^{\star}\|^{2}=\frac{c\epsilon_{1}}{\|x-x^{\star}\|}\|x-x^{\star}\|^{2}. Adding the above two inequalities leads to f~(x)f~(x),xxcϵ1maxyCyxxx2\langle\nabla\tilde{f}(x)-\nabla\tilde{f}(x^{\star}),x-x^{\star}\rangle\geq\frac{c\epsilon_{1}}{\max_{y\in C}\|y-x^{\star}\|}\|x-x^{\star}\|^{2}. This, along with (61), implies that f~\tilde{f} is restricted strongly convex with respect to xx^{\star} on CC, so that Assumption 7 holds.

-E Proof of Lemma 4

For simplicity, define ck:=𝐳k𝐳G~2+ρ𝐱kH~2c^{k}:=\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}. Since H~𝐱=𝟎\tilde{H}\mathbf{x}^{\star}=\mathbf{0}, 𝐱k𝐱Aa+ρH~2=𝐱k𝐱Aa2+ρ𝐱kH~2ck\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{A_{a}+\rho\tilde{H}}^{2}=\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{A_{a}}^{2}+\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}\leq c^{k}. Thus, to show 𝐱k𝒞\mathbf{x}^{k}\in\mathcal{C} k0\forall k\geq 0, it suffices to show that ckc0c^{k}\leq c^{0} k0\forall k\geq 0. We prove this by induction. Clearly, ckc0c^{k}\leq c^{0} holds for k=0k=0. Now suppose ckc0c^{k}\leq c^{0} and below we show ck+1ckc0c^{k+1}\leq c^{k}\leq c^{0}. Using (37) and (59) with GG replaced by G~\tilde{G},

𝐳k𝐳G~2𝐳k+1𝐳G~2𝐳k+1𝐳kG~2\displaystyle\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}-\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}-\|\mathbf{z}^{k+1}-\mathbf{z}^{k}\|_{\tilde{G}}^{2}
2η𝐱k𝐱,f(𝐱k)f(𝐱)\displaystyle\geq 2\eta\langle\mathbf{x}^{k}\!-\!\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})\!-\!\nabla f(\mathbf{x}^{\star})\rangle
+2𝐱k+1𝐱,(AkAa)(𝐱k+1𝐱k)𝐱k+1𝐱kΛM22(1η).\displaystyle\!\!\!+2\langle\mathbf{x}^{k+1}\!-\!\mathbf{x}^{\star},(A^{k}\!-\!A_{a})(\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k})\rangle\!-\!\tfrac{\|\mathbf{x}^{k+1}\!-\mathbf{x}^{k}\|_{\Lambda_{M}}^{2}}{2(1-\eta)}. (62)

On the other hand, since AAkAuA_{\ell}\preceq A^{k}\preceq A_{u}, we have AkAaΔ/2\|A^{k}-A_{a}\|\leq\Delta/2. Due to (44), there exist β>0\beta>0 and σ(0,1)\sigma\in(0,1) such that (47) holds. Then, through the AM-GM inequality, 𝐱k+1𝐱,(AkAa)(𝐱k+1𝐱k)=𝐱k𝐱,(AkAa)(𝐱k+1𝐱k)+𝐱k+1𝐱k,(AkAa)(𝐱k+1𝐱k)12(βΔ𝐱k𝐱2+Δ4β𝐱k+1𝐱k2)Δ2𝐱k+1𝐱k2\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(A^{k}-A_{a})(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle=\langle\mathbf{x}^{k}-\mathbf{x}^{\star},(A^{k}-A_{a})(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle+\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},(A^{k}-A_{a})(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle\geq-\frac{1}{2}(\beta\Delta\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}+\frac{\Delta}{4\beta}\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|^{2})-\frac{\Delta}{2}\|\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k}\|^{2}, which further implies

𝐱k+1𝐱kAa2+2𝐱k+1𝐱,(AkAa)(𝐱k+1𝐱k)\displaystyle\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{A_{a}}^{2}\!+\!2\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(A^{k}-A_{a})(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle
𝐱k+1𝐱kΛM22(1η)(1σ)𝐱k+1𝐱kAa2βΔ𝐱k𝐱2.\displaystyle-\!\tfrac{\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{\Lambda_{M}}^{2}}{2(1-\eta)}\!\geq\!(1\!-\!\sigma)\|\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k}\|_{A_{a}}^{2}\!-\!\beta\Delta\|\mathbf{x}^{k}\!-\!\mathbf{x}^{\star}\|^{2}.

Substituting this into (62) and applying (5), we obtain

ckck+1(1σ)𝐱k+1𝐱kAa2βΔ𝐱k𝐱2\displaystyle c^{k}-c^{k+1}\geq(1-\sigma)\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{A_{a}}^{2}-\beta\Delta\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}
+2η𝐱k𝐱,f(𝐱k)f(𝐱)+ρ𝐱kH~2.\displaystyle+2\eta\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle+\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}. (63)

Due to the hypothesis ckc0c^{k}\leq c^{0}, we have 𝐱k𝒞\mathbf{x}^{k}\in\mathcal{C}. It then follows from Proposition 2 and H~𝐱=𝟎\tilde{H}\mathbf{x}^{\star}=\mathbf{0} that 2𝐱k𝐱,f(𝐱k)f(𝐱)2mρ𝐱k𝐱2ρ𝐱kH~22\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\geq 2m_{\rho}\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}-\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}. This and (63) give

ckck+1(1σ)𝐱k+1𝐱kAa2\displaystyle c^{k}-c^{k+1}\geq(1-\sigma)\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{A_{a}}^{2}
+(2ηmρβΔ)𝐱k𝐱2+ρ(1η)𝐱kH~2.\displaystyle+(2\eta m_{\rho}-\beta\Delta)\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}+\rho(1-\eta)\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}. (64)

Note that σ(0,1)\sigma\in(0,1), βΔ<2ηmρ\beta\Delta<2\eta m_{\rho} due to (47), and η(0,1)\eta\in(0,1). Thus, the right-hand side of (64) is nonnegative, which implies ck+1ckc0c^{k+1}\leq c^{k}\leq c^{0}.

-F Proof of Theorem 2

Recall from the paragraph above Theorem 2 that we force 𝐯0𝐯S\mathbf{v}^{0}-\mathbf{v}^{\star}\in S^{\bot}. Also, due to (5), 𝐯k𝐯0Range(H~12)=S\mathbf{v}^{k}-\mathbf{v}^{0}\in\operatorname{Range}(\tilde{H}^{\frac{1}{2}})=S^{\bot}. Hence, 𝐯k𝐯Range(H~12)\mathbf{v}^{k}-\mathbf{v}^{\star}\in\operatorname{Range}(\tilde{H}^{\frac{1}{2}}). Moreover, since h(𝐱)0h(\mathbf{x})\equiv 0, gk+1=g=𝟎g^{k+1}=g^{\star}=\mathbf{0}. It then follows from (53) that 𝐯𝐯k=(H~12)(f(𝐱k)f(𝐱)+(Ak+ρH)(𝐱k+1𝐱k)+ρH𝐱k)\mathbf{v}^{\star}-\mathbf{v}^{k}=(\tilde{H}^{\frac{1}{2}})^{\dagger}(\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})+(A^{k}+\rho H)(\mathbf{x}^{k+1}-\mathbf{x}^{k})+\rho H\mathbf{x}^{k}). Then, via the AM-GM inequality, for any θ1,θ2>0\theta_{1},\theta_{2}>0,

𝐯k𝐯2(1+θ1)(1+θ2)f(𝐱k)f(𝐱)H~2\displaystyle\|\mathbf{v}^{k}-\mathbf{v}^{\star}\|^{2}\leq(1+\theta_{1})(1+\theta_{2})\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\|_{\tilde{H}^{\dagger}}^{2}
+2(1+θ1)(1+1/θ2)𝐱k+1𝐱k(Ak)TH~Ak+ρ2HH~H2\displaystyle\quad+2(1+\theta_{1})(1+1/\theta_{2})\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{(A^{k})^{T}\tilde{H}^{\dagger}A^{k}+\rho^{2}H\tilde{H}^{\dagger}H}^{2}
+ρ2(1+1/θ1)H𝐱kH~2.\displaystyle\quad+\rho^{2}(1+1/\theta_{1})\|H\mathbf{x}^{k}\|_{\tilde{H}^{\dagger}}^{2}. (65)

From (65), HH~Hλ¯H12H~H12H\tilde{H}^{\dagger}H\preceq\bar{\lambda}H^{\frac{1}{2}}\tilde{H}^{\dagger}H^{\frac{1}{2}}, H~INdλH~\tilde{H}^{\dagger}\preceq\frac{I_{Nd}}{\lambda_{\tilde{H}}}, and the MiM_{i}-smoothness of each fif_{i}, we have that for all δ(0,1)\delta\in(0,1),

δ𝐳k𝐳G~2δ𝐱k𝐱(1+θ1)(1+θ2)ΛM2ρλH~+Aa2\displaystyle\delta\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}\leq\delta\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{\frac{(1+\theta_{1})(1+\theta_{2})\Lambda_{M}^{2}}{\rho\lambda_{\tilde{H}}}+A_{a}}^{2}
+2δ(1+θ1)(1+1/θ2)ρ𝐱k+1𝐱kAu2/λH~+ρ2λ¯H12H~H122\displaystyle+\!\tfrac{2\delta(1\!+\!\theta_{1})(1\!+\!1/\theta_{2})}{\rho}\|\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k}\|_{A_{u}^{2}/\lambda_{\tilde{H}}\!+\!\rho^{2}\bar{\lambda}H^{\frac{1}{2}}\tilde{H}^{\dagger}H^{\frac{1}{2}}}^{2}
+δρ(1+1/θ1)H𝐱kH~2.\displaystyle+\delta\rho(1+1/\theta_{1})\|H\mathbf{x}^{k}\|_{\tilde{H}^{\dagger}}^{2}.

Note that (64) holds here because (44) guarantees (47) with some β>0\beta>0 and σ(0,1)\sigma\in(0,1). Then, by combining the above inequality and (64), we obtain (1δ)(𝐳k𝐳G~2+ρ𝐱kH~2)(𝐳k+1𝐳G~2+ρ𝐱k+1H~2)𝐱k𝐱B1(δ)2+𝐱kB2(δ)2+𝐱k+1𝐱kB3(δ)2(1\!-\!\delta)(\|\mathbf{z}^{k}\!-\!\mathbf{z}^{\star}\|_{\tilde{G}}^{2}\!+\!\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2})\!-\!(\|\mathbf{z}^{k+1}\!-\!\mathbf{z}^{\star}\|_{\tilde{G}}^{2}\!+\!\rho\|\mathbf{x}^{k+1}\|_{\tilde{H}}^{2})\geq\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{B_{1}(\delta)}^{2}+\|\mathbf{x}^{k}\|_{B_{2}(\delta)}^{2}+\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{B_{3}(\delta)}^{2}. Note from (47) that 2ηmρβΔ>02\eta m_{\rho}-\beta\Delta>0 and Aa𝐎A_{a}\succ\mathbf{O}. Also note that θ1>0\theta_{1}>0, θ2>0\theta_{2}>0, ρ>0\rho>0, λH~>0\lambda_{\tilde{H}}>0, η(0,1)\eta\in(0,1), H~𝐎\tilde{H}\succeq\mathbf{O}, Range(H~)=Range(H)=S\operatorname{Range}(\tilde{H})=\operatorname{Range}(H)=S^{\bot}, and σ(0,1)\sigma\in(0,1). Therefore, there exists δ(0,1)\delta\in(0,1) such that Bi(δ)𝐎B_{i}(\delta)\succeq\mathbf{O} i{1,2,3}\forall i\in\{1,2,3\}, which guarantees (46).

-G Proof of Theorem 3

In the proof of Lemma 4, we have shown that (ck)k=0(c^{k})_{k=0}^{\infty} is non-increasing. Thus, 𝐳k𝐳G~2ckc0=d02/ρ\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}\leq c^{k}\leq c^{0}=d_{0}^{2}/\rho. It follows that 𝐯0𝐯k𝐯0𝐯+𝐯𝐯k𝐯0𝐯+ρ𝐳k𝐳G~𝐯0𝐯+d0\|\mathbf{v}^{0}-\mathbf{v}^{k}\|\leq\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+\|\mathbf{v}^{\star}-\mathbf{v}^{k}\|\leq\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+\sqrt{\rho}\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}\leq\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+d_{0}. By substituting this into (39) and (41), we obtain (49) and (51). Note that this substitution is legitimate, since (39) and (41) do not rely on (38) to hold. To prove (50), we let J~t(𝐱)=ut(𝐱)+h(𝐱)+ρ2𝐱H2+(𝐯t)TH~12𝐱\tilde{J}^{t}(\mathbf{x})=u^{t}(\mathbf{x})+h(\mathbf{x})+\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}+(\mathbf{v}^{t})^{T}\tilde{H}^{\frac{1}{2}}\mathbf{x} t0\forall t\geq 0. Since 2ut(𝐱)A\nabla^{2}u^{t}(\mathbf{x})\succeq A_{\ell} 𝐱Nd\forall\mathbf{x}\in\mathbb{R}^{Nd}, J~t(𝐱)𝐱A2/2\tilde{J}^{t}(\mathbf{x})-\|\mathbf{x}\|_{A_{\ell}}^{2}/2 is convex. Similar to the derivation of (55),

J~t(𝐱t+1)J~t(𝐱)𝐱t+1𝐱A2/2.\tilde{J}^{t}(\mathbf{x}^{t+1})-\tilde{J}^{t}(\mathbf{x}^{\star})\leq-\|\mathbf{x}^{t+1}-\mathbf{x}^{\star}\|_{A_{\ell}}^{2}/2. (66)

Since A2ut(𝐱)AuA_{\ell}\preceq\nabla^{2}u^{t}(\mathbf{x})\preceq A_{u} 𝐱Nd\forall\mathbf{x}\in\mathbb{R}^{Nd}, we have ut(𝐱t+1)ut(𝐱t)+ut(𝐱t),𝐱t+1𝐱t+𝐱t+1𝐱tA2/2u^{t}(\mathbf{x}^{t+1})\geq u^{t}(\mathbf{x}^{t})+\langle\nabla u^{t}(\mathbf{x}^{t}),\mathbf{x}^{t+1}-\mathbf{x}^{t}\rangle+\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{A_{\ell}}^{2}/2 and ut(𝐱)ut(𝐱t)+ut(𝐱t),𝐱𝐱t+𝐱𝐱tAu2/2u^{t}(\mathbf{x}^{\star})\leq u^{t}(\mathbf{x}^{t})+\langle\nabla u^{t}(\mathbf{x}^{t}),\mathbf{x}^{\star}-\mathbf{x}^{t}\rangle+\|\mathbf{x}^{\star}-\mathbf{x}^{t}\|_{A_{u}}^{2}/2. These two inequalities along with ut(𝐱t)=f(𝐱t)\nabla u^{t}(\mathbf{x}^{t})=\nabla f(\mathbf{x}^{t}) imply

ut(𝐱t+1)ut(𝐱)f(𝐱t),𝐱t+1𝐱\displaystyle u^{t}(\mathbf{x}^{t+1})-u^{t}(\mathbf{x}^{\star})\geq\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}^{t+1}-\mathbf{x}^{\star}\rangle
+𝐱t+1𝐱tA2/2𝐱𝐱tAu2/2.\displaystyle+\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{A_{\ell}}^{2}/2-\|\mathbf{x}^{\star}-\mathbf{x}^{t}\|_{A_{u}}^{2}/2. (67)

Note that (56) and (57) hold in no need of (38). Hence, by integrating (66), (67), (56), (57), and H𝐱=H~12𝐱=𝟎H\mathbf{x}^{\star}\!=\!\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}\!=\!\mathbf{0}, f(𝐱t+1)+h(𝐱t+1)f(𝐱)h(𝐱)+𝐯t+12𝐯t22ρ𝐱t𝐱Au22𝐱t+1𝐱A22+𝐱t+1𝐱tΛMA22f(\mathbf{x}^{t+1})\!+\!h(\mathbf{x}^{t+1})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})\!+\!\frac{\|\mathbf{v}^{t+1}\|^{2}-\|\mathbf{v}^{t}\|^{2}}{2\rho}\leq\frac{\|\mathbf{x}^{t}-\mathbf{x}^{\star}\|_{A_{u}}^{2}}{2}-\frac{\|\mathbf{x}^{t+1}-\mathbf{x}^{\star}\|_{A_{\ell}}^{2}}{2}+\frac{\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A_{\ell}}^{2}}{2}. Similar to the derivation of (40) from (58), it follows that f(𝐱¯k)+h(𝐱¯k)f(𝐱)h(𝐱)1k(𝐯022ρ+𝐱0𝐱Au22)+1kt=0k1𝐱t+1𝐱tΛMA22+Δ2kt=1k1𝐱t𝐱2f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})\leq\frac{1}{k}\bigl{(}\frac{\|\mathbf{v}^{0}\|^{2}}{2\rho}+\frac{\|\mathbf{x}^{0}-\mathbf{x}^{\star}\|_{A_{u}}^{2}}{2}\bigr{)}+\frac{1}{k}\sum_{t=0}^{k-1}\frac{\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A_{\ell}}^{2}}{2}+\frac{\Delta}{2k}\sum_{t=1}^{k-1}\|\mathbf{x}^{t}-\mathbf{x}^{\star}\|^{2}. From (64), t=0k1𝐱t+1𝐱tAa2c0ck1σd02ρ(1σ)\sum_{t=0}^{k-1}\|\mathbf{x}^{t+1}\!-\!\mathbf{x}^{t}\|_{A_{a}}^{2}\leq\frac{c^{0}-c^{k}}{1-\sigma}\leq\frac{d_{0}^{2}}{\rho(1-\sigma)} and t=1k1𝐱t𝐱2c1ck2ηmρβΔd02ρ(2ηmρβΔ)\sum_{t=1}^{k-1}\|\mathbf{x}^{t}-\mathbf{x}^{\star}\|^{2}\leq\frac{c^{1}-c^{k}}{2\eta m_{\rho}-\beta\Delta}\leq\frac{d_{0}^{2}}{\rho(2\eta m_{\rho}-\beta\Delta)}, where β>0\beta>0 and σ(0,1)\sigma\in(0,1) satisfying (47) exist due to (44). Also, from (44), ΛMAΛM2(1η)Aa\Lambda_{M}-A_{\ell}\preceq\Lambda_{M}\preceq 2(1-\eta)A_{a}. Combining the above gives (50).

References

  • [1] A. Nedić, “Convergence rate of distributed averaging dynamics and optimization in networks,” Foundations and Trends in Systems and Control, vol. 2, no. 1, pp. 1–100, 2015.
  • [2] D. Bajović, D. Jekovetić, N. Krejić, and N. K. Jerinikić, “Newton-like method with diagonal correction for distributed optimization,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 1171–1203, 2017.
  • [3] F. Mansoori and E. Wei, “Fast distributed asynchronous Newton-based optimization algorithm,” IEEE Transactions on Automatic Control, vol. 65, no. 7, pp. 2769–2784, 2020.
  • [4] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
  • [5] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  • [6] Y. Liu, W. Xu, G. Wu, Z. Tian, and Q. Ling, “Communication-censored ADMM for decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 67, no. 10, pp. 2565–2579, 2019.
  • [7] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent stepsizes and separated convergence rates,” IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506, 2019.
  • [8] I. Notarnicola and G. Notarstefano, “Asynchronous distributed optimization via randomized dual proximal gradient,” IEEE Transactions on Automatic Control, vol. 62, no. 5, pp. 2095–2106, 2017.
  • [9] A. Makhdoumi and A. Ozdaglar, “Convergence rate of distributed ADMM over networks,” IEEE Transactions on Automatic Control, vol. 62, no. 10, pp. 5082–5095, 2017.
  • [10] J. Zeng, T. He, and M. Wang, “A fast proximal gradient algorithm for decentralized composite optimization over directed networks,” System & Control Letters, vol. 107, no. 2017, pp. 36–43, 2017.
  • [11] N. S. Aybat, Z. Wang, T. Lin, and S. Ma, “Distributed linearized alternating direction method of multipliers for composite convex consensus optimization,” IEEE Transactions on Automatic Control, vol. 63, no. 1, pp. 5–20, 2018.
  • [12] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
  • [13] J. Xu, S. Zhu, Y. Soh, and L. Xie, “A bregman splitting scheme for distributed optimization over networks,” IEEE Transactions on Automatic Control, vol. 63, no. 11, pp. 3809–3824, 2018.
  • [14] J. Lei, H. Chen, and H. Fang, “Primal-dual algorithm for distributed constrained optimization,” System & Control Letters, vol. 96, pp. 110–117, 2016.
  • [15] X. Wu and J. Lu, “Fenchel dual gradient methods for distributed convex optimization over time-varying networks,” IEEE Transactions on Automatic Control, vol. 64, no. 11, pp. 4629–4636, 2019.
  • [16] M. Hong and T. Chang, “Stochastic proximal gradient consensus over random networks,” IEEE Transactions on Signal Processing, vol. 65, no. 11, pp. 2933–2948, 2017.
  • [17] P. Bianchi, W. Hachem, and F. Iutzeler, “A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 10, pp. 2947–2957, 2016.
  • [18] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralized second-order method with exact linear convergence rate for consensus optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 507–522, 2016.
  • [19] ——, “DQM: Decentralized quadratically approximated alternating direction method of multipliers,” IEEE Transactions on Signal Processing, vol. 64, no. 19, pp. 5158–5173, 2016.
  • [20] X. Wu, Z. Qu, and J. Lu, “A second-order proximal algorithm for consensus optimization,” IEEE Transactions on Automatic Control, vol. 66, no. 4, pp. 1864–1871, 2021.
  • [21] J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Distributed algorithms for composite optimization: Unified and tight convergence analysis,” arXiv preprint arXiv:2002.11534, 2020.
  • [22] S. A. Alghunaim, E. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” accepted to IEEE Transactions on Automatic Control, 2020.
  • [23] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Transactions on Automatic Control, vol. 60, no. 3, pp. 601– 615, 2015.
  • [24] C. Xi and U. A. khan, “Distributed subgradient projection algorithm over directed graphs,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3986–3992, 2017.
  • [25] D. Yuan, Y. Hong, D. W.C.Ho, and C. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018.
  • [26] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
  • [27] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2017.
  • [28] S. Pu, W. Shi, J. Xu, and A. Nedic, “Push-pull gradient methods for distributed optimization in networks,” accepted to IEEE Transactions on Automatic Control, 2020.
  • [29] R. Xin and U. A. Khan, “Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2627–2633, 2020.
  • [30] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-Raphson consensus for distributed convex optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 4, pp. 994–1009, 2016.
  • [31] P. Di Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.
  • [32] A. Daneshmand, Y. Sun, G. Scutari, F. Facchinei, and B. Sadler, “Decentralized dictionary learning over time-varying digraphs,” Journal of machine learning research, vol. 20, pp. 1–62, 2019.
  • [33] G. Scutari and Y. Sun, “Distributed nonconvex constrained optimization over time-varying digraphs,” Mathematical Programming, vol. 176, pp. 497–544, 2016.
  • [34] I. Notarnicola, Y. Sun, G. Scutari, and G. Notarstefano, “Distributed big-data optimization via block-wise gradient tracking,” accepted to IEEE Transactions on Automatic Control,2020.
  • [35] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [36] F. Facchinei, G. Scutari, and S. Sagratella, “Parallel selective algorithms for nonconvex big data optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 7, pp. 1874–1889, 2015.
  • [37] F. Facchinei, L. Lampariello, and G. Scutari, “Feasible methods for nonconvex nonsmooth problems with applications in green communications,” Mathematical Programming, vol. 164, pp. 55–90, 2017.
  • [38] M. Hong, X. Wang, M. Razaviyayn, and Z.-Q. Luo, “Iteration complexity analysis of block coordinate descent methods,” Mathematical Programming, vol. 163, pp. 85–114, 2017.
  • [39] G. Scutari, F. Facchinei, and L. Lampariello, “Parallel and distributed methods for constrained nonconvex optimization—part i: Theory,” IEEE Transactions on Signal Processing, vol. 65, no. 8, pp. 1929–1944, 2017.
  • [40] L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous parallel algorithms for nonconvex optimization,” Mathematical Programming, vol. 184, pp. 1–34, 2019.
  • [41] S. Kakade, S. Shalev-Shwartz, and A. Tewari, “On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization,” Manuscript, 2009.
  • [42] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters, vol. 31, no. 3, pp. 167–175, 2003.
  • [43] M. Spivak, Calculus on manifolds: A modern approach to classical theorems of advanced calculus.   Addison-Wesley, 1965.
  • [44] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
  • [45] F. Bach, “Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression,” Journal of Machine Learning Research, vol. 15, pp. 595–627, 2014.
[Uncaptioned image] Xuyang Wu (SM’17) received the B.S. degree in Information and Computing Science from Northwestern Polytechnical University, Xi’an, China, in 2015, and the Ph.D. degree from the School of Information Science and Technology at ShanghaiTech University, Shanghai, China, in 2020. He is currently a postdoctoral researcher in the Division of Decision and Control Systems at KTH Royal Institute of Technology, Stockholm, Sweden. His research interests include distributed optimization, large-scale optimization, and their applications in IoT and power systems.
[Uncaptioned image] Jie Lu (SM’08-M’13) received the B.S. degree in Information Engineering from Shanghai Jiao Tong University, China, in 2007, and the Ph.D. degree in Electrical and Computer Engineering from the University of Oklahoma, USA, in 2011. From 2012 to 2015 she was a postdoctoral researcher with KTH Royal Institute of Technology, Stockholm, Sweden, and with Chalmers University of Technology, Gothenburg, Sweden. Since 2015, she has been an assistant professor in the School of Information Science and Technology at ShanghaiTech University, Shanghai, China. Her research interests include distributed optimization, optimization theory and algorithms, and networked dynamical systems.