A Unifying Approximate Method of Multipliers
for Distributed Composite Optimization

Xuyang Wu and Jie Lu X. Wu is with the Division of Decision and Control Systems, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden. Email: [email protected].J. Lu is with the School of Information Science and Technology, ShanghaiTech University, 201210 Shanghai, China. Email: [email protected].This work has been supported by the National Natural Science Foundation of China under grant 61603254.

Abstract

This paper investigates solving convex composite optimization on an undirected network, where each node, privately endowed with a smooth component function and a nonsmooth one, is required to minimize the sum of all the component functions throughout the network. To address such a problem, a general Approximate Method of Multipliers (AMM) is developed, which attempts to approximate the Method of Multipliers by virtue of a surrogate function with numerous options. We then design the possibly nonseparable, time-varying surrogate function in various ways, leading to different distributed realizations of AMM. We demonstrate that AMM generalizes more than ten state-of-the-art distributed optimization algorithms, and certain specific designs of its surrogate function result in a variety of new algorithms to the literature. Furthermore, we show that AMM is able to achieve an $O(1/k)$ rate of convergence to optimality, and the convergence rate becomes linear when the problem is locally restricted strongly convex and smooth. Such convergence rates provide new or stronger convergence results to many prior methods that can be viewed as specializations of AMM.

Index Terms:

Distributed optimization, convex composite optimization, the Method of Multipliers.

I Introduction

This paper addresses the following convex composite optimization problem:

\begin{split}\underset{x\in\mathbb{R}^{d}}{\operatorname{minimize}}~{}&~{}\sum_{i=1}^{N}\left(f_{i}(x)+h_{i}(x)\right),\end{split}

(1)

where each $f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is convex and has a Lipschitz continuous gradient, and each $h_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{+\infty\}$ is convex and can be non-differentiable. Notice that $f_{i}$ is allowed to be a zero function. Also, if $h_{i}$ contains an indicator function $\mathcal{I}_{X_{i}}$ with respect to a closed convex set $X_{i}\subset\mathbb{R}^{d}$ , i.e., $\mathcal{I}_{X_{i}}(x)=0$ if $x\in X_{i}$ and $\mathcal{I}_{X_{i}}(x)=+\infty$ otherwise, then problem (1) turns into a nonsmooth, constrained convex program.

We consider solving problem (1) in a distributed way over a network modeled as a connected, undirected graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}=\{1,\ldots,N\}$ is the vertex set and $\mathcal{E}\subseteq\{\{i,j\}|~{}i,j\in\mathcal{V},~{}i\neq j\}$ is the edge set. We suppose each node $i\in\mathcal{V}$ possesses two private component functions $f_{i}$ and $h_{i}$ , aims at solving (1), and only exchanges information with its neighbors denoted by the set $\mathcal{N}_{i}=\{j|~{}\{i,j\}\in\mathcal{E}\}$ . Applications of such a distributed composite optimization problem include control of multi-agent systems, robust estimation by sensor networks, resource allocation in smart grids, decision making for swarm robotics, as well as distributed learning [1].

To date, a large body of distributed optimization algorithms have been proposed to solve problem (1) or its special cases. A typical practice is to reformulate (1) into a constrained problem with a separable objective function and a consensus constraint to be relaxed. For example, the inexact methods [2, 3] solve this reformulated problem by penalizing the consensus constraint in the objective function, and the primal-dual methods [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] relax the consensus constraint through dual decomposition techniques. Another common approach is to emulate the centralized subgradient/gradient methods in a decentralized manner, such as the distributed subgradient methods [23, 24, 25] and the distributed gradient-tracking-based methods [26, 27, 28, 29, 30, 31, 32, 33, 34] where [31, 32, 33, 34] utilize surrogate functions in order to realize the decentralized emulation.

Despite the growing literature, relatively few methods manage to tackle the general form of problem (1) under distributed settings, among which only the first-order methods [8, 15, 6, 10, 17, 9, 11, 12, 13, 16, 7] are guaranteed to converge to optimality with constant step-sizes. Here, [8, 15] rely on strong convexity of the problem, and [6, 10, 17] only prove asymptotic convergence for non-strongly convex problems. In contrast, the remaining works [9, 11, 12, 13, 16, 7] establish $O(1/k)$ convergence rates in solving (1). Although these algorithms are developed using different rationales, we discover that the majority of them, i.e., [9, 11, 12, 13, 16], can indeed be thought to originate from the Method of Multipliers [35].

The Method of Multipliers is a seminal (centralized) optimization method, and one of its notable variants is the Alternating Direction Method of Multipliers (ADMM) [35]. In this paper, we develop a novel paradigm of solving (1) via approximating the behavior of the Method of Multipliers, called Approximate Method of Multipliers (AMM). The proposed AMM adopts a possibly time-varying surrogate function to take the place of the smooth objective function at every minimization step in the Method of Multipliers, facilitating abundant designs of distributed algorithms for solving (1) in a fully decentralized fashion. Unlike the typical separable surrogate functions for distributed optimization such as those in [31, 32, 33, 34], our surrogate function can be nonseparable in some distributed versions of AMM. It also admits certain function forms that cannot be included by [31, 32, 33, 34].

To enable distributed implementation of AMM, we first opt for a Bregman-divergence-type surrogate function, leading to a class of distributed realizations of AMM referred to as Distributed Approximate Method of Multipliers (DAMM). We also utilize convex conjugate and graph topologies to design the surrogate function, and thus construct two additional sets of distributed realizations of AMM, referred to as DAMM-SC and DAMM-SQ, for solving smooth convex optimization (i.e., (1) with $h_{i}\equiv 0$ $\forall i\in\mathcal{V}$ ). We concretely exemplify such realizations of AMM, so that new distributed proximal/second-order/gradient-tracking methods can be obtained. Apart from that, AMM and its distributed realizations also unify a wide range of state-of-the-art distributed first-order and second-order algorithms [5, 13, 19, 16, 12, 11, 4, 9, 14, 26, 18] for solving (1) or its special cases, as all of them can be cast into the form of AMM/DAMM/DAMM-SQ. This offers a unifying perspective for understanding the nature of these existing methods with various design rationales.

We show that AMM allows the nodes to reach a consensus and converge to the optimal value at a rate of $O(1/k)$ , either with a particular surrogate function type or under a local restricted strong convexity condition on $\sum_{i\in\mathcal{V}}f_{i}(x)$ . For all the algorithms that are able to solve the general form of problem (1) in a distributed way, the convergence rates of AMM and the algorithms in [9, 11, 12, 13, 16, 7] achieve the best order of $O(1/k)$ , while the algorithms in [9, 11, 12, 13, 16] are indeed special cases of AMM with that particular surrogate function type and the convergence rate in [7] is in terms of an implicit measure of optimality error. Moreover, we show that when problem (1) is both smooth and locally restricted strongly convex, AMM enables the nodes to attain the optimum at a linear rate. Unlike most existing works, this linear convergence is established in no need of global strong convexity. Naturally, our analysis on AMM yields new convergence results and relaxes some problem assumptions with no degeneration of the convergence rate order for quite a few existing distributed optimization methods that can be generalized by AMM.

The outline of the paper is as follows: Section II describes the proposed AMM and the distributed designs. Section III demonstrates that AMM generalizes a number of prior distributed optimization methods. Section IV discusses the convergence results of AMM under various assumptions, and Section V presents the comparative simulation results. Finally, Section VI concludes the paper.

Notation and Definition: We use $A=(a_{ij})_{n\times n}$ to denote an $n\times n$ real matrix whose $(i,j)$ -entry, denoted by $[A]_{ij}$ , is equal to $a_{ij}$ . In addition, $\operatorname{Null}(A)$ is the null space of $A$ , $\operatorname{Range}(A)$ is the range of $A$ , and $\|A\|$ is the spectral norm of $A$ . Besides, $\operatorname{diag}(D_{1},\ldots,D_{n})\in\mathbb{R}^{nd\times nd}$ represents the block diagonal matrix with $D_{1},\ldots,D_{n}\in\mathbb{R}^{d\times d}$ being its diagonal blocks. Also, $\mathbf{0}$ , $\mathbf{1}$ , and $\mathbf{O}$ represent a zero vector, an all-one vector, and a zero square matrix of proper dimensions, respectively, and $I_{n}$ represents the $n\times n$ identity matrix. For any two matrices $A$ and $B$ , $A\otimes B$ is the Kronecker product of $A$ and $B$ . If $A$ and $B$ are square matrices of the same size, $A\succeq B$ means $A-B$ is positive semidefinite and $A\succ B$ means $A-B$ is positive definite. For any $A=A^{T}\in\mathbb{R}^{n\times n}$ , $\lambda_{\max}(A)$ denotes the largest eigenvalue of $A$ . For any $z\in\mathbb{R}^{n}$ and $A\succeq\mathbf{O}$ , $\|z\|=\sqrt{z^{T}z}$ and $\|z\|_{A}=\sqrt{z^{T}Az}$ . For any countable set $S$ , we use $|S|$ to represent its cardinality. For any convex function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , $\partial f(x)\subset\mathbb{R}^{d}$ denotes its subdifferential (i.e., the set of subgradients) at $x$ . If $f$ is differentiable, $\partial f(x)$ only contains the gradient $\nabla f(x)$ .

Given a convex set $X\subseteq\mathbb{R}^{d}$ , a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is said to be strongly convex on $X$ with convexity parameter $\sigma>0$ (or simply $\sigma$ -strongly convex on $X$ ) if $\langle g_{x}-g_{y},x-y\rangle\geq\sigma\|x-y\|^{2}$ $\forall x,y\in X$ $\forall g_{x}\in\partial f(x)$ $\forall g_{y}\in\partial f(y)$ . We say $f$ is (globally) strongly convex if it is strongly convex on $\mathbb{R}^{d}$ . Given $\tilde{x}\in X$ , $f$ is said to be restricted strongly convex with respect to $\tilde{x}$ on $X$ with convexity parameter $\sigma>0$ if $\langle g_{x}-g_{\tilde{x}},x-\tilde{x}\rangle\geq\sigma\|x-\tilde{x}\|^{2}$ $\forall x\in X$ $\forall g_{x}\in\partial f(x)$ $\forall g_{\tilde{x}}\in\partial f(\tilde{x})$ . Given $L\geq 0$ , $f$ is said to be $L$ -smooth if it is differentiable and its gradient is Lipschitz continuous with Lipschitz constant $L$ , i.e., $\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|$ $\forall x,y\in\mathbb{R}^{d}$ .

II Approximate Method of Multipliers and Distributed Designs

This section develops distributed optimization algorithms for solving problem (1) over the undirected, connected graph $\mathcal{G}$ , with the following formalized problem assumption.

Assumption 1.

For each $i\in\mathcal{V}$ , $f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ and $h_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{+\infty\}$ are convex. In addition, $f_{i}$ is $M_{i}$ -smooth for some $M_{i}>0$ . Moreover, there exists at least one optimal solution $x^{\star}$ to problem (1).

II-A Approximate Method of Multipliers

We first propose a family of optimization algorithms that effectively approximate the Method of Multipliers [35] and serve as a cornerstone of developing distributed algorithms for solving problem (1).

Let each node $i\in\mathcal{V}$ keep a copy $x_{i}\in\mathbb{R}^{d}$ of the global decision variable $x\in\mathbb{R}^{d}$ in problem (1), and define

f(\mathbf{x}):=\textstyle{\sum_{i\in\mathcal{V}}}f_{i}(x_{i}),\quad h(\mathbf{x}):=\textstyle{\sum_{i\in\mathcal{V}}}h_{i}(x_{i}),

where $\mathbf{x}=(x_{1}^{T},\ldots,x_{N}^{T})^{T}\in\mathbb{R}^{Nd}$ . Then, problem (1) can be equivalently transformed into

\begin{split}\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{minimize}}~{}~{}&~{}f(\mathbf{x})+h(\mathbf{x})\\ \operatorname{subject~{}to}~{}&~{}\mathbf{x}\in S:=\{\mathbf{x}\in\mathbb{R}^{Nd}|~{}x_{1}=\cdots=x_{N}\}.\end{split}

(2)

Next, let $\tilde{H}\in\mathbb{R}^{Nd\times Nd}$ be such that $\tilde{H}=\tilde{H}^{T}$ , $\tilde{H}\succeq\mathbf{O}$ , and $\operatorname{Null}(\tilde{H})=S$ . It has been shown in [18] that the consensus constraint $\mathbf{x}\in S$ in problem (2) is equivalent to $\tilde{H}^{\frac{1}{2}}\mathbf{x}=\mathbf{0}$ . Therefore, (2) is identical to

\begin{split}\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{minimize}}~{}~{}&~{}f(\mathbf{x})+h(\mathbf{x})\\ \operatorname{subject~{}to}~{}&~{}\tilde{H}^{\frac{1}{2}}\mathbf{x}=\mathbf{0}.\end{split}

(3)

Problems (1), (2), and (3) have the same optimal value. Also, $x^{\star}\in\mathbb{R}^{d}$ is an optimum of (1) if and only if $\mathbf{x}^{\star}=((x^{\star})^{T},\ldots,(x^{\star})^{T})^{T}\in\mathbb{R}^{Nd}$ is an optimum of (2) and (3).

Consider applying the Method of Multipliers [35] to solve problem (3), which gives

	$\displaystyle\mathbf{x}^{k+1}$	$\displaystyle\in\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}f(\mathbf{x})\!+h(\mathbf{x})\!+\frac{\rho}{2}\\|\mathbf{x}\\|_{\tilde{H}}^{2}\!+(\mathbf{v}^{k})^{T}\tilde{H}^{\frac{1}{2}}\mathbf{x},$
	$\displaystyle\mathbf{v}^{k+1}$	$\displaystyle=\mathbf{v}^{k}+\rho\tilde{H}^{\frac{1}{2}}\mathbf{x}^{k+1}.$

Here, $\mathbf{x}^{k}\in\mathbb{R}^{Nd}$ is the primal variable at iteration $k\geq 0$ , which is updated by minimizing an augmented Lagrangian function with the penalty $\frac{\rho}{2}\|\mathbf{x}\|_{\tilde{H}}^{2}$ , $\rho>0$ on the consensus constraint $\tilde{H}^{\frac{1}{2}}\mathbf{x}=\mathbf{0}$ . In addition, $\mathbf{v}^{k}\in\mathbb{R}^{Nd}$ is the dual variable, whose initial value $\mathbf{v}^{0}$ can be arbitrarily set.

Although the Method of Multipliers may be applied to solve (3) with properly selected parameters, it is not implementable in a distributed fashion over $\mathcal{G}$ , even when the problem reduces to linear programming. To address this issue, our strategy is to first derive a paradigm of approximating the Method of Multipliers and then design its distributed realizations.

Our approximation approach is as follows: Starting from any $\mathbf{v}^{0}\in\mathbb{R}^{Nd}$ , let

	$\displaystyle\mathbf{x}^{k+1}$	$\displaystyle=\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{\arg\;\min}}\;u^{k}(\mathbf{x})\!+\!h(\mathbf{x})\!+\!\frac{\rho}{2}\\|\mathbf{x}\\|_{H}^{2}\!+\!(\mathbf{v}^{k})^{T}\tilde{H}^{\frac{1}{2}}\mathbf{x},$		(4)
	$\displaystyle\mathbf{v}^{k+1}$	$\displaystyle=\mathbf{v}^{k}+\rho\tilde{H}^{\frac{1}{2}}\mathbf{x}^{k+1},\quad\forall k\geq 0.$		(5)

Compared to the Method of Multipliers, we adopt the same dual update (5) but construct a different primal update (4). In (4), we use a possibly time-varying surrogate function $u^{k}:\mathbb{R}^{Nd}\rightarrow\mathbb{R}$ to replace $f(\mathbf{x})$ in the primal update of the Method of Multipliers, whose conditions are imposed in Assumption 2 below. Additionally, to introduce more flexibility, we use a different weight matrix $H\in\mathbb{R}^{Nd\times Nd}$ to define the penalty term $\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}$ , $\rho>0$ . We suppose $H$ has the same properties as $\tilde{H}$ , i.e., $H=H^{T}\succeq\mathbf{O}$ and $\operatorname{Null}(H)=S$ .

Assumption 2.

The functions $u^{k}$ $\forall k\geq 0$ satisfy the following:

(a)

$u^{k}\,\forall k\!\geq\!0$ are convex and twice continuously differentiable.
(b)

$u^{k}+\frac{\rho}{2}\|\cdot\|_{H}^{2}$ $\forall k\geq 0$ are strongly convex, whose convexity parameters are uniformly bounded from below by some positive constant.
(c)

$\nabla u^{k}$ $\forall k\geq 0$ are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by some nonnegative constant.
(d)

$\nabla u^{k}(\mathbf{x}^{k})=\nabla f(\mathbf{x}^{k})$ $\forall k\geq 0$ , where $\mathbf{x}^{0}$ can be arbitrarily given and $\mathbf{x}^{k}$ , $k\geq 1$ is generated by (4)–(5).

The strong convexity condition in Assumption 2(b) guarantees that $\mathbf{x}^{k+1}$ in (4) is well-defined and uniquely exists. Assumption 2(d) is the key to making (4)–(5) solve problem (3). To explain this, note that (4) is equivalent to finding the unique $\mathbf{x}^{k+1}$ satisfying

\displaystyle-\nabla u^{k}(\mathbf{x}^{k+1})-\rho H\mathbf{x}^{k+1}-\tilde{H}^{\frac{1}{2}}\mathbf{v}^{k}\in\partial h(\mathbf{x}^{k+1}).

(6)

Let $(\mathbf{x}^{\star},\mathbf{v}^{\star})$ be a primal-dual optimal solution pair of problem (3), which satisfies

\displaystyle-\tilde{H}^{\frac{1}{2}}\mathbf{v}^{\star}-\nabla f(\mathbf{x}^{\star})\in\partial h(\mathbf{x}^{\star}).

(7)

If $(\mathbf{x}^{k},\mathbf{v}^{k})=(\mathbf{x}^{\star},\mathbf{v}^{\star})$ , then $\mathbf{x}^{k+1}$ has to be $\mathbf{x}^{\star}$ because of Assumption 2(d), $H\mathbf{x}^{\star}=\mathbf{0}$ , (7), and the uniqueness of $\mathbf{x}^{k+1}$ in (6). It follows from (5) and $\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0}$ that $\mathbf{v}^{k+1}=\mathbf{v}^{\star}$ . Therefore, $(\mathbf{x}^{\star},\mathbf{v}^{\star})$ is a fixed point of (4)–(5). The remaining conditions in Assumption 2 will be used for convergence analysis later in Section IV.

The paradigm described by (4)–(5) and obeying Assumption 2 is called Approximate Method of Multipliers (AMM). As there are numerous options of the surrogate function $u^{k}$ , AMM unifies a wealth of optimization algorithms, including a variety of existing methods (cf. Section III) and many brand new algorithms. Moreover, since Assumption 2 allows $u^{k}$ to have a more favorable structure than $f$ , AMM with appropriate $u^{k}$ ’s may induce a prominent reduction in computational cost compared to the Method of Multipliers. In the sequel, we will provide various options of $u^{k}$ , which give rise to a series of distributed versions of AMM.

A number of existing works [36, 37, 38, 39, 40, 31, 32, 33, 34] also introduce surrogate functions to optimization methods, among which [31, 32, 33, 34] develop distributed algorithms and can address partially nonconvex problems on time-varying networks. The advantages of AMM as well as its differences from the algorithms in [31, 32, 33, 34] are highlighted as follows:

i) The algorithms carrying surrogate functions in [31, 32, 33, 34] are (primal) gradient-tracking-based methods. In contrast, AMM incorporates a surrogate function into a primal-dual framework, so that AMM is inherently different from the algorithms in [31, 32, 33, 34] and requires new analysis tools.

ii) The surrogate function conditions in Assumption 2 intersect with but still differ from those in [31, 32, 33, 34]. For instance, when $f$ is twice continuously differentiable, $u^{k}(\mathbf{x})=\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle+\frac{1}{2}(\mathbf{x}-\mathbf{x}^{k})^{T}(\nabla^{2}f(\mathbf{x}^{k})+\epsilon I)(\mathbf{x}-\mathbf{x}^{k})$ with $\epsilon>0$ meets Assumption 2 but cannot be included by [31, 32, 33, 34].

iii) To enable distributed implementation, the surrogate functions in [31, 32, 33, 34] need to be fully separable in the sense that they can be written as the sum of $N$ functions with independent variables. In contrast, AMM allows $u^{k}$ to be densely coupled under proper design yet can still be executed in a distributed fashion (cf. Sections II-C and III-C4). This may lead to more diverse algorithm design.

iv) In [31, 32, 33, 34], it is required that the global nonsmooth objective function be accessible to every node, while problem (1) only assigns a local nonsmooth component $h_{i}$ to each node $i$ , which is more applicable to scenarios concerning privacy and makes the analysis more challenging.

II-B Distributed Approximate Method of Multipliers

This subsection lays out the parameter designs of AMM for distributed implementations.

We first apply the following change of variable to AMM:

\displaystyle\mathbf{q}^{k}=\tilde{H}^{\frac{1}{2}}\mathbf{v}^{k},\quad k\geq 0.

(8)

Then, AMM (4)–(5) can be rewritten as

	$\displaystyle\mathbf{x}^{k+1}$	$\displaystyle=\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{\arg\;\min}}~{}u^{k}(\mathbf{x})+h(\mathbf{x})+\frac{\rho}{2}\\|\mathbf{x}\\|_{H}^{2}+(\mathbf{q}^{k})^{T}\mathbf{x},$		(9)
	$\displaystyle\mathbf{q}^{k+1}$	$\displaystyle=\mathbf{q}^{k}+\rho\tilde{H}\mathbf{x}^{k+1}.$		(10)

Moreover, note that

	$\displaystyle\operatorname{Range}(\tilde{H}^{\frac{1}{2}})=\operatorname{Range}(\tilde{H})$
	$\displaystyle=S^{\bot}:=\{\mathbf{x}\in\mathbb{R}^{Nd}\|~{}x_{1}+\cdots+x_{N}=\mathbf{0}\},$

where $S^{\bot}$ is the orthogonal complement of $S$ in (2). Hence, (8) requires $\mathbf{q}^{k}\in S^{\bot}$ $\forall k\geq 0$ , which, due to (10), can be ensured simply by the following initialization:

\displaystyle\mathbf{q}^{0}\in S^{\bot}.

(11)

Therefore, (9)–(11) is an equivalent form of AMM.

Next, partition the primal variable $\mathbf{x}^{k}$ and the dual variable $\mathbf{q}^{k}$ in (9)–(11) as $\mathbf{x}^{k}=((x_{1}^{k})^{T},\ldots,(x_{N}^{k})^{T})^{T}$ and $\mathbf{q}^{k}=((q_{1}^{k})^{T},\ldots,(q_{N}^{k})^{T})^{T}$ . Suppose each node $i\in\mathcal{V}$ maintains $x_{i}^{k}\in\mathbb{R}^{d}$ and $q_{i}^{k}\in\mathbb{R}^{d}$ . Clearly, the nodes manage to collectively meet (11) by setting, for instance, $q_{i}^{0}=\mathbf{0}$ $\forall i\in\mathcal{V}$ . Below, we discuss the selections of $u^{k}$ , $H$ , and $\tilde{H}$ for the sake of distributed implementations of (9) and (10).

To this end, we let

\displaystyle H=P\otimes I_{d},\quad\tilde{H}=\tilde{P}\otimes I_{d},

and impose Assumption 3 on $P,\tilde{P}\in\mathbb{R}^{N\times N}$ .

Assumption 3.

The matrices $P=(p_{ij})_{N\times N}$ and $\tilde{P}=(\tilde{p}_{ij})_{N\times N}$ satisfy the following:

(a)

$p_{ij}=p_{ji}$ , $\tilde{p}_{ij}=\tilde{p}_{ji}$ , $\forall\{i,j\}\in\mathcal{E}$ .
(b)

$p_{ij}=\tilde{p}_{ij}=0$ , $\forall i\in\mathcal{V}$ , $\forall j\notin\mathcal{N}_{i}\cup\{i\}$ .
(c)

$\operatorname{Null}(P)=\operatorname{Null}(\tilde{P})=\operatorname{span}(\mathbf{1})$ .
(d)

$P\succeq\mathbf{O}$ , $\tilde{P}\succeq\mathbf{O}$ .

The nodes can jointly determine $P,\tilde{P}$ under Assumption 3 without any centralized coordination. For instance, we can let each node $i\in\mathcal{V}$ agree with every neighbor $j\in\mathcal{N}_{i}$ on $p_{ij}=p_{ji}<0$ and $\tilde{p}_{ij}=\tilde{p}_{ji}<0$ , and set $p_{ii}=-\sum_{j\in\mathcal{N}_{i}}p_{ij}$ and $\tilde{p}_{ii}=-\sum_{j\in\mathcal{N}_{i}}\tilde{p}_{ij}$ , which directly guarantee Assumption 3(a). Then, Assumption 3(c)(d) are satisfied effortlessly by means of Assumption 3(b) and the connectivity of $\mathcal{G}$ . Typical examples of such $P$ and $\tilde{P}$ include the graph Laplacian matrix $L_{\mathcal{G}}\in\mathbb{R}^{N\times N}$ with $[L_{\mathcal{G}}]_{ij}=[L_{\mathcal{G}}]_{ji}=-1$ $\forall\{i,j\}\in\mathcal{E}$ and the Metropolis weight matrix $M_{\mathcal{G}}\in\mathbb{R}^{N\times N}$ with $[M_{\mathcal{G}}]_{ij}=[M_{\mathcal{G}}]_{ji}=-\frac{1}{\max\{|\mathcal{N}_{i}|,|\mathcal{N}_{j}|\}+1}$ $\forall\{i,j\}\in\mathcal{E}$ . Also, the conditions on $H$ and $\tilde{H}$ in Section II-A hold due to Assumption 3.

With Assumption 3, (10) can be accomplished by letting each node update as

\displaystyle q_{i}^{k+1}=q_{i}^{k}+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}\tilde{p}_{ij}x_{j}^{k+1},\quad\forall i\in\mathcal{V}.

(12)

This is a local operation, as each node $i\in\mathcal{V}$ only needs to acquire the primal variables associated with its neighbors.

It remains to design the surrogate function $u^{k}$ so that (9) can be realized in a distributed way. To this end, let $\psi_{i}^{k}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ $\forall i\in\mathcal{V}$ $\forall k\geq 0$ be a set of functions under Assumption 4.

Assumption 4.

The functions $\psi_{i}^{k}$ $\forall i\in\mathcal{V}$ $\forall k\geq 0$ satisfy:

(a)

$\psi_{i}^{k}$ $\forall i\in\mathcal{V}$ $\forall k\geq 0$ are twice continuously differentiable.
(b)

$\psi_{i}^{k}$ $\forall i\in\mathcal{V}$ $\forall k\geq 0$ are strongly convex, whose convexity parameters are uniformly bounded from below by some positive constant.
(c)

$\nabla\psi_{i}^{k}$ $\forall i\in\mathcal{V}$ $\forall k\geq 0$ are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by some positive constant.
(d)

$(\sum_{i\in\mathcal{V}}\psi_{i}^{k}(x_{i}))-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}$ $\forall k\geq 0$ are convex.

To meet Assumption 4(d), it suffices to let $\psi_{i}^{k}$ $\forall i\in\mathcal{V}$ $\forall k\geq 0$ be any strongly convex functions whose convexity parameters are larger than or equal to $\rho\lambda_{\max}(H)>0$ , and one readily available upper bound on $\lambda_{\max}(H)$ is $\max_{i\in\mathcal{V},\;j\in\mathcal{N}_{i}}|p_{ij}|N$ .

Now define

\displaystyle\phi^{k}(\mathbf{x}):=\bigl{(}\textstyle{\sum_{i\in\mathcal{V}}}\psi_{i}^{k}(x_{i})\bigr{)}-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2},\quad\forall k\geq 0,

(13)

and let $D_{\phi^{k}}(\mathbf{x},\mathbf{x}^{k})$ be the Bregman divergence associated with $\phi^{k}$ at $\mathbf{x}$ and $\mathbf{x}^{k}$ , i.e.,

\displaystyle D_{\phi^{k}}(\mathbf{x},\mathbf{x}^{k})=\phi^{k}(\mathbf{x})-\phi^{k}(\mathbf{x}^{k})-\langle\nabla\phi^{k}(\mathbf{x}^{k}),\mathbf{x}-\mathbf{x}^{k}\rangle.

(14)

Then, we set $u^{k}$ as

\displaystyle u^{k}(\mathbf{x})=D_{\phi^{k}}(\mathbf{x},\mathbf{x}^{k})+\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle.

(15)

With Assumption 4, (15) is sufficient to ensure Assumption 2. To see this, note from Assumption 4(a)(d) that $u^{k}(\mathbf{x})$ in (15) is twice continuously differentiable and convex, i.e., Assumption 2(a) holds. Also note from (14) and (15) that

\displaystyle\nabla u^{k}(\mathbf{x})=\nabla\phi^{k}(\mathbf{x})-\nabla\phi^{k}(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{k}).

(16)

This, along with Assumption 4 (b)(c), guarantees Assumption 2(b)(c)(d).

To see how (15) results in a distributed implementation of (9), note from (16) that (9) is equivalent to

\mathbf{0}=\nabla\phi^{k}(\mathbf{x}^{k+1})-\nabla\phi^{k}(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{k})+g^{k+1}+\rho H\mathbf{x}^{k+1}+\mathbf{q}^{k}

for some $g^{k+1}\in\partial h(\mathbf{x}^{k+1})$ . Then, using (13) and the structure of $H$ given in Assumption 3, this can be written as

	$\displaystyle\mathbf{0}=$	$\displaystyle\nabla\psi_{i}^{k}(x_{i}^{k+1})+g_{i}^{k+1}+q_{i}^{k}-\nabla\psi_{i}^{k}(x_{i}^{k})+\nabla f_{i}(x_{i}^{k})$
		$\displaystyle+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}p_{ij}x_{j}^{k},\quad\forall i\in\mathcal{V},$

where $g_{i}^{k+1}\in\partial h_{i}(x_{i}^{k+1})$ . In other words, (9) can be achieved by letting each node $i\in\mathcal{V}$ solve the following strongly convex optimization problem:

	$\displaystyle x_{i}^{k+1}=\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}~{}\psi_{i}^{k}(x)+h_{i}(x)$
	$\displaystyle\!\!\!\!+\langle x,q_{i}^{k}-\nabla\psi_{i}^{k}(x_{i}^{k})+\nabla f_{i}(x_{i}^{k})+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}p_{ij}x_{j}^{k}\rangle,$		(17)

which can be locally carried out by node $i$ .

Algorithms described by (11), (17), and (12) under Assumptions 3 and 4 constitute a set of distributed realizations of AMM, referred to as Distributed Approximate Method of Multipliers (DAMM), which can be implemented by the nodes in $\mathcal{G}$ via exchanging information with their neighbors only. Algorithm 1 describes how each node acts in DAMM.

Algorithm 1 DAMM

1: Initialization:

2: Each node

i\in\mathcal{V}

selects

q_{i}^{0}\in\mathbb{R}^{d}

such that

\sum_{i\in\mathcal{V}}q_{i}^{0}=\mathbf{0}

(or simply sets

q_{i}^{0}=\mathbf{0}

3: Each node

i\in\mathcal{V}

arbitrarily sets

x_{i}^{0}\in\mathbb{R}^{d}

and sends

x_{i}^{0}

to every neighbor

j\in\mathcal{N}_{i}

4: for

k\geq 0

5: Each node

i\in\mathcal{V}

computes

x_{i}^{k+1}=\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}\psi_{i}^{k}(x)+h_{i}(x)+\langle x,q_{i}^{k}-\nabla\psi_{i}^{k}(x_{i}^{k})+\nabla f_{i}(x_{i}^{k})+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{k}\rangle

6: Each node

i\in\mathcal{V}

sends

x_{i}^{k+1}

to every neighbor

j\in\mathcal{N}_{i}

7: Each node

i\in\mathcal{V}

computes

q_{i}^{k+1}=q_{i}^{k}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}\tilde{p}_{ij}x_{j}^{k+1}

8: end for

Finally, we provide two examples of DAMM with two particular choices of $\psi_{i}^{k}$ .

Example 1.

For each $i\in\mathcal{V}$ , let $\psi_{i}^{k}(x)=r_{i}(x-x_{i}^{k})+\frac{\epsilon_{i}}{2}\|x\|^{2}$ $\forall k\geq 0$ , where $r_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ $\forall i\in\mathcal{V}$ can be any convex, smooth, and twice continuously differentiable functions and $\epsilon_{i}>0$ is such that $\epsilon_{i}\geq\rho\lambda_{\max}(H)$ . Then, DAMM reduces to a new distributed proximal algorithm, with the following local update of $x_{i}^{k}$ :

	$\displaystyle x_{i}^{k+1}=$	$\displaystyle\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}~{}r_{i}(x-x_{i}^{k})+\frac{\epsilon_{i}}{2}\\|x-x_{i}^{k}\\|^{2}+h_{i}(x)$
		$\displaystyle+\langle x,q_{i}^{k}-\nabla r_{i}(\mathbf{0})+\nabla f_{i}(x_{i}^{k})+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}p_{ij}x_{j}^{k}\rangle.$

Example 2.

Suppose $f_{i}$ $\forall i\in\mathcal{V}$ are twice continuously differentiable. Since $\nabla^{2}f_{i}(x)\preceq M_{i}I_{d}$ $\forall x\in\mathbb{R}^{d}$ , we can let each $\psi_{i}^{k}(x)=\frac{1}{2}x^{T}(\nabla^{2}f_{i}(x_{i}^{k})+\epsilon_{i}I_{d})x$ , where $\epsilon_{i}>0$ is such that $\epsilon_{i}\geq\rho\lambda_{\max}(H)$ . Then, the resulting DAMM is a new distributed second-order method, with the following local update of $x_{i}^{k}$ :

	$\displaystyle x_{i}^{k+1}=$	$\displaystyle\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}~{}\frac{1}{2}\\|x-x_{i}^{k}\\|_{\nabla^{2}f_{i}(x_{i}^{k})+\epsilon_{i}I_{d}}^{2}+h_{i}(x)$
		$\displaystyle+\langle x,q_{i}^{k}+\nabla f_{i}(x_{i}^{k})+\rho\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}p_{ij}x_{j}^{k}\rangle.$

II-C Special Case: Smooth Problem

In this subsection, we focus on the smooth convex optimization problem $\min_{x\in\mathbb{R}^{d}}\sum_{i\in\mathcal{V}}f_{i}(x)$ , i.e., (1) with $h_{i}(x)\equiv 0$ $\forall i\in\mathcal{V}$ . For this special case of (1), we provide additional designs of the surrogate function $u^{k}$ in AMM, leading to a couple of variations of DAMM.

Here, we let $u^{k}$ still be in the form of (15), but no longer require $\phi^{k}+\frac{\rho}{2}\|\cdot\|_{H}^{2}$ be a separable function as in (13). Instead, we construct $\phi^{k}$ based upon another function $\gamma^{k}:\mathbb{R}^{Nd}\rightarrow\mathbb{R}$ under Assumption 5.

Assumption 5.

The functions $\gamma^{k}$ $\forall k\geq 0$ satisfy the following:

(a)

$\gamma^{k}$ $\forall k\geq 0$ are twice continuously differentiable.
(b)

$\gamma^{k}$ $\forall k\geq 0$ are strongly convex, whose convexity parameters are uniformly bounded from below by $\underline{\gamma}>0$ .
(c)

$\nabla\gamma^{k}$ $\forall k\geq 0$ are Lipschitz continuous, whose Lipschitz constants are uniformly bounded from above by $\bar{\gamma}>0$ .
(d)

$(\gamma^{k})^{\star}(\mathbf{x})-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}$ $\forall k\geq 0$ are convex, where $(\gamma^{k})^{\star}(\mathbf{x})=\sup_{\mathbf{y}\in\mathbb{R}^{Nd}}\langle\mathbf{x},\mathbf{y}\rangle-\gamma^{k}(\mathbf{y})$ is the convex conjugate function of $\gamma^{k}$ .
(e)

For any $k\geq 0$ and any $\mathbf{x}\in\mathbb{R}^{Nd}$ , the $i$ th $d$ -dimensional block of $\nabla\gamma^{k}(\mathbf{x})$ , denoted by $\nabla_{i}\gamma^{k}(\mathbf{x})$ , is independent of $x_{j}$ $\forall j\notin\mathcal{N}_{i}\cup\{i\}$ .

From Assumption 5(b)(c), each $(\gamma^{k})^{\star}$ is $(1/\bar{\gamma})$ -strongly convex and $(1/\underline{\gamma})$ -smooth [41], so that Assumption 5(d) holds as long as $I_{Nd}/\bar{\gamma}-\rho H\succeq\mathbf{O}$ . Now we set

\displaystyle\phi^{k}(\mathbf{x})=(\gamma^{k})^{\star}(\mathbf{x})-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2},\quad\forall k\geq 0.

(18)

Unlike DAMM, $u^{k}$ given by (15) and (18) under Assumption 5 does not necessarily guarantee that $u^{k}(\cdot)+\frac{\rho}{2}\|\cdot\|_{H}^{2}$ is separable. Below, we show that such $u^{k}$ $\forall k\geq 0$ also satisfy Assumption 2, leading to another subclass of AMM.

To do so, first notice that the strong convexity and smoothness of $(\gamma^{k})^{\star}$ guarantee Assumption 2(b)(c). Also note from (16) that Assumption 2(d) is assured. In addition, due to Assumption 5(d), $\phi^{k}$ is convex and, thus, so is $u^{k}$ . To show the twice continuous differentiability of $u^{k}$ in Assumption 2(a), consider the fact from [42] that due to Assumption 5(b)(c), $\nabla\gamma^{k}$ is invertible and its inverse function is $(\nabla\gamma^{k})^{-1}=\nabla(\gamma^{k})^{\star}$ . This, along with (18), implies that

\displaystyle\nabla\phi^{k}(\mathbf{x})=(\nabla\gamma^{k})^{-1}(\mathbf{x})-\rho H\mathbf{x}.

(19)

From the inverse function theorem [43], $(\nabla\gamma^{k})^{-1}$ is continuously differentiable, so that $\phi^{k}$ and therefore $u^{k}$ given by (15) and (18) are twice continuously differentiable. We then conclude that Assumption 2 holds.

Equipped with (15), (18), and Assumption 5, we start to derive additional distributed realizations of AMM (9)–(11) for minimizing $\sum_{i\in\mathcal{V}}f_{i}(x)$ . As $h_{i}(x)\equiv 0$ $\forall i\in\mathcal{V}$ , (9) is equivalent to

\displaystyle\nabla\phi^{k}(\mathbf{x}^{k+1})\!-\!\nabla\phi^{k}(\mathbf{x}^{k})\!+\!\nabla f(\mathbf{x}^{k})\!+\!\rho H\mathbf{x}^{k+1}\!+\!\mathbf{q}^{k}\!=\!\mathbf{0}.

(20)

This, together with (19), gives

\displaystyle\mathbf{x}^{k+1}=\nabla\gamma^{k}(\nabla\phi^{k}(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})-\mathbf{q}^{k}).

(21)

Like DAMM, we let each node $i\in\mathcal{V}$ maintain the $i$ th $d$ -dimensional blocks of $\mathbf{x}^{k}$ and $\mathbf{q}^{k}$ , i.e., $x_{i}^{k}$ and $q_{i}^{k}$ , where the update of $q_{i}^{k}$ is the same as (12). According to (21), the update of $x_{i}^{k}$ is given by $x_{i}^{k+1}=\nabla_{i}\gamma^{k}(\nabla\phi^{k}(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})-\mathbf{q}^{k})$ . From Assumption 5(e), this can be computed by node $i$ provided that it has access to $\nabla_{j}\phi^{k}(\mathbf{x}^{k})-\nabla f_{j}(x_{j}^{k})-q_{j}^{k}$ $\forall j\in\mathcal{N}_{i}\cup\{i\}$ , where $\nabla_{j}\phi^{k}(\mathbf{x}^{k})\in\mathbb{R}^{d}$ denotes the $j$ th block of $\nabla\phi^{k}(\mathbf{x}^{k})$ . Therefore, the remaining question is whether each node $i\in\mathcal{V}$ is able to locally evaluate $\nabla_{i}\phi^{k}(\mathbf{x}^{k})$ . In fact, this can be enabled by the following two concrete ways of constructing $\gamma^{k}$ .

Way #1: Let $\gamma^{k}=\gamma$ $\forall k\geq 0$ for some $\gamma:\mathbb{R}^{Nd}\rightarrow\mathbb{R}$ satisfying Assumption 5. Then, according to (18), $\phi^{k}$ $\forall k\geq 0$ are fixed to some $\phi:\mathbb{R}^{Nd}\rightarrow\mathbb{R}$ . We introduce an auxiliary variable $\mathbf{y}^{k}=((y_{1}^{k})^{T},\ldots,(y_{N}^{k})^{T})^{T}$ such that $\mathbf{y}^{k}=\nabla\phi(\mathbf{x}^{k})$ $\forall k\geq 0$ . From (20), $\mathbf{y}^{k}$ satisfies the recursive relation

\displaystyle\mathbf{y}^{k+1}=\mathbf{y}^{k}-\nabla f(\mathbf{x}^{k})-\mathbf{q}^{k}-\rho H\mathbf{x}^{k+1}.

Due to the structure of $H$ in Assumption 3, each node $i\in\mathcal{V}$ is able to locally update $y_{i}^{k+1}$ as above. Nevertheless, the initialization $\mathbf{y}^{0}=\nabla\phi(\mathbf{x}^{0})=(\nabla\gamma)^{-1}(\mathbf{x}^{0})-\rho H\mathbf{x}^{0}$ cannot be achieved without centralized coordination, given that $\mathbf{x}^{0}$ is arbitrarily chosen. We may overcome this by imposing a restriction on $\mathbf{x}^{0}$ as follows: Let each node $i\in\mathcal{V}$ pick any $\tilde{z}_{i}\in\mathbb{R}^{d}$ , and force $x_{i}^{0}=\nabla_{i}\gamma(\tilde{\mathbf{z}})$ with $\tilde{\mathbf{z}}=(\tilde{z}_{1}^{T},\ldots,\tilde{z}_{N}^{T})^{T}$ , so that $\tilde{\mathbf{z}}=(\nabla\gamma)^{-1}(\mathbf{x}^{0})$ . Then, by letting each $y_{i}^{0}=\tilde{z}_{i}-\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{0}$ , we obtain $\mathbf{y}^{0}=\nabla\phi(\mathbf{x}^{0})$ . Due to Assumption 5(e), such initialization is decentralized, which only requires each node $i$ to share $\tilde{z}_{i}$ and $x_{i}^{0}$ with its neighbors.

Incorporating the above into (21) yields another distributed version of AMM, which is called Distributed Approximate Method of Multipliers for Smooth problems with Constant update functions (DAMM-SC) and is specified in Algorithm 2.

Algorithm 2 DAMM-SC

1: Initialization:

2: Each node

i\in\mathcal{V}

selects

q_{i}^{0}\in\mathbb{R}^{d}

such that

\sum_{i\in\mathcal{V}}q_{i}^{0}=\mathbf{0}

(or simply sets

q_{i}^{0}=\mathbf{0}

3: Each node

i\!\in\!\mathcal{V}

arbitrarily chooses

\tilde{z}_{i}\!\in\!\mathbb{R}^{d}

and sends

\tilde{z}_{i}

to every neighbor

j\!\in\!\mathcal{N}_{i}

4: Each node

i\in\mathcal{V}

sets

x_{i}^{0}\!=\!\nabla_{i}\gamma(\tilde{\mathbf{z}})

(which only depends on

\tilde{z}_{j}

\forall j\in\mathcal{N}_{i}\cup\{i\}

) and sends it to every neighbor

j\in\mathcal{N}_{i}

5: Each node

i\in\mathcal{V}

sets

y_{i}^{0}=\tilde{z}_{i}-\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{0}

6: Each node

i\in\mathcal{V}

sends

y_{i}^{0}-\nabla f_{i}(x_{i}^{0})-q_{i}^{0}

to every neighbor

j\in\mathcal{N}_{i}

7: for

k\geq 0

8: Each node

i\in\mathcal{V}

computes

x_{i}^{k+1}=\nabla_{i}\gamma(\mathbf{y}^{k}-\nabla f(\mathbf{x}^{k})-\mathbf{q}^{k})

(which only depends on

y_{j}^{k}-\nabla f_{j}(x_{j}^{k})-q_{j}^{k}

\forall j\in\mathcal{N}_{i}\cup\{i\}

9: Each node

i\in\mathcal{V}

sends

x_{i}^{k+1}

to every neighbor

j\in\mathcal{N}_{i}

10: Each node

i\in\mathcal{V}

computes

y_{i}^{k+1}=y_{i}^{k}-\nabla f_{i}(x_{i}^{k})-q_{i}^{k}-\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{k+1}

11: Each node

i\in\mathcal{V}

computes

q_{i}^{k+1}=q_{i}^{k}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}\tilde{p}_{ij}x_{j}^{k+1}

12: Each node

i\in\mathcal{V}

sends

y_{i}^{k+1}-\nabla f_{i}(x_{i}^{k+1})-q_{i}^{k+1}

to every neighbor

j\in\mathcal{N}_{i}

13: end for

Example 3.

We can further reduce the communication cost of DAMM-SC by restricting $\nabla_{i}\gamma$ to solely depend on node $i$ . For example, let $g_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ $\forall i\in\mathcal{V}$ be arbitrary twice continuously differentiable, smooth, convex functions and let $g(\mathbf{x})=\sum_{i\in\mathcal{V}}g_{i}(x_{i})$ . Then, we may choose $\gamma(\mathbf{x})=g(\mathbf{x})+\frac{\epsilon}{2}\|\mathbf{x}\|^{2}$ with $\epsilon>0$ . It is straightforward to see that Assumption 5(a)(b)(c) hold and Assumption 5(d) can be satisfied with proper $\rho,\epsilon>0$ . For such a choice of $\gamma$ , $\nabla_{i}\gamma(\mathbf{x})=\nabla g_{i}(x_{i})+\epsilon x_{i}$ , which is up to node $i$ itself and hence meets Assumption 5(e). Thus, the primal update of DAMM-SC, i.e., Line 8 of Algorithm 2, is simplified to $x_{i}^{k+1}=\nabla g_{i}(y_{i}^{k}-\nabla f_{i}(x_{i}^{k})-q_{i}^{k})+\epsilon(y_{i}^{k}-\nabla f_{i}(x_{i}^{k})-q_{i}^{k})$ , so that each node $i$ only needs to share $x_{i}^{k+1}$ with its neighbors during each iteration and the local communications in Line 6 and Line 12 of Algorithm 2 are eliminated.

Way #2: For each $k\geq 0$ , let $\gamma^{k}(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}G^{k}\mathbf{x}$ , where $G^{k}=(G^{k})^{T}\in\mathbb{R}^{Nd\times Nd}$ . Suppose there exist $\bar{\gamma}\geq\underline{\gamma}>0$ such that $\underline{\gamma}I_{Nd}\preceq G^{k}\preceq\bar{\gamma}I_{Nd}$ $\forall k\geq 0$ , and also suppose $(G^{k})^{-1}\succeq\rho H$ $\forall k\geq 0$ . Moreover, we let each $G^{k}$ have a neighbor-sparse structure, i.e., the $d\times d$ $(i,j)$ -block of $G^{k}$ , denoted by $G_{ij}^{k}$ , is equal to $\mathbf{O}\in\mathbb{R}^{d\times d}$ for all $j\notin\mathcal{N}_{i}\cup\{i\}$ . It can be shown that such quadratic $\gamma^{k}$ ’s satisfy Assumption 5. Substituting $\gamma^{k}(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}G^{k}\mathbf{x}$ into (21) gives

\mathbf{x}^{k+1}=\mathbf{x}^{k}-G^{k}(\rho H\mathbf{x}^{k}+\nabla f(\mathbf{x}^{k})+\mathbf{q}^{k}),

(22)

which can be executed in a distributed manner due to the neighbor-sparse structure of $G^{k}$ . This leads to an additional distributed version of AMM, which is referred to as Distributed Approximate Method of Multipliers for Smooth problems with Quadratic update functions (DAMM-SQ) and is elaborated in Algorithm 3 where we introduce, for convenience, an auxiliary variable $\mathbf{z}^{k}=((z_{1}^{k})^{T},\ldots,(z_{N}^{k})^{T})^{T}=\rho H\mathbf{x}^{k}+\nabla f(\mathbf{x}^{k})+\mathbf{q}^{k}$ .

Algorithm 3 DAMM-SQ

1: Initialization:

2: Same as Lines 2–3 in Algorithm 1.

3: Each node

i\in\mathcal{V}

computes

z_{i}^{0}=\nabla f_{i}(x_{i}^{0})+q_{i}^{0}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{0}

and sends it to every neighbor

j\in\mathcal{N}_{i}

4: for

k\geq 0

5: Each node

i\in\mathcal{V}

computes

x_{i}^{k+1}=x_{i}^{k}-\sum_{j\in\mathcal{N}_{i}\cup\{i\}}G_{ij}^{k}z_{j}^{k}

6: Each node

i\in\mathcal{V}

sends

x_{i}^{k+1}

to every neighbor

j\in\mathcal{N}_{i}

7: Each node

i\in\mathcal{V}

computes

q_{i}^{k+1}=q_{i}^{k}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}\tilde{p}_{ij}x_{j}^{k+1}

8: Each node

i\in\mathcal{V}

computes

z_{i}^{k+1}=\nabla f_{i}(x_{i}^{k+1})+q_{i}^{k+1}+\rho\sum_{j\in\mathcal{N}_{i}\cup\{i\}}p_{ij}x_{j}^{k+1}

9: Each node

i\in\mathcal{V}

sends

z_{i}^{k+1}

to every neighbor

j\in\mathcal{N}_{i}

10: end for

Example 4.

We present a new distributed gradient-tracking algorithm as an example of DAMM-SQ, where $G^{k}=\frac{1}{\rho}(I_{N}-P)\otimes I_{d}$ $\forall k\geq 0$ with $P=\tilde{P}\prec I_{N}$ under Assumption 3. Note that $P=\tilde{P}\prec I_{N}$ can be satisfied by letting $I_{N}-P=I_{N}-\tilde{P}$ be strictly diagonally dominant with positive diagonal entries. Similar to the discussions below Assumption 3, such $P,\tilde{P}$ and therefore $G^{k}$ can be locally determined by the nodes. Moreover, since $(I_{N}-P)^{-1}\succeq I_{N}\succ P$ , $(G^{k})^{-1}=\rho(I_{N}-P)^{-1}\otimes I_{d}\succ\rho H$ . Also, it can be verified that all the other conditions on $G^{k}$ required by DAMM-SQ hold. With this particular $G^{k}$ , (22) becomes

	$\displaystyle\mathbf{x}^{k+1}=$	$\displaystyle((P+(I_{N}-P)^{2})\otimes I_{d})\mathbf{x}^{k}$
		$\displaystyle-\frac{1}{\rho}((I_{N}-P)\otimes I_{d})(\nabla f(\mathbf{x}^{k})+\mathbf{q}^{k}).$

Since $P\mathbf{1}=\mathbf{0}$ and $(\mathbf{1}^{T}\otimes I_{d})\mathbf{q}^{k}=\mathbf{0}$ , this indicates $\frac{1}{N}\sum_{i\in\mathcal{V}}x_{i}^{k+1}=\frac{1}{N}\sum_{i\in\mathcal{V}}x_{i}^{k}-\frac{1}{\rho}\Bigl{(}\frac{1}{N}\sum_{i\in\mathcal{V}}\nabla f_{i}(x_{i}^{k})\Bigr{)}$ . It can thus be seen that this example of DAMM-SQ tracks the average of all the local gradients so as to imitate the behavior of the (centralized) gradient descent method.

III Existing Algorithms as Specializations

This section exemplifies that AMM and its distributed realizations generalize a variety of existing distributed optimization methods originally developed in different ways.

III-A Specializations of DAMM-SQ

DAMM-SQ described in Algorithm 3 generalizes several distributed first-order and second-order algorithms for solving problem (1) with $h_{i}\equiv 0$ $\forall i\in\mathcal{V}$ , including EXTRA [5], ID-FBBS [13], and DQM [19].

III-A1 EXTRA

EXTRA [5] is a well-known first-order algorithm developed from a decentralized gradient descent method. From [5, Eq. (3.5)], EXTRA can be expressed as

\displaystyle\mathbf{x}^{k+1}\!=\!(\tilde{W}\!\otimes\!I_{d})\mathbf{x}^{k}\!-\!\alpha\nabla f(\mathbf{x}^{k})\!+\!\!\sum_{t=0}^{k}\!((W\!-\!\tilde{W})\!\otimes\!I_{d})\mathbf{x}^{t},

(23)

where $\mathbf{x}^{0}$ is arbitrarily given, $\alpha>0$ , and $W,\tilde{W}\in\mathbb{R}^{N\times N}$ are two average matrices associated with $\mathcal{G}$ ¹¹1We say $W\in\mathbb{R}^{N\times N}$ is an average matrix associated with $\mathcal{G}$ if $W=W^{T}$ , $W\mathbf{1}=\mathbf{1}$ , $\|W-\mathbf{1}\mathbf{1}^{T}/N\|<1$ , and $[W]_{ij}=0$ $\forall i\in\mathcal{V}$ $\forall j\notin\mathcal{N}_{i}\cup\{i\}$ . It can be shown that $-I_{N}\prec W\preceq I_{N}$ .. By letting $\mathbf{q}^{k}=\frac{1}{\alpha}\sum_{t=0}^{k}((\tilde{W}-W)\otimes I_{d})\mathbf{x}^{t}$ , (23) becomes

	$\displaystyle\mathbf{x}^{k+1}$	$\displaystyle=(\tilde{W}\otimes I_{d})\mathbf{x}^{k}-\alpha\nabla f(\mathbf{x}^{k})-\alpha\mathbf{q}^{k},$		(24)
	$\displaystyle\mathbf{q}^{k+1}$	$\displaystyle=\mathbf{q}^{k}+\frac{1}{\alpha}((\tilde{W}-W)\otimes I_{d})\mathbf{x}^{k+1}.$		(25)

This is in the form of DAMM-SQ with $\rho=1/\alpha$ , $\tilde{P}=\tilde{W}-W$ , $P=I_{N}-\tilde{W}$ , and $G^{k}=\alpha I_{Nd}$ . As [5] assumes $\tilde{W}\succeq W$ and $\operatorname{Null}(\tilde{W}-W)=\operatorname{span}(\mathbf{1})$ , Assumption 3 and (11) are guaranteed. Besides, [5] assumes $\tilde{W}\succ\mathbf{O}$ , so that $(G^{k})^{-1}=\rho I_{Nd}\succeq\rho(I_{N}-\tilde{W})\otimes I_{d}=\rho H$ . It is then straightforward to see that this particular $G^{k}$ satisfies all the conditions in Section II-C.

III-A2 ID-FBBS

ID-FBBS [13] takes the form of (24)–(25), except that $W=2\tilde{W}-I_{N}$ and $\mathbf{q}^{0}$ can be any vector in $S^{\perp}$ . Since [13] also assumes $\tilde{W}\succ\mathbf{O}$ , it follows from the analysis in Section III-A1 that ID-FBBS is a particular example of DAMM-SQ, where $\rho=1/\alpha$ , $\tilde{P}=P=I_{N}-\tilde{W}$ , and $G^{k}=\alpha I_{Nd}$ , with Assumption 3 and all the conditions on $G^{k}$ in Section II-C satisfied.

III-A3 DQM

DQM [19] is a distributed second-order method for solving problem (1) with strongly convex, smooth, and twice continuously differentiable $f_{i}$ ’s and zero $h_{i}$ ’s. DQM takes the following form: $x_{i}^{k+1}=x_{i}^{k}-(2c|\mathcal{N}_{i}|I_{d}+\nabla^{2}f_{i}(x_{i}^{k}))^{-1}(c\!\sum_{j\in\mathcal{N}_{i}}\!(x_{i}^{k}\!-\!x_{j}^{k})+\nabla f_{i}(x_{i}^{k})\!+\!q_{i}^{k})$ and $q_{i}^{k+1}=q_{i}^{k}+c\sum_{j\in\mathcal{N}_{i}}(x_{i}^{k+1}-x_{j}^{k+1})$ , where $x_{i}^{0}$ $\forall i\in\mathcal{V}$ are arbitrarily given, $q_{i}^{0}$ $\forall i\in\mathcal{V}$ satisfy $\sum_{i\in\mathcal{V}}q_{i}^{0}=\mathbf{0}$ , and $c>0$ . Observe that DAMM-SQ reduces to DQM if we set $G^{k}=(2c\operatorname{diag}(|\mathcal{N}_{1}|,\ldots,|\mathcal{N}_{N}|)\otimes I_{d}+\nabla^{2}f(\mathbf{x}^{k}))^{-1}$ , $\rho=c$ , $p_{ij}=\tilde{p}_{ij}=-1$ $\forall\{i,j\}\in\mathcal{E}$ , and $p_{ii}=\tilde{p}_{ii}=|\mathcal{N}_{i}|$ $\forall i\in\mathcal{V}$ . Clearly, $P$ and $\tilde{P}$ satisfy Assumption 3. Additionally, $(G^{k})^{-1}\succeq 2c\operatorname{diag}(|\mathcal{N}_{1}|,\ldots,|\mathcal{N}_{N}|)\otimes I_{d}\succeq\rho P\otimes I_{d}=\rho H$ , and $G^{k}$ meets all the other requirements in Section II-C.

III-B Specializations of DAMM

A number of distributed algorithms for composite or nonsmooth convex optimization can be cast into the form of DAMM described in Algorithm 1, including PGC [16], PG-EXTRA [12], DPGA [11], a decentralized ADMM in [4], and D-FBBS [13].

III-B1 PGC and PG-EXTRA

PGC [16] and PG-EXTRA [12] are two recently proposed distributed methods for solving problem (1), where PGC is constructed upon ADMM [35] and PG-EXTRA is an extension of EXTRA [5] to address (1) with nonzero $h_{i}$ ’s. According to [16, Section V-D], PGC can be described by $\mathbf{q}^{0}\in S^{\perp}$ , $\mathbf{q}^{k}=\mathbf{q}^{0}+\sum_{\ell=1}^{k}((\Lambda_{\beta}(\tilde{W}-W))\otimes I_{d})\mathbf{x}^{\ell}$ $\forall k\geq 1$ , and $\mathbf{x}^{k+1}=\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}h(\mathbf{x})+\frac{1}{2}\|\mathbf{x}-(\tilde{W}\otimes I_{d})\mathbf{x}^{k}\|_{\Lambda_{\beta}\otimes I_{d}}^{2}+\langle\nabla f(\mathbf{x}^{k})+\mathbf{q}^{k},\mathbf{x}\rangle$ $\forall k\geq 0$ , where $\mathbf{x}^{0}$ is arbitrarily given and the parameters are chosen as follows: Let $\Lambda_{\beta}=\operatorname{diag}(\beta_{1},\ldots,\beta_{N})$ be a positive definite diagonal matrix and $W,\tilde{W}\in\mathbb{R}^{N\times N}$ be two row-stochastic matrices such that $\Lambda_{\beta}W$ and $\Lambda_{\beta}\tilde{W}$ are symmetric, $[W]_{ij},[\tilde{W}]_{ij}>0$ $\forall j\in\mathcal{N}_{i}\cup\{i\}$ , and $[W]_{ij}=[\tilde{W}]_{ij}=0$ otherwise. To cast PGC in the form of DAMM, let $\rho=1$ , $\tilde{P}=\Lambda_{\beta}(\tilde{W}-W)$ , and $P=\Lambda_{\beta}(I_{N}-\tilde{W})$ . Then, starting from any $\mathbf{q}^{0}\in S^{\perp}$ , the updates of PGC can be expressed as

$\displaystyle\mathbf{x}^{k+1}=$	$\displaystyle\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}h(\mathbf{x})\!+\!\frac{1}{2}\\|\mathbf{x}-\mathbf{x}^{k}\\|_{\Lambda_{\beta}\otimes I_{d}}^{2}$
	$\displaystyle+\langle\nabla f(\mathbf{x}^{k})\!+\!\mathbf{q}^{k}\!+\!\rho(P\!\otimes\!I_{d})\mathbf{x}^{k},\mathbf{x}\rangle,$	(26)
$\displaystyle\mathbf{q}^{k+1}=$	$\displaystyle\mathbf{q}^{k}+\rho(\tilde{P}\otimes I_{d})\mathbf{x}^{k+1}.$	(27)

This means that PGC is exactly in the form of DAMM with $\psi_{i}^{k}(\cdot)=\frac{\beta_{i}}{2}\|\cdot\|^{2}$ . Note that $\Lambda_{\beta}\succeq\Lambda_{\beta}\tilde{W}$ and $\operatorname{Null}(\Lambda_{\beta}(I_{N}-\tilde{W}))=\operatorname{span}(\mathbf{1})$ . In addition, [16] assumes $\Lambda_{\beta}\tilde{W}\succeq\Lambda_{\beta}W$ , $\Lambda_{\beta}\tilde{W}\succeq\mathbf{O}$ , and $\operatorname{Null}(\Lambda_{\beta}(\tilde{W}-W))=\operatorname{span}(\mathbf{1})$ . Consequently, Assumption 3 and Assumption 4 hold. PG-EXTRA can also be described by (26)–(27) with $\beta_{i}=\rho>0$ $\forall i\in\mathcal{V}$ and $\mathbf{q}^{0}=\rho(\tilde{P}\otimes I_{d})\mathbf{x}^{0}$ , i.e., is a special form of PGC. Therefore, DAMM generalizes both PGC and PG-EXTRA.

III-B2 DPGA and a Decentralized ADMM

DPGA [11] is a distributed proximal gradient method and has the following form: Given arbitrary $x_{i}^{0}$ and $q_{i}^{0}=\mathbf{0}$ ,

	$\displaystyle x_{i}^{k+1}=$	$\displaystyle\operatorname{\arg\;\min}_{x\in\mathbb{R}^{d}}~{}h_{i}(x)+\frac{1}{2c_{i}}\\|x-x_{i}^{k}\\|^{2}$
		$\displaystyle+\langle\nabla f_{i}(x_{i}^{k})+q_{i}^{k}+\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}\Gamma_{ij}x_{j}^{k},x\rangle,$
	$\displaystyle q_{i}^{k+1}=$	$\displaystyle q_{i}^{k}+\textstyle{\sum_{j\in\mathcal{N}_{i}\cup\{i\}}}\Gamma_{ij}x_{j}^{k+1},\quad\forall i\in\mathcal{V},$

where $c_{i}>0$ $\forall i\in\mathcal{V}$ , $\Gamma_{ij}=\Gamma_{ji}<0$ $\forall\{i,j\}\in\mathcal{E}$ , and $\Gamma_{ii}=-\sum_{j\in\mathcal{N}_{i}}\Gamma_{ij}$ $\forall i\in\mathcal{V}$ . The above update equations of DPGA are equivalent to those of DAMM with $\psi_{i}^{k}(\cdot)=\frac{1}{2c_{i}}\|\cdot\|^{2}$ , $\rho=1$ , and $P,\tilde{P}$ such that $\tilde{p}_{ij},p_{ij}$ are equal to $\Gamma_{ij}$ if $\{i,j\}\in\mathcal{E}$ or $i=j$ and are equal to $0$ otherwise. Apparently, $P$ and $\tilde{P}$ satisfy Assumption 3. Furthermore, due to the conditions on $c_{i}$ in [11], it is guaranteed that $\sum_{i\in\mathcal{V}}\psi_{i}^{k}(x_{i})-\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}$ is convex and, thus, Assumption 4 holds. The decentralized ADMM in [4] for solving strongly convex and smooth problems can be viewed as a special case of DPGA with $c_{i}=\frac{1}{2c|\mathcal{N}_{i}|}$ $\forall i\in\mathcal{V}$ for some $c>0$ and $\Gamma_{ij}=\Gamma_{ji}=-c$ $\forall\{i,j\}\in\mathcal{E}$ , so that it is also a specialization of DAMM.

III-B3 D-FBBS

D-FBBS [13] addresses problem (1) with $f_{i}\equiv 0$ $\forall i\in\mathcal{V}$ , which is designed based on a Bregman method and an operator splitting technique. D-FBBS is described by (26)–(27) with $\beta_{i}=\rho>0$ $\forall i\in\mathcal{V}$ , $\tilde{P}=P=I_{N}-W$ for an average matrix $W\in\mathbb{R}^{N\times N}$ associated with $\mathcal{G}$ , $\mathbf{x}^{0}$ being arbitrary, and $\mathbf{q}^{0}\in S^{\perp}$ . Clearly, $P,\tilde{P}$ satisfy Assumption 3. Now let $\psi_{i}^{k}(\cdot)=\frac{\rho}{2}\|\cdot\|^{2}$ , which satisfies Assumption 4 because [13] assumes $W\succ\mathbf{O}$ . Therefore, we conclude that D-FBBS can be specialized from DAMM.

III-C Specializations of AMM

Since DAMM-SQ and DAMM are subsets of AMM, the algorithms in Section III-A and Section III-B are also specializations of AMM. Below, we present some other methods [9, 14, 26, 18] that can be specialized from AMM but do not belong to DAMM, DAMM-SC, or DAMM-SQ.

III-C1 A Distributed ADMM

In [9], a distributed ADMM is proposed to solve (1) with $f_{i}\equiv 0$ $\forall i\in\mathcal{V}$ :

	$\displaystyle\mathbf{x}^{k+1}=$	$\displaystyle\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}h(\mathbf{x})+\langle Q^{T}\mathbf{w}^{k},\mathbf{x}\rangle$
		$\displaystyle+c\langle Q^{T}(\Lambda^{-1}\otimes I_{d})Q\mathbf{x}^{k},\mathbf{x}\rangle+\frac{c}{2}\\|\mathbf{x}-\mathbf{x}^{k}\\|_{\tilde{Q}}^{2},$
	$\displaystyle\mathbf{w}^{k+1}=$	$\displaystyle\mathbf{w}^{k}+c(\Lambda^{-1}\otimes I_{d})Q\mathbf{x}^{k+1},$

where $\mathbf{x}^{0}\in\mathbb{R}^{Nd}$ is arbitrarily given and $\mathbf{w}^{0}=\mathbf{0}$ . In the above, $c>0$ , $\Lambda=\operatorname{diag}(|\mathcal{N}_{1}|+1,\ldots,|\mathcal{N}_{N}|+1)$ , and $Q=\Gamma\otimes I_{d}$ with $\Gamma\in\mathbb{R}^{N\times N}$ satisfying $[\Gamma]_{ij}=0$ $\forall i\in\mathcal{V}$ $\forall j\notin\mathcal{N}_{i}\cup\{i\}$ and $\operatorname{Null}(\Gamma^{T}\Lambda^{-1}\Gamma)$ $=\operatorname{span}(\mathbf{1})$ . Additionally, $\tilde{Q}=\operatorname{diag}(Q_{1}^{T}Q_{1},\ldots,Q_{N}^{T}Q_{N})\in\mathbb{R}^{Nd\times Nd}$ , where $Q_{i}\in\mathbb{R}^{Nd\times d}$ $\forall i\in\mathcal{V}$ are such that $Q=(Q_{1},\ldots,Q_{N})$ . It is shown in [9] that $\tilde{Q}\succeq Q^{T}(\Lambda^{-1}\!\otimes\!I_{d})Q$ . We may view this algorithm as a specialization from AMM (9)–(11), in which $\mathbf{q}^{k}=Q^{T}\mathbf{w}^{k}$ , $H=\tilde{H}=Q^{T}(\Lambda^{-1}\otimes I_{d})Q$ , $\rho=c$ , and $u^{k}(\mathbf{x})=\frac{\rho}{2}(\mathbf{x}-\mathbf{x}^{k})^{T}(\tilde{Q}-H)(\mathbf{x}-\mathbf{x}^{k})$ . Apparently, $H,\tilde{H}$ are symmetric and positive semidefinite, and $\operatorname{Null}(H)=\operatorname{Null}(\tilde{H})=S$ , as required in Section II-A. Also, since each $Q_{i}$ has full column rank, $\tilde{Q}\succ\mathbf{O}$ . This guarantees Assumption 2.

III-C2 A Distributed Primal-Dual Algorithm

In [14], a distributed primal-dual algorithm is developed to solve (1) with each $h_{i}$ equal to the indicator function $\mathcal{I}_{X_{i}}$ with respect to a closed convex set $X_{i}$ , and takes the following form:

	$\displaystyle\!\!\mathbf{x}^{k+1}\!=\!P_{X}[\mathbf{x}^{k}\!\!-\!\alpha(\nabla f(\mathbf{x}^{k})\!+\!(\Gamma\otimes I_{d})\mathbf{w}^{k}\!\!+\!\!(\Gamma\otimes I_{d})\mathbf{x}^{k})],$		(28)
	$\displaystyle\!\!\mathbf{w}^{k+1}\!=\!\mathbf{w}^{k}+\alpha(\Gamma\otimes I_{d})\mathbf{x}^{k},$		(29)

where $\mathbf{x}^{0},\mathbf{w}^{0}$ are arbitrarily given, $X=X_{1}\times\cdots\times X_{N}$ , and $P_{X}[\cdot]$ is the projection onto $X$ . Besides, $\Gamma\in\mathbb{R}^{N\times N}$ is a symmetric positive semidefinite matrix satisfying $[\Gamma]_{ij}=0$ $\forall i\in\mathcal{V}$ $\forall j\notin\mathcal{N}_{i}\cup\{i\}$ and $\operatorname{Null}(\Gamma)=\operatorname{span}(\mathbf{1})$ , and $0<\alpha\leq\frac{1}{2\|\Gamma\|}$ . To see how this algorithm relates to AMM, we use (29) to rewrite (28) as

$\displaystyle\mathbf{x}^{k+1}=$	$\displaystyle\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}~{}h(\mathbf{x})$
	$\displaystyle+\frac{1}{2\alpha}\\|\mathbf{x}-\Bigl{(}\mathbf{x}^{k}-\alpha\bigl{(}\nabla f(\mathbf{x}^{k})+(\Gamma\otimes I_{d})\mathbf{w}^{k+1}$
	$\displaystyle+((\Gamma-\alpha\Gamma^{2})\otimes I_{d})\mathbf{x}^{k}\bigr{)}\Bigr{)}\\|^{2},$	(30)

where $h(\mathbf{x})=\mathcal{I}_{X}(\mathbf{x})$ . It can be seen that (29)–(30) are equivalent to AMM (4)–(5) with $\mathbf{v}^{k}=\mathbf{w}^{k+1}$ , $\rho=\alpha$ , $\tilde{H}=\Gamma^{2}\otimes I_{d}$ , $H=(\Gamma/\alpha-\Gamma^{2})\otimes I_{d}$ , and $u^{k}(\mathbf{x})=\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle+\frac{1}{2\alpha}\|\mathbf{x}-\mathbf{x}^{k}\|_{I_{Nd}-\alpha(\Gamma-\alpha\Gamma^{2})\otimes I_{d}}^{2}$ . Since $\alpha\leq\frac{1}{2\|\Gamma\|}$ , we have $H\succeq\Gamma\otimes I_{d}/(2\alpha)$ and $I_{Nd}-\alpha(\Gamma-\alpha\Gamma^{2})\otimes I_{d}\succeq I_{Nd}-\alpha\Gamma\otimes I_{d}\succ\mathbf{O}$ . Thus, $H,\tilde{H}$ satisfy the conditions required by AMM in Section II-A. Also, $u^{k}$ is strongly convex and satisfies Assumption 2.

III-C3 DIGing on Static Networks

DIGing [26] is a distributed gradient-tracking method for solving problem (1) with $h_{i}\equiv 0$ $\forall i\in\mathcal{V}$ over time-varying networks. Here, we only consider DIGing on static undirected networks. Let $\alpha>0$ and $W\in\mathbb{R}^{N\times N}$ satisfy $W\mathbf{1}=W^{T}\mathbf{1}=\mathbf{1}$ , $[W]_{ij}=0$ $\forall i\in\mathcal{V}$ $\forall j\notin\mathcal{N}_{i}\cup\{i\}$ , and $\|W-\frac{1}{N}\mathbf{1}\mathbf{1}^{T}\|<1$ . It is shown in [26] that DIGing with $W=W^{T}$ can be written as follows:

	$\displaystyle\mathbf{x}^{k+2}=$	$\displaystyle(2W\otimes I_{d})\mathbf{x}^{k+1}-(W^{2}\otimes I_{d})\mathbf{x}^{k}$
		$\displaystyle-\alpha(\nabla f(\mathbf{x}^{k+1})-\nabla f(\mathbf{x}^{k})),\quad\forall k\geq 0,$

where $\mathbf{x}^{0}$ can be arbitrary and $\mathbf{x}^{1}=(W\otimes I_{d})\mathbf{x}^{0}-\alpha\nabla f(\mathbf{x}^{0})$ . Adding the above equation from $k=0$ to $k=K-1$ yields

	$\displaystyle\mathbf{x}^{K+1}=$	$\displaystyle(W^{2}\otimes I_{d})\mathbf{x}^{K}+((I_{N}-W)\otimes I_{d})\mathbf{x}^{0}$
		$\displaystyle-\sum_{k=0}^{K}((I_{N}-W)^{2}\otimes I_{d})\mathbf{x}^{k}-\alpha\nabla f(\mathbf{x}^{K})$

for all $K\geq 0$ . By letting $\mathbf{q}^{0}=\frac{1}{\alpha}((W^{2}-W)\otimes I_{d})\mathbf{x}^{0}$ , the above update is the same as

	$\displaystyle\mathbf{x}^{K+1}$	$\displaystyle=(W^{2}\otimes I_{d})\mathbf{x}^{K}-\alpha\nabla f(\mathbf{x}^{K})-\alpha\mathbf{q}^{K},$
	$\displaystyle\mathbf{q}^{K+1}$	$\displaystyle=\mathbf{q}^{K}+\frac{1}{\alpha}((I_{N}-W)^{2}\otimes I_{d})\mathbf{x}^{K+1}.$

Such an algorithmic form of DIGing is identical to AMM (9)–(11) with the above given $\mathbf{q}^{0}\in S^{\bot}$ , $\rho=1/\alpha$ , $H=(I_{N}-W^{2})\otimes I_{d}$ , $\tilde{H}=(I_{N}-W)^{2}\otimes I_{d}$ , and $u^{k}(\mathbf{x})=\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle+\frac{\rho}{2}\|(W\otimes I_{d})(\mathbf{x}-\mathbf{x}^{k})\|^{2}$ . It can be verified that $u^{k}$ $\forall k\geq 0$ and $H,\tilde{H}$ satisfy all the conditions in Section II-A.

III-C4 ESOM

ESOM [18] is a class of distributed second-order algorithms that address problem (1) with $f_{i}$ $\forall i\in\mathcal{V}$ being strongly convex, smooth, and twice continuously differentiable and with $h_{i}\equiv 0$ $\forall i\in\mathcal{V}$ . It is developed by incorporating a proximal technique and certain second-order approximations into the Method of Multipliers. To describe ESOM, let $\alpha>0$ , $\epsilon>0$ , and $W\in\mathbb{R}^{N\times N}$ be an average matrix associated with $\mathcal{G}$ such that $[W]_{ij}\geq 0$ $\forall i,j\in\mathcal{V}$ . In addition, define $W_{d}:=\operatorname{diag}([W]_{11},\ldots,[W]_{NN})$ , $B:=\alpha(I_{Nd}+(W-2W_{d})\otimes I_{d})$ , $D^{k}:=\nabla^{2}f(\mathbf{x}^{k})+\epsilon I_{Nd}+2\alpha(I_{Nd}-W_{d}\otimes I_{d})$ , and $Q^{k}(K):=(D^{k})^{-\frac{1}{2}}\sum_{t=0}^{K}((D^{k})^{-\frac{1}{2}}B(D^{k})^{-\frac{1}{2}})^{t}(D^{k})^{-\frac{1}{2}}$ , $K\geq 0$ . With the above notations, each ESOM- $K$ algorithm can be described by

	$\displaystyle\mathbf{x}^{k+1}\!$	$\displaystyle=\!\mathbf{x}^{k}\!\!-\!Q^{k}(\!K\!)(\nabla f(\mathbf{x}^{k})\!\!+\!\mathbf{q}^{k}\!+\!\alpha(I_{Nd}\!-\!W\!\!\otimes\!I_{d})\mathbf{x}^{k}),$		(31)
	$\displaystyle\mathbf{q}^{k+1}\!$	$\displaystyle=\mathbf{q}^{k}+\alpha(I_{Nd}-\!W\!\otimes I_{d})\mathbf{x}^{k+1},$		(32)

where $\mathbf{x}^{0}$ is any vector in $\mathbb{R}^{Nd}$ and $\mathbf{q}^{0}=\mathbf{0}$ which satisfies the initialization (11) of AMM. Note that the primal and dual updates of AMM, i.e., (9)–(10), reduce to (31)–(32) when $\tilde{H}=H=I_{Nd}-W\otimes I_{d}$ , $\rho=\alpha$ , and $u^{k}(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}((Q^{k}(K))^{-1}-\rho H)\mathbf{x}-\langle((Q^{k}(K))^{-1}-\rho H)\mathbf{x}^{k},\mathbf{x}\rangle+\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle$ . Clearly, $H$ and $\tilde{H}$ satisfy the conditions in Section II-A. Also, Assumption 2 holds, since $(M+\epsilon+2\rho)I_{Nd}\succeq(Q^{k}(K))^{-1}\succeq\nabla^{2}f(\mathbf{x}^{k})+\epsilon I_{Nd}+\rho H$ [18], where $M>0$ is the Lipschitz constant of all the $\nabla f_{i}$ ’s. Note that unlike most specializations of AMM discussed in this section, $u^{k}(\cdot)+\frac{\rho}{2}\|\cdot\|_{H}^{2}$ for ESOM- $K$ with $K\geq 1$ is non-separable.

III-D Connections between AMM and Existing Unifying Methods

PUDA [22] and ABC [21] are two recently proposed distributed methods for convex composite optimization, which unify a number of existing methods including a subset of the aforementioned specializations of AMM. Nevertheless, unlike AMM that can be specialized to both first-order and second-order methods, PUDA and ABC are first-order algorithms. Moreover, AMM allows $h_{i}$ $\forall i\in\mathcal{V}$ to be non-identical, i.e., each node only needs to know a local portion of the global nonsmooth objective function $h$ . In contrast, PUDA and ABC are restricted to the case of identical $h_{i}$ ’s, i.e., each node has to know the entire $h$ , and it is not straightforward to extend their analyses to the more general case of non-identical $h_{i}$ ’s.

Although none of AMM, PUDA, and ABC can include the others as special cases, they are implicitly connected through the following algorithm: Let $\mathbf{q}^{0}=\mathbf{0}_{Nd}$ . For any $k\geq 0$ , let

$\displaystyle\mathbf{z}^{k+1}$	$\displaystyle=\operatorname{\arg\;\min}_{\mathbf{z}\in\mathbb{R}^{Nd}}u^{k}(\mathbf{z})+\frac{\rho}{2}\\|\mathbf{z}\\|_{H}^{2}+(\mathbf{q}^{k})^{T}\mathbf{z},$	(33)
$\displaystyle\mathbf{x}^{k+1}$	$\displaystyle=\operatorname{\arg\;\min}_{\mathbf{x}\in\mathbb{R}^{Nd}}~{}h(\mathbf{x})+\frac{\rho}{2}\\|\mathbf{x}-\mathbf{z}^{k+1}\\|^{2},$	(34)
$\displaystyle\mathbf{q}^{k+1}$	$\displaystyle=\mathbf{q}^{k}+\rho\tilde{H}\mathbf{z}^{k+1}.$	(35)

Different from AMM whose primal update (9) involves both the surrogate function and the nonsmooth objective, the above algorithm tackles these two parts separately and thus has two sequential primal minimization operations. Indeed, when $h\equiv 0$ , (33)–(35) are identical to (9)–(10).

It can be shown that with particular $u^{k}$ , $H$ , and $\tilde{H}$ , (33)–(35) and $\mathbf{q}^{0}=\mathbf{0}_{Nd}$ become equivalent to PUDA and ABC under some mild parameter conditions. Specifically, PUDA corresponds to $\mathbf{u}^{k}(\mathbf{z})=\frac{\rho}{2}\|\mathbf{z}-\mathbf{x}^{k}\|_{I-\mathcal{C}_{1}}^{2}+\langle\nabla\mathbf{f}(\mathbf{x}^{k}),\mathbf{z}\rangle$ , $\rho>0$ , $H=\mathcal{A}_{1}^{-1}-I_{Nd}+\mathcal{C}_{1}$ , and $\tilde{H}=\mathcal{B}_{1}^{2}\mathcal{A}_{1}^{-1}$ , while ABC corresponds to $\mathbf{u}^{k}(\mathbf{z})=\frac{\rho}{2}\|\mathbf{z}-\mathbf{x}^{k}\|_{\mathcal{B}_{2}^{-1}\mathcal{A}_{2}}^{2}+\langle\nabla\mathbf{f}(\mathbf{x}^{k}),\mathbf{z}\rangle$ , $\rho>0$ , $H=\mathcal{B}_{2}^{-1}(I_{Nd}-\mathcal{A}_{2})$ , and $\tilde{H}=\mathcal{B}_{2}^{-1}\mathcal{C}_{2}$ . Here, $\mathcal{A}_{1},\mathcal{B}_{1},\mathcal{C}_{1},\mathcal{A}_{2},\mathcal{B}_{2},\mathcal{C}_{2}\in\mathbb{R}^{Nd\times Nd}$ are proper matrices whose detailed designs can be found in [22, 21]. Such equivalence requires additional conditions that $\mathcal{A}_{1}$ and $\mathcal{B}_{2}$ are invertible, $\mathcal{C}_{1}\preceq I_{Nd}$ , and $\mathcal{B}_{1}^{2}\mathcal{A}_{1}^{-1}$ is symmetric. In fact, these additional conditions are satisfied by many specializations of PUDA and ABC such as those in [22, Table I] and [21, Table II] when they choose positive definite average matrices. Also, the above options of $u^{k}$ , $H$ , and $\tilde{H}$ for PUDA and ABC satisfy our conditions for AMM in Section II-A.

To summarize, PUDA and ABC with some additional parameter conditions can be specialized from (33)–(35), while AMM and (33)–(35) are the same if problem (1) is smooth and are different otherwise. As a result, when solving smooth problems, AMM, PUDA, and ABC share a number of common special cases (e.g., EXTRA [5] and DIGing [26]).

IV Convergence Analysis

In this section, we analyze the convergence performance of AMM described by (4)–(5) (which are equivalent to (9)–(11)).

For the purpose of analysis, we add $H\succeq\tilde{H}$ to the conditions on $H,\tilde{H}$ in Section II-A, leading to Assumption 6.

Assumption 6.

The matrices $H$ and $\tilde{H}$ are symmetric and $\operatorname{Null}(H)=\operatorname{Null}(\tilde{H})=S$ , where $S$ is defined in (2). In addition, $H\succeq\tilde{H}\succeq\mathbf{O}$ .

In DAMM, DAMM-SC, and DAMM-SQ, we adopt $H=P\otimes I_{d}$ and $\tilde{H}=\tilde{P}\otimes I_{d}$ , where $P$ and $\tilde{P}$ comply with Assumption 3. Thus, as long as we further impose $\tilde{P}\preceq P$ , Assumption 6 holds. Moreover, all the existing specializations of AMM, DAMM, and DAMM-SQ in Section III, except DIGing [26], also need to satisfy Assumption 6. DIGing is not necessarily required to meet $H\succeq\tilde{H}$ , yet it can easily satisfy $H\succeq\tilde{H}$ by letting its average matrix be symmetric positive semidefinite.

In what follows, we let $(\mathbf{x}^{k},\mathbf{v}^{k})$ be generated by (4)–(5), and let $(\mathbf{x}^{\star},\mathbf{v}^{\star})$ be a primal-dual optimal solution pair of problem (3), which satisfies (7). Also, let $\Lambda_{M}:=\operatorname{diag}(M_{1},\ldots,M_{N})\otimes I_{d}$ , where $M_{i}>0$ is the Lipschitz constant of $\nabla f_{i}$ . In addition, define

\displaystyle A^{k}:=\int_{0}^{1}\nabla^{2}u^{k}((1-s)\mathbf{x}^{k}+s\mathbf{x}^{k+1})ds,\quad\forall k\geq 0.

Note that every $A^{k}$ exists and is symmetric due to Assumption 2. Moreover, since $\nabla u^{k}(\mathbf{x}^{k})=\nabla f(\mathbf{x}^{k})$ ,

	$\displaystyle\nabla u^{k}(\mathbf{x}^{k+1})$
$\displaystyle=$	$\displaystyle\nabla u^{k}(\mathbf{x}^{k})+\int_{0}^{1}\nabla^{2}u^{k}((1-s)\mathbf{x}^{k}+s\mathbf{x}^{k+1})(\mathbf{x}^{k+1}-\mathbf{x}^{k})ds$
$\displaystyle=$	$\displaystyle\nabla f(\mathbf{x}^{k})+A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k}).$	(36)

Then, we introduce the following auxiliary lemma for deriving the main results.

Lemma 1.

Suppose Assumption 1, Assumption 2, and Assumption 6 hold. For any $\eta\in[0,1)$ and $k\geq 0$ ,

	$\displaystyle\frac{1}{\rho}\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\rangle\geq\eta\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle$
	$\displaystyle+\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle-\frac{\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{\Lambda_{M}}^{2}}{4(1-\eta)}.$		(37)

Proof.

See Appendix -A. ∎

IV-A Convergence under General Convexity

In this subsection, we provide the convergence rates of AMM in solving the general convex problem (1).

Let $\bar{\mathbf{x}}^{k}\!=\!\frac{1}{k}\sum_{t=1}^{k}\mathbf{x}^{t}$ $\forall k\geq 1$ be the running average of $\mathbf{x}^{t}$ from $t=1$ to $t=k$ . Below, we derive sublinear convergence rates for (i) the consensus error $\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|$ , which represents the infeasibility of $\bar{\mathbf{x}}^{k}$ in solving the equivalent problem (3), and (ii) the objective error $|f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|$ , which is a direct measure of optimality. Throughout Section IV-A, we tentatively consider the following type of surrogate function that fulfills Assumption 2:

u^{k}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{k}\|_{A}^{2}+\langle\nabla f(\mathbf{x}^{k}),\mathbf{x}\rangle,\quad\forall k\geq 0,

(38)

where $A\in\mathbb{R}^{Nd\times Nd}$ satisfies $A=A^{T}$ , $A\succeq\mathbf{O}$ , and $A+\rho H\succ\mathbf{O}$ . Such a choice of $u^{k}$ results in $A^{k}=A$ $\forall k\geq 0$ .

Note that AMM endowed with (38) still generalizes most existing algorithms discussed in Section III, including EXTRA [5], ID-FBBS [13], PGC [16], PG-EXTRA [12], DPGA [11], the decentralized ADMM in [4], D-FBBS [13], the distributed ADMM in [9], the distributed primal-dual algorithm in [14], and DIGing [26]. Although DQM [19] and ESOM [18] are specialized from AMM with different $u^{k}$ ’s other than (38), they all require problem (1) to be strongly convex and smooth—In such a case, we will provide convergence rates for the general form of AMM in Section IV-B.

Now we bound $\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|$ and $|f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|$ . This plays a key role in acquiring the rates at which these errors vanish.

Lemma 2.

Suppose Assumption 1, Assumption 6, and (38) hold. For any $k\geq 1$ ,

	$\displaystyle\\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\\|=\frac{1}{\rho k}\\|\mathbf{v}^{k}-\mathbf{v}^{0}\\|,$		(39)
	$\displaystyle f(\bar{\mathbf{x}}^{k})\!+\!h(\bar{\mathbf{x}}^{k})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})$
	$\displaystyle\leq\frac{1}{k}\!\Bigl{(}\frac{\\|\mathbf{v}^{0}\\|^{2}}{2\rho}\!+\!\frac{\\|\mathbf{x}^{0}\!-\!\mathbf{x}^{\star}\\|_{A}^{2}}{2}\!+\!\sum_{t=0}^{k-1}\!\frac{\\|\mathbf{x}^{t+1}\!-\!\mathbf{x}^{t}\\|_{\Lambda_{M}\!-\!A}^{2}}{2}\Bigr{)},$		(40)
	$\displaystyle f(\bar{\mathbf{x}}^{k})\!+\!h(\bar{\mathbf{x}}^{k})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})\!\geq-\!\frac{\\|\mathbf{v}^{\star}\\|\cdot\\|\mathbf{v}^{k}-\mathbf{v}^{0}\\|}{\rho k}.$		(41)

Proof.

See Appendix -B. ∎

Observe from Lemma 2 that as long as $\|\mathbf{v}^{k}-\mathbf{v}^{0}\|$ and $\sum_{t=0}^{k-1}\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A}^{2}$ are bounded, $\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|$ and $|f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|$ are guaranteed to converge to $0$ . The following lemma assures this is true with the help of Lemma 1, in which $\mathbf{z}^{k}:=((\mathbf{x}^{k})^{T},(\mathbf{v}^{k})^{T})^{T}$ $\forall k\geq 0$ , $\mathbf{z}^{\star}:=((\mathbf{x}^{\star})^{T},(\mathbf{v}^{\star})^{T})^{T}$ , and $G:=\operatorname{diag}(A,I_{Nd}/\rho)$ .

Lemma 3.

Suppose all the conditions in Lemma 2 hold. Also suppose that $A\succ\Lambda_{M}/2$ . For any $k\geq 1$ ,

	$\displaystyle\\|\mathbf{v}^{k}-\mathbf{v}^{0}\\|\leq\\|\mathbf{v}^{0}-\mathbf{v}^{\star}\\|+\sqrt{\rho}\\|\mathbf{z}^{0}-\mathbf{z}^{\star}\\|_{G},$		(42)
	$\displaystyle\sum_{t=0}^{k-1}\\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\\|_{\Lambda_{M}-A}^{2}\leq\frac{\\|\mathbf{z}^{0}-\mathbf{z}^{\star}\\|_{G}^{2}}{1-\underline{\sigma}},$		(43)

where $\underline{\sigma}:=\min\{\sigma|~{}\sigma A\succeq\Lambda_{M}/2\}\in(0,1)$ .

Proof.

See Appendix -C. ∎

The following theorem results from Lemma 2 and Lemma 3, which provides $O(1/k)$ rates for both the consensus error and the objective error at the running average $\bar{\mathbf{x}}^{k}$ .

Theorem 1.

Suppose all the conditions in Lemma 3 hold. For any $k\geq 1$ ,

	$\displaystyle\\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\\|\leq\frac{\\|\mathbf{v}^{0}-\mathbf{v}^{\star}\\|+\sqrt{\rho}\\|\mathbf{z}^{0}-\mathbf{z}^{\star}\\|_{G}}{\rho k},$
	$\displaystyle f(\bar{\mathbf{x}}^{k})\!+\!h(\bar{\mathbf{x}}^{k})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})$
	$\displaystyle\leq\frac{1}{2k}\left(\frac{\\|\mathbf{v}^{0}\\|^{2}}{\rho}\!+\!\\|\mathbf{x}^{0}-\mathbf{x}^{\star}\\|_{A}^{2}\!+\!\frac{\\|\mathbf{z}^{0}-\mathbf{z}^{\star}\\|_{G}^{2}}{1-\underline{\sigma}}\right),$
	$\displaystyle f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})$
	$\displaystyle\geq-\frac{\\|\mathbf{v}^{\star}\\|(\\|\mathbf{v}^{0}-\mathbf{v}^{\star}\\|+\sqrt{\rho}\\|\mathbf{z}^{0}-\mathbf{z}^{\star}\\|_{G})}{\rho k}.$

Proof.

Substitute (42)–(43) into (39)–(41). ∎

Among the existing distributed algorithms that are able to solve nonsmooth convex optimization problems with non-strongly convex objective functions (e.g., [10, 6, 7, 9, 11, 12, 13, 16, 17, 23, 24, 33, 31, 34, 32, 21]), the best convergence rates are of $O(1/k)$ , achieved by [9, 11, 16, 12, 13, 7, 33, 21]. Indeed, AMM with all the conditions in Theorem 1 generalizes the algorithms in [9, 11, 16, 12, 13] as well as their parameter conditions. Like Theorem 1, [9, 11, 16, 21] achieve $O(1/k)$ rates for both the objective error and the consensus error, whereas [12, 13, 7, 33] reach $O(1/k)$ rates in terms of optimality residuals that implicitly reflect deviations from an optimality condition, and they only guarantee $O(1/\sqrt{k})$ rates for the consensus error. Also note that [33, 21] only address problem (1) with identical $h_{i}$ ’s.

Theorem 1 provides new convergence results for some existing algorithms generalized by AMM. In particular, the $O(1/k)$ rate in terms of the objective error is new to PG-EXTRA [12] and D-FBBS [13] for nonsmooth problems as well as EXTRA [5] and ID-FBBS [13] for smooth problems, which only had $O(1/k)$ rates with respect to optimality residuals before. Moreover, Theorem 1 improves their $O(1/\sqrt{k})$ rates in reaching consensus to $O(1/k)$ . Furthermore, Theorem 1 extends the $O(1/k)$ rate of the distributed primal-dual algorithm in [14] for constrained smooth convex optimization to general nonsmooth convex problems, and allows PGC [16] to establish its $O(1/k)$ rate without the originally required condition that each $h_{i}$ needs to have a compact domain.

Finally, we illustrate how the sublinear rates in Theorem 1 are influenced by the graph topology. For simplicity, set $\mathbf{v}^{0}=\mathbf{{0}}$ and $\tilde{H}\preceq I_{Nd}$ . Also, let $g^{\star}\in\partial h(\mathbf{x}^{\star})$ be such that $\nabla f(\mathbf{x}^{\star})+g^{\star}\in S^{\perp}$ , which exists because of the optimality condition $-\sum_{i\in\mathcal{V}}\nabla f_{i}(x^{\star})\in(\partial\sum_{i\in\mathcal{V}}h_{i})(x^{\star})$ , and let $\mathbf{{v}}^{\star}=-(\tilde{H}^{\frac{1}{2}})^{\dagger}(\nabla f(\mathbf{x}^{\star})+g^{\star})$ . It follows from (7) that $\mathbf{{v}}^{\star}$ is an optimal solution to the Lagrange dual of (3). Then, $\|\mathbf{{v}}^{0}-\mathbf{{v}}^{\star}\|=\|\mathbf{{v}}^{\star}\|\leq\frac{1}{\sqrt{\lambda_{\tilde{H}}}}\|\nabla f(\mathbf{x}^{\star})+g^{\star}\|$ , where $\lambda_{\tilde{H}}\in(0,1]$ is the smallest nonzero eigenvalue of $\tilde{H}$ . Note that $\|\mathbf{{z}}^{0}-\mathbf{{z}}^{\star}\|_{G}=(\|\mathbf{{x}}^{0}-\mathbf{{x}}^{\star}\|_{A}^{2}+\frac{1}{\rho}\|\mathbf{{v}}^{0}-\mathbf{{v}}^{\star}\|^{2})^{\frac{1}{2}}\leq\|\mathbf{{x}}^{0}-\mathbf{{x}}^{\star}\|_{A}+\frac{1}{\sqrt{\rho}}\|\mathbf{{v}}^{0}-\mathbf{{v}}^{\star}\|$ . Thus, by letting $\rho=\frac{1}{\sqrt{\lambda_{\tilde{H}}}}$ , we have $|f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|\leq O(\frac{1}{k\sqrt{\lambda_{\tilde{H}}}})$ . In addition, $\|P_{S^{\perp}}[\bar{\mathbf{{x}}}^{k}]\|\leq\frac{1}{\sqrt{\lambda_{\tilde{H}}}}\|P_{S^{\perp}}[\bar{\mathbf{{x}}}^{k}]\|_{\tilde{H}}=\frac{1}{\sqrt{\lambda_{\tilde{H}}}}\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|\leq O(\frac{1}{k\sqrt{\lambda_{\tilde{H}}}})$ . Here, $\|P_{S^{\perp}}[\bar{\mathbf{{x}}}^{k}]\|$ plays a similar role as $\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\|$ in quantifying the consensus error. Therefore, denser network connectivity, which is often indicated by larger $\lambda_{\tilde{H}}$ , can yield faster convergence of both the objective error and the consensus error (measured by $\|P_{S^{\perp}}[\bar{\mathbf{{x}}}^{k}]\|$ ).

Compared to the existing specializations of AMM that also have $O(1/k)$ rates in solving (1), the above $O(1/(k\sqrt{\lambda_{\tilde{H}}}))$ rate has better dependence on $\lambda_{\tilde{H}}\in(0,1]$ than the $O(1/(k\lambda_{\tilde{H}}))$ rate of DPGA [11] and the $O(1/(k\lambda_{\tilde{H}}^{2}))$ rate of the distributed ADMM [9]. The remaining specializations discussed in Section III have not been provided with similar dependence of their convergence rates on the graph topology.

IV-B Convergence under Local Restricted Strong Convexity

In this subsection, we additionally impose a problem assumption and acquire the convergence rates of AMM. Henceforth, we no longer restrict $u^{k}$ to the form of (38).

Assumption 7.

The function $\sum_{i\in\mathcal{V}}f_{i}(x)$ is locally restricted strongly convex with respect to $x^{\star}$ , where $x^{\star}$ is an optimum of problem (1), i.e., for any convex and compact set $C\subset\mathbb{R}^{d}$ containing $x^{\star}$ , $\sum_{i\in\mathcal{V}}f_{i}(x)$ is restricted strongly convex with respect to $x^{\star}$ on $C$ with convexity parameter $m_{C}>0$ .

Restricted strong convexity has been studied in the literature. For example, [44, 5] consider its global version and [20] considers the local version as Assumption 7. Under Assumption 7, problem (1) has a unique optimum $x^{\star}$ , and problem (3) has a unique optimum $\mathbf{x}^{\star}=((x^{\star})^{T},\ldots,(x^{\star})^{T})^{T}\in S$ [20]. Furthermore, Assumption 7 is less restricted than the standard global strong convexity condition. For instance, the objective function of logistic regression [45] is not globally strongly convex but meets Assumption 7.

Proposition 1.

Suppose Assumption 1 holds. If there exists $\epsilon>0$ such that each $f_{i}$ is twice continuously differentiable on the ball $B(x^{\star},\epsilon):=\{x|~{}\|x-x^{\star}\|\leq\epsilon\}$ and $\sum_{i\in\mathcal{V}}\nabla^{2}f_{i}(x^{\star})$ is positive definite, then Assumption 7 holds.

Proof.

See Appendix -D. ∎

Proposition 1 provides a sufficient condition for Assumption 7. When each $f_{i}$ is twice continuously differentiable, this sufficient condition is much weaker than global strong convexity which requires $\sum_{i\in\mathcal{V}}\nabla^{2}f_{i}(x)\succ\mathbf{O}$ $\forall x\in\mathbb{R}^{d}$ .

Note that Assumption 7 is unable to assure any strong convexity of $f(\mathbf{x})=\sum_{i\in\mathcal{V}}f_{i}(x_{i})$ , yet it guarantees the following property on $f(\mathbf{x})+\frac{\rho}{4}\|\mathbf{x}\|_{\tilde{H}}^{2}$ .

Proposition 2.

[20, Lemma 1] Under Assumption 6 and Assumption 7, $f(\mathbf{x})+\frac{\rho}{4}\|\mathbf{x}\|_{\tilde{H}}^{2}$ is locally restricted strongly convex with respect to $\mathbf{x}^{\star}$ .

Subsequently, we construct a convex and compact set $\mathcal{C}\subset\mathbb{R}^{Nd}$ containing $\mathbf{x}^{\star}$ , so that the restricted strong convexity in Proposition 2 takes into effect on $\mathcal{C}$ , leading to a parameter condition that guarantees $\mathbf{x}^{k}\in\mathcal{C}$ $\forall k\geq 0$ . To introduce $\mathcal{C}$ , note from Assumption 2 that there exist symmetric positive semidefinite matrices $A_{\ell}$ , $A_{u}\in\mathbb{R}^{Nd\times Nd}$ such that for any $\mathbf{x}\in\mathbb{R}^{Nd}$ and $k\geq 0$ , $A_{\ell}\preceq\nabla^{2}u^{k}(\mathbf{x})\preceq A_{u}$ . Let $\Delta:=\|A_{u}-A_{\ell}\|\geq 0$ . Moreover, define $A_{a}=(A_{\ell}+A_{u})/2$ and $\tilde{G}=\operatorname{diag}(A_{a},I_{Nd}/\rho)$ , which are also symmetric positive semidefinite. Then, let $\mathcal{C}:=\left\{\mathbf{x}|~{}\|\mathbf{x}-\mathbf{x}^{\star}\|_{A_{a}+\rho\tilde{H}}^{2}\leq\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho\|\mathbf{x}^{0}\|_{\tilde{H}}^{2}\right\}$ , where $\mathbf{z}^{0}$ and $\mathbf{z}^{\star}$ are defined above Lemma 3. From Assumption 2(b) and Assumption 6, $A_{a}+\rho\tilde{H}\succ\mathbf{O}$ , so that $\mathcal{C}$ is convex and compact. Thus, from Proposition 2, $f(\mathbf{x})+\frac{\rho}{4}\|\mathbf{x}\|_{\tilde{H}}^{2}$ is restricted strongly convex with respect to $\mathbf{x}^{\star}$ on $\mathcal{C}$ with convexity parameter $m_{\rho}\in(0,\infty)$ . Lemma 4 is another consequence of Lemma 1, showing that $\mathbf{x}^{k}$ stays identically in $\mathcal{C}$ .

Lemma 4.

Under Assumption 1, Assumption 2, Assumption 6, and Assumption 7, $\mathbf{x}^{k}\in\mathcal{C}$ $\forall k\geq 0$ provided that

\displaystyle A_{a}\!\succ\!\!\left(\!\frac{\Delta^{2}}{8\eta m_{\rho}}\!+\!\Delta\right)I_{Nd}\!+\!\frac{\Lambda_{M}}{2(1\!-\!\eta)}\text{ for some $\eta\in(0,1)$},

(44)

where $\Lambda_{M}$ is defined above Lemma 1.

Proof.

See Appendix -E. ∎

When $\Delta=0$ (which means $\nabla^{2}u^{k}$ is constant) or $\sum_{i\in\mathcal{V}}f_{i}(x)$ is globally restricted strongly convex (which means $m_{\rho}$ can be independent of $\mathcal{C}$ ), it is straightforward to find $u^{k}$ so that (44) holds. Otherwise, $m_{\rho}$ depends on $\mathcal{C}$ and thus on $A_{a}$ , so that both sides of (44) involve $A_{a}$ . With that being said, (44) can always be satisfied by proper $u^{k}$ ’s. To see this, arbitrarily pick $\eta\in(0,1)$ and $\bar{\lambda}_{a},\underline{\lambda}_{a}>0$ such that $\bar{\lambda}_{a}>\underline{\lambda}_{a}>\frac{M}{2(1-\eta)}$ , where $M:=\max_{i\in\mathcal{V}}M_{i}$ . If we choose $u^{k}$ such that the corresponding $A_{a}$ satisfies $\underline{\lambda}_{a}I_{Nd}\preceq A_{a}\preceq\bar{\lambda}_{a}I_{Nd}$ , then $\mathcal{C}$ is a subset of $\mathcal{C}^{\prime}:=\{\mathbf{x}|~{}\underline{\lambda}_{a}\|\mathbf{x}-\mathbf{x}^{\star}\|^{2}\leq\bar{\lambda}_{a}\|\mathbf{x}^{0}-\mathbf{x}^{\star}\|^{2}+\frac{1}{\rho}\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|^{2}+\rho\|\mathbf{x}^{0}\|_{\tilde{H}}^{2}\}$ . Let $m_{\rho}^{\prime}$ be any convexity parameter of $f(\mathbf{x})+\frac{\rho\|\mathbf{x}\|_{\tilde{H}}^{2}}{4}$ on $\mathcal{C}^{\prime}$ . Clearly, $m_{\rho}^{\prime}\in(0,m_{\rho}]$ and is independent of $A_{a}$ . Then, the following suffices to satisfy (44):

\left(\frac{\Delta^{2}}{8\eta m_{\rho}^{\prime}}+\Delta+\underline{\lambda}_{a}\right)I_{Nd}\preceq A_{a}\preceq\bar{\lambda}_{a}I_{Nd}.

(45)

The lower and upper bounds on $A_{a}$ in (45) do not need to depend on $A_{a}$ . Also, (45) is well-posed, since $\frac{\Delta^{2}}{8\eta m_{\rho}^{\prime}}+\Delta+\underline{\lambda}_{a}\leq\bar{\lambda}_{a}$ holds for sufficiently small $\Delta>0$ . Therefore, $u^{k}$ $\forall k\geq 0$ with such $\Delta$ and with $A_{a}$ satisfying (45) meet (44).

Next, we present the convergence rates of AMM in both optimality and feasibility under the local restricted strong convexity condition. We first consider the smooth case of (1) with each $h_{i}$ identically equal to $0$ , and provide a linear convergence rate for $\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}$ , which quantifies the distance to primal optimality, dual optimality, and primal feasibility of (3). In Theorem 2, we force $\mathbf{v^{\star}}$ to satisfy not only (7) but $\mathbf{v}^{\star}-\mathbf{v}^{0}\in S^{\perp}$ as well. Such $\mathbf{v}^{\star}$ is a particular optimum to the Lagrange dual of (3), and can be chosen as $\mathbf{v}^{\star}=\tilde{\mathbf{v}}+((\frac{1}{N}\sum_{i\in\mathcal{V}}(v_{i}^{0}-\tilde{v}_{i}))^{T},\ldots,(\frac{1}{N}\sum_{i\in\mathcal{V}}(v_{i}^{0}-\tilde{v}_{i}))^{T})^{T}$ , where $\tilde{\mathbf{v}}$ is any Lagrange dual optimum of (3) and $\mathbf{v}^{0}=((v_{1}^{0})^{T},\ldots,(v_{N}^{0})^{T})^{T}$ is the initial dual iterate of AMM.

Theorem 2.

Suppose all the conditions in Lemma 4 hold. Also suppose $h_{i}(x)\equiv 0$ $\forall i\in\mathcal{V}$ . Then, there exists $\tilde{\delta}\in(0,1)$ such that for each $k\geq 0$ ,

	$\displaystyle\\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\\|_{\tilde{G}}^{2}+\rho\\|\mathbf{x}^{k+1}\\|_{\tilde{H}}^{2}$
	$\displaystyle\leq(1-\tilde{\delta})(\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\\|_{\tilde{G}}^{2}+\rho\\|\mathbf{x}^{k}\\|_{\tilde{H}}^{2}).$		(46)

Specifically, $\tilde{\delta}=\max\{\delta|~{}B_{i}(\delta)\succeq\mathbf{O},\forall i\in\{1,2,3\}\}$ with

	$\displaystyle B_{1}(\delta)\!=$	$\displaystyle(2\eta m_{\rho}\!\!-\!\beta\Delta)I_{Nd}\!-\!\delta A_{a}\!\!-\!\delta(1\!+\!\theta_{1})(1\!+\!\theta_{2})\Lambda_{M}^{2}/\!(\rho\lambda_{\tilde{H}}\!),$
	$\displaystyle B_{2}(\delta)\!=$	$\displaystyle\rho(1-\delta-\eta)\tilde{H}-\rho\delta(1+1/\theta_{1})H\tilde{H}^{\dagger}H,$
	$\displaystyle B_{3}(\delta)\!=$	$\displaystyle(1\!-\!\sigma)\!A_{a}\!\!-\!2\delta(1\!+\!\theta_{1}\!)(1\!+\!\!1/\theta_{2}\!)(\tfrac{A_{u}^{2}}{\rho\lambda_{\tilde{H}}}\!+\!\rho\bar{\lambda}H^{\frac{1}{2}}\tilde{H}^{\dagger}H^{\frac{1}{2}}),$

where $\theta_{1},\theta_{2}$ are arbitrary positive scalars, $\bar{\lambda}$ is any upper bound on $\|H\|$ , $\eta\in(0,1)$ is given in (44), $\beta>0$ and $\sigma\in(0,1)$ are such that

\beta\Delta<2\eta m_{\rho},\quad\sigma A_{a}\succeq(\frac{1}{4\beta}+1)\Delta I_{Nd}+\frac{\Lambda_{M}}{2(1-\eta)},

(47)

and $\lambda_{\tilde{H}}>0$ is the smallest nonzero eigenvalue of $\tilde{H}$ .

Proof.

See Appendix -F. ∎

In comparison with many existing works such as [4, 18, 19, 9, 26, 27, 6, 7, 28, 29, 21, 22] that assume global strong convexity to derive linear convergence rates, the linear rate in Theorem 2 only requires the weaker condition of local restricted strong convexity in Assumption 7. Moreover, Theorem 2 provides new convergence results to a number of existing algorithms that can be specialized from AMM. Specifically, Theorem 2 establishes linear convergence for D-FBBS [13], ID-FBBS [13], and DPGA [11], which has never been discussed before. In addition, Theorem 2 allows EXTRA [5], DIGing [26], ESOM [18], DQM [19], the distributed ADMMs [9, 4], PUDA [22], and ABC [21] to achieve linear convergence under less restrictive problem assumptions, including relaxing the global strong convexity in [26, 18, 19, 9, 4, 21, 22] and the global restricted strong convexity in [5] to local restricted strong convexity, and eliminating the Lipschitz continuity of $\nabla^{2}f_{i}$ required in [18, 19].

Both the problem and the graph have impacts on the linear rate $(1-\tilde{\delta})$ in Theorem 2. Their impacts become explicit when each $f_{i}$ is globally restricted strongly convex with respect to $x^{\star}$ with convexity parameter $m>0$ . For simplicity, we set $M_{i}=M\geq m$ $\forall i\in\mathcal{V}$ and $m_{\rho}=m$ . Also, choose $H=\tilde{H}$ and proper $u^{k}$ $\forall k\geq 0$ . In the expressions of $B_{i}(\delta)$ $\forall i=1,2,3$ , we pick any $\theta_{2}>0$ and then let $\theta_{1}$ be such that $\max\{\delta:B_{2}(\delta)\succeq\mathbf{O}\}>\max\{\delta:B_{1}(\delta)\succeq\mathbf{O}\}$ . We first suppose $m$ increases. Note that $u^{k}$ still satisfies (44) with the same $\eta$ , while we allow for a larger $\beta$ and thus a smaller $\sigma$ such that (47) holds. Accordingly, $\max\{\delta:B_{i}(\delta)\succeq\mathbf{O}\}$ $\forall i=1,3$ become larger and, therefore, so does $\tilde{\delta}$ . Similarly, smaller $M$ yields larger $\tilde{\delta}$ . Now suppose the graph topology changes so that $\lambda_{\tilde{H}}$ increases, which implies denser connectivity of $\mathcal{G}$ . Since $\bar{\lambda}$ can be any upper bound on $\|H\|$ , we can always assume it is unchanged with the graph topology (e.g., $\bar{\lambda}=N$ when $H=L_{\mathcal{G}}\otimes I_{d}$ ). Then, it can be seen that $\tilde{\delta}$ increases.

The above discussions suggest that smaller condition number $\kappa_{f}:=M/m$ or denser network connectivity may lead to faster convergence of AMM in solving strongly convex and smooth problems. Indeed, for some particular forms of AMM, such dependence can be more explicit and is comparable to some prior results. For example, let $H=\tilde{H}=(I_{N}-W)\otimes I_{d}/4$ with an average matrix $W$ associated with $\mathcal{G}$ , $\rho=M\sqrt{2/\lambda_{\tilde{H}}}$ , and $u^{k}$ be such that $\Delta=0$ and $A_{a}=\frac{\rho(3I_{N}+W)\otimes I_{d}}{4}$ . Thus, $\rho I_{Nd}\succeq A_{a}\succeq\frac{\rho}{2}I_{Nd}$ , $\|H\|\leq\bar{\lambda}=\frac{1}{2}$ , and $\lambda_{\tilde{H}}=\frac{1-\lambda_{2}(W)}{4}\in(0,\frac{1}{2})$ , where $\lambda_{2}(W)$ is the second largest eigenvalue of $W$ . Accordingly, $\rho\geq 2M$ and $A_{a}\succeq MI_{Nd}$ . We may pick $\eta=\frac{1}{4}$ and $\sigma=\frac{2}{3}$ so that (47) holds. Then, by choosing $\theta_{1}=\theta_{2}=1$ in Theorem 2, we can show that

\tilde{\delta}=O(\min\{\tfrac{\sqrt{1-\lambda_{2}(W)}}{\kappa_{f}},1-\lambda_{2}(W)\}).

(48)

Most existing linearly convergent specializations of AMM have not revealed such explicit relations as (48), except the following. The decentralized ADMM [4] has the same result as (48), and DIGing [26] gives $\tilde{\delta}=O(\frac{(1-\lambda_{2}(W))^{2}}{\kappa_{f}^{1.5}})$ when $W\succeq\mathbf{O}$ , which is worse than (48). The distributed ADMM [9] can reach $\tilde{\delta}=O(\frac{1-\lambda_{2}(W)}{\sqrt{\kappa_{f}}})$ , which is better than (48). This is probably because (48) is directly from our analysis for more general problem assumptions and algorithmic forms.

Finally, we remove the condition of $h_{i}\equiv 0$ $\forall i\in\mathcal{V}$ in Theorem 2 and derive the following result.

Theorem 3.

Suppose all the conditions in Lemma 4 hold. For any $k\geq 1$ ,

	$\displaystyle\!\\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\\|\leq\frac{d_{0}+\\|\mathbf{v}^{0}-\mathbf{v}^{\star}\\|}{\rho k},$		(49)
	$\displaystyle\!f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})\!\leq\frac{1}{k}$
	$\displaystyle\!\cdot\!\Bigl{(}\!\frac{d_{0}^{2}}{\rho}\bigl{(}\dfrac{\Delta}{2(2\eta m_{\rho}\!-\!\beta\Delta)}\!+\!\dfrac{1\!-\!\eta}{1\!-\!\sigma}\bigr{)}\!\!+\!\!\frac{\\|\mathbf{x}^{0}\!-\!\mathbf{x}^{\star}\\|_{A_{u}}^{2}}{2}\!+\!\!\frac{\\|\mathbf{v}^{0}\\|^{2}}{2\rho}\!\Bigr{)},$		(50)
	$\displaystyle\!f(\bar{\mathbf{x}}^{k})\!\!+\!h(\bar{\mathbf{x}}^{k})\!-\!\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})\!\!\geq\!\!-\frac{\\|\mathbf{v}^{\star}\\|(d_{0}\!+\!\!\\|\mathbf{v}^{0}\!-\!\mathbf{v}^{\star}\\|)}{\rho k},$		(51)

where $d_{0}=\sqrt{\rho\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho^{2}\|\mathbf{x}^{0}\|_{\tilde{H}}^{2}}$ , $\eta\in(0,1)$ is given in (44), and $\beta>0$ and $\sigma\in(0,1)$ are such that (47) holds.

Proof.

See Appendix -G. ∎

The convergence rates in Theorem 3 are of the same order as those in Theorem 1. Theorem 1 considers a more general optimization problem, while Theorem 3 allows for more general $u^{k}$ ’s. In the literature, [8, 25, 15, 10] also deal with nonsmooth, strongly convex problems. Even under global strong convexity, [25, 15] provide convergence rates slower than $O(1/k)$ , and [8] derives an $O(1/k)$ rate of convergence only to dual optimality. Although [10] attains linear convergence, it considers a more restricted problem, which assumes $f_{i}$ to be globally restricted strongly convex and the subgradients of $h_{i}$ to be uniformly bounded.

Remark 1.

Compared to the existing algorithms in Section III that can be viewed as specializations of AMM, AMM is able to achieve convergence rates of the same or even better order under identical or weaker problem conditions. Moreover, although the theoretical results in Section IV require additional conditions on the parameters of AMM besides those in Section II, these parameter conditions are not restrictive. Indeed, when AMM reduces to the algorithms in Section III, its parameter conditions in Theorem 1 still generalize those in [5, 13, 16, 12, 11, 9, 14] and partially overlap those in [26] for non-strongly convex problems. The parameter conditions in Theorem 2 are more general than those in [5, 19], different from but intersecting with those in [26, 18], and admittedly, relatively limited compared to those in [9, 4] for strongly convex problems, yet on the premise that $u^{k}$ is specialized to the corresponding particular forms specified in Section III.

V Numerical Examples

This section demonstrates the competent practical performance of DAMM via numerical examples.

V-A Non-identical Local Nonsmooth Objectives

Consider the following constrained $l_{1}$ -regularized problem:

\operatorname{minimize}_{x\in\cap_{i\in\mathcal{V}}X_{i}}~{}\sum_{i\in\mathcal{V}}\left(\frac{1}{2}\|B_{i}x-b_{i}\|^{2}+\frac{1}{N}\|x\|_{1}\right),

(52)

where each $B_{i}\in\mathbb{R}^{m\times d}$ , $b_{i}\in\mathbb{R}^{m}$ , and $X_{i}=\{x|~{}\|x-a_{i}\|\leq\|a_{i}\|+1\}$ with $a_{i}\in\mathbb{R}^{d}$ . Note that $\cap_{i\in\mathcal{V}}X_{i}$ contains $\mathbf{0}$ and is nonempty. We set $f_{i}(x)=\frac{1}{2}\|B_{i}x-b_{i}\|^{2}$ and $h_{i}(x)=\frac{1}{N}\|x\|_{1}+\mathcal{I}_{X_{i}}(x)$ $\forall i\in\mathcal{V}$ . We choose $N=20$ , $d=5$ , and $m=3$ . Besides, we randomly generate each $B_{i}$ , $b_{i}$ , and $a_{i}$ . The graph $\mathcal{G}$ is also randomly generated and contains $26$ links.

The simulation involves PG-EXTRA [12], D-FBBS [13], DPGA [11], and the distributed ADMM [9], which are guaranteed to solve (52) with distinct $h_{i}$ ’s. From Section III, they are specializations of AMM, and the first three algorithms can also be specialized from DAMM. In addition to these existing methods, we run a particular DAMM with $\psi_{i}^{k}(\cdot)=\varpi_{i}(\cdot):=\frac{1}{2}\|\cdot\|_{B_{i}^{T}B_{i}+\epsilon I_{d}}^{2}$ , $\epsilon>0$ , which depends on the problem data. Note that this is a new algorithm for solving (52). All these algorithms have similar computational complexities per iteration, while each iteration of the distributed ADMM requires twice the communication cost of the others. The algorithmic forms of PG-EXTRA, D-FBBS, DPGA, and the distributed ADMM are given in Section III. For DPGA, we let $c_{i}=c$ $\forall i\in\mathcal{V}$ for some $c>0$ and $\Gamma_{ij}=\frac{1}{2c}[M_{\mathcal{G}}]_{ij}$ $\forall i\in\mathcal{V}$ $\forall j\in\mathcal{N}_{i}\cup\{i\}$ , where $M_{\mathcal{G}}$ is the Metropolis weight matrix defined below Assumption 3. We also set the weight matrix $\Gamma=\frac{1}{2}M_{\mathcal{G}}$ in the distributed ADMM. We assign $P=\tilde{P}=\frac{1}{2}M_{\mathcal{G}}$ to the above new DAMM and to PG-EXTRA and D-FBBS when cast in the form of DAMM. The remaining parameters are all fine-tuned in their theoretical ranges to achieve the best possible performance.

Refer to caption — (a) Optimality error at $\bar{\mathbf{x}}^{k}$

Figure 1 plots the optimality error $|f(\cdot)+h(\cdot)-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|+\|\tilde{H}^{\frac{1}{2}}(\cdot)\|$ (which includes both the objective error and the consensus error) at the running average $\bar{\mathbf{x}}^{k}$ and the iterate $\mathbf{x}^{k}$ , respectively. For each of the aforementioned algorithms, the running average produces a smoother curve, while the iterate converges faster. The new DAMM with $\psi_{i}^{k}(\cdot)=\varpi_{i}(\cdot)$ outperforms the other four methods, suggesting that AMM not only unifies a diversity of existing methods, but also creates novel distributed algorithms that achieve superior performance in solving various convex composite optimization problems.

V-B Comparison with Methods Using Surrogate Functions

This subsection compares the convergence performance of the particular DAMM with $\psi_{i}^{k}(\cdot)=\varpi_{i}(\cdot)$ in Section V-A with that of NEXT [31] and SONATA [33], which also utilize surrogate functions. However, NEXT and SONATA are only guaranteed to address problem (1) with identical $h_{i}$ ’s. Thus, we change $X_{i}$ $\forall i\in\mathcal{V}$ in (52) to the unit ball $\{x|~{}\|x\|\leq 1\}$ . The remaining settings comply with Section V-A.

At each iteration, the computational complexities of all the three algorithms are roughly the same, yet the communication costs of NEXT and SONATA double that of the particular DAMM. For NEXT and SONATA, we follow the simulations in [31, 33] to approximate $f_{i}(x_{i})$ by $f_{i}(x_{i})+\frac{\tau}{2}\|x_{i}-x_{i}^{k}\|^{2}$ for some $\tau>0$ at each iteration $k$ , and choose the average matrix as $I_{N}-M_{\mathcal{G}}$ , where $M_{\mathcal{G}}$ is the Metropolis weight matrix. The diminishing step-size of NEXT is set as $c/k$ with $c>0$ , and SONATA adopts a fixed step-size. Again, we fine-tune all the algorithm parameters within their theoretical ranges.

Figure 2 displays the optimality error $|f(\mathbf{x}^{k})+h(\mathbf{x}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})|+\|\tilde{H}^{\frac{1}{2}}\mathbf{x}^{k}\|$ generated by the aforementioned algorithms. Observe that the particular DAMM converges much faster than the other two methods. The main reason for the slow convergence of SONATA in solving this numerical example is that its theoretical step-size is very small.

VI Conclusion

We have introduced a unifying Approximate Method of Multipliers (AMM) that emulates the Method of Multipliers via a surrogate function. Proper designs of the surrogate function lead to a wealth of distributed algorithms for solving convex composite optimization over undirected graphs. Sublinear and linear convergence rates for AMM are established under various mild conditions. The proposed AMM and its distributed realizations generalize a number of well-noted methods in the literature. Moreover, the theoretical convergence rates of AMM are no worse than and sometimes even better than the convergence results of such existing specializations of AMM in the sense of rate order, problem assumption, etc. The generality of AMM as well as its convergence analysis provides insights into the design and analysis of distributed primal-dual optimization algorithms, and allows us to explore high-performance specializations of AMM when addressing specific convex optimization problems in practice.

-A Proof of Lemma 1

Define $g^{k+1}\!=\!-\nabla u^{k}(\mathbf{x}^{k+1})\!-\!\rho H\mathbf{x}^{k+1}\!-\!\tilde{H}^{\frac{1}{2}}\mathbf{v}^{k}$ and $g^{\star}\!=\!-\nabla f(\mathbf{x}^{\star})\!-\!\tilde{H}^{\frac{1}{2}}\mathbf{v}^{\star}$ . Due to (6) and (7), $g^{k+1}\in\partial h(\mathbf{x}^{k+1})$ and $g^{\star}\in\partial h(\mathbf{x}^{\star})$ . It follows from (36) that

	$\displaystyle\tilde{H}^{\frac{1}{2}}(\mathbf{v}^{\star}-\mathbf{v}^{k})=$	$\displaystyle\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})+A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k})$
		$\displaystyle+g^{k+1}-g^{\star}+\rho H\mathbf{x}^{k+1}.$		(53)

Also, since $\operatorname{Null}(\tilde{H}^{\frac{1}{2}})=\operatorname{Null}(\tilde{H})=S$ , we have $\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0}$ . This, along with (5), gives $\mathbf{v}^{k}-\mathbf{v}^{k+1}=-\rho\tilde{H}^{\frac{1}{2}}(\mathbf{x}^{k+1}-\mathbf{x}^{\star})$ . Thus, $\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k}-\mathbf{v}^{\star}\rangle=-\rho\langle\tilde{H}^{\frac{1}{2}}(\mathbf{x}^{k+1}-\mathbf{x}^{\star}),\mathbf{v}^{k}-\mathbf{v}^{\star}\rangle=\rho\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},\tilde{H}^{\frac{1}{2}}(\mathbf{v}^{\star}-\mathbf{v}^{k})\rangle$ . By substituting (53) into the above equation and using $H\mathbf{x}^{\star}=\mathbf{0}$ , we obtain $\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k}-\mathbf{v}^{\star}\rangle=\rho\langle\mathbf{x}^{k+1}\!-\!\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})\!-\!\nabla f(\mathbf{x}^{\star})\rangle\!+\!\rho\langle\mathbf{x}^{k+1}\!-\!\mathbf{x}^{\star},g^{k+1}\!-\!g^{\star}\rangle+\rho\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle+\rho^{2}\|\mathbf{x}^{k+1}\|_{H}^{2}$ . In this equation, $\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g^{k+1}-g^{\star}\rangle\geq 0$ , because $g^{k+1}\in\partial h(\mathbf{x}^{k+1})$ , $g^{\star}\in\partial h(\mathbf{x}^{\star})$ , and $h$ is convex. Moreover, due to $H\succeq\tilde{H}$ and (5), we have $\rho^{2}\|\mathbf{x}^{k+1}\|_{H}^{2}\geq\rho^{2}\|\mathbf{x}^{k+1}\|_{\tilde{H}}^{2}=\|\mathbf{v}^{k+1}-\mathbf{v}^{k}\|^{2}=\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k}-\mathbf{v}^{\star}\rangle-\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\rangle$ . It follows that $\frac{1}{\rho}\langle\mathbf{v}^{k}\!-\!\mathbf{v}^{k+1},\mathbf{v}^{k+1}\!-\!\mathbf{v}^{\star}\rangle\geq\langle\mathbf{x}^{k+1}\!-\!\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})\!-\!\nabla f(\mathbf{x}^{\star})\rangle+\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},A^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle$ . Hence, to prove (37), it suffices to show that for any $\eta\in(0,1)$ ,

	$\displaystyle\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle$
	$\displaystyle\geq\!\eta\langle\mathbf{x}^{k}\!-\!\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})\!-\!\nabla f(\mathbf{x}^{\star})\rangle\!-\!\frac{\\|\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k}\\|_{\Lambda_{M}}^{2}}{4(1-\eta)}.$		(54)

To do so, note from the AM-GM inequality and the Lipschitz continuity of $\nabla f_{i}$ that for any $i\in\mathcal{V}$ and $c_{i}>0$ , $\langle x_{i}^{k+1}-x_{i}^{k},\nabla f_{i}(x_{i}^{k})-\nabla f_{i}(x^{\star})\rangle\geq-c_{i}\|\nabla f_{i}(x_{i}^{k})-\nabla f_{i}(x^{\star})\|^{2}-\frac{1}{4c_{i}}\|x_{i}^{k+1}-x_{i}^{k}\|^{2}\geq-c_{i}M_{i}\langle x_{i}^{k}-x^{\star},\nabla f_{i}(x_{i}^{k})-\nabla f_{i}(x^{\star})\rangle-\frac{1}{4c_{i}}\|x_{i}^{k+1}-x_{i}^{k}\|^{2}$ . By adding the above inequality over all $i\in\mathcal{V}$ with $c_{i}=\frac{1-\eta}{M_{i}}$ , we have $\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle+(1-\eta)\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\geq-\frac{\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{\Lambda_{M}}^{2}}{4(1-\eta)}$ . Then, adding $\eta\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle$ to both sides of this inequality leads to (54).

-B Proof of Lemma 2

Let $k\geq 1$ . Due to (5), $\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}=\frac{1}{k}\sum_{t=1}^{k}\tilde{H}^{\frac{1}{2}}\mathbf{x}^{t}=\frac{1}{\rho k}(\mathbf{v}^{k}-\mathbf{v}^{0})$ , so that (39) holds. Next, we prove (40). For simplicity, define $J^{t}(\mathbf{x})=\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}\rangle+\|\mathbf{x}-\mathbf{x}^{t}\|_{A}^{2}/2+h(\mathbf{x})+\rho\|\mathbf{x}\|_{H}^{2}/2+(\mathbf{v}^{t})^{T}\tilde{H}^{\frac{1}{2}}\mathbf{x}$ $\forall t\geq 0$ . Since $J^{t}(\mathbf{x})-\frac{\|\mathbf{x}\|_{A}^{2}}{2}$ is convex, $\bigl{(}J^{t}(\mathbf{x}^{\star})-\frac{\|\mathbf{x}^{\star}\|_{A}^{2}}{2}\bigr{)}-\bigl{(}J^{t}(\mathbf{x}^{t+1})-\frac{\|\mathbf{x}^{t+1}\|_{A}^{2}}{2}\bigr{)}\geq\langle\tilde{g}^{t+1}\!-\!A\mathbf{x}^{t+1},\mathbf{x}^{\star}\!-\!\mathbf{x}^{t+1}\rangle$ for all $\tilde{g}^{t+1}\!\in\!\partial J^{t}(\mathbf{x}^{t+1})$ . Due to (4), $\mathbf{0}\in\partial J^{t}(\mathbf{x}^{t+1})$ . Thus, by letting $\tilde{g}^{t+1}=\mathbf{0}$ in the above inequality, we obtain

J^{t}(\mathbf{x}^{t+1})-J^{t}(\mathbf{x}^{\star})\leq-\|\mathbf{x}^{t+1}-\mathbf{x}^{\star}\|_{A}^{2}/2.

(55)

In addition, because of (5) and $H\succeq\tilde{H}$ ,

	$\displaystyle\langle\tilde{H}^{\frac{1}{2}}\mathbf{x}^{t+1},\mathbf{v}^{t}\rangle=\frac{1}{\rho}\langle\mathbf{v}^{t+1}-\mathbf{v}^{t},\mathbf{v}^{t}\rangle$
$\displaystyle=$	$\displaystyle(\\|\mathbf{v}^{t+1}\\|^{2}-\\|\mathbf{v}^{t}\\|^{2}-\\|\mathbf{v}^{t+1}-\mathbf{v}^{t}\\|^{2})/(2\rho)$
$\displaystyle=$	$\displaystyle(\\|\mathbf{v}^{t+1}\\|^{2}-\\|\mathbf{v}^{t}\\|^{2})/(2\rho)-\rho\\|\mathbf{x}^{t+1}\\|_{\tilde{H}}^{2}/2$
$\displaystyle\geq$	$\displaystyle(\\|\mathbf{v}^{t+1}\\|^{2}-\\|\mathbf{v}^{t}\\|^{2})/(2\rho)-\rho\\|\mathbf{x}^{t+1}\\|_{H}^{2}/2.$	(56)

Also, due to the convexity and the $M_{i}$ -smoothness of each $f_{i}$ ,

	$\displaystyle f(\mathbf{x}^{t+1})-f(\mathbf{x}^{\star})=(f(\mathbf{x}^{t+1})-f(\mathbf{x}^{t}))+(f(\mathbf{x}^{t})-f(\mathbf{x}^{\star}))$
$\displaystyle\leq$	$\displaystyle\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}^{t+1}\!\!-\!\mathbf{x}^{t}\rangle\!+\!\\|\mathbf{x}^{t+1}\!\!-\!\mathbf{x}^{t}\\|_{\Lambda_{M}}^{2}/2\!+\!\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}^{t}\!-\!\mathbf{x}^{\star}\rangle$
$\displaystyle=$	$\displaystyle\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}^{t+1}-\mathbf{x}^{\star}\rangle+\\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\\|_{\Lambda_{M}}^{2}/2.$	(57)

Through combining (55), (56), and (57) and utilizing $H\mathbf{x}^{\star}=\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0}$ , we derive

		$\displaystyle f(\mathbf{x}^{t+1})+h(\mathbf{x}^{t+1})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})+\frac{\\|\mathbf{v}^{t+1}\\|^{2}-\\|\mathbf{v}^{t}\\|^{2}}{2\rho}$
	$\displaystyle\leq$	$\displaystyle\frac{\\|\mathbf{x}^{t}\!-\!\mathbf{x}^{\star}\\|_{A}^{2}}{2}\!-\!\frac{\\|\mathbf{x}^{t+1}\!-\!\mathbf{x}^{\star}\\|_{A}^{2}}{2}\!+\!\frac{\\|\mathbf{x}^{t+1}\!-\!\mathbf{x}^{t}\\|_{\Lambda_{M}-A}^{2}}{2}.$		(58)

Now adding (58) over $t=0,\ldots,k-1$ yields $\sum_{t=1}^{k}\left(f(\mathbf{x}^{t})+h(\mathbf{x}^{t})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})\right)\leq\frac{\|\mathbf{v}^{0}\|^{2}}{2\rho}+\frac{\|\mathbf{x}^{0}-\mathbf{x}^{\star}\|_{A}^{2}}{2}+\sum_{t=0}^{k-1}\frac{\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A}^{2}}{2}$ . Moreover, the convexity of $f$ and $h$ implies $f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})\leq\frac{1}{k}\sum_{t=1}^{k}\left(f(\mathbf{x}^{t})+h(\mathbf{x}^{t})\right)$ . Combining the above results in (40).

Finally, to prove (41), note that $\mathbf{x}^{\star}$ is an optimal solution of $\min_{\mathbf{x}}f(\mathbf{x})+h(\mathbf{x})+\langle\mathbf{v}^{\star},\tilde{H}^{\frac{1}{2}}\mathbf{x}\rangle$ . It then follows from $\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0}$ that $f(\mathbf{x}^{\star})+h(\mathbf{x}^{\star})\leq f(\bar{\mathbf{x}}^{t})+h(\bar{\mathbf{x}}^{t})+\langle\mathbf{v}^{\star},\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{t}\rangle\leq f(\bar{\mathbf{x}}^{t})+h(\bar{\mathbf{x}}^{t})+\|\mathbf{v}^{\star}\|\cdot\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{t}\|$ . This, together with (39), implies (41).

-C Proof of Lemma 3

From the definition of $G$ , for each $t\geq 0$ ,

	$\displaystyle\\|\mathbf{z}^{t}-\mathbf{z}^{\star}\\|_{G}^{2}-\\|\mathbf{z}^{t+1}-\mathbf{z}^{\star}\\|_{G}^{2}-\\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\\|_{G}^{2}$
$\displaystyle=$	$\displaystyle 2\langle G(\mathbf{z}^{t}-\mathbf{z}^{t+1}),\mathbf{z}^{t+1}-\mathbf{z}^{\star}\rangle$
$\displaystyle=$	$\displaystyle 2\langle A(\mathbf{x}^{t}\!\!-\!\mathbf{x}^{t+1}),\mathbf{x}^{t+1}\!\!-\!\mathbf{x}^{\star}\rangle\!+\!\frac{2}{\rho}\langle\mathbf{v}^{t}\!\!-\!\mathbf{v}^{t+1},\mathbf{v}^{t+1}\!\!-\!\mathbf{v}^{\star}\rangle.$	(59)

By substituting (37) with $A^{t}=A$ and $\eta=0$ into (59),

	$\displaystyle\\|\mathbf{z}^{t}-\mathbf{z}^{\star}\\|_{G}^{2}-\\|\mathbf{z}^{t+1}-\mathbf{z}^{\star}\\|_{G}^{2}\geq\frac{1}{\rho}\\|\mathbf{v}^{t+1}-\mathbf{v}^{t}\\|^{2}$
	$\displaystyle+\\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\\|_{A-\frac{\Lambda_{M}}{2}}^{2}\geq(1-\underline{\sigma})\\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\\|_{A}^{2},$		(60)

where the last step is due to $\underline{\sigma}A\succeq\Lambda_{M}/2$ . Adding (60) over $t=0,\ldots,k-1$ and using $\|\mathbf{v}^{k}\!-\!\mathbf{v}^{\star}\|\!\leq\!\sqrt{\rho}\|\mathbf{z}^{k}\!-\!\mathbf{z}^{\star}\|_{G}$ , we have $\frac{1}{\rho}\|\mathbf{v}^{k}-\mathbf{v}^{\star}\|^{2}+(1-\underline{\sigma})\sum_{t=0}^{k-1}\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{A}^{2}\leq\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{G}^{2}$ . Therefore, (42) can be proved by combining the above inequality with $\|\mathbf{v}^{k}-\mathbf{v}^{0}\|\leq\|\mathbf{v}^{k}-\mathbf{v}^{\star}\|+\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|$ . Furthermore, the above inequality, along with $A\succ\Lambda_{M}-A$ , implies (43).

-D Proof of Proposition 1

For convenience, let $\tilde{f}(x)=\sum_{i\in\mathcal{V}}f_{i}(x)$ . Since $\nabla^{2}\tilde{f}(x^{\star})\succ\mathbf{O}$ and $\nabla^{2}\tilde{f}$ is continuous around $x^{\star}$ , there exist $\epsilon_{1}\in(0,\epsilon]$ and $c>0$ such that $\nabla^{2}\tilde{f}(x)\succeq cI_{d}$ $\forall x\in B(x^{\star},\epsilon_{1})$ . Therefore,

\langle\nabla\tilde{f}(x)\!-\!\nabla\tilde{f}(x^{\star}),x-x^{\star}\rangle\!\geq\!c\|x-x^{\star}\|^{2},\forall x\in B(x^{\star},\epsilon_{1}).

(61)

Let $C\subset\mathbb{R}^{d}$ be any convex and compact set containing $x^{\star}$ . Suppose $x\in C-B(x^{\star},\epsilon_{1})$ . Let $z=x^{\star}+\epsilon_{1}\frac{x-x^{\star}}{\|x-x^{\star}\|}\in B(x^{\star},\epsilon_{1})$ . Since $\tilde{f}$ is convex, $x-x^{\star}=\frac{x-z}{1-\epsilon_{1}/\|x-x^{\star}\|}$ , and $\|x-x^{\star}\|>\epsilon_{1}$ , we have $\langle\nabla\tilde{f}(x)-\nabla\tilde{f}(z),x-x^{\star}\rangle\geq 0$ . Moreover, using $x-x^{\star}=\frac{\|x-x^{\star}\|}{\epsilon_{1}}(z-x^{\star})$ , (61), and $\|z-x^{\star}\|=\epsilon_{1}$ , we obtain $\langle\nabla\tilde{f}(z)-\nabla\tilde{f}(x^{\star}),x-x^{\star}\rangle=\frac{\|x-x^{\star}\|}{\epsilon_{1}}\langle\nabla\tilde{f}(z)-\nabla\tilde{f}(x^{\star}),z-x^{\star}\rangle\geq\frac{c\|x-x^{\star}\|}{\epsilon_{1}}\|z-x^{\star}\|^{2}=\frac{c\epsilon_{1}}{\|x-x^{\star}\|}\|x-x^{\star}\|^{2}$ . Adding the above two inequalities leads to $\langle\nabla\tilde{f}(x)-\nabla\tilde{f}(x^{\star}),x-x^{\star}\rangle\geq\frac{c\epsilon_{1}}{\max_{y\in C}\|y-x^{\star}\|}\|x-x^{\star}\|^{2}$ . This, along with (61), implies that $\tilde{f}$ is restricted strongly convex with respect to $x^{\star}$ on $C$ , so that Assumption 7 holds.

-E Proof of Lemma 4

For simplicity, define $c^{k}:=\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}+\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}$ . Since $\tilde{H}\mathbf{x}^{\star}=\mathbf{0}$ , $\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{A_{a}+\rho\tilde{H}}^{2}=\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{A_{a}}^{2}+\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}\leq c^{k}$ . Thus, to show $\mathbf{x}^{k}\in\mathcal{C}$ $\forall k\geq 0$ , it suffices to show that $c^{k}\leq c^{0}$ $\forall k\geq 0$ . We prove this by induction. Clearly, $c^{k}\leq c^{0}$ holds for $k=0$ . Now suppose $c^{k}\leq c^{0}$ and below we show $c^{k+1}\leq c^{k}\leq c^{0}$ . Using (37) and (59) with $G$ replaced by $\tilde{G}$ ,

	$\displaystyle\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\\|_{\tilde{G}}^{2}-\\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\\|_{\tilde{G}}^{2}-\\|\mathbf{z}^{k+1}-\mathbf{z}^{k}\\|_{\tilde{G}}^{2}$
	$\displaystyle\geq 2\eta\langle\mathbf{x}^{k}\!-\!\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})\!-\!\nabla f(\mathbf{x}^{\star})\rangle$
	$\displaystyle\!\!\!+2\langle\mathbf{x}^{k+1}\!-\!\mathbf{x}^{\star},(A^{k}\!-\!A_{a})(\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k})\rangle\!-\!\tfrac{\\|\mathbf{x}^{k+1}\!-\mathbf{x}^{k}\\|_{\Lambda_{M}}^{2}}{2(1-\eta)}.$		(62)

On the other hand, since $A_{\ell}\preceq A^{k}\preceq A_{u}$ , we have $\|A^{k}-A_{a}\|\leq\Delta/2$ . Due to (44), there exist $\beta>0$ and $\sigma\in(0,1)$ such that (47) holds. Then, through the AM-GM inequality, $\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(A^{k}-A_{a})(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle=\langle\mathbf{x}^{k}-\mathbf{x}^{\star},(A^{k}-A_{a})(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle+\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},(A^{k}-A_{a})(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle\geq-\frac{1}{2}(\beta\Delta\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}+\frac{\Delta}{4\beta}\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|^{2})-\frac{\Delta}{2}\|\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k}\|^{2}$ , which further implies

	$\displaystyle\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{A_{a}}^{2}\!+\!2\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(A^{k}-A_{a})(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle$
	$\displaystyle-\!\tfrac{\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{\Lambda_{M}}^{2}}{2(1-\eta)}\!\geq\!(1\!-\!\sigma)\\|\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k}\\|_{A_{a}}^{2}\!-\!\beta\Delta\\|\mathbf{x}^{k}\!-\!\mathbf{x}^{\star}\\|^{2}.$

Substituting this into (62) and applying (5), we obtain

	$\displaystyle c^{k}-c^{k+1}\geq(1-\sigma)\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{A_{a}}^{2}-\beta\Delta\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\\|^{2}$
	$\displaystyle+2\eta\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle+\rho\\|\mathbf{x}^{k}\\|_{\tilde{H}}^{2}.$		(63)

Due to the hypothesis $c^{k}\leq c^{0}$ , we have $\mathbf{x}^{k}\in\mathcal{C}$ . It then follows from Proposition 2 and $\tilde{H}\mathbf{x}^{\star}=\mathbf{0}$ that $2\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\geq 2m_{\rho}\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}-\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2}$ . This and (63) give

	$\displaystyle c^{k}-c^{k+1}\geq(1-\sigma)\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{A_{a}}^{2}$
	$\displaystyle+(2\eta m_{\rho}-\beta\Delta)\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\\|^{2}+\rho(1-\eta)\\|\mathbf{x}^{k}\\|_{\tilde{H}}^{2}.$		(64)

Note that $\sigma\in(0,1)$ , $\beta\Delta<2\eta m_{\rho}$ due to (47), and $\eta\in(0,1)$ . Thus, the right-hand side of (64) is nonnegative, which implies $c^{k+1}\leq c^{k}\leq c^{0}$ .

-F Proof of Theorem 2

Recall from the paragraph above Theorem 2 that we force $\mathbf{v}^{0}-\mathbf{v}^{\star}\in S^{\bot}$ . Also, due to (5), $\mathbf{v}^{k}-\mathbf{v}^{0}\in\operatorname{Range}(\tilde{H}^{\frac{1}{2}})=S^{\bot}$ . Hence, $\mathbf{v}^{k}-\mathbf{v}^{\star}\in\operatorname{Range}(\tilde{H}^{\frac{1}{2}})$ . Moreover, since $h(\mathbf{x})\equiv 0$ , $g^{k+1}=g^{\star}=\mathbf{0}$ . It then follows from (53) that $\mathbf{v}^{\star}-\mathbf{v}^{k}=(\tilde{H}^{\frac{1}{2}})^{\dagger}(\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})+(A^{k}+\rho H)(\mathbf{x}^{k+1}-\mathbf{x}^{k})+\rho H\mathbf{x}^{k})$ . Then, via the AM-GM inequality, for any $\theta_{1},\theta_{2}>0$ ,

	$\displaystyle\\|\mathbf{v}^{k}-\mathbf{v}^{\star}\\|^{2}\leq(1+\theta_{1})(1+\theta_{2})\\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\\|_{\tilde{H}^{\dagger}}^{2}$
	$\displaystyle\quad+2(1+\theta_{1})(1+1/\theta_{2})\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{(A^{k})^{T}\tilde{H}^{\dagger}A^{k}+\rho^{2}H\tilde{H}^{\dagger}H}^{2}$
	$\displaystyle\quad+\rho^{2}(1+1/\theta_{1})\\|H\mathbf{x}^{k}\\|_{\tilde{H}^{\dagger}}^{2}.$		(65)

From (65), $H\tilde{H}^{\dagger}H\preceq\bar{\lambda}H^{\frac{1}{2}}\tilde{H}^{\dagger}H^{\frac{1}{2}}$ , $\tilde{H}^{\dagger}\preceq\frac{I_{Nd}}{\lambda_{\tilde{H}}}$ , and the $M_{i}$ -smoothness of each $f_{i}$ , we have that for all $\delta\in(0,1)$ ,

	$\displaystyle\delta\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\\|_{\tilde{G}}^{2}\leq\delta\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\\|_{\frac{(1+\theta_{1})(1+\theta_{2})\Lambda_{M}^{2}}{\rho\lambda_{\tilde{H}}}+A_{a}}^{2}$
	$\displaystyle+\!\tfrac{2\delta(1\!+\!\theta_{1})(1\!+\!1/\theta_{2})}{\rho}\\|\mathbf{x}^{k+1}\!-\!\mathbf{x}^{k}\\|_{A_{u}^{2}/\lambda_{\tilde{H}}\!+\!\rho^{2}\bar{\lambda}H^{\frac{1}{2}}\tilde{H}^{\dagger}H^{\frac{1}{2}}}^{2}$
	$\displaystyle+\delta\rho(1+1/\theta_{1})\\|H\mathbf{x}^{k}\\|_{\tilde{H}^{\dagger}}^{2}.$

Note that (64) holds here because (44) guarantees (47) with some $\beta>0$ and $\sigma\in(0,1)$ . Then, by combining the above inequality and (64), we obtain $(1\!-\!\delta)(\|\mathbf{z}^{k}\!-\!\mathbf{z}^{\star}\|_{\tilde{G}}^{2}\!+\!\rho\|\mathbf{x}^{k}\|_{\tilde{H}}^{2})\!-\!(\|\mathbf{z}^{k+1}\!-\!\mathbf{z}^{\star}\|_{\tilde{G}}^{2}\!+\!\rho\|\mathbf{x}^{k+1}\|_{\tilde{H}}^{2})\geq\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{B_{1}(\delta)}^{2}+\|\mathbf{x}^{k}\|_{B_{2}(\delta)}^{2}+\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{B_{3}(\delta)}^{2}$ . Note from (47) that $2\eta m_{\rho}-\beta\Delta>0$ and $A_{a}\succ\mathbf{O}$ . Also note that $\theta_{1}>0$ , $\theta_{2}>0$ , $\rho>0$ , $\lambda_{\tilde{H}}>0$ , $\eta\in(0,1)$ , $\tilde{H}\succeq\mathbf{O}$ , $\operatorname{Range}(\tilde{H})=\operatorname{Range}(H)=S^{\bot}$ , and $\sigma\in(0,1)$ . Therefore, there exists $\delta\in(0,1)$ such that $B_{i}(\delta)\succeq\mathbf{O}$ $\forall i\in\{1,2,3\}$ , which guarantees (46).

-G Proof of Theorem 3

In the proof of Lemma 4, we have shown that $(c^{k})_{k=0}^{\infty}$ is non-increasing. Thus, $\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}^{2}\leq c^{k}\leq c^{0}=d_{0}^{2}/\rho$ . It follows that $\|\mathbf{v}^{0}-\mathbf{v}^{k}\|\leq\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+\|\mathbf{v}^{\star}-\mathbf{v}^{k}\|\leq\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+\sqrt{\rho}\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{\tilde{G}}\leq\|\mathbf{v}^{0}-\mathbf{v}^{\star}\|+d_{0}$ . By substituting this into (39) and (41), we obtain (49) and (51). Note that this substitution is legitimate, since (39) and (41) do not rely on (38) to hold. To prove (50), we let $\tilde{J}^{t}(\mathbf{x})=u^{t}(\mathbf{x})+h(\mathbf{x})+\frac{\rho}{2}\|\mathbf{x}\|_{H}^{2}+(\mathbf{v}^{t})^{T}\tilde{H}^{\frac{1}{2}}\mathbf{x}$ $\forall t\geq 0$ . Since $\nabla^{2}u^{t}(\mathbf{x})\succeq A_{\ell}$ $\forall\mathbf{x}\in\mathbb{R}^{Nd}$ , $\tilde{J}^{t}(\mathbf{x})-\|\mathbf{x}\|_{A_{\ell}}^{2}/2$ is convex. Similar to the derivation of (55),

\tilde{J}^{t}(\mathbf{x}^{t+1})-\tilde{J}^{t}(\mathbf{x}^{\star})\leq-\|\mathbf{x}^{t+1}-\mathbf{x}^{\star}\|_{A_{\ell}}^{2}/2.

(66)

Since $A_{\ell}\preceq\nabla^{2}u^{t}(\mathbf{x})\preceq A_{u}$ $\forall\mathbf{x}\in\mathbb{R}^{Nd}$ , we have $u^{t}(\mathbf{x}^{t+1})\geq u^{t}(\mathbf{x}^{t})+\langle\nabla u^{t}(\mathbf{x}^{t}),\mathbf{x}^{t+1}-\mathbf{x}^{t}\rangle+\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{A_{\ell}}^{2}/2$ and $u^{t}(\mathbf{x}^{\star})\leq u^{t}(\mathbf{x}^{t})+\langle\nabla u^{t}(\mathbf{x}^{t}),\mathbf{x}^{\star}-\mathbf{x}^{t}\rangle+\|\mathbf{x}^{\star}-\mathbf{x}^{t}\|_{A_{u}}^{2}/2$ . These two inequalities along with $\nabla u^{t}(\mathbf{x}^{t})=\nabla f(\mathbf{x}^{t})$ imply

	$\displaystyle u^{t}(\mathbf{x}^{t+1})-u^{t}(\mathbf{x}^{\star})\geq\langle\nabla f(\mathbf{x}^{t}),\mathbf{x}^{t+1}-\mathbf{x}^{\star}\rangle$
	$\displaystyle+\\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\\|_{A_{\ell}}^{2}/2-\\|\mathbf{x}^{\star}-\mathbf{x}^{t}\\|_{A_{u}}^{2}/2.$		(67)

Note that (56) and (57) hold in no need of (38). Hence, by integrating (66), (67), (56), (57), and $H\mathbf{x}^{\star}\!=\!\tilde{H}^{\frac{1}{2}}\mathbf{x}^{\star}\!=\!\mathbf{0}$ , $f(\mathbf{x}^{t+1})\!+\!h(\mathbf{x}^{t+1})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})\!+\!\frac{\|\mathbf{v}^{t+1}\|^{2}-\|\mathbf{v}^{t}\|^{2}}{2\rho}\leq\frac{\|\mathbf{x}^{t}-\mathbf{x}^{\star}\|_{A_{u}}^{2}}{2}-\frac{\|\mathbf{x}^{t+1}-\mathbf{x}^{\star}\|_{A_{\ell}}^{2}}{2}+\frac{\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A_{\ell}}^{2}}{2}$ . Similar to the derivation of (40) from (58), it follows that $f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})\leq\frac{1}{k}\bigl{(}\frac{\|\mathbf{v}^{0}\|^{2}}{2\rho}+\frac{\|\mathbf{x}^{0}-\mathbf{x}^{\star}\|_{A_{u}}^{2}}{2}\bigr{)}+\frac{1}{k}\sum_{t=0}^{k-1}\frac{\|\mathbf{x}^{t+1}-\mathbf{x}^{t}\|_{\Lambda_{M}-A_{\ell}}^{2}}{2}+\frac{\Delta}{2k}\sum_{t=1}^{k-1}\|\mathbf{x}^{t}-\mathbf{x}^{\star}\|^{2}$ . From (64), $\sum_{t=0}^{k-1}\|\mathbf{x}^{t+1}\!-\!\mathbf{x}^{t}\|_{A_{a}}^{2}\leq\frac{c^{0}-c^{k}}{1-\sigma}\leq\frac{d_{0}^{2}}{\rho(1-\sigma)}$ and $\sum_{t=1}^{k-1}\|\mathbf{x}^{t}-\mathbf{x}^{\star}\|^{2}\leq\frac{c^{1}-c^{k}}{2\eta m_{\rho}-\beta\Delta}\leq\frac{d_{0}^{2}}{\rho(2\eta m_{\rho}-\beta\Delta)}$ , where $\beta>0$ and $\sigma\in(0,1)$ satisfying (47) exist due to (44). Also, from (44), $\Lambda_{M}-A_{\ell}\preceq\Lambda_{M}\preceq 2(1-\eta)A_{a}$ . Combining the above gives (50).

References

[1] A. Nedić, “Convergence rate of distributed averaging dynamics and optimization in networks,” Foundations and Trends in Systems and Control, vol. 2, no. 1, pp. 1–100, 2015.
[2] D. Bajović, D. Jekovetić, N. Krejić, and N. K. Jerinikić, “Newton-like method with diagonal correction for distributed optimization,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 1171–1203, 2017.
[3] F. Mansoori and E. Wei, “Fast distributed asynchronous Newton-based optimization algorithm,” IEEE Transactions on Automatic Control, vol. 65, no. 7, pp. 2769–2784, 2020.
[4] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
[5] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[6] Y. Liu, W. Xu, G. Wu, Z. Tian, and Q. Ling, “Communication-censored ADMM for decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 67, no. 10, pp. 2565–2579, 2019.
[7] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent stepsizes and separated convergence rates,” IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506, 2019.
[8] I. Notarnicola and G. Notarstefano, “Asynchronous distributed optimization via randomized dual proximal gradient,” IEEE Transactions on Automatic Control, vol. 62, no. 5, pp. 2095–2106, 2017.
[9] A. Makhdoumi and A. Ozdaglar, “Convergence rate of distributed ADMM over networks,” IEEE Transactions on Automatic Control, vol. 62, no. 10, pp. 5082–5095, 2017.
[10] J. Zeng, T. He, and M. Wang, “A fast proximal gradient algorithm for decentralized composite optimization over directed networks,” System & Control Letters, vol. 107, no. 2017, pp. 36–43, 2017.
[11] N. S. Aybat, Z. Wang, T. Lin, and S. Ma, “Distributed linearized alternating direction method of multipliers for composite convex consensus optimization,” IEEE Transactions on Automatic Control, vol. 63, no. 1, pp. 5–20, 2018.
[12] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
[13] J. Xu, S. Zhu, Y. Soh, and L. Xie, “A bregman splitting scheme for distributed optimization over networks,” IEEE Transactions on Automatic Control, vol. 63, no. 11, pp. 3809–3824, 2018.
[14] J. Lei, H. Chen, and H. Fang, “Primal-dual algorithm for distributed constrained optimization,” System & Control Letters, vol. 96, pp. 110–117, 2016.
[15] X. Wu and J. Lu, “Fenchel dual gradient methods for distributed convex optimization over time-varying networks,” IEEE Transactions on Automatic Control, vol. 64, no. 11, pp. 4629–4636, 2019.
[16] M. Hong and T. Chang, “Stochastic proximal gradient consensus over random networks,” IEEE Transactions on Signal Processing, vol. 65, no. 11, pp. 2933–2948, 2017.
[17] P. Bianchi, W. Hachem, and F. Iutzeler, “A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 10, pp. 2947–2957, 2016.
[18] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralized second-order method with exact linear convergence rate for consensus optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 507–522, 2016.
[19] ——, “DQM: Decentralized quadratically approximated alternating direction method of multipliers,” IEEE Transactions on Signal Processing, vol. 64, no. 19, pp. 5158–5173, 2016.
[20] X. Wu, Z. Qu, and J. Lu, “A second-order proximal algorithm for consensus optimization,” IEEE Transactions on Automatic Control, vol. 66, no. 4, pp. 1864–1871, 2021.
[21] J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Distributed algorithms for composite optimization: Unified and tight convergence analysis,” arXiv preprint arXiv:2002.11534, 2020.
[22] S. A. Alghunaim, E. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” accepted to IEEE Transactions on Automatic Control, 2020.
[23] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Transactions on Automatic Control, vol. 60, no. 3, pp. 601– 615, 2015.
[24] C. Xi and U. A. khan, “Distributed subgradient projection algorithm over directed graphs,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3986–3992, 2017.
[25] D. Yuan, Y. Hong, D. W.C.Ho, and C. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018.
[26] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
[27] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2017.
[28] S. Pu, W. Shi, J. Xu, and A. Nedic, “Push-pull gradient methods for distributed optimization in networks,” accepted to IEEE Transactions on Automatic Control, 2020.
[29] R. Xin and U. A. Khan, “Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2627–2633, 2020.
[30] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-Raphson consensus for distributed convex optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 4, pp. 994–1009, 2016.
[31] P. Di Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.
[32] A. Daneshmand, Y. Sun, G. Scutari, F. Facchinei, and B. Sadler, “Decentralized dictionary learning over time-varying digraphs,” Journal of machine learning research, vol. 20, pp. 1–62, 2019.
[33] G. Scutari and Y. Sun, “Distributed nonconvex constrained optimization over time-varying digraphs,” Mathematical Programming, vol. 176, pp. 497–544, 2016.
[34] I. Notarnicola, Y. Sun, G. Scutari, and G. Notarstefano, “Distributed big-data optimization via block-wise gradient tracking,” accepted to IEEE Transactions on Automatic Control,2020.
[35] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
[36] F. Facchinei, G. Scutari, and S. Sagratella, “Parallel selective algorithms for nonconvex big data optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 7, pp. 1874–1889, 2015.
[37] F. Facchinei, L. Lampariello, and G. Scutari, “Feasible methods for nonconvex nonsmooth problems with applications in green communications,” Mathematical Programming, vol. 164, pp. 55–90, 2017.
[38] M. Hong, X. Wang, M. Razaviyayn, and Z.-Q. Luo, “Iteration complexity analysis of block coordinate descent methods,” Mathematical Programming, vol. 163, pp. 85–114, 2017.
[39] G. Scutari, F. Facchinei, and L. Lampariello, “Parallel and distributed methods for constrained nonconvex optimization—part i: Theory,” IEEE Transactions on Signal Processing, vol. 65, no. 8, pp. 1929–1944, 2017.
[40] L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous parallel algorithms for nonconvex optimization,” Mathematical Programming, vol. 184, pp. 1–34, 2019.
[41] S. Kakade, S. Shalev-Shwartz, and A. Tewari, “On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization,” Manuscript, 2009.
[42] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters, vol. 31, no. 3, pp. 167–175, 2003.
[43] M. Spivak, Calculus on manifolds: A modern approach to classical theorems of advanced calculus. Addison-Wesley, 1965.
[44] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
[45] F. Bach, “Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression,” Journal of Machine Learning Research, vol. 15, pp. 595–627, 2014.

Jie Lu (SM’08-M’13) received the B.S. degree in Information Engineering from Shanghai Jiao Tong University, China, in 2007, and the Ph.D. degree in Electrical and Computer Engineering from the University of Oklahoma, USA, in 2011. From 2012 to 2015 she was a postdoctoral researcher with KTH Royal Institute of Technology, Stockholm, Sweden, and with Chalmers University of Technology, Gothenburg, Sweden. Since 2015, she has been an assistant professor in the School of Information Science and Technology at ShanghaiTech University, Shanghai, China. Her research interests include distributed optimization, optimization theory and algorithms, and networked dynamical systems.

	$\displaystyle\\|\tilde{H}^{\frac{1}{2}}\bar{\mathbf{x}}^{k}\\|\leq\frac{\\|\mathbf{v}^{0}-\mathbf{v}^{\star}\\|+\sqrt{\rho}\\|\mathbf{z}^{0}-\mathbf{z}^{\star}\\|_{G}}{\rho k},$
	$\displaystyle f(\bar{\mathbf{x}}^{k})\!+\!h(\bar{\mathbf{x}}^{k})\!-\!f(\mathbf{x}^{\star})\!-\!h(\mathbf{x}^{\star})$
	$\displaystyle\leq\frac{1}{2k}\left(\frac{\\|\mathbf{v}^{0}\\|^{2}}{\rho}\!+\!\\|\mathbf{x}^{0}-\mathbf{x}^{\star}\\|_{A}^{2}\!+\!\frac{\\|\mathbf{z}^{0}-\mathbf{z}^{\star}\\|_{G}^{2}}{1-\underline{\sigma}}\right),$
	$\displaystyle f(\bar{\mathbf{x}}^{k})+h(\bar{\mathbf{x}}^{k})-f(\mathbf{x}^{\star})-h(\mathbf{x}^{\star})$
	$\displaystyle\geq-\frac{\\|\mathbf{v}^{\star}\\|(\\|\mathbf{v}^{0}-\mathbf{v}^{\star}\\|+\sqrt{\rho}\\|\mathbf{z}^{0}-\mathbf{z}^{\star}\\|_{G})}{\rho k}.$

	$\displaystyle\\|\mathbf{v}^{k}-\mathbf{v}^{\star}\\|^{2}\leq(1+\theta_{1})(1+\theta_{2})\\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\\|_{\tilde{H}^{\dagger}}^{2}$
	$\displaystyle\quad+2(1+\theta_{1})(1+1/\theta_{2})\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{(A^{k})^{T}\tilde{H}^{\dagger}A^{k}+\rho^{2}H\tilde{H}^{\dagger}H}^{2}$
	$\displaystyle\quad+\rho^{2}(1+1/\theta_{1})\\|H\mathbf{x}^{k}\\|_{\tilde{H}^{\dagger}}^{2}.$		(65)

A Unifying Approximate Method of Multipliers for Distributed Composite Optimization

Abstract

Index Terms:

I Introduction

II Approximate Method of Multipliers and Distributed Designs

Assumption 1.

II-A Approximate Method of Multipliers

Assumption 2.

II-B Distributed Approximate Method of Multipliers

Assumption 3.

Assumption 4.

Example 1.

Example 2.

II-C Special Case: Smooth Problem

Assumption 5.

Example 3.

Example 4.

III Existing Algorithms as Specializations

III-A Specializations of DAMM-SQ

III-A1 EXTRA

III-A2 ID-FBBS

III-A3 DQM

III-B Specializations of DAMM

III-B1 PGC and PG-EXTRA

III-B2 DPGA and a Decentralized ADMM

III-B3 D-FBBS

III-C Specializations of AMM

III-C1 A Distributed ADMM

III-C2 A Distributed Primal-Dual Algorithm

III-C3 DIGing on Static Networks

III-C4 ESOM

III-D Connections between AMM and Existing Unifying Methods

IV Convergence Analysis

Assumption 6.

Lemma 1.

Proof.

IV-A Convergence under General Convexity

Lemma 2.

Proof.

Lemma 3.

Proof.

Theorem 1.

Proof.

IV-B Convergence under Local Restricted Strong Convexity

Assumption 7.

Proposition 1.

Proof.

Proposition 2.

Lemma 4.

Proof.

Theorem 2.

Proof.

Theorem 3.

Proof.

Remark 1.

V Numerical Examples

V-A Non-identical Local Nonsmooth Objectives

V-B Comparison with Methods Using Surrogate Functions

VI Conclusion

-A Proof of Lemma 1

-B Proof of Lemma 2

-C Proof of Lemma 3

-D Proof of Proposition 1

-E Proof of Lemma 4

-F Proof of Theorem 2

-G Proof of Theorem 3

References

A Unifying Approximate Method of Multipliers
for Distributed Composite Optimization