Accelerated Dual Averaging Methods for Decentralized Constrained Optimization

Changxin Liu, , Yang Shi, , Huiping Li, , and Wenli Du This work was supported in part by the Natural Sciences and Engineering Research Council of Canada, in part by the National Natural Science Foundation of China (NSFC) under Grant 61922068, and in part by the Shaanxi Provincial Funds for Distinguished Young Scientists under Grant 2019JC-14x. (Corresponding author: Yang Shi.)C. Liu is with the Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, and also with the Department of Mechanical Engineering, University of Victoria, Victoria, BC V8W 3P6, Canada (e-mail: [email protected]).Y. Shi is with the Department of Mechanical Engineering, University of Victoria, Victoria, BC V8W 3P6, Canada (e-mail: [email protected]).H. Li is with the School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, China (e-mail: [email protected]).W. Du is with the Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, China (e-mail: [email protected]).

Abstract

In this work, we study decentralized convex constrained optimization problems in networks. We focus on the dual averaging-based algorithmic framework that is well-documented to be superior in handling constraints and complex communication environments simultaneously. Two new decentralized dual averaging (DDA) algorithms are proposed. In the first one, a second-order dynamic average consensus protocol is tailored for DDA-type algorithms, which equips each agent with a provably more accurate estimate of the global dual variable than conventional schemes. We rigorously prove that the proposed algorithm attains $\mathcal{O}(1/t)$ convergence for general convex and smooth problems, for which existing DDA methods were only known to converge at $\mathcal{O}(1/\sqrt{t})$ prior to our work. In the second one, we use the extrapolation technique to accelerate the convergence of DDA. Compared to existing accelerated algorithms, where typically two different variables are exchanged among agents at each time, the proposed algorithm only seeks consensus on local gradients. Then, the extrapolation is performed based on two sequences of primal variables which are determined by the accumulations of gradients at two consecutive time instants, respectively. The algorithm is proved to converge at $\mathcal{O}(1)\left(\frac{1}{t^{2}}+\frac{1}{t(1-\beta)^{2}}\right)$ , where $\beta$ denotes the second largest singular value of the mixing matrix. We remark that the condition for the algorithmic parameter to guarantee convergence does not rely on the spectrum of the mixing matrix, making itself easy to satisfy in practice. Finally, numerical results are presented to demonstrate the efficiency of the proposed methods.

Index Terms:

Decentralized optimization, constrained optimization, dual averaging, acceleration, multi-agent system.

I Introduction

Consider a multi-agent system consisting of $n$ agents. Each agent holds a private objective function. They are connected via a communication network in order to collaboratively solve the following optimization problem:

\min_{x\in\mathcal{X}}\left\{f(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)\right\}

(1)

where $f_{i}$ represents the local smooth objective function of agent $i$ and $\mathcal{X}\subseteq\mathbb{R}^{m}$ denotes the constraint set shared by all the agents. This problem is referred to as decentralized optimization in the literature and finds broad applications in optimal control of cyber-physical systems, sensor networks, and machine learning, to name a few. For an overview of decentralized optimization and its applications, please refer to [1, 2].

Over the last decade, many decentralized optimization algorithms have been proposed for solving Problem (1). For unconstrained problems, i.e., $\mathcal{X}=\mathbb{R}^{m}$ , the authors in [3, 4] developed decentralized gradient descent (DGD) methods with constant step sizes, where the local search performed by individual agents is guided by local gradients and a consensus protocol. However, because each individual gradient evaluated at the global optimum is not necessarily zero, the search directions induced by consensus-seeking and local gradient may conflict with each other, making it difficult to ascertain the exact solution to the problem. Several efforts have been made to overcome this drawback. For example, the authors in [5] proposed the EXTRA algorithm that adds a cumulative correction term to DGD to achieve consensual optimization. Alternatively, the additional gradient-tracking process based on dynamic average consensus scheme in [6] can be used. It is shown in [7, 8] that, for unconstrained smooth optimization, the algorithms steered by the tracked gradient exactly converge at an $\mathcal{O}({1}/{{t}})$ rate. Based on this idea, a decentralized Nesterov gradient descent was proposed in [9], where the rate of convergence is accelerated to $\mathcal{O}(1/t^{1.4-\epsilon})$ for any $\epsilon\in(0,1.4)$ at the expense of exchanging an additional variable among agents at each time instant. In [10], the authors proposed an accelerated decentralized algorithm with multiple consensus rounds at each time instant, and proved that after $t$ local iterations and $\mathcal{O}(t\log t)$ communication rounds the objective error is bounded by $\mathcal{O}(1/t^{2})$ . By modeling Problem (1) as a linearly constrained optimization problem, centralized primal-dual paradigms such as the augmented Lagrangian method (ALM), the alternating direction method of multipliers (ADMM) and the dual ascent can also be used to design decentralized algorithms [11, 12, 13]. Based on the primal-dual reformulation, an accelerated primal-dual method was developed in [14]. The rate of convergence is improved to $\mathcal{O}(1)\left(\frac{L}{t^{2}}+\frac{1}{t\sqrt{\eta}t}\right)$ , where $L$ denotes the smoothness parameter of each objective function and $\eta=\lambda_{2}(\mathcal{L})/\lambda_{m}(\mathcal{L})$ is the eigengap of the graph Laplacian $\mathcal{L}$ . Notably, the authors established a lower bound for a class of decentralized primal-dual methods, suggesting that the developed algorithm therein is optimal in terms of gradient computations. The authors in [15] considered the Lagrangian dual formulation of the decentralized optimization problem and developed two algorithms based on dual accelerated methods. The algorithms are proved to be linearly convergent for strongly convex and smooth problems.

For constrained problems, the design and convergence analysis of decentralized optimization algorithms are more challenging [16, 17, 18]. The seminal work in [19] is based on the gossip protocol and projected subgradient method, where the step size was made decaying for convergence. The randomized smoothing technique and multi-round consensus scheme are used to design a provably optimal decentralized algorithm for non-smooth convex problems in [20]. To improve the performance using a constant step size, a variant of EXTRA (PG-EXTRA) was developed in [21], where the constraint is modeled as a non-smooth indicator function and handled via the proximal operator. An $\mathcal{O}({1}/{t})$ rate of convergence is proved for the squared norm of the difference of consecutive iterates. Recently the authors in [22] proposed an accelerated decentralized penalty method (APM), where the constraint can be also treated as the non-smooth part of the objective. Notably, there are some decentralized algorithms [23, 24, 25, 26, 27, 28, 29] available in the literature where the local search mechanism for individual agents is inspired by dual methods [30], e.g., mirror descent and dual averaging [31, 32]. Particularly, dual averaging is provably more efficient in exploiting sparsity than proximal gradient methods for $l_{1}$ -regularized problems [32]. For example, the authors in [26] developed a decentralized dual averaging (DDA) algorithm where a linear model of the global objective function is gradually learned by each agent via gossip. Compared to other types of decentralized first-order methods, DDA has the advantage in simultaneously handling constraints and complex communication environments, e.g., directed networks [33], deterministic time-varying networks [29], and stochastic networks [26, 34]. From a technical perspective, this is because that DDA seeks consensus on linear models of the objective function rather than on the local projected iterates as in decentralized primal methods, e.g., DGD, therefore decoupling the consensus-seeking process from nonlinear projection and facilitating the rate analysis in complex communication environments. We present a more detailed comparison in Section III-A.

Although decentralized dual methods in the literature have demonstrated advantages over their primal counterparts in terms of constraint handling and analysis complexity, existing results focused on non-smooth problems and can have an $\mathcal{O}({1}/{\sqrt{t}})$ rate of convergence. Considering this, a question naturally arises: If the objective functions exhibit some desired properties, e.g., smoothness, is it possible to accelerate the convergence rate of DDA beyond $\mathcal{O}({1}/{\sqrt{t}})$ ? We provide affirmative answer to this question in this work. The main results and contributions are summarized in the following:

•

First, we develop a new DDA algorithm, where a second-order dynamic average consensus protocol is deliberately designed to assist each agent in estimating the global dual variable. Compared to the conventional estimation scheme [26], the proposed method equips each agent with provably more accurate estimates. In particular, the estimation error accumulated over time is proved to admit an upper bound constituted by the successive difference of an auxiliary variable whose update uses the mean of local dual variables. Then a rigorous investigation into the convergence of the auxiliary variable is carried out. Summarizing these two relations, we establish conditions for algorithm parameters such that the estimation error can be fully compensated, leading to an improved rate of convergence $\mathcal{O}(1/t)$ .
•

Second, we propose an accelerated DDA (ADDA) algorithm. Different from DDA, each agent employs a first-order dynamic average consensus protocol to estimate the mean of local gradients and accumulates the estimate over time to generate a local dual variable. By solving the convex conjugate of a $1$ -strongly convex function over this local dual variable, each agent produces a primal variable and uses it to construct another two sequences of primal variables in an iterative manner based on the extrapolation technique in [35] and the average consensus protocol. The rate of convergence is proved to be $\mathcal{O}(1)\left(\frac{1}{t^{2}}+\frac{1}{t(1-\beta)^{2}}\right)$ , where $\beta$ denotes the second largest singular value of the mixing matrix. Notably, the condition for the algorithmic parameter to ensure convergence does not rely on the mixing matrix. Establishing such a condition that is independent on the mixing matrix offers the appealing advantage of convenient verification in practical applications.
•

Finally, the proposed algorithms are tested and compared with a few methods in the literature on decentralized LASSO problems characterized by synthetic and real datasets. The comparison results demonstrate the efficiency of the proposed methods.

Notation: We use $\mathbb{R}$ and $\mathbb{R}^{n}$ to denote the set of reals and the $n$ -dimensional Euclidean space, respectively. Given a real number $a$ , we denote by $\lceil a\rceil$ the ceiling function that maps $a$ to the least integer greater than or equal to $a$ . Given a vector $x\in\mathbb{R}^{n}$ , $\lVert x\rVert$ denotes its $2$ -norm. Given a matrix $P\in\mathbb{R}^{n\times n}$ , its spectral radius is denoted by $\rho(P)$ . Its eigenvalues and singular values are denoted by $\lambda_{1}(P)\geq\lambda_{2}(P)\geq\cdots\geq\lambda_{n}(P)$ and $\sigma_{1}(P)\geq\sigma_{2}(P)\geq\cdots\geq\sigma_{n}(P)$ , respectively.

II Preliminaries

II-A Basic Setup

We consider the finite-sum optimization problem (1), in which $\mathcal{X}$ is a convex and compact set, and $f_{i}$ satisfies the following assumptions for all $i=1,\dots,n$ :

$Assumption$ 1.

i)

$f_{i}$ is continuously differentiable on $\mathcal{X}$ .
ii)

$f_{i}$ is convex on $\mathcal{X}$ , i.e., for any $x,y\in\mathcal{X}$ ,

$f_{i}(x)-f_{i}(y)-\langle\nabla f_{i}(y),x-y\rangle\geq 0.$ (2)
iii)

$\nabla f_{i}$ is Lipschitz continuous on $\mathcal{X}$ with a constant $L>0$ , i.e., for any $x,y\in\mathcal{X}$ ,

$\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\|.$ (3)

A direct consequence of Assumption 1(iii) is

f_{i}(x)-f_{i}(y)-\langle\nabla f_{i}(y),x-y\rangle\leq\frac{L}{2}\|x-y\|^{2},\forall x,y\in\mathcal{X}.

(4)

The above assumptions are standard in the study of decentralized algorithms for convex optimization problems. Throughout the paper, we denote by $x^{*}$ an optimal solution of Problem (1).

II-B Communication Network

We consider solving Problem (1) in a decentralized fashion, that is, a pair of agents can exchange information only if they are connected in the communication network. To describe the network topology, an undirected graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ is used, where $\mathcal{V}=\{1,\cdots,n\}$ denotes the set of $n$ agents and $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ represents the set of bidirectional channels, i.e., $(i,j)\in\mathcal{E}$ indicates that nodes $i$ and $j$ can send information to each other. Agent $j$ is said to be a neighbor of $i$ if there exists a link between them, and the set of $i$ ’s neighbors is denoted by $\mathcal{N}_{i}=\{j\in\mathcal{V}|(j,i)\in\mathcal{E}\}$ . For every pair $(i,j)\in\mathcal{E}$ , a positive weight $p_{ij}>0$ is assigned to $i$ and $j$ to weigh the information received from each other. Otherwise $p_{ij}=0$ is considered. For the convergence of the algorithm, we make the following assumption for $P:=[p_{ij}]\in[0,1]^{n\times n}$ .

$Assumption$ 2.

i)

$P\mathbf{1}=\mathbf{1}$ and $\mathbf{1}^{\mathrm{T}}P=\mathbf{1}^{\mathrm{T}}$ , where $\mathbf{1}$ denotes the all-one vector of dimension $n$ .
ii)

$P$ has a strictly positive diagonal. i.e., $p_{ii}>0$ .

Assumption 2 implies that $\sigma_{2}(P)<1$ [28]. Given a connected network, the constant edge weights and the Metropolis-Hastings algorithm [36] can be used to construct a weight matrix $P$ fulfilling Assumption 2.

II-C Centralized Dual Averaging

Our algorithms are based on the dual averaging methods [31]. Before introducing them, we state the following definition.

$Definition$ 1.

A differentiable function $\psi$ is strongly convex with modulus $\mu>0$ on $\mathcal{X}$ , if

\psi(x)-\psi(y)-\langle\nabla\psi(y),x-y\rangle\geq\frac{\mu}{2}\lVert x-y\rVert^{2},\forall x,y\in\mathcal{X}.

Let $d$ be a strongly convex and differentiable function with modulus $1$ on $\mathcal{X}$ such that

x^{(0)}=\operatorname*{argmin}_{x\in\mathcal{X}}d(x)\quad\mathrm{and}\quad d(x^{(0)})=0.

(5)

To meet the condition in (5) for any $x^{(0)}\in\mathcal{X}$ , one can choose

d(x)=\tilde{d}(x)-\tilde{d}(x^{(0)})-\langle\nabla\tilde{d}(x^{(0)}),x-x^{(0)}\rangle,

where $\tilde{d}$ is any strongly convex function with modulus $1$ , e.g., $\tilde{d}(x)=\lVert x\rVert^{2}/2$ . The convex conjugate of $d$ is defined as

\nabla d^{*}(\cdot)=\operatorname*{argmax}_{x\in\mathcal{X}}\left\{\left\langle\cdot,x\right\rangle-d(x)\right\}.

As a corollary of Danskin’s Theorem, we have the following result [35].

$Lemma$ 1.

For all $x,y\in\mathbb{R}^{m}$ , we have

\begin{split}&\left\lVert\nabla d^{*}(x)-\nabla d^{*}(y)\right\rVert\leq\lVert x-y\rVert.\end{split}

(6)

Dual averaging. The dual averaging method can be applied to solving Problem (1) in a centralized manner. Starting with $x^{(0)}$ , it generates a sequence of variables $\{x^{(t)}\}_{t\geq 0}$ iteratively according to

{x}^{(t)}=\nabla d^{*}\left(-a_{t}z^{(t)}\right)

(7)

where

z^{(t)}=\sum_{\tau=0}^{t-1}\nabla f(x^{(\tau)})

(8)

and $\{a_{t}\}_{t\geq 0}$ is a sequence of positive parameters that determines the rate of convergence. Let $\tilde{x}^{(t)}=t^{-1}\sum_{\tau=0}^{t-1}{x}_{i}^{(\tau)}$ . It is proved in [31] that $f(\tilde{x}^{(t)})-f(x^{*})\leq\mathcal{O}({1}/{\sqrt{t}})$ when $a_{t}=\Theta(1/\sqrt{t})$ , that is, with order exactly $1/\sqrt{t}$ . When the objective function is convex and smooth, a constant $a_{t}=a$ can be used to achieve an ergodic $\mathcal{O}({1}/{t})$ rate of convergence in terms of objective error [37].

Accelerated dual averaging. To speed up the rate of convergence, an accelerated dual averaging algorithm is developed in [35]. In particular, the variables are updated according to


$\displaystyle u^{(t)}$	$\displaystyle=\frac{A_{t-1}}{A_{t}}v^{(t-1)}+\frac{a_{t}}{A_{t}}{w}^{(t-1)}$	(9a)
$\displaystyle v^{(t)}$	$\displaystyle=\frac{A_{t-1}}{A_{t}}v^{(t-1)}+\frac{a_{t}}{A_{t}}{w}^{(t)},$	(9b)

where $a_{t}:=a(t+1)$ for some $a>0$ , $A_{t}=\sum_{\tau=1}^{t}a_{\tau}$ and

{w}^{(t)}=\nabla d^{*}\left(-\sum_{\tau=1}^{t}a_{\tau}\nabla f(u^{(\tau)})\right).

(10)

Note that $t\geq 2$ is considered for the above iteration, and the variables are initialized with $u^{(1)}=w^{(0)}=x^{(0)}$ , $v^{(1)}=w^{(1)}$ . For convex and smooth objective functions, it is proved that $f(v^{(t)})-f(x^{*})\leq\mathcal{O}(1/t^{2})$ [35].

III Algorithms and Convergence Results

In this section, we develop two new DDA algorithms that are provably more efficient than existing DDA-type algorithms.

III-A Decentralized Dual Averaging

To solve Problem (1) in a decentralized manner, we propose a novel DDA algorithm. In particular, we employ the following dynamic average consensus protocol to estimate $z^{(t)}$ in (8):


$\displaystyle z_{i}^{(t)}$	$\displaystyle=\sum_{j=1}^{n}p_{ij}\left(z_{j}^{(t-1)}+s_{j}^{(t-1)}\right),$	(11a)
$\displaystyle s_{i}^{(t)}$	$\displaystyle=\sum_{j=1}^{n}p_{ij}s_{j}^{(t-1)}+\nabla f_{i}(x^{(t)}_{i})-\nabla f_{i}(x^{(t-1)}_{i}),$	(11b)

where $z_{i}^{(t)}$ is a local estimate of $z^{(t)}$ generated by agent $i$ , $s_{i}^{(t)}$ is a proxy of $\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(x_{i}^{(t)})$ which aims to reduce the consensus error among variables $\{z_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0}$ . Using it, each agent $i$ performs a local computation step to update its estimate of $x^{(t)}$ :

{x}_{i}^{(t)}=\nabla d^{*}\left(-az_{i}^{(t)}\right).

(12)

The overall algorithm is summarized in Algorithm 1.

Algorithm 1 Decentralized Dual Averaging

Input:

a>0

x^{(0)}\in\ \mathcal{X}

and a strongly convex function

d

with modulus

1

such that (5) holds

Initialize:

x_{i}^{(0)}=x^{(0)}

z_{i}^{(0)}=0

, and

s_{i}^{(0)}=\nabla f_{i}(x^{(0)})

for all

i=1,\dots,n

for

t=1,2,\cdots

In parallel (task for agent $i$ , $i=1,\dots,n$ )

collect

z_{j}^{(t-1)}

and

s_{j}^{(t-1)}

from all agents

j\in\mathcal{N}_{i}

update

z_{i}^{(t)}

and

s_{i}^{(t)}

by (11)

compute

x_{i}^{(t)}

by (12)

broadcast

z_{i}^{(t)}

and

s_{i}^{(t)}

to all agents

j\in\mathcal{N}_{i}

end for

Before proceeding, we make the following remarks on Algorithm 1.

i) Subproblem solvability. Similar to centralized dual averaging methods, we assume the subproblem in (12) can be solved easily. For general problems, we can choose $d(x)=\lVert x-x^{(0)}\rVert^{2}/2$ such that the subproblem (12) reduces to computing the projection of variables onto $\mathcal{X}$ . If $\mathcal{X}$ is simple enough, e.g., the simplex or $l_{1}$ -norm ball, a closed-form solution exists.

ii) Comparison with existing DDA algorithms. In existing DDA algorithms [26, 29, 28], each agent estimates $z^{(t)}$ in the following way

z_{i}^{(t)}=\sum_{j=1}^{n}p_{ij}z_{j}^{(t-1)}+\nabla f_{i}(x_{i}^{(t)}).

(13)

For this scheme, it is proved that the consensus error among variables $\{z_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0}$ admits a constant upper bound [26], which necessitates the use of a monotonically decreasing sequence $\{a_{t}\}_{t\geq 0}$ for convergence. However, decreasing $a_{t}$ slows down the convergence significantly; the rate of convergence in [26, 29] is reported to be $\mathcal{O}(1/\sqrt{t})$ . To speed up the convergence, we develop the consensus protocol in (11), which is inspired by the high-order consensus scheme in [6]. Thanks to it, we are able to prove that the deviation among variables $\{z_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0}$ asymptotically vanishes as time evolves. Therefore, the parameter in (12) can be set constant, i.e, $a_{t}=a>0$ , which is key to obtaining the improved rates.

iii) Comparison with DGD algorithms. In existing DGD, a so-called gradient-tracking process similar to (11) is usually observed:

\begin{split}x_{i}^{(t)}&=\sum_{j=1}^{n}p_{ij}\left(x_{j}^{(t-1)}+as_{j}^{(t-1)}\right),\\ s_{i}^{(t)}&=\sum_{j=1}^{n}p_{ij}s_{j}^{(t-1)}+\nabla f_{i}(x^{(t)}_{i})-\nabla f_{i}(x^{(t-1)}_{i}),\end{split}

(14)

where $a$ represents the step size. The proposed scheme (11) differs from (14) in step (11a). With such a deliberate design and another local dual averaging step in (12), Algorithm 1 solves constrained problems with convergence rate guarantee. To compare DDA with existing algorithms applicable to solving Problem (1), we recall the PG-EXTRA algorithm [21]:

\begin{split}\hat{x}_{i}^{(t+1)}=&\sum_{j=1}^{n}p_{ij}{x}_{j}^{(t)}+\hat{x}_{i}^{(t)}-\sum_{j=1}^{n}\tilde{p}_{ij}{x}_{j}^{(t-1)}\\ &-a\left(\nabla f_{i}(x_{i}^{(t)})-\nabla f_{i}(x_{i}^{(t-1)})\right)\\ {x}_{i}^{(t+1)}=&\operatorname*{argmin}_{x\in\mathcal{X}}\left\lVert x-\hat{x}_{i}^{(t+1)}\right\rVert^{2}\end{split}

(15)

where $a$ represents the step size and $\tilde{p}_{ij}^{(t)}$ denotes the $(i,j)$ -th entry of $\tilde{P}={(P+I)}/{2}$ . Notably, PG-EXTRA seeks consensus among variables $\{{x}_{i}^{(t)}:i=1,\cdots,n\}$ at time $t+1$ that are obtained via a projection operator at time $t$ . In contrast, DDA manages to agree on $\{{z}_{i}^{(t)}:i=1,\cdots,n\}$ , which essentially decouples the consensus-seeking procedure from projection. After using the smoothness assumption in (3) to bound $\lVert\nabla f_{i}(x^{(t)}_{i})-\nabla f_{i}(x^{(t-1)}_{i})\rVert$ in (11b), the iteration in (11) can be kept almost linear, which greatly facilitates the rate analysis; see the proof of Lemma 5 for more details.

Next, we present the convergence result of Algorithm 1. Inspired by [26], we first establish the convergence property of an auxiliary sequence $\{{y}^{(t)}\}_{t\geq 0}$ , which is instrumental in proving the convergence of $\{x_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0}$ . In particular, the update of ${y}^{(t)}$ obeys

{y}^{(t)}=\nabla d^{*}\left(-a\overline{z}^{(t)}\right),

(16)

where the initial vector ${y}^{(0)}=x^{(0)}$ and $\overline{z}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}z_{i}^{(t)}$ . To proceed, we introduce the following $2\times 2$ matrix:

{\bf M}=\begin{bmatrix}\beta&\beta\\ aL(\beta+1)&\beta(aL+1)\end{bmatrix}

(17)

where $\beta=\sigma_{2}(P)$ , and let $\rho({\bf M})$ be the spectral radius of ${\bf M}$ .

$Theorem$ 1.

Suppose that Assumptions 1 and 2 are satisfied. If the constant $a$ in Algorithm 1 satisfies

\frac{1}{a}>2L\cdot\max\left\{\frac{\beta}{(1-\beta)^{2}},1+\frac{8}{9\left(1-(\rho({\bf M}))^{2}\right)}\right\},

(18)

then, for all $t\geq 1$ , it holds that

\begin{split}f(\tilde{y}^{(t)})-f(x^{*})\leq\frac{C}{at},\end{split}

(19)

where $\tilde{y}^{(t)}=t^{-1}\sum_{\tau=1}^{t}{y}^{(\tau)}$ with ${y}^{(\tau)}$ defined in (16),

C:=d(x^{*})+\frac{8a\pi^{2}}{9nL\left(1-(\rho({\bf M}))^{2}\right)},

and

\pi^{2}={\sum_{i=1}^{n}\left\|\nabla f_{i}(x^{(0)})-\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}(x^{(0)})\right\|^{2}}.

(20)

In addition, for all $t\geq 1$ and $i=1,\cdots,n$ , we have

\lVert\tilde{x}_{i}^{(t)}-\tilde{y}_{i}^{(t)}\rVert^{2}\leq\frac{D}{t}

(21)

where $\tilde{x}_{i}^{(t)}=t^{-1}\sum_{\tau=1}^{t}{x}_{i}^{(\tau)}$ ,

D:=\frac{8nC}{9\gamma\left(1-\rho({\bf M})\right)^{2}}+\frac{8\pi^{2}}{9L^{2}\left(1-(\rho({\bf M}))^{2}\right)},

and

\gamma:=\frac{1}{2}-aL-\frac{8aL}{9\left(1-(\rho({\bf M}))^{2}\right)}.

(22)

Proof.

The proof is postponed to Appendix A. ∎

To obtain a more explicit version of (18), we identify the eigenvalues of ${\bf M}$ as $\lambda_{1}=(\xi_{1}+\xi_{2})/2$ and $\lambda_{2}=(\xi_{1}-\xi_{2})/2$ , where

\xi_{1}=\beta(2+aL)>0,\quad\xi_{2}=\sqrt{a^{2}\beta^{2}L^{2}+4aL\beta(\beta+1)}>0.

(23)

Thus, we have $\lvert\lambda_{1}\rvert>\lvert\lambda_{2}\rvert$ and $\rho({\bf M})=\lambda_{1}>0$ . By routine calculation, we can verify that $\rho({\bf M})$ and $\nu(a):=\frac{8}{9(1-(\rho({\bf M}))^{2})}$ monotonically increase with $a$ . Due to

\nu(\frac{1}{2L})<\frac{1}{\left(1-\left(\frac{2.5\beta+\sqrt{2.25\beta^{2}+2\beta}}{2}\right)^{2}\right)},

we have that as long as $a$ satisfies

\frac{1}{a}>2L\cdot\max\left\{\frac{\beta}{(1-\beta)^{2}},1+\frac{1}{\left(1-\left(\frac{2.5\beta+\sqrt{2.25\beta^{2}+2\beta}}{2}\right)^{2}\right)}\right\},

(24)

then $a$ also satisfies (18). Based on (24), we have

a=\Theta\left(\frac{(1-\beta)^{2}}{L}\right),

whose size is comparable to the DGD algorithms [38, 8, 39] in the literature.

Next, we consider an unconstrained version of Problem (1), i.e., $\mathcal{X}=\mathbb{R}^{m}$ , where the rate of convergence is stated for $f(\tilde{x}_{i}^{(t)})-f(x^{*})$ .

$Corollary$ 1.

Suppose the premise of Theorem 1 holds. If $\mathcal{X}=\mathbb{R}^{m}$ in (1) and $d(x)=\lVert x\rVert^{2}/2$ in (12), and

\frac{1}{a}>2L\cdot\max\left\{\frac{\beta}{(1-\beta)^{2}},1+\frac{8}{3\left(1-(\rho({\bf M}))^{2}\right)}\right\},

(25)

then

\begin{split}f(\tilde{x}_{i}^{(t)})-f(x^{*})\leq\frac{1}{t}\left(\frac{n}{2a}\lVert x^{*}\rVert^{2}+\frac{8\pi^{2}}{3L\left(1-(\rho({\bf M}))^{2}\right)}\right)\end{split}

(26)

where $\tilde{x}_{i}^{(t)}=t^{-1}\sum_{\tau=1}^{t}{x}_{i}^{(\tau)}$ and $\pi^{2}$ is defined in (20).

Proof.

The proof is given in Appendix B. ∎

III-B Accelerated Decentralized Dual Averaging

To further speed up the convergence, we develop a decentralized variant of the accelerated dual averaging method in (9) and (10). Different from Algorithm 1, we consider building consensus among variables $\{v_{i}^{(t)},i=1,\cdots,n\}$ and propose the following iteration rule:


$\displaystyle u^{(t)}_{i}$	$\displaystyle=\frac{A_{t-1}}{A_{t}}\sum_{j=1}^{n}p_{ij}v^{(t-1)}_{j}+\frac{a_{t}}{A_{t}}{w}_{i}^{(t-1)}$	(27a)
$\displaystyle v^{(t)}_{i}$	$\displaystyle=\frac{A_{t-1}}{A_{t}}\sum_{j=1}^{n}p_{ij}v^{(t-1)}_{j}+\frac{a_{t}}{A_{t}}{w}_{i}^{(t)},$	(27b)

where

{w}_{i}^{(t)}=\nabla d^{*}\left(-\sum_{\tau=1}^{t}a_{\tau}q_{i}^{(\tau)}\right),

(28)

and

q_{i}^{(t)}=\sum_{j=1}^{n}p_{ij}q_{j}^{(t-1)}+\nabla f_{i}(u^{(t)}_{i})-\nabla f_{i}(u^{(t-1)}_{i}).

(29)

The overall algorithm is summarized in Algorithm 2. It is worth to mention that agents in Algorithm 2 consume the same communication resources as Algorithm 1 to achieve acceleration.

Algorithm 2 Accelerated Decentralized Dual Averaging

Input:

a>0

x^{(0)}\in\mathcal{X}

and a strongly convex function

d

with modulus

1

such that (5) holds

Initialize:

A_{1}=a_{1}=2a

u_{i}^{(1)}=w_{i}^{(0)}=x^{(0)}

q_{i}^{(1)}=\nabla f_{i}(x^{(0)})

, and

v_{i}^{(1)}=w_{i}^{(1)}

for all

i=1,\dots,n

for

t=2,3,\cdots

set

a_{t}=a_{t-1}+a

and

A_{t}=A_{t-1}+a_{t}

In parallel (task for agent $i$ , $i=1,\dots,n$ )

collect

v_{j}^{(t-1)}

and

q_{j}^{(t-1)}

from all agents

j\in\mathcal{N}_{i}

update

u_{i}^{(t)}

by (27a)

update

q_{i}^{(t)}

by (29)

compute

w_{i}^{(t)}

by (28)

update

v_{i}^{(t)}

by (27b)

broadcast

v_{i}^{(t)}

and

q_{i}^{(t)}

to all agents

j\in\mathcal{N}_{i}

end for

$Assumption$ 3.

For the problem in (1), the constraint set $\mathcal{X}$ is bounded with the following diameter:

G=\max_{x,y\in\mathcal{X}}\lVert x-y\rVert.

$Theorem$ 2.

For Algorithm 2, if Assumptions 1, 2, and 3 are satisfied, and

a\leq\frac{1}{6L},

(30)

then, for all $t\geq 1$ , it holds that

\begin{split}f(\overline{v}^{(t)})-f(x^{*})\leq\frac{d(x^{*})}{A_{t}}+\frac{t}{A_{t}}\left(\frac{2G(LC_{p}+C_{g})}{\sqrt{n}}+\frac{6LC_{p}^{2}}{n}\right),\end{split}

(31)

where

C_{p}:=\lceil\frac{3}{1-\beta}\rceil\sqrt{n}G

and

C_{g}:=2L\lceil\frac{3}{1-\beta}\rceil\frac{\sqrt{n}G+C_{p}}{1-\beta}.

In addition, for all $t\geq 1$ and $i=1,\cdots,n$ , we have

\left\lVert{v}_{i}^{(t)}-\overline{v}^{(t)}\right\rVert^{2}\leq\frac{2aC_{p}}{A_{t}}.

(32)

Proof.

The proof is postponed to Appendix C. ∎

For Algorithm 2 and Theorem 2, the following remarks are in order.

i) Comparison with existing accelerated algorithms. Accelerated methods for decentralized constrained optimization are rarely reported in the literature. Recently, the authors in [22] developed the APM algorithm, where the iteration rule reads


$\displaystyle y_{i}^{(t)}=$	$\displaystyle{x}_{i}^{(t)}+\frac{\theta_{t}(1-\theta_{t-1})}{\theta_{t-1}}\left({x}_{i}^{(t)}-{x}_{i}^{(t-1)}\right)$	(33a)
$\displaystyle{s}_{i}^{(t)}=$	$\displaystyle\nabla f_{i}(y_{i}^{(t)})+\frac{\beta_{0}}{\theta_{t}}\sum_{i=1}^{n}p_{ij}\left(y_{i}^{(t)}-y_{j}^{(t)}\right)$	(33b)
$\displaystyle{x}_{i}^{(t+1)}=$	$\displaystyle\arg\min_{{x}\in\mathcal{X}}\left\lVert x-y_{i}^{(t)}+\frac{s_{i}^{(t)}}{L+\beta_{0}/\theta_{t}}\right\rVert^{2}$	(33c)

where $\beta_{0}={L}/{\sqrt{1-\lambda_{2}(P)}}$ and $\theta_{k}$ is a decreasing parameter satisfying $\theta_{k}={\theta_{k-1}}/{(1+\theta_{k-1})}$ with $\theta_{0}=1$ . Letting $\hat{s}_{i}^{(t)}=\theta_{k}s_{i}^{(t)}$ , we can equivalently rewrite (33b) and (33c) as

\begin{split}\hat{s}_{i}^{(t)}=&{\theta_{t}}\nabla f_{i}(y_{i}^{(t)})+{\beta_{0}}\sum_{i=1}^{n}p_{ij}\left(y_{i}^{(t)}-y_{j}^{(t)}\right)\\ x_{i}^{(t+1)}=&\arg\min_{{x}\in\mathcal{X}}\left\lVert x-y_{i}^{(t)}+\frac{\hat{s}_{i}^{(t)}}{L\theta_{t}+\beta_{0}}\right\rVert^{2},\end{split}

from which we can see that new gradients are assigned with decreasing weights, whereas increasing weights are used for ADDA in (28). The reason for such different choices of parameters may be two-fold. First, parameter choices in (centralized) primal gradient descent and dual averaging methods are intrinsically different. Second, APM gradually increases the penalty parameter $1/\theta_{t}$ in order to enforce consensus, which essentially dilutes the weight for gradients, as shown above. We will show in simulation that decreasing weights over time slows down convergence. There are also a few other accelerated decentralized methods such as [9, 14], however they do not apply to constrained problems.

ii) Discussion about optimality. For ADDA, the rate of convergence is proved to be

\mathcal{O}(1)\left(\frac{1}{t^{2}}+\frac{1}{t(1-\beta)^{2}}\right).

In light of the lower bound in [14], it is not optimal in terms of the dependence on $\beta$ . In particular, the dominant term of the error in $\mathcal{O}(1/(t(1-\beta)^{2}))$ becomes larger as $\beta$ grows, i.e., the network becomes more sparsely connected. This is mainly because we consider a one-consensus-one-gradient update in the algorithm. However, extending the algorithm in [14] to handle constraints may require further investigation. In the simulation section, we demonstrate the superiority of ADDA over existing decentralized constrained optimization algorithms.

IV Simulation

In this section, we verify the proposed methods by applying them to solve the following constrained LASSO problems:

\min_{x\in\mathbb{R}^{m}}\left\{f(x)=\frac{1}{2n}\sum_{i=1}^{n}\lVert M_{i}x-c_{i}\rVert^{2}\right\},\quad\mathrm{s.t.}\,\,\lVert x\rVert_{1}\leq R

where $M_{i}\in\mathbb{R}^{p_{i}\times m}$ , $c_{i}\in\mathbb{R}^{p_{i}}$ , and $R$ is a constant parameter that defines the constraint. In the simulation, each agent $i$ has access to a local data tuple $(y_{i},A_{i})$ and $R$ . Two different problem instances characterized by both real and synthetic datasets are considered.

IV-A Case I: Real Dataset

In this setting, we use sparco7 [40, 17] to define the LASSO problem, and consider a cycle graph and a complete graph of $n=50$ nodes. The corresponding weight matrix $P$ is determined by following the Metropolis-Hastings rule [36]. Each local measurement matrix $M_{i}\in\mathbb{R}^{12\times 2560}$ , and the local corrupted measurement $c_{i}\in\mathbb{R}^{12}$ . The constraint parameter is set as $R=1.1\cdot\lVert x_{{g}}\rVert_{1}$ , where $x_{{g}}$ with $\lVert x_{g}\rVert_{0}=20$ denotes the unknown variable to be recovered via solving LASSO. In this case, the simulation experiments were performed using MATLAB R2020b.

For comparison, the PG-EXTRA method in [21] and the APM method in [27] are simulated. For their algorithmic parameters, the step size for PG-EXTRA is set as $10^{-4}$ , and the parameters for APM are set as $L=250$ and $\beta_{0}={L}/{\sqrt{1-\lambda_{2}(P)}}$ . For DDA and ADDA in this work, we use $a=5\cdot 10^{-4}$ and $a_{t}=(t+1)\cdot 10^{-4}$ , respectively, and $\lVert x\rVert^{2}/2$ as the prox-function. The projection onto an $l_{1}$ ball is carried out via the algorithm in [41]. All the methods are initialized with $x_{i}^{(0)}=0,\forall i\in\mathcal{V}$ .

The performance of four algorithms are displayed in Figs. 1 and 2. In Fig. 1, the performance is evaluated in terms of the objective error $f(\frac{1}{n}\sum_{i=1}^{n}x_{i}^{(t)})-f(x^{*})$ , where $x^{*}$ is identified using CVX [42]. It demonstrates that the DDA method outperforms other methods when the graph is a cycle. As the graph becomes denser, i.e, complete graph, the convergence of all algorithms becomes faster. Among them, the ADDA method demonstrates the most significant improvement. This is in line with Theorem 2, where the network connectivity impacts the convergence error in $\mathcal{O}(1/t)$ as opposed to $\mathcal{O}(1/t^{2})$ . In Fig. 2, we compare the trajectories of consensus error, i.e., $\sqrt{\sum_{i=1}^{n}\lVert x_{i}^{(t)}-n^{-1}\sum_{j=1}^{n}x_{j}^{(t)}\rVert^{2}}$ , by all methods. When the graph is a cycle, the APM and PG-EXTRA have smaller consensus error than the developed methods, mainly because they build consensus directly among variables $\{x_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0}$ . When the graph is complete, the consensus error by the proposed DDA method vanishes because of the conservation property established in Lemma 3 and the complete graph structure.

IV-B Case II: Synthetic Dataset

For the synthetic dataset, the parameters are set as $n=8$ , $m=30000$ , $p_{i}=2000,\forall i\in\mathcal{V}$ , and the data is generated in the following way. First, each local measurement matrix $M_{i}$ is randomly generated where each entry follows the normal distribution $\mathcal{N}(0,1)$ . Next, each entry of the sparse vector $x_{{g}}$ to be recovered via LASSO is randomly generated from the normal distribution $\mathcal{N}(0,1)$ with $\lVert x_{{g}}\rVert_{0}=1500$ . Then the corrupted measurement $c_{i}$ is produced based on

c_{i}=M_{i}x_{g}+b_{i}

where $b_{i}$ represents the Gaussian noise with zero mean and variance $0.01$ . The constraint parameter is set as $R=1.1\cdot\lVert x_{g}\rVert_{1}$ . For this setting, we employed the message passing interface (MPI) in Python 3.7.3 to simulate a network of $8$ nodes, where each node $i$ is connected to a subset of nodes $\{1+i\,\mathrm{mod}\,8,1+(i+3)\,\mathrm{mod}\,8,1+(i+6)\,\mathrm{mod}\,8\}$ . For comparison, the proposed methods are compared to their centralized counterparts. The parameters for dual averaging and accelerated dual averaging are set as $a=1/(3\cdot 10^{5})$ and $a_{t}=a(t+1)$ , respectively. Similarly, the function $\lVert x\rVert^{2}/2$ is used as a prox-function, and the algorithms are initialized with $x_{i}^{(0)}=0,\forall i\in\mathcal{V}$ .

The performance of the developed algorithms and their centralized counterparts is illustrated in Fig. 3. In particular, the performance is evaluated in terms of objective function value versus computing time. It demonstrates that the proposed methods outperform the corresponding centralized algorithms in the sense that the decentralized algorithms consume less computing time to reach the same degree of accuracy than their centralized counterparts.

Refer to caption — Figure 1: Comparison of objective error in Case I.

V Conclusion

In this work, we have designed two DDA algorithms for solving decentralized constrained optimization problems with improved rates of convergence. In the first one, a novel second-order dynamic average consensus scheme is developed, with which each agent locally generates a more accurate estimate of the dual variable than existing methods under mild assumptions. This property enables each agent to use large constant weight in the local dual average step, and therefore improves the rate of convergence. In the second algorithm, each agent retains the conventional first-order dynamic average consensus method to estimate the average of local gradients. Alternatively, the extrapolation technique together with the average consensus protocol is used to achieve acceleration over a decentralized network.

This work opens several avenues for future research. In this work, we focus on the basic setting with time-invariant bidirectional communication networks. We believe that the consensus-based dual averaging framework can be extended to tackle decentralized constrained optimization in complex networks, e.g., directed networks [43] and time-varying networks [38]. Furthermore, we expect that our approach, as demonstrated by its centralized counterpart, i.e., follow-the-regularized-leader, may deliver superb performance in the online optimization setting [44].

Appendix A Roadmap for the proofs

Before proceeding to the proofs, we present Fig. 4 to illustrate how they relate to each other.

Appendix B Proof of Theorem 1

B-A Preliminaries

We introduce several notations to facilitate the presentation of the proof. Let

{\bf x}^{(t)}=\begin{bmatrix}x_{1}^{(t)}\\ \vdots\\ x_{n}^{(t)}\end{bmatrix},\quad{\bf z}^{(t)}=\begin{bmatrix}z_{1}^{(t)}\\ \vdots\\ z_{n}^{(t)}\end{bmatrix},\quad{\bf s}^{(t)}=\begin{bmatrix}s_{1}^{(t)}\\ \vdots\\ s_{n}^{(t)}\end{bmatrix},

{\bf y}^{(t)}=\begin{bmatrix}y^{(t)}\\ \vdots\\ y^{(t)}\end{bmatrix},\quad{\bf\nabla}^{(t)}=\begin{bmatrix}\nabla f_{1}(x_{1}^{(t)})\\ \vdots\\ \nabla f_{n}(x_{n}^{(t)})\end{bmatrix},

\overline{g}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(x_{i}^{(t)}),\,\overline{s}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}s_{i}^{(t)},\,\overline{z}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}z_{i}^{(t)}.

Using these notations, we express (11) in the following compact form


$\displaystyle{\bf z}^{(t)}$	$\displaystyle={\bf P}\left({\bf z}^{(t-1)}+{\bf s}^{(t-1)}\right),$	(34a)
$\displaystyle{\bf s}^{(t)}$	$\displaystyle={\bf P}{\bf s}^{(t-1)}+\nabla^{(t)}-\nabla^{(t-1)},$	(34b)

where ${\bf P}=P\otimes I$ . Before proceeding to the proof of Theorem 1, we present several technical lemmas.

Recall a lemma from [39].

$Lemma$ 2.

Suppose that $\{\varepsilon^{(t)}\}_{t\geq 0}$ and $\{\epsilon^{(t)}\}_{t\geq 0}$ are two sequences of positive scalars such that for all $t\geq 0$ ,

\varepsilon^{(t)}\leq\delta^{t}c+\sum_{\tau=0}^{t-1}\delta^{t-\tau-1}\epsilon^{(\tau)}

where $\delta\in(0,1)$ and $c\geq 0$ is a constant. Then, the following holds for all $t\geq 1$ :

\sum_{\tau=1}^{t}(\varepsilon^{(\tau)})^{2}\leq\frac{2}{(1-\delta)^{2}}\sum_{\tau=0}^{t-1}(\epsilon^{(\tau)})^{2}+\frac{2c^{2}}{1-\delta^{2}}.

$Lemma$ 3.

For Algorithm 1, we have that for any $t\geq 0$

\overline{s}^{(t)}=\overline{g}^{(t)},\quad\overline{z}^{(t+1)}=\sum_{\tau=0}^{t}\overline{s}^{(\tau)}.

(35)

Proof of Lemma 3.

We prove by induction. For $t=0$ , we readily have (35) satisfied since $s_{i}^{(0)}=\nabla f_{i}(x^{(0)})$ and $z_{i}^{0}=0$ for all $i$ . Suppose that (35) holds for $t-1$ . Using (34b),

(A\otimes B)(C\otimes D)=(AC\otimes BD),

(36)

and the double stochasticity of $P$ , we have

\begin{split}\overline{s}^{(t)}&=\frac{1}{n}({\bf 1}^{\mathrm{T}}\otimes I)\left({\bf P}{\bf s}^{(t-1)}+\nabla^{(t)}-\nabla^{(t-1)}\right)\\ &=\frac{1}{n}({\bf 1}^{\mathrm{T}}\otimes I)\left(({P}\otimes I){\bf s}^{(t-1)}+\nabla^{(t)}-\nabla^{(t-1)}\right)\\ &=\frac{1}{n}\left(({\bf 1}^{\mathrm{T}}P)\otimes I\right){\bf s}^{(t-1)}+\overline{g}^{(t)}-\overline{g}^{(t-1)}\\ &=\overline{s}^{(t-1)}+\overline{g}^{(t)}-\overline{g}^{(t-1)}=\overline{g}^{(t)}.\end{split}

Upon using a similar argument for $\overline{z}^{(t)}$ , we have

\begin{split}\overline{z}^{(t+1)}&=\frac{1}{n}({\bf 1}^{\mathrm{T}}\otimes I){(P\otimes I)}\left({\bf z}^{(t)}+{\bf s}^{(t)}\right)\\ &=\overline{z}^{(t)}+\overline{s}^{(t)}=\sum_{\tau=0}^{t}\overline{s}^{(\tau)}.\end{split}

Therefore, (35) holds for all $t$ . ∎

$Lemma$ 4.

If the parameter $a$ satisfies (18), then $\rho({\bf M})<1$ , where ${\bf M}$ is defined in (17).

Proof of Lemma 4.

Recall the two real eigenvalues of ${\bf M}$ in (23). Since $\xi_{1}>0$ and $\xi_{2}>0$ , we have $\rho({\bf M})=\lvert\lambda_{1}\rvert>\lvert\lambda_{2}\rvert$ . Also, one can verify that $\rho({\bf M})$ monotonically increases with $a$ . The solution to the equation $\lvert\lambda_{1}\rvert=1$ can be identified as $a=(1-\beta)^{2}/(2\beta L)$ . Therefore, we conclude that $\rho({\bf M})<1$ as long as (18) is satisfied. ∎

The following lemma establishes the relation between sequences $\{x_{i}^{(t)}\}_{t\geq 0}$ and $\{{y}^{(t)}\}_{t\geq 0}$ .

$Lemma$ 5.

Suppose that Assumptions 1, 2 hold and the parameter $a$ in Algorithm 1 satisfies (18). Then, for all $t\geq 0$ , it holds that

\begin{split}&\sum_{\tau=1}^{t}\lVert{\bf x}^{(\tau)}-\mathbf{y}^{(\tau)}\rVert^{2}\\ &\leq\frac{8}{9\left(1-\rho({\bf M})\right)^{2}}\sum_{\tau=0}^{t-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert^{2}+\frac{8\pi^{2}}{9L^{2}\left(1-(\rho({\bf M}))^{2}\right)}\end{split}

(37)

where

\pi^{2}={\sum_{i=1}^{n}\left\|\nabla f_{i}(x^{(0)})-\overline{g}^{(0)}\right\|^{2}}.

Proof of Lemma 5.

From Lemma 3, we have

\overline{z}^{(\tau)}=\overline{z}^{(\tau-1)}+\overline{g}^{(\tau-1)}.

Define

\tilde{{\bf s}}^{(t)}={\bf s}^{(t)}-{\bf 1}\otimes\overline{s}^{(t)},\,\tilde{{\bf z}}^{(t)}={\bf z}^{(t)}-{\bf 1}\otimes\overline{z}^{(t)}.

(38)

These in conjunction with (34a) lead to

\begin{split}\tilde{\bf z}^{(\tau)}=\mathbf{P}{\bf z}^{(\tau-1)}-\mathbf{1}\otimes\overline{z}^{(\tau-1)}+\mathbf{P}{\bf s}^{(\tau-1)}-\mathbf{1}\otimes\overline{s}^{(\tau-1)}.\end{split}

(39)

Because of $\mathbf{1}\otimes\bar{z}^{(\tau-1)}=(\mathbf{1}\otimes I)\bar{z}^{(\tau-1)}$ and (36), we have

\mathbf{1}\otimes\bar{z}^{(\tau-1)}=\frac{1}{n}(\mathbf{1}\otimes I)(\mathbf{1}^{\mathrm{T}}\otimes I)\mathbf{z}^{(\tau-1)}=\left(\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\otimes I\right)\mathbf{z}^{(\tau-1)}.

In addition, we have

		$\displaystyle\mathbf{P}{\bf z}^{(\tau-1)}-\mathbf{1}\otimes\overline{z}^{(\tau-1)}=(P\otimes I)\mathbf{z}^{(\tau-1)}-\left(\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\otimes I\right)\mathbf{z}^{(\tau-1)}$
		$\displaystyle=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{T}}{n}\right)\otimes I\right){\bf z}^{(\tau-1)}$
		$\displaystyle=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{T}}{n}\right)\otimes I\right)\left(\tilde{\bf z}^{(\tau-1)}+(\mathbf{1}\otimes I)\bar{z}^{(\tau-1)}\right)$
		$\displaystyle=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{T}}{n}\right)\otimes I\right)\tilde{\bf z}^{(\tau-1)}+\left(\left(P\mathbf{1}-\mathbf{1}\right)\otimes I\right)\bar{z}^{(\tau-1)}$
		$\displaystyle=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{T}}{n}\right)\otimes I\right)\tilde{\bf z}^{(\tau-1)},$		(40)

where the last equality is due to the double stochasticity of $P$ . Using the same arguments as above for $\mathbf{P}{\bf s}^{(\tau-1)}-\mathbf{1}\otimes\overline{s}^{(\tau-1)}$ and Assumption 2, we have

\begin{split}\lVert\tilde{\bf z}^{(\tau)}\rVert&\leq\left\lVert\mathbf{P}{\bf z}^{(\tau-1)}-\mathbf{1}\otimes\overline{z}^{(\tau-1)}\right\rVert+\left\lVert\mathbf{P}{\bf s}^{(\tau-1)}-\mathbf{1}\otimes\overline{s}^{(\tau-1)}\ \right\rVert\\ &\leq\beta\lVert\tilde{\bf z}^{(\tau-1)}\rVert+\beta\lVert\tilde{\bf s}^{(\tau-1)}\rVert.\end{split}

(41)

Similarly, from Lemma 3, we obtain

\begin{split}\lVert\tilde{\bf s}^{(\tau)}\rVert&=\left\|\mathbf{P}{{\bf s}}^{(\tau-1)}-\left(\mathbf{1}\otimes I\right)\overline{s}^{(\tau-1)}+\nabla^{(\tau)}-\nabla^{(\tau-1)}\right\rVert\\ &\leq\beta\lVert\tilde{\bf s}^{(\tau-1)}\rVert+\left\|\nabla^{(\tau)}-\nabla^{(\tau-1)}\right\|\\ &\leq\beta\lVert\tilde{\bf s}^{(\tau-1)}\rVert+L\lVert{\bf x}^{(\tau)}-{{\bf x}}^{(\tau-1)}\rVert\end{split}

(42)

where the last inequality is due to Assumption 1. Using

\begin{split}\lVert{\bf x}^{(\tau)}-{{\bf x}}^{(\tau-1)}\rVert\leq&\lVert{\bf x}^{(\tau)}-{{\bf y}}^{(\tau)}\rVert+\lVert{\bf x}^{(\tau-1)}-{{\bf y}}^{(\tau-1)}\rVert\\ &+\lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert,\end{split}

and Lemma 1, we obtain

\begin{split}&\lVert{\bf x}^{(\tau)}-{{\bf x}}^{(\tau-1)}\rVert\leq a\lVert\tilde{\mathbf{z}}^{(\tau)}\rVert+a\lVert\tilde{\mathbf{z}}^{(\tau-1)}\rVert+\lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert\\ &\leq a(\beta+1)\lVert\tilde{\mathbf{z}}^{(\tau-1)}\rVert+a\beta\lVert\tilde{\mathbf{s}}^{(\tau-1)}\rVert+\lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert.\end{split}

Upon substituting the above inequality into (42), we obtain

\begin{split}\lVert\tilde{\bf s}^{(\tau)}\rVert\leq&\beta(aL+1)\lVert\tilde{\bf s}^{(\tau-1)}\rVert+aL(\beta+1)\lVert\tilde{\bf z}^{(\tau-1)}\rVert\\ &+L\lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert.\end{split}

(43)

By combining (41) and (43), we establish the following inequality:

\begin{split}\begin{bmatrix}\lVert\tilde{\bf z}^{(\tau)}\rVert\\ \lVert\tilde{\bf s}^{(\tau)}\rVert\end{bmatrix}\leq\mathbf{M}\begin{bmatrix}\lVert\tilde{\bf z}^{(\tau-1)}\rVert\\ \lVert\tilde{\bf s}^{(\tau-1)}\rVert\end{bmatrix}+L\begin{bmatrix}0\\ \lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert\end{bmatrix}\end{split}

where $\mathbf{M}$ is defined in (17). By iterating the above inequality and using

\lVert\tilde{\bf z}^{(0)}\lVert=0,\quad\lVert\tilde{\bf s}^{(0)}\lVert=\sqrt{\sum_{i=1}^{n}\left\|\nabla f_{i}(x^{(0)})-\overline{g}^{(0)}\right\|^{2}}:=\pi,

we obtain

\begin{split}\begin{bmatrix}\lVert\tilde{\bf z}^{(t)}\rVert\\ \lVert\tilde{\bf s}^{(t)}\rVert\end{bmatrix}\leq&L\sum_{\tau=0}^{t-1}\mathbf{M}^{t-\tau-1}\begin{bmatrix}0\\ \lVert{\bf y}^{(\tau+1)}-{{\bf y}}^{(\tau)}\rVert\end{bmatrix}+\mathbf{M}^{t}\begin{bmatrix}0\\ \pi\end{bmatrix}.\end{split}

Recall the eigenvalues of matrix $\mathbf{M}$ in (23). Then an analytical form can be presented for the $n$ th power of $\mathbf{M}$ (see, e.g., [45])

\begin{split}\mathbf{M}^{n}=\lambda_{1}^{n}\left(\frac{\mathbf{M}-\lambda_{2}I}{\lambda_{1}-\lambda_{2}}\right)-\lambda_{2}^{n}\left(\frac{\mathbf{M}-\lambda_{1}I}{\lambda_{1}-\lambda_{2}}\right).\end{split}

Therefore, the $(1,2)$ -th entry of $\mathbf{M}^{n}$ can be written as

(\mathbf{M}^{n})_{12}=\frac{\mathbf{M}_{12}(\lambda_{1}^{n}-\lambda_{2}^{n})}{\lambda_{1}-\lambda_{2}}=\frac{\beta(\lambda_{1}^{n}-\lambda_{2}^{n})}{\lambda_{1}-\lambda_{2}}\leq\frac{2\beta(\rho(\mathbf{M}))^{n}}{\xi_{2}}.

Due to the assumption that ${1}/{a}>{2\beta L}/{(1-\beta)^{2}}$ and $\beta\in(0,1)$ , we have

\xi_{2}=\beta aL\sqrt{1+\frac{4(\beta+1)}{\beta aL}}>\beta aL\sqrt{1+\frac{8(\beta+1)}{(1-\beta)^{2}}}>3\beta aL.

Therefore,

\begin{split}&\lVert\tilde{\bf z}^{(t)}\rVert\\ &\leq\frac{2\beta{L}}{\xi_{2}}\sum_{\tau=0}^{t-1}\left(\rho(\mathbf{M})\right)^{t-\tau-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert+\frac{2\beta}{\xi_{2}}\left(\rho(\mathbf{M})\right)^{t}\pi\\ &\leq\frac{2}{3a}\sum_{\tau=0}^{t-1}\left(\rho(\mathbf{M})\right)^{t-\tau-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert+\frac{2}{3aL}\left(\rho(\mathbf{M})\right)^{t}\pi.\end{split}

(44)

Using Lemma 1, we further obtain

\begin{split}&\lVert{{\bf x}}^{(t)}-{{\bf y}}^{(t)}\rVert\leq a\lVert\tilde{\bf z}^{(t)}\rVert\\ &\leq\frac{2}{3}\sum_{\tau=0}^{t-1}\left(\rho(\mathbf{M})\right)^{t-\tau-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert+\frac{2}{3L}\left(\rho(\mathbf{M})\right)^{t}\pi.\end{split}

(45)

Further using the above relation and Lemma 2, the inequality (37) follows as desired. ∎

Then, a lemma is stated for the prox-mapping.

$Lemma$ 6.

Given a sequence of variables $\{\zeta^{(t)}\}_{t\geq 0}$ and a positive sequence $\{a_{t}\}_{t\geq 0}$ , for $\{{\nu}^{(t)}\}_{t\geq 0}$ generated by

{\nu}^{(t)}=\nabla d^{*}\left(-\sum_{\tau=1}^{t}a_{\tau}\zeta^{(\tau)}\right),

where $\nu^{(0)}=x^{(0)}$ in (5), it holds that

\begin{split}\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},{\nu}^{(\tau)}-x^{*}\right\rangle\leq d(x^{*})-\sum_{\tau=1}^{t}\frac{1}{2}\lVert{\nu}^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2}.\end{split}

(46)

Proof of Lemma 6.

Define

\begin{split}m_{t}({x})=\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},x\right\rangle+d({x})\end{split}

where $m_{0}(x)=d(x)$ . Since

\nu^{(\tau-1)}=\operatorname*{argmin}_{x\in\mathcal{X}}m_{\tau-1}(x)

and $m_{\tau-1}(x)$ is strongly convex with modulus $1$ , we have

\begin{split}m_{\tau-1}(x)-m_{\tau-1}({\nu}^{(\tau-1)})\geq\frac{1}{2}\lVert x-\nu^{(\tau-1)}\rVert^{2},\forall x\in\mathcal{X}.\end{split}

Upon taking $x=\nu^{(\tau)}$ in the above inequality and using

m_{\tau}({x})=m_{\tau-1}({x})+a_{\tau}\left\langle\zeta^{(\tau)},x\right\rangle,

we obtain

\begin{split}0\leq&m_{\tau-1}({\nu}^{(\tau)})-m_{\tau-1}({\nu}^{(\tau-1)})-\frac{1}{2}\lVert{\nu}^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2}\\ =&m_{\tau}({\nu}^{(\tau)})-a_{\tau}\left\langle\zeta^{(\tau)},{\nu}^{(\tau)}\right\rangle-m_{\tau-1}({\nu}^{(\tau-1)})\\ &-\frac{1}{2}\lVert\nu^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2},\end{split}

which is equivalent to

\begin{split}a_{\tau}\left\langle\zeta^{(\tau)},{\nu}^{(\tau)}\right\rangle\leq&m_{\tau}({\nu}^{(\tau)})-m_{\tau-1}({\nu}^{(\tau-1)})\\ &-\frac{1}{2}\lVert{\nu}^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2}.\end{split}

Iterating the above equation from $\tau=1$ to $\tau=t$ yields

\begin{split}&\sum_{\tau=1}^{t}\left\langle a_{\tau}\zeta^{(\tau)},{\nu}^{(\tau)}\right\rangle\\ \leq&m_{t}({\nu}^{(t)})-m_{0}({\nu}^{(0)})-\sum_{\tau=1}^{t}\frac{1}{2}\lVert{\nu}^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2}\\ =&m_{t}({\nu}^{(t)})-\sum_{\tau=1}^{t}\frac{1}{2}\lVert{\nu}^{(\tau)}-\nu^{(\tau-1)}\rVert^{2}\end{split}

(47)

We turn to consider

\begin{split}&\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},-x^{*}\right\rangle\\ &\leq\max_{{x}\in\mathcal{X}}\left\{\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},-x\right\rangle-d(x)\right\}+d(x^{*})\\ &=-\min_{{x}\in\mathcal{X}}\left\{\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},x\right\rangle+d(x)\right\}+d(x^{*})\\ &=-m_{t}({\nu}^{(t)})+d(x^{*}),\end{split}

which together with (47) leads to the inequality in (46), thereby concluding the proof. ∎

B-B Proof of Theorem 1

Proof of Theorem 1.

For all $\tau\geq 1$ , we have

	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}a\left(f_{i}({y}^{(\tau)})-f_{i}(x^{*})\right)$
	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}a\Big{(}f_{i}(x_{i}^{(\tau-1)})-f_{i}(x^{*})+\frac{L}{2}\lVert{y}^{(\tau)}-x_{i}^{(\tau-1)}\rVert^{2}$
	$\displaystyle\qquad\qquad\quad+\left\langle\nabla f_{i}(x_{i}^{(\tau-1)}),{y}^{(\tau)}-x_{i}^{(\tau-1)}\right\rangle\Big{)}$
	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}a\Big{(}\frac{L}{2}\lVert{y}^{(\tau)}-x_{i}^{(\tau-1)}\rVert^{2}+\left\langle\nabla f_{i}(x_{i}^{(\tau-1)}),{y}^{(\tau)}-x^{*}\right\rangle\Big{)}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}a\Big{(}\frac{L}{2}\lVert{y}^{(\tau)}-x_{i}^{(\tau-1)}\rVert^{2}\Big{)}+a\left\langle\overline{g}^{(\tau-1)},{y}^{(\tau)}-x^{*}\right\rangle,$		(48)

where the two inequalities are due to Assumption 1. By iterating (48) from $\tau=1$ to $\tau=t$ and using Lemma 6 ( $a_{\tau}=a,\zeta^{(\tau)}=\overline{g}^{(\tau-1)}$ , and $\nu^{(\tau)}=y^{(\tau)}$ ), we obtain

	$\displaystyle\sum_{\tau=1}^{t}a\left(f(y^{(\tau)})-f(x^{*})\right)$
	$\displaystyle\leq\frac{1}{n}\sum_{\tau=1}^{t}\sum_{i=1}^{n}a\Big{(}\frac{L}{2}\lVert{y}^{(\tau)}-x_{i}^{(\tau-1)}\rVert^{2}\Big{)}$
	$\displaystyle\quad-\frac{1}{2}\sum_{\tau=1}^{t}\lVert y^{(\tau)}-y^{(\tau-1)}\rVert^{2}+d(x^{*}).$
	$\displaystyle=\frac{aL}{2n}\sum_{\tau=1}^{t}\lVert{\bf y}^{(\tau)}-{\bf x}^{(\tau-1)}\rVert^{2}-\frac{1}{2}\sum_{\tau=1}^{t}\lVert y^{(\tau)}-y^{(\tau-1)}\rVert^{2}+d(x^{*}).$

Then we use the inequality

\|\mathbf{y}^{(\tau)}-\mathbf{x}^{(\tau-1)}\|^{2}\leq 2\|\mathbf{y}^{(\tau)}-\mathbf{y}^{(\tau-1)}\|^{2}+2\|\mathbf{y}^{(\tau-1)}-\mathbf{x}^{(\tau-1)}\|^{2}

and the convexity of $f$ to get

	$\displaystyle at\left(f(\tilde{y}^{(t)})-f(x^{*})\right)$
	$\displaystyle\leq\frac{1}{n}\sum_{\tau=1}^{t}\left(aL-\frac{1}{2}\right)\\|\mathbf{y}^{(\tau)}-\mathbf{y}^{(\tau-1)}\\|^{2}$
	$\displaystyle\quad+\frac{aL}{n}\sum_{\tau=1}^{t}\\|\mathbf{x}^{(\tau-1)}-\mathbf{y}^{(\tau-1)}\\|^{2}+d(x^{*}).$

Upon using Lemma 5, one has

	$\displaystyle at\left(f(\tilde{y}^{(t)})-f(x^{*})\right)+\frac{\gamma}{n}\sum_{\tau=1}^{t}\\|\mathbf{y}^{(\tau)}-\mathbf{y}^{(\tau-1)}\\|^{2}$
	$\displaystyle\leq d(x^{*})+\frac{8a\pi^{2}}{9nL\left(1-(\rho({\bf M}))^{2}\right)}=C,$		(49)

where $\gamma>0$ is defined in (22). Therefore we have (19) as desired.

From (49) and $f(\tilde{y}^{(t)})-f(x^{*})\geq 0$ , we have

\sum_{\tau=1}^{t}\|\mathbf{y}^{(\tau)}-\mathbf{y}^{(\tau-1)}\|^{2}\leq\frac{nC}{\gamma}.

By using the above inequality, the convexity of $\lVert\cdot\rVert^{2}$ , Jensen’s Inequality, and Lemma 5, we arrive at

	$\displaystyle t\lVert{\tilde{\bf x}}^{(t)}-\tilde{\bf y}^{(t)}\rVert^{2}\leq\sum_{\tau=1}^{t}\lVert{{\bf x}}^{(\tau)}-{{\bf y}}^{(\tau)}\rVert^{2}$
	$\displaystyle\leq\frac{8}{9\left(1-\rho({\bf M})\right)^{2}}\sum_{\tau=0}^{t-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert^{2}+\frac{8\pi^{2}}{9L^{2}\left(1-(\rho({\bf M}))^{2}\right)}$
	$\displaystyle\leq\frac{8nC}{9\gamma\left(1-\rho({\bf M})\right)^{2}}+\frac{8\pi^{2}}{9L^{2}\left(1-(\rho({\bf M}))^{2}\right)}=D,$

which implies (21) as desired. ∎

Appendix C Proof of Corollary 1

Proof of Corollary 1.

We consider

\begin{split}&f(x_{i}^{(\tau)})-f(y^{(\tau)})\\ &\leq\frac{1}{n}\sum_{j=1}^{n}\left(f_{j}(x_{i}^{(\tau)})-\langle\nabla f_{j}(x_{j}^{(\tau)}),y^{(\tau)}-x_{j}^{(\tau)}\rangle-f_{j}(x_{j}^{(\tau)})\right)\\ &\leq\frac{1}{n}\sum_{j=1}^{n}\Big{(}\langle\nabla f_{j}(x_{j}^{(\tau)}),x_{i}^{(\tau)}-x_{j}^{(\tau)}\rangle-\langle\nabla f_{j}(x_{j}^{(\tau)}),y^{(\tau)}-x_{j}^{(\tau)}\rangle\\ &\quad\quad\quad\quad+\frac{L}{2}\lVert x_{i}^{(\tau)}-y^{(\tau)}+y^{(\tau)}-x_{j}^{(\tau)}\rVert^{2}\Big{)}\\ &=\frac{1}{n}\sum_{j=1}^{n}\Big{(}\langle\nabla f_{j}(x_{j}^{(\tau)}),x_{i}^{(\tau)}-y^{(\tau)}\rangle+{L}\lVert x_{i}^{(\tau)}-y^{(\tau)}\rVert^{2}\\ &\quad\quad\quad\quad+{L}\lVert y^{(\tau)}-x_{j}^{(\tau)}\rVert^{2}\Big{)}\\ &=\left\langle\overline{g}^{(\tau)},x_{i}^{(\tau)}-y^{(\tau)}\right\rangle+{L}\lVert x_{i}^{(\tau)}-y^{(\tau)}\rVert^{2}+\frac{L}{n}\lVert\mathbf{y}^{(\tau)}-\mathbf{x}^{(\tau)}\rVert^{2},\end{split}

(50)

where the two inequalities follow from Assumption 1. When $\mathcal{X}=\mathbb{R}^{m}$ , the closed-form solutions for (12) and (16) can be identified as

x_{i}^{(\tau)}=-a{z_{i}^{(\tau)}},\quad y^{(\tau)}=-a{\overline{z}^{(\tau)}},

implying that $y^{(\tau)}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{(t)}$ . Upon summing up (50) from $i=1$ to $i=n$ , we obtain

\begin{split}\sum_{i=1}^{n}\left(f(x_{i}^{(\tau)})-f(y^{(\tau)})\right)\leq 2L\lVert\mathbf{y}^{(\tau)}-\mathbf{x}^{(\tau)}\rVert^{2}.\end{split}

(51)

Then, we iterate (51) from $\tau=1$ to $\tau=t$ and use the convexity of $f$ to get

\begin{split}&t\sum_{i=1}^{n}\left(f(\tilde{x}_{i}^{(t)})-f(y^{(\tau)})\right)\leq\sum_{\tau=1}^{t}\sum_{i=1}^{n}\left(f(x_{i}^{(\tau)})-f(y^{(\tau)})\right)\\ &\leq 2L\sum_{\tau=1}^{t}\lVert\mathbf{y}^{(\tau)}-\mathbf{x}^{(\tau)}\rVert^{2}.\end{split}

(52)

where $\tilde{x}_{i}^{(t)}={t}^{-1}\sum_{\tau=1}^{t}x_{i}^{(\tau)}$ . Using Lemma 5, we obtain

\begin{split}&t\sum_{i=1}^{n}\left(f(\tilde{x}_{i}^{(t)})-f(y^{(\tau)})\right)\\ &\leq\frac{16L}{9\left(1-\rho({\bf M})\right)^{2}}\sum_{\tau=0}^{t-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert^{2}+\frac{16\pi^{2}}{9L\left(1-(\rho({\bf M}))^{2}\right)}.\end{split}

(53)

Recall (49), we have

		$\displaystyle at\left(f(\tilde{y}^{(t)})-f(x^{})\right)\leq d(x^{})+\frac{8a\pi^{2}}{9nL\left(1-(\rho({\bf M}))^{2}\right)}$
		$\displaystyle-\left(\frac{1}{2}-aL-\frac{8aL}{9\left(1-(\rho({\bf M}))^{2}\right)}\right)\sum_{\tau=1}^{t}\\|{y}^{(\tau)}-{y}^{(\tau-1)}\\|^{2}.$		(54)

By multiplying $n/a>0$ on both sides of (C) and adding the resultant inequality to (53), we get

\begin{split}&t\left(f(\tilde{x}_{i}^{(t)})-f(x^{*})\right)\leq t\sum_{i=1}^{n}\left(f(\tilde{x}_{i}^{(t)})-f(x^{*})\right)\\ \leq&-\Big{(}\frac{1}{2a}-{L}-\frac{8L}{3\left(1-\rho({\bf M})\right)^{2}}\Big{)}{\sum_{\tau=1}^{t}\lVert{\bf y}^{(\tau)}-{\bf y}^{(\tau-1)}\rVert^{2}}\\ &+\frac{n}{2a}\lVert x^{*}\rVert^{2}+\frac{8\pi^{2}}{3L\left(1-(\rho({\bf M}))^{2}\right)}.\end{split}

(55)

Upon using the condition in (25), we arrive at (26) as desired. ∎

Appendix D Proof of Theorem 2

D-A Preliminaries

For Algorithm 2, we define

{\bf u}^{(t)}=\begin{bmatrix}u_{1}^{(t)}\\ \vdots\\ u_{n}^{(t)}\end{bmatrix},\quad{\bf v}^{(t)}=\begin{bmatrix}v_{1}^{(t)}\\ \vdots\\ v_{n}^{(t)}\end{bmatrix},\quad{\bf w}^{(t)}=\begin{bmatrix}w_{1}^{(t)}\\ \vdots\\ w_{n}^{(t)}\end{bmatrix},

{\bf q}^{(t)}=\begin{bmatrix}q_{1}^{(t)}\\ \vdots\\ q_{n}^{(t)}\end{bmatrix},\,\,\hat{\bf\nabla}^{(t)}=\begin{bmatrix}{\nabla}f_{1}(u_{1}^{(t)})\\ \vdots\\ \nabla f_{n}(u_{n}^{(t)})\end{bmatrix},

\overline{u}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}u_{i}^{(t)},\,\overline{v}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}v_{i}^{(t)},\,\overline{{w}}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}{{w}}_{i}^{(t)},

\overline{\hat{g}}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(u_{i}^{(t)}),\,\overline{{q}}_{t}=\frac{1}{n}\sum_{i=1}^{n}{{q}}_{i}^{(t)},\,\tilde{{\bf w}}^{(t)}={\bf w}^{(t)}-{\bf 1}\otimes\overline{w}^{(t)},

\tilde{{\bf u}}^{(t)}={\bf u}^{(t)}-{\bf 1}\otimes\overline{u}^{(t)},\,\tilde{{\bf v}}^{(t)}={\bf v}^{(t)}-{\bf 1}\otimes\overline{v}^{(t)},\,\tilde{{\bf q}}^{(t)}={\bf q}^{(t)}-{\bf 1}\otimes\overline{q}^{(t)}.

Based on these notations, we present the steps in (27) and (29) in the following compact form


$\displaystyle{\bf u}^{(t)}$	$\displaystyle=\frac{A_{t-1}}{A_{t}}\left({\bf P}{\bf v}^{(t-1)}\right)+\frac{a_{t}}{A_{t}}{\bf w}^{(t-1)},$	(56a)
$\displaystyle{\bf v}^{(t)}$	$\displaystyle=\frac{A_{t-1}}{A_{t}}\left({\bf P}{\bf v}^{(t-1)}\right)+\frac{a_{t}}{A_{t}}{\bf w}^{(t)},$	(56b)
$\displaystyle{\bf q}^{(t)}$	$\displaystyle={\bf P}{\bf q}^{(t-1)}+\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)},$	(56c)

where ${\bf P}=P\otimes I$ . According to (27), we have


$\displaystyle\overline{u}^{(t)}$	$\displaystyle=\frac{A_{t-1}}{A_{t}}\overline{v}^{(t-1)}+\frac{a_{t}}{A_{t}}\overline{{w}}^{(t-1)},$	(57a)
$\displaystyle\overline{v}^{(t)}$	$\displaystyle=\frac{A_{t-1}}{A_{t}}\overline{v}^{(t-1)}+\frac{a_{t}}{A_{t}}\overline{{w}}^{(t)}.$	(57b)

Before proving Theorem 2, we present Lemma 7 that establishes upper bounds for consensus error vectors $\tilde{\bf u}^{(t)}$ and $\tilde{\bf v}^{(t)}$ .

$Lemma$ 7.

For Algorithm 2, if Assumptions 1, 2, and 3 are satisfied, then

\lVert\tilde{\bf v}^{(t)}\rVert\leq\frac{a_{t}}{A_{t}}C_{p},\quad\lVert\tilde{\bf u}^{(t)}\rVert\leq\frac{a_{t}}{A_{t}}C_{p}

(58)

for all $t\geq 1$ , where $C_{p}=\lceil\frac{3}{1-\beta}\rceil\sqrt{n}G$ , and $\beta=\sigma_{2}(P)$ .

Proof of Lemma 7.

Since both $u_{i}^{(t)},v_{i}^{(t)},i=1,\cdots,n$ , $\overline{u}^{(t)}$ and $\overline{v}^{(t)}$ are within the constraint set, we readily have

\begin{split}\lVert{\bf u}^{(t)}-{\bf 1}\otimes\overline{u}^{(t)}\rVert\leq\sqrt{n}G\,\,\,\,\textrm{and}\,\,\,\,\lVert{\bf v}^{(t)}-{\bf 1}\otimes\overline{v}^{(t)}\rVert\leq\sqrt{n}G\end{split}

by Assumption 3. Upon using

\frac{a_{t}}{A_{t}}=\frac{2(t+1)}{t(t+3)}\geq\frac{1}{t},\quad\forall t\geq 1

and the definition of $C_{p}=\lceil\frac{3}{1-\beta}\rceil\sqrt{n}G$ , we have that (58) holds for $1\leq t<\lceil\frac{3}{1-\beta}\rceil.$

When $t\geq\lceil\frac{3}{1-\beta}\rceil,$ we prove by an induction argument. Suppose that (58) holds for some $t\geq\lceil\frac{3}{1-\beta}\rceil$ . Next, we examine the upper bounds for $\lVert\tilde{\bf v}^{(t+1)}\rVert$ and $\lVert\tilde{\bf u}^{(t+1)}\rVert$ , respectively.

i) Upper bound for $\lVert\tilde{\bf v}^{(t+1)}\rVert$ . Using

\displaystyle\mathbf{P}{\bf v}^{(t)}-\mathbf{1}\otimes\overline{v}^{(t)}=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\right)\otimes I\right)\tilde{\bf v}^{(t)},

(56b) and (57b), we obtain

\begin{split}\tilde{\bf v}^{(t+1)}=&\frac{A_{t}}{A_{t+1}}\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\right)\otimes I\right)\tilde{\bf v}^{(t)}+\frac{a_{t+1}}{A_{t+1}}\tilde{\bf w}^{(t+1)}.\end{split}

Evaluating the norm of both sides of the above equality yields

\begin{split}\lVert\tilde{\bf v}^{t+1}\rVert=&\left\lVert\frac{A_{t}}{A_{t+1}}\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\right)\otimes I\right)\tilde{\bf v}^{(t)}+\frac{a_{t+1}}{A_{t+1}}\tilde{\bf w}^{(t+1)}\right\rVert\\ {\leq}&\beta\lVert\tilde{\bf v}^{(t)}\rVert+\frac{a_{t+1}}{A_{t+1}}\left\lVert\tilde{\bf w}^{(t+1)}\right\rVert\\ \leq&\frac{a_{t}}{A_{t}}\beta C_{p}+\frac{a_{t+1}}{A_{t+1}}\sqrt{n}C_{p},\end{split}

where the last inequality follows from the hypothesis that $\lVert\tilde{\bf v}^{(t)}\rVert\leq{a_{t}}C_{p}/{A_{t}}$ and Assumption 3. Since ${a_{t}}/A_{t}$ monotonically decreases with $t$ , we have

\begin{split}\lVert\tilde{\bf v}^{t+1}\rVert\leq&\frac{a_{t}}{A_{t}}\left(\beta C_{p}+\sqrt{n}G\right)\leq\frac{a_{t}}{A_{t}}C_{p}\left(\beta+\frac{1}{\lceil\frac{3}{1-\beta}\rceil}\right),\end{split}

where the last inequality is due to $\sqrt{n}G=\frac{C_{p}}{\lceil\frac{3}{1-\beta}\rceil}$ . It then remains to prove

\left({\beta}+\frac{1}{\lceil\frac{3}{1-\beta}\rceil}\right)\leq\frac{A_{t}}{a_{t}}\cdot\frac{a_{t+1}}{A_{t+1}},\quad\forall t\geq\lceil\frac{3}{1-\beta}\rceil

(59)

to obtain the bound for $\lVert\tilde{\bf v}^{(t+1)}\rVert$ as desired. To prove (59), we let

t_{0}=\lceil\frac{3}{1-\beta}\rceil,

which implies

\frac{3}{t_{0}}\leq 1-\beta.

Based on the above relation, we further obtain

\beta+\frac{1}{t_{0}}\leq\frac{t_{0}-2}{t_{0}}\leq\frac{t_{0}+2}{t_{0}+4}.

(60)

This in conjunction with

\frac{t(t+3)}{(t+1)(t+1)}\geq 1,\quad\forall t\geq 1

and the definitions of $a_{t}$ and $A_{t}$ yields

\beta+\frac{1}{t_{0}}{\leq}\frac{t_{0}+2}{t_{0}+4}\cdot\frac{t_{0}(t_{0}+3)}{(t_{0}+1)(t_{0}+1)}=\frac{A_{t_{0}}}{a_{t_{0}}}\cdot\frac{a_{t_{0}+1}}{A_{t_{0}+1}}.

Since ${A_{t}a_{t+1}}/{(a_{t}A_{t+1})}$ monotonically increases with $t$ , we have (59) satisfied.

ii) Upper bound for $\lVert\tilde{\bf u}^{(t+1)}\rVert$ . Using the same arguments as above, we have

\begin{split}\lVert\tilde{\bf u}^{t+1}\rVert=&\left\lVert\frac{A_{t}}{A_{t+1}}\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\right)\otimes I\right)\tilde{\bf v}^{(t)}+\frac{a_{t+1}}{A_{t+1}}\tilde{\bf w}^{(t)}\right\rVert\\ {\leq}&\beta\lVert\tilde{\bf v}^{(t)}\rVert+\frac{a_{t+1}}{A_{t+1}}\left\lVert\tilde{\bf w}^{(t)}\right\rVert\\ \leq&\frac{a_{t}}{A_{t}}\beta C_{p}+\frac{a_{t+1}}{A_{t+1}}\sqrt{n}C_{p}.\end{split}

By following the same line of reasoning as in the first part, we are able to obtain

\lVert\tilde{\bf u}^{(t+1)}\rVert\leq\frac{a_{t+1}}{A_{t+1}}C_{p}.

Summarizing the above bounds, the proof is completed. ∎

Lemma 8 proves the upper bound for the consensus vector $\tilde{\bf q}^{(t)}$ .

$Lemma$ 8.

Suppose Assumptions 1, 2, and 3 are satisfied. For Algorithm 2, we have

\overline{q}^{(t)}=\overline{\hat{g}}^{(t)}

(61)

and

\lVert\tilde{\bf q}^{(t)}\rVert\leq\frac{a_{t}}{A_{t}}C_{g}

(62)

for all $t\geq 1$ , where $C_{g}={\lceil\frac{3}{1-\beta}\rceil L\Big{(}2\sqrt{n}G+2C_{p}\Big{)}}/{(1-\beta)}$ , and $\beta=\sigma_{2}(P)$ .

Proof of Lemma 8.

The proof of (61) directly follows from the proof of Lemma 3, and is omitted here for brevity.

For (62), we subtract ${\bf 1}\otimes\overline{q}^{(t)}$ from both sides of (56c) to get

\begin{split}{\bf q}^{(t)}-{\bf 1}\otimes\overline{q}^{(t)}=&{\bf P}{\bf q}^{(t-1)}-{\bf 1}\otimes\overline{q}^{(t-1)}\\ &+\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-{\bf 1}\otimes(\overline{q}^{(t)}-\overline{q}^{(t-1)}).\\ \end{split}

(63)

Using the same procedure in (B-A) leads to

\lVert\tilde{\bf q}^{(t)}\rVert\leq\beta\lVert\tilde{\bf q}^{(t-1)}\rVert+\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-{\bf 1}\otimes(\overline{q}^{(t)}-\overline{q}^{(t-1)})\rVert.

(64)

Since the objective is smooth, we obtain

\begin{split}&\left\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-\mathbf{1}\otimes\left(\overline{q}^{(t)}-\overline{q}^{(t-1)}\right)\right\rVert\\ =&\left\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-\mathbf{1}\otimes\left(\overline{\hat{g}}^{(t)}-\overline{\hat{g}}^{(t-1)}\right)\right\rVert\\ =&\left\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-\left(\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\otimes I\right)\left(\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}\right)\right\rVert\\ \leq&\left\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}\right\rVert\leq L\left\lVert{\bf u}^{(t)}-{\bf u}^{(t-1)}\right\rVert.\end{split}

To bound $\left\lVert{\bf u}^{(t)}-{\bf u}^{(t-1)}\right\rVert$ , we consider

\begin{split}&{\bf u}^{(t)}-{\bf u}^{(t-1)}\\ {=}&\frac{A_{t-1}}{A_{t}}{\bf P}{\bf v}^{(t-1)}+\frac{a_{t}}{A_{t}}{\bf w}^{(t-1)}-{\bf u}^{(t-1)}\\ =&\frac{A_{t-1}}{A_{t}}{\bf P}\left({\bf v}^{(t-1)}-{\bf u}^{(t-1)}\right)+\frac{A_{t-1}}{A_{t}}\left({\bf P}-I\otimes I\right){\bf u}^{(t-1)}\\ &+\frac{a_{t}}{A_{t}}\left({\bf w}^{(t-1)}-{\bf u}^{(t-1)}\right)\end{split}

where the first equality is due to (56a). From (56a) and (56b), we have ${\bf v}^{(t-1)}-{\bf u}^{(t-1)}=\frac{a_{t-1}}{A_{t-1}}\left({\bf w}^{(t-1)}-{\bf w}^{(t-2)}\right).$ In addition, we have $\left({\bf P}-I\otimes I\right){\bf u}^{(t-1)}=\left({\bf P}-I\otimes I\right)({\bf u}^{(t-1)}-{\bf 1}\otimes\overline{u}^{(t-1)}).$ Therefore, it holds that

\begin{split}&\left\lVert{\bf u}^{(t)}-{\bf u}^{(t-1)}\right\rVert\\ {\leq}&\frac{A_{t-1}}{A_{t}}\frac{a_{t-1}}{A_{t-1}}\left\lVert{\bf w}^{(t-1)}-{\bf w}^{(t-2)}\right\rVert+\frac{2A_{t-1}}{A_{t}}\left\lVert\tilde{\bf u}^{(t-1)}\right\rVert\\ &+\frac{a_{t}}{A_{t}}\left\lVert{\bf w}^{(t-1)}-{\bf u}^{(t-1)}\right\rVert\\ \leq&\frac{a_{t}}{A_{t}}\sqrt{n}G+\frac{2a_{t}}{A_{t}}C_{p}+\frac{a_{t}}{A_{t}}\sqrt{n}G\\ =&\frac{a_{t}}{A_{t}}\left(2\sqrt{n}G+2C_{p}\right)\end{split}

(65)

where Lemma 7 and Assumption 3 are used to get the second inequality. By substituting (65) into (64), we obtain

\begin{split}\lVert\tilde{\bf q}^{(t)}\rVert\leq\beta\lVert\tilde{\bf q}^{(t-1)}\rVert+\frac{a_{t}}{A_{t}}L\left(2\sqrt{n}G+2C_{p}\right).\end{split}

(66)

By initialization, we have $\tilde{\bf q}^{(0)}=0$ and therefore

\begin{split}\left\lVert{\tilde{\bf q}}^{(t_{0})}\right\rVert\leq&L\left(2\sqrt{n}G+2C_{p}\right)\sum_{\tau=1}^{t_{0}}\beta^{t_{0}-\tau}\frac{a_{\tau}}{A_{\tau}}\leq\frac{L\left(2\sqrt{n}G+2C_{p}\right)}{1-\beta},\end{split}

implying that (62) is valid for $1\leq t<\lceil\frac{3}{1-\beta}\rceil$ . Next, we prove that (62) also holds for $t\geq\lceil\frac{3}{1-\beta}\rceil$ by mathematical induction. Suppose that (62) holds true for some $t\geq\lceil\frac{3}{1-\beta}\rceil$ . Using this hypothesis and (66), we obtain

\begin{split}\lVert\tilde{\bf q}^{(t+1)}\rVert\leq&\frac{a_{t}}{A_{t}}\beta C_{g}+\frac{a_{t+1}}{A_{t+1}}L\left(2\sqrt{n}G+2C_{p}\right)\\ \leq&\frac{a_{t}}{A_{t}}C_{g}\left({\beta}+\frac{1}{\lceil\frac{3}{1-\beta}\rceil}\right).\end{split}

Finally, using the same argument with (59) and (60) in the proof of Lemma 7, we arrive at (62) as desired. ∎

D-B Proof of Theorem 2

Proof of Theorem 2.

Using $A_{\tau-1}=A_{\tau}-a_{\tau}$ , we have

\begin{split}&A_{t}\Big{(}f(\overline{v}^{(t)})-f(x^{*})\Big{)}\\ =&\sum_{\tau=1}^{t}\left(A_{\tau}f(\overline{v}^{(\tau)})-A_{\tau-1}f(\overline{v}^{(\tau-1)})\right)-\sum_{\tau=1}^{t}a_{\tau}f(x^{*})\\ =&\sum_{\tau=1}^{t}\Big{(}A_{\tau}\left(f(\overline{v}^{(\tau)})-f(\overline{u}^{(\tau)})\right)+a_{\tau}\left(f(\overline{u}^{(\tau)})-f(x^{*})\right)\\ &\quad\quad\,\,+A_{\tau-1}\left(f(\overline{u}^{(\tau)})-f(\overline{v}^{(\tau-1)})\right)\Big{)}\\ \end{split}

Upon using the convexity of $f$ , we obtain

\begin{split}&A_{t}\left(f(\overline{v}^{(t)})-f(x^{*})\right)\\ \leq&\sum_{\tau=1}^{t}\Big{(}A_{\tau}\left(f(\overline{v}^{(\tau)})-f(\overline{u}^{(\tau)})\right)+a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)}),\overline{u}^{(\tau)}-x^{*}\right\rangle\\ &\quad\quad\,\,+A_{\tau-1}\left\langle\nabla f(\overline{u}^{(\tau)}),\overline{u}^{(\tau)}-\overline{v}^{(\tau-1)}\right\rangle\Big{)}\end{split}

By (57b), we obtain

\begin{split}&A_{t}\Big{(}f(\overline{v}^{(t)})-f(x^{*})\Big{)}\\ \leq&\sum_{\tau=1}^{t}A_{\tau}\Big{(}f(\overline{v}^{(\tau)})-f(\overline{u}^{(\tau)})+\left\langle\nabla f(\overline{u}^{(\tau)}),\overline{u}^{(\tau)}-\overline{v}^{(\tau)}\right\rangle\Big{)}\\ &+\sum_{\tau=1}^{t}a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)}),\overline{w}^{(\tau)}-x^{*}\right\rangle\\ {\leq}&\sum_{\tau=1}^{t}\frac{A_{\tau}L}{2}\underbrace{\lVert\overline{u}^{(\tau)}-\overline{v}^{(\tau)}\rVert^{2}}_{(I)}+\frac{1}{n}\sum_{i=1}^{n}\underbrace{\sum_{\tau=1}^{t}a_{\tau}\left\langle q_{i}^{(\tau)},{w}_{i}^{(\tau)}-x^{*}\right\rangle}_{(II)}\\ &+\frac{1}{n}\sum_{\tau=1}^{t}\underbrace{\sum_{i=1}^{n}a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)})-q_{i}^{(\tau)},{w}_{i}^{(\tau)}-x^{*}\right\rangle}_{(III)}\end{split}

(67)

where the last inequality is due to the smoothness of $f$ . To bound $(I)$ , we consider

\begin{split}&\lVert\overline{u}^{(\tau)}-\overline{v}^{(\tau)}\rVert^{2}\\ =&\frac{1}{n}\sum_{i=1}^{n}\left\lVert\overline{u}^{(\tau)}-{u}^{(\tau)}_{i}+{u}^{(\tau)}_{i}-{v}^{(\tau)}_{i}+{v}^{(\tau)}_{i}-\overline{v}^{(\tau)}\right\rVert^{2}\\ \leq&\frac{1}{n}\sum_{i=1}^{n}\Big{(}3\left\lVert\overline{u}^{(\tau)}-{u}^{(\tau)}_{i}\right\rVert^{2}+3\left\lVert{u}^{(\tau)}_{i}-{v}^{(\tau)}_{i}\right\rVert^{2}\\ &\quad\quad\quad+3\left\lVert{v}^{(\tau)}_{i}-\overline{v}^{(\tau)}\right\rVert^{2}\Big{)}\\ \leq&\left(\frac{a_{\tau}}{A_{\tau}}\right)^{2}\frac{6C_{p}^{2}+{3}\left\lVert{\bf w}^{(\tau)}-{\bf w}^{(\tau-1)}\right\rVert^{2}}{n}\end{split}

(68)

where the first inequality follows from $\lVert x+y+z\rVert^{2}\leq 3\lVert x\rVert^{2}+3\lVert y\rVert^{2}+3\lVert z\rVert^{2},$ and the last inequality is due to Lemma 7 and (57). For $(II)$ , by letting $\zeta^{(\tau)}=q_{i}^{(\tau)}$ and $\nu^{(\tau)}={w}_{i}^{(\tau)}$ in Lemma 6, we have

\begin{split}\sum_{\tau=1}^{t}a_{\tau}\left\langle q_{i}^{(\tau)},{w}_{i}^{(\tau)}-x^{*}\right\rangle\leq d(x^{*})-\sum_{\tau=1}^{t}\frac{1}{2}\lVert{w}_{i}^{(\tau)}-{w}_{i}^{(\tau-1)}\rVert^{2}.\end{split}

(69)

To bound $(III)$ , we use (61) to get

\begin{split}&a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)})-q_{i}^{(\tau)},{{w}}_{i}^{(\tau)}-x^{*}\right\rangle\\ \leq&Ga_{\tau}\left\lVert\nabla f(\overline{u}^{(\tau)})-\overline{\hat{g}}^{(\tau)}+\overline{q}^{(\tau)}-q_{i}^{(\tau)}\right\rVert\\ \leq&Ga_{\tau}\left(\left\lVert\nabla f(\overline{u}^{(\tau)})-\overline{\hat{g}}^{(\tau)}\right\rVert+\left\lVert\overline{q}^{(\tau)}-q_{i}^{(\tau)}\right\rVert\right).\end{split}

Upon using Lemma 7, we obtain

\begin{split}&\left\lVert\nabla f(\overline{u}^{(\tau)})-\overline{\hat{g}}^{(\tau)}\right\rVert\leq\frac{1}{n}\sum_{i=1}^{n}\left\lVert\nabla f_{i}(\overline{u}^{(\tau)})-\nabla f_{i}(u_{i}^{(\tau)})\right\rVert\\ \leq&\frac{L}{n}\sum_{i=1}^{n}\left\lVert\overline{u}^{(\tau)}-u_{i}^{(\tau)}\right\rVert\leq L\sqrt{\frac{\lVert\tilde{\bf u}\rVert^{2}}{n}}\leq\left(\frac{a_{\tau}}{A_{\tau}}\right)\frac{LC_{p}}{\sqrt{n}}.\end{split}

Recall Lemma 8 that $\left\lVert\overline{q}^{(\tau)}-q_{i}^{(\tau)}\right\rVert\leq{C_{g}a_{\tau}}/(\sqrt{n}A_{\tau})$ . Therefore

\begin{split}&\sum_{i=1}^{n}a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)})-q_{i}^{(\tau)},{{w}}_{i}^{(\tau)}-x^{*}\right\rangle\\ &\leq\left(\frac{a_{\tau}^{2}}{A_{\tau}}\right){\sqrt{n}G(LC_{p}+C_{g})}.\end{split}

(70)

Finally, by collectively substituting (68), (69), and (70) into (67), we get

\begin{split}&A_{t}\left(f(\overline{v}^{(t)})-f(x^{*})\right)\\ {\leq}&\left(\frac{G(LC_{p}+C_{g})}{\sqrt{n}}+\frac{3LC_{p}^{2}}{n}\right)\sum_{\tau=1}^{t}\frac{a_{\tau}^{2}}{A_{\tau}}\\ &+d(x^{*})+\frac{1}{2n}\sum_{\tau=1}^{t}\left(\frac{3La_{\tau}^{2}}{A_{\tau}}-1\right)\left\lVert{\bf w}^{(\tau)}-{\bf w}^{(\tau-1)}\right\rVert^{2}.\end{split}

Based on the condition in (30) and the fact that ${a_{\tau}^{2}}/{A_{\tau}}\leq 2a$ , we obtain (31) as desired.

The inequality in (32) directly follows from Lemma 7. ∎

References

[1] A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.
[2] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, 2019.
[3] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
[4] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
[5] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[6] M. Zhu and S. Martínez, “Discrete-time dynamic average consensus,” Automatica, vol. 46, no. 2, pp. 322–329, 2010.
[7] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-raphson consensus for distributed convex optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 4, pp. 994–1009, 2015.
[8] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2017.
[9] ——, “Accelerated distributed nesterov gradient descent,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2566–2581, 2019.
[10] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
[11] D. Jakovetić, “A unification and generalization of exact distributed first-order methods,” IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 1, pp. 31–46, 2018.
[12] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
[13] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedić, “A dual approach for optimal algorithms in distributed optimization over networks,” Optimization Methods and Software, vol. 36, no. 1, pp. 171–210, 2021.
[14] J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Accelerated primal-dual algorithms for distributed smooth convex optimization over networks,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2381–2391.
[15] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in International Conference on Machine Learning. PMLR, 2017, pp. 3027–3036.
[16] P. Latafat, L. Stella, and P. Patrinos, “New primal-dual proximal algorithm for distributed optimization,” in 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, 2016, pp. 1959–1964.
[17] H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized frank–wolfe algorithm for convex and nonconvex problems,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017.
[18] Z. Li and M. Yan, “New convergence analysis of a primal-dual algorithm with large stepsizes,” Advances in Computational Mathematics, vol. 47, no. 1, pp. 1–20, 2021.
[19] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
[20] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, “Optimal algorithms for non-smooth distributed optimization in networks,” in 32nd International Conference on Neural Information Processing Systems, 2018, pp. 2745–2754.
[21] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
[22] H. Li, C. Fang, W. Yin, and Z. Lin, “Decentralized accelerated gradient methods with increasing penalty parameters,” IEEE Transactions on Signal Processing, vol. 68, pp. 4855–4870, 2020.
[23] M. Rabbat, “Multi-agent mirror descent for decentralized stochastic optimization,” in 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2015, pp. 517–520.
[24] S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” IEEE Transactions on Automatic Control, vol. 63, no. 3, pp. 714–725, 2017.
[25] D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018.
[26] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2011.
[27] S. Liu, P.-Y. Chen, and A. O. Hero, “Accelerated distributed dual averaging over evolving networks of growing connectivity,” IEEE Transactions on Signal Processing, vol. 66, no. 7, pp. 1845–1859, 2018.
[28] C. Liu, H. Li, and Y. Shi, “A unitary distributed subgradient method for multi-agent optimization with different coupling sources,” Automatica, vol. 114, no. 108834, 2020.
[29] S. Liang, G. Yin et al., “Dual averaging push for distributed convex optimization over time-varying directed graph,” IEEE Transactions on Automatic Control, vol. 65, no. 4, pp. 1785–1791, 2019.
[30] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization. John Wiley and Sons, 1983.
[31] Y. Nesterov, “Primal-dual subgradient methods for convex problems,” Mathematical Programming, vol. 120, no. 1, pp. 221–259, 2009.
[32] L. Xiao, “Dual averaging methods for regularized stochastic learning and onliije-ootimization,” Journal of Machine Learning Rescarch, vol. 11, pp. 2543–2596, 2010.
[33] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in 2012 IEEE 51st Conference on Decision and Control (CDC). IEEE, 2012, pp. 5453–5458.
[34] I. Colin, A. Bellet, J. Salmon, and S. Clémençon, “Gossip dual averaging for decentralized optimization of pairwise functions,” in International Conference on Machine Learning. PMLR, 2016, pp. 1388–1396.
[35] M. Cohen, J. Diakonikolas, and L. Orecchia, “On acceleration with noise-corrupted gradients,” in International Conference on Machine Learning, 2018, pp. 1019–1028.
[36] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
[37] H. Lu, R. M. Freund, and Y. Nesterov, “Relatively smooth convex optimization by first-order methods, and applications,” SIAM Journal on Optimization, vol. 28, no. 1, pp. 333–354, 2018.
[38] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
[39] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes,” in 2015 IEEE 54th Conference on Decision and Control (CDC). IEEE, 2015, pp. 2055–2060.
[40] E. van den Berg, M. Friedlander, G. Hennenfent, F. Herrmann, R. Saab, and O. Yılmaz, “Sparco: A testing framework for sparse reconstruction,” Dept. Comput. Sci., Univ. British Columbia, Vancouver, Tech. Rep. TR-2007-20,[Online]. Available: http://www. cs. ubc. ca/labs/scl/sparco, 2007.
[41] L. Condat, “Fast projection onto the simplex and the l1 ball,” Mathematical Programming, vol. 158, no. 1-2, pp. 575–585, 2016.
[42] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” 2014.
[43] S. Pu, W. Shi, J. Xu, and A. Nedić, “A push-pull gradient method for distributed optimization in networks,” in 2018 IEEE 57th Conference on Decision and Control (CDC). IEEE, 2018, pp. 3385–3390.
[44] X. Yi, X. Li, L. Xie, and K. H. Johansson, “Distributed online convex optimization with time-varying coupled inequality constraints,” IEEE Transactions on Signal Processing, vol. 68, pp. 731–746, 2020.
[45] K. S. Williams, “The $n$ th power of a 2 $\times$ 2 matrix,” Mathematics Magazine, vol. 65, no. 5, pp. 336–336, 1992.

Accelerated Dual Averaging Methods for Decentralized Constrained Optimization

Abstract

Index Terms:

I Introduction

II Preliminaries

II-A Basic Setup

A​s​s​u​m​p​t​i​o​nAssumption 1.

II-B Communication Network

A​s​s​u​m​p​t​i​o​nAssumption 2.

II-C Centralized Dual Averaging

D​e​f​i​n​i​t​i​o​nDefinition 1.

L​e​m​m​aLemma 1.

III Algorithms and Convergence Results

III-A Decentralized Dual Averaging

T​h​e​o​r​e​mTheorem 1.

Proof.

C​o​r​o​l​l​a​r​yCorollary 1.

Proof.

III-B Accelerated Decentralized Dual Averaging

A​s​s​u​m​p​t​i​o​nAssumption 3.

T​h​e​o​r​e​mTheorem 2.

Proof.

IV Simulation

IV-A Case I: Real Dataset

IV-B Case II: Synthetic Dataset

V Conclusion

Appendix A Roadmap for the proofs

Appendix B Proof of Theorem 1

B-A Preliminaries

L​e​m​m​aLemma 2.

L​e​m​m​aLemma 3.

Proof of Lemma 3.

L​e​m​m​aLemma 4.

Proof of Lemma 4.

L​e​m​m​aLemma 5.

Proof of Lemma 5.

L​e​m​m​aLemma 6.

Proof of Lemma 6.

B-B Proof of Theorem 1

Proof of Theorem 1.

Appendix C Proof of Corollary 1

Proof of Corollary 1.

Appendix D Proof of Theorem 2

D-A Preliminaries

L​e​m​m​aLemma 7.

Proof of Lemma 7.

L​e​m​m​aLemma 8.

Proof of Lemma 8.

D-B Proof of Theorem 2

Proof of Theorem 2.

References

$Assumption$ 1.

$Assumption$ 2.

$Definition$ 1.

$Lemma$ 1.

$Theorem$ 1.

$Corollary$ 1.

$Assumption$ 3.

$Theorem$ 2.

$Lemma$ 2.

$Lemma$ 3.

$Lemma$ 4.

$Lemma$ 5.

$Lemma$ 6.

$Lemma$ 7.

$Lemma$ 8.