This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerated Dual Averaging Methods for Decentralized Constrained Optimization

Changxin Liu, , Yang Shi, , Huiping Li, , and Wenli Du This work was supported in part by the Natural Sciences and Engineering Research Council of Canada, in part by the National Natural Science Foundation of China (NSFC) under Grant 61922068, and in part by the Shaanxi Provincial Funds for Distinguished Young Scientists under Grant 2019JC-14x. (Corresponding author: Yang Shi.)C. Liu is with the Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, and also with the Department of Mechanical Engineering, University of Victoria, Victoria, BC V8W 3P6, Canada (e-mail: chxliu@uvic.ca).Y. Shi is with the Department of Mechanical Engineering, University of Victoria, Victoria, BC V8W 3P6, Canada (e-mail: yshi@uvic.ca).H. Li is with the School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, China (e-mail: lihuiping@nwpu.edu.cn).W. Du is with the Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai, 200237, China (e-mail: wldu@ecust.edu.cn).
Abstract

In this work, we study decentralized convex constrained optimization problems in networks. We focus on the dual averaging-based algorithmic framework that is well-documented to be superior in handling constraints and complex communication environments simultaneously. Two new decentralized dual averaging (DDA) algorithms are proposed. In the first one, a second-order dynamic average consensus protocol is tailored for DDA-type algorithms, which equips each agent with a provably more accurate estimate of the global dual variable than conventional schemes. We rigorously prove that the proposed algorithm attains 𝒪(1/t)\mathcal{O}(1/t) convergence for general convex and smooth problems, for which existing DDA methods were only known to converge at 𝒪(1/t)\mathcal{O}(1/\sqrt{t}) prior to our work. In the second one, we use the extrapolation technique to accelerate the convergence of DDA. Compared to existing accelerated algorithms, where typically two different variables are exchanged among agents at each time, the proposed algorithm only seeks consensus on local gradients. Then, the extrapolation is performed based on two sequences of primal variables which are determined by the accumulations of gradients at two consecutive time instants, respectively. The algorithm is proved to converge at 𝒪(1)(1t2+1t(1β)2)\mathcal{O}(1)\left(\frac{1}{t^{2}}+\frac{1}{t(1-\beta)^{2}}\right), where β\beta denotes the second largest singular value of the mixing matrix. We remark that the condition for the algorithmic parameter to guarantee convergence does not rely on the spectrum of the mixing matrix, making itself easy to satisfy in practice. Finally, numerical results are presented to demonstrate the efficiency of the proposed methods.

Index Terms:
Decentralized optimization, constrained optimization, dual averaging, acceleration, multi-agent system.

I Introduction

Consider a multi-agent system consisting of nn agents. Each agent holds a private objective function. They are connected via a communication network in order to collaboratively solve the following optimization problem:

minx𝒳{f(x):=1ni=1nfi(x)}\min_{x\in\mathcal{X}}\left\{f(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)\right\} (1)

where fif_{i} represents the local smooth objective function of agent ii and 𝒳m\mathcal{X}\subseteq\mathbb{R}^{m} denotes the constraint set shared by all the agents. This problem is referred to as decentralized optimization in the literature and finds broad applications in optimal control of cyber-physical systems, sensor networks, and machine learning, to name a few. For an overview of decentralized optimization and its applications, please refer to [1, 2].

Over the last decade, many decentralized optimization algorithms have been proposed for solving Problem (1). For unconstrained problems, i.e., 𝒳=m\mathcal{X}=\mathbb{R}^{m}, the authors in [3, 4] developed decentralized gradient descent (DGD) methods with constant step sizes, where the local search performed by individual agents is guided by local gradients and a consensus protocol. However, because each individual gradient evaluated at the global optimum is not necessarily zero, the search directions induced by consensus-seeking and local gradient may conflict with each other, making it difficult to ascertain the exact solution to the problem. Several efforts have been made to overcome this drawback. For example, the authors in [5] proposed the EXTRA algorithm that adds a cumulative correction term to DGD to achieve consensual optimization. Alternatively, the additional gradient-tracking process based on dynamic average consensus scheme in [6] can be used. It is shown in [7, 8] that, for unconstrained smooth optimization, the algorithms steered by the tracked gradient exactly converge at an 𝒪(1/t)\mathcal{O}({1}/{{t}}) rate. Based on this idea, a decentralized Nesterov gradient descent was proposed in [9], where the rate of convergence is accelerated to 𝒪(1/t1.4ϵ)\mathcal{O}(1/t^{1.4-\epsilon}) for any ϵ(0,1.4)\epsilon\in(0,1.4) at the expense of exchanging an additional variable among agents at each time instant. In [10], the authors proposed an accelerated decentralized algorithm with multiple consensus rounds at each time instant, and proved that after tt local iterations and 𝒪(tlogt)\mathcal{O}(t\log t) communication rounds the objective error is bounded by 𝒪(1/t2)\mathcal{O}(1/t^{2}). By modeling Problem (1) as a linearly constrained optimization problem, centralized primal-dual paradigms such as the augmented Lagrangian method (ALM), the alternating direction method of multipliers (ADMM) and the dual ascent can also be used to design decentralized algorithms [11, 12, 13]. Based on the primal-dual reformulation, an accelerated primal-dual method was developed in [14]. The rate of convergence is improved to 𝒪(1)(Lt2+1tηt)\mathcal{O}(1)\left(\frac{L}{t^{2}}+\frac{1}{t\sqrt{\eta}t}\right), where LL denotes the smoothness parameter of each objective function and η=λ2()/λm()\eta=\lambda_{2}(\mathcal{L})/\lambda_{m}(\mathcal{L}) is the eigengap of the graph Laplacian \mathcal{L}. Notably, the authors established a lower bound for a class of decentralized primal-dual methods, suggesting that the developed algorithm therein is optimal in terms of gradient computations. The authors in [15] considered the Lagrangian dual formulation of the decentralized optimization problem and developed two algorithms based on dual accelerated methods. The algorithms are proved to be linearly convergent for strongly convex and smooth problems.

For constrained problems, the design and convergence analysis of decentralized optimization algorithms are more challenging [16, 17, 18]. The seminal work in [19] is based on the gossip protocol and projected subgradient method, where the step size was made decaying for convergence. The randomized smoothing technique and multi-round consensus scheme are used to design a provably optimal decentralized algorithm for non-smooth convex problems in [20]. To improve the performance using a constant step size, a variant of EXTRA (PG-EXTRA) was developed in [21], where the constraint is modeled as a non-smooth indicator function and handled via the proximal operator. An 𝒪(1/t)\mathcal{O}({1}/{t}) rate of convergence is proved for the squared norm of the difference of consecutive iterates. Recently the authors in [22] proposed an accelerated decentralized penalty method (APM), where the constraint can be also treated as the non-smooth part of the objective. Notably, there are some decentralized algorithms [23, 24, 25, 26, 27, 28, 29] available in the literature where the local search mechanism for individual agents is inspired by dual methods [30], e.g., mirror descent and dual averaging [31, 32]. Particularly, dual averaging is provably more efficient in exploiting sparsity than proximal gradient methods for l1l_{1}-regularized problems [32]. For example, the authors in [26] developed a decentralized dual averaging (DDA) algorithm where a linear model of the global objective function is gradually learned by each agent via gossip. Compared to other types of decentralized first-order methods, DDA has the advantage in simultaneously handling constraints and complex communication environments, e.g., directed networks [33], deterministic time-varying networks [29], and stochastic networks [26, 34]. From a technical perspective, this is because that DDA seeks consensus on linear models of the objective function rather than on the local projected iterates as in decentralized primal methods, e.g., DGD, therefore decoupling the consensus-seeking process from nonlinear projection and facilitating the rate analysis in complex communication environments. We present a more detailed comparison in Section III-A.

Although decentralized dual methods in the literature have demonstrated advantages over their primal counterparts in terms of constraint handling and analysis complexity, existing results focused on non-smooth problems and can have an 𝒪(1/t)\mathcal{O}({1}/{\sqrt{t}}) rate of convergence. Considering this, a question naturally arises: If the objective functions exhibit some desired properties, e.g., smoothness, is it possible to accelerate the convergence rate of DDA beyond 𝒪(1/t)\mathcal{O}({1}/{\sqrt{t}})? We provide affirmative answer to this question in this work. The main results and contributions are summarized in the following:

  • First, we develop a new DDA algorithm, where a second-order dynamic average consensus protocol is deliberately designed to assist each agent in estimating the global dual variable. Compared to the conventional estimation scheme [26], the proposed method equips each agent with provably more accurate estimates. In particular, the estimation error accumulated over time is proved to admit an upper bound constituted by the successive difference of an auxiliary variable whose update uses the mean of local dual variables. Then a rigorous investigation into the convergence of the auxiliary variable is carried out. Summarizing these two relations, we establish conditions for algorithm parameters such that the estimation error can be fully compensated, leading to an improved rate of convergence 𝒪(1/t)\mathcal{O}(1/t).

  • Second, we propose an accelerated DDA (ADDA) algorithm. Different from DDA, each agent employs a first-order dynamic average consensus protocol to estimate the mean of local gradients and accumulates the estimate over time to generate a local dual variable. By solving the convex conjugate of a 11-strongly convex function over this local dual variable, each agent produces a primal variable and uses it to construct another two sequences of primal variables in an iterative manner based on the extrapolation technique in [35] and the average consensus protocol. The rate of convergence is proved to be 𝒪(1)(1t2+1t(1β)2)\mathcal{O}(1)\left(\frac{1}{t^{2}}+\frac{1}{t(1-\beta)^{2}}\right), where β\beta denotes the second largest singular value of the mixing matrix. Notably, the condition for the algorithmic parameter to ensure convergence does not rely on the mixing matrix. Establishing such a condition that is independent on the mixing matrix offers the appealing advantage of convenient verification in practical applications.

  • Finally, the proposed algorithms are tested and compared with a few methods in the literature on decentralized LASSO problems characterized by synthetic and real datasets. The comparison results demonstrate the efficiency of the proposed methods.

Notation: We use \mathbb{R} and n\mathbb{R}^{n} to denote the set of reals and the nn-dimensional Euclidean space, respectively. Given a real number aa, we denote by a\lceil a\rceil the ceiling function that maps aa to the least integer greater than or equal to aa. Given a vector xnx\in\mathbb{R}^{n}, x\lVert x\rVert denotes its 22-norm. Given a matrix Pn×nP\in\mathbb{R}^{n\times n}, its spectral radius is denoted by ρ(P)\rho(P). Its eigenvalues and singular values are denoted by λ1(P)λ2(P)λn(P)\lambda_{1}(P)\geq\lambda_{2}(P)\geq\cdots\geq\lambda_{n}(P) and σ1(P)σ2(P)σn(P)\sigma_{1}(P)\geq\sigma_{2}(P)\geq\cdots\geq\sigma_{n}(P), respectively.

II Preliminaries

II-A Basic Setup

We consider the finite-sum optimization problem (1), in which 𝒳\mathcal{X} is a convex and compact set, and fif_{i} satisfies the following assumptions for all i=1,,ni=1,\dots,n:

AssumptionAssumption 1.
  • i)

    fif_{i} is continuously differentiable on 𝒳\mathcal{X}.

  • ii)

    fif_{i} is convex on 𝒳\mathcal{X}, i.e., for any x,y𝒳x,y\in\mathcal{X},

    fi(x)fi(y)fi(y),xy0.f_{i}(x)-f_{i}(y)-\langle\nabla f_{i}(y),x-y\rangle\geq 0. (2)
  • iii)

    fi\nabla f_{i} is Lipschitz continuous on 𝒳\mathcal{X} with a constant L>0L>0, i.e., for any x,y𝒳x,y\in\mathcal{X},

    fi(x)fi(y)Lxy.\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\|. (3)

A direct consequence of Assumption 1(iii) is

fi(x)fi(y)fi(y),xyL2xy2,x,y𝒳.f_{i}(x)-f_{i}(y)-\langle\nabla f_{i}(y),x-y\rangle\leq\frac{L}{2}\|x-y\|^{2},\forall x,y\in\mathcal{X}. (4)

The above assumptions are standard in the study of decentralized algorithms for convex optimization problems. Throughout the paper, we denote by xx^{*} an optimal solution of Problem (1).

II-B Communication Network

We consider solving Problem (1) in a decentralized fashion, that is, a pair of agents can exchange information only if they are connected in the communication network. To describe the network topology, an undirected graph 𝒢={𝒱,}\mathcal{G}=\{\mathcal{V},\mathcal{E}\} is used, where 𝒱={1,,n}\mathcal{V}=\{1,\cdots,n\} denotes the set of nn agents and 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V} represents the set of bidirectional channels, i.e., (i,j)(i,j)\in\mathcal{E} indicates that nodes ii and jj can send information to each other. Agent jj is said to be a neighbor of ii if there exists a link between them, and the set of ii’s neighbors is denoted by 𝒩i={j𝒱|(j,i)}\mathcal{N}_{i}=\{j\in\mathcal{V}|(j,i)\in\mathcal{E}\}. For every pair (i,j)(i,j)\in\mathcal{E}, a positive weight pij>0p_{ij}>0 is assigned to ii and jj to weigh the information received from each other. Otherwise pij=0p_{ij}=0 is considered. For the convergence of the algorithm, we make the following assumption for P:=[pij][0,1]n×nP:=[p_{ij}]\in[0,1]^{n\times n}.

AssumptionAssumption 2.
  • i)

    P𝟏=𝟏P\mathbf{1}=\mathbf{1} and 𝟏TP=𝟏T\mathbf{1}^{\mathrm{T}}P=\mathbf{1}^{\mathrm{T}}, where 𝟏\mathbf{1} denotes the all-one vector of dimension nn.

  • ii)

    PP has a strictly positive diagonal. i.e., pii>0p_{ii}>0.

Assumption 2 implies that σ2(P)<1\sigma_{2}(P)<1 [28]. Given a connected network, the constant edge weights and the Metropolis-Hastings algorithm [36] can be used to construct a weight matrix PP fulfilling Assumption 2.

II-C Centralized Dual Averaging

Our algorithms are based on the dual averaging methods [31]. Before introducing them, we state the following definition.

DefinitionDefinition 1.

A differentiable function ψ\psi is strongly convex with modulus μ>0\mu>0 on 𝒳\mathcal{X}, if

ψ(x)ψ(y)ψ(y),xyμ2xy2,x,y𝒳.\psi(x)-\psi(y)-\langle\nabla\psi(y),x-y\rangle\geq\frac{\mu}{2}\lVert x-y\rVert^{2},\forall x,y\in\mathcal{X}.

Let dd be a strongly convex and differentiable function with modulus 11 on 𝒳\mathcal{X} such that

x(0)=argminx𝒳d(x)andd(x(0))=0.x^{(0)}=\operatorname*{argmin}_{x\in\mathcal{X}}d(x)\quad\mathrm{and}\quad d(x^{(0)})=0. (5)

To meet the condition in (5) for any x(0)𝒳x^{(0)}\in\mathcal{X}, one can choose

d(x)=d~(x)d~(x(0))d~(x(0)),xx(0),d(x)=\tilde{d}(x)-\tilde{d}(x^{(0)})-\langle\nabla\tilde{d}(x^{(0)}),x-x^{(0)}\rangle,

where d~\tilde{d} is any strongly convex function with modulus 11, e.g., d~(x)=x2/2\tilde{d}(x)=\lVert x\rVert^{2}/2. The convex conjugate of dd is defined as

d()=argmaxx𝒳{,xd(x)}.\nabla d^{*}(\cdot)=\operatorname*{argmax}_{x\in\mathcal{X}}\left\{\left\langle\cdot,x\right\rangle-d(x)\right\}.

As a corollary of Danskin’s Theorem, we have the following result [35].

LemmaLemma 1.

For all x,ymx,y\in\mathbb{R}^{m}, we have

d(x)d(y)xy.\begin{split}&\left\lVert\nabla d^{*}(x)-\nabla d^{*}(y)\right\rVert\leq\lVert x-y\rVert.\end{split} (6)

Dual averaging. The dual averaging method can be applied to solving Problem (1) in a centralized manner. Starting with x(0)x^{(0)}, it generates a sequence of variables {x(t)}t0\{x^{(t)}\}_{t\geq 0} iteratively according to

x(t)=d(atz(t)){x}^{(t)}=\nabla d^{*}\left(-a_{t}z^{(t)}\right) (7)

where

z(t)=τ=0t1f(x(τ))z^{(t)}=\sum_{\tau=0}^{t-1}\nabla f(x^{(\tau)}) (8)

and {at}t0\{a_{t}\}_{t\geq 0} is a sequence of positive parameters that determines the rate of convergence. Let x~(t)=t1τ=0t1xi(τ)\tilde{x}^{(t)}=t^{-1}\sum_{\tau=0}^{t-1}{x}_{i}^{(\tau)}. It is proved in [31] that f(x~(t))f(x)𝒪(1/t)f(\tilde{x}^{(t)})-f(x^{*})\leq\mathcal{O}({1}/{\sqrt{t}}) when at=Θ(1/t)a_{t}=\Theta(1/\sqrt{t}), that is, with order exactly 1/t1/\sqrt{t}. When the objective function is convex and smooth, a constant at=aa_{t}=a can be used to achieve an ergodic 𝒪(1/t)\mathcal{O}({1}/{t}) rate of convergence in terms of objective error [37].

Accelerated dual averaging. To speed up the rate of convergence, an accelerated dual averaging algorithm is developed in [35]. In particular, the variables are updated according to

u(t)\displaystyle u^{(t)} =At1Atv(t1)+atAtw(t1)\displaystyle=\frac{A_{t-1}}{A_{t}}v^{(t-1)}+\frac{a_{t}}{A_{t}}{w}^{(t-1)} (9a)
v(t)\displaystyle v^{(t)} =At1Atv(t1)+atAtw(t),\displaystyle=\frac{A_{t-1}}{A_{t}}v^{(t-1)}+\frac{a_{t}}{A_{t}}{w}^{(t)}, (9b)

where at:=a(t+1)a_{t}:=a(t+1) for some a>0a>0, At=τ=1taτA_{t}=\sum_{\tau=1}^{t}a_{\tau} and

w(t)=d(τ=1taτf(u(τ))).{w}^{(t)}=\nabla d^{*}\left(-\sum_{\tau=1}^{t}a_{\tau}\nabla f(u^{(\tau)})\right). (10)

Note that t2t\geq 2 is considered for the above iteration, and the variables are initialized with u(1)=w(0)=x(0)u^{(1)}=w^{(0)}=x^{(0)}, v(1)=w(1)v^{(1)}=w^{(1)}. For convex and smooth objective functions, it is proved that f(v(t))f(x)𝒪(1/t2)f(v^{(t)})-f(x^{*})\leq\mathcal{O}(1/t^{2}) [35].

III Algorithms and Convergence Results

In this section, we develop two new DDA algorithms that are provably more efficient than existing DDA-type algorithms.

III-A Decentralized Dual Averaging

To solve Problem (1) in a decentralized manner, we propose a novel DDA algorithm. In particular, we employ the following dynamic average consensus protocol to estimate z(t)z^{(t)} in (8):

zi(t)\displaystyle z_{i}^{(t)} =j=1npij(zj(t1)+sj(t1)),\displaystyle=\sum_{j=1}^{n}p_{ij}\left(z_{j}^{(t-1)}+s_{j}^{(t-1)}\right), (11a)
si(t)\displaystyle s_{i}^{(t)} =j=1npijsj(t1)+fi(xi(t))fi(xi(t1)),\displaystyle=\sum_{j=1}^{n}p_{ij}s_{j}^{(t-1)}+\nabla f_{i}(x^{(t)}_{i})-\nabla f_{i}(x^{(t-1)}_{i}), (11b)

where zi(t)z_{i}^{(t)} is a local estimate of z(t)z^{(t)} generated by agent ii, si(t)s_{i}^{(t)} is a proxy of 1ni=1nfi(xi(t))\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(x_{i}^{(t)}) which aims to reduce the consensus error among variables {zi(t):i=1,,n}t0\{z_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0}. Using it, each agent ii performs a local computation step to update its estimate of x(t)x^{(t)}:

xi(t)=d(azi(t)).{x}_{i}^{(t)}=\nabla d^{*}\left(-az_{i}^{(t)}\right). (12)

The overall algorithm is summarized in Algorithm 1.

Algorithm 1 Decentralized Dual Averaging
  Input: a>0a>0, x(0)𝒳x^{(0)}\in\ \mathcal{X} and a strongly convex function dd with modulus 11 such that (5) holds
  Initialize: xi(0)=x(0)x_{i}^{(0)}=x^{(0)}, zi(0)=0z_{i}^{(0)}=0, and si(0)=fi(x(0))s_{i}^{(0)}=\nabla f_{i}(x^{(0)}) for all i=1,,ni=1,\dots,n
  for t=1,2,t=1,2,\cdots do
     In parallel (task for agent ii, i=1,,ni=1,\dots,n)
     collect zj(t1)z_{j}^{(t-1)} and sj(t1)s_{j}^{(t-1)} from all agents j𝒩ij\in\mathcal{N}_{i}
     update zi(t)z_{i}^{(t)} and si(t)s_{i}^{(t)} by (11)
     compute xi(t)x_{i}^{(t)} by (12)
     broadcast zi(t)z_{i}^{(t)} and si(t)s_{i}^{(t)} to all agents j𝒩ij\in\mathcal{N}_{i}
  end for

Before proceeding, we make the following remarks on Algorithm 1.

i) Subproblem solvability. Similar to centralized dual averaging methods, we assume the subproblem in (12) can be solved easily. For general problems, we can choose d(x)=xx(0)2/2d(x)=\lVert x-x^{(0)}\rVert^{2}/2 such that the subproblem (12) reduces to computing the projection of variables onto 𝒳\mathcal{X}. If 𝒳\mathcal{X} is simple enough, e.g., the simplex or l1l_{1}-norm ball, a closed-form solution exists.

ii) Comparison with existing DDA algorithms. In existing DDA algorithms [26, 29, 28], each agent estimates z(t)z^{(t)} in the following way

zi(t)=j=1npijzj(t1)+fi(xi(t)).z_{i}^{(t)}=\sum_{j=1}^{n}p_{ij}z_{j}^{(t-1)}+\nabla f_{i}(x_{i}^{(t)}). (13)

For this scheme, it is proved that the consensus error among variables {zi(t):i=1,,n}t0\{z_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0} admits a constant upper bound [26], which necessitates the use of a monotonically decreasing sequence {at}t0\{a_{t}\}_{t\geq 0} for convergence. However, decreasing ata_{t} slows down the convergence significantly; the rate of convergence in [26, 29] is reported to be 𝒪(1/t)\mathcal{O}(1/\sqrt{t}). To speed up the convergence, we develop the consensus protocol in (11), which is inspired by the high-order consensus scheme in [6]. Thanks to it, we are able to prove that the deviation among variables {zi(t):i=1,,n}t0\{z_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0} asymptotically vanishes as time evolves. Therefore, the parameter in (12) can be set constant, i.e, at=a>0a_{t}=a>0, which is key to obtaining the improved rates.

iii) Comparison with DGD algorithms. In existing DGD, a so-called gradient-tracking process similar to (11) is usually observed:

xi(t)=j=1npij(xj(t1)+asj(t1)),si(t)=j=1npijsj(t1)+fi(xi(t))fi(xi(t1)),\begin{split}x_{i}^{(t)}&=\sum_{j=1}^{n}p_{ij}\left(x_{j}^{(t-1)}+as_{j}^{(t-1)}\right),\\ s_{i}^{(t)}&=\sum_{j=1}^{n}p_{ij}s_{j}^{(t-1)}+\nabla f_{i}(x^{(t)}_{i})-\nabla f_{i}(x^{(t-1)}_{i}),\end{split} (14)

where aa represents the step size. The proposed scheme (11) differs from (14) in step (11a). With such a deliberate design and another local dual averaging step in (12), Algorithm 1 solves constrained problems with convergence rate guarantee. To compare DDA with existing algorithms applicable to solving Problem (1), we recall the PG-EXTRA algorithm [21]:

x^i(t+1)=j=1npijxj(t)+x^i(t)j=1np~ijxj(t1)a(fi(xi(t))fi(xi(t1)))xi(t+1)=argminx𝒳xx^i(t+1)2\begin{split}\hat{x}_{i}^{(t+1)}=&\sum_{j=1}^{n}p_{ij}{x}_{j}^{(t)}+\hat{x}_{i}^{(t)}-\sum_{j=1}^{n}\tilde{p}_{ij}{x}_{j}^{(t-1)}\\ &-a\left(\nabla f_{i}(x_{i}^{(t)})-\nabla f_{i}(x_{i}^{(t-1)})\right)\\ {x}_{i}^{(t+1)}=&\operatorname*{argmin}_{x\in\mathcal{X}}\left\lVert x-\hat{x}_{i}^{(t+1)}\right\rVert^{2}\end{split} (15)

where aa represents the step size and p~ij(t)\tilde{p}_{ij}^{(t)} denotes the (i,j)(i,j)-th entry of P~=(P+I)/2\tilde{P}={(P+I)}/{2}. Notably, PG-EXTRA seeks consensus among variables {xi(t):i=1,,n}\{{x}_{i}^{(t)}:i=1,\cdots,n\} at time t+1t+1 that are obtained via a projection operator at time tt. In contrast, DDA manages to agree on {zi(t):i=1,,n}\{{z}_{i}^{(t)}:i=1,\cdots,n\}, which essentially decouples the consensus-seeking procedure from projection. After using the smoothness assumption in (3) to bound fi(xi(t))fi(xi(t1))\lVert\nabla f_{i}(x^{(t)}_{i})-\nabla f_{i}(x^{(t-1)}_{i})\rVert in (11b), the iteration in (11) can be kept almost linear, which greatly facilitates the rate analysis; see the proof of Lemma 5 for more details.

Next, we present the convergence result of Algorithm 1. Inspired by [26], we first establish the convergence property of an auxiliary sequence {y(t)}t0\{{y}^{(t)}\}_{t\geq 0}, which is instrumental in proving the convergence of {xi(t):i=1,,n}t0\{x_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0}. In particular, the update of y(t){y}^{(t)} obeys

y(t)=d(az¯(t)),{y}^{(t)}=\nabla d^{*}\left(-a\overline{z}^{(t)}\right), (16)

where the initial vector y(0)=x(0){y}^{(0)}=x^{(0)} and z¯(t)=1ni=1nzi(t)\overline{z}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}z_{i}^{(t)}. To proceed, we introduce the following 2×22\times 2 matrix:

𝐌=[ββaL(β+1)β(aL+1)]{\bf M}=\begin{bmatrix}\beta&\beta\\ aL(\beta+1)&\beta(aL+1)\end{bmatrix} (17)

where β=σ2(P)\beta=\sigma_{2}(P), and let ρ(𝐌)\rho({\bf M}) be the spectral radius of 𝐌{\bf M}.

TheoremTheorem 1.

Suppose that Assumptions 1 and 2 are satisfied. If the constant aa in Algorithm 1 satisfies

1a>2Lmax{β(1β)2,1+89(1(ρ(𝐌))2)},\frac{1}{a}>2L\cdot\max\left\{\frac{\beta}{(1-\beta)^{2}},1+\frac{8}{9\left(1-(\rho({\bf M}))^{2}\right)}\right\}, (18)

then, for all t1t\geq 1, it holds that

f(y~(t))f(x)Cat,\begin{split}f(\tilde{y}^{(t)})-f(x^{*})\leq\frac{C}{at},\end{split} (19)

where y~(t)=t1τ=1ty(τ)\tilde{y}^{(t)}=t^{-1}\sum_{\tau=1}^{t}{y}^{(\tau)} with y(τ){y}^{(\tau)} defined in (16),

C:=d(x)+8aπ29nL(1(ρ(𝐌))2),C:=d(x^{*})+\frac{8a\pi^{2}}{9nL\left(1-(\rho({\bf M}))^{2}\right)},

and

π2=i=1nfi(x(0))1nj=1nfj(x(0))2.\pi^{2}={\sum_{i=1}^{n}\left\|\nabla f_{i}(x^{(0)})-\frac{1}{n}\sum_{j=1}^{n}\nabla f_{j}(x^{(0)})\right\|^{2}}. (20)

In addition, for all t1t\geq 1 and i=1,,ni=1,\cdots,n, we have

x~i(t)y~i(t)2Dt\lVert\tilde{x}_{i}^{(t)}-\tilde{y}_{i}^{(t)}\rVert^{2}\leq\frac{D}{t} (21)

where x~i(t)=t1τ=1txi(τ)\tilde{x}_{i}^{(t)}=t^{-1}\sum_{\tau=1}^{t}{x}_{i}^{(\tau)},

D:=8nC9γ(1ρ(𝐌))2+8π29L2(1(ρ(𝐌))2),D:=\frac{8nC}{9\gamma\left(1-\rho({\bf M})\right)^{2}}+\frac{8\pi^{2}}{9L^{2}\left(1-(\rho({\bf M}))^{2}\right)},

and

γ:=12aL8aL9(1(ρ(𝐌))2).\gamma:=\frac{1}{2}-aL-\frac{8aL}{9\left(1-(\rho({\bf M}))^{2}\right)}. (22)
Proof.

The proof is postponed to Appendix A. ∎

To obtain a more explicit version of (18), we identify the eigenvalues of 𝐌{\bf M} as λ1=(ξ1+ξ2)/2\lambda_{1}=(\xi_{1}+\xi_{2})/2 and λ2=(ξ1ξ2)/2\lambda_{2}=(\xi_{1}-\xi_{2})/2, where

ξ1=β(2+aL)>0,ξ2=a2β2L2+4aLβ(β+1)>0.\xi_{1}=\beta(2+aL)>0,\quad\xi_{2}=\sqrt{a^{2}\beta^{2}L^{2}+4aL\beta(\beta+1)}>0. (23)

Thus, we have |λ1|>|λ2|\lvert\lambda_{1}\rvert>\lvert\lambda_{2}\rvert and ρ(𝐌)=λ1>0\rho({\bf M})=\lambda_{1}>0. By routine calculation, we can verify that ρ(𝐌)\rho({\bf M}) and ν(a):=89(1(ρ(𝐌))2)\nu(a):=\frac{8}{9(1-(\rho({\bf M}))^{2})} monotonically increase with aa. Due to

ν(12L)<1(1(2.5β+2.25β2+2β2)2),\nu(\frac{1}{2L})<\frac{1}{\left(1-\left(\frac{2.5\beta+\sqrt{2.25\beta^{2}+2\beta}}{2}\right)^{2}\right)},

we have that as long as aa satisfies

1a>2Lmax{β(1β)2,1+1(1(2.5β+2.25β2+2β2)2)},\frac{1}{a}>2L\cdot\max\left\{\frac{\beta}{(1-\beta)^{2}},1+\frac{1}{\left(1-\left(\frac{2.5\beta+\sqrt{2.25\beta^{2}+2\beta}}{2}\right)^{2}\right)}\right\}, (24)

then aa also satisfies (18). Based on (24), we have

a=Θ((1β)2L),a=\Theta\left(\frac{(1-\beta)^{2}}{L}\right),

whose size is comparable to the DGD algorithms [38, 8, 39] in the literature.

Next, we consider an unconstrained version of Problem (1), i.e., 𝒳=m\mathcal{X}=\mathbb{R}^{m}, where the rate of convergence is stated for f(x~i(t))f(x)f(\tilde{x}_{i}^{(t)})-f(x^{*}).

CorollaryCorollary 1.

Suppose the premise of Theorem 1 holds. If 𝒳=m\mathcal{X}=\mathbb{R}^{m} in (1) and d(x)=x2/2d(x)=\lVert x\rVert^{2}/2 in (12), and

1a>2Lmax{β(1β)2,1+83(1(ρ(𝐌))2)},\frac{1}{a}>2L\cdot\max\left\{\frac{\beta}{(1-\beta)^{2}},1+\frac{8}{3\left(1-(\rho({\bf M}))^{2}\right)}\right\}, (25)

then

f(x~i(t))f(x)1t(n2ax2+8π23L(1(ρ(𝐌))2))\begin{split}f(\tilde{x}_{i}^{(t)})-f(x^{*})\leq\frac{1}{t}\left(\frac{n}{2a}\lVert x^{*}\rVert^{2}+\frac{8\pi^{2}}{3L\left(1-(\rho({\bf M}))^{2}\right)}\right)\end{split} (26)

where x~i(t)=t1τ=1txi(τ)\tilde{x}_{i}^{(t)}=t^{-1}\sum_{\tau=1}^{t}{x}_{i}^{(\tau)} and π2\pi^{2} is defined in (20).

Proof.

The proof is given in Appendix B. ∎

III-B Accelerated Decentralized Dual Averaging

To further speed up the convergence, we develop a decentralized variant of the accelerated dual averaging method in (9) and (10). Different from Algorithm 1, we consider building consensus among variables {vi(t),i=1,,n}\{v_{i}^{(t)},i=1,\cdots,n\} and propose the following iteration rule:

ui(t)\displaystyle u^{(t)}_{i} =At1Atj=1npijvj(t1)+atAtwi(t1)\displaystyle=\frac{A_{t-1}}{A_{t}}\sum_{j=1}^{n}p_{ij}v^{(t-1)}_{j}+\frac{a_{t}}{A_{t}}{w}_{i}^{(t-1)} (27a)
vi(t)\displaystyle v^{(t)}_{i} =At1Atj=1npijvj(t1)+atAtwi(t),\displaystyle=\frac{A_{t-1}}{A_{t}}\sum_{j=1}^{n}p_{ij}v^{(t-1)}_{j}+\frac{a_{t}}{A_{t}}{w}_{i}^{(t)}, (27b)

where

wi(t)=d(τ=1taτqi(τ)),{w}_{i}^{(t)}=\nabla d^{*}\left(-\sum_{\tau=1}^{t}a_{\tau}q_{i}^{(\tau)}\right), (28)

and

qi(t)=j=1npijqj(t1)+fi(ui(t))fi(ui(t1)).q_{i}^{(t)}=\sum_{j=1}^{n}p_{ij}q_{j}^{(t-1)}+\nabla f_{i}(u^{(t)}_{i})-\nabla f_{i}(u^{(t-1)}_{i}). (29)

The overall algorithm is summarized in Algorithm 2. It is worth to mention that agents in Algorithm 2 consume the same communication resources as Algorithm 1 to achieve acceleration.

Algorithm 2 Accelerated Decentralized Dual Averaging
  Input: a>0a>0, x(0)𝒳x^{(0)}\in\mathcal{X} and a strongly convex function dd with modulus 11 such that (5) holds
  Initialize: A1=a1=2aA_{1}=a_{1}=2a, ui(1)=wi(0)=x(0)u_{i}^{(1)}=w_{i}^{(0)}=x^{(0)}, qi(1)=fi(x(0))q_{i}^{(1)}=\nabla f_{i}(x^{(0)}), and vi(1)=wi(1)v_{i}^{(1)}=w_{i}^{(1)} for all i=1,,ni=1,\dots,n
  for t=2,3,t=2,3,\cdots do
     set at=at1+aa_{t}=a_{t-1}+a and At=At1+atA_{t}=A_{t-1}+a_{t}
     In parallel (task for agent ii, i=1,,ni=1,\dots,n)
     collect vj(t1)v_{j}^{(t-1)} and qj(t1)q_{j}^{(t-1)} from all agents j𝒩ij\in\mathcal{N}_{i}
     update ui(t)u_{i}^{(t)} by (27a)
     update qi(t)q_{i}^{(t)} by (29)
     compute wi(t)w_{i}^{(t)} by (28)
     update vi(t)v_{i}^{(t)} by (27b)
     broadcast vi(t)v_{i}^{(t)} and qi(t)q_{i}^{(t)} to all agents j𝒩ij\in\mathcal{N}_{i}
  end for

AssumptionAssumption 3.

For the problem in (1), the constraint set 𝒳\mathcal{X} is bounded with the following diameter:

G=maxx,y𝒳xy.G=\max_{x,y\in\mathcal{X}}\lVert x-y\rVert.
TheoremTheorem 2.

For Algorithm 2, if Assumptions 1, 2, and 3 are satisfied, and

a16L,a\leq\frac{1}{6L}, (30)

then, for all t1t\geq 1, it holds that

f(v¯(t))f(x)d(x)At+tAt(2G(LCp+Cg)n+6LCp2n),\begin{split}f(\overline{v}^{(t)})-f(x^{*})\leq\frac{d(x^{*})}{A_{t}}+\frac{t}{A_{t}}\left(\frac{2G(LC_{p}+C_{g})}{\sqrt{n}}+\frac{6LC_{p}^{2}}{n}\right),\end{split} (31)

where

Cp:=31βnGC_{p}:=\lceil\frac{3}{1-\beta}\rceil\sqrt{n}G

and

Cg:=2L31βnG+Cp1β.C_{g}:=2L\lceil\frac{3}{1-\beta}\rceil\frac{\sqrt{n}G+C_{p}}{1-\beta}.

In addition, for all t1t\geq 1 and i=1,,ni=1,\cdots,n, we have

vi(t)v¯(t)22aCpAt.\left\lVert{v}_{i}^{(t)}-\overline{v}^{(t)}\right\rVert^{2}\leq\frac{2aC_{p}}{A_{t}}. (32)
Proof.

The proof is postponed to Appendix C. ∎

For Algorithm 2 and Theorem 2, the following remarks are in order.

i) Comparison with existing accelerated algorithms. Accelerated methods for decentralized constrained optimization are rarely reported in the literature. Recently, the authors in [22] developed the APM algorithm, where the iteration rule reads

yi(t)=\displaystyle y_{i}^{(t)}= xi(t)+θt(1θt1)θt1(xi(t)xi(t1))\displaystyle{x}_{i}^{(t)}+\frac{\theta_{t}(1-\theta_{t-1})}{\theta_{t-1}}\left({x}_{i}^{(t)}-{x}_{i}^{(t-1)}\right) (33a)
si(t)=\displaystyle{s}_{i}^{(t)}= fi(yi(t))+β0θti=1npij(yi(t)yj(t))\displaystyle\nabla f_{i}(y_{i}^{(t)})+\frac{\beta_{0}}{\theta_{t}}\sum_{i=1}^{n}p_{ij}\left(y_{i}^{(t)}-y_{j}^{(t)}\right) (33b)
xi(t+1)=\displaystyle{x}_{i}^{(t+1)}= argminx𝒳xyi(t)+si(t)L+β0/θt2\displaystyle\arg\min_{{x}\in\mathcal{X}}\left\lVert x-y_{i}^{(t)}+\frac{s_{i}^{(t)}}{L+\beta_{0}/\theta_{t}}\right\rVert^{2} (33c)

where β0=L/1λ2(P)\beta_{0}={L}/{\sqrt{1-\lambda_{2}(P)}} and θk\theta_{k} is a decreasing parameter satisfying θk=θk1/(1+θk1)\theta_{k}={\theta_{k-1}}/{(1+\theta_{k-1})} with θ0=1\theta_{0}=1. Letting s^i(t)=θksi(t)\hat{s}_{i}^{(t)}=\theta_{k}s_{i}^{(t)}, we can equivalently rewrite (33b) and (33c) as

s^i(t)=θtfi(yi(t))+β0i=1npij(yi(t)yj(t))xi(t+1)=argminx𝒳xyi(t)+s^i(t)Lθt+β02,\begin{split}\hat{s}_{i}^{(t)}=&{\theta_{t}}\nabla f_{i}(y_{i}^{(t)})+{\beta_{0}}\sum_{i=1}^{n}p_{ij}\left(y_{i}^{(t)}-y_{j}^{(t)}\right)\\ x_{i}^{(t+1)}=&\arg\min_{{x}\in\mathcal{X}}\left\lVert x-y_{i}^{(t)}+\frac{\hat{s}_{i}^{(t)}}{L\theta_{t}+\beta_{0}}\right\rVert^{2},\end{split}

from which we can see that new gradients are assigned with decreasing weights, whereas increasing weights are used for ADDA in (28). The reason for such different choices of parameters may be two-fold. First, parameter choices in (centralized) primal gradient descent and dual averaging methods are intrinsically different. Second, APM gradually increases the penalty parameter 1/θt1/\theta_{t} in order to enforce consensus, which essentially dilutes the weight for gradients, as shown above. We will show in simulation that decreasing weights over time slows down convergence. There are also a few other accelerated decentralized methods such as [9, 14], however they do not apply to constrained problems.

ii) Discussion about optimality. For ADDA, the rate of convergence is proved to be

𝒪(1)(1t2+1t(1β)2).\mathcal{O}(1)\left(\frac{1}{t^{2}}+\frac{1}{t(1-\beta)^{2}}\right).

In light of the lower bound in [14], it is not optimal in terms of the dependence on β\beta. In particular, the dominant term of the error in 𝒪(1/(t(1β)2))\mathcal{O}(1/(t(1-\beta)^{2})) becomes larger as β\beta grows, i.e., the network becomes more sparsely connected. This is mainly because we consider a one-consensus-one-gradient update in the algorithm. However, extending the algorithm in [14] to handle constraints may require further investigation. In the simulation section, we demonstrate the superiority of ADDA over existing decentralized constrained optimization algorithms.

IV Simulation

In this section, we verify the proposed methods by applying them to solve the following constrained LASSO problems:

minxm{f(x)=12ni=1nMixci2},s.t.x1R\min_{x\in\mathbb{R}^{m}}\left\{f(x)=\frac{1}{2n}\sum_{i=1}^{n}\lVert M_{i}x-c_{i}\rVert^{2}\right\},\quad\mathrm{s.t.}\,\,\lVert x\rVert_{1}\leq R

where Mipi×mM_{i}\in\mathbb{R}^{p_{i}\times m}, cipic_{i}\in\mathbb{R}^{p_{i}}, and RR is a constant parameter that defines the constraint. In the simulation, each agent ii has access to a local data tuple (yi,Ai)(y_{i},A_{i}) and RR. Two different problem instances characterized by both real and synthetic datasets are considered.

IV-A Case I: Real Dataset

In this setting, we use sparco7 [40, 17] to define the LASSO problem, and consider a cycle graph and a complete graph of n=50n=50 nodes. The corresponding weight matrix PP is determined by following the Metropolis-Hastings rule [36]. Each local measurement matrix Mi12×2560M_{i}\in\mathbb{R}^{12\times 2560}, and the local corrupted measurement ci12c_{i}\in\mathbb{R}^{12}. The constraint parameter is set as R=1.1xg1R=1.1\cdot\lVert x_{{g}}\rVert_{1}, where xgx_{{g}} with xg0=20\lVert x_{g}\rVert_{0}=20 denotes the unknown variable to be recovered via solving LASSO. In this case, the simulation experiments were performed using MATLAB R2020b.

For comparison, the PG-EXTRA method in [21] and the APM method in [27] are simulated. For their algorithmic parameters, the step size for PG-EXTRA is set as 10410^{-4}, and the parameters for APM are set as L=250L=250 and β0=L/1λ2(P)\beta_{0}={L}/{\sqrt{1-\lambda_{2}(P)}}. For DDA and ADDA in this work, we use a=5104a=5\cdot 10^{-4} and at=(t+1)104a_{t}=(t+1)\cdot 10^{-4}, respectively, and x2/2\lVert x\rVert^{2}/2 as the prox-function. The projection onto an l1l_{1} ball is carried out via the algorithm in [41]. All the methods are initialized with xi(0)=0,i𝒱x_{i}^{(0)}=0,\forall i\in\mathcal{V}.

The performance of four algorithms are displayed in Figs. 1 and 2. In Fig. 1, the performance is evaluated in terms of the objective error f(1ni=1nxi(t))f(x)f(\frac{1}{n}\sum_{i=1}^{n}x_{i}^{(t)})-f(x^{*}), where xx^{*} is identified using CVX [42]. It demonstrates that the DDA method outperforms other methods when the graph is a cycle. As the graph becomes denser, i.e, complete graph, the convergence of all algorithms becomes faster. Among them, the ADDA method demonstrates the most significant improvement. This is in line with Theorem 2, where the network connectivity impacts the convergence error in 𝒪(1/t)\mathcal{O}(1/t) as opposed to 𝒪(1/t2)\mathcal{O}(1/t^{2}). In Fig. 2, we compare the trajectories of consensus error, i.e., i=1nxi(t)n1j=1nxj(t)2\sqrt{\sum_{i=1}^{n}\lVert x_{i}^{(t)}-n^{-1}\sum_{j=1}^{n}x_{j}^{(t)}\rVert^{2}}, by all methods. When the graph is a cycle, the APM and PG-EXTRA have smaller consensus error than the developed methods, mainly because they build consensus directly among variables {xi(t):i=1,,n}t0\{x_{i}^{(t)}:i=1,\cdots,n\}_{t\geq 0}. When the graph is complete, the consensus error by the proposed DDA method vanishes because of the conservation property established in Lemma 3 and the complete graph structure.

IV-B Case II: Synthetic Dataset

For the synthetic dataset, the parameters are set as n=8n=8, m=30000m=30000, pi=2000,i𝒱p_{i}=2000,\forall i\in\mathcal{V}, and the data is generated in the following way. First, each local measurement matrix MiM_{i} is randomly generated where each entry follows the normal distribution 𝒩(0,1)\mathcal{N}(0,1). Next, each entry of the sparse vector xgx_{{g}} to be recovered via LASSO is randomly generated from the normal distribution 𝒩(0,1)\mathcal{N}(0,1) with xg0=1500\lVert x_{{g}}\rVert_{0}=1500. Then the corrupted measurement cic_{i} is produced based on

ci=Mixg+bic_{i}=M_{i}x_{g}+b_{i}

where bib_{i} represents the Gaussian noise with zero mean and variance 0.010.01. The constraint parameter is set as R=1.1xg1R=1.1\cdot\lVert x_{g}\rVert_{1}. For this setting, we employed the message passing interface (MPI) in Python 3.7.3 to simulate a network of 88 nodes, where each node ii is connected to a subset of nodes {1+imod 8,1+(i+3)mod 8,1+(i+6)mod 8}\{1+i\,\mathrm{mod}\,8,1+(i+3)\,\mathrm{mod}\,8,1+(i+6)\,\mathrm{mod}\,8\}. For comparison, the proposed methods are compared to their centralized counterparts. The parameters for dual averaging and accelerated dual averaging are set as a=1/(3105)a=1/(3\cdot 10^{5}) and at=a(t+1)a_{t}=a(t+1), respectively. Similarly, the function x2/2\lVert x\rVert^{2}/2 is used as a prox-function, and the algorithms are initialized with xi(0)=0,i𝒱x_{i}^{(0)}=0,\forall i\in\mathcal{V}.

The performance of the developed algorithms and their centralized counterparts is illustrated in Fig. 3. In particular, the performance is evaluated in terms of objective function value versus computing time. It demonstrates that the proposed methods outperform the corresponding centralized algorithms in the sense that the decentralized algorithms consume less computing time to reach the same degree of accuracy than their centralized counterparts.

Refer to caption
Figure 1: Comparison of objective error in Case I.
Refer to caption
Figure 2: Comparison of consensus error in Case I.
Refer to caption
Figure 3: Comparison of objective value in Case II.

V Conclusion

In this work, we have designed two DDA algorithms for solving decentralized constrained optimization problems with improved rates of convergence. In the first one, a novel second-order dynamic average consensus scheme is developed, with which each agent locally generates a more accurate estimate of the dual variable than existing methods under mild assumptions. This property enables each agent to use large constant weight in the local dual average step, and therefore improves the rate of convergence. In the second algorithm, each agent retains the conventional first-order dynamic average consensus method to estimate the average of local gradients. Alternatively, the extrapolation technique together with the average consensus protocol is used to achieve acceleration over a decentralized network.

This work opens several avenues for future research. In this work, we focus on the basic setting with time-invariant bidirectional communication networks. We believe that the consensus-based dual averaging framework can be extended to tackle decentralized constrained optimization in complex networks, e.g., directed networks [43] and time-varying networks [38]. Furthermore, we expect that our approach, as demonstrated by its centralized counterpart, i.e., follow-the-regularized-leader, may deliver superb performance in the online optimization setting [44].

Appendix A Roadmap for the proofs

Refer to caption
Figure 4: Relation among the convergence results.

Before proceeding to the proofs, we present Fig. 4 to illustrate how they relate to each other.

Appendix B Proof of Theorem 1

B-A Preliminaries

We introduce several notations to facilitate the presentation of the proof. Let

𝐱(t)=[x1(t)xn(t)],𝐳(t)=[z1(t)zn(t)],𝐬(t)=[s1(t)sn(t)],{\bf x}^{(t)}=\begin{bmatrix}x_{1}^{(t)}\\ \vdots\\ x_{n}^{(t)}\end{bmatrix},\quad{\bf z}^{(t)}=\begin{bmatrix}z_{1}^{(t)}\\ \vdots\\ z_{n}^{(t)}\end{bmatrix},\quad{\bf s}^{(t)}=\begin{bmatrix}s_{1}^{(t)}\\ \vdots\\ s_{n}^{(t)}\end{bmatrix},
𝐲(t)=[y(t)y(t)],(t)=[f1(x1(t))fn(xn(t))],{\bf y}^{(t)}=\begin{bmatrix}y^{(t)}\\ \vdots\\ y^{(t)}\end{bmatrix},\quad{\bf\nabla}^{(t)}=\begin{bmatrix}\nabla f_{1}(x_{1}^{(t)})\\ \vdots\\ \nabla f_{n}(x_{n}^{(t)})\end{bmatrix},
g¯(t)=1ni=1nfi(xi(t)),s¯(t)=1ni=1nsi(t),z¯(t)=1ni=1nzi(t).\overline{g}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(x_{i}^{(t)}),\,\overline{s}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}s_{i}^{(t)},\,\overline{z}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}z_{i}^{(t)}.

Using these notations, we express (11) in the following compact form

𝐳(t)\displaystyle{\bf z}^{(t)} =𝐏(𝐳(t1)+𝐬(t1)),\displaystyle={\bf P}\left({\bf z}^{(t-1)}+{\bf s}^{(t-1)}\right), (34a)
𝐬(t)\displaystyle{\bf s}^{(t)} =𝐏𝐬(t1)+(t)(t1),\displaystyle={\bf P}{\bf s}^{(t-1)}+\nabla^{(t)}-\nabla^{(t-1)}, (34b)

where 𝐏=PI{\bf P}=P\otimes I. Before proceeding to the proof of Theorem 1, we present several technical lemmas.

Recall a lemma from [39].

LemmaLemma 2.

Suppose that {ε(t)}t0\{\varepsilon^{(t)}\}_{t\geq 0} and {ϵ(t)}t0\{\epsilon^{(t)}\}_{t\geq 0} are two sequences of positive scalars such that for all t0t\geq 0,

ε(t)δtc+τ=0t1δtτ1ϵ(τ)\varepsilon^{(t)}\leq\delta^{t}c+\sum_{\tau=0}^{t-1}\delta^{t-\tau-1}\epsilon^{(\tau)}

where δ(0,1)\delta\in(0,1) and c0c\geq 0 is a constant. Then, the following holds for all t1t\geq 1:

τ=1t(ε(τ))22(1δ)2τ=0t1(ϵ(τ))2+2c21δ2.\sum_{\tau=1}^{t}(\varepsilon^{(\tau)})^{2}\leq\frac{2}{(1-\delta)^{2}}\sum_{\tau=0}^{t-1}(\epsilon^{(\tau)})^{2}+\frac{2c^{2}}{1-\delta^{2}}.
LemmaLemma 3.

For Algorithm 1, we have that for any t0t\geq 0

s¯(t)=g¯(t),z¯(t+1)=τ=0ts¯(τ).\overline{s}^{(t)}=\overline{g}^{(t)},\quad\overline{z}^{(t+1)}=\sum_{\tau=0}^{t}\overline{s}^{(\tau)}. (35)
Proof of Lemma 3.

We prove by induction. For t=0t=0, we readily have (35) satisfied since si(0)=fi(x(0))s_{i}^{(0)}=\nabla f_{i}(x^{(0)}) and zi0=0z_{i}^{0}=0 for all ii. Suppose that (35) holds for t1t-1. Using (34b),

(AB)(CD)=(ACBD),(A\otimes B)(C\otimes D)=(AC\otimes BD), (36)

and the double stochasticity of PP, we have

s¯(t)=1n(𝟏TI)(𝐏𝐬(t1)+(t)(t1))=1n(𝟏TI)((PI)𝐬(t1)+(t)(t1))=1n((𝟏TP)I)𝐬(t1)+g¯(t)g¯(t1)=s¯(t1)+g¯(t)g¯(t1)=g¯(t).\begin{split}\overline{s}^{(t)}&=\frac{1}{n}({\bf 1}^{\mathrm{T}}\otimes I)\left({\bf P}{\bf s}^{(t-1)}+\nabla^{(t)}-\nabla^{(t-1)}\right)\\ &=\frac{1}{n}({\bf 1}^{\mathrm{T}}\otimes I)\left(({P}\otimes I){\bf s}^{(t-1)}+\nabla^{(t)}-\nabla^{(t-1)}\right)\\ &=\frac{1}{n}\left(({\bf 1}^{\mathrm{T}}P)\otimes I\right){\bf s}^{(t-1)}+\overline{g}^{(t)}-\overline{g}^{(t-1)}\\ &=\overline{s}^{(t-1)}+\overline{g}^{(t)}-\overline{g}^{(t-1)}=\overline{g}^{(t)}.\end{split}

Upon using a similar argument for z¯(t)\overline{z}^{(t)}, we have

z¯(t+1)=1n(𝟏TI)(PI)(𝐳(t)+𝐬(t))=z¯(t)+s¯(t)=τ=0ts¯(τ).\begin{split}\overline{z}^{(t+1)}&=\frac{1}{n}({\bf 1}^{\mathrm{T}}\otimes I){(P\otimes I)}\left({\bf z}^{(t)}+{\bf s}^{(t)}\right)\\ &=\overline{z}^{(t)}+\overline{s}^{(t)}=\sum_{\tau=0}^{t}\overline{s}^{(\tau)}.\end{split}

Therefore, (35) holds for all tt. ∎

LemmaLemma 4.

If the parameter aa satisfies (18), then ρ(𝐌)<1\rho({\bf M})<1, where 𝐌{\bf M} is defined in (17).

Proof of Lemma 4.

Recall the two real eigenvalues of 𝐌{\bf M} in (23). Since ξ1>0\xi_{1}>0 and ξ2>0\xi_{2}>0, we have ρ(𝐌)=|λ1|>|λ2|\rho({\bf M})=\lvert\lambda_{1}\rvert>\lvert\lambda_{2}\rvert. Also, one can verify that ρ(𝐌)\rho({\bf M}) monotonically increases with aa. The solution to the equation |λ1|=1\lvert\lambda_{1}\rvert=1 can be identified as a=(1β)2/(2βL)a=(1-\beta)^{2}/(2\beta L). Therefore, we conclude that ρ(𝐌)<1\rho({\bf M})<1 as long as (18) is satisfied. ∎

The following lemma establishes the relation between sequences {xi(t)}t0\{x_{i}^{(t)}\}_{t\geq 0} and {y(t)}t0\{{y}^{(t)}\}_{t\geq 0}.

LemmaLemma 5.

Suppose that Assumptions 1, 2 hold and the parameter aa in Algorithm 1 satisfies (18). Then, for all t0t\geq 0, it holds that

τ=1t𝐱(τ)𝐲(τ)289(1ρ(𝐌))2τ=0t1𝐲(τ+1)𝐲(τ)2+8π29L2(1(ρ(𝐌))2)\begin{split}&\sum_{\tau=1}^{t}\lVert{\bf x}^{(\tau)}-\mathbf{y}^{(\tau)}\rVert^{2}\\ &\leq\frac{8}{9\left(1-\rho({\bf M})\right)^{2}}\sum_{\tau=0}^{t-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert^{2}+\frac{8\pi^{2}}{9L^{2}\left(1-(\rho({\bf M}))^{2}\right)}\end{split} (37)

where

π2=i=1nfi(x(0))g¯(0)2.\pi^{2}={\sum_{i=1}^{n}\left\|\nabla f_{i}(x^{(0)})-\overline{g}^{(0)}\right\|^{2}}.
Proof of Lemma 5.

From Lemma 3, we have

z¯(τ)=z¯(τ1)+g¯(τ1).\overline{z}^{(\tau)}=\overline{z}^{(\tau-1)}+\overline{g}^{(\tau-1)}.

Define

𝐬~(t)=𝐬(t)𝟏s¯(t),𝐳~(t)=𝐳(t)𝟏z¯(t).\tilde{{\bf s}}^{(t)}={\bf s}^{(t)}-{\bf 1}\otimes\overline{s}^{(t)},\,\tilde{{\bf z}}^{(t)}={\bf z}^{(t)}-{\bf 1}\otimes\overline{z}^{(t)}. (38)

These in conjunction with (34a) lead to

𝐳~(τ)=𝐏𝐳(τ1)𝟏z¯(τ1)+𝐏𝐬(τ1)𝟏s¯(τ1).\begin{split}\tilde{\bf z}^{(\tau)}=\mathbf{P}{\bf z}^{(\tau-1)}-\mathbf{1}\otimes\overline{z}^{(\tau-1)}+\mathbf{P}{\bf s}^{(\tau-1)}-\mathbf{1}\otimes\overline{s}^{(\tau-1)}.\end{split} (39)

Because of 𝟏z¯(τ1)=(𝟏I)z¯(τ1)\mathbf{1}\otimes\bar{z}^{(\tau-1)}=(\mathbf{1}\otimes I)\bar{z}^{(\tau-1)} and (36), we have

𝟏z¯(τ1)=1n(𝟏I)(𝟏TI)𝐳(τ1)=(𝟏𝟏TnI)𝐳(τ1).\mathbf{1}\otimes\bar{z}^{(\tau-1)}=\frac{1}{n}(\mathbf{1}\otimes I)(\mathbf{1}^{\mathrm{T}}\otimes I)\mathbf{z}^{(\tau-1)}=\left(\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\otimes I\right)\mathbf{z}^{(\tau-1)}.

In addition, we have

𝐏𝐳(τ1)𝟏z¯(τ1)=(PI)𝐳(τ1)(𝟏𝟏TnI)𝐳(τ1)\displaystyle\mathbf{P}{\bf z}^{(\tau-1)}-\mathbf{1}\otimes\overline{z}^{(\tau-1)}=(P\otimes I)\mathbf{z}^{(\tau-1)}-\left(\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\otimes I\right)\mathbf{z}^{(\tau-1)}
=((P𝟏𝟏Tn)I)𝐳(τ1)\displaystyle=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{T}}{n}\right)\otimes I\right){\bf z}^{(\tau-1)}
=((P𝟏𝟏Tn)I)(𝐳~(τ1)+(𝟏I)z¯(τ1))\displaystyle=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{T}}{n}\right)\otimes I\right)\left(\tilde{\bf z}^{(\tau-1)}+(\mathbf{1}\otimes I)\bar{z}^{(\tau-1)}\right)
=((P𝟏𝟏Tn)I)𝐳~(τ1)+((P𝟏𝟏)I)z¯(τ1)\displaystyle=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{T}}{n}\right)\otimes I\right)\tilde{\bf z}^{(\tau-1)}+\left(\left(P\mathbf{1}-\mathbf{1}\right)\otimes I\right)\bar{z}^{(\tau-1)}
=((P𝟏𝟏Tn)I)𝐳~(τ1),\displaystyle=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{T}}{n}\right)\otimes I\right)\tilde{\bf z}^{(\tau-1)}, (40)

where the last equality is due to the double stochasticity of PP. Using the same arguments as above for 𝐏𝐬(τ1)𝟏s¯(τ1)\mathbf{P}{\bf s}^{(\tau-1)}-\mathbf{1}\otimes\overline{s}^{(\tau-1)} and Assumption 2, we have

𝐳~(τ)𝐏𝐳(τ1)𝟏z¯(τ1)+𝐏𝐬(τ1)𝟏s¯(τ1)β𝐳~(τ1)+β𝐬~(τ1).\begin{split}\lVert\tilde{\bf z}^{(\tau)}\rVert&\leq\left\lVert\mathbf{P}{\bf z}^{(\tau-1)}-\mathbf{1}\otimes\overline{z}^{(\tau-1)}\right\rVert+\left\lVert\mathbf{P}{\bf s}^{(\tau-1)}-\mathbf{1}\otimes\overline{s}^{(\tau-1)}\ \right\rVert\\ &\leq\beta\lVert\tilde{\bf z}^{(\tau-1)}\rVert+\beta\lVert\tilde{\bf s}^{(\tau-1)}\rVert.\end{split} (41)

Similarly, from Lemma 3, we obtain

𝐬~(τ)=𝐏𝐬(τ1)(𝟏I)s¯(τ1)+(τ)(τ1)β𝐬~(τ1)+(τ)(τ1)β𝐬~(τ1)+L𝐱(τ)𝐱(τ1)\begin{split}\lVert\tilde{\bf s}^{(\tau)}\rVert&=\left\|\mathbf{P}{{\bf s}}^{(\tau-1)}-\left(\mathbf{1}\otimes I\right)\overline{s}^{(\tau-1)}+\nabla^{(\tau)}-\nabla^{(\tau-1)}\right\rVert\\ &\leq\beta\lVert\tilde{\bf s}^{(\tau-1)}\rVert+\left\|\nabla^{(\tau)}-\nabla^{(\tau-1)}\right\|\\ &\leq\beta\lVert\tilde{\bf s}^{(\tau-1)}\rVert+L\lVert{\bf x}^{(\tau)}-{{\bf x}}^{(\tau-1)}\rVert\end{split} (42)

where the last inequality is due to Assumption 1. Using

𝐱(τ)𝐱(τ1)𝐱(τ)𝐲(τ)+𝐱(τ1)𝐲(τ1)+𝐲(τ)𝐲(τ1),\begin{split}\lVert{\bf x}^{(\tau)}-{{\bf x}}^{(\tau-1)}\rVert\leq&\lVert{\bf x}^{(\tau)}-{{\bf y}}^{(\tau)}\rVert+\lVert{\bf x}^{(\tau-1)}-{{\bf y}}^{(\tau-1)}\rVert\\ &+\lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert,\end{split}

and Lemma 1, we obtain

𝐱(τ)𝐱(τ1)a𝐳~(τ)+a𝐳~(τ1)+𝐲(τ)𝐲(τ1)a(β+1)𝐳~(τ1)+aβ𝐬~(τ1)+𝐲(τ)𝐲(τ1).\begin{split}&\lVert{\bf x}^{(\tau)}-{{\bf x}}^{(\tau-1)}\rVert\leq a\lVert\tilde{\mathbf{z}}^{(\tau)}\rVert+a\lVert\tilde{\mathbf{z}}^{(\tau-1)}\rVert+\lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert\\ &\leq a(\beta+1)\lVert\tilde{\mathbf{z}}^{(\tau-1)}\rVert+a\beta\lVert\tilde{\mathbf{s}}^{(\tau-1)}\rVert+\lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert.\end{split}

Upon substituting the above inequality into (42), we obtain

𝐬~(τ)β(aL+1)𝐬~(τ1)+aL(β+1)𝐳~(τ1)+L𝐲(τ)𝐲(τ1).\begin{split}\lVert\tilde{\bf s}^{(\tau)}\rVert\leq&\beta(aL+1)\lVert\tilde{\bf s}^{(\tau-1)}\rVert+aL(\beta+1)\lVert\tilde{\bf z}^{(\tau-1)}\rVert\\ &+L\lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert.\end{split} (43)

By combining (41) and (43), we establish the following inequality:

[𝐳~(τ)𝐬~(τ)]𝐌[𝐳~(τ1)𝐬~(τ1)]+L[0𝐲(τ)𝐲(τ1)]\begin{split}\begin{bmatrix}\lVert\tilde{\bf z}^{(\tau)}\rVert\\ \lVert\tilde{\bf s}^{(\tau)}\rVert\end{bmatrix}\leq\mathbf{M}\begin{bmatrix}\lVert\tilde{\bf z}^{(\tau-1)}\rVert\\ \lVert\tilde{\bf s}^{(\tau-1)}\rVert\end{bmatrix}+L\begin{bmatrix}0\\ \lVert{\bf y}^{(\tau)}-{{\bf y}}^{(\tau-1)}\rVert\end{bmatrix}\end{split}

where 𝐌\mathbf{M} is defined in (17). By iterating the above inequality and using

𝐳~(0)=0,𝐬~(0)=i=1nfi(x(0))g¯(0)2:=π,\lVert\tilde{\bf z}^{(0)}\lVert=0,\quad\lVert\tilde{\bf s}^{(0)}\lVert=\sqrt{\sum_{i=1}^{n}\left\|\nabla f_{i}(x^{(0)})-\overline{g}^{(0)}\right\|^{2}}:=\pi,

we obtain

[𝐳~(t)𝐬~(t)]Lτ=0t1𝐌tτ1[0𝐲(τ+1)𝐲(τ)]+𝐌t[0π].\begin{split}\begin{bmatrix}\lVert\tilde{\bf z}^{(t)}\rVert\\ \lVert\tilde{\bf s}^{(t)}\rVert\end{bmatrix}\leq&L\sum_{\tau=0}^{t-1}\mathbf{M}^{t-\tau-1}\begin{bmatrix}0\\ \lVert{\bf y}^{(\tau+1)}-{{\bf y}}^{(\tau)}\rVert\end{bmatrix}+\mathbf{M}^{t}\begin{bmatrix}0\\ \pi\end{bmatrix}.\end{split}

Recall the eigenvalues of matrix 𝐌\mathbf{M} in (23). Then an analytical form can be presented for the nnth power of 𝐌\mathbf{M} (see, e.g., [45])

𝐌n=λ1n(𝐌λ2Iλ1λ2)λ2n(𝐌λ1Iλ1λ2).\begin{split}\mathbf{M}^{n}=\lambda_{1}^{n}\left(\frac{\mathbf{M}-\lambda_{2}I}{\lambda_{1}-\lambda_{2}}\right)-\lambda_{2}^{n}\left(\frac{\mathbf{M}-\lambda_{1}I}{\lambda_{1}-\lambda_{2}}\right).\end{split}

Therefore, the (1,2)(1,2)-th entry of 𝐌n\mathbf{M}^{n} can be written as

(𝐌n)12=𝐌12(λ1nλ2n)λ1λ2=β(λ1nλ2n)λ1λ22β(ρ(𝐌))nξ2.(\mathbf{M}^{n})_{12}=\frac{\mathbf{M}_{12}(\lambda_{1}^{n}-\lambda_{2}^{n})}{\lambda_{1}-\lambda_{2}}=\frac{\beta(\lambda_{1}^{n}-\lambda_{2}^{n})}{\lambda_{1}-\lambda_{2}}\leq\frac{2\beta(\rho(\mathbf{M}))^{n}}{\xi_{2}}.

Due to the assumption that 1/a>2βL/(1β)2{1}/{a}>{2\beta L}/{(1-\beta)^{2}} and β(0,1)\beta\in(0,1), we have

ξ2=βaL1+4(β+1)βaL>βaL1+8(β+1)(1β)2>3βaL.\xi_{2}=\beta aL\sqrt{1+\frac{4(\beta+1)}{\beta aL}}>\beta aL\sqrt{1+\frac{8(\beta+1)}{(1-\beta)^{2}}}>3\beta aL.

Therefore,

𝐳~(t)2βLξ2τ=0t1(ρ(𝐌))tτ1𝐲(τ+1)𝐲(τ)+2βξ2(ρ(𝐌))tπ23aτ=0t1(ρ(𝐌))tτ1𝐲(τ+1)𝐲(τ)+23aL(ρ(𝐌))tπ.\begin{split}&\lVert\tilde{\bf z}^{(t)}\rVert\\ &\leq\frac{2\beta{L}}{\xi_{2}}\sum_{\tau=0}^{t-1}\left(\rho(\mathbf{M})\right)^{t-\tau-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert+\frac{2\beta}{\xi_{2}}\left(\rho(\mathbf{M})\right)^{t}\pi\\ &\leq\frac{2}{3a}\sum_{\tau=0}^{t-1}\left(\rho(\mathbf{M})\right)^{t-\tau-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert+\frac{2}{3aL}\left(\rho(\mathbf{M})\right)^{t}\pi.\end{split} (44)

Using Lemma 1, we further obtain

𝐱(t)𝐲(t)a𝐳~(t)23τ=0t1(ρ(𝐌))tτ1𝐲(τ+1)𝐲(τ)+23L(ρ(𝐌))tπ.\begin{split}&\lVert{{\bf x}}^{(t)}-{{\bf y}}^{(t)}\rVert\leq a\lVert\tilde{\bf z}^{(t)}\rVert\\ &\leq\frac{2}{3}\sum_{\tau=0}^{t-1}\left(\rho(\mathbf{M})\right)^{t-\tau-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert+\frac{2}{3L}\left(\rho(\mathbf{M})\right)^{t}\pi.\end{split} (45)

Further using the above relation and Lemma 2, the inequality (37) follows as desired. ∎

Then, a lemma is stated for the prox-mapping.

LemmaLemma 6.

Given a sequence of variables {ζ(t)}t0\{\zeta^{(t)}\}_{t\geq 0} and a positive sequence {at}t0\{a_{t}\}_{t\geq 0}, for {ν(t)}t0\{{\nu}^{(t)}\}_{t\geq 0} generated by

ν(t)=d(τ=1taτζ(τ)),{\nu}^{(t)}=\nabla d^{*}\left(-\sum_{\tau=1}^{t}a_{\tau}\zeta^{(\tau)}\right),

where ν(0)=x(0)\nu^{(0)}=x^{(0)} in (5), it holds that

τ=1taτζ(τ),ν(τ)xd(x)τ=1t12ν(τ)ν(τ1)2.\begin{split}\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},{\nu}^{(\tau)}-x^{*}\right\rangle\leq d(x^{*})-\sum_{\tau=1}^{t}\frac{1}{2}\lVert{\nu}^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2}.\end{split} (46)
Proof of Lemma 6.

Define

mt(x)=τ=1taτζ(τ),x+d(x)\begin{split}m_{t}({x})=\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},x\right\rangle+d({x})\end{split}

where m0(x)=d(x)m_{0}(x)=d(x). Since

ν(τ1)=argminx𝒳mτ1(x)\nu^{(\tau-1)}=\operatorname*{argmin}_{x\in\mathcal{X}}m_{\tau-1}(x)

and mτ1(x)m_{\tau-1}(x) is strongly convex with modulus 11, we have

mτ1(x)mτ1(ν(τ1))12xν(τ1)2,x𝒳.\begin{split}m_{\tau-1}(x)-m_{\tau-1}({\nu}^{(\tau-1)})\geq\frac{1}{2}\lVert x-\nu^{(\tau-1)}\rVert^{2},\forall x\in\mathcal{X}.\end{split}

Upon taking x=ν(τ)x=\nu^{(\tau)} in the above inequality and using

mτ(x)=mτ1(x)+aτζ(τ),x,m_{\tau}({x})=m_{\tau-1}({x})+a_{\tau}\left\langle\zeta^{(\tau)},x\right\rangle,

we obtain

0mτ1(ν(τ))mτ1(ν(τ1))12ν(τ)ν(τ1)2=mτ(ν(τ))aτζ(τ),ν(τ)mτ1(ν(τ1))12ν(τ)ν(τ1)2,\begin{split}0\leq&m_{\tau-1}({\nu}^{(\tau)})-m_{\tau-1}({\nu}^{(\tau-1)})-\frac{1}{2}\lVert{\nu}^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2}\\ =&m_{\tau}({\nu}^{(\tau)})-a_{\tau}\left\langle\zeta^{(\tau)},{\nu}^{(\tau)}\right\rangle-m_{\tau-1}({\nu}^{(\tau-1)})\\ &-\frac{1}{2}\lVert\nu^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2},\end{split}

which is equivalent to

aτζ(τ),ν(τ)mτ(ν(τ))mτ1(ν(τ1))12ν(τ)ν(τ1)2.\begin{split}a_{\tau}\left\langle\zeta^{(\tau)},{\nu}^{(\tau)}\right\rangle\leq&m_{\tau}({\nu}^{(\tau)})-m_{\tau-1}({\nu}^{(\tau-1)})\\ &-\frac{1}{2}\lVert{\nu}^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2}.\end{split}

Iterating the above equation from τ=1\tau=1 to τ=t\tau=t yields

τ=1taτζ(τ),ν(τ)mt(ν(t))m0(ν(0))τ=1t12ν(τ)ν(τ1)2=mt(ν(t))τ=1t12ν(τ)ν(τ1)2\begin{split}&\sum_{\tau=1}^{t}\left\langle a_{\tau}\zeta^{(\tau)},{\nu}^{(\tau)}\right\rangle\\ \leq&m_{t}({\nu}^{(t)})-m_{0}({\nu}^{(0)})-\sum_{\tau=1}^{t}\frac{1}{2}\lVert{\nu}^{(\tau)}-{\nu}^{(\tau-1)}\rVert^{2}\\ =&m_{t}({\nu}^{(t)})-\sum_{\tau=1}^{t}\frac{1}{2}\lVert{\nu}^{(\tau)}-\nu^{(\tau-1)}\rVert^{2}\end{split} (47)

We turn to consider

τ=1taτζ(τ),xmaxx𝒳{τ=1taτζ(τ),xd(x)}+d(x)=minx𝒳{τ=1taτζ(τ),x+d(x)}+d(x)=mt(ν(t))+d(x),\begin{split}&\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},-x^{*}\right\rangle\\ &\leq\max_{{x}\in\mathcal{X}}\left\{\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},-x\right\rangle-d(x)\right\}+d(x^{*})\\ &=-\min_{{x}\in\mathcal{X}}\left\{\sum_{\tau=1}^{t}a_{\tau}\left\langle\zeta^{(\tau)},x\right\rangle+d(x)\right\}+d(x^{*})\\ &=-m_{t}({\nu}^{(t)})+d(x^{*}),\end{split}

which together with (47) leads to the inequality in (46), thereby concluding the proof. ∎

B-B Proof of Theorem 1

Proof of Theorem 1.

For all τ1\tau\geq 1, we have

1ni=1na(fi(y(τ))fi(x))\displaystyle\frac{1}{n}\sum_{i=1}^{n}a\left(f_{i}({y}^{(\tau)})-f_{i}(x^{*})\right)
1ni=1na(fi(xi(τ1))fi(x)+L2y(τ)xi(τ1)2\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}a\Big{(}f_{i}(x_{i}^{(\tau-1)})-f_{i}(x^{*})+\frac{L}{2}\lVert{y}^{(\tau)}-x_{i}^{(\tau-1)}\rVert^{2}
+fi(xi(τ1)),y(τ)xi(τ1))\displaystyle\qquad\qquad\quad+\left\langle\nabla f_{i}(x_{i}^{(\tau-1)}),{y}^{(\tau)}-x_{i}^{(\tau-1)}\right\rangle\Big{)}
1ni=1na(L2y(τ)xi(τ1)2+fi(xi(τ1)),y(τ)x)\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}a\Big{(}\frac{L}{2}\lVert{y}^{(\tau)}-x_{i}^{(\tau-1)}\rVert^{2}+\left\langle\nabla f_{i}(x_{i}^{(\tau-1)}),{y}^{(\tau)}-x^{*}\right\rangle\Big{)}
=1ni=1na(L2y(τ)xi(τ1)2)+ag¯(τ1),y(τ)x,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}a\Big{(}\frac{L}{2}\lVert{y}^{(\tau)}-x_{i}^{(\tau-1)}\rVert^{2}\Big{)}+a\left\langle\overline{g}^{(\tau-1)},{y}^{(\tau)}-x^{*}\right\rangle, (48)

where the two inequalities are due to Assumption 1. By iterating (48) from τ=1\tau=1 to τ=t\tau=t and using Lemma 6 (aτ=a,ζ(τ)=g¯(τ1)a_{\tau}=a,\zeta^{(\tau)}=\overline{g}^{(\tau-1)}, and ν(τ)=y(τ)\nu^{(\tau)}=y^{(\tau)}), we obtain

τ=1ta(f(y(τ))f(x))\displaystyle\sum_{\tau=1}^{t}a\left(f(y^{(\tau)})-f(x^{*})\right)
1nτ=1ti=1na(L2y(τ)xi(τ1)2)\displaystyle\leq\frac{1}{n}\sum_{\tau=1}^{t}\sum_{i=1}^{n}a\Big{(}\frac{L}{2}\lVert{y}^{(\tau)}-x_{i}^{(\tau-1)}\rVert^{2}\Big{)}
12τ=1ty(τ)y(τ1)2+d(x).\displaystyle\quad-\frac{1}{2}\sum_{\tau=1}^{t}\lVert y^{(\tau)}-y^{(\tau-1)}\rVert^{2}+d(x^{*}).
=aL2nτ=1t𝐲(τ)𝐱(τ1)212τ=1ty(τ)y(τ1)2+d(x).\displaystyle=\frac{aL}{2n}\sum_{\tau=1}^{t}\lVert{\bf y}^{(\tau)}-{\bf x}^{(\tau-1)}\rVert^{2}-\frac{1}{2}\sum_{\tau=1}^{t}\lVert y^{(\tau)}-y^{(\tau-1)}\rVert^{2}+d(x^{*}).

Then we use the inequality

𝐲(τ)𝐱(τ1)22𝐲(τ)𝐲(τ1)2+2𝐲(τ1)𝐱(τ1)2\|\mathbf{y}^{(\tau)}-\mathbf{x}^{(\tau-1)}\|^{2}\leq 2\|\mathbf{y}^{(\tau)}-\mathbf{y}^{(\tau-1)}\|^{2}+2\|\mathbf{y}^{(\tau-1)}-\mathbf{x}^{(\tau-1)}\|^{2}

and the convexity of ff to get

at(f(y~(t))f(x))\displaystyle at\left(f(\tilde{y}^{(t)})-f(x^{*})\right)
1nτ=1t(aL12)𝐲(τ)𝐲(τ1)2\displaystyle\leq\frac{1}{n}\sum_{\tau=1}^{t}\left(aL-\frac{1}{2}\right)\|\mathbf{y}^{(\tau)}-\mathbf{y}^{(\tau-1)}\|^{2}
+aLnτ=1t𝐱(τ1)𝐲(τ1)2+d(x).\displaystyle\quad+\frac{aL}{n}\sum_{\tau=1}^{t}\|\mathbf{x}^{(\tau-1)}-\mathbf{y}^{(\tau-1)}\|^{2}+d(x^{*}).

Upon using Lemma 5, one has

at(f(y~(t))f(x))+γnτ=1t𝐲(τ)𝐲(τ1)2\displaystyle at\left(f(\tilde{y}^{(t)})-f(x^{*})\right)+\frac{\gamma}{n}\sum_{\tau=1}^{t}\|\mathbf{y}^{(\tau)}-\mathbf{y}^{(\tau-1)}\|^{2}
d(x)+8aπ29nL(1(ρ(𝐌))2)=C,\displaystyle\leq d(x^{*})+\frac{8a\pi^{2}}{9nL\left(1-(\rho({\bf M}))^{2}\right)}=C, (49)

where γ>0\gamma>0 is defined in (22). Therefore we have (19) as desired.

From (49) and f(y~(t))f(x)0f(\tilde{y}^{(t)})-f(x^{*})\geq 0, we have

τ=1t𝐲(τ)𝐲(τ1)2nCγ.\sum_{\tau=1}^{t}\|\mathbf{y}^{(\tau)}-\mathbf{y}^{(\tau-1)}\|^{2}\leq\frac{nC}{\gamma}.

By using the above inequality, the convexity of 2\lVert\cdot\rVert^{2}, Jensen’s Inequality, and Lemma 5, we arrive at

t𝐱~(t)𝐲~(t)2τ=1t𝐱(τ)𝐲(τ)2\displaystyle t\lVert{\tilde{\bf x}}^{(t)}-\tilde{\bf y}^{(t)}\rVert^{2}\leq\sum_{\tau=1}^{t}\lVert{{\bf x}}^{(\tau)}-{{\bf y}}^{(\tau)}\rVert^{2}
89(1ρ(𝐌))2τ=0t1𝐲(τ+1)𝐲(τ)2+8π29L2(1(ρ(𝐌))2)\displaystyle\leq\frac{8}{9\left(1-\rho({\bf M})\right)^{2}}\sum_{\tau=0}^{t-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert^{2}+\frac{8\pi^{2}}{9L^{2}\left(1-(\rho({\bf M}))^{2}\right)}
8nC9γ(1ρ(𝐌))2+8π29L2(1(ρ(𝐌))2)=D,\displaystyle\leq\frac{8nC}{9\gamma\left(1-\rho({\bf M})\right)^{2}}+\frac{8\pi^{2}}{9L^{2}\left(1-(\rho({\bf M}))^{2}\right)}=D,

which implies (21) as desired. ∎

Appendix C Proof of Corollary 1

Proof of Corollary 1.

We consider

f(xi(τ))f(y(τ))1nj=1n(fj(xi(τ))fj(xj(τ)),y(τ)xj(τ)fj(xj(τ)))1nj=1n(fj(xj(τ)),xi(τ)xj(τ)fj(xj(τ)),y(τ)xj(τ)+L2xi(τ)y(τ)+y(τ)xj(τ)2)=1nj=1n(fj(xj(τ)),xi(τ)y(τ)+Lxi(τ)y(τ)2+Ly(τ)xj(τ)2)=g¯(τ),xi(τ)y(τ)+Lxi(τ)y(τ)2+Ln𝐲(τ)𝐱(τ)2,\begin{split}&f(x_{i}^{(\tau)})-f(y^{(\tau)})\\ &\leq\frac{1}{n}\sum_{j=1}^{n}\left(f_{j}(x_{i}^{(\tau)})-\langle\nabla f_{j}(x_{j}^{(\tau)}),y^{(\tau)}-x_{j}^{(\tau)}\rangle-f_{j}(x_{j}^{(\tau)})\right)\\ &\leq\frac{1}{n}\sum_{j=1}^{n}\Big{(}\langle\nabla f_{j}(x_{j}^{(\tau)}),x_{i}^{(\tau)}-x_{j}^{(\tau)}\rangle-\langle\nabla f_{j}(x_{j}^{(\tau)}),y^{(\tau)}-x_{j}^{(\tau)}\rangle\\ &\quad\quad\quad\quad+\frac{L}{2}\lVert x_{i}^{(\tau)}-y^{(\tau)}+y^{(\tau)}-x_{j}^{(\tau)}\rVert^{2}\Big{)}\\ &=\frac{1}{n}\sum_{j=1}^{n}\Big{(}\langle\nabla f_{j}(x_{j}^{(\tau)}),x_{i}^{(\tau)}-y^{(\tau)}\rangle+{L}\lVert x_{i}^{(\tau)}-y^{(\tau)}\rVert^{2}\\ &\quad\quad\quad\quad+{L}\lVert y^{(\tau)}-x_{j}^{(\tau)}\rVert^{2}\Big{)}\\ &=\left\langle\overline{g}^{(\tau)},x_{i}^{(\tau)}-y^{(\tau)}\right\rangle+{L}\lVert x_{i}^{(\tau)}-y^{(\tau)}\rVert^{2}+\frac{L}{n}\lVert\mathbf{y}^{(\tau)}-\mathbf{x}^{(\tau)}\rVert^{2},\end{split} (50)

where the two inequalities follow from Assumption 1. When 𝒳=m\mathcal{X}=\mathbb{R}^{m}, the closed-form solutions for (12) and (16) can be identified as

xi(τ)=azi(τ),y(τ)=az¯(τ),x_{i}^{(\tau)}=-a{z_{i}^{(\tau)}},\quad y^{(\tau)}=-a{\overline{z}^{(\tau)}},

implying that y(τ)=1ni=1nxi(t)y^{(\tau)}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{(t)}. Upon summing up (50) from i=1i=1 to i=ni=n, we obtain

i=1n(f(xi(τ))f(y(τ)))2L𝐲(τ)𝐱(τ)2.\begin{split}\sum_{i=1}^{n}\left(f(x_{i}^{(\tau)})-f(y^{(\tau)})\right)\leq 2L\lVert\mathbf{y}^{(\tau)}-\mathbf{x}^{(\tau)}\rVert^{2}.\end{split} (51)

Then, we iterate (51) from τ=1\tau=1 to τ=t\tau=t and use the convexity of ff to get

ti=1n(f(x~i(t))f(y(τ)))τ=1ti=1n(f(xi(τ))f(y(τ)))2Lτ=1t𝐲(τ)𝐱(τ)2.\begin{split}&t\sum_{i=1}^{n}\left(f(\tilde{x}_{i}^{(t)})-f(y^{(\tau)})\right)\leq\sum_{\tau=1}^{t}\sum_{i=1}^{n}\left(f(x_{i}^{(\tau)})-f(y^{(\tau)})\right)\\ &\leq 2L\sum_{\tau=1}^{t}\lVert\mathbf{y}^{(\tau)}-\mathbf{x}^{(\tau)}\rVert^{2}.\end{split} (52)

where x~i(t)=t1τ=1txi(τ)\tilde{x}_{i}^{(t)}={t}^{-1}\sum_{\tau=1}^{t}x_{i}^{(\tau)}. Using Lemma 5, we obtain

ti=1n(f(x~i(t))f(y(τ)))16L9(1ρ(𝐌))2τ=0t1𝐲(τ+1)𝐲(τ)2+16π29L(1(ρ(𝐌))2).\begin{split}&t\sum_{i=1}^{n}\left(f(\tilde{x}_{i}^{(t)})-f(y^{(\tau)})\right)\\ &\leq\frac{16L}{9\left(1-\rho({\bf M})\right)^{2}}\sum_{\tau=0}^{t-1}\lVert{\bf y}^{(\tau+1)}-{\bf y}^{(\tau)}\rVert^{2}+\frac{16\pi^{2}}{9L\left(1-(\rho({\bf M}))^{2}\right)}.\end{split} (53)

Recall (49), we have

at(f(y~(t))f(x))d(x)+8aπ29nL(1(ρ(𝐌))2)\displaystyle at\left(f(\tilde{y}^{(t)})-f(x^{*})\right)\leq d(x^{*})+\frac{8a\pi^{2}}{9nL\left(1-(\rho({\bf M}))^{2}\right)}
(12aL8aL9(1(ρ(𝐌))2))τ=1ty(τ)y(τ1)2.\displaystyle-\left(\frac{1}{2}-aL-\frac{8aL}{9\left(1-(\rho({\bf M}))^{2}\right)}\right)\sum_{\tau=1}^{t}\|{y}^{(\tau)}-{y}^{(\tau-1)}\|^{2}. (54)

By multiplying n/a>0n/a>0 on both sides of (C) and adding the resultant inequality to (53), we get

t(f(x~i(t))f(x))ti=1n(f(x~i(t))f(x))(12aL8L3(1ρ(𝐌))2)τ=1t𝐲(τ)𝐲(τ1)2+n2ax2+8π23L(1(ρ(𝐌))2).\begin{split}&t\left(f(\tilde{x}_{i}^{(t)})-f(x^{*})\right)\leq t\sum_{i=1}^{n}\left(f(\tilde{x}_{i}^{(t)})-f(x^{*})\right)\\ \leq&-\Big{(}\frac{1}{2a}-{L}-\frac{8L}{3\left(1-\rho({\bf M})\right)^{2}}\Big{)}{\sum_{\tau=1}^{t}\lVert{\bf y}^{(\tau)}-{\bf y}^{(\tau-1)}\rVert^{2}}\\ &+\frac{n}{2a}\lVert x^{*}\rVert^{2}+\frac{8\pi^{2}}{3L\left(1-(\rho({\bf M}))^{2}\right)}.\end{split} (55)

Upon using the condition in (25), we arrive at (26) as desired. ∎

Appendix D Proof of Theorem 2

D-A Preliminaries

For Algorithm 2, we define

𝐮(t)=[u1(t)un(t)],𝐯(t)=[v1(t)vn(t)],𝐰(t)=[w1(t)wn(t)],{\bf u}^{(t)}=\begin{bmatrix}u_{1}^{(t)}\\ \vdots\\ u_{n}^{(t)}\end{bmatrix},\quad{\bf v}^{(t)}=\begin{bmatrix}v_{1}^{(t)}\\ \vdots\\ v_{n}^{(t)}\end{bmatrix},\quad{\bf w}^{(t)}=\begin{bmatrix}w_{1}^{(t)}\\ \vdots\\ w_{n}^{(t)}\end{bmatrix},
𝐪(t)=[q1(t)qn(t)],^(t)=[f1(u1(t))fn(un(t))],{\bf q}^{(t)}=\begin{bmatrix}q_{1}^{(t)}\\ \vdots\\ q_{n}^{(t)}\end{bmatrix},\,\,\hat{\bf\nabla}^{(t)}=\begin{bmatrix}{\nabla}f_{1}(u_{1}^{(t)})\\ \vdots\\ \nabla f_{n}(u_{n}^{(t)})\end{bmatrix},
u¯(t)=1ni=1nui(t),v¯(t)=1ni=1nvi(t),w¯(t)=1ni=1nwi(t),\overline{u}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}u_{i}^{(t)},\,\overline{v}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}v_{i}^{(t)},\,\overline{{w}}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}{{w}}_{i}^{(t)},
g^¯(t)=1ni=1nfi(ui(t)),q¯t=1ni=1nqi(t),𝐰~(t)=𝐰(t)𝟏w¯(t),\overline{\hat{g}}^{(t)}=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(u_{i}^{(t)}),\,\overline{{q}}_{t}=\frac{1}{n}\sum_{i=1}^{n}{{q}}_{i}^{(t)},\,\tilde{{\bf w}}^{(t)}={\bf w}^{(t)}-{\bf 1}\otimes\overline{w}^{(t)},
𝐮~(t)=𝐮(t)𝟏u¯(t),𝐯~(t)=𝐯(t)𝟏v¯(t),𝐪~(t)=𝐪(t)𝟏q¯(t).\tilde{{\bf u}}^{(t)}={\bf u}^{(t)}-{\bf 1}\otimes\overline{u}^{(t)},\,\tilde{{\bf v}}^{(t)}={\bf v}^{(t)}-{\bf 1}\otimes\overline{v}^{(t)},\,\tilde{{\bf q}}^{(t)}={\bf q}^{(t)}-{\bf 1}\otimes\overline{q}^{(t)}.

Based on these notations, we present the steps in (27) and (29) in the following compact form

𝐮(t)\displaystyle{\bf u}^{(t)} =At1At(𝐏𝐯(t1))+atAt𝐰(t1),\displaystyle=\frac{A_{t-1}}{A_{t}}\left({\bf P}{\bf v}^{(t-1)}\right)+\frac{a_{t}}{A_{t}}{\bf w}^{(t-1)}, (56a)
𝐯(t)\displaystyle{\bf v}^{(t)} =At1At(𝐏𝐯(t1))+atAt𝐰(t),\displaystyle=\frac{A_{t-1}}{A_{t}}\left({\bf P}{\bf v}^{(t-1)}\right)+\frac{a_{t}}{A_{t}}{\bf w}^{(t)}, (56b)
𝐪(t)\displaystyle{\bf q}^{(t)} =𝐏𝐪(t1)+^(t)^(t1),\displaystyle={\bf P}{\bf q}^{(t-1)}+\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}, (56c)

where 𝐏=PI{\bf P}=P\otimes I. According to (27), we have

u¯(t)\displaystyle\overline{u}^{(t)} =At1Atv¯(t1)+atAtw¯(t1),\displaystyle=\frac{A_{t-1}}{A_{t}}\overline{v}^{(t-1)}+\frac{a_{t}}{A_{t}}\overline{{w}}^{(t-1)}, (57a)
v¯(t)\displaystyle\overline{v}^{(t)} =At1Atv¯(t1)+atAtw¯(t).\displaystyle=\frac{A_{t-1}}{A_{t}}\overline{v}^{(t-1)}+\frac{a_{t}}{A_{t}}\overline{{w}}^{(t)}. (57b)

Before proving Theorem 2, we present Lemma 7 that establishes upper bounds for consensus error vectors 𝐮~(t)\tilde{\bf u}^{(t)} and 𝐯~(t)\tilde{\bf v}^{(t)}.

LemmaLemma 7.

For Algorithm 2, if Assumptions 1, 2, and 3 are satisfied, then

𝐯~(t)atAtCp,𝐮~(t)atAtCp\lVert\tilde{\bf v}^{(t)}\rVert\leq\frac{a_{t}}{A_{t}}C_{p},\quad\lVert\tilde{\bf u}^{(t)}\rVert\leq\frac{a_{t}}{A_{t}}C_{p} (58)

for all t1t\geq 1, where Cp=31βnGC_{p}=\lceil\frac{3}{1-\beta}\rceil\sqrt{n}G, and β=σ2(P)\beta=\sigma_{2}(P).

Proof of Lemma 7.

Since both ui(t),vi(t),i=1,,nu_{i}^{(t)},v_{i}^{(t)},i=1,\cdots,n, u¯(t)\overline{u}^{(t)} and v¯(t)\overline{v}^{(t)} are within the constraint set, we readily have

𝐮(t)𝟏u¯(t)nGand𝐯(t)𝟏v¯(t)nG\begin{split}\lVert{\bf u}^{(t)}-{\bf 1}\otimes\overline{u}^{(t)}\rVert\leq\sqrt{n}G\,\,\,\,\textrm{and}\,\,\,\,\lVert{\bf v}^{(t)}-{\bf 1}\otimes\overline{v}^{(t)}\rVert\leq\sqrt{n}G\end{split}

by Assumption 3. Upon using

atAt=2(t+1)t(t+3)1t,t1\frac{a_{t}}{A_{t}}=\frac{2(t+1)}{t(t+3)}\geq\frac{1}{t},\quad\forall t\geq 1

and the definition of Cp=31βnGC_{p}=\lceil\frac{3}{1-\beta}\rceil\sqrt{n}G, we have that (58) holds for 1t<31β.1\leq t<\lceil\frac{3}{1-\beta}\rceil.

When t31β,t\geq\lceil\frac{3}{1-\beta}\rceil, we prove by an induction argument. Suppose that (58) holds for some t31βt\geq\lceil\frac{3}{1-\beta}\rceil. Next, we examine the upper bounds for 𝐯~(t+1)\lVert\tilde{\bf v}^{(t+1)}\rVert and 𝐮~(t+1)\lVert\tilde{\bf u}^{(t+1)}\rVert, respectively.

i) Upper bound for 𝐯~(t+1)\lVert\tilde{\bf v}^{(t+1)}\rVert. Using

𝐏𝐯(t)𝟏v¯(t)=((P𝟏𝟏Tn)I)𝐯~(t),\displaystyle\mathbf{P}{\bf v}^{(t)}-\mathbf{1}\otimes\overline{v}^{(t)}=\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\right)\otimes I\right)\tilde{\bf v}^{(t)},

(56b) and (57b), we obtain

𝐯~(t+1)=AtAt+1((P𝟏𝟏Tn)I)𝐯~(t)+at+1At+1𝐰~(t+1).\begin{split}\tilde{\bf v}^{(t+1)}=&\frac{A_{t}}{A_{t+1}}\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\right)\otimes I\right)\tilde{\bf v}^{(t)}+\frac{a_{t+1}}{A_{t+1}}\tilde{\bf w}^{(t+1)}.\end{split}

Evaluating the norm of both sides of the above equality yields

𝐯~t+1=AtAt+1((P𝟏𝟏Tn)I)𝐯~(t)+at+1At+1𝐰~(t+1)β𝐯~(t)+at+1At+1𝐰~(t+1)atAtβCp+at+1At+1nCp,\begin{split}\lVert\tilde{\bf v}^{t+1}\rVert=&\left\lVert\frac{A_{t}}{A_{t+1}}\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\right)\otimes I\right)\tilde{\bf v}^{(t)}+\frac{a_{t+1}}{A_{t+1}}\tilde{\bf w}^{(t+1)}\right\rVert\\ {\leq}&\beta\lVert\tilde{\bf v}^{(t)}\rVert+\frac{a_{t+1}}{A_{t+1}}\left\lVert\tilde{\bf w}^{(t+1)}\right\rVert\\ \leq&\frac{a_{t}}{A_{t}}\beta C_{p}+\frac{a_{t+1}}{A_{t+1}}\sqrt{n}C_{p},\end{split}

where the last inequality follows from the hypothesis that 𝐯~(t)atCp/At\lVert\tilde{\bf v}^{(t)}\rVert\leq{a_{t}}C_{p}/{A_{t}} and Assumption 3. Since at/At{a_{t}}/A_{t} monotonically decreases with tt, we have

𝐯~t+1atAt(βCp+nG)atAtCp(β+131β),\begin{split}\lVert\tilde{\bf v}^{t+1}\rVert\leq&\frac{a_{t}}{A_{t}}\left(\beta C_{p}+\sqrt{n}G\right)\leq\frac{a_{t}}{A_{t}}C_{p}\left(\beta+\frac{1}{\lceil\frac{3}{1-\beta}\rceil}\right),\end{split}

where the last inequality is due to nG=Cp31β\sqrt{n}G=\frac{C_{p}}{\lceil\frac{3}{1-\beta}\rceil}. It then remains to prove

(β+131β)Atatat+1At+1,t31β\left({\beta}+\frac{1}{\lceil\frac{3}{1-\beta}\rceil}\right)\leq\frac{A_{t}}{a_{t}}\cdot\frac{a_{t+1}}{A_{t+1}},\quad\forall t\geq\lceil\frac{3}{1-\beta}\rceil (59)

to obtain the bound for 𝐯~(t+1)\lVert\tilde{\bf v}^{(t+1)}\rVert as desired. To prove (59), we let

t0=31β,t_{0}=\lceil\frac{3}{1-\beta}\rceil,

which implies

3t01β.\frac{3}{t_{0}}\leq 1-\beta.

Based on the above relation, we further obtain

β+1t0t02t0t0+2t0+4.\beta+\frac{1}{t_{0}}\leq\frac{t_{0}-2}{t_{0}}\leq\frac{t_{0}+2}{t_{0}+4}. (60)

This in conjunction with

t(t+3)(t+1)(t+1)1,t1\frac{t(t+3)}{(t+1)(t+1)}\geq 1,\quad\forall t\geq 1

and the definitions of ata_{t} and AtA_{t} yields

β+1t0t0+2t0+4t0(t0+3)(t0+1)(t0+1)=At0at0at0+1At0+1.\beta+\frac{1}{t_{0}}{\leq}\frac{t_{0}+2}{t_{0}+4}\cdot\frac{t_{0}(t_{0}+3)}{(t_{0}+1)(t_{0}+1)}=\frac{A_{t_{0}}}{a_{t_{0}}}\cdot\frac{a_{t_{0}+1}}{A_{t_{0}+1}}.

Since Atat+1/(atAt+1){A_{t}a_{t+1}}/{(a_{t}A_{t+1})} monotonically increases with tt, we have (59) satisfied.

ii) Upper bound for 𝐮~(t+1)\lVert\tilde{\bf u}^{(t+1)}\rVert. Using the same arguments as above, we have

𝐮~t+1=AtAt+1((P𝟏𝟏Tn)I)𝐯~(t)+at+1At+1𝐰~(t)β𝐯~(t)+at+1At+1𝐰~(t)atAtβCp+at+1At+1nCp.\begin{split}\lVert\tilde{\bf u}^{t+1}\rVert=&\left\lVert\frac{A_{t}}{A_{t+1}}\left(\left(P-\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\right)\otimes I\right)\tilde{\bf v}^{(t)}+\frac{a_{t+1}}{A_{t+1}}\tilde{\bf w}^{(t)}\right\rVert\\ {\leq}&\beta\lVert\tilde{\bf v}^{(t)}\rVert+\frac{a_{t+1}}{A_{t+1}}\left\lVert\tilde{\bf w}^{(t)}\right\rVert\\ \leq&\frac{a_{t}}{A_{t}}\beta C_{p}+\frac{a_{t+1}}{A_{t+1}}\sqrt{n}C_{p}.\end{split}

By following the same line of reasoning as in the first part, we are able to obtain

𝐮~(t+1)at+1At+1Cp.\lVert\tilde{\bf u}^{(t+1)}\rVert\leq\frac{a_{t+1}}{A_{t+1}}C_{p}.

Summarizing the above bounds, the proof is completed. ∎

Lemma 8 proves the upper bound for the consensus vector 𝐪~(t)\tilde{\bf q}^{(t)}.

LemmaLemma 8.

Suppose Assumptions 1, 2, and 3 are satisfied. For Algorithm 2, we have

q¯(t)=g^¯(t)\overline{q}^{(t)}=\overline{\hat{g}}^{(t)} (61)

and

𝐪~(t)atAtCg\lVert\tilde{\bf q}^{(t)}\rVert\leq\frac{a_{t}}{A_{t}}C_{g} (62)

for all t1t\geq 1, where Cg=31βL(2nG+2Cp)/(1β)C_{g}={\lceil\frac{3}{1-\beta}\rceil L\Big{(}2\sqrt{n}G+2C_{p}\Big{)}}/{(1-\beta)}, and β=σ2(P)\beta=\sigma_{2}(P).

Proof of Lemma 8.

The proof of (61) directly follows from the proof of Lemma 3, and is omitted here for brevity.

For (62), we subtract 𝟏q¯(t){\bf 1}\otimes\overline{q}^{(t)} from both sides of (56c) to get

𝐪(t)𝟏q¯(t)=𝐏𝐪(t1)𝟏q¯(t1)+^(t)^(t1)𝟏(q¯(t)q¯(t1)).\begin{split}{\bf q}^{(t)}-{\bf 1}\otimes\overline{q}^{(t)}=&{\bf P}{\bf q}^{(t-1)}-{\bf 1}\otimes\overline{q}^{(t-1)}\\ &+\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-{\bf 1}\otimes(\overline{q}^{(t)}-\overline{q}^{(t-1)}).\\ \end{split} (63)

Using the same procedure in (B-A) leads to

𝐪~(t)β𝐪~(t1)+^(t)^(t1)𝟏(q¯(t)q¯(t1)).\lVert\tilde{\bf q}^{(t)}\rVert\leq\beta\lVert\tilde{\bf q}^{(t-1)}\rVert+\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-{\bf 1}\otimes(\overline{q}^{(t)}-\overline{q}^{(t-1)})\rVert. (64)

Since the objective is smooth, we obtain

^(t)^(t1)𝟏(q¯(t)q¯(t1))=^(t)^(t1)𝟏(g^¯(t)g^¯(t1))=^(t)^(t1)(𝟏𝟏TnI)(^(t)^(t1))^(t)^(t1)L𝐮(t)𝐮(t1).\begin{split}&\left\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-\mathbf{1}\otimes\left(\overline{q}^{(t)}-\overline{q}^{(t-1)}\right)\right\rVert\\ =&\left\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-\mathbf{1}\otimes\left(\overline{\hat{g}}^{(t)}-\overline{\hat{g}}^{(t-1)}\right)\right\rVert\\ =&\left\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}-\left(\frac{\mathbf{1}\mathbf{1}^{\mathrm{T}}}{n}\otimes I\right)\left(\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}\right)\right\rVert\\ \leq&\left\lVert\hat{\nabla}^{(t)}-\hat{\nabla}^{(t-1)}\right\rVert\leq L\left\lVert{\bf u}^{(t)}-{\bf u}^{(t-1)}\right\rVert.\end{split}

To bound 𝐮(t)𝐮(t1)\left\lVert{\bf u}^{(t)}-{\bf u}^{(t-1)}\right\rVert, we consider

𝐮(t)𝐮(t1)=At1At𝐏𝐯(t1)+atAt𝐰(t1)𝐮(t1)=At1At𝐏(𝐯(t1)𝐮(t1))+At1At(𝐏II)𝐮(t1)+atAt(𝐰(t1)𝐮(t1))\begin{split}&{\bf u}^{(t)}-{\bf u}^{(t-1)}\\ {=}&\frac{A_{t-1}}{A_{t}}{\bf P}{\bf v}^{(t-1)}+\frac{a_{t}}{A_{t}}{\bf w}^{(t-1)}-{\bf u}^{(t-1)}\\ =&\frac{A_{t-1}}{A_{t}}{\bf P}\left({\bf v}^{(t-1)}-{\bf u}^{(t-1)}\right)+\frac{A_{t-1}}{A_{t}}\left({\bf P}-I\otimes I\right){\bf u}^{(t-1)}\\ &+\frac{a_{t}}{A_{t}}\left({\bf w}^{(t-1)}-{\bf u}^{(t-1)}\right)\end{split}

where the first equality is due to (56a). From (56a) and (56b), we have 𝐯(t1)𝐮(t1)=at1At1(𝐰(t1)𝐰(t2)).{\bf v}^{(t-1)}-{\bf u}^{(t-1)}=\frac{a_{t-1}}{A_{t-1}}\left({\bf w}^{(t-1)}-{\bf w}^{(t-2)}\right). In addition, we have (𝐏II)𝐮(t1)=(𝐏II)(𝐮(t1)𝟏u¯(t1)).\left({\bf P}-I\otimes I\right){\bf u}^{(t-1)}=\left({\bf P}-I\otimes I\right)({\bf u}^{(t-1)}-{\bf 1}\otimes\overline{u}^{(t-1)}). Therefore, it holds that

𝐮(t)𝐮(t1)At1Atat1At1𝐰(t1)𝐰(t2)+2At1At𝐮~(t1)+atAt𝐰(t1)𝐮(t1)atAtnG+2atAtCp+atAtnG=atAt(2nG+2Cp)\begin{split}&\left\lVert{\bf u}^{(t)}-{\bf u}^{(t-1)}\right\rVert\\ {\leq}&\frac{A_{t-1}}{A_{t}}\frac{a_{t-1}}{A_{t-1}}\left\lVert{\bf w}^{(t-1)}-{\bf w}^{(t-2)}\right\rVert+\frac{2A_{t-1}}{A_{t}}\left\lVert\tilde{\bf u}^{(t-1)}\right\rVert\\ &+\frac{a_{t}}{A_{t}}\left\lVert{\bf w}^{(t-1)}-{\bf u}^{(t-1)}\right\rVert\\ \leq&\frac{a_{t}}{A_{t}}\sqrt{n}G+\frac{2a_{t}}{A_{t}}C_{p}+\frac{a_{t}}{A_{t}}\sqrt{n}G\\ =&\frac{a_{t}}{A_{t}}\left(2\sqrt{n}G+2C_{p}\right)\end{split} (65)

where Lemma 7 and Assumption 3 are used to get the second inequality. By substituting (65) into (64), we obtain

𝐪~(t)β𝐪~(t1)+atAtL(2nG+2Cp).\begin{split}\lVert\tilde{\bf q}^{(t)}\rVert\leq\beta\lVert\tilde{\bf q}^{(t-1)}\rVert+\frac{a_{t}}{A_{t}}L\left(2\sqrt{n}G+2C_{p}\right).\end{split} (66)

By initialization, we have 𝐪~(0)=0\tilde{\bf q}^{(0)}=0 and therefore

𝐪~(t0)L(2nG+2Cp)τ=1t0βt0τaτAτL(2nG+2Cp)1β,\begin{split}\left\lVert{\tilde{\bf q}}^{(t_{0})}\right\rVert\leq&L\left(2\sqrt{n}G+2C_{p}\right)\sum_{\tau=1}^{t_{0}}\beta^{t_{0}-\tau}\frac{a_{\tau}}{A_{\tau}}\leq\frac{L\left(2\sqrt{n}G+2C_{p}\right)}{1-\beta},\end{split}

implying that (62) is valid for 1t<31β1\leq t<\lceil\frac{3}{1-\beta}\rceil. Next, we prove that (62) also holds for t31βt\geq\lceil\frac{3}{1-\beta}\rceil by mathematical induction. Suppose that (62) holds true for some t31βt\geq\lceil\frac{3}{1-\beta}\rceil. Using this hypothesis and (66), we obtain

𝐪~(t+1)atAtβCg+at+1At+1L(2nG+2Cp)atAtCg(β+131β).\begin{split}\lVert\tilde{\bf q}^{(t+1)}\rVert\leq&\frac{a_{t}}{A_{t}}\beta C_{g}+\frac{a_{t+1}}{A_{t+1}}L\left(2\sqrt{n}G+2C_{p}\right)\\ \leq&\frac{a_{t}}{A_{t}}C_{g}\left({\beta}+\frac{1}{\lceil\frac{3}{1-\beta}\rceil}\right).\end{split}

Finally, using the same argument with (59) and (60) in the proof of Lemma 7, we arrive at (62) as desired. ∎

D-B Proof of Theorem 2

Proof of Theorem 2.

Using Aτ1=AτaτA_{\tau-1}=A_{\tau}-a_{\tau}, we have

At(f(v¯(t))f(x))=τ=1t(Aτf(v¯(τ))Aτ1f(v¯(τ1)))τ=1taτf(x)=τ=1t(Aτ(f(v¯(τ))f(u¯(τ)))+aτ(f(u¯(τ))f(x))+Aτ1(f(u¯(τ))f(v¯(τ1))))\begin{split}&A_{t}\Big{(}f(\overline{v}^{(t)})-f(x^{*})\Big{)}\\ =&\sum_{\tau=1}^{t}\left(A_{\tau}f(\overline{v}^{(\tau)})-A_{\tau-1}f(\overline{v}^{(\tau-1)})\right)-\sum_{\tau=1}^{t}a_{\tau}f(x^{*})\\ =&\sum_{\tau=1}^{t}\Big{(}A_{\tau}\left(f(\overline{v}^{(\tau)})-f(\overline{u}^{(\tau)})\right)+a_{\tau}\left(f(\overline{u}^{(\tau)})-f(x^{*})\right)\\ &\quad\quad\,\,+A_{\tau-1}\left(f(\overline{u}^{(\tau)})-f(\overline{v}^{(\tau-1)})\right)\Big{)}\\ \end{split}

Upon using the convexity of ff, we obtain

At(f(v¯(t))f(x))τ=1t(Aτ(f(v¯(τ))f(u¯(τ)))+aτf(u¯(τ)),u¯(τ)x+Aτ1f(u¯(τ)),u¯(τ)v¯(τ1))\begin{split}&A_{t}\left(f(\overline{v}^{(t)})-f(x^{*})\right)\\ \leq&\sum_{\tau=1}^{t}\Big{(}A_{\tau}\left(f(\overline{v}^{(\tau)})-f(\overline{u}^{(\tau)})\right)+a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)}),\overline{u}^{(\tau)}-x^{*}\right\rangle\\ &\quad\quad\,\,+A_{\tau-1}\left\langle\nabla f(\overline{u}^{(\tau)}),\overline{u}^{(\tau)}-\overline{v}^{(\tau-1)}\right\rangle\Big{)}\end{split}

By (57b), we obtain

At(f(v¯(t))f(x))τ=1tAτ(f(v¯(τ))f(u¯(τ))+f(u¯(τ)),u¯(τ)v¯(τ))+τ=1taτf(u¯(τ)),w¯(τ)xτ=1tAτL2u¯(τ)v¯(τ)2(I)+1ni=1nτ=1taτqi(τ),wi(τ)x(II)+1nτ=1ti=1naτf(u¯(τ))qi(τ),wi(τ)x(III)\begin{split}&A_{t}\Big{(}f(\overline{v}^{(t)})-f(x^{*})\Big{)}\\ \leq&\sum_{\tau=1}^{t}A_{\tau}\Big{(}f(\overline{v}^{(\tau)})-f(\overline{u}^{(\tau)})+\left\langle\nabla f(\overline{u}^{(\tau)}),\overline{u}^{(\tau)}-\overline{v}^{(\tau)}\right\rangle\Big{)}\\ &+\sum_{\tau=1}^{t}a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)}),\overline{w}^{(\tau)}-x^{*}\right\rangle\\ {\leq}&\sum_{\tau=1}^{t}\frac{A_{\tau}L}{2}\underbrace{\lVert\overline{u}^{(\tau)}-\overline{v}^{(\tau)}\rVert^{2}}_{(I)}+\frac{1}{n}\sum_{i=1}^{n}\underbrace{\sum_{\tau=1}^{t}a_{\tau}\left\langle q_{i}^{(\tau)},{w}_{i}^{(\tau)}-x^{*}\right\rangle}_{(II)}\\ &+\frac{1}{n}\sum_{\tau=1}^{t}\underbrace{\sum_{i=1}^{n}a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)})-q_{i}^{(\tau)},{w}_{i}^{(\tau)}-x^{*}\right\rangle}_{(III)}\end{split} (67)

where the last inequality is due to the smoothness of ff. To bound (I)(I), we consider

u¯(τ)v¯(τ)2=1ni=1nu¯(τ)ui(τ)+ui(τ)vi(τ)+vi(τ)v¯(τ)21ni=1n(3u¯(τ)ui(τ)2+3ui(τ)vi(τ)2+3vi(τ)v¯(τ)2)(aτAτ)26Cp2+3𝐰(τ)𝐰(τ1)2n\begin{split}&\lVert\overline{u}^{(\tau)}-\overline{v}^{(\tau)}\rVert^{2}\\ =&\frac{1}{n}\sum_{i=1}^{n}\left\lVert\overline{u}^{(\tau)}-{u}^{(\tau)}_{i}+{u}^{(\tau)}_{i}-{v}^{(\tau)}_{i}+{v}^{(\tau)}_{i}-\overline{v}^{(\tau)}\right\rVert^{2}\\ \leq&\frac{1}{n}\sum_{i=1}^{n}\Big{(}3\left\lVert\overline{u}^{(\tau)}-{u}^{(\tau)}_{i}\right\rVert^{2}+3\left\lVert{u}^{(\tau)}_{i}-{v}^{(\tau)}_{i}\right\rVert^{2}\\ &\quad\quad\quad+3\left\lVert{v}^{(\tau)}_{i}-\overline{v}^{(\tau)}\right\rVert^{2}\Big{)}\\ \leq&\left(\frac{a_{\tau}}{A_{\tau}}\right)^{2}\frac{6C_{p}^{2}+{3}\left\lVert{\bf w}^{(\tau)}-{\bf w}^{(\tau-1)}\right\rVert^{2}}{n}\end{split} (68)

where the first inequality follows from x+y+z23x2+3y2+3z2,\lVert x+y+z\rVert^{2}\leq 3\lVert x\rVert^{2}+3\lVert y\rVert^{2}+3\lVert z\rVert^{2}, and the last inequality is due to Lemma 7 and (57). For (II)(II), by letting ζ(τ)=qi(τ)\zeta^{(\tau)}=q_{i}^{(\tau)} and ν(τ)=wi(τ)\nu^{(\tau)}={w}_{i}^{(\tau)} in Lemma 6, we have

τ=1taτqi(τ),wi(τ)xd(x)τ=1t12wi(τ)wi(τ1)2.\begin{split}\sum_{\tau=1}^{t}a_{\tau}\left\langle q_{i}^{(\tau)},{w}_{i}^{(\tau)}-x^{*}\right\rangle\leq d(x^{*})-\sum_{\tau=1}^{t}\frac{1}{2}\lVert{w}_{i}^{(\tau)}-{w}_{i}^{(\tau-1)}\rVert^{2}.\end{split} (69)

To bound (III)(III), we use (61) to get

aτf(u¯(τ))qi(τ),wi(τ)xGaτf(u¯(τ))g^¯(τ)+q¯(τ)qi(τ)Gaτ(f(u¯(τ))g^¯(τ)+q¯(τ)qi(τ)).\begin{split}&a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)})-q_{i}^{(\tau)},{{w}}_{i}^{(\tau)}-x^{*}\right\rangle\\ \leq&Ga_{\tau}\left\lVert\nabla f(\overline{u}^{(\tau)})-\overline{\hat{g}}^{(\tau)}+\overline{q}^{(\tau)}-q_{i}^{(\tau)}\right\rVert\\ \leq&Ga_{\tau}\left(\left\lVert\nabla f(\overline{u}^{(\tau)})-\overline{\hat{g}}^{(\tau)}\right\rVert+\left\lVert\overline{q}^{(\tau)}-q_{i}^{(\tau)}\right\rVert\right).\end{split}

Upon using Lemma 7, we obtain

f(u¯(τ))g^¯(τ)1ni=1nfi(u¯(τ))fi(ui(τ))Lni=1nu¯(τ)ui(τ)L𝐮~2n(aτAτ)LCpn.\begin{split}&\left\lVert\nabla f(\overline{u}^{(\tau)})-\overline{\hat{g}}^{(\tau)}\right\rVert\leq\frac{1}{n}\sum_{i=1}^{n}\left\lVert\nabla f_{i}(\overline{u}^{(\tau)})-\nabla f_{i}(u_{i}^{(\tau)})\right\rVert\\ \leq&\frac{L}{n}\sum_{i=1}^{n}\left\lVert\overline{u}^{(\tau)}-u_{i}^{(\tau)}\right\rVert\leq L\sqrt{\frac{\lVert\tilde{\bf u}\rVert^{2}}{n}}\leq\left(\frac{a_{\tau}}{A_{\tau}}\right)\frac{LC_{p}}{\sqrt{n}}.\end{split}

Recall Lemma 8 that q¯(τ)qi(τ)Cgaτ/(nAτ)\left\lVert\overline{q}^{(\tau)}-q_{i}^{(\tau)}\right\rVert\leq{C_{g}a_{\tau}}/(\sqrt{n}A_{\tau}). Therefore

i=1naτf(u¯(τ))qi(τ),wi(τ)x(aτ2Aτ)nG(LCp+Cg).\begin{split}&\sum_{i=1}^{n}a_{\tau}\left\langle\nabla f(\overline{u}^{(\tau)})-q_{i}^{(\tau)},{{w}}_{i}^{(\tau)}-x^{*}\right\rangle\\ &\leq\left(\frac{a_{\tau}^{2}}{A_{\tau}}\right){\sqrt{n}G(LC_{p}+C_{g})}.\end{split} (70)

Finally, by collectively substituting (68), (69), and (70) into (67), we get

At(f(v¯(t))f(x))(G(LCp+Cg)n+3LCp2n)τ=1taτ2Aτ+d(x)+12nτ=1t(3Laτ2Aτ1)𝐰(τ)𝐰(τ1)2.\begin{split}&A_{t}\left(f(\overline{v}^{(t)})-f(x^{*})\right)\\ {\leq}&\left(\frac{G(LC_{p}+C_{g})}{\sqrt{n}}+\frac{3LC_{p}^{2}}{n}\right)\sum_{\tau=1}^{t}\frac{a_{\tau}^{2}}{A_{\tau}}\\ &+d(x^{*})+\frac{1}{2n}\sum_{\tau=1}^{t}\left(\frac{3La_{\tau}^{2}}{A_{\tau}}-1\right)\left\lVert{\bf w}^{(\tau)}-{\bf w}^{(\tau-1)}\right\rVert^{2}.\end{split}

Based on the condition in (30) and the fact that aτ2/Aτ2a{a_{\tau}^{2}}/{A_{\tau}}\leq 2a, we obtain (31) as desired.

The inequality in (32) directly follows from Lemma 7. ∎

References

  • [1] A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.
  • [2] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, 2019.
  • [3] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
  • [4] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [5] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  • [6] M. Zhu and S. Martínez, “Discrete-time dynamic average consensus,” Automatica, vol. 46, no. 2, pp. 322–329, 2010.
  • [7] D. Varagnolo, F. Zanella, A. Cenedese, G. Pillonetto, and L. Schenato, “Newton-raphson consensus for distributed convex optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 4, pp. 994–1009, 2015.
  • [8] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2017.
  • [9] ——, “Accelerated distributed nesterov gradient descent,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2566–2581, 2019.
  • [10] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
  • [11] D. Jakovetić, “A unification and generalization of exact distributed first-order methods,” IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 1, pp. 31–46, 2018.
  • [12] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014.
  • [13] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedić, “A dual approach for optimal algorithms in distributed optimization over networks,” Optimization Methods and Software, vol. 36, no. 1, pp. 171–210, 2021.
  • [14] J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Accelerated primal-dual algorithms for distributed smooth convex optimization over networks,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 2381–2391.
  • [15] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in International Conference on Machine Learning.   PMLR, 2017, pp. 3027–3036.
  • [16] P. Latafat, L. Stella, and P. Patrinos, “New primal-dual proximal algorithm for distributed optimization,” in 2016 IEEE 55th Conference on Decision and Control (CDC).   IEEE, 2016, pp. 1959–1964.
  • [17] H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized frank–wolfe algorithm for convex and nonconvex problems,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017.
  • [18] Z. Li and M. Yan, “New convergence analysis of a primal-dual algorithm with large stepsizes,” Advances in Computational Mathematics, vol. 47, no. 1, pp. 1–20, 2021.
  • [19] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
  • [20] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, “Optimal algorithms for non-smooth distributed optimization in networks,” in 32nd International Conference on Neural Information Processing Systems, 2018, pp. 2745–2754.
  • [21] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.
  • [22] H. Li, C. Fang, W. Yin, and Z. Lin, “Decentralized accelerated gradient methods with increasing penalty parameters,” IEEE Transactions on Signal Processing, vol. 68, pp. 4855–4870, 2020.
  • [23] M. Rabbat, “Multi-agent mirror descent for decentralized stochastic optimization,” in 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).   IEEE, 2015, pp. 517–520.
  • [24] S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” IEEE Transactions on Automatic Control, vol. 63, no. 3, pp. 714–725, 2017.
  • [25] D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018.
  • [26] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2011.
  • [27] S. Liu, P.-Y. Chen, and A. O. Hero, “Accelerated distributed dual averaging over evolving networks of growing connectivity,” IEEE Transactions on Signal Processing, vol. 66, no. 7, pp. 1845–1859, 2018.
  • [28] C. Liu, H. Li, and Y. Shi, “A unitary distributed subgradient method for multi-agent optimization with different coupling sources,” Automatica, vol. 114, no. 108834, 2020.
  • [29] S. Liang, G. Yin et al., “Dual averaging push for distributed convex optimization over time-varying directed graph,” IEEE Transactions on Automatic Control, vol. 65, no. 4, pp. 1785–1791, 2019.
  • [30] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization.   John Wiley and Sons, 1983.
  • [31] Y. Nesterov, “Primal-dual subgradient methods for convex problems,” Mathematical Programming, vol. 120, no. 1, pp. 221–259, 2009.
  • [32] L. Xiao, “Dual averaging methods for regularized stochastic learning and onliije-ootimization,” Journal of Machine Learning Rescarch, vol. 11, pp. 2543–2596, 2010.
  • [33] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in 2012 IEEE 51st Conference on Decision and Control (CDC).   IEEE, 2012, pp. 5453–5458.
  • [34] I. Colin, A. Bellet, J. Salmon, and S. Clémençon, “Gossip dual averaging for decentralized optimization of pairwise functions,” in International Conference on Machine Learning.   PMLR, 2016, pp. 1388–1396.
  • [35] M. Cohen, J. Diakonikolas, and L. Orecchia, “On acceleration with noise-corrupted gradients,” in International Conference on Machine Learning, 2018, pp. 1019–1028.
  • [36] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
  • [37] H. Lu, R. M. Freund, and Y. Nesterov, “Relatively smooth convex optimization by first-order methods, and applications,” SIAM Journal on Optimization, vol. 28, no. 1, pp. 333–354, 2018.
  • [38] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
  • [39] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes,” in 2015 IEEE 54th Conference on Decision and Control (CDC).   IEEE, 2015, pp. 2055–2060.
  • [40] E. van den Berg, M. Friedlander, G. Hennenfent, F. Herrmann, R. Saab, and O. Yılmaz, “Sparco: A testing framework for sparse reconstruction,” Dept. Comput. Sci., Univ. British Columbia, Vancouver, Tech. Rep. TR-2007-20,[Online]. Available: http://www. cs. ubc. ca/labs/scl/sparco, 2007.
  • [41] L. Condat, “Fast projection onto the simplex and the l1 ball,” Mathematical Programming, vol. 158, no. 1-2, pp. 575–585, 2016.
  • [42] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” 2014.
  • [43] S. Pu, W. Shi, J. Xu, and A. Nedić, “A push-pull gradient method for distributed optimization in networks,” in 2018 IEEE 57th Conference on Decision and Control (CDC).   IEEE, 2018, pp. 3385–3390.
  • [44] X. Yi, X. Li, L. Xie, and K. H. Johansson, “Distributed online convex optimization with time-varying coupled inequality constraints,” IEEE Transactions on Signal Processing, vol. 68, pp. 731–746, 2020.
  • [45] K. S. Williams, “The nnth power of a 2×\times2 matrix,” Mathematics Magazine, vol. 65, no. 5, pp. 336–336, 1992.