This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decentralized Inexact Proximal Gradient Method With Network-Independent Stepsizes for Convex Composite Optimization

Luyao Guo, Xinli Shi,  Jinde Cao,  Zihao Wang This work was supported by the National Natural Science Foundation of China under Grant Nos. 61833005 and 62003084, and the Natural Science Foundation of Jiangsu Province of China under Grant No. BK20200355. (Corresponding author: Jinde Cao.)L. Guo is with the School of Mathematics and the Frontiers Science Center for Mobile Information Communication and Security, Southeast University, Nanjing 210096, China (e-mail: [email protected]).X. Shi and Z. Wang are with School of Cyber Science & Engineering, Southeast University, Nanjing 210096, China ([email protected]; [email protected]).J. Cao is with School of Mathematics, Frontiers Science Center for Mobile Information Communication and Security, Southeast University, Nanjing 210096, China, with Purple Mountain Laboratories, Nanjing 211111, China, and also with Yonsei Frontier Lab, Yonsei University, Seoul 03722, South Korea (e-mail: [email protected]).
Abstract

This paper proposes a novel CTA (Combine-Then-Adapt)-based decentralized algorithm for solving convex composite optimization problems over undirected and connected networks. The local loss function in these problems contains both smooth and nonsmooth terms. The proposed algorithm uses uncoordinated network-independent constant stepsizes and only needs to approximately solve a sequence of proximal mappings, which is advantageous for solving decentralized composite optimization problems where the proximal mappings of the nonsmooth loss functions may not have analytical solutions. For the general convex case, we prove an O(1/k)O(1/k) convergence rate of the proposed algorithm, which can be improved to o(1/k)o(1/k) if the proximal mappings are solved exactly. Furthermore, with metric subregularity, we establish a linear convergence rate for the proposed algorithm. Numerical experiments demonstrate the efficiency of the algorithm.

Index Terms:
Decentralized optimization, inexact proximal gradient method, linear convergence, network independent.

I Introduction

This paper considers the following decentralized composite optimization over networks with NN computational agents:

xargminxn{i=1N(fi(x)+gi(x))},\displaystyle{\mathrm{x}}^{*}\in\mathop{\arg\min}\limits_{{\mathrm{x}}\in\mathbb{R}^{n}}~{}\big{\{}\sum_{i=1}^{N}(f_{i}({\mathrm{x}})+g_{i}({\mathrm{x}}))\big{\}}, (1)

where the local loss function fi:nf_{i}:\mathbb{R}^{n}\rightarrow\mathbb{R} and gi:n{}g_{i}:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{\infty\} are convex and can only be accessed by corresponding agents. We assume that fif_{i} is LiL_{i}-smooth, and gig_{i} is proper, closed, but not necessarily smooth. This setting is general and is encountered in various application fields, including multi-agent control, wireless communication, and decentralized machine learning [1].

Arguably, the straightforward approach to address the decentralized settings is to employ the classical gradient descent method combined with the network structure[2, 3, 4, 5]. On one hand, in [2, 3, 4], the mixing matrix is used to realize the consensus of the local state through the weighted average of neighbor states. These algorithms can be regarded as the quadratic penalty methods for equality consensus constraints, which leads to an inexact convergence when the stepsizes are fixed [3]. To obtain the exact optimal solution, the diminishing stepsizes are adopted [4]. On the other hand, by using the sign of the relative state rather than averaging over the local iterates via the mixing matrix, [5] proposes an algorithm based on the exact penalty method. However, these primal methods can not achieve the exact convergence under the constant stepsize.

A recent line of work in the primal-dual domain provide a number of alternative methods that can exactly converge to the optimum with the fixed step sizes. When gi=0g_{i}=0 and fif_{i} is strongly convex, several linear convergence algorithms have been proposed including D-ADMM[6], DLM[7], EXTRA[8], DIGing[9], Harnessing[10], AugDGM [11, 12], and Exact Diffusion[13]. When gi0g_{i}\neq 0 and the nonsmooth terms are identical among all the agents, under the assumption that fif_{i} is strongly convex, [14, 15, 16, 17] present several linear convergence algorithms for problem (1). When gi0g_{i}\neq 0 and the nonsmooth terms may be distinct among these agents, IC-ADMM[18], PG-ADMM[19], PG-EXTRA[20], NIDS[21] and PAD[22] have been provided. In [18], it proves the convergence of IC-ADMM under the strong convexity of fif_{i}. In [19], PG-ADMM is shown to have the O(1/k)O({1}/{k}) ergodic convergence rate. Based on EXTRA[8], PG-EXTRA and NIDS have been provided in [20] and [21], respectively. For PG-EXTRA, it is shown that the algorithm has the O(1/k)O({1}/{k}) convergence rate and has the o(1/k)o({1}/{k}) convergence rate when the smooth part vanishes. For NIDS, which has a network independent step size, it is proved to have the o(1/k)o({1}/{k}) convergence rate. In [22], by the quadratic penalty method and linearized-ADMM, an inexact convergence algorithm called PAD is proposed. For PAD, it proves that for a fixed penalty parameter 1/ϵ~>0{1}/{\tilde{\epsilon}}>0, the distance between the solution of problem (1) and the limit point of PAD is of order O(ϵ~12)O(\tilde{\epsilon}^{\frac{1}{2}}). Moreover, it proves the O(1/k)O({1}/{k}) ergodic convergence rate under general convexity, and the linear convergence rate under the strongly convex of fif_{i}.

TABLE I: Convergence Properties of Decentralized Algorithms For Convex Composite Optimization On Networks
Algorithm fif_{i} gig_{i} Stepsize condition Rate Exactness
PUDA[14] strongly convex g1==gNg_{1}=\cdots=g_{N}; proximal mapping of gig_{i} is efficiently computable τ<2σM(IW)LF\tau<\frac{2-\sigma_{M}(I-W)}{L_{F}} linear exact
ABC-Algorithm[15] τ<1LF+μF\tau<\frac{1}{L_{F}+\mu_{F}}
NEXT[16]/SONATA[17] τ<min{μ~mi=1NLi,ϖ}\tau<\min\big{\{}\frac{\tilde{\mu}_{m}}{\sum_{i=1}^{N}L_{i}},\varpi\big{\}}
IC-ADMM[18] strongly convex proximal mapping of gig_{i} is efficiently computable τ<βσm(+)LF2/μF2\tau<\frac{\beta\sigma_{m}(\mathcal{L}_{+})}{L_{F}^{2}/\mu_{F}^{2}} converges exact
PG-ADMM[19] convex τi<1Li+βdi\tau_{i}<\frac{1}{L_{i}+\beta d_{i}} ergodic: O(1k)O(\frac{1}{k})
PG-EXTRA[20] τ<1+σm(W)LF\tau<\frac{1+\sigma_{m}(W)}{L_{F}} non-ergodic: O(1k)O(\frac{1}{k})
NIDS[21] τi<2Li\tau_{i}<\frac{2}{L_{i}}, 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i} non-ergodic: o(1k)o(\frac{1}{k})
PAD[22] convex proximal mapping of gig_{i} is efficiently computable τ<1LF+βσM(IW)\tau<\frac{1}{L_{F}+\beta\sigma_{M}(I-W)} ergodic: O(1k)O(\frac{1}{k}) inexact
strongly convex τ<1LF2/μF+βσM(IW)\tau<\frac{1}{L_{F}^{2}/\mu_{F}+\beta\sigma_{M}(I-W)} linear
D-iPGM convex τi<2Li\tau_{i}<\frac{2}{L_{i}}, 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i} ergodic: O(1k)O(\frac{1}{k}) non-ergodic: o(1k)o(\frac{1}{k}) exact
metrically subregular linear
τ\tau and β\beta are the stepsizes of primal update and dual update, respectively. τi\tau_{i} is the primal stepsize of local agent. LiL_{i} is the Lipschitz constant of the gradients fi\nabla f_{i} and LF=maxi{Li}L_{F}=\max_{i}\{L_{i}\}. μF\mu_{F} is the strong convexity constant of the functions fif_{i}. μ~m\tilde{\mu}_{m} is the strong convexity parameter of the successive convex approximation surrogate of fif_{i}. ϖ>0\varpi>0 is a given constant. +\mathcal{L}_{+} is the signless Laplacian matrix. The definition of WW, σm()\sigma_{m}(\cdot), and σM()\sigma_{M}(\cdot) can be found in Assumption 1 and Notations.

These algorithms can be divided into ATC (Adapt-Then-Combine)- and CTA (Combine-Then-Adapt)-based decentralized algorithms according to their characteristics [15]. For ATC-based decentralized algorithms, there are several algorithms with network-independent constant stepsizes such as AugDGM [11, 12], Exact Diffusion[13], ABC-Algorithm [15], NEXT[16]/SONATA[17], and NIDS [21] (see [15] for more details). However, to the best of our knowledge, the choice of the stepsize of the existing CTA-based decentralized algorithms are related to the network topology. Moreover, in the algorithms mentioned above for composite optimization, the proximal mapping of gig_{i} is required to have analytic solutions or can be efficiently computed. However, in many applications, there is no analytic solution to the proximal mapping of gig_{i}. Such problems include generalized LASSO regularizer [23], fused LASSO regularizer [24], the octagonal selection and clustering algorithm for regression (OSCAR) regularizer [25], and the group LASSO regularizer [26]. Therefore, it prompts us to introduce an inexact inner-solver to compute an approximation of the proximal mapping. There are several works focusing on inexact versions of ALM [27], primal-dual hybrid gradient method [28], ADMM [29] and decentralized ALM [30, 31]. To our knowledge, the corresponding results for problem (1) are relatively scarce. Furthermore, to the problem (1), when the nonsmooth parts gig_{i} are distinct, the linear convergence rate of corresponding algorithms under certain assumptions have not been established. Although [14] shows that when fif_{i} is a strongly convex smooth function and gig_{i} is nonsmooth convex with closed-form solution to proximal mappings, then the dimension independent exact global linear convergence cannot be attained for the composite optimization problem (1), it does not mean that the linear convergence rate related to dimension is impossible under some conditions. Therefore, it naturally leads to the following three problems.

Question 1. Can one provide a CTA-based decentralized algorithm with network-independent constant stepsizes?

Question 2. Can one provide a decentralized algorithm when the proximal mapping of gig_{i} has no analytic solution?

Question 3. Can one establish the linear convergence of the proposed decentralized algorithm for the problem (1)?

This paper aims at addressing Questions 1, 2 and 3 for the problem (1) over an undirected and connected network. We develop a novel CTA-based decentralized inexact proximal gradient method (D-iPGM) with network-independent constant stepsizes. In the implementation of D-iPGM, computable approximations of the proximal mapping is allowed, which implies that the proposed algorithm can reduce the computational cost when proximal mapping of gig_{i} has no closed-form solution. To convergence analysis, we prove the O(1/k)O(1/k) convergence rate under general convexity. When the proximal mapping of gig_{i} is solved exactly, we establish the o(1/k)o(1/k) non-ergodic convergence of D-iPGM. We further prove the linear convergence rate of D-iPGM under the metric subregularity which is weaker than strong convexity. Table I shows a comparison of D-iPGM with the existing decentralized algorithms to convex composite optimization problems over networks.

This paper is organized as follows. Section II casts the problem (1) into a constrained form, and based on the reformulation, the D-iPGM is proposed. Additionally, some relations between D-iPGM and PG-EXTRA [20] and NIDS [21] are discussed. Then, the convergence properties under general convexity and metric subregularity are investigated in Section III. Finally, the numerical experiments are implemented in Section IV and conclusions are given in Section V.

Notations: n\mathbb{R}^{n} denotes the nn-dimensional vector space with inner-product ,\langle\cdot,\cdot\rangle. The l1l_{1}-norm, l2l_{2}-norm and ll_{\infty}-norm are denoted as 1\|\cdot\|_{1}, \|\cdot\| and \|\cdot\|_{\infty}, respectively. 𝟎\mathbf{0}, 𝐈\mathbf{I}, and 𝟏\mathbf{1} denote the null matrix, the identity matrix, and the all-ones matrix of appropriate dimensions, respectively. For xin,i=1,m{\mathrm{x}}_{i}\in\mathbb{R}^{n},i=1\ldots,m, let col{x1,,xN}=[x1𝖳,,xN𝖳]𝖳\mathrm{col}\{{\mathrm{x}}_{1},\cdots,{\mathrm{x}}_{N}\}=[{\mathrm{x}}_{1}^{\sf T},\cdots,{\mathrm{x}}_{N}^{\sf T}]^{\sf T}. For a matrix Am×nA\in\mathbb{R}^{m\times n}, σm(A)\sigma_{m}(A) and σM(A)\sigma_{M}(A) is the smallest and largest nonzero singular value of AA, respectively. If AA is symmetric, A0A\succ 0 means that AA is positive definite. Let 0\mathcal{H}\succ 0, and 𝒟\mathcal{D} be a non-empty closed convex subset of n\mathbb{R}^{n}. For any xn{\mathrm{x}}\in\mathbb{R}^{n}, we define x2:=x,x\|{\mathrm{x}}\|^{2}_{\mathcal{H}}:=\langle{\mathrm{x}},\mathcal{H}{\mathrm{x}}\rangle, dist(x,𝒟):=minx𝒟xx\mathrm{dist}_{\mathcal{H}}({\mathrm{x}},\mathcal{D}):=\min_{{\mathrm{x}}^{\prime}\in\mathcal{D}}\|{\mathrm{x}}-{\mathrm{x}}^{\prime}\|_{\mathcal{H}}, and 𝒫𝒟(x):=argminx𝒟xx\mathcal{P}_{\mathcal{D}}^{\mathcal{H}}({\mathrm{x}}):=\arg\min_{{\mathrm{x}}^{\prime}\in\mathcal{D}}\|{\mathrm{x}}-{\mathrm{x}}^{\prime}\|_{\mathcal{H}}. For a given closed proper convex function θ\theta, the proximal mapping of θ\theta relative to \|\cdot\|_{\mathcal{H}} is defined as proxθ(x):=argminy{θ(y)+12yx2}\mathrm{prox}_{\theta}^{\mathcal{H}}({\mathrm{x}}):=\arg\min_{\mathrm{y}}\{\theta({\mathrm{y}})+\frac{1}{2}\|{\mathrm{y}}-{\mathrm{x}}\|_{\mathcal{H}}^{2}\}. When =𝐈\mathcal{H}={\bf{I}}, we just omit it from these notations. r(x¯):={x:xx¯<r}\mathcal{B}_{r}(\bar{{\mathrm{x}}}):=\{{\mathrm{x}}:\|{\mathrm{x}}-\bar{{\mathrm{x}}}\|<r\} denotes the open Euclidean norm ball around x{\mathrm{x}} with modulus r>0r>0.

II Decentralized Inexact Proximal Gradient Method

Consider an undirected and connected network 𝒢(𝒱,)\mathcal{G}(\mathcal{V},\mathcal{E}), where 𝒱={v1,,vN}\mathcal{V}=\{v_{1},\ldots,v_{N}\} denotes the vertex set, and the edge set 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V} specifies the connectivity in the network, i.e., a communication link between agents ii and jj exists iff (i,j)(i,j)\in\mathcal{E}. Let xin\mathrm{x}_{i}\in\mathbb{R}^{n} be the decision variable held by the agent ii. To cast the problem (1) into a consensus-constrained form and make sure that x1==xN\mathrm{x}_{1}=\cdots=\mathrm{x}_{N}, we introduce the global mixing matrix 𝐖=W𝐈n\mathbf{W}=W\otimes\mathbf{I}_{n}, where W=[Wij]N×NW=[W_{ij}]\in\mathbb{R}^{N\times N}, and make the following standard assumption in decentralized optimization.

Assumption 1.

The mixing matrix WW is symmetric and doubly stochastic, and the null space of 𝐈𝐖\mathbf{I}-\mathbf{W} is Span(𝟏)\mathrm{Span}(\mathbf{1}). Moreover, if (i,j)(i,j)\notin\mathcal{E} and iji\neq j, Wij=Wji=0W_{ij}=W_{ji}=0; otherwise, Wij>0W_{ij}>0.

In addition, with this mixing matrix, similar as [20], we introduce another mixing matrix 𝐖~\tilde{{\bf{W}}} defined as 𝐖~:=12(𝐈+𝐖)\tilde{{\bf{W}}}:=\frac{1}{2}({\bf{I+W}}). By Assumption 1, the constraint x1==xN\mathrm{x}_{1}=\cdots=\mathrm{x}_{N} is equivalent to the identity (𝐈𝐖)𝐱=𝟎(\mathbf{I}-\mathbf{W})\mathbf{x}=\mathbf{0}, which is again equivalent to 𝐈𝐖𝐱=𝟎\sqrt{\mathbf{I}-\mathbf{W}}\mathbf{x}=\mathbf{0}. Then, we have the following equivalent formulation of problem (1).

min𝐱i=1Nfi(xi):=F(𝐱)+i=1Ngi(xi):=G(𝐱),s.t.𝐈𝐖𝐱=𝟎,\displaystyle\min_{{\bf{x}}}~{}\underbrace{\sum_{i=1}^{N}f_{i}({\mathrm{x}}_{i})}_{:=F({\bf{x}})}+\underbrace{\sum_{i=1}^{N}g_{i}({\mathrm{x}}_{i})}_{:=G({\bf{x}})},~{}\mathrm{s.t.~{}}\sqrt{\mathbf{I}-\mathbf{W}}\mathbf{x}=\mathbf{0}, (2)

where 𝐱=[x1𝖳,,xN𝖳]𝖳Nn\mathbf{x}=[\mathrm{x}_{1}^{\sf T},\cdots,\mathrm{x}_{N}^{\sf T}]^{\sf T}\in\mathbb{R}^{Nn}. To guarantee the wellposedness of (1), it is necessary to suppose that the optimal solution of (1) exists. Throughout the paper, we give the following assumption.

Assumption 2.

The local function fif_{i} is convex and LiL_{i}-smooth.

According to [32, Theorem 5.8] and Assumption 2, for any 𝐱,𝐲,𝐳Nn{\bf{x}},{\bf{y}},{\bf{z}}\in\mathbb{R}^{Nn}, we have the following commonly used inequalities

F(𝐱)F(𝐲),𝐱𝐲F(𝐱)F(𝐲)𝐋F12,\displaystyle\langle\nabla F({\bf{x}})-\nabla F({\bf{y}}),{\bf{x-y}}\rangle\geq\|\nabla F({\bf{x}})-\nabla F({\bf{y}})\|^{2}_{{\bf{L}}_{F}^{-1}}, (3)
𝐱𝐲,F(𝐳)F(𝐱)F(𝐲)+12𝐲𝐳𝐋F2,\displaystyle\langle{\bf{x}}-{\bf{y}},\nabla F({\bf{z}})\rangle\leq F({\bf{x}})-F({\bf{y}})+\frac{1}{2}\|{\bf{y}}-{\bf{z}}\|_{{\bf{L}}_{F}}^{2}, (4)
𝐱𝐲,F(𝐳)F(𝐱)14𝐲𝐳𝐋F2,\displaystyle\langle{\bf{x}}-{\bf{y}},\nabla F({\bf{z}})-\nabla F({\bf{x}})\rangle\leq\frac{1}{4}\|{\bf{y}}-{\bf{z}}\|_{{\bf{L}}_{F}}^{2}, (5)

where 𝐋F=diag{L1𝐈n,,LN𝐈n}{\bf{L}}_{F}=\mathrm{diag}\{L_{1}{\bf{I}}_{n},\cdots,L_{N}{\bf{I}}_{n}\}.

II-A Algorithm Development

Recall ABC{\mathrm{ABC}}-unified framework proposed in [15]

𝐬k\displaystyle{\bf{s}}^{k} =A𝐱kτBF(𝐱k)𝝀k,\displaystyle={\mathrm{A}}{\bf{x}}^{k}-\tau{\mathrm{B}}\nabla F({\bf{x}}^{k})-\bm{\lambda}^{k}, (6a)
𝝀k+1\displaystyle\bm{\lambda}^{k+1} =𝝀k+C𝐬k,\displaystyle=\bm{\lambda}^{k}+{\mathrm{C}}{\bf{s}}^{k}, (6b)
𝐱k+1\displaystyle{\bf{x}}^{k+1} =proxτG(𝐬k),\displaystyle=\mathrm{prox}_{\tau G}({\bf{s}}^{k}), (6c)

where A,B,C{\mathrm{A,B,C}} are suitably chosen weight matrices, τ>0\tau>0 is the primal stepsize, and the nonsmooth function gig_{i} is common to all agents. Let C=12(𝐈𝐖){\mathrm{C}}=\frac{1}{2}({\bf{I-W}}). According to this unified framework, when A=B𝐈{\mathrm{A}}={\mathrm{B}}\neq{\bf{I}}, by eliminating the dual variable 𝝀\bm{\lambda}, it holds that

𝐱k+1\displaystyle\mathbf{x}^{k+1} =proxτG(𝐬k),\displaystyle=\mathrm{prox}_{\tau G}(\mathbf{s}^{k}),
𝐬k+1\displaystyle{\bf{s}}^{k+1} =𝐖~𝐬k+A(𝐱k+1𝐱k+τF(𝐱k)τF(𝐱k+1)First: Adapt)Then: Combine,\displaystyle=\tilde{{\bf{W}}}{\bf{s}}^{k}+\underbrace{{\mathrm{A}}(\underbrace{{\bf{x}}^{k+1}-{\bf{x}}^{k}+\tau\nabla F(\mathbf{x}^{k})-\tau\nabla F({\bf{x}}^{k+1})}_{\text{First: Adapt}})}_{\text{Then: Combine}},

and these algorithms are the ATC-based decentralized algorithms. Similarly, when B=𝐈{\mathrm{B}}={\bf{I}}, it deduces that

𝐱k+1\displaystyle\mathbf{x}^{k+1} =proxτG(𝐬k),\displaystyle=\mathrm{prox}_{\tau G}(\mathbf{s}^{k}),
𝐬k+1\displaystyle{\bf{s}}^{k+1} =𝐖~𝐬k+A(𝐱k+1𝐱k)First: Combine+τF(𝐱k)τF(𝐱k+1)Then: Adapt,\displaystyle=\tilde{{\bf{W}}}{\bf{s}}^{k}+\underbrace{\underbrace{{\mathrm{A}}({\bf{x}}^{k+1}-{\bf{x}}^{k})}_{\text{First: Combine}}+\tau\nabla F({\bf{x}}^{k})-\tau\nabla F({\bf{x}}^{k+1})}_{\text{Then: Adapt}},

and these algorithms are the CTA-based decentralized algorithms. To ATC-based algorithms, the communication of state information and local gradient information are involved. By this strategy, the stepsize conditions of ATC-based algorithms are often independent of the network topology, such as NIDS [21], which exchanges the gradient adapted estimations, i.e., 2𝐱k+1𝐱kτ(F(𝐱k+1)F(𝐱k))2{\bf{x}}^{k+1}-{\bf{x}}^{k}-\tau(\nabla F({\bf{x}}^{k+1})-\nabla F({\bf{x}}^{k})), and the stepsize condition of NIDS is 0<τ<2/LF0<\tau<2/{L_{F}}, where LF=maxi𝒱{Li}L_{F}=\max_{i\in\mathcal{V}}\{L_{i}\}. To ATC-based algorithms, they only need to share state information; however, their stepsize conditions are often related to the network topology, such as PG-EXTRA [20], which exchanges only the state information, and the stepsize condition PG-EXTRA is 0<τ<(1+σm(𝐖))/LF0<\tau<(1+\sigma_{m}({\bf{W}}))/{L_{F}}. To blend the respective superiority of these two types of algorithms, we propose the following decentralized algorithm to problem (1).

𝐱~k\displaystyle\tilde{\mathbf{x}}^{k} =proxGΓ1(𝐱kΓ(F(𝐱k)𝝀k)),\displaystyle=\mathrm{prox}_{G}^{\Gamma^{-1}}(\mathbf{x}^{k}-\Gamma(\nabla F(\mathbf{x}^{k})-\bm{\lambda}^{k})), (7a)
𝝀k+1\displaystyle\bm{\lambda}^{k+1} =𝝀kβ(𝐈𝐖)𝐱~k,\displaystyle=\bm{\lambda}^{k}-\beta(\mathbf{I-W})\tilde{\mathbf{x}}^{k}, (7b)
𝐱k+1\displaystyle{\bf{x}}^{k+1} =𝐱~kΓ(𝝀k𝝀k+1),\displaystyle=\tilde{\mathbf{x}}^{k}-\Gamma(\bm{\lambda}^{k}-\bm{\lambda}^{k+1}), (7c)

where Γ=diga{τ1𝐈n,,τN𝐈n}\Gamma=\mathrm{diga}\{\tau_{1}{\bf{I}}_{n},\cdots,\tau_{N}{\bf{I}}_{n}\} is the primal stepsize matrix, β>0\beta>0 is the dual stepsize, 𝝀k=𝐈𝐖𝜶k\bm{\lambda}^{k}=\sqrt{\mathbf{I-W}}\bm{\alpha}^{k} is the “scaled” Lagrange multiplier, and 𝜶k\bm{\alpha}^{k} is the Lagrange multiplier. To achieve a network independent stepsize, we use the simple primal correction step (7c). In Section III, we will prove that the algorithm (7) is convergent when 0<τi<2/Li,i𝒱0<\tau_{i}<{2}/{L_{i}},i\in\mathcal{V}, 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i}. By eliminating the dual variable 𝝀\bm{\lambda}, it holds that

𝐱~k\displaystyle\tilde{{\bf{x}}}^{k} =proxGΓ1(𝐬k),\displaystyle=\mathrm{prox}^{\Gamma^{-1}}_{G}({\bf{s}}^{k}),
𝐱ck\displaystyle{\bf{x}}_{c}^{k} =βΓ(𝐈𝐖)𝐱~k,\displaystyle=\beta\Gamma({\bf{I-W}})\tilde{{\bf{x}}}^{k},
𝐱k+1\displaystyle{\bf{x}}^{k+1} =𝐱~k𝐱ck,\displaystyle=\tilde{{\bf{x}}}^{k}-{\bf{x}}_{c}^{k},
𝐬k+1\displaystyle{\bf{s}}^{k+1} =𝐬k𝐱k+𝐱~k2𝐱ck+ΓF(𝐱k)ΓF(𝐱k+1).\displaystyle={\bf{s}}^{k}-{\bf{x}}^{k}+\tilde{{\bf{x}}}^{k}-2{\bf{x}}_{c}^{k}+\Gamma\nabla F(\mathbf{x}^{k})-\Gamma\nabla F(\mathbf{x}^{k+1}).

Apparently, the implementation of the proposed algorithm only requires neighboring variables x~jk\tilde{{\mathrm{x}}}_{j}^{k}, and there is only one round of communication in each iteration. Compared with (6), we know that the proposed algorithm is indeed a CTA-based decentralized algorithm and is independent of this primal-dual algorithmic framework.

On the other hand, when the proximal mapping of gig_{i} has no analytical solution, solving subproblem (7a) very accurately is a waste. It is more efficient to develop a proper condition and stop the subproblem procedure, i.e., inner iterations, once the condition is satisfied. By introducing an absolutely summable error criteria for subproblem (7a), we give D-iPGM (see Algorithm 1).

Algorithm 1 D-iPGM
1:Mixing matrix WW, 0<τi<2/Li0<\tau_{i}<{2}/{L_{i}}, and 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i}. Let {εk}\{\varepsilon_{k}\} be a summable sequence of nonnegative numbers.
2:Initial 𝐱0Nn\mathbf{x}^{0}\in\mathbb{R}^{Nn}, 𝝀0=𝟎\bm{\lambda}^{0}=\mathbf{0}.
3:for k=1,2,,Kk=1,2,\ldots,K do
4:     Primal Update:
x~iky~ik=proxτigi(xikτi(fi(xik)λik)),\displaystyle\tilde{\mathrm{x}}_{i}^{k}\approx\tilde{\mathrm{y}}_{i}^{k}=\mathrm{prox}_{\tau_{i}g_{i}}(\mathrm{x}_{i}^{k}-\tau_{i}(\nabla f_{i}(\mathrm{x}_{i}^{k})-{\lambda}_{i}^{k})), (8)
such that there exists a vector
dikgi(x~ik)+fi(xik)λik+1τi(x~ikxik),\displaystyle{\mathrm{d}}_{i}^{k}\in\partial g_{i}(\tilde{\mathrm{x}}_{i}^{k})+\nabla f_{i}(\mathrm{x}_{i}^{k})-\lambda_{i}^{k}+\frac{1}{\tau_{i}}(\tilde{\mathrm{x}}_{i}^{k}-\mathrm{x}_{i}^{k}), (9a)
satisfying dikεk.\displaystyle\text{satisfying }\|{\mathrm{d}}_{i}^{k}\|\leq\varepsilon_{k}. (9b)
5:     Agent ii exchanges x~ik\tilde{\mathrm{x}}_{i}^{k} with neighbors.
6:     Dual Update:
λik+1=λikβ(x~ikj=1NWijx~jk).\displaystyle\lambda^{k+1}_{i}={\lambda}_{i}^{k}-\beta(\tilde{\mathrm{x}}_{i}^{k}-\sum\nolimits_{j=1}^{N}W_{ij}\tilde{\mathrm{x}}_{j}^{k}). (10)
7:     Correction Step: xik+1=x~ikτi(λikλik+1).\mathrm{x}_{i}^{k+1}=\tilde{x}^{k}_{i}-\tau_{i}(\lambda_{i}^{k}-\lambda_{i}^{k+1}).
8:end for
9:𝐱K\mathbf{x}^{K}.

In Algorithm 1, to save computational expenses, in the kk-th iteration of D-iPGM, inexact proximal gradient descent is allowed, i.e., we can solve subproblem (8) inexactly by finding an approximate solution. It indicates that we have to find an appropriate inner iterative method to solve it inexactly. This is possible for a majority of the applications. For instance, to general LASSO regularizer [23], fused LASSO regularizer [24], OSCAR regularizer [25], and the group LASSO regularizer [26], in which gig_{i} has the form as gi(x)=r1,i(x)+r2,i(Dixi)g_{i}({\mathrm{x}})=r_{1,i}({\mathrm{x}})+r_{2,i}({\mathrm{D}}_{i}{\mathrm{x}}_{i}) with r1,i,r2,ir_{1,i},r_{2,i} being nonsmooth functions and Di{\mathrm{D}}_{i} being a given matrix, we can use the operator splitting methods [33] or accelerated linearized ADMM [34] to solve (8) with the inner stopping criterion (9).

Remark 1.

Since gig_{i} is convex, closed, and proper, by the definition of proximal mapping, each of the subproblems (8) is strongly convex so that each yik{\mathrm{y}}_{i}^{k} is uniquely determined by (xik,λik)({\mathrm{x}}_{i}^{k},\lambda_{i}^{k}). From [29], it holds that for any εk0\varepsilon_{k}\geq 0, a certain x~k\tilde{{\mathrm{x}}}^{k} can be found such that the absolutely summable error criteria (9) holds. Therefore, the Algorithm 1 is well-defined. Moreover, by [28, Lemma 3.2], the inner loops of Algorithm 1 always terminate after a finite number of iterations.

Remark 2.

It follows from (9a) and the definition of proximal mapping that x~ik=y~ik\tilde{{\mathrm{x}}}_{i}^{k}=\tilde{{\mathrm{y}}}_{i}^{k} if dik=𝟎{\mathrm{d}}^{k}_{i}={\bf{0}}. Therefore, x~ik\tilde{{\mathrm{x}}}_{i}^{k} can be seen as an approximation of the exact solution y~ik\tilde{{\mathrm{y}}}_{i}^{k}. To simplify convergence analysis, we give the inner stopping criterion (9). However, in practice, we can alternatively use xik,l+1xik,lε0εk\|{\mathrm{x}}_{i}^{k,l+1}-{\mathrm{x}}_{i}^{k,l}\|\leq\varepsilon_{0}\varepsilon_{k} as the inner stopping criterion, where ε0>0\varepsilon_{0}>0 is a constant, and xik,l{\mathrm{x}}_{i}^{k,l} is the estimate of y~ik\tilde{{\mathrm{y}}}_{i}^{k} in ll-the inner iteration, because dik,l+1gi(xik,l+1)+fi(xik)λik+1τi(xik,l+1xik){\mathrm{d}}_{i}^{k,l+1}\in\partial g_{i}({\mathrm{x}}_{i}^{k,l+1})+\nabla f_{i}(\mathrm{x}_{i}^{k})-\lambda_{i}^{k}+\frac{1}{\tau_{i}}({\mathrm{x}}_{i}^{k,l+1}-\mathrm{x}_{i}^{k}) can be controlled by xik,l+1xik,l\|{\mathrm{x}}_{i}^{k,l+1}-{\mathrm{x}}_{i}^{k,l}\|, i.e., there exists a constant ε0\varepsilon_{0} such that dik,l+1ε0xik,l+1xik,l\|{\mathrm{d}}_{i}^{k,l+1}\|\leq\varepsilon_{0}\|{\mathrm{x}}_{i}^{k,l+1}-{\mathrm{x}}_{i}^{k,l}\|. We will specify this in Section IV.

Remark 3.

Let 𝐝k=col{d1k,,dNk}{\bf{d}}^{k}=\mathrm{col}\{{\mathrm{d}}_{1}^{k},\cdots,{\mathrm{d}}_{N}^{k}\}. Different from [28], we require dikεk\|{\mathrm{d}}_{i}^{k}\|\leq\varepsilon_{k} rather than 𝐝kεkmax{β,𝐱~k}\|\mathbf{d}^{k}\|\leq\frac{\varepsilon_{k}}{\max\{\beta,\|\tilde{\mathbf{x}}^{k}\|\}}. Thus, by this setting, all agents have their own independent subproblems and independent error criteria for these subproblems, which implies that these subproblems can be solved in a decentralized manner. In addition, to control the residual dik{\mathrm{d}}_{i}^{k}, we introduce a summable sequence {εk}\{\varepsilon_{k}\}, which implies that we can choose εk\varepsilon_{k} by εk=ε0/(k+1)r\varepsilon_{k}=\varepsilon_{0}/(k+1)^{r} with any ε0>0\varepsilon_{0}>0 and r>1r>1, or by εk=rk\varepsilon_{k}=r^{k} with 0<r<10<r<1. Apparently, when the proximal mapping of gig_{i} has a closed-form representation, we can set εk0,k0\varepsilon_{k}\equiv 0,k\geq 0. In this scenario, we use the exact proximal gradient descent to perform the primal update, thus it has the same complexity as the existing proximal gradient methods [14, 15, 16, 17, 18, 19, 20, 21, 22].

TABLE II: The Comparison of D-iPGM (εk=0\varepsilon_{k}=0) with PG-EXTRA [20] and NIDS [21].
Algorithm Communication Stepsize, uncoordinated? Non-ergodic Rate Ergodic Rate Linear Rate Inexact Iteration
PG-EXTRA [20] 2𝐱k+1𝐱k2{\bf{x}}^{k+1}-{\bf{x}}^{k} τ<(1+σm(𝐖))/LF\tau<(1+\sigma_{m}({\bf{W}}))/{L_{F}}, no O(1/k)O(1/k) No results No results
NIDS [21] τ(F(𝐱k)F(𝐱k+1))\tau(\nabla F({\bf{x}}^{k})-\nabla F({\bf{x}}^{k+1})) τi<2/Li\tau_{i}<2/L_{i}, yes o(1/k)o(1/k) No results No results
+2𝐱k+1𝐱k+2{\bf{x}}^{k+1}-{\bf{x}}^{k}
D-iPGM, εk=0\varepsilon_{k}=0 𝐱~k\tilde{{\bf{x}}}^{k} τi<2/Li\tau_{i}<2/L_{i}, yes o(1/k)o(1/k) O(1/k)O(1/k)

II-B Comparison With PG-EXTRA [20] and NIDS [21]

In this subsection, we compare D-iPGM (εk=0\varepsilon_{k}=0), PG-EXTRA [20], and NIDS [21] from the perspective of operator splitting. Recall problem (2), which is equivalent to

min𝐱F(𝐱)+G(𝐱)+δ{𝟎}(𝐕𝐱),\displaystyle\min_{{\bf{x}}}F({\bf{x}})+G({\bf{x}})+\delta_{\{{\bf{0}}\}}({\bf{Vx}}), (11)

where 𝐕=𝐈𝐖{\bf{V}}=\sqrt{{\bf{I-W}}}, and δ{𝟎}(𝐕𝐱)\delta_{\{{\bf{0}}\}}({\bf{V}}{\bf{x}}) is an indicator function defined as δ{𝟎}(𝐕𝐱)=𝟎\delta_{\{{\bf{0}}\}}({\bf{V}}{\bf{x}})={\bf{0}} if 𝐕𝐱=𝟎{\bf{V}}{\bf{x}}={\bf{0}}; otherwise, δ{𝟎}(𝐕𝐱)=\delta_{\{{\bf{0}}\}}({\bf{V}}{\bf{x}})=\infty, which encodes the constraint 𝐕𝐱=𝟎{\bf{V}}{\bf{x}}={\bf{0}}. For compactness of exposition, we define the following operators

𝑨=(G𝟎𝟎𝟎),𝑩=(𝟎𝐕𝐕𝟎),𝑪=(F𝟎𝟎𝟎).\bm{A}=\left(\begin{array}[]{cc}\partial G&{\bf{0}}\\ {\bf{0}}&{\bf{0}}\\ \end{array}\right),\bm{B}=\left(\begin{array}[]{cc}{\bf{0}}&-{\bf{V}}\\ {\bf{V}}&{\bf{0}}\\ \end{array}\right),\bm{C}=\left(\begin{array}[]{cc}\nabla F&{\bf{0}}\\ {\bf{0}}&{\bf{0}}\\ \end{array}\right).

According to [35, Section 11.3.3], PG-EXTRA is equivalent to the C-V splitting algorithm [36, 37] applied to problem (11). In addition, it also can be derived by forward-backward operator splitting [33] under the metric defined by

𝐇P=(1τ𝐈𝐕𝐕2τ𝐈),\displaystyle{\bf{H}}_{\mathrm{P}}=\left(\begin{array}[]{cc}\frac{1}{\tau}{\bf{I}}&{\bf{V}}\\ {\bf{V}}&2\tau{\bf{I}}\\ \end{array}\right),

i.e., it can be rewritten as

𝐮k+1=(𝐇P+𝑨+𝑩)1(𝐇P𝑪)𝐮k,{\bf{u}}^{k+1}=({\bf{H}}_{{\mathrm{P}}}+\bm{A+B})^{-1}({\bf{H}}_{{\mathrm{P}}}-\bm{C}){\bf{u}}^{k},

where 𝐮k=(𝐱k,𝜶k){\bf{u}}^{k}=({\bf{x}}^{k},\bm{\alpha}^{k}). Since 𝝀k=𝐈𝐖𝜶k\bm{\lambda}^{k}=\sqrt{{\bf{I-W}}}\bm{\alpha}^{k}, the primal-dual update of PG-EXTRA is

𝐱k+1\displaystyle{\bf{x}}^{k+1} =proxτG(𝐱kτ(F(𝐱k)𝝀k)),\displaystyle=\mathrm{prox}_{\tau G}(\mathbf{x}^{k}-\tau(\nabla F(\mathbf{x}^{k})-\bm{\lambda}^{k})), (12a)
𝝀k+1\displaystyle\bm{\lambda}^{k+1} =𝝀k12τ(𝐈𝐖)(2𝐱k+1𝐱k).\displaystyle=\bm{\lambda}^{k}-\frac{1}{2\tau}({\bf{I-W}})(2{\bf{x}}^{k+1}-{\bf{x}}^{k}). (12b)

By eliminating the dual variable 𝝀\bm{\lambda}, it holds that

𝐱k+1\displaystyle\mathbf{x}^{k+1} =proxτG(𝐬k),\displaystyle=\mathrm{prox}_{\tau G}(\mathbf{s}^{k}), (13a)
𝐬k+1\displaystyle\mathbf{s}^{k+1} =𝐬k𝐱k+1+𝐖~(2𝐱k+1𝐱k)\displaystyle=\mathbf{s}^{k}-{\bf{x}}^{k+1}+\tilde{{\bf{W}}}(2{\bf{x}}^{k+1}-{\bf{x}}^{k})
+τF(𝐱k)τF(𝐱k+1).\displaystyle\quad+\tau\nabla F({\bf{x}}^{k})-\tau\nabla F(\mathbf{x}^{k+1}). (13b)

In the implementation of PG-EXTRA, only the communication of the state variable xi{\mathrm{x}}_{i} is involved. Since the eigenvalues of 𝐖{\bf{W}} lie in (1,1](-1,1], and the multiplicity of eigenvalue 11 is one, we obtain that the metric 𝐇P{\bf{H}}_{{\mathrm{P}}} is positive definite. To ensure the convergence of PG-EXTRA, the positive definiteness of the following metric matrix is required

𝐇^P=𝐇P12diag{LF𝐈,𝟎},\displaystyle\widehat{{\bf{H}}}_{\mathrm{P}}={\bf{H}}_{\mathrm{P}}-\frac{1}{2}\mathrm{diag}\{L_{F}{\bf{I}},{\bf{0}}\},

where LF=maxi𝒱{Li}L_{F}=\max_{i\in\mathcal{V}}\{L_{i}\}. Thus, we have that the stepsize condition of PG-EXTRA is 0<τ<(1+σm(𝐖))/LF0<\tau<(1+\sigma_{m}({\bf{W}}))/{L_{F}}.

To NIDS, in terms of [35, Section 11.3.3], it is equivalent to the primal-dual three-operator splitting algorithm (PD3O) [38] applied to problem (11) with the metric matrix being

𝐇N=(1τ𝐈𝟎𝟎τ(2𝐈𝐕2)).{\bf{H}}_{{\mathrm{N}}}=\left(\begin{array}[]{cc}\frac{1}{\tau}{\bf{I}}&{\bf{0}}\\ {\bf{0}}&\tau(2{\bf{I}}-{\bf{V}}^{2})\\ \end{array}\right).

The primal-dual update of NIDS is

𝐱k+1\displaystyle{\bf{x}}^{k+1} =proxτG(𝐱kτ(F(𝐱k)𝝀k)),\displaystyle=\mathrm{prox}_{\tau G}(\mathbf{x}^{k}-\tau(\nabla F(\mathbf{x}^{k})-\bm{\lambda}^{k})), (14a)
𝝀k+1\displaystyle\bm{\lambda}^{k+1} =𝝀k12τ(𝐈𝐖)(2𝐱k+1𝐱k\displaystyle=\bm{\lambda}^{k}-\frac{1}{2\tau}({\bf{I-W}})\big{(}2{\bf{x}}^{k+1}-{\bf{x}}^{k}
+τF(𝐱k)τF(𝐱k+1)).\displaystyle\quad+\tau\nabla F({\bf{x}}^{k})-\tau\nabla F({\bf{x}}^{k+1})\big{)}. (14b)

By eliminating the dual variable 𝝀\bm{\lambda}, it holds that

𝐱k+1\displaystyle\mathbf{x}^{k+1} =proxτG(𝐬k),\displaystyle=\mathrm{prox}_{\tau G}(\mathbf{s}^{k}), (15a)
𝐬k+1\displaystyle\mathbf{s}^{k+1} =𝐬k𝐱k+1+𝐖~(2𝐱k+1𝐱k\displaystyle=\mathbf{s}^{k}-{\bf{x}}^{k+1}+\tilde{{\bf{W}}}\big{(}2{\bf{x}}^{k+1}-{\bf{x}}^{k}
+τF(𝐱k)τF(𝐱k+1)).\displaystyle\quad+\tau\nabla F({\bf{x}}^{k})-\tau\nabla F(\mathbf{x}^{k+1})\big{)}. (15b)

To achieve a more relaxed convergence condition than PG-EXTRA, there are two additional terms (F(𝐱k)(\nabla F({\bf{x}}^{k}) and F(𝐱k+1))\nabla F({\bf{x}}^{k+1})) compared with the primal-dual version of PG-EXTRA (12) in the dual update. Moreover, in the implementation of NIDS, the communication of both the state variable xi{\mathrm{x}}_{i} and fi\nabla f_{i} are involved. Since 2𝐈𝐕2=𝐈+𝐖02{\bf{I}}-{\bf{V}}^{2}={\bf{I+W}}\succ 0, the metric 𝐇N{\bf{H}}_{{\mathrm{N}}} is positive definite. Similarly, to ensure the convergence of NIDS, the metric matrix

𝐇^N=𝐇N12diag{LF𝐈,𝟎},\widehat{{\bf{H}}}_{\mathrm{N}}={\bf{H}}_{{\mathrm{N}}}-\frac{1}{2}\mathrm{diag}\{L_{F}{\bf{I}},{\bf{0}}\},

is required to be positive definite. Thus, the stepsize condition of NIDS is 0<τ<2/LF0<\tau<{2}/{L_{F}}.

Inspired by asymmetric forward-backward-adjoint splitting [39], which is a very general three operator splitting scheme, and by prediction-correction framework [40], the proposed D-iPGM can be seen as the triangularly preconditioned forward-backward operator splitting algorithm with a primal corrector. For a clearer view of this, we define the triangularly preconditioner 𝐐D{\bf{Q}}_{{\mathrm{D}}} and the corrector 𝐌D{\bf{M}}_{{\mathrm{D}}}

𝐐D=(1τ𝐈𝐕𝟎2τ𝐈),𝐌D=(𝐈τ𝐕𝟎𝐈).{\bf{Q}}_{{\mathrm{D}}}=\left(\begin{array}[]{cc}\frac{1}{\tau}{\bf{I}}&{\bf{V}}\\ \mathbf{0}&2\tau{\bf{I}}\\ \end{array}\right),~{}{\bf{M}}_{{\mathrm{D}}}=\left(\begin{array}[]{cc}{\bf{I}}&\tau{\bf{V}}\\ {\bf{0}}&{\bf{I}}\\ \end{array}\right).

Let εk=0\varepsilon_{k}=0, Γ=τ𝐈,τ>0\Gamma=\tau{\bf{I}},\tau>0 and β=12τ\beta=\frac{1}{2\tau}. By these definitions, D-iPGM is equivalent to

[D-iPGM]:𝐮~k\displaystyle\text{[D-iPGM]}:~{}~{}~{}\tilde{{\bf{u}}}^{k} =(𝐐D+𝑨+𝑩)1(𝐐D𝑪)𝐮k,\displaystyle=({\bf{Q}}_{{\mathrm{D}}}+\bm{A+B})^{-1}({\bf{Q}}_{{\mathrm{D}}}-\bm{C}){\bf{u}}^{k},
𝐮k+1\displaystyle{\bf{u}}^{k+1} =𝐮k𝐌D(𝐮k𝐮~k).\displaystyle={\bf{u}}^{k}-{\bf{M}}_{{\mathrm{D}}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k}).

By eliminating the dual variable, it holds that

𝐱~k\displaystyle\tilde{{\bf{x}}}^{k} =proxτG(𝐬k),\displaystyle=\mathrm{prox}_{\tau G}({\bf{s}}^{k}),
𝐱k+1\displaystyle{\bf{x}}^{k+1} =𝐖~𝐱~k,\displaystyle=\tilde{{\bf{W}}}\tilde{{\bf{x}}}^{k},
𝐬k+1\displaystyle{\bf{s}}^{k+1} =𝐬k𝐱k𝐖𝐱~k+τF(𝐱k)τF(𝐱k+1).\displaystyle={\bf{s}}^{k}-{\bf{x}}^{k}-{\bf{W}}\tilde{{\bf{x}}}^{k}+\tau\nabla F(\mathbf{x}^{k})-\tau\nabla F(\mathbf{x}^{k+1}).

Different from forward-backward operator splitting, the preconditioner 𝐐D{\bf{Q}}_{{\mathrm{D}}} is not required to be positive definite. To achieve a network-independent convergence condition, the correction step 𝐮k+1=𝐮k𝐌D(𝐮k𝐮~k){\bf{u}}^{k+1}={\bf{u}}^{k}-{\bf{M}}_{{\mathrm{D}}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k}) is employed, which enables the metric matrices of the D-PGM,

𝐇D\displaystyle{\bf{H}}_{{\mathrm{D}}} =𝐐D𝐌D1=diag{1τ𝐈,2τ𝐈}0,\displaystyle={\bf{Q}}_{{\mathrm{D}}}{\bf{M}}_{{\mathrm{D}}}^{-1}=\mathrm{diag}\{\frac{1}{\tau}{\bf{I}},2\tau{\bf{I}}\}\succ 0,
𝐇^D\displaystyle\widehat{{\bf{H}}}_{\mathrm{D}} =𝐐D𝖳+𝐐D𝐌D𝖳𝐇D𝐌D12diag{LF𝐈,𝟎}=𝐇^N,\displaystyle={\bf{Q}}_{{\mathrm{D}}}^{\sf T}+{\bf{Q}}_{{\mathrm{D}}}-{\bf{M}}_{{\mathrm{D}}}^{\sf T}{\bf{H}}_{{\mathrm{D}}}{\bf{M}}_{{\mathrm{D}}}-\frac{1}{2}\mathrm{diag}\{L_{F}{\bf{I}},{\bf{0}}\}=\widehat{{\bf{H}}}_{\mathrm{N}},

to be diagonal. Therefore, similar as NIDS, the stepsize condition of D-iPGM is 0<τ<2/LF0<\tau<{2}/{L_{F}}. We compare it with PG-EXTRA and NIDS in Table II. In comparison, we observe that the primal sequence {𝐱k}\{{\bf{x}}^{k}\} produced by D-iPGM differs from PG-EXTRA and NIDS. Therefore, we conclude that D-iPGM is a unique algorithm that distinguishes itself from PG-EXTRA and NIDS. To accomplish a more lenient convergence condition (or network-independent stepsize condition) that differs from NIDS, D-iPGM adopts an extra primal correction step. This primal correction step can be executed independently by each agent without exchanging any private information. Additionally, the extra computational cost is almost negligible when compared to PG-EXTRA. It is worth noting that in the absence of the proximal term, the sequences 𝐱k{{\bf{x}}^{k}} generated by NIDS and D-iPGM are identical. Furthermore, as shown in Section IV-A, D-iPGM (εk=0\varepsilon_{k}=0) and NIDS exhibit similar performance in solving composite optimization problems.

III Convergence Analysis

This section analyzes the convergence and the convergence rate of the D-iPGM. The convergence analysis will be conducted in the variational inequality context [40], and the forthcoming analysis of the proposed D-iPGM is based on the following fundamental lemma.

Lemma 1 ([40]).

Let θ1(x)\theta_{1}({\mathrm{x}}) and θ2(x)\theta_{2}({\mathrm{x}}) be proper closed convex functions. If θ1\theta_{1} is differentiable, and the solution set of the problem min{θ1(x)+θ2(x):xn}\min\{\theta_{1}({\mathrm{x}})+\theta_{2}({\mathrm{x}}):{\mathrm{x}}\in\mathbb{R}^{n}\} is nonempty, then it holds that xargmin{θ1(x)+θ2(x):xn}{\mathrm{x}}^{*}\in\arg\min\{\theta_{1}({\mathrm{x}})+\theta_{2}({\mathrm{x}}):{\mathrm{x}}\in\mathbb{R}^{n}\} if and only if θ2(x)θ2(x)+xx,θ1(x)0,xn.\theta_{2}({\mathrm{x}})-\theta_{2}({\mathrm{x}}^{*})+\langle{\mathrm{x}}-{\mathrm{x}}^{*},\nabla\theta_{1}({\mathrm{x}}^{*})\rangle\geq 0,\forall{\mathrm{x}}\in\mathbb{R}^{n}.

Recall the Lagrangian of problem (2): L(𝐱,𝜶)=F(𝐱)+G(𝐱)𝜶,𝐕𝐱,L(\mathbf{x},\bm{\alpha})=F(\mathbf{x})+G(\mathbf{x})-\langle\bm{\alpha},{\bf{V}}\mathbf{x}\rangle, where 𝐕=𝐈𝐖{\bf{V}}=\sqrt{{\bf{I-W}}} and 𝜶Nn\bm{\alpha}\in\mathbb{R}^{Nn} is the Lagrange multiplier. By strongly duality, to solve problem (2), we can consider the saddle-point problem

min𝐱max𝜶F(𝐱)+G(𝐱)𝜶,𝐕𝐱.\displaystyle\min_{\mathbf{x}}\max_{\bm{\alpha}}F(\mathbf{x})+G(\mathbf{x})-\langle\bm{\alpha},{\bf{V}}\mathbf{x}\rangle. (16)

Define \mathcal{M}^{*} as the set of saddle-points to the above saddle-point problem. It holds that =𝒳×𝒴\mathcal{M}^{*}=\mathcal{X}^{*}\times\mathcal{Y}^{*}, where 𝒳\mathcal{X}^{*} is the optimal solution set to problem (2) and 𝒴\mathcal{Y}^{*} is the optimal solution set to its dual problem. Denote the KKT mapping as

𝒥(𝐮)=(G(𝐱)+F(𝐱)𝐕𝜶𝐕𝐱).\mathcal{J}(\mathbf{u})=\left(\begin{array}[]{c}\partial G(\mathbf{x})+\nabla F(\mathbf{x})-{\bf{V}}\bm{\alpha}\\ {\bf{V}}\mathbf{x}\end{array}\right).

We have (𝐱,𝜶)𝒳×𝒴(\mathbf{x}^{*},\bm{\alpha}^{*})\in\mathcal{X}^{*}\times\mathcal{Y}^{*} if and only if 𝟎𝒥(𝐱,𝜶)\mathbf{0}\in\mathcal{J}(\mathbf{\mathbf{x}^{*},\bm{\alpha}^{*}}). A point 𝐮=(𝐱,𝜶)Nn×Nn:=\mathbf{u}^{*}=(\mathbf{x}^{*},\bm{\alpha}^{*})\in\mathcal{M}^{*}\subseteq\mathbb{R}^{Nn}\times\mathbb{R}^{Nn}:=\mathcal{M} is called a saddle point of the Lagrangian function L(𝐱,𝜶)L(\mathbf{x},\bm{\alpha}) if L(𝐱,𝜶)L(𝐱,𝜶)L(𝐱,𝜶),(𝐱,𝜶),L(\mathbf{x}^{*},\bm{\alpha})\leq L(\mathbf{x}^{*},\bm{\alpha}^{*})\leq L(\mathbf{x},\bm{\alpha}^{*}),\forall(\mathbf{x},\bm{\alpha})\in\mathcal{M}, which can be alternatively rewritten as the following variational inequalities (by Lemma 1)

G(𝐱)G(𝐱)+𝐮𝐮,𝒦1(𝐮)0,𝐮,\displaystyle G(\mathbf{x})-G(\mathbf{x}^{*})+\langle\mathbf{u}-\mathbf{u}^{*},\mathcal{K}_{1}(\mathbf{u}^{*})\rangle\geq 0,\forall\mathbf{u}\in\mathcal{M}, (17)
\displaystyle\Longleftrightarrow
ϕ(𝐱)ϕ(𝐱)+𝐮𝐮,𝒦2(𝐮)0,𝐮,\displaystyle\phi(\mathbf{x})-\phi(\mathbf{x}^{*})+\langle\mathbf{u}-\mathbf{u}^{*},\mathcal{K}_{2}(\mathbf{u}^{*})\rangle\geq 0,\forall\mathbf{u}\in\mathcal{M}, (18)

where ϕ(𝐱)=G(𝐱)+F(𝐱)\phi({\bf{x}})=G({\bf{x}})+F({\bf{x}}), 𝐮=col{𝐱,𝜶}\mathbf{u}=\mathrm{col}\{\mathbf{x},\bm{\alpha}\},

𝒦1(𝐮)=(F(𝐱)𝐕𝜶𝐕𝐱), and 𝒦2(𝐮)=(𝐕𝜶𝐕𝐱).\mathcal{K}_{1}(\mathbf{u})=\left(\begin{array}[]{c}\nabla F({\bf{x}})-{\bf{V}}\bm{\alpha}\\ {\bf{V}}\mathbf{x}\end{array}\right),\text{ and }\mathcal{K}_{2}(\mathbf{u})=\left(\begin{array}[]{c}-{\bf{V}}\bm{\alpha}\\ {\bf{V}}\mathbf{x}\end{array}\right).

Note that the mapping 𝒦2\mathcal{K}_{2} is affine with a skew-symmetric matrix and thus we have

𝐮1𝐮2,𝒦1(𝐮1)𝒦1(𝐮2)0,𝐮1,𝐮2.\displaystyle\langle{\bf{u}}_{1}-{\bf{u}}_{2},\mathcal{K}_{1}({\bf{u}}_{1})-\mathcal{K}_{1}({\bf{u}}_{2})\rangle\equiv 0,\forall{\bf{u}}_{1},{\bf{u}}_{2}\in\mathcal{M}. (19)

From the above analysis, it holds that the set \mathcal{M}^{*} is also the solution set of (17) and (18). Therefore, the convergence of D-iPGM can be proved if the generated sequence can be expressed by a similar inequality to (17) or (18) with an extra term converging to zero. Based on such observation, we will show the convergence properties of the sequence generated by D-iPGM in Sections III-A, III-B, and III-C.

III-A Convergence Analysis Under General Convexity

To simplify the notation for the convergence analysis, we first define two matrices as the following

𝐐=(Γ1𝐕𝟎1β𝐈),𝐌=(𝐈Γ𝐕𝟎𝐈).{\bf{Q}}=\left(\begin{array}[]{cc}\Gamma^{-1}&{\bf{V}}\\ \mathbf{0}&\frac{1}{\beta}{\bf{I}}\\ \end{array}\right),~{}{\bf{M}}=\left(\begin{array}[]{cc}{\bf{I}}&\Gamma{\bf{V}}\\ {\bf{0}}&{\bf{I}}\\ \end{array}\right).

Then, with the matrices 𝐐{\bf{Q}} and 𝐌{\bf{M}}, we define the following two matrices which are important in the convergence and convergence rate establishment of D-iPGM.

𝐇:=𝐐𝐌1,𝐆:=𝐐𝖳+𝐐𝐌𝖳𝐇𝐌.\displaystyle{\bf{H}}:={\bf{QM}}^{-1},~{}{\bf{G}}:={\bf{Q}}^{\sf T}+{\bf{Q}}-{\bf{M}}^{\sf T}{\bf{H}}{\bf{M}}. (20)
Proposition 1.

If τi>0,i𝒱\tau_{i}>0,i\in\mathcal{V} and 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i}, the matrices 𝐇{\bf{H}} and 𝐆{\bf{G}} are positive definite.

Proof:

It follows from the definitions of 𝐐{\bf{Q}} and 𝐌{\bf{M}} that

𝐇=(Γ1𝐕𝟎1β𝐈)(𝐈Γ𝐕𝟎𝐈)=(Γ1𝟎𝟎1β𝐈).\displaystyle{\bf{H}}=\left(\begin{array}[]{cc}\Gamma^{-1}&{\bf{V}}\\ \mathbf{0}&\frac{1}{\beta}{\bf{I}}\\ \end{array}\right)\left(\begin{array}[]{cc}{\bf{I}}&-\Gamma{\bf{V}}\\ {\bf{0}}&{\bf{I}}\\ \end{array}\right)=\left(\begin{array}[]{cc}\Gamma^{-1}&{\bf{0}}\\ \mathbf{0}&\frac{1}{\beta}{\bf{I}}\\ \end{array}\right).

In addition, it holds that

𝐆\displaystyle{\bf{G}} =𝐐𝖳+𝐐𝐌𝖳𝐇𝐌=𝐐𝖳+𝐐𝐌𝖳𝐐\displaystyle={\bf{Q}}^{\sf T}+{\bf{Q}}-{\bf{M}}^{\sf T}{\bf{H}}{\bf{M}}={\bf{Q}}^{\sf T}+{\bf{Q}}-{\bf{M}}^{\sf T}{\bf{Q}}
=(2Γ1𝐕𝐕2β𝐈)(𝐈𝟎𝐕Γ𝐈)(Γ1𝐕𝟎1β𝐈)\displaystyle=\left(\begin{array}[]{cc}2\Gamma^{-1}&{\bf{V}}\\ \mathbf{V}&\frac{2}{\beta}{\bf{I}}\\ \end{array}\right)-\left(\begin{array}[]{cc}{\bf{I}}&{\bf{0}}\\ {\bf{V}}\Gamma&{\bf{I}}\\ \end{array}\right)\left(\begin{array}[]{cc}\Gamma^{-1}&{\bf{V}}\\ \mathbf{0}&\frac{1}{\beta}{\bf{I}}\\ \end{array}\right)
=(2Γ1𝐕𝐕2β𝐈)(Γ1𝐕𝐕1β𝐈+𝐕Γ𝐕)\displaystyle=\left(\begin{array}[]{cc}2\Gamma^{-1}&{\bf{V}}\\ \mathbf{V}&\frac{2}{\beta}{\bf{I}}\\ \end{array}\right)-\left(\begin{array}[]{cc}\Gamma^{-1}&{\bf{V}}\\ \mathbf{V}&\frac{1}{\beta}{\bf{I}}+{\bf{V}}\Gamma{\bf{V}}\\ \end{array}\right)
=(Γ1𝟎𝟎1β𝐈𝐕Γ𝐕).\displaystyle=\left(\begin{array}[]{cc}\Gamma^{-1}&{\bf{0}}\\ \mathbf{0}&\frac{1}{\beta}{\bf{I}}-{\bf{V}}\Gamma{\bf{V}}\\ \end{array}\right).

Since 𝐕=𝐈𝐖{\bf{V}}=\sqrt{{\bf{I-W}}} and σM(𝐈𝐖)[0,2)\sigma_{M}({\bf{I-W}})\in[0,2), 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i} implies that 1β𝐈𝐕Γ𝐕𝟎\frac{1}{\beta}{\bf{I}}-{\bf{V}}\Gamma{\bf{V}}\succ{\bf{0}}. This completes the proof of this proposition. ∎

With the matrix 𝐆{\bf{G}}, we define two more matrices as

𝐇^1=𝐆12diag{𝐋F,𝟎}, and 𝐇^2=𝐆diag{𝐋F,𝟎}.\widehat{{\bf{H}}}_{1}={\bf{G}}-\frac{1}{2}\mathrm{diag}\{{\bf{L}}_{F},{\bf{0}}\},\text{ and }\widehat{{\bf{H}}}_{2}={\bf{G}}-\mathrm{diag}\{{\bf{L}}_{F},{\bf{0}}\}.

The positive definiteness of 𝐇^1\widehat{{\bf{H}}}_{1} is necessary for establishing the convergence and the non-ergodic convergence rate of D-iPGM, whereas the positive definiteness of 𝐇^2\widehat{{\bf{H}}}_{2} is for establishing ergodic convergence rate for D-iPGM. If 𝐇^10\widehat{{\bf{H}}}_{1}\succ 0, we have for any 𝐮1,𝐮2{\bf{u}}_{1},{\bf{u}}_{2}\in\mathcal{M}

c2𝐮1𝐮2\displaystyle c_{2}\|{\bf{u}}_{1}-{\bf{u}}_{2}\| 𝐮1𝐮2𝐇^1\displaystyle\leq\|{\bf{u}}_{1}-{\bf{u}}_{2}\|_{\widehat{{\bf{H}}}_{1}}
𝐮1𝐮2𝐇c1𝐮1𝐮2,\displaystyle\leq\|{\bf{u}}_{1}-{\bf{u}}_{2}\|_{{\bf{H}}}\leq c_{1}\|{\bf{u}}_{1}-{\bf{u}}_{2}\|,

where c1=σM1/2(𝐇)c_{1}=\sigma_{M}^{1/2}({\bf{H}}) and c2=σm1/2(𝐇^1)c_{2}=\sigma_{m}^{1/2}(\widehat{{\bf{H}}}_{1}). Let 𝐮~k=col{𝐱~k,𝜶~k}\tilde{\mathbf{u}}^{k}=\mathrm{col}\{\tilde{\mathbf{x}}^{k},\tilde{\bm{\alpha}}^{k}\}. We can rewrite Algorithm 1 as

𝐱~k\displaystyle\tilde{\mathbf{x}}^{k} =proxGΓ1(𝐱kΓ(F(𝐱k)𝐕𝜶k𝐝k)),\displaystyle=\mathrm{prox}_{G}^{\Gamma^{-1}}(\mathbf{x}^{k}-\Gamma(\nabla F(\mathbf{x}^{k})-{\bf{V}}\bm{\alpha}^{k}-{\bf{d}}^{k})), (21a)
𝜶~k\displaystyle\tilde{\bm{\alpha}}^{k} =𝜶kβ𝐕𝐱~k,\displaystyle=\bm{\alpha}^{k}-\beta{\bf{V}}\tilde{\mathbf{x}}^{k}, (21b)
𝐮k+1\displaystyle{\bf{u}}^{k+1} =𝐮k𝐌(𝐮k𝐮~k),\displaystyle={\bf{u}}^{k}-{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k}), (21c)

where 𝐝k=col{d1k,,dNk}{\bf{d}}^{k}=\mathrm{col}\{{\mathrm{d}}_{1}^{k},\cdots,{\mathrm{d}}_{N}^{k}\}. Note that if 𝜶0=𝝀0=𝟎\bm{\alpha}^{0}=\bm{\lambda}^{0}=\mathbf{0}, the sequence {𝐱k}\{\mathbf{x}^{k}\} generated by Algorithm 1 and recursion (21) are identical. In the analysis, we consider the sequence {(𝐱k,𝜶k)}\{(\mathbf{x}^{k},\bm{\alpha}^{k})\} to illustrate the convergence of Algorithm 1.

To establish the global convergence and derive the convergence rate of D-iPGM, we first give the following lemma.

Lemma 2.

Suppose that Assumptions 1 and 2 hold. For any (𝐱,𝛂)\forall({\bf{x}},\bm{\alpha})\in\mathcal{M}, the sequence {𝐮~k}\{\tilde{\mathbf{u}}^{k}\} generated by (21) satisfies

G(𝐱)G(𝐱~k)+𝐮𝐮~k,𝒦1(𝐮)𝐮𝐮~k,𝐐(𝐮k𝐮~k)\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{1}(\mathbf{u})\rangle\geq\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle
+𝐱𝐱~k,𝐝k14𝐱k𝐱~k𝐋F2,\displaystyle\quad+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\mathbf{d}^{k}\rangle-\frac{1}{4}\|\mathbf{x}^{k}-\tilde{\mathbf{x}}^{k}\|_{\mathbf{L}_{F}}^{2}, (22)

and

ϕ(𝐱)ϕ(𝐱~k)+𝐮𝐮~k,𝒦2(𝐮)𝐮𝐮~k,𝐐(𝐮k𝐮~k)\displaystyle\phi(\mathbf{x})-\phi(\tilde{\mathbf{x}}^{k})+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{2}(\mathbf{u})\rangle\geq\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle
+𝐱𝐱~k,𝐝k12𝐱k𝐱~k𝐋F2.\displaystyle\quad+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\mathbf{d}^{k}\rangle-\frac{1}{2}\|\mathbf{x}^{k}-\tilde{\mathbf{x}}^{k}\|_{\mathbf{L}_{F}}^{2}. (23)
Proof:

See Appendix A. ∎

To ensure the positive definiteness of 𝐇^1\widehat{{\bf{H}}}_{1}, we assume that the stepsize satisfies 0<τi<2/Li,i𝒱0<\tau_{i}<{2}/{L_{i}},i\in\mathcal{V}, 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i}. Then, we give the following theorem to show that the sequence {(𝐱k,𝜶k)}\{(\mathbf{x}^{k},\bm{\alpha}^{k})\} generated by (21) converges to a primal-dual optimal solution of problem (16).

Theorem 1.

Suppose that Assumptions 1 and 2 hold. If 0<τi<2/Li,i𝒱0<\tau_{i}<{2}/{L_{i}},i\in\mathcal{V}, 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i}, and k=1εk<\sum_{k=1}^{\infty}\varepsilon_{k}<\infty, the sequence {𝐮k}\{\mathbf{u}^{k}\} generated by (21) satisfies,

  1. 1.

    for any 𝐮\mathbf{u}^{*}\in\mathcal{M}^{*}, it holds that

    𝐮k+1𝐮𝐇2𝐮k𝐮𝐇2𝐮k𝐮~k𝐇^12\displaystyle\|\mathbf{u}^{k+1}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}-\|\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}
    +2𝐱~k𝐱,𝐝k;\displaystyle\quad+2\langle\tilde{\mathbf{x}}^{k}-\mathbf{x}^{*},\mathbf{d}^{k}\rangle; (24)
  2. 2.

    the sequence {𝐮k}\{\mathbf{u}^{k}\} is bounded;

  3. 3.

    these exists 𝐮\mathbf{u}^{\infty}\in\mathcal{M}^{*} such that limk𝐮k=𝐮\lim_{k\rightarrow\infty}\mathbf{u}^{k}=\mathbf{u}^{\infty}, i.e., 𝐱=𝟏N𝐱{\bf{x}}^{\infty}=\mathbf{1}_{N}\otimes{\bf{x}}^{\infty}, and 𝐱{\bf{x}}^{\infty} solves problem (1).

Proof:

See Appendix B. ∎

Although (1) may not guarantee the monotonicity of sequence {𝐮k𝐮𝐇2}\{\|\mathbf{u}^{k}-\mathbf{u}^{\infty}\|^{2}_{{\bf{H}}}\}, since {𝐮k}\{{\bf{u}}^{k}\} is bounded and 𝐝k\|\mathbf{d}^{k}\| is summable, the sequence 𝐮k𝐮𝐇2\|\mathbf{u}^{k}-\mathbf{u}^{\infty}\|^{2}_{{\bf{H}}} is a quasi-Fejér monotone sequence in 𝐇{\bf{H}}-norm, i.e.,

𝐮k+1𝐮𝐇𝐮k𝐮𝐇+μ¯εk,\displaystyle\|{\bf{u}}^{k+1}-{\bf{u}}^{\infty}\|_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{\infty}\|_{{\bf{H}}}+\bar{\mu}\varepsilon_{k},

which guarantees the convergence of the Di-PGM.

When the subproblem (8) is solved exactly or the proximal operator of gig_{i} has an analytical solution, we have 𝐝k=0\|{\bf{d}}^{k}\|=0. In this case, the sequence {𝐮k}\{\mathbf{u}^{k}\} generated by (21) satisfies

𝐮k+1𝐮𝐇2𝐮k𝐮𝐇2𝐮k𝐮~k𝐇^12,𝐮,\displaystyle\|\mathbf{u}^{k+1}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}-\|\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}},\forall{\bf{u}}^{*}\in\mathcal{M}^{*},

which implies that {𝐮k}\{{\bf{u}}^{k}\} is a Fejér monotone sequence with respect to \mathcal{M}^{*} in 𝐇{\bf{H}}-norm.

III-B Sublinear Convergence Rate Under General Convexity

With general convexity, this subsection established the sublinear convergence rate of D-iPGM.

We give the following theorem to show the O(1/k)O(1/k) non-ergodic and ergodic convergence rate of D-iPGM.

Theorem 2.

Suppose that Assumptions 1 and 2 hold. If 0<τi<2/Li,i𝒱0<\tau_{i}<{2}/{L_{i}},i\in\mathcal{V}, 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i}, and k=1εk<\sum_{k=1}^{\infty}\varepsilon_{k}<\infty, it holds that

  1. 1.

    Running-average first-order optimality residual:

    1K+1k=0Kdist2(𝟎,𝒥(𝐮~k))=O(1K+1).\displaystyle\frac{1}{K+1}\sum_{k=0}^{K}\mathrm{dist}^{2}(\mathbf{0},\mathcal{J}(\tilde{{\bf{u}}}^{k}))=O\left(\frac{1}{K+1}\right).
  2. 2.

    Running-best first-order optimality residual:

    min0kK{dist2(𝟎,𝒥(𝐮~k))}=o(1K+1).\displaystyle\min_{0\leq k\leq K}\{\mathrm{dist}^{2}(\mathbf{0},\mathcal{J}(\tilde{{\bf{u}}}^{k}))\}=o\left(\frac{1}{K+1}\right).

Let 𝐗K=1Kk=0K1𝐱~k{\bf{X}}^{K}=\frac{1}{K}\sum_{k=0}^{K-1}\tilde{{\bf{x}}}^{k} and 𝚲K=1Kk=0K1𝛂~k{\bf{\Lambda}}^{K}=\frac{1}{K}\sum_{k=0}^{K-1}\tilde{\bm{\alpha}}^{k}. Furthermore, if 0<τi<1/Li,i𝒱0<\tau_{i}<{1}/{L_{i}},i\in\mathcal{V}, it holds that

  1. 1.

    Primal-dual gap: L(𝐗K,𝜶)L(𝐱,𝚲K)=O(1K)L({\bf{X}}^{K},\bm{\alpha})-L({\bf{x}},{\bf{\Lambda}}^{K})=O\left(\frac{1}{K}\right).

  2. 2.

    Primal optimality gap: |ϕ(𝐗K)ϕ(𝐱)|=O(1K)|\phi({\bf{X}}^{K})-\phi({\bf{x}}^{*})|=O\left(\frac{1}{K}\right).

  3. 3.

    Consensus error: 𝐈𝐖𝐗K=O(1K)\|\sqrt{{\bf{I-W}}}{\bf{X}}^{K}\|=O\left(\frac{1}{K}\right).

Proof:

See Appendix C. ∎

With the stepsize contion τi<2/Li,i𝒱\tau_{i}<2/L_{i},i\in\mathcal{V}, we establish the non-ergodic iteration complexity of D-iPGM. With a stronger convergence condition τi<1/Li,i𝒱\tau_{i}<1/L_{i},i\in\mathcal{V}, the ergodic O(1/k)O(1/k) convergence rate of the primal-dual gap, the primal optimality gap, and the consensus error are proved.

The next theorem gives the o(1/k)o(1/k) non-ergodic convergence rates of D-iPGM when εk=0,k0\varepsilon_{k}=0,k\geq 0.

Theorem 3.

Suppose that Assumptions 1 and 2 hold. If 0<τi<2/Li,i𝒱0<\tau_{i}<{2}/{L_{i}},i\in\mathcal{V}, 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i}, and εk=0,k0\varepsilon_{k}=0,\forall k\geq 0, for any integer k0k\geq 0, it holds that

𝐌(𝐮k+1𝐮~k+1)𝐇2𝐌(𝐮k𝐮~k)𝐇2.\displaystyle\|{\bf{M}}({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1})\|^{2}_{{\bf{H}}}\leq\|{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})\|^{2}_{{\bf{H}}}. (25)

Moreover, for any 𝐮{\bf{u}}^{*}\in\mathcal{M}^{*}, it holds that

  1. 1.

    Successive difference:

    𝐮k+1𝐮k𝐇2=o(1k+1).\displaystyle\|{\bf{u}}^{k+1}-{\bf{u}}^{k}\|^{2}_{{\bf{H}}}=o\left(\frac{1}{k+1}\right).
  2. 2.

    First-order optimality residual:

    dist2(𝟎,𝒥(𝐮~k))=o(1k+1).\displaystyle\mathrm{dist}^{2}({\bf{0}},\mathcal{J}(\tilde{{\bf{u}}}^{k}))=o\left(\frac{1}{k+1}\right).
Proof:

See Appendix D. ∎

It follows from (A), i.e.,

G(𝐱)G(𝐱~k)+𝐱𝐱~k,F(𝐱k)+𝐮𝐮~k,𝒦2(𝐮~k)\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\nabla F(\mathbf{x}^{k})\rangle+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{2}(\tilde{\mathbf{u}}^{k})\rangle
𝐮𝐮~k,𝐐(𝐮k𝐮~k)+𝐱𝐱~k,𝐝k,𝐮,\displaystyle\geq\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\mathbf{d}^{k}\rangle,\forall\mathbf{u}\in\mathcal{M},

the successive difference 𝐮k+1𝐮k𝐇2\|{\bf{u}}^{k+1}-{\bf{u}}^{k}\|^{2}_{{\bf{H}}} can be used to measure the accuracy of the iterate 𝐮k+1\mathbf{u}^{k+1} to \mathcal{M}^{*}, when εk=0,k0\varepsilon_{k}=0,k\geq 0. More specifically, if 𝐮k+1𝐮k𝐇2=0\|{\bf{u}}^{k+1}-{\bf{u}}^{k}\|^{2}_{{\bf{H}}}=0, by (21c), we have 𝐮k=𝐮~k=𝐮k+1{\bf{u}}^{k}=\tilde{{\bf{u}}}^{k}={\bf{u}}^{k+1}. Hence, it deduces that

G(𝐱)G(𝐱~k)+𝐮𝐮~k,𝒦1(𝐮~k)0,𝐮.\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{1}(\tilde{\mathbf{u}}^{k})\rangle\geq 0,\forall\mathbf{u}\in\mathcal{M}.

Compared to (17), 𝐮k+1=𝐮~k{\bf{u}}^{k+1}=\tilde{\mathbf{u}}^{k}\in\mathcal{M}^{*}. Therefore, 𝐮k+1𝐮k𝐇2\|{\bf{u}}^{k+1}-{\bf{u}}^{k}\|^{2}_{{\bf{H}}} can be viewed as an error measurement after kk iterations of the D-iPGM. The same error measurement is considered in PG-EXTRA [20] and NIDS [21].

III-C Linear Convergence Rate Under Metric Subregularity

By some variational analysis techniques, this subsection proves the linear convergence of D-iPGM under metric subregularity. Recall the notion of metric subregularity [41], which is important in the establishment of the local linear convergence rate [42, 43].

Definition 1.

A set-valued mapping Ψ:nm\Psi:\mathbb{R}^{n}\rightrightarrows\mathbb{R}^{m} is metrically subregular at (u¯,v¯)gph(Ψ)(\bar{u},\bar{v})\in\mathrm{gph}(\Psi) if for some ϵ>0\epsilon>0 there exists κ0\kappa\geq 0 such that

dist(u,Ψ1(v¯))κdist(v¯,Ψ(u)),uϵ(u¯),\mathrm{dist}(u,\Psi^{-1}(\bar{v}))\leq\kappa~{}\mathrm{dist}(\bar{v},\Psi(u)),~{}\forall u\in\mathcal{B}_{\epsilon}(\bar{u}),

where gph(Ψ):={(u,v):v=Φ(u)}\mathrm{gph}(\Psi):=\{(u,v):v=\Phi(u)\}, Ψ1(v):={un:(u,v)gph(Ψ)}\Psi^{-1}(v):=\{u\in\mathbb{R}^{n}:(u,v)\in\mathrm{gph}(\Psi)\}.

To ensure the linear rate convergence of D-iPGM, we give the following assumption.

Assumption 3.

There exists an integer k¯>0\bar{k}>0 and a sequence {ϱk}\{\varrho_{k}\} such that

supkk¯{ϱk}<1/μ¯, and 𝐝kϱk𝐱k𝐱k+1,kk¯,\sup_{k\geq\bar{k}}\{\varrho_{k}\}<{1}/{\bar{\mu}},\text{ and }\|\mathbf{d}^{k}\|\leq\varrho_{k}\|\mathbf{x}^{k}-\mathbf{x}^{k+1}\|,\forall k\geq\bar{k}, (26)

where μ¯=σM(𝐇12𝐌)σM(Γ)(βσM(𝐈𝐖)+1)\bar{\mu}=\sigma_{M}({\bf{H}}^{\frac{1}{2}}{\bf{M}})\sigma_{M}(\Gamma)(\beta\sigma_{M}(\sqrt{{\bf{I-W}}})+1).

Remark 4.

From Theorem 1, we have k=0𝐱k+1𝐱k<\sum_{k=0}^{\infty}\|{\bf{x}}^{k+1}-{\bf{x}}^{k}\|<\infty. It implies that limk(𝐱k𝐱k1𝐱k+1𝐱k)=0\lim_{k\rightarrow\infty}(\|{\bf{x}}^{k}-{\bf{x}}^{k-1}\|-\|{\bf{x}}^{k+1}-{\bf{x}}^{k}\|)=0. Therefore, to facilitate implementation, we can use 𝐱k𝐱k1\|{\bf{x}}^{k}-{\bf{x}}^{k-1}\| instead of 𝐱k+1𝐱k\|{\bf{x}}^{k+1}-{\bf{x}}^{k}\| in practice.

Then, the linear convergence rate of D-iPGM is presented.

Theorem 4.

Suppose that Assumptions 1, 2, and 3 hold, 0<τi<2/Li,i𝒱0<\tau_{i}<{2}/{L_{i}},i\in\mathcal{V}, 0<β0.5/maxiτi0<\beta\leq 0.5/\max_{i}\tau_{i}, and the sequence {𝐮k}\{\mathbf{u}^{k}\} generated by (21) converges to 𝐮\mathbf{u}^{\infty}. If the KKT mapping 𝒥(𝐮)\mathcal{J}(\mathbf{u}) is metrically subregular at (𝐮,𝟎)(\mathbf{u}^{\infty},{\bf{0}}), then there exist an integer k¯>0\bar{k}>0 and a constant κ>0\kappa>0 such that for all kk¯k\geq\bar{k}

dist𝐇(𝐮k+1,)ϑkdist𝐇(𝐮k,),\displaystyle\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k+1},\mathcal{M}^{*})\leq\vartheta_{k}~{}\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k},\mathcal{M}^{*}), (27)

with

ϑk\displaystyle\vartheta_{k} =μ(1+ω)ϱk+ϑ¯(1μϱk),μ=μ¯/c2,ω=c1/c2,\displaystyle=\frac{{\mu}(1+\omega)\varrho_{k}+\bar{\vartheta}}{(1-{\mu}\varrho_{k})},~{}\mu={\bar{\mu}}/{c_{2}},~{}\omega=\sqrt{c_{1}/c_{2}},
ϑ¯2\displaystyle\bar{\vartheta}^{2} =1c22c12(κκ2c1+1)2<1,\displaystyle={1-\frac{c_{2}^{2}}{c^{2}_{1}(\kappa\kappa_{2}c_{1}+1)^{2}}}<1,
κ22\displaystyle\kappa^{2}_{2} =max{3σM(𝐕2)+1β2,3maxi{Li2}+3σm(Γ2)}σm(𝐇^1).\displaystyle={\frac{\max\{3\sigma_{M}({\bf{V}}^{2})+\frac{1}{\beta^{2}},3\max_{i}\{L^{2}_{i}\}+3\sigma_{m}(\Gamma^{-2})\}}{\sigma_{m}(\widehat{{\bf{H}}}_{1})}}.

Moreover, if

supkk¯{ϱk}<1ϑ¯μ(2+ω),\sup_{k\geq\bar{k}}\{\varrho_{k}\}<\frac{1-\bar{\vartheta}}{\mu(2+\omega)}, (28)

then one has ϑk<1\vartheta_{k}<1, and thus the convergence rate of dist𝐇(𝐮k,)\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k},\mathcal{M}^{*}) is Q-linear.

Proof:

See Appendix E. ∎

Note that if {ϱk}\{\varrho_{k}\} converges to 0 as kk\rightarrow\infty, condition (28) hold eventually for k¯\bar{k} sufficiently large, since μ¯\bar{\mu} and ϑ¯\bar{\vartheta} are constant. In addition, if ϱk=0,k0\varrho_{k}=0,k\geq 0, it holds that ϑkϑ¯\vartheta_{k}\equiv\overline{\vartheta}, which implies that for all kk¯k\geq\bar{k}

dist𝐇2(𝐮k+1,)(1c22c12(κκ2c1+1)2)dist𝐇2(𝐮k,).\displaystyle\mathrm{dist}^{2}_{{\bf{H}}}(\mathbf{u}^{k+1},\mathcal{M}^{*})\leq\left(1-\frac{c_{2}^{2}}{c^{2}_{1}(\kappa\kappa_{2}c_{1}+1)^{2}}\right)\mathrm{dist}^{2}_{{\bf{H}}}(\mathbf{u}^{k},\mathcal{M}^{*}).

Thus, the Q-linear convergence of D-iPGM holds for exact proximal iteration.

Remark 5.

In [14], a dimension independent lower bound of problem (1) is given, and it is reported that if each agent owns a different local nonsmooth term, then exact linear convergence cannot be attained in the worst case (fif_{i} is a strongly convex smooth function and gig_{i} is nonsmooth convex with closed form proximal mappings). However, in Theorem 4, the dimension dependent linear rate of D-iPGM can be established, which does not contradict with the exiting results.

Next, using the analysis techniques in [42] and [43], we give the sufficient condition such that 𝒥(𝐮)\mathcal{J}(\mathbf{u}) is metrically subregular at (𝐮,𝟎)(\mathbf{u}^{\infty},\mathbf{0}). Similar as [42] and [43], a convex function F(𝐱)F(\mathbf{x}) is said to satisfy the structured assumption if F(𝐱)=i=iNfi(xi)=i=1Nhi(Aixi)+qi,xiF(\mathbf{x})=\sum_{i=i}^{N}f_{i}(\mathrm{x}_{i})=\sum_{i=1}^{N}h_{i}(A_{i}\mathrm{x}_{i})+\langle q_{i},\mathrm{x}_{i}\rangle, where AiA_{i} is a mi×nm_{i}\times n matrix, qiq_{i} is a vector in n\mathbb{R}^{n}, and hih_{i} is smooth and essentially locally strongly convex, i.e., for any compact and convex subset 𝕂\mathbb{K}, hih_{i} is strongly convex on 𝕂\mathbb{K}. Some commonly used loss functions in machine learning automatically satisfy the structured assumption. We give Table III to summarize cases satisfying the structured assumption, where bi1n\mathrm{b}^{1}_{i}\in\mathbb{R}^{n}, bi2{0,1}n\mathrm{b}^{2}_{i}\in\{0,1\}^{n}, bi3+n\mathrm{b}^{3}_{i}\in\mathbb{R}^{n}_{+}, bi4{1,1}n\mathrm{b}^{4}_{i}\in\{-1,1\}^{n}, and bi5{0,1,2,}\mathrm{b}_{i}^{5}\in\{0,1,2,\cdots\}.

TABLE III: Some Commonly Used Convex Loss Functions Satisfying The Structured Assumption.
Local loss function hih_{i} Convexity of fif_{i}
Linear regression 12xibi12\frac{1}{2}\|\mathrm{x}_{i}-\mathrm{b}^{1}_{i}\|^{2} Convex
Logistic regression j=1nlog(1+exij)(bi2)xi\sum_{j=1}^{n}\log(1+e^{\mathrm{x}_{ij}})-(\mathrm{b}_{i}^{2})^{\top}\mathrm{x}_{i} Convex
Logistic regression j=1nlog(xij)+(bi3)xi-\sum_{j=1}^{n}\log(\mathrm{x}_{ij})+(\mathrm{b}_{i}^{3})^{\top}\mathrm{x}_{i} Convex
Likelihood estimation j=1nlog(1+e(bi4)xi)\sum_{j=1}^{n}\log(1+e^{(\mathrm{b}_{i}^{4})^{\top}\mathrm{x}_{i}}) Convex
Poisson regression j=1nexij(bi5)xi\sum_{j=1}^{n}e^{\mathrm{x}_{ij}}-(\mathrm{b}_{i}^{5})^{\top}\mathrm{x}_{i} Convex

By the structure of FF, the following proposition is provided, which presents an alternative characterization of \mathcal{M}^{*}.

Proposition 2.

Suppose that F(𝐱)F(\mathbf{x}) satisfies the structured assumption. The set \mathcal{M}^{*} can be characterized as

={𝐮:\displaystyle\mathcal{M}^{*}=\{\mathbf{u}: 𝐀~𝐱=𝐭~,𝐈𝐖𝐱=𝟎,\displaystyle\widetilde{\mathbf{A}}\mathbf{x}=\tilde{\mathbf{t}},\sqrt{\mathbf{I-W}}\mathbf{x}=\mathbf{0},
𝟎G(𝐱)+𝝃~𝐈𝐖𝜶},\displaystyle\mathbf{0}\in\partial G(\mathbf{x})+\tilde{\bm{\xi}}-\sqrt{\mathbf{I-W}}\bm{\alpha}\},

with some 𝐭~\tilde{\mathbf{t}} satisfying 𝐀~𝐱=𝐭~\widetilde{\mathbf{A}}\mathbf{x}=\tilde{\mathbf{t}} for all 𝐱𝒳\mathbf{x}\in\mathcal{X}^{*}, and 𝛏~=col{A1h1(t~1)+q1,,ANhN(t~N)+qN}\tilde{\bm{\xi}}=\mathrm{col}\{A_{1}^{\top}\nabla h_{1}(\tilde{t}_{1})+q_{1},\cdots,A_{N}^{\top}\nabla h_{N}(\tilde{t}_{N})+q_{N}\}, where 𝐭~=col{t~1,,t~N}\tilde{\mathbf{t}}=\mathrm{col}\{\tilde{t}_{1},\cdots,\tilde{t}_{N}\}, and 𝐀~\widetilde{\mathbf{A}} is a block diagonal matrix with its (i,i)(i,i)-th block being AiA_{i} and other blocks being 𝟎\mathbf{0}.

The proof of Proposition 2 is rather standard by [44, Lemma 2.1] and thus is omitted here. To facilitate the analysis, an auxiliary perturbed set-valued mapping with perturbation 𝐏=(P1,P2,P3)\mathbf{P}=(P_{1},P_{2},P_{3}) is introduced associated with \mathcal{M}^{*}: 𝚪(𝐏)={𝐮:P1=𝐀~𝐱𝐭~,P2=𝐕𝐱,P3G(𝐱)+𝝃~𝐕𝜶}.\mathbf{\Gamma}(\mathbf{P})=\{\mathbf{u}:P_{1}=\widetilde{\mathbf{A}}\mathbf{x}-\tilde{\mathbf{t}},P_{2}={\bf{V}}\mathbf{x},P_{3}\in\partial G(\mathbf{x})+\tilde{\bm{\xi}}-{\bf{V}}\bm{\alpha}\}. It is obvious that Γ(𝟎)=\Gamma(\mathbf{0})=\mathcal{M}^{*}. Similar to [42, Proposition 40] and [43, Proposition 4.1], one has the following equivalence.

Proposition 3.

Assume that structured assumption of FF is satisfied. The metric subregularity conditions of 𝚪1\mathbf{\Gamma}^{-1} and 𝒥\mathcal{J} are equivalent. Precisely, given 𝐮¯\bar{\mathbf{u}}\in\mathcal{M}^{*}, the following two statements are equivalent:

(i) There exist κ~1,ϵ~1>0\tilde{\kappa}_{1},\tilde{\epsilon}_{1}>0 such that

dist(𝐮,𝚪(𝟎))κ~1dist(𝟎,𝚪1(𝐮)),𝐮ϵ~1(𝐮¯).\mathrm{dist}(\mathbf{u},{\bf{\Gamma}}(\mathbf{0}))\leq\tilde{\kappa}_{1}~{}\mathrm{dist}(\mathbf{0},\mathbf{\Gamma}^{-1}(\mathbf{u})),\forall\mathbf{u}\in\mathcal{B}_{\tilde{\epsilon}_{1}}(\bar{\mathbf{u}}).

(ii) There exist κ~2,ϵ~2>0\tilde{\kappa}_{2},\tilde{\epsilon}_{2}>0 such that

dist(𝐮,𝒥1(𝟎))κ~2dist(𝟎,𝒥(𝐮)),𝐮ϵ~2(𝐮¯).\mathrm{dist}(\mathbf{u},\mathcal{J}^{-1}(\mathbf{0}))\leq\tilde{\kappa}_{2}~{}\mathrm{dist}(\mathbf{0},\mathcal{J}(\mathbf{u})),\forall\mathbf{u}\in\mathcal{B}_{\tilde{\epsilon}_{2}}(\bar{\mathbf{u}}).

Suppose that F(𝐱)F({\bf{x}}) satisfies the structured assumption and G\partial G is a polyhedral multifunction, i.e., those whose graphs are unions of finite number of polyhedral convex sets. In this scenario, it holds that 𝚪{\bf{\Gamma}} is a polyhedral multifunction and hence 𝚪1\mathbf{\Gamma}^{-1} is also a polyhedral multifunction. By [45, Proposition 1], for any point (𝐮,𝐏)gph(𝚪1)(\mathbf{u},\mathbf{P})\in\mathrm{gph}(\mathbf{\Gamma}^{-1}), 𝚪1\mathbf{\Gamma}^{-1} is metrically subregular at (𝐮,𝐏)(\mathbf{u},\mathbf{P}). Thus, by Proposition 3, 𝒥\mathcal{J} is metrically subregular at (𝐮,𝟎)(\mathbf{u}^{*},\mathbf{0}) for any given 𝐮\mathbf{u}^{*}\in\mathcal{M}^{*}. It follows from [43, Proposition 7] that if gig_{i} is a polyhedral convex function, which includes the indicator function of a polyhedral set and the polyhedral convex regularizer, or gig_{i} is convex piecewise linear-quadratic function, then gi\partial g_{i} is a polyhedral multifunction. This case covers scenarios where gig_{i} is the ll_{\infty}-norm regularizer, the l1l_{1}-norm regularizer, the elastic net regularizer [46], the generalized LASSO regularizer [23], and the fused LASSO regularizer [24].

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: The relative error 𝐱k𝐱/𝐱\|{\bf{x}}^{k}-{\bf{x}}^{*}\|/\|{\bf{x}}^{*}\| with respect to iteration numbers when the regularizer is ν1,ix+ν2,i2x2\nu_{1,i}\|{\mathrm{x}}\|+\frac{\nu_{2,i}}{2}\|{\mathrm{x}}\|^{2}, where ν1,i=0.01\nu_{1,i}=0.01 and ν2,i=1\nu_{2,i}=1. Here, different stepsizes for D-iPGM, PG-EXTRA, and NIDS are considered, and the stepsize across the network of agents are the same.
Refer to caption
Refer to caption
Refer to caption
Figure 2: The relative error 𝐱k𝐱/𝐱\|{\bf{x}}^{k}-{\bf{x}}^{*}\|/\|{\bf{x}}^{*}\| with respect to iteration numbers when the regularizer is ν1,ix1+ν2,i2x2\nu_{1,i}\|{\mathrm{x}}\|_{1}+\frac{\nu_{2,i}}{2}\|{\mathrm{x}}\|^{2}, where ν1,i=0.01\nu_{1,i}=0.01 and ν2,i=1\nu_{2,i}=1. Here the each agent has its own local stepsize τi=1.99/Li\tau_{i}=1.99/L_{i}. We fix τi\tau_{i} and let τβ=0.9999/σM(𝐈𝐖)\tau\beta=0.9999/\sigma_{M}({\bf{I-W}}), 1/21/2, 1/41/4, and 1/81/8, where τ=maxi𝒱{τi}\tau=\max_{i\in\mathcal{V}}\{\tau_{i}\}.

IV Numerical Experiments

In this section, we compare the performance of D-iPGM with PG-EXTRA [20] (see (12) or (13)) and NIDS [21] (see (14) or (15)) for decentralized linear and logistic regression problems with different regularizers on real datasets. All the algorithms are implemented in Matlab R2020b in a computer with 3.30 GHz AMD Ryzen 9 5900HS with Radeon Graphics and 16 GB memory.

For all experiments, we first compute the solution x{\mathrm{x}}^{*} to (1) by centralized methods, and then run over a randomly generated connected network with NN agents and ιN(N1)2\frac{\iota N(N-1)}{2} undirected edges, where N=50N=50 and ι=0.1\iota=0.1 is the connectivity ratio. The mixing matrix WW is generated with the Metropolis-Hastings rule (see [8, Section 2.4]). The performance is evaluated by the relative error 𝐱k𝐱/𝐱\|{\bf{x}}^{k}-{\bf{x}}^{*}\|/\|{\bf{x}}^{*}\|. For decentralized linear regression, the loss function is

i=1N{1mij=1mi12𝒜ij𝖳xij2}.\sum_{i=1}^{N}\big{\{}\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\frac{1}{2}\|\mathcal{A}_{ij}^{\sf T}{{\mathrm{x}}}-\mathcal{B}_{ij}\|^{2}\big{\}}.

For decentralized logistic regression, the loss function is

i=1N{1mij=1miln(1+e(𝒜ij𝖳x)ij)}.\sum_{i=1}^{N}\big{\{}\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\ln(1+e^{-(\mathcal{A}_{ij}^{\sf T}{{\mathrm{x}}})\mathcal{B}_{ij}})\big{\}}.

Here, any agent ii holds its own training date (𝒜ij,ij)\left(\mathcal{A}_{ij},\mathcal{B}_{ij}\right)\in n×{1,1},j=1,,mi\mathbb{R}^{n}\times\{-1,1\},j=1,\cdots,m_{i}, including sample vectors 𝒜ij\mathcal{A}_{ij} and corresponding classes ij\mathcal{B}_{ij}. We use three real datasets including a9a, covtype, and ijcnn1111https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/, whose attributes are n=123n=123 and i=1Nmi=32550\sum_{i=1}^{N}m_{i}=32550, n=54n=54 and i=1Nmi=55500\sum_{i=1}^{N}m_{i}=55500, and n=22n=22 and i=1Nmi=49950\sum_{i=1}^{N}m_{i}=49950, respectively. Moreover, the training samples are randomly and evenly distributed over all the NN agents.

TABLE IV: Comparison of Decentralized Linear Regression With Generalized LASSO Solved by PG-EXTRA, NIDS, and D-iPGM
Algorithm a9a: Linear Regression covtype: Linear Regression ijcnn1: Linear Regression
Inner Iter. Outer Iter. Time Inner Iter. Outer Iter. Time Inner Iter. Outer Iter. Time
PG-EXTRA 32687 146 14.1916 63843 186 6.8362 42901 139 1.7367
NIDS 31678 141 13.7203 61940 180 6.7148 41027 133 1.6229
D-iPGM-εk=1010\varepsilon_{k}=10^{-10} 31460 140 13.1901 61809 180 6.4375 40528 132 1.6263
D-iPGM-εk=O(1/k2)\varepsilon_{k}=O(1/k^{2}) 18254 140 11.4739 35067 185 5.1945 27595 132 1.2733
D-iPGM-εk=O(lnk/k2)\varepsilon_{k}=O(\ln k/k^{2}) 16640 146 11.2607 33887 189 4.9323 24375 132 1.1896
D-iPGM-εk=O(xikxik1/k)\varepsilon_{k}=O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/k) 11836 142 10.6125 23520 184 4.4608 13388 132 0.8935
D-iPGM-εk=O(xikxik1/lnk)\varepsilon_{k}=O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/\ln k) 8568 141 9.9773 17539 185 3.9734 11529 132 0.8552
TABLE V: Comparison of Decentralized Logistic Regression With Generalized LASSO Solved by PG-EXTRA, NIDS, and D-iPGM
Algorithm a9a: Logistic Regression covtype: Logistic Regression ijcnn1: Logistic Regression
Inner Iter. Outer Iter. Time Inner Iter. Outer Iter. Time Inner Iter. Outer Iter. Time
PG-EXTRA 32912 169 15.3620 52761 199 6.8032 54041 135 1.9368
NIDS 32910 169 15.3008 55084 208 7.0974 53794 130 1.8648
D-iPGM-εk=1010\varepsilon_{k}=10^{-10} 33794 169 15.7742 55281 209 7.0255 54050 130 1.9370
D-iPGM-εk=O(1/k2)\varepsilon_{k}=O(1/k^{2}) 17701 165 13.6317 31920 210 5.8638 37284 130 1.4901
D-iPGM-εk=O(lnk/k2)\varepsilon_{k}=O(\ln k/k^{2}) 16055 166 13.0957 30864 211 5.6870 32520 130 1.4303
D-iPGM-εk=O(xikxik1/k)\varepsilon_{k}=O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/k) 13458 166 12.9017 19383 160 4.1614 17887 123 1.0219
D-iPGM-εk=O(xikxik1/lnk)\varepsilon_{k}=O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/\ln k) 8521 156 11.5524 18795 173 4.2117 13759 117 0.8875

IV-A Scenario 1: proxτigi()\mathrm{prox}_{\tau_{i}g_{i}}(\cdot) Has a Closed-form Solution

We first consider the decentralized linear and logistic regression problems with ν1,ix+ν2,i2x2\nu_{1,i}\|{\mathrm{x}}\|+\frac{\nu_{2,i}}{2}\|{\mathrm{x}}\|^{2} regularizer. To these problems the nonsmooth term is gi(x)=ν1,ixg_{i}({\mathrm{x}})=\nu_{1,i}\|x\|, and we have proxτigi(x)=(1τiν1,imax{x,τiν1,i})x.\mathrm{prox}_{\tau_{i}g_{i}}({\mathrm{x}})=(1-\frac{\tau_{i}\nu_{1,i}}{\max\{\|{\mathrm{x}}\|,\tau_{i}\nu_{1,i}\}}){\mathrm{x}}. It is not difficult to verify that the smooth coefficient of these two problems are Li=𝒜i𝖳𝒜i+1L_{i}=\|\mathcal{A}_{i}^{\sf T}\mathcal{A}_{i}\|+1, where 𝒜i=[𝒜i1𝖳;;𝒜imi𝖳]\mathcal{A}_{i}=[\mathcal{A}_{i1}^{\sf T};\cdots;\mathcal{A}_{im_{i}}^{\sf T}]. In this case, since the proximal mapping of gig_{i} has a closed-form solution, we use the exact version of D-iPGM, i.e., εk=0,k0\varepsilon_{k}=0,k\geq 0. Thus, the per-iteration complexity of these three algorithms are the same. Note that there is only one round of communication in each iteration of these algorithms. The amount of information exchange over the network is proportional to their numbers of iterations. Consequently, only the number of iterations is recorded.

The comparison results of relative error for these algorithms after each iteration are shown as Fig. 1, where we fix τβ=12\tau\beta=\frac{1}{2} for D-iPGM. For PG-EXTRA, we only show τ=1/LF\tau=1/L_{F}, because when τ=1.4/LF\tau=1.4/L_{F} and τ=1.95/LF\tau=1.95/L_{F}, PG-EXTRA is divergent. When τ=1/LF\tau=1/L_{F}, these three algorithms have similar convergence performance. When τ=1.4/LF\tau=1.4/L_{F} and τ=1.95/LF\tau=1.95/L_{F}, D-iPGM and NIDS converge at almost the same speed. In addition, D-iPGM and NIDS converge faster with a larger stepsize.

Then, we consider the decentralized linear regression problem with ν1,ix1+ν2,i2x2\nu_{1,i}\|{\mathrm{x}}\|_{1}+\frac{\nu_{2,i}}{2}\|{\mathrm{x}}\|^{2} regularizer. Thus, the nonsmooth term is gi(x)=ν1,ix1g_{i}({\mathrm{x}})=\nu_{1,i}\|{\mathrm{x}}\|_{1}, it also has a closed-form solution. We use the uncoordinated steps τi=1.99/Li\tau_{i}=1.99/L_{i} and let τβ=0.9999/σM(𝐈𝐖)\tau\beta=0.9999/\sigma_{M}({\bf{I-W}}), 1/21/2, 1/41/4, and 1/81/8, where τ=maxi𝒱{τi}\tau=\max_{i\in\mathcal{V}}\{\tau_{i}\}. The relative error after each iteration are shown in Fig. 2. It shows that D-iPGM converge faster with a larger τβ\tau\beta.

IV-B Scenario 2: proxτigi()\mathrm{prox}_{\tau_{i}g_{i}}(\cdot) Has no Closed-form Solution

In this subsection, we consider the decentralized linear and logistic regression problems with l2l_{2}+generalized LASSO [23] regularizer, i.e., ν2,i2x2+ν1,iDix1\frac{\nu_{2,i}}{2}\|{\mathrm{x}}\|^{2}+\nu_{1,i}\|{\mathrm{D}}_{i}{\mathrm{x}}\|_{1}, where Di10×n{\mathrm{D}}_{i}\in\mathbb{R}^{10\times n} is drawn from the normal distribution N(0,1)N(0,1), ν1,i=0.01\nu_{1,i}=0.01, and ν2,i=1\nu_{2,i}=1. To this scenario, the nonsmooth term gi(x)=Dix1g_{i}({\mathrm{x}})=\|{\mathrm{D}}_{i}{\mathrm{x}}\|_{1}, whose proximal mapping has no closed-form solution for general Di{\mathrm{D}}_{i}, and we will use the C-V algorithm [36, 37] to compute the numerical solution of proxτigi()\mathrm{prox}_{\tau_{i}g_{i}}(\cdot).

Recall the primal update of D-iPGM,

x~iky~ik:=proxτigi(xikτi(fi(xik)λik)).\tilde{\mathrm{x}}_{i}^{k}\approx\tilde{\mathrm{y}}_{i}^{k}:=\mathrm{prox}_{\tau_{i}g_{i}}(\mathrm{x}_{i}^{k}-\tau_{i}(\nabla f_{i}(\mathrm{x}_{i}^{k})-{\lambda}_{i}^{k})).

Let sik=xikτi(fi(xik)λik)s_{i}^{k}=\mathrm{x}_{i}^{k}-\tau_{i}(\nabla f_{i}(\mathrm{x}_{i}^{k})-{\lambda}_{i}^{k}). By the definition of proximal mapping of gig_{i}, it holds that

x~iky~ik=argminxn{τiν1,iDix1+12xsik2}.\displaystyle\tilde{\mathrm{x}}_{i}^{k}\approx\tilde{\mathrm{y}}_{i}^{k}=\mathop{\arg\min}\limits_{{\mathrm{x}}\in\mathbb{R}^{n}}\{\tau_{i}\nu_{1,i}\|{\mathrm{D}}_{i}{\mathrm{x}}\|_{1}+\frac{1}{2}\|{\mathrm{x}}-{\mathrm{s}}^{k}_{i}\|^{2}\}. (29)

Applying the C-V algorithm to problem (29), we get

xik,l+1\displaystyle{\mathrm{x}}_{i}^{k,l+1} =xik,lt1,i(xik,lsik+Di𝖳wil),\displaystyle={\mathrm{x}}_{i}^{k,l}-t_{1,i}({\mathrm{x}}_{i}^{k,l}-s_{i}^{k}+{\mathrm{D}}_{i}^{\sf T}{\mathrm{w}}_{i}^{l}),
vil+1\displaystyle{\mathrm{v}}_{i}^{l+1} =wil+t2,iDi(2xik,l+1xik,l),\displaystyle={\mathrm{w}}_{i}^{l}+t_{2,i}{\mathrm{D}}_{i}(2{\mathrm{x}}_{i}^{k,l+1}-{\mathrm{x}}_{i}^{k,l}),
wil+1\displaystyle{\mathrm{w}}_{i}^{l+1} =vil+1t2,iproxτiν1,it2,i1(vil+1/t2,i),\displaystyle={\mathrm{v}}_{i}^{l+1}-t_{2,i}\mathrm{prox}_{\frac{\tau_{i}\nu_{1,i}}{t_{2,i}}\|\cdot\|_{1}}({\mathrm{v}}_{i}^{l+1}/t_{2,i}),

where xik,0=xik{\mathrm{x}}_{i}^{k,0}={\mathrm{x}}_{i}^{k}, wi0=𝟎{\mathrm{w}}_{i}^{0}={\bf{0}}, and t1,it_{1,i} and t2,it_{2,i} are the stepsizes satisfying that t1,i2+t1,it2,iDi𝖳Di<1\frac{t_{1,i}}{2}+t_{1,i}t_{2,i}\|{\mathrm{D}}_{i}^{\sf T}{\mathrm{D}}_{i}\|<1. Let dik,l+1gi(xik,l+1)+fi(xik)λik+1τi(xik,l+1xik){\mathrm{d}}_{i}^{k,l+1}\in\partial g_{i}({\mathrm{x}}_{i}^{k,l+1})+\nabla f_{i}(\mathrm{x}_{i}^{k})-\lambda_{i}^{k}+\frac{1}{\tau_{i}}({\mathrm{x}}_{i}^{k,l+1}-\mathrm{x}_{i}^{k}). It follows from [33, Section 7] and [47, Lemma 3] that there exists a constant ε0\varepsilon_{0} such that dik,l+1ε0xik,l+1xik,l\|{\mathrm{d}}_{i}^{k,l+1}\|\leq\varepsilon_{0}\|{\mathrm{x}}_{i}^{k,l+1}-{\mathrm{x}}_{i}^{k,l}\|. Therefore, if xik,l+1xik,l<εk/ε0\|{\mathrm{x}}_{i}^{k,l+1}-{\mathrm{x}}_{i}^{k,l}\|<{\varepsilon_{k}}/{\varepsilon_{0}}, we stop the inner procedure, and set x~ik=xik,l+1\tilde{{\mathrm{x}}}_{i}^{k}={\mathrm{x}}_{i}^{k,l+1}, i.e., we can choose the inner stopping criterion of D-iPGM as

[D-iPGM]: xik,l+1xik,lεk.\text{[D-iPGM]: }\|{\mathrm{x}}_{i}^{k,l+1}-{\mathrm{x}}_{i}^{k,l}\|\leq\varepsilon_{k}.

To this scenario, we use the primal-dual forms of PG-EXTRA (12) and NIDS (14) for solving the considered problems. Note that the primal update of these three algorithms are the same, and PG-EXTRA and NIDS are exact iterative algorithms. We apply the C-V algorithm to compute (12a) and (14a) with the inner stopping criterion

[PG-EXTRA/NIDS]: xik,l+1xik,l1010.\text{[PG-EXTRA/NIDS]: }\|{\mathrm{x}}_{i}^{k,l+1}-{\mathrm{x}}_{i}^{k,l}\|\leq 10^{-10}.

In order to compare the effect of inexact iteration and to ensure that the performance of outer iteration is similar for all the considered algorithms, according to the experimental results of Scenario 1, we set τ=1/LF\tau={1}/{L_{F}}, τβ=1/2\tau\beta={1}/{2} for the outer iteration. In addition, we use the stopping criterion 𝐱k𝐱/𝐱105{\|{\bf{x}}^{k}-{\bf{x}}^{*}\|}/{\|{\bf{x}}^{*}\|}\leq 10^{-5} for all considered methods.

Refer to caption
Figure 3: The number of inner iterations against the outer iterations. D-iPGM0, D-iPGM1, D-iPGM2, D-iPGM3, and D-iPGM4 represent D-iPGM with εk=1010,O(xikxik1/k),O(1/k2),O(xikxik1/lnk),O(lnk/k2)\varepsilon_{k}=10^{-10},O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/k),O(1/k^{2}),O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/\ln k),O(\ln k/k^{2}).

Note that different algorithms have different complexities for each iteration. In Tables IV and V, we report the number of inner iterations (Inner Iter.) and outer iterations (Outer Iter.), as well as the computational time in seconds (Time), when each method achieves the outer iteration stopping criterion. From Tables IV and V, we can observe that, the computational time and the inner iteration number of the D-iPGM are significantly less than exact algorithms including its exact version, and the outer iteration number of D-iPGM is very close to exact algorithms. Therefore, it is obvious that D-iPGM performs better than the exact version of D-iPGM, PG-EXTRA, and NIDS in terms of the computational time. Moreover, by comparing the performance of D-iPGM with various inner stopping criterions, the result suggests that it is better to choose the εk=O(xikxik1/lnk)\varepsilon_{k}=O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/\ln k) as the inner stopping criterion. To further visualize the numerical results, we present Fig. 3 to show the required number of inner iterations for each iteration, which indicates that D-iPGM requires fewer inner iterations per iteration than exact algorithms. If we choose εk=O(xikxik1/lnk)\varepsilon_{k}=O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/\ln k) or εk=O(xikxik1/k)\varepsilon_{k}=O(\|{\mathrm{x}}_{i}^{k}-{\mathrm{x}}_{i}^{k-1}\|/k) as the inner stopping criterion, the number of inner iterations required for each iteration is roughly on the rise. If we choose εk=O(1/k2)\varepsilon_{k}=O(1/k^{2}) or εk=O(lnk/k2)\varepsilon_{k}=O(\ln k/k^{2}), the trend of the number of internal iterations required for each iteration is similar to that of the exact algorithms.

V Conclusion

In this paper, we proposed a CTA-based decentralized algorithm called D-iPGM for convex composite optimizations over networks, utilizing uncoordinated network-independent constant stepsizes. D-iPGM does not require the exact solution of the proximal mapping of gig_{i} in each iteration, making it suitable for decentralized composite optimizations where the proximal mapping may be difficult to compute. We established the convergence rate of D-iPGM to be O(1/k)O(1/k) with general convexity, which can be improved to o(1/k)o(1/k) if the proximal mapping is solved exactly. With metric subregularity, we further established its linear convergence rate. The numerical experiments confirmed the effectiveness of D-iPGM. Acceleration, asynchronous, and stochastic settings of D-iPGM were left as future work.

Appendix A Proof of Lemma 2

Proof:

It follows from (21a) that G(𝐱)G(𝐱~k)+𝐱𝐱~k,F(𝐱k)𝐕𝜶k+Γ1(𝐱~k𝐱k)𝐝k0,𝐱NnG(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\nabla F(\mathbf{x}^{k})-{\bf{V}}\bm{\alpha}^{k}+\Gamma^{-1}(\tilde{\mathbf{x}}^{k}-\mathbf{x}^{k})-\mathbf{d}^{k}\rangle\geq 0,\forall\mathbf{x}\in\mathbb{R}^{Nn}, which implies that for 𝐱Nn\forall\mathbf{x}\in\mathbb{R}^{Nn},

G(𝐱)G(𝐱~k)+𝐱𝐱~k,F(𝐱k)𝐕𝜶~k\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\nabla F(\mathbf{x}^{k})-{\bf{V}}\tilde{\bm{\alpha}}^{k}
+Γ1(𝐱~k𝐱k)+𝐕(𝜶~k𝜶k)𝐝k0.\displaystyle+\Gamma^{-1}(\tilde{\mathbf{x}}^{k}-\mathbf{x}^{k})+{\bf{V}}(\tilde{\bm{\alpha}}^{k}-\bm{\alpha}^{k})-\mathbf{d}^{k}\rangle\geq 0. (30)

It follows from (21b) that for 𝜶Nn\forall\bm{\alpha}\in\mathbb{R}^{Nn}

𝜶𝜶~k,𝐕𝐱~k=𝜶𝜶~k,1β(𝜶k𝜶~k).\displaystyle\langle\bm{\alpha}-\tilde{\bm{\alpha}}^{k},{\bf{V}}\tilde{\mathbf{x}}^{k}\rangle=\langle\bm{\alpha}-\tilde{\bm{\alpha}}^{k},\frac{1}{\beta}(\bm{\alpha}^{k}-\tilde{\bm{\alpha}}^{k})\rangle. (31)

Combining (A) and (31), it gives that

G(𝐱)G(𝐱~k)+𝐱𝐱~k,F(𝐱k)+𝐮𝐮~k,𝒦2(𝐮~k)\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\nabla F(\mathbf{x}^{k})\rangle+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{2}(\tilde{\mathbf{u}}^{k})\rangle
𝐮𝐮~k,𝐐(𝐮k𝐮~k)+𝐱𝐱~k,𝐝k,𝐮.\displaystyle\geq\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\mathbf{d}^{k}\rangle,\forall\mathbf{u}\in\mathcal{M}. (32)

Using (19), (A) can be rewritten as

G(𝐱)G(𝐱~k)+𝐱𝐱~k,F(𝐱k)+𝐮𝐮~k,𝒦2(𝐮)\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\nabla F(\mathbf{x}^{k})\rangle+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{2}({\bf{u}})\rangle
𝐮𝐮~k,𝐐(𝐮k𝐮~k)+𝐱𝐱~k,𝐝k,𝐮.\displaystyle\geq\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\mathbf{d}^{k}\rangle,\forall\mathbf{u}\in\mathcal{M}. (33)

Note that

𝐮𝐮~k,𝒦1(𝐮)=𝐱𝐱~k,F(𝐱)+𝐮𝐮~k,𝒦2(𝐮).\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{1}({\bf{u}})\rangle=\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\nabla F(\mathbf{x})\rangle+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{2}({\bf{u}})\rangle.

By rearranging some terms of (A), it holds that for 𝐮\forall{\bf{u}}\in\mathcal{M}

G(𝐱)G(𝐱~k)+𝐮𝐮~k,𝒦1(𝐮)\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{1}({\bf{u}})\rangle
𝐱𝐱~k,F(𝐱k)F(𝐱)\displaystyle\geq-\langle{\bf{x}}-\tilde{{\bf{x}}}^{k},\nabla F({\bf{x}}^{k})-\nabla F({\bf{x}})\rangle
+𝐮𝐮~k,𝐐(𝐮k𝐮~k)+𝐱𝐱~k,𝐝k.\displaystyle\quad+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\mathbf{d}^{k}\rangle.

By (5), the assertion (2) holds. Combining (4) and (A), the assertion (2) holds. ∎

Appendix B Proof of Theorem 1

Proof:

1) According to (21c) and 𝐐=𝐇𝐌{\bf{Q}}={\bf{HM}}, one has that

𝐮𝐮~k,𝐐(𝐮k𝐮~k)=𝐮𝐮~k,𝐇(𝐮k𝐮k+1).\displaystyle\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle=\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{H}}(\mathbf{u}^{k}-{\mathbf{u}}^{k+1})\rangle.

Applying the identity ab,𝐇(cd)=12{ad𝐇2ac𝐇2+bc𝐇2bd𝐇2}\langle a-b,{\bf{H}}(c-d)\rangle=\frac{1}{2}\{\|a-d\|^{2}_{{\bf{H}}}-\|a-c\|^{2}_{{\bf{H}}}+\|b-c\|^{2}_{{\bf{H}}}-\|b-d\|^{2}_{{\bf{H}}}\} to the 𝐮~k𝐮,H(𝐮k+1𝐮k)\langle\tilde{\mathbf{u}}^{k}-\mathbf{u},\mathrm{H}(\mathbf{u}^{k+1}-\mathbf{u}^{k})\rangle with specifications a=𝐮~ka=\tilde{\mathbf{u}}^{k}, b=𝐮b=\mathbf{u}, c=𝐮k+1c=\mathbf{u}^{k+1} and d=𝐮kd=\mathbf{u}^{k}, it holds that

𝐮~k𝐮,𝐇(𝐮k+1𝐮k)=12{𝐮~k𝐮k𝐇2\displaystyle\langle\tilde{\mathbf{u}}^{k}-\mathbf{u},{\bf{H}}(\mathbf{u}^{k+1}-\mathbf{u}^{k})\rangle=\frac{1}{2}\{\|\tilde{\mathbf{u}}^{k}-\mathbf{u}^{k}\|^{2}_{{\bf{H}}}
𝐮k+1𝐮~k𝐇2+𝐮k+1𝐮𝐇2𝐮k𝐮𝐇2}.\displaystyle\quad-\|\mathbf{u}^{k+1}-\tilde{\mathbf{u}}^{k}\|^{2}_{{\bf{H}}}+\|\mathbf{u}^{k+1}-\mathbf{u}\|^{2}_{{\bf{H}}}-\|\mathbf{u}^{k}-\mathbf{u}\|^{2}_{{\bf{H}}}\}. (34)

Note that 𝐐=𝐇𝐌{\bf{Q}}={\bf{HM}} and 𝐆=𝐐𝖳+𝐐𝐌𝖳𝐇𝐌{\bf{G}}={\bf{Q}}^{\sf T}+{\bf{Q}}-{\bf{M}}^{\sf T}{\bf{H}}{\bf{M}}, one has

𝐮k𝐮~k𝐇2𝐮k+1𝐮~k𝐇2\displaystyle\|\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k}\|^{2}_{{\bf{H}}}-\|\mathbf{u}^{k+1}-\tilde{\mathbf{u}}^{k}\|^{2}_{{\bf{H}}}
=𝐮k𝐮~k𝐇2(𝐮k𝐮~k)𝐌(𝐮k𝐮~k)𝐇2\displaystyle=\|\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k}\|^{2}_{{\bf{H}}}-\|({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})\|^{2}_{{\bf{H}}}
=2𝐮k𝐮~k,𝐇𝐌(𝐮k𝐮~k)\displaystyle=2\langle\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k},{\bf{HM}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle
𝐮k𝐮~k,𝐌𝖳𝐇𝐌(𝐮k𝐮~k)\displaystyle\quad-\langle\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k},{\bf{M}}^{\sf T}{\bf{H}}{\bf{M}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle
=𝐮k𝐮~k,(𝐐𝖳+𝐐𝐌𝖳𝐇𝐌)(𝐮k𝐮~k)\displaystyle=\langle\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k},({\bf{Q}}^{\sf T}+{\bf{Q}}-{\bf{M}}^{\sf T}{\bf{HM}})(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle
=𝐮k𝐮~k𝐆2.\displaystyle=\|\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k}\|^{2}_{{\bf{G}}}. (35)

Thus, substituting (B) and (B) into (2), it holds that

G(𝐱)G(𝐱~k)+𝐮𝐮~k,𝒦1(𝐮)𝐱𝐱~k,𝐝k\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{1}(\mathbf{u})\rangle\geq\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\mathbf{d}^{k}\rangle
+12(𝐮𝐮k+1𝐇2𝐮𝐮k𝐇2)+12𝐮k𝐮~k𝐇^12.\displaystyle+\frac{1}{2}(\|{\bf{u}}-{\bf{u}}^{k+1}\|^{2}_{{\bf{H}}}-\|{\bf{u}}-{\bf{u}}^{k}\|^{2}_{{\bf{H}}})+\frac{1}{2}\|\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}. (36)

Letting 𝐮{\bf{u}} in (B) as arbitrary 𝐮{\bf{u}}^{*}\in\mathcal{M}^{*}, we have

12𝐮k+1𝐮𝐇212𝐮k𝐮𝐇212𝐮k𝐮~k𝐇^12\displaystyle\frac{1}{2}\|\mathbf{u}^{k+1}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}\leq\frac{1}{2}\|\mathbf{u}^{k}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}-\frac{1}{2}\|\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}
+𝐱~k𝐱,𝐝k+G(𝐱)G(𝐱~k)+𝐮𝐮~k,𝒦1(𝐮).\displaystyle+\langle\tilde{\mathbf{x}}^{k}-\mathbf{x}^{*},\mathbf{d}^{k}\rangle+G(\mathbf{x}^{*})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{u}^{*}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{1}(\mathbf{u}^{*})\rangle.

Then, by (17), the assertion (1) holds.

2) We introduce the following notations.

𝐲~k\displaystyle\tilde{\mathbf{y}}^{k} =proxGΓ1(𝐱kΓ(F(𝐱k)𝐕𝜶k)),\displaystyle=\mathrm{prox}_{G}^{\Gamma^{-1}}(\mathbf{x}^{k}-\Gamma(\nabla F(\mathbf{x}^{k})-{\bf{V}}\bm{\alpha}^{k})), (37a)
𝐳~k\displaystyle\tilde{{\bf{z}}}^{k} =𝜶kβ𝐕𝐲~k,\displaystyle=\bm{\alpha}^{k}-\beta{\bf{V}}\tilde{\mathbf{y}}^{k}, (37b)
𝐯k+1\displaystyle{\bf{v}}^{k+1} =𝐮k𝐌(𝐮k𝐯~k),\displaystyle={\bf{u}}^{k}-{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{v}}}^{k}), (37c)

where 𝐯~k=col{𝐲~k,𝐳~k}\tilde{{\bf{v}}}^{k}=\mathrm{col}\{\tilde{\mathbf{y}}^{k},\tilde{{\bf{z}}}^{k}\}. It follows from (1) that

𝐯k+1𝐮𝐇2𝐮k𝐮𝐇2𝐮k𝐯~k𝐇^12,k0,\displaystyle\|\mathbf{v}^{k+1}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}-\|\mathbf{u}^{k}-\tilde{\mathbf{v}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}},\forall k\geq 0,

which implies that 𝐯k+1𝐮𝐇𝐮k𝐮𝐇,k0\|\mathbf{v}^{k+1}-\mathbf{u}^{*}\|_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{*}\|_{{\bf{H}}},k\geq 0. Hence, for 𝐮\forall{\bf{u}}^{*}\in\mathcal{M}^{*} one has

𝐮k+1𝐮𝐇\displaystyle\|{\bf{u}}^{k+1}-{\bf{u}}^{*}\|_{{\bf{H}}} 𝐯k+1𝐮𝐇+𝐮k+1𝐯k+1𝐇\displaystyle\leq\|{\bf{v}}^{k+1}-{\bf{u}}^{*}\|_{{\bf{H}}}+\|{\bf{u}}^{k+1}-{\bf{v}}^{k+1}\|_{{\bf{H}}}
𝐮k𝐮𝐇+𝐮k+1𝐯k+1𝐇.\displaystyle\leq\|\mathbf{u}^{k}-\mathbf{u}^{*}\|_{{\bf{H}}}+\|{\bf{u}}^{k+1}-{\bf{v}}^{k+1}\|_{{\bf{H}}}. (38)

By the nonexpansivity of proximal mapping, it holds that

𝐱~k𝐲~kσM(Γ)𝐝kNσM(Γ)εk,k0.\|\tilde{\mathbf{x}}^{k}-\tilde{\mathbf{y}}^{k}\|\leq\sigma_{M}(\Gamma)\|{\bf{d}}^{k}\|\leq\sqrt{N}\sigma_{M}(\Gamma)\varepsilon_{k},\forall k\geq 0.

Therefore, it follows that for k0\forall k\geq 0

𝐮k+1𝐯k+1𝐇=𝐌(𝐮~k𝐯~k)𝐇\displaystyle\|{\bf{u}}^{k+1}-{\bf{v}}^{k+1}\|_{{\bf{H}}}=\|{\bf{M}}(\tilde{{\bf{u}}}^{k}-\tilde{{\bf{v}}}^{k})\|_{{\bf{H}}}
σM(𝐇12𝐌)(𝐱~k𝐲~k+β𝐕(𝐱~k𝐲~k))\displaystyle\leq\sigma_{M}({\bf{H}}^{\frac{1}{2}}{\bf{M}})(\|\tilde{\mathbf{x}}^{k}-\tilde{\mathbf{y}}^{k}\|+\|\beta\mathbf{V}(\tilde{\mathbf{x}}^{k}-\tilde{\mathbf{y}}^{k})\|)
NσM(𝐇12𝐌)σM(Γ)(βσM(𝐕)+1):=μ¯>0εk.\displaystyle\leq\underbrace{\sqrt{N}\sigma_{M}({\bf{H}}^{\frac{1}{2}}{\bf{M}})\sigma_{M}(\Gamma)(\beta\sigma_{M}({\bf{V}})+1)}_{:=\bar{\mu}>0}\varepsilon_{k}. (39)

Combining (B) and (B), it holds that

𝐮k+1𝐮𝐇𝐮k𝐮𝐇+μ¯εk,𝐮.\displaystyle\|{\bf{u}}^{k+1}-{\bf{u}}^{*}\|_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{*}\|_{{\bf{H}}}+\bar{\mu}\varepsilon_{k},\forall{\bf{u}}^{*}\in\mathcal{M}^{*}. (40)

Summing the inequality (40) over k=0,1,,K1k=0,1,\cdots,\mathrm{K}-1, one has for any K1\mathrm{K}\geq 1,

k=0K1{𝐮k+1𝐮𝐇𝐮k𝐮𝐇}k=0K1μ¯εk,\displaystyle\sum_{k=0}^{\mathrm{K}-1}\left\{\|\mathbf{u}^{k+1}-\mathbf{u}^{*}\|_{{\bf{H}}}-\|\mathbf{u}^{k}-\mathbf{u}^{*}\|_{{\bf{H}}}\right\}\leq\sum_{k=0}^{\mathrm{K}-1}\bar{\mu}\varepsilon_{k},

which implies

𝐮K𝐮𝐇𝐮0𝐮𝐇+k=0K1μ¯εk,K1.\|\mathbf{u}^{\mathrm{K}}-\mathbf{u}^{*}\|_{{\bf{H}}}\leq\|\mathbf{u}^{0}-\mathbf{u}^{*}\|_{{\bf{H}}}+\sum_{k=0}^{\mathrm{K}-1}\bar{\mu}\varepsilon_{k},\forall\mathrm{K}\geq 1.

Since 𝐇0{\bf{H}}\succ 0 and {εk}\{\varepsilon_{k}\} is summable, it deduces that for any 𝐮\mathbf{u}^{*}\in\mathcal{M}^{*}, {𝐮k𝐮}\{\|\mathbf{u}^{k}-\mathbf{u}^{*}\|\} is bounded.

3) Summing (1) over k=0,1,,k=0,1,\cdots,\infty, one obtains

k=0𝐮k𝐮~k𝐇^12𝐮0𝐮𝐇2+k=02N𝐱~k𝐱εk.\sum_{k=0}^{\infty}\|{\bf{u}}^{k}-\tilde{{\bf{u}}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}\leq\|{\bf{u}}^{0}-{\bf{u}}^{*}\|_{{\bf{H}}}^{2}+\sum_{k=0}^{\infty}2\sqrt{N}\|\tilde{{\bf{x}}}^{k}-{\bf{x}}^{*}\|\varepsilon_{k}.

Since {εk}\{\varepsilon_{k}\} is summable and {𝐮k}\{\mathbf{u}^{k}\} is bounded, one has

k=0𝐮k𝐮~k𝐇^12<,\displaystyle\sum_{k=0}^{\infty}\|{\bf{u}}^{k}-\tilde{{\bf{u}}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}<\infty, (41)

which implies that 𝐮~k𝐮k20,k\|\tilde{\mathbf{u}}^{k}-\mathbf{u}^{k}\|^{2}\rightarrow 0,k\rightarrow\infty. Let 𝐮\mathbf{u}^{\infty} be a cluster point of {𝐮~k}\{\tilde{\mathbf{u}}^{k}\} and {𝐮~kj}\{\tilde{\mathbf{u}}^{k_{j}}\} be a subsequence converging to 𝐮\mathbf{u}^{\infty}. Then, it follows from (A) that

G(𝐱)G(𝐱~kj)+𝐱𝐱~kj,F(𝐱kj)+𝐮𝐮~kj,𝒦2(𝐮~kj)\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k_{j}})+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k_{j}},\nabla F(\mathbf{x}^{k_{j}})\rangle+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k_{j}},\mathcal{K}_{2}(\tilde{\mathbf{u}}^{k_{j}})\rangle
𝐮𝐮~kj,𝐐(𝐮kj𝐮~kj)+𝐱𝐱~kj,𝐝kj,𝐮.\displaystyle\geq\langle\mathbf{u}-\tilde{\mathbf{u}}^{k_{j}},{\bf{Q}}(\mathbf{u}^{k_{j}}-\tilde{\mathbf{u}}^{k_{j}})\rangle+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k_{j}},\mathbf{d}^{k_{j}}\rangle,\forall\mathbf{u}\in\mathcal{M}.

Since the matrix 𝐐{\bf{Q}} is nonsingular, and GG, F\nabla F, and 𝒦2\mathcal{K}_{2} are continuous, taking kjk_{j}\rightarrow\infty in the above inequality, one gets

G(𝐱)G(𝐱)+𝐮𝐮,𝒦1(𝐮)0,𝐮.\displaystyle G(\mathbf{x})-G(\mathbf{x}^{\infty})+\langle\mathbf{u}-\mathbf{u}^{\infty},\mathcal{K}_{1}(\mathbf{u}^{\infty})\rangle\geq 0,\forall\mathbf{u}\in\mathcal{M}.

Compared with (17), we can conclude that 𝐮{\bf{u}}^{\infty}\in\mathcal{M}^{*}. Therefore, by (40), it holds that

𝐮k+1𝐮𝐇𝐮k𝐮𝐇+μ¯εk.\displaystyle\|{\bf{u}}^{k+1}-{\bf{u}}^{\infty}\|_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{\infty}\|_{{\bf{H}}}+\bar{\mu}\varepsilon_{k}.

Since k=0εk<\sum_{k=0}^{\infty}\varepsilon_{k}<\infty, by [27, Lemma 3.2] the quasi-Fejér monotone sequence {𝐮k𝐮𝐇}\{\|{\bf{u}}^{k}-{\bf{u}}^{\infty}\|_{\mathbf{H}}\} converges to a unique limit point. Then, with 𝐮{\bf{u}}^{\infty} being an accumulation point of {𝐮k}\{{\bf{u}}^{k}\}, one has limk𝐮k=𝐮\lim_{k\rightarrow\infty}{\bf{u}}^{k}={\bf{u}}^{\infty}. ∎

Appendix C Proof of Theorem 2

Proof:

By (21a) and (21b), we have

𝟎G(𝐱~k)+F(𝐱k)𝐕𝜶k+Γ1(𝐱~k𝐱k)𝐝k,\displaystyle{\bf{0}}\in\partial G(\tilde{{\bf{x}}}^{k})+\nabla F({\bf{x}}^{k})-{\bf{V}}\bm{\alpha}^{k}+\Gamma^{-1}(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})-{\bf{d}}^{k},
𝟎=β𝐕𝐱~k+(𝜶~k𝜶k),\displaystyle{\bf{0}}=\beta{{\bf{V}}}\tilde{{\bf{x}}}^{k}+(\tilde{\bm{\alpha}}^{k}-\bm{\alpha}^{k}),

which implies that

F(𝐱~k)F(𝐱k)𝐕(𝜶~k𝜶k)Γ1(𝐱~k𝐱k)+𝐝k\displaystyle\nabla F(\tilde{{\bf{x}}}^{k})-\nabla F({\bf{x}}^{k})-{\bf{V}}(\tilde{\bm{\alpha}}^{k}-\bm{\alpha}^{k})-\Gamma^{-1}(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})+{\bf{d}}^{k}
G(𝐱~k)+F(𝐱~k)𝐕𝜶~k,\displaystyle\in\partial G(\tilde{{\bf{x}}}^{k})+\nabla F(\tilde{{\bf{x}}}^{k})-{\bf{V}}\tilde{\bm{\alpha}}^{k},

and 1β(𝜶~k𝜶k)=𝐕𝐱~k-\frac{1}{\beta}(\tilde{\bm{\alpha}}^{k}-\bm{\alpha}^{k})={{\bf{V}}}\tilde{{\bf{x}}}^{k}. Therefore, it holds that

dist2(𝟎,𝒥(𝐮~k))F(𝐱~k)F(𝐱k)𝐕(𝜶~k𝜶k)\displaystyle\mathrm{dist}^{2}({\bf{0}},\mathcal{J}(\tilde{{\bf{u}}}^{k}))\leq\|\nabla F(\tilde{{\bf{x}}}^{k})-\nabla F({\bf{x}}^{k})-{\bf{V}}(\tilde{\bm{\alpha}}^{k}-\bm{\alpha}^{k})
Γ1(𝐱~k𝐱k)+𝐝k2+1β(𝜶~k𝜶k)2\displaystyle\quad-\Gamma^{-1}(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})+{\bf{d}}^{k}\|^{2}+\|\frac{1}{\beta}(\tilde{\bm{\alpha}}^{k}-\bm{\alpha}^{k})\|^{2}
(4maxi{Li2}+4σm(Γ2))𝐱~k𝐱k2\displaystyle\leq(4\max_{i}\{L^{2}_{i}\}+4\sigma_{m}(\Gamma^{2}))\|\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k}\|^{2}
+(4σM(𝐕2)+1β2)𝜶~k𝜶k2+4𝐝k2\displaystyle\quad+(4\sigma_{M}({\bf{V}}^{2})+\frac{1}{\beta^{2}})\|\tilde{\bm{\alpha}}^{k}-\bm{\alpha}^{k}\|^{2}+4\|{\bf{d}}^{k}\|^{2}
κ12𝐮~k𝐮k𝐇^12+4N(εk)2,\displaystyle\leq\kappa_{1}^{2}\|\tilde{{\bf{u}}}^{k}-{\bf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}+4N(\varepsilon_{k})^{2}, (42)

where κ12=max{4σM(𝐕2)+1β2,4maxi{Li2}+4σm(Γ2)}σm(𝐇^1)\kappa_{1}^{2}=\frac{\max\{4\sigma_{M}({\bf{V}}^{2})+\frac{1}{\beta^{2}},4\max_{i}\{L^{2}_{i}\}+4\sigma_{m}(\Gamma^{-2})\}}{\sigma_{m}(\widehat{{\bf{H}}}_{1})}. Thus, by (41) and [20, Proposition 1], the O(1/(K+1))O(1/(K+1)) running-average optimality residual and the o(1/(K+1))o(1/(K+1)) running-best optimality residual hold.

Substituting (B) and (B) into (2), it holds that

ϕ(𝐱)ϕ(𝐱~k)+𝐮𝐮~k,𝒦2(𝐮)𝐱𝐱~k,𝐝k\displaystyle\phi(\mathbf{x})-\phi(\tilde{\mathbf{x}}^{k})+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{2}(\mathbf{u})\rangle\geq\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\mathbf{d}^{k}\rangle
+12(𝐮𝐮k+1𝐇2𝐮𝐮k𝐇2)+12𝐮k𝐮~k𝐇^22.\displaystyle+\frac{1}{2}(\|{\bf{u}}-{\bf{u}}^{k+1}\|^{2}_{{\bf{H}}}-\|{\bf{u}}-{\bf{u}}^{k}\|^{2}_{{\bf{H}}})+\frac{1}{2}\|\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{2}}. (43)

Note that

L(𝐱~k,𝜶)L(𝐱,𝜶~k)=ϕ(𝐱~k)ϕ(𝐱)+𝐮~k𝐮,𝒦2(𝐮).L(\tilde{{\bf{x}}}^{k},\bm{\alpha})-L({\bf{x}},\tilde{\bm{\alpha}}^{k})=\phi(\tilde{\mathbf{x}}^{k})-\phi(\mathbf{x})+\langle\tilde{\mathbf{u}}^{k}-\mathbf{u},\mathcal{K}_{2}(\mathbf{u})\rangle.

Summing (C) over k=0,1,,K1k=0,1,\cdots,K-1, we obtain

2k=0K1(L(𝐱~k,𝜶)L(𝐱,𝜶~k))\displaystyle 2\sum_{k=0}^{K-1}({L}(\tilde{{\bf{x}}}^{k},\bm{\alpha})-{L}({\bf{x}},\tilde{\bm{\alpha}}^{k}))
𝐮0𝐮𝐇2+2Nk=0K1𝐱~k𝐱εk.\displaystyle\leq\|{\bf{u}}^{0}-{\bf{u}}\|_{{\bf{H}}}^{2}+2\sqrt{N}\sum_{k=0}^{K-1}\|\tilde{{\bf{x}}}^{k}-{\bf{x}}\|\varepsilon_{k}.

It follows from the convexity of FF, GG and the definition of (𝐗K,𝚲K)({\bf{X}}^{K},{\bf{\Lambda}}^{K}) that

K(L(𝐗K,𝜶)L(𝐱,𝚲K))k=0K1(L(𝐱~k,𝜶)L(𝐱,𝜶~k)).K(L({\bf{X}}^{K},\bm{\alpha})-L({\bf{x}},{\bf{\Lambda}}^{K}))\leq\sum_{k=0}^{K-1}(L(\tilde{{\bf{x}}}^{k},\bm{\alpha})-L({\bf{x}},\tilde{\bm{\alpha}}^{k})).

Therefore, the primal–dual gap

L(𝐗K,𝜶)L(𝐱,𝚲K)\displaystyle L({\bf{X}}^{K},\bm{\alpha})-L({\bf{x}},{\bf{\Lambda}}^{K})
𝐮0𝐮𝐇2+2Nk=0K1𝐱~k𝐱εk2K\displaystyle\leq\frac{\|{\bf{u}}^{0}-{\bf{u}}\|_{{\bf{H}}}^{2}+2\sqrt{N}\sum_{k=0}^{K-1}\|\tilde{{\bf{x}}}^{k}-{\bf{x}}\|\varepsilon_{k}}{2K} (44)

holds. Since {εk}\{\varepsilon_{k}\} is summable and {𝐮k}\{\mathbf{u}^{k}\} is bounded, the primal-dual gap is O(1/K)O(1/K).

Note the inequality (C) holds for all (𝐱,𝜶)({\bf{x}},\bm{\alpha})\in\mathcal{M}, hence it is also true for any optimal solution 𝐱{\bf{x}}^{*} and ρ={𝜶:𝜶ρ}\mathcal{B}_{\rho}=\{\bm{\alpha}:\|\bm{\alpha}\|\leq\rho\}, where ρ>0\rho>0 is a any given positive number. Letting 𝐮=col{𝐱,𝜶}{\bf{u}}=\mathrm{col}\{{\bf{x}}^{*},\bm{\alpha}\} and 𝐔K=col{𝐗K,𝚲K}{\bf{U}}^{K}=\mathrm{col}\{{\bf{X}}^{K},{\bf{\Lambda}}^{K}\}, it gives

sup𝜶ρ{(𝐗K,𝜶)(𝐱,𝚲K)}\displaystyle\sup_{\bm{\alpha}\in\mathcal{B}_{\rho}}\{\mathcal{L}({\bf{X}}^{K},\bm{\alpha})-\mathcal{L}({\bf{x}}^{*},{\bf{\Lambda}}^{K})\}
=sup𝜶ρ{ϕ(𝐗K)ϕ(𝐱)+𝐔K𝐮,𝒦2(𝐮)}\displaystyle=\sup_{\bm{\alpha}\in\mathcal{B}_{\rho}}\{\phi({\bf{X}}^{K})-\phi(\mathbf{x}^{*})+\langle{\mathbf{U}}^{K}-\mathbf{u},\mathcal{K}_{2}(\mathbf{u})\rangle\}
=(19)sup𝜶ρ{ϕ(𝐗K)ϕ(𝐱)+𝐔K𝐮,𝒦2(𝐔K)}\displaystyle\overset{\eqref{skew-symmetric}}{=}\sup_{\bm{\alpha}\in\mathcal{B}_{\rho}}\{\phi({\bf{X}}^{K})-\phi(\mathbf{x}^{*})+\langle{\mathbf{U}}^{K}-\mathbf{u},\mathcal{K}_{2}({\mathbf{U}}^{K})\rangle\}
=sup𝜶ρ{ϕ(𝐗K)ϕ(𝐱)+𝐗K𝐱,𝐕𝚲K\displaystyle=\sup_{\bm{\alpha}\in\mathcal{B}_{\rho}}\{\phi({\bf{X}}^{K})-\phi(\mathbf{x}^{*})+\langle{\bf{X}}^{K}-{\bf{x}}^{*},-{\bf{V}}{\bf{\Lambda}}^{K}\rangle
+𝚲k𝜶,𝐕𝐗K}\displaystyle\quad\quad\quad\quad+\langle{\bf{\Lambda}}^{k}-\bm{\alpha},{\bf{V}}{\bf{X}}^{K}\rangle\}
=sup𝜶ρ{ϕ(𝐗K)ϕ(𝐱)𝜶,𝐕𝐗K}\displaystyle=\sup_{\bm{\alpha}\in\mathcal{B}_{\rho}}\{\phi({\bf{X}}^{K})-\phi(\mathbf{x}^{*})-\langle\bm{\alpha},{\bf{V}}{\bf{X}}^{K}\rangle\}
=ϕ(𝐗K)ϕ(𝐱)+ρ𝐕𝐗K.\displaystyle=\phi({\bf{X}}^{K})-\phi(\mathbf{x}^{*})+\rho\|{\bf{VX}}^{K}\|. (45)

Note that 𝜶0=𝟎\bm{\alpha}^{0}={\bf{0}}. Combining (C) and (C), one has

ϕ(𝐗K)ϕ(𝐱)+ρ𝐕𝐗K\displaystyle\phi({\bf{X}}^{K})-\phi(\mathbf{x}^{*})+\rho\|{\bf{VX}}^{K}\|
sup𝜶ρ{𝐮0𝐮𝐇2+2Nk=0K1𝐱~k𝐱εk2K}\displaystyle\leq\sup_{\bm{\alpha}\in\mathcal{B}_{\rho}}\{\frac{\|{\bf{u}}^{0}-{\bf{u}}\|_{{\bf{H}}}^{2}+2\sqrt{N}\sum_{k=0}^{K-1}\|\tilde{{\bf{x}}}^{k}-{\bf{x}}^{*}\|{\varepsilon_{k}}}{2K}\}
=𝐱0𝐱Γ12+1βρ2+2Nk=0K1𝐱~k𝐱εk2K.\displaystyle=\frac{\|{\bf{x}}^{0}-{\bf{x}}^{*}\|^{2}_{\Gamma^{-1}}+\frac{1}{\beta}\rho^{2}+2\sqrt{N}\sum_{k=0}^{K-1}\|\tilde{{\bf{x}}}^{k}-{\bf{x}}^{*}\|{\varepsilon_{k}}}{2K}.

Since {𝐮k}\{{\bf{u}}^{k}\} is bounded and k=0εk<\sum_{k=0}^{\infty}{\varepsilon_{k}}<\infty, there exists 0<ψ<0<\psi<\infty such that

ϕ(𝐗K)ϕ(𝐱)+ρ𝐕𝐗KψK.\displaystyle\phi({\bf{X}}^{K})-\phi(\mathbf{x}^{*})+\rho\|{\bf{VX}}^{K}\|\leq\frac{\psi}{K}. (46)

For any ρ>𝜶\rho>\|\bm{\alpha}^{*}\|, applying [34, Lemma 2.3] in (46), one has

𝐕𝐗K\displaystyle\|{\bf{VX}}^{K}\| ψK(ρ𝜶),\displaystyle\leq\frac{\psi}{K(\rho-\|\bm{\alpha}^{*}\|)},
ψ𝜶K(ρ𝜶)\displaystyle-\frac{\psi\|\bm{\alpha}^{*}\|}{K(\rho-\|\bm{\alpha}^{*}\|)} ϕ(𝐗K)ϕ(𝐱)ψK.\displaystyle\leq\phi({\bf{X}}^{K})-\phi({\bf{x}}^{*})\leq\frac{\psi}{K}.

Letting ρ=max{1+𝜶,2𝜶}\rho=\max\{1+\|\bm{\alpha}^{*}\|,2\|\bm{\alpha}^{*}\|\}, we have

|ϕ(𝐗K)ϕ(𝐱)|ψK,𝐈𝐖𝐗KψK.\displaystyle|\phi({\bf{X}}^{K})-\phi({\bf{x}}^{*})|\leq\frac{\psi}{K},\|\sqrt{{\bf{I-W}}}{\bf{X}}^{K}\|\leq\frac{\psi}{K}.

Thus, the primal gap and the consensus error are O(1/K)O(1/K). ∎

Appendix D Proof of Theorem 3

Proof:

Since εk=0{\varepsilon_{k}}=0 for all k0k\geq 0, by (A), one has

G(𝐱)G(𝐱~k)+𝐱𝐱~k,F(𝐱k)\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k})+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k},\nabla F(\mathbf{x}^{k})\rangle
+𝐮𝐮~k,𝒦2(𝐮~k)\displaystyle\quad+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{2}(\tilde{\mathbf{u}}^{k})\rangle
𝐮𝐮~k,𝐐(𝐮k𝐮~k),𝐮.\displaystyle\geq\langle\mathbf{u}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle,\forall\mathbf{u}\in\mathcal{M}. (47)

Letting 𝐮=𝐮~k+1{\bf{u}}=\tilde{{\bf{u}}}^{k+1} in (D), one has

G(𝐱~k+1)G(𝐱~k)+𝐱~k+1𝐱~k,F(𝐱k)\displaystyle G(\tilde{{\bf{x}}}^{k+1})-G(\tilde{\mathbf{x}}^{k})+\langle\tilde{{\bf{x}}}^{k+1}-\tilde{\mathbf{x}}^{k},\nabla F(\mathbf{x}^{k})\rangle
+𝐮~k+1𝐮~k,𝒦2(𝐮~k)\displaystyle\quad+\langle\tilde{{\bf{u}}}^{k+1}-\tilde{\mathbf{u}}^{k},\mathcal{K}_{2}(\tilde{\mathbf{u}}^{k})\rangle
𝐮~k+1𝐮~k,𝐐(𝐮k𝐮~k).\displaystyle\geq\langle\tilde{{\bf{u}}}^{k+1}-\tilde{\mathbf{u}}^{k},{\bf{Q}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\rangle. (48)

Rewriting the inequality (D) for the (k+1)(k+1)-th iteration, it gives that

G(𝐱)G(𝐱~k+1)+𝐱𝐱~k+1,F(𝐱k+1)\displaystyle G(\mathbf{x})-G(\tilde{\mathbf{x}}^{k+1})+\langle\mathbf{x}-\tilde{\mathbf{x}}^{k+1},\nabla F(\mathbf{x}^{k+1})\rangle
+𝐮𝐮~k+1,𝒦2(𝐮~k+1)\displaystyle\quad+\langle\mathbf{u}-\tilde{\mathbf{u}}^{k+1},\mathcal{K}_{2}(\tilde{\mathbf{u}}^{k+1})\rangle
𝐮𝐮~k+1,𝐐(𝐮k+1𝐮~k+1),𝐮.\displaystyle\geq\langle\mathbf{u}-\tilde{\mathbf{u}}^{k+1},{\bf{Q}}(\mathbf{u}^{k+1}-\tilde{\mathbf{u}}^{k+1})\rangle,\forall\mathbf{u}\in\mathcal{M}. (49)

Setting 𝐮=𝐮~k{\bf{u}}=\tilde{{\bf{u}}}^{k} in (D), one has

G(𝐱~k)G(𝐱~k+1)+𝐱~k𝐱~k+1,F(𝐱k+1)\displaystyle G(\tilde{{\bf{x}}}^{k})-G(\tilde{\mathbf{x}}^{k+1})+\langle\tilde{{\bf{x}}}^{k}-\tilde{\mathbf{x}}^{k+1},\nabla F(\mathbf{x}^{k+1})\rangle
+𝐮~k𝐮~k+1,𝒦2(𝐮~k+1)\displaystyle\quad+\langle\tilde{{\bf{u}}}^{k}-\tilde{\mathbf{u}}^{k+1},\mathcal{K}_{2}(\tilde{\mathbf{u}}^{k+1})\rangle
𝐮~k𝐮~k+1,𝐐(𝐮k+1𝐮~k+1),𝐮.\displaystyle\geq\langle\tilde{{\bf{u}}}^{k}-\tilde{\mathbf{u}}^{k+1},{\bf{Q}}(\mathbf{u}^{k+1}-\tilde{\mathbf{u}}^{k+1})\rangle,\forall\mathbf{u}\in\mathcal{M}. (50)

Note that 𝐮~k𝐮~k+1,𝒦2(𝐮~k)𝒦2(𝐮~k+1)0\langle\tilde{{\bf{u}}}^{k}-\tilde{{\bf{u}}}^{k+1},\mathcal{K}_{2}(\tilde{{\bf{u}}}^{k})-\mathcal{K}_{2}(\tilde{{\bf{u}}}^{k+1})\rangle\equiv 0. Adding (D) and (D), we obtain

𝐮~k𝐮~k+1,𝐐((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))\displaystyle\langle\tilde{{\bf{u}}}^{k}-\tilde{{\bf{u}}}^{k+1},{\bf{Q}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\rangle
𝐱~k𝐱~k+1,F(𝐱k)F(𝐱k+1).\displaystyle\geq\langle\tilde{{\bf{x}}}^{k}-\tilde{\mathbf{x}}^{k+1},\nabla F(\mathbf{x}^{k})-\nabla F(\mathbf{x}^{k+1})\rangle. (51)

Then, adding the term

(𝐮k𝐮~k)(𝐮k+1𝐮~k+1),𝐐((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))\langle({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}),{\bf{Q}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\rangle

to both sides (D) and using the identity

𝐮𝖳𝐐𝐮=12𝐮𝖳(𝐐𝖳+𝐐)𝐮,𝐮,{\bf{u}}^{\sf T}{\bf{Q}}{\bf{u}}=\frac{1}{2}{\bf{u}}^{\sf T}({\bf{Q}}^{\sf T}+{\bf{Q}}){\bf{u}},\forall{\bf{u}}\in\mathcal{M},

one obtains that

𝐮k𝐮k+1,𝐐((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))(I)\displaystyle\underbrace{\langle{{\bf{u}}}^{k}-{{\bf{u}}}^{k+1},{\bf{Q}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\rangle}_{({\mathrm{I}})}
𝐱~k𝐱~k+1,F(𝐱k)F(𝐱k+1)(II)\displaystyle\geq\underbrace{\langle\tilde{{\bf{x}}}^{k}-\tilde{\mathbf{x}}^{k+1},\nabla F(\mathbf{x}^{k})-\nabla F(\mathbf{x}^{k+1})\rangle}_{({\mathrm{II}})}
+12(𝐮k𝐮~k)(𝐮k+1𝐮~k+1)𝐐𝖳+𝐐2.\displaystyle\quad+\frac{1}{2}\|({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1})\|^{2}_{{\bf{Q}}^{\sf T}+{\bf{Q}}}. (52)

On one hand, we have

(I)=\displaystyle({\mathrm{I}})= 𝐮k𝐮k+1,𝐐((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))\displaystyle\langle{{\bf{u}}}^{k}-{{\bf{u}}}^{k+1},{\bf{Q}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\rangle
=(21c)\displaystyle\overset{\eqref{DALM2.3}}{=} 𝐌(𝐮k𝐮~k),𝐐((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))\displaystyle\langle{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k}),{\bf{Q}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\rangle
=(20)\displaystyle\overset{\eqref{MAXCON}}{=} 𝐮k𝐮~k,𝐌𝖳𝐇𝐌((𝐮k𝐮~k)(𝐮k+1𝐮~k+1)).\displaystyle\langle{\bf{u}}^{k}-\tilde{{\bf{u}}}^{k},{\bf{M}}^{\sf T}{\bf{HM}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\rangle.

On the other hand, we have

(II)=\displaystyle-({\mathrm{II}})= F(𝐱k+1)F(𝐱k),(𝐱~k𝐱k)(𝐱~k+1𝐱k+1)\displaystyle\langle\nabla F({\bf{x}}^{k+1})-\nabla F({\bf{x}}^{k}),(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})-(\tilde{{\bf{x}}}^{k+1}-{\bf{x}}^{k+1})\rangle
+F(𝐱k+1)F(𝐱k),𝐱k𝐱k+1\displaystyle+\langle\nabla F({\bf{x}}^{k+1})-\nabla F({\bf{x}}^{k}),{\bf{x}}^{k}-{\bf{x}}^{k+1}\rangle
\displaystyle\leq ϵ2(𝐱~k𝐱k)(𝐱~k+1𝐱k+1)𝐋F2\displaystyle\frac{\epsilon}{2}\|(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})-(\tilde{{\bf{x}}}^{k+1}-{\bf{x}}^{k+1})\|^{2}_{{\bf{L}}_{F}}
+12ϵF(𝐱k+1)F(𝐱k)𝐋F12\displaystyle+\frac{1}{2\epsilon}\|\nabla F({\bf{x}}^{k+1})-\nabla F({\bf{x}}^{k})\|^{2}_{{\bf{L}}_{F}^{-1}}
+F(𝐱k+1)F(𝐱k),𝐱k𝐱k+1\displaystyle+\langle\nabla F({\bf{x}}^{k+1})-\nabla F({\bf{x}}^{k}),{\bf{x}}^{k}-{\bf{x}}^{k+1}\rangle
(3)\displaystyle\overset{\eqref{SM3}}{\leq} ϵ2(𝐱~k𝐱k)(𝐱~k+1𝐱k+1)𝐋F2\displaystyle\frac{\epsilon}{2}\|(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})-(\tilde{{\bf{x}}}^{k+1}-{\bf{x}}^{k+1})\|^{2}_{{\bf{L}}_{F}}
+(112ϵ)F(𝐱k+1)F(𝐱k),𝐱k𝐱k+1.\displaystyle+(1-\frac{1}{2\epsilon})\langle\nabla F({\bf{x}}^{k+1})-\nabla F({\bf{x}}^{k}),{\bf{x}}^{k}-{\bf{x}}^{k+1}\rangle.

Setting ϵ=12\epsilon=\frac{1}{2}, one has that

(II)14(𝐱~k𝐱k)(𝐱~k+1𝐱k+1)𝐋F2.({\mathrm{II}})\geq-\frac{1}{4}\|(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})-(\tilde{{\bf{x}}}^{k+1}-{\bf{x}}^{k+1})\|^{2}_{{\bf{L}}_{F}}.

Thus, by (D), it holds that

2𝐮k𝐮~k,𝐌𝖳𝐇𝐌((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))\displaystyle 2\langle{\bf{u}}^{k}-\tilde{{\bf{u}}}^{k},{\bf{M}}^{\sf T}{\bf{HM}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\rangle
(𝐮k𝐮~k)(𝐮k+1𝐮~k+1)𝐐𝖳+𝐐2\displaystyle\geq\|({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1})\|^{2}_{{\bf{Q}}^{\sf T}+{\bf{Q}}}
12(𝐱~k𝐱k)(𝐱~k+1𝐱k+1)𝐋F2.\displaystyle\quad-\frac{1}{2}\|(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})-(\tilde{{\bf{x}}}^{k+1}-{\bf{x}}^{k+1})\|^{2}_{{\bf{L}}_{F}}. (53)

By the identity a𝐇2b𝐇2=2a,𝐇(ab)ab𝐇2\|a\|^{2}_{{\bf{H}}}-\|b\|^{2}_{{\bf{H}}}=2\langle a,{\bf{H}}(a-b)\rangle-\|a-b\|^{2}_{{\bf{H}}} with a=𝐌(𝐮k𝐮~k)a={\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k}) and b=𝐌(𝐮k+1𝐮~k+1)b={\bf{M}}({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}) and 𝐇^1=𝐐𝖳+𝐐𝐌𝖳𝐇𝐌12diag{𝐋F,𝟎}\widehat{{\bf{H}}}_{1}={\bf{Q}}^{\sf T}+{\bf{Q}}-{\bf{M}}^{\sf T}{\bf{HM}}-\frac{1}{2}\mathrm{diag}\{{\bf{L}}_{F},{\bf{0}}\}, we have

𝐌(𝐮k𝐮~k)𝐇2𝐌(𝐮k+1𝐮~k+1)𝐇2\displaystyle\|{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})\|^{2}_{{\bf{H}}}-\|{\bf{M}}({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1})\|^{2}_{{\bf{H}}}
=2𝐮k𝐮~k,𝐌𝖳𝐇𝐌((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))\displaystyle=2\langle{\bf{u}}^{k}-\tilde{{\bf{u}}}^{k},{\bf{M}}^{\sf T}{\bf{HM}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\rangle
+𝐌((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))𝐇2\displaystyle\quad+\|{\bf{M}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\|^{2}_{{\bf{H}}}
(D)(𝐮k𝐮~k)(𝐮k+1𝐮~k+1)𝐐𝖳+𝐐2\displaystyle\overset{\eqref{Proof:Theorem:point-wise-rate:7}}{\geq}\|({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1})\|^{2}_{{\bf{Q}}^{\sf T}+{\bf{Q}}}
+𝐌((𝐮k𝐮~k)(𝐮k+1𝐮~k+1))𝐇2\displaystyle\quad+\|{\bf{M}}(({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1}))\|^{2}_{{\bf{H}}}
12(𝐱~k𝐱k)(𝐱~k+1𝐱k+1)𝐋F2\displaystyle\quad-\frac{1}{2}\|(\tilde{{\bf{x}}}^{k}-{\bf{x}}^{k})-(\tilde{{\bf{x}}}^{k+1}-{\bf{x}}^{k+1})\|^{2}_{{\bf{L}}_{F}}
=(𝐮k𝐮~k)(𝐮k+1𝐮~k+1)𝐇^120.\displaystyle=\|({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})-({\bf{u}}^{k+1}-\tilde{{\bf{u}}}^{k+1})\|^{2}_{\widehat{{\bf{H}}}_{1}}\geq 0.

Thus, (25) holds.

By (1) and the equivalence of different norms, there exists a constant c0c_{0} such that for any 𝐮{\bf{u}}^{*}\in\mathcal{M}^{*}

𝐮k+1𝐮𝐇2𝐮k𝐮𝐇2c0𝐌(𝐮k𝐮~k)𝐇2.\displaystyle\|\mathbf{u}^{k+1}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{*}\|^{2}_{{\bf{H}}}-c_{0}\|{\bf{M}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\|^{2}_{{\bf{H}}}. (54)

Summarizing (54) over k=0,,Kk=0,\cdots,K, it gives that 𝐮\forall{\bf{u}}^{*}\in\mathcal{M}^{*}

k=0Kc0𝐌(𝐮k𝐮~k)𝐇2𝐮0𝐮𝐇2<.\displaystyle\sum_{k=0}^{K}c_{0}\|{\bf{M}}(\mathbf{u}^{k}-\tilde{\mathbf{u}}^{k})\|^{2}_{{\bf{H}}}\leq\|{\bf{u}}^{0}-{\bf{u}}^{*}\|^{2}_{{\bf{H}}}<\infty.

It follows from (25) that the sequence {𝐌(𝐮k𝐮~k)𝐇2}\{\|{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})\|^{2}_{{\bf{H}}}\} is monotonically non-increasing. Thus, we have

𝐌(𝐮k𝐮~k)𝐇2𝐮0𝐮𝐇2c0(k+1).\|{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k})\|^{2}_{{\bf{H}}}\leq\frac{\|{\bf{u}}^{0}-{\bf{u}}^{*}\|^{2}_{{\bf{H}}}}{c_{0}(k+1)}.

Since 𝐮k+1𝐮k=𝐌(𝐮k𝐮~k){\bf{u}}^{k+1}-{\bf{u}}^{k}=-{\bf{M}}({\bf{u}}^{k}-\tilde{{\bf{u}}}^{k}), the o(1/k)o({1}/{k}) rate of 𝐮k+1𝐮k𝐇2\|{\bf{u}}^{k+1}-{\bf{u}}^{k}\|^{2}_{{\bf{H}}} follows from [20, Proposition 1]. By (C), the o(1/k)o({1}/{k}) rate of the first-order optimality residual hold. ∎

Appendix E Proof of Theorem 4

Proof:

Since {𝐮k}\{{\bf{u}}^{k}\} and {𝐮~k}\{\tilde{{\bf{u}}}^{k}\} converge to 𝐮{\bf{u}}^{\infty} and {𝐝k}\{\|{\bf{d}}^{k}\|\} converges to 0 as kk\rightarrow\infty, it from (B) that {𝐯k}\{{\bf{v}}^{k}\} and {𝐯~k}\{\tilde{{\bf{v}}}^{k}\} also converges to 𝐮{\bf{u}}^{\infty} as kk\rightarrow\infty. By (37), one has

𝟎G(𝐲~k)+F(𝐱k)𝐕𝜶k+Γ1(𝐲~k𝐱k),\displaystyle{\bf{0}}\in\partial G(\tilde{{\bf{y}}}^{k})+\nabla F({\bf{x}}^{k})-{\bf{V}}\bm{\alpha}^{k}+\Gamma^{-1}(\tilde{{\bf{y}}}^{k}-{\bf{x}}^{k}),
𝟎=β𝐕𝐲~k+(𝐳~k𝜶k),\displaystyle{\bf{0}}=\beta{{\bf{V}}}\tilde{{\bf{y}}}^{k}+(\tilde{{\bf{z}}}^{k}-\bm{\alpha}^{k}),

which implies that

F(𝐲~k)F(𝐱k)𝐕(𝐳~k𝜶k)Γ1(𝐲~k𝐱k)\displaystyle\nabla F(\tilde{{\bf{y}}}^{k})-\nabla F({\bf{x}}^{k})-{\bf{V}}(\tilde{{\bf{z}}}^{k}-\bm{\alpha}^{k})-\Gamma^{-1}(\tilde{{\bf{y}}}^{k}-{\bf{x}}^{k})
G(𝐲~k)+F(𝐱~k)𝐕𝐳k,\displaystyle\in\partial G(\tilde{{\bf{y}}}^{k})+\nabla F(\tilde{{\bf{x}}}^{k})-{\bf{V}}{{\bf{z}}}^{k},

and 1β(𝐳~k𝜶k)=𝐕𝐲~k-\frac{1}{\beta}(\tilde{{\bf{z}}}^{k}-\bm{\alpha}^{k})={{\bf{V}}}\tilde{{\bf{y}}}^{k}. Therefore, it holds that

dist2(𝟎,𝒥(𝐯~k))F(𝐲~k)F(𝐱k)𝐕(𝐳~k𝜶k)\displaystyle\mathrm{dist}^{2}({\bf{0}},\mathcal{J}(\tilde{{\bf{v}}}^{k}))\leq\|\nabla F(\tilde{{\bf{y}}}^{k})-\nabla F({\bf{x}}^{k})-{\bf{V}}(\tilde{{\bf{z}}}^{k}-\bm{\alpha}^{k})
Γ1(𝐲~k𝐱k)2+1β(𝐳~k𝜶k)2\displaystyle\quad-\Gamma^{-1}(\tilde{{\bf{y}}}^{k}-{\bf{x}}^{k})\|^{2}+\|\frac{1}{\beta}(\tilde{{\bf{z}}}^{k}-\bm{\alpha}^{k})\|^{2}
(3maxi{Li2}+3σm(Γ2))𝐲~k𝐱k2\displaystyle\leq(3\max_{i}\{L^{2}_{i}\}+3\sigma_{m}(\Gamma^{2}))\|\tilde{{\bf{y}}}^{k}-{\bf{x}}^{k}\|^{2}
+(3σM(𝐕2)+1β2)𝐳~k𝜶k2\displaystyle\quad+(3\sigma_{M}({\bf{V}}^{2})+\frac{1}{\beta^{2}})\|\tilde{{\bf{z}}}^{k}-\bm{\alpha}^{k}\|^{2}
κ22𝐯~k𝐮k𝐇^12,\displaystyle\leq\kappa_{2}^{2}\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}},

where κ22=max{3σM(𝐕2)+1β2,3maxi{Li2}+3σm(Γ2)}σm(𝐇^1)\kappa_{2}^{2}=\frac{\max\{3\sigma_{M}({\bf{V}}^{2})+\frac{1}{\beta^{2}},3\max_{i}\{L^{2}_{i}\}+3\sigma_{m}(\Gamma^{-2})\}}{\sigma_{m}(\widehat{{\bf{H}}}_{1})}. Since the KKT mapping 𝒥\mathcal{J} is metrically subregular at (𝐮,𝟎)(\mathbf{u}^{\infty},\mathbf{0}), there exists κ>0\kappa>0 and ϵ>0\epsilon>0 such that when 𝐯~kϵ(𝐮)\tilde{{\bf{v}}}^{k}\in\mathcal{B}_{\epsilon}(\mathbf{u}^{\infty})

dist(𝐯~k,)=dist(𝐯~k,𝒯1(0))\displaystyle\mathrm{dist}(\tilde{{\bf{v}}}^{k},\mathcal{M}^{*})=\mathrm{dist}(\tilde{{\bf{v}}}^{k},\mathcal{T}^{-1}(0))
κdist(0,𝒯(𝐯~k))κκ2𝐯~k𝐮k𝐇^1.\displaystyle\leq\kappa\mathrm{dist}(0,\mathcal{T}(\tilde{{\bf{v}}}^{k}))\leq\kappa\kappa_{2}\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|_{\widehat{{\bf{H}}}_{1}}.

Since {𝐮k}\{{\bf{u}}^{k}\} and {𝐯~k}\{\tilde{{\bf{v}}}^{k}\} converge to 𝐮{\bf{u}}^{\infty} as kk\rightarrow\infty, there exists k¯0\bar{k}\geq 0 such that

dist(𝐯~k,)κκ2𝐯~k𝐮k𝐇^1,kk¯.\displaystyle\mathrm{dist}(\tilde{{\bf{v}}}^{k},\mathcal{M}^{*})\leq\kappa\kappa_{2}\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|_{\widehat{{\bf{H}}}_{1}},\forall k\geq\bar{k}. (55)

Since Note that 0𝐇^1𝐇0\prec\widehat{{\bf{H}}}_{1}\prec{\bf{H}}, one has

dist(𝐯~k,)1c1dist𝐇^1(𝐯~k,)\displaystyle\mathrm{dist}(\tilde{{\bf{v}}}^{k},\mathcal{M}^{*})\geq\frac{1}{c_{1}}\mathrm{dist}_{\widehat{{\bf{H}}}_{1}}(\tilde{{\bf{v}}}^{k},\mathcal{M}^{*})
1c1(dist𝐇^1(𝐮k,)𝐯~k𝐮k𝐇^1),\displaystyle\geq\frac{1}{c_{1}}(\mathrm{dist}_{\widehat{{\bf{H}}}_{1}}({\bf{u}}^{k},\mathcal{M}^{*})-\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|_{\widehat{{\bf{H}}}_{1}}), (56)

where c1=σM1/2(𝐇)c_{1}=\sigma_{M}^{1/2}({\bf{H}}). Combining (55) and (E), it gives that

dist𝐇(𝐮k,)c1c2dist𝐇^1(𝐮k,)\displaystyle\mathrm{dist}_{{\bf{H}}}({\bf{u}}^{k},\mathcal{M}^{*})\leq\frac{c_{1}}{c_{2}}\mathrm{dist}_{\widehat{{\bf{H}}}_{1}}({\bf{u}}^{k},\mathcal{M}^{*})
c1(κκ2c1+1)c2𝐯~k𝐮k𝐇^1,kk¯,\displaystyle\leq\frac{c_{1}(\kappa\kappa_{2}c_{1}+1)}{c_{2}}\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|_{\widehat{{\bf{H}}}_{1}},\forall k\geq\bar{k}, (57)

where c2=σm1/2(𝐇^1)c_{2}=\sigma_{m}^{1/2}(\widehat{{\bf{H}}}_{1}). It follows from (1) that for k0\forall k\geq 0

𝐯k+1𝐮𝐇2𝐮k𝐮𝐇2𝐯~k𝐮k𝐇^12.\displaystyle\|\mathbf{v}^{k+1}-\mathbf{u}^{\infty}\|^{2}_{{\bf{H}}}\leq\|\mathbf{u}^{k}-\mathbf{u}^{\infty}\|^{2}_{{\bf{H}}}-\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}. (58)

Together with (E), one has

dist𝐇2(𝐮k,)dist𝐇2(𝐯k+1,)𝐯~k𝐮k𝐇^12\displaystyle\mathrm{dist}_{{\bf{H}}}^{2}({\bf{u}}^{k},\mathcal{M}^{*})-\mathrm{dist}_{{\bf{H}}}^{2}({\bf{v}}^{k+1},\mathcal{M}^{*})\geq\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}
c22c12(κκ2c1+1)2dist𝐇2(𝐮k,),kk¯,\displaystyle\geq\frac{c_{2}^{2}}{c^{2}_{1}(\kappa\kappa_{2}c_{1}+1)^{2}}\mathrm{dist}^{2}_{{\bf{H}}}({\bf{u}}^{k},\mathcal{M}^{*}),\forall k\geq\bar{k},

which implies that

dist𝐇(𝐯k+1,)ϑ¯dist𝐇(𝐮k,),kk¯,\displaystyle\mathrm{dist}_{{\bf{H}}}({\bf{v}}^{k+1},\mathcal{M}^{*})\leq\bar{\vartheta}~{}\mathrm{dist}_{{\bf{H}}}({\bf{u}}^{k},\mathcal{M}^{*}),\forall k\geq\bar{k}, (59)

where ϑ¯=1c22c12(κκ2c1+1)2<1\bar{\vartheta}=\sqrt{1-\frac{c_{2}^{2}}{c^{2}_{1}(\kappa\kappa_{2}c_{1}+1)^{2}}}<1. By (58) and the fact 0𝐇^1𝐇0\prec\widehat{{\bf{H}}}_{1}\prec{\bf{H}}, one has

c2c1𝐯~k𝐮k𝐇2𝐯~k𝐮k𝐇^12dist𝐇2(𝐮k,),kk¯.\displaystyle\frac{c_{2}}{c_{1}}\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|^{2}_{{\bf{H}}}\leq\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|^{2}_{\widehat{{\bf{H}}}_{1}}\leq\mathrm{dist}^{2}_{{\bf{H}}}({\bf{u}}^{k},\mathcal{M}^{*}),\forall k\geq\bar{k}.

It deduces that

𝐯~k𝐮k𝐇ωdist𝐇(𝐮k,),kk¯,\displaystyle\|\tilde{{\bf{v}}}^{k}-{\bf{u}}^{k}\|_{{\bf{H}}}\leq\omega~{}\mathrm{dist}_{{\bf{H}}}({\bf{u}}^{k},\mathcal{M}^{*}),\forall k\geq\bar{k}, (60)

where ω=c1/c2\omega=\sqrt{c_{1}/c_{2}}. By the triangle inequality, one has

𝐮k𝒫𝐇(𝐯k+1)𝐇dist𝐇(𝐮k,)\displaystyle\|\mathbf{u}^{k}-\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}({\bf{v}}^{k+1})\|_{{\bf{H}}}\leq\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k},\mathcal{M}^{*})
+𝒫𝐇(𝐮k)𝒫𝐇(𝐯k+1)H,kk¯.\displaystyle\quad+\|\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}(\mathbf{u}^{k})-\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}({\mathbf{v}}^{k+1})\|_{\mathrm{H}},\forall k\geq\bar{k}. (61)

Then, by the Lipschitz continuity of 𝒫𝐇\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}, it holds that

𝒫𝐇(𝐮k)𝒫𝐇(𝐯k+1)𝐇𝐮k𝐯k+1𝐇,\displaystyle\|\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}(\mathbf{u}^{k})-\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}({\mathbf{v}}^{k+1})\|_{{\bf{H}}}\leq\|\mathbf{u}^{k}-{\mathbf{v}}^{k+1}\|_{{\bf{H}}},

which together with (60) and (E) deduces that

𝐮k𝒫𝐇(𝐯k+1)𝐇(1+ω)dist𝐇(𝐮k,).\displaystyle\|\mathbf{u}^{k}-\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}({\mathbf{v}}^{k+1})\|_{{\bf{H}}}\leq(1+\omega)\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k},\mathcal{M}^{*}). (62)

By (26), (B) and the triangle inequality, it holds that

𝐮k+1𝒫𝐇(𝐯k+1)𝐇\displaystyle\|\mathbf{u}^{k+1}-\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}({\mathbf{v}}^{k+1})\|_{{\bf{H}}}
𝐯k+1𝐮k+1𝐇+dist𝐇(𝐯k+1,)\displaystyle\leq\|{\mathbf{v}}^{k+1}-\mathbf{u}^{k+1}\|_{{\bf{H}}}+\mathrm{dist}_{{\bf{H}}}({\mathbf{v}}^{k+1},\mathcal{M}^{*})
μ¯𝐝k+dist𝐇(𝐯k+1,)\displaystyle\leq\bar{\mu}\|\mathbf{d}^{k}\|+\mathrm{dist}_{{\bf{H}}}({\mathbf{v}}^{k+1},\mathcal{M}^{*})
μϱk𝐮k𝐮k+1𝐇+dist𝐇(𝐯k+1,)\displaystyle\leq\mu\varrho_{k}\|\mathbf{u}^{k}-\mathbf{u}^{k+1}\|_{{\bf{H}}}+\mathrm{dist}_{{\bf{H}}}({\mathbf{v}}^{k+1},\mathcal{M}^{*})
μϱk(𝐮k+1𝒫𝐇(𝐯k+1)\displaystyle\leq\mu\varrho_{k}(\|\mathbf{u}^{k+1}-\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}({\mathbf{v}}^{k+1})\|
+𝐮k𝒫𝐇(𝐯k+1))+dist𝐇(𝐯k+1,),\displaystyle\quad+\|\mathbf{u}^{k}-\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}({\mathbf{v}}^{k+1})\|)+\mathrm{dist}_{{\bf{H}}}({\mathbf{v}}^{k+1},\mathcal{M}^{*}),

where μ=μ¯/c2\mu={\bar{\mu}}/{c_{2}}. Then, by the fact that

𝐮k+1𝒫𝐇(𝐯k+1)𝐇dist𝐇(𝐮k+1,)\|\mathbf{u}^{k+1}-\mathcal{P}_{\mathcal{M}^{*}}^{{\bf{H}}}({\mathbf{v}}^{k+1})\|_{{\bf{H}}}\geq\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k+1},\mathcal{M}^{*})

and (62), it can be deduced that when kk¯k\geq\bar{k}

(1μϱk)dist𝐇(𝐮k+1,)\displaystyle(1-\mu\varrho_{k})\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k+1},\mathcal{M}^{*})
μϱk(1+ω)dist𝐇(𝐮k,)+dist𝐇(𝐯k+1,),\displaystyle\leq{\mu}\varrho_{k}(1+\omega)\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k},\mathcal{M}^{*})+\mathrm{dist}_{{\bf{H}}}({\mathbf{v}}^{k+1},\mathcal{M}^{*}),

which together with (59), implies that when kk¯k\geq\bar{k}

dist𝐇(𝐮k+1,)μ(1+ω)ϱk+ϑ¯(1μϱk):=ϑkdist𝐇(𝐮k,).\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k+1},\mathcal{M}^{*})\leq\underbrace{\frac{{\mu}(1+\omega)\varrho_{k}+\bar{\vartheta}}{(1-{\mu}\varrho_{k})}}_{:=\vartheta_{k}}\mathrm{dist}_{{\bf{H}}}(\mathbf{u}^{k},\mathcal{M}^{*}).

It follows that if

supkk¯{ϱk}<1ϑ¯μ(2+ω),\sup_{k\geq\bar{k}}\{\varrho_{k}\}<\frac{1-\bar{\vartheta}}{\mu(2+\omega)},

then one has supkk¯{ϱk}<1\sup_{k\geq\bar{k}}\{\varrho_{k}\}<1. This completes the proof. ∎

References

  • [1] A. Nedić, “Convergence rate of distributed averaging dynamics and optimization in networks,” Found. Trends Syst. Control, vol. 2, no. 1, pp. 1–100, 2015.
  • [2] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multiagent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.
  • [3] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM J. Optim., vol. 26, no. 3, pp. 1835–1854, 2016.
  • [4] D. Jakovetić, J. Xavier, and J. M. F. Moura, “Fast distributed gradient method,” IEEE Trans. Autom. Control, vol. 59, no. 5, pp. 1131–1146, May 2014.
  • [5] J. Zhang, K. You, and T. Başar, “Distributed discrete-time optimization in multiagent networks using only sign of relative state,” IEEE Trans. Autom. Control, vol. 64, no. 6, pp. 2352–2367, Jun. 2019.
  • [6] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization,” IEEE Trans. Signal Process., vol. 62, no. 7, pp. 1750–1761, Apr. 2014.
  • [7] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, “DLM: Decentralized linearized alternating direction method of multipliers,” IEEE Trans. Signal Process., vol. 63, no. 15, pp. 4051–4064, Aug. 2015.
  • [8] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM J. Optim., vol. 25, no. 2, pp. 944–966, 2015.
  • [9] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM J. Optim., vol. 27, no. 4, pp. 2597–2633, 2017.
  • [10] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245–1260, Sep. 2018.
  • [11] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes,” in Proc. 54th IEEE Conf. Decis. Control, 2015, pp. 2055–2060.
  • [12] A. Nedić, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometrically convergent distributed optimization with uncoordinated step-sizes,” in Proc. Amer. Control Conf., 2017, pp. 3950–3955.
  • [13] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—Part I: Algorithm development,” IEEE Trans. Signal Process., vol. 67, no. 3, pp. 708–723, Feb. 2019.
  • [14] S. A. Alghunaim, E. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Trans. Autom. Control, vol. 66, no. 6, pp. 2787–2794, Jun. 2020.
  • [15] J. Xu, Y. Tian, Y. Sun and G. Scutari, “Distributed algorithms for composite optimization: Unified framework and convergence analysis,” IEEE Trans. Signal Process., vol. 69, pp. 3555–3570, Jun. 2021.
  • [16] P. D. Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Trans. Signal Inf. Process. Netw., vol. 2, no. 2, pp. 120–136, Jun. 2016.
  • [17] Y. Sun, G. Scutari, and A. Daneshmand, “Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation,” SIAM J. Optim., vol. 32, no. 2, pp. 345–385, 2022.
  • [18] T.-H. Chang, M. Hong, and X. Wang, “Multi-agent distributed optimization via inexact consensus ADMM,” IEEE Trans. Signal Process., vol. 63, no. 2, pp. 482–497, Jan. 2015.
  • [19] N. S. Aybat, Z.Wang, T. Lin, and S. Ma, “Distributed linearized alternating directionmethod of multipliers for composite convex consensus optimization,” IEEE Trans. Autom. Control, vol. 63, no. 1, pp. 5–20, Jan. 2017.
  • [20] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradient algorithm for decentralized composite optimization,” IEEE Trans. Signal Process., vol. 63, no. 22, pp. 6013–6023, Nov. 2015.
  • [21] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494-4506, Sep. 2019.
  • [22] J. Zhang, H. Liu, A. M.-C. So and Q. Ling, “A penalty alternating direction method of multipliers for convex composite optimization over decentralized networks,” IEEE Trans. Signal Process., vol. 69, pp. 4282–4295, Jun. 2021.
  • [23] R. J. Tibshirani and J. Taylor, “The solution path of the generalized lasso,” Ann. Statist. vol. 39, no. 3, pp. 1335–1371, Jun. 2011.
  • [24] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, “Sparsity and smoothness via the fused lasso,” J. Roy. Stat. Soc. Ser. B, Stat. Methodol., vol. 67, no. 1, pp. 91–108, Feb. 2005.
  • [25] H. D. Bondell and B. J. Reich, “Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR,” Biometrics, vol. 64, no. 1, pp. 115–123, Mar. 2008.
  • [26] L. Jacob, G. Obozinski, J.-P. Vert, “Group lasso with overlap and graph lasso,” in Proc. 26th Int. Conf. Mach. Learn., 2009, pp. 433–440.
  • [27] L. Chen, X. Li, D. Sun, and K.-C. Toh, “On the equivalence of inexact proximal ALM and ADMM for a class of convex composite programming,” Math. Program., vol. 185, pp. 111–161, 2021.
  • [28] F. Jiang, X. Cai, Z. Wu, and D. Han, “Approximate first-order primal-dual algorithms for saddle point problems,” Math. Comp., vol. 90, pp. 1227–1262, 2021.
  • [29] J. Eckstein and W. Yao, “Approximate ADMM algorithms derived from Lagrangian splitting,” Comput. Optim. Appl., vol. 68, no. 2, pp. 363–405, 2017.
  • [30] Y. Arjevani, J. Bruna, B. Can, M. Gurbuzbalaban, S. Jegelka, and H. Lin, “IDEAL: Inexact decentralized accelerated augmented Lagrangian method,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 20648–20659.
  • [31] M. Hong, “Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: Algorithms, convergence, and applications,” 2016, arXiv:1604.00543.
  • [32] A. Beck, First-Order Methods in Optimization, vol. 25. SIAM, Philadelphia, 2017.
  • [33] L. Condat, D. Kitahara, A. Contreras, and A. Hirabayashi, “Proximal splitting algorithms for convex optimization: A tour of recent advances, with new twists,” 2021, arXiv:1912.00137v7.
  • [34] Y. Xu, “Accelerated first-order primal–dual proximal methods for linearly constrained composite convex programming, SIAM J. Optim., vol. 27, no. 3, pp. 1459–1484, 2017.
  • [35] E. K. Ryu and W. Yin, Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators, Cambridge, 2022.
  • [36] L. Condat, “A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms,” J. Optim. Theory Appl., vol. 158, no. 2, pp. 460–479, 2013.
  • [37] B. C. Vũ, “A splitting algorithm for dual monotone inclusions involving cocoercive operators,” Adv. Comput. Math., vol. 38, no. 3, pp. 667–681, 2013.
  • [38] M. Yan, “A new primal-dual algorithm for minimizing the sum of three functions with a linear operator,” J. Sci. Comput., vol. 76, pp. 1698–1717, 2018.
  • [39] P. Latafat and P. Patrinos, “Asymmetric forward-backward-adjoint splitting for solving monotone inclusions involving three operators,” Comput. Optim. Appl., vol. 68, no. 1, pp. 57–93, 2017.
  • [40] B. S. He and X. M. Yuan, “On construction of splitting contraction algorithms in a prediction-correction framework for separable convex optimization,” 2022, arXiv:2204.11522.
  • [41] T. R. Rockafellar and J. B. Wets, Variational analysis, Springer-Verlag Berlin Heidelberg, 2004.
  • [42] X. Yuan, S. Zeng, and J. Zhang, “Discerning the linear convergence of ADMM for structured convex optimization through the lens of variational analysis,” J. Mach. Learn. Res., vol. 21, pp. 1–75, 2020.
  • [43] J. J. Ye, X. Yuan, S. Zeng, and J. Zhang, “Variational analysis perspective on linear convergence of some first order methods for nonsmooth convex optimization problems,” Set-Valued Var. Anal., vol. 29, pp. 803–837, 2021.
  • [44] Z. Luo and P. Tseng, “On the linear convergence of descent methods for convex essentially smooth minimization,” SIAM J. Control. Optim., vol. 30, pp. 408–425, 1992.
  • [45] S. M. Robinson, “Some continuity properties of polyhedral multifunctions,” Math. Program., vol. 14, pp. 206–214, 1981.
  • [46] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. Roy. Stat. Soc. Ser. B, Stat. Methodol., vol. 67, no. 2, pp. 301–320, 2005.
  • [47] F. Jiang, Z. Wu, X. Cai, and H. Zhao, “Unified linear convergence of first-order primal-dual algorithms for saddle point problems,” Optim. Lett. vol. 16, pp. 1675–1700, 2022.
[Uncaptioned image] Luyao Guo received the B.S. degree in information and computing science from Shanxi University, Taiyuan, China, in 2020. He is currently pursuing the Ph.D. degree in applied mathematics with the Jiangsu Provincial Key Laboratory of Networked Collective Intelligence, School of Mathematics, Southeast University, Nanjing, China. His current research focuses on distributed optimization and learning.
[Uncaptioned image] Xinli Shi (Senior Member, IEEE) received the B.S. degree in software engineering, the M.S. degree in applied mathematics, and the Ph.D. degree in control science and engineering from Southeast University, Nanjing, China, in 2013, 2016, and 2019, respectively. He held a China Scholarship Council Studentship for one-year study with the University of Royal Melbourne Institute of Technology, Melbourne, VIC, Australia, in 2018. He is currently an Associate Professor with the School of Cyber Science and Engineering, Southeast University. His current research interests include distributed optimization, nonsmooth analysis, and network control systems. Dr. Shi was the recipient of the Outstanding Ph.D. Degree Thesis Award from Jiangsu Province, China.
[Uncaptioned image] Jinde Cao (Fellow, IEEE) received the B.S. degree in mathematics from Anhui Normal University, Wuhu, China, in 1986, the M.S. degree in mathematics/applied mathematics from Yunnan University, Kunming, China, in 1989, and the Ph.D. degree in mathematics/applied mathematics from Sichuan University, Chengdu, China, in 1998. He is an Endowed Chair Professor, the Dean of the School of Mathematics, and the Director of the Jiangsu Provincial Key Laboratory of Networked Collective Intelligence of China and the Research Center for Complex Systems and Network Sciences with Southeast University, Nanjing, China. Prof. Cao was a recipient of the National Innovation Award of China, the Gold medal of Russian Academy of Natural Sciences, the Obada Prize, and the Highly Cited Researcher Award in Engineering, Computer Science, and Mathematics by Thomson Reuters/Clarivate Analytics. He is elected as a foreign member of Russian Academy of Sciences, a member of the Academy of Europe, a foreign member of Russian Academy of Engineering, a member of the European Academy of Sciences and Arts, a foreign member of Russian Academy of Natural Sciences, a foreign fellow of Pakistan Academy of Sciences, a fellow of African Academy of Sciences, a foreign Member of the Lithuanian Academy of Sciences, and an IASCYS academician.
[Uncaptioned image] Zihao Wang received the B.E. degree in network engineering from Nanjing Xiaozhuang University, Nanjing, China, in 2020. He is currently pursuing the M.E. degree in cyberspace security with the School of Cyber Science and Engineering, Southeast University, Nanjing. His current research focuses on privacy-preserving distributed optimization of networked multi-agent systems.