This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed stochastic proximal algorithm with random reshuffling for non-smooth finite-sum optimization

Xia Jiang, Xianlin Zeng,  Jian Sun,  Jie Chen,  and Lihua Xie This work was supported in part by the National Key R&\&D Program of China under Grant 2021YFB1714800, the National Natural Science Foundation of China under Grants 61925303, 62088101, 62073035, 62173034, the Natural Science Foundation of Chongqing under Grant 2021ZX4100027 and China Scholarship Council 202006030072. (Corresponding author: Jian Sun.)X. Jiang ([email protected]) and J. Sun ([email protected]) are with Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, 100081, China, and also with the Beijing Institute of Technology Chongqing Innovation Center, Chongqing 401120, ChinaX. Zeng ([email protected]) is with Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, 100081, ChinaJ. Chen ([email protected]) is with the School of Electronic and Information Engineering, Tongji University, Shanghai, 200082, China, and also with Beijing Advanced Innovation Center for Intelligent Robots and Systems (Beijing Institute of Technology), Key Laboratory of Biomimetic Robots and Systems (Beijing Institute of Technology), Ministry of Education, Beijing, 100081, China.L. Xie ([email protected]) is with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798.
Abstract

The non-smooth finite-sum minimization is a fundamental problem in machine learning. This paper develops a distributed stochastic proximal-gradient algorithm with random reshuffling to solve the finite-sum minimization over time-varying multi-agent networks. The objective function is a sum of differentiable convex functions and non-smooth regularization. Each agent in the network updates local variables by local information exchange and cooperates to seek an optimal solution. We prove that local variable estimates generated by the proposed algorithm achieve consensus and are attracted to a neighborhood of the optimal solution with an 𝒪(1T+1T)\mathcal{O}(\frac{1}{T}+\frac{1}{\sqrt{T}}) convergence rate, where TT is the total number of iterations. Finally, some comparative simulations are provided to verify the convergence performance of the proposed algorithm.

Index Terms:
distributed optimization, proximal operator, random reshuffling, stochastic algorithm, time-varying graphs

I Introduction

Distributed non-smooth finite-sum minimization is a basic problem of the popular supervised machine learning [1, 2, 3]. In machine learning communities, optimizing a training model with merely the average loss over a finite data set usually leads to over-fitting or poor generalization. Some non-smooth regularizations are often included in the cost function to encode prior knowledge, which also introduces the challenge of non-smoothness. What’s more, in many practical network systems [4, 5], training data are naturally stored at different physical nodes and it is expensive to collect data and train the model in one centralized node. Compared to the centralized setting, the distributed setting makes use of multi computational sources to train the learning model in parallel, leading to potential speedup. However, since the data are distributed and the communication is limited, more involved approaches are needed to solve the minimization problem. Therefore, distributed finite-sum minimization has attracted much attention in machine learning, especially for large-scale applications with big data.

For large-scale non-smooth finite-sum optimization problems, there have been many efficient centralized and distributed first-order algorithms, including subgradient-based methods [6, 7, 8] and proximal gradient algorithms [9, 10, 11, 12, 13, 14]. Subgradient-based method is very generic in optimization with the expense of slow convergence. To be specific, subgradient-based algorithms may increase the objective function of the optimization problem for small step-sizes [10]. Proximal gradient algorithms are considered more stable than subgradient-based methods and often own better numerical performance than subgradient-based methods for convex optimization [15]. Hence, proximal gradient algorithms have attracted great interest for large-scale non-smooth optimization problems. Particularly, training data are allocated to different computing nodes and the non-smooth finite-sum optimization needs to be handled efficiently in a distributed manner. Many distributed deterministic proximal gradient algorithms have been proposed with guaranteed convergence for non-smooth optimization over time-invariant and time-varying graphs [12, 13, 14]. These distributed works are developed under the primal-dual framework and are applicable for the constrained non-smooth optimization. However, these distributed algorithms focus on the deterministic non-smooth optimization and require computing the full gradients of all local functions at each iteration. The per iteration computational cost of a deterministic algorithm is much higher than that of a stochastic gradient method, which hinders the application of deterministic algorithms for non-smooth optimization with large-scale data.

In large-scale machine learning settings, various first-order stochastic methods are leading algorithms due to their scalability and low computational requirements. Decentralized stochastic gradient descent algorithms have gained a lot of attention recently, especially for the traditional federated learning setting with a star-shaped network topology [16, 17, 18]. All of these methods are not fully distributed in a sense that they require a central parameter server. The potential bottleneck of the star-shaped network is possible communication traffic jam on the central sever and the performance will be significantly degraded when the network bandwidth is low. To consider a more general distributed network topology without a central server, many distributed stochastic gradient algorithms have been studied [19, 20, 21, 22, 23] for convex finite-sum optimization. To avoid the degenerated performance of algorithms with diminishing steps-sizes, some recent works [22] and [23] have proposed consensus-based distributed stochastic gradient descent methods with non-diminishing (constant) step-sizes for smooth convex optimization over time-invariant networks. With the variance reduction technique to handle the variance of the local stochastic gradients at each node, some distributed efficient stochastic algorithms with constant step-sizes have been studied for strongly convex smooth optimization over time-invariant undirected networks [24] and time-invariant directed networks [25]. However, most of the existing distributed stochastic algorithms are designed for smooth optimization and are only applicable over time-invariant networks.

In practice, a communication network may be time-varying, since network attacks may cause failures of communications and connectivity of a network may change, especially for mobile networks [26, 4]. Therefore, it is necessary to study distributed algorithms over time-varying networks. Considering the unreliable network communication, some works have studied distributed asynchronous stochastic algorithms with diminishing step-sizes for smooth convex optimization over multi-agent networks [20, 21]. To avoid the degenerated performance of algorithms with diminishing step-sizes, [7] developed a distributed gradient algorithm with a multi-step consensus mapping for multi-agent optimization. This multi-step consensus mapping has also been widely applied in the follow-up works [27, 28] for multi-agent consensus and non-smooth optimization over time-varying networks. Differently from these deterministic works, this paper studies a distributed stochastic gradient algorithm with sample-without-replacement heuristic for non-smooth finite-sum optimization over time-varying multi-agent networks.

Random reshuffling (RR) is a simple, popular but elusive stochastic method for finite-sum minimization in machine learning. Contrasted with stochastic gradient descent (SGD), where the training data are sampled uniformly with replacement, the sampling-without-replacement RR method learns from each data point in each epoch and often own better performance in many practical problems[29, 30, 31]. The convergence properties of SGD are well-understood with tight lower and upper bounds in many settings [32, 33], however, the theoretical analysis of RR had not been studied until recent years. The sampling-without-replacement in RR introduces a significant complication that the gradients are now biased, which implies that a single iteration does not approximate a full gradient descent step. The incremental gradient algorithm (IG) is one special case of the shuffling algorithm. The IG generates a deterministic permutation before the start of epochs and reuses the permutation in all subsequent epochs. In contrast, in the RR, a new permutation is generated at the beginning of each epoch. The implementation of IG has a challenge of choosing a suitable permutation for cycling through the iterations [34], and the IG is susceptible to bad orderings compared to RR [35]. Therefore, it is necessary to study a random reshuffling algorithm for convex non-smooth optimization. For strongly convex optimization, the seminal work [30] theoretically characterized various convergence rates for the random reshuffling method. One inspiring recent work by Richtarik [36] has provided some involved yet insightful proofs for the convergence behavior of RR under weak assumptions. For smooth finite-sum minimization, [37] has studied decentralized stochastic gradient with shuffling and provided insights for practical data processing. What’s more, the recent work [38] has studied centralized proximal gradient descent algorithms with random reshuffling (Prox-RR) for non-smooth finite-sum optimization and extended the Prox-RR to federated learning. However, these works are not applicable to fully distributed settings and the Prox-RR needs the strong convexity assumption, which hinders its application.

Inspired by the existing deterministic and stochastic algorithms, we develop a distributed stochastic proximal algorithm with random reshuffling (DPG-RR) for non-smooth finite-sum optimization over time-varying multi-agent networks, extending the random reshuffling to distributed non-smooth convex optimization.

The contributions of this paper are summarized as follows.

  • For non-smooth convex optimization, we propose a distributed stochastic algorithm DPG-RR over time-varying networks. Although there are a large number of works on proximal SGD [39, 40], few works have studied how to extend random reshuffling to solve convex optimization with non-smooth regularization. This paper extends the recent centralized proximal algorithm Prox-RR in [38] for strongly convex non-smooth optimization to general convex optimization and distributed settings. Another very related decentralized work is [37], which has provided important insights to distributed stochastic gradient descent with data shuffling under master-slave frameworks. This paper extends [37] to fully distributed settings and makes it applicable for non-smooth optimization. To our best knowledge, this is the first attempt to study fully distributed stochastic proximal algorithms with random reshuffling over time-varying networks.

  • The proposed algorithm DPG-RR needs fewer proximal operators compared with proximal SGD and is applicable for time-varying networks. To be specific, for the non-smooth optimization with nn local functions, the standard proximal SGD applies the proximal operator after each stochastic gradient step [40] and needs nn proximal evaluations. In contrast, the proposed algorithm DPG-RR only operates one single proximal evaluation at each iteration. In addition, with the design of multi-step consensus mapping, this proposed algorithm enables local variable estimates closer to each other, which is crucial for distributed non-smooth optimization over time-varying networks. Compared with the existing incremental gradient algorithm in [41], the proposed random reshuffling algorithm in this paper requires a weaker assumption, where the cost function does not need to be strongly-convex.

  • This work provides a complete and rigorous theoretical convergence analysis for the proposed DPG-RR over time-varying multi-agent networks. Due to the sampling-without-replacement mechanism of random reshuffling, the local stochastic gradients are biased estimators of the full gradient. In addition, because of scattered information in the distributed setting, there are differences between local and global objective functions, and thus local and the global gradients. To handle the differences between local and global gradients, this paper proves the summability of the error sequences of the estimated gradients. Then, we prove the convergence of the transformed stochastic algorithm by some novel proof techniques inspired by [36].

The remainder of the paper is organized as follows. The preliminary mathematical notations and proximal operator are introduced in section II. The problem description and the design of a distributed solver are provided in section III. The convergence performance of the proposed algorithm is analyzed in section IV. Numerical simulations are provided in section V and the conclusion is made in section VI.

II Notations and Preliminaries

II-A Notations

We denote \mathbb{R} the set of real numbers, \mathbb{N} the set of natural numbers, n\mathbb{R}^{n} the set of nn-dimensional real column vectors and n×m\mathbb{R}^{n\times m} the set of nn-by-mm real matrices, respectively. We denote vv^{\prime} the transpose of vector vv. In addition, \left\|\cdot\right\| denotes the Euclidean norm, ,\langle\cdot,\cdot\rangle the inner product, which is defined by a,b=ab\langle a,b\rangle=a^{\prime}b. The vectors in this paper are column vectors unless otherwise stated. The notation [m][m] denotes the set {1,,m}\{1,\cdots,m\}. For a differentiable function f:nf:\mathbb{R}^{n}\to\mathbb{R}, f(x)\nabla f(x) denotes the gradient of function ff with respect to xx. For a convex non-smooth function h:nh:\mathbb{R}^{n}\to\mathbb{R}, h(x)\partial h(x) denotes the subdifferential of function hh at xx. The ε\varepsilon-subdifferential of a non-smooth convex function hh at x{x}, denoted by εh(x)\partial_{\varepsilon}h(x), is the set of vectors yy such that for all qq,

h(q)h(x)y(qx)ε.\displaystyle h(q)-h({x})\geq y^{\prime}(q-{x})-\varepsilon. (1)

II-B Proximal Operator

For a proper non-differentiable convex function h:n(,]h:\mathbb{R}^{n}\to(-\infty,\infty] and a scalar α>0\alpha>0, the proximal operator is defined as

proxα,h(x)=argminznh(z)+12αzx2.\displaystyle{\rm prox}_{\alpha,h}(x)={\rm argmin}_{z\in\mathbb{R}^{n}}{h(z)+\frac{1}{2\alpha}\|z-x\|^{2}}. (2)

The minimum is attained at a unique point y=proxα,h(x)y={\rm prox}_{\alpha,h}(x), which means the proximal operator is a single-valued map. In addition, it follows from the optimality condition for convex optimization that

0h(y)+1α(yx).\displaystyle 0\in\partial h(y)+\frac{1}{\alpha}(y-x). (3)

The following proposition presents some properties of the proximal operator.

Proposition II.1.

[42] Let h:n(,]h:\mathbb{R}^{n}\to(-\infty,\infty] be a closed proper convex function. For a scalar α>0\alpha>0 and xnx\in\mathbb{R}^{n}, let y=proxα,h(x)y={\rm prox}_{\alpha,h}(x). For x,x^nx,\hat{x}\in\mathbb{R}^{n}, proxα,h(x)proxα,h(x^)xx^\|{\rm prox}_{\alpha,h}(x)-{\rm prox}_{\alpha,h}(\hat{x})\|\leq\|x-\hat{x}\|, which is also called the nonexpansiveness of proximal operator.

When there exists error ε\varepsilon in the computation of proximal operator, we denote the inexact proximal operator by proxα,hε(){\rm prox}_{\alpha,h}^{\varepsilon}(\cdot), which is defined as

proxα,hε(x){x~|\displaystyle{\rm prox}_{\alpha,h}^{\varepsilon}(x)\triangleq\Big{\{}\tilde{x}\big{|} 12αx~x2+h(x~)\displaystyle\frac{1}{2\alpha}\|\tilde{x}-x\|^{2}+h(\tilde{x})\leq
ε+minzd{12αzx2+h(z)}}.\displaystyle\varepsilon+\min_{z\in\mathbb{R}^{d}}\big{\{}\frac{1}{2\alpha}\|z-x\|^{2}+h(z)\big{\}}\Big{\}}. (4)

III Problem Description And Algorithm Design

In this paper, we aim to solve the following non-smooth finite-sum optimization problem over a multi-agent network,

minxdF(x),F(x)=f(x)+ϕ(x)=1mj=1mi=1nfj,i(x)+ϕ(x)\displaystyle\min_{x\in\mathbb{R}^{d}}F(x),\ F(x)=f(x)+\phi(x)=\!\frac{1}{m}\!\sum_{j=1}^{m}\sum_{i=1}^{n}f_{j,i}(x)+\phi(x) (5)

where xdx\in\mathbb{R}^{d} is the unknown decision variable, mm is the number of agents in the network and nn is the number of local samples. The global sample size of the multi-agent network is N=mnN=mn. The cost function f(x)f(x) in (5) denotes the global cost of the multi-agent network and is differentiable. The regularization ϕ\phi is non-smooth, which encodes some prior knowledge. In the multi-agent network, each agent jj only knows nn local samples and the corresponding cost functions fj,i()f_{j,i}(\cdot) to update the local decision variable. In addition, the non-smooth regularization ϕ\phi is available to all agents. In this setting, we assume that each agent uses local cost functions and communications with its neighbors to obtain the same optimal solution xx_{*} of problem (5).

In real-world applications of distributed settings, it is difficult to keep all communication links between agents connected and stable due to the the existence of networked attacks and limited bandwidth. We consider a general time-varying multi-agent communication network 𝒢(t)=([m],(t),A(t))\mathcal{G}(t)=([m],\mathcal{E}(t),A(t)), where (t)\mathcal{E}(t) denotes the set of communication edges at time tt. At each time tt, agent jj can only communicate with agent ii if and only if (j,i)(t)(j,i)\in\mathcal{E}(t). In addition, the neighbors of agent jj at time tt are the agents that agent jj can communicate with at time tt. The adjacent matrix A(t)A(t) of the network is a square m×mm\times m matrix, whose elements indicate the weights of the edges in (t)\mathcal{E}(t).

For the finite-sum optimization (5), we make the following assumption.

Assumption III.1.
  • (a)

    Functions ff and ϕ\phi are convex.

  • (b)

    Local function fj,if_{j,i} has a Lipschitz-continuous gradient with constant L>0L>0, i.e. for every x,ydx,y\in\mathbb{R}^{d},

    fj,i(x)fj,i(y)Lxy,\displaystyle\|\nabla f_{j,i}(x)-\nabla f_{j,i}(y)\|\leq L\|x-y\|, (6)

    while the regular function ϕ\phi is non-smooth.

  • (c)

    There exists a scalar GϕG_{\phi} such that for each sub-gradient zϕ(x)z\in\partial\phi(x), z<Gϕ\|z\|<G_{\phi}, for every xdx\in\mathbb{R}^{d}.

  • (d)

    There exists a scalar GfG_{f} such that for each agent jj and every xdx\in\mathbb{R}^{d}, fj,i(x)<Gf\|\nabla f_{j,i}(x)\|<G_{f}.

  • (e)

    Further, FF is lower bounded by some FF_{*}\in\mathbb{R} and the optimization problem owns at least one optimal solution xx_{*}.

Assumption III.1 is common in stochastic algorithm research [37, 43] and finite-sum minimization problems satisfying Assumption III.1 arise in many applications. Here, we provide some real-world examples.

Example 1: In binary classification problem [44], the logistic regression optimization aims to obtain a predictor xx to estimate the categorical variables of the testing data. When the large-scale training samples are allocated to mm different agents, the local cost function in (5) is a cross-entropy error function (CNF), fj,i(x)=ln(1+exp(lj,iaj,i,x))f_{j,i}(x)=\ln(1+\exp(-l_{j,i}\langle a_{j,i},x\rangle)), where aj,ia_{j,i} and lj,il_{j,i} denote the feature vector and categorical value of the iith training sample at agent jj. Then, the gradient of the convex function fj,if_{j,i} is bounded and satisfies the Lipschitz condition in (6).

For the communication topology, the time-varying multi-agent network of the finite-sum optimization (5) satisfies the following assumption.

Assumption III.2.

Consider the undirected time-varying network 𝒢(t)\mathcal{G}(t) with adjacent matrices A(t)=[aij(t)]A(t)=[a_{ij}(t)], t=1,2,t=1,2,\cdots

  • (a)

    For each tt, the adjacent matrix A(t)A(t) is doubly stochastic.

  • (b)

    There exists a scalar η(0,1)\eta\in(0,1) such that ajj(t)ηa_{jj}(t)\geq\eta for all j[m]j\in[m]. In addition, aij(t)ηa_{ij}(t)\geq\eta if {i,j}(t)\{i,j\}\in\mathcal{E}(t), and aij(t)=0a_{ij}(t)=0 otherwise.

  • (c)

    The time-varying graph sequence 𝒢(t)\mathcal{G}(t) is uniformly connected. That is, there exists an integer B1B\geq 1 such that agent jj sends its information to all other agents at least once every BB consecutive time slots.

Remark III.1.

In Assumption III.2, part (a) is common for undirected graphs and balanced directed graphs. Part (b) means that each agent gives non-negligible weights to the local estimate and the estimates received from neighbors. Part (c) implies that the time-varying network is able to transmit information among any pair of agents in bounded time. Assumption III.2 is widely adopted in the existing literature [7, 28].

Next, we design a distributed proximal gradient algorithm with random reshuffling for each agent j[m]j\in[m] in the multi-agent network. At the beginning of each iteration tt, we sample indices πt0,,πtn1\pi_{t}^{0},\cdots,\pi_{t}^{n-1} without replacement from {1,,n}\{1,\cdots,n\} such that the generated πt={πt0,,πtn1}\pi_{t}=\{\pi_{t}^{0},\cdots,\pi_{t}^{n-1}\} is a random permutation of {1,,n}\{1,\cdots,n\}. Then, we process with nn inner iterations of the form

xj,ti+1\displaystyle x_{j,t}^{i+1} =xj,tiγfj,πti(xj,ti),i{0,,n1},\displaystyle=x_{j,t}^{i}-\gamma\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i}),\ i\in\{0,\cdots,n-1\}, (7)

where γ\gamma is a constant step-size, the superscript ii denotes the iith inner iteration, the subscripts jj and tt denote the jjth agent and ttth outer iteration, respectively.

Then, to achieve the consensus between different agents, a multi-step consensus mapping is applied as

vj,t\displaystyle v_{j,t} =l=1mλjl,txl,tn,\displaystyle=\sum_{l=1}^{m}\lambda_{jl,t}x_{l,t}^{n}, (8)

where λjl,t\lambda_{jl,t} is the (j,l)(j,l)th element of matrix Φ(𝕋(t)+t,𝕋(t))\Phi(\mathbb{T}(t)+t,\mathbb{T}(t)) for j,l{1,,m}j,l\in\{1,\cdots,m\}. The notation 𝕋(t)\mathbb{T}(t) is the total number of communication steps before iteration tt, and Φ\Phi is a transition matrix, defined as

Φ(t,s)=A(t)A(t1)A(s+1)A(s),t>s0,\displaystyle\Phi(t,s)=A(t)A(t-1)\cdots A(s+1)A(s),\quad t>s\geq 0, (9)

where A(t)=[aij(t)]i,j=1,,mA(t)=[a_{ij}(t)]_{i,j=1,\cdots,m} is the adjacent matrix of the multi-agent network at iteration tt. To be specific, using vector notations vt=[vj,t]j=1,,mv_{t}=[v_{j,t}]_{j=1,\cdots,m} and xtn=[xj,tn]j=1,,mx_{t}^{n}=[x_{j,t}^{n}]_{j=1,\cdots,m}, we rewrite (8) as

vt=\displaystyle v_{t}= Φ(𝕋(t)+t,𝕋(t))xtn\displaystyle\Phi(\mathbb{T}(t)+t,\mathbb{T}(t))x_{t}^{n}
=\displaystyle= A(𝕋(t)+t)A(𝕋(t)+t1)A(𝕋(t))xtn.\displaystyle A(\mathbb{T}(t)+t)A(\mathbb{T}(t)+t-1)\cdots A(\mathbb{T}(t))x_{t}^{n}.

Agents perform tt communication steps at iteration tt. At each communication step, agents exchange their estimates xj,tnx_{j,t}^{n} and linearly combine the received estimates using weights A(t)A(t). This mapping (8) is referred as a multi-step consensus mapping because linear combinations of estimates bring the estimates of different agents close to each other.

Finally, for the non-smooth regular function, we use proximal operator

xj,t+1=proxγ,ϕ(vj,t)\displaystyle x_{j,t+1}={\rm prox}_{\gamma,\phi}(v_{j,t}) (10)

to obtain the local variable estimate xj,t+1x_{j,t+1} of agent jj at the next iteration.

The proposed distributed proximal stochastic gradient algorithm with random reshuffling (DPG-RR)111We suppose the local shuffling is sufficient, i.e. the permutation after shuffling is uniformly distributed. is formally summarized in Algorithm 1.

Algorithm 1 DPG-RR for agent j[m]j\in[m]
1:(S.1) Initialization:
2:Step-size γ>0\gamma>0; initial vector xj,0dx_{j,0}\in\mathbb{R}^{d} for j[m]j\in[m]; number of epoches TT.
3:(S.2) Iterations:
4:for t=0,1,,T1t=0,1,\cdots,T-1 do
5:     Generate a random permutation πt=(πt0,,πtn1)\pi_{t}=(\pi_{t}^{0},\cdots,\pi_{t}^{n-1})
6:     xj,t0=xj,tx_{j,t}^{0}=x_{j,t}
7:     for i=0,1,,n1i=0,1,\cdots,n-1 do
8:         
xj,ti+1=xj,tiγfj,πti(xj,ti)\displaystyle x_{j,t}^{i+1}=x_{j,t}^{i}-\gamma\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i})
9:     end for
10:     
vj,t\displaystyle v_{j,t} =l=1mλjl,txl,tn\displaystyle=\sum_{l=1}^{m}\lambda_{jl,t}x_{l,t}^{n}
xj,t+1\displaystyle x_{j,t+1} =proxγ,ϕ(vj,t)\displaystyle={\rm prox}_{\gamma,\phi}(v_{j,t})
11:end for
Remark III.2.

Compared with the standard proximal SGD, where the proximal operator is applied after each stochastic gradient descent, the proposed DPG-RR only operates one single proximal evaluation at each iteration, which is 1n\frac{1}{n} of the proximal evaluations in the proximal SGD. Therefore, the proposed DPG-RR owns lower computational burden and is applicable for the non-smooth optimization where the proximal operator is expensive to calculate.

Remark III.3.

Compared with the fully random sampling procedure, which is used in the well-known stochastic gradient descent method (SGD), one main advantage of the random reshuffling procedure is its intrinsic ability to avoid cache misses when reading the data from memory, enabling a faster implementation. In addition, a random reshuffling algorithm usually converges in fewer iterations than distributed SGD[30, 38]. It is due to that the RR procedure learns from each sample in each epoch while the SGD can miss learning from some samples in any given epoch. The gradient method with the incremental sampling procedure is a special case of the random reshuffling gradient method, which has been investigated in [36].

Remark III.4.

The work in [37] has studied the convergence rates for strongly convex, convex and non-convex optimization under different shuffling procedures. [37] has provided a unified analysis framework for three shuffling procedures, including global shuffling, local shuffling and insufficient shuffling222If the training samples of machine learning task are centrally stored, the global shuffling is usually applied, where the entire training samples are shuffled after each epoch. Otherwise, local shuffling, where each agent randomly shuffles local training samples after each epoch, is preferred to reduce communication burden.. Unlike the work [37], this paper focuses on non-smooth optimization with a local shuffling procedure, which is more applicable for the distributed setting. In addition, this proposed distributed DPG-RR owns a comparable convergence rate and allows non-smooth functions and time-varying communications between different agents for convex finite-sum optimization.

The convergence performance of the proposed algorithm is discussed in the following theorem and the proof is provided in the next section.

Theorem III.1.

Suppose that Assumptions III.1, III.2 hold and the step-size γ=M/T,M6/6Ln\gamma=M/{\sqrt{T}},\ M\leq\sqrt{6}/6Ln. Then, Algorithm 1 possesses the properties:

  • (1)

    The local variable estimates xj,tx_{j,t} achieve consensus and limtxj,tx¯t=0\lim_{t\to\infty}\|x_{j,t}-\bar{x}_{t}\|=0, where x¯t1mj=1mxj,t\bar{x}_{t}\triangleq\frac{1}{m}\sum_{j=1}^{m}x_{j,t}.

  • (2)

    In addition, 𝔼[F(x^T)F]=𝒪(1T+1T)\mathbb{E}[F(\hat{x}_{T})-F_{*}]=\mathcal{O}(\frac{1}{T}+\frac{1}{\sqrt{T}}), where x^T1mTt=1Tj=1mxj,t\hat{x}_{T}\triangleq\frac{1}{mT}\sum_{t=1}^{T}\sum_{j=1}^{m}x_{j,t}.

Remark III.5.

If the local sample size nn increases, the convergence rate of the proposed algorithm will become slower. It follows from the proof of Theorem III.1 that the constant term in the convergence rate is proportional to the third-order polynomial of nn. Since the proposed algorithm is distributed, we can reduce the effect of local sample size nn on the convergence rate by using more agents for large-scale problems. Whereas more agents may require higher communication costs over multi-agent networks. In addition, the convergence rate also depends on other factors, such as the Lipschitz constant of objective functions, the upper bounds of subgradients, the shuffling variance, and the bounded intercommunication interval.

Remark III.6.

Theorem III.1 extends that of the centralized Prox-RR in [38] to distributed settings and non-strongly convex optimization. Compared with the popular stochastic sub-gradient method with a diminishing step-size for non-smooth optimization, whose convergence rate is 𝒪(log(T)T)\mathcal{O}(\frac{\log(T)}{\sqrt{T}}) [19], the proposed DPG-RR has a faster convergence rate.

IV Theoretical analysis

In this section, we present theoretical proofs for the convergence performance of the proposed algorithm. The proof sketch includes three parts.

  • (a)

    Transform the DPG-RR to an algorithm with some error sequences, and prove that the error sequences are summable333 Let {an}\{a_{n}\} be a sequence of real numbers. The series k=1ak\sum_{k=1}^{\infty}a_{k} is summable if and only if the sequence xnk=1nak,nx_{n}\triangleq\sum_{k=1}^{n}a_{k},\ n\in\mathbb{N} converges..

  • (b)

    Estimate the bound of the forward per-epoch deviation of the DPG-RR.

  • (c)

    Prove the consensus property of local variables and the convergence performance in Theorem III.1.

Each part is discussed in detail in the subsequent subsections. To present the transformation, we define

x¯ti1mj=1mxj,tid,x¯t1mj=1mxj,td,\displaystyle\bar{x}_{t}^{i}\triangleq\frac{1}{m}\sum_{j=1}^{m}x_{j,t}^{i}\in\mathbb{R}^{d},\ \bar{x}_{t}\triangleq\frac{1}{m}\sum_{j=1}^{m}x_{j,t}\in\mathbb{R}^{d},
v¯t1mj=1mvj,td,zt+1proxγ,ϕ(v¯t).\displaystyle\bar{v}_{t}\triangleq\frac{1}{m}\sum_{j=1}^{m}v_{j,t}\in\mathbb{R}^{d},\ z_{t+1}\triangleq{\rm prox}_{\gamma,\phi}(\bar{v}_{t}).

IV-A Transformation of DPG-RR and the summability of error sequences

Proposition IV.1.

Suppose Assumptions III.1 and III.2 hold. The average variable of Algorithm 1 satisfies

x¯t+1\displaystyle\bar{x}_{t+1} proxγ,ϕεt+1(x¯tγ(i=0n1(𝐟πti(x¯ti)+eti))),\displaystyle\in{\rm prox}_{\gamma,\phi}^{\varepsilon_{t+1}}\Big{(}\bar{x}_{t}-\gamma\big{(}\sum_{i=0}^{n-1}(\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})+e_{t}^{i})\big{)}\Big{)}, (11)

where 𝐟πti(x)=1mj=1mfj,πti(x)\mathbf{f}_{\pi_{t}^{i}}(x)=\frac{1}{m}\sum_{j=1}^{m}f_{j,\pi_{t}^{i}}(x), and the error sequences {eti}t=0\{e_{t}^{i}\}_{t=0}^{\infty} and {εt+1}t=0\{\varepsilon_{t+1}\}_{t=0}^{\infty} satisfy

eti\displaystyle\|e_{t}^{i}\| Lmj=1mxj,tix¯ti,\displaystyle\leq\frac{L}{m}\sum_{j=1}^{m}\|x_{j,t}^{i}-\bar{x}_{t}^{i}\|, (12a)
εt+1\displaystyle\varepsilon_{t+1} 2Gϕmj=1mvj,tv¯t+12γ(1mj=1mvj,tv¯t)2.\displaystyle\leq\frac{2G_{\phi}}{m}\sum_{j=1}^{m}\|v_{j,t}\!-\!\bar{v}_{t}\|\!+\!\frac{1}{2\gamma}(\frac{1}{m}\sum_{j=1}^{m}\|v_{j,t}\!-\!\bar{v}_{t}\|)^{2}. (12b)
Proof.

See Appendix VII-B. ∎

Then, inspired by Section III.B of [28], we discuss the summabilities of error sequences {γet}\{\gamma\|e_{t}\|\} and {εt}\{\varepsilon_{t}\} in the following proposition.

Proposition IV.2.

Under Assumptions III.1 and III.2, error sequences {eti}\{e_{t}^{i}\} and {εt}\{\varepsilon_{t}\} satisfy etibe,t+γC0\left\|e_{t}^{i}\right\|\leq b_{e,t}+\gamma C_{0}, εtbε,t{\varepsilon_{t}}\leq b_{\varepsilon,t} and εtbε,t\sqrt{\varepsilon_{t}}\leq b_{\sqrt{\varepsilon},t}, where be,tb_{e,t}, bε,tb_{\varepsilon,t} and bε,tb_{\sqrt{\varepsilon},t} are polynomial-geometric sequences and C0C_{0} is a constant.

Proof.

See Appendix VII-C. ∎

IV-B Boundness of the forward per-epoch deviation

Before presenting the boundness analysis, we introduce some necessary quantities for the random reshuffling technique.

Definition IV.1.

[36] For any ii, the quantity Dfi(x,y)fi(x)fi(y)fi(y),xyD_{f_{i}}(x,y)\triangleq f_{i}(x)-f_{i}(y)-\langle\nabla f_{i}(y),x-y\rangle is the Bregman divergence between xx and yy associated with function fif_{i}.

If the function fif_{i} is LL-smooth, then for all x,ydx,y\in\mathbb{R}^{d}, Dfi(x,y)L2xy2D_{f_{i}}(x,y)\leq\frac{L}{2}\|x-y\|^{2}. The difference between the gradients of a convex and LL-smooth function fif_{i} satisfies

fi(x)fi(y)22LDfi(x,y).\displaystyle\|\nabla f_{i}(x)-\nabla f_{i}(y)\|^{2}\leq 2LD_{f_{i}}(x,y). (13)

In addition, the forward per-epoch deviation of the proposed algorithm is introduced as follows.

Definition IV.2.

The forward per-epoch deviation of DPG-RR for the tt-th epoch is defined as

𝒱ti=0n1x¯tix¯t+12.\displaystyle\mathcal{V}_{t}\triangleq\sum_{i=0}^{n-1}\|\bar{x}_{t}^{i}-\bar{x}_{t+1}\|^{2}. (14)

Now, we show that the forward per-epoch deviation 𝒱t\mathcal{V}_{t} of DPG-RR is upper bounded in the following lemma.

Lemma IV.1.

If Assumption III.1 holds, then the forward per-epoch deviation 𝒱t\mathcal{V}_{t} satisfies

𝔼[𝒱t]\displaystyle\mathbb{E}[\mathcal{V}_{t}]\leq 24γ2Ln2i=0n1𝔼[D𝐟πti(x,x¯ti)]+3γ2n2σ2+9γ2nGf2\displaystyle 24\gamma^{2}Ln^{2}\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})]+3\gamma^{2}n^{2}\sigma_{*}^{2}+9\gamma^{2}nG_{f}^{2}
+6γ2nGϕ2+12n3(γ2be,t2+γ4C02)+4nγbε,t,\displaystyle+6\gamma^{2}nG_{\phi}^{2}+12n^{3}(\gamma^{2}b_{e,t}^{2}+\gamma^{4}C_{0}^{2})+4n\gamma b_{\varepsilon,t}, (15)

where 𝒱t\mathcal{V}_{t} is defined in (14), σ21ni=1n𝐟πti(x)1nf(x)2\sigma_{*}^{2}\triangleq\frac{1}{n}\sum_{i=1}^{n}\|\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})-\frac{1}{n}\nabla f(x_{*})\|^{2}, be,tb_{e,t} and bε,tb_{\varepsilon,t} are defined in (53) and (VII-C), respectively.

Proof.

See Appendix VII-D. ∎

IV-C Proof of Theorem III.1

Proof.

(1) First, we prove the consensus property of the generated local variables, i.e. limtxj,tx¯t\lim_{t\to\infty}x_{j,t}\to\bar{x}_{t} for all agent jj. By (40) in Lemma VII.5, j=1mxj,tx¯t2mΓΞtl=1mxl,t1n\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|\leq 2m\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\| holds. In addition, with Lemma VII.2 and Lemma VII.6, we obtain that

t=12mΓΞtl=1mxl,t1n<.\sum_{t=1}^{\infty}2m\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|<\infty.

Then, t=1j=1mxj,tx¯t<\sum_{t=1}^{\infty}\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|<\infty. Because j=1mxj,tx¯t\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\| is nonnegative,

limtj=1mxj,tx¯t=0,\lim_{t\to\infty}\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|=0,

and hence, limtxj,tx¯t=0\lim_{t\to\infty}\|x_{j,t}-\bar{x}_{t}\|=0 for each agent jj. Therefore, local variable estimates xj,tx_{j,t} of all agents achieve consensus and converge to the average x¯t\bar{x}_{t} as tt\to\infty.

(2) Next, we consider the convergence rate of the proposed algorithm. By Lemma VII.1 and (11), we have

x¯t=γg¯t+x¯t+1+γet+1+pt+1+γd¯t+1,\displaystyle\bar{x}_{t}=\gamma\bar{g}_{t}+\bar{x}_{t+1}+\gamma e_{t+1}+p_{t+1}+\gamma\bar{d}_{t+1}, (16)

where et+1=i=0n1etie_{t+1}=\sum_{i=0}^{n-1}e_{t}^{i}, d¯t+1εt+1ϕ(x¯t+1)\bar{d}_{t+1}\in\partial_{\varepsilon_{t+1}}\phi(\bar{x}_{t+1}) and g¯ti=0n1𝐟πti(x¯ti)\bar{g}_{t}\triangleq\sum_{i=0}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i}). Then, the square of the norm between x¯t\bar{x}_{t} and the optimal solution xx_{*} satisfies

x¯tx2\displaystyle\|\bar{x}_{t}-x_{*}\|^{2}
=\displaystyle= γg¯t+γet+1+pt+1+γd¯t+1+x¯t+1x2\displaystyle\|\gamma\bar{g}_{t}+\gamma{e_{t+1}}+p_{t+1}+\gamma\bar{d}_{t+1}+\bar{x}_{t+1}-x_{*}\|^{2}
\displaystyle\geq x¯t+1x2+2γi=0n1x¯t+1x,𝐟πti(x¯ti)\displaystyle\|\bar{x}_{t+1}-x_{*}\|^{2}+2\gamma\sum_{i=0}^{n-1}\langle\bar{x}_{t+1}-x_{*},\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})\rangle
+2x¯t+1x,γet+1+pt+1+γd¯t+1.\displaystyle+2\langle\bar{x}_{t+1}-x_{*},\gamma e_{t+1}+p_{t+1}+\gamma\bar{d}_{t+1}\rangle. (17)

For the second term in (IV-C), we have for any ii,

x¯t+1x,𝐟πti(x¯ti)\displaystyle\langle\bar{x}_{t+1}-x_{*},\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})\rangle
=\displaystyle= 𝐟πti(x¯t+1)𝐟πti(x)\displaystyle\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(x_{*})
+𝐟πti(x)𝐟πti(x¯ti)𝐟πti(x¯ti),xx¯ti\displaystyle+\mathbf{f}_{\pi_{t}^{i}}(x_{*})-\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\langle\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i}),x_{*}-\bar{x}_{t}^{i}\rangle
[𝐟πti(x¯t+1)𝐟πti(x¯ti)𝐟πti(x¯ti),x¯t+1x¯ti]\displaystyle-[\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\langle\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i}),\bar{x}_{t+1}-\bar{x}_{t}^{i}\rangle]
=\displaystyle= [𝐟πti(x¯t+1)𝐟πti(x)]+D𝐟πti(x,x¯ti)D𝐟πti(x¯t+1,x¯ti).\displaystyle[\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(x_{*})]\!+\!D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})\!-\!D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i}). (18)

Summing 𝐟πti(x¯t+1)𝐟πti(x)\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(x_{*}) over ii from 0 to n1n-1 gives

i=0n1[𝐟πti(x¯t+1)𝐟πti(x)]=f(x¯t+1)f.\displaystyle\sum_{i=0}^{n-1}[\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(x_{*})]=f(\bar{x}_{t+1})-f_{*}. (19)

Next, we bound the third term in (IV-C) with LL-smoothness in Assumption III.1 (b) as

D𝐟πti(x¯t+1,x¯ti)L2x¯t+1x¯ti2.\displaystyle D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i})\leq\frac{L}{2}\|\bar{x}_{t+1}-\bar{x}_{t}^{i}\|^{2}. (20)

By summing (20) over ii from 0 to n1n-1, we obtain an upper bound of the forward deviation 𝒱t\mathcal{V}_{t} by Lemma IV.1 that

i=0n1𝔼[D𝐟πti(x¯t+1,x¯ti)]L2𝔼[𝒱t]\displaystyle\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i})]\leq\frac{L}{2}\mathbb{E}[\mathcal{V}_{t}]
\displaystyle\leq 12γ2L2n2i=0n1𝔼[D𝐟πti(x,x¯ti)]+3γ2Ln2σ22\displaystyle 12\gamma^{2}L^{2}n^{2}\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})]+3\gamma^{2}L\frac{n^{2}\sigma_{*}^{2}}{2}
+92γ2LnGf2+3γ2nLGϕ2+6Ln3(γ2be,t2+γ4C02)+2Lnγbε,t.\displaystyle+\!\frac{9}{2}\gamma^{2}LnG_{f}^{2}\!+\!3\gamma^{2}nLG_{\phi}^{2}\!+\!6Ln^{3}(\gamma^{2}b_{e,t}^{2}\!+\!\gamma^{4}C_{0}^{2})\!+\!2Ln\gamma b_{\varepsilon,t}.

With the above inequality, we estimate the lower bound of the sum of the second and third term in (IV-C) as

i=0n1𝔼[D𝐟πti(x,x¯ti)D𝐟πti(x¯t+1,x¯ti)]\displaystyle\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})-D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i})]
\displaystyle\geq (112γ2L2n2)i=0n1𝔼[D𝐟πti(x,x¯ti)]3γ2Ln2σ22\displaystyle(1-12\gamma^{2}L^{2}n^{2})\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})]-3\gamma^{2}L\frac{n^{2}\sigma_{*}^{2}}{2}
92γ2LnGf23γ2nLGϕ26Ln3(γ2be,t2+γ4C02)2Lnγbε,t\displaystyle\!-\!\frac{9}{2}\gamma^{2}LnG_{f}^{2}\!-\!3\gamma^{2}nLG_{\phi}^{2}\!-\!6Ln^{3}(\gamma^{2}b_{e,t}^{2}\!+\!\gamma^{4}C_{0}^{2})\!-\!2Ln\gamma b_{\varepsilon,t}
\displaystyle\geq 3γ2Ln2σ2292γ2LnGf23γ2nLGϕ2\displaystyle-3\gamma^{2}L\frac{n^{2}\sigma_{*}^{2}}{2}-\frac{9}{2}\gamma^{2}LnG_{f}^{2}-3\gamma^{2}nLG_{\phi}^{2}
6Ln3(γ2be,t2+γ4C02)2Lnγbε,t,\displaystyle-6Ln^{3}(\gamma^{2}b_{e,t}^{2}+\gamma^{4}C_{0}^{2})-2Ln\gamma b_{\varepsilon,t}, (21)

where the last inequality holds since (112γ2L2n2)0(1-12\gamma^{2}L^{2}n^{2})\geq 0 by γ3/6Ln\gamma\leq\sqrt{3}/6Ln, and i=0n1D𝐟πti(x,x¯ti)\sum_{i=0}^{n-1}D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i}) is non-negative due to the convexity.

By rearranging (IV-C), we have

x¯t+1x2\displaystyle\|\bar{x}_{t+1}-x_{*}\|^{2}
(IV-C),(19)\displaystyle\overset{\eqref{inabx},\eqref{ff}}{\leq} x¯tx22γ(f(x¯t+1)f)\displaystyle\|\bar{x}_{t}-x_{*}\|^{2}-2\gamma(f(\bar{x}_{t+1})-f_{*})
i=0n12γ(D𝐟πti(x,x¯ti)D𝐟πti(x¯t+1,x¯ti))\displaystyle-\sum_{i=0}^{n-1}2\gamma(D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})-D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i}))
2x¯t+1x,γd¯t+12x¯t+1x,γet+1+pt+1\displaystyle-2\langle\bar{x}_{t+1}-x_{*},\gamma\bar{d}_{t+1}\rangle\!-2\langle\bar{x}_{t+1}-x_{*},\gamma e_{t+1}\!+\!p_{t+1}\rangle
\displaystyle\leq x¯tx22γ(f(x¯t+1)f)\displaystyle\|\bar{x}_{t}-x_{*}\|^{2}-2\gamma(f(\bar{x}_{t+1})-f_{*})
i=0n12γ(D𝐟πti(x,x¯ti)D𝐟πti(x¯t+1,x¯ti))\displaystyle-\sum_{i=0}^{n-1}2\gamma(D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})-D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i}))
2x¯t+1x,γd¯t+1\displaystyle-2\langle\bar{x}_{t+1}-x_{*},\gamma\bar{d}_{t+1}\rangle
+2x¯t+1x(γet+1+pt+1)\displaystyle+2\|\bar{x}_{t+1}-x_{*}\|(\gamma\|e_{t+1}\|+\|p_{t+1}\|)
(53),(55)\displaystyle\overset{\eqref{bet},\eqref{sqrtbvapt}}{\leq} x¯tx22γ(f(x¯t+1)f)2x¯t+1x,γd¯t+1\displaystyle\|\bar{x}_{t}-x_{*}\|^{2}\!-\!2\gamma(f(\bar{x}_{t+1})\!-\!f_{*})\!-\!2\langle\bar{x}_{t+1}-x_{*},\gamma\bar{d}_{t+1}\rangle
i=0n12γ(D𝐟πti(x,x¯ti)D𝐟πti(x¯t+1,x¯ti))\displaystyle-\sum_{i=0}^{n-1}2\gamma(D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})-D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i}))
+2x¯t+1x(nγbe,t+nγ2C0+2γbε,t).\displaystyle+2\|\bar{x}_{t+1}-x_{*}\|(n\gamma b_{e,t}+n\gamma^{2}C_{0}+\sqrt{2\gamma}b_{\sqrt{\varepsilon},t}). (22)

where the second inequality holds due to the Cauchy Schwarz inequality and the triangle inequality.

Taking expectation of (IV-C) and by (IV-C),

𝔼[x¯t+1x2]\displaystyle\mathbb{E}[\|\bar{x}_{t+1}-x_{*}\|^{2}]
\displaystyle\leq 𝔼[x¯tx2]2γ𝔼[f(x¯t+1)f]\displaystyle\mathbb{E}[\|\bar{x}_{t}-x_{*}\|^{2}]-2\gamma\mathbb{E}[f(\bar{x}_{t+1})-f_{*}]
+9γ3LnGf2+3γ3Ln2σ2+6γ3nLGϕ2\displaystyle+9\gamma^{3}LnG_{f}^{2}+3\gamma^{3}Ln^{2}\sigma_{*}^{2}+6\gamma^{3}nLG_{\phi}^{2}
+12n3(γ3Lbe,t2+γ5C02)+4γ2Lnbε,t\displaystyle+12n^{3}(\gamma^{3}Lb_{e,t}^{2}+\gamma^{5}C_{0}^{2})+4\gamma^{2}Lnb_{\varepsilon,t}
2γ𝔼[x¯t+1x,d¯t+1]+2λt𝔼[x¯t+1x]\displaystyle-2\gamma\mathbb{E}[\langle\bar{x}_{t+1}-x_{*},\bar{d}_{t+1}\rangle]+2\lambda_{t}\mathbb{E}[\|\bar{x}_{t+1}-x_{*}\|]
\displaystyle\leq 𝔼[x¯tx2]2γ𝔼[f(x¯t+1)f]+9γ3LnGf2\displaystyle\mathbb{E}[\|\bar{x}_{t}-x_{*}\|^{2}]-2\gamma\mathbb{E}[f(\bar{x}_{t+1})-f_{*}]+9\gamma^{3}LnG_{f}^{2}
+3γ3Ln2σ2+6γ3nLGϕ2\displaystyle+3\gamma^{3}Ln^{2}\sigma_{*}^{2}+6\gamma^{3}nLG_{\phi}^{2}
+12n3(γ3Lbe,t2+γ5C02)+4γ2Lnbε,t\displaystyle+12n^{3}(\gamma^{3}Lb_{e,t}^{2}+\gamma^{5}C_{0}^{2})+4\gamma^{2}Lnb_{\varepsilon,t}
2γ𝔼[ϕ(x¯t+1)ϕ]+2γbε,t+2λt𝔼[x¯t+1x],\displaystyle-2\gamma\mathbb{E}[\phi(\bar{x}_{t+1})-\phi_{*}]+2\gamma b_{\varepsilon,t}+2\lambda_{t}\mathbb{E}[\|\bar{x}_{t+1}-x_{*}\|],

where λt=γnbe,t+nγ2C0+2γbε,t\lambda_{t}=\gamma nb_{e,t}+n\gamma^{2}C_{0}+\sqrt{2\gamma}b_{\sqrt{\varepsilon},t}, the second inequality holds because of x¯t+1x,d¯t+1+εt+1ϕ(x¯t+1)ϕ\langle\bar{x}_{t+1}-x_{*},\bar{d}_{t+1}\rangle+\varepsilon_{t+1}\geq\phi(\bar{x}_{t+1})-\phi_{*} and (VII-C).

Then, rearranging the above result leads to

2γ𝔼[F(x¯t+1)F]\displaystyle 2\gamma\mathbb{E}[F(\bar{x}_{t+1})-F_{*}]
\displaystyle\leq 𝔼[x¯tx2]𝔼[x¯t+1x2]+9γ3LnGf2\displaystyle\mathbb{E}[\|\bar{x}_{t}-x_{*}\|^{2}]-\mathbb{E}[\|\bar{x}_{t+1}-x_{*}\|^{2}]+9\gamma^{3}LnG_{f}^{2}
+3γ3Ln2σ2+6γ3nLGϕ2+12n3(γ3Lbe,t2+γ5C02)\displaystyle+3\gamma^{3}Ln^{2}\sigma_{*}^{2}+6\gamma^{3}nLG_{\phi}^{2}+12n^{3}(\gamma^{3}Lb_{e,t}^{2}+\gamma^{5}C_{0}^{2})
+4γ2Lnbε,t+2γbε,t+2λt𝔼[x¯t+1x].\displaystyle+4\gamma^{2}Lnb_{\varepsilon,t}+2\gamma b_{\varepsilon,t}+2\lambda_{t}\mathbb{E}[\|\bar{x}_{t+1}-x_{*}\|].

Summing over tt from 0 to T1T-1,

2γt=0T1𝔼[F(x¯t+1)F]\displaystyle 2\gamma\sum_{t=0}^{T-1}\mathbb{E}[F(\bar{x}_{t+1})-F_{*}]
\displaystyle\leq x¯0x2𝔼[x¯Tx2]+9Tγ3LnGf2+3Tγ3Ln2σ2\displaystyle\|\bar{x}_{0}-x_{*}\|^{2}\!-\mathbb{E}[\|\bar{x}_{T}-x_{*}\|^{2}]\!+9T\gamma^{3}LnG_{f}^{2}\!+3T\gamma^{3}Ln^{2}\sigma_{*}^{2}
+6Tγ3nLGϕ2+12Tn3γ5C02+12γ3Ln3t=0T1be,t2\displaystyle+6T\gamma^{3}nLG_{\phi}^{2}+12Tn^{3}\gamma^{5}C_{0}^{2}+12\gamma^{3}Ln^{3}\sum_{t=0}^{T-1}b_{e,t}^{2}
+2γ(1+2γnL)t=0T1bε,t+2t=0T1λt𝔼[x¯t+1x].\displaystyle+2\gamma(1\!+\!2\gamma nL)\sum_{t=0}^{T-1}b_{\varepsilon,t}\!+\!2\sum_{t=0}^{T-1}\lambda_{t}\mathbb{E}[\|\bar{x}_{t+1}\!-\!x_{*}\|]. (23)

Then, we need to estimate an upper bound of 𝔼[x¯t+1x]\mathbb{E}[\|\bar{x}_{t+1}-x_{*}\|]. By the transformation of (IV-C) and the fact that 𝔼[x¯Tx]2𝔼[x¯Tx2]\mathbb{E}[\|\bar{x}_{T}-x_{*}\|]^{2}\leq\mathbb{E}[\|\bar{x}_{T}-x_{*}\|^{2}],

𝔼[x¯Tx]2\displaystyle\mathbb{E}[\|\bar{x}_{T}-x_{*}\|]^{2}
\displaystyle\leq ST+2t=0T1λt𝔼[x¯t+1x].\displaystyle S_{T}+2\sum_{t=0}^{T-1}\lambda_{t}\mathbb{E}[\|\bar{x}_{t+1}-x_{*}\|].

where ST=x¯0x2+9Tγ3LnGf2+3Tγ3Ln2σ2+6Tγ3nLGϕ2+12n3γ5TC02+12γ3Ln3t=0T1be,t2+2γ(1+2γnL)t=0T1bε,tS_{T}=\|\bar{x}_{0}-x_{*}\|^{2}+9T\gamma^{3}LnG_{f}^{2}+3T\gamma^{3}L{n^{2}}\sigma_{*}^{2}+6T\gamma^{3}nLG_{\phi}^{2}+12n^{3}\gamma^{5}TC_{0}^{2}+12\gamma^{3}Ln^{3}\sum_{t=0}^{T-1}b_{e,t}^{2}+2\gamma(1+2\gamma nL)\sum_{t=0}^{T-1}b_{\varepsilon,t}. It follows from Lemma VII.4 that

𝔼[x¯Tx]\displaystyle\mathbb{E}[\|\bar{x}_{T}\!-\!x_{*}\|] t=0T1(γnbe,t+nγ2C0+2γbε,t)\displaystyle\leq\sum_{t=0}^{T-1}(\gamma nb_{e,t}+n\gamma^{2}C_{0}+\sqrt{2\gamma}b_{\sqrt{\varepsilon},t})
+(ST+(t=0T1(γnbe,t+nγ2C0+2γbε,t))2)12,\displaystyle+\!\Big{(}S_{T}\!+\!\big{(}\sum_{t=0}^{T-1}\!(\gamma nb_{e,t}\!+\!n\gamma^{2}C_{0}\!+\!\sqrt{2\gamma}b_{\sqrt{\varepsilon},t})\big{)}^{2}\Big{)}^{\frac{1}{2}},

where the sequences be,tb_{e,t}, bε,tb_{\varepsilon,t} and bε,tb_{\sqrt{\varepsilon},t} are polynomial-geometric sequences and are summable by Lemma VII.2.

By the summability of be,tb_{e,t}, bε,tb_{\varepsilon,t} and bε,tb_{\sqrt{\varepsilon},t}, there exist some scalars A and B such that

AT=t=0T1(γnbe,t+nγ2C0+2γbε,t)A<\displaystyle A_{T}=\sum_{t=0}^{T-1}(\gamma nb_{e,t}+n\gamma^{2}C_{0}+\sqrt{2\gamma}b_{\sqrt{\varepsilon},t})\leq\textbf{A}<\infty (24)

where we use the step-size range γ1/T\gamma\leq 1/\sqrt{T}. And

BT=\displaystyle B_{T}= TD+12γ3Ln3t=0T1be,t2+2γ(1+2γnL)t=0T1bε,t,\displaystyle T\textbf{D}+12\gamma^{3}Ln^{3}\sum_{t=0}^{T-1}b_{e,t}^{2}+2\gamma(1+2\gamma nL)\sum_{t=0}^{T-1}b_{\varepsilon,t},
\displaystyle\leq TD+B,\displaystyle T\textbf{D}+\textbf{B}, (25)

where Dγ3(9LnGf2+3Ln2σ2+6nLGϕ2+12n3γ2C02)\textbf{D}\triangleq\gamma^{3}(9LnG_{f}^{2}+3L{n^{2}}\sigma_{*}^{2}+6nLG_{\phi}^{2}+12n^{3}\gamma^{2}C_{0}^{2}). Then, 𝔼[x¯Tx]\mathbb{E}[\|\bar{x}_{T}-x_{*}\|] satisfies

𝔼[x¯Tx]A+(x¯0x2+(TD+B)+A2)12.\displaystyle\mathbb{E}[\|\bar{x}_{T}-x_{*}\|]\leq\textbf{A}+(\|\bar{x}_{0}-x_{*}\|^{2}+(T\textbf{D}+\textbf{B})+\textbf{A}^{2})^{\frac{1}{2}}.

Since {At}\{A_{t}\} and {Bt}\{B_{t}\} are increasing sequences, we have for tTt\leq T,

𝔼[x¯tx]\displaystyle\mathbb{E}[\|\bar{x}_{t}-x_{*}\|]\leq AT+(x¯0x2+BT+AT2)12\displaystyle A_{T}+(\|\bar{x}_{0}-x_{*}\|^{2}+B_{T}+A_{T}^{2})^{\frac{1}{2}}
\displaystyle\leq 2AT+x¯0x+BT\displaystyle 2A_{T}+\|\bar{x}_{0}-x_{*}\|+\sqrt{B_{T}}
\displaystyle\leq 2A+x¯0x+TD+B.\displaystyle 2\textbf{A}+\|\bar{x}_{0}-x_{*}\|+\sqrt{T\textbf{D}+\textbf{B}}. (26)

Now, we can bound the right-hand side of (IV-C) with (IV-C),

2γt=1T𝔼[F(x¯t)F]\displaystyle 2\gamma\sum_{t=1}^{T}\mathbb{E}[F(\bar{x}_{t})-F_{*}]
\displaystyle\leq x¯0x2+BT+2AT(2AT+x¯0x+BT)\displaystyle\|\bar{x}_{0}-x_{*}\|^{2}+B_{T}+2A_{T}(2A_{T}+\|\bar{x}_{0}-x_{*}\|+\sqrt{B_{T}})
\displaystyle\leq x¯0x2+TD+B+2A(2A+x¯0x+TD+B).\displaystyle\|\bar{x}_{0}\!-\!x_{*}\|^{2}\!+\!T\textbf{D}\!+\!\textbf{B}\!+\!2\textbf{A}(2\textbf{A}\!+\!\|\bar{x}_{0}-x_{*}\|\!+\!\sqrt{T\textbf{D}\!+\!\textbf{B}}).

By dividing both sides by 2γT2\gamma T, we get

1Tt=1T𝔼[F(x¯t)F]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[F(\bar{x}_{t})-F_{*}]
\displaystyle\leq x¯0x22γT+D2γ+B2γT+AγT(2A+x¯0x+TD+B).\displaystyle\frac{\|\bar{x}_{0}\!-\!x_{*}\|^{2}}{2\gamma T}\!+\!\frac{\textbf{D}}{2\gamma}\!+\!\frac{\textbf{B}}{2\gamma T}\!+\!\frac{\textbf{A}}{\gamma T}\!(2\textbf{A}\!+\!\|\bar{x}_{0}\!-\!x_{*}\|\!+\!\sqrt{\!T\textbf{D}\!+\!\textbf{B}}).

Finally, by the convexity of FF, the average iterate x^T=1Tt=1Tx¯t\hat{x}_{T}=\frac{1}{T}\sum_{t=1}^{T}\bar{x}_{t} satisfies

𝔼[F(x^T)F]\displaystyle\mathbb{E}[F(\hat{x}_{T})-F_{*}]
\displaystyle\leq 1Tt=1T𝔼[F(x¯t)F]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[F(\bar{x}_{t})-F_{*}]
\displaystyle\leq x¯0x22γT+D2γ+B2γT+AγT(2A+x¯0x+TD+B)\displaystyle\frac{\|\bar{x}_{0}\!-\!x_{*}\|^{2}}{2\gamma T}\!+\!\frac{\textbf{D}}{2\gamma}\!+\!\frac{\textbf{B}}{2\gamma T}\!+\!\frac{\textbf{A}}{\gamma T}(2\textbf{A}\!+\!\|\bar{x}_{0}\!-\!x_{*}\|\!+\!\sqrt{\!T\textbf{D}\!+\!\textbf{B}})
\displaystyle\leq x¯0x22γT+AγT(2A+x¯0x)+ABγT+B2γT\displaystyle\frac{\|\bar{x}_{0}-x_{*}\|^{2}}{2\gamma T}+\frac{\textbf{A}}{\gamma T}(2\textbf{A}+\|\bar{x}_{0}-x_{*}\|)+\frac{\textbf{A}\sqrt{\textbf{B}}}{\gamma T}+\frac{\textbf{B}}{2\gamma T}
+ADγT+D2γ,\displaystyle+\frac{\textbf{A}\sqrt{\textbf{D}}}{\gamma\sqrt{T}}+\frac{\textbf{D}}{2\gamma},
=\displaystyle= 𝒪(1T+1T1/2)\displaystyle\mathcal{O}(\frac{1}{T}+\frac{1}{T^{1/2}}) (27)

where γT1/2\gamma\leq T^{-1/2} and the last equality holds due to the fact that 𝐃/γ=𝒪(T1/4)\sqrt{\mathbf{D}}/{\gamma}=\mathcal{O}(T^{-1/4}), and 𝐃/γ=𝒪(T1)\mathbf{D}/{\gamma}=\mathcal{O}(T^{-1}). ∎

V Simulation

In this section, we apply the proposed algorithm DPG-RR to optimize the black-box binary classification problem [44], which is to find the optimal predictor xnx\in\mathbb{R}^{n} by solving

minxdF(x),F(x)=1mj=1mfj(x)+λ1x1,\displaystyle{\rm min}_{x\in\mathbb{R}^{d}}F(x),\ F(x)=\frac{1}{m}\sum_{j=1}^{m}f_{j}(x)+\lambda_{1}\|x\|_{1}, (28)

where fj(x)i=1nln(1+exp(lj,iaj,i,x))f_{j}(x)\triangleq\sum_{i=1}^{n}\ln(1+\exp(-l_{j,i}\langle a_{j,i},x\rangle)), aj,ida_{j,i}\in\mathbb{R}^{d} is the feature vector of the iith local sample of agent jj, lj,i{1,1}l_{j,i}\in\{-1,1\} is the classification value of the iith local sample of agent jj and {aj,i,lj,i}i=1n\{a_{j,i},l_{j,i}\}_{i=1}^{n} denotes the set of local training samples of agent jj. In this experiment, we use the publicly available real datasets a9a and w8a 444a9a and w8a are available in the website www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/.. In addition, in our setting, there are m=10m=10 agents and λ1=5×104\lambda_{1}=5\times 10^{-4}. All comparative algorithms are initialized with a same value. Experiment codes and the adjacent matrices of ten-agent time-varying networks are provided at https://github.com/managerjiang/Dis-Prox-RR.

Refer to caption
(a) Consensus trajectories
Refer to caption
(b) Cost evolutions
Figure 1: Different algorithms for a9a dataset over time-invariant graphs
Refer to caption
(a) Consensus trajectories
Refer to caption
(b) Cost evolutions
Figure 2: Different algorithms for w8a dataset over time-invariant graphs

(1) Time-invariant networks: For comparison, we apply some existing distributed stochastic algorithms including GT-SAGA in [24] and DSGD in [22], and the proposed stochastic DPG-RR to solve (28) over the ten-agent connected network. Since the cost function in (28) is non-smooth, we replace the gradients in GT-SAGA and DSGD with subgradients. The simulation results for a9a and w8a datasets are shown in Figs. 1 and 2. Fig. 1(b) and Fig. 2(b) show that the trajectories of objective function F(x)F(x) generated by all algorithms converge quickly to the optimum FF_{*}. To measure the consensus performance, we define one quantity D(x)D(x)555The D(x)D(x) is an expansion of x^Lx^\hat{x}^{\top}L\hat{x}, where LIdmd×mdL\triangleq\mathcal{L}\otimes I_{d}\in\mathbb{R}^{md\times md}, \mathcal{L} is the Laplacian matrix of the undirected multi-agent network and x^=[x1,,xm]md\hat{x}=[x_{1},\cdots,x_{m}]^{\top}\in\mathbb{R}^{md}. In addition, the Laplacian matrix of an undirected multi-agent network is positive semi-definite. By the fact that 0 is the single eigenvalue corresponding to eigenvector 1md1_{md} of LL, Lx^=0mdL\hat{x}=0_{md} if and only if xi=xjx_{i}=x_{j} for all i,j{1,,m}i,j\in\{1,\cdots,m\}. as D(x)=i=1mxij=1maij(xixj)D(x)=\sum_{i=1}^{m}x_{i}^{\prime}\sum_{j=1}^{m}a_{ij}(x_{i}-x_{j}), where aija_{ij} is (i,j)(i,j)th element of one doubly-stochastic adjacent matrix. If all local variable estimates achieve consensus, then, D(x)=0D(x)=0. Fig. 1(a) and Fig. 2(a) represent that local variables generated by all algorithms achieve consensus. In addition, it is seen that the proposed algorithm DPG-RR owns a better consensus performance than GT-SAGA and DSGD, which verify the effectiveness of the multi-step consensus mapping in DPG-RR.

Although all comparative algorithms have comparable convergence rate in practice, [24] and [22] do not provide the theoretical convergence analysis of GT-SAGA and DSGD for non-smooth optimization. In addition, these algorithms are not applicable for time-varying multi-agent networks, hence, we further compare our proposed algorithm with some recent distributed deterministic algorithms for non-smooth optimization over time-varying networks.

(2) Time-varying networks: We compare DPG-RR with the distributed deterministic proximal algorithm Prox-G in [28], the algorithm NEXT in [45] and the distributed subgradient method (DGM) in [7] to solve (28) over the time-varying graphs. DPG-RR and Prox-G adopt a same constant step-size, NEXT and DGM have diminishing step-sizes.

Refer to caption
Figure 3: Ten-agent periodically time-varying networks

All distributed algorithms are applied over periodically time-varying ten-agent networks satisfying Assumption III.2 to solve the problem (28), which are shown in Fig. 3.

Refer to caption
(a) Consensus trajectories
Refer to caption
(b) Cost evolutions
Figure 4: Different algorithms for a9a dataset over time-varying graphs
Refer to caption
(a) Consensus trajectories
Refer to caption
(b) Cost evolutions
Figure 5: Different algorithms for w8a dataset over time-varying graphs
TABLE I: Running time of different algorithms
time(s)
Algorithms a9a w8a
DPG-RR 25.21 140.89
Prox-G 55.50 161.44
NEXT 80.28 264.78
DGM 30.91 120.51

The simulation results for a9a and w8a datasets are shown in Fig. 4 and Fig. 5, respectively. It is seen from Fig. 4(a) and Fig. 5(a) that the trajectories generated by DPG-RR, Prox-G, NEXT and DGM all converge to zero, implying that local variable estimates generated by all algorithms achieve consensus. In addition, the DPG-RR shows a better consensus numerical performance than NEXT and DGM algorithms. From Fig. 4(b) and Fig. 5(b), we can see that DPG-RR converges faster than NEXT and DGM over both a9a and w8a datasets, which verify that DPG-RR is superior to algorithms with diminishing step-sizes. We observe that DPG-RR has a comparable convergence performance of Prox-G. While, due to the stochastic technique, DPG-RR avoids the computation of full local gradients in Prox-G. The running time of different algorithms are shown in Table I. It is observed that the proposed stochastic DPG-RR generally costs less running time than the deterministic algorithms Prox-G and NEXT.

Refer to caption
Figure 6: Algorithms with different sampling procedures for w8a dataset

To show the advantages of random reshuffling sampling procedure, we replace the random reshuffling step of DPG-RR with a stochastic sampling procedure, yielding a DPG-SG algorithm. Similarly, we replace the random reshuffling step of DPG-RR with a deterministic incremental sampling procedure, yielding a DPG-IG algorithm. All algorithms take the same constant step-size (0.30.3) and the simulation results are shown in Fig. 6. Compared with fully stochastic sampling procedure (DPG-SG), the proposed DPG-RR shows a faster convergence. In addition, the DPG-IG algorithm shows a comparable convergence rate with DPG-RR. These numerical results are consistent with the theoretical findings discussed in Remark III.3.

VI Conclusion

Making use of random reshuffling, this paper has developed one distributed stochastic proximal-gradient algorithm for solving large-scale non-smooth convex optimization over time-varying multi-agent graphs. The proposed algorithm operates only one single proximal evaluation at each epoch, owning significant advantages in scenarios where the proximal mapping is computationally expensive. In addition, the proposed algorithm owns constant step-sizes, overcoming the degraded performance of most stochastic algorithms with diminishing step-sizes. One future research direction is to improve the convergence rate of DPG-RR by some accelerated methods, such as Nesterov’s acceleration technique or variance reduction technique. Another one is to extend the proposed algorithm for distributed non-smooth non-convex optimization, which is applicable for deep neural network models.

VII Appendix

VII-A Some vital Lemmas

At first, the following lemma characterizes the εt\varepsilon_{t}-subdifferential of non-smooth function ϕ\phi at x¯t\bar{x}_{t}, εtϕ(x¯t)\partial_{\varepsilon_{t}}\phi(\bar{x}_{t}).

Lemma VII.1.

[9, Lemma 2] If x¯t\bar{x}_{t} is an εt\varepsilon_{t}-optimal solution to (2) in the sense of (II-B) with y=x¯t1γ(g¯t1+et)y=\bar{x}_{t-1}-\gamma(\bar{g}_{t-1}+e_{t}), then there exists ptdp_{t}\in\mathbb{R}^{d} such that

pt2γεt\displaystyle\|p_{t}\|\leq\sqrt{2\gamma\varepsilon_{t}} (29)

and 1γ(x¯t1x¯tγg¯t1γetpt)εtϕ(x¯t).\frac{1}{\gamma}(\bar{x}_{t-1}-\bar{x}_{t}-\gamma\bar{g}_{t-1}-\gamma e_{t}-p_{t})\in\partial_{\varepsilon_{t}}\phi(\bar{x}_{t}).

The next lemma shows that polynomial-geometric sequences are summable, which is vital for the analysis of error sequences introduced by transformation.

Lemma VII.2.

[28, Proposition 3] Let ζ(0,1)\zeta\in(0,1), and let

Pk,N={cNkN++c1k+c0|cj,j=0,,N}P_{k,N}=\{c_{N}k^{N}+\cdots+c_{1}k+c_{0}|c_{j}\in\mathbb{R},j=0,\cdots,N\}

denote the set of all NN-th order polynomials of kk, where NN\in\mathbb{N}. Then for every polynomial pkPk,Np_{k}\in P_{k,N},

k=1pkζk<.\sum_{k=1}^{\infty}p_{k}\zeta^{k}<\infty.

The result of this Lemma for Pk,N={kN}P_{k,N}=\{k^{N}\} will be particularly useful for the analysis in the following. Hence, we define

SNζk=1kNζk<.\displaystyle S_{N}^{\zeta}\triangleq\sum_{k=1}^{\infty}k^{N}\zeta^{k}<\infty. (30)

The following lemma characterizes the variance of sampling without replacement, which is a key ingredient in our convergence result of DPG-RR.

Lemma VII.3.

[36, Lemma 1] Let X1,,XndX_{1},\cdots,X_{n}\in\mathbb{R}^{d} be fixed vectors, X¯1ni=1nXi\bar{X}\triangleq\frac{1}{n}\sum_{i=1}^{n}X_{i} be their average and σ21ni=1nXiX¯2\sigma^{2}\triangleq\frac{1}{n}\sum_{i=1}^{n}\|X_{i}-\bar{X}\|^{2} be the population variance. Fix any k{1,,n}k\in\{1,\cdots,n\}, let Xπ1,,XπkX_{\pi_{1}},\cdots,X_{\pi_{k}} be sampled uniformly without replacement from {X1,,Xn}\{X_{1},\cdots,X_{n}\} and X¯π\bar{X}_{\pi} be their average. Then, the sample average and variance are given by

𝔼[X¯π]=X¯,𝔼[X¯πX¯2]=nkk(n1)σ2.\displaystyle\mathbb{E}[\bar{X}_{\pi}]=\bar{X},\quad\mathbb{E}[\|\bar{X}_{\pi}-\bar{X}\|^{2}]=\frac{n-k}{k(n-1)}\sigma^{2}.

Finally, we provide a lemma on the boundness of one special nonnegative sequence, which is vital for the convergence analysis.

Lemma VII.4.

[9, Lemma 1] Assume that the nonnegative sequence {uT}\{u_{T}\} satisfies the following recursion for all k1k\geq 1

uT2ST+t=1Tλtut,\displaystyle u_{T}^{2}\leq S_{T}+\sum_{t=1}^{T}\lambda_{t}u_{t}, (31)

with {ST}\{S_{T}\} an increasing sequence, S0u02S_{0}\geq u_{0}^{2} and λt0\lambda_{t}\geq 0 for all tt. Then, for all T1T\geq 1,

uT12t=1Tλt+(ST+(12t=1Tλt)2)12.\displaystyle u_{T}\leq\frac{1}{2}\sum_{t=1}^{T}\lambda_{t}+\Big{(}S_{T}+\big{(}\frac{1}{2}\sum_{t=1}^{T}\lambda_{t}\big{)}^{2}\Big{)}^{\frac{1}{2}}. (32)

VII-B Proof of Proposition IV.1

Proof.

By taking the average of (7) over mm agents,

x¯ti+1=\displaystyle\bar{x}_{t}^{i+1}= x¯tiγ(1mj=1mfj,πti(x¯ti)+eti)\displaystyle\bar{x}_{t}^{i}-\gamma(\frac{1}{m}\sum_{j=1}^{m}\nabla f_{j,\pi_{t}^{i}}(\bar{x}_{t}^{i})+e_{t}^{i})
=\displaystyle= x¯tiγ(𝐟πti(x¯ti)+eti),\displaystyle\bar{x}_{t}^{i}-\gamma(\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})+e_{t}^{i}), (33)

where eti=1mj=1m(fj,πti(xj,ti)fj,πti(x¯ti))e_{t}^{i}=\frac{1}{m}\sum_{j=1}^{m}(\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i})-\nabla f_{j,\pi_{t}^{i}}(\bar{x}_{t}^{i})) and 𝐟πti(x)=1mj=1mfj,πti(x)\mathbf{f}_{\pi_{t}^{i}}(x)=\frac{1}{m}\sum_{j=1}^{m}f_{j,\pi_{t}^{i}}(x). Because of the Lipschitz-continuity of the gradient of fj,πti(x)f_{j,\pi_{t}^{i}}(x), eti\|e_{t}^{i}\| satisfies

etiLmj=1mxj,tix¯ti.\displaystyle\|e_{t}^{i}\|\leq\frac{L}{m}\sum_{j=1}^{m}\|x_{j,t}^{i}-\bar{x}_{t}^{i}\|. (34)

Recall that zt+1proxγ,ϕ(v¯t)=argminx{ϕ(x)+12γxv¯t2}z_{t+1}\triangleq{\rm prox}_{\gamma,\phi}(\bar{v}_{t})={\rm argmin}_{x}\{\phi(x)+\frac{1}{2\gamma}\|x-\bar{v}_{t}\|^{2}\}, which is the result of the exact centralized proximal step. Then, we relate zt+1z_{t+1} and x¯t+1\bar{x}_{t+1} by formulating the latter as an inexact proximal step with error εt+1\varepsilon_{t+1}. A simple algebraic expansion gives

ϕ(x¯t+1)+12γx¯t+1v¯t2\displaystyle\phi(\bar{x}_{t+1})+\frac{1}{2\gamma}\|\bar{x}_{t+1}-\bar{v}_{t}\|^{2}
\displaystyle\leq ϕ(zt+1)+Gϕx¯t+1zt+1+12γ(zt+1v¯t2\displaystyle\phi(z_{t+1})+G_{\phi}\|\bar{x}_{t+1}-z_{t+1}\|+\frac{1}{2\gamma}\Big{(}\|z_{t+1}-\bar{v}_{t}\|^{2}
+2zt+1v¯t,x¯t+1zt+1+x¯t+1zt+12)\displaystyle+2\langle z_{t+1}-\bar{v}_{t},\bar{x}_{t+1}-z_{t+1}\rangle+\|\bar{x}_{t+1}-z_{t+1}\|^{2}\Big{)}
\displaystyle\leq minzd{ϕ(z)+12γzv¯t2}+12γx¯t+1zt+12\displaystyle\min_{z\in\mathbb{R}^{d}}\{\phi(z)+\frac{1}{2\gamma}\|z-\bar{v}_{t}\|^{2}\}+\frac{1}{2\gamma}\|\bar{x}_{t+1}-z_{t+1}\|^{2}
+x¯t+1zt+1(Gϕ+1γzt+1v¯t),\displaystyle+\|\bar{x}_{t+1}-z_{t+1}\|(G_{\phi}+\frac{1}{\gamma}\|z_{t+1}-\bar{v}_{t}\|),

where we have used the fact that ϕ(zt+1)+12γzt+1v¯t2=minz{ϕ(z)+12γzv¯t2}\phi(z_{t+1})+\frac{1}{2\gamma}\|z_{t+1}-\bar{v}_{t}\|^{2}=\min_{z}\{\phi(z)+\frac{1}{2\gamma}\|z-\bar{v}_{t}\|^{2}\} and zt+1v¯t,x¯t+1zt+1zt+1v¯tx¯t+1zt+1\langle z_{t+1}-\bar{v}_{t},\bar{x}_{t+1}-z_{t+1}\rangle\leq\|z_{t+1}-\bar{v}_{t}\|\|\bar{x}_{t+1}-z_{t+1}\| in the last inequality.

Therefore, we can write

x¯t+1proxγ,ϕεt+1(v¯t),\displaystyle\bar{x}_{t+1}\in{\rm prox}_{\gamma,\phi}^{\varepsilon_{t+1}}(\bar{v}_{t}), (35)

where εt+1=x¯t+1zt+1(Gϕ+1γzt+1v¯t)+12γx¯t+1zt+12.\varepsilon_{t+1}=\|\bar{x}_{t+1}-z_{t+1}\|(G_{\phi}+\frac{1}{\gamma}\|z_{t+1}-\bar{v}_{t}\|)+\frac{1}{2\gamma}\|\bar{x}_{t+1}-z_{t+1}\|^{2}.

By definition, zt+1z_{t+1} satisfies 1γ(v¯tzt+1)ϕ(zt+1)\frac{1}{\gamma}(\bar{v}_{t}-z_{t+1})\in\partial\phi(z_{t+1}), and therefore its norm is bounded by GϕG_{\phi}. As a result,

εt+12Gϕx¯t+1zt+1+12γx¯t+1zt+12.\displaystyle\varepsilon_{t+1}\leq 2G_{\phi}\|\bar{x}_{t+1}-z_{t+1}\|+\frac{1}{2\gamma}\|\bar{x}_{t+1}-z_{t+1}\|^{2}. (36)

Combined with the nonexpansiveness of the proximal operator,

x¯t+1zt+1\displaystyle\|\bar{x}_{t+1}-z_{t+1}\|\leq 1mj=1mproxγ,ϕ(vj,t)proxγ,ϕ(v¯t)\displaystyle\frac{1}{m}\sum_{j=1}^{m}\|{\rm prox}_{\gamma,\phi}(v_{j,t})-{\rm prox}_{\gamma,\phi}(\bar{v}_{t})\|
\displaystyle\leq 1mj=1mvj,tv¯t.\displaystyle\frac{1}{m}\sum_{j=1}^{m}\|v_{j,t}-\bar{v}_{t}\|. (37)

Finally, substituting (VII-B) to (36), we obtain (12b). ∎

VII-C Proof of Proposition IV.2

We first define two useful quantities Γ21+η(m1)B1η(m1)B\Gamma\triangleq 2\frac{1+\eta^{-(m-1)B}}{1-\eta^{(m-1)B}} and Ξ(1η(m1)B)1(m1)B\Xi\triangleq(1-\eta^{(m-1)B})^{\frac{1}{(m-1)B}}. Then, we provide some properties of the local variables generated by the proposed algorithm in the following Lemma.

Lemma VII.5.

Under Assumptions III.1 and III.2, for each iteration t2t\geq 2,

j=1mxj,tnj=1mxj,t1n+mγ(Gϕ+nGf),\displaystyle\sum_{j=1}^{m}\|x_{j,t}^{n}\|\leq\sum_{j=1}^{m}\|x_{j,t-1}^{n}\|+m\gamma(G_{\phi}+nG_{f}), (38)
j=1mxj,t+1xj,t2mΓΞt1l=1mxl,t1n+mγ(Gϕ+nGf),\displaystyle\sum_{j=1}^{m}\|x_{j,t+1}-x_{j,t}\|\!\leq\!2m\Gamma\Xi^{t-1}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|\!+\!m\gamma(G_{\phi}\!+\!nG_{f}), (39)
j=1mxj,tx¯t2mΓΞtl=1mxl,t1n.\displaystyle\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|\leq 2m\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|. (40)
Proof.

By (10), there exists zj,t+1ϕ(xj,t+1)z_{j,t+1}\in\partial\phi(x_{j,t+1}) such that

xj,t+1=vj,tγzj,t+1.\displaystyle x_{j,t+1}=v_{j,t}-\gamma z_{j,t+1}. (41)

Since function ϕ\phi has bounded subgradients,

xj,t+1vj,tγGϕ.\displaystyle\|x_{j,t+1}-v_{j,t}\|\leq\gamma G_{\phi}. (42)

(a) By the algorithm 1, we have

xj,tn=xj,t0γi=0n1fj,πti(xj,ti).\displaystyle x_{j,t}^{n}=x_{j,t}^{0}-\gamma\sum_{i=0}^{n-1}\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i}).

Taking norm of the above equality and summing over jj,

j=1mxj,tn=\displaystyle\sum_{j=1}^{m}\|x_{j,t}^{n}\|= j=1mxj,t0γi=0n1fj,πti(xj,ti)\displaystyle\sum_{j=1}^{m}\|x_{j,t}^{0}-\gamma\sum_{i=0}^{n-1}\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i})\|
\displaystyle\leq j=1mxj,t+γmnGf.\displaystyle\sum_{j=1}^{m}\|x_{j,t}\|+\gamma mnG_{f}. (43)

By (42), xj,t+1vj,tγGϕ\|x_{j,t+1}\|-\|v_{j,t}\|\leq\gamma G_{\phi} holds. Since vj,tv_{j,t} is a convex combination of {xl,tn}l=1m\{x_{l,t}^{n}\}_{l=1}^{m} and l=1mλjl,t=1\sum_{l=1}^{m}\lambda_{jl,t}=1,

j=1mvj,tj=1mxj,tn.\displaystyle\sum_{j=1}^{m}\|v_{j,t}\|\leq\sum_{j=1}^{m}\|x_{j,t}^{n}\|. (44)

Then, substituting the fact that xj,t+1vj,tγGϕ\|x_{j,t+1}\|-\|v_{j,t}\|\leq\gamma G_{\phi} and (44) to (VII-C), we obtain

j=1mxj,tnj=1mxj,t1n+mγ(Gϕ+nGf).\displaystyle\sum_{j=1}^{m}\|x_{j,t}^{n}\|\leq\sum_{j=1}^{m}\|x_{j,t-1}^{n}\|+m\gamma(G_{\phi}+nG_{f}).

(b) By (41) and the proposed algorithm,

xj,t+1=\displaystyle x_{j,t+1}= vj,tγzj,t+1\displaystyle v_{j,t}-\gamma z_{j,t+1}
=(8)\displaystyle\overset{\eqref{v_ji_up}}{=} l=1mλjl,txl,tnγzj,t+1\displaystyle\sum_{l=1}^{m}\lambda_{jl,t}x_{l,t}^{n}-\gamma z_{j,t+1}
=(7)\displaystyle\overset{\eqref{x_ji_up}}{=} l=1mλjl,t(xl,t0γi=0n1fl,πti(xl,ti))γzj,t+1.\displaystyle\sum_{l=1}^{m}\lambda_{jl,t}\Big{(}x_{l,t}^{0}-\gamma\sum_{i=0}^{n-1}\nabla f_{l,\pi_{t}^{i}}(x_{l,t}^{i})\Big{)}-\gamma z_{j,t+1}. (45)

Then, subtracting xj,tx_{j,t} from both sides of (VII-C) and taking norm, we obtain

j=1mxj,t+1xj,t\displaystyle\sum_{j=1}^{m}\|x_{j,t+1}-x_{j,t}\|
\displaystyle\leq j=1ml=1mλjl,txl,t0xj,t+mγ(Gϕ+nGf)\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}\lambda_{jl,t}\|x_{l,t}^{0}-x_{j,t}\|+m\gamma(G_{\phi}+nG_{f})
=\displaystyle= j=1ml=1mλjl,txl,txj,t+mγ(Gϕ+nGf).\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}\lambda_{jl,t}\|x_{l,t}-x_{j,t}\|+m\gamma(G_{\phi}+nG_{f}). (46)

For the first term in (VII-C), by the nonexpansiveness of the proximal operator,

xl,txj,tvl,t1vj,t1.\|x_{l,t}-x_{j,t}\|\leq\|v_{l,t-1}-v_{j,t-1}\|.

In addition, by [27, Proposition 1], the bound of the distance between iterates vj,tv_{j,t} and v¯t\bar{v}_{t} satisfies

vj,tv¯t=\displaystyle\|v_{j,t}-\bar{v}_{t}\|= l=1m(λjl,txl,tn1mxl,tn)\displaystyle\Big{\|}\sum_{l=1}^{m}\big{(}\lambda_{jl,t}x_{l,t}^{n}-\frac{1}{m}x_{l,t}^{n}\big{)}\Big{\|}
\displaystyle\leq l=1m|λjl,t1m|xl,tn\displaystyle\sum_{l=1}^{m}\big{|}\lambda_{jl,t}-\frac{1}{m}\big{|}\|x_{l,t}^{n}\|
\displaystyle\leq ΓΞtl=1mxl,tn.\displaystyle\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t}^{n}\|. (47)

Then, by transformation,

j=1ml=1mλjl,tvl,t1vj,t1\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}\lambda_{jl,t}\|v_{l,t-1}-v_{j,t-1}\|
\displaystyle\leq j=1ml=1mλjl,t(vl,t1v¯t1+vj,t1v¯t1)\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}\lambda_{jl,t}(\|v_{l,t-1}-\bar{v}_{t-1}\|+\|v_{j,t-1}-\bar{v}_{t-1}\|)
\displaystyle\leq 2mΓΞt1l=1mxl,t1n.\displaystyle 2m\Gamma\Xi^{t-1}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|. (48)

Substituting (VII-C) to (VII-C),

j=1mxj,t+1xj,t\displaystyle\sum_{j=1}^{m}\|x_{j,t+1}-x_{j,t}\|
\displaystyle\leq 2mΓΞt1l=1mxl,t1n+mγ(Gϕ+nGf).\displaystyle 2m\Gamma\Xi^{t-1}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|+m\gamma(G_{\phi}+nG_{f}).

(c) By the definition of x¯t=1mp=1mxp,t\bar{x}_{t}=\frac{1}{m}\sum_{p=1}^{m}x_{p,t}, j=1mxj,tx¯t\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\| satisfies

j=1mxj,t1mp=1mxp,t\displaystyle\sum_{j=1}^{m}\|x_{j,t}-\frac{1}{m}\sum_{p=1}^{m}x_{p,t}\|
=\displaystyle= j=1m1mp=1m(xj,txp,t)\displaystyle\sum_{j=1}^{m}\|\frac{1}{m}\sum_{p=1}^{m}(x_{j,t}-x_{p,t})\|
\displaystyle\leq 1mj=1mp=1mvj,t1vp,t1\displaystyle\frac{1}{m}\sum_{j=1}^{m}\sum_{p=1}^{m}\|v_{j,t-1}-v_{p,t-1}\|
\displaystyle\leq 1mj=1mp=1m(vj,t1v¯t1+vp,t1v¯t1)\displaystyle\frac{1}{m}\sum_{j=1}^{m}\sum_{p=1}^{m}(\|v_{j,t-1}-\bar{v}_{t-1}\|+\|v_{p,t-1}-\bar{v}_{t-1}\|)
\displaystyle\leq 2mΓΞt1l=1mxl,t1n,\displaystyle 2m\Gamma\Xi^{t-1}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|,

where the first inequality follows from Proposition II.1 (b) and the last inequality follows from (VII-C). ∎

The following Lemma prove that the sequence j=1mxj,tn\sum_{j=1}^{m}\|x_{j,t}^{n}\| is upper bounded by a polynomial-geometric sequence, which is vital in the discussions of the summability of the error sequences.

Lemma VII.6.

Under Assumptions III.1 and III.2, there exist non-negative scalars Cq=Cq(x1,1n,,xm,1n)C_{q}=C_{q}(x_{1,1}^{n},\cdots,x_{m,1}^{n}), Cq1=Cq1(m,Γ,Cq,Cq2)C_{q}^{1}=C_{q}^{1}(m,\Gamma,C_{q},C_{q}^{2}), Cq2=Cq2(m,γ,Gg,Gh)C_{q}^{2}=C_{q}^{2}(m,\gamma,G_{g},G_{h}) such that for iteration t1t\geq 1,

j=1mxj,tnCq+Cq1t+Cq2t2.\displaystyle\sum_{j=1}^{m}\|x_{j,t}^{n}\|\leq C_{q}+C_{q}^{1}t+C_{q}^{2}t^{2}.
Proof.

This proof is inspired by [28, Lemma 1]. We proceed by induction on tt. First, we show that the result holds for t=1t=1 by choosing Cq=j=1mxj,1nC_{q}=\sum_{j=1}^{m}\|x_{j,1}^{n}\|. It suffices to show that j=1mxj,1n\sum_{j=1}^{m}\|x_{j,1}^{n}\| is bounded given the initial points xj,0nx_{j,0}^{n}.

Indeed, by (38),

j=1mxj,1nj=1mxj,0n+mγ(Gϕ+nGf)<.\sum_{j=1}^{m}\|x_{j,1}^{n}\|\leq\sum_{j=1}^{m}\|x_{j,0}^{n}\|+m\gamma(G_{\phi}+nG_{f})<\infty.

Therefore, Cq=j=1mxj,1n<C_{q}=\sum_{j=1}^{m}\|x_{j,1}^{n}\|<\infty is a valid choice.

Now suppose the result in Lemma VII.6 holds for some positive integer t1t\geq 1. We show that it also holds for t+1t+1. We transform (38) in Lemma VII.5 to

j=1mxj,t+1nj=1mxj,tn+mγ(Gϕ+nGf)+j=1mxj,txj,t1.\displaystyle\sum_{j=1}^{m}\|x_{j,t+1}^{n}\|\!\leq\!\sum_{j=1}^{m}\!\|x_{j,t}^{n}\|\!+\!m\gamma(G_{\phi}\!+\!nG_{f})\!+\!\sum_{j=1}^{m}\!\|x_{j,t}\!-\!x_{j,t-1}\|. (49)

For the last term in (49), by transforming (39) in Lemma VII.5, we obtain

j=1mxj,t+1xj,t2mΓp=1t1Ξpj=1mxj,ln+(t1)mγ(Gϕ+nGf).\displaystyle\sum_{j=1}^{m}\!\|x_{j,t+1}\!-\!x_{j,t}\|\!\leq\!2m\Gamma\!\sum_{p=1}^{t-1}\!\Xi^{p}\!\sum_{j=1}^{m}\!\|x_{j,l}^{n}\|\!+\!(t\!-\!1)m\gamma(\!G_{\phi}\!+\!nG_{f}\!). (50)

Then, substituting the induction hypothesis for tt into (50), we have

j=1mxj,t+1xj,t\displaystyle\sum_{j=1}^{m}\|x_{j,t+1}-x_{j,t}\|\leq 2mΓp=1t1Ξp(Cq+Cq1p+Cq2p2)\displaystyle 2m\Gamma\sum_{p=1}^{t-1}\Xi^{p}(C_{q}+C_{q}^{1}p+C_{q}^{2}p^{2})
+(t1)mγ(Gϕ+nGf).\displaystyle+(t-1)m\gamma(G_{\phi}+nG_{f}).

By Lemma VII.2 and (30), there exist constants S0Ξ,S1Ξ,S2ΞS_{0}^{\Xi},S_{1}^{\Xi},S_{2}^{\Xi} such that

p=1Ξp(Cq+Cq1p+Cq2p2)CqS0Ξ+Cq1S1Ξ+Cq2S2Ξ.\sum_{p=1}^{\infty}\Xi^{p}(C_{q}+C_{q}^{1}p+C_{q}^{2}p^{2})\leq C_{q}S_{0}^{\Xi}+C_{q}^{1}S_{1}^{\Xi}+C_{q}^{2}S_{2}^{\Xi}.

By induction hypothesis and (49),

j=1mxj,t+1n\displaystyle\sum_{j=1}^{m}\|x_{j,t+1}^{n}\|\leq Cq+Cq1t+Cq2t2+tmγ(Gϕ+nGf)\displaystyle C_{q}+C_{q}^{1}t+C_{q}^{2}t^{2}+tm\gamma(G_{\phi}+nG_{f})
+2mΓ(CqS0Ξ+Cq1S1Ξ+Cq2S2Ξ).\displaystyle+2m\Gamma(C_{q}S_{0}^{\Xi}+C_{q}^{1}S_{1}^{\Xi}+C_{q}^{2}S_{2}^{\Xi}). (51)

Take the coefficients CqC_{q}, Cq1C_{q}^{1} and Cq2C_{q}^{2} as

Cq\displaystyle C_{q} =j=1mxj,1n,\displaystyle=\sum_{j=1}^{m}\|x_{j,1}^{n}\|,
Cq1\displaystyle C_{q}^{1} =2mΓCqS0Ξ+(2mΓS2Ξ1)Cq22mΓS1Ξ1,\displaystyle=\frac{2m\Gamma C_{q}S_{0}^{\Xi}+(2m\Gamma S_{2}^{\Xi}-1)C_{q}^{2}}{2m\Gamma S_{1}^{\Xi}-1},
Cq2\displaystyle C_{q}^{2} =γm2(Gϕ+nGf).\displaystyle=\frac{\gamma m}{2}(G_{\phi}+nG_{f}).

Then, comparing coefficients, we see that the right-hand side of (VII-C) has an upper bound Cq+Cq1(t+1)+Cq2(t+1)2C_{q}+C_{q}^{1}(t+1)+C_{q}^{2}(t+1)^{2}, which implies that the induction hypothesis holds for t+1t+1. ∎

Proof of Proposition IV.2:

Proof.

(a) By (40) in Lemma VII.5,

xj,tx¯t2ΓΞt1l=1mxl,t1n.\displaystyle\|x_{j,t}-\bar{x}_{t}\|\leq 2\Gamma\Xi^{t-1}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|. (52)

Then, by Lemma VII.6, (12a) and (52), there exist CqC_{q}, Cq1C_{q}^{1}, Cq2>0C_{q}^{2}>0 such that

eti\displaystyle{\|e_{t}^{i}\|} 2LΓΞt1(Cq+Cq1(t1)+Cq2(t1)2)be,t+γC0,\displaystyle\leq\underbrace{2L\Gamma\Xi^{t-1}(C_{q}\!+\!C_{q}^{1}(t-1)\!+\!C_{q}^{2}(t-1)^{2})}_{b_{e,t}}+\gamma C_{0}, (53)

where C0=2LnGfC_{0}=2LnG_{f} and be,tb_{e,t} is a polynomial-geometric sequence.

(b) It follows from (12b), (VII-C) and Lemma VII.6 that

εt+12GϕΓΞtl=1mxl,tn+12γ(ΓΞtl=1mxl,tn)2\displaystyle\varepsilon_{t+1}\leq 2G_{\phi}\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t}^{n}\|+\frac{1}{2\gamma}(\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t}^{n}\|)^{2}
2GϕΓΞt(Cq+Cq1t+Cq2t2)+12γ[ΓΞt(Cq+Cq1t+Cq2t2)]2bε,t.\displaystyle\leq\!\underbrace{\!2G_{\phi}\Gamma\Xi^{t}(C_{q}\!+\!C_{q}^{1}t\!+\!C_{q}^{2}t^{2})\!+\!\frac{1}{2\gamma}\!\big{[}\Gamma\Xi^{t}(C_{q}\!+\!C_{q}^{1}t\!+\!C_{q}^{2}t^{2})\big{]}^{2}\!}_{b_{\varepsilon,t}}. (54)

Using the fact that a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for all nonnegative real numbers a,ba,b,

εt+1\displaystyle\sqrt{\varepsilon_{t+1}}\!\leq\! 2GϕΓΞt(Cq+Cq1t+Cq2t)+12γΓΞtC(t)bε,t,\displaystyle\underbrace{\!\sqrt{2G_{\phi}\Gamma}\sqrt{\Xi^{t}}\big{(}\sqrt{C_{q}}\!+\!\sqrt{C_{q}^{1}t}\!+\!\sqrt{C_{q}^{2}}t\big{)}\!+\!\frac{1}{\sqrt{2\gamma}}\Gamma\Xi^{t}C(t)\!}_{b_{\sqrt{\varepsilon},t}}, (55)

where C(t)=(Cq+Cq1t+Cq2t2)C(t)=(C_{q}+C_{q}^{1}t+C_{q}^{2}t^{2}). ∎

VII-D Proof of Lemma IV.1

Proof.

By Lemma VII.1 and (11), we have

x¯tγg¯tx¯t+1γi=0n1etiet+1pt+1=γd¯t+1,\displaystyle\bar{x}_{t}-\gamma\bar{g}_{t}-\bar{x}_{t+1}-\gamma\underbrace{\sum_{i=0}^{n-1}e_{t}^{i}}_{e_{t+1}}-p_{t+1}=\gamma\bar{d}_{t+1}, (56)

where pt+12γεt+1\|p_{t+1}\|\leq\sqrt{2\gamma\varepsilon_{t+1}}, d¯t+1εt+1ϕ(x¯t+1)\bar{d}_{t+1}\in\partial_{\varepsilon_{t+1}}\phi(\bar{x}_{t+1}), g¯ti=0n1𝐟πti(x¯ti)\bar{g}_{t}\triangleq\sum_{i=0}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i}) and 𝐟πti(x)=1mj=1mfj,πti(x)\mathbf{f}_{\pi_{t}^{i}}(x)=\frac{1}{m}\sum_{j=1}^{m}f_{j,\pi_{t}^{i}}(x).

In addition, by (VII-B), x¯tk=x¯tγi=0k1𝐟πti(x¯ti)γi=0k1eti\bar{x}_{t}^{k}=\bar{x}_{t}-\gamma\sum_{i=0}^{k-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\gamma\sum_{i=0}^{k-1}e_{t}^{i}. Then, combining (56) with (VII-B),

x¯tkx¯t+1\displaystyle\bar{x}_{t}^{k}-\bar{x}_{t+1}
=\displaystyle= x¯tγi=0k1𝐟πti(x¯ti)γi=0k1eti\displaystyle\bar{x}_{t}-\gamma\sum_{i=0}^{k-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\gamma\sum_{i=0}^{k-1}e_{t}^{i}
(x¯tγi=0n1𝐟πti(x¯ti)γi=0n1etipt+1γd¯t+1)\displaystyle-\big{(}\bar{x}_{t}-\gamma\sum_{i=0}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\gamma\sum_{i=0}^{n-1}e_{t}^{i}-p_{t+1}-\gamma\bar{d}_{t+1}\big{)}
=\displaystyle= γi=kn1𝐟πti(x¯ti)+γi=kn1etie~k,t+1+pt+1+γd¯t+1.\displaystyle\gamma\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})+\gamma\underbrace{\sum_{i=k}^{n-1}e_{t}^{i}}_{\tilde{e}_{k,t+1}}+p_{t+1}+\gamma\bar{d}_{t+1}.

Taking the norm of x¯tkx¯t+1\bar{x}_{t}^{k}-\bar{x}_{t+1}, we have

x¯tkx¯t+12\displaystyle\|\bar{x}_{t}^{k}-\bar{x}_{t+1}\|^{2}
=\displaystyle= γi=kn1𝐟πti(x¯ti)+γe~k,t+1+pt+1+γd¯t+12\displaystyle\|\gamma\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})+\gamma\tilde{e}_{k,t+1}+p_{t+1}+\gamma\bar{d}_{t+1}\|^{2}
\displaystyle\leq 6γ2i=kn1𝐟πti(x¯ti)2+6γ2e~k,t+12\displaystyle 6\gamma^{2}\|\!\sum_{i=k}^{n-1}\!\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})\|^{2}\!+\!6\gamma^{2}\|\tilde{e}_{k,t+1}\|^{2}\!
+6γ2d¯t+12+2pt+12\displaystyle+\!6\gamma^{2}\|\bar{d}_{t+1}\|^{2}\!+\!2\|p_{t+1}\|^{2}
\displaystyle\leq 6γ2(2i=kn1(𝐟πti(x¯ti)𝐟πti(x))2+2i=kn1𝐟πti(x)2)\displaystyle 6\gamma^{2}\!\Big{(}\!2\big{\|}\sum_{i=k}^{n-1}\!(\nabla\mathbf{f}_{\pi_{t}^{i}}\!(\bar{x}_{t}^{i})\!-\!\nabla\mathbf{f}_{\pi_{t}^{i}}\!(x_{*}))\big{\|}^{2}\!+\!2\big{\|}\sum_{i=k}^{n-1}\!\nabla\mathbf{f}_{\pi_{t}^{i}}\!(x_{*})\big{\|}^{2}\!\Big{)}
+6γ2e~k,t+12+6γ2d¯t+12+2pt+12\displaystyle+6\gamma^{2}\|\tilde{e}_{k,t+1}\|^{2}+6\gamma^{2}\|\bar{d}_{t+1}\|^{2}+2\|p_{t+1}\|^{2}
(13),(29)\displaystyle\overset{\eqref{fnablf},\eqref{pt_range}}{\leq} 6γ2(4Lni=kn1D𝐟πti(x,x¯ti)+2i=kn1𝐟πti(x)2)\displaystyle 6\gamma^{2}\!\big{(}4Ln\!\sum_{i=k}^{n-1}D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})+2\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\|^{2}\big{)}
+6γ2e~k,t+12+6γ2d¯t+12+4γεt+1\displaystyle+6\gamma^{2}\|\tilde{e}_{k,t+1}\|^{2}+6\gamma^{2}\|\bar{d}_{t+1}\|^{2}+4\gamma\varepsilon_{t+1}
(53),(VII-C)\displaystyle\overset{\eqref{bet},\eqref{bvapt}}{\leq} 6γ2(4Lni=0n1D𝐟πti(x,x¯ti)+2i=kn1𝐟πti(x)2)\displaystyle 6\gamma^{2}\big{(}4Ln\sum_{i=0}^{n-1}D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})\!+\!2\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\|^{2}\big{)}
+12n2γ2be,t2+12n2γ4C02+6γ2d¯t+12+4γbε,t.\displaystyle+12n^{2}\gamma^{2}b_{e,t}^{2}+12n^{2}\gamma^{4}C_{0}^{2}+6\gamma^{2}\|\bar{d}_{t+1}\|^{2}+4\gamma b_{\varepsilon,t}. (57)

Next, consider the term i=kn1𝐟πti(x)2\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\|^{2} in (VII-D). By Lemma VII.3 with X¯π=1nki=kn1𝐟πti(x)\bar{X}_{\pi}=\frac{1}{n-k}\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*}) and X¯=1nf(x)\bar{X}=\frac{1}{n}\nabla f(x_{*}),

𝔼[i=kn1𝐟πti(x)2]\displaystyle\mathbb{E}\Big{[}\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\|^{2}\Big{]}
=\displaystyle= (nk)2𝔼[1nki=kn1𝐟πti(x)2]\displaystyle(n-k)^{2}\mathbb{E}\Big{[}\|\frac{1}{n-k}\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\|^{2}\Big{]}
=\displaystyle= (nk)2𝔼[X¯π2]\displaystyle(n-k)^{2}\mathbb{E}\big{[}\|\bar{X}_{\pi}\|^{2}\big{]}
=\displaystyle= (nk)2(X¯2+𝔼[X¯πX¯2])\displaystyle(n-k)^{2}\big{(}\|\bar{X}\|^{2}+\mathbb{E}[\|\bar{X}_{\pi}-\bar{X}\|^{2}]\big{)}
=\displaystyle= (nk)21nf(x)2+(nk)k(n1)σ2,\displaystyle(n-k)^{2}\|\frac{1}{n}\nabla f(x_{*})\|^{2}+(n-k)\frac{k}{(n-1)}\sigma_{*}^{2},

where σ21ni=1n𝐟πti(x)1nf(x)2\sigma_{*}^{2}\triangleq\frac{1}{n}\sum_{i=1}^{n}\|\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})-\frac{1}{n}\nabla f(x_{*})\|^{2} is the population variance at the optimum xx_{*}. Summing this over kk from 0 to n1n-1,

k=0n1𝔼[i=kn1𝐟πti(x)2]\displaystyle\sum_{k=0}^{n-1}\mathbb{E}\Big{[}\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\|^{2}\Big{]}
\displaystyle\leq k=0n1(nk)21nf(x)2+k=0n1k(nk)n1σ2\displaystyle\sum_{k=0}^{n-1}(n-k)^{2}\|\frac{1}{n}\nabla f(x_{*})\|^{2}+\sum_{k=0}^{n-1}\frac{k(n-k)}{n-1}\sigma_{*}^{2}
=\displaystyle= n(n+1)(2n+1)6n2f(x)2+n(n+1)6σ2\displaystyle\frac{n(n+1)(2n+1)}{6n^{2}}\|\nabla f(x_{*})\|^{2}+\frac{n(n+1)}{6}\sigma_{*}^{2}
\displaystyle\leq 3n4Gf2+n2σ24,\displaystyle\frac{3n}{4}G_{f}^{2}+\frac{n^{2}\sigma_{*}^{2}}{4}, (58)

where the inequality holds due to n2n\geq 2, n+1<2nn+1<2n and Assumption III.1 (d).

Thus, summing (VII-D) over kk gives

k=0n1𝔼[x¯tkx¯t+12]\displaystyle\sum_{k=0}^{n-1}\mathbb{E}[\|\bar{x}_{t}^{k}-\bar{x}_{t+1}\|^{2}]
\displaystyle\leq 24γ2Ln2i=0n1𝔼[D𝐟πti(x,x¯ti)]+9γ2nGf2+3γ2n2σ2\displaystyle 24\gamma^{2}Ln^{2}\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})]+9\gamma^{2}nG_{f}^{2}+3\gamma^{2}n^{2}\sigma_{*}^{2}
+6γ2nGϕ2+12n3γ2be,t2+12n3γ4C02+4nγbε,t,\displaystyle+6\gamma^{2}nG_{\phi}^{2}+12n^{3}\gamma^{2}b_{e,t}^{2}+12n^{3}\gamma^{4}C_{0}^{2}+4n\gamma b_{\varepsilon,t},

where we use (VII-D) and Assumption III.1 (c). ∎

References

  • [1] F. Iutzeler and J. Malick, “Nonsmoothness in machine learning: Specific structure, proximal identification, and applications,” Set-Valued Var. Anal, vol. 28, pp. 661–678, 2020.
  • [2] W. Tao, Z. Pan, G. Wu, and Q. Tao, “The strength of nesterov’s extrapolation in the individual convergence of nonsmooth optimization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 7, pp. 2557–2568, 2020.
  • [3] A. Astorino and A. Fuduli, “Nonsmooth optimization techniques for semisupervised classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2135–2142, 2007.
  • [4] A. H. Zaini and L. Xie, “Distributed drone traffic coordination using triggered communication,” Unmanned Systems, vol. 08, no. 01, pp. 1–20, 2020.
  • [5] K. Wang, Z. Fu, Q. Xu, D. Chen, L. Wang, and W. Yu, “Distributed fixed step-size algorithm for dynamic economic dispatch with power flow limits,” SCIENCE CHINA INFORMATION SCIENCES, vol. 64, p. 112202, 2021.
  • [6] P. Yi and Y. Hong, “Stochastic sub-gradient algorithm for distributed optimization with random sleep scheme,” Control Theory Technol., vol. 13, pp. 333–347, 2015.
  • [7] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [8] W. Zhong and J. Kwok, “Accelerated Stochastic Gradient Method for Composite Regularization,” in Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, vol. 33.   Reykjavik, Iceland: PMLR, 22–25 Apr 2014, pp. 1086–1094.
  • [9] M. Schmidt, N. L. Roux, and F. Bach, “Convergence rates of inexact proximal-gradient methods for convex optimization,” Advances in Neural Information Processing Systems, vol. 24, pp. 1458–1466, 2012.
  • [10] D. P. Bertsekas, “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,” arXiv.cs.SY, 2017.
  • [11] Q. Wang, J. Chen, X. Zeng, and B. Xin, “Distributed proximal-gradient algorithms for nonsmooth convex optimization of second-order multiagent systems,” International Journal of Robust and Nonlinear Control, vol. 30, no. 17, pp. 7574–7592, 2020. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/rnc.5199
  • [12] X. Li, G. Feng, and L. Xie, “Distributed proximal algorithms for multiagent optimization with coupled inequality constraints,” IEEE Transactions on Automatic Control, vol. 66, no. 3, pp. 1223–1230, 2021.
  • [13] S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, no. 6, pp. 2787–2794, 2021.
  • [14] J. Zhang, H. Liu, A. M.-C. So, and Q. Ling, “A penalty alternating direction method of multipliers for convex composite optimization over decentralized networks,” IEEE Transactions on Signal Processing, vol. 69, pp. 4282–4295, 2021.
  • [15] L. Vandenberghe, “Lecture notes for EE 236C, UCLA,” Spring, 2011-2012.
  • [16] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3710–3722, 2021.
  • [17] W. Wu, X. Jing, W. Du, and G. Chen, “Learning dynamics of gradient descent optimization in deep neural networks,” SCIENCE CHINA INFORMATION SCIENCES, vol. 64, p. 150102, 2021.
  • [18] W. Li, Z. Wu, T. Chen, L. Li, and Q. Ling, “Communication-censored distributed stochastic gradient descent,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2021.
  • [19] O. Shamir and T. Zhang, “Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes,” in Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ser. ICML’13.   JMLR.org, 2013, p. I–71–I–79.
  • [20] P. Bianchi, G. Fort, and W. Hachem, “Performance of a distributed stochastic approximation algorithm,” IEEE Transactions on Information Theory, vol. 59, no. 11, pp. 7405–7418, 2013.
  • [21] J. Lei, H.-F. Chen, and H.-T. Fang, “Asymptotic properties of primal-dual algorithm for distributed stochastic optimization over random networks with imperfect communications,” SIAM Journal on Control and Optimization, vol. 56, no. 3, pp. 2159–2188, 2018. [Online]. Available: https://doi.org/10.1137/16M1086133
  • [22] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 5336–5346.
  • [23] Z. Li, B. Liu, and Z. Ding, “Consensus-based cooperative algorithms for training over distributed data sets using stochastic gradients,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11, 2021.
  • [24] R. Xin, U. A. Khan, and S. Kar, “Variance-reduced decentralized stochastic optimization with accelerated convergence,” IEEE Transactions on Signal Processing, vol. 68, p. 6255–6271, 2020. [Online]. Available: http://dx.doi.org/10.1109/TSP.2020.3031071
  • [25] M. I. Qureshi, R. Xin, S. Kar, and U. A. Khan, “Push-saga: A decentralized stochastic algorithm with variance reduction over directed graphs,” IEEE Control Systems Letters, vol. 6, pp. 1202–1207, 2022.
  • [26] X.-F. Wang, A. R. Teel, K.-Z. Liu, and X.-M. Sun, “Stability analysis of distributed convex optimization under persistent attacks: A hybrid systems approach,” Automatica, vol. 111, p. 108607, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0005109819304686
  • [27] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
  • [28] A. I. Chen and A. Ozdaglar, “A fast distributed proximal-gradient method,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Oct 2012, pp. 601–608.
  • [29] K. Ahn, C. Yun, and S. Sra, “SGD with shuffling: optimal rates without component convexity and large epoch requirements,” in Advances in Neural Information Processing Systems.   Curran Associates, Inc., 2020.
  • [30] M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo, “Why random reshuffling beats stochastic gradient descent,” Mathematical Programming, vol. 186, pp. 49–84, 2021.
  • [31] B. Recht and C. Ré, “Parallel stochastic gradient algorithms for large-scale matrix completion,” Mathematical Programming Computation, vol. 5, no. 2, pp. 201–226, 2013.
  • [32] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, ser. ICML’12.   Madison, WI, USA: Omnipress, 2012, p. 1571–1578.
  • [33] Y. Drori and O. Shamir, “The complexity of finding stationary points with stochastic gradient descent,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119.   PMLR, 13–18 Jul 2020, pp. 2658–2667. [Online]. Available: https://proceedings.mlr.press/v119/drori20a.html
  • [34] A. Nedic and D. P. Bertsekas, “Incremental subgradient methods for nondifferentiable optimization,” SIAM Journal on Optimization, vol. 12, no. 1, pp. 109–138, 2001.
  • [35] D. P. Bertsekas, Optimization for Machine Learning, chapter 4.   The MIT Press, 2011.
  • [36] K. Mishchenko, A. Khaled, and P. Richtarik, “Random reshuffling: Simple analysis with vast improvements,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 17 309–17 320. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/c8cc6e90ccbff44c9cee23611711cdc4-Paper.pdf
  • [37] Q. Meng, W. Chen, Y. Wang, Z.-M. Ma, and T.-Y. Liu, “Convergence analysis of distributed stochastic gradient descent with shuffling,” Neurocomputing, vol. 337, pp. 46–57, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231219300578
  • [38] K. Mishchenko, A. Khaled, and P. Richtárik, “Proximal and federated random reshuffling,” 2021.
  • [39] L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,” SIAM Journal on Optimization, vol. 24, no. 4, pp. 2057–2075, 2014. [Online]. Available: https://doi.org/10.1137/140961791
  • [40] Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/file/e727fa59ddefcefb5d39501167623132-Paper.pdf
  • [41] N. D. Vanli, M. Gurbuzbalaban, and A. Ozdaglar, “Global convergence rate of incremental aggregated gradient methods for nonsmooth problems,” in 2016 IEEE 55th Conference on Decision and Control (CDC), 2016, pp. 173–178.
  • [42] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
  • [43] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on Non-IID data,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJxNAnVtDS
  • [44] S. Abu-Mostafa, M. Magdon-Ismail, and H. Lin, Learning from data.   United States of America: AMLbook.com, 2012.
  • [45] P. D. Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.