Distributed stochastic proximal algorithm with random reshuffling for non-smooth finite-sum optimization

Xia Jiang, Xianlin Zeng, Jian Sun, Jie Chen, and Lihua Xie This work was supported in part by the National Key R

\&

D Program of China under Grant 2021YFB1714800, the National Natural Science Foundation of China under Grants 61925303, 62088101, 62073035, 62173034, the Natural Science Foundation of Chongqing under Grant 2021ZX4100027 and China Scholarship Council 202006030072. (Corresponding author: Jian Sun.)X. Jiang ([email protected]) and J. Sun ([email protected]) are with Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, 100081, China, and also with the Beijing Institute of Technology Chongqing Innovation Center, Chongqing 401120, ChinaX. Zeng ([email protected]) is with Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, 100081, ChinaJ. Chen ([email protected]) is with the School of Electronic and Information Engineering, Tongji University, Shanghai, 200082, China, and also with Beijing Advanced Innovation Center for Intelligent Robots and Systems (Beijing Institute of Technology), Key Laboratory of Biomimetic Robots and Systems (Beijing Institute of Technology), Ministry of Education, Beijing, 100081, China.L. Xie ([email protected]) is with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798.

Abstract

The non-smooth finite-sum minimization is a fundamental problem in machine learning. This paper develops a distributed stochastic proximal-gradient algorithm with random reshuffling to solve the finite-sum minimization over time-varying multi-agent networks. The objective function is a sum of differentiable convex functions and non-smooth regularization. Each agent in the network updates local variables by local information exchange and cooperates to seek an optimal solution. We prove that local variable estimates generated by the proposed algorithm achieve consensus and are attracted to a neighborhood of the optimal solution with an $\mathcal{O}(\frac{1}{T}+\frac{1}{\sqrt{T}})$ convergence rate, where $T$ is the total number of iterations. Finally, some comparative simulations are provided to verify the convergence performance of the proposed algorithm.

Index Terms:

distributed optimization, proximal operator, random reshuffling, stochastic algorithm, time-varying graphs

I Introduction

Distributed non-smooth finite-sum minimization is a basic problem of the popular supervised machine learning [1, 2, 3]. In machine learning communities, optimizing a training model with merely the average loss over a finite data set usually leads to over-fitting or poor generalization. Some non-smooth regularizations are often included in the cost function to encode prior knowledge, which also introduces the challenge of non-smoothness. What’s more, in many practical network systems [4, 5], training data are naturally stored at different physical nodes and it is expensive to collect data and train the model in one centralized node. Compared to the centralized setting, the distributed setting makes use of multi computational sources to train the learning model in parallel, leading to potential speedup. However, since the data are distributed and the communication is limited, more involved approaches are needed to solve the minimization problem. Therefore, distributed finite-sum minimization has attracted much attention in machine learning, especially for large-scale applications with big data.

For large-scale non-smooth finite-sum optimization problems, there have been many efficient centralized and distributed first-order algorithms, including subgradient-based methods [6, 7, 8] and proximal gradient algorithms [9, 10, 11, 12, 13, 14]. Subgradient-based method is very generic in optimization with the expense of slow convergence. To be specific, subgradient-based algorithms may increase the objective function of the optimization problem for small step-sizes [10]. Proximal gradient algorithms are considered more stable than subgradient-based methods and often own better numerical performance than subgradient-based methods for convex optimization [15]. Hence, proximal gradient algorithms have attracted great interest for large-scale non-smooth optimization problems. Particularly, training data are allocated to different computing nodes and the non-smooth finite-sum optimization needs to be handled efficiently in a distributed manner. Many distributed deterministic proximal gradient algorithms have been proposed with guaranteed convergence for non-smooth optimization over time-invariant and time-varying graphs [12, 13, 14]. These distributed works are developed under the primal-dual framework and are applicable for the constrained non-smooth optimization. However, these distributed algorithms focus on the deterministic non-smooth optimization and require computing the full gradients of all local functions at each iteration. The per iteration computational cost of a deterministic algorithm is much higher than that of a stochastic gradient method, which hinders the application of deterministic algorithms for non-smooth optimization with large-scale data.

In large-scale machine learning settings, various first-order stochastic methods are leading algorithms due to their scalability and low computational requirements. Decentralized stochastic gradient descent algorithms have gained a lot of attention recently, especially for the traditional federated learning setting with a star-shaped network topology [16, 17, 18]. All of these methods are not fully distributed in a sense that they require a central parameter server. The potential bottleneck of the star-shaped network is possible communication traffic jam on the central sever and the performance will be significantly degraded when the network bandwidth is low. To consider a more general distributed network topology without a central server, many distributed stochastic gradient algorithms have been studied [19, 20, 21, 22, 23] for convex finite-sum optimization. To avoid the degenerated performance of algorithms with diminishing steps-sizes, some recent works [22] and [23] have proposed consensus-based distributed stochastic gradient descent methods with non-diminishing (constant) step-sizes for smooth convex optimization over time-invariant networks. With the variance reduction technique to handle the variance of the local stochastic gradients at each node, some distributed efficient stochastic algorithms with constant step-sizes have been studied for strongly convex smooth optimization over time-invariant undirected networks [24] and time-invariant directed networks [25]. However, most of the existing distributed stochastic algorithms are designed for smooth optimization and are only applicable over time-invariant networks.

In practice, a communication network may be time-varying, since network attacks may cause failures of communications and connectivity of a network may change, especially for mobile networks [26, 4]. Therefore, it is necessary to study distributed algorithms over time-varying networks. Considering the unreliable network communication, some works have studied distributed asynchronous stochastic algorithms with diminishing step-sizes for smooth convex optimization over multi-agent networks [20, 21]. To avoid the degenerated performance of algorithms with diminishing step-sizes, [7] developed a distributed gradient algorithm with a multi-step consensus mapping for multi-agent optimization. This multi-step consensus mapping has also been widely applied in the follow-up works [27, 28] for multi-agent consensus and non-smooth optimization over time-varying networks. Differently from these deterministic works, this paper studies a distributed stochastic gradient algorithm with sample-without-replacement heuristic for non-smooth finite-sum optimization over time-varying multi-agent networks.

Random reshuffling (RR) is a simple, popular but elusive stochastic method for finite-sum minimization in machine learning. Contrasted with stochastic gradient descent (SGD), where the training data are sampled uniformly with replacement, the sampling-without-replacement RR method learns from each data point in each epoch and often own better performance in many practical problems[29, 30, 31]. The convergence properties of SGD are well-understood with tight lower and upper bounds in many settings [32, 33], however, the theoretical analysis of RR had not been studied until recent years. The sampling-without-replacement in RR introduces a significant complication that the gradients are now biased, which implies that a single iteration does not approximate a full gradient descent step. The incremental gradient algorithm (IG) is one special case of the shuffling algorithm. The IG generates a deterministic permutation before the start of epochs and reuses the permutation in all subsequent epochs. In contrast, in the RR, a new permutation is generated at the beginning of each epoch. The implementation of IG has a challenge of choosing a suitable permutation for cycling through the iterations [34], and the IG is susceptible to bad orderings compared to RR [35]. Therefore, it is necessary to study a random reshuffling algorithm for convex non-smooth optimization. For strongly convex optimization, the seminal work [30] theoretically characterized various convergence rates for the random reshuffling method. One inspiring recent work by Richtarik [36] has provided some involved yet insightful proofs for the convergence behavior of RR under weak assumptions. For smooth finite-sum minimization, [37] has studied decentralized stochastic gradient with shuffling and provided insights for practical data processing. What’s more, the recent work [38] has studied centralized proximal gradient descent algorithms with random reshuffling (Prox-RR) for non-smooth finite-sum optimization and extended the Prox-RR to federated learning. However, these works are not applicable to fully distributed settings and the Prox-RR needs the strong convexity assumption, which hinders its application.

Inspired by the existing deterministic and stochastic algorithms, we develop a distributed stochastic proximal algorithm with random reshuffling (DPG-RR) for non-smooth finite-sum optimization over time-varying multi-agent networks, extending the random reshuffling to distributed non-smooth convex optimization.

The contributions of this paper are summarized as follows.

•

For non-smooth convex optimization, we propose a distributed stochastic algorithm DPG-RR over time-varying networks. Although there are a large number of works on proximal SGD [39, 40], few works have studied how to extend random reshuffling to solve convex optimization with non-smooth regularization. This paper extends the recent centralized proximal algorithm Prox-RR in [38] for strongly convex non-smooth optimization to general convex optimization and distributed settings. Another very related decentralized work is [37], which has provided important insights to distributed stochastic gradient descent with data shuffling under master-slave frameworks. This paper extends [37] to fully distributed settings and makes it applicable for non-smooth optimization. To our best knowledge, this is the first attempt to study fully distributed stochastic proximal algorithms with random reshuffling over time-varying networks.
•

The proposed algorithm DPG-RR needs fewer proximal operators compared with proximal SGD and is applicable for time-varying networks. To be specific, for the non-smooth optimization with $n$ local functions, the standard proximal SGD applies the proximal operator after each stochastic gradient step [40] and needs $n$ proximal evaluations. In contrast, the proposed algorithm DPG-RR only operates one single proximal evaluation at each iteration. In addition, with the design of multi-step consensus mapping, this proposed algorithm enables local variable estimates closer to each other, which is crucial for distributed non-smooth optimization over time-varying networks. Compared with the existing incremental gradient algorithm in [41], the proposed random reshuffling algorithm in this paper requires a weaker assumption, where the cost function does not need to be strongly-convex.
•

This work provides a complete and rigorous theoretical convergence analysis for the proposed DPG-RR over time-varying multi-agent networks. Due to the sampling-without-replacement mechanism of random reshuffling, the local stochastic gradients are biased estimators of the full gradient. In addition, because of scattered information in the distributed setting, there are differences between local and global objective functions, and thus local and the global gradients. To handle the differences between local and global gradients, this paper proves the summability of the error sequences of the estimated gradients. Then, we prove the convergence of the transformed stochastic algorithm by some novel proof techniques inspired by [36].

The remainder of the paper is organized as follows. The preliminary mathematical notations and proximal operator are introduced in section II. The problem description and the design of a distributed solver are provided in section III. The convergence performance of the proposed algorithm is analyzed in section IV. Numerical simulations are provided in section V and the conclusion is made in section VI.

II Notations and Preliminaries

II-A Notations

We denote $\mathbb{R}$ the set of real numbers, $\mathbb{N}$ the set of natural numbers, $\mathbb{R}^{n}$ the set of $n$ -dimensional real column vectors and $\mathbb{R}^{n\times m}$ the set of $n$ -by- $m$ real matrices, respectively. We denote $v^{\prime}$ the transpose of vector $v$ . In addition, $\left\|\cdot\right\|$ denotes the Euclidean norm, $\langle\cdot,\cdot\rangle$ the inner product, which is defined by $\langle a,b\rangle=a^{\prime}b$ . The vectors in this paper are column vectors unless otherwise stated. The notation $[m]$ denotes the set $\{1,\cdots,m\}$ . For a differentiable function $f:\mathbb{R}^{n}\to\mathbb{R}$ , $\nabla f(x)$ denotes the gradient of function $f$ with respect to $x$ . For a convex non-smooth function $h:\mathbb{R}^{n}\to\mathbb{R}$ , $\partial h(x)$ denotes the subdifferential of function $h$ at $x$ . The $\varepsilon$ -subdifferential of a non-smooth convex function $h$ at ${x}$ , denoted by $\partial_{\varepsilon}h(x)$ , is the set of vectors $y$ such that for all $q$ ,

\displaystyle h(q)-h({x})\geq y^{\prime}(q-{x})-\varepsilon.

(1)

II-B Proximal Operator

For a proper non-differentiable convex function $h:\mathbb{R}^{n}\to(-\infty,\infty]$ and a scalar $\alpha>0$ , the proximal operator is defined as

\displaystyle{\rm prox}_{\alpha,h}(x)={\rm argmin}_{z\in\mathbb{R}^{n}}{h(z)+\frac{1}{2\alpha}\|z-x\|^{2}}.

(2)

The minimum is attained at a unique point $y={\rm prox}_{\alpha,h}(x)$ , which means the proximal operator is a single-valued map. In addition, it follows from the optimality condition for convex optimization that

\displaystyle 0\in\partial h(y)+\frac{1}{\alpha}(y-x).

(3)

The following proposition presents some properties of the proximal operator.

Proposition II.1.

[42] Let $h:\mathbb{R}^{n}\to(-\infty,\infty]$ be a closed proper convex function. For a scalar $\alpha>0$ and $x\in\mathbb{R}^{n}$ , let $y={\rm prox}_{\alpha,h}(x)$ . For $x,\hat{x}\in\mathbb{R}^{n}$ , $\|{\rm prox}_{\alpha,h}(x)-{\rm prox}_{\alpha,h}(\hat{x})\|\leq\|x-\hat{x}\|$ , which is also called the nonexpansiveness of proximal operator.

When there exists error $\varepsilon$ in the computation of proximal operator, we denote the inexact proximal operator by ${\rm prox}_{\alpha,h}^{\varepsilon}(\cdot)$ , which is defined as

	$\displaystyle{\rm prox}_{\alpha,h}^{\varepsilon}(x)\triangleq\Big{\{}\tilde{x}\big{\|}$	$\displaystyle\frac{1}{2\alpha}\\|\tilde{x}-x\\|^{2}+h(\tilde{x})\leq$
		$\displaystyle\varepsilon+\min_{z\in\mathbb{R}^{d}}\big{\{}\frac{1}{2\alpha}\\|z-x\\|^{2}+h(z)\big{\}}\Big{\}}.$		(4)

III Problem Description And Algorithm Design

In this paper, we aim to solve the following non-smooth finite-sum optimization problem over a multi-agent network,

\displaystyle\min_{x\in\mathbb{R}^{d}}F(x),\ F(x)=f(x)+\phi(x)=\!\frac{1}{m}\!\sum_{j=1}^{m}\sum_{i=1}^{n}f_{j,i}(x)+\phi(x)

(5)

where $x\in\mathbb{R}^{d}$ is the unknown decision variable, $m$ is the number of agents in the network and $n$ is the number of local samples. The global sample size of the multi-agent network is $N=mn$ . The cost function $f(x)$ in (5) denotes the global cost of the multi-agent network and is differentiable. The regularization $\phi$ is non-smooth, which encodes some prior knowledge. In the multi-agent network, each agent $j$ only knows $n$ local samples and the corresponding cost functions $f_{j,i}(\cdot)$ to update the local decision variable. In addition, the non-smooth regularization $\phi$ is available to all agents. In this setting, we assume that each agent uses local cost functions and communications with its neighbors to obtain the same optimal solution $x_{*}$ of problem (5).

In real-world applications of distributed settings, it is difficult to keep all communication links between agents connected and stable due to the the existence of networked attacks and limited bandwidth. We consider a general time-varying multi-agent communication network $\mathcal{G}(t)=([m],\mathcal{E}(t),A(t))$ , where $\mathcal{E}(t)$ denotes the set of communication edges at time $t$ . At each time $t$ , agent $j$ can only communicate with agent $i$ if and only if $(j,i)\in\mathcal{E}(t)$ . In addition, the neighbors of agent $j$ at time $t$ are the agents that agent $j$ can communicate with at time $t$ . The adjacent matrix $A(t)$ of the network is a square $m\times m$ matrix, whose elements indicate the weights of the edges in $\mathcal{E}(t)$ .

For the finite-sum optimization (5), we make the following assumption.

Assumption III.1.

(a)

Functions $f$ and $\phi$ are convex.
(b)

Local function $f_{j,i}$ has a Lipschitz-continuous gradient with constant $L>0$ , i.e. for every $x,y\in\mathbb{R}^{d}$ ,

$\displaystyle\|\nabla f_{j,i}(x)-\nabla f_{j,i}(y)\|\leq L\|x-y\|,$ (6)

while the regular function $\phi$ is non-smooth.
(c)

There exists a scalar $G_{\phi}$ such that for each sub-gradient $z\in\partial\phi(x)$ , $\|z\|<G_{\phi}$ , for every $x\in\mathbb{R}^{d}$ .
(d)

There exists a scalar $G_{f}$ such that for each agent $j$ and every $x\in\mathbb{R}^{d}$ , $\|\nabla f_{j,i}(x)\|<G_{f}$ .
(e)

Further, $F$ is lower bounded by some $F_{*}\in\mathbb{R}$ and the optimization problem owns at least one optimal solution $x_{*}$ .

Assumption III.1 is common in stochastic algorithm research [37, 43] and finite-sum minimization problems satisfying Assumption III.1 arise in many applications. Here, we provide some real-world examples.

Example 1: In binary classification problem [44], the logistic regression optimization aims to obtain a predictor $x$ to estimate the categorical variables of the testing data. When the large-scale training samples are allocated to $m$ different agents, the local cost function in (5) is a cross-entropy error function (CNF), $f_{j,i}(x)=\ln(1+\exp(-l_{j,i}\langle a_{j,i},x\rangle))$ , where $a_{j,i}$ and $l_{j,i}$ denote the feature vector and categorical value of the $i$ th training sample at agent $j$ . Then, the gradient of the convex function $f_{j,i}$ is bounded and satisfies the Lipschitz condition in (6).

For the communication topology, the time-varying multi-agent network of the finite-sum optimization (5) satisfies the following assumption.

Assumption III.2.

Consider the undirected time-varying network $\mathcal{G}(t)$ with adjacent matrices $A(t)=[a_{ij}(t)]$ , $t=1,2,\cdots$

(a)

For each $t$ , the adjacent matrix $A(t)$ is doubly stochastic.
(b)

There exists a scalar $\eta\in(0,1)$ such that $a_{jj}(t)\geq\eta$ for all $j\in[m]$ . In addition, $a_{ij}(t)\geq\eta$ if $\{i,j\}\in\mathcal{E}(t)$ , and $a_{ij}(t)=0$ otherwise.
(c)

The time-varying graph sequence $\mathcal{G}(t)$ is uniformly connected. That is, there exists an integer $B\geq 1$ such that agent $j$ sends its information to all other agents at least once every $B$ consecutive time slots.

Remark III.1.

In Assumption III.2, part (a) is common for undirected graphs and balanced directed graphs. Part (b) means that each agent gives non-negligible weights to the local estimate and the estimates received from neighbors. Part (c) implies that the time-varying network is able to transmit information among any pair of agents in bounded time. Assumption III.2 is widely adopted in the existing literature [7, 28].

Next, we design a distributed proximal gradient algorithm with random reshuffling for each agent $j\in[m]$ in the multi-agent network. At the beginning of each iteration $t$ , we sample indices $\pi_{t}^{0},\cdots,\pi_{t}^{n-1}$ without replacement from $\{1,\cdots,n\}$ such that the generated $\pi_{t}=\{\pi_{t}^{0},\cdots,\pi_{t}^{n-1}\}$ is a random permutation of $\{1,\cdots,n\}$ . Then, we process with $n$ inner iterations of the form

\displaystyle x_{j,t}^{i+1}

\displaystyle=x_{j,t}^{i}-\gamma\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i}),\ i\in\{0,\cdots,n-1\},

(7)

where $\gamma$ is a constant step-size, the superscript $i$ denotes the $i$ th inner iteration, the subscripts $j$ and $t$ denote the $j$ th agent and $t$ th outer iteration, respectively.

Then, to achieve the consensus between different agents, a multi-step consensus mapping is applied as

\displaystyle v_{j,t}

\displaystyle=\sum_{l=1}^{m}\lambda_{jl,t}x_{l,t}^{n},

(8)

where $\lambda_{jl,t}$ is the $(j,l)$ th element of matrix $\Phi(\mathbb{T}(t)+t,\mathbb{T}(t))$ for $j,l\in\{1,\cdots,m\}$ . The notation $\mathbb{T}(t)$ is the total number of communication steps before iteration $t$ , and $\Phi$ is a transition matrix, defined as

\displaystyle\Phi(t,s)=A(t)A(t-1)\cdots A(s+1)A(s),\quad t>s\geq 0,

(9)

where $A(t)=[a_{ij}(t)]_{i,j=1,\cdots,m}$ is the adjacent matrix of the multi-agent network at iteration $t$ . To be specific, using vector notations $v_{t}=[v_{j,t}]_{j=1,\cdots,m}$ and $x_{t}^{n}=[x_{j,t}^{n}]_{j=1,\cdots,m}$ , we rewrite (8) as

	$\displaystyle v_{t}=$	$\displaystyle\Phi(\mathbb{T}(t)+t,\mathbb{T}(t))x_{t}^{n}$
	$\displaystyle=$	$\displaystyle A(\mathbb{T}(t)+t)A(\mathbb{T}(t)+t-1)\cdots A(\mathbb{T}(t))x_{t}^{n}.$

Agents perform $t$ communication steps at iteration $t$ . At each communication step, agents exchange their estimates $x_{j,t}^{n}$ and linearly combine the received estimates using weights $A(t)$ . This mapping (8) is referred as a multi-step consensus mapping because linear combinations of estimates bring the estimates of different agents close to each other.

Finally, for the non-smooth regular function, we use proximal operator

\displaystyle x_{j,t+1}={\rm prox}_{\gamma,\phi}(v_{j,t})

(10)

to obtain the local variable estimate $x_{j,t+1}$ of agent $j$ at the next iteration.

The proposed distributed proximal stochastic gradient algorithm with random reshuffling (DPG-RR)¹¹1We suppose the local shuffling is sufficient, i.e. the permutation after shuffling is uniformly distributed. is formally summarized in Algorithm 1.

Algorithm 1 DPG-RR for agent

j\in[m]

1:(S.1) Initialization:

2:Step-size

\gamma>0

; initial vector

x_{j,0}\in\mathbb{R}^{d}

for

j\in[m]

; number of epoches

T

3:(S.2) Iterations:

4:for

t=0,1,\cdots,T-1

5: Generate a random permutation

\pi_{t}=(\pi_{t}^{0},\cdots,\pi_{t}^{n-1})

x_{j,t}^{0}=x_{j,t}

7: for

i=0,1,\cdots,n-1

\displaystyle x_{j,t}^{i+1}=x_{j,t}^{i}-\gamma\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i})

9: end for

10:

	$\displaystyle v_{j,t}$	$\displaystyle=\sum_{l=1}^{m}\lambda_{jl,t}x_{l,t}^{n}$
	$\displaystyle x_{j,t+1}$	$\displaystyle={\rm prox}_{\gamma,\phi}(v_{j,t})$

11:end for

Remark III.2.

Compared with the standard proximal SGD, where the proximal operator is applied after each stochastic gradient descent, the proposed DPG-RR only operates one single proximal evaluation at each iteration, which is $\frac{1}{n}$ of the proximal evaluations in the proximal SGD. Therefore, the proposed DPG-RR owns lower computational burden and is applicable for the non-smooth optimization where the proximal operator is expensive to calculate.

Remark III.3.

Compared with the fully random sampling procedure, which is used in the well-known stochastic gradient descent method (SGD), one main advantage of the random reshuffling procedure is its intrinsic ability to avoid cache misses when reading the data from memory, enabling a faster implementation. In addition, a random reshuffling algorithm usually converges in fewer iterations than distributed SGD[30, 38]. It is due to that the RR procedure learns from each sample in each epoch while the SGD can miss learning from some samples in any given epoch. The gradient method with the incremental sampling procedure is a special case of the random reshuffling gradient method, which has been investigated in [36].

Remark III.4.

The work in [37] has studied the convergence rates for strongly convex, convex and non-convex optimization under different shuffling procedures. [37] has provided a unified analysis framework for three shuffling procedures, including global shuffling, local shuffling and insufficient shuffling²²2If the training samples of machine learning task are centrally stored, the global shuffling is usually applied, where the entire training samples are shuffled after each epoch. Otherwise, local shuffling, where each agent randomly shuffles local training samples after each epoch, is preferred to reduce communication burden.. Unlike the work [37], this paper focuses on non-smooth optimization with a local shuffling procedure, which is more applicable for the distributed setting. In addition, this proposed distributed DPG-RR owns a comparable convergence rate and allows non-smooth functions and time-varying communications between different agents for convex finite-sum optimization.

The convergence performance of the proposed algorithm is discussed in the following theorem and the proof is provided in the next section.

Theorem III.1.

Suppose that Assumptions III.1, III.2 hold and the step-size $\gamma=M/{\sqrt{T}},\ M\leq\sqrt{6}/6Ln$ . Then, Algorithm 1 possesses the properties:

(1)

The local variable estimates $x_{j,t}$ achieve consensus and $\lim_{t\to\infty}\|x_{j,t}-\bar{x}_{t}\|=0$ , where $\bar{x}_{t}\triangleq\frac{1}{m}\sum_{j=1}^{m}x_{j,t}$ .
(2)

In addition, $\mathbb{E}[F(\hat{x}_{T})-F_{*}]=\mathcal{O}(\frac{1}{T}+\frac{1}{\sqrt{T}})$ , where $\hat{x}_{T}\triangleq\frac{1}{mT}\sum_{t=1}^{T}\sum_{j=1}^{m}x_{j,t}$ .

Remark III.5.

If the local sample size $n$ increases, the convergence rate of the proposed algorithm will become slower. It follows from the proof of Theorem III.1 that the constant term in the convergence rate is proportional to the third-order polynomial of $n$ . Since the proposed algorithm is distributed, we can reduce the effect of local sample size $n$ on the convergence rate by using more agents for large-scale problems. Whereas more agents may require higher communication costs over multi-agent networks. In addition, the convergence rate also depends on other factors, such as the Lipschitz constant of objective functions, the upper bounds of subgradients, the shuffling variance, and the bounded intercommunication interval.

Remark III.6.

Theorem III.1 extends that of the centralized Prox-RR in [38] to distributed settings and non-strongly convex optimization. Compared with the popular stochastic sub-gradient method with a diminishing step-size for non-smooth optimization, whose convergence rate is $\mathcal{O}(\frac{\log(T)}{\sqrt{T}})$ [19], the proposed DPG-RR has a faster convergence rate.

IV Theoretical analysis

In this section, we present theoretical proofs for the convergence performance of the proposed algorithm. The proof sketch includes three parts.

(a)

Transform the DPG-RR to an algorithm with some error sequences, and prove that the error sequences are summable³³3 Let $\{a_{n}\}$ be a sequence of real numbers. The series $\sum_{k=1}^{\infty}a_{k}$ is summable if and only if the sequence $x_{n}\triangleq\sum_{k=1}^{n}a_{k},\ n\in\mathbb{N}$ converges..
(b)

Estimate the bound of the forward per-epoch deviation of the DPG-RR.
(c)

Prove the consensus property of local variables and the convergence performance in Theorem III.1.

Each part is discussed in detail in the subsequent subsections. To present the transformation, we define

	$\displaystyle\bar{x}_{t}^{i}\triangleq\frac{1}{m}\sum_{j=1}^{m}x_{j,t}^{i}\in\mathbb{R}^{d},\ \bar{x}_{t}\triangleq\frac{1}{m}\sum_{j=1}^{m}x_{j,t}\in\mathbb{R}^{d},$
	$\displaystyle\bar{v}_{t}\triangleq\frac{1}{m}\sum_{j=1}^{m}v_{j,t}\in\mathbb{R}^{d},\ z_{t+1}\triangleq{\rm prox}_{\gamma,\phi}(\bar{v}_{t}).$

IV-A Transformation of DPG-RR and the summability of error sequences

Proposition IV.1.

Suppose Assumptions III.1 and III.2 hold. The average variable of Algorithm 1 satisfies

\displaystyle\bar{x}_{t+1}

\displaystyle\in{\rm prox}_{\gamma,\phi}^{\varepsilon_{t+1}}\Big{(}\bar{x}_{t}-\gamma\big{(}\sum_{i=0}^{n-1}(\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})+e_{t}^{i})\big{)}\Big{)},

(11)

where $\mathbf{f}_{\pi_{t}^{i}}(x)=\frac{1}{m}\sum_{j=1}^{m}f_{j,\pi_{t}^{i}}(x)$ , and the error sequences $\{e_{t}^{i}\}_{t=0}^{\infty}$ and $\{\varepsilon_{t+1}\}_{t=0}^{\infty}$ satisfy


$\displaystyle\\|e_{t}^{i}\\|$	$\displaystyle\leq\frac{L}{m}\sum_{j=1}^{m}\\|x_{j,t}^{i}-\bar{x}_{t}^{i}\\|,$	(12a)
$\displaystyle\varepsilon_{t+1}$	$\displaystyle\leq\frac{2G_{\phi}}{m}\sum_{j=1}^{m}\\|v_{j,t}\!-\!\bar{v}_{t}\\|\!+\!\frac{1}{2\gamma}(\frac{1}{m}\sum_{j=1}^{m}\\|v_{j,t}\!-\!\bar{v}_{t}\\|)^{2}.$	(12b)

Proof.

See Appendix VII-B. ∎

Then, inspired by Section III.B of [28], we discuss the summabilities of error sequences $\{\gamma\|e_{t}\|\}$ and $\{\varepsilon_{t}\}$ in the following proposition.

Proposition IV.2.

Under Assumptions III.1 and III.2, error sequences $\{e_{t}^{i}\}$ and $\{\varepsilon_{t}\}$ satisfy $\left\|e_{t}^{i}\right\|\leq b_{e,t}+\gamma C_{0}$ , ${\varepsilon_{t}}\leq b_{\varepsilon,t}$ and $\sqrt{\varepsilon_{t}}\leq b_{\sqrt{\varepsilon},t}$ , where $b_{e,t}$ , $b_{\varepsilon,t}$ and $b_{\sqrt{\varepsilon},t}$ are polynomial-geometric sequences and $C_{0}$ is a constant.

Proof.

See Appendix VII-C. ∎

IV-B Boundness of the forward per-epoch deviation

Before presenting the boundness analysis, we introduce some necessary quantities for the random reshuffling technique.

Definition IV.1.

[36] For any $i$ , the quantity $D_{f_{i}}(x,y)\triangleq f_{i}(x)-f_{i}(y)-\langle\nabla f_{i}(y),x-y\rangle$ is the Bregman divergence between $x$ and $y$ associated with function $f_{i}$ .

If the function $f_{i}$ is $L$ -smooth, then for all $x,y\in\mathbb{R}^{d}$ , $D_{f_{i}}(x,y)\leq\frac{L}{2}\|x-y\|^{2}$ . The difference between the gradients of a convex and $L$ -smooth function $f_{i}$ satisfies

\displaystyle\|\nabla f_{i}(x)-\nabla f_{i}(y)\|^{2}\leq 2LD_{f_{i}}(x,y).

(13)

In addition, the forward per-epoch deviation of the proposed algorithm is introduced as follows.

Definition IV.2.

The forward per-epoch deviation of DPG-RR for the $t$ -th epoch is defined as

\displaystyle\mathcal{V}_{t}\triangleq\sum_{i=0}^{n-1}\|\bar{x}_{t}^{i}-\bar{x}_{t+1}\|^{2}.

(14)

Now, we show that the forward per-epoch deviation $\mathcal{V}_{t}$ of DPG-RR is upper bounded in the following lemma.

Lemma IV.1.

If Assumption III.1 holds, then the forward per-epoch deviation $\mathcal{V}_{t}$ satisfies

	$\displaystyle\mathbb{E}[\mathcal{V}_{t}]\leq$	$\displaystyle 24\gamma^{2}Ln^{2}\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{},\bar{x}_{t}^{i})]+3\gamma^{2}n^{2}\sigma_{}^{2}+9\gamma^{2}nG_{f}^{2}$
		$\displaystyle+6\gamma^{2}nG_{\phi}^{2}+12n^{3}(\gamma^{2}b_{e,t}^{2}+\gamma^{4}C_{0}^{2})+4n\gamma b_{\varepsilon,t},$		(15)

where $\mathcal{V}_{t}$ is defined in (14), $\sigma_{*}^{2}\triangleq\frac{1}{n}\sum_{i=1}^{n}\|\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})-\frac{1}{n}\nabla f(x_{*})\|^{2}$ , $b_{e,t}$ and $b_{\varepsilon,t}$ are defined in (53) and (VII-C), respectively.

Proof.

See Appendix VII-D. ∎

IV-C Proof of Theorem III.1

Proof.

(1) First, we prove the consensus property of the generated local variables, i.e. $\lim_{t\to\infty}x_{j,t}\to\bar{x}_{t}$ for all agent $j$ . By (40) in Lemma VII.5, $\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|\leq 2m\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|$ holds. In addition, with Lemma VII.2 and Lemma VII.6, we obtain that

\sum_{t=1}^{\infty}2m\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|<\infty.

Then, $\sum_{t=1}^{\infty}\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|<\infty$ . Because $\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|$ is nonnegative,

\lim_{t\to\infty}\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|=0,

and hence, $\lim_{t\to\infty}\|x_{j,t}-\bar{x}_{t}\|=0$ for each agent $j$ . Therefore, local variable estimates $x_{j,t}$ of all agents achieve consensus and converge to the average $\bar{x}_{t}$ as $t\to\infty$ .

(2) Next, we consider the convergence rate of the proposed algorithm. By Lemma VII.1 and (11), we have

\displaystyle\bar{x}_{t}=\gamma\bar{g}_{t}+\bar{x}_{t+1}+\gamma e_{t+1}+p_{t+1}+\gamma\bar{d}_{t+1},

(16)

where $e_{t+1}=\sum_{i=0}^{n-1}e_{t}^{i}$ , $\bar{d}_{t+1}\in\partial_{\varepsilon_{t+1}}\phi(\bar{x}_{t+1})$ and $\bar{g}_{t}\triangleq\sum_{i=0}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})$ . Then, the square of the norm between $\bar{x}_{t}$ and the optimal solution $x_{*}$ satisfies

	$\displaystyle\\|\bar{x}_{t}-x_{*}\\|^{2}$
$\displaystyle=$	$\displaystyle\\|\gamma\bar{g}_{t}+\gamma{e_{t+1}}+p_{t+1}+\gamma\bar{d}_{t+1}+\bar{x}_{t+1}-x_{*}\\|^{2}$
$\displaystyle\geq$	$\displaystyle\\|\bar{x}_{t+1}-x_{}\\|^{2}+2\gamma\sum_{i=0}^{n-1}\langle\bar{x}_{t+1}-x_{},\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})\rangle$
	$\displaystyle+2\langle\bar{x}_{t+1}-x_{*},\gamma e_{t+1}+p_{t+1}+\gamma\bar{d}_{t+1}\rangle.$	(17)

For the second term in (IV-C), we have for any $i$ ,

	$\displaystyle\langle\bar{x}_{t+1}-x_{*},\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})\rangle$
$\displaystyle=$	$\displaystyle\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(x_{*})$
	$\displaystyle+\mathbf{f}_{\pi_{t}^{i}}(x_{})-\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\langle\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i}),x_{}-\bar{x}_{t}^{i}\rangle$
	$\displaystyle-[\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\langle\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i}),\bar{x}_{t+1}-\bar{x}_{t}^{i}\rangle]$
$\displaystyle=$	$\displaystyle[\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(x_{})]\!+\!D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{},\bar{x}_{t}^{i})\!-\!D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i}).$	(18)

Summing $\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(x_{*})$ over $i$ from $0$ to $n-1$ gives

\displaystyle\sum_{i=0}^{n-1}[\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t+1})-\mathbf{f}_{\pi_{t}^{i}}(x_{*})]=f(\bar{x}_{t+1})-f_{*}.

(19)

Next, we bound the third term in (IV-C) with $L$ -smoothness in Assumption III.1 (b) as

\displaystyle D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i})\leq\frac{L}{2}\|\bar{x}_{t+1}-\bar{x}_{t}^{i}\|^{2}.

(20)

By summing (20) over $i$ from $0$ to $n-1$ , we obtain an upper bound of the forward deviation $\mathcal{V}_{t}$ by Lemma IV.1 that

		$\displaystyle\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i})]\leq\frac{L}{2}\mathbb{E}[\mathcal{V}_{t}]$
	$\displaystyle\leq$	$\displaystyle 12\gamma^{2}L^{2}n^{2}\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{},\bar{x}_{t}^{i})]+3\gamma^{2}L\frac{n^{2}\sigma_{}^{2}}{2}$
		$\displaystyle+\!\frac{9}{2}\gamma^{2}LnG_{f}^{2}\!+\!3\gamma^{2}nLG_{\phi}^{2}\!+\!6Ln^{3}(\gamma^{2}b_{e,t}^{2}\!+\!\gamma^{4}C_{0}^{2})\!+\!2Ln\gamma b_{\varepsilon,t}.$

With the above inequality, we estimate the lower bound of the sum of the second and third term in (IV-C) as

	$\displaystyle\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})-D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i})]$
$\displaystyle\geq$	$\displaystyle(1-12\gamma^{2}L^{2}n^{2})\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{},\bar{x}_{t}^{i})]-3\gamma^{2}L\frac{n^{2}\sigma_{}^{2}}{2}$
	$\displaystyle\!-\!\frac{9}{2}\gamma^{2}LnG_{f}^{2}\!-\!3\gamma^{2}nLG_{\phi}^{2}\!-\!6Ln^{3}(\gamma^{2}b_{e,t}^{2}\!+\!\gamma^{4}C_{0}^{2})\!-\!2Ln\gamma b_{\varepsilon,t}$
$\displaystyle\geq$	$\displaystyle-3\gamma^{2}L\frac{n^{2}\sigma_{*}^{2}}{2}-\frac{9}{2}\gamma^{2}LnG_{f}^{2}-3\gamma^{2}nLG_{\phi}^{2}$
	$\displaystyle-6Ln^{3}(\gamma^{2}b_{e,t}^{2}+\gamma^{4}C_{0}^{2})-2Ln\gamma b_{\varepsilon,t},$	(21)

where the last inequality holds since $(1-12\gamma^{2}L^{2}n^{2})\geq 0$ by $\gamma\leq\sqrt{3}/6Ln$ , and $\sum_{i=0}^{n-1}D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})$ is non-negative due to the convexity.

By rearranging (IV-C), we have

	$\displaystyle\\|\bar{x}_{t+1}-x_{*}\\|^{2}$
$\displaystyle\overset{\eqref{inabx},\eqref{ff}}{\leq}$	$\displaystyle\\|\bar{x}_{t}-x_{}\\|^{2}-2\gamma(f(\bar{x}_{t+1})-f_{})$
	$\displaystyle-\sum_{i=0}^{n-1}2\gamma(D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})-D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i}))$
	$\displaystyle-2\langle\bar{x}_{t+1}-x_{},\gamma\bar{d}_{t+1}\rangle\!-2\langle\bar{x}_{t+1}-x_{},\gamma e_{t+1}\!+\!p_{t+1}\rangle$
$\displaystyle\leq$	$\displaystyle\\|\bar{x}_{t}-x_{}\\|^{2}-2\gamma(f(\bar{x}_{t+1})-f_{})$
	$\displaystyle-\sum_{i=0}^{n-1}2\gamma(D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})-D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i}))$
	$\displaystyle-2\langle\bar{x}_{t+1}-x_{*},\gamma\bar{d}_{t+1}\rangle$
	$\displaystyle+2\\|\bar{x}_{t+1}-x_{*}\\|(\gamma\\|e_{t+1}\\|+\\|p_{t+1}\\|)$
$\displaystyle\overset{\eqref{bet},\eqref{sqrtbvapt}}{\leq}$	$\displaystyle\\|\bar{x}_{t}-x_{}\\|^{2}\!-\!2\gamma(f(\bar{x}_{t+1})\!-\!f_{})\!-\!2\langle\bar{x}_{t+1}-x_{*},\gamma\bar{d}_{t+1}\rangle$
	$\displaystyle-\sum_{i=0}^{n-1}2\gamma(D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{*},\bar{x}_{t}^{i})-D_{\mathbf{f}_{\pi_{t}^{i}}}(\bar{x}_{t+1},\bar{x}_{t}^{i}))$
	$\displaystyle+2\\|\bar{x}_{t+1}-x_{*}\\|(n\gamma b_{e,t}+n\gamma^{2}C_{0}+\sqrt{2\gamma}b_{\sqrt{\varepsilon},t}).$	(22)

where the second inequality holds due to the Cauchy Schwarz inequality and the triangle inequality.

Taking expectation of (IV-C) and by (IV-C),

		$\displaystyle\mathbb{E}[\\|\bar{x}_{t+1}-x_{*}\\|^{2}]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\\|\bar{x}_{t}-x_{}\\|^{2}]-2\gamma\mathbb{E}[f(\bar{x}_{t+1})-f_{}]$
		$\displaystyle+9\gamma^{3}LnG_{f}^{2}+3\gamma^{3}Ln^{2}\sigma_{*}^{2}+6\gamma^{3}nLG_{\phi}^{2}$
		$\displaystyle+12n^{3}(\gamma^{3}Lb_{e,t}^{2}+\gamma^{5}C_{0}^{2})+4\gamma^{2}Lnb_{\varepsilon,t}$
		$\displaystyle-2\gamma\mathbb{E}[\langle\bar{x}_{t+1}-x_{},\bar{d}_{t+1}\rangle]+2\lambda_{t}\mathbb{E}[\\|\bar{x}_{t+1}-x_{}\\|]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\\|\bar{x}_{t}-x_{}\\|^{2}]-2\gamma\mathbb{E}[f(\bar{x}_{t+1})-f_{}]+9\gamma^{3}LnG_{f}^{2}$
		$\displaystyle+3\gamma^{3}Ln^{2}\sigma_{*}^{2}+6\gamma^{3}nLG_{\phi}^{2}$
		$\displaystyle+12n^{3}(\gamma^{3}Lb_{e,t}^{2}+\gamma^{5}C_{0}^{2})+4\gamma^{2}Lnb_{\varepsilon,t}$
		$\displaystyle-2\gamma\mathbb{E}[\phi(\bar{x}_{t+1})-\phi_{}]+2\gamma b_{\varepsilon,t}+2\lambda_{t}\mathbb{E}[\\|\bar{x}_{t+1}-x_{}\\|],$

where $\lambda_{t}=\gamma nb_{e,t}+n\gamma^{2}C_{0}+\sqrt{2\gamma}b_{\sqrt{\varepsilon},t}$ , the second inequality holds because of $\langle\bar{x}_{t+1}-x_{*},\bar{d}_{t+1}\rangle+\varepsilon_{t+1}\geq\phi(\bar{x}_{t+1})-\phi_{*}$ and (VII-C).

Then, rearranging the above result leads to

		$\displaystyle 2\gamma\mathbb{E}[F(\bar{x}_{t+1})-F_{*}]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\\|\bar{x}_{t}-x_{}\\|^{2}]-\mathbb{E}[\\|\bar{x}_{t+1}-x_{}\\|^{2}]+9\gamma^{3}LnG_{f}^{2}$
		$\displaystyle+3\gamma^{3}Ln^{2}\sigma_{*}^{2}+6\gamma^{3}nLG_{\phi}^{2}+12n^{3}(\gamma^{3}Lb_{e,t}^{2}+\gamma^{5}C_{0}^{2})$
		$\displaystyle+4\gamma^{2}Lnb_{\varepsilon,t}+2\gamma b_{\varepsilon,t}+2\lambda_{t}\mathbb{E}[\\|\bar{x}_{t+1}-x_{*}\\|].$

Summing over $t$ from $0$ to $T-1$ ,

	$\displaystyle 2\gamma\sum_{t=0}^{T-1}\mathbb{E}[F(\bar{x}_{t+1})-F_{*}]$
$\displaystyle\leq$	$\displaystyle\\|\bar{x}_{0}-x_{}\\|^{2}\!-\mathbb{E}[\\|\bar{x}_{T}-x_{}\\|^{2}]\!+9T\gamma^{3}LnG_{f}^{2}\!+3T\gamma^{3}Ln^{2}\sigma_{*}^{2}$
	$\displaystyle+6T\gamma^{3}nLG_{\phi}^{2}+12Tn^{3}\gamma^{5}C_{0}^{2}+12\gamma^{3}Ln^{3}\sum_{t=0}^{T-1}b_{e,t}^{2}$
	$\displaystyle+2\gamma(1\!+\!2\gamma nL)\sum_{t=0}^{T-1}b_{\varepsilon,t}\!+\!2\sum_{t=0}^{T-1}\lambda_{t}\mathbb{E}[\\|\bar{x}_{t+1}\!-\!x_{*}\\|].$	(23)

Then, we need to estimate an upper bound of $\mathbb{E}[\|\bar{x}_{t+1}-x_{*}\|]$ . By the transformation of (IV-C) and the fact that $\mathbb{E}[\|\bar{x}_{T}-x_{*}\|]^{2}\leq\mathbb{E}[\|\bar{x}_{T}-x_{*}\|^{2}]$ ,

		$\displaystyle\mathbb{E}[\\|\bar{x}_{T}-x_{*}\\|]^{2}$
	$\displaystyle\leq$	$\displaystyle S_{T}+2\sum_{t=0}^{T-1}\lambda_{t}\mathbb{E}[\\|\bar{x}_{t+1}-x_{*}\\|].$

where $S_{T}=\|\bar{x}_{0}-x_{*}\|^{2}+9T\gamma^{3}LnG_{f}^{2}+3T\gamma^{3}L{n^{2}}\sigma_{*}^{2}+6T\gamma^{3}nLG_{\phi}^{2}+12n^{3}\gamma^{5}TC_{0}^{2}+12\gamma^{3}Ln^{3}\sum_{t=0}^{T-1}b_{e,t}^{2}+2\gamma(1+2\gamma nL)\sum_{t=0}^{T-1}b_{\varepsilon,t}$ . It follows from Lemma VII.4 that

	$\displaystyle\mathbb{E}[\\|\bar{x}_{T}\!-\!x_{*}\\|]$	$\displaystyle\leq\sum_{t=0}^{T-1}(\gamma nb_{e,t}+n\gamma^{2}C_{0}+\sqrt{2\gamma}b_{\sqrt{\varepsilon},t})$
		$\displaystyle+\!\Big{(}S_{T}\!+\!\big{(}\sum_{t=0}^{T-1}\!(\gamma nb_{e,t}\!+\!n\gamma^{2}C_{0}\!+\!\sqrt{2\gamma}b_{\sqrt{\varepsilon},t})\big{)}^{2}\Big{)}^{\frac{1}{2}},$

where the sequences $b_{e,t}$ , $b_{\varepsilon,t}$ and $b_{\sqrt{\varepsilon},t}$ are polynomial-geometric sequences and are summable by Lemma VII.2.

By the summability of $b_{e,t}$ , $b_{\varepsilon,t}$ and $b_{\sqrt{\varepsilon},t}$ , there exist some scalars A and B such that

\displaystyle A_{T}=\sum_{t=0}^{T-1}(\gamma nb_{e,t}+n\gamma^{2}C_{0}+\sqrt{2\gamma}b_{\sqrt{\varepsilon},t})\leq\textbf{A}<\infty

(24)

where we use the step-size range $\gamma\leq 1/\sqrt{T}$ . And

	$\displaystyle B_{T}=$	$\displaystyle T\textbf{D}+12\gamma^{3}Ln^{3}\sum_{t=0}^{T-1}b_{e,t}^{2}+2\gamma(1+2\gamma nL)\sum_{t=0}^{T-1}b_{\varepsilon,t},$
	$\displaystyle\leq$	$\displaystyle T\textbf{D}+\textbf{B},$		(25)

where $\textbf{D}\triangleq\gamma^{3}(9LnG_{f}^{2}+3L{n^{2}}\sigma_{*}^{2}+6nLG_{\phi}^{2}+12n^{3}\gamma^{2}C_{0}^{2})$ . Then, $\mathbb{E}[\|\bar{x}_{T}-x_{*}\|]$ satisfies

\displaystyle\mathbb{E}[\|\bar{x}_{T}-x_{*}\|]\leq\textbf{A}+(\|\bar{x}_{0}-x_{*}\|^{2}+(T\textbf{D}+\textbf{B})+\textbf{A}^{2})^{\frac{1}{2}}.

Since $\{A_{t}\}$ and $\{B_{t}\}$ are increasing sequences, we have for $t\leq T$ ,

$\displaystyle\mathbb{E}[\\|\bar{x}_{t}-x_{*}\\|]\leq$	$\displaystyle A_{T}+(\\|\bar{x}_{0}-x_{*}\\|^{2}+B_{T}+A_{T}^{2})^{\frac{1}{2}}$
$\displaystyle\leq$	$\displaystyle 2A_{T}+\\|\bar{x}_{0}-x_{*}\\|+\sqrt{B_{T}}$
$\displaystyle\leq$	$\displaystyle 2\textbf{A}+\\|\bar{x}_{0}-x_{*}\\|+\sqrt{T\textbf{D}+\textbf{B}}.$	(26)

Now, we can bound the right-hand side of (IV-C) with (IV-C),

		$\displaystyle 2\gamma\sum_{t=1}^{T}\mathbb{E}[F(\bar{x}_{t})-F_{*}]$
	$\displaystyle\leq$	$\displaystyle\\|\bar{x}_{0}-x_{}\\|^{2}+B_{T}+2A_{T}(2A_{T}+\\|\bar{x}_{0}-x_{}\\|+\sqrt{B_{T}})$
	$\displaystyle\leq$	$\displaystyle\\|\bar{x}_{0}\!-\!x_{}\\|^{2}\!+\!T\textbf{D}\!+\!\textbf{B}\!+\!2\textbf{A}(2\textbf{A}\!+\!\\|\bar{x}_{0}-x_{}\\|\!+\!\sqrt{T\textbf{D}\!+\!\textbf{B}}).$

By dividing both sides by $2\gamma T$ , we get

		$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[F(\bar{x}_{t})-F_{*}]$
	$\displaystyle\leq$	$\displaystyle\frac{\\|\bar{x}_{0}\!-\!x_{}\\|^{2}}{2\gamma T}\!+\!\frac{\textbf{D}}{2\gamma}\!+\!\frac{\textbf{B}}{2\gamma T}\!+\!\frac{\textbf{A}}{\gamma T}\!(2\textbf{A}\!+\!\\|\bar{x}_{0}\!-\!x_{}\\|\!+\!\sqrt{\!T\textbf{D}\!+\!\textbf{B}}).$

Finally, by the convexity of $F$ , the average iterate $\hat{x}_{T}=\frac{1}{T}\sum_{t=1}^{T}\bar{x}_{t}$ satisfies

	$\displaystyle\mathbb{E}[F(\hat{x}_{T})-F_{*}]$
$\displaystyle\leq$	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[F(\bar{x}_{t})-F_{*}]$
$\displaystyle\leq$	$\displaystyle\frac{\\|\bar{x}_{0}\!-\!x_{}\\|^{2}}{2\gamma T}\!+\!\frac{\textbf{D}}{2\gamma}\!+\!\frac{\textbf{B}}{2\gamma T}\!+\!\frac{\textbf{A}}{\gamma T}(2\textbf{A}\!+\!\\|\bar{x}_{0}\!-\!x_{}\\|\!+\!\sqrt{\!T\textbf{D}\!+\!\textbf{B}})$
$\displaystyle\leq$	$\displaystyle\frac{\\|\bar{x}_{0}-x_{}\\|^{2}}{2\gamma T}+\frac{\textbf{A}}{\gamma T}(2\textbf{A}+\\|\bar{x}_{0}-x_{}\\|)+\frac{\textbf{A}\sqrt{\textbf{B}}}{\gamma T}+\frac{\textbf{B}}{2\gamma T}$
	$\displaystyle+\frac{\textbf{A}\sqrt{\textbf{D}}}{\gamma\sqrt{T}}+\frac{\textbf{D}}{2\gamma},$
$\displaystyle=$	$\displaystyle\mathcal{O}(\frac{1}{T}+\frac{1}{T^{1/2}})$	(27)

where $\gamma\leq T^{-1/2}$ and the last equality holds due to the fact that $\sqrt{\mathbf{D}}/{\gamma}=\mathcal{O}(T^{-1/4})$ , and $\mathbf{D}/{\gamma}=\mathcal{O}(T^{-1})$ . ∎

V Simulation

In this section, we apply the proposed algorithm DPG-RR to optimize the black-box binary classification problem [44], which is to find the optimal predictor $x\in\mathbb{R}^{n}$ by solving

\displaystyle{\rm min}_{x\in\mathbb{R}^{d}}F(x),\ F(x)=\frac{1}{m}\sum_{j=1}^{m}f_{j}(x)+\lambda_{1}\|x\|_{1},

(28)

where $f_{j}(x)\triangleq\sum_{i=1}^{n}\ln(1+\exp(-l_{j,i}\langle a_{j,i},x\rangle))$ , $a_{j,i}\in\mathbb{R}^{d}$ is the feature vector of the $i$ th local sample of agent $j$ , $l_{j,i}\in\{-1,1\}$ is the classification value of the $i$ th local sample of agent $j$ and $\{a_{j,i},l_{j,i}\}_{i=1}^{n}$ denotes the set of local training samples of agent $j$ . In this experiment, we use the publicly available real datasets a9a and w8a ⁴⁴4a9a and w8a are available in the website www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/.. In addition, in our setting, there are $m=10$ agents and $\lambda_{1}=5\times 10^{-4}$ . All comparative algorithms are initialized with a same value. Experiment codes and the adjacent matrices of ten-agent time-varying networks are provided at https://github.com/managerjiang/Dis-Prox-RR.

Refer to caption — (a) Consensus trajectories

(1) Time-invariant networks: For comparison, we apply some existing distributed stochastic algorithms including GT-SAGA in [24] and DSGD in [22], and the proposed stochastic DPG-RR to solve (28) over the ten-agent connected network. Since the cost function in (28) is non-smooth, we replace the gradients in GT-SAGA and DSGD with subgradients. The simulation results for a9a and w8a datasets are shown in Figs. 1 and 2. Fig. 1(b) and Fig. 2(b) show that the trajectories of objective function $F(x)$ generated by all algorithms converge quickly to the optimum $F_{*}$ . To measure the consensus performance, we define one quantity $D(x)$ ⁵⁵5The $D(x)$ is an expansion of $\hat{x}^{\top}L\hat{x}$ , where $L\triangleq\mathcal{L}\otimes I_{d}\in\mathbb{R}^{md\times md}$ , $\mathcal{L}$ is the Laplacian matrix of the undirected multi-agent network and $\hat{x}=[x_{1},\cdots,x_{m}]^{\top}\in\mathbb{R}^{md}$ . In addition, the Laplacian matrix of an undirected multi-agent network is positive semi-definite. By the fact that $0$ is the single eigenvalue corresponding to eigenvector $1_{md}$ of $L$ , $L\hat{x}=0_{md}$ if and only if $x_{i}=x_{j}$ for all $i,j\in\{1,\cdots,m\}$ . as $D(x)=\sum_{i=1}^{m}x_{i}^{\prime}\sum_{j=1}^{m}a_{ij}(x_{i}-x_{j})$ , where $a_{ij}$ is $(i,j)$ th element of one doubly-stochastic adjacent matrix. If all local variable estimates achieve consensus, then, $D(x)=0$ . Fig. 1(a) and Fig. 2(a) represent that local variables generated by all algorithms achieve consensus. In addition, it is seen that the proposed algorithm DPG-RR owns a better consensus performance than GT-SAGA and DSGD, which verify the effectiveness of the multi-step consensus mapping in DPG-RR.

Although all comparative algorithms have comparable convergence rate in practice, [24] and [22] do not provide the theoretical convergence analysis of GT-SAGA and DSGD for non-smooth optimization. In addition, these algorithms are not applicable for time-varying multi-agent networks, hence, we further compare our proposed algorithm with some recent distributed deterministic algorithms for non-smooth optimization over time-varying networks.

(2) Time-varying networks: We compare DPG-RR with the distributed deterministic proximal algorithm Prox-G in [28], the algorithm NEXT in [45] and the distributed subgradient method (DGM) in [7] to solve (28) over the time-varying graphs. DPG-RR and Prox-G adopt a same constant step-size, NEXT and DGM have diminishing step-sizes.

All distributed algorithms are applied over periodically time-varying ten-agent networks satisfying Assumption III.2 to solve the problem (28), which are shown in Fig. 3.

TABLE I: Running time of different algorithms

	time(s)
Algorithms	a9a	w8a
DPG-RR	25.21	140.89
Prox-G	55.50	161.44
NEXT	80.28	264.78
DGM	30.91	120.51

The simulation results for a9a and w8a datasets are shown in Fig. 4 and Fig. 5, respectively. It is seen from Fig. 4(a) and Fig. 5(a) that the trajectories generated by DPG-RR, Prox-G, NEXT and DGM all converge to zero, implying that local variable estimates generated by all algorithms achieve consensus. In addition, the DPG-RR shows a better consensus numerical performance than NEXT and DGM algorithms. From Fig. 4(b) and Fig. 5(b), we can see that DPG-RR converges faster than NEXT and DGM over both a9a and w8a datasets, which verify that DPG-RR is superior to algorithms with diminishing step-sizes. We observe that DPG-RR has a comparable convergence performance of Prox-G. While, due to the stochastic technique, DPG-RR avoids the computation of full local gradients in Prox-G. The running time of different algorithms are shown in Table I. It is observed that the proposed stochastic DPG-RR generally costs less running time than the deterministic algorithms Prox-G and NEXT.

To show the advantages of random reshuffling sampling procedure, we replace the random reshuffling step of DPG-RR with a stochastic sampling procedure, yielding a DPG-SG algorithm. Similarly, we replace the random reshuffling step of DPG-RR with a deterministic incremental sampling procedure, yielding a DPG-IG algorithm. All algorithms take the same constant step-size ( $0.3$ ) and the simulation results are shown in Fig. 6. Compared with fully stochastic sampling procedure (DPG-SG), the proposed DPG-RR shows a faster convergence. In addition, the DPG-IG algorithm shows a comparable convergence rate with DPG-RR. These numerical results are consistent with the theoretical findings discussed in Remark III.3.

VI Conclusion

Making use of random reshuffling, this paper has developed one distributed stochastic proximal-gradient algorithm for solving large-scale non-smooth convex optimization over time-varying multi-agent graphs. The proposed algorithm operates only one single proximal evaluation at each epoch, owning significant advantages in scenarios where the proximal mapping is computationally expensive. In addition, the proposed algorithm owns constant step-sizes, overcoming the degraded performance of most stochastic algorithms with diminishing step-sizes. One future research direction is to improve the convergence rate of DPG-RR by some accelerated methods, such as Nesterov’s acceleration technique or variance reduction technique. Another one is to extend the proposed algorithm for distributed non-smooth non-convex optimization, which is applicable for deep neural network models.

VII Appendix

VII-A Some vital Lemmas

At first, the following lemma characterizes the $\varepsilon_{t}$ -subdifferential of non-smooth function $\phi$ at $\bar{x}_{t}$ , $\partial_{\varepsilon_{t}}\phi(\bar{x}_{t})$ .

Lemma VII.1.

[9, Lemma 2] If $\bar{x}_{t}$ is an $\varepsilon_{t}$ -optimal solution to (2) in the sense of (II-B) with $y=\bar{x}_{t-1}-\gamma(\bar{g}_{t-1}+e_{t})$ , then there exists $p_{t}\in\mathbb{R}^{d}$ such that

\displaystyle\|p_{t}\|\leq\sqrt{2\gamma\varepsilon_{t}}

(29)

and $\frac{1}{\gamma}(\bar{x}_{t-1}-\bar{x}_{t}-\gamma\bar{g}_{t-1}-\gamma e_{t}-p_{t})\in\partial_{\varepsilon_{t}}\phi(\bar{x}_{t}).$

The next lemma shows that polynomial-geometric sequences are summable, which is vital for the analysis of error sequences introduced by transformation.

Lemma VII.2.

[28, Proposition 3] Let $\zeta\in(0,1)$ , and let

P_{k,N}=\{c_{N}k^{N}+\cdots+c_{1}k+c_{0}|c_{j}\in\mathbb{R},j=0,\cdots,N\}

denote the set of all $N$ -th order polynomials of $k$ , where $N\in\mathbb{N}$ . Then for every polynomial $p_{k}\in P_{k,N}$ ,

\sum_{k=1}^{\infty}p_{k}\zeta^{k}<\infty.

The result of this Lemma for $P_{k,N}=\{k^{N}\}$ will be particularly useful for the analysis in the following. Hence, we define

\displaystyle S_{N}^{\zeta}\triangleq\sum_{k=1}^{\infty}k^{N}\zeta^{k}<\infty.

(30)

The following lemma characterizes the variance of sampling without replacement, which is a key ingredient in our convergence result of DPG-RR.

Lemma VII.3.

[36, Lemma 1] Let $X_{1},\cdots,X_{n}\in\mathbb{R}^{d}$ be fixed vectors, $\bar{X}\triangleq\frac{1}{n}\sum_{i=1}^{n}X_{i}$ be their average and $\sigma^{2}\triangleq\frac{1}{n}\sum_{i=1}^{n}\|X_{i}-\bar{X}\|^{2}$ be the population variance. Fix any $k\in\{1,\cdots,n\}$ , let $X_{\pi_{1}},\cdots,X_{\pi_{k}}$ be sampled uniformly without replacement from $\{X_{1},\cdots,X_{n}\}$ and $\bar{X}_{\pi}$ be their average. Then, the sample average and variance are given by

\displaystyle\mathbb{E}[\bar{X}_{\pi}]=\bar{X},\quad\mathbb{E}[\|\bar{X}_{\pi}-\bar{X}\|^{2}]=\frac{n-k}{k(n-1)}\sigma^{2}.

Finally, we provide a lemma on the boundness of one special nonnegative sequence, which is vital for the convergence analysis.

Lemma VII.4.

[9, Lemma 1] Assume that the nonnegative sequence $\{u_{T}\}$ satisfies the following recursion for all $k\geq 1$

\displaystyle u_{T}^{2}\leq S_{T}+\sum_{t=1}^{T}\lambda_{t}u_{t},

(31)

with $\{S_{T}\}$ an increasing sequence, $S_{0}\geq u_{0}^{2}$ and $\lambda_{t}\geq 0$ for all $t$ . Then, for all $T\geq 1$ ,

\displaystyle u_{T}\leq\frac{1}{2}\sum_{t=1}^{T}\lambda_{t}+\Big{(}S_{T}+\big{(}\frac{1}{2}\sum_{t=1}^{T}\lambda_{t}\big{)}^{2}\Big{)}^{\frac{1}{2}}.

(32)

VII-B Proof of Proposition IV.1

Proof.

By taking the average of (7) over $m$ agents,

	$\displaystyle\bar{x}_{t}^{i+1}=$	$\displaystyle\bar{x}_{t}^{i}-\gamma(\frac{1}{m}\sum_{j=1}^{m}\nabla f_{j,\pi_{t}^{i}}(\bar{x}_{t}^{i})+e_{t}^{i})$
	$\displaystyle=$	$\displaystyle\bar{x}_{t}^{i}-\gamma(\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})+e_{t}^{i}),$		(33)

where $e_{t}^{i}=\frac{1}{m}\sum_{j=1}^{m}(\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i})-\nabla f_{j,\pi_{t}^{i}}(\bar{x}_{t}^{i}))$ and $\mathbf{f}_{\pi_{t}^{i}}(x)=\frac{1}{m}\sum_{j=1}^{m}f_{j,\pi_{t}^{i}}(x)$ . Because of the Lipschitz-continuity of the gradient of $f_{j,\pi_{t}^{i}}(x)$ , $\|e_{t}^{i}\|$ satisfies

\displaystyle\|e_{t}^{i}\|\leq\frac{L}{m}\sum_{j=1}^{m}\|x_{j,t}^{i}-\bar{x}_{t}^{i}\|.

(34)

Recall that $z_{t+1}\triangleq{\rm prox}_{\gamma,\phi}(\bar{v}_{t})={\rm argmin}_{x}\{\phi(x)+\frac{1}{2\gamma}\|x-\bar{v}_{t}\|^{2}\}$ , which is the result of the exact centralized proximal step. Then, we relate $z_{t+1}$ and $\bar{x}_{t+1}$ by formulating the latter as an inexact proximal step with error $\varepsilon_{t+1}$ . A simple algebraic expansion gives

		$\displaystyle\phi(\bar{x}_{t+1})+\frac{1}{2\gamma}\\|\bar{x}_{t+1}-\bar{v}_{t}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\phi(z_{t+1})+G_{\phi}\\|\bar{x}_{t+1}-z_{t+1}\\|+\frac{1}{2\gamma}\Big{(}\\|z_{t+1}-\bar{v}_{t}\\|^{2}$
		$\displaystyle+2\langle z_{t+1}-\bar{v}_{t},\bar{x}_{t+1}-z_{t+1}\rangle+\\|\bar{x}_{t+1}-z_{t+1}\\|^{2}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\min_{z\in\mathbb{R}^{d}}\{\phi(z)+\frac{1}{2\gamma}\\|z-\bar{v}_{t}\\|^{2}\}+\frac{1}{2\gamma}\\|\bar{x}_{t+1}-z_{t+1}\\|^{2}$
		$\displaystyle+\\|\bar{x}_{t+1}-z_{t+1}\\|(G_{\phi}+\frac{1}{\gamma}\\|z_{t+1}-\bar{v}_{t}\\|),$

where we have used the fact that $\phi(z_{t+1})+\frac{1}{2\gamma}\|z_{t+1}-\bar{v}_{t}\|^{2}=\min_{z}\{\phi(z)+\frac{1}{2\gamma}\|z-\bar{v}_{t}\|^{2}\}$ and $\langle z_{t+1}-\bar{v}_{t},\bar{x}_{t+1}-z_{t+1}\rangle\leq\|z_{t+1}-\bar{v}_{t}\|\|\bar{x}_{t+1}-z_{t+1}\|$ in the last inequality.

Therefore, we can write

\displaystyle\bar{x}_{t+1}\in{\rm prox}_{\gamma,\phi}^{\varepsilon_{t+1}}(\bar{v}_{t}),

(35)

where $\varepsilon_{t+1}=\|\bar{x}_{t+1}-z_{t+1}\|(G_{\phi}+\frac{1}{\gamma}\|z_{t+1}-\bar{v}_{t}\|)+\frac{1}{2\gamma}\|\bar{x}_{t+1}-z_{t+1}\|^{2}.$

By definition, $z_{t+1}$ satisfies $\frac{1}{\gamma}(\bar{v}_{t}-z_{t+1})\in\partial\phi(z_{t+1})$ , and therefore its norm is bounded by $G_{\phi}$ . As a result,

\displaystyle\varepsilon_{t+1}\leq 2G_{\phi}\|\bar{x}_{t+1}-z_{t+1}\|+\frac{1}{2\gamma}\|\bar{x}_{t+1}-z_{t+1}\|^{2}.

(36)

Combined with the nonexpansiveness of the proximal operator,

	$\displaystyle\\|\bar{x}_{t+1}-z_{t+1}\\|\leq$	$\displaystyle\frac{1}{m}\sum_{j=1}^{m}\\|{\rm prox}_{\gamma,\phi}(v_{j,t})-{\rm prox}_{\gamma,\phi}(\bar{v}_{t})\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{m}\sum_{j=1}^{m}\\|v_{j,t}-\bar{v}_{t}\\|.$		(37)

Finally, substituting (VII-B) to (36), we obtain (12b). ∎

VII-C Proof of Proposition IV.2

We first define two useful quantities $\Gamma\triangleq 2\frac{1+\eta^{-(m-1)B}}{1-\eta^{(m-1)B}}$ and $\Xi\triangleq(1-\eta^{(m-1)B})^{\frac{1}{(m-1)B}}$ . Then, we provide some properties of the local variables generated by the proposed algorithm in the following Lemma.

Lemma VII.5.

Under Assumptions III.1 and III.2, for each iteration $t\geq 2$ ,

\displaystyle\sum_{j=1}^{m}\|x_{j,t}^{n}\|\leq\sum_{j=1}^{m}\|x_{j,t-1}^{n}\|+m\gamma(G_{\phi}+nG_{f}),

(38)

\displaystyle\sum_{j=1}^{m}\|x_{j,t+1}-x_{j,t}\|\!\leq\!2m\Gamma\Xi^{t-1}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|\!+\!m\gamma(G_{\phi}\!+\!nG_{f}),

(39)

\displaystyle\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|\leq 2m\Gamma\Xi^{t}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|.

(40)

Proof.

By (10), there exists $z_{j,t+1}\in\partial\phi(x_{j,t+1})$ such that

\displaystyle x_{j,t+1}=v_{j,t}-\gamma z_{j,t+1}.

(41)

Since function $\phi$ has bounded subgradients,

\displaystyle\|x_{j,t+1}-v_{j,t}\|\leq\gamma G_{\phi}.

(42)

(a) By the algorithm 1, we have

\displaystyle x_{j,t}^{n}=x_{j,t}^{0}-\gamma\sum_{i=0}^{n-1}\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i}).

Taking norm of the above equality and summing over $j$ ,

	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t}^{n}\\|=$	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t}^{0}-\gamma\sum_{i=0}^{n-1}\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i})\\|$
	$\displaystyle\leq$	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t}\\|+\gamma mnG_{f}.$		(43)

By (42), $\|x_{j,t+1}\|-\|v_{j,t}\|\leq\gamma G_{\phi}$ holds. Since $v_{j,t}$ is a convex combination of $\{x_{l,t}^{n}\}_{l=1}^{m}$ and $\sum_{l=1}^{m}\lambda_{jl,t}=1$ ,

\displaystyle\sum_{j=1}^{m}\|v_{j,t}\|\leq\sum_{j=1}^{m}\|x_{j,t}^{n}\|.

(44)

Then, substituting the fact that $\|x_{j,t+1}\|-\|v_{j,t}\|\leq\gamma G_{\phi}$ and (44) to (VII-C), we obtain

\displaystyle\sum_{j=1}^{m}\|x_{j,t}^{n}\|\leq\sum_{j=1}^{m}\|x_{j,t-1}^{n}\|+m\gamma(G_{\phi}+nG_{f}).

(b) By (41) and the proposed algorithm,

$\displaystyle x_{j,t+1}=$	$\displaystyle v_{j,t}-\gamma z_{j,t+1}$
$\displaystyle\overset{\eqref{v_ji_up}}{=}$	$\displaystyle\sum_{l=1}^{m}\lambda_{jl,t}x_{l,t}^{n}-\gamma z_{j,t+1}$
$\displaystyle\overset{\eqref{x_ji_up}}{=}$	$\displaystyle\sum_{l=1}^{m}\lambda_{jl,t}\Big{(}x_{l,t}^{0}-\gamma\sum_{i=0}^{n-1}\nabla f_{l,\pi_{t}^{i}}(x_{l,t}^{i})\Big{)}-\gamma z_{j,t+1}.$	(45)

Then, subtracting $x_{j,t}$ from both sides of (VII-C) and taking norm, we obtain

	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t+1}-x_{j,t}\\|$
$\displaystyle\leq$	$\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}\lambda_{jl,t}\\|x_{l,t}^{0}-x_{j,t}\\|+m\gamma(G_{\phi}+nG_{f})$
$\displaystyle=$	$\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}\lambda_{jl,t}\\|x_{l,t}-x_{j,t}\\|+m\gamma(G_{\phi}+nG_{f}).$	(46)

For the first term in (VII-C), by the nonexpansiveness of the proximal operator,

\|x_{l,t}-x_{j,t}\|\leq\|v_{l,t-1}-v_{j,t-1}\|.

In addition, by [27, Proposition 1], the bound of the distance between iterates $v_{j,t}$ and $\bar{v}_{t}$ satisfies

$\displaystyle\\|v_{j,t}-\bar{v}_{t}\\|=$	$\displaystyle\Big{\\|}\sum_{l=1}^{m}\big{(}\lambda_{jl,t}x_{l,t}^{n}-\frac{1}{m}x_{l,t}^{n}\big{)}\Big{\\|}$
$\displaystyle\leq$	$\displaystyle\sum_{l=1}^{m}\big{\|}\lambda_{jl,t}-\frac{1}{m}\big{\|}\\|x_{l,t}^{n}\\|$
$\displaystyle\leq$	$\displaystyle\Gamma\Xi^{t}\sum_{l=1}^{m}\\|x_{l,t}^{n}\\|.$	(47)

Then, by transformation,

	$\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}\lambda_{jl,t}\\|v_{l,t-1}-v_{j,t-1}\\|$
$\displaystyle\leq$	$\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}\lambda_{jl,t}(\\|v_{l,t-1}-\bar{v}_{t-1}\\|+\\|v_{j,t-1}-\bar{v}_{t-1}\\|)$
$\displaystyle\leq$	$\displaystyle 2m\Gamma\Xi^{t-1}\sum_{l=1}^{m}\\|x_{l,t-1}^{n}\\|.$	(48)

Substituting (VII-C) to (VII-C),

		$\displaystyle\sum_{j=1}^{m}\\|x_{j,t+1}-x_{j,t}\\|$
	$\displaystyle\leq$	$\displaystyle 2m\Gamma\Xi^{t-1}\sum_{l=1}^{m}\\|x_{l,t-1}^{n}\\|+m\gamma(G_{\phi}+nG_{f}).$

(c) By the definition of $\bar{x}_{t}=\frac{1}{m}\sum_{p=1}^{m}x_{p,t}$ , $\sum_{j=1}^{m}\|x_{j,t}-\bar{x}_{t}\|$ satisfies

		$\displaystyle\sum_{j=1}^{m}\\|x_{j,t}-\frac{1}{m}\sum_{p=1}^{m}x_{p,t}\\|$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{m}\\|\frac{1}{m}\sum_{p=1}^{m}(x_{j,t}-x_{p,t})\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{m}\sum_{j=1}^{m}\sum_{p=1}^{m}\\|v_{j,t-1}-v_{p,t-1}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{m}\sum_{j=1}^{m}\sum_{p=1}^{m}(\\|v_{j,t-1}-\bar{v}_{t-1}\\|+\\|v_{p,t-1}-\bar{v}_{t-1}\\|)$
	$\displaystyle\leq$	$\displaystyle 2m\Gamma\Xi^{t-1}\sum_{l=1}^{m}\\|x_{l,t-1}^{n}\\|,$

where the first inequality follows from Proposition II.1 (b) and the last inequality follows from (VII-C). ∎

The following Lemma prove that the sequence $\sum_{j=1}^{m}\|x_{j,t}^{n}\|$ is upper bounded by a polynomial-geometric sequence, which is vital in the discussions of the summability of the error sequences.

Lemma VII.6.

Under Assumptions III.1 and III.2, there exist non-negative scalars $C_{q}=C_{q}(x_{1,1}^{n},\cdots,x_{m,1}^{n})$ , $C_{q}^{1}=C_{q}^{1}(m,\Gamma,C_{q},C_{q}^{2})$ , $C_{q}^{2}=C_{q}^{2}(m,\gamma,G_{g},G_{h})$ such that for iteration $t\geq 1$ ,

\displaystyle\sum_{j=1}^{m}\|x_{j,t}^{n}\|\leq C_{q}+C_{q}^{1}t+C_{q}^{2}t^{2}.

Proof.

This proof is inspired by [28, Lemma 1]. We proceed by induction on $t$ . First, we show that the result holds for $t=1$ by choosing $C_{q}=\sum_{j=1}^{m}\|x_{j,1}^{n}\|$ . It suffices to show that $\sum_{j=1}^{m}\|x_{j,1}^{n}\|$ is bounded given the initial points $x_{j,0}^{n}$ .

Indeed, by (38),

\sum_{j=1}^{m}\|x_{j,1}^{n}\|\leq\sum_{j=1}^{m}\|x_{j,0}^{n}\|+m\gamma(G_{\phi}+nG_{f})<\infty.

Therefore, $C_{q}=\sum_{j=1}^{m}\|x_{j,1}^{n}\|<\infty$ is a valid choice.

Now suppose the result in Lemma VII.6 holds for some positive integer $t\geq 1$ . We show that it also holds for $t+1$ . We transform (38) in Lemma VII.5 to

\displaystyle\sum_{j=1}^{m}\|x_{j,t+1}^{n}\|\!\leq\!\sum_{j=1}^{m}\!\|x_{j,t}^{n}\|\!+\!m\gamma(G_{\phi}\!+\!nG_{f})\!+\!\sum_{j=1}^{m}\!\|x_{j,t}\!-\!x_{j,t-1}\|.

(49)

For the last term in (49), by transforming (39) in Lemma VII.5, we obtain

\displaystyle\sum_{j=1}^{m}\!\|x_{j,t+1}\!-\!x_{j,t}\|\!\leq\!2m\Gamma\!\sum_{p=1}^{t-1}\!\Xi^{p}\!\sum_{j=1}^{m}\!\|x_{j,l}^{n}\|\!+\!(t\!-\!1)m\gamma(\!G_{\phi}\!+\!nG_{f}\!).

(50)

Then, substituting the induction hypothesis for $t$ into (50), we have

	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t+1}-x_{j,t}\\|\leq$	$\displaystyle 2m\Gamma\sum_{p=1}^{t-1}\Xi^{p}(C_{q}+C_{q}^{1}p+C_{q}^{2}p^{2})$
		$\displaystyle+(t-1)m\gamma(G_{\phi}+nG_{f}).$

By Lemma VII.2 and (30), there exist constants $S_{0}^{\Xi},S_{1}^{\Xi},S_{2}^{\Xi}$ such that

\sum_{p=1}^{\infty}\Xi^{p}(C_{q}+C_{q}^{1}p+C_{q}^{2}p^{2})\leq C_{q}S_{0}^{\Xi}+C_{q}^{1}S_{1}^{\Xi}+C_{q}^{2}S_{2}^{\Xi}.

By induction hypothesis and (49),

	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t+1}^{n}\\|\leq$	$\displaystyle C_{q}+C_{q}^{1}t+C_{q}^{2}t^{2}+tm\gamma(G_{\phi}+nG_{f})$
		$\displaystyle+2m\Gamma(C_{q}S_{0}^{\Xi}+C_{q}^{1}S_{1}^{\Xi}+C_{q}^{2}S_{2}^{\Xi}).$		(51)

Take the coefficients $C_{q}$ , $C_{q}^{1}$ and $C_{q}^{2}$ as

	$\displaystyle C_{q}$	$\displaystyle=\sum_{j=1}^{m}\\|x_{j,1}^{n}\\|,$
	$\displaystyle C_{q}^{1}$	$\displaystyle=\frac{2m\Gamma C_{q}S_{0}^{\Xi}+(2m\Gamma S_{2}^{\Xi}-1)C_{q}^{2}}{2m\Gamma S_{1}^{\Xi}-1},$
	$\displaystyle C_{q}^{2}$	$\displaystyle=\frac{\gamma m}{2}(G_{\phi}+nG_{f}).$

Then, comparing coefficients, we see that the right-hand side of (VII-C) has an upper bound $C_{q}+C_{q}^{1}(t+1)+C_{q}^{2}(t+1)^{2}$ , which implies that the induction hypothesis holds for $t+1$ . ∎

Proof of Proposition IV.2:

Proof.

(a) By (40) in Lemma VII.5,

\displaystyle\|x_{j,t}-\bar{x}_{t}\|\leq 2\Gamma\Xi^{t-1}\sum_{l=1}^{m}\|x_{l,t-1}^{n}\|.

(52)

Then, by Lemma VII.6, (12a) and (52), there exist $C_{q}$ , $C_{q}^{1}$ , $C_{q}^{2}>0$ such that

\displaystyle{\|e_{t}^{i}\|}

\displaystyle\leq\underbrace{2L\Gamma\Xi^{t-1}(C_{q}\!+\!C_{q}^{1}(t-1)\!+\!C_{q}^{2}(t-1)^{2})}_{b_{e,t}}+\gamma C_{0},

(53)

where $C_{0}=2LnG_{f}$ and $b_{e,t}$ is a polynomial-geometric sequence.

(b) It follows from (12b), (VII-C) and Lemma VII.6 that

		$\displaystyle\varepsilon_{t+1}\leq 2G_{\phi}\Gamma\Xi^{t}\sum_{l=1}^{m}\\|x_{l,t}^{n}\\|+\frac{1}{2\gamma}(\Gamma\Xi^{t}\sum_{l=1}^{m}\\|x_{l,t}^{n}\\|)^{2}$
		$\displaystyle\leq\!\underbrace{\!2G_{\phi}\Gamma\Xi^{t}(C_{q}\!+\!C_{q}^{1}t\!+\!C_{q}^{2}t^{2})\!+\!\frac{1}{2\gamma}\!\big{[}\Gamma\Xi^{t}(C_{q}\!+\!C_{q}^{1}t\!+\!C_{q}^{2}t^{2})\big{]}^{2}\!}_{b_{\varepsilon,t}}.$		(54)

Using the fact that $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ for all nonnegative real numbers $a,b$ ,

\displaystyle\sqrt{\varepsilon_{t+1}}\!\leq\!

\displaystyle\underbrace{\!\sqrt{2G_{\phi}\Gamma}\sqrt{\Xi^{t}}\big{(}\sqrt{C_{q}}\!+\!\sqrt{C_{q}^{1}t}\!+\!\sqrt{C_{q}^{2}}t\big{)}\!+\!\frac{1}{\sqrt{2\gamma}}\Gamma\Xi^{t}C(t)\!}_{b_{\sqrt{\varepsilon},t}},

(55)

where $C(t)=(C_{q}+C_{q}^{1}t+C_{q}^{2}t^{2})$ . ∎

VII-D Proof of Lemma IV.1

Proof.

By Lemma VII.1 and (11), we have

\displaystyle\bar{x}_{t}-\gamma\bar{g}_{t}-\bar{x}_{t+1}-\gamma\underbrace{\sum_{i=0}^{n-1}e_{t}^{i}}_{e_{t+1}}-p_{t+1}=\gamma\bar{d}_{t+1},

(56)

where $\|p_{t+1}\|\leq\sqrt{2\gamma\varepsilon_{t+1}}$ , $\bar{d}_{t+1}\in\partial_{\varepsilon_{t+1}}\phi(\bar{x}_{t+1})$ , $\bar{g}_{t}\triangleq\sum_{i=0}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})$ and $\mathbf{f}_{\pi_{t}^{i}}(x)=\frac{1}{m}\sum_{j=1}^{m}f_{j,\pi_{t}^{i}}(x)$ .

In addition, by (VII-B), $\bar{x}_{t}^{k}=\bar{x}_{t}-\gamma\sum_{i=0}^{k-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\gamma\sum_{i=0}^{k-1}e_{t}^{i}$ . Then, combining (56) with (VII-B),

		$\displaystyle\bar{x}_{t}^{k}-\bar{x}_{t+1}$
	$\displaystyle=$	$\displaystyle\bar{x}_{t}-\gamma\sum_{i=0}^{k-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\gamma\sum_{i=0}^{k-1}e_{t}^{i}$
		$\displaystyle-\big{(}\bar{x}_{t}-\gamma\sum_{i=0}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})-\gamma\sum_{i=0}^{n-1}e_{t}^{i}-p_{t+1}-\gamma\bar{d}_{t+1}\big{)}$
	$\displaystyle=$	$\displaystyle\gamma\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})+\gamma\underbrace{\sum_{i=k}^{n-1}e_{t}^{i}}_{\tilde{e}_{k,t+1}}+p_{t+1}+\gamma\bar{d}_{t+1}.$

Taking the norm of $\bar{x}_{t}^{k}-\bar{x}_{t+1}$ , we have

	$\displaystyle\\|\bar{x}_{t}^{k}-\bar{x}_{t+1}\\|^{2}$
$\displaystyle=$	$\displaystyle\\|\gamma\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})+\gamma\tilde{e}_{k,t+1}+p_{t+1}+\gamma\bar{d}_{t+1}\\|^{2}$
$\displaystyle\leq$	$\displaystyle 6\gamma^{2}\\|\!\sum_{i=k}^{n-1}\!\nabla\mathbf{f}_{\pi_{t}^{i}}(\bar{x}_{t}^{i})\\|^{2}\!+\!6\gamma^{2}\\|\tilde{e}_{k,t+1}\\|^{2}\!$
	$\displaystyle+\!6\gamma^{2}\\|\bar{d}_{t+1}\\|^{2}\!+\!2\\|p_{t+1}\\|^{2}$
$\displaystyle\leq$	$\displaystyle 6\gamma^{2}\!\Big{(}\!2\big{\\|}\sum_{i=k}^{n-1}\!(\nabla\mathbf{f}_{\pi_{t}^{i}}\!(\bar{x}_{t}^{i})\!-\!\nabla\mathbf{f}_{\pi_{t}^{i}}\!(x_{}))\big{\\|}^{2}\!+\!2\big{\\|}\sum_{i=k}^{n-1}\!\nabla\mathbf{f}_{\pi_{t}^{i}}\!(x_{})\big{\\|}^{2}\!\Big{)}$
	$\displaystyle+6\gamma^{2}\\|\tilde{e}_{k,t+1}\\|^{2}+6\gamma^{2}\\|\bar{d}_{t+1}\\|^{2}+2\\|p_{t+1}\\|^{2}$
$\displaystyle\overset{\eqref{fnablf},\eqref{pt_range}}{\leq}$	$\displaystyle 6\gamma^{2}\!\big{(}4Ln\!\sum_{i=k}^{n-1}D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{},\bar{x}_{t}^{i})+2\\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{})\\|^{2}\big{)}$
	$\displaystyle+6\gamma^{2}\\|\tilde{e}_{k,t+1}\\|^{2}+6\gamma^{2}\\|\bar{d}_{t+1}\\|^{2}+4\gamma\varepsilon_{t+1}$
$\displaystyle\overset{\eqref{bet},\eqref{bvapt}}{\leq}$	$\displaystyle 6\gamma^{2}\big{(}4Ln\sum_{i=0}^{n-1}D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{},\bar{x}_{t}^{i})\!+\!2\\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{})\\|^{2}\big{)}$
	$\displaystyle+12n^{2}\gamma^{2}b_{e,t}^{2}+12n^{2}\gamma^{4}C_{0}^{2}+6\gamma^{2}\\|\bar{d}_{t+1}\\|^{2}+4\gamma b_{\varepsilon,t}.$	(57)

Next, consider the term $\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\|^{2}$ in (VII-D). By Lemma VII.3 with $\bar{X}_{\pi}=\frac{1}{n-k}\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})$ and $\bar{X}=\frac{1}{n}\nabla f(x_{*})$ ,

		$\displaystyle\mathbb{E}\Big{[}\\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\\|^{2}\Big{]}$
	$\displaystyle=$	$\displaystyle(n-k)^{2}\mathbb{E}\Big{[}\\|\frac{1}{n-k}\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\\|^{2}\Big{]}$
	$\displaystyle=$	$\displaystyle(n-k)^{2}\mathbb{E}\big{[}\\|\bar{X}_{\pi}\\|^{2}\big{]}$
	$\displaystyle=$	$\displaystyle(n-k)^{2}\big{(}\\|\bar{X}\\|^{2}+\mathbb{E}[\\|\bar{X}_{\pi}-\bar{X}\\|^{2}]\big{)}$
	$\displaystyle=$	$\displaystyle(n-k)^{2}\\|\frac{1}{n}\nabla f(x_{})\\|^{2}+(n-k)\frac{k}{(n-1)}\sigma_{}^{2},$

where $\sigma_{*}^{2}\triangleq\frac{1}{n}\sum_{i=1}^{n}\|\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})-\frac{1}{n}\nabla f(x_{*})\|^{2}$ is the population variance at the optimum $x_{*}$ . Summing this over $k$ from $0$ to $n-1$ ,

	$\displaystyle\sum_{k=0}^{n-1}\mathbb{E}\Big{[}\\|\sum_{i=k}^{n-1}\nabla\mathbf{f}_{\pi_{t}^{i}}(x_{*})\\|^{2}\Big{]}$
$\displaystyle\leq$	$\displaystyle\sum_{k=0}^{n-1}(n-k)^{2}\\|\frac{1}{n}\nabla f(x_{})\\|^{2}+\sum_{k=0}^{n-1}\frac{k(n-k)}{n-1}\sigma_{}^{2}$
$\displaystyle=$	$\displaystyle\frac{n(n+1)(2n+1)}{6n^{2}}\\|\nabla f(x_{})\\|^{2}+\frac{n(n+1)}{6}\sigma_{}^{2}$
$\displaystyle\leq$	$\displaystyle\frac{3n}{4}G_{f}^{2}+\frac{n^{2}\sigma_{*}^{2}}{4},$	(58)

where the inequality holds due to $n\geq 2$ , $n+1<2n$ and Assumption III.1 (d).

Thus, summing (VII-D) over $k$ gives

		$\displaystyle\sum_{k=0}^{n-1}\mathbb{E}[\\|\bar{x}_{t}^{k}-\bar{x}_{t+1}\\|^{2}]$
	$\displaystyle\leq$	$\displaystyle 24\gamma^{2}Ln^{2}\sum_{i=0}^{n-1}\mathbb{E}[D_{\mathbf{f}_{\pi_{t}^{i}}}(x_{},\bar{x}_{t}^{i})]+9\gamma^{2}nG_{f}^{2}+3\gamma^{2}n^{2}\sigma_{}^{2}$
		$\displaystyle+6\gamma^{2}nG_{\phi}^{2}+12n^{3}\gamma^{2}b_{e,t}^{2}+12n^{3}\gamma^{4}C_{0}^{2}+4n\gamma b_{\varepsilon,t},$

where we use (VII-D) and Assumption III.1 (c). ∎

References

[1] F. Iutzeler and J. Malick, “Nonsmoothness in machine learning: Specific structure, proximal identification, and applications,” Set-Valued Var. Anal, vol. 28, pp. 661–678, 2020.
[2] W. Tao, Z. Pan, G. Wu, and Q. Tao, “The strength of nesterov’s extrapolation in the individual convergence of nonsmooth optimization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 7, pp. 2557–2568, 2020.
[3] A. Astorino and A. Fuduli, “Nonsmooth optimization techniques for semisupervised classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2135–2142, 2007.
[4] A. H. Zaini and L. Xie, “Distributed drone traffic coordination using triggered communication,” Unmanned Systems, vol. 08, no. 01, pp. 1–20, 2020.
[5] K. Wang, Z. Fu, Q. Xu, D. Chen, L. Wang, and W. Yu, “Distributed fixed step-size algorithm for dynamic economic dispatch with power flow limits,” SCIENCE CHINA INFORMATION SCIENCES, vol. 64, p. 112202, 2021.
[6] P. Yi and Y. Hong, “Stochastic sub-gradient algorithm for distributed optimization with random sleep scheme,” Control Theory Technol., vol. 13, pp. 333–347, 2015.
[7] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
[8] W. Zhong and J. Kwok, “Accelerated Stochastic Gradient Method for Composite Regularization,” in Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, vol. 33. Reykjavik, Iceland: PMLR, 22–25 Apr 2014, pp. 1086–1094.
[9] M. Schmidt, N. L. Roux, and F. Bach, “Convergence rates of inexact proximal-gradient methods for convex optimization,” Advances in Neural Information Processing Systems, vol. 24, pp. 1458–1466, 2012.
[10] D. P. Bertsekas, “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,” arXiv.cs.SY, 2017.
[11] Q. Wang, J. Chen, X. Zeng, and B. Xin, “Distributed proximal-gradient algorithms for nonsmooth convex optimization of second-order multiagent systems,” International Journal of Robust and Nonlinear Control, vol. 30, no. 17, pp. 7574–7592, 2020. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/rnc.5199
[12] X. Li, G. Feng, and L. Xie, “Distributed proximal algorithms for multiagent optimization with coupled inequality constraints,” IEEE Transactions on Automatic Control, vol. 66, no. 3, pp. 1223–1230, 2021.
[13] S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, no. 6, pp. 2787–2794, 2021.
[14] J. Zhang, H. Liu, A. M.-C. So, and Q. Ling, “A penalty alternating direction method of multipliers for convex composite optimization over decentralized networks,” IEEE Transactions on Signal Processing, vol. 69, pp. 4282–4295, 2021.
[15] L. Vandenberghe, “Lecture notes for EE 236C, UCLA,” Spring, 2011-2012.
[16] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3710–3722, 2021.
[17] W. Wu, X. Jing, W. Du, and G. Chen, “Learning dynamics of gradient descent optimization in deep neural networks,” SCIENCE CHINA INFORMATION SCIENCES, vol. 64, p. 150102, 2021.
[18] W. Li, Z. Wu, T. Chen, L. Li, and Q. Ling, “Communication-censored distributed stochastic gradient descent,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2021.
[19] O. Shamir and T. Zhang, “Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes,” in Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ser. ICML’13. JMLR.org, 2013, p. I–71–I–79.
[20] P. Bianchi, G. Fort, and W. Hachem, “Performance of a distributed stochastic approximation algorithm,” IEEE Transactions on Information Theory, vol. 59, no. 11, pp. 7405–7418, 2013.
[21] J. Lei, H.-F. Chen, and H.-T. Fang, “Asymptotic properties of primal-dual algorithm for distributed stochastic optimization over random networks with imperfect communications,” SIAM Journal on Control and Optimization, vol. 56, no. 3, pp. 2159–2188, 2018. [Online]. Available: https://doi.org/10.1137/16M1086133
[22] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 5336–5346.
[23] Z. Li, B. Liu, and Z. Ding, “Consensus-based cooperative algorithms for training over distributed data sets using stochastic gradients,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11, 2021.
[24] R. Xin, U. A. Khan, and S. Kar, “Variance-reduced decentralized stochastic optimization with accelerated convergence,” IEEE Transactions on Signal Processing, vol. 68, p. 6255–6271, 2020. [Online]. Available: http://dx.doi.org/10.1109/TSP.2020.3031071
[25] M. I. Qureshi, R. Xin, S. Kar, and U. A. Khan, “Push-saga: A decentralized stochastic algorithm with variance reduction over directed graphs,” IEEE Control Systems Letters, vol. 6, pp. 1202–1207, 2022.
[26] X.-F. Wang, A. R. Teel, K.-Z. Liu, and X.-M. Sun, “Stability analysis of distributed convex optimization under persistent attacks: A hybrid systems approach,” Automatica, vol. 111, p. 108607, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0005109819304686
[27] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
[28] A. I. Chen and A. Ozdaglar, “A fast distributed proximal-gradient method,” in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Oct 2012, pp. 601–608.
[29] K. Ahn, C. Yun, and S. Sra, “SGD with shuffling: optimal rates without component convexity and large epoch requirements,” in Advances in Neural Information Processing Systems. Curran Associates, Inc., 2020.
[30] M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo, “Why random reshuffling beats stochastic gradient descent,” Mathematical Programming, vol. 186, pp. 49–84, 2021.
[31] B. Recht and C. Ré, “Parallel stochastic gradient algorithms for large-scale matrix completion,” Mathematical Programming Computation, vol. 5, no. 2, pp. 201–226, 2013.
[32] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, ser. ICML’12. Madison, WI, USA: Omnipress, 2012, p. 1571–1578.
[33] Y. Drori and O. Shamir, “The complexity of finding stationary points with stochastic gradient descent,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 13–18 Jul 2020, pp. 2658–2667. [Online]. Available: https://proceedings.mlr.press/v119/drori20a.html
[34] A. Nedic and D. P. Bertsekas, “Incremental subgradient methods for nondifferentiable optimization,” SIAM Journal on Optimization, vol. 12, no. 1, pp. 109–138, 2001.
[35] D. P. Bertsekas, Optimization for Machine Learning, chapter 4. The MIT Press, 2011.
[36] K. Mishchenko, A. Khaled, and P. Richtarik, “Random reshuffling: Simple analysis with vast improvements,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 17 309–17 320. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/c8cc6e90ccbff44c9cee23611711cdc4-Paper.pdf
[37] Q. Meng, W. Chen, Y. Wang, Z.-M. Ma, and T.-Y. Liu, “Convergence analysis of distributed stochastic gradient descent with shuffling,” Neurocomputing, vol. 337, pp. 46–57, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231219300578
[38] K. Mishchenko, A. Khaled, and P. Richtárik, “Proximal and federated random reshuffling,” 2021.
[39] L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,” SIAM Journal on Optimization, vol. 24, no. 4, pp. 2057–2075, 2014. [Online]. Available: https://doi.org/10.1137/140961791
[40] Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization,” in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/file/e727fa59ddefcefb5d39501167623132-Paper.pdf
[41] N. D. Vanli, M. Gurbuzbalaban, and A. Ozdaglar, “Global convergence rate of incremental aggregated gradient methods for nonsmooth problems,” in 2016 IEEE 55th Conference on Decision and Control (CDC), 2016, pp. 173–178.
[42] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
[43] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on Non-IID data,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJxNAnVtDS
[44] S. Abu-Mostafa, M. Magdon-Ismail, and H. Lin, Learning from data. United States of America: AMLbook.com, 2012.
[45] P. D. Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.

	$\displaystyle{\rm prox}_{\alpha,h}^{\varepsilon}(x)\triangleq\Big{\{}\tilde{x}\big{\|}$	$\displaystyle\frac{1}{2\alpha}\\|\tilde{x}-x\\|^{2}+h(\tilde{x})\leq$
		$\displaystyle\varepsilon+\min_{z\in\mathbb{R}^{d}}\big{\{}\frac{1}{2\alpha}\\|z-x\\|^{2}+h(z)\big{\}}\Big{\}}.$		(4)

$\displaystyle\mathbb{E}[\\|\bar{x}_{t}-x_{*}\\|]\leq$	$\displaystyle A_{T}+(\\|\bar{x}_{0}-x_{*}\\|^{2}+B_{T}+A_{T}^{2})^{\frac{1}{2}}$
$\displaystyle\leq$	$\displaystyle 2A_{T}+\\|\bar{x}_{0}-x_{*}\\|+\sqrt{B_{T}}$
$\displaystyle\leq$	$\displaystyle 2\textbf{A}+\\|\bar{x}_{0}-x_{*}\\|+\sqrt{T\textbf{D}+\textbf{B}}.$	(26)

		$\displaystyle\phi(\bar{x}_{t+1})+\frac{1}{2\gamma}\\|\bar{x}_{t+1}-\bar{v}_{t}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\phi(z_{t+1})+G_{\phi}\\|\bar{x}_{t+1}-z_{t+1}\\|+\frac{1}{2\gamma}\Big{(}\\|z_{t+1}-\bar{v}_{t}\\|^{2}$
		$\displaystyle+2\langle z_{t+1}-\bar{v}_{t},\bar{x}_{t+1}-z_{t+1}\rangle+\\|\bar{x}_{t+1}-z_{t+1}\\|^{2}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\min_{z\in\mathbb{R}^{d}}\{\phi(z)+\frac{1}{2\gamma}\\|z-\bar{v}_{t}\\|^{2}\}+\frac{1}{2\gamma}\\|\bar{x}_{t+1}-z_{t+1}\\|^{2}$
		$\displaystyle+\\|\bar{x}_{t+1}-z_{t+1}\\|(G_{\phi}+\frac{1}{\gamma}\\|z_{t+1}-\bar{v}_{t}\\|),$

	$\displaystyle\\|\bar{x}_{t+1}-z_{t+1}\\|\leq$	$\displaystyle\frac{1}{m}\sum_{j=1}^{m}\\|{\rm prox}_{\gamma,\phi}(v_{j,t})-{\rm prox}_{\gamma,\phi}(\bar{v}_{t})\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{m}\sum_{j=1}^{m}\\|v_{j,t}-\bar{v}_{t}\\|.$		(37)

	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t}^{n}\\|=$	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t}^{0}-\gamma\sum_{i=0}^{n-1}\nabla f_{j,\pi_{t}^{i}}(x_{j,t}^{i})\\|$
	$\displaystyle\leq$	$\displaystyle\sum_{j=1}^{m}\\|x_{j,t}\\|+\gamma mnG_{f}.$		(43)