GT-STORM: Taming Sample, Communication, and Memory Complexities in Decentralized Non-Convex Learning

Xin Zhang¹, Jia Liu², Zhengyuan Zhu¹, and Elizabeth S. Bentley³ ¹Department of Statistics, Iowa State University²Department of Electrical and Computer Engineering, The Ohio State University³Information Directorate, Air Force Research Laboratory

(2020)

Abstract.

Decentralized nonconvex optimization has received increasing attention in recent years in machine learning due to its advantages in system robustness, data privacy, and implementation simplicity. However, three fundamental challenges in designing decentralized optimization algorithms are how to reduce their sample, communication, and memory complexities. In this paper, we propose a gradient-tracking-based stochastic recursive momentum (GT-STORM) algorithm for efficiently solving nonconvex optimization problems. We show that to reach an $\epsilon^{2}$ -stationary solution, the total number of sample evaluations of our algorithm is $\tilde{O}(m^{1/2}\epsilon^{-3})$ and the number of communication rounds is $\tilde{O}(m^{-1/2}\epsilon^{-3})$ , which improve the $O(\epsilon^{-4})$ costs of sample evaluations and communications for the existing decentralized stochastic gradient algorithms. We conduct extensive experiments with a variety of learning models, including non-convex logistical regression and convolutional neural networks, to verify our theoretical findings. Collectively, our results contribute to the state of the art of theories and algorithms for decentralized network optimization.

Network Consensus Optimization, Stochastic Variance Reduction, Gradient Tracking

^†^†journalyear: 2020^†^†copyright: rightsretained^†^†doi: 10.1145/1122445.1122456^†^†conference: MobiHoc ’21: ACM International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing; June 30 – July 03, 2020; Shanghai, China^†^†booktitle: MobiHoc ’20: ACM Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, June 30 – July 03, 2020, Shanghai, China^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Computing methodologies Distributed algorithms^†^†ccs: Computing methodologies Machine learning^†^†ccs: Networks Network performance analysis

1. Introduction

In recent years, machine learning has witnessed enormous success in many areas, including image processing, natural language processing, online recommender systems, just to name a few. From a mathematical perspective, training machine learning models amounts to solving an optimization problem. However, with the rapidly increasing dataset sizes and the high dimensionality and the non-convex hardness of the training problem (e.g., due to the use of deep neural networks), training large-scale machine learning models by a single centralized machine has become inefficient and unscalable. To address the efficiency and scalability challenges, an effective approach is to leverage decentralized computational resources in a computing network, which could follow a parameter server (PS)-worker architecture (recht2011hogwild, ; zinkevich2010parallelized, ; dean2012large, ) or fully decentralized peer-to-peer network structure (nedic2009distributed, ; lian2017can, ). Also, thanks to the robustness to single-point-of-failure, data privacy, and implementation simplicity, decentralized learning over computing networks has attracted increasing interest recently, and has been applied in various science and engineering areas (including dictionary learning (chen2014dictionary, ), multi-agent systems (cao2012overview, ; zhou2011multirobot, ), multi-task learning (wang2018distributed, ; zhang2019distributed, ), information retrieval (ali2004tivo, ), energy allocation (jiang2018consensus, ), etc.).

In the fast growing literature of decentralized learning over networks, a classical approach is the so-called network consensus optimization, which traces its roots to the seminal work by Tsitsiklis in 1984 (tsitsiklis1984problems, ). Recently, network consensus optimization has gained a lot of renewed interest owing to the elegant decentralized subgradient descent method (DSGD) proposed by Nedic and Ozdaglar (nedic2009distributed, ), which has been applied in decentralized learning due to its simple algorithmic structure and good convergence performance. In network-consensus-based decentralized learning, a set of geographically distributed computing nodes collaborate to train a common learning model. Each node holds a local dataset that may be too large to be sent to a centralized location due to network communication limits, or cannot be shared due to privacy/security risks. A distinctive feature of network-consensus-baed decentralized learning is that there is a lack of a dedicated PS. As a result, each node has to exchange information with its local neighbors to reach a consensus on a global optimal learning model.

Despite its growing significance in practice, the design of high-performance network-consensus-based decentralized learning faces three fundamental conflicting complexities, namely sample, communication, and memory complexities. First, due to the high dimensionality of most deep learning models, it is impossible to leverage beyond first-order (stochastic) gradient information to compute the update direction in each iteration. The variability of a stochastic gradient is strongly influenced by the number of training samples in its mini-batch. However, the more training samples in a mini-batch, the higher computational cost of the stochastic gradient. Second, by using fewer training samples in each iteration to trade for a lower computational cost, the resulting stochastic gradient unavoidably has a larger variance, which further leads to more iterations (hence communication rounds) to reach a certain training accuracy (i.e., slower convergence). The low communication efficiency is particularly problematic in many wireless edge networks, where the communication links could be low-speed and highly unreliable. Lastly, in many mobile edge-computing environments, the mobile devices could be severely limited by hardware resources (e.g., CPU/GPU, memory) and they cannot afford reserving a large memory space to run a very sophisticated decentralized learning algorithm that has too many intermediate variables.

Due to the above fundamental trade-off between sample, communication, and computing resource costs, the notions of sample, communication, and memory complexities (to be formally defined in Section 2) become three of the most important measures in assessing the performances of decentralized learning algorithms. However, in the literature, most existing works have achieve low complexities in some of these measures, but not all (see Section 2 for in-depth discussions). The limitations of these existing works motivate to ask the following question: Could we design a decentralized learning algorithm that strikes a good balance between sample complexity and communication complexity? In this paper, we answer the above question positively by proposing a new GT-STORM algorithm (gradient-tracking-based stochastic recursive momentum) that achieves low sample, communication, and memory complexities. Our main results and contributions are summarized as follows:

•

Unlike existing approaches, our proposed GT-STORM algorithm adopts a new estimator, which is updated with a consensus mixing of the neighboring estimators of the last iteration, which helps improve the global gradient estimation. Our method achieves the nice features of previous works (tran2019hybrid, ; cutkosky2019momentum, ; di2016next, ; lu2019gnsd, ) while avoiding their pitfalls. To some extent, our GT-STORM algorithm can be viewed as an indirect way of integrating the stochastic gradient method, variance reduction method, and gradient tracking method.
•

We provide a detailed convergence analysis and complexity analysis. Under some mild assumptions and parameter conditions, our algorithm enjoys an $\tilde{O}(T^{-2/3})$ convergence rate. Note that this rate is much faster than the rate of $O(T^{-1/2})$ for the classic decentralized stochastic algorithms, e.g., DSGD (jiang2017collaborative, ), PSGD (lian2017can, ) and GNSD (lu2019gnsd, ). Also, we show that to reach an $\epsilon^{2}$ -stationary solution, the total number of sample evaluations of our algorithm is $\tilde{O}(m^{1/2}\epsilon^{-3})$ and the communication round is $\tilde{O}(m^{-1/2}\epsilon^{-3})$ .
•

We conduct extensive experiments to examine the performance of our algorithm, including both a non-convex logistic regression model on the LibSVM datasets and convolutional neural network models on MNIST and CIFAR-10 datasets. Our experiments show that the our algorithm outperforms two state-of-the-art decentralized learning algorithms (lian2017can, ; lu2019gnsd, ). These experiments corroborate our theoretical results.

The rest of the paper is organized as follows. In Section 2, we first provide the preliminaries of network consensus optimization and discuss related works with a focus on sample, communication, and memory complexities. In Section 3, we present our proposed GT-STORM algorithm, as well as its communication, sample, and memory complexity analysis. We provide numerical results in Section 4 to verify the theoretical results of our GT-STORM algorithm. Lastly in Section 5, we provide concluding remarks.

2. Preliminaries and Related Work

To facilitate our technical discussions, in Section 2.1, we first provide an overview on network consensus optimization and formally define the notions of sample, communication, and memory complexities of decentralized optimization algorithms for network consensus optimization. Then, in Section 2.2, we first review centralized stochastic first-order optimization algorithms for solving non-convex learning problems from a historical perspective and with a focus on sample, communication, and memory complexities. Here, we introduce several acceleration techniques that motivate our GT-STORM algorithmic design. Lastly, we review the recent developments of optimization algorithms for decentralized learning and compare them with our work.

2.1. Network Consensus Optimization

As mentioned in Section 1, in decentralized learning, there are a set of geographically distributed computing nodes forming a network. In this paper, we represent such a networked by an undirected connected network $\mathcal{G}=(\mathcal{N},\mathcal{L})$ , where $\mathcal{N}$ and $\mathcal{L}$ are the sets of nodes and edges, respectively, with $|\mathcal{N}|=m$ . Each node can communicate with their neighbors via the edges in $\mathcal{L}$ . The goal of decentralized learning is to use the nodes to distributively and collaboratively solve a network-wide optimization problem as follows:

(1)

\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{p}}f(\mathbf{x})=\min_{\mathbf{x}\in\mathbb{R}^{p}}\frac{1}{m}\sum_{i=1}^{m}f_{i}(\mathbf{x}),

where each local objective function $f_{i}(\mathbf{x})\triangleq\mathbb{E}_{\zeta\sim\mathcal{D}_{i}}f_{i}(\mathbf{x};\zeta)$ is only observable to node $i$ and not necessarily convex. Here, $\mathcal{D}_{i}$ represents the distribution of the dataset at node $i$ , and $f_{i}(\mathbf{x};\zeta)$ represents a loss function that evaluates the discrepancy between the learning model’s output and the ground truth of a training sample $\zeta$ . To solve Problem (1) in a decentralized fashion, a common approach is to rewrite Problem (1) in the following equivalent form:

(2)			Minimize	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}f_{i}(\mathbf{x}_{i})$
		subject to	$\displaystyle\mathbf{x}_{i}=\mathbf{x}_{j},$	$\displaystyle\forall(i,j)\in\mathcal{L},\vspace{-.05in}$

where $\mathbf{x}\triangleq[\mathbf{x}_{1}^{\top},\cdots,\mathbf{x}_{m}^{\top}]^{\top}$ and $\mathbf{x}_{i}$ is an introduced local copy at node $i$ . In Problem (2), the constraints ensure that the local copies at all nodes are equal to each other, hence the term “consensus.” Thus, Problems (1) and (2) share the same solutions. The main goal of network consensus optimization is to design an algorithm to attain an $\epsilon^{2}$ -stationary point $\mathbf{x}$ defined as follows:

(3)

\displaystyle\underbrace{\Big{\|}\frac{1}{m}\sum_{i=1}^{m}\nabla f_{i}(\mathbf{\bar{x}})\Big{\|}^{2}}_{\mathrm{Global\,\,gradient\,\,magnitude}}\!\!\!\!+\underbrace{\frac{1}{m}\sum_{i=1}^{m}\|\mathbf{x}_{i}-\mathbf{\bar{x}}\|^{2}}_{\mathrm{Consensus\,\,error}}\leq\epsilon^{2},

where $\mathbf{\bar{x}}\triangleq\frac{1}{m}\sum_{i=1}^{m}\mathbf{x}_{i}$ denotes the global average across all nodes. Different from the traditional $\epsilon^{2}$ -stationary point in centralized optimization problems, the metric in Eq. (3) has two terms: the first term is the gradient magnitude for the (non-convex) global objective and the second term is the average consensus error of all local copies. To date, many decentralized algorithms have been developed to compute the $\epsilon^{2}$ -stationary point (see Section 2.2). However, most of these algorithms suffer limitations in sample, communication, and memory complexities. In what follows, we formally state the definitions of sample, communication, and memory complexities used in the literature (see, e.g., (sun2019improving, )):

Definition 1 (Sample Complexity).

The sample complexity is defined as the total number of the incremental first-order oracle (IFO) calls required across all the nodes to find an $\epsilon^{2}$ -stationary point defined in Eq. (3), where one IFO call evaluates a pair of $(f_{i}(\mathbf{x};\zeta),\nabla f_{i}(\mathbf{x};\zeta))$ on a sample $\zeta\sim\mathcal{D}_{i}$ and parameter $\mathbf{x}\in\mathbb{R}^{p}$ at node $i.$

Definition 2 (Communication Complexity).

The communication complexity is defined as the total rounds of communications required to find an $\epsilon^{2}$ -stationary point defined in Eq. (3), where each node can send and receive a $p$ -dimensional vector with its neighboring nodes in one communication round.

Definition 3 (Memory Complexity).

The memory complexity is defined as total dimensionality of all intermediate variables in the algorithm run by a node to find an $\epsilon^{2}$ -stationary point in Eq. (3).

To make sense of these three complexity metrics into perspective, consider the standard centralized gradient descent (GD) method as an example. Note that the GD algorithm has an $O(1/T)$ convergence rate for non-convex optimization, which suggests $O(\epsilon^{-2})$ communication complexity. Also, it takes a full gradient evaluation in each iteration, i.e., $O(n)$ per-iteration sample complexity, where $n$ is the total number of samples. This implies $O(n\epsilon^{-2})$ sample complexity to converge to an $\epsilon^{2}$ -stationary point. Hence, the sample complexity of GD is high if the dataset size $n$ is large.

In contrast, consider the classical stochastic gradient descent (SGD) algorithm that is widely used in machine learning. The basic idea of SGD is to lower the gradient evaluation cost by using only a mini-batch of samples in each iteration. However, due to the sample randomness in mini-batches, the convergence rate of SGD for non-convex optimization is reduced to $O(1/\sqrt{T})$ (ghadimi2013stochastic, ; bottou2018optimization, ; zhou2018new, ). Thus, to reach an $\epsilon^{2}$ -stationary point $\mathbf{x}$ with $\|\nabla f(\mathbf{x})\|^{2}\leq\epsilon^{2}$ , SGD has $O(\epsilon^{-4})$ sample complexity, which could be either higher or lower than the $O(n\epsilon^{-2})$ sample complexity of the GD method, depending on the relationship between $n$ and $\epsilon$ . Also, for $p$ -dimensional problems, both GD and SGD have memory complexity $p$ , since they only need a $p$ -dimensional vector to store (stochastic) gradients.

2.2. Related Work

1) Centralized First-Order Methods with Low Complexities: Now, we review several state-of-the-art low-complexity centralized stochastic first-order methods that are related to our GT-STORM algorithm. To reduce the overall sample and communication complexities of the standard GD and SGD algorithms, a natural approach is variance reduction. Earlier works following this approach include SVRG (johnson2013accelerating, ; reddi2016stochastic, ), SAGA (defazio2014saga, ) and SCSG (lei2017non, ). These algorithms has an overall sample complexity of $O(n+n^{2/3}\epsilon^{-2})$ . A more recent variance reduction method is the stochastic path-integrated differential estimator (SPIDER) (fang2018spider, ), which is based on the SARAH gradient estimator developed by Nguyen et al. (nguyen2017sarah, ). SPIDER further lowers the sample complexity to $O(n+\sqrt{n}\epsilon^{-2})$ , which attains the $\Omega(\sqrt{n}\epsilon^{-2})$ theoretical lower bound for finding an $\epsilon^{2}$ -stationary point for $n=O(\epsilon^{-4})$ . More recently, to improve the small step-size $O(\epsilon L^{-1})$ in SPIDER, a variant called SpiderBoost was proposed in (wang2019spiderboost, ), which allows a larger constant step-size $O(L^{-1})$ while keeping the same $O(n+\sqrt{n}\epsilon^{-2})$ sample complexity. It should be noted, however, that the significantly improved sample complexity of SPIDER/SpiderBoost is due to a restrictive assumption that a universal Lipschitz smoothness constant exists for all local objectives $f(\cdot;\zeta_{i})$ $\forall i$ . This means that the objectives are “similar” and there are no “outliers” in the training samples. Meanwhile, to obtain the optimal communication complexity, SpiderBoost require a (nearly) full gradient every $\sqrt{n}$ iterations and a mini-batch of stochastic gradient evaluation with batch size $\sqrt{n}$ in each iteration.

To overcome the above limitations, a hybrid stochastic gradient descent (Hybrid-SGD) method is recently proposed in (tran2019hybrid, ), where a convex combination of the SARAH estimator (nguyen2017sarah, ) and an unbiased stochastic gradient is used as the gradient estimator. The Hybrid-SGD method relaxes the universal Lipschitz constant assumption in SpiderBoost to an average Lipschitz smoothness assumption. Moreover, it only requires two samples to evaluate the gradient per iteration. As a result, Hybrid-SGD has a $O(\epsilon^{-3})$ sample complexity that is independent of dataset size. Although Hybrid-SGD is for centralized optimization, the interesting ideas therein motivate our GT-STORM approach for decentralized learning following a similar token. Interestingly, we show that in decentralized settings, our GT-STORM method can further improve the gradient evaluation to only one sample per iteration, while not degrading the communication complexity order. Lastly, we remark that all algorithms above have memory complexity at least $2p$ for $p$ -dimensional problems. In contrast, GT-STORM enjoys a $p$ memory complexity.

2) Decentralized Optimization Algorithms In the literature, many decentralized learning optimization algorithms have been proposed to solve Problem (1), e.g., first-order methods (nedic2009distributed, ; yuan2016convergence, ; shi2015extra, ; di2016next, ), prime-dual methods (sun2019distributed, ; mota2013d, ), Newton-type methods (mokhtari2016decentralized, ; eisen2017decentralized, ) (see in (nedic2018network, ; chang2020distributed, ) for comprehensive surveys). In this paper, we consider decentralized first-order methods for the non-convex network consensus optimization in (2). In the literature, the convergence rate of the well-known decentralized gradient descent (DGD) algorithm (nedic2009distributed, ) was studied in (zeng2018nonconvex, ), which showed that DGD with a constant step-size converges with an $O(1/T)$ rate to a step-size-dependent error ball around a stationary point. Later, a gradient tracking (GT) method was proposed in (di2016next, ) to find an $\epsilon^{2}$ -stationary point with an $O(1/T)$ convergence rate under constant step-sizes. However, these methods require a full gradient evaluation per iteration, which yields $O(n\epsilon^{-2})$ sample complexity. To reduce the per-iteration sample complexity, stochastic gradients are adopted in the decentralized optimization, e.g., DSGD (jiang2017collaborative, ), PSGD (lian2017can, ), GNSD (lu2019gnsd, ). Due to the randomness in stochastic gradients, the convergence rate is reduced to $O(1/\sqrt{T}).$ Thus, the sample and communication complexities of these stochastic methods are $O(\epsilon^{-4})$ and $O(m^{-1}\epsilon^{-4})$ , two orders of magnitude higher than their deterministic counterparts. To overcome the limitations in stochastic methods, a natural idea is to use variance reduction techniques similar to those for centralized optimization to reduce the sample and communication complexities for the non-convex network consensus optimization. So far, existing works on the decentralized stochastic variance reduction methods include DSA (mokhtari2016dsa, ), diffusion-AVRG (yuan2018variance, ) and GT-SAGA (xin2019variance, ) etc., all of which focus on convex problems. To our knowledge, the decentralized gradient estimation and tracking (D-GET) algorithm in (sun2019improving, ) is the only work for non-convex optimization. D-GET integrates the decentralized gradient tracking (lu2019gnsd, ) and the SpiderBoost gradient estimator (wang2019spiderboost, ) to obtain $O(mn+m\sqrt{n}\epsilon^{-2})$ dataset-size-dependent sample complexity and $O(\epsilon^{-2})$ communication complexity. Recall that the sample and communication complexities of GT-STORM are $O(m^{1/2}\epsilon^{-3})$ and $O(m^{-1/2}\epsilon^{-3})$ , respectively. Thus, if dataset size $n=\Omega(\epsilon^{-2})$ , D-GET has a higher sample complexity than GT-STORM. As an example, when $\epsilon=10^{-2}$ , $n$ is on the order of $10^{4}$ , which is common in modern machine learning datasets. Also, the memory complexity of D-GET is $2p$ as opposed to the $p$ memory complexity of GT-STORM. This implies a huge saving with GT-STORM if $p$ is large, e.g., $p\approx 10^{6}$ in many deep learning models.

3. A Gradient-Tracking Stochastic Recursive Momentum Algorithm

In this section, we introduce our gradient-tracking-based stochastic recursive momentum (GT-STORM) algorithm for solving Problem (2) in Section 3.1. Then, we will state the main theoretical results and their proofs in Sections 3.2 and 3.3, respectively.

3.1. The GT-STORM Algorithm

In the literature, a standard starting point to solve Problem (2) is to reformulate the problem as (nedic2009distributed, ):

(4)			Minimize	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}f_{i}(\mathbf{x}_{i})$
		subject to	$\displaystyle(\mathbf{W}\otimes\mathbf{I}_{p})\mathbf{x}=\mathbf{x},$

where $\mathbf{I}_{p}$ denotes the $p$ -dimensional identity matrix, the operator $\otimes$ denotes the Kronecker product, and $\mathbf{W}\in\mathbb{R}^{m\times m}$ is often referred to as the consensus matrix. We let $[\mathbf{W}]_{ij}$ represent the element in the $i$ -th row and the $j$ -th column in $\mathbf{W}$ . For Problems (4) and (2) to be equivalent, $\mathbf{W}$ should satisfy the following properties:

(a)

Doubly Stochastic: $\sum_{i=1}^{m}[\mathbf{W}]_{ij}=\sum_{j=1}^{m}[\mathbf{W}]_{ij}=1$ .
(b)

Symmetric: $[\mathbf{W}]_{ij}=[\mathbf{W}]_{ji}$ , $\forall i,j\in\mathcal{N}$ .
(c)

Network-Defined Sparsity Pattern: $[\mathbf{W}]_{ij}>0$ if $(i,j)\in\mathcal{L};$ otherwise $[\mathbf{W}]_{ij}=0$ , $\forall i,j\in\mathcal{N}$ .

The above properties imply that the eigenvalues of $\mathbf{W}$ are real and can be sorted as $-1<\lambda_{m}\leq\cdots\leq\lambda_{2}<\lambda_{1}=1$ . We define the second-largest eigenvalue in magnitude of $\mathbf{W}$ as $\lambda\triangleq\max\{|\lambda_{2}|,|\lambda_{m}|\}$ for the further notation convenience. It can be seen later that $\lambda$ plays an important role in the step-size selection and the algorithm’s convergence rate.

As mentioned in Section 2.1, our GT-STORM algorithm is inspired by the GT method (di2016next, ; nedich2016geometrically, ) for reducing consensus error and the recursive variance reduction (VR) methods (fang2018spider, ; wang2019spiderboost, ) developed for centralized optimization. Specifically, in the centralized GT method, an estimator $\mathbf{y}$ is introduced to track the global gradient:

(5)

\displaystyle\mathbf{y}_{t}=\mathbf{W}\mathbf{y}_{t-1}+\mathbf{g}_{t}-\mathbf{g}_{t-1},

where $\mathbf{g}_{t}$ is the gradient estimation in the $t$ th iteration. Meanwhile, to reduce the stochastic error, a gradient estimator $\mathbf{v}$ in VR methods is updated recursively based on a double-loop structure as follows:

(6)

\displaystyle\mathbf{v}_{t}=\mathbf{v}_{t-1}+\nabla f(\mathbf{x}_{t};\zeta_{t})-\nabla f(\mathbf{x}_{t-1};\zeta_{t}),\quad\text{if }\text{mod}(t,q)\neq 0,

where $\nabla f(\mathbf{x};\zeta)$ is the stochastic gradient dependent on parameter $\mathbf{x}$ and a data sample $\zeta,$ and $q$ is the number of the inner loop iterations. On the other hand, if $\text{mod}(t,q)=0$ , $\mathbf{v}_{t}$ takes a full gradient. Note that these two estimators have a similar structure: Both are recursively updating the previous estimation based on the difference of the gradient estimations between two consecutive iterations (i.e., momentum). This motivates us to consider the following question: Could we somehow “integrate” these two methods to develop a new decentralized gradient estimator to track the global gradient and reduce the stochastic error at the same time? Unfortunately, the GT and VR estimators can not be combined straightforwardly. The major challenge lies in the structural difference in the outer loop iteration (i.e., $\text{mod}(t,q)=0$ ), where the VR estimator requires full gradient and does not follow the recursive updating structure.

Surprisingly, in this paper, we show that there exists an “indirect” way to achieve the salient features of both GT and VR. Our approach is to abandon the double-loop structure of VR and pursue a single-loop structure. Yet, this single-loop structure should still be able to reduce the variance and consistently track the global gradient. Specifically, we introduce a parameter $\beta_{t}\in[0,1]$ in the recursive update and integrate it with a consensus step as follows:

(7)

\displaystyle\mathbf{v}_{i,t}\!=\!\beta_{t}

\displaystyle\sum\nolimits_{j\in\mathcal{N}_{i}}[\mathbf{W}]_{ij}\mathbf{v}_{j,t-1}\!+\!\nabla f_{i}(\mathbf{x}_{i,t};\zeta_{i,t})\!-\!\beta_{t}\nabla f_{i}(\mathbf{x}_{i,t-1};\zeta_{i,t}),\!\!\!

where $\mathbf{x}_{i,t},$ $\mathbf{v}_{i,t}$ and $\zeta_{i,t}$ are the parameter, gradient estimator, and random sample in the $t$ th iteration at node $i$ , respectively. Note that the estimator reduces to the classical stochastic gradient estimator when $\beta_{t}=0$ . On the other hand, if we set $\beta_{t}=1$ , the estimator becomes the (stochastic) gradient tracking estimator based on a single sample (implying low sample complexity). Then, the key to the success of our GT-STORM design lies in meticulously choosing parameter $\beta_{t}$ to mimic the gradient estimator technique in centralized optimization (cutkosky2019momentum, ; tran2019hybrid, ). Lastly, the local parameters can be updated by the conventional decentralized stochastic gradient descent step:

(8)

\displaystyle\mathbf{x}_{i,t+1}=\sum\nolimits_{j\in\mathcal{N}_{i}}[\mathbf{W}]_{ij}\mathbf{x}_{j,t}-\eta_{t}\mathbf{v}_{i,t},

where $\eta_{t}$ is the step-size in iteration $t$ . To summarize, we state our algorithm in Algorithm 1 as follows.

Algorithm 1: Gradient-Tracking-based Stochastic Recursive Momentum Algorithm (GT-STORM). Initialization:

1.

Choose $T>0$ and let $t=1$ . Set $\mathbf{x}_{i,0}=\mathbf{x}^{0}$ at node $i$ . Calculate $\mathbf{v}_{i,0}=\nabla f_{i}(\mathbf{x}_{i,0};\zeta_{i,0})$ at node $i$ .

Main Loop:

2.
In the $t$ -th iteration, each node sends $\mathbf{x}_{i,t-1}$ and local gradient estimator $\mathbf{v}_{i,t-1}$ to its neighbors. Meanwhile, upon the reception of all neighbors’ information, each node performs the following:
1. a)
  
  Update local parameter: $\mathbf{x}_{i,t}=\sum\nolimits_{j\in\mathcal{N}_{i}}[\mathbf{W}]_{ij}\mathbf{x}_{j,t-1}-\eta_{t-1}\mathbf{v}_{i,t-1}$ .
2. b)
  
  Update local gradient estimator: $\mathbf{v}_{i,t}=\beta_{t}\sum\nolimits_{j\in\mathcal{N}_{i}}[\mathbf{W}]_{ij}\mathbf{v}_{j,t-1}$ $+\nabla f_{i}(\mathbf{x}_{i,t};\zeta_{i,t})-\beta_{t}\nabla f_{i}(\mathbf{x}_{i,t-1};\zeta_{i,t})$ .
3.

Stop if $t>T$ ; otherwise, let $t\leftarrow t+1$ and go to Step 2.

Two remarks for Algorithm 1 are in order. First, thanks to the single-loop structure, GT-STORM is easier to implement compared to the low-sample-complexity D-GET (sun2019improving, ) method, which has in a double-loop structure. Second, GT-STORM only requires $p$ memory space due to the use of only one intermediate vector $\mathbf{v}$ at each node. In contrast, the memory complexity of D-GET is $2p$ (cf. $\mathbf{y}$ and $\mathbf{v}$ in (sun2019improving, )). This 50% saving is huge particularly for deep learning models, where the number of parameters could be in the range of millions.

3.2. Main Theoretical Results

In this section, we will establish the complexity properties of the proposed GT-STORM algorithm. For better readability, we state the main theorem and its corollary in this section and provide the intermediate lemmas to Section 3.3. We start with the following assumptions on the global and local objectives:

Assumption 1.

The objective function $f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^{m}f_{i}(\mathbf{x})$ with $f_{i}(\mathbf{x})=\mathbb{E}_{\zeta\sim\mathcal{D}_{i}}f_{i}(\mathbf{x};\zeta)$ satisfies the following assumptions:

(a)

Boundedness from below: There exists a finite lower bound $f^{*}=\inf_{\mathbf{x}}f(x)>-\infty;$
(b)

$L$ -average smoothness: $f_{i}(\cdot;\zeta_{i})$ is $L$ -average smooth on $\mathbb{R}^{p}$ , i.e., there exists a positive constant $L,$ such that $\mathbb{E}_{\zeta\!\sim\!\mathcal{D}_{i}}[\|\nabla f_{i}(\mathbf{x};\zeta)\!-\!\nabla f_{i}(\mathbf{y};\zeta)\|^{2}]\!\leq\!L^{2}\|\mathbf{x}\!-\!\mathbf{y}\|^{2},\forall\mathbf{x},\mathbf{y}\in\mathbb{R}^{p},i\in[m]$ ;
(c)

Bounded variance: There exists a constant $\sigma\geq 0$ such that $\mathbb{E}_{\zeta\sim\mathcal{D}_{i}}[\|\nabla f_{i}(\mathbf{x};\zeta)-\nabla f_{i}(\mathbf{x})\|^{2}]\leq\sigma^{2},\forall\mathbf{x}\in\mathbb{R}^{p},i\in[m]$ ;
(d)

Bounded gradient: There exists a constant $G\geq 0$ such that $\mathbb{E}_{\zeta\sim\mathcal{D}_{i}}[\|\nabla f_{i}(\mathbf{x};\zeta)\|^{2}]\leq G^{2},\forall\mathbf{x}\in\mathbb{R}^{p},i\in[m]$ .

In the above assumptions, (a) and (c) are standard in the stochastic non-convex optimization literature; (b) is an expected Lipschitz smoothness condition over the data distribution, which implies the conventional global Lipschitz smoothness (ghadimi2013stochastic, ) by the Jensen’s inequality. Note that (b) is weaker than the individual Lipschitz smoothness in (fang2018spider, ; wang2019spiderboost, ; sun2019improving, ): if there exists an outlier data sample, then the individual objective function might have a very large smoothness parameter while the average smoothness can still be small; (d) is equivalent to the Lipschitz continuity assumption, which is also commonly used for non-convex stochastic algorithms (zhou2018generalization, ; karimireddy2019error, ; koloskova2019decentralized, ) and is essential for analyzing the decentralized gradient descent method (yuan2016convergence, ; zeng2018nonconvex, ; jiang2017collaborative, ).¹¹1Note that under the assumption (b), as long as the parameter $\mathbf{x}$ is bounded, (d) is satisfied.

For convenience, in the subsequent analysis, we define $\tilde{\mathbf{W}}=\mathbf{W}\otimes\mathbf{I}_{m},$ $\mathbf{g}_{i,t}=\nabla f_{i}(\mathbf{x}_{i,t}),$ $\mathbf{u}_{i,t}=\nabla f_{i}(\mathbf{x}_{i,t};\zeta_{i,t})$ , $\mathbf{w}_{i,t}=\nabla f_{i}(\mathbf{x}_{i,t};\zeta_{i,t})-\nabla f_{i}(\mathbf{x}_{i,t-1};\zeta_{i,t})$ and $\mathbf{a}_{t}=[\mathbf{a}_{1,t}^{\top},\cdots,\mathbf{a}_{m,t}^{\top}]^{\top}$ and $\bar{\mathbf{a}}_{t}=\frac{1}{m}\sum_{i=1}^{m}\mathbf{a}_{i,t},$ for $\mathbf{a}\in\{\mathbf{x},\mathbf{u},\mathbf{w},\mathbf{v},\mathbf{g}\}$ . Then, the algorithm can be compactly rewritten in the following matrix-vector form:

(9)		$\displaystyle\mathbf{x}_{t}$	$\displaystyle=\tilde{\mathbf{W}}\mathbf{x}_{t-1}-\eta_{t-1}\mathbf{v}_{t-1},$
(10)		$\displaystyle\mathbf{v}_{t}$	$\displaystyle=\beta_{t}\tilde{\mathbf{W}}\mathbf{v}_{t-1}+\beta_{t}\mathbf{w}_{t}+(1-\beta_{t})\mathbf{u}_{t}.$

Furthermore, since $\mathbf{1}^{\top}\mathbf{W}=\mathbf{1}^{\top},$ we have $\mathbf{\bar{x}}_{t}=\mathbf{\bar{x}}_{t-1}-\eta_{t-1}\bar{\mathbf{v}}_{t-1},$ $\bar{\mathbf{v}}_{t}=\beta_{t}{\bar{\mathbf{v}}}_{t-1}+\beta_{t}\bar{\mathbf{w}}_{t}+(1-\beta_{t})\bar{\mathbf{u}}_{t}.$ We first state the convergence result for Algorithm 1 as follows:

Theorem 1.

Under Assumption 1 and with the positive constants $c_{0}$ and $c_{1}$ satisfying $1-(1+c_{1})\lambda^{2}-\frac{1}{c_{0}}>0$ , if we set $\eta_{t}=\tau/(\omega+t)^{1/3}$ and $\beta_{t+1}=1-\rho\eta_{t}^{2}$ , with $\tau>0,$ $\omega\geq\max\{2,\tau^{3}/\min\{k_{1}^{3},k_{2}^{3},k_{3}^{3}\}\}$ and $\rho=2/(3\tau^{3})+32L^{2}$ , then we have the following result for Algorithm 1:

		$\displaystyle\min_{t\in[T]}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]+\frac{1}{m}\mathbb{E}[\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}]$
	$\displaystyle\leq$	$\displaystyle\frac{2(f(\mathbf{\bar{x}}_{0})-f(\mathbf{\bar{x}}^{*}))}{\tau{(T+1)}^{2/3}}+\frac{2c_{0}\mathbb{E}[\\|\mathbf{v}_{0}-\mathbf{1}\otimes\bar{\mathbf{v}}_{0}\\|^{2}]}{m\tau{(T+1)}^{2/3}}$
		$\displaystyle+\frac{(\omega-1)\sigma^{2}}{16mL^{2}\tau^{2}{(T+1)}^{2/3}}+\frac{\rho^{2}\sigma^{2}\ln(\omega+T-1)}{8mL^{2}(T+1)^{2/3}}$
(11)			$\displaystyle+\frac{12(1+\frac{1}{c_{1}})c_{0}\tau^{1/3}G^{2}\rho^{2}}{(\omega-1)^{1/3}(T+1)^{2/3}}+O\Big{(}\frac{c_{3}\omega}{\tau T^{5/3}}\Big{)},$

where $c_{3}=\max\{1,\omega/(m\tau^{2}),\tau^{4/3}/\omega^{1/3},\tau\ln(\omega+T)/m\},$ and the constants $k_{1},$ $k_{2}$ and $k_{3}$ are:

(12)	$\displaystyle k_{1}$	$\displaystyle=1/\Big{(}2L+32(1+\frac{1}{c_{1}})c_{0}L^{2}\Big{)},$
(13)	$\displaystyle k_{2}$	$\displaystyle=\Big{(}1-(1+c_{1})\lambda^{2}\Big{)}/\Big{(}1+\frac{1}{c_{1}}+\frac{1}{c_{0}}\Big{)},$
(14)	$\displaystyle k_{3}$	$\displaystyle=\sqrt{\Big{(}1-(1+c_{1})\lambda^{2}-\frac{1}{c_{0}}\Big{)}/\Big{(}\frac{2}{3\tau^{3}}+\frac{2L^{2}+1}{2c_{0}}\Big{)}}.$

In Theorem 1, $c_{0}$ and $c_{1}$ are two constants depending on the network topology, which in turn will affect the step-size and convergence: with a sparse network, i.e., $\lambda$ is close to but not exactly one (recall that $\lambda=\max\{|\lambda_{2}|,|\lambda_{m}|\}$ ). In order for $1-(1+c_{1})\lambda^{2}-\frac{1}{c_{0}}>0$ to hold, $c_{0}$ needs to be large and $c_{1}$ needs be close to zero, which leads to small $k_{1},$ $k_{2}$ and $k_{3}.$ Note that the step-size $\eta_{t}$ is of the order $O(t^{-1/3}),$ which is larger than the $O(t^{-1/2})$ order for the classical decentralized SGD algorithms. With this larger step-size, the convergence rate is $O(t^{-2/3})$ and faster than the rate $O(t^{-1/2})$ for the decentralized SGD algorithms. Based on Theorem 1, we have the sample and communication complexity results for Algorithm 1:

Corollary 2.

Under the conditions in Theorem 1, if $\tau=O(m^{1/3})$ and $\omega=O(m^{4/3})$ , then to achieve an $\epsilon^{2}$ -stationary solution, the total communication rounds are on the order of $\tilde{O}(m^{-1/2}\epsilon^{-3})$ and the total samples evaluated across the network is on the order of $\tilde{O}(m^{1/2}\epsilon^{-3}).$

3.3. Proofs of the Theoretical Results

Due to space limitation, we provide a proof sketch for Theorem 1 here and relegate the details to the appendices. First, we bound the error of gradient estimator $\mathbb{E}[\|\mathbf{v}_{t}-\mathbf{g}_{t}\|^{2}]$ as follows:

Lemma 1 (Error of Gradient Estimator).

Under Assumption 1 and with $\mathbf{v}_{t}$ defined in (10), it holds that $\mathbb{E}[\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{g}}_{t}\|^{2}]\leq\beta_{t}^{2}\mathbb{E}[\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{g}}_{t-1}\|^{2}]+\frac{2\beta_{t}^{2}L^{2}}{m}\mathbb{E}[\|\mathbf{x}_{t}-\mathbf{x}_{t-1}\|^{2}]+\frac{2(1-\beta_{t})^{2}\sigma^{2}}{m}$ .

It can be seen that the upper bound depends on the error in the previous step with a factor $\beta_{t}^{2}$ . This will be helpful when we construct a potential function. Then, according to the algorithm updates (9)–(10), we show the following descent inequality:

Lemma 2 (Descent Lemma).

Under Assumption 1, Algorithm 1 satisfies: $\mathbb{E}[f(\mathbf{\bar{x}}_{t+1})]-\mathbb{E}[f(\mathbf{\bar{x}}_{t})]\leq-\frac{\eta_{t}}{2}\mathbb{E}[\|\nabla f(\mathbf{\bar{x}}_{t})\|^{2}]-(\frac{\eta_{t}}{2}-\frac{L\eta_{t}^{2}}{2})\times$ $\mathbb{E}[\|\bar{\mathbf{v}}_{t}\|^{2}]+\eta_{t}\mathbb{E}[\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{g}}_{t}\|^{2}]+\frac{L^{2}\eta_{t}}{m}\mathbb{E}[\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\|^{2}]$ .

We remark that the right-hand-side (RHS) of the above inequality contains the consensus error of local parameters $\sum_{t=0}^{T}\mathbb{E}[\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\|^{2}]$ , which makes the analysis more difficult than that of the centralized optimization. Next, we prove the contraction of iterations in the following lemma, which is useful in analyzing the decentralized gradient tracking algorithms.

Lemma 3 (Iterates Contraction).

The following contraction properties of the iterates produced by Algorithm 1 hold:

	$\displaystyle\\|\mathbf{x}_{t+1}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t+1}\\|^{2}\leq(1+c_{1})$	$\displaystyle\lambda^{2}\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}$
(15)			$\displaystyle+(1+\frac{1}{c_{1}})\eta_{t}^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2},$

	$\displaystyle\\|\mathbf{v}_{t+1}-\mathbf{1}$	$\displaystyle\otimes\bar{\mathbf{v}}_{t+1}\\|^{2}\leq(1+c_{1})\beta_{t+1}^{2}\lambda^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}$
(16)			$\displaystyle+2(1+\frac{1}{c_{1}})\big{(}\beta_{t+1}^{2}\\|\mathbf{w}_{t+1}\\|^{2}+(1-\beta_{t+1})^{2}\\|\mathbf{u}_{t+1}\\|^{2}\big{)},$

where $c_{1}$ is a positive constant. Additionally, we have

	$\displaystyle\\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\\|^{2}\leq 8\\|(\mathbf{x}_{t}$	$\displaystyle-\mathbf{1}\otimes\mathbf{\bar{x}}_{t})\\|^{2}$
(17)			$\displaystyle+4\eta_{t}^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}+4\eta_{t}^{2}m\\|\bar{\mathbf{v}}_{t}\\|^{2}.$

Finally, we define a potential function in (4), based on which we prove the convergence bound:

Lemma 4.

(Convergence of Potential Function) Define the following potential function:

	$\displaystyle H_{t}=\mathbb{E}[f(\mathbf{\bar{x}}_{t})+\frac{1}{32L^{2}\eta_{t-1}}\\|\bar{\mathbf{g}}_{t}-\bar{\mathbf{v}}_{t}\\|^{2}$	$\displaystyle+\frac{c_{0}}{m\eta_{t-1}}\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}$
(18)			$\displaystyle+\frac{c_{0}}{m}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}],$

where $c_{0}$ is a positive constant. Under Assumption 1, if we set $\eta_{t}=\tau/(\omega+t)^{1/3}$ and $\beta_{t+1}=1-\rho\eta_{t}^{2}$ , where $\tau,$ $\omega\geq 2,$ $\rho=2/(3\tau^{3})+32L^{2}$ are three constants, then it holds that:

	$\displaystyle H_{t+1}-H_{t}\leq$	$\displaystyle-\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]+\frac{\rho^{2}\sigma^{2}\eta_{t}^{3}}{16mL^{2}}+2(1+\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}\eta_{t}^{4}$
		$\displaystyle-\frac{c_{0}C_{1}}{m\eta_{t}}[\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}]-\frac{c_{0}C_{2}}{m}\mathbb{E}[\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}]$
(19)			$\displaystyle-\frac{C_{3}\eta_{t}}{4}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}],$

where $C_{1},$ $C_{2},$ and $C_{3}$ are following constants: $C_{1}=1-(1+c_{1})\lambda^{2}-\frac{1}{2c_{0}}-16(1+\frac{1}{c_{1}})L^{2}\eta_{t}-\Big{(}\frac{2}{3\tau^{3}}+\frac{L^{2}}{c_{0}}\Big{)}\eta_{t}^{2},$ $C_{2}=1-(1+c_{1})\lambda^{2}-(1+\frac{1}{c_{1}})\eta_{t}-\frac{\eta_{t}}{4c_{0}}-8(1+\frac{1}{c_{1}})L^{2}\eta_{t}^{2}$ , $C_{3}=1-2L\eta_{t}-32(1+\frac{1}{c_{1}})c_{0}L^{2}\eta_{t}$ .

Finally, by properly selecting the parameters, constants $C_{1},$ $C_{2}$ and $C_{3}$ can be made non-negative, which leads to Theorem 1.

4. Experimental Results

In this section, we conduct experiments using several non-convex machine learning problems to evaluate the performance of our method. In particular, we compare our algorithm with the following state-of-art single-loop algorithms:

•

DSGD (nedic2009distributed, ; yuan2016convergence, ; jiang2017collaborative, ): Each node performs: $\mathbf{x}_{i,t+1}=\sum_{j\in\mathcal{N}_{i}}$ $[\mathbf{W}]_{ij}\mathbf{x}_{j,t}-\eta\nabla f_{i}(\mathbf{x}_{i,t};\zeta_{i,t})$ , where the stochastic gradient $\nabla f_{i}(\mathbf{x}_{i,t};\zeta_{i,t})$ corresponds to random sample $\zeta_{i,t}$ . Then, each node exchanges the local parameter $\mathbf{x}_{i,t}$ with its neighbors.
•

GNSD (lu2019gnsd, ): Each node keeps two variables $\mathbf{x}_{i,t}$ and $\mathbf{y}_{i,t}$ . The local parameter $\mathbf{x}_{i,t}$ is updated as $\mathbf{x}_{i,t+1}\!=\!\sum_{j\in\mathcal{N}_{i}}[\mathbf{W}]_{ij}\mathbf{x}_{j,t}\!-\!\eta\mathbf{y}_{i,t}$ and the tracked gradient $\mathbf{y}_{i,t}$ is updated as $\mathbf{y}_{i,t+1}\!=\!\sum_{j\in\mathcal{N}_{i}}[\mathbf{W}]_{ij}\mathbf{y}_{j,t}\!+\!\nabla f_{i}(\mathbf{x}_{i,t+1};\zeta_{i,t+1})\!-\!\nabla f_{i}(\mathbf{x}_{i,t};\zeta_{i,t}).$

Here, we compare with the above two classes of stochastic algorithms because they all employ a single-loop structure and do not require full gradient evaluations. We note that it is hard to have a direct and fair comparison with D-GET (sun2019improving, ) numerically, since D-GET uses full gradients and has a double-loop structure.

Network Model: The communication graph $\mathcal{G}$ is generated by the Erd $\ddot{\text{o}}$ s-R $\grave{\text{e}}$ nyi graph with different edge connectivity probability $p_{c}$ and number of nodes $m$ . We set $m=10$ and the edge connectivity probability as $p_{c}=0.5$ . The consensus matrix is chosen as $\mathbf{W}=\mathbf{I}-\frac{2}{3\lambda_{\text{max}}(\mathbf{L})}\mathbf{L},$ where $\mathbf{L}$ is the Laplacian matrix of $\mathcal{G}$ , and $\lambda_{\text{max}}(\mathbf{L})$ denotes the largest eigenvalue of $\mathbf{L}$ .

1) Non-convex logistic regression: In our first experiment, we consider the binary logistic regression problem with a non-convex regularizer (wang2018cubic, ; wang2019spiderboost, ; tran2019hybrid, ):

	$\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{d}}-\frac{1}{mn}$	$\displaystyle\sum_{i=1}^{m}\sum_{j=1}^{n}[y_{ij}\log\big{(}\frac{1}{1+e^{-\mathbf{x}^{\top}\zeta_{ij}}}\big{)}+$
(20)			$\displaystyle(1-y_{ij})\log\big{(}\frac{e^{-\mathbf{x}^{\top}\zeta_{ij}}}{1+e^{-\mathbf{x}^{\top}\zeta_{ij}}}\big{)}]+\alpha\sum_{i=1}^{d}\frac{\mathbf{x}_{i}^{2}}{1+\mathbf{x}_{i}^{2}},$

where the label $y_{ij}\in\{0,1\},$ the feature $\zeta_{ij}\in\mathbb{R}^{d}$ and $\alpha=0.1$ .

1-a) Datasets: We consider three commonly used binary classification datasets from LibSVM: $a9a$ , $rcv1.binary$ and $ijcnn1$ . The $a9a$ dataset has $32561$ samples, $123$ features, the $rcv1.binary$ dataset has $20242$ samples, $47236$ features, and the $ijcnn1$ dataset has $49990$ samples, $22$ features. We evenly divide the dataset into $m$ sub-datasets corresponding to the $m$ nodes.

1-b) Parameters: For all algorithms, we set the batch size as one and the initial step-size $\eta_{0}$ is tuned by searching over the grid $\{0.01,0.02,0.05,0.1,0.2,0.5,1.0\}.$ For DSGD and GNSD, the step-size is set to $\eta_{t}=\eta_{0}/\sqrt{1+0.1t}$ , which is on the order of $O(t^{-1/2})$ following the state-of-the-art theoretical result (lu2019gnsd, ). For GT-STORM, the step-size is set as $\eta_{t}=\eta_{0}/\sqrt[3]{1+0.1t}$ , which is on the order of $O(t^{-1/3})$ as specified in our theoretical result. In addition, we choose the parameter $\rho$ for GT-STORM as $1/\eta_{0}^{2}$ , so that $\beta_{1}=0$ in the first step.

1-c) Results: We first compare the convergence rates of the algorithms. We adopt the consensus loss defined in the left-hand-side (LHS) of (3) as the criterion. After tuning, the best initial step-sizes are $0.1,$ $0.5$ and $0.2$ for $a9a$ , $ijcnn1$ and $rcv1.binary,$ respectively. The results are shown in Figs. 4–4. It can be seen that our algorithm has a better performance: for $a9a$ and $rcv1.binary$ datasets, all algorithms reach almost the same accuracy but our algorithm has a faster speed; for $ijcnn1$ dataset, our algorithm outperforms other methods both in the speed and accuracy.

Refer to caption — Figure 1. Non-convex logistic regression on LibSVM: a9a.

Next, we examine the effect of the parameter $\rho$ on our algorithm. We focus on the $a9a$ dataset and fix the initial step-size as $\eta_{0}\!=\!0.1$ . We choose $\rho$ from $\{10^{-1},10^{0},10^{1},10^{2}\}.$ Note that $\rho=10^{2}$ is corresponding to the case $\rho\!=\!1/\eta_{0}^{2}.$ The results are shown in Fig. 4. It can be seen that the case $\rho\!=\!10^{1}$ has the best performance, which is followed by the case $\rho=10^{2}.$ Also, as $\rho$ decreases, the convergence speed becomes slower (see the cases $\rho\!=\!10^{-1}$ and $10^{0}$ ).

In addition, we examine the effect of the network topology. We first fix the number of workers as $m=10$ and change the the edge connectivity probability $p_{c}$ from $0.35$ to $0.9.$ Note that with a smaller $p_{c},$ the network becomes sparser. We set $\eta_{0}=0.1$ and $\rho=10^{2}.$ The results are shown in Fig. 8. Under different $p_{c}$ -values, our algorithm has a similar performance in terms of convergence speed and accuracy. But with a larger $p_{c}$ -values i.e., a denser network, the convergence speed slightly increases (see the zoom-in view in Fig. 8. Then, we fix the the edge connectivity probability $p_{c}=0.5$ but change the number of workers $m$ from $10$ to $50.$ We show the results in Fig. 8. It can be seen that with more workers, the algorithm converges faster and reaches a better accuracy.

2) Convolutional neural networks We use all three algorithms to train a convolutional neural network (CNN) model for image classification on MNIST and CIFAR-10 datasets. We adopt the same network topology as in the previous experiment. We use a non-identically distributed data partition strategy: the $i$ th machine can access the data with the $i$ th label. We fix the initial step-size as $\eta_{0}=0.01$ for all three algorithms and the remaining settings are the same as in the previous experiment.

2-a) Learning Models: For MNIST, the adopted CNN model has two convolutional layers (first of size $1\times 16\times 5$ and then of size $16\times 32\times 5$ ), each of which is followed by a max-pooling layer with size $2\times 2$ , and then a fully connected layer. The ReLU activation is used for the two convolutional layers and the “softmax” activation is applied at the output layer. The batch size is 64 for the CNN training on MNIST. For CIFAR-10, we apply the CNN model with two convolutional layers (first of size $3\times 6\times 5$ and then of size $6\times 16\times 5$ ). Each of the convolutional layers is followed by a max-pooling layer of size $2\times 2$ , and then three fully connected layers. The ReLU activation is used for the two convolutional layers and the first two fully connected layers, and the “softmax” activation is applied at the output layer. The batch size is chosen as 128 for the CNN training on CIFAR-10.

2-b) Results: Fig. 8 illustrates the testing accuracy of different algorithms versus iterations on MNIST and CIFAR-10 datasets. It can be seen from Fig. 8 that on the MNIST dataset, GNSD and GT-STORM have similar performance, but our GT-STORM maintains a faster speed and a better prediction accuracy. Compared with DSGD, our GT-STORM can gain about $10\%$ more accuracy. On the CIFAR-10 dataset (see Fig. 8), the performances of DSGD and GNSD deteriorate, while GT-STORM can achieve a better accuracy. Specifically, the accuracy of GT-STORM is around $15\%$ higher than that of GNSD and $25\%$ higher than that of DSGD.

5. Conclusion

In this paper, we proposed a gradient-tracking-based stochastic recursive momentum (GT-STORM) algorithm for decentralized non-convex optimization, which enjoys low sample, communication, and memory complexities. Our algorithm fuses the gradient tracking estimator and the variance reduction estimator and has a simple single-loop structure. Thus, it is more practical compared to existing works (e.g. GT-SAGA/SVRG and D-GET) in the literature. We have also conducted extensive numerical studies to verify the performance of our method, including non-convex logistic regression and neural networks. The numerical results show that our method outperforms the state-of-the-art methods when training on the large datasets. Our results in this work contribute to the increasingly important field of decentralized network training.

References

(1) B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in neural information processing systems, 2011, pp. 693–701.
(2) M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized stochastic gradient descent,” in Advances in neural information processing systems, 2010, pp. 2595–2603.
(3) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223–1231.
(4) A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, p. 48, 2009.
(5) X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.
(6) J. Chen, Z. J. Towfic, and A. H. Sayed, “Dictionary learning over distributed models,” IEEE Transactions on Signal Processing, vol. 63, no. 4, pp. 1001–1016, 2014.
(7) Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multi-agent coordination,” IEEE Transactions on Industrial informatics, vol. 9, no. 1, pp. 427–438, 2012.
(8) K. Zhou and S. I. Roumeliotis, “Multirobot active target tracking with combinations of relative observations,” IEEE Transactions on Robotics, vol. 27, no. 4, pp. 678–695, 2011.
(9) W. Wang, J. Wang, M. Kolar, and N. Srebro, “Distributed stochastic multi-task learning with graph regularization,” arXiv preprint arXiv:1802.03830, 2018.
(10) X. Zhang, J. Liu, and Z. Zhu, “Distributed linear model clustering over networks: A tree-based fused-lasso admm approach,” arXiv preprint arXiv:1905.11549, 2019.
(11) K. Ali and W. Van Stam, “Tivo: Making show recommendations using a distributed collaborative filtering architecture,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 394–401.
(12) Z. Jiang, K. Mukherjee, and S. Sarkar, “On consensus-disagreement tradeoff in distributed optimization,” in 2018 Annual American Control Conference (ACC). IEEE, 2018, pp. 571–576.
(13) J. N. Tsitsiklis, “Problems in decentralized decision making and computation.” Massachusetts Inst of Tech Cambridge Lab for Information and Decision Systems, Tech. Rep., 1984.
(14) Q. Tran-Dinh, N. H. Pham, D. T. Phan, and L. M. Nguyen, “Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization,” arXiv preprint arXiv:1905.05920, 2019.
(15) A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” in Advances in Neural Information Processing Systems, 2019, pp. 15 210–15 219.
(16) P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.
(17) S. Lu, X. Zhang, H. Sun, and M. Hong, “GNSD: a gradient-tracking based nonconvex stochastic algorithm for decentralized optimization,” in 2019 IEEE Data Science Workshop, DSW 2019. Institute of Electrical and Electronics Engineers Inc., 2019, pp. 315–321.
(18) Z. Jiang, A. Balu, C. Hegde, and S. Sarkar, “Collaborative deep learning in fixed topology networks,” in Advances in Neural Information Processing Systems, 2017, pp. 5904–5914.
(19) H. Sun, S. Lu, and M. Hong, “Improving the sample and communication complexity for decentralized non-convex optimization: A joint gradient estimation and tracking approach,” ICML 2020, 2019.
(20) S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
(21) L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.
(22) P. Zhou, X. Yuan, and J. Feng, “New insight into hybrid stochastic gradient descent: Beyond with-replacement sampling and convexity,” in Advances in Neural Information Processing Systems, 2018, pp. 1234–1243.
(23) R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in neural information processing systems, 2013, pp. 315–323.
(24) S. J. Reddi, A. Hefny, S. Sra, B. Póczos, and A. Smola, “Stochastic variance reduction for nonconvex optimization,” in International conference on machine learning, 2016, pp. 314–323.
(25) A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental gradient method with support for non-strongly convex composite objectives,” in Advances in neural information processing systems, 2014, pp. 1646–1654.
(26) L. Lei, C. Ju, J. Chen, and M. I. Jordan, “Non-convex finite-sum optimization via scsg methods,” in Advances in Neural Information Processing Systems, 2017, pp. 2348–2358.
(27) C. Fang, C. J. Li, Z. Lin, and T. Zhang, “Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator,” in Advances in Neural Information Processing Systems, 2018, pp. 689–699.
(28) L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takác, “Sarah: A novel method for machine learning problems using stochastic recursive gradient,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2613–2621.
(29) Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh, “Spiderboost and momentum: Faster variance reduction algorithms,” in Advances in Neural Information Processing Systems, 2019, pp. 2403–2413.
(30) K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
(31) W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
(32) H. Sun and M. Hong, “Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms,” IEEE Transactions on Signal processing, vol. 67, no. 22, pp. 5912–5928, 2019.
(33) J. F. Mota, J. M. Xavier, P. M. Aguiar, and M. Püschel, “D-admm: A communication-efficient distributed algorithm for separable optimization,” IEEE Transactions on Signal Processing, vol. 61, no. 10, pp. 2718–2723, 2013.
(34) A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralized second-order method with exact linear convergence rate for consensus optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 507–522, 2016.
(35) M. Eisen, A. Mokhtari, and A. Ribeiro, “Decentralized quasi-newton methods,” IEEE Transactions on Signal Processing, vol. 65, no. 10, pp. 2613–2628, 2017.
(36) A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.
(37) T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed learning in the non-convex world: From batch to streaming data, and beyond,” arXiv preprint arXiv:2001.04786, 2020.
(38) J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on signal processing, vol. 66, no. 11, pp. 2834–2848, 2018.
(39) A. Mokhtari and A. Ribeiro, “DSA: decentralized double stochastic averaging gradient algorithm,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2165–2199, 2016.
(40) K. Yuan, B. Ying, J. Liu, and A. H. Sayed, “Variance-reduced stochastic learning by networked agents under random reshuffling,” IEEE Transactions on Signal Processing, vol. 67, no. 2, pp. 351–366, 2018.
(41) R. Xin, U. A. Khan, and S. Kar, “Variance-reduced decentralized stochastic optimization with gradient tracking,” arXiv preprint arXiv:1909.11774, 2019.
(42) A. Nedich, A. Olshevsky, and W. Shi, “A geometrically convergent method for distributed optimization over time-varying graphs,” in 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, 2016, pp. 1023–1029.
(43) Y. Zhou, Y. Liang, and H. Zhang, “Generalization error bounds with probabilistic guarantee for sgd in nonconvex optimization,” arXiv preprint arXiv:1802.06903, 2018.
(44) S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi, “Error feedback fixes SignSGD and other gradient compression schemes,” arXiv preprint arXiv:1901.09847, 2019.
(45) A. Koloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” arXiv preprint arXiv:1902.00340, 2019.
(46) Z. Wang, Y. Zhou, Y. Liang, and G. Lan, “Cubic regularization with momentum for nonconvex optimization,” arXiv preprint arXiv:1810.03763, 2018.
(47) X. Zhang, J. Liu, Z. Zhu, and E. S. Bentley. (2020) GT-STORM: taming sample, communication, and memory complexities in decentralized non-convex learning. [Online]. Available: https://kevinliu-osu-ece.github.io/publications/GT-STORM_TR.pdf

Appendix A Addtional Experiment Details

In our simulation, the communication graph $\mathcal{G}$ is generated by the Erd $\ddot{\text{o}}$ s-R $\grave{\text{e}}$ nyi graph with different edge connectivity probability $p_{c}$ and nodes number $m$ . We set $m=10$ and $p_{c}=0.5$ . The generated graph is shown in Figure 9.

A.1. Nonconvex Logistic Regression

In Section 4.1, we set the step-size as $\eta_{t}=\eta_{0}/\sqrt{1+0.1\times t}$ for DSGD and GNSD, while $\eta_{t}=\eta_{0}/\sqrt[3]{1+0.1\times t}$ for GT-STORM. It can be noted that the step-size adopted for GT-STORM is diminishing slower than those for DSGD and GNSD, though the choices are following the theoretical results. Thus, here we apply the step-size as $\eta_{t}=\eta_{0}/\sqrt[3]{1+0.1\times t}$ for all the three algorithms. We tune the initial step-size $\eta_{0}$ by searching the grid $\{0.01,0.02,0.05,0.1,0.2,0.5,1.0\}.$ After tuning, the best initial step-sizes are $0.1,$ $0.5$ and $0.2$ for $a9a$ , $ijcnn1$ and $rcv1.binary,$ respectively. We show the results in Figure 10. It can be seen that with a larger step-size, though the convergence is faster for DSGD and GNSD at the beginning, the accuracy is unsatisfactory (e.g. $a9a$ and $rcv1.binary$ ). Also, with the same step-size, our algorithm performs much better than the other two.

A.2. Convolutional Neural Networks

Here we show the testing loss and accuracy for the CNN models on the MNIST and CIFAR-10 datasets in Figure 11-12. In all experiment results, our algorithm has a better performance: a higher accuracy and a smaller loss. The final testing accuracy results are summarized in Table 1.

Table 1. The results of testing accuracy with the CNN models trained by different algorithms.

Dataset		DSGD	GNSD	GT-STORM
MNIST	I.D.	0.9102	0.9102	0.9375
MNIST	N.D.	0.8203	0.9102	0.9257
CIFAR-10	I.D.	0.6093	0.6016	0.7734
CIFAR-10	N.D.	0.5352	0.6133	0.7695

Appendix B Proof of Main Results

Due to space limitation, we provide key proof steps of the key lemmas and theorems in this appendix. We refer readers to (Zhang20:GT-STORM_TR, ) for the complete proofs.

B.1. Proof for Lemma 1

Proof.

Recall that $\bar{\mathbf{v}}_{t}=\beta_{t}(\bar{\mathbf{v}}_{t-1}+\bar{\mathbf{w}}_{t})+(1-\beta_{t})\bar{\mathbf{u}}_{t}$ , then

		$\displaystyle\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}=\\|\beta_{t}(\bar{\mathbf{v}}_{t\!-\!1}+\bar{\mathbf{w}}_{t})\!+\!(1\!-\!\beta_{t})\bar{\mathbf{u}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|\beta_{t}(\bar{\mathbf{v}}_{t\!-\!1}\!-\!\bar{\mathbf{g}}_{t\!-\!1})\!+\!\beta_{t}(\bar{\mathbf{w}}_{t}\!-\!\bar{\mathbf{g}}_{t}\!+\!\bar{\mathbf{g}}_{t\!-\!1})\!+\!(1\!-\!\beta_{t})(\bar{\mathbf{u}}_{t}\!-\!\bar{\mathbf{g}}_{t})\\|^{2}$
	$\displaystyle=$	$\displaystyle\beta_{t}^{2}\\|\bar{\mathbf{v}}_{t\!-\!1}\!-\!\bar{\mathbf{g}}_{t\!-\!1}\\|^{2}\!+\!\\|\beta_{t}(\bar{\mathbf{w}}_{t}\!-\!\bar{\mathbf{g}}_{t}\!+\!\bar{\mathbf{g}}_{t\!-\!1})\!+\!(1\!-\!\beta_{t})(\bar{\mathbf{u}}_{t}\!-\!\bar{\mathbf{g}}_{t})\\|^{2}$
(21)			$\displaystyle\!+\!2\langle\bar{\mathbf{v}}_{t\!-\!1}\!-\!\bar{\mathbf{g}}_{t\!-\!1},\beta_{t}(\bar{\mathbf{w}}_{t}\!-\!\bar{\mathbf{g}}_{t}\!+\!\bar{\mathbf{g}}_{t\!-\!1})\!+\!(1\!-\!\beta_{t})(\bar{\mathbf{u}}_{t}\!-\!\bar{\mathbf{g}}_{t})\rangle$

Note that $\mathbb{E}_{\zeta_{t}}[\bar{\mathbf{w}}_{t}]=\bar{\mathbf{g}}_{t}-\bar{\mathbf{g}}_{t-1}$ and $\mathbb{E}_{\zeta_{t}}[\bar{\mathbf{u}}_{t}]=\bar{\mathbf{g}}_{t}.$ Taking expectation with respect to $\zeta_{t},$ we have:

	$\displaystyle\mathbb{E}_{\zeta_{t}}[\\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{g}}_{t}\\|^{2}]$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\beta_{t}^{2}\\|\bar{\mathbf{v}}_{t\!-\!1}\!-\!\bar{\mathbf{g}}_{t\!-\!1}\\|^{2}\!+\!\mathbb{E}_{\zeta_{t}}[\\|\beta_{t}(\bar{\mathbf{w}}_{t}\!-\!\bar{\mathbf{g}}_{t}\!+\!\bar{\mathbf{g}}_{t\!-\!1})\!+\!(1\!-\!\beta_{t})(\bar{\mathbf{u}}_{t}\!-\!\bar{\mathbf{g}}_{t})\\|^{2}]$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\beta_{t}^{2}\\|\bar{\mathbf{v}}_{t\!-\!1}\!-\!\bar{\mathbf{g}}_{t-1}\\|^{2}\!\!\!+\!\mathbb{E}_{\zeta_{t}}\![2\beta_{t}^{2}\\|\bar{\mathbf{w}}_{t}\!-\!\bar{\mathbf{g}}_{t}\!\!+\!\bar{\mathbf{g}}_{t\!-\!1}\\|^{2}\!\!\!+\!2(1\!-\!\beta_{t})^{2}\\|\bar{\mathbf{u}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}\!]$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}\beta_{t}^{2}\\|\bar{\mathbf{v}}_{t\!-\!1}-\bar{\mathbf{g}}_{t\!-\!1}\\|^{2}+2\beta_{t}^{2}\mathbb{E}_{\zeta_{t}}[\\|\bar{\mathbf{w}}_{t}\\|^{2}]+\frac{2(1-\beta_{t})^{2}\sigma^{2}}{m}$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\beta_{t}^{2}\\|\bar{\mathbf{v}}_{t\!-\!1}\!-\!\bar{\mathbf{g}}_{t\!-\!1}\\|^{2}\!+\!\frac{2\beta_{t}^{2}}{m}\sum_{i=1}^{m}\mathbb{E}_{\zeta_{t}}[\\|\mathbf{w}_{i,t}\\|^{2}]+\frac{2(1-\beta_{t})^{2}\sigma^{2}}{m}$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}\beta_{t}^{2}\\|\bar{\mathbf{v}}_{t\!-\!1}\!-\!\bar{\mathbf{g}}_{t\!-\!1}\\|^{2}\!\!+\!\frac{2\beta_{t}^{2}L^{2}}{m}\sum_{i=1}^{m}\mathbb{E}_{\zeta_{t}}[\\|\mathbf{x}_{i,t}\!-\!\mathbf{x}_{i,t\!-\!1}\\|^{2}]\!+\!\frac{2(1\!-\!\beta_{t})^{2}\sigma^{2}}{m}$
(22)		$\displaystyle=\beta_{t}^{2}\\|\bar{\mathbf{v}}_{t\!-\!1}\!-\!\bar{\mathbf{g}}_{t\!-\!1}\\|^{2}\!+\!\frac{2\beta_{t}^{2}L^{2}}{m}\mathbb{E}_{\zeta_{t}}[\\|\mathbf{x}_{t}\!-\!\mathbf{x}_{t\!-\!1}\\|^{2}]\!+\!\frac{2(1\!-\!\beta_{t})^{2}\sigma^{2}}{m},$

where (a) is because the cross term has the expectation as zero; (b) is by $\|\mathbf{a}+\mathbf{b}\|^{2}\leq 2\|\mathbf{a}\|^{2}+2\|\mathbf{b}\|^{2};$ (c) is by $\mathbb{E}[\|\mathbf{X}-\mathbb{E}[\mathbf{X}]\|^{2}]\leq\mathbb{E}[\|\mathbf{X}\|^{2}]$ and Assumption 1 (c); (d) is because of the Jensen’s inequality; (e) is by the $L$ -average smoothness. Thus, taking the full expectation, it holds that

	$\displaystyle\mathbb{E}[\\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{g}}_{t}\\|^{2}]\leq$	$\displaystyle\beta_{t}^{2}\mathbb{E}[\\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{g}}_{t-1}\\|^{2}]$
(23)			$\displaystyle+\frac{2\beta_{t}^{2}L^{2}}{m}\mathbb{E}[\\|\mathbf{x}_{t}-\mathbf{x}_{t-1}\\|^{2}]+\frac{2(1-\beta_{t})^{2}\sigma^{2}}{m}.$

∎

B.2. Proof for Lemma 2

Proof.

From the $L$ -smoothness of $f,$ we have:

	$\displaystyle f(\mathbf{\bar{x}}_{t\!+\!1})\!\leq\!f(\mathbf{\bar{x}}_{t})\!-\!\eta_{t}\langle\nabla f(\mathbf{\bar{x}}_{t}),\bar{\mathbf{v}}_{t}\rangle\!+\!\frac{L\eta_{t}^{2}}{2}\\|\bar{\mathbf{v}}_{t}\\|^{2}$
	$\displaystyle\!=\!f(\mathbf{\bar{x}}_{t})\!-\!\frac{\eta_{t}}{2}\\|\nabla\!f(\mathbf{\bar{x}}_{t})\\|^{2}\!-\!(\frac{\eta_{t}}{2}\!-\!\frac{L\eta_{t}^{2}}{2})\\|\bar{\mathbf{v}}_{t}\\|^{2}\!+\!\frac{\eta_{t}}{2}\!\\|\bar{\mathbf{v}}_{t}\!-\!\nabla\!f(\mathbf{\bar{x}}_{t})\\|^{2}$
	$\displaystyle\!\leq\!f(\mathbf{\bar{x}}_{t})\!-\!\frac{\eta_{t}}{2}\!\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}\!-\!(\frac{\eta_{t}}{2}\!-\!\frac{L\eta_{t}^{2}}{2})\\|\bar{\mathbf{v}}_{t}\\|^{2}$
	$\displaystyle+\!\eta_{t}\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}\!+\!\eta_{t}\\|\bar{\mathbf{g}}_{t}\!-\!\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}f(\mathbf{\bar{x}}_{t})\!-\!\frac{\eta_{t}}{2}\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}\!-\!(\frac{\eta_{t}}{2}\!-\!\frac{L\eta_{t}^{2}}{2})\\|\bar{\mathbf{v}}_{t}\\|^{2}$
	$\displaystyle+\!\eta_{t}\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}\!+\!\frac{\eta_{t}}{m}\sum_{i=1}^{m}\\|\mathbf{g}_{i,t}\!-\!\nabla f_{i}(\mathbf{\bar{x}}_{t})\\|^{2}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}f(\mathbf{\bar{x}}_{t})\!-\!\frac{\eta_{t}}{2}\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}\!-\!(\frac{\eta_{t}}{2}\!-\!\frac{L\eta_{t}^{2}}{2})\\|\bar{\mathbf{v}}_{t}\\|^{2}$
(24)		$\displaystyle+\!\eta_{t}\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}\!+\!\frac{L^{2}\eta_{t}}{m}\\|\mathbf{x}_{t}\!-\!\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2},$

where $\bar{\mathbf{g}}_{t}=\frac{1}{m}\sum_{i=1}^{m}\mathbf{g}_{i,t}$ and $\mathbf{g}_{i,t}=\nabla f_{i}(\mathbf{x}_{i,t}),$ (a) is because of the Jensen’s inequality and (b) is by the $L$ -average smoothness. Take the full expectation on the above inequality:

	$\displaystyle\mathbb{E}[f(\mathbf{\bar{x}}_{t\!+\!1})]\!-\!\mathbb{E}[f($	$\displaystyle\mathbf{\bar{x}}_{t})]\!\leq\!-\!\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!-\!(\frac{\eta_{t}}{2}\!-\!\frac{L\eta_{t}^{2}}{2})\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}]$
(25)			$\displaystyle\!+\!\eta_{t}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}]\!+\!\frac{L^{2}\eta_{t}}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}].$

∎

B.3. Proof for Lemma 3

Proof.

First for the iterate $\mathbf{x}_{t},$ we have the following contraction:

(26)

\displaystyle\|\tilde{\mathbf{W}}\mathbf{x}_{t}\!-\!\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\|^{2}\!=\!\|\tilde{\mathbf{W}}(\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t})\|^{2}\!\leq\!\lambda^{2}\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\|^{2},

This is because $\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}$ is orthogonal to $\mathbf{1},$ which is the eigenvector corresponding to the largest eigenvalue of $\tilde{\mathbf{W}},$ and $\lambda=\max\{|\lambda_{2}|,|\lambda_{m}|\}.$ Recall that $\mathbf{\bar{x}}_{t+1}=\mathbf{\bar{x}}_{t}-\eta_{t}\bar{\mathbf{v}}_{t},$ hence,

	$\displaystyle\\|\mathbf{x}_{t+1}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t+1}\\|^{2}=\\|\tilde{\mathbf{W}}\mathbf{x}_{t}-\eta_{t}\mathbf{v}_{t}-\mathbf{1}\otimes(\mathbf{\bar{x}}_{t}-\eta_{t}\bar{\mathbf{v}}_{t})\\|^{2}$
	$\displaystyle\leq(1+c_{1})\\|\tilde{\mathbf{W}}\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}+(1+\frac{1}{c_{1}})\eta_{t}^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}$
(27)		$\displaystyle\leq(1+c_{1})\lambda^{2}\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}+(1+\frac{1}{c_{1}})\eta_{t}^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}.$

Similarly to (B.3), we have:

		$\displaystyle\\|\mathbf{v}_{t\!+\!1}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t\!+\!1}\\|^{2}$
	$\displaystyle\!=$	$\displaystyle\\|\beta_{t\!+\!1}(\tilde{\mathbf{W}}\mathbf{v}_{t}\!+\!\mathbf{w}_{t\!+\!1})\!+\!(1\!-\!\beta_{t})\mathbf{u}_{t\!+\!1}\!-\!\mathbf{1}\!\otimes\!\big{(}\beta_{t\!+\!1}(\bar{\mathbf{v}}_{t}$
		$\displaystyle+\!\bar{\mathbf{w}}_{t\!+\!1})\!+\!(1\!-\!\beta_{t\!+\!1})\bar{\mathbf{u}}_{t\!+\!1}\big{)}\\|$
	$\displaystyle\!\leq$	$\displaystyle(1\!+\!c_{1})\beta_{t\!+\!1}^{2}\lambda^{2}\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}\!+\!(1\!+\!\frac{1}{c_{1}})\\|\beta_{t\!+\!1}(\mathbf{w}_{t\!+\!1}$
		$\displaystyle\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{w}}_{t\!+\!1})\!+\!(1\!-\!\beta_{t\!+\!1})(\mathbf{u}_{t\!+\!1}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{u}}_{t\!+\!1})\\|^{2}$
	$\displaystyle\!\leq$	$\displaystyle(1\!+\!c_{1})\beta_{t\!+\!1}^{2}\lambda^{2}\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}\!+\!2(1\!+\!\frac{1}{c_{1}})\big{(}\beta_{t\!+\!1}^{2}\\|\mathbf{w}_{t\!+\!1}$
		$\displaystyle\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{w}}_{t\!+\!1}\\|^{2}\!+\!(1\!-\!\beta_{t\!+\!1})^{2}\\|\mathbf{u}_{t\!+\!1}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{u}}_{t\!+\!1}\\|^{2}\big{)}$
	$\displaystyle\!\stackrel{{\scriptstyle(a)}}{{\leq}}\!$	$\displaystyle(1\!+\!c_{1})\beta_{t\!+\!1}^{2}\lambda^{2}\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}$
(28)			$\displaystyle\!+\!2(1\!+\!\frac{1}{c_{1}})\big{(}\beta_{t\!+\!1}^{2}\\|\mathbf{w}_{t\!+\!1}\\|^{2}\!+\!(1\!-\!\beta_{t\!+\!1})^{2}\\|\mathbf{u}_{t\!+\!1}\\|^{2}\big{)}$

where (a) is due to $\|\mathbf{I}-\frac{1}{m}\mathbf{1}\mathbf{1}^{\top}\|\leq 1.$ Lastly, according to the updating equation (9) in main paper, it holds

		$\displaystyle\\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\\|^{2}=\\|\tilde{\mathbf{W}}\mathbf{x}_{t}-\eta_{t}\mathbf{v}_{t}-\mathbf{x}_{t}\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|(\tilde{\mathbf{W}}-\mathbf{I})\mathbf{x}_{t}-\eta_{t}\mathbf{v}_{t}\\|^{2}\leq 2\\|(\tilde{\mathbf{W}}-\mathbf{I})\mathbf{x}_{t}\\|^{2}+2\eta_{t}^{2}\\|\mathbf{v}_{t}\\|^{2}$
	$\displaystyle=$	$\displaystyle 2\\|(\tilde{\mathbf{W}}-\mathbf{I})(\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t})\\|^{2}+2\eta_{t}^{2}\\|\mathbf{v}_{t}\\|^{2}$
(29)		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}$	$\displaystyle 8\\|(\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t})\\|^{2}+4\eta_{t}^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}+4\eta_{t}^{2}m\\|\bar{\mathbf{v}}_{t}\\|^{2},$

where (a) is because $\|\tilde{\mathbf{W}}-\mathbf{I}\|\leq 2.$ ∎

B.4. Proof for Lemma 4

Proof.

First, with $\eta_{t}=\tau/(\omega+t)^{1/3},$ we have:

	$\displaystyle\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}$	$\displaystyle=\frac{1}{\tau}\big{(}(\omega+t)^{\frac{1}{3}}-(\omega+t-1)^{\frac{1}{3}}\big{)}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{1}{3\tau}\cdot\frac{1}{(\omega+t-1)^{\frac{2}{3}}}=\frac{1}{3\tau^{3}}\eta_{t-1}^{2}$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{1}{3\tau}\cdot\frac{1}{(\frac{\omega}{2}+t)^{\frac{2}{3}}}\stackrel{{\scriptstyle}}{{\leq}}\frac{1}{3\tau}\cdot\frac{2^{\frac{2}{3}}}{(\omega+t)^{\frac{2}{3}}}$
(30)			$\displaystyle\stackrel{{\scriptstyle}}{{\leq}}\frac{2^{\frac{2}{3}}}{3\tau^{3}}\cdot\frac{\tau^{2}}{(\omega+t)^{\frac{2}{3}}}\leq\frac{2}{3\tau^{3}}\eta_{t},$

where (a) is by $(x+y)^{1/3}-x^{1/3}\leq yx^{-2/3}/3$ and (b) is by $\omega\geq 2.$

Then, we give the following three contractions:

i) for $\mathbb{E}[\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{g}}_{t}\|^{2}],$ we have:

	$\displaystyle\frac{1}{\eta_{t}}\mathbb{E}[\\|\bar{\mathbf{v}}_{t\!+\!1}\!-\!\bar{\mathbf{g}}_{t\!+\!1}\\|^{2}]\!-\!\frac{1}{\eta_{t\!-\!1}}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}]$
	$\displaystyle\!\stackrel{{\scriptstyle(a)}}{{\leq}}\!\!\Big{(}\frac{\beta_{t\!+\!1}^{2}}{\eta_{t}}\!-\!\frac{1}{\eta_{t\!-\!1}}\!\Big{)}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}]\!\!+\!\!\frac{2\beta_{t\!+\!1}^{2}L^{2}}{m\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t\!+\!1}\!-\!\mathbf{x}_{t}\\|^{2}\!]\!\!+\!\frac{2(1\!\!-\!\beta_{t\!+\!1})^{2}\sigma^{2}}{m\eta_{t}}$
	$\displaystyle\!\stackrel{{\scriptstyle(b)}}{{\leq}}\!\!\Big{(}\!\frac{1\!-\!\rho\eta_{t}^{2}}{\eta_{t}}\!-\!\frac{1}{\eta_{t\!-\!1}}\!\Big{)}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}\!]\!+\!\frac{2\beta_{t\!+\!1}^{2}L^{2}}{m\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\\|^{2}\!]\!\!+\!\frac{2\rho^{2}\sigma^{2}\eta_{t}^{3}}{m}$
	$\displaystyle\!\stackrel{{\scriptstyle(c)}}{{\leq}}\!\!\big{(}\!\frac{2}{3\tau^{3}}\!-\!\rho\!\big{)}\eta_{t}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}]\!+\!\frac{2\beta_{t\!+\!1}^{2}L^{2}}{m\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t\!+\!1}\!-\!\mathbf{x}_{t}\\|^{2}]\!+\!\frac{2\rho^{2}\sigma^{2}\eta_{t}^{3}}{m}$
(31)		$\displaystyle\!\stackrel{{\scriptstyle(d)}}{{=}}\!\!-32\eta_{t}L^{2}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}]\!\!+\!\frac{2\beta_{t\!+\!1}^{2}L^{2}}{m\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t\!+\!1}\!-\!\mathbf{x}_{t}\\|^{2}]\!\!+\!\frac{2\rho^{2}\sigma^{2}\eta_{t}^{3}}{m}$

where (a) is from Lemma 1, (b) is by $\beta_{t+1}=1-\rho\eta^{2}_{t}<1,$ (c) is by (B.4) and (d) is by the setting $\rho=2/(3\tau^{3})+32L^{2}.$

ii) for $\mathbb{E}[\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\|^{2}]$ , we have:

	$\displaystyle\frac{1}{\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t\!+\!1}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t\!+\!1}\\|^{2}]\!-\!\frac{1}{\eta_{t\!-\!1}}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
	$\displaystyle\!\stackrel{{\scriptstyle(a)}}{{\leq}}\!\!\Big{(}\!\frac{(1\!+\!c_{1})\lambda^{2}}{\eta_{t}}\!-\!\frac{1}{\eta_{t\!-\!1}}\!\Big{)}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]\!+\!(1\!+\!\frac{1}{c_{1}})\eta_{t}\mathbb{E}[\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}]$
(32)		$\displaystyle\!\stackrel{{\scriptstyle(b)}}{{\leq}}\!\!\Big{(}\!\frac{(1\!+\!c_{1})\lambda^{2}\!\!-\!1}{\eta_{t}}\!+\!\frac{2}{3\tau^{3}}\eta_{t}\!\Big{)}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]\!\!+\!(1\!\!+\!\!\frac{1}{c_{1}})\eta_{t}\mathbb{E}[\\|\mathbf{v}_{t}\!\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}]$

where (a) is by Lemma 3 and (b) is by (B.4).

iii) for $\mathbb{E}[\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\|^{2}]$ , we have:

	$\displaystyle\mathbb{E}[\\|\mathbf{v}_{t+1}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t+1}\\|^{2}-\mathbb{E}[\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}]$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\big{(}(1+c_{1})\beta_{t+1}^{2}\lambda^{2}-1\big{)}\mathbb{E}[\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}]$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+2(1+\frac{1}{c_{1}})\big{(}\beta_{t+1}^{2}\mathbb{E}[\\|\mathbf{w}_{t+1}\\|^{2}]+(1-\beta_{t+1})^{2}\mathbb{E}[\\|\mathbf{u}_{t+1}\\|^{2}]\big{)}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\big{(}(1+c_{1})\beta_{t+1}^{2}\lambda^{2}-1\big{)}\mathbb{E}[\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}]$
(33)		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+2(1+\frac{1}{c_{1}})\big{(}\beta_{t+1}^{2}L^{2}\mathbb{E}[\\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\\|^{2}]+mG^{2}\rho^{2}\eta_{t}^{4}\big{)}$

where (a) is by Lemma 3 and (b) is by $\beta_{t+1}=1-\rho\eta^{2}_{t}$ and Assumption 1.

Recall the result from Lemma 2:

	$\displaystyle\mathbb{E}[f(\mathbf{\bar{x}}_{t\!+\!1})]\!-$	$\displaystyle\!\mathbb{E}[f(\mathbf{\bar{x}}_{t})]\!\leq\!\!-\!\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!-\!(\frac{\eta_{t}}{2}\!-\!\frac{L\eta_{t}^{2}}{2})\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}]$
(34)			$\displaystyle\!+\!\eta_{t}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\!-\!\bar{\mathbf{g}}_{t}\\|^{2}]\!+\!\frac{L^{2}\eta_{t}}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}].$

Then, with the result in i), we have:

		$\displaystyle\mathbb{E}[f(\mathbf{\bar{x}}_{t\!+\!1})\!+\!\frac{1}{32L^{2}\eta_{t}}\\|\bar{\mathbf{g}}_{t\!+\!1}\!-\!\bar{\mathbf{v}}_{t\!+\!1}\\|^{2}]\!-\!\mathbb{E}[f(\mathbf{\bar{x}}_{t})\!+\!\frac{1}{32L^{2}\eta_{t\!\!-\!1}}\\|\bar{\mathbf{g}}_{t}\!-\!\bar{\mathbf{v}}_{t}\\|^{2}]$
	$\displaystyle\!\leq\!$	$\displaystyle-\!\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!-\!(\frac{\eta_{t}}{2}\!-\!\!\frac{L\eta_{t}^{2}}{2})\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}]\!+\!\frac{\beta_{t\!+\!1}^{2}}{16m\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t\!+\!1}\!-\!\mathbf{x}_{t}\\|^{2}]$
(35)			$\displaystyle\!+\!\frac{L^{2}\eta_{t}}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}]\!+\!\frac{\rho^{2}\sigma^{2}\eta_{t}^{3}}{16mL^{2}}$

Next, with the results in ii) and iii), we have:

		$\displaystyle\mathbb{E}[\frac{1}{\eta_{t}}\\|\mathbf{x}_{t\!+\!1}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t\!+\!1}\\|^{2}\!+\!\\|\mathbf{v}_{t\!+\!1}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t\!+\!1}\\|^{2}]\!$
		$\displaystyle-\!\mathbb{E}[\frac{1}{\eta_{t\!-\!1}}\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}\!+\!\\|\mathbf{v}_{t}\!-\!\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}]$
	$\displaystyle\!\leq$	$\displaystyle\!-\!\Big{(}\frac{1\!-\!(1\!+\!c_{1})\lambda^{2}}{\eta_{t}}\!-\!\frac{2\eta_{t}}{3\tau^{3}}\Big{)}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}]$
		$\displaystyle\!-\!\Big{(}1\!-\!(1\!+\!c_{1})\beta_{t\!+\!1}^{2}\lambda^{2}\!-\!(1+\frac{1}{c_{1}})\eta_{t}\Big{)}\mathbb{E}[\\|\mathbf{v}_{t}\!-\!\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}]$
(36)			$\displaystyle\!+\!2(1\!+\!\frac{1}{c_{1}})\beta_{t\!+\!1}^{2}L^{2}\mathbb{E}[\\|\mathbf{x}_{t\!+\!1}\!-\!\mathbf{x}_{t}\\|^{2}]\!+\!2(1\!+\!\frac{1}{c_{1}})mG^{2}\rho^{2}\eta_{t}^{4}.$

Thus, for the defined potential function:

	$\displaystyle H_{t}\!=\mathbb{E}[f(\mathbf{\bar{x}}_{t})+$	$\displaystyle\frac{1}{32L^{2}\eta_{t\!-\!1}}\\|\bar{\mathbf{g}}_{t}\!-\!\bar{\mathbf{v}}_{t}\\|^{2}\!$
(37)			$\displaystyle+\!\frac{c_{0}}{m\eta_{t\!-\!1}}\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}\!+\!\frac{c_{0}}{m}\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}],$

its differential can be calculated as

		$\displaystyle H_{t\!+\!1}\!-\!H_{t}\!\leq\!-\!\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!-\!(\frac{\eta_{t}}{2}\!-\!\frac{L\eta_{t}^{2}}{2})\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}]$
		$\displaystyle\!+\!\Big{(}\!\frac{\beta_{t\!+\!1}^{2}}{16m\eta_{t}}\!+\!2(1\!+\!\frac{1}{c_{1}})\frac{c_{0}\beta_{t\!+\!1}^{2}L^{2}}{m}\!\Big{)}\mathbb{E}[\\|\mathbf{x}_{t\!+\!1}\!-\!\mathbf{x}_{t}\\|^{2}]$
		$\displaystyle\!-\!\Big{(}\!1\!-\!(1\!+\!c_{1})\lambda^{2}\!-\!\frac{2\eta_{t}^{2}}{3\tau^{3}}\!-\!\frac{L^{2}\eta_{t}^{2}}{c_{0}}\!\Big{)}\!\times\!\frac{c_{0}}{m\eta_{t}}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
		$\displaystyle\!-\!\Big{(}\!1\!-\!(1\!+\!c_{1})\beta_{t\!+\!1}^{2}\lambda^{2}\!-\!(1\!+\!\frac{1}{c_{1}})\eta_{t}\!\Big{)}\!\times\!\frac{c_{0}}{m}\mathbb{E}[\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}]$
		$\displaystyle\!+\!\frac{\rho^{2}\sigma^{2}\eta_{t}^{3}}{16mL^{2}}\!+\!2(1\!+\!\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}\eta_{t}^{4}$
	$\displaystyle\!\stackrel{{\scriptstyle(a)}}{{\leq}}\!$	$\displaystyle-\!\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!+\!\frac{\rho^{2}\sigma^{2}\eta_{t}^{3}}{16mL^{2}}\!+\!2(1\!+\!\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}\eta_{t}^{4}$
		$\displaystyle\!-\!\Big{(}\!1\!-\!(1\!+\!c_{1})\lambda^{2}\!-\!\frac{1}{2c_{0}}\!-\!16(1\!+\!\frac{1}{c_{1}})L^{2}\eta_{t}\!-\!\frac{2\eta_{t}^{2}}{3\tau^{3}}$
		$\displaystyle\!-\!\frac{L^{2}\eta_{t}^{2}}{c_{0}}\!\Big{)}\!\times\!\frac{c_{0}}{m\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
		$\displaystyle\!-\!\Big{(}\!1\!-\!(1\!+\!c_{1})\lambda^{2}\!-\!(1\!+\!\frac{1}{c_{1}})\eta_{t}\!-\!\frac{\eta_{t}}{4c_{0}}$
		$\displaystyle\!-\!8(1\!+\!\frac{1}{c_{1}})L^{2}\eta_{t}^{2}\!\Big{)}\!\times\!\frac{c_{0}}{m}\mathbb{E}[\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}]$
(38)			$\displaystyle\!-\!\Big{(}\!1\!-\!2L\eta_{t}\!-\!32(1\!+\!\frac{1}{c_{1}})c_{0}L^{2}\eta_{t}\!\Big{)}\!\times\!\frac{\eta_{t}}{4}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}].$

where (a) follows from plugging the result for $\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|^{2}$ from Lemma 3 and $\beta_{t+1}<1.$

∎

B.5. Proof for Theorem 1

Proof.

From Lemma 4, we have:

	$\displaystyle\sum_{t=0}^{T}\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!\leq\!H_{0}\!-\!H_{T\!+\!1}\!+\!\sum_{t=0}^{T}\frac{\rho^{2}\sigma^{2}\eta_{t}^{3}}{16mL^{2}}$
	$\displaystyle\!+\!\sum_{t=0}^{T}2(1\!+\!\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}\eta_{t}^{4}\!-\!\sum_{t=0}^{T}\frac{c_{0}C_{1}}{m\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
(39)		$\displaystyle\!-\!\sum_{t=0}^{T}\frac{c_{0}C_{2}}{m}\mathbb{E}[\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}]\!-\!\sum_{t=0}^{T}\frac{C_{3}\eta_{t}}{4}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}].$

Note with $\eta_{t}=\tau/(\omega+t)^{1/3}$ and $\tau\geq 2,$ thus

(40)		$\displaystyle\sum_{t=0}^{T}\eta_{t}^{3}=\sum_{t=0}^{T}\frac{\tau}{\omega+t}\leq\int_{-1}^{T-1}\frac{\tau}{\omega+t}dt\leq\tau\ln(\omega+T-1)$
(41)		$\displaystyle\sum_{t=0}^{T}\eta_{t}^{4}=\sum_{t=0}^{T}\Big{(}\frac{\tau}{\omega+t}\Big{)}^{\frac{4}{3}}\leq\int_{-1}^{T-1}\Big{(}\frac{\tau}{\omega+t}\Big{)}^{\frac{4}{3}}dt\leq\frac{3\tau^{4/3}}{(\omega-1)^{1/3}}.$

Hence, since $\eta_{t}$ is decreasing, we have:

	$\displaystyle\frac{\eta_{T}}{2}\!\sum_{t=0}^{T}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!+\!\frac{1}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
	$\displaystyle\leq\sum_{t=0}^{T}\!\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!+\!\frac{\eta_{t}}{2m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
	$\displaystyle\!\leq\!H_{0}\!-\!H_{T\!+\!1}\!+\!\frac{\tau\rho^{2}\sigma^{2}\ln(\omega\!+\!T\!-\!1)}{16mL^{2}}$
	$\displaystyle\!+\!6(1\!+\!\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}\frac{\tau^{4/3}}{(\omega\!-\!1)^{1/3}}\!-\!\sum_{t=0}^{T}\frac{2c_{0}C_{1}\!-\!\eta_{t}^{2}}{2m\eta_{t}}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
(42)		$\displaystyle\!-\!\sum_{t=0}^{T}\frac{c_{0}C_{2}}{m}\mathbb{E}[\\|\mathbf{v}_{t}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{t}\\|^{2}]\!-\!\sum_{t=0}^{T}\frac{C_{3}\eta_{t}}{4}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}].$

Now, we show that by properly choosing $\eta_{t},$ $c_{1},$ and $c_{0}$ , the coefficients $C_{1}-\eta^{2}/2c_{0},$ $C_{2}$ and $C_{3}$ can be non-negative. Recall that:

(43)	$\displaystyle C_{1}$	$\displaystyle\!=\!1\!-\!(1\!+\!c_{1})\lambda^{2}\!-\!\frac{1}{2c_{0}}\!-\!16(1\!+\!\frac{1}{c_{1}})L^{2}\eta_{t}\!-\!\Big{(}\frac{2}{3\tau^{3}}\!+\!\frac{L^{2}}{c_{0}}\Big{)}\eta_{t}^{2},$
(44)	$\displaystyle C_{2}$	$\displaystyle\!=\!1\!-\!(1\!+\!c_{1})\lambda^{2}\!-\!(1\!+\!\frac{1}{c_{1}})\eta_{t}\!-\!\frac{\eta_{t}}{4c_{0}}\!-\!8(1\!+\!\frac{1}{c_{1}})L^{2}\eta_{t}^{2},$
(45)	$\displaystyle C_{3}$	$\displaystyle\!=\!1\!-\!2L\eta_{t}\!-\!32(1\!+\!\frac{1}{c_{1}})c_{0}L^{2}\eta_{t}.$

In order to have $C_{3}\geq 0,$ we have:

(46)

\displaystyle\eta_{t}\leq 1/\Big{(}2L+32(1+\frac{1}{c_{1}})c_{0}L^{2}\Big{)}:=k_{1}.

With (46), it follows that:

(47)

\displaystyle C_{2}

\displaystyle\geq 1-(1+c_{1})\lambda^{2}-(1+\frac{1}{c_{1}})\eta_{t}-\frac{\eta_{t}}{2c_{0}}.

Thus, $C_{2}\geq 0$ if we set

(48)

\displaystyle\eta_{t}\leq\Big{(}1-(1+c_{1})\lambda^{2}\Big{)}/\Big{(}1+\frac{1}{c_{1}}+\frac{1}{2c_{0}}\Big{)}:=k_{2}.

For $C_{1}-\eta^{2}/2c_{0},$ it follows from (46) that:

(49)

\displaystyle C_{1}-\frac{\eta^{2}}{2c_{0}}

\displaystyle\geq 1-(1+c_{1})\lambda^{2}-\frac{1}{c_{0}}-\Big{(}\frac{2}{3\tau^{3}}+\frac{2L^{2}+1}{2c_{0}}\Big{)}\eta_{t}^{2}.

By choosing

(50)		$\displaystyle\eta_{t}$	$\displaystyle\leq\sqrt{\Big{(}1-(1+c_{1})\lambda^{2}-\frac{1}{c_{0}}\Big{)}/\Big{(}\frac{2}{3\tau^{3}}+\frac{2L^{2}+1}{2c_{0}}\Big{)}}:=k_{3},$
(51)		$\displaystyle\text{and }~{}0$	$\displaystyle<1-(1+c_{1})\lambda^{2}-\frac{3}{4c_{0}},$

we have $C_{1}-\eta^{2}/2c_{0}\geq 0.$ To summarize, we need to set $\eta_{t}\leq\min\{k_{1},k_{2},k_{3}\}.$ Since $\eta_{t}$ is decreasing and $\eta_{0}=\tau/\omega^{1/3},$ it implies that $\omega\geq(\tau/\min\{k_{1},k_{2},k_{3}\})^{3}.$

With the above parameter setting, we have:

	$\displaystyle\frac{\eta_{T}}{2}\sum_{t=0}^{T}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]+\frac{1}{m}\mathbb{E}[\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}]\leq H_{0}-H_{T+1}$
(52)		$\displaystyle+\frac{\tau\rho^{2}\sigma^{2}\ln(\omega+T-1)}{16mL^{2}}+6(1+\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}\frac{\tau^{4/3}}{(\omega-1)^{1/3}}.$

Multiplying both side of the above inequality by ${2}/{\eta_{T}(T+1)},$ we have:

	$\displaystyle\frac{1}{T\!+\!1}\!\sum_{t=0}^{T}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!+\!\frac{1}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
(53)		$\displaystyle\!\leq\!\frac{2(H_{0}\!-\!H_{T\!+\!1})}{\eta_{T}(T\!+\!1)}\!+\!\frac{\tau\rho^{2}\sigma^{2}\ln(\omega\!+\!T\!-\!1)}{8mL^{2}\eta_{T}(T\!+\!1)}\!+\!\frac{12(1\!+\!\frac{1}{c_{1}})c_{0}\tau^{4/3}G^{2}\rho^{2}}{(\omega\!-\!1)^{1/3}\eta_{T}(T\!+\!1)}.$

Note that

	$\displaystyle H_{0}$	$\displaystyle\!=\!\mathbb{E}[f(\mathbf{\bar{x}}_{0})\!+\!\frac{1}{32L^{2}\eta_{-1}}\\|\bar{\mathbf{g}}_{0}\!-\!\bar{\mathbf{v}}_{0}\\|^{2}$
		$\displaystyle\!+\!\frac{c_{0}}{m\eta_{-1}}\\|\mathbf{x}_{0}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{0}\\|^{2}\!+\!\frac{c_{0}}{m}\\|\mathbf{v}_{0}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{0}\\|^{2}]\$
		$\displaystyle\!\stackrel{{\scriptstyle(a)}}{{=}}\!\mathbb{E}[f(\mathbf{\bar{x}}_{0})\!+\!\frac{1}{32L^{2}\eta_{-1}}\\|\bar{\mathbf{g}}_{0}\!-\!\bar{\mathbf{v}}_{0}\\|^{2}\!+\!\frac{c_{0}}{m}\\|\mathbf{v}_{0}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{0}\\|^{2}]\$
		$\displaystyle\!\stackrel{{\scriptstyle(b)}}{{\leq}}\!\mathbb{E}[f(\mathbf{\bar{x}}_{0})\!+\!\frac{\sigma^{2}}{32mL^{2}\eta_{-1}}\!+\!\frac{c_{0}}{m}\\|\mathbf{v}_{0}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{0}\\|^{2}]$
	$\displaystyle H_{T\!+\!1}$	$\displaystyle\!\geq\!\mathbb{E}[f(\mathbf{\bar{x}}_{T\!+\!1})\!+\!\frac{c_{0}}{m\eta_{T}}\\|\mathbf{x}_{T\!+\!1}\!-\!\mathbf{1}\mathbf{\bar{x}}_{T\!+\!1}\\|^{2}\!+\!\frac{c_{0}}{m}\\|\mathbf{v}_{T\!+\!1}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{T\!+\!1}\\|^{2}]$
		$\displaystyle\geq f(\mathbf{\bar{x}}^{*}),$

where (a) is by $\mathbf{x}_{i,0}=\mathbf{x}_{0}$ from line 1 in Algorithm 1 and (b) is by Assumption 1.

Hence, it follows that

	$\displaystyle\frac{1}{T\!+\!1}\sum_{t=0}^{T}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!+\!\frac{1}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
	$\displaystyle\!\leq\!\frac{2(f(\mathbf{\bar{x}}_{0})\!-\!f(\mathbf{\bar{x}}^{*}))}{\eta_{T}{(T\!+\!1)}}\!+\!\frac{2c_{0}\mathbb{E}[\\|\mathbf{v}_{0}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{0}\\|^{2}]}{m\eta_{T}{(T\!+\!1)}}$
	$\displaystyle\!+\!\frac{(\omega\!-\!1)\sigma^{2}}{16mL^{2}\tau\eta_{T}{(T\!+\!1)}}\!+\!\frac{\tau\rho^{2}\sigma^{2}\ln(\omega\!+T\!-\!1)}{8mL^{2}\eta_{T}(T\!+\!1)}$
(54)		$\displaystyle\!+\!12(1\!+\!\frac{1}{c_{1}})c_{0}\frac{\tau^{4/3}G^{2}\rho^{2}}{(\omega\!-\!1)^{1/3}\eta_{T}(T\!+\!1)}.$

Since $\eta_{T}=\tau/(\omega+T)^{1/3}$ , we have:

	$\displaystyle\min_{t\in[T]}\!\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!+\!\frac{1}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
	$\displaystyle\!\leq\!\frac{2(f(\mathbf{\bar{x}}_{0})\!-\!f(\mathbf{\bar{x}}^{*}))}{\tau{(T\!+\!1)}^{2/3}}\!+\!\frac{2c_{0}\mathbb{E}[\\|\mathbf{v}_{0}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{0}\\|^{2}]}{m\tau{(T\!+\!1)}^{2/3}}$
	$\displaystyle\!+\!\frac{(\omega\!-\!1)\sigma^{2}}{16mL^{2}\tau^{2}{(T\!+\!1)}^{2/3}}\!+\!\frac{\rho^{2}\sigma^{2}\ln(\omega\!+\!T\!-\!1)}{8mL^{2}(T\!+\!1)^{2/3}}$
(55)		$\displaystyle\!+\!\frac{12(1\!+\!\frac{1}{c_{1}})c_{0}\tau^{1/3}G^{2}\rho^{2}}{(\omega\!-\!1)^{1/3}(T\!+\!1)^{2/3}}\!+\!O\Big{(}\frac{c_{3}\omega}{\tau T^{5/3}}\Big{)}.$

where the $O$ -notation is from $(\omega+T)^{1/3}-(T+1)^{1/3}\leq(\omega-1)(T+1)^{-2/3}/3$ and $c_{3}=\max\{1,(\omega-1)/(m\tau^{2}),\tau^{4/3}/\omega^{1/3},\tau\ln(\omega+T-1)/m\}.$

∎

B.6. Proof for Corollary 2

Proof.

First, note that $\omega\geq\max\{2,\tau^{3}/\min\{k_{1}^{3},k_{2}^{3},k_{3}^{3}\}\}$ holds with $\tau=O(m^{1/3})$ and $\omega=O(m^{4/3}).$ Plugging these parameters into Theorem 1 yields:

	$\displaystyle\min_{t\in[T]}\!\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!+\!\frac{1}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
	$\displaystyle\!\leq\!O\Big{(}\frac{2(f(\mathbf{\bar{x}}_{0})\!-\!f(\mathbf{\bar{x}}^{*}))}{m^{1/3}{(T\!+\!1)}^{2/3}}\!+\!\frac{2c_{0}\mathbb{E}[\\|\mathbf{v}_{0}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{0}\\|^{2}]}{m^{4/3}{(T\!+\!1)}^{2/3}}\!+\!\frac{\sigma^{2}}{16L^{2}m^{1/3}{(T\!+\!1)}^{2/3}}$
(56)		$\displaystyle\!+\!\frac{\rho^{2}\sigma^{2}\ln(m^{4/3}\!+\!T)}{8mL^{2}(T\!+\!1)^{2/3}}\!+\!\frac{12(1+\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}}{m^{1/3}(T\!+\!1)^{2/3}}\!+\!\frac{c_{3}m}{T^{5/3}}\Big{)}.$

With $T\gg m^{4/3},$ we have:

	$\displaystyle\min_{t\in[T]}\!\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]\!+\!\frac{1}{m}\mathbb{E}[\\|\mathbf{x}_{t}\!-\!\mathbf{1}\!\otimes\!\mathbf{\bar{x}}_{t}\\|^{2}]$
	$\displaystyle\!\leq\!O\Big{(}\frac{2(f(\mathbf{\bar{x}}_{0})\!-\!f(\mathbf{\bar{x}}^{*}))}{m^{1/3}{(T\!+\!1)}^{2/3}}\!+\!\frac{2c_{0}\mathbb{E}[\\|\mathbf{v}_{0}\!-\!\mathbf{1}\!\otimes\!\bar{\mathbf{v}}_{0}\\|^{2}]}{m^{4/3}{(T\!+\!1)}^{2/3}}$
	$\displaystyle\!+\!\frac{\sigma^{2}}{16L^{2}m^{1/3}{(T\!+\!1)}^{2/3}}\!+\!\frac{\rho^{2}\sigma^{2}\ln T}{8mL^{2}(T\!+\!1)^{2/3}}$
(57)		$\displaystyle+\frac{12(1\!+\!\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}}{m^{1/3}(T\!+\!1)^{2/3}}\!+\!\frac{c_{3}}{m^{1/3}T^{2/3}}\Big{)},$

where $c_{3}=\max\{O(1),O(1/m^{1/3}),\ln(m^{4/3}+T)/m^{2/3}\}.$ The above result implies that the convergence rate is $\tilde{O}(m^{-1/3}T^{-2/3}).$ Thus, to achieve an $\epsilon^{2}$ -stationary solution, the total communication rounds needed are $T=\tilde{O}(m^{-1/2}\epsilon^{-3})$ , and the total samples needed are $mT=\tilde{O}(m^{1/2}\epsilon^{-3}).$ ∎

	$\displaystyle\\|\mathbf{x}_{t+1}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t+1}\\|^{2}\leq(1+c_{1})$	$\displaystyle\lambda^{2}\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}$
(15)			$\displaystyle+(1+\frac{1}{c_{1}})\eta_{t}^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2},$

	$\displaystyle\\|\mathbf{v}_{t+1}-\mathbf{1}$	$\displaystyle\otimes\bar{\mathbf{v}}_{t+1}\\|^{2}\leq(1+c_{1})\beta_{t+1}^{2}\lambda^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}$
(16)			$\displaystyle+2(1+\frac{1}{c_{1}})\big{(}\beta_{t+1}^{2}\\|\mathbf{w}_{t+1}\\|^{2}+(1-\beta_{t+1})^{2}\\|\mathbf{u}_{t+1}\\|^{2}\big{)},$

	$\displaystyle\\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\\|^{2}\leq 8\\|(\mathbf{x}_{t}$	$\displaystyle-\mathbf{1}\otimes\mathbf{\bar{x}}_{t})\\|^{2}$
(17)			$\displaystyle+4\eta_{t}^{2}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}+4\eta_{t}^{2}m\\|\bar{\mathbf{v}}_{t}\\|^{2}.$

	$\displaystyle H_{t}=\mathbb{E}[f(\mathbf{\bar{x}}_{t})+\frac{1}{32L^{2}\eta_{t-1}}\\|\bar{\mathbf{g}}_{t}-\bar{\mathbf{v}}_{t}\\|^{2}$	$\displaystyle+\frac{c_{0}}{m\eta_{t-1}}\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}$
(18)			$\displaystyle+\frac{c_{0}}{m}\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}],$

	$\displaystyle H_{t+1}-H_{t}\leq$	$\displaystyle-\frac{\eta_{t}}{2}\mathbb{E}[\\|\nabla f(\mathbf{\bar{x}}_{t})\\|^{2}]+\frac{\rho^{2}\sigma^{2}\eta_{t}^{3}}{16mL^{2}}+2(1+\frac{1}{c_{1}})c_{0}G^{2}\rho^{2}\eta_{t}^{4}$
		$\displaystyle-\frac{c_{0}C_{1}}{m\eta_{t}}[\\|\mathbf{x}_{t}-\mathbf{1}\otimes\mathbf{\bar{x}}_{t}\\|^{2}]-\frac{c_{0}C_{2}}{m}\mathbb{E}[\\|\mathbf{v}_{t}-\mathbf{1}\otimes\bar{\mathbf{v}}_{t}\\|^{2}]$
(19)			$\displaystyle-\frac{C_{3}\eta_{t}}{4}\mathbb{E}[\\|\bar{\mathbf{v}}_{t}\\|^{2}],$