DIMIX: DIminishing MIXing for Sloppy Agents

Hadi Reisizadeh Behrouz Touri and Soheil Mohajer¹¹1H. Reisizadeh (email: [email protected]) and S. Mohajer (email: [email protected]) are with the University of Minnesota, and B. Touri (email: [email protected]) is with the University of California San Diego.

Abstract

We study non-convex distributed optimization problems where a set of agents collaboratively solve a separable optimization problem that is distributed over a time-varying network. The existing methods to solve these problems rely on (at most) one time-scale algorithms, where each agent performs a diminishing or constant step-size gradient descent at the average estimate of the agents in the network. However, if possible at all, exchanging exact information, that is required to evaluate these average estimates, potentially introduces a massive communication overhead. Therefore, a reasonable practical assumption to be made is that agents only receive a rough approximation of the neighboring agents’ information. To address this, we introduce and study a two time-scale decentralized algorithm with a broad class of lossy information sharing methods (that includes noisy, quantized, and/or compressed information sharing) over time-varying networks. In our method, one time-scale suppresses the (imperfect) incoming information from the neighboring agents, and one time-scale operates on local cost functions’ gradients. We show that with a proper choices for the step-sizes’ parameters, the algorithm achieves a convergence rate of $\mathcal{O}({T}^{-1/3+\epsilon})$ for non-convex distributed optimization problems over time-varying networks, for any $\epsilon>0$ . Our simulation results support the theoretical results of the paper.

1 Introduction

Distributed learning serves as a learning framework where a set of computing nodes/agents are interested in collaboratively solving an optimization problem. In this paradigm, in the absence of a central node, the learning task solely depends on on-device computation and local communication among the neighboring agents. With the appearance of modern computation architectures and the decentralized nature of storage, large-scale distributed computation frameworks have received significant attention due to data locality, privacy, data ownership, and scalability to larger datasets and systems. These features of distributed learning have led to applications in several domains including distributed deep networks [1, 2, 3], distributed sensor networks [4, 5, 6], and network resource allocation [7, 8]. In the context of Machine Learning, a prototypical application of distributed learning and optimization is the empirical risk minimization (ERM) problem. In ERM problems, each server in a network of data centers has access to a number of independent data points (realizations) sampled from an underlying distribution, and the objective is to minimize an additive and separable cost function (of the data points) over the underlying network [9, 10, 11].

Related Works. Decentralized consensus or averaging-based optimization algorithms have been studied extensively over the past few years [12, 13, 14, 15, 16, 17]. It has been shown that when a fixed step-size is utilized, the loss function decreases with the rate of $\mathcal{O}(1/T)$ until the estimates reach a neighborhood of the (local) minimum of the objective cost function [12]. However, with a fixed step-size, the local estimates may not converge to an optimal point [16]. To remedy this, the diminishing step-size variation of the algorithm is introduced and studied for convex [18, 17, 19] and non-convex problems [20, 21, 22].

A majority of the existing works in this area suffer from requiring large communication overhead, as a (potentially) massive amount of local information is needed to be exchanged among the agents throughout the learning algorithm. In addition, such communication load can lead to a major delay in the convergence of the algorithm, and become a severe bottleneck when the dimension of the model is large. To mitigate the communication overhead, several compression techniques are introduced to reduce the size of the exchanged information. In particular, this is used in the context of distributed averaging/consensus problem [23, 12, 24], where each node has an initial numerical value and aims at evaluating the average of all initial values using exchange of quantized information, over a fixed or a time-varying network. In the context of distributed optimization, various compression approaches have been introduced to mitigate the communication overhead [25, 26, 27, 28, 29].

An inexact proximal gradient method with a fixed step-size over a fixed network for strongly convex loss functions is proposed in [25]. In this algorithm, each node applies an adaptive deterministic quantizer on its model and gradient before sending them to its neighbors. It is shown that under certain conditions, the algorithm provides a linear convergence rate. In another related work [26], authors proposed a decentralized gradient descent method, named ‘QuanTimed-DSGD’, with fixed step-sizes. In this algorithm, agents exchange a quantized version of their models/estimates in order to reduce the communication load. It is shown that for strongly convex loss functions and a given termination time $T$ , constant step-size parameters for the algorithm can be chosen so that the average distance between the model of the agents and the global optimum is bounded by $c(T^{-1/2+\epsilon})$ for some $c>0$ and any $\epsilon>0$ . In this algorithm not only the step sizes depend on $T$ , but also the results only hold for the exact termination iteration $T$ , which further needs to satisfy $T\geq T_{\min}$ , where $T_{\min}$ itself depends on $\epsilon$ as well as non-local parameters of the fixed underlying network. For smooth non-convex cost functions, there it is shown that the temporal average (up to time $T$ ) of the expected norm of the gradients at average models does not exceed $c(T^{-1/3})$ . However, similar to the strongly convex case, the choice of parameters (i.e., the step-sizes) for QuanTimed-DSGD depend on the termination time $T$ , and need to be re-evaluated when the target termination time varies. To compensate the quantization error, a decentralized (diminishing) gradient descent algorithm is proposed in in [27, 30] using error-feedback. The proposed algorithm achieve the convergence rate of $\mathcal{O}(T^{-1})$ and $\mathcal{O}(T^{-1/2})$ for strongly and non convex objective functions, respectively. However, the nature of the algorithm restricts its use to time-invariant networks and in addition, the feedback mechanism cannot compensate communication noise between the nodes. In [31], a two-time-scale gradient descent algorithm was presented for distributed constrained and convex optimization problems over an i.i.d. communication graph with noisy communication links, and sub-gradient errors. It is shown that if the random communication satisfies certain conditions, proper choices of the time-scale parameters result in the almost sure convergence of the local states to an optimal point. Another interesting approach to address exact convergence for distributed optimization with fixed gradient step-sizes under a noiseless communication model is to use gradient tracking methods [32, 33].

Our Contributions. We introduce and study a two time-scale decentralized gradient descent algorithm for a broad class of imperfect sharing of information over time-varying communication networks for distributed optimization problems with smooth non-convex local cost functions. In this work, one time-scale addresses the convergence of the local estimates to an stationary point while the (newly introduced) second time-scale is introduced to suppress the noisy effect of the imperfect incoming information from the neighbors.

Our main result shows that with a proper choice of the parameters for the two diminishing step-size sequences, the temporal average of the expected norm of the gradients decreases with the rate of $\mathcal{O}(T^{-1/3+\epsilon})$ . To prove this result, we have established new techniques to analyze the interplay between the two time-scales, in particular, in the presence of underlying time-varying networks. To validate our theoretical results, we numerically evaluate the algorithm for optimizing a regulized logistic regression on benchmark dataset MNIST. To assess the efficiency of our algorithm over time-varying networks, we also implement a linear regression model (convex) over synthetic data and a CNN (non-convex) over CIFAR-10. Our simulation results also confirm that introducing a time-shift in our step-sizes improves the finite-time performance of our algorithm leading to a faster decay of the loss function.

Paper Organization. After introducing some notations, we present the problem formulation, the algorithm, the main result, and some of its practical implications in Section 2. Our theoretical results are corroborated by simulation results in Section 3. To prove the main result, we first provide some auxiliary lemmas in Section 4 whose proofs are presented in Appendix A. In Sections 5, we present the proof of the main result. Finally, we conclude the paper in Section 7.

Notation and Basic Terminology. Throughout this paper, we use $[n]$ to denote the set of integers $\{1,2,\dots,n\}$ . In this paper, we are dealing with $n$ agents that are minimizing a function in $\mathbb{R}^{d}$ . For notational convenience, throughout this paper, we assume that the underlying functions are acting on row vectors, and hence, we view vectors in $\mathbb{R}^{1\times d}=\mathbb{R}^{d}$ as row vectors. The rest of the vectors, i.e., the vectors in $\mathbb{R}^{n\times 1}=\mathbb{R}^{n}$ , are assumed to be column vectors. The $\ell_{2}$ -norm of a vector $\mathbf{x}\in\mathbb{R}^{d}$ is defined as $\|\mathbf{x}\|^{2}=\sum_{j=1}^{d}|x_{j}|^{2}$ . The Frobenius norm of a matrix $A\in\mathbb{R}^{n\times d}$ is defined as $\left\|A\right\|_{F}^{2}=\sum_{i=1}^{n}\|A_{i}\|^{2}=\sum_{i=1}^{n}\sum_{j=1}^{d}|A_{ij}|^{2}$ . A vector $\mathbf{r}\in\mathbb{R}^{n}$ is called stochastic if $r_{i}\geq 0$ and $\sum_{i=1}^{n}r_{i}=1$ . Similarly, a non-negative matrix $A\in\mathbb{R}^{n\times d}$ is called (row) stochastic if $\sum_{j=1}^{d}A_{ij}=1$ for every $i\in[n]$ . For a matrix $A\in\mathbb{R}^{n\times d}$ , we denote its $i$ -th row and $j$ -th column by $A_{i}$ and $A^{j}$ , respectively. Note that we deal with two types of vector throughout the paper. For an $n\times d$ matrix $A$ and a strictly positive stochastic vector $\mathbf{r}\in\mathbb{R}^{n}$ , we define the $\mathbf{r}$ -norm of $A$ by $\left\|A\right\|_{\mathbf{r}}^{2}=\sum_{i=1}^{n}r_{i}\left\|A_{i}\right\|^{2}$ . It can be verified that $\left\|\cdot\right\|_{\mathbf{r}}$ is a norm on the space of $n\times d$ matrices. Finally, we write $A\geq B$ if $A-B$ (is well-defined and) has non-negative entries.

2 Problem Setup and Main Result

In this section, first we formulate non-convex distributed optimization problems over time-varying networks and introduce some standard assumptions on the underlying problem. After proposing our algorithm, we state our main result. Finally, we discuss the implications of our result on various important practical settings with imperfect information sharing.

2.1 Problem Setup

This paper is motivated by stochastic learning problems in which the goal is to solve

\displaystyle\min_{\mathbf{x}}L(\mathbf{x}):=\min_{\mathbf{x}}\mathbb{E}_{\xi\sim\mathcal{P}}\left[\ell(\mathbf{x},\xi)\right],

(1)

where $\ell:\mathbb{R}^{d}\times\mathbb{R}^{p}\rightarrow\mathbb{R}$ is a loss function, $\mathbf{x}\in\mathbb{R}^{1\times d}=\mathbb{R}^{d}$ is the decision/optimization row vector, and $\xi$ is a random vector taking values in $\mathbb{R}^{p}$ that is drawn from an unknown underlying distribution $\mathcal{P}$ . One of the key practical considerations that renders (1) as a challenging task is that the underlying distribution $\mathcal{P}$ is often unknown. Instead, we have access to $N$ independent realizations of $\xi$ and focus on solving the corresponding ERM problem which is given by

\displaystyle\min_{\mathbf{x}}f(\mathbf{x}):=\min_{\mathbf{x}}\frac{1}{N}\sum_{j=1}^{N}\ell(\mathbf{x},\xi_{j}),

(2)

where $f(\mathbf{x})$ is the empirical risk with respect to the data points $\mathcal{D}=\{\xi_{1},\ldots,\xi_{N}\}$ . We assume that $\ell(\cdot,\cdot)$ is a non-convex loss function, which potentially results in a non-convex function $f(\cdot)$ .

In distributed optimization, we have a network consisting of $n$ computing nodes (agents, or workers), where each node $i$ observes a non-overlapping subset of ${m_{i}=r_{i}N}$ data points, denoted by $\mathcal{D}_{i}=\{\xi^{i}_{1},\ldots,\xi^{i}_{m_{i}}\}$ , where ${\mathcal{D}=\mathcal{D}_{1}\cup\cdots\cup\mathcal{D}_{n}}$ . Here, $r_{i}$ represents the fraction of the data that is processed at node $i\in[n]$ . Note that the vector $\mathbf{r}=(r_{1},\dots,r_{n})$ is a strictly positive stochastic vector, i.e., $r_{i}>0$ and $\sum_{i=1}^{n}r_{i}=1$ . Thus, the ERM problem in (2) can be written as the minimization of the weighted average of local empirical risk functions $f_{i}$ for all nodes $i\in[n]$ in the network, i.e.,

\displaystyle\min_{\mathbf{x}}f(\mathbf{x})=\min_{\mathbf{x}}\sum_{i=1}^{n}r_{i}f_{i}(\mathbf{x})=\min_{\mathbf{x}}\frac{1}{N}\sum_{i=1}^{n}\sum_{\xi\in\mathcal{D}_{i}}\ell(\mathbf{x},\xi),

(3)

where $f_{i}(\mathbf{x}):=\frac{1}{m_{i}}\sum_{\xi\in\mathcal{D}_{i}}\ell(\mathbf{x},\xi)=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\ell(\mathbf{x},\xi_{j}^{i})$ . We can rewrite the ERM problem in (3) as a distributed consensus optimization problem, given by

\displaystyle\min_{\mathbf{x}_{1},\ldots,\mathbf{x}_{n}}\sum_{i=1}^{n}r_{i}f_{i}(\mathbf{x}_{i})\quad\textrm{subject to}\quad\mathbf{x}_{1}=\mathbf{x}_{2}=\cdots=\mathbf{x}_{n}.

(4)

Consider an $n\geq 2$ agents that are connected through a time-varying network. We represent this network at time $t\geq 1$ by the directed graph $\mathcal{G}(t)=([n],\mathcal{E}(t))$ , where the vertex set $[n]$ represents the set of agents and the edge set ${\mathcal{E}(t)\subseteq\{(i,j):i,j\in[n]\}}$ represents the set of links at time $t$ . At each discrete time $t\geq 1$ , agent $i$ can only send information to its (out-) neighbors in $\mathcal{E}(t)$ , i.e., all $j\in[n]$ with $(i,j)\in\mathcal{E}(t)$ .

To discuss our algorithm (DIMIX) for solving (4) distributively, let us first discuss its general structure and the required information at each node for its execution. In this algorithm, at each iteration $t\geq 1$ , agent $i\in[n]$ updates its estimate $\mathbf{x}_{i}(t)\in\mathbb{R}^{d}$ of an optimizer of (3). To this end, it utilizes the gradient information of its own local cost function $f_{i}(\mathbf{x})$ as well as a noisy/lossy average of its current neighbors estimates, denoted by ${\hat{\mathbf{x}}_{i}(t):=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\mathbf{e}_{i}(t)}$ . Here, $W(t)$ is a row-stochastic matrix that is consistent with the underlying connectivity network $\mathcal{G}(t)$ (i.e., $W_{ij}(t)>0$ only if $(j,i)\in\mathcal{E}(t)$ ) and $\mathbf{e}_{i}(t)\in\mathbb{R}^{d}$ is a random noise vector. Later, in Section 2.4 we discuss several noisy and lossy information sharing architectures (quantization and noisy communication) that fit in this broad information structure. Now we are ready to discuss the DIMIX algorithm. In this algorithm, using the information available to agent $i$ at time $t$ , agent $i$ updates its current estimate by computing a diminishing weighted average of its own state and the noisy average of its neighbors’ estimates, and moves along its local gradient. More formally, the update rule at node $i\in[n]$ is given by

\displaystyle\mathbf{x}_{i}(t+1)=(1-\beta(t))\mathbf{x}_{i}(t)+\beta(t)\hat{\mathbf{x}}_{i}(t)-\alpha(t)\beta(t)\nabla f_{i}(\mathbf{x}_{i}(t)),

(5)

where $\alpha(t)=\frac{\alpha_{0}}{(t+\tau)^{\nu}}$ and $\beta(t)=\frac{\beta_{0}}{(t+\tau)^{\mu}}$ for some $\mu,\nu\in(0,1)$ are the diminishing step-sizes of the algorithm, and $\tau\geq 0$ is an arbitrary shift, that is introduced to accelerate the finite-time performance of the algorithm. The description of DIMIX is summarized in Algorithm 1. For notational simplicity, let

\displaystyle X(t):=\begin{bmatrix}\mathbf{x}_{1}(t)\\ \vdots\\ \mathbf{x}_{n}(t)\end{bmatrix},\quad E(t):=\begin{bmatrix}\mathbf{e}_{1}(t)\\ \vdots\\ \mathbf{e}_{n}(t)\end{bmatrix},\quad\nabla f(X(t)):=\begin{bmatrix}\nabla f_{1}(\mathbf{x}_{1}(t))\\ \vdots\\ \nabla f_{n}(\mathbf{x}_{n}(t))\end{bmatrix}.

(6)

Since $\hat{\mathbf{x}}_{i}(x)=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\mathbf{e}_{i}(t)$ , we can rewrite the update rule in (5) in the matrix format

\displaystyle X(t+1)=((1-\beta(t))I+\beta(t)W(t))X(t)+\beta(t)E(t)-\alpha(t)\beta(t)\nabla f(X(t)).

(7)

Let us discuss some important aspects of the above update rule. Note that both $\alpha(t)$ and $\beta(t)$ are diminishing step-sizes. If $\beta(t)=\beta_{0}<1$ and $\alpha(t)=\alpha_{0}<1$ are both constants, then the dynamics in (7) reduces to the algorithm proposed in [26] for both convex and non-convex cost functions. Alternatively, if $\beta(t)=\beta_{0}<1$ is a constant sequence and $E(t)=\mathbf{0}$ for all $t\geq 1$ , (7) would be reduced to the averaging-based distributed optimization with diminishing steps-sizes (for gradients), which is introduced and studied in [18] for local convex cost functions $f_{i}(\mathbf{x})$ . The newly introduced time-scale/step-size $\beta(t)$ suppresses the incoming noise $\mathbf{e}_{i}(t)$ from the neighboring agents. However, $\beta(t)$ also suppresses the incoming signal level $\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)$ at each node $i$ . This casts a major technical challenge for establishing convergence-to-consensus guarantees for the algorithm over time-varying networks.

On the other hand, the diminishing step-size for the gradient update is $\hat{\alpha}(t)=\alpha(t)\beta(t)$ . We chose to represent our algorithm in this way to ensure that the local mixing (consensus) scheme is operated on a faster time-scale than the gradient descent.

Algorithm 1 DIMIX at agent

i

1:Stochastic matrix sequence

\{W(t)\}

, Iteration

T

2:Set

\mathbf{x}_{i}(1)=0

3:for

t=1,\ldots,T-1

4: Compute the local gradient

\nabla f_{i}(\mathbf{x}_{i}(t))

5: Obtain noisy average neighbors’ estimate

\hat{\mathbf{x}}_{i}(t)

6: Set:

\mathbf{x}_{i}(t+1)=(1-\beta(t))\mathbf{x}_{i}(t)+\beta(t)\hat{\mathbf{x}}_{i}(t)-\alpha(t)\beta(t)\nabla f_{i}(\mathbf{x}_{i}(t))

7:end for

2.2 Assumptions

In order to provide performance guarantees for DIMIX, we need to assume certain regularity conditions on (i) the statistics of the (neighbors’ averaging) noise process $\{E(t)\}$ , (ii) the mixing properties of the weight sequence $\{W(t)\}$ , and (iii) the loss function $\ell(\cdot,\cdot)$ .

First, we discuss our main assumption on the noise sequence $\{E(t)\}$ .

Assumption 1 (Neighbors State Estimate Assumption)

Suppose that $\{X(t)\}$ is adapted to a filtration $\{\mathcal{F}_{t}\}$ on the underlying probability space (see e.g. Section 5.2 in [34]). We assume that there exists some $\gamma>0$ such that for all $i\in[n]$ and all $t\geq 1$ , the noise sequence $\{\mathbf{e}_{i}(t)\}$ satisfies

	$\displaystyle\mathbb{E}\left[\mathbf{e}_{i}(t)\mid\mathcal{F}_{t}\right]$	$\displaystyle=0\mbox{, and}$		(8)
	$\displaystyle\mathbb{E}\left[\\|\mathbf{e}_{i}(t)\\|^{2}\mid\mathcal{F}_{t}\right]$	$\displaystyle\leq\gamma.$		(9)

Note that the natural filtration of the random process $\{X(t)\}$ is one choice for $\{\mathcal{F}_{t}\}$ . In this case, (8) reduces to $\mathbb{E}\left[\mathbf{e}_{i}(t)\!\mid\!\!X(1),\ldots,X(t)\right]\!=\!0$ and $\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}\!\mid\!X(1),\ldots,X(t)\right]\!\leq\gamma.$

Next, we discuss the main assumption on the network connectivity, or more precisely, the mixing of information over the time-varying network.

Assumption 2 (Connectivity Assumption)

We assume that the weight matrix sequence $\{W(t)\}$ in (7) satisfies the following properties

(a)

Stochastic with Common Stationary Distribution: $W(t)$ is non-negative and ${W(t)\mathbf{1}=\mathbf{1}}$ and ${\mathbf{r}^{T}W(t)=\mathbf{r}^{T}}$ for all $t\geq 1$ , where $\mathbf{1}\in\mathbb{R}^{n}$ is the all-one vector, and $\mathbf{r}>0$ is the weight vector.
(b)

Bounded Nonzero Elements: There exists some $\eta>0$ such that if for some $i,j\in[n]$ and $t\geq 1$ we have $W_{ij}(t)>0$ , then $W_{ij}(t)\geq\eta$ .
(c)

$B$ -Connected: For a fixed integer $B\geq 1$ , the graph $\left([n],\bigcup_{k=t+1}^{t+B}\mathcal{E}(k)\right)$ is strongly connected for all $t\geq 1$ , where $\mathcal{E}(k)=\{(j,i)\mid W_{ij}(k)>0\}$ .

Finally for the loss function $\ell(\cdot,\cdot)$ , we make the following assumption.

Assumption 3 (Stochastic Loss Function Assumption)

We assume that:

(a)

The function $\ell(\cdot,\cdot)$ is $K$ -smooth with respect to its first argument, i.e., for any $\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}$ and $\xi\in\mathcal{D}$ we have that $\|\nabla\ell(\mathbf{x},\xi)-\nabla\ell(\mathbf{y},\xi)\|\leq K\|\mathbf{x}-\mathbf{y}\|$ .

(b)

Stochastic gradient $\nabla\ell(\mathbf{x},\xi)$ is unbiased and variance bounded, i.e.,

\mathbb{E}_{\xi}\left[\nabla\ell(\mathbf{x},\xi)\right]=\nabla L(\mathbf{x}),\quad\mathbb{E}_{\xi}\left[\|\nabla\ell(\mathbf{x},\xi)-\nabla L(\mathbf{x})\|^{2}\right]\leq\sigma^{2}.

Note that Assumption 3-b implies a homogeneous sampling, i.e., each agent draws i.i.d. sample points from a data batch. In a related work [35], a stronger assumption has been considered which allows for heterogeneous data samples.

Remark 1

Recall that local function $f_{i}$ is defined as $f_{i}(\mathbf{x})=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\ell(\mathbf{x},\xi_{j}^{i})$ . Then the conditions of Assumption 3 on the function $\ell(\cdot,\cdot)$ can be directly translated to conditions on $f_{i}(\mathbf{x})$ given by

\displaystyle\|f_{i}(\mathbf{x})-f_{i}(\mathbf{y})\|\leq K\|\mathbf{x}-\mathbf{y}\|,\quad\mathbb{E}_{\xi}\left[\nabla f_{i}(\mathbf{x})\right]=\nabla L(\mathbf{x}),\quad\mathbb{E}_{\xi}\left[\|\nabla f_{i}(\mathbf{x})-\nabla L(\mathbf{x})\|^{2}\right]\leq\sigma^{2}/m_{i}.

2.3 Main Result

Here, we characterize the convergence rates of our algorithm for the $K$ -smooth non-convex loss functions. More precisely, we establish a rate for the temporal average of the expected norm of the gradients for various choice of the time-scale parameters $\nu,\mu$ .

Theorem 1

Suppose that Assumptions 1–3 hold and let $\alpha(t)=\frac{\alpha_{0}}{(t+\tau)^{\nu}}$ , $\beta(t)=\frac{\beta_{0}}{(t+\tau)^{\mu}}$ where $\alpha_{0},\beta_{0}\in(0,1)$ , $\tau\geq 0$ , and $\nu,\mu\in(0,1)$ are arbitrary constants with $\mu\neq 1/2$ and $3\nu+\mu\neq 1$ . Then the weighted average estimates $\bar{\mathbf{x}}(t):=\sum_{i=1}^{n}r_{i}\mathbf{x}_{i}(t)$ generated by Algorithm 1 satisfy

\displaystyle M_{\theta}(\nu,\mu):=\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}\right]^{1/\theta}

\displaystyle=\mathcal{O}\left(T^{-\min\{1-\nu-\mu,\mu-\nu,2\nu\}}\right),

(10)

where $\theta\in(0,1)$ is an arbitrary constant.

Furthermore, for $(\nu^{\star},\mu^{\star})=\left(\frac{1}{6},\frac{1}{2}\right)$ we get the optimal rate of

\displaystyle M_{\theta}(\nu^{\star},\mu^{\star})=\mathcal{O}\left(T^{-1/3}\ln T\right).

(11)

Remark 2

Note that the expectation operator $\mathbb{E}\left[\cdot\right]$ is over the randomness of the dataset $\mathcal{D}$ and the compression/communication noise. Moreover, note that the theorem above shows that the gradient of $f(\cdot)$ (which depends of the choice of $\mathcal{D}$ ) at the average state of $\overline{\mathbf{x}}(t)$ (which also depends on $\mathcal{D}$ ) vanishes at a certain rate. It is worth mentioning that this is not the performance of the average trajectory for the average function.

Remark 3

From (10), one has to maximize $\min\{1-\nu-\mu,\mu-\nu,2\nu\}$ over $\nu,\mu\in(0,1)$ to achieve the fastest convergence for $M_{\theta}$ . This leads to $(\nu^{\star},\mu^{\star})=(1/6,1/2)$ , which none of the conditions $\mu\not=1/2$ and $3\nu+\mu\not=1$ hold for. However, one can choose $(\nu,\mu)=(1/6+\epsilon/2,1/2+\epsilon/2)$ and obtain $M_{\theta}=\mathcal{O}(T^{-1/3+\epsilon})$ for any $\epsilon>0$ . Nevertheless, note that (11) provides a faster convergence rate of $\mathcal{O}(T^{-1/3}\ln T)$ for $(\nu^{\star},\mu^{\star})=(1/6,1/3)$ .

Proposition 1

Under the conditions of Theorem 1, for the optimum choice of $(\nu^{\star},\mu^{\star})=(1/6,1/3)$ , we have

\displaystyle M_{1}(\nu^{\star},\mu^{\star}):=\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]

\displaystyle\leq\mathcal{O}\left(T^{-1/3+\epsilon}\right),

(12)

for any $\epsilon>0$ . Furthermore, in this case, for each agent $i\in[n]$ the convergence rate to consensus is given by

\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\|^{2}\right]\leq\mathcal{O}\left(T^{-1/3+\epsilon}\right).

(13)

As a result, combining (12), (13), and Assumption 3, for all $i\in[n]$ , we have

\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\|\nabla f(\mathbf{x}_{i}(t))\right\|^{2}\right]

\displaystyle\leq\mathcal{O}\left(T^{-1/3+\epsilon}\right).

Remark 4

We should comment that almost universally, all the existing results and algorithms on distributed optimization algorithms for time-varying graphs assume a uniform positive lower bound on non-zero elements of the (effective) weight matrices [36, 12, 37, 38, 39, 16, 21, 40]. Absence of such an assumption significantly increases the complexity of the convergence analysis of the algorithm. In our work, even though the stochastic matrix sequence $\{W(t)\}$ is assumed to be $B$ -connected, the effective averaging weight sequence, that is given by $\{(1-\beta(t))I+\beta(t)W(t)\}$ , has vanishing (non-self) weights. One of the major theoretical contributions of this work is to introduce tools and techniques to study distributed optimization with diminishing weight sequences.

Remark 5

In a related work [26] on distributed optimization with compressed information sharing among the nodes, authors considered a fixed step-size (zero time-scale) version of Algorithm 1 with a fixed averaging matrix $W$ . It is shown that for a given termination time $T$ , the algorithm’s step-sizes can be chosen (depending on $T$ ) such that the temporal average (up to iteration $T$ ) of the expected norm of the gradient (i.e., $M_{1}$ defined in (12)) does not exceed $c({T}^{-1/3})$ (where $c>0$ is a constant). However, the step-sizes should be re-evaluated and the algorithm needs to be re-executed if one targets another termination time $T^{\prime}$ . In this work, we use vanishing step-sizes $\alpha(t)$ and $\beta(t)$ (which do not depend on the termination time) and show that the same temporal average vanishes at the rate of $\mathcal{O}\left({T}^{-1/3+\epsilon}\right)$ for every iteration $T$ and any arbitrarily small $\epsilon>0$ .

2.4 Examples for Stochastic Noisy State Estimation

Refer to caption — Figure 1: A general architecture to lossy information model.

The noisy information in (5) is very general, and captures several models of imperfect data used in practice and/or theoretical studies. A rather general architecture that leads to such noisy/lossy estimates is demonstrated in Figure 1: Once the estimate $\mathbf{x}_{j}(t)$ of node $j$ at iteration $t$ is evaluated, node $j$ may apply an operation (such as compression/sparsification or quantization) on its own model to generate $\tilde{\mathbf{x}}_{j}(t)$ . This data will be sent over a potentially noisy communication channel, and a neighbor $i$ will receive a corrupted version of $\tilde{\mathbf{x}}_{j}(t)$ , say $\tilde{\mathbf{y}}_{i,j}(t)$ from every neighbor node $j$ . Upon receiving such channel outputs from all its neighbors, node $i$ computes their weighted average $\mathbf{y}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\tilde{\mathbf{y}}_{i,j}(t)$ . This can be decoded to the approximate average model $\hat{\mathbf{x}}_{i}(t)$ . In the following we describe three popular frameworks, in which each node $i$ can only use an imperfect neighbors’ average $\hat{\mathbf{x}}_{i}(t)$ to update its estimate. It is worth emphasizing that these are just some examples that lie under the general model in (5).

Example 1

(Unbiased Random Sparsification). Let $Z_{k}$ be the set of all subsets of $[d]$ of size $k$ for a parameter $k\in[d]$ . We define the operator ${\text{Rand}_{k}:\mathbb{R}^{d}\times Z_{k}\rightarrow\mathbb{R}^{d}}$ as

\displaystyle[\text{Rand}_{k}(\mathbf{x},\zeta)]_{\ell}:=\left\{\begin{array}[]{ll}(\frac{d}{k})x_{\ell}&\ell\in\zeta\\ 0&\textrm{otherwise}.\end{array}\right.

(16)

For a given time, with $\zeta_{j}$ being drawn uniformly at random from $Z_{k}$ independent of the input vector $\mathbf{x}$ , iteration, and other agents, $\text{Rand}_{k}(\cdot,\cdot)$ is an unbiased version of the random sparisification introduced in [41]. It can be verified that, when ${\|\mathbf{x}\|^{2}\leq D}$ , we have

\mathbb{E}_{\zeta}\left[\text{Rand}_{k}(\mathbf{x},\zeta)\right]=\mathbf{x},\quad\mathbb{E}_{\zeta}\left[\|\text{Rand}_{k}(\mathbf{x},\zeta)-\mathbf{x}\|^{2}\right]\leq\left(\frac{d}{k}-1\right)\|\mathbf{x}\|^{2}\leq\left(\frac{d}{k}-1\right)D.

Then, we can get

\displaystyle\hat{\mathbf{x}}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\text{Rand}_{k}(\mathbf{x}_{j}(t),\zeta_{j})=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\mathbf{e}_{i}(t),

where $\mathbf{e}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\left(\text{Rand}_{k}(\mathbf{x}_{j}(t),\zeta_{j})-\mathbf{x}_{j}(t)\right)$ . From the properties of $\text{Rand}_{k}(\cdot,\cdot)$ , we can verify that $\mathbb{E}\left[\mathbf{e}_{i}(t)|\mathcal{F}_{t}\right]=0$ and

\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}|\mathcal{F}_{t}\right]=\left(\frac{d}{k}-1\right)D\sum_{j=1}^{n}W_{ij}(t)^{2}\leq\left(\frac{d}{k}-1\right)D,

which fulfills the conditions of Assumption 1. With respect to the architecture of Figure 1, we have $\tilde{\mathbf{x}}_{j}(t)=\text{Rand}(\mathbf{x}_{j}(t),\zeta_{j})$ . Moreover, the noisy channel is perfect, and the decoder is just an identity function, i.e., $\mathbf{y}_{i,j}(t)=\tilde{\mathbf{x}}_{j}(t)$ and $\hat{\mathbf{x}}_{i}(t)=\mathbf{y}_{i}(t)$ .

Example 2

(Stochastic Quantizer with bounded trajectory). The stochastic quantizer with a number of quantization levels $s$ maps a vector $\mathbf{x}\in\mathbb{R}^{d}$ to a random vector $Q_{s}^{S}(\mathbf{x})\in\mathbb{R}^{d}$ , where its $\ell$ th entry is given by

\displaystyle\left[Q^{S}_{s}(\mathbf{x})\right]_{\ell}:=\|\mathbf{x}\|\cdot\operatorname{sgn}(x_{\ell})\cdot\zeta\left(|x_{\ell}|/\|\mathbf{x}\|,s\right),\quad\ell\in[d],

(17)

and $\zeta(x,s)$ is a random variable taking values

\displaystyle\zeta(x,s)=\left\{\begin{array}[]{ll}\lceil sx\rceil/s&\textrm{w.p. $sx-\lfloor sx\rfloor$}\\ \lfloor sx\rfloor/s&\textrm{w.p. $\lceil sx\rceil-sx$.}\end{array}\right.

(20)

Note that, random variables $\{\zeta(\cdot,\cdot)\}$ are independent, across the coordinates, agents, and time steps. Thus, in this case, the relationship between $\tilde{\mathbf{x}}_{j}(t)$ and $\mathbf{x}_{j}(t)$ in Figure 1 would be $\tilde{\mathbf{x}}_{j}(t)=Q^{S}_{s}(\mathbf{x}_{j}(t))$ . Furthermore, the noisy channel is perfect, and the decoder component is just an identity function, i.e., $\mathbf{y}_{i,j}(t)=\tilde{\mathbf{x}}_{j}(t)$ and $\hat{\mathbf{x}}_{i}(t)=\mathbf{y}_{i}(t)$ . It is shown in [42] that applying this quantizer on $\mathbf{x}\in\mathbb{R}^{d}$ with a bounded norm $\|\mathbf{x}\|^{2}\leq D$ satisfies $\mathbb{E}\left[Q^{S}_{s}(\mathbf{x})\right]=\mathbf{x}$ and $\mathbb{E}\left[\|Q^{S}_{s}(\mathbf{x})-\mathbf{x}\|^{2}\right]\leq\min\left(\frac{\sqrt{d}}{s},\frac{d}{s^{2}}\right)D$ . Therefore, the neighbors estimate for node $i$ will be

\displaystyle\hat{\mathbf{x}}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)Q^{S}_{s}(\mathbf{x}_{j}(t))=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\mathbf{e}_{i}(t),

where $\mathbf{e}_{i}(t)\!=\!\sum_{j=1}^{n}W_{ij}(t)\left(Q^{S}_{b}(\mathbf{x}_{j}(t))\!-\!\mathbf{x}_{j}(t)\right)$ satisfies $\mathbb{E}\left[\mathbf{e}_{i}(t)|\mathcal{F}_{t}\right]=0$ and

\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}|\mathcal{F}_{t}\right]=\min\left(\sqrt{d}/s,d/s^{2}\right)D\sum_{j=1}^{n}W^{2}_{ij}(t)\leq\min\left(\sqrt{d}/s,d/s^{2}\right)D,

provided that $\|\mathbf{x}_{j}(t)\|^{2}\leq D$ for every $j\in[n]$ and every $t\geq 1$ . Therefore, the conditions of Assumption 1 are satisfied.

Note that in both examples above $\mathbf{e}_{i}(t)$ and $\mathbf{e}_{j}(t)$ might be correlated, especially when nodes $i$ and $j$ have common neighbor(s). However, this does not violate the conditions of Assumption 3.

Example 3

(Noisy Communication). The noisy neighbor estimate model may arise due to imperfect communication between the agents. Consider a wireless network, in which the computing nodes communicate with their neighbors over a Gaussian channel, i.e., when node $j$ sends its state $\mathbf{x}_{j}(t)$ (without compression, i.e., $\tilde{\mathbf{x}}_{j}(t)=\mathbf{x}_{j}(t)$ ) to its neighbor $i$ , the signal received at node $i$ is $\tilde{\mathbf{y}}_{i,j}(t)=\mathbf{x}_{j}(t)+\mathbf{z}_{i,j}(t)$ where $\mathbf{z}_{i,j}(t)$ is a zero-mean Gaussian noise with variance $\zeta^{2}$ , independent across $(i,j)$ , and $t$ . Applying an identity map decoder at node $i$ (i.e., $\hat{\mathbf{x}}_{i}(t)=\mathbf{y}_{i}(t)$ ) we have

\displaystyle\hat{\mathbf{x}}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\left(\mathbf{x}_{j}(t)+\mathbf{z}_{i,j}(t)\right)=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\sum_{j=1}^{n}W_{ij}(t)\mathbf{z}_{i,j}(t).

Thus, we get $\mathbf{e}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\mathbf{z}_{i,j}(t)$ implying

\mathbb{E}\left[\mathbf{e}_{i}(t)|\mathcal{F}_{t}\right]=0,\quad{\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}|\mathcal{F}_{t}\right]=\zeta^{2}\sum_{j=1}^{n}W^{2}_{ij}(t)\leq\zeta^{2}}

. Hence, the conditions of Assumption 1 are satisfied.

3 Experimental Results

In the following, we discuss two sets of simulations to verify our theoretical results. First, we perform Algorithm 1 for a fixed network to compare fixed step-sizes [26] versus diminishing step-sizes (this work). Then, we compare the performance of our algorithm on fixed versus time-varying graph. Throughout this section, we use the unbiased stochastic quantizer in (17) with various total number of quantization levels $s$ . For CNN experiments over CIFAR-10 and regularized logistic regression problem over MNIST dataset, the mini-batch size is set to be $10$ and $20$ , respectively, the loss function is the cross-entropy function, we use $N=10,000$ data points which are distributed across $n=20$ agents. Moreover, for each set of experiments, we distribute the data points across the nodes according to $r_{i}=p_{i}/\sum_{i=1}^{20}p_{i}$ , where $p_{i}$ is drawn uniformly at random from the interval $(0.01,0.9)$ . Our codes are implemented using Python and tested on a 2017 MacBook Pro with 16 GB memory and a 2.5 GHz Intel Core i7 processor.

3.1 DIMIX vs. QuanTimed-DSGD over Fixed Network

We use regularized logistic regression to validation our algorithm on the benchmark dataset MNIST. $\triangleright$ Data and Experimental Setup. In this experiments we train a regularized logistic regression to classify the MNIST dataset over a fixed network. First, we generate a random (connected) undirected Erdös-Renyi graph with the edge probability $p_{c}=0.3$ , which is fixed throughout this experiment. We also generate a doubly stochastic weight matrix ${A:=I-(d_{\max}+1)^{-1}L_{G}}$ where $L_{G}$ is the Laplacian matrix and $d_{\max}$ is the maximum degree of the graph. Finally, we use the time-invariant weight sequence $W=I+c\hat{\mathbf{r}}_{\min}\text{diag}^{-1}(\mathbf{r})(A-I)$ , where ${\hat{\mathbf{r}}_{\min}=\min_{i\in[n]}r_{i}/(1-A_{ii})}$ , $c=0.95$ , and $\text{diag}(\mathbf{r})$ is a diagonal matrix with $r_{i}$ as its $i$ th diagonal element. It can be verified that $W$ satisfies the conditions of Assumption 2, i.e., it is row-stochastic and satisfies $\mathbf{r}^{T}W=\mathbf{r}^{T}$ . We implemented our algorithm with quantizer parameter $s=3$ and tuned the step-size parameters $(\alpha_{0},\nu^{\star})=(0.005,1/6)$ , $(\beta_{0},\mu^{\star})=(0.6,1/2)$ , with $\tau=2000$ . For the fixed step-sizes, we used the termination time $T=7500$ , which results in $\alpha(t)=0.001$ and $\beta(t)=0.01$ for all $t\geq 1$ .

Results. The plot in Figure 2 shows network variance defined by $\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}$ for fixed and diminishing step-sizes. It can be observed that with a fixed step-sizes algorithm we can only reach a neighborhood of the average model, but a consensus is not necessarily achieved. However, using diminishing step-sizes guarantees that each node’s state converges to the average model.

3.2 Diminishing Step-sizes over Time-varying vs. Fixed Network

Here, we provide simulation results to demonstrate the performance of DIMIX on time-varying networks. For this, we conduct simulations on non-convex (CNN) and convex (linear regression) problems. For the sake of comparison, we apply our algorithm on a corresponding fixed network.

Data and Experimental Setup. For the non-convex setting, we study the problem of learning the parameters of the LeNet-5 CNN [43] for CIFAR-10 dataset. The LeNet-5 CNN architecture consists of two convolutional layers, two sub-sampling (pooling) layers, and three fully connected layers, with the ReLU activation function. As before we have $n=20$ nodes and $N=10,000$ data points, which are distributed among the agents according to some stochastic vector $\mathbf{r}$ . For the convex setting, we consider a $100$ -dimensional linear regression problem. We generate $N=300$ data points $\{\xi_{1},\ldots,\xi_{300}\}$ , where $\xi_{i}=(\boldsymbol{u}_{i},v_{i})$ , with $v_{i}=\boldsymbol{u}_{i}^{T}\tilde{\mathbf{x}}+\epsilon_{i}$ , and $\boldsymbol{u}_{i},\tilde{\mathbf{x}}\in\mathbb{R}^{100}$ for all $i\in[300]$ . In order to generate the data, we uniformly and independently draw the entries of each $\boldsymbol{u}_{i}$ and each $\epsilon_{i}$ from $(0,1)$ and $(0,0.1)$ , respectively. Similarly, entries of $\tilde{\mathbf{x}}$ are sampled uniformly and independently from $(-1,1)$ . The goal is to estimate the unknown coefficient vector $\tilde{\mathbf{x}}$ , which leads to solving the minimum mean square error problem, i.e.,

f(\mathbf{x}):=\min_{\mathbf{x}\in\mathbb{R}^{d}}\frac{1}{2N}\sum_{i=1}^{300}\|v_{i}-\boldsymbol{u}_{i}^{T}\mathbf{x}\|^{2}

Experiments with Fixed Graph. We consider a fixed undirected cyclic graph $\mathcal{G}^{C}=([n],\mathcal{E})$ , where ${\mathcal{E}=\{(\left\langle i\right\rangle,\left\langle i+1\right\rangle):i\in[n]\}}$ , and $\left\langle i\right\rangle:=(i-1\mod n)+1$ . For these experiments, the stochastic matrix sequence $W(t)=W$ for any $t$ is given by

\displaystyle W_{ij}=\begin{cases}\frac{r_{\left\langle j\right\rangle}}{2(r_{\left\langle i\right\rangle}+r_{\left\langle j\right\rangle})}&j\in\{\left\langle i-1\right\rangle,\left\langle i+1\right\rangle\}\\ \frac{r_{\left\langle i\right\rangle}}{2(r_{\left\langle i\right\rangle}+r_{\left\langle i+1\right\rangle})}+\frac{r_{\left\langle i\right\rangle}}{2(r_{\left\langle i\right\rangle}+r_{\left\langle i-1\right\rangle})}&j=i\\ 0&\textrm{otherwise}.\end{cases}

(21)

Experiments with Time-varying Graph. To evaluate Algorithm 1 for a time-varying network, we use cyclic gossip [44, 45, 6, 46, 47] iterations. Here, the algorithm goes through the cycle graph $\mathcal{G}^{C}$ (described above), and a pair of neighbors (on the cycle) exchange their information at a time. More precisely, at time $t$ , the averaging graph $\mathcal{G}(t)=([n],\mathcal{E}(t))$ only consists of a single edge $\mathcal{E}(t)=\{({\left\langle t\right\rangle},{\left\langle t+1\right\rangle})\}$ . For $\mathcal{G}(t)$ , we let

\displaystyle[W(t)]_{ij}=\begin{cases}\frac{r_{\left\langle j\right\rangle}}{r_{{\left\langle t\right\rangle}}+r_{{\left\langle t+1\right\rangle}}}&i,j\in\{{\left\langle t\right\rangle},{\left\langle t+1\right\rangle}\}\\ 1&i=j\notin\{{\left\langle t\right\rangle},{\left\langle t+1\right\rangle}\}\\ 0&\textrm{otherwise}.\end{cases}

(22)

Note that each edge in $\mathcal{G}^{C}$ will be visited periodically, and hence, the sequence of stochastic matrices $\{W(t)\}$ in (22) is $B$ -connected with $B=n$ .

Results. In Figure 3, the plot on the left demonstrates the training loss of LeNet-5 for CIFAR-10 dataset and the plot on the right shows the training loss of the linear regression. Here, ‘Fixed network’ refers to the full cycle with stochastic matrix in (21) and ‘Time-varying network’ refers to the network with stochastic matrix sequence in (22). For the CNN, the parameters of the dynamics in (5) are fine-tuned to ${(\alpha_{0},\nu^{\star})=(0.02,1/6)}$ , ${(\beta_{0},\mu^{\star})=(0.05,1/2)}$ , $\tau=0$ , and $s=6$ is the parameter of the stochastic quantizer for both networks. For the linear regression model, the parameter are adjusted to $(\alpha_{0},\nu)=(6,1/4)$ , $(\beta_{0},\mu)=(16,3/4)$ , and $\tau=1500$ with the quantizer parameter $s=6$ . It can be observed that the algorithm works for both fixed and time-varying networks. However, the performance of the algorithm over the time-varying network naturally suffers from a slower convergence, which is due to a slower mixing of information over the network.

4 Auxiliary Lemmas

The following lemmas are used to facilitate the proof of the main result of the paper. We present the proofs of Lemmas 1–5 in Section A.

Lemma 1

For any pair of vectors $\boldsymbol{u}$ , $\boldsymbol{v}$ , and any scalar $\omega>0$ , we have

\displaystyle\|\boldsymbol{u}+\boldsymbol{v}\|^{2}

\displaystyle\leq(1+\omega)\|\boldsymbol{u}\|^{2}+\left(1+\frac{1}{\omega}\right)\|\boldsymbol{v}\|^{2}.

(23)

Similarly, for matrices $U$ and $V$ and any scalar $\omega>0$ , their $\mathbf{r}$ -norm satisfy

\displaystyle\|U+V\|_{\mathbf{r}}^{2}\leq(1+\omega)\|U\|^{2}_{\mathbf{r}}+\left(1+\frac{1}{\omega}\right)\|V\|^{2}_{\mathbf{r}}.

(24)

Lemma 2

Let $\{\beta(t)\}$ be a sequence in $\mathbb{R}$ and $\lambda$ be a positive scalar. Then the following identities hold

	$\displaystyle\sum_{s=1}^{t-1}\beta(s)\!\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))$	$\displaystyle=\frac{1}{\lambda}-\frac{1}{\lambda}\prod_{k=1}^{t-1}(1-\lambda\beta(k)),\quad\mbox{for all $1\leq t$},$		(25)
	$\displaystyle\sum_{t=s+1}^{T}\beta(t)\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))$	$\displaystyle=\frac{1}{\lambda}-\frac{1}{\lambda}\prod_{k=s+1}^{T}(1-\lambda\beta(k))\quad\mbox{for all $1\leq s\leq T$}.$		(26)

As a result, for any sequence $\{\beta(t)\}$ in $[0,1]$ and $\lambda>0$ , we have

\sum_{s=1}^{t-1}\beta(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\leq\frac{1}{\lambda},

and

\sum_{t=s+1}^{T}\beta(t)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\leq\frac{1}{\lambda}.

Lemma 3

For any $\delta\in\mathbb{R}$ , $\tau\geq 0$ , and $T\geq 1$ , we have

\displaystyle\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\begin{cases}\frac{\tau^{1+\delta}}{|1+\delta|}&\textrm{if $\delta<-1$},\\ \ln\left(\frac{T}{\tau}+1\right)&\textrm{if $\delta=-1$},\\ \frac{2^{1+\delta}}{1+\delta}(T+\tau)^{1+\delta}&\textrm{if $\delta>-1$}.\end{cases}

(27)

Lemma 4

For an $n\times m$ matrix $A$ and $m\times q$ matrix $B$ , we have:

\displaystyle\left\|AB\right\|_{\mathbf{r}}\leq\left\|A\right\|_{\mathbf{r}}\left\|B\right\|_{F}.

As we discussed in Remark 4, we cannot use the conventional results (e.g., in [48, 36, 12, 37, 38, 39, 16, 21, 40]) to bound the norm of a vector after averaging with matrices with diminishing weights. The following lemma provides a new bounding techniques, which we will use in the proof of the main result of this work.

Lemma 5

Let $\{W(t)\}$ satisfy the Connectivity Assumption 2 with parameters $(B,\eta)$ , and let $\{A(t)\}$ be given by $A(t)=(1-\beta(t))I+\beta(t)W(t)$ where $\beta(t)\in(0,1]$ for all $t$ and $\{\beta(t)\}$ is a non-increasing sequence. Then, for any matrix $U\in\mathbb{R}^{n\times d}$ , parameter $\lambda=\frac{\eta\mathbf{r}_{\min}}{2Bn^{2}}$ , and all $t>s\geq 1$ , we have

\left\|\left(A(t-1)A(t-2)\cdots A(s+1)-\mathbf{1}\mathbf{r}^{T}\right)U\right\|_{\mathbf{r}}^{2}\leq\kappa\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\left\|U\right\|_{\mathbf{r}}^{2},

where $\kappa=\left(1-B\lambda\beta_{0}\right)^{-1}$ and $\beta_{0}=\beta(1)$ .

5 Proof of Theorem 1

For our analysis, first we obtain an expression for the average reduction of the objective function $f(\cdot)$ at the average of the states, i.e., $\bar{\mathbf{x}}(t)=\mathbf{r}^{T}X(t)=\sum_{i=1}^{n}r_{i}\mathbf{x}_{i}(t)$ . Recall that $\mathbf{r}^{T}W(t)=\mathbf{r}^{T}$ for all $t\geq 1$ . Hence, multiplying both sides of (7) by $\mathbf{r}^{T}$ , we get

\displaystyle\bar{\mathbf{x}}(t+1)=\bar{\mathbf{x}}(t)+\beta(t)\mathbf{r}^{T}E(t)-\alpha(t)\beta(t)\mathbf{r}^{T}\nabla f(X(t)).

From Assumption 3-(a) Lemma 3.4 in [49] we can conclude

\displaystyle f(\bar{\mathbf{x}}(t+1))-f(\bar{\mathbf{x}}(t))-\left\langle\nabla f(\bar{\mathbf{x}}(t)),\bar{\mathbf{x}}(t+1)-\bar{\mathbf{x}}(t)\right\rangle\leq\frac{K}{2}\left\|\bar{\mathbf{x}}(t+1)-\bar{\mathbf{x}}(t)\right\|^{2},

or equivalently,

	$\displaystyle f(\bar{\mathbf{x}}(t+1))\leq$	$\displaystyle f(\bar{\mathbf{x}}(t))+\beta(t)\left\langle\nabla f(\bar{\mathbf{x}}(t)),\mathbf{r}^{T}E(t)\right\rangle-\alpha(t)\beta(t)\left\langle\nabla f(\bar{\mathbf{x}}(t)),\mathbf{r}^{T}\nabla f(X(t))\right\rangle$
		$\displaystyle+\beta(t)^{2}\frac{K}{2}\left\\|\mathbf{r}^{T}E(t)-\alpha(t)\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}.$

Then, since $\{X(t)\}$ is adapted to the filtration $\{\mathcal{F}_{t}\}$ and using Assumption 1, we arrive at

	$\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\middle\|\mathcal{F}_{t}\right]$	$\displaystyle\leq f\left(\bar{\mathbf{x}}(t)\right)-\alpha(t)\beta(t)\left\langle\nabla f\left(\bar{\mathbf{x}}(t)\right),\mathbf{r}^{T}\nabla f(X(t))\right\rangle$		(28)
		$\displaystyle\phantom{--}+\beta^{2}(t)\frac{K}{2}\mathbb{E}\left[\left\\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\\|^{2}\middle\|\mathcal{F}_{t}\right].$		(29)

Let us focus on the second term in (28). Using the identity $2\left\langle\mathbf{x}-\mathbf{y}\right\rangle=\left\|\mathbf{x}\right\|^{2}+\left\|\mathbf{y}\right\|^{2}-\left\|\mathbf{x}-\mathbf{y}\right\|^{2}$ , we can write

\displaystyle\left\langle\nabla f\left(\bar{\mathbf{x}}(t)\right),\mathbf{r}^{T}\nabla f(X(t))\right\rangle\!=\!\|\nabla f\left(\bar{\mathbf{x}}(t)\right)\|^{2}\!+\!\|\mathbf{r}^{T}\nabla f(X(t))\|^{2}\!-\!2\|\nabla f\left(\bar{\mathbf{x}}(t)\right)-\mathbf{r}^{T}\nabla f(X(t))\|^{2}.

(30)

Moreover, we have

	$\displaystyle\\|\nabla f\left(\bar{\mathbf{x}}(t)\right)-\mathbf{r}^{T}\nabla f(X(t))\\|^{2}$	$\displaystyle=\left\\|\sum_{i=1}^{n}r_{i}\nabla f_{i}(\bar{\mathbf{x}}(t))-\sum_{i=1}^{n}r_{i}\nabla f_{i}(\mathbf{x}_{i}(t))\right\\|^{2}$
		$\displaystyle=\left\\|\sum_{i=1}^{n}r_{i}\left(\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right)\right\\|^{2}$
		$\displaystyle\leq\sum_{i=1}^{n}r_{i}\left\\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right\\|^{2},$

where the last inequality holds as $\|\cdot\|^{2}$ is a convex function and the $\mathbf{r}$ is a stochastic vector. Therefore, using the above inequality and Assumption 3-(a), we get

	$\displaystyle\\|\nabla f\left(\bar{\mathbf{x}}(t)\right)-\mathbf{r}^{T}\nabla f(X(t))\\|^{2}$	$\displaystyle\leq\sum_{i=1}^{n}r_{i}\left\\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right\\|^{2}$		(31)
		$\displaystyle\leq K^{2}\sum_{i=1}^{n}r_{i}\left\\|\bar{\mathbf{x}}(t)-\mathbf{x}_{i}(t)\right\\|^{2}.$		(32)

Next, we analyze the last term in (28). From Assumption 1 we have $\mathbb{E}\left[E(t)|\mathcal{F}_{t}\right]=0$ , which leads to

\displaystyle\mathbb{E}\left[\left\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\|^{2}\middle|\mathcal{F}_{t}\right]=\mathbb{E}\left[\left\|\mathbf{r}^{T}E(t)\right\|^{2}\middle|\mathcal{F}_{t}\right]+\left\|\alpha(t)\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}.

(33)

For the first term in (33), we again exploit Assumption 1 which implies

	$\displaystyle\mathbb{E}\left[\left\\|\mathbf{r}^{T}E(t)\right\\|^{2}\middle\|\mathcal{F}_{t}\right]$	$\displaystyle=\mathbf{r}^{T}\mathbb{E}\left[E(t)E(t)^{T}\middle\|\mathcal{F}_{t}\right]\mathbf{r}$
		$\displaystyle\leq\mathbf{r}^{T}(\gamma\mathbf{1}\mathbf{1}^{T})\mathbf{r}=\gamma,$		(34)

where we used the fact that $\mathbf{r}^{T}\mathbf{1}=1$ and the inequality holds since

	$\displaystyle\left[\mathbb{E}\left[E(t)E(t)^{T}\middle\|\mathcal{F}_{t}\right]\right]_{ij}$	$\displaystyle=\mathbb{E}\left[\mathbf{e}_{i}(t)\mathbf{e}_{j}(t)\|\mathcal{F}_{t}\right]$
		$\displaystyle\leq\sqrt{\mathbb{E}\left[\\|\mathbf{e}_{i}(t)\\|^{2}\|\mathcal{F}_{t}\right]\mathbb{E}\left[\\|\mathbf{e}_{j}(t)\\|^{2}\|\mathcal{F}_{t}\right]}\leq\gamma,$

for all $i,j\in\ [n]$ . Thus, the last term in (28) is upper bounded as

\displaystyle\mathbb{E}\left[\left\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\|^{2}\middle|\mathcal{F}_{t}\right]\leq\gamma+\alpha^{2}(t)\|\mathbf{r}^{T}\nabla f(X(t))\|^{2}.

(35)

Therefore, replacing (30), (31), and (35) in (28) we get

	$\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\middle\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq f\left(\bar{\mathbf{x}}(t)\right)-\alpha(t)\beta(t)\left\langle\nabla f\left(\bar{\mathbf{x}}(t)\right),\mathbf{r}^{T}\nabla f(X(t))\right\rangle$
	$\displaystyle\quad+\beta^{2}(t)\frac{K}{2}\mathbb{E}\left[\left\\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\\|^{2}\middle\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq f\left(\bar{\mathbf{x}}(t)\right)-\frac{1}{2}\alpha(t)\beta(t)\left(\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}+\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}-2K^{2}\sum_{i=1}^{n}r_{i}\left\\|\bar{\mathbf{x}}(t)-\mathbf{x}_{i}(t)\right\\|^{2}\right)$
	$\displaystyle\quad+\beta^{2}(t)\frac{K}{2}\left(\gamma+\alpha^{2}(t)\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right)$
	$\displaystyle=f\left(\bar{\mathbf{x}}(t)\right)-\frac{1}{2}\alpha(t)\beta(t)\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}-\frac{1}{2}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}$
	$\displaystyle\quad+\alpha(t)\beta(t)K^{2}\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}+\beta^{2}(t)\frac{K}{2}\gamma.$

Taking the expectation of both sides lead to

$\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\right]$	$\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t)\right)\right]-\frac{1}{2}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$	(36)
	$\displaystyle\quad-\frac{1}{2}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]$	(37)
	$\displaystyle\quad+\alpha(t)\beta(t)K^{2}\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]+\beta^{2}(t)\frac{K}{2}\gamma.$	(38)

5.1 State Deviations from the Average State: $\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]$

Note that the dynamics in (7) can be viewed as the linear time-varying system

\displaystyle X(t+1)=A(t)X(t)+U(t),

(39)

with

	$\displaystyle A(t)$	$\displaystyle=((1-\beta(t))I+\beta(t)W(t)),$
	$\displaystyle U(t)$	$\displaystyle=\beta(t)E(t)-\alpha(t)\beta(t)\nabla f(X(t)).$

The solution of (39) is given by

\displaystyle X(t)=\sum_{s=1}^{t-1}\Phi(t:s)U(s)+\Phi(t:0)X(1),

(40)

where $\Phi(t:s)=A(t-1)\cdots A(s+1)$ with $\Phi(t:t-1)=I$ is the transition matrix of the linear system (39). For the simplicity of notations, let us define

\displaystyle P(t:s)

\displaystyle:=\beta(s)(\Phi(t:s)-\mathbf{1}\mathbf{r}^{T})=\beta(s)\left(A(t-1)\cdots A(s+1)-\mathbf{1}\mathbf{r}^{T}\right).

As a result of Lemma 5, we have

\left\|P(t:s)U\right\|_{\mathbf{r}}\leq\pi(t:s)\left\|U\right\|_{\mathbf{r}},

where $\pi(t:s)$ is defined by

\displaystyle\pi(t:s):=\beta(s)\kappa^{\frac{1}{2}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))^{\frac{1}{2}}.

(41)

Now consider the dynamic in (40). Assuming $X(1)={\mathbf{0}}$ , we get

\displaystyle X(t)=\sum_{s=1}^{t-1}\Phi(t:s)U(s).

(42)

Moreover, multiplying both sides of (42) from the left by $\mathbf{r}^{T}$ and using $\mathbf{r}^{T}A(t)=\mathbf{r}^{T}$ , we get

\displaystyle\bar{\mathbf{x}}(t)=\mathbf{r}^{T}X(t)=\sum_{s=1}^{t-1}\mathbf{r}^{T}\Phi(t:s)U(s)=\sum_{s=1}^{t-1}\mathbf{r}^{T}U(s).

(43)

Then, subtracting (43) from (42), and plugging the definition of $U(s)$ we have

	$\displaystyle X(t)-\mathbf{1}\bar{\mathbf{x}}(t)$	$\displaystyle=\sum_{s=1}^{t-1}\Phi(t:s)U(s)-\sum_{s=1}^{t-1}\mathbf{1}\mathbf{r}^{T}U(s)$
		$\displaystyle=\sum_{s=1}^{t-1}(\Phi(t:s)-\mathbf{1}\mathbf{r}^{T})U(s)$
		$\displaystyle=\sum_{s=1}^{t-1}\beta(s)(\Phi(t:s)-\mathbf{1}\mathbf{r}^{T})\bigg{[}E(s)-\alpha(s)\nabla f(X(s))\bigg{]}$
		$\displaystyle=\sum_{s=1}^{t-1}P(t:s)E(s)-\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s)).$

Therefore, using Lemma 1 with $\omega=1$ , we get

	$\displaystyle\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}$	$\displaystyle=\left\\|\sum_{s=1}^{t-1}P(t:s)E(s)-\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}$
		$\displaystyle\leq 2\left\\|\sum_{s=1}^{t-1}P(t:s)E(s)\right\\|_{\mathbf{r}}^{2}+2\left\\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}.$

By expanding $\left\|\sum_{s=1}^{t-1}P(t:s)E(s)\right\|_{\mathbf{r}}^{2}$ , we get

	$\displaystyle\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}$	$\displaystyle=2\sum_{s=1}^{t-1}\left\\|P(t:s)E(s)\right\\|_{\mathbf{r}}^{2}+2\sum_{s\neq q}\left\langle P(t:s)E(s),P(t:q)E(q)\right\rangle$		(44)
		$\displaystyle\phantom{=}+2\left\\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}.$		(45)

Recall from Assumption 1 that $\mathbb{E}\left[E(q)|\mathcal{F}_{q}\right]=0$ . Moreover, since $E(s)$ is measurable with respect to $\mathcal{F}_{q}$ for $q>s$ , we have

	$\displaystyle\mathbb{E}\left[\left\langle P(t:s)E(s),P(t:q)E(q)\right\rangle\right]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\left\langle P(t:s)E(s),P(t:q)E(q)\right\rangle\right]\middle\|\mathcal{F}_{q}\right]$
		$\displaystyle=\mathbb{E}\left[\left\langle P(t:s)E(s),P(t:q)\mathbb{E}\left[E(q)\middle\|\mathcal{F}_{q}\right]\right\rangle\right]=0.$

Using a similar argument for $q<s$ and conditioning on $\mathcal{F}_{s}$ , we conclude that the above relation holds for all $q\not=s$ . Therefore, taking the expectation of both sides of (44) and noting that the average of the second and the forth terms are zero, we get

\displaystyle\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]

\displaystyle\leq 2\sum_{s=1}^{t-1}\mathbb{E}\left[\left\|P(t:s)E(s)\right\|_{\mathbf{r}}^{2}\right]+2\mathbb{E}\left[\left\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]

(46)

We continue with bounding the first term in (46). First note that Assumption 1 implies

\displaystyle\mathbb{E}\left[\left\|E(s)\right\|_{\mathbf{r}}^{2}\right]=\mathbb{E}\left[\mathbb{E}\left[\left\|E(s)\right\|_{\mathbf{r}}^{2}|\mathcal{F}_{s}\right]\right]=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\mathbb{E}\left[\|\mathbf{e}_{i}(s)\|^{2}|\mathcal{F}_{s}\right]\right]\leq\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\gamma\right]=\gamma.

(47)

This together with Lemma 5 leads to

	$\displaystyle\sum_{s=1}^{t-1}\mathbb{E}\left[\left\\|P(t:s)E(s)\right\\|_{\mathbf{r}}^{2}\right]$	$\displaystyle\leq\left[\sum_{s=1}^{t-1}\beta^{2}(s)\kappa\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\mathbb{E}\left[\left\\|E(s)\right\\|_{\mathbf{r}}^{2}\right]\right]$		(48)
		$\displaystyle\leq\gamma\kappa\sum_{s=1}^{t-1}\left[\beta^{2}(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right].$		(49)

To bound the second term in (46), we use the triangle inequality for the norm $\left\|\cdot\right\|_{\mathbf{r}}$ , and write

\displaystyle\begin{split}&\mathbb{E}\left[\left\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\cr&\leq\mathbb{E}\left[\left(\sum_{s=1}^{t-1}\left\|\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}\right)^{2}\right]\cr&=\sum_{1\leq s,q\leq t-1}\mathbb{E}\left[\alpha(s)\left\|P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}\alpha(q)\left\|P(t:q)\nabla f(X(q))\right\|_{\mathbf{r}}\right].\end{split}

(50)

Using Lemma 5 and $2ab\leq a^{2}+b^{2}$ , we can upper-bound this expression as

$\displaystyle\sum_{1\leq s,q\leq t-1}$	$\displaystyle\mathbb{E}\left[\alpha(s)\left\\|P(t:s)\nabla f(X(s))\right\\|_{\mathbf{r}}\alpha(q)\left\\|P(t:q)\nabla f(X(q))\right\\|_{\mathbf{r}}\right]$	(51)
	$\displaystyle\leq\sum_{1\leq s,q\leq t-1}\mathbb{E}\left[\alpha(s)\pi(t:s)\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}\alpha(q)\pi(t:q)\left\\|\nabla f(X(q))\right\\|_{\mathbf{r}}\right]$	(52)
	$\displaystyle=\sum_{1\leq s,q\leq t-1}\pi(t:s)\pi(t:q)\mathbb{E}\left[\alpha(s)\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}\alpha(q)\left\\|\nabla f(X(q))\right\\|_{\mathbf{r}}\right]$	(53)
	$\displaystyle\leq\frac{1}{2}\sum_{1\leq s,q\leq t-1}\pi(t:s)\pi(t:q)\mathbb{E}\left[\alpha^{2}(s)\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}+\alpha^{2}(q)\left\\|\nabla f(X(q))\right\\|_{\mathbf{r}}^{2}\right]$	(54)
	$\displaystyle=\sum_{1\leq s,q\leq t-1}\pi(t:s)\pi(t:q)\mathbb{E}\left[\alpha^{2}(s)\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}\right]$	(55)
	$\displaystyle=\left(\sum_{q=1}^{t-1}\pi(t:q)\right)\left(\sum_{s=1}^{t-1}\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}\right]\right),$	(56)

where $\pi(t:s)$ is given by (41). Next, the fact that $\sqrt{1-x}\leq 1-x/2$ and Lemma 2 imply

$\displaystyle\sum_{q=1}^{t-1}\pi(t:q)$	$\displaystyle=\sum_{q=1}^{t-1}\left[\beta(q)\kappa^{\frac{1}{2}}\prod_{k=q+1}^{t-1}(1-\lambda\beta(k))^{\frac{1}{2}}\right]$	(57)
	$\displaystyle\leq\sum_{q=1}^{t-1}\beta(q)\kappa^{\frac{1}{2}}\prod_{k=q+1}^{t-1}\left(1-\frac{\lambda}{2}\beta(k)\right)$	(58)
	$\displaystyle\leq\frac{2}{\lambda}\kappa^{\frac{1}{2}}.$	(59)

Using this inequality in (56), we get

\displaystyle\mathbb{E}\left[\left\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]

\displaystyle\leq\frac{2}{\lambda}\kappa^{\frac{1}{2}}\sum_{s=1}^{t-1}\left[\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right].

(60)

Finally, using the bounds obtained in (48) and (60) in (46) we arrive at

	$\displaystyle\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$	$\displaystyle\leq 2\gamma\kappa\sum_{s=1}^{t-1}\left[\beta^{2}(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right]$		(61)
		$\displaystyle\phantom{\leq}+\frac{4}{\lambda}\kappa^{\frac{1}{2}}\sum_{s=1}^{t-1}\left[\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}\right]\right].$		(62)

5.2 Analysis of the overall deviation: $\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]$

Our goal here is to bound the overall weighted deviation of the states from their average. First recall the bound for $\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]$ , derived in Section 5.1 for each $t$ . Our goal here is to bound

	$\displaystyle\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]\leq$	$\displaystyle 2\gamma\kappa\sum_{t=1}^{T}\alpha(t)\beta(t)\sum_{s=1}^{t-1}\left[\beta^{2}(s)\!\!\prod_{k=s+1}^{t-1}\!(1-\lambda\beta(k))\right]$		(63)
		$\displaystyle+\frac{4}{\lambda}\kappa^{\frac{1}{2}}\sum_{t=1}^{T}\alpha(t)\beta(t)\sum_{s=1}^{t-1}\left[\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}\right]\right].$

Focusing on the first term in (63), we can write

$\displaystyle\sum_{t=1}^{T}\!\left[\alpha(t)\beta(t)\sum_{s=1}^{t-1}\left[\beta^{2}(s)\!\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right]\right]\!$	$\displaystyle=\!\sum_{s=1}^{T-1}\left[\beta^{2}(s)\!\!\sum_{t=s+1}^{T}\left[\alpha(t)\beta(t)\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right]\right]$	(64)
	$\displaystyle\leq\!\sum_{s=1}^{T-1}\left[\alpha(s)\beta^{2}(s)\!\!\sum_{t=s+1}^{T}\left[\beta(t)\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right]\right]$	(65)
	$\displaystyle\leq\frac{1}{\lambda}\sum_{s=1}^{T-1}\alpha(s)\beta^{2}(s),$	(66)

where the first inequality is due to the fact that $\alpha(t)\leq\alpha(s)$ for $t>s$ , and the second one follows from Lemma 2. Similarly, using the fact that $\alpha(t)\leq\alpha(s)$ for $t>s$ , for the second term in (63), we have

	$\displaystyle\sum_{t=1}^{T}\left[\alpha(t)\beta(t)\sum_{s=1}^{t-1}\left[\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}\right]\right]\right]$		(67)
	$\displaystyle=\sum_{s=1}^{T-1}\left[\alpha^{2}(s)\mathbb{E}\left[\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}\right]\sum_{t=s+1}^{T}\alpha(t)\beta(t)\pi(t:s)\right]$		(68)
	$\displaystyle\leq\sum_{s=1}^{T-1}\left[\alpha^{3}(s)\mathbb{E}\left[\left\\|\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}\right]\sum_{t=s+1}^{T}\beta(t)\pi(t:s)\right].$		(69)

Recall that $\pi(t:s)=\beta(s)\kappa^{\frac{1}{2}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))^{\frac{1}{2}}$ . Then, using the fact that $\sqrt{1-x}\leq 1-x/2$ , we can write

	$\displaystyle\sum_{t=s+1}^{T}\beta(t)\pi(t:s)$	$\displaystyle=\sum_{t=s+1}^{T}\left[\beta(t)\beta(s)\kappa^{\frac{1}{2}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))^{\frac{1}{2}}\right]$		(70)
		$\displaystyle\leq\beta(s)\kappa^{\frac{1}{2}}\sum_{t=s+1}^{T}\left[\beta(t)\prod_{k=s+1}^{t-1}\left(1-\frac{\lambda}{2}\beta(k)\right)\right]\leq\frac{2}{\lambda}\beta(s)\kappa^{\frac{1}{2}},$		(71)

where the last inequality follows from Lemma 2. Therefore, from (69) and (71) we have

\displaystyle\sum_{t=1}^{T}\!\left[\!\alpha(t)\beta(t)\!\sum_{s=1}^{t-1}\!\left[\!\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right]\right]

\displaystyle\!\leq\!\frac{2}{\lambda}\kappa^{\frac{1}{2}}\!\sum_{s=1}^{T-1}\!\left[\alpha^{3}(s)\beta(s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right].

(72)

Therefore, from (64) and (72) we can conclude

	$\displaystyle\sum_{t=1}^{T}\!\alpha(t)\beta(t)$	$\displaystyle\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]\!\leq\!\frac{2\gamma\kappa}{\lambda}\!\sum_{t=1}^{T-1}\!\alpha(t)\beta^{2}(t)\!+\!\frac{8\kappa}{\lambda^{2}}\sum_{t=1}^{T-1}\!\left[\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(X(t))\right\\|_{\mathbf{r}}^{2}\right]\right]$
		$\displaystyle\!\leq\!\frac{2\gamma\kappa}{\lambda}\!\sum_{t=1}^{T}\!\alpha(t)\beta^{2}(t)\!+\!\frac{8\kappa}{\lambda^{2}}\sum_{t=1}^{T}\!\left[\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(X(t))\right\\|_{\mathbf{r}}^{2}\right]\right].$		(73)

5.3 Bounding $\mathbb{E}\left[\left\|\nabla f(X(t))\right\|_{\mathbf{r}}^{2}\right]$

In this part, we study $\mathbb{E}\left[\left\|\nabla f(X(t)\right\|_{\mathbf{r}}\right]^{2}$ , to provide an upper bound for it. Following [26], we can rewrite $\nabla f(X(t))$ as

\nabla f(X(t))=3\left[\frac{1}{3}\left(\nabla f(X(t))-\nabla f(\mathbf{1}\bar{\mathbf{x}}(t)\right)+\frac{1}{3}\left(\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))-\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right)+\frac{1}{3}\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right].

Then, since $\left\|\cdot\right\|^{2}_{\mathbf{r}}$ is a convex function, we have

	$\displaystyle\mathbb{E}\left[\left\\|\nabla f(X(t)\right\\|_{\mathbf{r}}^{2}\right]$	$\displaystyle\leq 3\mathbb{E}\left[\left\\|\nabla f(X(t))-\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))\right\\|_{\mathbf{r}}^{2}\right]+3\mathbb{E}\left[\left\\|\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))-\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right\\|_{\mathbf{r}}^{2}\right]$
		$\displaystyle\phantom{=}+3\mathbb{E}\left[\left\\|\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right\\|_{\mathbf{r}}^{2}\right].$		(74)

Next, we bound each term in (5.3). Recall from (6) that the $i$ th row of $\nabla f(X(t))$ is given by $\nabla f_{i}(X_{i}(t))$ , where $X_{i}(t)=\mathbf{x}_{i}(t)$ is the $i$ th row of matrix $X(t)$ . Thus, from Assumption 3-(a) we have

$\displaystyle\mathbb{E}\left[\left\\|\nabla f(X(t))-\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))\right\\|_{\mathbf{r}}^{2}\right]$	$\displaystyle=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\\|\nabla f_{i}(\mathbf{x}_{i}(t))-\nabla f_{i}(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$	(75)
	$\displaystyle\leq K^{2}\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\right\\|^{2}\right]$	(76)
	$\displaystyle=K^{2}\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right].$	(77)

Similarly, using the convexity of function $\left\|\cdot\right\|^{2}$ , for the second term in (5.3) we have

	$\displaystyle\mathbb{E}\left[\left\\|\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))-\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right\\|_{\mathbf{r}}^{2}\right]$		(78)
	$\displaystyle=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(79)
	$\displaystyle=\sum_{i=1}^{n}r_{i}\mathbb{E}\left[4\left\\|\frac{1}{2}\Big{(}\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\Big{)}-\frac{1}{2}\Big{(}\nabla f(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\Big{)}\right\\|^{2}\right]$		(80)
	$\displaystyle\leq\sum_{i=1}^{n}r_{i}\mathbb{E}\left[2\left\\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\right\\|^{2}+2\left\\|\nabla f(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(81)
	$\displaystyle=\sum_{i=1}^{n}2r_{i}\mathbb{E}\left[\left\\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\right\\|^{2}\right]+2\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(82)
	$\displaystyle\stackrel{{\scriptstyle\rm{(a)}}}{{=}}\!\sum_{i=1}^{n}2r_{i}\mathbb{E}\!\left[\!\frac{1}{m_{i}^{2}}\!\left\\|\sum_{j=1}^{m_{i}}\!\left[\nabla\ell(\bar{\mathbf{x}}(t),\xi_{j}^{i})\!-\!\nabla L(\bar{\mathbf{x}}(t))\right]\right\\|^{2}\right]\!\!+\!2\mathbb{E}\!\left[\!\frac{1}{N^{2}}\!\left\\|\sum_{j=1}^{N}\!\left[\nabla\ell(\bar{\mathbf{x}}(t),\xi_{j})\!-\!\nabla L(\bar{\mathbf{x}}(t))\right]\right\\|^{2}\right]$		(83)
	$\displaystyle\stackrel{{\scriptstyle\rm{(b)}}}{{=}}\sum_{i=1}^{n}2r_{i}\frac{1}{m_{i}^{2}}\sum_{j=1}^{m_{i}}\mathbb{E}\left[\left\\|\nabla\ell(\bar{\mathbf{x}}(t),\xi_{j}^{i})-\nabla L(\bar{\mathbf{x}}(t))\right\\|^{2}\right]+\frac{2}{N^{2}}\sum_{j=1}^{N}\mathbb{E}\left[\left\\|\nabla\ell(\bar{\mathbf{x}}(t),\xi_{j})-\nabla L(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(84)
	$\displaystyle\stackrel{{\scriptstyle\rm{(c)}}}{{\leq}}\sum_{i=1}^{n}2\frac{m_{i}}{N}\frac{1}{m_{i}^{2}}m_{i}\sigma^{2}+\frac{2}{N^{2}}N\sigma^{2}$		(85)
	$\displaystyle=\frac{2(n+1)}{N}\sigma^{2},$		(86)

where in (a) we replaced the definitions of $f_{i}(\bar{\mathbf{x}}(t))$ and $f(\bar{\mathbf{x}}(t))$ from (3) and (2), respectively, the equality in (b) holds since $\xi_{j}$ s are independent samples from the underlying distribution, and (c) follows from Assumption 3-(b) and the fact that $r_{i}=m_{i}/N$ for $i\in[n]$ . Finally, for the third term in (5.3), we have

\displaystyle\mathbb{E}\left[\left\|\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right\|_{\mathbf{r}}^{2}\right]=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]=\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right].

(87)

Plugging (77)–(87) in (5.3), we get

\displaystyle\mathbb{E}\left[\left\|\nabla f(X(t)\right\|_{\mathbf{r}}^{2}\right]

\displaystyle\leq 3K^{2}\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]+3\frac{2(n+1)}{N}\sigma^{2}+3\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right].

(88)

Next, replacing this bound in (73), we arrive at

$\displaystyle\sum_{t=1}^{T}$	$\displaystyle\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$	(89)
	$\displaystyle\leq\frac{2\gamma\kappa}{\lambda}\sum_{t=1}^{T}\alpha(t)\beta^{2}(t)+\frac{8\kappa}{\lambda^{2}}\sum_{t=1}^{T}\left[\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(X(t))\right\\|_{\mathbf{r}}^{2}\right]\right]$	(90)
	$\displaystyle\leq\frac{2\gamma\kappa}{\lambda}\sum_{t=1}^{T}\alpha(t)\beta^{2}(t)+\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$	(91)
	$\displaystyle\phantom{\leq}+\frac{48\kappa(n+1)\sigma^{2}}{N\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)+\frac{24\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right].$	(92)

Now, let us define $\phi_{i,j}(T):=\sum_{t=1}^{T}\alpha^{i}(t)\beta^{j}(t)$ .
Then, ${\frac{2\gamma\kappa}{\lambda}\sum_{t=1}^{T}\alpha(t)\beta^{2}(t)=\epsilon_{1}\phi_{1,2}(T)}$ and ${\frac{48\kappa(n+1)\sigma^{2}}{N\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)=\epsilon_{2}\phi_{3,1}(T)}$ , where $\epsilon_{1}:=\frac{2\gamma\kappa}{\lambda}$ and $\epsilon_{2}:=\frac{48\kappa(n+1)\sigma^{2}}{N\lambda^{2}}$ . Furthermore, we set $T_{0}:=\left\lceil\left(\frac{14\alpha_{0}\kappa^{\frac{1}{2}}K}{\lambda}\right)^{\frac{1}{\nu}}\right\rceil$ such that $\frac{24\kappa K^{2}}{\lambda^{2}}\alpha^{2}(T_{0})\leq\frac{24}{196}<\frac{1}{2}$ . Then, for $T>T_{0}$ we can rewrite (89) as

	$\displaystyle\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)$	$\displaystyle\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]+\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$
		$\displaystyle\leq\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\frac{24\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
		$\displaystyle\phantom{=}+\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$
		$\displaystyle\phantom{=}+\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=T_{0}+1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$
		$\displaystyle\leq\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\frac{24\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
		$\displaystyle\phantom{=}+\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$
		$\displaystyle\phantom{=}+\frac{24\kappa K^{2}}{\lambda^{2}}\alpha^{2}(T_{0})\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$
		$\displaystyle\leq\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\frac{24\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
		$\displaystyle\phantom{=}+\epsilon_{3}+\frac{1}{2}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right],$

where the second inequality holds since $\alpha(t)$ is a non-increasing sequence, and

\displaystyle\epsilon_{3}:=\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]

(93)

does not grow with $T$ . Therefore, we have

		$\displaystyle\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$		(94)
		$\displaystyle\leq 2\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]+\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$		(95)
		$\displaystyle\leq 2\epsilon_{1}\phi_{1,2}(T)+2\epsilon_{2}\phi_{3,1}(T)+2\epsilon_{3}+\frac{48\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right].$		(96)

5.4 Back to the Main Dynamics

Recall the dynamics in (36), that is,

	$\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\right]$	$\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t)\right)\right]-\frac{1}{2}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
		$\displaystyle\quad-\frac{1}{2}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]$
		$\displaystyle\quad+\alpha(t)\beta(t)K^{2}\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]+\beta^{2}(t)\frac{K}{2}\gamma.$

Summing (36) for $t=1,2,\dots,T$ , and using (94) we get

		$\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(T+1)\right)\right]$		(97)
		$\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(98)
		$\displaystyle\quad-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]$		(99)
		$\displaystyle\quad+K^{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]+\frac{K}{2}\gamma\sum_{t=1}^{T}\beta^{2}(t)$		(100)
		$\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]+\frac{K}{2}\gamma\phi_{0,2}(T)$		(101)
		$\displaystyle\quad-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]$		(102)
		$\displaystyle\quad+K^{2}\left[2\epsilon_{1}\phi_{1,2}(T)+2\epsilon_{2}\phi_{3,1}(T)+2\epsilon_{3}+\frac{48\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]\right],$		(103)

where $\phi_{0,2}(T)=\sum_{t=1}^{T}\beta^{2}(t)$ . Recall that for $T_{0}=\left\lceil\left(\frac{14\alpha_{0}\kappa^{\frac{1}{2}}K}{\lambda}\right)^{\frac{1}{\nu}}\right\rceil$ . Moreover, for $\lambda$ and $\kappa$ defined in Lemma 5 we have $\lambda\leq 1$ and $\kappa>1$ . Therefore, for $t>T_{0}$ we have

\alpha(t)\beta(t)K\leq\alpha(T_{0})\beta(T_{0})K\leq\alpha(T_{0})K\leq\frac{\lambda}{14\kappa^{\frac{1}{2}}}<1.

This allows us to lower bound the second summation in (97) for $T>T_{0}$ as

$\displaystyle\frac{1}{2}$	$\displaystyle\sum_{t=1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]$	(104)
	$\displaystyle=\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]$	(105)
	$\displaystyle\quad+\frac{1}{2}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]$	(106)
	$\displaystyle\geq\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]:=\epsilon_{4},$	(107)

where $\epsilon_{4}$ does not grow with $T$ , and the inequality holds since $1-\alpha(t)\beta(t)K\geq 0$ for $t>T_{0}$ . Similarly, for $T_{0}=\left\lceil\left(\frac{14\alpha_{0}\kappa^{\frac{1}{2}}K}{\lambda}\right)^{\frac{1}{\nu}}\right\rceil$ , we have $\frac{48\kappa K^{2}}{\lambda^{2}}\alpha^{2}(T_{0})\leq\frac{48}{196}\alpha_{0}\leq\frac{1}{4}$ . Therefore, for $T>T_{0}$ , the last summation in (97) can be upper bounded by

		$\displaystyle\frac{48\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(108)
		$\displaystyle=\frac{48\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]+\frac{48\kappa K^{2}}{\lambda^{2}}\sum_{t=T_{0}+1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(109)
		$\displaystyle\leq\epsilon_{5}+\frac{48\kappa K^{2}}{\lambda^{2}}\alpha^{2}(T_{0})\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(110)
		$\displaystyle\leq\epsilon_{5}+\frac{1}{4}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right],$		(111)

where

\displaystyle\epsilon_{5}:=\frac{48\kappa K^{2}}{\lambda}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right],

(112)

does not depend on $T$ . Plugging (104) and (108) in (97), for $T>T_{0}$ , we get

	$\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(T+1)\right)\right]$	$\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]+\frac{K}{2}\phi_{0,2}(T)+2K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)$
		$\displaystyle\phantom{=}-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]$
		$\displaystyle\phantom{=}-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]+\frac{48\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]+\frac{K}{2}\phi_{0,2}(T)+2K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)-\epsilon_{4}$
		$\displaystyle\phantom{=}-\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]-\frac{1}{2}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
		$\displaystyle\phantom{=}+\epsilon_{5}+\frac{1}{4}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]+\frac{K}{2}\phi_{0,2}(T)+2K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)-\epsilon_{4}+\epsilon_{5}$
		$\displaystyle\phantom{=}-\frac{1}{4}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right].$

Hence,

	$\displaystyle\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$		(113)
	$\displaystyle\leq 4\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-4\mathbb{E}\left[f\left(\bar{\mathbf{x}}(T+1)\right)\right]\!+\!2K\phi_{0,2}(T)+8K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)\!-\!4\epsilon_{4}\!+\!4\epsilon_{5}$		(114)
	$\displaystyle\leq 4\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-4\mathbb{E}\left[f(\mathbf{x}^{\star})\right]+2K\phi_{0,2}(T)+8K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)-4\epsilon_{4}+4\epsilon_{5}$		(115)
	$\displaystyle=\epsilon_{6}+2K\phi_{0,2}(T)+8K^{2}\epsilon_{1}\phi_{1,2}(T)+8K^{2}\epsilon_{2}\phi_{3,1}(T)$		(116)
	$\displaystyle\leq\epsilon_{6}+(2K+8K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T)+8K^{2}\epsilon_{2}\phi_{3,1}(T),$		(117)

where

	$\displaystyle\epsilon_{6}$	$\displaystyle:=8K^{2}\epsilon_{3}-4\epsilon_{4}+4\epsilon_{5}+4\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-4\mathbb{E}\left[f(\mathbf{x}^{\star})\right]$
		$\displaystyle=\frac{192\kappa K^{2}}{\lambda}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\left[K^{2}\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]+\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]\right]$
		$\displaystyle\phantom{=}-2\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right]+4\left(\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-\mathbb{E}\left[f(\mathbf{x}^{\star})\right]\right)$

is constant (does not depend on $T$ ), and the last inequality in (117) follows from the fact that

\displaystyle\phi_{1,2}(T)=\sum_{t=1}^{T}\alpha(t)\beta^{2}(t)=\alpha_{0}\sum_{t=1}^{T}\frac{1}{(t+\tau)^{\nu}}\beta^{2}(t)\leq\alpha_{0}\sum_{t=1}^{T}\beta^{2}(t)=\alpha_{0}\phi_{0,2}(T).

(118)

5.5 Bound on the Moments of $\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]$ and $\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]$

The inequality in (117) provides us with an upper bound on the temporal average of sequence $\big{\{}\alpha(t)\beta(t)\mathbb{E}[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}]\big{\}}$ . However, our goal is to derive a bound on the temporal average of $\big{\{}\mathbb{E}[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}]\big{\}}$ . To this end, for any given $\theta\in(0,1)$ , we define the measure of convergence as

\displaystyle M_{\theta}(\nu,\mu):=\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}.

Note that by Hölder’s inequality [50, Theorem 6.2] for any $p,q>1$ with $\frac{1}{p}+\frac{1}{q}=1$ , and non-negative sequences $\{a_{t}\}_{t=1}^{T}$ and $\{b_{t}\}_{t=1}^{T}$ , we have

\displaystyle\sum_{t=1}^{T}a_{t}b_{t}\leq\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{1}{p}}\left(\sum_{t=1}^{T}b_{t}^{q}\right)^{\frac{1}{q}},

or equivalently,

\displaystyle\left(\sum_{t=1}^{T}a_{t}b_{t}\right)^{q}\leq\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}\left(\sum_{t=1}^{T}b_{t}^{q}\right).

(119)

Let $(p,q)=\left(\frac{1}{1-\theta},\frac{1}{\theta}\right)$ so that $\frac{1}{p}+\frac{1}{q}=1$ . Furthermore, let

\displaystyle\begin{split}a_{t}&=\left(\frac{1}{\alpha(t)\beta(t)}\right)^{\theta}=\frac{1}{(\alpha_{0}\beta_{0})^{\theta}}(t+\tau)^{(\mu+\nu)\theta}\\ b_{t}&=\left(\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}.\end{split}

(120)

Then, applying Hölder’s inequality (119), we arrive at

\displaystyle M_{\theta}(\nu,\mu)

\displaystyle=\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}=\left(\frac{1}{T}\sum_{t=1}^{T}a_{t}b_{t}\right)^{q}\leq\frac{1}{T^{q}}\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}\left(\sum_{t=1}^{T}b_{t}^{q}\right).

(121)

It remains to upper bound the terms in the right hand side of (121). First, using Lemma 3 we get

\displaystyle\sum_{t=1}^{T}a_{t}^{p}=\frac{1}{(\alpha_{0}\beta_{0})^{\frac{\theta}{1-\theta}}}\sum_{t=1}^{T}(t+\tau)^{\frac{(\nu+\mu)\theta}{1-\theta}}\leq\frac{2^{1+\frac{(\nu+\mu)\theta}{1-\theta}}}{(\alpha_{0}\beta_{0})^{\frac{\theta}{1-\theta}}\left(1+\frac{(\nu+\mu)\theta}{1-\theta}\right)}(T+\tau)^{1+\frac{(\nu+\mu)\theta}{1-\theta}}.

Therefore,

\displaystyle\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}\leq\frac{2^{\frac{1-\theta}{\theta}+(\nu+\mu)}}{\alpha_{0}\beta_{0}\left(1+\frac{(\nu+\mu)\theta}{1-\theta}\right)^{\frac{1-\theta}{\theta}}}(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}:=\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}.

(122)

Next, continuing from (117), for $T\geq T_{0}$ we can write

$\displaystyle\sum_{t=1}^{T}b_{t}^{q}$	$\displaystyle=\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$	(123)
	$\displaystyle\leq\epsilon_{6}+(2K+8K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T)+8K^{2}\epsilon_{2}\phi_{3,1}(T)$	(124)
	$\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\},$	(125)

where the last inequality follows from the fact that $a+b+c\leq\max\{3a,3b,3c\}$ . Next, we can use Lemma 3 to bound $\phi_{0,2}(T)$ , as

\displaystyle\phi_{0,2}(T)=\sum_{t=1}^{T}\beta^{2}(t)=\frac{1}{\beta_{0}^{2}}\sum_{t=1}^{T}(t+\tau)^{-2\mu}\leq\begin{cases}\frac{\tau^{1-2\mu}}{\beta_{0}^{2}|1-2\mu|}&\textrm{if $\mu>1/2$},\\ \frac{1}{\beta_{0}^{2}}\ln\left(\frac{T}{\tau}+1\right)&\textrm{if $\mu=1/2$},\\ \frac{2}{\beta_{0}^{2}(1-2\mu)}(T+\tau)^{1-2\mu}&\textrm{if $0\leq\mu<1/2$},\end{cases}

(126)

where we used the fact that $2^{1-2\mu}\leq 2$ for $0\leq\mu<1$ . Similarly, applying Lemma 3 on $\phi_{3,1}(T)$ , we get

	$\displaystyle\phi_{3,1}(T)$	$\displaystyle=\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)=\frac{1}{\alpha_{0}^{3}\beta_{0}}\sum_{t=1}^{T}(t+\tau)^{-3\nu-\mu}$		(127)
		$\displaystyle\leq\begin{cases}\frac{\tau^{1-3\nu-\mu}}{\alpha_{0}^{3}\beta_{0}\|1-3\nu-\mu\|}&\textrm{if $3\nu+\mu>1$},\\ \frac{1}{\alpha_{0}^{3}\beta_{0}}\ln\left(\frac{T}{\tau}+1\right)&\textrm{if $3\nu+\mu=1$},\\ \frac{2}{\alpha_{0}^{3}\beta_{0}(1-3\nu-\mu)}(T+\tau)^{1-3\nu-\mu}&\textrm{if $0\leq 3\nu+\mu<1$},\end{cases}$		(128)

in which we have upper bounded $2^{1-3\nu-\mu}$ by by $2$ , for $0\leq 3\nu+\mu<1$ . Note that the upper bound in (123) is the maximum of a constant, and two $\phi_{\cdot,\cdot}(T)$ functions. Moreover, depending on the values of $\mu$ and $\nu$ , each $\phi_{\cdot,\cdot}(T)$ can be upper bounded by a constant, a logarithmic function, or a polynomial (in $T$ ). Figure 4 illustrates the four regions of $(\mu,\nu)$ . In the following we first, analyze the interior of the four regions shown in Figure 4, and then study the boundary cases.

Region $\mathsf{I}$ : $\mu>1/2$ and $3\nu+\mu>1$
Recall from (126) and (127) that both $\phi_{0,2}(T)$ and $\phi_{3,1}(T)$ are upper bounded by constant functions, which grows faster than a constant. Hence, from (123) in this regime we have

$\displaystyle\sum_{t=1}^{T}b_{t}^{q}$	$\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\}$	(129)
	$\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\frac{\tau^{1-2\mu}}{\beta_{0}^{2}\|1-2\mu\|},24K^{2}\epsilon_{2}\frac{\tau^{1-3\nu-\mu}}{\alpha_{0}^{3}\beta_{0}\|1-3\nu-\mu\|}\right\}$	(130)
	$\displaystyle:=\epsilon_{8}.$	(131)

Note that $\epsilon_{8}$ does not depend on $T$ . Plugging (122) and (129) in (121), we arrive at

\displaystyle M_{\theta}(\nu,\mu)\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)\cdot\epsilon_{8}\cdot(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}=\mathcal{O}\left(T^{-(1-\nu-\mu)}\right).

(132)

Region $\mathsf{II}$ : $\mu<1/2$ and $3\nu+\mu>1$
From (126) and (127), we can see that $\phi_{0,2}(T)$ and $\phi_{3,1}(T)$ the upper bounded by a polynomial and a constant function of $T$ , respectively. Noting that for sufficiently large $T$ , a polynomial function beats any constant, we can write

$\displaystyle\sum_{t=1}^{T}b_{t}^{q}$	$\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\}$	(133)
	$\displaystyle\leq\frac{12K+48K^{2}\epsilon_{1}\alpha_{0}}{\beta_{0}^{2}(1-2\mu)}(T+\tau)^{1-2\mu}$	(134)
	$\displaystyle:=\epsilon_{9}\cdot(T+\tau)^{1-2\mu}.$	(135)

This together with (121) and (122) implies

\displaystyle M_{\theta}(\nu,\mu)\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}\cdot\epsilon_{9}\cdot(T+\tau)^{1-2\mu}=\mathcal{O}\left(T^{-(\mu-\nu)}\right).

(136)

Region $\mathsf{III}$ : $\mu>1/2$ and $3\nu+\mu<1$
Recall from (126) and (127) that $\phi_{0,2}(T)$ and $\phi_{3,1}(T)$ are upper bounded by a constant and a polynomial function of $T$ , respectively. Therefore, for sufficiently large $T$ , we get

$\displaystyle\sum_{t=1}^{T}b_{t}^{q}$	$\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\}$	(137)
	$\displaystyle\leq\left(\frac{48K^{2}\epsilon_{2}}{\alpha_{0}^{3}\beta_{0}(1-3\nu-\mu)}\right)(T+\tau)^{1-3\nu-\mu}$	(138)
	$\displaystyle:=\epsilon_{10}\cdot(T+\tau)^{1-3\nu-\mu}.$	(139)

Therefore, plugging (137) and (122) to (121), we get

\displaystyle M_{\theta}(\nu,\mu)\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}\cdot\epsilon_{10}\cdot(T+\tau)^{1-3\nu-\mu}=\mathcal{O}\left(T^{-2\nu}\right).

(140)

Region $\mathsf{IV}$ : $\mu<1/2$ and $3\nu+\mu<1$
In this region, we can use (126) and (127) to upper bound both $\phi_{0,2}(T)$ and $\phi_{3,1}(T)$ by polynomial functions. Thus, for sufficiently large $T$ , we get

$\displaystyle\sum_{t=1}^{T}b_{t}^{q}$	$\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\}$	(141)
	$\displaystyle\leq\max\left\{\frac{12K+48K^{2}\epsilon_{1}\alpha_{0}}{\beta_{0}^{2}(1-2\mu)},\frac{48K^{2}\epsilon_{2}}{\alpha_{0}^{3}\beta_{0}(1-3\nu-\mu)}\right\}(T+\tau)^{\max\{1-2\mu,1-3\nu-\mu\}}$	(142)
	$\displaystyle:=\epsilon_{11}\cdot(T+\tau)^{\max\{1-2\mu,1-3\nu-\mu\}}.$	(143)

Then, we can plug (141) and (122) to (121), to conclude

	$\displaystyle M_{\theta}(\nu,\mu)$	$\displaystyle\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}\cdot\epsilon_{11}\cdot(T+\tau)^{\max\{1-2\mu,1-3\nu-\mu\}}$		(144)
		$\displaystyle=\mathcal{O}\left(T^{-\min\{\mu-\nu,2\nu\}}\right).$		(145)

Summarizing the result of the four cases above, we get

	$\displaystyle M_{\theta}(\nu,\mu)$	$\displaystyle=\begin{cases}\mathcal{O}\left(T^{-(1-\nu-\mu)}\right)&\textrm{if $\mu>1/2$ and $3\nu+\mu>1$}\\ \mathcal{O}\left(T^{-(\mu-\nu)}\right)&\textrm{if $\mu<1/2$ and $3\nu+\mu>1$}\\ \mathcal{O}\left(T^{-2\nu}\right)&\textrm{if $\mu>1/2$ and $3\nu+\mu<1$}\\ \mathcal{O}\left(T^{-\min\{\mu-\nu,2\nu\}}\right)&\textrm{if $\mu<1/2$ and $3\nu+\mu<1$}\end{cases}$		(146)
		$\displaystyle=\mathcal{O}\left(T^{-\min\{1-\nu-\mu,\mu-\nu,2\nu\}}\right),$		(147)

which concludes the first claim of Theorem 1.
Recall that our goal is to optimize over $(\nu,\mu)$ to achieve the best convergence rate. This is equivalent to maximizing the exponent of $1/T$ in each of the four regions, i.e.,

		$\displaystyle\textrm{Region }\mathsf{I}:\quad$	$\displaystyle\sup\{1-\nu-\mu:\mu>1/2,3\nu+\mu>1\},$
		$\displaystyle\textrm{Region }\mathsf{II}:\quad$	$\displaystyle\sup\{\mu-\nu:\mu<1/2,3\nu+\mu>1\},$
		$\displaystyle\textrm{Region }\mathsf{III}:\quad$	$\displaystyle\sup\{2\nu:\mu>1/2,3\nu+\mu<1\},$
		$\displaystyle\textrm{Region }\mathsf{IV}:\quad$	$\displaystyle\sup\{\min\{\mu-\nu,2\nu\}:\mu<1/2,3\nu+\mu<1\}.$

Interestingly, it turns out that the respective supremum value in all four regions is $1/3$ , which corresponds to the boundary point $(\nu^{\star},\mu^{\star})=(1/6,1/2)$ . However, this point does not belong to any of the corresponding open sets, which motivates the convergence analysis of $M_{\theta}$ for $(\nu^{\star},\mu^{\star})=(1/6,1/2)$ .

Boundary Case: $\mu=1/2$ and $3\nu+\mu=1$
First note that the two lines of interest intersect at $(\nu^{\star},\mu^{\star})=(1/6,1/2)$ , as shown in Figure 4. Applying Lemma 3 on $\phi_{0,2}(T)$ and $\phi_{3,1}(T)$ for $(\nu^{\star},\mu^{\star})=(1/6,1/2)$ we get

	$\displaystyle\phi_{0,2}(T)$	$\displaystyle\leq\frac{1}{\beta_{0}^{2}}\ln\left(\frac{T}{\tau}+1\right),$		(148)
	$\displaystyle\phi_{3,1}(T)$	$\displaystyle\leq\frac{1}{\alpha_{0}^{3}\beta_{0}}\ln\left(\frac{T}{\tau}+1\right).$		(149)

Therefore, (123) reduces to

$\displaystyle\sum_{t=1}^{T}b_{t}^{q}$	$\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\}$	(150)
	$\displaystyle\leq\max\left\{\frac{6K+24K^{2}\epsilon_{1}\alpha_{0}}{\beta_{0}^{2}},\frac{24K^{2}\epsilon_{2}}{\alpha_{0}^{3}\beta_{0}}\right\}\ln\left(\frac{T}{\tau}+1\right)$	(151)
	$\displaystyle:=\epsilon_{12}\cdot\ln\left(\frac{T}{\tau}+1\right).$	(152)

Plugging (150) and (122) to (121), we arrive at

	$\displaystyle M_{\theta}\left(\frac{1}{6},\frac{1}{2}\right)$	$\displaystyle\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\frac{2}{3}}\cdot\epsilon_{12}\cdot\ln\left(\frac{T}{\tau}+1\right)$		(153)
		$\displaystyle=\mathcal{O}\left(T^{-1/3}\ln T\right),$		(154)

which is the second claim of the theorem.

6 Proof of Proposition 1

First note that we cannot directly conclude the proposition from Theorem 1, since the theorem only holds for $\theta\in(0,1)$ , and not $\theta=1$ . In order to show the claim, for a vector $\mathbf{y}\in\mathbb{R}^{T}$ and some $\theta\in(0,1)$ we define²²2Note that for $\|\mathbf{y}\|_{\theta}$ is not a norm, since it is not a subadditive function for $\theta<1$ . $\|\mathbf{y}\|_{\theta}:=\left(|y_{1}|^{\theta}+|y_{2}|^{\theta}+\ldots+|y_{T}|^{\theta}\right)^{1/\theta}$ . Then we have $\|\mathbf{y}\|_{1}\leq\|\mathbf{y}\|_{\theta}$ since

\displaystyle\left(\frac{\|\mathbf{y}\|_{\theta}}{\|\mathbf{y}\|_{1}}\right)^{\theta}=\frac{|y_{1}|^{\theta}+\ldots+|y_{T}|^{\theta}}{(|y_{1}|+\ldots+|y_{T}|)^{\theta}}=\sum_{t=1}^{T}\left(\frac{|y_{t}|}{|y_{1}|+\ldots+|y_{T}|}\right)^{\theta}\geq\sum_{t=1}^{T}\frac{|y_{t}|}{|y_{1}|+\ldots+|y_{T}|}=1,

where the inequality holds since we have $0\leq|y_{t}|/\sum_{i=1}^{T}|y_{i}|\leq 1$ , and $\theta<1$ .
Now, for the vector $\mathbf{y}$ with $y_{t}=\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]$ we have

	$\displaystyle M_{1}$	$\displaystyle\!=\!\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]\!=\!\frac{1}{T}\\|\mathbf{y}\\|_{1}\!\leq\!\frac{1}{T}\\|\mathbf{y}\\|_{\theta}=\frac{1}{T}\left(\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]\right)^{\theta}\right)^{1/\theta}$
		$\displaystyle=\frac{1}{T^{1-1/\theta}}\left(\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]\right)^{\theta}\right)^{1/\theta}=\frac{1}{T^{1-1/\theta}}M_{\theta}.$

Then, from Theorem 1 for $(\nu^{\star},\mu^{\star})=(1/6,1/2)$ and $\theta=\frac{2}{2+\epsilon}$ , we get

\displaystyle M_{1}\leq\frac{1}{T^{1-1/\theta}}M_{\theta}\leq\frac{1}{T^{1-1/\theta}}\mathcal{O}\left(T^{-1/3}\ln T\right)=T^{\epsilon/2}\mathcal{O}\left(T^{-1/3}\ln T\right)=\mathcal{O}\left(T^{-1/3+\epsilon}\right),

(155)

where the last equality holds since $\ln T=\mathcal{O}(T^{\epsilon/2})$ for any $\epsilon>0$ . Similarly, for the vector $\mathbf{z}\in\mathbb{R}^{T}$ with $z_{t}:=\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]$ and any $\theta\in(0,1)$ we have

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]=\frac{1}{T}\\|\mathbf{z}\\|_{1}\leq\frac{1}{T}\\|\mathbf{z}\\|_{\theta}$	$\displaystyle=\frac{1}{T}\left[\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\\|X(t)\!-\!\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}$
		$\displaystyle=\!\frac{1}{T^{1-1/\theta}}\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\\|X(t)\!-\!\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}.$		(156)

We need to bound the RHS of (6). Let $a_{t}\!:=\!(\alpha(t)\beta(t))^{-\theta}\!=(t+\tau)^{2\theta/3}/(\alpha_{0}\beta_{0})^{\theta}$ and $c_{t}:=\left(\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]\right)^{\theta}$ . Then, using the Hölder’s inequality in (119) for $(p,q)=\left(\frac{1}{1-\theta},\frac{1}{\theta}\right)$ we can write

\displaystyle\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}=\left(\frac{1}{T}\sum_{t=1}^{T}a_{t}c_{t}\right)^{q}\leq\frac{1}{T^{q}}\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}\left(\sum_{t=1}^{T}c_{t}^{q}\right).

(157)

Note that $\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}$ is bounded in (122). Moreover, for $\sum_{t=1}^{T}c_{t}^{q}$ we can write

	$\displaystyle\sum_{t=1}^{T}c_{t}^{q}=\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right]$
	$\displaystyle\stackrel{{\scriptstyle\rm{(a)}}}{{\leq}}2\epsilon_{1}\phi_{1,2}(T)+2\epsilon_{2}\phi_{3,1}(T)+2\epsilon_{3}+\frac{48\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
	$\displaystyle\stackrel{{\scriptstyle\rm{(b)}}}{{\leq}}2\epsilon_{1}\phi_{1,2}(T)+2\epsilon_{2}\phi_{3,1}(T)+2\epsilon_{3}+\frac{48\kappa\alpha_{0}^{2}}{\lambda^{2}}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$
	$\displaystyle\stackrel{{\scriptstyle\rm{(c)}}}{{\leq}}\!\!2\alpha_{0}\epsilon_{1}\phi_{0,2}(T)\!+\!2\epsilon_{2}\phi_{3,1}(T)\!+\!2\epsilon_{3}\!+\!\frac{48\kappa\alpha_{0}^{2}}{\lambda^{2}}\!\left(\!\epsilon_{6}\!+\!K(2\!+\!8K\!\alpha_{0}\epsilon_{1})\phi_{0,2}(T)\!+\!8K^{2}\epsilon_{2}\phi_{3,1}(T)\right)$
	$\displaystyle\leq\!2\epsilon_{3}\!+\!\frac{48\kappa\alpha_{0}^{2}\epsilon_{6}}{\lambda^{2}}\!+\!\left(\!2\alpha_{0}\epsilon_{1}\!+\!\frac{96\kappa\alpha_{0}^{2}K(1+4K\alpha_{0}\epsilon_{1})}{\lambda^{2}}\!\right)\phi_{0,2}(T)\!+\!\left(2\epsilon_{2}\!+\!\frac{384\kappa\alpha_{0}^{2}K^{2}\epsilon_{2}}{\lambda^{2}}\!\right)\!\phi_{3,1}(T)$
	$\displaystyle\stackrel{{\scriptstyle\rm{(d)}}}{{\leq}}2\epsilon_{3}+\frac{48\kappa\alpha_{0}^{2}\epsilon_{6}}{\lambda^{2}}+\left(\frac{2\alpha_{0}\epsilon_{1}}{\beta_{0}^{2}}+\frac{96\kappa\alpha_{0}^{2}K(1+4K\alpha_{0}\epsilon_{1})}{\lambda^{2}\beta_{0}^{2}}\right)\ln\left(\frac{T}{\tau}+1\right)$
	$\displaystyle\hskip 79.0pt+\left(\frac{2\epsilon_{2}}{\alpha_{0}^{3}\beta_{0}}+\frac{384\kappa K^{2}\epsilon_{2}}{\lambda^{2}\alpha_{0}\beta_{0}}\right)\ln\left(\frac{T}{\tau}+1\right)\stackrel{{\scriptstyle\rm{(e)}}}{{\leq}}\epsilon_{13}\cdot\ln\left(\frac{T}{\tau}+1\right),$		(158)

where $\rm{(a)}$ follows from (94), $\rm{(b)}$ holds since $\alpha^{2}(t)=\frac{\alpha_{0}^{2}}{(t+\tau)^{1/3}}\leq\alpha_{0}^{2}$ , the inequality in $\rm{(c)}$ follows from (117) and (118), we have used a bounds in (126) and (127) for $(\nu^{\star},\mu^{\star})=(1/6,1/2)$ in $\rm{(d)}$ , and $\rm{(e)}$ holds since the constant term $2\epsilon_{3}+48\kappa\alpha_{0}^{2}\epsilon_{6}/\lambda^{2}$ is upper bounded by $\ln(T/\tau+1)$ for large enough $T$ . Plugging (122) for $\nu^{\star}+\mu^{\star}=2/3$ and (158) into (157) and (6), and setting $\theta=\frac{2}{2+\epsilon}$ we arrive at

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\Big{[}$	$\displaystyle\\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\\|^{2}\Big{]}\leq\frac{1}{\mathbf{r}_{\min}}\left[\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\\|_{\mathbf{r}}^{2}\right]\right]$
		$\displaystyle\leq\frac{1}{\mathbf{r}_{\min}}\frac{1}{T^{1-1/\theta}}\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\frac{2}{3}}\cdot\epsilon_{13}\cdot\ln\left(\frac{T}{\tau}+1\right)=\mathcal{O}\left(T^{-1/3+\epsilon}\right),$		(159)

where the last equality holds since $\ln T=\mathcal{O}(T^{\epsilon/2})$ for any $\epsilon>0$ . Finally, combining (12), (13), and using Assumption 3 and Lemma 1 (for $\omega=1$ ) we have

	$\displaystyle\frac{1}{T}$	$\displaystyle\sum_{t=1}^{T}\mathbb{E}\Big{[}\\|\nabla f(\mathbf{x}_{i}(t))\\|^{2}\Big{]}\leq\frac{2}{T}\sum_{t=1}^{T}\left\{\mathbb{E}\Big{[}\\|\nabla f(\mathbf{x}_{i}(t))-\nabla f(\bar{\mathbf{x}}(t))\\|^{2}\Big{]}+\mathbb{E}\Big{[}\\|\nabla f(\bar{\mathbf{x}}(t))\\|^{2}\Big{]}\right\}$
		$\displaystyle\leq\frac{2}{T}\ \sum_{t=1}^{T}\left\{K^{2}\mathbb{E}\Big{[}\\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\\|^{2}\Big{]}+\mathbb{E}\Big{[}\\|\nabla f(\bar{\mathbf{x}}(t))\\|^{2}\Big{]}\right\}\leq\mathcal{O}(T^{-1/3+\epsilon}).$

for every $i\in[n]$ . This concludes the proof of Proposition 1. $\blacksquare$

7 Conclusion

We have studied non-convex distributed optimization over time-varying networks with lossy information sharing. Inspired by the original averaging-based distributed optimization algorithm, we proposed and studied a two-time scale decentralized algorithm including a damping mechanism for incoming information from the neighboring agents as well as local cost functions’ gradients. We presented the convergence rate for various choices of the diminishing step-size parameters. By optimizing the achieved rate over all feasible choices for parameters, the algorithm obtains a convergence rate of $\mathcal{O}({T}^{-1/3+\epsilon})$ for non-convex distributed optimization problems over time-varying networks, for any $\epsilon>0$ . Our theoretical results are corroborated by numerical simulations.

References

[1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, pp. 1223–1231, 2012.
[2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on OSDI, pp. 265–283, 2016.
[3] P. H. Jin, Q. Yuan, F. Iandola, and K. Keutzer, “How to scale distributed deep learning?,” in NIPS Workshop MLSystems, 2016.
[4] D. Li, K. D. Wong, Y. H. Hu, and A. M. Sayeed, “Detection, classification, and tracking of targets,” IEEE signal processing magazine, vol. 19, no. 2, pp. 17–29, 2002.
[5] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proceedings of the 3rd International Symp. on Information processing in Sensor Networks, pp. 20–27, 2004.
[6] S. Kar and J. M. Moura, “Distributed consensus algorithms in sensor networks with imperfect communication: Link failures and channel noise,” IEEE Trans. Signal Process., vol. 57, no. 1, pp. 355–369, 2008.
[7] A. Ribeiro, “Ergodic stochastic optimization algorithms for wireless communication and networking,” IEEE Transactions on Signal Processing, vol. 58, no. 12, pp. 6369–6386, 2010.
[8] T. Chen, Q. Ling, and G. B. Giannakis, “An online convex optimization approach to proactive network resource allocation,” IEEE Trans. Signal Process., vol. 65, no. 24, pp. 6350–6364, 2017.
[9] A. Nedić, S. Lee, and M. Raginsky, “Decentralized online optimization with global objectives and local communication,” in 2015 American Control Conference (ACC), pp. 4497–4503, IEEE, 2015.
[10] R. Xin, A. K. Sahu, S. Kar, and U. A. Khan, “Distributed empirical risk minimization over directed graphs,” in 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pp. 189–193, IEEE, 2019.
[11] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[12] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Automat. Contr., vol. 54, no. 1, pp. 48–61, 2009.
[13] I. Lobel and A. Ozdaglar, “Distributed subgradient methods for convex optimization over random networks,” IEEE Trans. Automat. Contr., vol. 56, no. 6, pp. 1291–1306, 2010.
[14] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent optimization with state-dependent communication,” Mathematical programming, vol. 129, no. 2, pp. 255–284, 2011.
[15] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Trans. Signal Process., vol. 60, no. 8, pp. 4289–4305, 2012.
[16] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
[17] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Trans. Automat. Contr., vol. 59, no. 5, pp. 1131–1146, 2014.
[18] A. Nedić, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Trans. Automat. Contr., vol. 55, no. 4, pp. 922–938, 2010.
[19] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Trans. Automat. Contr., vol. 60, no. 3, pp. 601–615, 2014.
[20] Y. Sun, G. Scutari, and D. Palomar, “Distributed nonconvex multiagent optimization over time-varying networks,” in IEEE Asilomar Conf. Signals Syst. Comput., pp. 788–794, 2016.
[21] T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Trans. Automat. Contr., vol. 62, no. 8, pp. 3744–3757, 2017.
[22] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on signal processing, vol. 66, no. 11, pp. 2834–2848, 2018.
[23] A. Kashyap, T. Başar, and R. Srikant, “Quantized consensus,” Automatica, vol. 43, no. 7, pp. 1192–1203, 2007.
[24] K. Cai and H. Ishii, “Quantized consensus and averaging on gossip digraphs,” IEEE Trans. Automat. Contr., vol. 56, no. 9, pp. 2087–2100, 2011.
[25] Y. Pu, M. N. Zeilinger, and C. N. Jones, “Quantization design for distributed optimization,” IEEE Trans. Automat. Contr., vol. 62, no. 5, pp. 2107–2120, 2016.
[26] A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communication-efficient collaborative learning,” in Advances in Neural Information Processing Systems, pp. 8388–8399, 2019.
[27] A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in International Conference on Machine Learning, pp. 3478–3487, PMLR, 2019.
[28] H. Reisizadeh, B. Touri, and S. Mohajer, “Distributed optimization over time-varying graphs with imperfect sharing of information,” arXiv preprint arXiv:2106.08469, 2021.
[29] H. Reisizadeh, B. Touri, and S. Mohajer, “Adaptive bit allocation for communication-efficient distributed optimization,” in 2021 60th IEEE Conference on Decision and Control (CDC), pp. 1994–2001, IEEE, 2021.
[30] A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep learning with arbitrary communication compression,” arXiv preprint arXiv:1907.09356, 2019.
[31] K. Srivastava and A. Nedic, “Distributed asynchronous constrained stochastic optimization,” IEEE journal of selected topics in signal processing, vol. 5, no. 4, pp. 772–790, 2011.
[32] S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in 2018 IEEE Conference on Decision and Control (CDC), pp. 963–968, IEEE, 2018.
[33] J. Zhang and K. You, “Decentralized stochastic gradient tracking for non-convex empirical risk minimization,” arXiv preprint arXiv:1909.02712, 2019.
[34] R. Durrett, Probability: theory and examples. Cambridge University Press, 2010.
[35] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” arXiv preprint arXiv:1705.09056, 2017.
[36] A. Nedić, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis, “On distributed averaging algorithms and quantization effects,” IEEE Trans. Automat. Contr., vol. 54, no. 11, pp. 2506–2517, 2009.
[37] A. Nedić and A. Olshevsky, “Distributed optimization of strongly convex functions on directed time-varying graphs,” in IEEE Glob. Conf. Signal Inf. Process. Proc., pp. 329–332, 2013.
[38] B. Touri and A. Nedić, “Product of random stochastic matrices,” IEEE Trans. Automat. Contr., vol. 59, no. 2, pp. 437–448, 2013.
[39] B. Touri and C. Langbort, “On endogenous random consensus and averaging dynamics,” IEEE Transactions on Control of Network Systems, vol. 1, no. 3, pp. 241–248, 2014.
[40] A. Aghajan and B. Touri, “Distributed optimization over dependent random networks,” arXiv preprint arXiv:2010.01956, 2020.
[41] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” arXiv preprint arXiv:1809.07599, 2018.
[42] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
[43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[44] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE transactions on information theory, vol. 52, no. 6, pp. 2508–2530, 2006.
[45] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
[46] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Asynchronous gossip algorithm for stochastic optimization: Constant stepsize analysis,” in Recent Advances in Optimization and its Applications in Engineering, pp. 51–60, Springer, 2010.
[47] S. Lee and A. Nedić, “Asynchronous gossip-based random projection algorithms over networks,” IEEE Trans. Automat. Contr., vol. 61, no. 4, pp. 953–968, 2015.
[48] P. M. DeMarzo, D. Vayanos, and J. Zwiebel, “Persuasion bias, social influence, and unidimensional opinions,” The Quarterly journal of economics, vol. 118, no. 3, pp. 909–968, 2003.
[49] S. Bubeck, “Convex optimization: Algorithms and complexity,” Foundations and Trends in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
[50] G. B. Folland, Real analysis: modern techniques and their applications, vol. 40. John Wiley & Sons, 1999.
[51] B. Touri and A. Nedić, “On existence of a quadratic comparison function for random weighted averaging dynamics and its implications,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference, pp. 3806–3811, IEEE, 2011.

Appendix A Proof of Auxiliary Lemmas

In this section, we provide the proofs of auxiliary lemmas.

Proof of Lemma 1: For two vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ (with identical dimension) and any scalar $\omega>0$ , we have

	$\displaystyle\\|\boldsymbol{u}+\boldsymbol{v}\\|^{2}$	$\displaystyle=\\|\boldsymbol{u}\\|^{2}+\\|\boldsymbol{v}\\|^{2}+2\left\langle\boldsymbol{u},\boldsymbol{v}\right\rangle$
		$\displaystyle\stackrel{{\scriptstyle\rm{(a)}}}{{\leq}}\\|\boldsymbol{u}\\|^{2}+\\|\boldsymbol{v}\\|^{2}+2\\|\boldsymbol{u}\\|\\|\boldsymbol{v}\\|$
		$\displaystyle=\\|\boldsymbol{u}\\|^{2}+\\|\boldsymbol{v}\\|^{2}+2\left(\sqrt{\omega}\\|\boldsymbol{u}\\|\cdot\frac{1}{\sqrt{\omega}}\\|\boldsymbol{v}\\|\right)$
		$\displaystyle\stackrel{{\scriptstyle\rm{(b)}}}{{\leq}}\\|\boldsymbol{u}\\|^{2}+\\|\boldsymbol{v}\\|^{2}+\omega\\|\boldsymbol{u}\\|^{2}+\frac{1}{\omega}\\|\boldsymbol{v}\\|^{2}$
		$\displaystyle=(1+\omega)\\|\boldsymbol{u}\\|^{2}+\left(1+\frac{1}{\omega}\right)\\|\boldsymbol{v}\\|^{2},$

where $\rm{(a)}$ follows from the Cauchy–Schwarz inequality and we used the geometric-arithmetic inequality in $\rm{(b)}$ .

Now, recall that for a matrix $U\in\mathbb{R}^{n\times d}$ we have $\|U\|_{\mathbf{r}}=\sum_{i=1}^{n}r_{i}\|U_{i}\|^{2}$ . Therefore, for matrices $U$ and $V$ we have

	$\displaystyle\\|U+V\\|_{\mathbf{r}}^{2}$	$\displaystyle=\sum_{i=1}^{n}r_{i}\\|U_{i}+V_{i}\\|^{2}\leq\sum_{i=1}^{n}r_{i}\left[(1+\omega)\\|U_{i}\\|^{2}+\left(1+\frac{1}{\omega}\right)\\|V_{i}\\|^{2}\right]$
		$\displaystyle=(1+\omega)\sum_{i=1}^{n}r_{i}\\|U_{i}\\|^{2}+\left(1+\frac{1}{\omega}\right)\sum_{i=1}^{n}r_{i}\\|V_{i}\\|^{2}$
		$\displaystyle=(1+\omega)\\|U\\|^{2}_{\mathbf{r}}+\left(1+\frac{1}{\omega}\right)\\|V\\|^{2}_{\mathbf{r}}.$

This completes the proof of the lemma. $\blacksquare$

Proof of Lemma 2: To prove (25), let $g=\sum_{s=1}^{t-1}\beta(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))$ . Then, we have

	$\displaystyle\lambda g$	$\displaystyle=\sum_{s=1}^{t-1}\lambda\beta(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))$
		$\displaystyle=\sum_{s=1}^{t-1}(1-(1-\lambda\beta(s)))\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))$
		$\displaystyle=\sum_{s=1}^{t-1}\left[\prod_{k=s+1}^{t-1}\left(1-\lambda\beta(k)\right)-\prod_{k=s}^{t-1}\left(1-\lambda\beta(k))\right)\right].$

Note that the last summation is a telescopic sum, we have

\displaystyle\lambda g

\displaystyle=\prod_{k=t}^{t-1}\left(1-\lambda\beta(k)\right)-\prod_{k=1}^{t-1}\left(1-\lambda\beta(k)\right)=1-\prod_{k=1}^{t-1}\left(1-\lambda\beta(k)\right).

Dividing both sides by $\lambda>0$ we arrive at (25). To show (26), we utilize the same idea by letting ${h=\sum_{t=s+1}^{T}\beta(t)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))}$ , we can write

	$\displaystyle\lambda h$	$\displaystyle=\sum_{t=s+1}^{T}\left[\prod_{k=s+1}^{t-1}\left(1-\lambda\beta(k)\right)-\prod_{k=s+1}^{t}\left(1-\lambda\beta(k)\right)\right]$
		$\displaystyle=\prod_{k=s+1}^{s}\left(1-\lambda\beta(k)\right)-\prod_{k=s+1}^{t}\left(1-\lambda\beta(k)\right)$
		$\displaystyle=1-\prod_{k=s+1}^{T}\left(1-\lambda\beta(k)\right).$

Dividing both sides by $\lambda>0$ , results in (26). $\blacksquare$

Proof of Lemma 3: In order to prove the lemma, we separately analyze the cases $\delta<-1$ , $\delta=-1$ , $-1<\delta<0$ , and $\delta\geq 0$ . Note that for $\delta<0$ , the function $h(t):=(t+\tau)^{\delta}$ is a decreasing function, and thus we have ${\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\int_{0}^{T}(t+\tau)^{\delta}dt}$ . In the following we upper bound the latter integral for each regime of $\delta$ . When $\delta<-1$ we have

\displaystyle\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\int_{0}^{T}(t+\tau)^{\delta}dt=\frac{\tau^{1+\delta}-(T+\tau)^{1+\delta}}{-1-\delta}\leq\frac{\tau^{1+\delta}}{-1-\delta},

(160)

which does not grow with $T$ . For $\delta=-1$ , we get

\displaystyle\sum_{t=1}^{T}(t+\tau)^{-1}\leq\int_{0}^{T}(t+\tau)^{-1}dt=\ln(T+\tau)-\ln(\tau)=\ln\left(\frac{T}{\tau}+1\right).

(161)

When $-1<\delta<0$ we arrive at

\displaystyle\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\int_{0}^{T}(t+\tau)^{\delta}dt=\frac{(T+\tau)^{1+\delta}-\tau^{1+\delta}}{1+\delta}\leq\frac{(T+\tau)^{1+\delta}}{1+\delta}\leq\frac{2^{1+\delta}}{1+\delta}(T+\tau)^{1+\delta},

(162)

where the last inequality holds since $2^{1+\delta}\geq 2^{0}=1$ .

Finally, for $\delta\geq 0$ , the function $h(t)$ is an increasing function. Hence, we can write

\displaystyle\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\int_{1}^{T+1}(t+\tau)^{\delta}dt=\frac{(T+\tau+1)^{1+\delta}-(\tau+1)^{1+\delta}}{1+\delta}\leq\frac{2^{1+\delta}}{1+\delta}(T+\tau)^{1+\delta}.

(163)

Note that the last inequality follows from $T+\tau+1\geq 2(T+\tau)$ which holds for any $T\geq 1$ and $\tau\geq 0$ . $\blacksquare$

Proof of Lemma 4: Recall that the $(i,j)$ th entry of the matrix product $AB$ is the inner product between the $i$ th row of $A$ and the $j$ th column of $B$ . Thus, using the Cauchy-Schwarz inequality, we have ${|[AB]_{ij}|=|\left\langle A_{i},B^{j}\right\rangle|\leq\left\|A_{i}\right\|\left\|B^{j}\right\|}$ . Therefore,

\displaystyle\left\|[AB]_{i}\right\|^{2}=\sum_{j=1}^{m}|[AB]_{ij}|^{2}=\sum_{j=1}^{m}|\left\langle A_{i},B^{j}\right\rangle|^{2}\leq\left\|A_{i}\right\|^{2}\sum_{j=1}^{m}\left\|B^{j}\right\|^{2}\leq\left\|A_{i}\right\|^{2}\left\|B\right\|_{F}^{2}.

Using this inequality and the definition of $\mathbf{r}$ -norm, we get

\displaystyle\left\|AB\right\|_{\mathbf{r}}^{2}

\displaystyle=\sum_{i=1}^{n}r_{i}\left\|[AB]_{i}\right\|^{2}\leq\sum_{i=1}^{n}r_{i}\left\|A_{i}\right\|^{2}\left\|B\right\|_{F}^{2}=\left\|A\right\|_{\mathbf{r}}^{2}\left\|B\right\|_{F}^{2}.

$\blacksquare$

Proof of Lemma 5: Due to the separable nature of $\left\|\cdot\right\|_{\mathbf{r}}$ , i.e., $\left\|U\right\|_{\mathbf{r}}^{2}=\sum_{j=1}^{d}\left\|U^{j}\right\|_{\mathbf{r}}$ , without loss of generality, we may assume that $d=1$ . Thus, let $U=\boldsymbol{u}\in\mathbb{R}^{n}$ . Define $V_{\mathbf{r}}:\mathbb{R}^{n}\to\mathbb{R}^{+}$ by

\displaystyle V_{\mathbf{r}}(\boldsymbol{u}):=\left\|\boldsymbol{u}-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}\right\|_{\mathbf{r}}^{2}=\sum_{i=1}^{n}r_{i}(u_{i}-\mathbf{r}^{T}\boldsymbol{u})^{2}.

(164)

For notational simplicity, let $\boldsymbol{u}(s)=\boldsymbol{u}=\begin{bmatrix}u_{1}&u_{2}&\ldots&u_{n}\end{bmatrix}$ , and $\boldsymbol{u}(k+1)=A(k)\boldsymbol{u}(k)$ . Also with a slight abuse of notation, we denote $V_{\mathbf{r}}(\boldsymbol{u}(k))$ by $V_{\mathbf{r}}(k)$ for $k=s,\ldots,t$ .

Using Theorem 1 in [51], we have

\displaystyle V_{\mathbf{r}}(t)=V_{\mathbf{r}}(s)-\sum_{k=s}^{t-1}\sum_{i<j}H_{ij}(k)(u_{i}(k)-u_{j}(k))^{2},

(165)

where $H(k)=A^{T}(k)\operatorname{diag}(\mathbf{r})A(k)$ . Note that $A(k)$ is a non-negative matrix, and hence we have ${H(k)\geq\mathbf{r}_{\min}A^{T}(k)A(k)}$ , for $k=s,\ldots,t$ . Also, note that since ${A(k)=(1-\beta(k))I+\beta(k)W(k)}$ , then Assumption 2-(b) implies that the minimum non-zero elements of $A(k)$ are bounded bellow by $\eta\beta(k)$ . Therefore, since $\beta(k)$ is non-increasing, on the window $k=s,\ldots,s+B$ , the minimum non-zero elements of $A(k)$ for $k$ in this window are lower bounded by $\eta\beta(s+B)$ . Without loss of generality assume that the entries of $\boldsymbol{u}$ are sorted, i.e., $u_{1}\leq\ldots\leq u_{n}$ , otherwise, we can relabel the agents (rows and columns of $A(k)$ s and $\boldsymbol{u}$ to achieve this). Therefore, by Lemma 8 in [36], for (165), we have

	$\displaystyle V_{\mathbf{r}}(s+B)$	$\displaystyle\leq V_{\mathbf{r}}(s)-\mathbf{r}_{\min}\sum_{k=s}^{s+B-1}\sum_{i<j}[A^{T}(k)A(k)]_{ij}(u_{i}(k)-u_{j}(k))^{2}$
		$\displaystyle\leq V_{\mathbf{r}}(s)-\frac{\eta\mathbf{r}_{\min}}{2}\beta(s+B)\sum_{\ell=1}^{n-1}(u_{\ell+1}-u_{\ell})^{2}.$		(166)

We may comment here that although Lemma 8 in [36] is written for doubly stochastic matrices, and its statement is about the decrease of $V_{\mathbf{r}}(\mathbf{x})$ for the special case of $\mathbf{r}=\frac{1}{n}\mathbf{1}$ , but in fact, it is a result on bounding $\sum_{k=s}^{s+B-1}\sum_{i<j}[A^{T}(k)A(k)]_{ij}(u_{i}(k)-u_{j}(k))^{2}$ for a sequence of $B$ -connected stochastic matrices $A(k)$ in terms of the minimum non-zero entries of stochastic matrices ${A(s),\ldots,A(s+B-1)}$ .

Next, we will show that $\sum_{\ell=1}^{n-1}(u_{\ell+1}-u_{\ell})^{2}\geq n^{-2}V_{\mathbf{r}}(\boldsymbol{u})$ . This argument adapts a similar argument used in the proof of Theorem 18 in [36] to the general $V_{\mathbf{r}}(\cdot)$ .

For a $\boldsymbol{v}\in\mathbb{R}^{n}$ with $V_{\mathbf{r}}(\boldsymbol{v})>0$ , define the quotient

\displaystyle h(\boldsymbol{v})=\frac{\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}}{\sum_{i=1}^{n}r_{i}(v_{i}-\mathbf{r}^{T}\boldsymbol{v})^{2}}=\frac{\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}}{V_{\mathbf{r}}(\boldsymbol{v})}.

(167)

Note that $h(\boldsymbol{v})$ is invariant under scaling and translations by all-one vector, i.e., $h(\omega\boldsymbol{v})=h(\boldsymbol{v})$ for all non-zero $\omega\in\mathbb{R}$ and $h(\boldsymbol{v}+\omega\mathbf{1})=h(\boldsymbol{v})$ for all $\omega\in\mathbb{R}$ . Therefore,

	$\displaystyle\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ V_{\mathbf{r}}(\boldsymbol{v})\not=0\end{subarray}}h(\boldsymbol{v})$	$\displaystyle=\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ \mathbf{r}^{T}\boldsymbol{v}=0,V_{\mathbf{r}}(\boldsymbol{v})=1\end{subarray}}h(\boldsymbol{v})$		(168)
		$\displaystyle=\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ \mathbf{r}^{T}\boldsymbol{v}=0,V_{\mathbf{r}}(\boldsymbol{v})=1\end{subarray}}\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}.$		(169)

Since $\mathbf{r}$ is a stochastic vector, then for a vector $\boldsymbol{v}$ with ${v_{1}\leq\ldots\leq v_{n}}$ and $\mathbf{r}^{T}\boldsymbol{v}=0$ , we would have ${v_{1}\leq\mathbf{r}^{T}\boldsymbol{v}=0\leq v_{n}}$ . On the other hand, the fact that ${V_{\mathbf{r}}(\boldsymbol{v})=\sum_{i=1}^{n}r_{i}v^{2}_{i}=1}$ would imply ${\max(|v_{1}|,|v_{n}|)\geq\frac{1}{\sqrt{n}}}$ . Let us consider the difference sequence $\hat{v}_{\ell}=v_{\ell+1}-v_{\ell}$ for $\ell=1,\ldots,n-1$ , for which we have $\sum_{i=1}^{n-1}\hat{v}_{\ell}=v_{n}-v_{1}\geq v_{n}\geq\frac{1}{\sqrt{n}}$ . Therefore, the optimization problem (168) can be rewritten as

	$\displaystyle\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ V_{\mathbf{r}}(\boldsymbol{v})\not=0\end{subarray}}h(\boldsymbol{v})$	$\displaystyle=\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ \mathbf{r}^{T}\boldsymbol{v}=0,V_{\mathbf{r}}(\boldsymbol{v})=1\end{subarray}}\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}$		(170)
		$\displaystyle\geq\min_{\begin{subarray}{c}\hat{v}_{1},\ldots,\hat{v}_{n-1}\geq 0\\ \sum_{i=1}^{n-1}\hat{v}_{i}\geq\frac{1}{\sqrt{n}}\end{subarray}}\sum_{\ell=1}^{n-1}\hat{v}_{\ell}^{2}.$		(171)

Using the Cauchy-Schwarz inequality, we get ${\sum_{\ell=1}^{n-1}\hat{v}_{\ell}^{2}\cdot\sum_{\ell=1}^{n-1}1^{2}\geq\big{(}\sum_{\ell=1}^{n-1}\hat{v}_{\ell}\big{)}^{2}\geq\big{(}\frac{1}{\sqrt{n}}\big{)}^{2}=\frac{1}{n}}$ . Hence,

\displaystyle\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\ldots\leq v_{n}\\ V_{\mathbf{r}}(\boldsymbol{v})\not=0\end{subarray}}h(\boldsymbol{v})\geq\frac{1}{n(n-1)}\geq\frac{1}{n^{2}}.

(172)

This implies that for $v_{1}\leq v_{2}\leq\ldots\leq v_{n}$ , we have $\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}\geq n^{-2}V_{\mathbf{r}}(\boldsymbol{v})$ (note that this inequality also holds for $\boldsymbol{v}\in\mathbb{R}^{n}$ with $V_{\mathbf{r}}(\boldsymbol{v})=0$ ). Using this fact in (A) implies

\displaystyle V_{\mathbf{r}}(s+B)\leq\left(1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+B)\right)V_{\mathbf{r}}(s).

(173)

Applying (173) for $\Delta:=\lfloor\frac{t-1-s}{B}\rfloor$ steps recursively, we get

\displaystyle V_{\mathbf{r}}(s+\Delta B)\leq\prod_{j=1}^{\Delta}\left(1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+jB)\right)V_{\mathbf{r}}(s)

Using the fact that $(1-x)^{1/B}\leq 1-x/B$ and since $\{\beta(k)\}$ is a non-increasing sequence, we have

	$\displaystyle 1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+jB)$	$\displaystyle=\prod_{\ell=1}^{B}\left(1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+jB)\right)^{1/B}$
		$\displaystyle\leq\prod_{\ell=1}^{B}\left(1-\frac{\eta\mathbf{r}_{\min}}{2Bn^{2}}\beta(s+jB)\right)$
		$\displaystyle\leq\prod_{\ell=1}^{B}\left(1-\lambda\beta(s+jB+\ell)\right).$

Recall from (165) that $V_{\mathbf{r}}(\cdot)$ is non-increasing. Therefore, for $s+\Delta B\leq t-1<s+(\Delta+1)B$ we have

$\displaystyle V_{\mathbf{r}}(t-1)$	$\displaystyle\leq V_{\mathbf{r}}(s+\Delta B)$
	$\displaystyle\leq\prod_{j=1}^{\Delta}\left(1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+jB)\right)V_{\mathbf{r}}(s)$
	$\displaystyle\leq\prod_{j=1}^{\Delta}\prod_{\ell=1}^{B}\left(1-\lambda\beta(s+jB+\ell)\right)V_{\mathbf{r}}(s)$
	$\displaystyle=\prod_{k=s+B+1}^{s+(\Delta+1)B}\left(1-\lambda\beta(k)\right)V_{\mathbf{r}}(s)$
	$\displaystyle\leq\prod_{k=s+B+1}^{t-1}\left(1-\lambda\beta(k)\right)V_{\mathbf{r}}(s).$	(174)

Next, since $\{\beta(k)\}$ is a non-increasing sequence, we have $\beta(k)\leq\beta(1)=\beta_{0}$ . Thus,

	$\displaystyle\prod_{k=s+1}^{s+B}\left(1-\lambda\beta(k)\right)$	$\displaystyle\geq\prod_{k=s+1}^{s+B}\left(1-\lambda\beta_{0}\right)$
		$\displaystyle=\left(1-\lambda\beta_{0}\right)^{B}\geq 1-B\lambda\beta_{0}.$		(175)

Therefore, combining (A) and (A), we get

$\displaystyle V_{\mathbf{r}}(t-1)$	$\displaystyle\leq\prod_{k=s+B+1}^{t-1}\left(1-\lambda\beta(k)\right)V_{\mathbf{r}}(s)$
	$\displaystyle\leq\frac{\prod_{k=s+1}^{s+B}\left(1-\lambda\beta(k)\right)}{1-B\lambda\beta_{0}}\!\!\prod_{k=s+B+1}^{t-1}\!\!\left(1-\lambda\beta(k)\right)V_{\mathbf{r}}(s)$
	$\displaystyle=\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))V_{\mathbf{r}}(s).$	(176)

Now, we define

\Phi(t:s)=A(t-1)\cdots A(s+1)

for ${t\geq s}$ with ${\Phi(t:t-1)\!=\!I}$ . Note that Assumption 2-(a) and the fact ${A(k)=(1\!-\!\beta(k))I\!+\!\beta(k)W(k)}$ imply ${\mathbf{r}^{T}\Phi(t:s)=\mathbf{r}^{T}}$ . Then, setting $\boldsymbol{u}(s)=\boldsymbol{u}=U$ and ${\boldsymbol{u}(t-1)=\Phi(t:s)\boldsymbol{u}(s)=\Phi(t:s)U}$ , we can write

	$\displaystyle\left(\Phi(t:s)-\mathbf{1}\mathbf{r}^{T}\right)U$	$\displaystyle=\Phi(t:s)U-\mathbf{1}\mathbf{r}^{T}U$
		$\displaystyle=\Phi(t:s)U-\mathbf{1}\mathbf{r}^{T}\Phi(t:s)U$
		$\displaystyle=\boldsymbol{u}(t-1)-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}(t-1).$

Therefore, using (A) we have

	$\displaystyle\big{\\|}\big{(}\Phi(t:s)-\mathbf{1}\mathbf{r}^{T}\big{)}U\big{\\|}_{\mathbf{r}}^{2}$	$\displaystyle=\left\\|\boldsymbol{u}(t-1)-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}(t-1)\right\\|_{\mathbf{r}}^{2}=V_{\mathbf{r}}(t-1)$
		$\displaystyle\leq\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))V_{\mathbf{r}}(s)$
		$\displaystyle=\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\left\\|\boldsymbol{u}-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}\right\\|_{\mathbf{r}}^{2}$
		$\displaystyle\leq\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\left\\|\boldsymbol{u}\right\\|_{\mathbf{r}}^{2}$
		$\displaystyle=\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\left\\|U\right\\|_{\mathbf{r}}^{2},$

where the second inequality follows form the the fact that $\left\|\boldsymbol{u}-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}\right\|_{\mathbf{r}}^{2}+\left\|\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}\right\|_{\mathbf{r}}^{2}=\left\|\boldsymbol{u}\right\|_{\mathbf{r}}^{2}$ . This completes the proof of the lemma. $\blacksquare$

	$\displaystyle\\|\nabla f\left(\bar{\mathbf{x}}(t)\right)-\mathbf{r}^{T}\nabla f(X(t))\\|^{2}$	$\displaystyle=\left\\|\sum_{i=1}^{n}r_{i}\nabla f_{i}(\bar{\mathbf{x}}(t))-\sum_{i=1}^{n}r_{i}\nabla f_{i}(\mathbf{x}_{i}(t))\right\\|^{2}$
		$\displaystyle=\left\\|\sum_{i=1}^{n}r_{i}\left(\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right)\right\\|^{2}$
		$\displaystyle\leq\sum_{i=1}^{n}r_{i}\left\\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right\\|^{2},$

	$\displaystyle\\|\nabla f\left(\bar{\mathbf{x}}(t)\right)-\mathbf{r}^{T}\nabla f(X(t))\\|^{2}$	$\displaystyle\leq\sum_{i=1}^{n}r_{i}\left\\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right\\|^{2}$		(31)
		$\displaystyle\leq K^{2}\sum_{i=1}^{n}r_{i}\left\\|\bar{\mathbf{x}}(t)-\mathbf{x}_{i}(t)\right\\|^{2}.$		(32)

	$\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\middle\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq f\left(\bar{\mathbf{x}}(t)\right)-\alpha(t)\beta(t)\left\langle\nabla f\left(\bar{\mathbf{x}}(t)\right),\mathbf{r}^{T}\nabla f(X(t))\right\rangle$
	$\displaystyle\quad+\beta^{2}(t)\frac{K}{2}\mathbb{E}\left[\left\\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\\|^{2}\middle\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq f\left(\bar{\mathbf{x}}(t)\right)-\frac{1}{2}\alpha(t)\beta(t)\left(\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}+\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}-2K^{2}\sum_{i=1}^{n}r_{i}\left\\|\bar{\mathbf{x}}(t)-\mathbf{x}_{i}(t)\right\\|^{2}\right)$
	$\displaystyle\quad+\beta^{2}(t)\frac{K}{2}\left(\gamma+\alpha^{2}(t)\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}\right)$
	$\displaystyle=f\left(\bar{\mathbf{x}}(t)\right)-\frac{1}{2}\alpha(t)\beta(t)\left\\|\nabla f(\bar{\mathbf{x}}(t))\right\\|^{2}-\frac{1}{2}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\left\\|\mathbf{r}^{T}\nabla f(X(t))\right\\|^{2}$
	$\displaystyle\quad+\alpha(t)\beta(t)K^{2}\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}+\beta^{2}(t)\frac{K}{2}\gamma.$

	$\displaystyle\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}$	$\displaystyle=2\sum_{s=1}^{t-1}\left\\|P(t:s)E(s)\right\\|_{\mathbf{r}}^{2}+2\sum_{s\neq q}\left\langle P(t:s)E(s),P(t:q)E(q)\right\rangle$		(44)
		$\displaystyle\phantom{=}+2\left\\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\\|_{\mathbf{r}}^{2}.$		(45)

$\displaystyle\mathbb{E}\left[\left\\|\nabla f(X(t))-\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))\right\\|_{\mathbf{r}}^{2}\right]$	$\displaystyle=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\\|\nabla f_{i}(\mathbf{x}_{i}(t))-\nabla f_{i}(\bar{\mathbf{x}}(t))\right\\|^{2}\right]$	(75)
	$\displaystyle\leq K^{2}\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\right\\|^{2}\right]$	(76)
	$\displaystyle=K^{2}\mathbb{E}\left[\left\\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\\|_{\mathbf{r}}^{2}\right].$	(77)