This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DIMIX: DIminishing MIXing for Sloppy Agents

Hadi Reisizadeh Behrouz Touri and Soheil Mohajer111H. Reisizadeh (email: [email protected]) and S. Mohajer (email: [email protected]) are with the University of Minnesota, and B. Touri (email: [email protected]) is with the University of California San Diego.
Abstract

We study non-convex distributed optimization problems where a set of agents collaboratively solve a separable optimization problem that is distributed over a time-varying network. The existing methods to solve these problems rely on (at most) one time-scale algorithms, where each agent performs a diminishing or constant step-size gradient descent at the average estimate of the agents in the network. However, if possible at all, exchanging exact information, that is required to evaluate these average estimates, potentially introduces a massive communication overhead. Therefore, a reasonable practical assumption to be made is that agents only receive a rough approximation of the neighboring agents’ information. To address this, we introduce and study a two time-scale decentralized algorithm with a broad class of lossy information sharing methods (that includes noisy, quantized, and/or compressed information sharing) over time-varying networks. In our method, one time-scale suppresses the (imperfect) incoming information from the neighboring agents, and one time-scale operates on local cost functions’ gradients. We show that with a proper choices for the step-sizes’ parameters, the algorithm achieves a convergence rate of 𝒪(T1/3+ϵ)\mathcal{O}({T}^{-1/3+\epsilon}) for non-convex distributed optimization problems over time-varying networks, for any ϵ>0\epsilon>0. Our simulation results support the theoretical results of the paper.

1 Introduction

Distributed learning serves as a learning framework where a set of computing nodes/agents are interested in collaboratively solving an optimization problem. In this paradigm, in the absence of a central node, the learning task solely depends on on-device computation and local communication among the neighboring agents. With the appearance of modern computation architectures and the decentralized nature of storage, large-scale distributed computation frameworks have received significant attention due to data locality, privacy, data ownership, and scalability to larger datasets and systems. These features of distributed learning have led to applications in several domains including distributed deep networks [1, 2, 3], distributed sensor networks [4, 5, 6], and network resource allocation [7, 8]. In the context of Machine Learning, a prototypical application of distributed learning and optimization is the empirical risk minimization (ERM) problem. In ERM problems, each server in a network of data centers has access to a number of independent data points (realizations) sampled from an underlying distribution, and the objective is to minimize an additive and separable cost function (of the data points) over the underlying network [9, 10, 11].

Related Works. Decentralized consensus or averaging-based optimization algorithms have been studied extensively over the past few years [12, 13, 14, 15, 16, 17]. It has been shown that when a fixed step-size is utilized, the loss function decreases with the rate of 𝒪(1/T)\mathcal{O}(1/T) until the estimates reach a neighborhood of the (local) minimum of the objective cost function [12]. However, with a fixed step-size, the local estimates may not converge to an optimal point [16]. To remedy this, the diminishing step-size variation of the algorithm is introduced and studied for convex [18, 17, 19] and non-convex problems [20, 21, 22].

A majority of the existing works in this area suffer from requiring large communication overhead, as a (potentially) massive amount of local information is needed to be exchanged among the agents throughout the learning algorithm. In addition, such communication load can lead to a major delay in the convergence of the algorithm, and become a severe bottleneck when the dimension of the model is large. To mitigate the communication overhead, several compression techniques are introduced to reduce the size of the exchanged information. In particular, this is used in the context of distributed averaging/consensus problem [23, 12, 24], where each node has an initial numerical value and aims at evaluating the average of all initial values using exchange of quantized information, over a fixed or a time-varying network. In the context of distributed optimization, various compression approaches have been introduced to mitigate the communication overhead [25, 26, 27, 28, 29].

An inexact proximal gradient method with a fixed step-size over a fixed network for strongly convex loss functions is proposed in [25]. In this algorithm, each node applies an adaptive deterministic quantizer on its model and gradient before sending them to its neighbors. It is shown that under certain conditions, the algorithm provides a linear convergence rate. In another related work [26], authors proposed a decentralized gradient descent method, named ‘QuanTimed-DSGD’, with fixed step-sizes. In this algorithm, agents exchange a quantized version of their models/estimates in order to reduce the communication load. It is shown that for strongly convex loss functions and a given termination time TT, constant step-size parameters for the algorithm can be chosen so that the average distance between the model of the agents and the global optimum is bounded by c(T1/2+ϵ)c(T^{-1/2+\epsilon}) for some c>0c>0 and any ϵ>0\epsilon>0. In this algorithm not only the step sizes depend on TT, but also the results only hold for the exact termination iteration TT, which further needs to satisfy TTminT\geq T_{\min}, where TminT_{\min} itself depends on ϵ\epsilon as well as non-local parameters of the fixed underlying network. For smooth non-convex cost functions, there it is shown that the temporal average (up to time TT) of the expected norm of the gradients at average models does not exceed c(T1/3)c(T^{-1/3}). However, similar to the strongly convex case, the choice of parameters (i.e., the step-sizes) for QuanTimed-DSGD depend on the termination time TT, and need to be re-evaluated when the target termination time varies. To compensate the quantization error, a decentralized (diminishing) gradient descent algorithm is proposed in in [27, 30] using error-feedback. The proposed algorithm achieve the convergence rate of 𝒪(T1)\mathcal{O}(T^{-1}) and 𝒪(T1/2)\mathcal{O}(T^{-1/2}) for strongly and non convex objective functions, respectively. However, the nature of the algorithm restricts its use to time-invariant networks and in addition, the feedback mechanism cannot compensate communication noise between the nodes. In [31], a two-time-scale gradient descent algorithm was presented for distributed constrained and convex optimization problems over an i.i.d.  communication graph with noisy communication links, and sub-gradient errors. It is shown that if the random communication satisfies certain conditions, proper choices of the time-scale parameters result in the almost sure convergence of the local states to an optimal point. Another interesting approach to address exact convergence for distributed optimization with fixed gradient step-sizes under a noiseless communication model is to use gradient tracking methods [32, 33].

Our Contributions. We introduce and study a two time-scale decentralized gradient descent algorithm for a broad class of imperfect sharing of information over time-varying communication networks for distributed optimization problems with smooth non-convex local cost functions. In this work, one time-scale addresses the convergence of the local estimates to an stationary point while the (newly introduced) second time-scale is introduced to suppress the noisy effect of the imperfect incoming information from the neighbors.

Our main result shows that with a proper choice of the parameters for the two diminishing step-size sequences, the temporal average of the expected norm of the gradients decreases with the rate of 𝒪(T1/3+ϵ)\mathcal{O}(T^{-1/3+\epsilon}). To prove this result, we have established new techniques to analyze the interplay between the two time-scales, in particular, in the presence of underlying time-varying networks. To validate our theoretical results, we numerically evaluate the algorithm for optimizing a regulized logistic regression on benchmark dataset MNIST. To assess the efficiency of our algorithm over time-varying networks, we also implement a linear regression model (convex) over synthetic data and a CNN (non-convex) over CIFAR-10. Our simulation results also confirm that introducing a time-shift in our step-sizes improves the finite-time performance of our algorithm leading to a faster decay of the loss function.

Paper Organization. After introducing some notations, we present the problem formulation, the algorithm, the main result, and some of its practical implications in Section 2. Our theoretical results are corroborated by simulation results in Section 3. To prove the main result, we first provide some auxiliary lemmas in Section 4 whose proofs are presented in Appendix A. In Sections 5, we present the proof of the main result. Finally, we conclude the paper in Section 7.

Notation and Basic Terminology. Throughout this paper, we use [n][n] to denote the set of integers {1,2,,n}\{1,2,\dots,n\}. In this paper, we are dealing with nn agents that are minimizing a function in d\mathbb{R}^{d}. For notational convenience, throughout this paper, we assume that the underlying functions are acting on row vectors, and hence, we view vectors in 1×d=d\mathbb{R}^{1\times d}=\mathbb{R}^{d} as row vectors. The rest of the vectors, i.e., the vectors in n×1=n\mathbb{R}^{n\times 1}=\mathbb{R}^{n}, are assumed to be column vectors. The 2\ell_{2}-norm of a vector 𝐱d\mathbf{x}\in\mathbb{R}^{d} is defined as 𝐱2=j=1d|xj|2\|\mathbf{x}\|^{2}=\sum_{j=1}^{d}|x_{j}|^{2}. The Frobenius norm of a matrix An×dA\in\mathbb{R}^{n\times d} is defined as AF2=i=1nAi2=i=1nj=1d|Aij|2\left\|A\right\|_{F}^{2}=\sum_{i=1}^{n}\|A_{i}\|^{2}=\sum_{i=1}^{n}\sum_{j=1}^{d}|A_{ij}|^{2}. A vector 𝐫n\mathbf{r}\in\mathbb{R}^{n} is called stochastic if ri0r_{i}\geq 0 and i=1nri=1\sum_{i=1}^{n}r_{i}=1. Similarly, a non-negative matrix An×dA\in\mathbb{R}^{n\times d} is called (row) stochastic if j=1dAij=1\sum_{j=1}^{d}A_{ij}=1 for every i[n]i\in[n]. For a matrix An×dA\in\mathbb{R}^{n\times d}, we denote its ii-th row and jj-th column by AiA_{i} and AjA^{j}, respectively. Note that we deal with two types of vector throughout the paper. For an n×dn\times d matrix AA and a strictly positive stochastic vector 𝐫n\mathbf{r}\in\mathbb{R}^{n}, we define the 𝐫\mathbf{r}-norm of AA by A𝐫2=i=1nriAi2\left\|A\right\|_{\mathbf{r}}^{2}=\sum_{i=1}^{n}r_{i}\left\|A_{i}\right\|^{2}. It can be verified that 𝐫\left\|\cdot\right\|_{\mathbf{r}} is a norm on the space of n×dn\times d matrices. Finally, we write ABA\geq B if ABA-B (is well-defined and) has non-negative entries.

2 Problem Setup and Main Result

In this section, first we formulate non-convex distributed optimization problems over time-varying networks and introduce some standard assumptions on the underlying problem. After proposing our algorithm, we state our main result. Finally, we discuss the implications of our result on various important practical settings with imperfect information sharing.

2.1 Problem Setup

This paper is motivated by stochastic learning problems in which the goal is to solve

min𝐱L(𝐱):=min𝐱𝔼ξ𝒫[(𝐱,ξ)],\displaystyle\min_{\mathbf{x}}L(\mathbf{x}):=\min_{\mathbf{x}}\mathbb{E}_{\xi\sim\mathcal{P}}\left[\ell(\mathbf{x},\xi)\right], (1)

where :d×p\ell:\mathbb{R}^{d}\times\mathbb{R}^{p}\rightarrow\mathbb{R} is a loss function, 𝐱1×d=d\mathbf{x}\in\mathbb{R}^{1\times d}=\mathbb{R}^{d} is the decision/optimization row vector, and ξ\xi is a random vector taking values in p\mathbb{R}^{p} that is drawn from an unknown underlying distribution 𝒫\mathcal{P}. One of the key practical considerations that renders (1) as a challenging task is that the underlying distribution 𝒫\mathcal{P} is often unknown. Instead, we have access to NN independent realizations of ξ\xi and focus on solving the corresponding ERM problem which is given by

min𝐱f(𝐱):=min𝐱1Nj=1N(𝐱,ξj),\displaystyle\min_{\mathbf{x}}f(\mathbf{x}):=\min_{\mathbf{x}}\frac{1}{N}\sum_{j=1}^{N}\ell(\mathbf{x},\xi_{j}), (2)

where f(𝐱)f(\mathbf{x}) is the empirical risk with respect to the data points 𝒟={ξ1,,ξN}\mathcal{D}=\{\xi_{1},\ldots,\xi_{N}\}. We assume that (,)\ell(\cdot,\cdot) is a non-convex loss function, which potentially results in a non-convex function f()f(\cdot).

In distributed optimization, we have a network consisting of nn computing nodes (agents, or workers), where each node ii observes a non-overlapping subset of mi=riN{m_{i}=r_{i}N} data points, denoted by 𝒟i={ξ1i,,ξmii}\mathcal{D}_{i}=\{\xi^{i}_{1},\ldots,\xi^{i}_{m_{i}}\}, where 𝒟=𝒟1𝒟n{\mathcal{D}=\mathcal{D}_{1}\cup\cdots\cup\mathcal{D}_{n}}. Here, rir_{i} represents the fraction of the data that is processed at node i[n]i\in[n]. Note that the vector 𝐫=(r1,,rn)\mathbf{r}=(r_{1},\dots,r_{n}) is a strictly positive stochastic vector, i.e., ri>0r_{i}>0 and i=1nri=1\sum_{i=1}^{n}r_{i}=1. Thus, the ERM problem in (2) can be written as the minimization of the weighted average of local empirical risk functions fif_{i} for all nodes i[n]i\in[n] in the network, i.e.,

min𝐱f(𝐱)=min𝐱i=1nrifi(𝐱)=min𝐱1Ni=1nξ𝒟i(𝐱,ξ),\displaystyle\min_{\mathbf{x}}f(\mathbf{x})=\min_{\mathbf{x}}\sum_{i=1}^{n}r_{i}f_{i}(\mathbf{x})=\min_{\mathbf{x}}\frac{1}{N}\sum_{i=1}^{n}\sum_{\xi\in\mathcal{D}_{i}}\ell(\mathbf{x},\xi), (3)

where fi(𝐱):=1miξ𝒟i(𝐱,ξ)=1mij=1mi(𝐱,ξji)f_{i}(\mathbf{x}):=\frac{1}{m_{i}}\sum_{\xi\in\mathcal{D}_{i}}\ell(\mathbf{x},\xi)=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\ell(\mathbf{x},\xi_{j}^{i}). We can rewrite the ERM problem in (3) as a distributed consensus optimization problem, given by

min𝐱1,,𝐱ni=1nrifi(𝐱i)subject to𝐱1=𝐱2==𝐱n.\displaystyle\min_{\mathbf{x}_{1},\ldots,\mathbf{x}_{n}}\sum_{i=1}^{n}r_{i}f_{i}(\mathbf{x}_{i})\quad\textrm{subject to}\quad\mathbf{x}_{1}=\mathbf{x}_{2}=\cdots=\mathbf{x}_{n}. (4)

Consider an n2n\geq 2 agents that are connected through a time-varying network. We represent this network at time t1t\geq 1 by the directed graph 𝒢(t)=([n],(t))\mathcal{G}(t)=([n],\mathcal{E}(t)), where the vertex set [n][n] represents the set of agents and the edge set (t){(i,j):i,j[n]}{\mathcal{E}(t)\subseteq\{(i,j):i,j\in[n]\}} represents the set of links at time tt. At each discrete time t1t\geq 1, agent ii can only send information to its (out-) neighbors in (t)\mathcal{E}(t), i.e., all j[n]j\in[n] with (i,j)(t)(i,j)\in\mathcal{E}(t).

To discuss our algorithm (DIMIX) for solving (4) distributively, let us first discuss its general structure and the required information at each node for its execution. In this algorithm, at each iteration t1t\geq 1, agent i[n]i\in[n] updates its estimate 𝐱i(t)d\mathbf{x}_{i}(t)\in\mathbb{R}^{d} of an optimizer of (3). To this end, it utilizes the gradient information of its own local cost function fi(𝐱)f_{i}(\mathbf{x}) as well as a noisy/lossy average of its current neighbors estimates, denoted by 𝐱^i(t):=j=1nWij(t)𝐱j(t)+𝐞i(t){\hat{\mathbf{x}}_{i}(t):=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\mathbf{e}_{i}(t)}. Here, W(t)W(t) is a row-stochastic matrix that is consistent with the underlying connectivity network 𝒢(t)\mathcal{G}(t) (i.e., Wij(t)>0W_{ij}(t)>0 only if (j,i)(t)(j,i)\in\mathcal{E}(t)) and 𝐞i(t)d\mathbf{e}_{i}(t)\in\mathbb{R}^{d} is a random noise vector. Later, in Section 2.4 we discuss several noisy and lossy information sharing architectures (quantization and noisy communication) that fit in this broad information structure. Now we are ready to discuss the DIMIX algorithm. In this algorithm, using the information available to agent ii at time tt, agent ii updates its current estimate by computing a diminishing weighted average of its own state and the noisy average of its neighbors’ estimates, and moves along its local gradient. More formally, the update rule at node i[n]i\in[n] is given by

𝐱i(t+1)=(1β(t))𝐱i(t)+β(t)𝐱^i(t)α(t)β(t)fi(𝐱i(t)),\displaystyle\mathbf{x}_{i}(t+1)=(1-\beta(t))\mathbf{x}_{i}(t)+\beta(t)\hat{\mathbf{x}}_{i}(t)-\alpha(t)\beta(t)\nabla f_{i}(\mathbf{x}_{i}(t)), (5)

where α(t)=α0(t+τ)ν\alpha(t)=\frac{\alpha_{0}}{(t+\tau)^{\nu}} and β(t)=β0(t+τ)μ\beta(t)=\frac{\beta_{0}}{(t+\tau)^{\mu}} for some μ,ν(0,1)\mu,\nu\in(0,1) are the diminishing step-sizes of the algorithm, and τ0\tau\geq 0 is an arbitrary shift, that is introduced to accelerate the finite-time performance of the algorithm. The description of DIMIX is summarized in Algorithm 1. For notational simplicity, let

X(t):=[𝐱1(t)𝐱n(t)],E(t):=[𝐞1(t)𝐞n(t)],f(X(t)):=[f1(𝐱1(t))fn(𝐱n(t))].\displaystyle X(t):=\begin{bmatrix}\mathbf{x}_{1}(t)\\ \vdots\\ \mathbf{x}_{n}(t)\end{bmatrix},\quad E(t):=\begin{bmatrix}\mathbf{e}_{1}(t)\\ \vdots\\ \mathbf{e}_{n}(t)\end{bmatrix},\quad\nabla f(X(t)):=\begin{bmatrix}\nabla f_{1}(\mathbf{x}_{1}(t))\\ \vdots\\ \nabla f_{n}(\mathbf{x}_{n}(t))\end{bmatrix}. (6)

Since 𝐱^i(x)=j=1nWij(t)𝐱j(t)+𝐞i(t)\hat{\mathbf{x}}_{i}(x)=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\mathbf{e}_{i}(t), we can rewrite the update rule in (5) in the matrix format

X(t+1)=((1β(t))I+β(t)W(t))X(t)+β(t)E(t)α(t)β(t)f(X(t)).\displaystyle X(t+1)=((1-\beta(t))I+\beta(t)W(t))X(t)+\beta(t)E(t)-\alpha(t)\beta(t)\nabla f(X(t)). (7)

Let us discuss some important aspects of the above update rule. Note that both α(t)\alpha(t) and β(t)\beta(t) are diminishing step-sizes. If β(t)=β0<1\beta(t)=\beta_{0}<1 and α(t)=α0<1\alpha(t)=\alpha_{0}<1 are both constants, then the dynamics in (7) reduces to the algorithm proposed in [26] for both convex and non-convex cost functions. Alternatively, if β(t)=β0<1\beta(t)=\beta_{0}<1 is a constant sequence and E(t)=𝟎E(t)=\mathbf{0} for all t1t\geq 1, (7) would be reduced to the averaging-based distributed optimization with diminishing steps-sizes (for gradients), which is introduced and studied in [18] for local convex cost functions fi(𝐱)f_{i}(\mathbf{x}). The newly introduced time-scale/step-size β(t)\beta(t) suppresses the incoming noise 𝐞i(t)\mathbf{e}_{i}(t) from the neighboring agents. However, β(t)\beta(t) also suppresses the incoming signal level j=1nWij(t)𝐱j(t)\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t) at each node ii. This casts a major technical challenge for establishing convergence-to-consensus guarantees for the algorithm over time-varying networks.

On the other hand, the diminishing step-size for the gradient update is α^(t)=α(t)β(t)\hat{\alpha}(t)=\alpha(t)\beta(t). We chose to represent our algorithm in this way to ensure that the local mixing (consensus) scheme is operated on a faster time-scale than the gradient descent.

Algorithm 1 DIMIX at agent ii
1:Stochastic matrix sequence {W(t)}\{W(t)\}, Iteration TT
2:Set 𝐱i(1)=0\mathbf{x}_{i}(1)=0.
3:for t=1,,T1t=1,\ldots,T-1 do
4:     Compute the local gradient fi(𝐱i(t))\nabla f_{i}(\mathbf{x}_{i}(t)).
5:     Obtain noisy average neighbors’ estimate 𝐱^i(t)\hat{\mathbf{x}}_{i}(t).
6:     Set: 𝐱i(t+1)=(1β(t))𝐱i(t)+β(t)𝐱^i(t)α(t)β(t)fi(𝐱i(t))\mathbf{x}_{i}(t+1)=(1-\beta(t))\mathbf{x}_{i}(t)+\beta(t)\hat{\mathbf{x}}_{i}(t)-\alpha(t)\beta(t)\nabla f_{i}(\mathbf{x}_{i}(t)).
7:end for

2.2 Assumptions

In order to provide performance guarantees for DIMIX, we need to assume certain regularity conditions on (i) the statistics of the (neighbors’ averaging) noise process {E(t)}\{E(t)\}, (ii) the mixing properties of the weight sequence {W(t)}\{W(t)\}, and (iii) the loss function (,)\ell(\cdot,\cdot).

First, we discuss our main assumption on the noise sequence {E(t)}\{E(t)\}.

Assumption 1 (Neighbors State Estimate Assumption)

Suppose that {X(t)}\{X(t)\} is adapted to a filtration {t}\{\mathcal{F}_{t}\} on the underlying probability space (see e.g. Section 5.2 in [34]). We assume that there exists some γ>0\gamma>0 such that for all i[n]i\in[n] and all t1t\geq 1, the noise sequence {𝐞i(t)}\{\mathbf{e}_{i}(t)\} satisfies

𝔼[𝐞i(t)t]\displaystyle\mathbb{E}\left[\mathbf{e}_{i}(t)\mid\mathcal{F}_{t}\right] =0, and\displaystyle=0\mbox{, and} (8)
𝔼[𝐞i(t)2t]\displaystyle\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}\mid\mathcal{F}_{t}\right] γ.\displaystyle\leq\gamma. (9)

Note that the natural filtration of the random process {X(t)}\{X(t)\} is one choice for {t}\{\mathcal{F}_{t}\}. In this case, (8) reduces to 𝔼[𝐞i(t)X(1),,X(t)]=0\mathbb{E}\left[\mathbf{e}_{i}(t)\!\mid\!\!X(1),\ldots,X(t)\right]\!=\!0 and 𝔼[𝐞i(t)2X(1),,X(t)]γ.\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}\!\mid\!X(1),\ldots,X(t)\right]\!\leq\gamma.

Next, we discuss the main assumption on the network connectivity, or more precisely, the mixing of information over the time-varying network.

Assumption 2 (Connectivity Assumption)

We assume that the weight matrix sequence {W(t)}\{W(t)\} in (7) satisfies the following properties

  1. (a)

    Stochastic with Common Stationary Distribution: W(t)W(t) is non-negative and W(t)𝟏=𝟏{W(t)\mathbf{1}=\mathbf{1}} and 𝐫TW(t)=𝐫T{\mathbf{r}^{T}W(t)=\mathbf{r}^{T}} for all t1t\geq 1, where 𝟏n\mathbf{1}\in\mathbb{R}^{n} is the all-one vector, and 𝐫>0\mathbf{r}>0 is the weight vector.

  2. (b)

    Bounded Nonzero Elements: There exists some η>0\eta>0 such that if for some i,j[n]i,j\in[n] and t1t\geq 1 we have Wij(t)>0W_{ij}(t)>0, then Wij(t)ηW_{ij}(t)\geq\eta.

  3. (c)

    BB-Connected: For a fixed integer B1B\geq 1, the graph ([n],k=t+1t+B(k))\left([n],\bigcup_{k=t+1}^{t+B}\mathcal{E}(k)\right) is strongly connected for all t1t\geq 1, where (k)={(j,i)Wij(k)>0}\mathcal{E}(k)=\{(j,i)\mid W_{ij}(k)>0\}.

Finally for the loss function (,)\ell(\cdot,\cdot), we make the following assumption.

Assumption 3 (Stochastic Loss Function Assumption)

We assume that:

  1. (a)

    The function (,)\ell(\cdot,\cdot) is KK-smooth with respect to its first argument, i.e., for any 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d} and ξ𝒟\xi\in\mathcal{D} we have that (𝐱,ξ)(𝐲,ξ)K𝐱𝐲\|\nabla\ell(\mathbf{x},\xi)-\nabla\ell(\mathbf{y},\xi)\|\leq K\|\mathbf{x}-\mathbf{y}\|.

  2. (b)

    Stochastic gradient (𝐱,ξ)\nabla\ell(\mathbf{x},\xi) is unbiased and variance bounded, i.e.,

    𝔼ξ[(𝐱,ξ)]=L(𝐱),𝔼ξ[(𝐱,ξ)L(𝐱)2]σ2.\mathbb{E}_{\xi}\left[\nabla\ell(\mathbf{x},\xi)\right]=\nabla L(\mathbf{x}),\quad\mathbb{E}_{\xi}\left[\|\nabla\ell(\mathbf{x},\xi)-\nabla L(\mathbf{x})\|^{2}\right]\leq\sigma^{2}.

Note that Assumption 3-b implies a homogeneous sampling, i.e., each agent draws i.i.d. sample points from a data batch. In a related work [35], a stronger assumption has been considered which allows for heterogeneous data samples.

Remark 1

Recall that local function fif_{i} is defined as fi(𝐱)=1mij=1mi(𝐱,ξji)f_{i}(\mathbf{x})=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\ell(\mathbf{x},\xi_{j}^{i}). Then the conditions of Assumption 3 on the function (,)\ell(\cdot,\cdot) can be directly translated to conditions on fi(𝐱)f_{i}(\mathbf{x}) given by

fi(𝐱)fi(𝐲)K𝐱𝐲,𝔼ξ[fi(𝐱)]=L(𝐱),𝔼ξ[fi(𝐱)L(𝐱)2]σ2/mi.\displaystyle\|f_{i}(\mathbf{x})-f_{i}(\mathbf{y})\|\leq K\|\mathbf{x}-\mathbf{y}\|,\quad\mathbb{E}_{\xi}\left[\nabla f_{i}(\mathbf{x})\right]=\nabla L(\mathbf{x}),\quad\mathbb{E}_{\xi}\left[\|\nabla f_{i}(\mathbf{x})-\nabla L(\mathbf{x})\|^{2}\right]\leq\sigma^{2}/m_{i}.

.

2.3 Main Result

Here, we characterize the convergence rates of our algorithm for the KK-smooth non-convex loss functions. More precisely, we establish a rate for the temporal average of the expected norm of the gradients for various choice of the time-scale parameters ν,μ\nu,\mu.

Theorem 1

Suppose that Assumptions 13 hold and let α(t)=α0(t+τ)ν\alpha(t)=\frac{\alpha_{0}}{(t+\tau)^{\nu}}, β(t)=β0(t+τ)μ\beta(t)=\frac{\beta_{0}}{(t+\tau)^{\mu}} where α0,β0(0,1)\alpha_{0},\beta_{0}\in(0,1), τ0\tau\geq 0, and ν,μ(0,1)\nu,\mu\in(0,1) are arbitrary constants with μ1/2\mu\neq 1/2 and 3ν+μ13\nu+\mu\neq 1. Then the weighted average estimates 𝐱¯(t):=i=1nri𝐱i(t)\bar{\mathbf{x}}(t):=\sum_{i=1}^{n}r_{i}\mathbf{x}_{i}(t) generated by Algorithm 1 satisfy

Mθ(ν,μ):=[1Tt=1T(𝔼[f(𝐱¯(t))2])θ]1/θ\displaystyle M_{\theta}(\nu,\mu):=\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}\right]^{1/\theta} =𝒪(Tmin{1νμ,μν,2ν}),\displaystyle=\mathcal{O}\left(T^{-\min\{1-\nu-\mu,\mu-\nu,2\nu\}}\right), (10)

where θ(0,1)\theta\in(0,1) is an arbitrary constant.

Furthermore, for (ν,μ)=(16,12)(\nu^{\star},\mu^{\star})=\left(\frac{1}{6},\frac{1}{2}\right) we get the optimal rate of

Mθ(ν,μ)=𝒪(T1/3lnT).\displaystyle M_{\theta}(\nu^{\star},\mu^{\star})=\mathcal{O}\left(T^{-1/3}\ln T\right). (11)
Remark 2

Note that the expectation operator 𝔼[]\mathbb{E}\left[\cdot\right] is over the randomness of the dataset 𝒟\mathcal{D} and the compression/communication noise. Moreover, note that the theorem above shows that the gradient of f()f(\cdot) (which depends of the choice of 𝒟\mathcal{D}) at the average state of 𝐱¯(t)\overline{\mathbf{x}}(t) (which also depends on 𝒟\mathcal{D}) vanishes at a certain rate. It is worth mentioning that this is not the performance of the average trajectory for the average function.

Remark 3

From (10), one has to maximize min{1νμ,μν,2ν}\min\{1-\nu-\mu,\mu-\nu,2\nu\} over ν,μ(0,1)\nu,\mu\in(0,1) to achieve the fastest convergence for MθM_{\theta}. This leads to (ν,μ)=(1/6,1/2)(\nu^{\star},\mu^{\star})=(1/6,1/2), which none of the conditions μ1/2\mu\not=1/2 and 3ν+μ13\nu+\mu\not=1 hold for. However, one can choose (ν,μ)=(1/6+ϵ/2,1/2+ϵ/2)(\nu,\mu)=(1/6+\epsilon/2,1/2+\epsilon/2) and obtain Mθ=𝒪(T1/3+ϵ)M_{\theta}=\mathcal{O}(T^{-1/3+\epsilon}) for any ϵ>0\epsilon>0. Nevertheless, note that (11) provides a faster convergence rate of 𝒪(T1/3lnT)\mathcal{O}(T^{-1/3}\ln T) for (ν,μ)=(1/6,1/3)(\nu^{\star},\mu^{\star})=(1/6,1/3).

Proposition 1

Under the conditions of Theorem 1, for the optimum choice of (ν,μ)=(1/6,1/3)(\nu^{\star},\mu^{\star})=(1/6,1/3), we have

M1(ν,μ):=1Tt=1T𝔼[f(𝐱¯(t))2]\displaystyle M_{1}(\nu^{\star},\mu^{\star}):=\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] 𝒪(T1/3+ϵ),\displaystyle\leq\mathcal{O}\left(T^{-1/3+\epsilon}\right), (12)

for any ϵ>0\epsilon>0. Furthermore, in this case, for each agent i[n]i\in[n] the convergence rate to consensus is given by

1Tt=1T𝔼[𝐱i(t)𝐱¯(t)2]𝒪(T1/3+ϵ).\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\|^{2}\right]\leq\mathcal{O}\left(T^{-1/3+\epsilon}\right). (13)

As a result, combining (12), (13), and Assumption 3, for all i[n]i\in[n], we have

1Tt=1T𝔼[f(𝐱i(t))2]\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\|\nabla f(\mathbf{x}_{i}(t))\right\|^{2}\right] 𝒪(T1/3+ϵ).\displaystyle\leq\mathcal{O}\left(T^{-1/3+\epsilon}\right).
Remark 4

We should comment that almost universally, all the existing results and algorithms on distributed optimization algorithms for time-varying graphs assume a uniform positive lower bound on non-zero elements of the (effective) weight matrices [36, 12, 37, 38, 39, 16, 21, 40]. Absence of such an assumption significantly increases the complexity of the convergence analysis of the algorithm. In our work, even though the stochastic matrix sequence {W(t)}\{W(t)\} is assumed to be BB-connected, the effective averaging weight sequence, that is given by {(1β(t))I+β(t)W(t)}\{(1-\beta(t))I+\beta(t)W(t)\}, has vanishing (non-self) weights. One of the major theoretical contributions of this work is to introduce tools and techniques to study distributed optimization with diminishing weight sequences.

Remark 5

In a related work [26] on distributed optimization with compressed information sharing among the nodes, authors considered a fixed step-size (zero time-scale) version of Algorithm 1 with a fixed averaging matrix WW. It is shown that for a given termination time TT, the algorithm’s step-sizes can be chosen (depending on TT) such that the temporal average (up to iteration TT) of the expected norm of the gradient (i.e., M1M_{1} defined in (12)) does not exceed c(T1/3)c({T}^{-1/3}) (where c>0c>0 is a constant). However, the step-sizes should be re-evaluated and the algorithm needs to be re-executed if one targets another termination time TT^{\prime}. In this work, we use vanishing step-sizes α(t)\alpha(t) and β(t)\beta(t) (which do not depend on the termination time) and show that the same temporal average vanishes at the rate of 𝒪(T1/3+ϵ)\mathcal{O}\left({T}^{-1/3+\epsilon}\right) for every iteration TT and any arbitrarily small ϵ>0\epsilon>0.

2.4 Examples for Stochastic Noisy State Estimation

Refer to caption
Figure 1: A general architecture to lossy information model.

The noisy information in (5) is very general, and captures several models of imperfect data used in practice and/or theoretical studies. A rather general architecture that leads to such noisy/lossy estimates is demonstrated in Figure 1: Once the estimate 𝐱j(t)\mathbf{x}_{j}(t) of node jj at iteration tt is evaluated, node jj may apply an operation (such as compression/sparsification or quantization) on its own model to generate 𝐱~j(t)\tilde{\mathbf{x}}_{j}(t). This data will be sent over a potentially noisy communication channel, and a neighbor ii will receive a corrupted version of 𝐱~j(t)\tilde{\mathbf{x}}_{j}(t), say 𝐲~i,j(t)\tilde{\mathbf{y}}_{i,j}(t) from every neighbor node jj. Upon receiving such channel outputs from all its neighbors, node ii computes their weighted average 𝐲i(t)=j=1nWij(t)𝐲~i,j(t)\mathbf{y}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\tilde{\mathbf{y}}_{i,j}(t). This can be decoded to the approximate average model 𝐱^i(t)\hat{\mathbf{x}}_{i}(t). In the following we describe three popular frameworks, in which each node ii can only use an imperfect neighbors’ average 𝐱^i(t)\hat{\mathbf{x}}_{i}(t) to update its estimate. It is worth emphasizing that these are just some examples that lie under the general model in (5).

Example 1

(Unbiased Random Sparsification). Let ZkZ_{k} be the set of all subsets of [d][d] of size kk for a parameter k[d]k\in[d]. We define the operator Randk:d×Zkd{\text{Rand}_{k}:\mathbb{R}^{d}\times Z_{k}\rightarrow\mathbb{R}^{d}} as

[Randk(𝐱,ζ)]:={(dk)xζ0otherwise.\displaystyle[\text{Rand}_{k}(\mathbf{x},\zeta)]_{\ell}:=\left\{\begin{array}[]{ll}(\frac{d}{k})x_{\ell}&\ell\in\zeta\\ 0&\textrm{otherwise}.\end{array}\right. (16)

For a given time, with ζj\zeta_{j} being drawn uniformly at random from ZkZ_{k} independent of the input vector 𝐱\mathbf{x}, iteration, and other agents, Randk(,)\text{Rand}_{k}(\cdot,\cdot) is an unbiased version of the random sparisification introduced in [41]. It can be verified that, when 𝐱2D{\|\mathbf{x}\|^{2}\leq D}, we have

𝔼ζ[Randk(𝐱,ζ)]=𝐱,𝔼ζ[Randk(𝐱,ζ)𝐱2](dk1)𝐱2(dk1)D.\mathbb{E}_{\zeta}\left[\text{Rand}_{k}(\mathbf{x},\zeta)\right]=\mathbf{x},\quad\mathbb{E}_{\zeta}\left[\|\text{Rand}_{k}(\mathbf{x},\zeta)-\mathbf{x}\|^{2}\right]\leq\left(\frac{d}{k}-1\right)\|\mathbf{x}\|^{2}\leq\left(\frac{d}{k}-1\right)D.

Then, we can get

𝐱^i(t)=j=1nWij(t)Randk(𝐱j(t),ζj)=j=1nWij(t)𝐱j(t)+𝐞i(t),\displaystyle\hat{\mathbf{x}}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\text{Rand}_{k}(\mathbf{x}_{j}(t),\zeta_{j})=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\mathbf{e}_{i}(t),

where 𝐞i(t)=j=1nWij(t)(Randk(𝐱j(t),ζj)𝐱j(t))\mathbf{e}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\left(\text{Rand}_{k}(\mathbf{x}_{j}(t),\zeta_{j})-\mathbf{x}_{j}(t)\right). From the properties of Randk(,)\text{Rand}_{k}(\cdot,\cdot), we can verify that 𝔼[𝐞i(t)|t]=0\mathbb{E}\left[\mathbf{e}_{i}(t)|\mathcal{F}_{t}\right]=0 and

𝔼[𝐞i(t)2|t]=(dk1)Dj=1nWij(t)2(dk1)D,\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}|\mathcal{F}_{t}\right]=\left(\frac{d}{k}-1\right)D\sum_{j=1}^{n}W_{ij}(t)^{2}\leq\left(\frac{d}{k}-1\right)D,

which fulfills the conditions of Assumption 1. With respect to the architecture of Figure 1, we have 𝐱~j(t)=Rand(𝐱j(t),ζj)\tilde{\mathbf{x}}_{j}(t)=\text{Rand}(\mathbf{x}_{j}(t),\zeta_{j}). Moreover, the noisy channel is perfect, and the decoder is just an identity function, i.e., 𝐲i,j(t)=𝐱~j(t)\mathbf{y}_{i,j}(t)=\tilde{\mathbf{x}}_{j}(t) and 𝐱^i(t)=𝐲i(t)\hat{\mathbf{x}}_{i}(t)=\mathbf{y}_{i}(t).

Example 2

(Stochastic Quantizer with bounded trajectory). The stochastic quantizer with a number of quantization levels ss maps a vector 𝐱d\mathbf{x}\in\mathbb{R}^{d} to a random vector QsS(𝐱)dQ_{s}^{S}(\mathbf{x})\in\mathbb{R}^{d}, where its \ellth entry is given by

[QsS(𝐱)]:=𝐱sgn(x)ζ(|x|/𝐱,s),[d],\displaystyle\left[Q^{S}_{s}(\mathbf{x})\right]_{\ell}:=\|\mathbf{x}\|\cdot\operatorname{sgn}(x_{\ell})\cdot\zeta\left(|x_{\ell}|/\|\mathbf{x}\|,s\right),\quad\ell\in[d], (17)

and ζ(x,s)\zeta(x,s) is a random variable taking values

ζ(x,s)={sx/sw.p. sxsxsx/sw.p. sxsx.\displaystyle\zeta(x,s)=\left\{\begin{array}[]{ll}\lceil sx\rceil/s&\textrm{w.p. $sx-\lfloor sx\rfloor$}\\ \lfloor sx\rfloor/s&\textrm{w.p. $\lceil sx\rceil-sx$.}\end{array}\right. (20)

Note that, random variables {ζ(,)}\{\zeta(\cdot,\cdot)\} are independent, across the coordinates, agents, and time steps. Thus, in this case, the relationship between 𝐱~j(t)\tilde{\mathbf{x}}_{j}(t) and 𝐱j(t)\mathbf{x}_{j}(t) in Figure 1 would be 𝐱~j(t)=QsS(𝐱j(t))\tilde{\mathbf{x}}_{j}(t)=Q^{S}_{s}(\mathbf{x}_{j}(t)). Furthermore, the noisy channel is perfect, and the decoder component is just an identity function, i.e., 𝐲i,j(t)=𝐱~j(t)\mathbf{y}_{i,j}(t)=\tilde{\mathbf{x}}_{j}(t) and 𝐱^i(t)=𝐲i(t)\hat{\mathbf{x}}_{i}(t)=\mathbf{y}_{i}(t). It is shown in [42] that applying this quantizer on 𝐱d\mathbf{x}\in\mathbb{R}^{d} with a bounded norm 𝐱2D\|\mathbf{x}\|^{2}\leq D satisfies 𝔼[QsS(𝐱)]=𝐱\mathbb{E}\left[Q^{S}_{s}(\mathbf{x})\right]=\mathbf{x} and 𝔼[QsS(𝐱)𝐱2]min(ds,ds2)D\mathbb{E}\left[\|Q^{S}_{s}(\mathbf{x})-\mathbf{x}\|^{2}\right]\leq\min\left(\frac{\sqrt{d}}{s},\frac{d}{s^{2}}\right)D. Therefore, the neighbors estimate for node ii will be

𝐱^i(t)=j=1nWij(t)QsS(𝐱j(t))=j=1nWij(t)𝐱j(t)+𝐞i(t),\displaystyle\hat{\mathbf{x}}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)Q^{S}_{s}(\mathbf{x}_{j}(t))=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\mathbf{e}_{i}(t),

where 𝐞i(t)=j=1nWij(t)(QbS(𝐱j(t))𝐱j(t))\mathbf{e}_{i}(t)\!=\!\sum_{j=1}^{n}W_{ij}(t)\left(Q^{S}_{b}(\mathbf{x}_{j}(t))\!-\!\mathbf{x}_{j}(t)\right) satisfies 𝔼[𝐞i(t)|t]=0\mathbb{E}\left[\mathbf{e}_{i}(t)|\mathcal{F}_{t}\right]=0 and

𝔼[𝐞i(t)2|t]=min(d/s,d/s2)Dj=1nWij2(t)min(d/s,d/s2)D,\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}|\mathcal{F}_{t}\right]=\min\left(\sqrt{d}/s,d/s^{2}\right)D\sum_{j=1}^{n}W^{2}_{ij}(t)\leq\min\left(\sqrt{d}/s,d/s^{2}\right)D,

provided that 𝐱j(t)2D\|\mathbf{x}_{j}(t)\|^{2}\leq D for every j[n]j\in[n] and every t1t\geq 1. Therefore, the conditions of Assumption 1 are satisfied.

Note that in both examples above 𝐞i(t)\mathbf{e}_{i}(t) and 𝐞j(t)\mathbf{e}_{j}(t) might be correlated, especially when nodes ii and jj have common neighbor(s). However, this does not violate the conditions of Assumption 3.

Example 3

(Noisy Communication). The noisy neighbor estimate model may arise due to imperfect communication between the agents. Consider a wireless network, in which the computing nodes communicate with their neighbors over a Gaussian channel, i.e., when node jj sends its state 𝐱j(t)\mathbf{x}_{j}(t) (without compression, i.e., 𝐱~j(t)=𝐱j(t)\tilde{\mathbf{x}}_{j}(t)=\mathbf{x}_{j}(t)) to its neighbor ii, the signal received at node ii is 𝐲~i,j(t)=𝐱j(t)+𝐳i,j(t)\tilde{\mathbf{y}}_{i,j}(t)=\mathbf{x}_{j}(t)+\mathbf{z}_{i,j}(t) where 𝐳i,j(t)\mathbf{z}_{i,j}(t) is a zero-mean Gaussian noise with variance ζ2\zeta^{2}, independent across (i,j)(i,j), and tt. Applying an identity map decoder at node ii (i.e., 𝐱^i(t)=𝐲i(t)\hat{\mathbf{x}}_{i}(t)=\mathbf{y}_{i}(t)) we have

𝐱^i(t)=j=1nWij(t)(𝐱j(t)+𝐳i,j(t))=j=1nWij(t)𝐱j(t)+j=1nWij(t)𝐳i,j(t).\displaystyle\hat{\mathbf{x}}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\left(\mathbf{x}_{j}(t)+\mathbf{z}_{i,j}(t)\right)=\sum_{j=1}^{n}W_{ij}(t)\mathbf{x}_{j}(t)+\sum_{j=1}^{n}W_{ij}(t)\mathbf{z}_{i,j}(t).

Thus, we get 𝐞i(t)=j=1nWij(t)𝐳i,j(t)\mathbf{e}_{i}(t)=\sum_{j=1}^{n}W_{ij}(t)\mathbf{z}_{i,j}(t) implying

𝔼[𝐞i(t)|t]=0,𝔼[𝐞i(t)2|t]=ζ2j=1nWij2(t)ζ2\mathbb{E}\left[\mathbf{e}_{i}(t)|\mathcal{F}_{t}\right]=0,\quad{\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}|\mathcal{F}_{t}\right]=\zeta^{2}\sum_{j=1}^{n}W^{2}_{ij}(t)\leq\zeta^{2}}

. Hence, the conditions of Assumption 1 are satisfied.

3 Experimental Results

In the following, we discuss two sets of simulations to verify our theoretical results. First, we perform Algorithm 1 for a fixed network to compare fixed step-sizes [26] versus diminishing step-sizes (this work). Then, we compare the performance of our algorithm on fixed versus time-varying graph. Throughout this section, we use the unbiased stochastic quantizer in (17) with various total number of quantization levels ss. For CNN experiments over CIFAR-10 and regularized logistic regression problem over MNIST dataset, the mini-batch size is set to be 1010 and 2020, respectively, the loss function is the cross-entropy function, we use N=10,000N=10,000 data points which are distributed across n=20n=20 agents. Moreover, for each set of experiments, we distribute the data points across the nodes according to ri=pi/i=120pir_{i}=p_{i}/\sum_{i=1}^{20}p_{i}, where pip_{i} is drawn uniformly at random from the interval (0.01,0.9)(0.01,0.9). Our codes are implemented using Python and tested on a 2017 MacBook Pro with 16 GB memory and a 2.5 GHz Intel Core i7 processor.

3.1 DIMIX vs. QuanTimed-DSGD over Fixed Network

We use regularized logistic regression to validation our algorithm on the benchmark dataset MNIST. \triangleright Data and Experimental Setup. In this experiments we train a regularized logistic regression to classify the MNIST dataset over a fixed network. First, we generate a random (connected) undirected Erdös-Renyi graph with the edge probability pc=0.3p_{c}=0.3, which is fixed throughout this experiment. We also generate a doubly stochastic weight matrix A:=I(dmax+1)1LG{A:=I-(d_{\max}+1)^{-1}L_{G}} where LGL_{G} is the Laplacian matrix and dmaxd_{\max} is the maximum degree of the graph. Finally, we use the time-invariant weight sequence W=I+c𝐫^mindiag1(𝐫)(AI)W=I+c\hat{\mathbf{r}}_{\min}\text{diag}^{-1}(\mathbf{r})(A-I), where 𝐫^min=mini[n]ri/(1Aii){\hat{\mathbf{r}}_{\min}=\min_{i\in[n]}r_{i}/(1-A_{ii})}, c=0.95c=0.95, and diag(𝐫)\text{diag}(\mathbf{r}) is a diagonal matrix with rir_{i} as its iith diagonal element. It can be verified that WW satisfies the conditions of Assumption 2, i.e., it is row-stochastic and satisfies 𝐫TW=𝐫T\mathbf{r}^{T}W=\mathbf{r}^{T}. We implemented our algorithm with quantizer parameter s=3s=3 and tuned the step-size parameters (α0,ν)=(0.005,1/6)(\alpha_{0},\nu^{\star})=(0.005,1/6), (β0,μ)=(0.6,1/2)(\beta_{0},\mu^{\star})=(0.6,1/2), with τ=2000\tau=2000. For the fixed step-sizes, we used the termination time T=7500T=7500, which results in α(t)=0.001\alpha(t)=0.001 and β(t)=0.01\beta(t)=0.01 for all t1t\geq 1.

Refer to caption
Figure 2: Logistic Regression on MNIST: network variance for fixed and vanishing step-sizes.

Results. The plot in Figure 2 shows network variance defined by X(t)𝟏𝐱¯(t)𝐫2\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2} for fixed and diminishing step-sizes. It can be observed that with a fixed step-sizes algorithm we can only reach a neighborhood of the average model, but a consensus is not necessarily achieved. However, using diminishing step-sizes guarantees that each node’s state converges to the average model.

3.2 Diminishing Step-sizes over Time-varying vs. Fixed Network

Here, we provide simulation results to demonstrate the performance of DIMIX on time-varying networks. For this, we conduct simulations on non-convex (CNN) and convex (linear regression) problems. For the sake of comparison, we apply our algorithm on a corresponding fixed network.

Data and Experimental Setup. For the non-convex setting, we study the problem of learning the parameters of the LeNet-5 CNN [43] for CIFAR-10 dataset. The LeNet-5 CNN architecture consists of two convolutional layers, two sub-sampling (pooling) layers, and three fully connected layers, with the ReLU activation function. As before we have n=20n=20 nodes and N=10,000N=10,000 data points, which are distributed among the agents according to some stochastic vector 𝐫\mathbf{r}. For the convex setting, we consider a 100100-dimensional linear regression problem. We generate N=300N=300 data points {ξ1,,ξ300}\{\xi_{1},\ldots,\xi_{300}\}, where ξi=(𝒖i,vi)\xi_{i}=(\boldsymbol{u}_{i},v_{i}), with vi=𝒖iT𝐱~+ϵiv_{i}=\boldsymbol{u}_{i}^{T}\tilde{\mathbf{x}}+\epsilon_{i}, and 𝒖i,𝐱~100\boldsymbol{u}_{i},\tilde{\mathbf{x}}\in\mathbb{R}^{100} for all i[300]i\in[300]. In order to generate the data, we uniformly and independently draw the entries of each 𝒖i\boldsymbol{u}_{i} and each ϵi\epsilon_{i} from (0,1)(0,1) and (0,0.1)(0,0.1), respectively. Similarly, entries of 𝐱~\tilde{\mathbf{x}} are sampled uniformly and independently from (1,1)(-1,1). The goal is to estimate the unknown coefficient vector 𝐱~\tilde{\mathbf{x}}, which leads to solving the minimum mean square error problem, i.e.,

f(𝐱):=min𝐱d12Ni=1300vi𝒖iT𝐱2f(\mathbf{x}):=\min_{\mathbf{x}\in\mathbb{R}^{d}}\frac{1}{2N}\sum_{i=1}^{300}\|v_{i}-\boldsymbol{u}_{i}^{T}\mathbf{x}\|^{2}

.

Experiments with Fixed Graph. We consider a fixed undirected cyclic graph 𝒢C=([n],)\mathcal{G}^{C}=([n],\mathcal{E}), where ={(i,i+1):i[n]}{\mathcal{E}=\{(\left\langle i\right\rangle,\left\langle i+1\right\rangle):i\in[n]\}}, and i:=(i1modn)+1\left\langle i\right\rangle:=(i-1\mod n)+1. For these experiments, the stochastic matrix sequence W(t)=WW(t)=W for any tt is given by

Wij={rj2(ri+rj)j{i1,i+1}ri2(ri+ri+1)+ri2(ri+ri1)j=i0otherwise.\displaystyle W_{ij}=\begin{cases}\frac{r_{\left\langle j\right\rangle}}{2(r_{\left\langle i\right\rangle}+r_{\left\langle j\right\rangle})}&j\in\{\left\langle i-1\right\rangle,\left\langle i+1\right\rangle\}\\ \frac{r_{\left\langle i\right\rangle}}{2(r_{\left\langle i\right\rangle}+r_{\left\langle i+1\right\rangle})}+\frac{r_{\left\langle i\right\rangle}}{2(r_{\left\langle i\right\rangle}+r_{\left\langle i-1\right\rangle})}&j=i\\ 0&\textrm{otherwise}.\end{cases} (21)

Experiments with Time-varying Graph. To evaluate Algorithm 1 for a time-varying network, we use cyclic gossip [44, 45, 6, 46, 47] iterations. Here, the algorithm goes through the cycle graph 𝒢C\mathcal{G}^{C} (described above), and a pair of neighbors (on the cycle) exchange their information at a time. More precisely, at time tt, the averaging graph 𝒢(t)=([n],(t))\mathcal{G}(t)=([n],\mathcal{E}(t)) only consists of a single edge (t)={(t,t+1)}\mathcal{E}(t)=\{({\left\langle t\right\rangle},{\left\langle t+1\right\rangle})\}. For 𝒢(t)\mathcal{G}(t), we let

[W(t)]ij={rjrt+rt+1i,j{t,t+1}1i=j{t,t+1}0otherwise.\displaystyle[W(t)]_{ij}=\begin{cases}\frac{r_{\left\langle j\right\rangle}}{r_{{\left\langle t\right\rangle}}+r_{{\left\langle t+1\right\rangle}}}&i,j\in\{{\left\langle t\right\rangle},{\left\langle t+1\right\rangle}\}\\ 1&i=j\notin\{{\left\langle t\right\rangle},{\left\langle t+1\right\rangle}\}\\ 0&\textrm{otherwise}.\end{cases} (22)

Note that each edge in 𝒢C\mathcal{G}^{C} will be visited periodically, and hence, the sequence of stochastic matrices {W(t)}\{W(t)\} in (22) is BB-connected with B=nB=n.

Refer to caption
Refer to caption
Figure 3: Training Loss vs. Iterations: LeNet-5 on CIFAR-10 (left), and Linear Regression (right).

Results. In Figure 3, the plot on the left demonstrates the training loss of LeNet-5 for CIFAR-10 dataset and the plot on the right shows the training loss of the linear regression. Here, ‘Fixed network’ refers to the full cycle with stochastic matrix in (21) and ‘Time-varying network’ refers to the network with stochastic matrix sequence in (22). For the CNN, the parameters of the dynamics in (5) are fine-tuned to (α0,ν)=(0.02,1/6){(\alpha_{0},\nu^{\star})=(0.02,1/6)}, (β0,μ)=(0.05,1/2){(\beta_{0},\mu^{\star})=(0.05,1/2)}, τ=0\tau=0, and s=6s=6 is the parameter of the stochastic quantizer for both networks. For the linear regression model, the parameter are adjusted to (α0,ν)=(6,1/4)(\alpha_{0},\nu)=(6,1/4), (β0,μ)=(16,3/4)(\beta_{0},\mu)=(16,3/4), and τ=1500\tau=1500 with the quantizer parameter s=6s=6. It can be observed that the algorithm works for both fixed and time-varying networks. However, the performance of the algorithm over the time-varying network naturally suffers from a slower convergence, which is due to a slower mixing of information over the network.

4 Auxiliary Lemmas

The following lemmas are used to facilitate the proof of the main result of the paper. We present the proofs of Lemmas 15 in Section A.

Lemma 1

For any pair of vectors 𝐮\boldsymbol{u}, 𝐯\boldsymbol{v}, and any scalar ω>0\omega>0, we have

𝒖+𝒗2\displaystyle\|\boldsymbol{u}+\boldsymbol{v}\|^{2} (1+ω)𝒖2+(1+1ω)𝒗2.\displaystyle\leq(1+\omega)\|\boldsymbol{u}\|^{2}+\left(1+\frac{1}{\omega}\right)\|\boldsymbol{v}\|^{2}. (23)

Similarly, for matrices UU and VV and any scalar ω>0\omega>0, their 𝐫\mathbf{r}-norm satisfy

U+V𝐫2(1+ω)U𝐫2+(1+1ω)V𝐫2.\displaystyle\|U+V\|_{\mathbf{r}}^{2}\leq(1+\omega)\|U\|^{2}_{\mathbf{r}}+\left(1+\frac{1}{\omega}\right)\|V\|^{2}_{\mathbf{r}}. (24)
Lemma 2

Let {β(t)}\{\beta(t)\} be a sequence in \mathbb{R} and λ\lambda be a positive scalar. Then the following identities hold

s=1t1β(s)k=s+1t1(1λβ(k))\displaystyle\sum_{s=1}^{t-1}\beta(s)\!\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k)) =1λ1λk=1t1(1λβ(k)),for all 1t,\displaystyle=\frac{1}{\lambda}-\frac{1}{\lambda}\prod_{k=1}^{t-1}(1-\lambda\beta(k)),\quad\mbox{for all $1\leq t$}, (25)
t=s+1Tβ(t)k=s+1t1(1λβ(k))\displaystyle\sum_{t=s+1}^{T}\beta(t)\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k)) =1λ1λk=s+1T(1λβ(k))for all 1sT.\displaystyle=\frac{1}{\lambda}-\frac{1}{\lambda}\prod_{k=s+1}^{T}(1-\lambda\beta(k))\quad\mbox{for all $1\leq s\leq T$}. (26)

As a result, for any sequence {β(t)}\{\beta(t)\} in [0,1][0,1] and λ>0\lambda>0, we have

s=1t1β(s)k=s+1t1(1λβ(k))1λ,\sum_{s=1}^{t-1}\beta(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\leq\frac{1}{\lambda},

and

t=s+1Tβ(t)k=s+1t1(1λβ(k))1λ.\sum_{t=s+1}^{T}\beta(t)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\leq\frac{1}{\lambda}.
Lemma 3

For any δ\delta\in\mathbb{R}, τ0\tau\geq 0, and T1T\geq 1, we have

t=1T(t+τ)δ{τ1+δ|1+δ|if δ<1,ln(Tτ+1)if δ=1,21+δ1+δ(T+τ)1+δif δ>1.\displaystyle\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\begin{cases}\frac{\tau^{1+\delta}}{|1+\delta|}&\textrm{if $\delta<-1$},\\ \ln\left(\frac{T}{\tau}+1\right)&\textrm{if $\delta=-1$},\\ \frac{2^{1+\delta}}{1+\delta}(T+\tau)^{1+\delta}&\textrm{if $\delta>-1$}.\end{cases} (27)
Lemma 4

For an n×mn\times m matrix AA and m×qm\times q matrix BB, we have:

AB𝐫A𝐫BF.\displaystyle\left\|AB\right\|_{\mathbf{r}}\leq\left\|A\right\|_{\mathbf{r}}\left\|B\right\|_{F}.

As we discussed in Remark 4, we cannot use the conventional results (e.g., in [48, 36, 12, 37, 38, 39, 16, 21, 40]) to bound the norm of a vector after averaging with matrices with diminishing weights. The following lemma provides a new bounding techniques, which we will use in the proof of the main result of this work.

Lemma 5

Let {W(t)}\{W(t)\} satisfy the Connectivity Assumption 2 with parameters (B,η)(B,\eta), and let {A(t)}\{A(t)\} be given by A(t)=(1β(t))I+β(t)W(t)A(t)=(1-\beta(t))I+\beta(t)W(t) where β(t)(0,1]\beta(t)\in(0,1] for all tt and {β(t)}\{\beta(t)\} is a non-increasing sequence. Then, for any matrix Un×dU\in\mathbb{R}^{n\times d}, parameter λ=η𝐫min2Bn2\lambda=\frac{\eta\mathbf{r}_{\min}}{2Bn^{2}}, and all t>s1t>s\geq 1, we have

(A(t1)A(t2)A(s+1)𝟏𝐫T)U𝐫2κk=s+1t1(1λβ(k))U𝐫2,\left\|\left(A(t-1)A(t-2)\cdots A(s+1)-\mathbf{1}\mathbf{r}^{T}\right)U\right\|_{\mathbf{r}}^{2}\leq\kappa\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\left\|U\right\|_{\mathbf{r}}^{2},

where κ=(1Bλβ0)1\kappa=\left(1-B\lambda\beta_{0}\right)^{-1} and β0=β(1)\beta_{0}=\beta(1).

5 Proof of Theorem 1

For our analysis, first we obtain an expression for the average reduction of the objective function f()f(\cdot) at the average of the states, i.e., 𝐱¯(t)=𝐫TX(t)=i=1nri𝐱i(t)\bar{\mathbf{x}}(t)=\mathbf{r}^{T}X(t)=\sum_{i=1}^{n}r_{i}\mathbf{x}_{i}(t). Recall that 𝐫TW(t)=𝐫T\mathbf{r}^{T}W(t)=\mathbf{r}^{T} for all t1t\geq 1. Hence, multiplying both sides of (7) by 𝐫T\mathbf{r}^{T}, we get

𝐱¯(t+1)=𝐱¯(t)+β(t)𝐫TE(t)α(t)β(t)𝐫Tf(X(t)).\displaystyle\bar{\mathbf{x}}(t+1)=\bar{\mathbf{x}}(t)+\beta(t)\mathbf{r}^{T}E(t)-\alpha(t)\beta(t)\mathbf{r}^{T}\nabla f(X(t)).

From Assumption 3-(a) Lemma 3.4 in [49] we can conclude

f(𝐱¯(t+1))f(𝐱¯(t))f(𝐱¯(t)),𝐱¯(t+1)𝐱¯(t)K2𝐱¯(t+1)𝐱¯(t)2,\displaystyle f(\bar{\mathbf{x}}(t+1))-f(\bar{\mathbf{x}}(t))-\left\langle\nabla f(\bar{\mathbf{x}}(t)),\bar{\mathbf{x}}(t+1)-\bar{\mathbf{x}}(t)\right\rangle\leq\frac{K}{2}\left\|\bar{\mathbf{x}}(t+1)-\bar{\mathbf{x}}(t)\right\|^{2},

or equivalently,

f(𝐱¯(t+1))\displaystyle f(\bar{\mathbf{x}}(t+1))\leq f(𝐱¯(t))+β(t)f(𝐱¯(t)),𝐫TE(t)α(t)β(t)f(𝐱¯(t)),𝐫Tf(X(t))\displaystyle f(\bar{\mathbf{x}}(t))+\beta(t)\left\langle\nabla f(\bar{\mathbf{x}}(t)),\mathbf{r}^{T}E(t)\right\rangle-\alpha(t)\beta(t)\left\langle\nabla f(\bar{\mathbf{x}}(t)),\mathbf{r}^{T}\nabla f(X(t))\right\rangle
+β(t)2K2𝐫TE(t)α(t)𝐫Tf(X(t))2.\displaystyle+\beta(t)^{2}\frac{K}{2}\left\|\mathbf{r}^{T}E(t)-\alpha(t)\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}.

Then, since {X(t)}\{X(t)\} is adapted to the filtration {t}\{\mathcal{F}_{t}\} and using Assumption 1, we arrive at

𝔼[f(𝐱¯(t+1))|t]\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\middle|\mathcal{F}_{t}\right] f(𝐱¯(t))α(t)β(t)f(𝐱¯(t)),𝐫Tf(X(t))\displaystyle\leq f\left(\bar{\mathbf{x}}(t)\right)-\alpha(t)\beta(t)\left\langle\nabla f\left(\bar{\mathbf{x}}(t)\right),\mathbf{r}^{T}\nabla f(X(t))\right\rangle (28)
+β2(t)K2𝔼[𝐫T[E(t)α(t)f(X(t))]2|t].\displaystyle\phantom{--}+\beta^{2}(t)\frac{K}{2}\mathbb{E}\left[\left\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\|^{2}\middle|\mathcal{F}_{t}\right]. (29)

Let us focus on the second term in (28). Using the identity 2𝐱𝐲=𝐱2+𝐲2𝐱𝐲22\left\langle\mathbf{x}-\mathbf{y}\right\rangle=\left\|\mathbf{x}\right\|^{2}+\left\|\mathbf{y}\right\|^{2}-\left\|\mathbf{x}-\mathbf{y}\right\|^{2}, we can write

f(𝐱¯(t)),𝐫Tf(X(t))=f(𝐱¯(t))2+𝐫Tf(X(t))22f(𝐱¯(t))𝐫Tf(X(t))2.\displaystyle\left\langle\nabla f\left(\bar{\mathbf{x}}(t)\right),\mathbf{r}^{T}\nabla f(X(t))\right\rangle\!=\!\|\nabla f\left(\bar{\mathbf{x}}(t)\right)\|^{2}\!+\!\|\mathbf{r}^{T}\nabla f(X(t))\|^{2}\!-\!2\|\nabla f\left(\bar{\mathbf{x}}(t)\right)-\mathbf{r}^{T}\nabla f(X(t))\|^{2}. (30)

Moreover, we have

f(𝐱¯(t))𝐫Tf(X(t))2\displaystyle\|\nabla f\left(\bar{\mathbf{x}}(t)\right)-\mathbf{r}^{T}\nabla f(X(t))\|^{2} =i=1nrifi(𝐱¯(t))i=1nrifi(𝐱i(t))2\displaystyle=\left\|\sum_{i=1}^{n}r_{i}\nabla f_{i}(\bar{\mathbf{x}}(t))-\sum_{i=1}^{n}r_{i}\nabla f_{i}(\mathbf{x}_{i}(t))\right\|^{2}
=i=1nri(fi(𝐱¯(t))fi(𝐱i(t)))2\displaystyle=\left\|\sum_{i=1}^{n}r_{i}\left(\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right)\right\|^{2}
i=1nrifi(𝐱¯(t))fi(𝐱i(t))2,\displaystyle\leq\sum_{i=1}^{n}r_{i}\left\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right\|^{2},

where the last inequality holds as 2\|\cdot\|^{2} is a convex function and the 𝐫\mathbf{r} is a stochastic vector. Therefore, using the above inequality and Assumption 3-(a), we get

f(𝐱¯(t))𝐫Tf(X(t))2\displaystyle\|\nabla f\left(\bar{\mathbf{x}}(t)\right)-\mathbf{r}^{T}\nabla f(X(t))\|^{2} i=1nrifi(𝐱¯(t))fi(𝐱i(t))2\displaystyle\leq\sum_{i=1}^{n}r_{i}\left\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f_{i}(\mathbf{x}_{i}(t))\right\|^{2} (31)
K2i=1nri𝐱¯(t)𝐱i(t)2.\displaystyle\leq K^{2}\sum_{i=1}^{n}r_{i}\left\|\bar{\mathbf{x}}(t)-\mathbf{x}_{i}(t)\right\|^{2}. (32)

Next, we analyze the last term in (28). From Assumption 1 we have 𝔼[E(t)|t]=0\mathbb{E}\left[E(t)|\mathcal{F}_{t}\right]=0, which leads to

𝔼[𝐫T[E(t)α(t)f(X(t))]2|t]=𝔼[𝐫TE(t)2|t]+α(t)𝐫Tf(X(t))2.\displaystyle\mathbb{E}\left[\left\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\|^{2}\middle|\mathcal{F}_{t}\right]=\mathbb{E}\left[\left\|\mathbf{r}^{T}E(t)\right\|^{2}\middle|\mathcal{F}_{t}\right]+\left\|\alpha(t)\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}. (33)

For the first term in (33), we again exploit Assumption 1 which implies

𝔼[𝐫TE(t)2|t]\displaystyle\mathbb{E}\left[\left\|\mathbf{r}^{T}E(t)\right\|^{2}\middle|\mathcal{F}_{t}\right] =𝐫T𝔼[E(t)E(t)T|t]𝐫\displaystyle=\mathbf{r}^{T}\mathbb{E}\left[E(t)E(t)^{T}\middle|\mathcal{F}_{t}\right]\mathbf{r}
𝐫T(γ𝟏𝟏T)𝐫=γ,\displaystyle\leq\mathbf{r}^{T}(\gamma\mathbf{1}\mathbf{1}^{T})\mathbf{r}=\gamma, (34)

where we used the fact that 𝐫T𝟏=1\mathbf{r}^{T}\mathbf{1}=1 and the inequality holds since

[𝔼[E(t)E(t)T|t]]ij\displaystyle\left[\mathbb{E}\left[E(t)E(t)^{T}\middle|\mathcal{F}_{t}\right]\right]_{ij} =𝔼[𝐞i(t)𝐞j(t)|t]\displaystyle=\mathbb{E}\left[\mathbf{e}_{i}(t)\mathbf{e}_{j}(t)|\mathcal{F}_{t}\right]
𝔼[𝐞i(t)2|t]𝔼[𝐞j(t)2|t]γ,\displaystyle\leq\sqrt{\mathbb{E}\left[\|\mathbf{e}_{i}(t)\|^{2}|\mathcal{F}_{t}\right]\mathbb{E}\left[\|\mathbf{e}_{j}(t)\|^{2}|\mathcal{F}_{t}\right]}\leq\gamma,

for all i,j[n]i,j\in\ [n]. Thus, the last term in (28) is upper bounded as

𝔼[𝐫T[E(t)α(t)f(X(t))]2|t]γ+α2(t)𝐫Tf(X(t))2.\displaystyle\mathbb{E}\left[\left\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\|^{2}\middle|\mathcal{F}_{t}\right]\leq\gamma+\alpha^{2}(t)\|\mathbf{r}^{T}\nabla f(X(t))\|^{2}. (35)

Therefore, replacing (30), (31), and (35) in (28) we get

𝔼[f(𝐱¯(t+1))|t]\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\middle|\mathcal{F}_{t}\right]
f(𝐱¯(t))α(t)β(t)f(𝐱¯(t)),𝐫Tf(X(t))\displaystyle\leq f\left(\bar{\mathbf{x}}(t)\right)-\alpha(t)\beta(t)\left\langle\nabla f\left(\bar{\mathbf{x}}(t)\right),\mathbf{r}^{T}\nabla f(X(t))\right\rangle
+β2(t)K2𝔼[𝐫T[E(t)α(t)f(X(t))]2|t]\displaystyle\quad+\beta^{2}(t)\frac{K}{2}\mathbb{E}\left[\left\|\mathbf{r}^{T}\left[E(t)-\alpha(t)\nabla f(X(t))\right]\right\|^{2}\middle|\mathcal{F}_{t}\right]
f(𝐱¯(t))12α(t)β(t)(f(𝐱¯(t))2+𝐫Tf(X(t))22K2i=1nri𝐱¯(t)𝐱i(t)2)\displaystyle\leq f\left(\bar{\mathbf{x}}(t)\right)-\frac{1}{2}\alpha(t)\beta(t)\left(\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}+\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}-2K^{2}\sum_{i=1}^{n}r_{i}\left\|\bar{\mathbf{x}}(t)-\mathbf{x}_{i}(t)\right\|^{2}\right)
+β2(t)K2(γ+α2(t)𝐫Tf(X(t))2)\displaystyle\quad+\beta^{2}(t)\frac{K}{2}\left(\gamma+\alpha^{2}(t)\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right)
=f(𝐱¯(t))12α(t)β(t)f(𝐱¯(t))212α(t)β(t)(1α(t)β(t)K)𝐫Tf(X(t))2\displaystyle=f\left(\bar{\mathbf{x}}(t)\right)-\frac{1}{2}\alpha(t)\beta(t)\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}-\frac{1}{2}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}
+α(t)β(t)K2X(t)𝟏𝐱¯(t)𝐫2+β2(t)K2γ.\displaystyle\quad+\alpha(t)\beta(t)K^{2}\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}+\beta^{2}(t)\frac{K}{2}\gamma.

Taking the expectation of both sides lead to

𝔼[f(𝐱¯(t+1))]\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\right] 𝔼[f(𝐱¯(t))]12α(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t)\right)\right]-\frac{1}{2}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] (36)
12α(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]\displaystyle\quad-\frac{1}{2}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right] (37)
+α(t)β(t)K2𝔼[X(t)𝟏𝐱¯(t)𝐫2]+β2(t)K2γ.\displaystyle\quad+\alpha(t)\beta(t)K^{2}\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]+\beta^{2}(t)\frac{K}{2}\gamma. (38)

5.1 State Deviations from the Average State: 𝔼[X(t)𝟏𝐱¯(t)𝐫2]\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]

Note that the dynamics in (7) can be viewed as the linear time-varying system

X(t+1)=A(t)X(t)+U(t),\displaystyle X(t+1)=A(t)X(t)+U(t), (39)

with

A(t)\displaystyle A(t) =((1β(t))I+β(t)W(t)),\displaystyle=((1-\beta(t))I+\beta(t)W(t)),
U(t)\displaystyle U(t) =β(t)E(t)α(t)β(t)f(X(t)).\displaystyle=\beta(t)E(t)-\alpha(t)\beta(t)\nabla f(X(t)).

The solution of (39) is given by

X(t)=s=1t1Φ(t:s)U(s)+Φ(t:0)X(1),\displaystyle X(t)=\sum_{s=1}^{t-1}\Phi(t:s)U(s)+\Phi(t:0)X(1), (40)

where Φ(t:s)=A(t1)A(s+1)\Phi(t:s)=A(t-1)\cdots A(s+1) with Φ(t:t1)=I\Phi(t:t-1)=I is the transition matrix of the linear system (39). For the simplicity of notations, let us define

P(t:s)\displaystyle P(t:s) :=β(s)(Φ(t:s)𝟏𝐫T)=β(s)(A(t1)A(s+1)𝟏𝐫T).\displaystyle:=\beta(s)(\Phi(t:s)-\mathbf{1}\mathbf{r}^{T})=\beta(s)\left(A(t-1)\cdots A(s+1)-\mathbf{1}\mathbf{r}^{T}\right).

As a result of Lemma 5, we have

P(t:s)U𝐫π(t:s)U𝐫,\left\|P(t:s)U\right\|_{\mathbf{r}}\leq\pi(t:s)\left\|U\right\|_{\mathbf{r}},

where π(t:s)\pi(t:s) is defined by

π(t:s):=β(s)κ12k=s+1t1(1λβ(k))12.\displaystyle\pi(t:s):=\beta(s)\kappa^{\frac{1}{2}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))^{\frac{1}{2}}. (41)

Now consider the dynamic in (40). Assuming X(1)=𝟎X(1)={\mathbf{0}}, we get

X(t)=s=1t1Φ(t:s)U(s).\displaystyle X(t)=\sum_{s=1}^{t-1}\Phi(t:s)U(s). (42)

Moreover, multiplying both sides of (42) from the left by 𝐫T\mathbf{r}^{T} and using 𝐫TA(t)=𝐫T\mathbf{r}^{T}A(t)=\mathbf{r}^{T}, we get

𝐱¯(t)=𝐫TX(t)=s=1t1𝐫TΦ(t:s)U(s)=s=1t1𝐫TU(s).\displaystyle\bar{\mathbf{x}}(t)=\mathbf{r}^{T}X(t)=\sum_{s=1}^{t-1}\mathbf{r}^{T}\Phi(t:s)U(s)=\sum_{s=1}^{t-1}\mathbf{r}^{T}U(s). (43)

Then, subtracting (43) from (42), and plugging the definition of U(s)U(s) we have

X(t)𝟏𝐱¯(t)\displaystyle X(t)-\mathbf{1}\bar{\mathbf{x}}(t) =s=1t1Φ(t:s)U(s)s=1t1𝟏𝐫TU(s)\displaystyle=\sum_{s=1}^{t-1}\Phi(t:s)U(s)-\sum_{s=1}^{t-1}\mathbf{1}\mathbf{r}^{T}U(s)
=s=1t1(Φ(t:s)𝟏𝐫T)U(s)\displaystyle=\sum_{s=1}^{t-1}(\Phi(t:s)-\mathbf{1}\mathbf{r}^{T})U(s)
=s=1t1β(s)(Φ(t:s)𝟏𝐫T)[E(s)α(s)f(X(s))]\displaystyle=\sum_{s=1}^{t-1}\beta(s)(\Phi(t:s)-\mathbf{1}\mathbf{r}^{T})\bigg{[}E(s)-\alpha(s)\nabla f(X(s))\bigg{]}
=s=1t1P(t:s)E(s)s=1t1α(s)P(t:s)f(X(s)).\displaystyle=\sum_{s=1}^{t-1}P(t:s)E(s)-\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s)).

Therefore, using Lemma 1 with ω=1\omega=1, we get

X(t)𝟏𝐱¯(t)𝐫2\displaystyle\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2} =s=1t1P(t:s)E(s)s=1t1α(s)P(t:s)f(X(s))𝐫2\displaystyle=\left\|\sum_{s=1}^{t-1}P(t:s)E(s)-\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}
2s=1t1P(t:s)E(s)𝐫2+2s=1t1α(s)P(t:s)f(X(s))𝐫2.\displaystyle\leq 2\left\|\sum_{s=1}^{t-1}P(t:s)E(s)\right\|_{\mathbf{r}}^{2}+2\left\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}.

By expanding s=1t1P(t:s)E(s)𝐫2\left\|\sum_{s=1}^{t-1}P(t:s)E(s)\right\|_{\mathbf{r}}^{2}, we get

X(t)𝟏𝐱¯(t)𝐫2\displaystyle\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2} =2s=1t1P(t:s)E(s)𝐫2+2sqP(t:s)E(s),P(t:q)E(q)\displaystyle=2\sum_{s=1}^{t-1}\left\|P(t:s)E(s)\right\|_{\mathbf{r}}^{2}+2\sum_{s\neq q}\left\langle P(t:s)E(s),P(t:q)E(q)\right\rangle (44)
+2s=1t1α(s)P(t:s)f(X(s))𝐫2.\displaystyle\phantom{=}+2\left\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}. (45)

Recall from Assumption 1 that 𝔼[E(q)|q]=0\mathbb{E}\left[E(q)|\mathcal{F}_{q}\right]=0. Moreover, since E(s)E(s) is measurable with respect to q\mathcal{F}_{q} for q>sq>s, we have

𝔼[P(t:s)E(s),P(t:q)E(q)]\displaystyle\mathbb{E}\left[\left\langle P(t:s)E(s),P(t:q)E(q)\right\rangle\right] =𝔼[𝔼[P(t:s)E(s),P(t:q)E(q)]|q]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\left\langle P(t:s)E(s),P(t:q)E(q)\right\rangle\right]\middle|\mathcal{F}_{q}\right]
=𝔼[P(t:s)E(s),P(t:q)𝔼[E(q)|q]]=0.\displaystyle=\mathbb{E}\left[\left\langle P(t:s)E(s),P(t:q)\mathbb{E}\left[E(q)\middle|\mathcal{F}_{q}\right]\right\rangle\right]=0.

Using a similar argument for q<sq<s and conditioning on s\mathcal{F}_{s}, we conclude that the above relation holds for all qsq\not=s. Therefore, taking the expectation of both sides of (44) and noting that the average of the second and the forth terms are zero, we get

𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right] 2s=1t1𝔼[P(t:s)E(s)𝐫2]+2𝔼[s=1t1α(s)P(t:s)f(X(s))𝐫2]\displaystyle\leq 2\sum_{s=1}^{t-1}\mathbb{E}\left[\left\|P(t:s)E(s)\right\|_{\mathbf{r}}^{2}\right]+2\mathbb{E}\left[\left\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right] (46)

We continue with bounding the first term in (46). First note that Assumption 1 implies

𝔼[E(s)𝐫2]=𝔼[𝔼[E(s)𝐫2|s]]=𝔼[i=1nri𝔼[𝐞i(s)2|s]]𝔼[i=1nriγ]=γ.\displaystyle\mathbb{E}\left[\left\|E(s)\right\|_{\mathbf{r}}^{2}\right]=\mathbb{E}\left[\mathbb{E}\left[\left\|E(s)\right\|_{\mathbf{r}}^{2}|\mathcal{F}_{s}\right]\right]=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\mathbb{E}\left[\|\mathbf{e}_{i}(s)\|^{2}|\mathcal{F}_{s}\right]\right]\leq\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\gamma\right]=\gamma. (47)

This together with Lemma 5 leads to

s=1t1𝔼[P(t:s)E(s)𝐫2]\displaystyle\sum_{s=1}^{t-1}\mathbb{E}\left[\left\|P(t:s)E(s)\right\|_{\mathbf{r}}^{2}\right] [s=1t1β2(s)κk=s+1t1(1λβ(k))𝔼[E(s)𝐫2]]\displaystyle\leq\left[\sum_{s=1}^{t-1}\beta^{2}(s)\kappa\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\mathbb{E}\left[\left\|E(s)\right\|_{\mathbf{r}}^{2}\right]\right] (48)
γκs=1t1[β2(s)k=s+1t1(1λβ(k))].\displaystyle\leq\gamma\kappa\sum_{s=1}^{t-1}\left[\beta^{2}(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right]. (49)

To bound the second term in (46), we use the triangle inequality for the norm 𝐫\left\|\cdot\right\|_{\mathbf{r}}, and write

𝔼[s=1t1α(s)P(t:s)f(X(s))𝐫2]𝔼[(s=1t1α(s)P(t:s)f(X(s))𝐫)2]=1s,qt1𝔼[α(s)P(t:s)f(X(s))𝐫α(q)P(t:q)f(X(q))𝐫].\displaystyle\begin{split}&\mathbb{E}\left[\left\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\cr&\leq\mathbb{E}\left[\left(\sum_{s=1}^{t-1}\left\|\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}\right)^{2}\right]\cr&=\sum_{1\leq s,q\leq t-1}\mathbb{E}\left[\alpha(s)\left\|P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}\alpha(q)\left\|P(t:q)\nabla f(X(q))\right\|_{\mathbf{r}}\right].\end{split} (50)

Using Lemma 5 and 2aba2+b22ab\leq a^{2}+b^{2}, we can upper-bound this expression as

1s,qt1\displaystyle\sum_{1\leq s,q\leq t-1} 𝔼[α(s)P(t:s)f(X(s))𝐫α(q)P(t:q)f(X(q))𝐫]\displaystyle\mathbb{E}\left[\alpha(s)\left\|P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}\alpha(q)\left\|P(t:q)\nabla f(X(q))\right\|_{\mathbf{r}}\right] (51)
1s,qt1𝔼[α(s)π(t:s)f(X(s))𝐫α(q)π(t:q)f(X(q))𝐫]\displaystyle\leq\sum_{1\leq s,q\leq t-1}\mathbb{E}\left[\alpha(s)\pi(t:s)\left\|\nabla f(X(s))\right\|_{\mathbf{r}}\alpha(q)\pi(t:q)\left\|\nabla f(X(q))\right\|_{\mathbf{r}}\right] (52)
=1s,qt1π(t:s)π(t:q)𝔼[α(s)f(X(s))𝐫α(q)f(X(q))𝐫]\displaystyle=\sum_{1\leq s,q\leq t-1}\pi(t:s)\pi(t:q)\mathbb{E}\left[\alpha(s)\left\|\nabla f(X(s))\right\|_{\mathbf{r}}\alpha(q)\left\|\nabla f(X(q))\right\|_{\mathbf{r}}\right] (53)
121s,qt1π(t:s)π(t:q)𝔼[α2(s)f(X(s))𝐫2+α2(q)f(X(q))𝐫2]\displaystyle\leq\frac{1}{2}\sum_{1\leq s,q\leq t-1}\pi(t:s)\pi(t:q)\mathbb{E}\left[\alpha^{2}(s)\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}+\alpha^{2}(q)\left\|\nabla f(X(q))\right\|_{\mathbf{r}}^{2}\right] (54)
=1s,qt1π(t:s)π(t:q)𝔼[α2(s)f(X(s))𝐫2]\displaystyle=\sum_{1\leq s,q\leq t-1}\pi(t:s)\pi(t:q)\mathbb{E}\left[\alpha^{2}(s)\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right] (55)
=(q=1t1π(t:q))(s=1t1α2(s)π(t:s)𝔼[f(X(s))𝐫2]),\displaystyle=\left(\sum_{q=1}^{t-1}\pi(t:q)\right)\left(\sum_{s=1}^{t-1}\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right), (56)

where π(t:s)\pi(t:s) is given by (41). Next, the fact that 1x1x/2\sqrt{1-x}\leq 1-x/2 and Lemma 2 imply

q=1t1π(t:q)\displaystyle\sum_{q=1}^{t-1}\pi(t:q) =q=1t1[β(q)κ12k=q+1t1(1λβ(k))12]\displaystyle=\sum_{q=1}^{t-1}\left[\beta(q)\kappa^{\frac{1}{2}}\prod_{k=q+1}^{t-1}(1-\lambda\beta(k))^{\frac{1}{2}}\right] (57)
q=1t1β(q)κ12k=q+1t1(1λ2β(k))\displaystyle\leq\sum_{q=1}^{t-1}\beta(q)\kappa^{\frac{1}{2}}\prod_{k=q+1}^{t-1}\left(1-\frac{\lambda}{2}\beta(k)\right) (58)
2λκ12.\displaystyle\leq\frac{2}{\lambda}\kappa^{\frac{1}{2}}. (59)

Using this inequality in (56), we get

𝔼[s=1t1α(s)P(t:s)f(X(s))𝐫2]\displaystyle\mathbb{E}\left[\left\|\sum_{s=1}^{t-1}\alpha(s)P(t:s)\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right] 2λκ12s=1t1[α2(s)π(t:s)𝔼[f(X(s))𝐫2]].\displaystyle\leq\frac{2}{\lambda}\kappa^{\frac{1}{2}}\sum_{s=1}^{t-1}\left[\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right]. (60)

Finally, using the bounds obtained in (48) and (60) in (46) we arrive at

𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right] 2γκs=1t1[β2(s)k=s+1t1(1λβ(k))]\displaystyle\leq 2\gamma\kappa\sum_{s=1}^{t-1}\left[\beta^{2}(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right] (61)
+4λκ12s=1t1[α2(s)π(t:s)𝔼[f(X(s))𝐫2]].\displaystyle\phantom{\leq}+\frac{4}{\lambda}\kappa^{\frac{1}{2}}\sum_{s=1}^{t-1}\left[\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right]. (62)

5.2 Analysis of the overall deviation: t=1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]

Our goal here is to bound the overall weighted deviation of the states from their average. First recall the bound for 𝔼[X(t)𝟏𝐱¯(t)𝐫2]\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right], derived in Section 5.1 for each tt. Our goal here is to bound

t=1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]\leq 2γκt=1Tα(t)β(t)s=1t1[β2(s)k=s+1t1(1λβ(k))]\displaystyle 2\gamma\kappa\sum_{t=1}^{T}\alpha(t)\beta(t)\sum_{s=1}^{t-1}\left[\beta^{2}(s)\!\!\prod_{k=s+1}^{t-1}\!(1-\lambda\beta(k))\right] (63)
+4λκ12t=1Tα(t)β(t)s=1t1[α2(s)π(t:s)𝔼[f(X(s))𝐫2]].\displaystyle+\frac{4}{\lambda}\kappa^{\frac{1}{2}}\sum_{t=1}^{T}\alpha(t)\beta(t)\sum_{s=1}^{t-1}\left[\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right].

Focusing on the first term in (63), we can write

t=1T[α(t)β(t)s=1t1[β2(s)k=s+1t1(1λβ(k))]]\displaystyle\sum_{t=1}^{T}\!\left[\alpha(t)\beta(t)\sum_{s=1}^{t-1}\left[\beta^{2}(s)\!\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right]\right]\! =s=1T1[β2(s)t=s+1T[α(t)β(t)k=s+1t1(1λβ(k))]]\displaystyle=\!\sum_{s=1}^{T-1}\left[\beta^{2}(s)\!\!\sum_{t=s+1}^{T}\left[\alpha(t)\beta(t)\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right]\right] (64)
s=1T1[α(s)β2(s)t=s+1T[β(t)k=s+1t1(1λβ(k))]]\displaystyle\leq\!\sum_{s=1}^{T-1}\left[\alpha(s)\beta^{2}(s)\!\!\sum_{t=s+1}^{T}\left[\beta(t)\!\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\right]\right] (65)
1λs=1T1α(s)β2(s),\displaystyle\leq\frac{1}{\lambda}\sum_{s=1}^{T-1}\alpha(s)\beta^{2}(s), (66)

where the first inequality is due to the fact that α(t)α(s)\alpha(t)\leq\alpha(s) for t>st>s, and the second one follows from Lemma 2. Similarly, using the fact that α(t)α(s)\alpha(t)\leq\alpha(s) for t>st>s, for the second term in (63), we have

t=1T[α(t)β(t)s=1t1[α2(s)π(t:s)𝔼[f(X(s))𝐫2]]]\displaystyle\sum_{t=1}^{T}\left[\alpha(t)\beta(t)\sum_{s=1}^{t-1}\left[\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right]\right] (67)
=s=1T1[α2(s)𝔼[f(X(s))𝐫2]t=s+1Tα(t)β(t)π(t:s)]\displaystyle=\sum_{s=1}^{T-1}\left[\alpha^{2}(s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\sum_{t=s+1}^{T}\alpha(t)\beta(t)\pi(t:s)\right] (68)
s=1T1[α3(s)𝔼[f(X(s))𝐫2]t=s+1Tβ(t)π(t:s)].\displaystyle\leq\sum_{s=1}^{T-1}\left[\alpha^{3}(s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\sum_{t=s+1}^{T}\beta(t)\pi(t:s)\right]. (69)

Recall that π(t:s)=β(s)κ12k=s+1t1(1λβ(k))12\pi(t:s)=\beta(s)\kappa^{\frac{1}{2}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))^{\frac{1}{2}}. Then, using the fact that 1x1x/2\sqrt{1-x}\leq 1-x/2, we can write

t=s+1Tβ(t)π(t:s)\displaystyle\sum_{t=s+1}^{T}\beta(t)\pi(t:s) =t=s+1T[β(t)β(s)κ12k=s+1t1(1λβ(k))12]\displaystyle=\sum_{t=s+1}^{T}\left[\beta(t)\beta(s)\kappa^{\frac{1}{2}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))^{\frac{1}{2}}\right] (70)
β(s)κ12t=s+1T[β(t)k=s+1t1(1λ2β(k))]2λβ(s)κ12,\displaystyle\leq\beta(s)\kappa^{\frac{1}{2}}\sum_{t=s+1}^{T}\left[\beta(t)\prod_{k=s+1}^{t-1}\left(1-\frac{\lambda}{2}\beta(k)\right)\right]\leq\frac{2}{\lambda}\beta(s)\kappa^{\frac{1}{2}}, (71)

where the last inequality follows from Lemma 2. Therefore, from (69) and  (71) we have

t=1T[α(t)β(t)s=1t1[α2(s)π(t:s)𝔼[f(X(s))𝐫2]]]\displaystyle\sum_{t=1}^{T}\!\left[\!\alpha(t)\beta(t)\!\sum_{s=1}^{t-1}\!\left[\!\alpha^{2}(s)\pi(t:s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right]\right] 2λκ12s=1T1[α3(s)β(s)𝔼[f(X(s))𝐫2]].\displaystyle\!\leq\!\frac{2}{\lambda}\kappa^{\frac{1}{2}}\!\sum_{s=1}^{T-1}\!\left[\alpha^{3}(s)\beta(s)\mathbb{E}\left[\left\|\nabla f(X(s))\right\|_{\mathbf{r}}^{2}\right]\right]. (72)

Therefore, from (64) and (72) we can conclude

t=1Tα(t)β(t)\displaystyle\sum_{t=1}^{T}\!\alpha(t)\beta(t) 𝔼[X(t)𝟏𝐱¯(t)𝐫2]2γκλt=1T1α(t)β2(t)+8κλ2t=1T1[α3(t)β(t)𝔼[f(X(t))𝐫2]]\displaystyle\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]\!\leq\!\frac{2\gamma\kappa}{\lambda}\!\sum_{t=1}^{T-1}\!\alpha(t)\beta^{2}(t)\!+\!\frac{8\kappa}{\lambda^{2}}\sum_{t=1}^{T-1}\!\left[\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(X(t))\right\|_{\mathbf{r}}^{2}\right]\right]
2γκλt=1Tα(t)β2(t)+8κλ2t=1T[α3(t)β(t)𝔼[f(X(t))𝐫2]].\displaystyle\!\leq\!\frac{2\gamma\kappa}{\lambda}\!\sum_{t=1}^{T}\!\alpha(t)\beta^{2}(t)\!+\!\frac{8\kappa}{\lambda^{2}}\sum_{t=1}^{T}\!\left[\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(X(t))\right\|_{\mathbf{r}}^{2}\right]\right]. (73)

5.3 Bounding 𝔼[f(X(t))𝐫2]\mathbb{E}\left[\left\|\nabla f(X(t))\right\|_{\mathbf{r}}^{2}\right]

In this part, we study 𝔼[f(X(t)𝐫]2\mathbb{E}\left[\left\|\nabla f(X(t)\right\|_{\mathbf{r}}\right]^{2}, to provide an upper bound for it. Following [26], we can rewrite f(X(t))\nabla f(X(t)) as

f(X(t))=3[13(f(X(t))f(𝟏𝐱¯(t))+13(f(𝟏𝐱¯(t))𝟏f(𝐱¯(t)))+13𝟏f(𝐱¯(t))].\nabla f(X(t))=3\left[\frac{1}{3}\left(\nabla f(X(t))-\nabla f(\mathbf{1}\bar{\mathbf{x}}(t)\right)+\frac{1}{3}\left(\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))-\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right)+\frac{1}{3}\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right].

Then, since 𝐫2\left\|\cdot\right\|^{2}_{\mathbf{r}} is a convex function, we have

𝔼[f(X(t)𝐫2]\displaystyle\mathbb{E}\left[\left\|\nabla f(X(t)\right\|_{\mathbf{r}}^{2}\right] 3𝔼[f(X(t))f(𝟏𝐱¯(t))𝐫2]+3𝔼[f(𝟏𝐱¯(t))𝟏f(𝐱¯(t))𝐫2]\displaystyle\leq 3\mathbb{E}\left[\left\|\nabla f(X(t))-\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))\right\|_{\mathbf{r}}^{2}\right]+3\mathbb{E}\left[\left\|\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))-\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right\|_{\mathbf{r}}^{2}\right]
+3𝔼[𝟏f(𝐱¯(t))𝐫2].\displaystyle\phantom{=}+3\mathbb{E}\left[\left\|\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right\|_{\mathbf{r}}^{2}\right]. (74)

Next, we bound each term in (5.3). Recall from (6) that the iith row of f(X(t))\nabla f(X(t)) is given by fi(Xi(t))\nabla f_{i}(X_{i}(t)), where Xi(t)=𝐱i(t)X_{i}(t)=\mathbf{x}_{i}(t) is the iith row of matrix X(t)X(t). Thus, from Assumption 3-(a) we have

𝔼[f(X(t))f(𝟏𝐱¯(t))𝐫2]\displaystyle\mathbb{E}\left[\left\|\nabla f(X(t))-\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))\right\|_{\mathbf{r}}^{2}\right] =𝔼[i=1nrifi(𝐱i(t))fi(𝐱¯(t))2]\displaystyle=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\|\nabla f_{i}(\mathbf{x}_{i}(t))-\nabla f_{i}(\bar{\mathbf{x}}(t))\right\|^{2}\right] (75)
K2𝔼[i=1nri𝐱i(t)𝐱¯(t)2]\displaystyle\leq K^{2}\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\right\|^{2}\right] (76)
=K2𝔼[X(t)𝟏𝐱¯(t)𝐫2].\displaystyle=K^{2}\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]. (77)

Similarly, using the convexity of function 2\left\|\cdot\right\|^{2}, for the second term in (5.3) we have

𝔼[f(𝟏𝐱¯(t))𝟏f(𝐱¯(t))𝐫2]\displaystyle\mathbb{E}\left[\left\|\nabla f(\mathbf{1}\bar{\mathbf{x}}(t))-\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right\|_{\mathbf{r}}^{2}\right] (78)
=𝔼[i=1nrifi(𝐱¯(t))f(𝐱¯(t))2]\displaystyle=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] (79)
=i=1nri𝔼[412(fi(𝐱¯(t))L(𝐱¯(t)))12(f(𝐱¯(t))L(𝐱¯(t)))2]\displaystyle=\sum_{i=1}^{n}r_{i}\mathbb{E}\left[4\left\|\frac{1}{2}\Big{(}\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\Big{)}-\frac{1}{2}\Big{(}\nabla f(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\Big{)}\right\|^{2}\right] (80)
i=1nri𝔼[2fi(𝐱¯(t))L(𝐱¯(t))2+2f(𝐱¯(t))L(𝐱¯(t))2]\displaystyle\leq\sum_{i=1}^{n}r_{i}\mathbb{E}\left[2\left\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\right\|^{2}+2\left\|\nabla f(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\right\|^{2}\right] (81)
=i=1n2ri𝔼[fi(𝐱¯(t))L(𝐱¯(t))2]+2𝔼[f(𝐱¯(t))L(𝐱¯(t))2]\displaystyle=\sum_{i=1}^{n}2r_{i}\mathbb{E}\left[\left\|\nabla f_{i}(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\right\|^{2}\right]+2\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))-\nabla L(\bar{\mathbf{x}}(t))\right\|^{2}\right] (82)
=(a)i=1n2ri𝔼[1mi2j=1mi[(𝐱¯(t),ξji)L(𝐱¯(t))]2]+2𝔼[1N2j=1N[(𝐱¯(t),ξj)L(𝐱¯(t))]2]\displaystyle\stackrel{{\scriptstyle\rm{(a)}}}{{=}}\!\sum_{i=1}^{n}2r_{i}\mathbb{E}\!\left[\!\frac{1}{m_{i}^{2}}\!\left\|\sum_{j=1}^{m_{i}}\!\left[\nabla\ell(\bar{\mathbf{x}}(t),\xi_{j}^{i})\!-\!\nabla L(\bar{\mathbf{x}}(t))\right]\right\|^{2}\right]\!\!+\!2\mathbb{E}\!\left[\!\frac{1}{N^{2}}\!\left\|\sum_{j=1}^{N}\!\left[\nabla\ell(\bar{\mathbf{x}}(t),\xi_{j})\!-\!\nabla L(\bar{\mathbf{x}}(t))\right]\right\|^{2}\right] (83)
=(b)i=1n2ri1mi2j=1mi𝔼[(𝐱¯(t),ξji)L(𝐱¯(t))2]+2N2j=1N𝔼[(𝐱¯(t),ξj)L(𝐱¯(t))2]\displaystyle\stackrel{{\scriptstyle\rm{(b)}}}{{=}}\sum_{i=1}^{n}2r_{i}\frac{1}{m_{i}^{2}}\sum_{j=1}^{m_{i}}\mathbb{E}\left[\left\|\nabla\ell(\bar{\mathbf{x}}(t),\xi_{j}^{i})-\nabla L(\bar{\mathbf{x}}(t))\right\|^{2}\right]+\frac{2}{N^{2}}\sum_{j=1}^{N}\mathbb{E}\left[\left\|\nabla\ell(\bar{\mathbf{x}}(t),\xi_{j})-\nabla L(\bar{\mathbf{x}}(t))\right\|^{2}\right] (84)
(c)i=1n2miN1mi2miσ2+2N2Nσ2\displaystyle\stackrel{{\scriptstyle\rm{(c)}}}{{\leq}}\sum_{i=1}^{n}2\frac{m_{i}}{N}\frac{1}{m_{i}^{2}}m_{i}\sigma^{2}+\frac{2}{N^{2}}N\sigma^{2} (85)
=2(n+1)Nσ2,\displaystyle=\frac{2(n+1)}{N}\sigma^{2}, (86)

where in (a) we replaced the definitions of fi(𝐱¯(t))f_{i}(\bar{\mathbf{x}}(t)) and f(𝐱¯(t))f(\bar{\mathbf{x}}(t)) from (3) and (2), respectively, the equality in (b) holds since ξj\xi_{j}s are independent samples from the underlying distribution, and (c) follows from Assumption 3-(b) and the fact that ri=mi/Nr_{i}=m_{i}/N for i[n]i\in[n]. Finally, for the third term in (5.3), we have

𝔼[𝟏f(𝐱¯(t))𝐫2]=𝔼[i=1nrif(𝐱¯(t))2]=𝔼[f(𝐱¯(t))2].\displaystyle\mathbb{E}\left[\left\|\mathbf{1}\nabla f(\bar{\mathbf{x}}(t))\right\|_{\mathbf{r}}^{2}\right]=\mathbb{E}\left[\sum_{i=1}^{n}r_{i}\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]=\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]. (87)

Plugging (77)–(87) in (5.3), we get

𝔼[f(X(t)𝐫2]\displaystyle\mathbb{E}\left[\left\|\nabla f(X(t)\right\|_{\mathbf{r}}^{2}\right] 3K2𝔼[X(t)𝟏𝐱¯(t)𝐫2]+32(n+1)Nσ2+3𝔼[f(𝐱¯(t))2].\displaystyle\leq 3K^{2}\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]+3\frac{2(n+1)}{N}\sigma^{2}+3\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]. (88)

Next, replacing this bound in (73), we arrive at

t=1T\displaystyle\sum_{t=1}^{T} α(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right] (89)
2γκλt=1Tα(t)β2(t)+8κλ2t=1T[α3(t)β(t)𝔼[f(X(t))𝐫2]]\displaystyle\leq\frac{2\gamma\kappa}{\lambda}\sum_{t=1}^{T}\alpha(t)\beta^{2}(t)+\frac{8\kappa}{\lambda^{2}}\sum_{t=1}^{T}\left[\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(X(t))\right\|_{\mathbf{r}}^{2}\right]\right] (90)
2γκλt=1Tα(t)β2(t)+24κK2λ2t=1Tα3(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\leq\frac{2\gamma\kappa}{\lambda}\sum_{t=1}^{T}\alpha(t)\beta^{2}(t)+\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right] (91)
+48κ(n+1)σ2Nλ2t=1Tα3(t)β(t)+24κλ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2].\displaystyle\phantom{\leq}+\frac{48\kappa(n+1)\sigma^{2}}{N\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)+\frac{24\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]. (92)

Now, let us define ϕi,j(T):=t=1Tαi(t)βj(t)\phi_{i,j}(T):=\sum_{t=1}^{T}\alpha^{i}(t)\beta^{j}(t).
Then, 2γκλt=1Tα(t)β2(t)=ϵ1ϕ1,2(T){\frac{2\gamma\kappa}{\lambda}\sum_{t=1}^{T}\alpha(t)\beta^{2}(t)=\epsilon_{1}\phi_{1,2}(T)} and 48κ(n+1)σ2Nλ2t=1Tα3(t)β(t)=ϵ2ϕ3,1(T){\frac{48\kappa(n+1)\sigma^{2}}{N\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)=\epsilon_{2}\phi_{3,1}(T)}, where ϵ1:=2γκλ\epsilon_{1}:=\frac{2\gamma\kappa}{\lambda} and ϵ2:=48κ(n+1)σ2Nλ2\epsilon_{2}:=\frac{48\kappa(n+1)\sigma^{2}}{N\lambda^{2}}. Furthermore, we set T0:=(14α0κ12Kλ)1νT_{0}:=\left\lceil\left(\frac{14\alpha_{0}\kappa^{\frac{1}{2}}K}{\lambda}\right)^{\frac{1}{\nu}}\right\rceil such that 24κK2λ2α2(T0)24196<12\frac{24\kappa K^{2}}{\lambda^{2}}\alpha^{2}(T_{0})\leq\frac{24}{196}<\frac{1}{2}. Then, for T>T0T>T_{0} we can rewrite (89) as

t=1T0α(t)β(t)\displaystyle\sum_{t=1}^{T_{0}}\alpha(t)\beta(t) 𝔼[X(t)𝟏𝐱¯(t)𝐫2]+t=T0+1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]+\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]
ϵ1ϕ1,2(T)+ϵ2ϕ3,1(T)+24κλ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\leq\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\frac{24\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
+24κK2λ2t=1T0α3(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\phantom{=}+\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]
+24κK2λ2t=T0+1Tα3(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\phantom{=}+\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=T_{0}+1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]
ϵ1ϕ1,2(T)+ϵ2ϕ3,1(T)+24κλ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\leq\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\frac{24\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
+24κK2λ2t=1T0α3(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\phantom{=}+\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]
+24κK2λ2α2(T0)t=T0+1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\phantom{=}+\frac{24\kappa K^{2}}{\lambda^{2}}\alpha^{2}(T_{0})\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]
ϵ1ϕ1,2(T)+ϵ2ϕ3,1(T)+24κλ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\leq\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\frac{24\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
+ϵ3+12t=T0+1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2],\displaystyle\phantom{=}+\epsilon_{3}+\frac{1}{2}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right],

where the second inequality holds since α(t)\alpha(t) is a non-increasing sequence, and

ϵ3:=24κK2λ2t=1T0α3(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\epsilon_{3}:=\frac{24\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right] (93)

does not grow with TT. Therefore, we have

t=1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right] (94)
2t=1T0α(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]+t=T0+1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\leq 2\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]+\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right] (95)
2ϵ1ϕ1,2(T)+2ϵ2ϕ3,1(T)+2ϵ3+48κλ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2].\displaystyle\leq 2\epsilon_{1}\phi_{1,2}(T)+2\epsilon_{2}\phi_{3,1}(T)+2\epsilon_{3}+\frac{48\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]. (96)

5.4 Back to the Main Dynamics

Recall the dynamics in (36), that is,

𝔼[f(𝐱¯(t+1))]\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t+1)\right)\right] 𝔼[f(𝐱¯(t))]12α(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(t)\right)\right]-\frac{1}{2}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
12α(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]\displaystyle\quad-\frac{1}{2}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right]
+α(t)β(t)K2𝔼[X(t)𝟏𝐱¯(t)𝐫2]+β2(t)K2γ.\displaystyle\quad+\alpha(t)\beta(t)K^{2}\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]+\beta^{2}(t)\frac{K}{2}\gamma.

Summing (36) for t=1,2,,Tt=1,2,\dots,T, and using (94) we get

𝔼[f(𝐱¯(T+1))]\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(T+1)\right)\right] (97)
𝔼[f(𝐱¯(1))]12t=1Tα(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] (98)
12t=1Tα(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]\displaystyle\quad-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right] (99)
+K2t=1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]+K2γt=1Tβ2(t)\displaystyle\quad+K^{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]+\frac{K}{2}\gamma\sum_{t=1}^{T}\beta^{2}(t) (100)
𝔼[f(𝐱¯(1))]12t=1Tα(t)β(t)𝔼[f(𝐱¯(t))2]+K2γϕ0,2(T)\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]+\frac{K}{2}\gamma\phi_{0,2}(T) (101)
12t=1Tα(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]\displaystyle\quad-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right] (102)
+K2[2ϵ1ϕ1,2(T)+2ϵ2ϕ3,1(T)+2ϵ3+48κλ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2]],\displaystyle\quad+K^{2}\left[2\epsilon_{1}\phi_{1,2}(T)+2\epsilon_{2}\phi_{3,1}(T)+2\epsilon_{3}+\frac{48\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right], (103)

where ϕ0,2(T)=t=1Tβ2(t)\phi_{0,2}(T)=\sum_{t=1}^{T}\beta^{2}(t). Recall that for T0=(14α0κ12Kλ)1νT_{0}=\left\lceil\left(\frac{14\alpha_{0}\kappa^{\frac{1}{2}}K}{\lambda}\right)^{\frac{1}{\nu}}\right\rceil. Moreover, for λ\lambda and κ\kappa defined in Lemma 5 we have λ1\lambda\leq 1 and κ>1\kappa>1. Therefore, for t>T0t>T_{0} we have

α(t)β(t)Kα(T0)β(T0)Kα(T0)Kλ14κ12<1.\alpha(t)\beta(t)K\leq\alpha(T_{0})\beta(T_{0})K\leq\alpha(T_{0})K\leq\frac{\lambda}{14\kappa^{\frac{1}{2}}}<1.

This allows us to lower bound the second summation in (97) for T>T0T>T_{0} as

12\displaystyle\frac{1}{2} t=1Tα(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]\displaystyle\sum_{t=1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right] (104)
=12t=1T0α(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]\displaystyle=\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right] (105)
+12t=T0+1Tα(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]\displaystyle\quad+\frac{1}{2}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right] (106)
12t=1T0α(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]:=ϵ4,\displaystyle\geq\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right]:=\epsilon_{4}, (107)

where ϵ4\epsilon_{4} does not grow with TT, and the inequality holds since 1α(t)β(t)K01-\alpha(t)\beta(t)K\geq 0 for t>T0t>T_{0}. Similarly, for T0=(14α0κ12Kλ)1νT_{0}=\left\lceil\left(\frac{14\alpha_{0}\kappa^{\frac{1}{2}}K}{\lambda}\right)^{\frac{1}{\nu}}\right\rceil, we have 48κK2λ2α2(T0)48196α014\frac{48\kappa K^{2}}{\lambda^{2}}\alpha^{2}(T_{0})\leq\frac{48}{196}\alpha_{0}\leq\frac{1}{4}. Therefore, for T>T0T>T_{0}, the last summation in (97) can be upper bounded by

48κK2λ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\frac{48\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] (108)
=48κK2λ2t=1T0α3(t)β(t)𝔼[f(𝐱¯(t))2]+48κK2λ2t=T0+1Tα3(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle=\frac{48\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]+\frac{48\kappa K^{2}}{\lambda^{2}}\sum_{t=T_{0}+1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] (109)
ϵ5+48κK2λ2α2(T0)t=T0+1Tα(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\leq\epsilon_{5}+\frac{48\kappa K^{2}}{\lambda^{2}}\alpha^{2}(T_{0})\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] (110)
ϵ5+14t=T0+1Tα(t)β(t)𝔼[f(𝐱¯(t))2],\displaystyle\leq\epsilon_{5}+\frac{1}{4}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right], (111)

where

ϵ5:=48κK2λt=1T0α3(t)β(t)𝔼[f(𝐱¯(t))2],\displaystyle\epsilon_{5}:=\frac{48\kappa K^{2}}{\lambda}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right], (112)

does not depend on TT. Plugging (104) and (108) in (97), for T>T0T>T_{0}, we get

𝔼[f(𝐱¯(T+1))]\displaystyle\mathbb{E}\left[f\left(\bar{\mathbf{x}}(T+1)\right)\right] 𝔼[f(𝐱¯(1))]+K2ϕ0,2(T)+2K2(ϵ1ϕ1,2(T)+ϵ2ϕ3,1(T)+ϵ3)\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]+\frac{K}{2}\phi_{0,2}(T)+2K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)
12t=1Tα(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]\displaystyle\phantom{=}-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right]
12t=1Tα(t)β(t)𝔼[f(𝐱¯(t))2]+48κK2λ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\phantom{=}-\frac{1}{2}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]+\frac{48\kappa K^{2}}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
𝔼[f(𝐱¯(1))]+K2ϕ0,2(T)+2K2(ϵ1ϕ1,2(T)+ϵ2ϕ3,1(T)+ϵ3)ϵ4\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]+\frac{K}{2}\phi_{0,2}(T)+2K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)-\epsilon_{4}
12t=1T0α(t)β(t)𝔼[f(𝐱¯(t))2]12t=T0+1Tα(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\phantom{=}-\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]-\frac{1}{2}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
+ϵ5+14t=T0+1Tα(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\phantom{=}+\epsilon_{5}+\frac{1}{4}\sum_{t=T_{0}+1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
𝔼[f(𝐱¯(1))]+K2ϕ0,2(T)+2K2(ϵ1ϕ1,2(T)+ϵ2ϕ3,1(T)+ϵ3)ϵ4+ϵ5\displaystyle\leq\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]+\frac{K}{2}\phi_{0,2}(T)+2K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)-\epsilon_{4}+\epsilon_{5}
14t=1Tα(t)β(t)𝔼[f(𝐱¯(t))2].\displaystyle\phantom{=}-\frac{1}{4}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right].

Hence,

t=1Tα(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] (113)
4𝔼[f(𝐱¯(1))]4𝔼[f(𝐱¯(T+1))]+2Kϕ0,2(T)+8K2(ϵ1ϕ1,2(T)+ϵ2ϕ3,1(T)+ϵ3)4ϵ4+4ϵ5\displaystyle\leq 4\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-4\mathbb{E}\left[f\left(\bar{\mathbf{x}}(T+1)\right)\right]\!+\!2K\phi_{0,2}(T)+8K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)\!-\!4\epsilon_{4}\!+\!4\epsilon_{5} (114)
4𝔼[f(𝐱¯(1))]4𝔼[f(𝐱)]+2Kϕ0,2(T)+8K2(ϵ1ϕ1,2(T)+ϵ2ϕ3,1(T)+ϵ3)4ϵ4+4ϵ5\displaystyle\leq 4\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-4\mathbb{E}\left[f(\mathbf{x}^{\star})\right]+2K\phi_{0,2}(T)+8K^{2}\left(\epsilon_{1}\phi_{1,2}(T)+\epsilon_{2}\phi_{3,1}(T)+\epsilon_{3}\right)-4\epsilon_{4}+4\epsilon_{5} (115)
=ϵ6+2Kϕ0,2(T)+8K2ϵ1ϕ1,2(T)+8K2ϵ2ϕ3,1(T)\displaystyle=\epsilon_{6}+2K\phi_{0,2}(T)+8K^{2}\epsilon_{1}\phi_{1,2}(T)+8K^{2}\epsilon_{2}\phi_{3,1}(T) (116)
ϵ6+(2K+8K2ϵ1α0)ϕ0,2(T)+8K2ϵ2ϕ3,1(T),\displaystyle\leq\epsilon_{6}+(2K+8K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T)+8K^{2}\epsilon_{2}\phi_{3,1}(T), (117)

where

ϵ6\displaystyle\epsilon_{6} :=8K2ϵ34ϵ4+4ϵ5+4𝔼[f(𝐱¯(1))]4𝔼[f(𝐱)]\displaystyle:=8K^{2}\epsilon_{3}-4\epsilon_{4}+4\epsilon_{5}+4\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-4\mathbb{E}\left[f(\mathbf{x}^{\star})\right]
=192κK2λt=1T0α3(t)β(t)[K2𝔼[X(t)𝟏𝐱¯(t)𝐫2]+𝔼[f(𝐱¯(t))2]]\displaystyle=\frac{192\kappa K^{2}}{\lambda}\sum_{t=1}^{T_{0}}\alpha^{3}(t)\beta(t)\left[K^{2}\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]+\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right]
2t=1T0α(t)β(t)(1α(t)β(t)K)𝔼[𝐫Tf(X(t))2]+4(𝔼[f(𝐱¯(1))]𝔼[f(𝐱)])\displaystyle\phantom{=}-2\sum_{t=1}^{T_{0}}\alpha(t)\beta(t)\left(1-\alpha(t)\beta(t)K\right)\mathbb{E}\left[\left\|\mathbf{r}^{T}\nabla f(X(t))\right\|^{2}\right]+4\left(\mathbb{E}\left[f\left(\bar{\mathbf{x}}(1)\right)\right]-\mathbb{E}\left[f(\mathbf{x}^{\star})\right]\right)

is constant (does not depend on TT), and the last inequality in (117) follows from the fact that

ϕ1,2(T)=t=1Tα(t)β2(t)=α0t=1T1(t+τ)νβ2(t)α0t=1Tβ2(t)=α0ϕ0,2(T).\displaystyle\phi_{1,2}(T)=\sum_{t=1}^{T}\alpha(t)\beta^{2}(t)=\alpha_{0}\sum_{t=1}^{T}\frac{1}{(t+\tau)^{\nu}}\beta^{2}(t)\leq\alpha_{0}\sum_{t=1}^{T}\beta^{2}(t)=\alpha_{0}\phi_{0,2}(T). (118)

5.5 Bound on the Moments of 𝔼[f(𝐱¯(t))2]\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] and 𝔼[X(t)𝟏𝐱¯(t)𝐫2]\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]

The inequality in (117) provides us with an upper bound on the temporal average of sequence {α(t)β(t)𝔼[f(𝐱¯(t))2]}\big{\{}\alpha(t)\beta(t)\mathbb{E}[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}]\big{\}}. However, our goal is to derive a bound on the temporal average of {𝔼[f(𝐱¯(t))2]}\big{\{}\mathbb{E}[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}]\big{\}}. To this end, for any given θ(0,1)\theta\in(0,1), we define the measure of convergence as

Mθ(ν,μ):=[1Tt=1T(𝔼[f(𝐱¯(t))2])θ]1θ.\displaystyle M_{\theta}(\nu,\mu):=\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}.

Note that by Hölder’s inequality [50, Theorem 6.2] for any p,q>1p,q>1 with 1p+1q=1\frac{1}{p}+\frac{1}{q}=1, and non-negative sequences {at}t=1T\{a_{t}\}_{t=1}^{T} and {bt}t=1T\{b_{t}\}_{t=1}^{T}, we have

t=1Tatbt(t=1Tatp)1p(t=1Tbtq)1q,\displaystyle\sum_{t=1}^{T}a_{t}b_{t}\leq\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{1}{p}}\left(\sum_{t=1}^{T}b_{t}^{q}\right)^{\frac{1}{q}},

or equivalently,

(t=1Tatbt)q(t=1Tatp)qp(t=1Tbtq).\displaystyle\left(\sum_{t=1}^{T}a_{t}b_{t}\right)^{q}\leq\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}\left(\sum_{t=1}^{T}b_{t}^{q}\right). (119)

Let (p,q)=(11θ,1θ)(p,q)=\left(\frac{1}{1-\theta},\frac{1}{\theta}\right) so that 1p+1q=1\frac{1}{p}+\frac{1}{q}=1. Furthermore, let

at=(1α(t)β(t))θ=1(α0β0)θ(t+τ)(μ+ν)θbt=(α(t)β(t)𝔼[f(𝐱¯(t))2])θ.\displaystyle\begin{split}a_{t}&=\left(\frac{1}{\alpha(t)\beta(t)}\right)^{\theta}=\frac{1}{(\alpha_{0}\beta_{0})^{\theta}}(t+\tau)^{(\mu+\nu)\theta}\\ b_{t}&=\left(\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}.\end{split} (120)

Then, applying Hölder’s inequality (119), we arrive at

Mθ(ν,μ)\displaystyle M_{\theta}(\nu,\mu) =[1Tt=1T(𝔼[f(𝐱¯(t))2])θ]1θ=(1Tt=1Tatbt)q1Tq(t=1Tatp)qp(t=1Tbtq).\displaystyle=\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}=\left(\frac{1}{T}\sum_{t=1}^{T}a_{t}b_{t}\right)^{q}\leq\frac{1}{T^{q}}\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}\left(\sum_{t=1}^{T}b_{t}^{q}\right). (121)

It remains to upper bound the terms in the right hand side of (121). First, using Lemma 3 we get

t=1Tatp=1(α0β0)θ1θt=1T(t+τ)(ν+μ)θ1θ21+(ν+μ)θ1θ(α0β0)θ1θ(1+(ν+μ)θ1θ)(T+τ)1+(ν+μ)θ1θ.\displaystyle\sum_{t=1}^{T}a_{t}^{p}=\frac{1}{(\alpha_{0}\beta_{0})^{\frac{\theta}{1-\theta}}}\sum_{t=1}^{T}(t+\tau)^{\frac{(\nu+\mu)\theta}{1-\theta}}\leq\frac{2^{1+\frac{(\nu+\mu)\theta}{1-\theta}}}{(\alpha_{0}\beta_{0})^{\frac{\theta}{1-\theta}}\left(1+\frac{(\nu+\mu)\theta}{1-\theta}\right)}(T+\tau)^{1+\frac{(\nu+\mu)\theta}{1-\theta}}.

Therefore,

(t=1Tatp)qp21θθ+(ν+μ)α0β0(1+(ν+μ)θ1θ)1θθ(T+τ)1θθ+ν+μ:=ϵ7(θ)(T+τ)1θθ+ν+μ.\displaystyle\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}\leq\frac{2^{\frac{1-\theta}{\theta}+(\nu+\mu)}}{\alpha_{0}\beta_{0}\left(1+\frac{(\nu+\mu)\theta}{1-\theta}\right)^{\frac{1-\theta}{\theta}}}(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}:=\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}. (122)

Next, continuing from (117), for TT0T\geq T_{0} we can write

t=1Tbtq\displaystyle\sum_{t=1}^{T}b_{t}^{q} =t=1Tα(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle=\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] (123)
ϵ6+(2K+8K2ϵ1α0)ϕ0,2(T)+8K2ϵ2ϕ3,1(T)\displaystyle\leq\epsilon_{6}+(2K+8K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T)+8K^{2}\epsilon_{2}\phi_{3,1}(T) (124)
max{3ϵ6,(6K+24K2ϵ1α0)ϕ0,2(T),24K2ϵ2ϕ3,1(T)},\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\}, (125)

where the last inequality follows from the fact that a+b+cmax{3a,3b,3c}a+b+c\leq\max\{3a,3b,3c\}. Next, we can use Lemma 3 to bound ϕ0,2(T)\phi_{0,2}(T), as

ϕ0,2(T)=t=1Tβ2(t)=1β02t=1T(t+τ)2μ{τ12μβ02|12μ|if μ>1/2,1β02ln(Tτ+1)if μ=1/2,2β02(12μ)(T+τ)12μif 0μ<1/2,\displaystyle\phi_{0,2}(T)=\sum_{t=1}^{T}\beta^{2}(t)=\frac{1}{\beta_{0}^{2}}\sum_{t=1}^{T}(t+\tau)^{-2\mu}\leq\begin{cases}\frac{\tau^{1-2\mu}}{\beta_{0}^{2}|1-2\mu|}&\textrm{if $\mu>1/2$},\\ \frac{1}{\beta_{0}^{2}}\ln\left(\frac{T}{\tau}+1\right)&\textrm{if $\mu=1/2$},\\ \frac{2}{\beta_{0}^{2}(1-2\mu)}(T+\tau)^{1-2\mu}&\textrm{if $0\leq\mu<1/2$},\end{cases} (126)

where we used the fact that 212μ22^{1-2\mu}\leq 2 for 0μ<10\leq\mu<1. Similarly, applying Lemma 3 on ϕ3,1(T)\phi_{3,1}(T), we get

ϕ3,1(T)\displaystyle\phi_{3,1}(T) =t=1Tα3(t)β(t)=1α03β0t=1T(t+τ)3νμ\displaystyle=\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)=\frac{1}{\alpha_{0}^{3}\beta_{0}}\sum_{t=1}^{T}(t+\tau)^{-3\nu-\mu} (127)
{τ13νμα03β0|13νμ|if 3ν+μ>1,1α03β0ln(Tτ+1)if 3ν+μ=1,2α03β0(13νμ)(T+τ)13νμif 03ν+μ<1,\displaystyle\leq\begin{cases}\frac{\tau^{1-3\nu-\mu}}{\alpha_{0}^{3}\beta_{0}|1-3\nu-\mu|}&\textrm{if $3\nu+\mu>1$},\\ \frac{1}{\alpha_{0}^{3}\beta_{0}}\ln\left(\frac{T}{\tau}+1\right)&\textrm{if $3\nu+\mu=1$},\\ \frac{2}{\alpha_{0}^{3}\beta_{0}(1-3\nu-\mu)}(T+\tau)^{1-3\nu-\mu}&\textrm{if $0\leq 3\nu+\mu<1$},\end{cases} (128)

in which we have upper bounded 213νμ2^{1-3\nu-\mu} by by 22, for 03ν+μ<10\leq 3\nu+\mu<1. Note that the upper bound in (123) is the maximum of a constant, and two ϕ,(T)\phi_{\cdot,\cdot}(T) functions. Moreover, depending on the values of μ\mu and ν\nu, each ϕ,(T)\phi_{\cdot,\cdot}(T) can be upper bounded by a constant, a logarithmic function, or a polynomial (in TT). Figure 4 illustrates the four regions of (μ,ν)(\mu,\nu). In the following we first, analyze the interior of the four regions shown in Figure 4, and then study the boundary cases.

Refer to caption
Figure 4: Regions of (μ,ν)(\mu,\nu).

Region 𝖨\mathsf{I}: μ>1/2\mu>1/2 and 3ν+μ>13\nu+\mu>1
Recall from (126) and  (127) that both ϕ0,2(T)\phi_{0,2}(T) and ϕ3,1(T)\phi_{3,1}(T) are upper bounded by constant functions, which grows faster than a constant. Hence, from (123) in this regime we have

t=1Tbtq\displaystyle\sum_{t=1}^{T}b_{t}^{q} max{3ϵ6,(6K+24K2ϵ1α0)ϕ0,2(T),24K2ϵ2ϕ3,1(T)}\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\} (129)
max{3ϵ6,(6K+24K2ϵ1α0)τ12μβ02|12μ|,24K2ϵ2τ13νμα03β0|13νμ|}\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\frac{\tau^{1-2\mu}}{\beta_{0}^{2}|1-2\mu|},24K^{2}\epsilon_{2}\frac{\tau^{1-3\nu-\mu}}{\alpha_{0}^{3}\beta_{0}|1-3\nu-\mu|}\right\} (130)
:=ϵ8.\displaystyle:=\epsilon_{8}. (131)

Note that ϵ8\epsilon_{8} does not depend on TT. Plugging (122) and (129) in (121), we arrive at

Mθ(ν,μ)1T1/θϵ7(θ)ϵ8(T+τ)1θθ+ν+μ=𝒪(T(1νμ)).\displaystyle M_{\theta}(\nu,\mu)\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)\cdot\epsilon_{8}\cdot(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}=\mathcal{O}\left(T^{-(1-\nu-\mu)}\right). (132)

Region 𝖨𝖨\mathsf{II}: μ<1/2\mu<1/2 and 3ν+μ>13\nu+\mu>1
From (126) and (127), we can see that ϕ0,2(T)\phi_{0,2}(T) and ϕ3,1(T)\phi_{3,1}(T) the upper bounded by a polynomial and a constant function of TT, respectively. Noting that for sufficiently large TT, a polynomial function beats any constant, we can write

t=1Tbtq\displaystyle\sum_{t=1}^{T}b_{t}^{q} max{3ϵ6,(6K+24K2ϵ1α0)ϕ0,2(T),24K2ϵ2ϕ3,1(T)}\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\} (133)
12K+48K2ϵ1α0β02(12μ)(T+τ)12μ\displaystyle\leq\frac{12K+48K^{2}\epsilon_{1}\alpha_{0}}{\beta_{0}^{2}(1-2\mu)}(T+\tau)^{1-2\mu} (134)
:=ϵ9(T+τ)12μ.\displaystyle:=\epsilon_{9}\cdot(T+\tau)^{1-2\mu}. (135)

This together with (121) and (122) implies

Mθ(ν,μ)1T1/θϵ7(θ)(T+τ)1θθ+ν+μϵ9(T+τ)12μ=𝒪(T(μν)).\displaystyle M_{\theta}(\nu,\mu)\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}\cdot\epsilon_{9}\cdot(T+\tau)^{1-2\mu}=\mathcal{O}\left(T^{-(\mu-\nu)}\right). (136)

Region 𝖨𝖨𝖨\mathsf{III}: μ>1/2\mu>1/2 and 3ν+μ<13\nu+\mu<1
Recall from (126) and (127) that ϕ0,2(T)\phi_{0,2}(T) and ϕ3,1(T)\phi_{3,1}(T) are upper bounded by a constant and a polynomial function of TT, respectively. Therefore, for sufficiently large TT, we get

t=1Tbtq\displaystyle\sum_{t=1}^{T}b_{t}^{q} max{3ϵ6,(6K+24K2ϵ1α0)ϕ0,2(T),24K2ϵ2ϕ3,1(T)}\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\} (137)
(48K2ϵ2α03β0(13νμ))(T+τ)13νμ\displaystyle\leq\left(\frac{48K^{2}\epsilon_{2}}{\alpha_{0}^{3}\beta_{0}(1-3\nu-\mu)}\right)(T+\tau)^{1-3\nu-\mu} (138)
:=ϵ10(T+τ)13νμ.\displaystyle:=\epsilon_{10}\cdot(T+\tau)^{1-3\nu-\mu}. (139)

Therefore, plugging (137) and (122) to (121), we get

Mθ(ν,μ)1T1/θϵ7(θ)(T+τ)1θθ+ν+μϵ10(T+τ)13νμ=𝒪(T2ν).\displaystyle M_{\theta}(\nu,\mu)\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}\cdot\epsilon_{10}\cdot(T+\tau)^{1-3\nu-\mu}=\mathcal{O}\left(T^{-2\nu}\right). (140)

Region 𝖨𝖵\mathsf{IV}: μ<1/2\mu<1/2 and 3ν+μ<13\nu+\mu<1
In this region, we can use (126) and (127) to upper bound both ϕ0,2(T)\phi_{0,2}(T) and ϕ3,1(T)\phi_{3,1}(T) by polynomial functions. Thus, for sufficiently large TT, we get

t=1Tbtq\displaystyle\sum_{t=1}^{T}b_{t}^{q} max{3ϵ6,(6K+24K2ϵ1α0)ϕ0,2(T),24K2ϵ2ϕ3,1(T)}\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\} (141)
max{12K+48K2ϵ1α0β02(12μ),48K2ϵ2α03β0(13νμ)}(T+τ)max{12μ,13νμ}\displaystyle\leq\max\left\{\frac{12K+48K^{2}\epsilon_{1}\alpha_{0}}{\beta_{0}^{2}(1-2\mu)},\frac{48K^{2}\epsilon_{2}}{\alpha_{0}^{3}\beta_{0}(1-3\nu-\mu)}\right\}(T+\tau)^{\max\{1-2\mu,1-3\nu-\mu\}} (142)
:=ϵ11(T+τ)max{12μ,13νμ}.\displaystyle:=\epsilon_{11}\cdot(T+\tau)^{\max\{1-2\mu,1-3\nu-\mu\}}. (143)

Then, we can plug (141) and (122) to (121), to conclude

Mθ(ν,μ)\displaystyle M_{\theta}(\nu,\mu) 1T1/θϵ7(θ)(T+τ)1θθ+ν+μϵ11(T+τ)max{12μ,13νμ}\displaystyle\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\nu+\mu}\cdot\epsilon_{11}\cdot(T+\tau)^{\max\{1-2\mu,1-3\nu-\mu\}} (144)
=𝒪(Tmin{μν,2ν}).\displaystyle=\mathcal{O}\left(T^{-\min\{\mu-\nu,2\nu\}}\right). (145)

Summarizing the result of the four cases above, we get

Mθ(ν,μ)\displaystyle M_{\theta}(\nu,\mu) ={𝒪(T(1νμ))if μ>1/2 and 3ν+μ>1𝒪(T(μν))if μ<1/2 and 3ν+μ>1𝒪(T2ν)if μ>1/2 and 3ν+μ<1𝒪(Tmin{μν,2ν})if μ<1/2 and 3ν+μ<1\displaystyle=\begin{cases}\mathcal{O}\left(T^{-(1-\nu-\mu)}\right)&\textrm{if $\mu>1/2$ and $3\nu+\mu>1$}\\ \mathcal{O}\left(T^{-(\mu-\nu)}\right)&\textrm{if $\mu<1/2$ and $3\nu+\mu>1$}\\ \mathcal{O}\left(T^{-2\nu}\right)&\textrm{if $\mu>1/2$ and $3\nu+\mu<1$}\\ \mathcal{O}\left(T^{-\min\{\mu-\nu,2\nu\}}\right)&\textrm{if $\mu<1/2$ and $3\nu+\mu<1$}\end{cases} (146)
=𝒪(Tmin{1νμ,μν,2ν}),\displaystyle=\mathcal{O}\left(T^{-\min\{1-\nu-\mu,\mu-\nu,2\nu\}}\right), (147)

which concludes the first claim of Theorem 1.
Recall that our goal is to optimize over (ν,μ)(\nu,\mu) to achieve the best convergence rate. This is equivalent to maximizing the exponent of 1/T1/T in each of the four regions, i.e.,

Region 𝖨:\displaystyle\textrm{Region }\mathsf{I}:\quad sup{1νμ:μ>1/2,3ν+μ>1},\displaystyle\sup\{1-\nu-\mu:\mu>1/2,3\nu+\mu>1\},
Region 𝖨𝖨:\displaystyle\textrm{Region }\mathsf{II}:\quad sup{μν:μ<1/2,3ν+μ>1},\displaystyle\sup\{\mu-\nu:\mu<1/2,3\nu+\mu>1\},
Region 𝖨𝖨𝖨:\displaystyle\textrm{Region }\mathsf{III}:\quad sup{2ν:μ>1/2,3ν+μ<1},\displaystyle\sup\{2\nu:\mu>1/2,3\nu+\mu<1\},
Region 𝖨𝖵:\displaystyle\textrm{Region }\mathsf{IV}:\quad sup{min{μν,2ν}:μ<1/2,3ν+μ<1}.\displaystyle\sup\{\min\{\mu-\nu,2\nu\}:\mu<1/2,3\nu+\mu<1\}.

Interestingly, it turns out that the respective supremum value in all four regions is 1/31/3, which corresponds to the boundary point (ν,μ)=(1/6,1/2)(\nu^{\star},\mu^{\star})=(1/6,1/2). However, this point does not belong to any of the corresponding open sets, which motivates the convergence analysis of MθM_{\theta} for (ν,μ)=(1/6,1/2)(\nu^{\star},\mu^{\star})=(1/6,1/2).

Boundary Case: μ=1/2\mu=1/2 and 3ν+μ=13\nu+\mu=1
First note that the two lines of interest intersect at (ν,μ)=(1/6,1/2)(\nu^{\star},\mu^{\star})=(1/6,1/2), as shown in Figure 4. Applying Lemma 3 on ϕ0,2(T)\phi_{0,2}(T) and ϕ3,1(T)\phi_{3,1}(T) for (ν,μ)=(1/6,1/2)(\nu^{\star},\mu^{\star})=(1/6,1/2) we get

ϕ0,2(T)\displaystyle\phi_{0,2}(T) 1β02ln(Tτ+1),\displaystyle\leq\frac{1}{\beta_{0}^{2}}\ln\left(\frac{T}{\tau}+1\right), (148)
ϕ3,1(T)\displaystyle\phi_{3,1}(T) 1α03β0ln(Tτ+1).\displaystyle\leq\frac{1}{\alpha_{0}^{3}\beta_{0}}\ln\left(\frac{T}{\tau}+1\right). (149)

Therefore, (123) reduces to

t=1Tbtq\displaystyle\sum_{t=1}^{T}b_{t}^{q} max{3ϵ6,(6K+24K2ϵ1α0)ϕ0,2(T),24K2ϵ2ϕ3,1(T)}\displaystyle\leq\max\left\{3\epsilon_{6},(6K+24K^{2}\epsilon_{1}\alpha_{0})\phi_{0,2}(T),24K^{2}\epsilon_{2}\phi_{3,1}(T)\right\} (150)
max{6K+24K2ϵ1α0β02,24K2ϵ2α03β0}ln(Tτ+1)\displaystyle\leq\max\left\{\frac{6K+24K^{2}\epsilon_{1}\alpha_{0}}{\beta_{0}^{2}},\frac{24K^{2}\epsilon_{2}}{\alpha_{0}^{3}\beta_{0}}\right\}\ln\left(\frac{T}{\tau}+1\right) (151)
:=ϵ12ln(Tτ+1).\displaystyle:=\epsilon_{12}\cdot\ln\left(\frac{T}{\tau}+1\right). (152)

Plugging (150) and (122) to (121), we arrive at

Mθ(16,12)\displaystyle M_{\theta}\left(\frac{1}{6},\frac{1}{2}\right) 1T1/θϵ7(θ)(T+τ)1θθ+23ϵ12ln(Tτ+1)\displaystyle\leq\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\frac{2}{3}}\cdot\epsilon_{12}\cdot\ln\left(\frac{T}{\tau}+1\right) (153)
=𝒪(T1/3lnT),\displaystyle=\mathcal{O}\left(T^{-1/3}\ln T\right), (154)

which is the second claim of the theorem.

6 Proof of Proposition 1

First note that we cannot directly conclude the proposition from Theorem 1, since the theorem only holds for θ(0,1)\theta\in(0,1), and not θ=1\theta=1. In order to show the claim, for a vector 𝐲T\mathbf{y}\in\mathbb{R}^{T} and some θ(0,1)\theta\in(0,1) we define222Note that for 𝐲θ\|\mathbf{y}\|_{\theta} is not a norm, since it is not a subadditive function for θ<1\theta<1. 𝐲θ:=(|y1|θ+|y2|θ++|yT|θ)1/θ\|\mathbf{y}\|_{\theta}:=\left(|y_{1}|^{\theta}+|y_{2}|^{\theta}+\ldots+|y_{T}|^{\theta}\right)^{1/\theta}. Then we have 𝐲1𝐲θ\|\mathbf{y}\|_{1}\leq\|\mathbf{y}\|_{\theta} since

(𝐲θ𝐲1)θ=|y1|θ++|yT|θ(|y1|++|yT|)θ=t=1T(|yt||y1|++|yT|)θt=1T|yt||y1|++|yT|=1,\displaystyle\left(\frac{\|\mathbf{y}\|_{\theta}}{\|\mathbf{y}\|_{1}}\right)^{\theta}=\frac{|y_{1}|^{\theta}+\ldots+|y_{T}|^{\theta}}{(|y_{1}|+\ldots+|y_{T}|)^{\theta}}=\sum_{t=1}^{T}\left(\frac{|y_{t}|}{|y_{1}|+\ldots+|y_{T}|}\right)^{\theta}\geq\sum_{t=1}^{T}\frac{|y_{t}|}{|y_{1}|+\ldots+|y_{T}|}=1,

where the inequality holds since we have 0|yt|/i=1T|yi|10\leq|y_{t}|/\sum_{i=1}^{T}|y_{i}|\leq 1, and θ<1\theta<1.
Now, for the vector 𝐲\mathbf{y} with yt=𝔼[f(𝐱¯(t))2]y_{t}=\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right] we have

M1\displaystyle M_{1} =1Tt=1T𝔼[f(𝐱¯(t))2]=1T𝐲11T𝐲θ=1T(t=1T(𝔼[f(𝐱¯(t))2])θ)1/θ\displaystyle\!=\!\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\!=\!\frac{1}{T}\|\mathbf{y}\|_{1}\!\leq\!\frac{1}{T}\|\mathbf{y}\|_{\theta}=\frac{1}{T}\left(\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}\right)^{1/\theta}
=1T11/θ(1Tt=1T(𝔼[f(𝐱¯(t))2])θ)1/θ=1T11/θMθ.\displaystyle=\frac{1}{T^{1-1/\theta}}\left(\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]\right)^{\theta}\right)^{1/\theta}=\frac{1}{T^{1-1/\theta}}M_{\theta}.

Then, from Theorem 1 for (ν,μ)=(1/6,1/2)(\nu^{\star},\mu^{\star})=(1/6,1/2) and θ=22+ϵ\theta=\frac{2}{2+\epsilon}, we get

M11T11/θMθ1T11/θ𝒪(T1/3lnT)=Tϵ/2𝒪(T1/3lnT)=𝒪(T1/3+ϵ),\displaystyle M_{1}\leq\frac{1}{T^{1-1/\theta}}M_{\theta}\leq\frac{1}{T^{1-1/\theta}}\mathcal{O}\left(T^{-1/3}\ln T\right)=T^{\epsilon/2}\mathcal{O}\left(T^{-1/3}\ln T\right)=\mathcal{O}\left(T^{-1/3+\epsilon}\right), (155)

where the last equality holds since lnT=𝒪(Tϵ/2)\ln T=\mathcal{O}(T^{\epsilon/2}) for any ϵ>0\epsilon>0. Similarly, for the vector 𝐳T\mathbf{z}\in\mathbb{R}^{T} with zt:=𝔼[X(t)𝟏𝐱¯(t)𝐫2]z_{t}:=\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right] and any θ(0,1)\theta\in(0,1) we have

1Tt=1T𝔼[X(t)𝟏𝐱¯(t)𝐫2]=1T𝐳11T𝐳θ\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]=\frac{1}{T}\|\mathbf{z}\|_{1}\leq\frac{1}{T}\|\mathbf{z}\|_{\theta} =1T[t=1T(𝔼[X(t)𝟏𝐱¯(t)𝐫2])θ]1θ\displaystyle=\frac{1}{T}\left[\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|X(t)\!-\!\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}
=1T11/θ[1Tt=1T(𝔼[X(t)𝟏𝐱¯(t)𝐫2])θ]1θ.\displaystyle=\!\frac{1}{T^{1-1/\theta}}\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|X(t)\!-\!\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}. (156)

We need to bound the RHS of (6). Let at:=(α(t)β(t))θ=(t+τ)2θ/3/(α0β0)θa_{t}\!:=\!(\alpha(t)\beta(t))^{-\theta}\!=(t+\tau)^{2\theta/3}/(\alpha_{0}\beta_{0})^{\theta} and ct:=(α(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2])θc_{t}:=\left(\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]\right)^{\theta}. Then, using the Hölder’s inequality in (119) for (p,q)=(11θ,1θ)(p,q)=\left(\frac{1}{1-\theta},\frac{1}{\theta}\right) we can write

[1Tt=1T(𝔼[X(t)𝟏𝐱¯(t)𝐫2])θ]1θ=(1Tt=1Tatct)q1Tq(t=1Tatp)qp(t=1Tctq).\displaystyle\left[\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]\right)^{\theta}\right]^{\frac{1}{\theta}}=\left(\frac{1}{T}\sum_{t=1}^{T}a_{t}c_{t}\right)^{q}\leq\frac{1}{T^{q}}\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}}\left(\sum_{t=1}^{T}c_{t}^{q}\right). (157)

Note that (t=1Tatp)qp\left(\sum_{t=1}^{T}a_{t}^{p}\right)^{\frac{q}{p}} is bounded in (122). Moreover, for t=1Tctq\sum_{t=1}^{T}c_{t}^{q} we can write

t=1Tctq=t=1Tα(t)β(t)𝔼[X(t)𝟏𝐱¯(t)𝐫2]\displaystyle\sum_{t=1}^{T}c_{t}^{q}=\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\right\|_{\mathbf{r}}^{2}\right]
(a)2ϵ1ϕ1,2(T)+2ϵ2ϕ3,1(T)+2ϵ3+48κλ2t=1Tα3(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\stackrel{{\scriptstyle\rm{(a)}}}{{\leq}}2\epsilon_{1}\phi_{1,2}(T)+2\epsilon_{2}\phi_{3,1}(T)+2\epsilon_{3}+\frac{48\kappa}{\lambda^{2}}\sum_{t=1}^{T}\alpha^{3}(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
(b)2ϵ1ϕ1,2(T)+2ϵ2ϕ3,1(T)+2ϵ3+48κα02λ2t=1Tα(t)β(t)𝔼[f(𝐱¯(t))2]\displaystyle\stackrel{{\scriptstyle\rm{(b)}}}{{\leq}}2\epsilon_{1}\phi_{1,2}(T)+2\epsilon_{2}\phi_{3,1}(T)+2\epsilon_{3}+\frac{48\kappa\alpha_{0}^{2}}{\lambda^{2}}\sum_{t=1}^{T}\alpha(t)\beta(t)\mathbb{E}\left[\left\|\nabla f(\bar{\mathbf{x}}(t))\right\|^{2}\right]
(c)2α0ϵ1ϕ0,2(T)+2ϵ2ϕ3,1(T)+2ϵ3+48κα02λ2(ϵ6+K(2+8Kα0ϵ1)ϕ0,2(T)+8K2ϵ2ϕ3,1(T))\displaystyle\stackrel{{\scriptstyle\rm{(c)}}}{{\leq}}\!\!2\alpha_{0}\epsilon_{1}\phi_{0,2}(T)\!+\!2\epsilon_{2}\phi_{3,1}(T)\!+\!2\epsilon_{3}\!+\!\frac{48\kappa\alpha_{0}^{2}}{\lambda^{2}}\!\left(\!\epsilon_{6}\!+\!K(2\!+\!8K\!\alpha_{0}\epsilon_{1})\phi_{0,2}(T)\!+\!8K^{2}\epsilon_{2}\phi_{3,1}(T)\right)
2ϵ3+48κα02ϵ6λ2+(2α0ϵ1+96κα02K(1+4Kα0ϵ1)λ2)ϕ0,2(T)+(2ϵ2+384κα02K2ϵ2λ2)ϕ3,1(T)\displaystyle\leq\!2\epsilon_{3}\!+\!\frac{48\kappa\alpha_{0}^{2}\epsilon_{6}}{\lambda^{2}}\!+\!\left(\!2\alpha_{0}\epsilon_{1}\!+\!\frac{96\kappa\alpha_{0}^{2}K(1+4K\alpha_{0}\epsilon_{1})}{\lambda^{2}}\!\right)\phi_{0,2}(T)\!+\!\left(2\epsilon_{2}\!+\!\frac{384\kappa\alpha_{0}^{2}K^{2}\epsilon_{2}}{\lambda^{2}}\!\right)\!\phi_{3,1}(T)
(d)2ϵ3+48κα02ϵ6λ2+(2α0ϵ1β02+96κα02K(1+4Kα0ϵ1)λ2β02)ln(Tτ+1)\displaystyle\stackrel{{\scriptstyle\rm{(d)}}}{{\leq}}2\epsilon_{3}+\frac{48\kappa\alpha_{0}^{2}\epsilon_{6}}{\lambda^{2}}+\left(\frac{2\alpha_{0}\epsilon_{1}}{\beta_{0}^{2}}+\frac{96\kappa\alpha_{0}^{2}K(1+4K\alpha_{0}\epsilon_{1})}{\lambda^{2}\beta_{0}^{2}}\right)\ln\left(\frac{T}{\tau}+1\right)
+(2ϵ2α03β0+384κK2ϵ2λ2α0β0)ln(Tτ+1)(e)ϵ13ln(Tτ+1),\displaystyle\hskip 79.0pt+\left(\frac{2\epsilon_{2}}{\alpha_{0}^{3}\beta_{0}}+\frac{384\kappa K^{2}\epsilon_{2}}{\lambda^{2}\alpha_{0}\beta_{0}}\right)\ln\left(\frac{T}{\tau}+1\right)\stackrel{{\scriptstyle\rm{(e)}}}{{\leq}}\epsilon_{13}\cdot\ln\left(\frac{T}{\tau}+1\right), (158)

where (a)\rm{(a)} follows from (94), (b)\rm{(b)} holds since α2(t)=α02(t+τ)1/3α02\alpha^{2}(t)=\frac{\alpha_{0}^{2}}{(t+\tau)^{1/3}}\leq\alpha_{0}^{2}, the inequality in (c)\rm{(c)} follows from (117) and (118), we have used a bounds in (126) and (127) for (ν,μ)=(1/6,1/2)(\nu^{\star},\mu^{\star})=(1/6,1/2) in (d)\rm{(d)}, and (e)\rm{(e)} holds since the constant term 2ϵ3+48κα02ϵ6/λ22\epsilon_{3}+48\kappa\alpha_{0}^{2}\epsilon_{6}/\lambda^{2} is upper bounded by ln(T/τ+1)\ln(T/\tau+1) for large enough TT. Plugging (122) for ν+μ=2/3\nu^{\star}+\mu^{\star}=2/3 and (158) into (157) and (6), and setting θ=22+ϵ\theta=\frac{2}{2+\epsilon} we arrive at

1Tt=1T𝔼[\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\Big{[} 𝐱i(t)𝐱¯(t)2]1𝐫min[1Tt=1T𝔼[X(t)𝟏𝐱¯(t)𝐫2]]\displaystyle\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\|^{2}\Big{]}\leq\frac{1}{\mathbf{r}_{\min}}\left[\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\|X(t)-\mathbf{1}\bar{\mathbf{x}}(t)\|_{\mathbf{r}}^{2}\right]\right]
1𝐫min1T11/θ1T1/θϵ7(θ)(T+τ)1θθ+23ϵ13ln(Tτ+1)=𝒪(T1/3+ϵ),\displaystyle\leq\frac{1}{\mathbf{r}_{\min}}\frac{1}{T^{1-1/\theta}}\frac{1}{T^{1/\theta}}\epsilon_{7}(\theta)(T+\tau)^{\frac{1-\theta}{\theta}+\frac{2}{3}}\cdot\epsilon_{13}\cdot\ln\left(\frac{T}{\tau}+1\right)=\mathcal{O}\left(T^{-1/3+\epsilon}\right), (159)

where the last equality holds since lnT=𝒪(Tϵ/2)\ln T=\mathcal{O}(T^{\epsilon/2}) for any ϵ>0\epsilon>0. Finally, combining (12), (13), and using Assumption 3 and Lemma 1 (for ω=1\omega=1) we have

1T\displaystyle\frac{1}{T} t=1T𝔼[f(𝐱i(t))2]2Tt=1T{𝔼[f(𝐱i(t))f(𝐱¯(t))2]+𝔼[f(𝐱¯(t))2]}\displaystyle\sum_{t=1}^{T}\mathbb{E}\Big{[}\|\nabla f(\mathbf{x}_{i}(t))\|^{2}\Big{]}\leq\frac{2}{T}\sum_{t=1}^{T}\left\{\mathbb{E}\Big{[}\|\nabla f(\mathbf{x}_{i}(t))-\nabla f(\bar{\mathbf{x}}(t))\|^{2}\Big{]}+\mathbb{E}\Big{[}\|\nabla f(\bar{\mathbf{x}}(t))\|^{2}\Big{]}\right\}
2Tt=1T{K2𝔼[𝐱i(t)𝐱¯(t)2]+𝔼[f(𝐱¯(t))2]}𝒪(T1/3+ϵ).\displaystyle\leq\frac{2}{T}\ \sum_{t=1}^{T}\left\{K^{2}\mathbb{E}\Big{[}\|\mathbf{x}_{i}(t)-\bar{\mathbf{x}}(t)\|^{2}\Big{]}+\mathbb{E}\Big{[}\|\nabla f(\bar{\mathbf{x}}(t))\|^{2}\Big{]}\right\}\leq\mathcal{O}(T^{-1/3+\epsilon}).

for every i[n]i\in[n]. This concludes the proof of Proposition 1. \blacksquare

7 Conclusion

We have studied non-convex distributed optimization over time-varying networks with lossy information sharing. Inspired by the original averaging-based distributed optimization algorithm, we proposed and studied a two-time scale decentralized algorithm including a damping mechanism for incoming information from the neighboring agents as well as local cost functions’ gradients. We presented the convergence rate for various choices of the diminishing step-size parameters. By optimizing the achieved rate over all feasible choices for parameters, the algorithm obtains a convergence rate of 𝒪(T1/3+ϵ)\mathcal{O}({T}^{-1/3+\epsilon}) for non-convex distributed optimization problems over time-varying networks, for any ϵ>0\epsilon>0. Our theoretical results are corroborated by numerical simulations.

References

  • [1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, pp. 1223–1231, 2012.
  • [2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on OSDI, pp. 265–283, 2016.
  • [3] P. H. Jin, Q. Yuan, F. Iandola, and K. Keutzer, “How to scale distributed deep learning?,” in NIPS Workshop MLSystems, 2016.
  • [4] D. Li, K. D. Wong, Y. H. Hu, and A. M. Sayeed, “Detection, classification, and tracking of targets,” IEEE signal processing magazine, vol. 19, no. 2, pp. 17–29, 2002.
  • [5] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proceedings of the 3rd International Symp. on Information processing in Sensor Networks, pp. 20–27, 2004.
  • [6] S. Kar and J. M. Moura, “Distributed consensus algorithms in sensor networks with imperfect communication: Link failures and channel noise,” IEEE Trans. Signal Process., vol. 57, no. 1, pp. 355–369, 2008.
  • [7] A. Ribeiro, “Ergodic stochastic optimization algorithms for wireless communication and networking,” IEEE Transactions on Signal Processing, vol. 58, no. 12, pp. 6369–6386, 2010.
  • [8] T. Chen, Q. Ling, and G. B. Giannakis, “An online convex optimization approach to proactive network resource allocation,” IEEE Trans. Signal Process., vol. 65, no. 24, pp. 6350–6364, 2017.
  • [9] A. Nedić, S. Lee, and M. Raginsky, “Decentralized online optimization with global objectives and local communication,” in 2015 American Control Conference (ACC), pp. 4497–4503, IEEE, 2015.
  • [10] R. Xin, A. K. Sahu, S. Kar, and U. A. Khan, “Distributed empirical risk minimization over directed graphs,” in 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pp. 189–193, IEEE, 2019.
  • [11] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
  • [12] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Automat. Contr., vol. 54, no. 1, pp. 48–61, 2009.
  • [13] I. Lobel and A. Ozdaglar, “Distributed subgradient methods for convex optimization over random networks,” IEEE Trans. Automat. Contr., vol. 56, no. 6, pp. 1291–1306, 2010.
  • [14] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent optimization with state-dependent communication,” Mathematical programming, vol. 129, no. 2, pp. 255–284, 2011.
  • [15] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Trans. Signal Process., vol. 60, no. 8, pp. 4289–4305, 2012.
  • [16] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
  • [17] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Trans. Automat. Contr., vol. 59, no. 5, pp. 1131–1146, 2014.
  • [18] A. Nedić, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Trans. Automat. Contr., vol. 55, no. 4, pp. 922–938, 2010.
  • [19] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Trans. Automat. Contr., vol. 60, no. 3, pp. 601–615, 2014.
  • [20] Y. Sun, G. Scutari, and D. Palomar, “Distributed nonconvex multiagent optimization over time-varying networks,” in IEEE Asilomar Conf. Signals Syst. Comput., pp. 788–794, 2016.
  • [21] T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Trans. Automat. Contr., vol. 62, no. 8, pp. 3744–3757, 2017.
  • [22] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on signal processing, vol. 66, no. 11, pp. 2834–2848, 2018.
  • [23] A. Kashyap, T. Başar, and R. Srikant, “Quantized consensus,” Automatica, vol. 43, no. 7, pp. 1192–1203, 2007.
  • [24] K. Cai and H. Ishii, “Quantized consensus and averaging on gossip digraphs,” IEEE Trans. Automat. Contr., vol. 56, no. 9, pp. 2087–2100, 2011.
  • [25] Y. Pu, M. N. Zeilinger, and C. N. Jones, “Quantization design for distributed optimization,” IEEE Trans. Automat. Contr., vol. 62, no. 5, pp. 2107–2120, 2016.
  • [26] A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communication-efficient collaborative learning,” in Advances in Neural Information Processing Systems, pp. 8388–8399, 2019.
  • [27] A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in International Conference on Machine Learning, pp. 3478–3487, PMLR, 2019.
  • [28] H. Reisizadeh, B. Touri, and S. Mohajer, “Distributed optimization over time-varying graphs with imperfect sharing of information,” arXiv preprint arXiv:2106.08469, 2021.
  • [29] H. Reisizadeh, B. Touri, and S. Mohajer, “Adaptive bit allocation for communication-efficient distributed optimization,” in 2021 60th IEEE Conference on Decision and Control (CDC), pp. 1994–2001, IEEE, 2021.
  • [30] A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep learning with arbitrary communication compression,” arXiv preprint arXiv:1907.09356, 2019.
  • [31] K. Srivastava and A. Nedic, “Distributed asynchronous constrained stochastic optimization,” IEEE journal of selected topics in signal processing, vol. 5, no. 4, pp. 772–790, 2011.
  • [32] S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in 2018 IEEE Conference on Decision and Control (CDC), pp. 963–968, IEEE, 2018.
  • [33] J. Zhang and K. You, “Decentralized stochastic gradient tracking for non-convex empirical risk minimization,” arXiv preprint arXiv:1909.02712, 2019.
  • [34] R. Durrett, Probability: theory and examples. Cambridge University Press, 2010.
  • [35] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” arXiv preprint arXiv:1705.09056, 2017.
  • [36] A. Nedić, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis, “On distributed averaging algorithms and quantization effects,” IEEE Trans. Automat. Contr., vol. 54, no. 11, pp. 2506–2517, 2009.
  • [37] A. Nedić and A. Olshevsky, “Distributed optimization of strongly convex functions on directed time-varying graphs,” in IEEE Glob. Conf. Signal Inf. Process. Proc., pp. 329–332, 2013.
  • [38] B. Touri and A. Nedić, “Product of random stochastic matrices,” IEEE Trans. Automat. Contr., vol. 59, no. 2, pp. 437–448, 2013.
  • [39] B. Touri and C. Langbort, “On endogenous random consensus and averaging dynamics,” IEEE Transactions on Control of Network Systems, vol. 1, no. 3, pp. 241–248, 2014.
  • [40] A. Aghajan and B. Touri, “Distributed optimization over dependent random networks,” arXiv preprint arXiv:2010.01956, 2020.
  • [41] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” arXiv preprint arXiv:1809.07599, 2018.
  • [42] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
  • [43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [44] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE transactions on information theory, vol. 52, no. 6, pp. 2508–2530, 2006.
  • [45] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
  • [46] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Asynchronous gossip algorithm for stochastic optimization: Constant stepsize analysis,” in Recent Advances in Optimization and its Applications in Engineering, pp. 51–60, Springer, 2010.
  • [47] S. Lee and A. Nedić, “Asynchronous gossip-based random projection algorithms over networks,” IEEE Trans. Automat. Contr., vol. 61, no. 4, pp. 953–968, 2015.
  • [48] P. M. DeMarzo, D. Vayanos, and J. Zwiebel, “Persuasion bias, social influence, and unidimensional opinions,” The Quarterly journal of economics, vol. 118, no. 3, pp. 909–968, 2003.
  • [49] S. Bubeck, “Convex optimization: Algorithms and complexity,” Foundations and Trends in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
  • [50] G. B. Folland, Real analysis: modern techniques and their applications, vol. 40. John Wiley & Sons, 1999.
  • [51] B. Touri and A. Nedić, “On existence of a quadratic comparison function for random weighted averaging dynamics and its implications,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference, pp. 3806–3811, IEEE, 2011.

Appendix A Proof of Auxiliary Lemmas

In this section, we provide the proofs of auxiliary lemmas.

Proof of Lemma 1: For two vectors 𝒖\boldsymbol{u} and 𝒗\boldsymbol{v} (with identical dimension) and any scalar ω>0\omega>0, we have

𝒖+𝒗2\displaystyle\|\boldsymbol{u}+\boldsymbol{v}\|^{2} =𝒖2+𝒗2+2𝒖,𝒗\displaystyle=\|\boldsymbol{u}\|^{2}+\|\boldsymbol{v}\|^{2}+2\left\langle\boldsymbol{u},\boldsymbol{v}\right\rangle
(a)𝒖2+𝒗2+2𝒖𝒗\displaystyle\stackrel{{\scriptstyle\rm{(a)}}}{{\leq}}\|\boldsymbol{u}\|^{2}+\|\boldsymbol{v}\|^{2}+2\|\boldsymbol{u}\|\|\boldsymbol{v}\|
=𝒖2+𝒗2+2(ω𝒖1ω𝒗)\displaystyle=\|\boldsymbol{u}\|^{2}+\|\boldsymbol{v}\|^{2}+2\left(\sqrt{\omega}\|\boldsymbol{u}\|\cdot\frac{1}{\sqrt{\omega}}\|\boldsymbol{v}\|\right)
(b)𝒖2+𝒗2+ω𝒖2+1ω𝒗2\displaystyle\stackrel{{\scriptstyle\rm{(b)}}}{{\leq}}\|\boldsymbol{u}\|^{2}+\|\boldsymbol{v}\|^{2}+\omega\|\boldsymbol{u}\|^{2}+\frac{1}{\omega}\|\boldsymbol{v}\|^{2}
=(1+ω)𝒖2+(1+1ω)𝒗2,\displaystyle=(1+\omega)\|\boldsymbol{u}\|^{2}+\left(1+\frac{1}{\omega}\right)\|\boldsymbol{v}\|^{2},

where (a)\rm{(a)} follows from the Cauchy–Schwarz inequality and we used the geometric-arithmetic inequality in (b)\rm{(b)}.

Now, recall that for a matrix Un×dU\in\mathbb{R}^{n\times d} we have U𝐫=i=1nriUi2\|U\|_{\mathbf{r}}=\sum_{i=1}^{n}r_{i}\|U_{i}\|^{2}. Therefore, for matrices UU and VV we have

U+V𝐫2\displaystyle\|U+V\|_{\mathbf{r}}^{2} =i=1nriUi+Vi2i=1nri[(1+ω)Ui2+(1+1ω)Vi2]\displaystyle=\sum_{i=1}^{n}r_{i}\|U_{i}+V_{i}\|^{2}\leq\sum_{i=1}^{n}r_{i}\left[(1+\omega)\|U_{i}\|^{2}+\left(1+\frac{1}{\omega}\right)\|V_{i}\|^{2}\right]
=(1+ω)i=1nriUi2+(1+1ω)i=1nriVi2\displaystyle=(1+\omega)\sum_{i=1}^{n}r_{i}\|U_{i}\|^{2}+\left(1+\frac{1}{\omega}\right)\sum_{i=1}^{n}r_{i}\|V_{i}\|^{2}
=(1+ω)U𝐫2+(1+1ω)V𝐫2.\displaystyle=(1+\omega)\|U\|^{2}_{\mathbf{r}}+\left(1+\frac{1}{\omega}\right)\|V\|^{2}_{\mathbf{r}}.

This completes the proof of the lemma. \blacksquare

Proof of Lemma 2: To prove (25), let g=s=1t1β(s)k=s+1t1(1λβ(k))g=\sum_{s=1}^{t-1}\beta(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k)). Then, we have

λg\displaystyle\lambda g =s=1t1λβ(s)k=s+1t1(1λβ(k))\displaystyle=\sum_{s=1}^{t-1}\lambda\beta(s)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))
=s=1t1(1(1λβ(s)))k=s+1t1(1λβ(k))\displaystyle=\sum_{s=1}^{t-1}(1-(1-\lambda\beta(s)))\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))
=s=1t1[k=s+1t1(1λβ(k))k=st1(1λβ(k)))].\displaystyle=\sum_{s=1}^{t-1}\left[\prod_{k=s+1}^{t-1}\left(1-\lambda\beta(k)\right)-\prod_{k=s}^{t-1}\left(1-\lambda\beta(k))\right)\right].

Note that the last summation is a telescopic sum, we have

λg\displaystyle\lambda g =k=tt1(1λβ(k))k=1t1(1λβ(k))=1k=1t1(1λβ(k)).\displaystyle=\prod_{k=t}^{t-1}\left(1-\lambda\beta(k)\right)-\prod_{k=1}^{t-1}\left(1-\lambda\beta(k)\right)=1-\prod_{k=1}^{t-1}\left(1-\lambda\beta(k)\right).

Dividing both sides by λ>0\lambda>0 we arrive at (25). To show (26), we utilize the same idea by letting h=t=s+1Tβ(t)k=s+1t1(1λβ(k)){h=\sum_{t=s+1}^{T}\beta(t)\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))}, we can write

λh\displaystyle\lambda h =t=s+1T[k=s+1t1(1λβ(k))k=s+1t(1λβ(k))]\displaystyle=\sum_{t=s+1}^{T}\left[\prod_{k=s+1}^{t-1}\left(1-\lambda\beta(k)\right)-\prod_{k=s+1}^{t}\left(1-\lambda\beta(k)\right)\right]
=k=s+1s(1λβ(k))k=s+1t(1λβ(k))\displaystyle=\prod_{k=s+1}^{s}\left(1-\lambda\beta(k)\right)-\prod_{k=s+1}^{t}\left(1-\lambda\beta(k)\right)
=1k=s+1T(1λβ(k)).\displaystyle=1-\prod_{k=s+1}^{T}\left(1-\lambda\beta(k)\right).

Dividing both sides by λ>0\lambda>0, results in (26). \blacksquare

Proof of Lemma 3: In order to prove the lemma, we separately analyze the cases δ<1\delta<-1, δ=1\delta=-1, 1<δ<0-1<\delta<0, and δ0\delta\geq 0. Note that for δ<0\delta<0, the function h(t):=(t+τ)δh(t):=(t+\tau)^{\delta} is a decreasing function, and thus we have t=1T(t+τ)δ0T(t+τ)δ𝑑t{\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\int_{0}^{T}(t+\tau)^{\delta}dt}. In the following we upper bound the latter integral for each regime of δ\delta. When δ<1\delta<-1 we have

t=1T(t+τ)δ0T(t+τ)δ𝑑t=τ1+δ(T+τ)1+δ1δτ1+δ1δ,\displaystyle\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\int_{0}^{T}(t+\tau)^{\delta}dt=\frac{\tau^{1+\delta}-(T+\tau)^{1+\delta}}{-1-\delta}\leq\frac{\tau^{1+\delta}}{-1-\delta}, (160)

which does not grow with TT. For δ=1\delta=-1, we get

t=1T(t+τ)10T(t+τ)1𝑑t=ln(T+τ)ln(τ)=ln(Tτ+1).\displaystyle\sum_{t=1}^{T}(t+\tau)^{-1}\leq\int_{0}^{T}(t+\tau)^{-1}dt=\ln(T+\tau)-\ln(\tau)=\ln\left(\frac{T}{\tau}+1\right). (161)

When 1<δ<0-1<\delta<0 we arrive at

t=1T(t+τ)δ0T(t+τ)δ𝑑t=(T+τ)1+δτ1+δ1+δ(T+τ)1+δ1+δ21+δ1+δ(T+τ)1+δ,\displaystyle\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\int_{0}^{T}(t+\tau)^{\delta}dt=\frac{(T+\tau)^{1+\delta}-\tau^{1+\delta}}{1+\delta}\leq\frac{(T+\tau)^{1+\delta}}{1+\delta}\leq\frac{2^{1+\delta}}{1+\delta}(T+\tau)^{1+\delta}, (162)

where the last inequality holds since 21+δ20=12^{1+\delta}\geq 2^{0}=1.

Finally, for δ0\delta\geq 0, the function h(t)h(t) is an increasing function. Hence, we can write

t=1T(t+τ)δ1T+1(t+τ)δ𝑑t=(T+τ+1)1+δ(τ+1)1+δ1+δ21+δ1+δ(T+τ)1+δ.\displaystyle\sum_{t=1}^{T}(t+\tau)^{\delta}\leq\int_{1}^{T+1}(t+\tau)^{\delta}dt=\frac{(T+\tau+1)^{1+\delta}-(\tau+1)^{1+\delta}}{1+\delta}\leq\frac{2^{1+\delta}}{1+\delta}(T+\tau)^{1+\delta}. (163)

Note that the last inequality follows from T+τ+12(T+τ)T+\tau+1\geq 2(T+\tau) which holds for any T1T\geq 1 and τ0\tau\geq 0. \blacksquare

Proof of Lemma 4: Recall that the (i,j)(i,j)th entry of the matrix product ABAB is the inner product between the iith row of AA and the jjth column of BB. Thus, using the Cauchy-Schwarz inequality, we have |[AB]ij|=|Ai,Bj|AiBj{|[AB]_{ij}|=|\left\langle A_{i},B^{j}\right\rangle|\leq\left\|A_{i}\right\|\left\|B^{j}\right\|}. Therefore,

[AB]i2=j=1m|[AB]ij|2=j=1m|Ai,Bj|2Ai2j=1mBj2Ai2BF2.\displaystyle\left\|[AB]_{i}\right\|^{2}=\sum_{j=1}^{m}|[AB]_{ij}|^{2}=\sum_{j=1}^{m}|\left\langle A_{i},B^{j}\right\rangle|^{2}\leq\left\|A_{i}\right\|^{2}\sum_{j=1}^{m}\left\|B^{j}\right\|^{2}\leq\left\|A_{i}\right\|^{2}\left\|B\right\|_{F}^{2}.

Using this inequality and the definition of 𝐫\mathbf{r}-norm, we get

AB𝐫2\displaystyle\left\|AB\right\|_{\mathbf{r}}^{2} =i=1nri[AB]i2i=1nriAi2BF2=A𝐫2BF2.\displaystyle=\sum_{i=1}^{n}r_{i}\left\|[AB]_{i}\right\|^{2}\leq\sum_{i=1}^{n}r_{i}\left\|A_{i}\right\|^{2}\left\|B\right\|_{F}^{2}=\left\|A\right\|_{\mathbf{r}}^{2}\left\|B\right\|_{F}^{2}.

\blacksquare

Proof of Lemma 5: Due to the separable nature of 𝐫\left\|\cdot\right\|_{\mathbf{r}}, i.e., U𝐫2=j=1dUj𝐫\left\|U\right\|_{\mathbf{r}}^{2}=\sum_{j=1}^{d}\left\|U^{j}\right\|_{\mathbf{r}}, without loss of generality, we may assume that d=1d=1. Thus, let U=𝒖nU=\boldsymbol{u}\in\mathbb{R}^{n}. Define V𝐫:n+V_{\mathbf{r}}:\mathbb{R}^{n}\to\mathbb{R}^{+} by

V𝐫(𝒖):=𝒖𝟏𝐫T𝒖𝐫2=i=1nri(ui𝐫T𝒖)2.\displaystyle V_{\mathbf{r}}(\boldsymbol{u}):=\left\|\boldsymbol{u}-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}\right\|_{\mathbf{r}}^{2}=\sum_{i=1}^{n}r_{i}(u_{i}-\mathbf{r}^{T}\boldsymbol{u})^{2}. (164)

For notational simplicity, let 𝒖(s)=𝒖=[u1u2un]\boldsymbol{u}(s)=\boldsymbol{u}=\begin{bmatrix}u_{1}&u_{2}&\ldots&u_{n}\end{bmatrix}, and 𝒖(k+1)=A(k)𝒖(k)\boldsymbol{u}(k+1)=A(k)\boldsymbol{u}(k). Also with a slight abuse of notation, we denote V𝐫(𝒖(k))V_{\mathbf{r}}(\boldsymbol{u}(k)) by V𝐫(k)V_{\mathbf{r}}(k) for k=s,,tk=s,\ldots,t.

Using Theorem 1 in [51], we have

V𝐫(t)=V𝐫(s)k=st1i<jHij(k)(ui(k)uj(k))2,\displaystyle V_{\mathbf{r}}(t)=V_{\mathbf{r}}(s)-\sum_{k=s}^{t-1}\sum_{i<j}H_{ij}(k)(u_{i}(k)-u_{j}(k))^{2}, (165)

where H(k)=AT(k)diag(𝐫)A(k)H(k)=A^{T}(k)\operatorname{diag}(\mathbf{r})A(k). Note that A(k)A(k) is a non-negative matrix, and hence we have H(k)𝐫minAT(k)A(k){H(k)\geq\mathbf{r}_{\min}A^{T}(k)A(k)}, for k=s,,tk=s,\ldots,t. Also, note that since A(k)=(1β(k))I+β(k)W(k){A(k)=(1-\beta(k))I+\beta(k)W(k)}, then Assumption 2-(b) implies that the minimum non-zero elements of A(k)A(k) are bounded bellow by ηβ(k)\eta\beta(k). Therefore, since β(k)\beta(k) is non-increasing, on the window k=s,,s+Bk=s,\ldots,s+B, the minimum non-zero elements of A(k)A(k) for kk in this window are lower bounded by ηβ(s+B)\eta\beta(s+B). Without loss of generality assume that the entries of 𝒖\boldsymbol{u} are sorted, i.e., u1unu_{1}\leq\ldots\leq u_{n}, otherwise, we can relabel the agents (rows and columns of A(k)A(k)s and 𝒖\boldsymbol{u} to achieve this). Therefore, by Lemma 8 in [36], for (165), we have

V𝐫(s+B)\displaystyle V_{\mathbf{r}}(s+B) V𝐫(s)𝐫mink=ss+B1i<j[AT(k)A(k)]ij(ui(k)uj(k))2\displaystyle\leq V_{\mathbf{r}}(s)-\mathbf{r}_{\min}\sum_{k=s}^{s+B-1}\sum_{i<j}[A^{T}(k)A(k)]_{ij}(u_{i}(k)-u_{j}(k))^{2}
V𝐫(s)η𝐫min2β(s+B)=1n1(u+1u)2.\displaystyle\leq V_{\mathbf{r}}(s)-\frac{\eta\mathbf{r}_{\min}}{2}\beta(s+B)\sum_{\ell=1}^{n-1}(u_{\ell+1}-u_{\ell})^{2}. (166)

We may comment here that although Lemma 8 in [36] is written for doubly stochastic matrices, and its statement is about the decrease of V𝐫(𝐱)V_{\mathbf{r}}(\mathbf{x}) for the special case of 𝐫=1n𝟏\mathbf{r}=\frac{1}{n}\mathbf{1}, but in fact, it is a result on bounding k=ss+B1i<j[AT(k)A(k)]ij(ui(k)uj(k))2\sum_{k=s}^{s+B-1}\sum_{i<j}[A^{T}(k)A(k)]_{ij}(u_{i}(k)-u_{j}(k))^{2} for a sequence of BB-connected stochastic matrices A(k)A(k) in terms of the minimum non-zero entries of stochastic matrices A(s),,A(s+B1){A(s),\ldots,A(s+B-1)}.

Next, we will show that =1n1(u+1u)2n2V𝐫(𝒖)\sum_{\ell=1}^{n-1}(u_{\ell+1}-u_{\ell})^{2}\geq n^{-2}V_{\mathbf{r}}(\boldsymbol{u}). This argument adapts a similar argument used in the proof of Theorem 18 in [36] to the general V𝐫()V_{\mathbf{r}}(\cdot).

For a 𝒗n\boldsymbol{v}\in\mathbb{R}^{n} with V𝐫(𝒗)>0V_{\mathbf{r}}(\boldsymbol{v})>0, define the quotient

h(𝒗)==1n1(v+1v)2i=1nri(vi𝐫T𝒗)2==1n1(v+1v)2V𝐫(𝒗).\displaystyle h(\boldsymbol{v})=\frac{\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}}{\sum_{i=1}^{n}r_{i}(v_{i}-\mathbf{r}^{T}\boldsymbol{v})^{2}}=\frac{\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}}{V_{\mathbf{r}}(\boldsymbol{v})}. (167)

Note that h(𝒗)h(\boldsymbol{v}) is invariant under scaling and translations by all-one vector, i.e., h(ω𝒗)=h(𝒗)h(\omega\boldsymbol{v})=h(\boldsymbol{v}) for all non-zero ω\omega\in\mathbb{R} and h(𝒗+ω𝟏)=h(𝒗)h(\boldsymbol{v}+\omega\mathbf{1})=h(\boldsymbol{v}) for all ω\omega\in\mathbb{R}. Therefore,

minv1v2vnV𝐫(𝒗)0h(𝒗)\displaystyle\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ V_{\mathbf{r}}(\boldsymbol{v})\not=0\end{subarray}}h(\boldsymbol{v}) =minv1v2vn𝐫T𝒗=0,V𝐫(𝒗)=1h(𝒗)\displaystyle=\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ \mathbf{r}^{T}\boldsymbol{v}=0,V_{\mathbf{r}}(\boldsymbol{v})=1\end{subarray}}h(\boldsymbol{v}) (168)
=minv1v2vn𝐫T𝒗=0,V𝐫(𝒗)=1=1n1(v+1v)2.\displaystyle=\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ \mathbf{r}^{T}\boldsymbol{v}=0,V_{\mathbf{r}}(\boldsymbol{v})=1\end{subarray}}\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}. (169)

Since 𝐫\mathbf{r} is a stochastic vector, then for a vector 𝒗\boldsymbol{v} with v1vn{v_{1}\leq\ldots\leq v_{n}} and 𝐫T𝒗=0\mathbf{r}^{T}\boldsymbol{v}=0, we would have v1𝐫T𝒗=0vn{v_{1}\leq\mathbf{r}^{T}\boldsymbol{v}=0\leq v_{n}}. On the other hand, the fact that V𝐫(𝒗)=i=1nrivi2=1{V_{\mathbf{r}}(\boldsymbol{v})=\sum_{i=1}^{n}r_{i}v^{2}_{i}=1} would imply max(|v1|,|vn|)1n{\max(|v_{1}|,|v_{n}|)\geq\frac{1}{\sqrt{n}}}. Let us consider the difference sequence v^=v+1v\hat{v}_{\ell}=v_{\ell+1}-v_{\ell} for =1,,n1\ell=1,\ldots,n-1, for which we have i=1n1v^=vnv1vn1n\sum_{i=1}^{n-1}\hat{v}_{\ell}=v_{n}-v_{1}\geq v_{n}\geq\frac{1}{\sqrt{n}}. Therefore, the optimization problem (168) can be rewritten as

minv1v2vnV𝐫(𝒗)0h(𝒗)\displaystyle\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ V_{\mathbf{r}}(\boldsymbol{v})\not=0\end{subarray}}h(\boldsymbol{v}) =minv1v2vn𝐫T𝒗=0,V𝐫(𝒗)=1=1n1(v+1v)2\displaystyle=\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\cdots\leq v_{n}\\ \mathbf{r}^{T}\boldsymbol{v}=0,V_{\mathbf{r}}(\boldsymbol{v})=1\end{subarray}}\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2} (170)
minv^1,,v^n10i=1n1v^i1n=1n1v^2.\displaystyle\geq\min_{\begin{subarray}{c}\hat{v}_{1},\ldots,\hat{v}_{n-1}\geq 0\\ \sum_{i=1}^{n-1}\hat{v}_{i}\geq\frac{1}{\sqrt{n}}\end{subarray}}\sum_{\ell=1}^{n-1}\hat{v}_{\ell}^{2}. (171)

Using the Cauchy-Schwarz inequality, we get =1n1v^2=1n112(=1n1v^)2(1n)2=1n{\sum_{\ell=1}^{n-1}\hat{v}_{\ell}^{2}\cdot\sum_{\ell=1}^{n-1}1^{2}\geq\big{(}\sum_{\ell=1}^{n-1}\hat{v}_{\ell}\big{)}^{2}\geq\big{(}\frac{1}{\sqrt{n}}\big{)}^{2}=\frac{1}{n}}. Hence,

minv1v2vnV𝐫(𝒗)0h(𝒗)1n(n1)1n2.\displaystyle\min_{\begin{subarray}{c}v_{1}\leq v_{2}\leq\ldots\leq v_{n}\\ V_{\mathbf{r}}(\boldsymbol{v})\not=0\end{subarray}}h(\boldsymbol{v})\geq\frac{1}{n(n-1)}\geq\frac{1}{n^{2}}. (172)

This implies that for v1v2vnv_{1}\leq v_{2}\leq\ldots\leq v_{n}, we have =1n1(v+1v)2n2V𝐫(𝒗)\sum_{\ell=1}^{n-1}(v_{\ell+1}-v_{\ell})^{2}\geq n^{-2}V_{\mathbf{r}}(\boldsymbol{v}) (note that this inequality also holds for 𝒗n\boldsymbol{v}\in\mathbb{R}^{n} with V𝐫(𝒗)=0V_{\mathbf{r}}(\boldsymbol{v})=0). Using this fact in (A) implies

V𝐫(s+B)(1η𝐫min2n2β(s+B))V𝐫(s).\displaystyle V_{\mathbf{r}}(s+B)\leq\left(1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+B)\right)V_{\mathbf{r}}(s). (173)

Applying (173) for Δ:=t1sB\Delta:=\lfloor\frac{t-1-s}{B}\rfloor steps recursively, we get

V𝐫(s+ΔB)j=1Δ(1η𝐫min2n2β(s+jB))V𝐫(s)\displaystyle V_{\mathbf{r}}(s+\Delta B)\leq\prod_{j=1}^{\Delta}\left(1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+jB)\right)V_{\mathbf{r}}(s)

Using the fact that (1x)1/B1x/B(1-x)^{1/B}\leq 1-x/B and since {β(k)}\{\beta(k)\} is a non-increasing sequence, we have

1η𝐫min2n2β(s+jB)\displaystyle 1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+jB) ==1B(1η𝐫min2n2β(s+jB))1/B\displaystyle=\prod_{\ell=1}^{B}\left(1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+jB)\right)^{1/B}
=1B(1η𝐫min2Bn2β(s+jB))\displaystyle\leq\prod_{\ell=1}^{B}\left(1-\frac{\eta\mathbf{r}_{\min}}{2Bn^{2}}\beta(s+jB)\right)
=1B(1λβ(s+jB+)).\displaystyle\leq\prod_{\ell=1}^{B}\left(1-\lambda\beta(s+jB+\ell)\right).

Recall from (165) that V𝐫()V_{\mathbf{r}}(\cdot) is non-increasing. Therefore, for s+ΔBt1<s+(Δ+1)Bs+\Delta B\leq t-1<s+(\Delta+1)B we have

V𝐫(t1)\displaystyle V_{\mathbf{r}}(t-1) V𝐫(s+ΔB)\displaystyle\leq V_{\mathbf{r}}(s+\Delta B)
j=1Δ(1η𝐫min2n2β(s+jB))V𝐫(s)\displaystyle\leq\prod_{j=1}^{\Delta}\left(1-\frac{\eta\mathbf{r}_{\min}}{2n^{2}}\beta(s+jB)\right)V_{\mathbf{r}}(s)
j=1Δ=1B(1λβ(s+jB+))V𝐫(s)\displaystyle\leq\prod_{j=1}^{\Delta}\prod_{\ell=1}^{B}\left(1-\lambda\beta(s+jB+\ell)\right)V_{\mathbf{r}}(s)
=k=s+B+1s+(Δ+1)B(1λβ(k))V𝐫(s)\displaystyle=\prod_{k=s+B+1}^{s+(\Delta+1)B}\left(1-\lambda\beta(k)\right)V_{\mathbf{r}}(s)
k=s+B+1t1(1λβ(k))V𝐫(s).\displaystyle\leq\prod_{k=s+B+1}^{t-1}\left(1-\lambda\beta(k)\right)V_{\mathbf{r}}(s). (174)

Next, since {β(k)}\{\beta(k)\} is a non-increasing sequence, we have β(k)β(1)=β0\beta(k)\leq\beta(1)=\beta_{0}. Thus,

k=s+1s+B(1λβ(k))\displaystyle\prod_{k=s+1}^{s+B}\left(1-\lambda\beta(k)\right) k=s+1s+B(1λβ0)\displaystyle\geq\prod_{k=s+1}^{s+B}\left(1-\lambda\beta_{0}\right)
=(1λβ0)B1Bλβ0.\displaystyle=\left(1-\lambda\beta_{0}\right)^{B}\geq 1-B\lambda\beta_{0}. (175)

Therefore, combining (A) and (A), we get

V𝐫(t1)\displaystyle V_{\mathbf{r}}(t-1) k=s+B+1t1(1λβ(k))V𝐫(s)\displaystyle\leq\prod_{k=s+B+1}^{t-1}\left(1-\lambda\beta(k)\right)V_{\mathbf{r}}(s)
k=s+1s+B(1λβ(k))1Bλβ0k=s+B+1t1(1λβ(k))V𝐫(s)\displaystyle\leq\frac{\prod_{k=s+1}^{s+B}\left(1-\lambda\beta(k)\right)}{1-B\lambda\beta_{0}}\!\!\prod_{k=s+B+1}^{t-1}\!\!\left(1-\lambda\beta(k)\right)V_{\mathbf{r}}(s)
=11Bλβ0k=s+1t1(1λβ(k))V𝐫(s).\displaystyle=\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))V_{\mathbf{r}}(s). (176)

Now, we define

Φ(t:s)=A(t1)A(s+1)\Phi(t:s)=A(t-1)\cdots A(s+1)

for ts{t\geq s} with Φ(t:t1)=I{\Phi(t:t-1)\!=\!I}. Note that Assumption 2-(a) and the fact A(k)=(1β(k))I+β(k)W(k){A(k)=(1\!-\!\beta(k))I\!+\!\beta(k)W(k)} imply 𝐫TΦ(t:s)=𝐫T{\mathbf{r}^{T}\Phi(t:s)=\mathbf{r}^{T}}. Then, setting 𝒖(s)=𝒖=U\boldsymbol{u}(s)=\boldsymbol{u}=U and 𝒖(t1)=Φ(t:s)𝒖(s)=Φ(t:s)U{\boldsymbol{u}(t-1)=\Phi(t:s)\boldsymbol{u}(s)=\Phi(t:s)U}, we can write

(Φ(t:s)𝟏𝐫T)U\displaystyle\left(\Phi(t:s)-\mathbf{1}\mathbf{r}^{T}\right)U =Φ(t:s)U𝟏𝐫TU\displaystyle=\Phi(t:s)U-\mathbf{1}\mathbf{r}^{T}U
=Φ(t:s)U𝟏𝐫TΦ(t:s)U\displaystyle=\Phi(t:s)U-\mathbf{1}\mathbf{r}^{T}\Phi(t:s)U
=𝒖(t1)𝟏𝐫T𝒖(t1).\displaystyle=\boldsymbol{u}(t-1)-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}(t-1).

Therefore, using (A) we have

(Φ(t:s)𝟏𝐫T)U𝐫2\displaystyle\big{\|}\big{(}\Phi(t:s)-\mathbf{1}\mathbf{r}^{T}\big{)}U\big{\|}_{\mathbf{r}}^{2} =𝒖(t1)𝟏𝐫T𝒖(t1)𝐫2=V𝐫(t1)\displaystyle=\left\|\boldsymbol{u}(t-1)-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}(t-1)\right\|_{\mathbf{r}}^{2}=V_{\mathbf{r}}(t-1)
11Bλβ0k=s+1t1(1λβ(k))V𝐫(s)\displaystyle\leq\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))V_{\mathbf{r}}(s)
=11Bλβ0k=s+1t1(1λβ(k))𝒖𝟏𝐫T𝒖𝐫2\displaystyle=\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\left\|\boldsymbol{u}-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}\right\|_{\mathbf{r}}^{2}
11Bλβ0k=s+1t1(1λβ(k))𝒖𝐫2\displaystyle\leq\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\left\|\boldsymbol{u}\right\|_{\mathbf{r}}^{2}
=11Bλβ0k=s+1t1(1λβ(k))U𝐫2,\displaystyle=\frac{1}{1-B\lambda\beta_{0}}\prod_{k=s+1}^{t-1}(1-\lambda\beta(k))\left\|U\right\|_{\mathbf{r}}^{2},

where the second inequality follows form the the fact that 𝒖𝟏𝐫T𝒖𝐫2+𝟏𝐫T𝒖𝐫2=𝒖𝐫2\left\|\boldsymbol{u}-\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}\right\|_{\mathbf{r}}^{2}+\left\|\mathbf{1}\mathbf{r}^{T}\boldsymbol{u}\right\|_{\mathbf{r}}^{2}=\left\|\boldsymbol{u}\right\|_{\mathbf{r}}^{2}. This completes the proof of the lemma. \blacksquare