This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Local Stochastic Approximation: A Unified View of Federated Learning and Distributed Multi-Task Reinforcement Learning Algorithms

Thinh T. Doan Thinh T. Doan is with the School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, 30332, USA. Email: [email protected]
Abstract

Motivated by broad applications in reinforcement learning and federated learning, we study local stochastic approximation over a network of agents, where their goal is to find the root of an operator composed of the local operators at the agents. Our focus is to characterize the finite-time performance of this method when the data at each agent are generated from Markov processes, and hence they are dependent. In particularly, we provide the explicit convergence rates of local stochastic approximation for both constant and time-varying step sizes. Our results show that these rates are within a logarithmic factor of the ones under independent data. We then illustrate the applications of these results to different interesting problems in multi-task reinforcement learning and federated learning.

1 Introduction

In this paper, we study local stochastic approximation (SA), a distributed variant of the classic SA originally introduced by Robbins and Monro [1] for solving the root-finding problems under corrupted measurements of an operator (or function). We consider the setting where there are a group of agents communicating indirectly through a centralized coordinator. The goal of the agents is to find the root of the operator, which is composed of the local operators at the agents. For solving this problem, each agent iteratively runs a number of local SA steps based on its own data, whose iterates are then averaged at the centralized coordinator. This algorithm, presented in detail in Section 2, is relatively simple and efficient for solving problems requiring a large amount of data that are distributed to different agents.

We are motivated by the broad applications of local SA in solving problems in federated learning [2, 3] and multi-task reinforcement learning [4, 5, 6]. These two areas share a common framework, where multiple agents (clients or workers) collaboratively solve a machine learning/reinforcement learning problem under the coordination of a centralized server [2, 3, 4, 5, 6]. In stead of sharing the data collected at the local devices/environments to the server, the agents run local updates of their models/policy based on their data, whose results are then aggregated at the server with the goal to find the global learning objective. In these contexts, local SA can be used to formulate the popular algorithms studied in these two areas, as shown in Section 4.

Our focus, in this paper, is on the theoretical aspects of the finite-time performance of the local stochastic approximation. Our goal is to characterize the convergence rate of this method when the data at the agents are heterogeneous and dependent. In particular, we consider the case when the data at each agent is generated from a Markov process as often considered in the context of multi-task reinforcement learning. Our setting generalizes the existing works in the literature, where the local data is assumed i.i.d. Under fairly standard assumptions, our main contribution is to show that the convergence rates of local SA is within a logarithmic factor of the comparable bounds for independent data. As illustrated in Section 4, the results in this paper provide a unified view for the finite-time bounds of federated learning and multi-task reinforcement learning algorithms under different settings.

1.1 Related works

Stochastic approximation is the most efficient and widely used method for solving stochastic optimization problems in many areas, including machine learning [7] and reinforcement learning [8, 9]. The asymptotic convergence of SA under Markov randomness is often done by using the ordinary differential equation (ODE) method [10, 11]. Such ODE method shows that under the right conditions the noise effects eventually average out and the SA iterate asymptotically follows a stable ODE. On the other hand, the rates of SA have been mostly considered in the context of stochastic gradient descent (SGD) under i.i.d samples [7, 12]. The finite-time convergence of SGD under Markov randomness has been studied in [13, 14] and the references therein. In the context of reinforcement learning, such results have been studied in [15, 16, 17, 18, 19] for linear SA and a recent work in [20] for nonlinear SA.

The local SA method considered in this paper has recently received much interests in the context of federated learning under the name of local SGD; see for example [21, 22, 23, 24, 25, 26, 27]. Finite-time bounds of local SGD in these works are derived when the local data at each agent are sampled i.i.d. On the other hand, our focus is to consider the setting where the local data at each agent are sampled from a Markov process, and therefore, they are dependent. We note that the popular distributed SGD in machine learning also shares the same communication structure as in the local SGD. However, while in local SGD the agents have heterogeneous objectives and only share their iterates with the centralized coordinator, in distributed SGD the agents compute the stochastic gradients of a global objective and send them to the centralized coordinator.

Finally, distributed/local stochastic approximation has also found broad applications in the context of (multi-task) reinforcement learning [5, 6, 28, 29, 4, 30], which is the main motivation of this paper. In these application, since reinforcement learning is often modeled as a Markov random process, the noise in local SA is Markovian. However, there is a lack of understanding about its finite-time performance. Our results in this paper, therefore, help to fill this gap.

2 Local stochastic approximation

We consider distributed learning framework, where there are a group of NN agents communicating indirectly through a centralized coordinator. Associated with each agent ii is a local operator Fi:ddF_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}. The goal of the agents is to find the solution θ\theta^{*} satisfying

F(θ)i=1NFi(θ)=0,\displaystyle F(\theta^{*})\triangleq\sum_{i=1}^{N}F_{i}(\theta^{*})=0, (1)

where each Fi:ddF_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is given as

Fi(θ)𝔼πi[Fi(θ;Xi)]=Xi𝒳iπi(Xi)Fi(θ;Xi).\displaystyle F_{i}(\theta)\triangleq\mathbb{E}_{\pi_{i}}\left[F_{i}(\theta;X_{i})\right]=\sum_{X_{i}\in{\cal X}_{i}}\pi_{i}(X_{i})F_{i}(\theta;X_{i}). (2)

Here 𝒳i{\cal X}_{i} is a statistical sample space with probability distribution πi\pi_{i} at agent ii. We assume that each agent ii has access to operator FiF_{i} only through its samples {Fi(,Xik)}\{F_{i}(\cdot,X_{i}^{k})\}, where {Xik}\{X_{i}^{k}\} is a sequence of samples of the random variable XiX_{i}. We are interested in the case where each sequence {Xik}\{X_{i}^{k}\} is generated from an ergodic Markov process, whose stationary distribution is πi\pi_{i}. Moreover, the sequences {Xik}\{X_{i}^{k}\} are independent across ii. For solving problem (1), our focus is to study the local stochastic approximation formally stated in Algorithm 1. In our algorithm, we implicitly assume that there is an oracle that returns to agent ii the value Fi(θ;Xi)F_{i}(\theta;X_{i}) for a given θ\theta and XiX_{i}.

Initialization: Each agent i initializes θi0d\theta_{i}^{0}\in\mathbb{R}^{d}, a sequence of step sizes {αk}\{\alpha_{k}\}, and a positive integer HH.
The centralized coordinator initializes θ¯0=1/Ni=1Nθi0{\bar{\theta}}^{0}=1/N\sum_{i=1}^{N}\theta_{i}^{0}.
for k=0,1,2,… do
       for each worker i do
             1) Receive θ¯k{\bar{\theta}}^{k} sent by the centralized coordinator
             2) Set θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k}
             for t=0,1,,H1t=0,1,\ldots,H-1 do
                  
θik,t+1=θik,tαk+tFi(θik,t;Xik+t).\displaystyle\theta_{i}^{k,t+1}=\theta_{i}^{k,t}-\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}). (3)
             end for
            
       end for
      The centralized coordinator receives θik,H\theta_{i}^{k,H} from each agent ii and implements
θ¯k+1=1Ni=1Nθik,H.\displaystyle{\bar{\theta}}^{k+1}=\frac{1}{N}\sum_{i=1}^{N}\theta_{i}^{k,H}. (4)
end for
Algorithm 1 Local stochastic approximation

Algorithm 1 is relatively simple to implement, which can be explained as follows. In this algorithm, each agent ii maintains a copy θi\theta_{i} of the solution θ\theta^{*} and the centralized coordinator maintains θ¯{\bar{\theta}} to estimate the average of θi\theta_{i}. At any iteration k0k\geq 0, each agent ii first received θ¯k{\bar{\theta}}^{k} from the centralized coordinator and initalizes its iterate θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k}. Here θik,t\theta_{i}^{k,t} denotes the iterate at iteration kk and local time t[0,,H1]t\in[0,\ldots,H-1]. Agent ii then runs a number HH of local stochastic approximation steps using the time-varying step sizes αk\alpha_{k} and based on its local data {Xik+t}\{X_{i}^{k+t}\}. After HH local steps, the agents then send their new local updates θik,H\theta_{i}^{k,H} to the centralized coordinator to update θ¯k+1{\bar{\theta}}^{k+1} by taking the average of these local values.

3 Convergence analysis

In this section, we study the finite-time performance of Algorithm 1 when each operator FiF_{i} is strongly monotone. We provide an upper bound for the convergence rate of the mean square error 𝔼[θ¯kθ2]\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right] to zero for both constant and time-varying step sizes. In particular, under constant step size, αk=α\alpha_{k}=\alpha for some constant α\alpha and klog(1/α)k\gtrsim\log(1/\alpha), this convergence occurs at a rate

𝔼[θ¯k+1θ2](1HμαN)k+1τ(α)𝔼[θ¯τ(α)θ2]+𝒪(CNHLB2log(1/α)α),\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|^{2}\right]\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\|^{2}\right]+{\cal O}\left(CNHLB^{2}\log(1/\alpha)\alpha\right),

where τ(α)\tau(\alpha) denotes the mixing time of the underlying Markov chain. On the other hand, under time-varying step sizes αk1/(k+1)\alpha_{k}\sim 1/(k+1) this rate happens at

𝔼[θ¯k+1θ2]𝒪(𝔼[θ¯𝒦θ2]k2)+𝒪(CHLB2log(k)k),k𝒦\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|^{2}\right]\leq{\cal O}\left(\frac{\mathbb{E}\left[\|{\bar{\theta}}^{{\cal K}^{*}}-\theta^{*}\|^{2}\right]}{k^{2}}\right)+{\cal O}\left(\frac{CHLB^{2}\log(k)}{k}\right),\qquad\forall k\geq{\cal K}^{*}

for some positive integer 𝒦{\cal K}^{*} depending on the problem parameters C,L,B,μC,L,B,\mu, which we will define shortly. First, our rates scale linearly with the number of local steps HH. As expected, when HH goes to \infty, each agent only consider its local SA without communicating with the centralized coordinator. In this case, one would expect that each agent only finds the root of its own operator. Second, one can view the constant BB as the variance of the noise. Finally, our rates match the ones of local SGD under i.i.d noise [27], except for a log\log factor reflecting the Markovian randomness.

We start our analysis by introducing the following technical assumptions and notation used in this section. Given a constant α>0\alpha>0, we denote by τi(α)\tau_{i}(\alpha) the mixing time associated with the Markov chain {Xik}\{X_{i}^{k}\}, i.e., the following condition holds

ik(Xi0,)πiTVα,kτi(α),Xi0𝒳i,i[N],\displaystyle\|\mathbb{P}_{i}^{k}(X_{i}^{0},\cdot)-\pi_{i}\|_{TV}\leq\alpha,\quad\forall k\geq\tau_{i}(\alpha),\;\forall X_{i}^{0}\in{\cal X}_{i},\;\forall i\in[N], (5)

where TV\|\cdot\|_{TV} is the total variance distance and ik(Xi0,Xi)\mathbb{P}_{i}^{k}(X_{i}^{0},X_{i}) is the probability that Xik=XiX_{i}^{k}=X_{i} when we start from Xi0X_{i}^{0}. The mixing time represents the time XikX_{i}^{k} getting close to the stationary distribution πi\pi_{i}. In addition, we denote by τ(α)=maxiτi(α)\tau(\alpha)=\max_{i}\tau_{i}(\alpha).

For our results derived in this section, we consider the following assumptions. For simplicity, we assume that these assumptions hold to the rest of this paper.

Assumption 1.

The Markov chain {Xik}\{X_{i}^{k}\} is erogodic (irreducible and aperiodic) with finite state space 𝒳i{\cal X}_{i}.

Assumption 2.

The mapping Fi()F_{i}(\cdot) and Fi(,Xi)F_{i}(\cdot,X_{i}) is Lipschitz continuous in θ\theta almost surely, i.e., there exists a positive constant LL such that for all Xi𝒳iX_{i}\in{\cal X}_{i} and i[N]i\in[N]

Fi(θ)Fi(ω)LθωandFi(θ,Xi)Fi(ω,Xi)Lθω,θ,ωd.\displaystyle\|F_{i}(\theta)-F_{i}(\omega)\|\leq L\|\theta-\omega\|\quad\text{and}\quad\|F_{i}(\theta,X_{i})-F_{i}(\omega,X_{i})\|\leq L\|\theta-\omega\|,\qquad\forall\theta,\omega\in\mathbb{R}^{d}. (6)
Assumption 3.

There exists a positive constant μ\mu such that

(Fi(θ)Fi(ω))T(θω)μθω2,θ,ωd,i[N].\displaystyle(F_{i}(\theta)-F_{i}(\omega))^{T}(\theta-\omega)\geq\mu\|\theta-\omega\|^{2},\qquad\forall\theta,\omega\in\mathbb{R}^{d},\;\forall i\in[N]. (7)

Assumption 1 implies that the Markov chain {Xik}\{X_{i}^{k}\} has geometric mixing time, which depends on the second largest eigenvalue of the transition probability matrix of {Xik}\{X_{i}^{k}\} [31]. This assumption holds in various applications, e.g, in incremental optimization [32] and in reinforcement learning modeled by Markov decision processes with a finite number of states and actions [33]. In addition, Assumption 1 is used in the existing literature to study the finite-time performance of SA and its distributed variants under Markov randomness; see [13, 16, 20, 19, 34, 35] and the references therein.

Assumption 2 is often used in local SGD methods in federated learning [27]. This assumption also holds in the context of reinforcement learning considered in Section 4. Assumption 3 implies that FiF_{i} are strongly monotone (or strongly convex in the context of SGD).

The following result is a consequence of Assumptions 1 and 2, which is shown in [20, Lemma 3.23.2]. This lemma basically states that each Markov chain XiX_{i} has a geometric mixing time, which translates to the operator FiF_{i} due to the Lipschitz condition.

Lemma 1.

There exists a constant C>0C>0 such that given α>0\alpha>0 we have, \forall i[N]i\in[N], τiClog(1/α)\tau_{i}\leq C\log(1/\alpha) and

|𝔼[Fi(θ,Xik)|Xi0]Fi(θ)|α(θ+1),θ,kτi(α).\displaystyle\left|\mathbb{E}\left[F_{i}(\theta,X_{i}^{k})\,|\,X_{i}^{0}\right]-F_{i}(\theta)\right|\leq\alpha(\|\theta\|+1),\qquad\forall\theta,\;\forall k\geq\tau_{i}(\alpha). (8)

Finally, since 𝒳i{\cal X}_{i} is finite for all i[N]i\in[N], Assumption 2 also gives the following result.

Lemma 2.

Let B=maximaxXi𝒳iFi(0,Xi)B=\max_{i}\max_{X_{i}\in{\cal X}_{i}}\|F_{i}(0,X_{i})\|. Then we have for all θd\theta\in\mathbb{R}^{d}

Fi(θ)B(θ+1)andFi(θ,Xi)B(θ+1),Xi𝒳i,i[N].\displaystyle\|F_{i}(\theta)\|\leq B(\|\theta\|+1)\quad\text{and}\quad\|F_{i}(\theta,X_{i})\|\leq B(\|\theta\|+1),\qquad\forall X_{i}\in{\cal X}_{i},\;\forall i\in[N]. (9)

3.1 Constant step sizes

In this subsection, we derive the rate of Algorithm 1 under constant step sizes, that is, αk=α\alpha_{k}=\alpha for all k0k\geq 0. Note that by Lemma 1 τi(α)Clog(1/α)\tau_{i}(\alpha)\leq C\log(1/\alpha) given a constant α>0\alpha>0. Thus, we have limα0τi(α)α=0\lim_{\alpha\rightarrow 0}\tau_{i}(\alpha)\alpha=0 for all i[N]i\in[N]. This implies that there exists a sufficiently small positive α\alpha such that

ατ(α)min{log(2)2BH,μ8N(19B2H+9+57LBH),N2Hμ},\displaystyle\alpha\tau(\alpha)\leq\min\left\{\frac{\log(2)}{2BH}\;,\;\frac{\mu}{8N\left(19B^{2}H+9+57LBH\right)}\;,\;\frac{N}{2H\mu}\right\}, (10)

where recall that τ(α)=maxiτi(α)\tau(\alpha)=\max_{i}\tau_{i}(\alpha). The result in this section is established under condition (10). Under constant step sizes, we have from (3) for all k0k\geq 0

θik,H=θik,0αt=0H1Fi(θik,t;Xik+t)=θ¯kαt=0H1Fi(θik,t;Xik+t),\displaystyle\theta_{i}^{k,H}=\theta_{i}^{k,0}-\alpha\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})={\bar{\theta}}^{k}-\alpha\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}), (11)

which implies

θ¯k+1=θ¯kαNi=1Nt=0H1Fi(θik,t;Xik+t).\displaystyle{\bar{\theta}}^{k+1}={\bar{\theta}}^{k}-\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}). (12)

To derive the finite-time bound of Algorithm 1, we require the following three technical lemmas. For an ease of exposition, their proofs are presented in the Appendix. The first lemma is to upper bound the norm of θi\theta_{i} by the norm of θ¯{\bar{\theta}}.

Lemma 3.

Let {θ¯k}\{{\bar{\theta}}^{k}\} and {θik,t}\{\theta_{i}^{k,t}\}, for all k 0\geq 0 and t[1,H]t\in[1,H], be generated by Algorithm 1. In addition, let the step size α\alpha satisfy (10). Then the following relations hold for all k0k\geq 0 and t[0,H]t\in[0,H]

θik,t\displaystyle\|\theta_{i}^{k,t}\| 2θ¯k+2BHα2θ¯k+1.\displaystyle\leq 2\|{\bar{\theta}}^{k}\|+2BH\alpha\leq 2\|{\bar{\theta}}^{k}\|+1. (13)
θik,tθ¯k\displaystyle\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\| 2BHαθ¯k+2BHα.\displaystyle\leq 2BH\alpha\|{\bar{\theta}}^{k}\|+2BH\alpha. (14)

Our next lemma is to provide an upper bound for the quantity θ¯kθ¯kτ(α)\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|.

Lemma 4.

Let all the conditions in Lemma 3 hold. Then the following relations hold k0\forall k\geq 0 and t[0,H]t\in[0,H]

θ¯kθ¯kτ(α)\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\| 12BHατ(α)θ¯k+12BHατ(α)2θ¯k+2.\displaystyle\leq 12BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+12BH\alpha\tau(\alpha)\leq 2\|{\bar{\theta}}^{k}\|+2. (15)
θ¯kθ¯kτ(α)2\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|^{2} 288B2H2α2τ2(α)θ¯k2+288B2H2α2τ2(α)8θ¯k2+8.\displaystyle\leq 288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\|{\bar{\theta}}^{k}\|^{2}+288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\leq 8\|{\bar{\theta}}^{k}\|^{2}+8. (16)

Finally, we present an upper bound for the bias caused by Markovian noise.

Lemma 5.

Let all the conditions in Lemma 3 hold. Then we have

i=1Nt=0H1𝔼[θ¯kθ,Fi(θik,t;Xik+t)Fi(θ¯k)]\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right] 36NHα𝔼[θ¯kθ2]+36NHα(1+θ)2\displaystyle\leq 36NH\alpha\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+36NH\alpha\left(1+\|\theta^{*}\|\right)^{2}
+12(19L+6B)NBH2ατ(α)𝔼[θ¯kθ2]\displaystyle\qquad+12(19L+6B)NBH^{2}\alpha\tau(\alpha)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+12(19L+6B)NBH2(1+θ)2ατ(α).\displaystyle\qquad+12(19L+6B)NBH^{2}\left(1+\|\theta^{*}\|\right)^{2}\alpha\tau(\alpha). (17)

Our first main result in this paper is presented in the following Theorem, where we derive the rate of Algorithm 1 under constant step sizes.

Theorem 1.

Let {θ¯k}\{{\bar{\theta}}^{k}\} and {θik,t}\{\theta_{i}^{k,t}\}, for all k 0\geq 0 and t[1,H]t\in[1,H], be generated by Algorithm 1. In addition, let the step size α\alpha satisfy (10). Then we have

𝔼[θ¯k+1θ2]\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|^{2}\right] (1HμαN)k+1τ(α)𝔼[θ¯τ(α)θ2]\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\|^{2}\right]
+8NC(B2H+9+3(19L+6B)BH)(θ+1)2μlog(1/α)α.\displaystyle\qquad+\frac{8NC\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}}{\mu}\log(1/\alpha)\alpha. (18)
Remark 1.

Under constant step sizes, the rate in (18) shows that the mean square errors generated by Algorithm 1 decays to a ball surrounding the origin exponentially. As α\alpha decays to zero, this error also goes to zero. Second, our rate is only different from the one using i.i.d data by a log\log factor, for example in [27]. This reflects the impact of Markovian randomness through the mixing time τ(α)\tau(\alpha). Third, our upper bound scales linearly on the number of local steps HH.

Proof.

By (12) we consider

θ¯k+1θ2\displaystyle\|{\bar{\theta}}^{k+1}-\theta^{*}\|^{2} =θ¯kθαNi=1Nt=0H1Fi(θik,t;Xik+t)2\displaystyle=\left\|{\bar{\theta}}^{k}-\theta^{*}-\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\|^{2}
=θ¯kθ2+αNi=1Nt=0H1Fi(θik,t;Xik+t)22αNθ¯kθ,i=1Nt=0H1Fi(θik,t;Xik+t)\displaystyle=\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+\left\|\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\|^{2}-\frac{2\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\rangle
=θ¯kθ2+αNi=1Nt=0H1Fi(θik,t;Xik+t)22HαNθ¯kθ,F(θ¯k)\displaystyle=\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+\left\|\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\|^{2}-\frac{2H\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},F({\bar{\theta}}^{k})\right\rangle
2αNθ¯kθ,i=1Nt=0H1Fi(θik,t;Xik+t)Fi(θ¯k).\displaystyle\qquad-\frac{2\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle. (19)

First, using (9) and (13) we consider the second term on the right-hand side of (19)

αNi=1Nt=0H1Fi(θik,t;Xik+t)2α2N2|i=1Nt=0H1B(θik,t+1)|2α2N2|i=1Nt=0H12B(θ¯k+1)|2\displaystyle\left\|\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\|^{2}\leq\frac{\alpha^{2}}{N^{2}}\left|\sum_{i=1}^{N}\sum_{t=0}^{H-1}B(\|\theta_{i}^{k,t}\|+1)\right|^{2}\leq\frac{\alpha^{2}}{N^{2}}\left|\sum_{i=1}^{N}\sum_{t=0}^{H-1}2B(\|{\bar{\theta}}^{k}\|+1)\right|^{2}
4B2H2α2(θ¯k+1)28B2H2α2θ¯kθ2+8B2H2(θ+1)2α2.\displaystyle\qquad\leq 4B^{2}H^{2}\alpha^{2}(\|{\bar{\theta}}^{k}\|+1)^{2}\leq 8B^{2}H^{2}\alpha^{2}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+8B^{2}H^{2}(\|\theta^{*}\|+1)^{2}\alpha^{2}. (20)

Second, by Assumption 3 we have

2HαNθ¯kθ,F(θ¯k)2HμαNθ¯kθ2.\displaystyle-\frac{2H\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},F({\bar{\theta}}^{k})\right\rangle\leq-\frac{2H\mu\alpha}{N}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}. (21)

Thus, taking the expectation on both sides of (19) and using (17), (20), and (21) yields

𝔼[θ¯k+1θ]\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|\right] (12HμαN)𝔼[θ¯kθ2]+8B2H2α2𝔼[θ¯kθ2]+8B2H2(θ+1)2α2\displaystyle\leq\left(1-\frac{2H\mu\alpha}{N}\right)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+8B^{2}H^{2}\alpha^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+8B^{2}H^{2}(\|\theta^{*}+1\|)^{2}\alpha^{2}
+72Hα2𝔼[θ¯kθ2]+72Hα2(1+θ)2\displaystyle\qquad+72H\alpha^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+72H\alpha^{2}\left(1+\|\theta^{*}\|\right)^{2}
+24(19L+6B)BH2α2τ(α)𝔼[θ¯kθ2]\displaystyle\qquad+24(19L+6B)BH^{2}\alpha^{2}\tau(\alpha)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+24(19L+6B)BH2(1+θ)2α2τ(α)\displaystyle\qquad+24(19L+6B)BH^{2}\left(1+\|\theta^{*}\|\right)^{2}\alpha^{2}\tau(\alpha)
(12HμαN)𝔼[θ¯kθ2]\displaystyle\leq\left(1-\frac{2H\mu\alpha}{N}\right)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+8H(B2H+9+3(19L+6B)BH)τ(α)α2𝔼[θ¯kθ2]\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)\tau(\alpha)\alpha^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+8H(B2H+9+3(19L+6B)BH)(θ+1)2τ(α)α2.\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\tau(\alpha)\alpha^{2}.

Recall that α\alpha satisfies (10) and by Lemma 1 τ(α)Clog(1/α)\tau(\alpha)\leq C\log(1/\alpha). Then the preceding relation yields (18), i.e., for all kτ(α)k\geq\tau(\alpha)

𝔼[θ¯k+1θ]\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|\right] (1HμαN)𝔼[θ¯kθ2]\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+8H(B2H+9+3(19L+6B)BH)(θ+1)2τ(α)α2\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\tau(\alpha)\alpha^{2}
(1HμαN)k+1τ(α)𝔼[θ¯τ(α)θ2]\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\|^{2}\right]
+8H(B2H+9+3(19L+6B)BH)(θ+1)2τ(α)α2t=τ(α)k(1HμαN)kt\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\tau(\alpha)\alpha^{2}\sum_{t=\tau(\alpha)}^{k}\left(1-\frac{H\mu\alpha}{N}\right)^{k-t}
(1HμαN)k+1τ(α)𝔼[θ¯τ(α)θ2]\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\|^{2}\right]
+8NC(B2H+9+3(19L+6B)BH)(θ+1)2μClog(1/α)α.\displaystyle\qquad+\frac{8NC\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}}{\mu}C\log(1/\alpha)\alpha.

3.2 Time-varying step sizes

In this section, we derive the finite-time bound of Algorithm 1 under time-varying step sizes αk\alpha_{k}. We consider αk\alpha_{k} being nonnegative, decreasing, and limkαk=0\lim_{k\rightarrow\infty}\alpha_{k}=0. Thus, by Lemma 1 we have limkτi(αk)αk=0\lim_{k\rightarrow\infty}\tau_{i}(\alpha_{k})\alpha_{k}=0 for all i[N]i\in[N]. This implies that there exists a postive integer 𝒦{\cal K}^{*} such that for all k𝒦k\geq{\cal K}^{*}

t=kτ(αk)kαtαkτ(αk)τ(αk)min{log(2)2BH,μ8N(19B2H+9+57LBH), 2α0},\displaystyle\sum_{t=k-\tau(\alpha_{k})}^{k}\alpha_{t}\leq\alpha_{k-\tau(\alpha_{k})}\tau(\alpha_{k})\leq\min\left\{\frac{\log(2)}{2BH}\;,\;\frac{\mu}{8N\left(19B^{2}H+9+57LBH\right)}\;,\;2\alpha_{0}\right\}, (22)

where recall that τ(αk)=maxiτi(αk)\tau(\alpha_{k})=\max_{i}\tau_{i}(\alpha_{k}). For convenience, we denote by

αk;τ(αk)=t=kτ(αk)kαt.\displaystyle\alpha_{k;\tau(\alpha_{k})}=\sum_{t=k-\tau(\alpha_{k})}^{k}\alpha_{t}. (23)

Under αk\alpha_{k}, we have from (3) for all k0k\geq 0

θik,H=θik,0t=0H1αk+tFi(θik,t;Xik+t)=θ¯kt=0H1αk+tFi(θik,t;Xik+t),\displaystyle\theta_{i}^{k,H}=\theta_{i}^{k,0}-\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})={\bar{\theta}}^{k}-\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}), (24)

which implies

θ¯k+1=θ¯k1Ni=1Nt=0H1αk+tFi(θik,t;Xik+t).\displaystyle{\bar{\theta}}^{k+1}={\bar{\theta}}^{k}-\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}). (25)

Similar to the case of constant step sizes, we first consider the following three technical lemmas that are useful for our main result presented in Theorem 2 below. For an ease of exposition, their proofs are presented in the Appendix. The first lemma is to upper bound the norm of θi\theta_{i} by the norm of θ¯{\bar{\theta}}.

Lemma 6.

Let {θ¯k}\{{\bar{\theta}}^{k}\} and {θik,t}\{\theta_{i}^{k,t}\}, for all k 0\geq 0 and t[1,H]t\in[1,H], be generated by Algorithm 1. In addition, let the step size α\alpha satisfy (22). Then the following relations hold for all k0k\geq 0 and t[0,H]t\in[0,H]

θik,t\displaystyle\|\theta_{i}^{k,t}\| 2θ¯k+2BHαk2θ¯k+1.\displaystyle\leq 2\|{\bar{\theta}}^{k}\|+2BH\alpha_{k}\leq 2\|{\bar{\theta}}^{k}\|+1. (26)
θik,tθ¯k\displaystyle\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\| 2BHαkθ¯k+2BHαk.\displaystyle\leq 2BH\alpha_{k}\|{\bar{\theta}}^{k}\|+2BH\alpha_{k}. (27)

Our next lemma is to provide an upper bound for the quantity θ¯kθ¯kτ(α)\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|.

Lemma 7.

Let all the conditions in Lemma 6 hold. Then the following relations hold k0\forall k\geq 0 and t[0,H]t\in[0,H]

θ¯kθ¯kτ(α)\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\| 12BHαk;τ(αk)θ¯k+12BHαk;τ(αk)2θ¯k+2.\displaystyle\leq 12BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+12BH\alpha_{k;\tau(\alpha_{k})}\leq 2\|{\bar{\theta}}^{k}\|+2. (28)
θ¯kθ¯kτ(α)2\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|^{2} 288B2H2αk;τ(αk)2θ¯k2+288B2H2αk;τ(αk)28θ¯k2+8.\displaystyle\leq 288B^{2}H^{2}\alpha_{k;\tau(\alpha_{k})}^{2}\|{\bar{\theta}}^{k}\|^{2}+288B^{2}H^{2}\alpha_{k;\tau(\alpha_{k})}^{2}\leq 8\|{\bar{\theta}}^{k}\|^{2}+8. (29)

Finally, we present an upper bound for the bias caused by Markovian noise.

Lemma 8.

Let all the conditions in Lemma 6 hold. Then we have

i=1Nt=0H1αk+t𝔼[θ¯kθ,Fi(θik,t;Xik+t)Fi(θ¯k)]\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right] 36NHαk2𝔼[θ¯kθ2]+36NHαk2(1+θ)2\displaystyle\leq 36NH\alpha_{k}^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+36NH\alpha_{k}^{2}\left(1+\|\theta^{*}\|\right)^{2}
+12(19L+6B)NBH2αkαk;τ(αk)𝔼[θ¯kθ2]\displaystyle\quad+12(19L+6B)NBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+12(19L+6B)NBH2(1+θ)2αkαk;τ(αk).\displaystyle\quad+12(19L+6B)NBH^{2}\left(1+\|\theta^{*}\|\right)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}. (30)

The second main result in this paper is presented in the following theorem, where we study the rate of Algorithm 1 under time-varying step sizes.

Theorem 2.

Let {θ¯k}\{{\bar{\theta}}^{k}\} and {θik,t}\{\theta_{i}^{k,t}\}, for all k 0\geq 0 and t[1,H]t\in[1,H], be generated by Algorithm 1. In addition, let the step size αk=α/(k+1)\alpha_{k}=\alpha/(k+1) satisfy (22) where α=2N/(Hμ)\alpha=2N/(H\mu) . Then we have for all k𝒦k\geq{\cal K}^{*}

𝔼[θ¯k+1θ2]\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|^{2}\right] (𝒦)2𝔼[θ¯𝒦θ2](k+1)2\displaystyle\leq\frac{({\cal K}^{*})^{2}\mathbb{E}\left[\|{\bar{\theta}}^{{\cal K}^{*}}-\theta^{*}\|^{2}\right]}{(k+1)^{2}}
+16α2CH(B2H+9+3(19L+6B)BH)(θ+1)2log(k+1α)k+1\displaystyle\qquad+\frac{16\alpha^{2}CH\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\log\left(\frac{k+1}{\alpha}\right)}{k+1}\cdot (31)
Remark 2.

Here, we have the same observation as the one in Theorem 1, except now the rate is sublinear due to time-varying step sizes. However, the mean square errors decay to zero instead of to a neighborhood of the origin.

Proof.

By (12) we consider

θ¯k+1θ2\displaystyle\|{\bar{\theta}}^{k+1}-\theta^{*}\|^{2} =θ¯kθ1Ni=1Nt=0H1αk+tFi(θik,t;Xik+t)2\displaystyle=\left\|{\bar{\theta}}^{k}-\theta^{*}-\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\|^{2}
=θ¯kθ2+1Ni=1Nt=0H1αk+tFi(θik,t;Xik+t)22Nθ¯kθ,i=1Nt=0H1αk+tFi(θik,t;Xik+t)\displaystyle=\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+\left\|\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\|^{2}-\frac{2}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\rangle
=θ¯kθ2+1Ni=1Nt=0H1αk+tFi(θik,t;Xik+t)22HNθ¯kθ,F(θ¯k)t=0H1αk+t\displaystyle=\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+\left\|\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\|^{2}-\frac{2H}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},F({\bar{\theta}}^{k})\sum_{t=0}^{H-1}\alpha_{k+t}\right\rangle
2Nθ¯kθ,i=1Nt=0H1αk+t(Fi(θik,t;Xik+t)Fi(θ¯k)).\displaystyle\qquad-\frac{2}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right)\right\rangle. (32)

First, using (9) and (26) we consider the second term on the right-hand side of (32)

1Ni=1Nt=0H1αk+tFi(θik,t;Xik+t)2αk2N2|i=1Nt=0H1B(θik,t+1)|2αk2N2|i=1Nt=0H12B(θ¯k+1)|2\displaystyle\left\|\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\|^{2}\leq\frac{\alpha_{k}^{2}}{N^{2}}\left|\sum_{i=1}^{N}\sum_{t=0}^{H-1}B(\|\theta_{i}^{k,t}\|+1)\right|^{2}\leq\frac{\alpha_{k}^{2}}{N^{2}}\left|\sum_{i=1}^{N}\sum_{t=0}^{H-1}2B(\|{\bar{\theta}}^{k}\|+1)\right|^{2}
4B2H2αk2(θ¯k+1)28B2H2αk2θ¯kθ2+8B2H2(θ+1)2αk2.\displaystyle\qquad\leq 4B^{2}H^{2}\alpha_{k}^{2}(\|{\bar{\theta}}^{k}\|+1)^{2}\leq 8B^{2}H^{2}\alpha_{k}^{2}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+8B^{2}H^{2}(\|\theta^{*}+1\|)^{2}\alpha_{k}^{2}. (33)

Second, by Assumption 3 we have

2HNθ¯kθ,F(θ¯k)t=0H1αk+t2Hμt=0H1αk+tNθ¯kθ22HμNαkθ¯kθ2.\displaystyle-\frac{2H}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},F({\bar{\theta}}^{k})\sum_{t=0}^{H-1}\alpha_{k+t}\right\rangle\leq-\frac{2H\mu\sum_{t=0}^{H-1}\alpha_{k+t}}{N}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\leq\frac{-2H\mu}{N}\alpha_{k}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}. (34)

Thus, taking the expectation on both sides of (32) and using (30), (33), and (34) yields

𝔼[θ¯k+1θ]\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|\right] (12HμαkN)𝔼[θ¯kθ2]+8B2H2αk2𝔼[θ¯kθ2]+8B2H2(θ+1)2αk2\displaystyle\leq\left(1-\frac{2H\mu\alpha_{k}}{N}\right)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+8B^{2}H^{2}\alpha_{k}^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+8B^{2}H^{2}(\|\theta^{*}+1\|)^{2}\alpha_{k}^{2}
+72Hαk2𝔼[θ¯kθ2]+72Hαk2(1+θ)2\displaystyle\qquad+72H\alpha_{k}^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+72H\alpha_{k}^{2}\left(1+\|\theta^{*}\|\right)^{2}
+24(19L+6B)BH2αkαk;τ(αk)𝔼[θ¯kθ2]\displaystyle\qquad+24(19L+6B)BH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+24(19L+6B)BH2(1+θ)2αkαk;τ(αk)\displaystyle\qquad+24(19L+6B)BH^{2}\left(1+\|\theta^{*}\|\right)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}
(12HμαkN)𝔼[θ¯kθ2]\displaystyle\leq\left(1-\frac{2H\mu\alpha_{k}}{N}\right)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+8H(B2H+9+3(19L+6B)BH)αkαk;τ(αk)𝔼[θ¯kθ2]\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+8H(B2H+9+3(19L+6B)BH)(θ+1)2αkαk;τ(αk).\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}. (35)

Recall that αk=α/(k+1)\alpha_{k}=\alpha/(k+1) satisfies (22) where α=2N/(Hμ)\alpha=2N/(H\mu). Then we have

(k+1)2(1HμαkN)=(k+1)(k1)k2.\displaystyle(k+1)^{2}\left(1-\frac{H\mu\alpha_{k}}{N}\right)=(k+1)(k-1)\leq k^{2}.

In addition, by Lemma 1 and (22) we have

(k+1)αk;τ(αk)ατ(αk)(k+1)k+1τ(αk)2ατ(αk)2Cαlog(k+1α)\displaystyle(k+1)\alpha_{k;\tau(\alpha_{k})}\leq\frac{\alpha\tau(\alpha_{k})(k+1)}{k+1-\tau(\alpha_{k})}\leq 2\alpha\tau(\alpha_{k})\leq 2C\alpha\log\left(\frac{k+1}{\alpha}\right)

Thus, multiply both sides of (35) by (k+1)2(k+1)^{2} yields

(k+1)2𝔼[θ¯k+1θ]\displaystyle(k+1)^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|\right] (k+1)2(1HμαkN)𝔼[θ¯kθ2]\displaystyle\leq(k+1)^{2}\left(1-\frac{H\mu\alpha_{k}}{N}\right)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+8H(B2H+9+3(19L+6B)BH)(θ+1)2αkαk;τ(αk)(k+1)2\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}(k+1)^{2}
k2𝔼[θ¯kθ2]\displaystyle\leq k^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+8αH(B2H+9+3(19L+6B)BH)(θ+1)2αk;τ(αk)(k+1)\displaystyle\qquad+8\alpha H\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\alpha_{k;\tau(\alpha_{k})}(k+1)
k2𝔼[θ¯𝒦θ2]\displaystyle\leq k^{2}\mathbb{E}\left[\|{\bar{\theta}}^{{\cal K}^{*}}-\theta^{*}\|^{2}\right]
+16α2CH(B2H+9+3(19L+6B)BH)(θ+1)2log(k+1α)\displaystyle\qquad+16\alpha^{2}CH\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\log\left(\frac{k+1}{\alpha}\right)
(𝒦)2𝔼[θ¯𝒦θ2]\displaystyle\leq({\cal K}^{*})^{2}\mathbb{E}\left[\|{\bar{\theta}}^{{\cal K}^{*}}-\theta^{*}\|^{2}\right]
+16α2CH(B2H+9+3(19L+6B)BH)(θ+1)2t=𝒦klog(t+1α)\displaystyle\qquad+16\alpha^{2}CH\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}\sum_{t={\cal K}^{*}}^{k}\log\left(\frac{t+1}{\alpha}\right)
(𝒦)2𝔼[θ¯𝒦θ2]\displaystyle\leq({\cal K}^{*})^{2}\mathbb{E}\left[\|{\bar{\theta}}^{{\cal K}^{*}}-\theta^{*}\|^{2}\right]
+16α2CH(B2H+9+3(19L+6B)BH)(θ+1)2klog(k+1α),\displaystyle\qquad+16\alpha^{2}CH\left(B^{2}H+9+3(19L+6B)BH\right)(\|\theta^{*}\|+1)^{2}k\log\left(\frac{k+1}{\alpha}\right),

which dividing both sides by (k+1)2(k+1)^{2} yields (31). ∎

4 Motivating applications

In this section, we consider three concrete applications in federated learning [2, 3] and multi-task reinforcement learning [36], which can be formulated as problem (1). Thus, these problems can be solved by Algorithm 1, and therefore, one can use our results to provide its theoretical guarantees.

4.1 Federated learning

In federated learning, multiple agents (clients or workers) collaboratively solve a machine learning problem under the coordination of a centralized server [2, 3]. In stead of sharing the data to the server, the agents run local updates of their models (parameters) based on their data, whose results are aggregated at the server with the goal to find the global learning objective. Such an approach has gained much interests recently due to its efficiency in data processing, system privacy, and operating costs.

A central problem in federated learning is distributed (or federated) optimization problems, where the goal is to solve

minimize θdG(θ)1Ni=1NGi(θ).\displaystyle\underset{\theta\in\mathbb{R}^{d}}{\text{minimize }}G(\theta)\triangleq\frac{1}{N}\sum_{i=1}^{N}G_{i}(\theta). (36)

Here each Gi(θ)=𝔼Xiπi[Gi(θ,Xi)]G_{i}(\theta)=\mathbb{E}_{X_{i}\sim\pi_{i}}[G_{i}(\theta,X_{i})] is the loss function and πi\pi_{i} is the distribution of the data located at agent ii. For iji\neq j, πi\pi_{i} and πj\pi_{j} are very different, which is referred to as data heterogeneity across the agents. The most popular method for solving (36) is the so-called local stochastic gradient descent (SGD) [21, 22, 23, 24, 25, 26, 27], which can be viewed as a variant of Algorithm 1. In particular, let Fi(θ)=Gi(θ)F_{i}(\theta)=\nabla G_{i}(\theta) and Fi(θ,Xi)F_{i}(\theta,X_{i}) is its stochastic (sub)gradient Gi(θ,Xi)\nabla G_{i}(\theta,X_{i}). Then at each iteration k0k\geq 0, each agent ii initialize θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k} and runs HH steps of local SGD to update θik,t\theta_{i}^{k,t}. These values are then aggregated by the server to update for θ¯k+1{\bar{\theta}}^{k+1}, i.e.,

Agent i: θik,t+1=θik,tαk+tGi(θik,t,Xik+t),θik,0=θ¯k,t[0,H1]Server: θ¯k+1=1Ni=1Nθik,H.\displaystyle\begin{aligned} &\text{Agent i: }\quad\theta_{i}^{k,t+1}=\theta_{i}^{k,t}-\alpha_{k+t}\nabla G_{i}(\theta_{i}^{k,t},X_{i}^{k+t}),\qquad\theta_{i}^{k,0}={\bar{\theta}}^{k},\;t\in[0,H-1]\\ &\text{Server: }\qquad{\bar{\theta}}^{k+1}=\frac{1}{N}\sum_{i=1}^{N}\theta_{i}^{k,H}.\end{aligned}

In federated optimization literature, it is often assumed that the sequence of samples {Xik}\{X_{i}^{k}\} are sampled i.i.d from πi\pi_{i} and the resulting stochastic gradients are unbiased, i.e., Gi(θ)=𝔼[Gi(θ,Xik)]\nabla G_{i}(\theta)=\mathbb{E}[\nabla G_{i}(\theta,X_{i}^{k})]. In addition, the variance of these samples is assumed to be bounded. These assumptions are obviously the special case of the ones considered in this paper.

4.2 Distributed multi-task reinforcement learning

We consider a multi-task reinforcement learning problem over a network of NN agents operating in NN different environments modeled by Markov random processes (MDPs). Here, each environment represents a task assigned to each agent. We assume that the agents can communicate directly with a centralized coordinator. This is a distributed multi-task reinforcement learning problem (MTRL) over multi-agent systems, which can be mathematically characterized by using NN different MDPs as follows.

Let i=(𝒮i,𝒜i,𝒫i,i,γi){\cal M}_{i}=({\cal S}_{i},{\cal A}_{i},{\cal P}_{i},{\cal R}_{i},\gamma_{i}) be a 55-tuple representing the discounted reward MDP at agent ii, where 𝒮i,𝒜i,{\cal S}_{i},{\cal A}_{i}, and 𝒫i{\cal P}_{i} are the set of states, action, and transition probability matrices, respectively. In addition, i{\cal R}_{i} is the reward function and γi(0,1)\gamma_{i}\in(0,1) is the discount factor. Note that the set of states and actions at the agents can (partially) overlap with each other, and we denote them by 𝒮=i𝒮i{\cal S}=\cup_{i}{\cal S}_{i} and 𝒜=i𝒜i{\cal A}=\cup_{i}{\cal A}_{i}. We focus on randomized stationary policies (RSPs), where a policy π\pi assigns to each s𝒮s\in{\cal S} a probability distribution π(|s)\pi(\cdot|s) over 𝒜{\cal A}.

Given a policy π\pi, let ViπV^{\pi}_{i} be the value function associated with the ii-th environment,

Viπ(si)=𝔼[k=0γiki(sik,aik)|si0=si],aikπ(|sik).\displaystyle V_{i}^{\pi}(s_{i})=\mathbb{E}\left[\sum_{k=0}^{\infty}\gamma_{i}^{k}{\cal R}_{i}(s^{k}_{i},a_{i}^{k})\,|\,s_{i}^{0}=s_{i}\right],\qquad a_{i}^{k}\sim\pi(\cdot|s_{i}^{k}).

Similarly, we denote by QiπQ_{i}^{\pi} the QQ-function in the ii-th environment

Qiπ(si,ai)=𝔼[k=0γik(sik,aik)|si0=si,ai0=ai].\displaystyle Q_{i}^{\pi}(s_{i},a_{i})=\mathbb{E}\left[\sum_{k=0}^{\infty}\gamma_{i}^{k}{\cal R}(s_{i}^{k},a_{i}^{k})\,|\,s_{i}^{0}=s_{i},a_{i}^{0}=a_{i}\right].

Let ρi\rho_{i} be an initial state distribution over 𝒮i{\cal S}_{i} and with some abuse of notation we denote the long-term reward associated with this distribution as Viπ(ρi)=𝔼siρi[Viπ(si)]V_{i}^{\pi}(\rho_{i})=\mathbb{E}_{s_{i}\sim\rho_{i}}\left[V_{i}^{\pi}(s_{i})\right]. The goal of the agents is to cooperatively find a policy π\pi^{*} that maximizes the total accumulative discounted rewards

maxπV(π;𝝆)i=1NViπ(ρi),𝝆=[ρ1ρN].\displaystyle\max_{\pi}V(\pi;\boldsymbol{\rho})\triangleq\sum_{i=1}^{N}V_{i}^{\pi}(\rho_{i}),\quad\boldsymbol{\rho}=\left[\begin{array}[]{c}\rho_{1}\\ \vdots\\ \rho_{N}\end{array}\right]. (40)

Treating each of the environments as independent RL problems would produce different policies πi\pi_{i}^{*}, each maximizing their respective ViπV_{i}^{\pi}. The goal of MTRL is to to find a single π\pi^{*} that balances the performance across all environments. In the following, we present two fundamental problems in this area, which can be formulated as problem (2). As a consequence, we can apply Algorithm 1 to solve these problems in a distributed manner.

4.2.1 Distributed TD(λ)(\lambda) with linear function approximation

One of the most fundamental problems in RL is the so-called policy evaluation problems, where the goal is to estimate the value function VπV^{\pi} associated with a stationary policy π\pi. This problem arises as a subproblem in RL policy search methods, including policy iteration and actor-critic methods. Our focus here is to study the multi-task variant of the policy evaluation problems, that is, we want to estimate the sum of the value functions ViπV_{i}^{\pi} of a stationary policy π\pi. In addition, we study this problem when the number of state space 𝒮{\cal S} is very large, motivating us to use function approximation. We consider the linear function approximation V~iθ\tilde{V}_{i}^{\theta} of ViπV_{i}^{\pi} parameterized by a weight vector θL\theta\in\mathbb{R}^{L} and given as

V~iθ(s)==1Lθϕi,(s),s𝒮,\displaystyle\tilde{V}_{i}^{\theta}(s)=\sum_{\ell=1}^{L}\theta_{\ell}\phi_{i,\ell}(s),\qquad\forall s\in{\cal S},

for a given set of LL basis vectors ϕi,:𝒮\phi_{i,\ell}:{\cal S}\rightarrow\mathbb{R}, {1,,L}\ell\in\{1,\ldots,L\}, where some examples of how to choose these vectors can be found in [37]. Here we assume that ϕi,(s)=0\phi_{i,\ell}(s)=0 if s𝒮is\notin{\cal S}_{i}, implying V~iθ(s)=0\tilde{V}_{i}^{\theta}(s)=0. We are interested in the case LM=|𝒮|L\ll M=|{\cal S}|. Our goal is to find θ\theta^{*} such that it provides a good approximation of the sum of the value functions at the agents, i.e.,

i=1NViπi=1NΦiθ,\displaystyle\sum_{i=1}^{N}V_{i}^{\pi}\approx\sum_{i=1}^{N}\Phi_{i}\theta^{*},

where ΦiM×L\Phi_{i}\in\mathbb{R}^{M\times L} is the feature matrix, whose ii-th row is ϕi(s)L\phi_{i}(s)\in\mathbb{R}^{L}, the feature vector of the agent ii

ϕi(s)=(ϕi,1(s),,ϕi,L(s))TL.\displaystyle\phi_{i}(s)=(\phi_{i,1}(s),\ldots,\phi_{i,L}(s))^{T}\in\mathbb{R}^{L}.

Distributed TD(λ)(\lambda). For solving this problem, we consider a distributed variant of TD(λ)(\lambda), originally studied in [38] and analyzed explicitly in [39, 15, 16]. For simplicity, we consider the case λ=0\lambda=0, while the case of λ[0,1]\lambda\in[0,1] can be done in a similar manner. In particular, each agent ii maintains an estimate θi\theta_{i} of θ\theta^{*} and the centralized coordinator maintains θ¯{\bar{\theta}}, the averages of these θi\theta_{i}. At each iteration k0k\geq 0, each agent ii initialize θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k} and runs HH steps of local TD(0)(0) to update θik,t\theta_{i}^{k,t}. These values are then aggregated by the server to update for θ¯k+1{\bar{\theta}}^{k+1}, i.e., set θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k} and for all t[0,H1]t\in[0,H-1]

Agent i: θik,t+1=θik,t+αk+t(ik+t+γϕi(sik+t+1)Tθik,tϕi(sik+t)Tθik,t)ϕi(sik+t)TServer: θ¯k+1=1Ni=1Nθik,H,\displaystyle\begin{aligned} &\text{Agent i: }\quad\theta_{i}^{k,t+1}=\theta_{i}^{k,t}+\alpha_{k+t}\left({\cal R}_{i}^{k+t}+\gamma\phi_{i}(s_{i}^{k+t+1})^{T}\theta_{i}^{k,t}-\phi_{i}(s_{i}^{k+t})^{T}\theta_{i}^{k,t}\right)\phi_{i}(s_{i}^{k+t})^{T}\\ &\text{Server: }\qquad{\bar{\theta}}^{k+1}=\frac{1}{N}\sum_{i=1}^{N}\theta_{i}^{k,H},\end{aligned} (41)

where ik=i(sik,aik){\cal R}_{i}^{k}={\cal R}_{i}(s_{i}^{k},a_{i}^{k}) and {sik,sik+1,ik}\{s_{i}^{k},s_{i}^{k+1},{\cal R}_{i}^{k}\} is the data tuple observed at time kk at agent ii.

Let {Xik}={(sik,sik+1,aik)}\{X_{i}^{k}\}=\left\{(s_{i}^{k},s_{i}^{k+1},a_{i}^{k})\right\} be a Markov chain. The update above can be viewed as a local stochastic approximation for finding the root of some linear operator. Indeed, let 𝐀i(Xik){\bf A}_{i}(X_{i}^{k}) and bi(Xik)b_{i}(X_{i}^{k}) be defined as

𝐀i(Xik)=ϕ(sik)(γϕ(sik+1)ϕ(sik)T,bi(Xik)=Rikϕ(sik).\displaystyle\begin{aligned} {\bf A}_{i}(X_{i}^{k})&=\phi(s_{i}^{k})(\gamma\phi(s_{i}^{k+1})-\phi(s_{i}^{k})^{T},\\ b_{i}(X_{i}^{k})&=R_{i}^{k}\phi(s_{i}^{k}).\end{aligned}

Moreover, let πi\pi_{i} be the stationary distribution of the underlying Markov chain {Xik}\{X_{i}^{k}\} and

𝐀i=𝔼πi[𝐀i(Xik)],bi=𝔼πi[bi(Xik)].\displaystyle{\bf A}_{i}=\mathbb{E}_{\pi_{i}}\left[{\bf A}_{i}(X_{i}^{k})\right],\qquad b_{i}=\mathbb{E}_{\pi_{i}}\left[b_{i}(X_{i}^{k})\right].

Thus, in this case if we consider

Fi(θik;Xik)=𝐀i(Xik)θikbi(Xik),\displaystyle F_{i}(\theta_{i}^{k};X_{i}^{k})=-{\bf A}_{i}(X_{i}^{k})\theta_{i}^{k}-b_{i}(X_{i}^{k}),

then (41) is a variant of Algorithm 1 where each FiF_{i} is linear in θ\theta. In this case, the local TD(0)(0) seeks to find θ\theta^{*} satisfying

i=1N𝐀iθ+bi=0.\displaystyle\sum_{i=1}^{N}{\bf A}_{i}\theta^{*}+b_{i}=0.

To establish the convergence of (41) the following conditions are assumed in the literature [39, 15, 16].

Assumption 4.

The instantaneous rewards at the agents are uniformly bounded, i.e., there exist a constant RR such that |i(s,s)|R|\,{\cal R}_{i}(s,s^{\prime})\,|\leq R, for all s,s𝒮s,s\in{\cal S} and i[N]i\in[N].

Assumption 5.

For each i[N]i\in[N], the feature vectors {ϕi,}\{\phi_{i,\ell}\}, for all {1,,L}\ell\in\{1,\ldots,L\}, are linearly independent, i.e., the matrix Φi\Phi_{i} has full column rank. In addition, we assume that all feature vectors ϕi(s)\phi_{i}(s) are uniformly bounded, i.e., ϕi(s)1\|\phi_{i}(s)\|\leq 1.

Assumption 6.

Each Markov chain {Xik}\{X_{i}^{k}\} is irreducible and aperiodic.

Under Assumption 46 one can verify that Assumptions 13 hold [39]. For example, under these assumptions each 𝐀i{\bf A}_{i} is a negative definite matrix, i.e., xT𝐀ix<0x^{T}{\bf A}_{i}x<0 for all xx.

4.2.2 Distributed Q-learning with linear function approximation

In this section, we consider a distributed variant of the classic Q-learning method [40] for solving problem (40). Similar to the case of TD(λ)(\lambda), we focus on the linear function approximation Q~iθ\tilde{Q}_{i}^{\theta} of QiQ_{i} parameterized by a weight vector θL\theta\in\mathbb{R}^{L} and defined as

Q~iθ(s,a)==1Lθϕi,(s,a),(s,a)𝒮×𝒜\displaystyle\tilde{Q}_{i}^{\theta}(s,a)=\sum_{\ell=1}^{L}\theta_{\ell}\phi_{i,\ell}(s,a),\qquad\forall(s,a)\in{\cal S}\times{\cal A}

for a given set of LL basis vectors ϕi,:𝒮×𝒜\phi_{i,\ell}:{\cal S}\times{\cal A}\rightarrow\mathbb{R}, 1,,L\ell\in{1,\ldots,L}. We assume again that ϕi,(s,a)=0\phi_{i,\ell}(s,a)=0 is either s𝒮is\notin{\cal S}_{i} or a𝒜ia\notin{\cal A}_{i}, implying Q~iθ(s,a)=0\tilde{Q}_{i}^{\theta}(s,a)=0. Let Φi|𝒮||𝒜|×L\Phi_{i}\in\mathbb{R}^{|{\cal S}||{\cal A}|\times L} be the feature matrix, whose ii-th row is ϕi(s,a)=(ϕi,1(s,a),,ϕi,L(s,a)T\phi_{i}(s,a)=(\phi_{i,1}(s,a),\ldots,\phi_{i,L}(s,a)^{T}. The the goal of distributed Q-learning is to solve

i=1NΦiθ=i=1NΠ𝒯i[Φiθ],\displaystyle\sum_{i=1}^{N}\Phi_{i}\theta=\sum_{i=1}^{N}\Pi{\cal T}_{i}[\Phi_{i}\theta],

where Π\Pi denotes the projection on the linear subspace of the feature vectors and 𝒯i{\cal T}_{i} is the Bellman operator associated with QQ function at environment ii; see for example [20].

For solving this problem, we consider a distributed variant of Q-learning [20, 41, 42]. In particular, each agent ii maintains an estimate θi\theta_{i} of θ\theta^{*} and the centralized coordinator maintains θ¯{\bar{\theta}}, the averages of these θi\theta_{i}. At each iteration k0k\geq 0, each agent ii initialize θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k} and runs HH steps of local Q-learning to update θik,t\theta_{i}^{k,t}. These values are then aggregated by the server to update for θ¯k+1{\bar{\theta}}^{k+1}, i.e., set θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k} and for all t[0,H1]t\in[0,H-1]

Agent i: θik,t+1=θik,t+αk+t(ik+t+γmaxaϕi(sik+t+1,a)Tθik,tϕi(sik+t,ak+t)Tθik,t)ϕi(sik+t,ak+t)T\displaystyle\text{Agent i: }\quad\theta_{i}^{k,t+1}=\theta_{i}^{k,t}+\alpha_{k+t}\left({\cal R}_{i}^{k+t}+\gamma\max_{a^{\prime}}\phi_{i}(s_{i}^{k+t+1},a^{\prime})^{T}\theta_{i}^{k,t}-\phi_{i}(s_{i}^{k+t},a^{k+t})^{T}\theta_{i}^{k,t}\right)\phi_{i}(s_{i}^{k+t},a^{k+t})^{T}
Server: θ¯k+1=1Ni=1Nθik,H,\displaystyle\text{Server: }\qquad{\bar{\theta}}^{k+1}=\frac{1}{N}\sum_{i=1}^{N}\theta_{i}^{k,H}, (42)

where ik=i(sik,aik){\cal R}_{i}^{k}={\cal R}_{i}(s_{i}^{k},a_{i}^{k}) and {sik,sik+1,aik}\{s_{i}^{k},s_{i}^{k+1},a_{i}^{k}\} is the sample trajectory generated by some behavior policy σi\sigma_{i} at agent ii. Let Xik={sik,aik,sik+1}X_{i}^{k}=\{s_{i}^{k},a_{i}^{k},s_{i}^{k+1}\} be a Markov chain. We denote the nonlinear mapping FiF_{i} as

Fi(θ;Xk)=ϕi(sk,ak)[(sk,ak)+γmaxaϕi(sk+1,a)Tθϕi(sk,ak)Tθ],\displaystyle F_{i}(\theta;X^{k})=\phi_{i}(s^{k},a^{k})\left[{\cal R}(s^{k},a^{k})+\gamma\max_{a}\phi_{i}(s^{k+1},a)^{T}\theta-\phi_{i}(s^{k},a^{k})^{T}\theta\right],

and Fi(θ)=𝔼πi[Fi(θ;Xk)]F_{i}(\theta)=\mathbb{E}_{\pi_{i}}\left[F_{i}(\theta;X^{k})\right], where πi\pi_{i} is the stationary distribution of {Xik}\{X_{i}^{k}\}. Then the goal of (42) is to find θ\theta^{*} such that

i=1NFi(θ)=0.\displaystyle\sum_{i=1}^{N}F_{i}(\theta^{*})=0.

Under proper assumptions, for example see [20, Theorem 1], one can verify that Assumptions 13 hold. Thus, we can apply our results in Theorems 1 and 2 to derive the finite-time performance bound for distributed Q-learning in (42).

5 Concluding remark

This paper studies local stochastic approximation over a network of agents, where the data at each agent are generated from a Markov process. Our main contribution is to provide a finite-time bound for the convergence of mean square errors generated by the algorithm to zero. Our results generalized the existing literature, where the data at the agents are i.i.d, and therefore, the current approach cannot be applied to some algorithms in multi-task reinforcement learning over multi-agent systems.

References

  • [1] H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951.
  • [2] P. Kairouz and etc., “Advances and open problems in federated learning,” available at: https://arxiv.org/abs/1912.04977, 2020.
  • [3] T. Li, A. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” available at: https://arxiv.org/abs/1908.07873, 2020.
  • [4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
  • [5] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning et al., “Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures,” arXiv preprint arXiv:1802.01561, 2018.
  • [6] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt, “Multi-task deep reinforcement learning with popart,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3796–3803.
  • [7] L. Bottou, F. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
  • [8] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed.   MIT Press, 2018.
  • [9] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, 2nd ed.   Athena Scientific, Belmont, MA, 1999.
  • [10] V. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint.   Cambridge University Press, 2008.
  • [11] H. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications.   Springer, NY, 2003.
  • [12] G. Lan, First-order and Stochastic Optimization Methods for Machine Learning.   Springer-Nature, 2020.
  • [13] T. Sun, Y. Sun, and W. Yin, “On markov chain gradient descent,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18, 2018, p. 9918–9927.
  • [14] T. T. Doan, L. M. Nguyen, N. H. Pham, and J. Romberg, “Convergence rates of accelerated markov gradient descent with applications in reinforcement learning,” available at: https://arxiv.org/abs/2002.02873, 2020.
  • [15] J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in COLT, 2018.
  • [16] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and TD learning,” in COLT, 2019.
  • [17] B. Hu and U. Syed, “Characterizing the exact behaviors of temporal difference learning algorithms using markov jump linear system theory,” in Advances in Neural Information Processing Systems 32, 2019.
  • [18] B. Karimi, B. Miasojedow, E. Moulines, and H. Wai, “Non-asymptotic analysis of biased stochastic approximation scheme,” in Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, 2019, pp. 1944–1974.
  • [19] T. T. Doan, “Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation,” available at: https://arxiv.org/abs/1912.10583, 2019.
  • [20] C. Z. Chen, S. Zhang, T. T. Doan, S. T. Maguluri, and J.-P. Clarke, “Performance of Q-learning with Linear Function Approximation: Stability and Finite-Time Analysis,” available at: https://arxiv.org/abs/1905.11425, 2019.
  • [21] T. T. Doan, J. Lubars, C. L. Beck, and R. Srikant, “Convergence rate of distributed random projections,” in 7th IFAC Workshop on Distributed Estimation and Control in Networked Systems (NECSYS), vol. 51, no. 23, 2018, pp. 373 – 378.
  • [22] S. U. Stich, “Local SGD converges fast and communicates little,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1g2JnRcFX
  • [23] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, 2017, pp. 1273–1282.
  • [24] B. Woodworth, K. Patel, S. Stich, Z. Dai, B. Bullins, B. McMahan, O. Shamir, and N. Srebro, “Is local SGD better than minibatch SGD?” available at: https://arxiv.org/abs/2002.07839, 2020.
  • [25] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems 32, 2019, pp. 11 082–11 094.
  • [26] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for on-device federated learning,” 2019. [Online]. Available: https://arxiv.org/abs/1910.06378
  • [27] A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data,” in the 23rd International Conference on Artificial Intelligence and Statistics, 2020.
  • [28] L. T. Liu, U. Dogan, and K. Hofmann, “Decoding multitask dqn in the world of minecraft,” in The 13th European Workshop on Reinforcement Learning (EWRL) 2016, 2016.
  • [29] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” available at: https://arxiv.org/abs/2001.06782, 2020.
  • [30] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver, “Massively parallel methods for deep reinforcement learning,” 07 2015.
  • [31] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing times.   American Mathematical Society, 2006.
  • [32] S. Ram, A. Nedić, and V. V. Veeravalli, “Incremental stochastic subgradient algorithms for convex optimization,” SIAM Journal on Optimization, vol. 20, no. 2, pp. 691–717, 2009.
  • [33] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
  • [34] Y. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite time analysis of two time-scale actor critic methods,” arXiv preprint arXiv:2005.01350, 2020.
  • [35] T. T. Doan, S. T. Maguluri, and J. Romberg, “Finite-time performance of distributed temporal difference learning with linear function approximation,” available at: https://arxiv.org/abs/1907.12530, 2019.
  • [36] S. Zeng, A. Anwar, T. Doan, J. Romberg, and A. Raychowdhury, “A decentralized policy gradient approach to multi-task reinforcement learning,” available at: https://arxiv.org/abs/2006.04338, 2020.
  • [37] G. Konidaris, S. Osentoski, and P. Thomas, “Value function approximation in reinforcement learning using the fourier basis,” in Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011, p. 380–385.
  • [38] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, Aug 1988.
  • [39] J. N. Tsitsiklis and B. V. Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674–690, 1997.
  • [40] C. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp. 279–292, 05 1992.
  • [41] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 664–671.
  • [42] D. Lee and N. He, “A unified switching system perspective and o.d.e. analysis of q-learning algorithms,” available at: https://arxiv.org/abs/1912.02270, 2019.

Appendix A Proofs of Lemmas 38

A.1 Proof of Lemma 3

We first show (13). Indeed, by (3) and (9) we have for any t[0,H1]t\in[0,H-1]

θik,t+1θik,t=αFi(θik,t;Xik+t)Bα(θik,t+1),\displaystyle\|\theta_{i}^{k,t+1}-\theta_{i}^{k,t}\|=\alpha\|F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\|\leq B\alpha(\|\theta_{i}^{k,t}\|+1), (43)

which gives

θik,t+1\displaystyle\|\theta_{i}^{k,t+1}\| (1+Bα)θik,t+Bα(1+Bα)t+1θik,0+Bαu=0t(1+Bα)tu.\displaystyle\leq(1+B\alpha)\|\theta_{i}^{k,t}\|+B\alpha\leq(1+B\alpha)^{t+1}\|\theta_{i}^{k,0}\|+B\alpha\sum_{u=0}^{t}(1+B\alpha)^{t-u}.

Using the relation 1+xex1+x\leq e^{x} for all x0x\geq 0 into the preceding equation gives (13), i.e.,

θik,t+1\displaystyle\|\theta_{i}^{k,t+1}\| eBα(t+1)θik,0+BαteBαteBαHθik,0+BHαeBαH2θ¯k+2BHα2θ¯k+1,\displaystyle\leq e^{B\alpha(t+1)}\|\theta_{i}^{k,0}\|+B\alpha te^{B\alpha t}\leq e^{B\alpha H}\|\theta_{i}^{k,0}\|+BH\alpha e^{B\alpha H}\leq 2\|{\bar{\theta}}^{k}\|+2BH\alpha\leq 2\|{\bar{\theta}}^{k}\|+1,

where the second inequality is due to (10), i.e., HBαlog(2)/2τ(α)log(2)HB\alpha\leq\log(2)/2\tau(\alpha)\leq\log(2), and recall that θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k}. Next, using (43) and (13) we obtain for all t[0,H1]t\in[0,H-1]

θik,t+1θik,t2Bαθ¯k+2B2Hα2+Bα,\displaystyle\|\theta_{i}^{k,t+1}-\theta_{i}^{k,t}\|\leq 2B\alpha\|{\bar{\theta}}^{k}\|+2B^{2}H\alpha^{2}+B\alpha, (44)

which implies (14), i.e., for all t[1,H]t\in[1,H] we have

θik,tθ¯k\displaystyle\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\| =u=0t1θik,u+1θik,uu=0t1θik,u+1θik,u\displaystyle=\left\|\sum_{u=0}^{t-1}\theta_{i}^{k,u+1}-\theta_{i}^{k,u}\right\|\leq\sum_{u=0}^{t-1}\|\theta_{i}^{k,u+1}-\theta_{i}^{k,u}\|
2Bαtθ¯k+2B2Hα2t+Bαt2BHαθ¯k+2B2H2α2+BHα\displaystyle\leq 2B\alpha t\|{\bar{\theta}}^{k}\|+2B^{2}H\alpha^{2}t+B\alpha t\leq 2BH\alpha\|{\bar{\theta}}^{k}\|+2B^{2}H^{2}\alpha^{2}+BH\alpha
2BHαθ¯k+2BHα,\displaystyle\leq 2BH\alpha\|{\bar{\theta}}^{k}\|+2BH\alpha,

where the last inequality is due to (10).

A.2 Proof of Lemma 4

We first show (15). Using (12) and (9) we have

θ¯k+1θ¯k\displaystyle\|{\bar{\theta}}^{k+1}\|-\|{\bar{\theta}}^{k}\| θ¯k+1θ¯kαNi=1Nt=0H1Fi(θik,t;Xik+t)\displaystyle\leq\|{\bar{\theta}}^{k+1}-{\bar{\theta}}^{k}\|\leq\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\|F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\|
αNi=1Nt=0H1B(θik,t+1)BHα(θ¯k+1)+αNi=1Nt=0H1Bθik,tθ¯k,\displaystyle\leq\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}B\left(\|\theta_{i}^{k,t}\|+1\right)\leq BH\alpha\left(\|{\bar{\theta}}^{k}\|+1\right)+\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}B\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\|,

which when using (14) and BHαlog(2)/2BH\alpha\leq\log(2)/2 (from (10)) gives

θ¯k+1θ¯k\displaystyle\|{\bar{\theta}}^{k+1}\|-\|{\bar{\theta}}^{k}\| θ¯k+1θ¯kBHα(θ¯k+1)+BHα(2BHαθ¯k+2BHα)\displaystyle\leq\|{\bar{\theta}}^{k+1}-{\bar{\theta}}^{k}\|\leq BH\alpha\left(\|{\bar{\theta}}^{k}\|+1\right)+BH\alpha\left(2BH\alpha\|{\bar{\theta}}^{k}\|+2BH\alpha\right)
2BHαθ¯k+2BHα,\displaystyle\leq 2BH\alpha\|{\bar{\theta}}^{k}\|+2BH\alpha, (45)

The preceding relation yields

θ¯k+1(1+2BHα)θ¯k+2BHα.\displaystyle\|{\bar{\theta}}^{k+1}\|\leq(1+2BH\alpha)\|{\bar{\theta}}^{k}\|+2BH\alpha.

Using the relation 1+xex1+x\leq e^{x} for all x0x\geq 0, the equation above gives for all t[kτ(α),k1]t\in[k-\tau(\alpha),k-1]

θ¯t\displaystyle\|{\bar{\theta}}^{t}\| (1+2BHα)tk+τ(α)θ¯kτ(α)+2BHu=kτ(α)t1α(1+2BHα)tu1\displaystyle\leq(1+2BH\alpha)^{t-k+\tau(\alpha)}\|{\bar{\theta}}^{k-\tau(\alpha)}\|+2BH\sum_{u=k-\tau(\alpha)}^{t-1}\alpha(1+2BH\alpha)^{t-u-1}
(1+2BHα)τ(α)θ¯kτ(α)+2BHατ(α)(1+2BHα)t1k+τ(α)\displaystyle\leq(1+2BH\alpha)^{\tau(\alpha)}\|{\bar{\theta}}^{k-\tau(\alpha)}\|+2BH\alpha\tau(\alpha)(1+2BH\alpha)^{t-1-k+\tau(\alpha)}
e2BHατ(α)θ¯kτ(α)+2BHατ(α)(1+2BHα)τ(α)e2BHατ(α)θ¯kτ(α)+2BHατ(α)e2BHατ(α)\displaystyle\leq e^{2BH\alpha\tau(\alpha)}\|{\bar{\theta}}^{k-\tau(\alpha)}\|+2BH\alpha\tau(\alpha)(1+2BH\alpha)^{\tau(\alpha)}\leq e^{2BH\alpha\tau(\alpha)}\|{\bar{\theta}}^{k-\tau(\alpha)}\|+2BH\alpha\tau(\alpha)e^{2BH\alpha\tau(\alpha)}
2θ¯kτ(α)+4BHατ(α),\displaystyle\leq 2\|{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\alpha\tau(\alpha),

where the last inequality is due to (10), i.e., 2HBτ(α)αlog(2).2HB\tau(\alpha)\alpha\leq\log(2). Using the preceding relation we have from (45)

θ¯kθ¯kτ(α)\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\| t=kτ(α)k1θ¯t+1θ¯tt=kτ(α)k12BHα(θ¯t+1)\displaystyle\leq\sum_{t=k-\tau(\alpha)}^{k-1}\|{\bar{\theta}}^{t+1}-{\bar{\theta}}^{t}\|\leq\sum_{t=k-\tau(\alpha)}^{k-1}2BH\alpha(\|{\bar{\theta}}^{t}\|+1)
2BHαt=kτ(α)k1(2θ¯kτ(α)+4BHατ(α))+2BHατ(α)\displaystyle\leq 2BH\alpha\sum_{t=k-\tau(\alpha)}^{k-1}\left(2\|{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\alpha\tau(\alpha)\right)+2BH\alpha\tau(\alpha)
4BHατ(α)θ¯kτ(α)+4BHατ(α),\displaystyle\leq 4BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\alpha\tau(\alpha),

where the last inequality is due to (10), i.e., 2HBτ(α)αlog(2)1/22HB\tau(\alpha)\alpha\leq\log(2)\leq 1/2. Using the preceding inequality and the triangle inequality yields

θ¯kθ¯kτ(αk)\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\| 4BHατ(α)θ¯kθ¯kτ(α)+4BHατ(α)θ¯k+4BHατ(α)\displaystyle\leq 4BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+4BH\alpha\tau(\alpha)
23θ¯kθ¯kτ(α)+4BHατ(α)θ¯k+4BHατ(α),\displaystyle\leq\frac{2}{3}\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+4BH\alpha\tau(\alpha),

where the last inequality we use (10) to have 2BHτ(αk)αlog(2)1/32BH\tau(\alpha_{k})\alpha\leq\log(2)\leq 1/3. Rearranging the equation above yields (15)

θ¯kθ¯kτ(α)12BHατ(α)θ¯k+12BHατ(α)2θ¯k+2.\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|\leq 12BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+12BH\alpha\tau(\alpha)\leq 2\|{\bar{\theta}}^{k}\|+2.

Taking square on both sides of the preceding relation and using the Cauchy-Schwarz inequality yield (16)

θ¯kθ¯kτ(α)2288B2H2α2τ2(α)θ¯k2+288B2H2α2τ2(α)8θ¯k2+8.\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|^{2}\leq 288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\|{\bar{\theta}}^{k}\|^{2}+288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\leq 8\|{\bar{\theta}}^{k}\|^{2}+8.

A.3 Proof of Lemma 5

Consider

i=1Nt=0H1θ¯kθ,Fi(θik,t;Xik+t)Fi(θ¯k)\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle
=i=1Nt=0H1θ¯kθ¯kτ(α),Fi(θik,t;Xik+t)Fi(θ¯k)i=1Nt=0H1θ¯kτ(α)θ,Fi(θik,t;Xik+t)Fi(θ¯k)\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle
=i=1Nt=0H1θ¯kθ¯kτ(α),Fi(θik,t;Xik+t)Fi(θ¯k)\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle
i=1Nt=0H1θ¯kτ(α)θ,Fi(θikτ(α),t;Xik+t)Fi(θikτ(α),t)\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\right\rangle
i=1Nt=0H1θ¯kτ(α)θ,Fi(θik,t;Xik+t)Fi(θikτ(α),t;Xik+t)\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})\right\rangle
i=1Nt=0H1θ¯kτ(α)θ,Fi(θikτ(α),t)Fi(θ¯kτ(α))\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha)})\right\rangle
i=1Nt=0H1θ¯kτ(α)θ,Fi(θ¯kτ(α))Fi(θ¯k).\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha)})-F_{i}({\bar{\theta}}^{k})\right\rangle. (46)

We first consider the second term on the right-hand side of (46). Let k{\cal F}_{k} be the set containing all the information generated by Algorithm 1 up to time kk. Then, using (8) we have

i=1Nt=0H1𝔼[θ¯kτ(α)θ,Fi(θikτ(α),t;Xik+t)Fi(θikτ(α),t)|k+tτ(α)]\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\right\rangle\,|\,{\cal F}_{k+t-\tau(\alpha)}\right]
=i=1Nt=0H1θ¯kτ(α)θ,𝔼[Fi(θikτ(α),t;Xik+t)Fi(θikτ(α),t)|k+tτ(α)]\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},\mathbb{E}\left[F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\,|\,{\cal F}_{k+t-\tau(\alpha)}\right]\right\rangle
i=1Nt=0H1θ¯kτ(α)θ|𝔼[Fi(θikτ(α),t;Xik+t)Fi(θikτ(α),t)|k+tτ(α)]|\displaystyle\leq\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\right\|\left|\mathbb{E}\left[F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\,|\,{\cal F}_{k+t-\tau(\alpha)}\right]\right|
i=1Nt=0H1αθ¯kτ(α)θ(θikτ(α),t+1)=NHαθ¯kτ(α)θ(θikτ(α),t+1),\displaystyle\leq\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha\left\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\right\|\left(\left\|\theta_{i}^{k-\tau(\alpha),t}\right\|+1\right)=NH\alpha\left\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\right\|\left(\left\|\theta_{i}^{k-\tau(\alpha),t}\right\|+1\right),

which by using the triangle inequality and (13) yields

i=1Nt=0H1𝔼[θ¯kτ(α)θ,Fi(θikτ(α),t;Xik+t)Fi(θikτ(α),t)|k+tτ(α)]\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\right\rangle\,|\,{\cal F}_{k+t-\tau(\alpha)}\right]
NHα(θ¯kθ¯kτ(α)+θ¯kθ)(2θ¯kτ(α)+2BHα+1)\displaystyle\leq NH\alpha\left(\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(2\|{\bar{\theta}}^{k-\tau(\alpha)}\|+2BH\alpha+1\right)
NHα(θ¯kθ¯kτ(α)+θ¯kθ)(2θ¯kθ¯kτ(α)+2θ¯k+2)\displaystyle\leq NH\alpha\left(\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(2\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|+2\|{\bar{\theta}}^{k}\|+2\right)
2NHα(2θ¯k+2+θ¯kθ)(3θ¯k+3)\displaystyle\leq 2NH\alpha\left(2\|{\bar{\theta}}{k}\|+2+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(3\|{\bar{\theta}}^{k}\|+3\right)
6NHα(3θ¯kθ+2+2θ)(θ¯kθ+1+θ)\displaystyle\leq 6NH\alpha\left(3\|{\bar{\theta}}^{k}-\theta^{*}\|+2+2\|\theta^{*}\|\right)\left(\|{\bar{\theta}}^{k}-\theta^{*}\|+1+\|\theta^{*}\|\right)
18NHα(θ¯kθ+1+θ)236NHα(θ¯kθ)2+36NHα(1+θ)2,\displaystyle\leq 18NH\alpha\left(\|{\bar{\theta}}^{k}-\theta^{*}\|+1+\|\theta^{*}\|\right)^{2}\leq 36NH\alpha\left(\|{\bar{\theta}}^{k}-\theta^{*}\|\right)^{2}+36NH\alpha\left(1+\|\theta^{*}\|\right)^{2}, (47)

where the third inequality is due to (15) and the last inequality is due to the Cauchy-Schwarz inequality. Next, we consider the third term on the right-hand side of (46). Indeed, using (6) we have

i=1Nt=0H1θ¯kτ(α)θ,Fi(θik,t;Xik+t)Fi(θikτ(α),t;Xik+t)\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})\right\rangle
Li=1Nt=0H1θ¯kτ(α)θθik,tθikτ(α),t\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\|\|\theta_{i}^{k,t}-\theta_{i}^{k-\tau(\alpha),t}\|
Li=1Nt=0H1θ¯kτ(α)θ(θik,tθ¯k+θ¯kθ¯kτ(α)+θ¯kτ(α)θikτ(α),t).\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\|\left(\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\|+\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|+\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta_{i}^{k-\tau(\alpha),t}\|\right). (48)

Similarly, using (6) we consider the last two terms on the right-hand sides of (46)

i=1Nt=0H1(θ¯kτ(α)θ,Fi(θikτ(α),t)Fi(θ¯kτ(α))+θ¯kτ(α)θ,Fi(θ¯kτ(α))Fi(θ¯k))\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha)})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha)})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)
Li=1Nt=0H1θ¯kτ(α)θ(θikτ(α),tθ¯kτ(α)+θ¯kθ¯kτ(α)),\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\|\left(\|\theta_{i}^{k-\tau(\alpha),t}-{\bar{\theta}}^{k-\tau(\alpha)}\|+\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|\right),

which by adding to (48) yields

i=1Nt=0H1(θ¯kτ(α)θ,Fi(θikτ(α),t)Fi(θ¯kτ(α))+θ¯kτ(α)θ,Fi(θ¯kτ(α))Fi(θ¯k))\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha)})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha)})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)
i=1Nt=0H1θ¯kτ(α)θ,Fi(θik,t;Xik+t)Fi(θikτ(α),t;Xik+t)\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})\right\rangle
Li=1Nt=0H1θ¯kτ(α)θ(θik,tθ¯k+2θikτ(α),tθ¯kτ(α)+2θ¯kθ¯kτ(α)).\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\|\left(\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\|+2\|\theta_{i}^{k-\tau(\alpha),t}-{\bar{\theta}}^{k-\tau(\alpha)}\|+2\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|\right).

Using (14), (16), and the triangle inequality into the preceding relation yields

i=1Nt=0H1(θ¯kτ(α)θ,Fi(θikτ(α),t)Fi(θ¯kτ(α))+θ¯kτ(α)θ,Fi(θ¯kτ(α))Fi(θ¯k))\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha)})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha)})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)
i=1Nt=0H1θ¯kτ(α)θ,Fi(θik,t;Xik+t)Fi(θikτ(α),t;Xik+t)\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})\right\rangle
Li=1Nt=0H1(θ¯kτ(α)θ¯k+θ¯kθ)\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(\|{\bar{\theta}}^{k-\tau(\alpha)}-{\bar{\theta}}^{k}\|+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)
×(2BHαθ¯k+4BHαθ¯kτ(α)+6BHα+24BHατ(α)θ¯k+18BHατ(α))\displaystyle\hskip 71.13188pt\times\left(2BH\alpha\|{\bar{\theta}}^{k}\|+4BH\alpha\|{\bar{\theta}}^{k-\tau(\alpha)}\|+6BH\alpha+24BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+18BH\alpha\tau(\alpha)\right)
LNH(2θ¯k+2+θ¯kθ)(26BHατ(α)θ¯k+24BHατ(α)+4BHαθ¯kτ(α)θ¯k+4BHαθ¯k)\displaystyle\leq LNH\left(2\|{\bar{\theta}}^{k}\|+2+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(26BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+24BH\alpha\tau(\alpha)+4BH\alpha\|{\bar{\theta}}^{k-\tau(\alpha)}-{\bar{\theta}}^{k}\|+4BH\alpha\|{\bar{\theta}}^{k}\|\right)
LNH(3θ¯kθ+2+2θ)(38BHατ(α)θ¯k+32BHατ(α))\displaystyle\leq LNH\left(3\|{\bar{\theta}}^{k}-\theta^{*}\|+2+2\|\theta^{*}\|\right)\left(38BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+32BH\alpha\tau(\alpha)\right)
LNH(3θ¯kθ+2+2θ)(38BHατ(α)θ¯kθ+38BH(θ+1)ατ(α))\displaystyle\leq LNH\left(3\|{\bar{\theta}}^{k}-\theta^{*}\|+2+2\|\theta^{*}\|\right)\left(38BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}-\theta^{*}\|+38BH(\|\theta^{*}\|+1)\alpha\tau(\alpha)\right)
114LBNH2ατ(α)(θ¯kθ+θ+1)2\displaystyle\leq 114LBNH^{2}\alpha\tau(\alpha)\left(\|{\bar{\theta}}^{k}-\theta^{*}\|+\|\theta^{*}\|+1\right)^{2}
228LBNH2ατ(α)(θ¯kθ)2+228LBNH2ατ(α)(θ+1)2.\displaystyle\leq 228LBNH^{2}\alpha\tau(\alpha)\left(\|{\bar{\theta}}^{k}-\theta^{*}\|\right)^{2}+228LBNH^{2}\alpha\tau(\alpha)\left(\|\theta^{*}\|+1\right)^{2}. (49)

Finally, we consider the first term on the right-hand side of (46). Using (13), (15), and (9) we consider

i=1Nt=0H1θ¯kθ¯kτ(α),Fi(θik,t;Xik+t)Fi(θ¯k)Bi=1Nt=0H1θ¯kθ¯kτ(α)(θik,t+θ¯k+2)\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\leq B\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\right\|\left(\|\theta_{i}^{k,t}\|+\|{\bar{\theta}}^{k}\|+2\right)
Bi=1Nt=0H1(12BHατ(α)θ¯k+12BHατ(α))(3θ¯k+2+2BHα)\displaystyle\leq B\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(12BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+12BH\alpha\tau(\alpha)\right)\left(3\|{\bar{\theta}}^{k}\|+2+2BH\alpha\right)
12NB2H2ατ(α)(θ¯k+1)(3θ¯k+3)36NB2H2ατ(α)(θ¯k+1)2\displaystyle\leq 12NB^{2}H^{2}\alpha\tau(\alpha)\left(\|{\bar{\theta}}^{k}\|+1\right)(3\|{\bar{\theta}}^{k}\|+3)\leq 36NB^{2}H^{2}\alpha\tau(\alpha)(\|{\bar{\theta}}^{k}\|+1)^{2}
36NB2H2ατ(α)(θ¯kθ+θ+1)2\displaystyle\leq 36NB^{2}H^{2}\alpha\tau(\alpha)(\|{\bar{\theta}}^{k}-\theta^{*}\|+\|\theta^{*}\|+1)^{2}
72NB2H2ατ(α)θ¯kθ2+72NB2H2(1+θ)2ατ(α),\displaystyle\leq 72NB^{2}H^{2}\alpha\tau(\alpha)\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+72NB^{2}H^{2}(1+\|\theta^{*}\|)^{2}\alpha\tau(\alpha), (50)

where in the third inequality we use (10) to have 2BHα12BH\alpha\leq 1. Finally, taking the expectation on both sides of (46) and using (47), (49), and (50) yields (17)

i=1Nt=0H1𝔼[θ¯kθ,Fi(θik,t;Xik+t)Fi(θ¯k)]\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right]
36NHα𝔼[θ¯kθ2]+36NHα(1+θ)2+228LBNH2ατ(α)𝔼[θ¯kθ2]\displaystyle\leq 36NH\alpha\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+36NH\alpha\left(1+\|\theta^{*}\|\right)^{2}+228LBNH^{2}\alpha\tau(\alpha)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+228LBNH2ατ(α)(θ+1)2+72NB2H2ατ(α)𝔼[θ¯kθ2]+72NB2H2(1+θ)2ατ(α)\displaystyle\qquad+228LBNH^{2}\alpha\tau(\alpha)\left(\|\theta^{*}\|+1\right)^{2}+72NB^{2}H^{2}\alpha\tau(\alpha)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+72NB^{2}H^{2}(1+\|\theta^{*}\|)^{2}\alpha\tau(\alpha)
36NHα𝔼[θ¯kθ2]+36NHα(1+θ)2\displaystyle\leq 36NH\alpha\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+36NH\alpha\left(1+\|\theta^{*}\|\right)^{2}
+12(19L+6B)NBH2ατ(α)𝔼[θ¯kθ2]+12(19L+6B)NBH2(1+θ)2ατ(α).\displaystyle\qquad+12(19L+6B)NBH^{2}\alpha\tau(\alpha)\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+12(19L+6B)NBH^{2}\left(1+\|\theta^{*}\|\right)^{2}\alpha\tau(\alpha).

A.4 Proof of Lemma 6

We first show (26). Indeed, by (3) and (9) we have for any t[0,H1]t\in[0,H-1]

θik,t+1θik,t=αk+tFi(θik,t;Xik+t)Bαk+t(θik,t+1),\displaystyle\|\theta_{i}^{k,t+1}-\theta_{i}^{k,t}\|=\alpha_{k+t}\|F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\|\leq B\alpha_{k+t}(\|\theta_{i}^{k,t}\|+1), (51)

which gives

θik,t+1\displaystyle\|\theta_{i}^{k,t+1}\| (1+Bαk+t)θik,t+Bαk+tu=0t(1+Bαk+u)θik,0+Bu=0tαk+u=u+1t(1+Bαk+).\displaystyle\leq(1+B\alpha_{k+t})\|\theta_{i}^{k,t}\|+B\alpha_{k+t}\leq\prod_{u=0}^{t}(1+B\alpha_{k+u})\|\theta_{i}^{k,0}\|+B\sum_{u=0}^{t}\alpha_{k+u}\prod_{\ell=u+1}^{t}(1+B\alpha_{k+\ell}).

Using the relation 1+xex1+x\leq e^{x} for all x0x\geq 0 into the preceding equation and since αk\alpha_{k} is decreasing we obtain (26), i.e., for t[0,H1]t\in[0,H-1]

θik,t+1\displaystyle\|\theta_{i}^{k,t+1}\| exp{Bu=0tαk+u}θik,0+Bu=0tαk+uexp{B=u+1tαk+}\displaystyle\leq\exp\left\{B\sum_{u=0}^{t}\alpha_{k+u}\right\}\|\theta_{i}^{k,0}\|+B\sum_{u=0}^{t}\alpha_{k+u}\exp\left\{B\sum_{\ell=u+1}^{t}\alpha_{k+\ell}\right\}
exp{BHαk}θik,0+Bu=0tαk+uexp{BHαk}\displaystyle\leq\exp\{BH\alpha_{k}\}\|\theta_{i}^{k,0}\|+B\sum_{u=0}^{t}\alpha_{k+u}\exp\{BH\alpha_{k}\}
2θ¯k+2BHαk,\displaystyle\leq 2\|{\bar{\theta}}^{k}\|+2BH\alpha_{k},

where the second inequality is (22), i.e., HBαklog(2)HB\alpha_{k}\leq\log(2), and recall that θik,0=θ¯k\theta_{i}^{k,0}={\bar{\theta}}^{k}. Next, using (51) and (26) and since αk\alpha_{k} is decreasing we obtain for all t[0,H1]t\in[0,H-1]

θik,t+1θik,tBαk(θik,t+1)2Bαkθ¯k+2B2Hαk2+Bαk,\displaystyle\|\theta_{i}^{k,t+1}-\theta_{i}^{k,t}\|\leq B\alpha_{k}\left(\|\theta_{i}^{k,t}\|+1\right)\leq 2B\alpha_{k}\|{\bar{\theta}}^{k}\|+2B^{2}H\alpha_{k}^{2}+B\alpha_{k}, (52)

which implies (14), i.e., for all t[1,H]t\in[1,H] we have

θik,tθ¯k\displaystyle\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\| =u=0t1θik,u+1θik,uu=0t1θik,u+1θik,u\displaystyle=\left\|\sum_{u=0}^{t-1}\theta_{i}^{k,u+1}-\theta_{i}^{k,u}\right\|\leq\sum_{u=0}^{t-1}\|\theta_{i}^{k,u+1}-\theta_{i}^{k,u}\|
2Bαktθ¯k+2B2Hαk2t+Bαkt2BHαkθ¯k+2B2H2αk2+BHαk\displaystyle\leq 2B\alpha_{k}t\|{\bar{\theta}}^{k}\|+2B^{2}H\alpha_{k}^{2}t+B\alpha_{k}t\leq 2BH\alpha_{k}\|{\bar{\theta}}^{k}\|+2B^{2}H^{2}\alpha_{k}^{2}+BH\alpha_{k}
2BHαkθ¯k+2BHαk,\displaystyle\leq 2BH\alpha_{k}\|{\bar{\theta}}^{k}\|+2BH\alpha_{k},

where the last inequality is due to (22).

A.5 Proof of Lemma 7

We first show (28). Using (25) and (9) we have

θ¯k+1θ¯k\displaystyle\|{\bar{\theta}}^{k+1}\|-\|{\bar{\theta}}^{k}\| θ¯k+1θ¯k1Ni=1Nt=0H1αk+tFi(θik,t;Xik+t)\displaystyle\leq\|{\bar{\theta}}^{k+1}-{\bar{\theta}}^{k}\|\leq\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\|F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\|
1Ni=1Nt=0H1Bαk+t(θik,t+1)BHαk(θ¯k+1)+1Ni=1Nt=0H1Bαk+tθik,tθ¯k,\displaystyle\leq\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}B\alpha_{k+t}\left(\|\theta_{i}^{k,t}\|+1\right)\leq BH\alpha_{k}\left(\|{\bar{\theta}}^{k}\|+1\right)+\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}B\alpha_{k+t}\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\|,

which when using (27) and BHαklog(2)/2BH\alpha_{k}\leq\log(2)/2 (from (22)) gives

θ¯k+1θ¯k\displaystyle\|{\bar{\theta}}^{k+1}\|-\|{\bar{\theta}}^{k}\| θ¯k+1θ¯kBHαk(θ¯k+1)+BHαk(2BHαkθ¯k+2BHαk)\displaystyle\leq\|{\bar{\theta}}^{k+1}-{\bar{\theta}}^{k}\|\leq BH\alpha_{k}\left(\|{\bar{\theta}}^{k}\|+1\right)+BH\alpha_{k}\left(2BH\alpha_{k}\|{\bar{\theta}}^{k}\|+2BH\alpha_{k}\right)
2BHαkθ¯k+2BHαk,\displaystyle\leq 2BH\alpha_{k}\|{\bar{\theta}}^{k}\|+2BH\alpha_{k}, (53)

The preceding relation yields

θ¯k+1(1+2BHαk)θ¯k+2BHαk.\displaystyle\|{\bar{\theta}}^{k+1}\|\leq(1+2BH\alpha_{k})\|{\bar{\theta}}^{k}\|+2BH\alpha_{k}.

Using the relation 1+xex1+x\leq e^{x} for all x0x\geq 0, the equation above gives for all t[kτ(αk),k1]t\in[k-\tau(\alpha_{k}),k-1]

θ¯t+1\displaystyle\|{\bar{\theta}}^{t+1}\| u=kτ(αk)t(1+2BHαu)θ¯kτ(αk)+2BHu=kτ(αk)t1αu=u+1t(1+2BHα)\displaystyle\leq\prod_{u=k-\tau(\alpha_{k})}^{t}(1+2BH\alpha_{u})\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+2BH\sum_{u=k-\tau(\alpha_{k})}^{t-1}\alpha_{u}\prod_{\ell=u+1}^{t}(1+2BH\alpha_{\ell})
exp{u=kτ(αk)t2BHαu}θ¯kτ(αk)+2BHu=kτ(αk)t1αuexp{=u+1t2BHα}\displaystyle\leq\exp\left\{\sum_{u=k-\tau(\alpha_{k})}^{t}2BH\alpha_{u}\right\}\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+2BH\sum_{u=k-\tau(\alpha_{k})}^{t-1}\alpha_{u}\exp\left\{\sum_{\ell=u+1}^{t}2BH\alpha_{\ell}\right\}
2θ¯kτ(αk)+4BHu=kτ(αk)t1αu,\displaystyle\leq 2\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+4BH\sum_{u=k-\tau(\alpha_{k})}^{t-1}\alpha_{u},

where the last inequality is due to (22), i.e., 2HBαk;τ(αk)log(2).2HB\alpha_{k;\tau(\alpha_{k})}\leq\log(2). Using the preceding relation we have from (53)

θ¯kθ¯kτ(αk)\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\| t=kτ(αk)k1θ¯t+1θ¯tt=kτ(αk)k12BHαt(θ¯t+1)\displaystyle\leq\sum_{t=k-\tau(\alpha_{k})}^{k-1}\|{\bar{\theta}}^{t+1}-{\bar{\theta}}^{t}\|\leq\sum_{t=k-\tau(\alpha_{k})}^{k-1}2BH\alpha_{t}(\|{\bar{\theta}}^{t}\|+1)
2BHt=kτ(α)k1αt(2θ¯kτ(α)+4BHu=kτ(αk)t1αu)+2BHαk;τ(αk)\displaystyle\leq 2BH\sum_{t=k-\tau(\alpha)}^{k-1}\alpha_{t}\left(2\|{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\sum_{u=k-\tau(\alpha_{k})}^{t-1}\alpha_{u}\right)+2BH\alpha_{k;\tau(\alpha_{k})}
4BHαk;τ(αk)θ¯kτ(α)+4BHαk;τ(αk),\displaystyle\leq 4BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\alpha_{k;\tau(\alpha_{k})},

where the last inequality is due to (22), i.e., 2HBαk;τ(αk)log(2)1/22HB\alpha_{k;\tau(\alpha_{k})}\leq\log(2)\leq 1/2. Using the preceding inequality and the triangle inequality yields

θ¯kθ¯kτ(αk)\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\| 4BHαk;τ(αk)θ¯kθ¯kτ(α)+4BHαk;τ(αk)θ¯k+4BHαk;τ(αk)\displaystyle\leq 4BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+4BH\alpha_{k;\tau(\alpha_{k})}
23θ¯kθ¯kτ(α)+4BHαk;τ(αk)θ¯k+4BHαk;τ(αk),\displaystyle\leq\frac{2}{3}\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|+4BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+4BH\alpha_{k;\tau(\alpha_{k})},

where the last inequality we use (22) to have 2BHαk;τ(αk)log(2)1/32BH\alpha_{k;\tau(\alpha_{k})}\leq\log(2)\leq 1/3. Rearranging the equation above yields (15)

θ¯kθ¯kτ(α)12BHαk;τ(αk)θ¯k+12BHαk;τ(αk)2θ¯k+2.\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|\leq 12BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+12BH\alpha_{k;\tau(\alpha_{k})}\leq 2\|{\bar{\theta}}^{k}\|+2.

Taking square on both sides of the preceding relation and using the Cauchy-Schwarz inequality yield (16)

θ¯kθ¯kτ(α)2288B2H2αk;τ(αk)2θ¯k2+288B2H2αk;τ(αk)28θ¯k2+8.\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|^{2}\leq 288B^{2}H^{2}\alpha_{k;\tau(\alpha_{k})}^{2}\|{\bar{\theta}}^{k}\|^{2}+288B^{2}H^{2}\alpha_{k;\tau(\alpha_{k})}^{2}\leq 8\|{\bar{\theta}}^{k}\|^{2}+8.

A.6 Proof of Lemma 8

Consider

i=1Nt=0H1αk+tθ¯kθ,Fi(θik,t;Xik+t)Fi(θ¯k)\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle
=i=1Nt=0H1αk+tθ¯kθ¯kτ(αk),Fi(θik,t;Xik+t)Fi(θ¯k)\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle
i=1Nt=0H1αk+tθ¯kτ(αk)θ,Fi(θik,t;Xik+t)Fi(θ¯k)\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle
=i=1Nt=0H1αk+tθ¯kθ¯kτ(αk),Fi(θik,t;Xik+t)Fi(θ¯k)\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle
i=1Nt=0H1αk+tθ¯kτ(αk)θ,Fi(θikτ(αk),t;Xik+t)Fi(θikτ(αk),t)\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\right\rangle
i=1Nt=0H1αk+tθ¯kτ(αk)θ,Fi(θik,t;Xik+t)Fi(θikτ(αk),t;Xik+t)\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})\right\rangle
i=1Nt=0H1αk+tθ¯kτ(αk)θ,Fi(θikτ(αk),t)Fi(θ¯kτ(αk))\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})\right\rangle
i=1Nt=0H1αk+tθ¯kτ(αk)θ,Fi(θ¯kτ(αk))Fi(θ¯k).\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})-F_{i}({\bar{\theta}}^{k})\right\rangle. (54)

We first consider the second term on the right-hand side of (54). Let k{\cal F}_{k} be the set containing all the information generated by Algorithm 1 up to time kk. Then, using (8) we have

i=1Nt=0H1αk+t𝔼[θ¯kτ(αk)θ,Fi(θikτ(αk),t;Xik+t)Fi(θikτ(αk),t)|k+tτ(αk)]\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\right\rangle\,|\,{\cal F}_{k+t-\tau(\alpha_{k})}\right]
=i=1Nt=0H1αk+tθ¯kτ(αk)θ,𝔼[Fi(θikτ(αk),t;Xik+t)Fi(θikτ(αk),t)|k+tτ(αk)]\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},\mathbb{E}\left[F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\,|\,{\cal F}_{k+t-\tau(\alpha_{k})}\right]\right\rangle
i=1Nt=0H1αk+tθ¯kτ(αk)θ|𝔼[Fi(θikτ(αk),t;Xik+t)Fi(θikτ(αk),t)|k+tτ(αk)]|\displaystyle\leq\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\right\|\left|\mathbb{E}\left[F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\,|\,{\cal F}_{k+t-\tau(\alpha_{k})}\right]\right|
i=1Nt=0H1αk+tαkθ¯kτ(αk)θ(θikτ(αk),t+1)=NHαk2θ¯kτ(αk)θ(θikτ(αk),t+1),\displaystyle\leq\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\alpha_{k}\left\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\right\|\left(\left\|\theta_{i}^{k-\tau(\alpha_{k}),t}\right\|+1\right)=NH\alpha_{k}^{2}\left\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\right\|\left(\left\|\theta_{i}^{k-\tau(\alpha_{k}),t}\right\|+1\right),

which by using the triangle inequality and (26) yields

i=1Nt=0H1αk+t𝔼[θ¯kτ(αk)θ,Fi(θikτ(αk),t;Xik+t)Fi(θikτ(αk),t)|k+tτ(αk)]\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\right\rangle\,|\,{\cal F}_{k+t-\tau(\alpha_{k})}\right]
NHαk2(θ¯kθ¯kτ(αk)+θ¯kθ)(2θ¯kτ(αk)+2BHαk+1)\displaystyle\leq NH\alpha_{k}^{2}\left(\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(2\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+2BH\alpha_{k}+1\right)
NHαk2(θ¯kθ¯kτ(αk)+θ¯kθ)(2θ¯kθ¯kτ(αk)+2θ¯k+2)\displaystyle\leq NH\alpha_{k}^{2}\left(\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(2\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+2\|{\bar{\theta}}^{k}\|+2\right)
2NHαk2(2θ¯k+2+θ¯kθ)(3θ¯k+3)\displaystyle\leq 2NH\alpha_{k}^{2}\left(2\|{\bar{\theta}}{k}\|+2+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(3\|{\bar{\theta}}^{k}\|+3\right)
6NHαk2(3θ¯kθ+2+2θ)(θ¯kθ+1+θ)\displaystyle\leq 6NH\alpha_{k}^{2}\left(3\|{\bar{\theta}}^{k}-\theta^{*}\|+2+2\|\theta^{*}\|\right)\left(\|{\bar{\theta}}^{k}-\theta^{*}\|+1+\|\theta^{*}\|\right)
18NHαk2(θ¯kθ+1+θ)236NHαk2(θ¯kθ)2+36NHαk2(1+θ)2,\displaystyle\leq 18NH\alpha_{k}^{2}\left(\|{\bar{\theta}}^{k}-\theta^{*}\|+1+\|\theta^{*}\|\right)^{2}\leq 36NH\alpha_{k}^{2}\left(\|{\bar{\theta}}^{k}-\theta^{*}\|\right)^{2}+36NH\alpha_{k}^{2}\left(1+\|\theta^{*}\|\right)^{2}, (55)

where the third inequality is due to (28) and the last inequality is due to the Cauchy-Schwarz inequality. Next, we consider the third term on the right-hand side of (54). Indeed, using (6) we have

i=1Nt=0H1αk+tθ¯kτ(αk)θ,Fi(θik,t;Xik+t)Fi(θikτ(αk),t;Xik+t)\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})\right\rangle
Li=1Nt=0H1αk+tθ¯kτ(αk)θθik,tθikτ(αk),t\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\|\|\theta_{i}^{k,t}-\theta_{i}^{k-\tau(\alpha_{k}),t}\|
Li=1Nt=0H1αk+tθ¯kτ(αk)θ(θik,tθ¯k+θ¯kθ¯kτ(αk)+θ¯kτ(αk)θikτ(αk),t).\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\|\left(\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\|+\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta_{i}^{k-\tau(\alpha_{k}),t}\|\right). (56)

Similarly, using (6) we consider the last two terms on the right-hand sides of (54)

i=1Nt=0H1αk+t(θ¯kτ(αk)θ,Fi(θikτ(αk),t)Fi(θ¯kτ(αk))+θ¯kτ(αk)θ,Fi(θ¯kτ(αk))Fi(θ¯k))\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)
Li=1Nt=0H1αk+tθ¯kτ(αk)θ(θikτ(αk),tθ¯kτ(αk)+θ¯kθ¯kτ(αk)),\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\|\left(\|\theta_{i}^{k-\tau(\alpha_{k}),t}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\|\right),

which by adding to (56) yields

i=1Nt=0H1αk+t(θ¯kτ(αk)θ,Fi(θikτ(αk),t)Fi(θ¯kτ(αk))+θ¯kτ(αk)θ,Fi(θ¯kτ(αk))Fi(θ¯k))\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)
i=1Nt=0H1αk+tθ¯kτ(αk)θ,Fi(θik,t;Xik+t)Fi(θikτ(αk),t;Xik+t)\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})\right\rangle
Li=1Nt=0H1αk+tθ¯kτ(αk)θ(θik,tθ¯k+2θikτ(αk),tθ¯kτ(αk)+2θ¯kθ¯kτ(αk)).\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\|\left(\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\|+2\|\theta_{i}^{k-\tau(\alpha_{k}),t}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+2\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\|\right).

Using (27), (28), and the triangle inequality into the preceding relation yields

i=1Nt=0H1αk+t(θ¯kτ(αk)θ,Fi(θikτ(αk),t)Fi(θ¯kτ(αk))+θ¯kτ(αk)θ,Fi(θ¯kτ(αk))Fi(θ¯k))\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)
i=1Nt=0H1αk+tθ¯kτ(αk)θ,Fi(θik,t;Xik+t)Fi(θikτ(αk),t;Xik+t)\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})\right\rangle
Li=1Nt=0H1αk+t(θ¯kτ(αk)θ¯k+θ¯kθ)(2BHαkθ¯k+2BHαk+4BHαkθ¯kτ(αk)+4BHαkτ(αk))+\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-{\bar{\theta}}^{k}\|+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(2BH\alpha_{k}\|{\bar{\theta}}^{k}\|+2BH\alpha_{k}+4BH\alpha_{k}\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\|+4BH\alpha_{k-\tau(\alpha_{k})}\right)+
+Li=1Nt=0H1αk+t(θ¯kτ(αk)θ¯k+θ¯kθ)(24BHαk;τ(αk)θ¯k+24BHαk;τ(αk))\displaystyle\qquad+L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-{\bar{\theta}}^{k}\|+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(24BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+24BH\alpha_{k;\tau(\alpha_{k})}\right)
LNBH2αk(2θ¯k+2+θ¯kθ)(26αk;τ(αk)θ¯k+30αk;τ(αk)+4αkθ¯kτ(αk)θ¯k+4αkθ¯k)\displaystyle\leq LNBH^{2}\alpha_{k}\left(2\|{\bar{\theta}}^{k}\|+2+\|{\bar{\theta}}^{k}-\theta^{*}\|\right)\left(26\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+30\alpha_{k;\tau(\alpha_{k})}+4\alpha_{k}\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-{\bar{\theta}}^{k}\|+4\alpha_{k}\|{\bar{\theta}}^{k}\|\right)
LNBH2αk(3θ¯kθ+2+2θ)(30αk;τ(αk)θ¯k+30αk;τ(αk)+8αkθ¯k+8αk)\displaystyle\leq LNBH^{2}\alpha_{k}\left(3\|{\bar{\theta}}^{k}-\theta^{*}\|+2+2\|\theta^{*}\|\right)\left(30\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+30\alpha_{k;\tau(\alpha_{k})}+8\alpha_{k}\|{\bar{\theta}}^{k}\|+8\alpha_{k}\right)
38LNBH2αkαk;τ(αk)(3θ¯kθ+2+2θ)(θ¯k+1)\displaystyle\leq 38LNBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(3\|{\bar{\theta}}^{k}-\theta^{*}\|+2+2\|\theta^{*}\|\right)\left(\|{\bar{\theta}}^{k}\|+1\right)
114LNBH2αkαk;τ(αk)(θ¯kθ+1+θ)2\displaystyle\leq 114LNBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(\|{\bar{\theta}}^{k}-\theta^{*}\|+1+\|\theta^{*}\|\right)^{2}
228LNBH2αkαk;τ(αk)θ¯kθ2+228LNBH2αkαk;τ(αk)(θ+1)2.\displaystyle\leq 228LNBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+228LNBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(\|\theta^{*}\|+1\right)^{2}. (57)

Finally, we consider the first term on the right-hand side of (54). Using (26), (28), and (9) we consider

i=1Nt=0H1αk+tθ¯kθ¯kτ(αk),Fi(θik,t;Xik+t)Fi(θ¯k)Bi=1Nt=0H1αk+tθ¯kθ¯kτ(αk)(θik,t+θ¯k+2)\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\leq B\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\right\|\left(\|\theta_{i}^{k,t}\|+\|{\bar{\theta}}^{k}\|+2\right)
Bi=1Nt=0H1αk+t(12BHαk;τ(αk)θ¯k+12BHαk;τ(αk))(3θ¯k+3)\displaystyle\leq B\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(12BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+12BH\alpha_{k;\tau(\alpha_{k})}\right)\left(3\|{\bar{\theta}}^{k}\|+3\right)
36NB2H2αkαk;τ(αk)(θ¯k+1)236NB2H2αkαk;τ(αk)(θ¯kθ+θ+1)2\displaystyle\leq 36NB^{2}H^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(\|{\bar{\theta}}^{k}\|+1\right)^{2}\leq 36NB^{2}H^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}(\|{\bar{\theta}}^{k}-\theta^{*}\|+\|\theta^{*}\|+1)^{2}
72NB2H2αkαk;τ(αk)θ¯kθ2+72NB2H2(1+θ)2αkαk;τ(αk),\displaystyle\leq 72NB^{2}H^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}+72NB^{2}H^{2}(1+\|\theta^{*}\|)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}, (58)

where in the third inequality we use (22) to have 2BHα12BH\alpha\leq 1. Finally, taking the expectation on both sides of (54) and using (55), (57), and (58) yields (30), i.e.,

i=1Nt=0H1αk+t𝔼[θ¯kθ,Fi(θik,t;Xik+t)Fi(θ¯k)]\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right]
36NHαk2(θ¯kθ)2+36NHαk2(1+θ)2+228LBNH2αkαk;τ(αk)𝔼[θ¯kθ2]\displaystyle\leq 36NH\alpha_{k}^{2}\left(\|{\bar{\theta}}^{k}-\theta^{*}\|\right)^{2}+36NH\alpha_{k}^{2}\left(1+\|\theta^{*}\|\right)^{2}+228LBNH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+228LBNH2αkαk;τ(αk)(θ+1)2+72NB2H2αkαk;τ(αk)𝔼[θ¯kθ2]\displaystyle\qquad+228LBNH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(\|\theta^{*}\|+1\right)^{2}+72NB^{2}H^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]
+72NB2H2(1+θ)2αkαk;τ(αk)\displaystyle\qquad+72NB^{2}H^{2}(1+\|\theta^{*}\|)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}
36NHαk2𝔼[θ¯kθ2]+36NHαk2(1+θ)2\displaystyle\leq 36NH\alpha_{k}^{2}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+36NH\alpha_{k}^{2}\left(1+\|\theta^{*}\|\right)^{2}
+12(19L+6B)NBH2αkαk;τ(αk)𝔼[θ¯kθ2]+12(19L+6B)NBH2(1+θ)2αkαk;τ(αk).\displaystyle\qquad+12(19L+6B)NBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]+12(19L+6B)NBH^{2}\left(1+\|\theta^{*}\|\right)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}.