Local Stochastic Approximation: A Unified View of Federated Learning and Distributed Multi-Task Reinforcement Learning Algorithms

Thinh T. Doan Thinh T. Doan is with the School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, 30332, USA. Email: [email protected]

Abstract

Motivated by broad applications in reinforcement learning and federated learning, we study local stochastic approximation over a network of agents, where their goal is to find the root of an operator composed of the local operators at the agents. Our focus is to characterize the finite-time performance of this method when the data at each agent are generated from Markov processes, and hence they are dependent. In particularly, we provide the explicit convergence rates of local stochastic approximation for both constant and time-varying step sizes. Our results show that these rates are within a logarithmic factor of the ones under independent data. We then illustrate the applications of these results to different interesting problems in multi-task reinforcement learning and federated learning.

1 Introduction

In this paper, we study local stochastic approximation (SA), a distributed variant of the classic SA originally introduced by Robbins and Monro [1] for solving the root-finding problems under corrupted measurements of an operator (or function). We consider the setting where there are a group of agents communicating indirectly through a centralized coordinator. The goal of the agents is to find the root of the operator, which is composed of the local operators at the agents. For solving this problem, each agent iteratively runs a number of local SA steps based on its own data, whose iterates are then averaged at the centralized coordinator. This algorithm, presented in detail in Section 2, is relatively simple and efficient for solving problems requiring a large amount of data that are distributed to different agents.

We are motivated by the broad applications of local SA in solving problems in federated learning [2, 3] and multi-task reinforcement learning [4, 5, 6]. These two areas share a common framework, where multiple agents (clients or workers) collaboratively solve a machine learning/reinforcement learning problem under the coordination of a centralized server [2, 3, 4, 5, 6]. In stead of sharing the data collected at the local devices/environments to the server, the agents run local updates of their models/policy based on their data, whose results are then aggregated at the server with the goal to find the global learning objective. In these contexts, local SA can be used to formulate the popular algorithms studied in these two areas, as shown in Section 4.

Our focus, in this paper, is on the theoretical aspects of the finite-time performance of the local stochastic approximation. Our goal is to characterize the convergence rate of this method when the data at the agents are heterogeneous and dependent. In particular, we consider the case when the data at each agent is generated from a Markov process as often considered in the context of multi-task reinforcement learning. Our setting generalizes the existing works in the literature, where the local data is assumed i.i.d. Under fairly standard assumptions, our main contribution is to show that the convergence rates of local SA is within a logarithmic factor of the comparable bounds for independent data. As illustrated in Section 4, the results in this paper provide a unified view for the finite-time bounds of federated learning and multi-task reinforcement learning algorithms under different settings.

1.1 Related works

Stochastic approximation is the most efficient and widely used method for solving stochastic optimization problems in many areas, including machine learning [7] and reinforcement learning [8, 9]. The asymptotic convergence of SA under Markov randomness is often done by using the ordinary differential equation (ODE) method [10, 11]. Such ODE method shows that under the right conditions the noise effects eventually average out and the SA iterate asymptotically follows a stable ODE. On the other hand, the rates of SA have been mostly considered in the context of stochastic gradient descent (SGD) under i.i.d samples [7, 12]. The finite-time convergence of SGD under Markov randomness has been studied in [13, 14] and the references therein. In the context of reinforcement learning, such results have been studied in [15, 16, 17, 18, 19] for linear SA and a recent work in [20] for nonlinear SA.

The local SA method considered in this paper has recently received much interests in the context of federated learning under the name of local SGD; see for example [21, 22, 23, 24, 25, 26, 27]. Finite-time bounds of local SGD in these works are derived when the local data at each agent are sampled i.i.d. On the other hand, our focus is to consider the setting where the local data at each agent are sampled from a Markov process, and therefore, they are dependent. We note that the popular distributed SGD in machine learning also shares the same communication structure as in the local SGD. However, while in local SGD the agents have heterogeneous objectives and only share their iterates with the centralized coordinator, in distributed SGD the agents compute the stochastic gradients of a global objective and send them to the centralized coordinator.

Finally, distributed/local stochastic approximation has also found broad applications in the context of (multi-task) reinforcement learning [5, 6, 28, 29, 4, 30], which is the main motivation of this paper. In these application, since reinforcement learning is often modeled as a Markov random process, the noise in local SA is Markovian. However, there is a lack of understanding about its finite-time performance. Our results in this paper, therefore, help to fill this gap.

2 Local stochastic approximation

We consider distributed learning framework, where there are a group of $N$ agents communicating indirectly through a centralized coordinator. Associated with each agent $i$ is a local operator $F_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ . The goal of the agents is to find the solution $\theta^{*}$ satisfying

\displaystyle F(\theta^{*})\triangleq\sum_{i=1}^{N}F_{i}(\theta^{*})=0,

(1)

where each $F_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is given as

\displaystyle F_{i}(\theta)\triangleq\mathbb{E}_{\pi_{i}}\left[F_{i}(\theta;X_{i})\right]=\sum_{X_{i}\in{\cal X}_{i}}\pi_{i}(X_{i})F_{i}(\theta;X_{i}).

(2)

Here ${\cal X}_{i}$ is a statistical sample space with probability distribution $\pi_{i}$ at agent $i$ . We assume that each agent $i$ has access to operator $F_{i}$ only through its samples $\{F_{i}(\cdot,X_{i}^{k})\}$ , where $\{X_{i}^{k}\}$ is a sequence of samples of the random variable $X_{i}$ . We are interested in the case where each sequence $\{X_{i}^{k}\}$ is generated from an ergodic Markov process, whose stationary distribution is $\pi_{i}$ . Moreover, the sequences $\{X_{i}^{k}\}$ are independent across $i$ . For solving problem (1), our focus is to study the local stochastic approximation formally stated in Algorithm 1. In our algorithm, we implicitly assume that there is an oracle that returns to agent $i$ the value $F_{i}(\theta;X_{i})$ for a given $\theta$ and $X_{i}$ .

Initialization: Each agent i initializes

\theta_{i}^{0}\in\mathbb{R}^{d}

, a sequence of step sizes

\{\alpha_{k}\}

, and a positive integer

H

The centralized coordinator initializes

{\bar{\theta}}^{0}=1/N\sum_{i=1}^{N}\theta_{i}^{0}

for k=0,1,2,… do

for each worker i do

1) Receive

{\bar{\theta}}^{k}

sent by the centralized coordinator

2) Set

\theta_{i}^{k,0}={\bar{\theta}}^{k}

for $t=0,1,\ldots,H-1$ do

\displaystyle\theta_{i}^{k,t+1}=\theta_{i}^{k,t}-\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}).

(3)

end for

The centralized coordinator receives

\theta_{i}^{k,H}

from each agent

i

and implements

\displaystyle{\bar{\theta}}^{k+1}=\frac{1}{N}\sum_{i=1}^{N}\theta_{i}^{k,H}.

(4)

end for

Algorithm 1 Local stochastic approximation

Algorithm 1 is relatively simple to implement, which can be explained as follows. In this algorithm, each agent $i$ maintains a copy $\theta_{i}$ of the solution $\theta^{*}$ and the centralized coordinator maintains ${\bar{\theta}}$ to estimate the average of $\theta_{i}$ . At any iteration $k\geq 0$ , each agent $i$ first received ${\bar{\theta}}^{k}$ from the centralized coordinator and initalizes its iterate $\theta_{i}^{k,0}={\bar{\theta}}^{k}$ . Here $\theta_{i}^{k,t}$ denotes the iterate at iteration $k$ and local time $t\in[0,\ldots,H-1]$ . Agent $i$ then runs a number $H$ of local stochastic approximation steps using the time-varying step sizes $\alpha_{k}$ and based on its local data $\{X_{i}^{k+t}\}$ . After $H$ local steps, the agents then send their new local updates $\theta_{i}^{k,H}$ to the centralized coordinator to update ${\bar{\theta}}^{k+1}$ by taking the average of these local values.

3 Convergence analysis

In this section, we study the finite-time performance of Algorithm 1 when each operator $F_{i}$ is strongly monotone. We provide an upper bound for the convergence rate of the mean square error $\mathbb{E}\left[\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\right]$ to zero for both constant and time-varying step sizes. In particular, under constant step size, $\alpha_{k}=\alpha$ for some constant $\alpha$ and $k\gtrsim\log(1/\alpha)$ , this convergence occurs at a rate

\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|^{2}\right]\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\|^{2}\right]+{\cal O}\left(CNHLB^{2}\log(1/\alpha)\alpha\right),

where $\tau(\alpha)$ denotes the mixing time of the underlying Markov chain. On the other hand, under time-varying step sizes $\alpha_{k}\sim 1/(k+1)$ this rate happens at

\displaystyle\mathbb{E}\left[\|{\bar{\theta}}^{k+1}-\theta^{*}\|^{2}\right]\leq{\cal O}\left(\frac{\mathbb{E}\left[\|{\bar{\theta}}^{{\cal K}^{*}}-\theta^{*}\|^{2}\right]}{k^{2}}\right)+{\cal O}\left(\frac{CHLB^{2}\log(k)}{k}\right),\qquad\forall k\geq{\cal K}^{*}

for some positive integer ${\cal K}^{*}$ depending on the problem parameters $C,L,B,\mu$ , which we will define shortly. First, our rates scale linearly with the number of local steps $H$ . As expected, when $H$ goes to $\infty$ , each agent only consider its local SA without communicating with the centralized coordinator. In this case, one would expect that each agent only finds the root of its own operator. Second, one can view the constant $B$ as the variance of the noise. Finally, our rates match the ones of local SGD under i.i.d noise [27], except for a $\log$ factor reflecting the Markovian randomness.

We start our analysis by introducing the following technical assumptions and notation used in this section. Given a constant $\alpha>0$ , we denote by $\tau_{i}(\alpha)$ the mixing time associated with the Markov chain $\{X_{i}^{k}\}$ , i.e., the following condition holds

\displaystyle\|\mathbb{P}_{i}^{k}(X_{i}^{0},\cdot)-\pi_{i}\|_{TV}\leq\alpha,\quad\forall k\geq\tau_{i}(\alpha),\;\forall X_{i}^{0}\in{\cal X}_{i},\;\forall i\in[N],

(5)

where $\|\cdot\|_{TV}$ is the total variance distance and $\mathbb{P}_{i}^{k}(X_{i}^{0},X_{i})$ is the probability that $X_{i}^{k}=X_{i}$ when we start from $X_{i}^{0}$ . The mixing time represents the time $X_{i}^{k}$ getting close to the stationary distribution $\pi_{i}$ . In addition, we denote by $\tau(\alpha)=\max_{i}\tau_{i}(\alpha)$ .

For our results derived in this section, we consider the following assumptions. For simplicity, we assume that these assumptions hold to the rest of this paper.

Assumption 1.

The Markov chain $\{X_{i}^{k}\}$ is erogodic (irreducible and aperiodic) with finite state space ${\cal X}_{i}$ .

Assumption 2.

The mapping $F_{i}(\cdot)$ and $F_{i}(\cdot,X_{i})$ is Lipschitz continuous in $\theta$ almost surely, i.e., there exists a positive constant $L$ such that for all $X_{i}\in{\cal X}_{i}$ and $i\in[N]$

\displaystyle\|F_{i}(\theta)-F_{i}(\omega)\|\leq L\|\theta-\omega\|\quad\text{and}\quad\|F_{i}(\theta,X_{i})-F_{i}(\omega,X_{i})\|\leq L\|\theta-\omega\|,\qquad\forall\theta,\omega\in\mathbb{R}^{d}.

(6)

Assumption 3.

There exists a positive constant $\mu$ such that

\displaystyle(F_{i}(\theta)-F_{i}(\omega))^{T}(\theta-\omega)\geq\mu\|\theta-\omega\|^{2},\qquad\forall\theta,\omega\in\mathbb{R}^{d},\;\forall i\in[N].

(7)

Assumption 1 implies that the Markov chain $\{X_{i}^{k}\}$ has geometric mixing time, which depends on the second largest eigenvalue of the transition probability matrix of $\{X_{i}^{k}\}$ [31]. This assumption holds in various applications, e.g, in incremental optimization [32] and in reinforcement learning modeled by Markov decision processes with a finite number of states and actions [33]. In addition, Assumption 1 is used in the existing literature to study the finite-time performance of SA and its distributed variants under Markov randomness; see [13, 16, 20, 19, 34, 35] and the references therein.

Assumption 2 is often used in local SGD methods in federated learning [27]. This assumption also holds in the context of reinforcement learning considered in Section 4. Assumption 3 implies that $F_{i}$ are strongly monotone (or strongly convex in the context of SGD).

The following result is a consequence of Assumptions 1 and 2, which is shown in [20, Lemma $3.2$ ]. This lemma basically states that each Markov chain $X_{i}$ has a geometric mixing time, which translates to the operator $F_{i}$ due to the Lipschitz condition.

Lemma 1.

There exists a constant $C>0$ such that given $\alpha>0$ we have, $\forall$ $i\in[N]$ , $\tau_{i}\leq C\log(1/\alpha)$ and

\displaystyle\left|\mathbb{E}\left[F_{i}(\theta,X_{i}^{k})\,|\,X_{i}^{0}\right]-F_{i}(\theta)\right|\leq\alpha(\|\theta\|+1),\qquad\forall\theta,\;\forall k\geq\tau_{i}(\alpha).

(8)

Finally, since ${\cal X}_{i}$ is finite for all $i\in[N]$ , Assumption 2 also gives the following result.

Lemma 2.

Let $B=\max_{i}\max_{X_{i}\in{\cal X}_{i}}\|F_{i}(0,X_{i})\|$ . Then we have for all $\theta\in\mathbb{R}^{d}$

\displaystyle\|F_{i}(\theta)\|\leq B(\|\theta\|+1)\quad\text{and}\quad\|F_{i}(\theta,X_{i})\|\leq B(\|\theta\|+1),\qquad\forall X_{i}\in{\cal X}_{i},\;\forall i\in[N].

(9)

3.1 Constant step sizes

In this subsection, we derive the rate of Algorithm 1 under constant step sizes, that is, $\alpha_{k}=\alpha$ for all $k\geq 0$ . Note that by Lemma 1 $\tau_{i}(\alpha)\leq C\log(1/\alpha)$ given a constant $\alpha>0$ . Thus, we have $\lim_{\alpha\rightarrow 0}\tau_{i}(\alpha)\alpha=0$ for all $i\in[N]$ . This implies that there exists a sufficiently small positive $\alpha$ such that

\displaystyle\alpha\tau(\alpha)\leq\min\left\{\frac{\log(2)}{2BH}\;,\;\frac{\mu}{8N\left(19B^{2}H+9+57LBH\right)}\;,\;\frac{N}{2H\mu}\right\},

(10)

where recall that $\tau(\alpha)=\max_{i}\tau_{i}(\alpha)$ . The result in this section is established under condition (10). Under constant step sizes, we have from (3) for all $k\geq 0$

\displaystyle\theta_{i}^{k,H}=\theta_{i}^{k,0}-\alpha\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})={\bar{\theta}}^{k}-\alpha\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}),

(11)

which implies

\displaystyle{\bar{\theta}}^{k+1}={\bar{\theta}}^{k}-\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}).

(12)

To derive the finite-time bound of Algorithm 1, we require the following three technical lemmas. For an ease of exposition, their proofs are presented in the Appendix. The first lemma is to upper bound the norm of $\theta_{i}$ by the norm of ${\bar{\theta}}$ .

Lemma 3.

Let $\{{\bar{\theta}}^{k}\}$ and $\{\theta_{i}^{k,t}\}$ , for all k $\geq 0$ and $t\in[1,H]$ , be generated by Algorithm 1. In addition, let the step size $\alpha$ satisfy (10). Then the following relations hold for all $k\geq 0$ and $t\in[0,H]$

	$\displaystyle\\|\theta_{i}^{k,t}\\|$	$\displaystyle\leq 2\\|{\bar{\theta}}^{k}\\|+2BH\alpha\leq 2\\|{\bar{\theta}}^{k}\\|+1.$		(13)
	$\displaystyle\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|$	$\displaystyle\leq 2BH\alpha\\|{\bar{\theta}}^{k}\\|+2BH\alpha.$		(14)

Our next lemma is to provide an upper bound for the quantity $\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|$ .

Lemma 4.

Let all the conditions in Lemma 3 hold. Then the following relations hold $\forall k\geq 0$ and $t\in[0,H]$

	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|$	$\displaystyle\leq 12BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}\\|+12BH\alpha\tau(\alpha)\leq 2\\|{\bar{\theta}}^{k}\\|+2.$		(15)
	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|^{2}$	$\displaystyle\leq 288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\\|{\bar{\theta}}^{k}\\|^{2}+288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\leq 8\\|{\bar{\theta}}^{k}\\|^{2}+8.$		(16)

Finally, we present an upper bound for the bias caused by Markovian noise.

Lemma 5.

Let all the conditions in Lemma 3 hold. Then we have

$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right]$	$\displaystyle\leq 36NH\alpha\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+36NH\alpha\left(1+\\|\theta^{}\\|\right)^{2}$
	$\displaystyle\qquad+12(19L+6B)NBH^{2}\alpha\tau(\alpha)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
	$\displaystyle\qquad+12(19L+6B)NBH^{2}\left(1+\\|\theta^{*}\\|\right)^{2}\alpha\tau(\alpha).$	(17)

Our first main result in this paper is presented in the following Theorem, where we derive the rate of Algorithm 1 under constant step sizes.

Theorem 1.

Let $\{{\bar{\theta}}^{k}\}$ and $\{\theta_{i}^{k,t}\}$ , for all k $\geq 0$ and $t\in[1,H]$ , be generated by Algorithm 1. In addition, let the step size $\alpha$ satisfy (10). Then we have

	$\displaystyle\mathbb{E}\left[\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|^{2}\right]$	$\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+\frac{8NC\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}}{\mu}\log(1/\alpha)\alpha.$		(18)

Remark 1.

Under constant step sizes, the rate in (18) shows that the mean square errors generated by Algorithm 1 decays to a ball surrounding the origin exponentially. As $\alpha$ decays to zero, this error also goes to zero. Second, our rate is only different from the one using i.i.d data by a $\log$ factor, for example in [27]. This reflects the impact of Markovian randomness through the mixing time $\tau(\alpha)$ . Third, our upper bound scales linearly on the number of local steps $H$ .

Proof.

By (12) we consider

$\displaystyle\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|^{2}$	$\displaystyle=\left\\|{\bar{\theta}}^{k}-\theta^{*}-\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}$
	$\displaystyle=\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+\left\\|\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}-\frac{2\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{},\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\rangle$
	$\displaystyle=\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+\left\\|\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}-\frac{2H\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{},F({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle\qquad-\frac{2\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle.$	(19)

First, using (9) and (13) we consider the second term on the right-hand side of (19)

	$\displaystyle\left\\|\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}\leq\frac{\alpha^{2}}{N^{2}}\left\|\sum_{i=1}^{N}\sum_{t=0}^{H-1}B(\\|\theta_{i}^{k,t}\\|+1)\right\|^{2}\leq\frac{\alpha^{2}}{N^{2}}\left\|\sum_{i=1}^{N}\sum_{t=0}^{H-1}2B(\\|{\bar{\theta}}^{k}\\|+1)\right\|^{2}$
	$\displaystyle\qquad\leq 4B^{2}H^{2}\alpha^{2}(\\|{\bar{\theta}}^{k}\\|+1)^{2}\leq 8B^{2}H^{2}\alpha^{2}\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+8B^{2}H^{2}(\\|\theta^{}\\|+1)^{2}\alpha^{2}.$		(20)

Second, by Assumption 3 we have

\displaystyle-\frac{2H\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},F({\bar{\theta}}^{k})\right\rangle\leq-\frac{2H\mu\alpha}{N}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}.

(21)

Thus, taking the expectation on both sides of (19) and using (17), (20), and (21) yields

	$\displaystyle\mathbb{E}\left[\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|\right]$	$\displaystyle\leq\left(1-\frac{2H\mu\alpha}{N}\right)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+8B^{2}H^{2}\alpha^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+8B^{2}H^{2}(\\|\theta^{*}+1\\|)^{2}\alpha^{2}$
		$\displaystyle\qquad+72H\alpha^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+72H\alpha^{2}\left(1+\\|\theta^{}\\|\right)^{2}$
		$\displaystyle\qquad+24(19L+6B)BH^{2}\alpha^{2}\tau(\alpha)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+24(19L+6B)BH^{2}\left(1+\\|\theta^{*}\\|\right)^{2}\alpha^{2}\tau(\alpha)$
		$\displaystyle\leq\left(1-\frac{2H\mu\alpha}{N}\right)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)\tau(\alpha)\alpha^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}\tau(\alpha)\alpha^{2}.$

Recall that $\alpha$ satisfies (10) and by Lemma 1 $\tau(\alpha)\leq C\log(1/\alpha)$ . Then the preceding relation yields (18), i.e., for all $k\geq\tau(\alpha)$

	$\displaystyle\mathbb{E}\left[\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|\right]$	$\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}\tau(\alpha)\alpha^{2}$
		$\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}\tau(\alpha)\alpha^{2}\sum_{t=\tau(\alpha)}^{k}\left(1-\frac{H\mu\alpha}{N}\right)^{k-t}$
		$\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+\frac{8NC\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}}{\mu}C\log(1/\alpha)\alpha.$

∎

3.2 Time-varying step sizes

In this section, we derive the finite-time bound of Algorithm 1 under time-varying step sizes $\alpha_{k}$ . We consider $\alpha_{k}$ being nonnegative, decreasing, and $\lim_{k\rightarrow\infty}\alpha_{k}=0$ . Thus, by Lemma 1 we have $\lim_{k\rightarrow\infty}\tau_{i}(\alpha_{k})\alpha_{k}=0$ for all $i\in[N]$ . This implies that there exists a postive integer ${\cal K}^{*}$ such that for all $k\geq{\cal K}^{*}$

\displaystyle\sum_{t=k-\tau(\alpha_{k})}^{k}\alpha_{t}\leq\alpha_{k-\tau(\alpha_{k})}\tau(\alpha_{k})\leq\min\left\{\frac{\log(2)}{2BH}\;,\;\frac{\mu}{8N\left(19B^{2}H+9+57LBH\right)}\;,\;2\alpha_{0}\right\},

(22)

where recall that $\tau(\alpha_{k})=\max_{i}\tau_{i}(\alpha_{k})$ . For convenience, we denote by

\displaystyle\alpha_{k;\tau(\alpha_{k})}=\sum_{t=k-\tau(\alpha_{k})}^{k}\alpha_{t}.

(23)

Under $\alpha_{k}$ , we have from (3) for all $k\geq 0$

\displaystyle\theta_{i}^{k,H}=\theta_{i}^{k,0}-\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})={\bar{\theta}}^{k}-\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}),

(24)

which implies

\displaystyle{\bar{\theta}}^{k+1}={\bar{\theta}}^{k}-\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t}).

(25)

Similar to the case of constant step sizes, we first consider the following three technical lemmas that are useful for our main result presented in Theorem 2 below. For an ease of exposition, their proofs are presented in the Appendix. The first lemma is to upper bound the norm of $\theta_{i}$ by the norm of ${\bar{\theta}}$ .

Lemma 6.

Let $\{{\bar{\theta}}^{k}\}$ and $\{\theta_{i}^{k,t}\}$ , for all k $\geq 0$ and $t\in[1,H]$ , be generated by Algorithm 1. In addition, let the step size $\alpha$ satisfy (22). Then the following relations hold for all $k\geq 0$ and $t\in[0,H]$

	$\displaystyle\\|\theta_{i}^{k,t}\\|$	$\displaystyle\leq 2\\|{\bar{\theta}}^{k}\\|+2BH\alpha_{k}\leq 2\\|{\bar{\theta}}^{k}\\|+1.$		(26)
	$\displaystyle\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|$	$\displaystyle\leq 2BH\alpha_{k}\\|{\bar{\theta}}^{k}\\|+2BH\alpha_{k}.$		(27)

Our next lemma is to provide an upper bound for the quantity $\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|$ .

Lemma 7.

Let all the conditions in Lemma 6 hold. Then the following relations hold $\forall k\geq 0$ and $t\in[0,H]$

	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|$	$\displaystyle\leq 12BH\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}\\|+12BH\alpha_{k;\tau(\alpha_{k})}\leq 2\\|{\bar{\theta}}^{k}\\|+2.$		(28)
	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|^{2}$	$\displaystyle\leq 288B^{2}H^{2}\alpha_{k;\tau(\alpha_{k})}^{2}\\|{\bar{\theta}}^{k}\\|^{2}+288B^{2}H^{2}\alpha_{k;\tau(\alpha_{k})}^{2}\leq 8\\|{\bar{\theta}}^{k}\\|^{2}+8.$		(29)

Finally, we present an upper bound for the bias caused by Markovian noise.

Lemma 8.

Let all the conditions in Lemma 6 hold. Then we have

$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right]$	$\displaystyle\leq 36NH\alpha_{k}^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+36NH\alpha_{k}^{2}\left(1+\\|\theta^{}\\|\right)^{2}$
	$\displaystyle\quad+12(19L+6B)NBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
	$\displaystyle\quad+12(19L+6B)NBH^{2}\left(1+\\|\theta^{*}\\|\right)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}.$	(30)

The second main result in this paper is presented in the following theorem, where we study the rate of Algorithm 1 under time-varying step sizes.

Theorem 2.

Let $\{{\bar{\theta}}^{k}\}$ and $\{\theta_{i}^{k,t}\}$ , for all k $\geq 0$ and $t\in[1,H]$ , be generated by Algorithm 1. In addition, let the step size $\alpha_{k}=\alpha/(k+1)$ satisfy (22) where $\alpha=2N/(H\mu)$ . Then we have for all $k\geq{\cal K}^{*}$

	$\displaystyle\mathbb{E}\left[\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|^{2}\right]$	$\displaystyle\leq\frac{({\cal K}^{})^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{{\cal K}^{}}-\theta^{*}\\|^{2}\right]}{(k+1)^{2}}$
		$\displaystyle\qquad+\frac{16\alpha^{2}CH\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}\log\left(\frac{k+1}{\alpha}\right)}{k+1}\cdot$		(31)

Remark 2.

Here, we have the same observation as the one in Theorem 1, except now the rate is sublinear due to time-varying step sizes. However, the mean square errors decay to zero instead of to a neighborhood of the origin.

Proof.

By (12) we consider

$\displaystyle\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|^{2}$	$\displaystyle=\left\\|{\bar{\theta}}^{k}-\theta^{*}-\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}$
	$\displaystyle=\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+\left\\|\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}-\frac{2}{N}\left\langle{\bar{\theta}}^{k}-\theta^{},\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\rangle$
	$\displaystyle=\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+\left\\|\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}-\frac{2H}{N}\left\langle{\bar{\theta}}^{k}-\theta^{},F({\bar{\theta}}^{k})\sum_{t=0}^{H-1}\alpha_{k+t}\right\rangle$
	$\displaystyle\qquad-\frac{2}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right)\right\rangle.$	(32)

First, using (9) and (26) we consider the second term on the right-hand side of (32)

	$\displaystyle\left\\|\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}\leq\frac{\alpha_{k}^{2}}{N^{2}}\left\|\sum_{i=1}^{N}\sum_{t=0}^{H-1}B(\\|\theta_{i}^{k,t}\\|+1)\right\|^{2}\leq\frac{\alpha_{k}^{2}}{N^{2}}\left\|\sum_{i=1}^{N}\sum_{t=0}^{H-1}2B(\\|{\bar{\theta}}^{k}\\|+1)\right\|^{2}$
	$\displaystyle\qquad\leq 4B^{2}H^{2}\alpha_{k}^{2}(\\|{\bar{\theta}}^{k}\\|+1)^{2}\leq 8B^{2}H^{2}\alpha_{k}^{2}\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+8B^{2}H^{2}(\\|\theta^{}+1\\|)^{2}\alpha_{k}^{2}.$		(33)

Second, by Assumption 3 we have

\displaystyle-\frac{2H}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},F({\bar{\theta}}^{k})\sum_{t=0}^{H-1}\alpha_{k+t}\right\rangle\leq-\frac{2H\mu\sum_{t=0}^{H-1}\alpha_{k+t}}{N}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}\leq\frac{-2H\mu}{N}\alpha_{k}\|{\bar{\theta}}^{k}-\theta^{*}\|^{2}.

(34)

Thus, taking the expectation on both sides of (32) and using (30), (33), and (34) yields

$\displaystyle\mathbb{E}\left[\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|\right]$	$\displaystyle\leq\left(1-\frac{2H\mu\alpha_{k}}{N}\right)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+8B^{2}H^{2}\alpha_{k}^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+8B^{2}H^{2}(\\|\theta^{*}+1\\|)^{2}\alpha_{k}^{2}$
	$\displaystyle\qquad+72H\alpha_{k}^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+72H\alpha_{k}^{2}\left(1+\\|\theta^{}\\|\right)^{2}$
	$\displaystyle\qquad+24(19L+6B)BH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
	$\displaystyle\qquad+24(19L+6B)BH^{2}\left(1+\\|\theta^{*}\\|\right)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}$
	$\displaystyle\leq\left(1-\frac{2H\mu\alpha_{k}}{N}\right)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
	$\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
	$\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}.$	(35)

Recall that $\alpha_{k}=\alpha/(k+1)$ satisfies (22) where $\alpha=2N/(H\mu)$ . Then we have

\displaystyle(k+1)^{2}\left(1-\frac{H\mu\alpha_{k}}{N}\right)=(k+1)(k-1)\leq k^{2}.

In addition, by Lemma 1 and (22) we have

\displaystyle(k+1)\alpha_{k;\tau(\alpha_{k})}\leq\frac{\alpha\tau(\alpha_{k})(k+1)}{k+1-\tau(\alpha_{k})}\leq 2\alpha\tau(\alpha_{k})\leq 2C\alpha\log\left(\frac{k+1}{\alpha}\right)

Thus, multiply both sides of (35) by $(k+1)^{2}$ yields

	$\displaystyle(k+1)^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|\right]$	$\displaystyle\leq(k+1)^{2}\left(1-\frac{H\mu\alpha_{k}}{N}\right)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+8H\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}(k+1)^{2}$
		$\displaystyle\leq k^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+8\alpha H\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}\alpha_{k;\tau(\alpha_{k})}(k+1)$
		$\displaystyle\leq k^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{{\cal K}^{}}-\theta^{}\\|^{2}\right]$
		$\displaystyle\qquad+16\alpha^{2}CH\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}\log\left(\frac{k+1}{\alpha}\right)$
		$\displaystyle\leq({\cal K}^{})^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{{\cal K}^{}}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+16\alpha^{2}CH\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{}\\|+1)^{2}\sum_{t={\cal K}^{}}^{k}\log\left(\frac{t+1}{\alpha}\right)$
		$\displaystyle\leq({\cal K}^{})^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{{\cal K}^{}}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+16\alpha^{2}CH\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}k\log\left(\frac{k+1}{\alpha}\right),$

which dividing both sides by $(k+1)^{2}$ yields (31). ∎

4 Motivating applications

In this section, we consider three concrete applications in federated learning [2, 3] and multi-task reinforcement learning [36], which can be formulated as problem (1). Thus, these problems can be solved by Algorithm 1, and therefore, one can use our results to provide its theoretical guarantees.

4.1 Federated learning

In federated learning, multiple agents (clients or workers) collaboratively solve a machine learning problem under the coordination of a centralized server [2, 3]. In stead of sharing the data to the server, the agents run local updates of their models (parameters) based on their data, whose results are aggregated at the server with the goal to find the global learning objective. Such an approach has gained much interests recently due to its efficiency in data processing, system privacy, and operating costs.

A central problem in federated learning is distributed (or federated) optimization problems, where the goal is to solve

\displaystyle\underset{\theta\in\mathbb{R}^{d}}{\text{minimize }}G(\theta)\triangleq\frac{1}{N}\sum_{i=1}^{N}G_{i}(\theta).

(36)

Here each $G_{i}(\theta)=\mathbb{E}_{X_{i}\sim\pi_{i}}[G_{i}(\theta,X_{i})]$ is the loss function and $\pi_{i}$ is the distribution of the data located at agent $i$ . For $i\neq j$ , $\pi_{i}$ and $\pi_{j}$ are very different, which is referred to as data heterogeneity across the agents. The most popular method for solving (36) is the so-called local stochastic gradient descent (SGD) [21, 22, 23, 24, 25, 26, 27], which can be viewed as a variant of Algorithm 1. In particular, let $F_{i}(\theta)=\nabla G_{i}(\theta)$ and $F_{i}(\theta,X_{i})$ is its stochastic (sub)gradient $\nabla G_{i}(\theta,X_{i})$ . Then at each iteration $k\geq 0$ , each agent $i$ initialize $\theta_{i}^{k,0}={\bar{\theta}}^{k}$ and runs $H$ steps of local SGD to update $\theta_{i}^{k,t}$ . These values are then aggregated by the server to update for ${\bar{\theta}}^{k+1}$ , i.e.,

\displaystyle\begin{aligned} &\text{Agent i: }\quad\theta_{i}^{k,t+1}=\theta_{i}^{k,t}-\alpha_{k+t}\nabla G_{i}(\theta_{i}^{k,t},X_{i}^{k+t}),\qquad\theta_{i}^{k,0}={\bar{\theta}}^{k},\;t\in[0,H-1]\\ &\text{Server: }\qquad{\bar{\theta}}^{k+1}=\frac{1}{N}\sum_{i=1}^{N}\theta_{i}^{k,H}.\end{aligned}

In federated optimization literature, it is often assumed that the sequence of samples $\{X_{i}^{k}\}$ are sampled i.i.d from $\pi_{i}$ and the resulting stochastic gradients are unbiased, i.e., $\nabla G_{i}(\theta)=\mathbb{E}[\nabla G_{i}(\theta,X_{i}^{k})]$ . In addition, the variance of these samples is assumed to be bounded. These assumptions are obviously the special case of the ones considered in this paper.

4.2 Distributed multi-task reinforcement learning

We consider a multi-task reinforcement learning problem over a network of $N$ agents operating in $N$ different environments modeled by Markov random processes (MDPs). Here, each environment represents a task assigned to each agent. We assume that the agents can communicate directly with a centralized coordinator. This is a distributed multi-task reinforcement learning problem (MTRL) over multi-agent systems, which can be mathematically characterized by using $N$ different MDPs as follows.

Let ${\cal M}_{i}=({\cal S}_{i},{\cal A}_{i},{\cal P}_{i},{\cal R}_{i},\gamma_{i})$ be a $5$ -tuple representing the discounted reward MDP at agent $i$ , where ${\cal S}_{i},{\cal A}_{i},$ and ${\cal P}_{i}$ are the set of states, action, and transition probability matrices, respectively. In addition, ${\cal R}_{i}$ is the reward function and $\gamma_{i}\in(0,1)$ is the discount factor. Note that the set of states and actions at the agents can (partially) overlap with each other, and we denote them by ${\cal S}=\cup_{i}{\cal S}_{i}$ and ${\cal A}=\cup_{i}{\cal A}_{i}$ . We focus on randomized stationary policies (RSPs), where a policy $\pi$ assigns to each $s\in{\cal S}$ a probability distribution $\pi(\cdot|s)$ over ${\cal A}$ .

Given a policy $\pi$ , let $V^{\pi}_{i}$ be the value function associated with the $i$ -th environment,

\displaystyle V_{i}^{\pi}(s_{i})=\mathbb{E}\left[\sum_{k=0}^{\infty}\gamma_{i}^{k}{\cal R}_{i}(s^{k}_{i},a_{i}^{k})\,|\,s_{i}^{0}=s_{i}\right],\qquad a_{i}^{k}\sim\pi(\cdot|s_{i}^{k}).

Similarly, we denote by $Q_{i}^{\pi}$ the $Q$ -function in the $i$ -th environment

\displaystyle Q_{i}^{\pi}(s_{i},a_{i})=\mathbb{E}\left[\sum_{k=0}^{\infty}\gamma_{i}^{k}{\cal R}(s_{i}^{k},a_{i}^{k})\,|\,s_{i}^{0}=s_{i},a_{i}^{0}=a_{i}\right].

Let $\rho_{i}$ be an initial state distribution over ${\cal S}_{i}$ and with some abuse of notation we denote the long-term reward associated with this distribution as $V_{i}^{\pi}(\rho_{i})=\mathbb{E}_{s_{i}\sim\rho_{i}}\left[V_{i}^{\pi}(s_{i})\right]$ . The goal of the agents is to cooperatively find a policy $\pi^{*}$ that maximizes the total accumulative discounted rewards

\displaystyle\max_{\pi}V(\pi;\boldsymbol{\rho})\triangleq\sum_{i=1}^{N}V_{i}^{\pi}(\rho_{i}),\quad\boldsymbol{\rho}=\left[\begin{array}[]{c}\rho_{1}\\ \vdots\\ \rho_{N}\end{array}\right].

(40)

Treating each of the environments as independent RL problems would produce different policies $\pi_{i}^{*}$ , each maximizing their respective $V_{i}^{\pi}$ . The goal of MTRL is to to find a single $\pi^{*}$ that balances the performance across all environments. In the following, we present two fundamental problems in this area, which can be formulated as problem (2). As a consequence, we can apply Algorithm 1 to solve these problems in a distributed manner.

4.2.1 Distributed TD $(\lambda)$ with linear function approximation

One of the most fundamental problems in RL is the so-called policy evaluation problems, where the goal is to estimate the value function $V^{\pi}$ associated with a stationary policy $\pi$ . This problem arises as a subproblem in RL policy search methods, including policy iteration and actor-critic methods. Our focus here is to study the multi-task variant of the policy evaluation problems, that is, we want to estimate the sum of the value functions $V_{i}^{\pi}$ of a stationary policy $\pi$ . In addition, we study this problem when the number of state space ${\cal S}$ is very large, motivating us to use function approximation. We consider the linear function approximation $\tilde{V}_{i}^{\theta}$ of $V_{i}^{\pi}$ parameterized by a weight vector $\theta\in\mathbb{R}^{L}$ and given as

\displaystyle\tilde{V}_{i}^{\theta}(s)=\sum_{\ell=1}^{L}\theta_{\ell}\phi_{i,\ell}(s),\qquad\forall s\in{\cal S},

for a given set of $L$ basis vectors $\phi_{i,\ell}:{\cal S}\rightarrow\mathbb{R}$ , $\ell\in\{1,\ldots,L\}$ , where some examples of how to choose these vectors can be found in [37]. Here we assume that $\phi_{i,\ell}(s)=0$ if $s\notin{\cal S}_{i}$ , implying $\tilde{V}_{i}^{\theta}(s)=0$ . We are interested in the case $L\ll M=|{\cal S}|$ . Our goal is to find $\theta^{*}$ such that it provides a good approximation of the sum of the value functions at the agents, i.e.,

\displaystyle\sum_{i=1}^{N}V_{i}^{\pi}\approx\sum_{i=1}^{N}\Phi_{i}\theta^{*},

where $\Phi_{i}\in\mathbb{R}^{M\times L}$ is the feature matrix, whose $i$ -th row is $\phi_{i}(s)\in\mathbb{R}^{L}$ , the feature vector of the agent $i$

\displaystyle\phi_{i}(s)=(\phi_{i,1}(s),\ldots,\phi_{i,L}(s))^{T}\in\mathbb{R}^{L}.

Distributed TD $(\lambda)$ . For solving this problem, we consider a distributed variant of TD $(\lambda)$ , originally studied in [38] and analyzed explicitly in [39, 15, 16]. For simplicity, we consider the case $\lambda=0$ , while the case of $\lambda\in[0,1]$ can be done in a similar manner. In particular, each agent $i$ maintains an estimate $\theta_{i}$ of $\theta^{*}$ and the centralized coordinator maintains ${\bar{\theta}}$ , the averages of these $\theta_{i}$ . At each iteration $k\geq 0$ , each agent $i$ initialize $\theta_{i}^{k,0}={\bar{\theta}}^{k}$ and runs $H$ steps of local TD $(0)$ to update $\theta_{i}^{k,t}$ . These values are then aggregated by the server to update for ${\bar{\theta}}^{k+1}$ , i.e., set $\theta_{i}^{k,0}={\bar{\theta}}^{k}$ and for all $t\in[0,H-1]$

\displaystyle\begin{aligned} &\text{Agent i: }\quad\theta_{i}^{k,t+1}=\theta_{i}^{k,t}+\alpha_{k+t}\left({\cal R}_{i}^{k+t}+\gamma\phi_{i}(s_{i}^{k+t+1})^{T}\theta_{i}^{k,t}-\phi_{i}(s_{i}^{k+t})^{T}\theta_{i}^{k,t}\right)\phi_{i}(s_{i}^{k+t})^{T}\\ &\text{Server: }\qquad{\bar{\theta}}^{k+1}=\frac{1}{N}\sum_{i=1}^{N}\theta_{i}^{k,H},\end{aligned}

(41)

where ${\cal R}_{i}^{k}={\cal R}_{i}(s_{i}^{k},a_{i}^{k})$ and $\{s_{i}^{k},s_{i}^{k+1},{\cal R}_{i}^{k}\}$ is the data tuple observed at time $k$ at agent $i$ .

Let $\{X_{i}^{k}\}=\left\{(s_{i}^{k},s_{i}^{k+1},a_{i}^{k})\right\}$ be a Markov chain. The update above can be viewed as a local stochastic approximation for finding the root of some linear operator. Indeed, let ${\bf A}_{i}(X_{i}^{k})$ and $b_{i}(X_{i}^{k})$ be defined as

\displaystyle\begin{aligned} {\bf A}_{i}(X_{i}^{k})&=\phi(s_{i}^{k})(\gamma\phi(s_{i}^{k+1})-\phi(s_{i}^{k})^{T},\\ b_{i}(X_{i}^{k})&=R_{i}^{k}\phi(s_{i}^{k}).\end{aligned}

Moreover, let $\pi_{i}$ be the stationary distribution of the underlying Markov chain $\{X_{i}^{k}\}$ and

\displaystyle{\bf A}_{i}=\mathbb{E}_{\pi_{i}}\left[{\bf A}_{i}(X_{i}^{k})\right],\qquad b_{i}=\mathbb{E}_{\pi_{i}}\left[b_{i}(X_{i}^{k})\right].

Thus, in this case if we consider

\displaystyle F_{i}(\theta_{i}^{k};X_{i}^{k})=-{\bf A}_{i}(X_{i}^{k})\theta_{i}^{k}-b_{i}(X_{i}^{k}),

then (41) is a variant of Algorithm 1 where each $F_{i}$ is linear in $\theta$ . In this case, the local TD $(0)$ seeks to find $\theta^{*}$ satisfying

\displaystyle\sum_{i=1}^{N}{\bf A}_{i}\theta^{*}+b_{i}=0.

To establish the convergence of (41) the following conditions are assumed in the literature [39, 15, 16].

Assumption 4.

The instantaneous rewards at the agents are uniformly bounded, i.e., there exist a constant $R$ such that $|\,{\cal R}_{i}(s,s^{\prime})\,|\leq R$ , for all $s,s\in{\cal S}$ and $i\in[N]$ .

Assumption 5.

For each $i\in[N]$ , the feature vectors $\{\phi_{i,\ell}\}$ , for all $\ell\in\{1,\ldots,L\}$ , are linearly independent, i.e., the matrix $\Phi_{i}$ has full column rank. In addition, we assume that all feature vectors $\phi_{i}(s)$ are uniformly bounded, i.e., $\|\phi_{i}(s)\|\leq 1$ .

Assumption 6.

Each Markov chain $\{X_{i}^{k}\}$ is irreducible and aperiodic.

Under Assumption 4–6 one can verify that Assumptions 1–3 hold [39]. For example, under these assumptions each ${\bf A}_{i}$ is a negative definite matrix, i.e., $x^{T}{\bf A}_{i}x<0$ for all $x$ .

4.2.2 Distributed Q-learning with linear function approximation

In this section, we consider a distributed variant of the classic Q-learning method [40] for solving problem (40). Similar to the case of TD $(\lambda)$ , we focus on the linear function approximation $\tilde{Q}_{i}^{\theta}$ of $Q_{i}$ parameterized by a weight vector $\theta\in\mathbb{R}^{L}$ and defined as

\displaystyle\tilde{Q}_{i}^{\theta}(s,a)=\sum_{\ell=1}^{L}\theta_{\ell}\phi_{i,\ell}(s,a),\qquad\forall(s,a)\in{\cal S}\times{\cal A}

for a given set of $L$ basis vectors $\phi_{i,\ell}:{\cal S}\times{\cal A}\rightarrow\mathbb{R}$ , $\ell\in{1,\ldots,L}$ . We assume again that $\phi_{i,\ell}(s,a)=0$ is either $s\notin{\cal S}_{i}$ or $a\notin{\cal A}_{i}$ , implying $\tilde{Q}_{i}^{\theta}(s,a)=0$ . Let $\Phi_{i}\in\mathbb{R}^{|{\cal S}||{\cal A}|\times L}$ be the feature matrix, whose $i$ -th row is $\phi_{i}(s,a)=(\phi_{i,1}(s,a),\ldots,\phi_{i,L}(s,a)^{T}$ . The the goal of distributed Q-learning is to solve

\displaystyle\sum_{i=1}^{N}\Phi_{i}\theta=\sum_{i=1}^{N}\Pi{\cal T}_{i}[\Phi_{i}\theta],

where $\Pi$ denotes the projection on the linear subspace of the feature vectors and ${\cal T}_{i}$ is the Bellman operator associated with $Q$ function at environment $i$ ; see for example [20].

For solving this problem, we consider a distributed variant of Q-learning [20, 41, 42]. In particular, each agent $i$ maintains an estimate $\theta_{i}$ of $\theta^{*}$ and the centralized coordinator maintains ${\bar{\theta}}$ , the averages of these $\theta_{i}$ . At each iteration $k\geq 0$ , each agent $i$ initialize $\theta_{i}^{k,0}={\bar{\theta}}^{k}$ and runs $H$ steps of local Q-learning to update $\theta_{i}^{k,t}$ . These values are then aggregated by the server to update for ${\bar{\theta}}^{k+1}$ , i.e., set $\theta_{i}^{k,0}={\bar{\theta}}^{k}$ and for all $t\in[0,H-1]$

	$\displaystyle\text{Agent i: }\quad\theta_{i}^{k,t+1}=\theta_{i}^{k,t}+\alpha_{k+t}\left({\cal R}_{i}^{k+t}+\gamma\max_{a^{\prime}}\phi_{i}(s_{i}^{k+t+1},a^{\prime})^{T}\theta_{i}^{k,t}-\phi_{i}(s_{i}^{k+t},a^{k+t})^{T}\theta_{i}^{k,t}\right)\phi_{i}(s_{i}^{k+t},a^{k+t})^{T}$
	$\displaystyle\text{Server: }\qquad{\bar{\theta}}^{k+1}=\frac{1}{N}\sum_{i=1}^{N}\theta_{i}^{k,H},$		(42)

where ${\cal R}_{i}^{k}={\cal R}_{i}(s_{i}^{k},a_{i}^{k})$ and $\{s_{i}^{k},s_{i}^{k+1},a_{i}^{k}\}$ is the sample trajectory generated by some behavior policy $\sigma_{i}$ at agent $i$ . Let $X_{i}^{k}=\{s_{i}^{k},a_{i}^{k},s_{i}^{k+1}\}$ be a Markov chain. We denote the nonlinear mapping $F_{i}$ as

\displaystyle F_{i}(\theta;X^{k})=\phi_{i}(s^{k},a^{k})\left[{\cal R}(s^{k},a^{k})+\gamma\max_{a}\phi_{i}(s^{k+1},a)^{T}\theta-\phi_{i}(s^{k},a^{k})^{T}\theta\right],

and $F_{i}(\theta)=\mathbb{E}_{\pi_{i}}\left[F_{i}(\theta;X^{k})\right]$ , where $\pi_{i}$ is the stationary distribution of $\{X_{i}^{k}\}$ . Then the goal of (42) is to find $\theta^{*}$ such that

\displaystyle\sum_{i=1}^{N}F_{i}(\theta^{*})=0.

Under proper assumptions, for example see [20, Theorem 1], one can verify that Assumptions 1–3 hold. Thus, we can apply our results in Theorems 1 and 2 to derive the finite-time performance bound for distributed Q-learning in (42).

5 Concluding remark

This paper studies local stochastic approximation over a network of agents, where the data at each agent are generated from a Markov process. Our main contribution is to provide a finite-time bound for the convergence of mean square errors generated by the algorithm to zero. Our results generalized the existing literature, where the data at the agents are i.i.d, and therefore, the current approach cannot be applied to some algorithms in multi-task reinforcement learning over multi-agent systems.

References

[1] H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951.
[2] P. Kairouz and etc., “Advances and open problems in federated learning,” available at: https://arxiv.org/abs/1912.04977, 2020.
[3] T. Li, A. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” available at: https://arxiv.org/abs/1908.07873, 2020.
[4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
[5] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning et al., “Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures,” arXiv preprint arXiv:1802.01561, 2018.
[6] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt, “Multi-task deep reinforcement learning with popart,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3796–3803.
[7] L. Bottou, F. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
[8] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. MIT Press, 2018.
[9] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, 2nd ed. Athena Scientific, Belmont, MA, 1999.
[10] V. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
[11] H. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications. Springer, NY, 2003.
[12] G. Lan, First-order and Stochastic Optimization Methods for Machine Learning. Springer-Nature, 2020.
[13] T. Sun, Y. Sun, and W. Yin, “On markov chain gradient descent,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18, 2018, p. 9918–9927.
[14] T. T. Doan, L. M. Nguyen, N. H. Pham, and J. Romberg, “Convergence rates of accelerated markov gradient descent with applications in reinforcement learning,” available at: https://arxiv.org/abs/2002.02873, 2020.
[15] J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in COLT, 2018.
[16] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and TD learning,” in COLT, 2019.
[17] B. Hu and U. Syed, “Characterizing the exact behaviors of temporal difference learning algorithms using markov jump linear system theory,” in Advances in Neural Information Processing Systems 32, 2019.
[18] B. Karimi, B. Miasojedow, E. Moulines, and H. Wai, “Non-asymptotic analysis of biased stochastic approximation scheme,” in Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, 2019, pp. 1944–1974.
[19] T. T. Doan, “Finite-time analysis and restarting scheme for linear two-time-scale stochastic approximation,” available at: https://arxiv.org/abs/1912.10583, 2019.
[20] C. Z. Chen, S. Zhang, T. T. Doan, S. T. Maguluri, and J.-P. Clarke, “Performance of Q-learning with Linear Function Approximation: Stability and Finite-Time Analysis,” available at: https://arxiv.org/abs/1905.11425, 2019.
[21] T. T. Doan, J. Lubars, C. L. Beck, and R. Srikant, “Convergence rate of distributed random projections,” in 7th IFAC Workshop on Distributed Estimation and Control in Networked Systems (NECSYS), vol. 51, no. 23, 2018, pp. 373 – 378.
[22] S. U. Stich, “Local SGD converges fast and communicates little,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1g2JnRcFX
[23] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, 2017, pp. 1273–1282.
[24] B. Woodworth, K. Patel, S. Stich, Z. Dai, B. Bullins, B. McMahan, O. Shamir, and N. Srebro, “Is local SGD better than minibatch SGD?” available at: https://arxiv.org/abs/2002.07839, 2020.
[25] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems 32, 2019, pp. 11 082–11 094.
[26] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for on-device federated learning,” 2019. [Online]. Available: https://arxiv.org/abs/1910.06378
[27] A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data,” in the 23rd International Conference on Artificial Intelligence and Statistics, 2020.
[28] L. T. Liu, U. Dogan, and K. Hofmann, “Decoding multitask dqn in the world of minecraft,” in The 13th European Workshop on Reinforcement Learning (EWRL) 2016, 2016.
[29] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” available at: https://arxiv.org/abs/2001.06782, 2020.
[30] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver, “Massively parallel methods for deep reinforcement learning,” 07 2015.
[31] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing times. American Mathematical Society, 2006.
[32] S. Ram, A. Nedić, and V. V. Veeravalli, “Incremental stochastic subgradient algorithms for convex optimization,” SIAM Journal on Optimization, vol. 20, no. 2, pp. 691–717, 2009.
[33] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
[34] Y. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite time analysis of two time-scale actor critic methods,” arXiv preprint arXiv:2005.01350, 2020.
[35] T. T. Doan, S. T. Maguluri, and J. Romberg, “Finite-time performance of distributed temporal difference learning with linear function approximation,” available at: https://arxiv.org/abs/1907.12530, 2019.
[36] S. Zeng, A. Anwar, T. Doan, J. Romberg, and A. Raychowdhury, “A decentralized policy gradient approach to multi-task reinforcement learning,” available at: https://arxiv.org/abs/2006.04338, 2020.
[37] G. Konidaris, S. Osentoski, and P. Thomas, “Value function approximation in reinforcement learning using the fourier basis,” in Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011, p. 380–385.
[38] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, Aug 1988.
[39] J. N. Tsitsiklis and B. V. Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674–690, 1997.
[40] C. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp. 279–292, 05 1992.
[41] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 664–671.
[42] D. Lee and N. He, “A unified switching system perspective and o.d.e. analysis of q-learning algorithms,” available at: https://arxiv.org/abs/1912.02270, 2019.

Appendix A Proofs of Lemmas 3–8

A.1 Proof of Lemma 3

We first show (13). Indeed, by (3) and (9) we have for any $t\in[0,H-1]$

\displaystyle\|\theta_{i}^{k,t+1}-\theta_{i}^{k,t}\|=\alpha\|F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\|\leq B\alpha(\|\theta_{i}^{k,t}\|+1),

(43)

which gives

\displaystyle\|\theta_{i}^{k,t+1}\|

\displaystyle\leq(1+B\alpha)\|\theta_{i}^{k,t}\|+B\alpha\leq(1+B\alpha)^{t+1}\|\theta_{i}^{k,0}\|+B\alpha\sum_{u=0}^{t}(1+B\alpha)^{t-u}.

Using the relation $1+x\leq e^{x}$ for all $x\geq 0$ into the preceding equation gives (13), i.e.,

\displaystyle\|\theta_{i}^{k,t+1}\|

\displaystyle\leq e^{B\alpha(t+1)}\|\theta_{i}^{k,0}\|+B\alpha te^{B\alpha t}\leq e^{B\alpha H}\|\theta_{i}^{k,0}\|+BH\alpha e^{B\alpha H}\leq 2\|{\bar{\theta}}^{k}\|+2BH\alpha\leq 2\|{\bar{\theta}}^{k}\|+1,

where the second inequality is due to (10), i.e., $HB\alpha\leq\log(2)/2\tau(\alpha)\leq\log(2)$ , and recall that $\theta_{i}^{k,0}={\bar{\theta}}^{k}$ . Next, using (43) and (13) we obtain for all $t\in[0,H-1]$

\displaystyle\|\theta_{i}^{k,t+1}-\theta_{i}^{k,t}\|\leq 2B\alpha\|{\bar{\theta}}^{k}\|+2B^{2}H\alpha^{2}+B\alpha,

(44)

which implies (14), i.e., for all $t\in[1,H]$ we have

	$\displaystyle\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|$	$\displaystyle=\left\\|\sum_{u=0}^{t-1}\theta_{i}^{k,u+1}-\theta_{i}^{k,u}\right\\|\leq\sum_{u=0}^{t-1}\\|\theta_{i}^{k,u+1}-\theta_{i}^{k,u}\\|$
		$\displaystyle\leq 2B\alpha t\\|{\bar{\theta}}^{k}\\|+2B^{2}H\alpha^{2}t+B\alpha t\leq 2BH\alpha\\|{\bar{\theta}}^{k}\\|+2B^{2}H^{2}\alpha^{2}+BH\alpha$
		$\displaystyle\leq 2BH\alpha\\|{\bar{\theta}}^{k}\\|+2BH\alpha,$

where the last inequality is due to (10).

A.2 Proof of Lemma 4

We first show (15). Using (12) and (9) we have

	$\displaystyle\\|{\bar{\theta}}^{k+1}\\|-\\|{\bar{\theta}}^{k}\\|$	$\displaystyle\leq\\|{\bar{\theta}}^{k+1}-{\bar{\theta}}^{k}\\|\leq\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\\|F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\\|$
		$\displaystyle\leq\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}B\left(\\|\theta_{i}^{k,t}\\|+1\right)\leq BH\alpha\left(\\|{\bar{\theta}}^{k}\\|+1\right)+\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}B\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|,$

which when using (14) and $BH\alpha\leq\log(2)/2$ (from (10)) gives

	$\displaystyle\\|{\bar{\theta}}^{k+1}\\|-\\|{\bar{\theta}}^{k}\\|$	$\displaystyle\leq\\|{\bar{\theta}}^{k+1}-{\bar{\theta}}^{k}\\|\leq BH\alpha\left(\\|{\bar{\theta}}^{k}\\|+1\right)+BH\alpha\left(2BH\alpha\\|{\bar{\theta}}^{k}\\|+2BH\alpha\right)$
		$\displaystyle\leq 2BH\alpha\\|{\bar{\theta}}^{k}\\|+2BH\alpha,$		(45)

The preceding relation yields

\displaystyle\|{\bar{\theta}}^{k+1}\|\leq(1+2BH\alpha)\|{\bar{\theta}}^{k}\|+2BH\alpha.

Using the relation $1+x\leq e^{x}$ for all $x\geq 0$ , the equation above gives for all $t\in[k-\tau(\alpha),k-1]$

	$\displaystyle\\|{\bar{\theta}}^{t}\\|$	$\displaystyle\leq(1+2BH\alpha)^{t-k+\tau(\alpha)}\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+2BH\sum_{u=k-\tau(\alpha)}^{t-1}\alpha(1+2BH\alpha)^{t-u-1}$
		$\displaystyle\leq(1+2BH\alpha)^{\tau(\alpha)}\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+2BH\alpha\tau(\alpha)(1+2BH\alpha)^{t-1-k+\tau(\alpha)}$
		$\displaystyle\leq e^{2BH\alpha\tau(\alpha)}\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+2BH\alpha\tau(\alpha)(1+2BH\alpha)^{\tau(\alpha)}\leq e^{2BH\alpha\tau(\alpha)}\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+2BH\alpha\tau(\alpha)e^{2BH\alpha\tau(\alpha)}$
		$\displaystyle\leq 2\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\alpha\tau(\alpha),$

where the last inequality is due to (10), i.e., $2HB\tau(\alpha)\alpha\leq\log(2).$ Using the preceding relation we have from (45)

	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|$	$\displaystyle\leq\sum_{t=k-\tau(\alpha)}^{k-1}\\|{\bar{\theta}}^{t+1}-{\bar{\theta}}^{t}\\|\leq\sum_{t=k-\tau(\alpha)}^{k-1}2BH\alpha(\\|{\bar{\theta}}^{t}\\|+1)$
		$\displaystyle\leq 2BH\alpha\sum_{t=k-\tau(\alpha)}^{k-1}\left(2\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\alpha\tau(\alpha)\right)+2BH\alpha\tau(\alpha)$
		$\displaystyle\leq 4BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\alpha\tau(\alpha),$

where the last inequality is due to (10), i.e., $2HB\tau(\alpha)\alpha\leq\log(2)\leq 1/2$ . Using the preceding inequality and the triangle inequality yields

	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|$	$\displaystyle\leq 4BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}\\|+4BH\alpha\tau(\alpha)$
		$\displaystyle\leq\frac{2}{3}\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}\\|+4BH\alpha\tau(\alpha),$

where the last inequality we use (10) to have $2BH\tau(\alpha_{k})\alpha\leq\log(2)\leq 1/3$ . Rearranging the equation above yields (15)

\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|\leq 12BH\alpha\tau(\alpha)\|{\bar{\theta}}^{k}\|+12BH\alpha\tau(\alpha)\leq 2\|{\bar{\theta}}^{k}\|+2.

Taking square on both sides of the preceding relation and using the Cauchy-Schwarz inequality yield (16)

\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|^{2}\leq 288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\|{\bar{\theta}}^{k}\|^{2}+288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\leq 8\|{\bar{\theta}}^{k}\|^{2}+8.

A.3 Proof of Lemma 5

Consider

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\right\rangle$
	$\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})\right\rangle$
	$\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha)})\right\rangle$
	$\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha)})-F_{i}({\bar{\theta}}^{k})\right\rangle.$		(46)

We first consider the second term on the right-hand side of (46). Let ${\cal F}_{k}$ be the set containing all the information generated by Algorithm 1 up to time $k$ . Then, using (8) we have

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\right\rangle\,\|\,{\cal F}_{k+t-\tau(\alpha)}\right]$
	$\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},\mathbb{E}\left[F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\,\|\,{\cal F}_{k+t-\tau(\alpha)}\right]\right\rangle$
	$\displaystyle\leq\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\right\\|\left\|\mathbb{E}\left[F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\,\|\,{\cal F}_{k+t-\tau(\alpha)}\right]\right\|$
	$\displaystyle\leq\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha\left\\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{}\right\\|\left(\left\\|\theta_{i}^{k-\tau(\alpha),t}\right\\|+1\right)=NH\alpha\left\\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{}\right\\|\left(\left\\|\theta_{i}^{k-\tau(\alpha),t}\right\\|+1\right),$

which by using the triangle inequality and (13) yields

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t})\right\rangle\,\|\,{\cal F}_{k+t-\tau(\alpha)}\right]$
	$\displaystyle\leq NH\alpha\left(\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(2\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+2BH\alpha+1\right)$
	$\displaystyle\leq NH\alpha\left(\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(2\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+2\\|{\bar{\theta}}^{k}\\|+2\right)$
	$\displaystyle\leq 2NH\alpha\left(2\\|{\bar{\theta}}{k}\\|+2+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(3\\|{\bar{\theta}}^{k}\\|+3\right)$
	$\displaystyle\leq 6NH\alpha\left(3\\|{\bar{\theta}}^{k}-\theta^{}\\|+2+2\\|\theta^{}\\|\right)\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|+1+\\|\theta^{}\\|\right)$
	$\displaystyle\leq 18NH\alpha\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|+1+\\|\theta^{}\\|\right)^{2}\leq 36NH\alpha\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|\right)^{2}+36NH\alpha\left(1+\\|\theta^{}\\|\right)^{2},$		(47)

where the third inequality is due to (15) and the last inequality is due to the Cauchy-Schwarz inequality. Next, we consider the third term on the right-hand side of (46). Indeed, using (6) we have

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})\right\rangle$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\\|\\|\theta_{i}^{k,t}-\theta_{i}^{k-\tau(\alpha),t}\\|$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\\|\left(\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|+\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+\\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta_{i}^{k-\tau(\alpha),t}\\|\right).$		(48)

Similarly, using (6) we consider the last two terms on the right-hand sides of (46)

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{},F_{i}(\theta_{i}^{k-\tau(\alpha),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha)})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{},F_{i}({\bar{\theta}}^{k-\tau(\alpha)})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\\|\left(\\|\theta_{i}^{k-\tau(\alpha),t}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|\right),$

which by adding to (48) yields

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{},F_{i}(\theta_{i}^{k-\tau(\alpha),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha)})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{},F_{i}({\bar{\theta}}^{k-\tau(\alpha)})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)$
	$\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})\right\rangle$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\\|{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*}\\|\left(\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|+2\\|\theta_{i}^{k-\tau(\alpha),t}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+2\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|\right).$

Using (14), (16), and the triangle inequality into the preceding relation yields

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{},F_{i}(\theta_{i}^{k-\tau(\alpha),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha)})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{},F_{i}({\bar{\theta}}^{k-\tau(\alpha)})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)$
	$\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k-\tau(\alpha)}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha),t};X_{i}^{k+t})\right\rangle$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(\\|{\bar{\theta}}^{k-\tau(\alpha)}-{\bar{\theta}}^{k}\\|+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)$
	$\displaystyle\hskip 71.13188pt\times\left(2BH\alpha\\|{\bar{\theta}}^{k}\\|+4BH\alpha\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+6BH\alpha+24BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}\\|+18BH\alpha\tau(\alpha)\right)$
	$\displaystyle\leq LNH\left(2\\|{\bar{\theta}}^{k}\\|+2+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(26BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}\\|+24BH\alpha\tau(\alpha)+4BH\alpha\\|{\bar{\theta}}^{k-\tau(\alpha)}-{\bar{\theta}}^{k}\\|+4BH\alpha\\|{\bar{\theta}}^{k}\\|\right)$
	$\displaystyle\leq LNH\left(3\\|{\bar{\theta}}^{k}-\theta^{}\\|+2+2\\|\theta^{}\\|\right)\left(38BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}\\|+32BH\alpha\tau(\alpha)\right)$
	$\displaystyle\leq LNH\left(3\\|{\bar{\theta}}^{k}-\theta^{}\\|+2+2\\|\theta^{}\\|\right)\left(38BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}-\theta^{}\\|+38BH(\\|\theta^{}\\|+1)\alpha\tau(\alpha)\right)$
	$\displaystyle\leq 114LBNH^{2}\alpha\tau(\alpha)\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|+\\|\theta^{}\\|+1\right)^{2}$
	$\displaystyle\leq 228LBNH^{2}\alpha\tau(\alpha)\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|\right)^{2}+228LBNH^{2}\alpha\tau(\alpha)\left(\\|\theta^{}\\|+1\right)^{2}.$		(49)

Finally, we consider the first term on the right-hand side of (46). Using (13), (15), and (9) we consider

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\leq B\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\right\\|\left(\\|\theta_{i}^{k,t}\\|+\\|{\bar{\theta}}^{k}\\|+2\right)$
	$\displaystyle\leq B\sum_{i=1}^{N}\sum_{t=0}^{H-1}\left(12BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}\\|+12BH\alpha\tau(\alpha)\right)\left(3\\|{\bar{\theta}}^{k}\\|+2+2BH\alpha\right)$
	$\displaystyle\leq 12NB^{2}H^{2}\alpha\tau(\alpha)\left(\\|{\bar{\theta}}^{k}\\|+1\right)(3\\|{\bar{\theta}}^{k}\\|+3)\leq 36NB^{2}H^{2}\alpha\tau(\alpha)(\\|{\bar{\theta}}^{k}\\|+1)^{2}$
	$\displaystyle\leq 36NB^{2}H^{2}\alpha\tau(\alpha)(\\|{\bar{\theta}}^{k}-\theta^{}\\|+\\|\theta^{}\\|+1)^{2}$
	$\displaystyle\leq 72NB^{2}H^{2}\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+72NB^{2}H^{2}(1+\\|\theta^{}\\|)^{2}\alpha\tau(\alpha),$		(50)

where in the third inequality we use (10) to have $2BH\alpha\leq 1$ . Finally, taking the expectation on both sides of (46) and using (47), (49), and (50) yields (17)

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right]$
	$\displaystyle\leq 36NH\alpha\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+36NH\alpha\left(1+\\|\theta^{}\\|\right)^{2}+228LBNH^{2}\alpha\tau(\alpha)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
	$\displaystyle\qquad+228LBNH^{2}\alpha\tau(\alpha)\left(\\|\theta^{}\\|+1\right)^{2}+72NB^{2}H^{2}\alpha\tau(\alpha)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+72NB^{2}H^{2}(1+\\|\theta^{*}\\|)^{2}\alpha\tau(\alpha)$
	$\displaystyle\leq 36NH\alpha\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+36NH\alpha\left(1+\\|\theta^{}\\|\right)^{2}$
	$\displaystyle\qquad+12(19L+6B)NBH^{2}\alpha\tau(\alpha)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+12(19L+6B)NBH^{2}\left(1+\\|\theta^{}\\|\right)^{2}\alpha\tau(\alpha).$

A.4 Proof of Lemma 6

We first show (26). Indeed, by (3) and (9) we have for any $t\in[0,H-1]$

\displaystyle\|\theta_{i}^{k,t+1}-\theta_{i}^{k,t}\|=\alpha_{k+t}\|F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\|\leq B\alpha_{k+t}(\|\theta_{i}^{k,t}\|+1),

(51)

which gives

\displaystyle\|\theta_{i}^{k,t+1}\|

\displaystyle\leq(1+B\alpha_{k+t})\|\theta_{i}^{k,t}\|+B\alpha_{k+t}\leq\prod_{u=0}^{t}(1+B\alpha_{k+u})\|\theta_{i}^{k,0}\|+B\sum_{u=0}^{t}\alpha_{k+u}\prod_{\ell=u+1}^{t}(1+B\alpha_{k+\ell}).

Using the relation $1+x\leq e^{x}$ for all $x\geq 0$ into the preceding equation and since $\alpha_{k}$ is decreasing we obtain (26), i.e., for $t\in[0,H-1]$

	$\displaystyle\\|\theta_{i}^{k,t+1}\\|$	$\displaystyle\leq\exp\left\{B\sum_{u=0}^{t}\alpha_{k+u}\right\}\\|\theta_{i}^{k,0}\\|+B\sum_{u=0}^{t}\alpha_{k+u}\exp\left\{B\sum_{\ell=u+1}^{t}\alpha_{k+\ell}\right\}$
		$\displaystyle\leq\exp\{BH\alpha_{k}\}\\|\theta_{i}^{k,0}\\|+B\sum_{u=0}^{t}\alpha_{k+u}\exp\{BH\alpha_{k}\}$
		$\displaystyle\leq 2\\|{\bar{\theta}}^{k}\\|+2BH\alpha_{k},$

where the second inequality is (22), i.e., $HB\alpha_{k}\leq\log(2)$ , and recall that $\theta_{i}^{k,0}={\bar{\theta}}^{k}$ . Next, using (51) and (26) and since $\alpha_{k}$ is decreasing we obtain for all $t\in[0,H-1]$

\displaystyle\|\theta_{i}^{k,t+1}-\theta_{i}^{k,t}\|\leq B\alpha_{k}\left(\|\theta_{i}^{k,t}\|+1\right)\leq 2B\alpha_{k}\|{\bar{\theta}}^{k}\|+2B^{2}H\alpha_{k}^{2}+B\alpha_{k},

(52)

which implies (14), i.e., for all $t\in[1,H]$ we have

	$\displaystyle\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|$	$\displaystyle=\left\\|\sum_{u=0}^{t-1}\theta_{i}^{k,u+1}-\theta_{i}^{k,u}\right\\|\leq\sum_{u=0}^{t-1}\\|\theta_{i}^{k,u+1}-\theta_{i}^{k,u}\\|$
		$\displaystyle\leq 2B\alpha_{k}t\\|{\bar{\theta}}^{k}\\|+2B^{2}H\alpha_{k}^{2}t+B\alpha_{k}t\leq 2BH\alpha_{k}\\|{\bar{\theta}}^{k}\\|+2B^{2}H^{2}\alpha_{k}^{2}+BH\alpha_{k}$
		$\displaystyle\leq 2BH\alpha_{k}\\|{\bar{\theta}}^{k}\\|+2BH\alpha_{k},$

where the last inequality is due to (22).

A.5 Proof of Lemma 7

We first show (28). Using (25) and (9) we have

	$\displaystyle\\|{\bar{\theta}}^{k+1}\\|-\\|{\bar{\theta}}^{k}\\|$	$\displaystyle\leq\\|{\bar{\theta}}^{k+1}-{\bar{\theta}}^{k}\\|\leq\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\\|F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\\|$
		$\displaystyle\leq\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}B\alpha_{k+t}\left(\\|\theta_{i}^{k,t}\\|+1\right)\leq BH\alpha_{k}\left(\\|{\bar{\theta}}^{k}\\|+1\right)+\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}B\alpha_{k+t}\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|,$

which when using (27) and $BH\alpha_{k}\leq\log(2)/2$ (from (22)) gives

	$\displaystyle\\|{\bar{\theta}}^{k+1}\\|-\\|{\bar{\theta}}^{k}\\|$	$\displaystyle\leq\\|{\bar{\theta}}^{k+1}-{\bar{\theta}}^{k}\\|\leq BH\alpha_{k}\left(\\|{\bar{\theta}}^{k}\\|+1\right)+BH\alpha_{k}\left(2BH\alpha_{k}\\|{\bar{\theta}}^{k}\\|+2BH\alpha_{k}\right)$
		$\displaystyle\leq 2BH\alpha_{k}\\|{\bar{\theta}}^{k}\\|+2BH\alpha_{k},$		(53)

The preceding relation yields

\displaystyle\|{\bar{\theta}}^{k+1}\|\leq(1+2BH\alpha_{k})\|{\bar{\theta}}^{k}\|+2BH\alpha_{k}.

Using the relation $1+x\leq e^{x}$ for all $x\geq 0$ , the equation above gives for all $t\in[k-\tau(\alpha_{k}),k-1]$

	$\displaystyle\\|{\bar{\theta}}^{t+1}\\|$	$\displaystyle\leq\prod_{u=k-\tau(\alpha_{k})}^{t}(1+2BH\alpha_{u})\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+2BH\sum_{u=k-\tau(\alpha_{k})}^{t-1}\alpha_{u}\prod_{\ell=u+1}^{t}(1+2BH\alpha_{\ell})$
		$\displaystyle\leq\exp\left\{\sum_{u=k-\tau(\alpha_{k})}^{t}2BH\alpha_{u}\right\}\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+2BH\sum_{u=k-\tau(\alpha_{k})}^{t-1}\alpha_{u}\exp\left\{\sum_{\ell=u+1}^{t}2BH\alpha_{\ell}\right\}$
		$\displaystyle\leq 2\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+4BH\sum_{u=k-\tau(\alpha_{k})}^{t-1}\alpha_{u},$

where the last inequality is due to (22), i.e., $2HB\alpha_{k;\tau(\alpha_{k})}\leq\log(2).$ Using the preceding relation we have from (53)

	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|$	$\displaystyle\leq\sum_{t=k-\tau(\alpha_{k})}^{k-1}\\|{\bar{\theta}}^{t+1}-{\bar{\theta}}^{t}\\|\leq\sum_{t=k-\tau(\alpha_{k})}^{k-1}2BH\alpha_{t}(\\|{\bar{\theta}}^{t}\\|+1)$
		$\displaystyle\leq 2BH\sum_{t=k-\tau(\alpha)}^{k-1}\alpha_{t}\left(2\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\sum_{u=k-\tau(\alpha_{k})}^{t-1}\alpha_{u}\right)+2BH\alpha_{k;\tau(\alpha_{k})}$
		$\displaystyle\leq 4BH\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\alpha_{k;\tau(\alpha_{k})},$

where the last inequality is due to (22), i.e., $2HB\alpha_{k;\tau(\alpha_{k})}\leq\log(2)\leq 1/2$ . Using the preceding inequality and the triangle inequality yields

	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|$	$\displaystyle\leq 4BH\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}\\|+4BH\alpha_{k;\tau(\alpha_{k})}$
		$\displaystyle\leq\frac{2}{3}\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|+4BH\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}\\|+4BH\alpha_{k;\tau(\alpha_{k})},$

where the last inequality we use (22) to have $2BH\alpha_{k;\tau(\alpha_{k})}\leq\log(2)\leq 1/3$ . Rearranging the equation above yields (15)

\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|\leq 12BH\alpha_{k;\tau(\alpha_{k})}\|{\bar{\theta}}^{k}\|+12BH\alpha_{k;\tau(\alpha_{k})}\leq 2\|{\bar{\theta}}^{k}\|+2.

Taking square on both sides of the preceding relation and using the Cauchy-Schwarz inequality yield (16)

\displaystyle\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\|^{2}\leq 288B^{2}H^{2}\alpha_{k;\tau(\alpha_{k})}^{2}\|{\bar{\theta}}^{k}\|^{2}+288B^{2}H^{2}\alpha_{k;\tau(\alpha_{k})}^{2}\leq 8\|{\bar{\theta}}^{k}\|^{2}+8.

A.6 Proof of Lemma 8

Consider

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\right\rangle$
	$\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})\right\rangle$
	$\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})\right\rangle$
	$\displaystyle\quad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})-F_{i}({\bar{\theta}}^{k})\right\rangle.$		(54)

We first consider the second term on the right-hand side of (54). Let ${\cal F}_{k}$ be the set containing all the information generated by Algorithm 1 up to time $k$ . Then, using (8) we have

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\right\rangle\,\|\,{\cal F}_{k+t-\tau(\alpha_{k})}\right]$
	$\displaystyle=-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},\mathbb{E}\left[F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\,\|\,{\cal F}_{k+t-\tau(\alpha_{k})}\right]\right\rangle$
	$\displaystyle\leq\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\right\\|\left\|\mathbb{E}\left[F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\,\|\,{\cal F}_{k+t-\tau(\alpha_{k})}\right]\right\|$
	$\displaystyle\leq\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\alpha_{k}\left\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{}\right\\|\left(\left\\|\theta_{i}^{k-\tau(\alpha_{k}),t}\right\\|+1\right)=NH\alpha_{k}^{2}\left\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{}\right\\|\left(\left\\|\theta_{i}^{k-\tau(\alpha_{k}),t}\right\\|+1\right),$

which by using the triangle inequality and (26) yields

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})\right\rangle\,\|\,{\cal F}_{k+t-\tau(\alpha_{k})}\right]$
	$\displaystyle\leq NH\alpha_{k}^{2}\left(\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(2\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+2BH\alpha_{k}+1\right)$
	$\displaystyle\leq NH\alpha_{k}^{2}\left(\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(2\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+2\\|{\bar{\theta}}^{k}\\|+2\right)$
	$\displaystyle\leq 2NH\alpha_{k}^{2}\left(2\\|{\bar{\theta}}{k}\\|+2+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(3\\|{\bar{\theta}}^{k}\\|+3\right)$
	$\displaystyle\leq 6NH\alpha_{k}^{2}\left(3\\|{\bar{\theta}}^{k}-\theta^{}\\|+2+2\\|\theta^{}\\|\right)\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|+1+\\|\theta^{}\\|\right)$
	$\displaystyle\leq 18NH\alpha_{k}^{2}\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|+1+\\|\theta^{}\\|\right)^{2}\leq 36NH\alpha_{k}^{2}\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|\right)^{2}+36NH\alpha_{k}^{2}\left(1+\\|\theta^{}\\|\right)^{2},$		(55)

where the third inequality is due to (28) and the last inequality is due to the Cauchy-Schwarz inequality. Next, we consider the third term on the right-hand side of (54). Indeed, using (6) we have

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})\right\rangle$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\\|\\|\theta_{i}^{k,t}-\theta_{i}^{k-\tau(\alpha_{k}),t}\\|$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\\|\left(\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|+\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta_{i}^{k-\tau(\alpha_{k}),t}\\|\right).$		(56)

Similarly, using (6) we consider the last two terms on the right-hand sides of (54)

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{},F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\\|\left(\\|\theta_{i}^{k-\tau(\alpha_{k}),t}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|\right),$

which by adding to (56) yields

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{},F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)$
	$\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})\right\rangle$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*}\\|\left(\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|+2\\|\theta_{i}^{k-\tau(\alpha_{k}),t}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+2\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|\right).$

Using (27), (28), and the triangle inequality into the preceding relation yields

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{},F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t})-F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})\right\rangle+\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{},F_{i}({\bar{\theta}}^{k-\tau(\alpha_{k})})-F_{i}({\bar{\theta}}^{k})\right\rangle\right)$
	$\displaystyle\qquad-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k-\tau(\alpha_{k})}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}(\theta_{i}^{k-\tau(\alpha_{k}),t};X_{i}^{k+t})\right\rangle$
	$\displaystyle\leq L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-{\bar{\theta}}^{k}\\|+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(2BH\alpha_{k}\\|{\bar{\theta}}^{k}\\|+2BH\alpha_{k}+4BH\alpha_{k}\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}\\|+4BH\alpha_{k-\tau(\alpha_{k})}\right)+$
	$\displaystyle\qquad+L\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-{\bar{\theta}}^{k}\\|+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(24BH\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}\\|+24BH\alpha_{k;\tau(\alpha_{k})}\right)$
	$\displaystyle\leq LNBH^{2}\alpha_{k}\left(2\\|{\bar{\theta}}^{k}\\|+2+\\|{\bar{\theta}}^{k}-\theta^{*}\\|\right)\left(26\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}\\|+30\alpha_{k;\tau(\alpha_{k})}+4\alpha_{k}\\|{\bar{\theta}}^{k-\tau(\alpha_{k})}-{\bar{\theta}}^{k}\\|+4\alpha_{k}\\|{\bar{\theta}}^{k}\\|\right)$
	$\displaystyle\leq LNBH^{2}\alpha_{k}\left(3\\|{\bar{\theta}}^{k}-\theta^{}\\|+2+2\\|\theta^{}\\|\right)\left(30\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}\\|+30\alpha_{k;\tau(\alpha_{k})}+8\alpha_{k}\\|{\bar{\theta}}^{k}\\|+8\alpha_{k}\right)$
	$\displaystyle\leq 38LNBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(3\\|{\bar{\theta}}^{k}-\theta^{}\\|+2+2\\|\theta^{}\\|\right)\left(\\|{\bar{\theta}}^{k}\\|+1\right)$
	$\displaystyle\leq 114LNBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|+1+\\|\theta^{}\\|\right)^{2}$
	$\displaystyle\leq 228LNBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+228LNBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(\\|\theta^{}\\|+1\right)^{2}.$		(57)

Finally, we consider the first term on the right-hand side of (54). Using (26), (28), and (9) we consider

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\langle{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\leq B\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha_{k})}\right\\|\left(\\|\theta_{i}^{k,t}\\|+\\|{\bar{\theta}}^{k}\\|+2\right)$
	$\displaystyle\leq B\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\left(12BH\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}\\|+12BH\alpha_{k;\tau(\alpha_{k})}\right)\left(3\\|{\bar{\theta}}^{k}\\|+3\right)$
	$\displaystyle\leq 36NB^{2}H^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(\\|{\bar{\theta}}^{k}\\|+1\right)^{2}\leq 36NB^{2}H^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}(\\|{\bar{\theta}}^{k}-\theta^{}\\|+\\|\theta^{}\\|+1)^{2}$
	$\displaystyle\leq 72NB^{2}H^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+72NB^{2}H^{2}(1+\\|\theta^{}\\|)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})},$		(58)

where in the third inequality we use (22) to have $2BH\alpha\leq 1$ . Finally, taking the expectation on both sides of (54) and using (55), (57), and (58) yields (30), i.e.,

	$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\alpha_{k+t}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right]$
	$\displaystyle\leq 36NH\alpha_{k}^{2}\left(\\|{\bar{\theta}}^{k}-\theta^{}\\|\right)^{2}+36NH\alpha_{k}^{2}\left(1+\\|\theta^{}\\|\right)^{2}+228LBNH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
	$\displaystyle\qquad+228LBNH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\left(\\|\theta^{}\\|+1\right)^{2}+72NB^{2}H^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]$
	$\displaystyle\qquad+72NB^{2}H^{2}(1+\\|\theta^{*}\\|)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}$
	$\displaystyle\leq 36NH\alpha_{k}^{2}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+36NH\alpha_{k}^{2}\left(1+\\|\theta^{}\\|\right)^{2}$
	$\displaystyle\qquad+12(19L+6B)NBH^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+12(19L+6B)NBH^{2}\left(1+\\|\theta^{}\\|\right)^{2}\alpha_{k}\alpha_{k;\tau(\alpha_{k})}.$

	$\displaystyle\\|\theta_{i}^{k,t}\\|$	$\displaystyle\leq 2\\|{\bar{\theta}}^{k}\\|+2BH\alpha\leq 2\\|{\bar{\theta}}^{k}\\|+1.$		(13)
	$\displaystyle\\|\theta_{i}^{k,t}-{\bar{\theta}}^{k}\\|$	$\displaystyle\leq 2BH\alpha\\|{\bar{\theta}}^{k}\\|+2BH\alpha.$		(14)

	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|$	$\displaystyle\leq 12BH\alpha\tau(\alpha)\\|{\bar{\theta}}^{k}\\|+12BH\alpha\tau(\alpha)\leq 2\\|{\bar{\theta}}^{k}\\|+2.$		(15)
	$\displaystyle\\|{\bar{\theta}}^{k}-{\bar{\theta}}^{k-\tau(\alpha)}\\|^{2}$	$\displaystyle\leq 288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\\|{\bar{\theta}}^{k}\\|^{2}+288B^{2}H^{2}\alpha^{2}\tau^{2}(\alpha)\leq 8\\|{\bar{\theta}}^{k}\\|^{2}+8.$		(16)

$\displaystyle-\sum_{i=1}^{N}\sum_{t=0}^{H-1}\mathbb{E}\left[\left\langle{\bar{\theta}}^{k}-\theta^{*},F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle\right]$	$\displaystyle\leq 36NH\alpha\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}\right]+36NH\alpha\left(1+\\|\theta^{}\\|\right)^{2}$
	$\displaystyle\qquad+12(19L+6B)NBH^{2}\alpha\tau(\alpha)\mathbb{E}\left[\\|{\bar{\theta}}^{k}-\theta^{*}\\|^{2}\right]$
	$\displaystyle\qquad+12(19L+6B)NBH^{2}\left(1+\\|\theta^{*}\\|\right)^{2}\alpha\tau(\alpha).$	(17)

	$\displaystyle\mathbb{E}\left[\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|^{2}\right]$	$\displaystyle\leq\left(1-\frac{H\mu\alpha}{N}\right)^{k+1-\tau(\alpha)}\mathbb{E}\left[\\|{\bar{\theta}}^{\tau(\alpha)}-\theta^{*}\\|^{2}\right]$
		$\displaystyle\qquad+\frac{8NC\left(B^{2}H+9+3(19L+6B)BH\right)(\\|\theta^{*}\\|+1)^{2}}{\mu}\log(1/\alpha)\alpha.$		(18)

$\displaystyle\\|{\bar{\theta}}^{k+1}-\theta^{*}\\|^{2}$	$\displaystyle=\left\\|{\bar{\theta}}^{k}-\theta^{*}-\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}$
	$\displaystyle=\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+\left\\|\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}-\frac{2\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{},\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\rangle$
	$\displaystyle=\\|{\bar{\theta}}^{k}-\theta^{}\\|^{2}+\left\\|\frac{\alpha}{N}\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})\right\\|^{2}-\frac{2H\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{},F({\bar{\theta}}^{k})\right\rangle$
	$\displaystyle\qquad-\frac{2\alpha}{N}\left\langle{\bar{\theta}}^{k}-\theta^{*},\sum_{i=1}^{N}\sum_{t=0}^{H-1}F_{i}(\theta_{i}^{k,t};X_{i}^{k+t})-F_{i}({\bar{\theta}}^{k})\right\rangle.$	(19)

Local Stochastic Approximation: A Unified View of Federated Learning and Distributed Multi-Task Reinforcement Learning Algorithms

Abstract

1 Introduction

1.1 Related works

2 Local stochastic approximation

3 Convergence analysis

Assumption 1.

Assumption 2.

Assumption 3.

Lemma 1.

Lemma 2.

3.1 Constant step sizes

Lemma 3.

Lemma 4.

Lemma 5.

Theorem 1.

Remark 1.

Proof.

3.2 Time-varying step sizes

Lemma 6.

Lemma 7.

Lemma 8.

Theorem 2.

Remark 2.

Proof.

4 Motivating applications

4.1 Federated learning

4.2 Distributed multi-task reinforcement learning

4.2.1 Distributed TD(λ)(\lambda) with linear function approximation

Assumption 4.

Assumption 5.

Assumption 6.

4.2.2 Distributed Q-learning with linear function approximation

5 Concluding remark

References

Appendix A Proofs of Lemmas 3–8

A.1 Proof of Lemma 3

A.2 Proof of Lemma 4

A.3 Proof of Lemma 5

A.4 Proof of Lemma 6

A.5 Proof of Lemma 7

A.6 Proof of Lemma 8

4.2.1 Distributed TD $(\lambda)$ with linear function approximation