This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Huaqing Xiong equal contribution Department of Electrical and Computer Engineering, The Ohio State University Tengyu Xu* Department of Electrical and Computer Engineering, The Ohio State University Yingbin Liang Department of Electrical and Computer Engineering, The Ohio State University Wei Zhang Department of Mechanical and Energy Engineering, Southern University of Science and Technology
Abstract

Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established. This paper provides the first such convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning that incorporate AMSGrad updates (a standard alternative of Adam in theoretical analysis), referred to as PG-AMSGrad and TD-AMSGrad, respectively. Moreover, our analysis focuses on Markovian sampling for both algorithms. We show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of 𝒪(1/T)\mathcal{O}(1/T) (where TT denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at the rate of 𝒪(log2T/T)\mathcal{O}(\log^{2}T/\sqrt{T}). Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of 𝒪(1/T)\mathcal{O}(1/T), and with a diminishing stepsize converges exactly to the global optimum at the rate of 𝒪(logT/T)\mathcal{O}(\log T/\sqrt{T}). Our study develops new techniques for analyzing the Adam-type RL algorithms under Markovian sampling.

1 Introduction

Reinforcement learning (RL) aims to study how an agent learns a policy through interacting with its environment to maximize the accumulative reward for a task. RL has so far accomplished tremendous success in various applications such as playing video games Mnih et al., (2013), bipedal walking Castillo et al., (2019) and online advertising Pednault et al., (2002), to name a few. In general, there are two widely used classes of RL algorithms: policy-based methods and value function based methods.

For the first class, policy gradient (PG) Sutton et al., (2000) is a basic algorithm which has motivated many advanced policy-based algorithms including actor-critic Konda and Tsitsiklis, (2000), DPG Silver et al., (2014), TRPO Schulman et al., (2015), PPO Schulman et al., (2017), etc. The idea of the vanilla PG Sutton et al., (2000) is to parameterize the policy and optimize a target accumulated reward function by (stochastic) gradient descent. The asymptotic convergence and finite-time (i.e., non-asymptotic) analysis have been characterized for PG in various scenarios, which will be further discussed in Section 1.2.

For the second class of value function based algorithms, temporal difference (TD) learning Sutton, (1988) is a fundamental algorithm which has motivated more advanced algorithms such as Q-learning Watkins and Dayan, (1992), SARSA Rummery and Niranjan, (1994), etc. The vanilla TD Sutton, (1988) typically parameterizes the value function of an unknown policy and iteratively finds the true value function or its estimator by following the (projected) Bellman operation, which can also be interpreted to be analogous to a stochastic gradient descent (SGD) update. The asymptotic convergence and finite-time analysis have been established for TD in various scenarios, which will be discussed in Section 1.2.

Despite extensive exploration, all the existing theoretical studies of PG and TD have focused on algorithms that adopt the SGD-type updates without adaption on the stepsize. In practice, however, the adaptive momentum estimation (Adam) method Kingma and Ba, (2015) has been more commonly used in RL Bello et al., (2017); Stooke and Abbeel, (2018). It is so far unclear whether RL algorithms that incorporate the Adam-type updates have provable converge guarantee. The goal of this paper is to provide the first non-asymptotic analysis to characterize the convergence rate of the Adam-type PG and TD algorithms. In addition, we develop new technical tools to analyze the Adam-type algorithms under Markovian sampling. Such analysis is not available in the existing studies of the Adam-type algorithms in optimization which usually assume independent and identically distributed (i.i.d.) sampling.

1.1 Our Contribution

The main contribution of the paper lies in establishing the first non-asymptotic analysis for the Adam-type PG and TD algorithms under the Markovian sampling model.

From the technical standpoint, although the existing analysis of the Adam-type algorithms in conventional optimization may provide useful tools, analysis of such algorithms in RL problems has new challenges. (1) Samples in TD learning are drawn from an unknown transition kernel in a non-i.i.d. manner, and hence a bias of the gradient estimation arises. Such a bias is also dependent on the AMSGrad updates, which further complicates the analysis. We develop techniques to bound the bias by the stepsize even under adaptive momentum updates. (2) The sampling process in PG (with a nonlinear approximation architecture) is not only non-i.i.d., but also follows time-varying parametric policies. Thus, it is even more challenging to bound such a bias with the AMSGrad updates. In response to this, we develop new techniques to characterize the bias error which can also be controlled by the stepsize.

As for the results, we provide the first convergence guarantee for RL algorithms (including PG and TD) incorporating the update rule of AMSGrad (referred to as PG-AMSGrad and TD-AMSGrad, respectively), and our techniques also lead to an improved result for the vanilla PG. (1) First, we show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize111The “stepsize” here refers to the basic stepsize α\alpha in the AMSGrad update (4). The overall learning rate of the algorithm is determined by the basic stepsize α\alpha, hyperparameters β1\beta_{1} and β2\beta_{2}, and the first and second moments of gradients as given in (1)-(4), and is hence adaptive during the AMSGrad iteration. converges to a neighborhood of a stationary point at a rate of 𝒪(1/T)\mathcal{O}(1/T) (where TT denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at a rate of 𝒪(log2T/T)\mathcal{O}(\log^{2}T/\sqrt{T}). (2) Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at a rate of 𝒪(1/T)\mathcal{O}(1/T), and with a diminishing stepsize converges exactly to the global optimum at a rate of 𝒪(logT/T)\mathcal{O}(\log T/\sqrt{T}). (3) In addition, by adapting our technical tools to analyze the vanilla PG with the SGD update under Markovian sampling, we obtain an orderwisely better computational complexity than the existing works.

1.2 Related Work

Due to the rapidly growing theoretical studies on RL, we only review the most relevant studies below.

Convergence analysis of PG: Asymptotic convergence of PG based on stochastic approximation (SA) was established in Williams, (1992); Baxter and Bartlett, (2001); Sutton et al., (2000); Kakade, (2002); Pirotta et al., (2015); Tadić et al., (2017). In specific RL problems such as LQR, PG has been proved to converge to the optimal policy Fazel et al., (2018); Malik et al., (2019); Tu and Recht, (2019). Under specific policy function approximation classes, Bhandari and Russo, (2019); Agarwal et al., (2019) also showed that PG can find the optimal policy. Under the general nonconvex approximation, Shen et al., (2019); Papini et al., (2018, 2019); Xu et al., 2019a ; Xu et al., (2020) characterized the convergence rate for PG and variance reduced PG to a stationary point in the finite-horizon scenario, and Zhang et al., (2019); Karimi et al., (2019) provided the convergence rate for PG in the infinite-horizon scenario. Wang et al., (2020) studied PG with neural network function approximation in an overparameterized regime. Convergence analysis has been also established for the variants of PG, such as actor-critic algorithms (Bhatnagar et al.,, 2008, 2009; Bhatnagar,, 2010; Luo et al.,, 2019; Yang et al.,, 2019; Fu et al.,, 2020), TRPO (Shani et al.,, 2020), TRPO/PPO (Liu et al.,, 2019), etc. This paper studies the infinite-horizon scenario, but focuses on the Adam-type PG, which has not been studied in the literature. Our analysis is also applicable to other variants of PG.

Convergence analysis of TD: Originally proposed in Sutton, (1988), TD learning with function approximation aroused great interest in analyzing its convergence. While a general TD may not converge as pointed out in Baird, (1995); Györfi and Walk, (1996), the authors in Tsitsiklis and Van Roy, (1997) provided conditions to ensure asymptotic convergence of TD with linear function approximation under i.i.d. sampling. Other results on asymptotic convergence using the tools from linear SA were provided in Kushner and Yin, (2003); Benveniste et al., (2012). Non-asymptotic convergence was established for TD under i.i.d. sampling in, e.g., Dalal et al., (2018); Bhandari et al., (2018); Lakshminarayanan and Szepesvari, (2018), and under Markovian sampling in, e.g., Bhandari et al., (2018); Srikant and Ying, (2019); Hu and Syed, (2019). The convergence rate of TD with nonlinear function approximation has recently been studied in Cai et al., (2019) for overparameterized neural networks using i.i.d. samples. In contrast to the aforementioned work on TD with the SGD updates, this paper studies Adam-type TD under Markovian sampling.

Adaptive reinforcement learning algorithms: Adaptivity has been applied to RL algorithms to improve the performance. (Shani et al.,, 2020) used an adaptive proximity term to study the convergence of TRPO. An adaptive batch size was adopted to improve the policy performance (Papini et al.,, 2017) and reduce the variance (Ji et al.,, 2019) of PG. The abovementioned papers did not study how adaptive learning rates can affect the performance of PG or TD. More recently, a concurrent work (Sun et al.,, 2020) which is posted a few days after our paper provided an analysis of TD(0) and TD(λ\lambda) when incorporating adaptive gradient descent (AdaGrad) updates. However, this paper focuses on a more popular class of adaptive algorithms: Adam-type methods, and provide the first convergence guarantee when they are applied to PG and TD.

Convergence analysis of Adam-type algorithms in conventional optimization: Adam was proposed in Kingma and Ba, (2015) for speeding up the training of deep neural networks, but the vanilla Adam was shown not to converge in Reddi et al., (2018). Instead AMSGrad was proposed as a slightly modified version to justify the theoretic performance of Adam and its regret bounds were characterized in Reddi et al., (2018); Tran and Phong, (2019) for online convex optimization. Recently, Adam/AMSGrad was proved to converge to a stationary point for certain general nonconvex optimization problems Zou et al., 2019a ; Zhou et al., (2018); Chen et al., 2019a . Our study provides the first convergence guarantee for the Adam-type algorithms in the RL settings, where the Markovian sampling poses the key difference and challenge in our analysis from those in conventional optimization.

2 Preliminary

In this section, we provide the necessary background for the problems that we study in this paper.

2.1 Markov Decision Process

We consider the standard RL settings, where an agent interacts with a (possibly stochastic) environment (e.g. process or system dynamics). This interaction is usually modeled as a discrete-time discounted Markov Decision Processes (MDPs), described by a tuple (𝒮,𝒜,,R,γ,ζ)(\mathcal{S},\mathcal{A},\mathbb{P},R,\gamma,\zeta), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, :𝒮×𝒜×𝒮[0,1]\mathbb{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1] is the probability kernel for the state transitions, e.g., (|s,a)\mathbb{P}(\cdot|s,a) denotes the probability distribution of the next state given the current state ss and action aa. In addition, R:𝒮×𝒜[0,Rmax]R:\mathcal{S}\times\mathcal{A}\mapsto[0,R_{\max}] is the reward function mapping station-action pairs to a bounded subset of \mathbb{R}, γ(0,1)\gamma\in(0,1) is the discount factor, and ζ\zeta denotes the initial state distribution. The agent’s decision is captured by the policy π:=π(|s)\pi:=\pi(\cdot|s) which characterizes the density function over the action space 𝒜\mathcal{A} at the state s𝒮s\in\mathcal{S}. We denote ν:=νπ\nu:=\nu_{\pi} as the stationary distribution of the transition kernel \mathbb{P} for a given π\pi. In addition, we define the γ\gamma-discounted stationary visitation distribution of the policy π\pi as μπ(s)=t=1γtPζ,π(st=s)\mu_{\pi}(s)=\sum_{t=1}^{\infty}\gamma^{t}P_{\zeta,\pi}(s_{t}=s). Further, we denote μπ(s,a)=μπ(s)π(a|s)\mu_{\pi}(s,a)=\mu_{\pi}(s)\pi(a|s) as the (discounted) state-action visitation distribution.

In this paper, we assume that the state space 𝒮\mathcal{S} is countably infinite and the action space 𝒜\mathcal{A} is finite with possibly large cardinality |𝒜||\mathcal{A}|. Hence, we estimate the policy and the value function corresponding to a certain unknown policy by parameterized function classes as we introduce in Section Sections 3.1 and 4.1, respectively.

2.2 Update Rule of AMSGrad

Although Adam Kingma and Ba, (2015) has gained great success in practice since being proposed, it is shown not to converge even in the simple convex setting Reddi et al., (2018). Instead, a slightly modified version called AMSGrad Reddi et al., (2018) is widely used to understand the success of adaptive momentum optimization algorithms. Given a gradient gtg_{t} at time tt, the generic form of AMSGrad is given by

mt\displaystyle m_{t} =(1β1)mt1+β1gt;\displaystyle=(1-\beta_{1})m_{t-1}+\beta_{1}g_{t}; (1)
vt\displaystyle v_{t} =(1β2)v^t1+β2gt2;\displaystyle=(1-\beta_{2})\hat{v}_{t-1}+\beta_{2}g_{t}^{2}; (2)
v^t\displaystyle\hat{v}_{t} =max(v^t1,vt),V^t=diag(v^t,1,,v^t,d);\displaystyle=\max(\hat{v}_{t-1},v_{t}),\ \hat{V}_{t}=diag(\hat{v}_{t,1},\dots,\hat{v}_{t,d}); (3)
θt+1\displaystyle\theta_{t+1} =θtαtV^t12mt,\displaystyle=\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}, (4)

where αt\alpha_{t} is the stepsize, and β1,β2\beta_{1},\beta_{2} are two hyper-parameters. In addition, mtm_{t}, vtv_{t} given in (1) and (2) are viewed as the estimation of the first moment and second moment, respectively, which play important roles in adapting the learning rate as in (4). Compared to Adam, the main difference of AMSGrad lies in (3), which guarantees the sequence v^t\hat{v}_{t} to be non-decreasing whereas Adam does not require this. Such difference is considered to be a central reason causing the non-convergent behavior of Adam Reddi et al., (2018); Chen et al., 2019a .

2.3 Notations

We use x:=x2=xTx\|x\|:=\|x\|_{2}=\sqrt{x^{T}x} to denote the 2\ell_{2}-norm of a vector xx, and use x=max𝑖|xi|\|x\|_{\infty}=\underset{i}{\max}|x_{i}| to denote the infinity norm. When x,yx,y are both vectors, x/y,xy,x2,xx/y,xy,x^{2},\sqrt{x} are all calculated in the element-wise manner, which are used in the update of AMSGrad. We denote [n]={1,2,,n}[n]=\{1,2,\dots,n\}, and x\lceil x\rceil\in\mathbb{Z} as the integer such that x1x<x\lceil x\rceil-1\leq x<\lceil x\rceil.

3 Convergence of PG-AMSGrad under Markovian Sampling

In this section, we study the convergence of an Adam-type policy gradient algorithm (PG-AMSGrad) with nonlinear function approximation and under non-i.i.d. sampling.

3.1 Policy Gradient and PG-AMSGrad

Consider a reinforcement learning problem which aims to find a policy that maximizes the expected accumulative reward. We assume that the policy is parameterized by θd\theta\in\mathbb{R}^{d} and form a policy class Π:={πθ|θd}\Pi:=\{\pi_{\theta}|\theta\in\mathbb{R}^{d}\}, which in general is a nonlinear function class. The policy gradient method is usually used to solve the following infinite-horizon optimization problem:

maximizeθdJ(θ)=𝔼[t=1γtR(st,at)].\underset{\theta\in\mathbb{R}^{d}}{\text{maximize}}\quad J(\theta)=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t}R(s_{t},a_{t})\right]. (5)

The gradient of J(θ)J(\theta) with respect to θ\theta is captured by the policy gradient theorem for infinite-horizon MDP with the discounted reward Sutton et al., (2000), and is given by

θJ(θ)=𝔼μθ[Qπθ(s,a)θlog(πθ(a|s))],\nabla_{\theta}J(\theta)=\underset{\mu_{\theta}}{\mathbb{E}}\left[Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\log(\pi_{\theta}(a|s))\right], (6)

where the expectation is taken over the discounted state-action visitation distribution μθ:=μπθ(s,a)\mu_{\theta}:=\mu_{\pi_{\theta}}(s,a), and Qπθ(s,a)Q^{\pi_{\theta}}(s,a) denotes the Q-function for an initial state-action pair (s,a)(s,a) defined as

Qπθ(s,a)=𝔼[t=1γtR(st,at)|s1=s,a1=a].Q^{\pi_{\theta}}(s,a)=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t}R(s_{t},a_{t})\Big{\lvert}s_{1}=s,a_{1}=a\right].

In addition, we refer to θlogπθ(a|s)\nabla_{\theta}\log\pi_{\theta}(a|s) as the score function corresponding to the policy πθ\pi_{\theta}.

Since the transition probability is unknown, the policy gradient in (6) needs to be estimated via sampling. The Q-function Qπθ(s,a)Q^{\pi_{\theta}}(s,a) and the score function are typically estimated by independent samples. First, at each time tt, we draw a sample trajectory to provide an estimated Q-function Q^πθ(s,a)\hat{Q}^{\pi_{\theta}}(s,a) based on the algorithm EstQ Zhang et al., (2019) (see Algorithm 3 in Appendix A for details). Such an estimator has been shown to be unbiased Zhang et al., (2019). That is, if we use OqO^{q} to denote the randomness including the samples and horizon in EstQ, then we have

𝔼OqQ^πθ(s,a)=Qπθ(s,a),(s,a).\underset{O^{q}}{\mathbb{E}}\hat{Q}^{\pi_{\theta}}(s,a)=Q^{\pi_{\theta}}(s,a),\quad\forall(s,a). (7)

Second, we take samples {(st,at)}\{(s_{t},a_{t})\} to estimate the score function by following the policy πθt\pi_{\theta_{t}} and the transition function proposed in Konda, (2002) which is given by

P^(|st,at)=γ(|st,at)+(1γ)ζ(),\hat{P}(\cdot|s_{t},a_{t})=\gamma\mathbb{P}(\cdot|s_{t},a_{t})+(1-\gamma)\zeta(\cdot),

where ζ()\zeta(\cdot) denotes the initial distribution and \mathbb{P} is the transition probability from the original MDP. It has been shown in Konda, (2002) that such a transition probability guarantees the MDP to converge to the state-action visitation distribution.

Then, the gradient estimator to approximate θJ(θ)\nabla_{\theta}J(\theta) at time tt is given by

gt:=g(θt;st,at)=Q^πθt(st,at)θtlog(πθt(at|st)).g_{t}\!:=\!g(\theta_{t};s_{t},a_{t})\!=\!\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\!\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t})). (8)

We then apply such a gradient estimator gtg_{t} to update the policy parameter by the AMSGrad update given in (1)-(4), and obtain PG-AMSGrad as in Algorithm 1.

Algorithm 1 PG-AMSGrad
1:  Input: α,θ1,β1,β2,m0=0,v^0=0,t=1,s1ζ(),a1πθ1(|s)\alpha,\theta_{1},\beta_{1},\beta_{2},m_{0}=0,\hat{v}_{0}=0,t=1,s_{1}\sim\zeta(\cdot),a_{1}\sim\pi_{\theta_{1}}(\cdot|s).
2:  while  not converge  do
3:     Assign stepsize αt\alpha_{t}.
4:     Obtain Q^πθt(st,at)EstQ(st,at,θt)\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\leftarrow\text{EstQ}(s_{t},a_{t},\theta_{t}).
5:     Compute gt=Q^πθt(st,at)θtlog(πθt(at|st))g_{t}=\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t})).
6:     mt=(1β1)mt1+β1gtm_{t}=(1-\beta_{1})m_{t-1}+\beta_{1}g_{t}.
7:     vt=(1β2)v^t1+β2gt2v_{t}=(1-\beta_{2})\hat{v}_{t-1}+\beta_{2}g_{t}^{2}.
8:     v^t=max(v^t1,vt),V^t=diag(v^t,1,,v^t,d)\hat{v}_{t}=\max(\hat{v}_{t-1},v_{t}),\ \hat{V}_{t}=diag(\hat{v}_{t,1},\dots,\hat{v}_{t,d}).
9:     θt+1=θtαtV^t12mt\theta_{t+1}=\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}.
10:     tt+1t\leftarrow t+1.
11:     Sample stP^(|st1,at1),atπθt(|st)s_{t}\sim\hat{P}(\cdot|s_{t-1},a_{t-1}),a_{t}\sim\pi_{\theta_{t}}(\cdot|s_{t}).
12:  end while

We note that the gradient estimator obtained in (8) is biased, because the score function is estimated by a sequence of Markovian samples. We will show that such a biased gradient estimator is in fact computationally more efficient than the unbiased estimator used in the existing literature Zhang et al., (2019) (see Section 3.4). Our main technical novelty here lies in developing techniques to analyze the biased estimator under the AMSGrad update for PG.

3.2 Technical Assumptions

In the following, we specify some technical assumptions in our convergence analysis.

We consider a general class of parameterized policy functions that satisfy the following assumption.

Assumption 1.

Assume that the parameterized policy πθ\pi_{\theta} is differentiable with respect to θ\theta, and the score function θlogπθ(a|s)\nabla_{\theta}\log\pi_{\theta}(a|s) corresponding to πθ(|s)\pi_{\theta}(\cdot|s) exists. In addition, we assume both the policy function and the score function are Lipschitz continuous with the parameters LπL_{\pi} and LL, respectively, i.e., for all θ1,θ2d\theta_{1},\theta_{2}\in\mathbb{R}^{d},

|πθ1(a|s)πθ2(a|s)|Lπθ1θ2;|\pi_{\theta_{1}}(a|s)-\pi_{\theta_{2}}(a|s)|\leq L_{\pi}\left\lVert\theta_{1}-\theta_{2}\right\rVert;
θ1log(πθ1(a|s))θ2log(πθ2(a|s))Lθ1θ2.\left\lVert\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a|s))\!-\!\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))\right\rVert\leq L\left\lVert\theta_{1}-\theta_{2}\right\rVert.

Further, we also assume that the score function is uniformly bounded by cΘc_{\Theta} for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, i.e.,

θlog(πθ(a|s))cΘ,θ.\left\lVert\nabla_{\theta}\log(\pi_{\theta}(a|s))\right\rVert\leq c_{\Theta},\quad\forall\theta.

This assumption is standard in the existing literature that studies PG with nonconvex function approximation Zhang et al., (2019); Xu et al., 2019a ; Papini et al., (2018).

In Algorithm 1, we sample a data trajectory using the transition kernel P^\hat{P} and the policy πθt\pi_{\theta_{t}}. Such a sequence of samples are non-i.i.d. and follow a Markovian distribution. We assume that the MDP and the policies we consider satisfy the following standard mixing property.

Assumption 2.

For any θd\theta\in\mathbb{R}^{d}, there exist constant σ>0\sigma>0 and ρ(0,1)\rho\in(0,1) such that

sups𝒮P(st|s1=s)μθ()TVσρtt,\underset{s\in\mathcal{S}}{\sup}\ \left\|P(s_{t}\in\cdot|s_{1}=s)-\mu_{\theta}(\cdot)\right\|_{TV}\leq\sigma\rho^{t}\quad\forall t,

where μ1μ2TV\left\|\mu_{1}-\mu_{2}\right\|_{TV} denotes the total-variation norm (or the total-variation distance between two probability measures μ1\mu_{1} and μ2\mu_{2}).

This assumption holds for irreducible and aperiodic Markov chains Mitrophanov, (2005), and is widely adopted in the theoretical analysis of RL algorithms under Markovian sampling settings Bhandari et al., (2018); Chen et al., 2019b ; Zou et al., 2019b ; Karimi et al., (2019).

3.3 Convergence of PG-AMSGrad

In this section, we provide the convergence analysis of PG-AMSGrad as given in Algorithm 1. We first consider the case with a constant stepsize, and then provide the result with a diminishing stepsize.

Although AMSGrad has been studied in conventional optimization, our analysis of PG-AMSGrad mainly deals with the following new challenges arising in RL. First, samples here are generated via an MDP and distributed in a non-i.i.d. fashion. Thus the gradient estimator is biased and we need to control the bias with a certain upper bound scaled by the stepsize. Second, the sampling distribution also changes over time, which causes additional complication. Thus, our technical development mainly handles the above two challenges under the adaptive momentum update rule of AMSGrad. We next provide the convergence results that we obtain and relegate all the proofs to the appendices.

We first provide the Lipschitz properties for the true policy gradient and its estimator, which are useful for establishing the convergence. Recall that in Algorithm 1, the gradient estimator gt=Q^πθt(st,at)θtlog(πθt(at|st))g_{t}=\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t})) at time tt is obtained by using the Q-function estimator generated by the EstQ algorithm (see Appendix A). Note that Q^πθ(s,a)\hat{Q}^{\pi_{\theta}}(s,a) is an unbiased estimator of Qπθ(s,a)Q^{\pi_{\theta}}(s,a) for all (s,a)(s,a) Zhang et al., (2019), and the samples for estimation are independent of those for other steps in PG-AMSGrad except the initial sample. Taking expectation over the randomness in EstQ at time tt (denoted as OtqO^{q}_{t}), we obtain an estimator θtJ~(θt;st,at)\nabla_{\theta_{t}}\tilde{J}(\theta_{t};s_{t},a_{t}) defined as

θtJ~(θt;st,at):=𝔼Otq[gt]=𝔼Otq[Q^πθt(st,at)θtlog(πθt(at|st))]=Qπθt(st,at)θtlog(πθt(at|st)).\displaystyle\nabla_{\theta_{t}}\tilde{J}(\theta_{t};s_{t},a_{t}):=\underset{O^{q}_{t}}{\mathbb{E}}[g_{t}]=\underset{O^{q}_{t}}{\mathbb{E}}\left[\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t}))\right]=Q^{\pi_{\theta_{t}}}(s_{t},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t})). (9)

We next obtain the Lipschitz properties of θJ~(θ;s,a)\nabla_{\theta}\tilde{J}(\theta;s,a) and θJ(θ)\nabla_{\theta}J(\theta) in the following lemma.

Lemma 1.

(Lipschitz property of policy gradient) Under Assumptions 1 and 2, the policy gradient θJ(θ)\nabla_{\theta}J(\theta) defined in (6) is Lipschitz continuous with the parameter cJc_{J}, i.e., θ1,θ2d\forall\theta_{1},\theta_{2}\in\mathbb{R}^{d},

θ1J(θ1)θ2J(θ2)cJθ1θ2,\left\lVert\nabla_{\theta_{1}}J(\theta_{1})-\nabla_{\theta_{2}}J(\theta_{2})\right\rVert\leq c_{J}\left\lVert\theta_{1}-\theta_{2}\right\rVert, (10)

where the constant coefficient cJ=RmaxL1γ+(1+cΘ)Rmax1γ|𝒜|Lπ(1+logρσ1+11ρ)c_{J}=\frac{R_{\max}L}{1-\gamma}+\frac{(1+c_{\Theta})R_{\max}}{1-\gamma}\cdot|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right). Further, the policy gradient estimator θJ~(θ;s,a)\nabla_{\theta}\tilde{J}(\theta;s,a) defined in (9) is also Lipschitz continuous with the parameter cJ~c_{\tilde{J}}, i.e., θ1,θ2d,(s,a)𝒮×𝒜\forall\theta_{1},\theta_{2}\in\mathbb{R}^{d},\forall(s,a)\in\mathcal{S}\times\mathcal{A},

θ1J~(θ1;s,a)θ2J~(θ2;s,a)cJ~θ1θ2,\left\lVert\nabla_{\theta_{1}}\tilde{J}(\theta_{1};s,a)\!-\!\nabla_{\theta_{2}}\tilde{J}(\theta_{2};s,a)\right\rVert\leq c_{\tilde{J}}\left\lVert\theta_{1}\!-\!\theta_{2}\right\rVert, (11)

where cJ~=RmaxL1γ+cΘ|𝒜|Lπ(1+logρσ1+11ρ)c_{\tilde{J}}=\frac{R_{\max}L}{1-\gamma}+c_{\Theta}|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right).

The following theorem characterizes the convergence of PG-AMSGrad with a constant stepsize. Recall that the stepsize refers to the parameter α\alpha in the AMSGrad update (4), not the overall learning rate of the algorithm.

Theorem 1.

(Convergence of PG-AMSGrad with constant stepsize) Fix β1,β2\beta_{1},\beta_{2} in Algorithm 1. Initialize Algorithm 1 such that |g1,i|G0|g_{1,i}|\geq G_{0} for all i[d]i\in[d] with some G0>0G_{0}>0. Suppose Assumptions 1 and 2 hold. Let αt=α\alpha_{t}=\alpha for t=1,,Tt=1,\dots,T. Then after running TT steps of PG-AMSGrad as given in Algorithm 1, we have:

mint[T]𝔼[θtJ(θt)2]C1T+αC2,\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla_{\theta_{t}}J(\theta_{t})\right\rVert^{2}\right]\leq\frac{C_{1}}{T}+\alpha C_{2},

where

C1=G𝔼[J(z1)]α+dG3G0(1β1)+2GτG0G2+dcJαG(3β12+2(1β1)(1β1/β2))(1β1)(1β2)(1β1/β2),\displaystyle C_{1}=\frac{G_{\infty}\mathbb{E}[J(z_{1})]}{\alpha}+\frac{dG_{\infty}^{3}}{G_{0}(1-\beta_{1})}+\frac{2G_{\infty}\tau^{*}}{G_{0}}G_{\infty}^{2}+\frac{dc_{J}\alpha G_{\infty}(3\beta_{1}^{2}+2(1-\beta_{1})(1-\beta_{1}/\beta_{2}))}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})},
C2=G3G0[(3cJ+cJ~)τG0+d+dLπG(2τ+(τ)2)2G0]\displaystyle C_{2}=\frac{G_{\infty}^{3}}{G_{0}}\left[\frac{(3c_{J}\!+\!c_{\tilde{J}})\tau^{*}}{G_{0}}\!+\!d\!+\!\frac{dL_{\pi}G_{\infty}(2\tau^{*}\!+\!(\tau^{*})^{2})}{2G_{0}}\right]

with cJ,cJ~c_{J},c_{\tilde{J}} defined in Lemma 1, τ=min{τ:mρτα}\tau^{*}=\min\{\tau:m\rho^{\tau}\leq\alpha\} and G=cΘRmax1γG_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}.

Theorem 1 indicates that under the constant stepsize, PG-AMSGrad converges to a neighborhood of a stationary point at a rate of 𝒪(1T)\mathcal{O}\left(\frac{1}{T}\right). The size of the neighborhood can be controlled by the stepsize α\alpha. One can observe that α\alpha controls a tradeoff between the convergence rate and the convergence accuracy. Decreasing α\alpha improves the convergence accuracy, but slows down the convergence, since the coefficient C1C_{1} contains α\alpha in the denominator. To balance such a tradeoff, we set the stepsize αt=1T\alpha_{t}=\frac{1}{\sqrt{T}}. In this case, the mixing time becomes τ=𝒪(logT)\tau^{*}=\mathcal{O}(\log T) and thus PG-AMSGrad converges to a stationary point with a rate of 𝒪(log2TT)\mathcal{O}\left(\frac{\log^{2}T}{\sqrt{T}}\right).

In the following, we adopt a diminishing stepsize to eliminate the convergence error and obtain the exact convergence.

Theorem 2.

(Convergence of PG-AMSGrad with diminishing stepsize) Suppose the same conditions of Theorem 1 hold, and let αt=αt\alpha_{t}=\frac{\alpha}{\sqrt{t}} for t=1,,Tt=1,\dots,T. Then running TT steps of PG-AMSGrad as given in Algorithm 1, we have:

mint[T]𝔼[θtJ(θt)2]C1T+C2T,\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla_{\theta_{t}}J(\theta_{t})\right\rVert^{2}\right]\leq\frac{C_{1}}{T}+\frac{C_{2}}{\sqrt{T}},

where

C1=f1Gα+2dcJαG1β2+2τGG0G2+3dcJβ12αG(1β1)(1β2)(1β1/β2)\displaystyle C_{1}=\frac{f_{1}G_{\infty}}{\alpha}+\frac{2dc_{J}\alpha G_{\infty}}{1-\beta_{2}}+\frac{2\tau^{*}G_{\infty}}{G_{0}}G_{\infty}^{2}+\frac{3dc_{J}\beta_{1}^{2}\alpha G_{\infty}}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}
C2=RmaxGα(1γ)+dG3G0(1β1)+αG3G0[2(3cJ+cJ~)τG0+d(1+LπG(τ+(τ)2)G0)]\displaystyle C_{2}=\frac{R_{\max}G_{\infty}}{\alpha(1-\gamma)}+\frac{dG_{\infty}^{3}}{G_{0}(1-\beta_{1})}+\frac{\alpha G_{\infty}^{3}}{G_{0}}\left[\frac{2(3c_{J}\!+\!c_{\tilde{J}})\tau^{*}}{G_{0}}+d\left(1+\frac{L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})}{G_{0}}\right)\right]

with cJ,cJ~c_{J},c_{\tilde{J}} defined in Lemma 1, τ=min{τ:mρταT=αT}\tau^{*}\!=\!\min\{\tau:m\rho^{\tau}\!\leq\!\alpha_{T}\!=\!\frac{\alpha}{\sqrt{T}}\} and G=cΘRmax1γG_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}.

Under the diminishing stepsize, PG-AMSGrad can converge exactly to a stationary point. Since τ=𝒪(logT)\tau^{*}=\mathcal{O}(\log T), the convergence rate is given by 𝒪(log2TT)\mathcal{O}\left(\frac{\log^{2}T}{\sqrt{T}}\right).

Theorems 1 and 2 indicate that under both a constant stepsize and diminishing stepsize, PG-AMSGrad finds a stationary point efficiently with guaranteed convergence. However, there is a tradeoff between the convergence rate and the convergence accuracy. With a constant stepsize, PG-AMSGrad can converge faster but only to a neighborhood of a stationary point whose size is controlled by the stepsize, whereas a diminishing stepsize yields a better convergence accuracy, but attains a lower convergence rate.

3.4 Implication on Vanilla PG under Markovian Data

Although our focus in this paper is on the Adam-type PG, slight modification of our analysis yields the convergence rate of the vanilla PG under infinite horizon Markovian sampling. In the following, we first present such results and then compare them with two recent studies Zhang et al., (2019); Karimi et al., (2019) on the same model to illustrate the novelty of our analysis and benefit of our sampling strategy.

We consider the vanilla PG algorithm that uses the same gradient estimator and sampling strategy as those of Algorithm 1, but adopts the SGD update (i.e., θt+1=θtαtgt\theta_{t+1}=\theta_{t}-\alpha_{t}g_{t}) rather than the AMSGrad update. We call such an algorithm as PG-SGD. The following proposition characterizes the convergence rate for PG-SGD.

Proposition 1.

Suppose Assumptions 1 and 2 hold. After running TT steps of PG-SGD with a constant stepsize αt=α\alpha_{t}=\alpha for t=1,,Tt=1,\dots,T, we have:

mint[T]𝔼[θtJ(θt)2]J(θ1)/α+2G2τT+αC1,\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla_{\theta_{t}}J(\theta_{t})\right\rVert^{2}\right]\leq\frac{J(\theta_{1})/\alpha+2G_{\infty}^{2}\tau^{*}}{T}+\alpha C_{1},

where

C1=G2[cJ2+(3cJ+cJ~)τ+1+LπG(2τ+(τ)2)2],\displaystyle C_{1}\!=\!G_{\infty}^{2}\left[\frac{c_{J}}{2}\!+\!(3c_{J}\!+\!c_{\tilde{J}})\tau^{*}\!+\!1\!+\!\frac{L_{\pi}G_{\infty}(2\tau^{*}\!+\!(\tau^{*})^{2})}{2}\right],

with cJ,cJ~c_{J},c_{\tilde{J}} defined in Lemma 1, τ=min{τ:mρτα}\tau^{*}=\min\{\tau:m\rho^{\tau}\leq\alpha\} and G=cΘRmax1γG_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}. Furthermore, after running TT steps of PG-SGD with a diminishing stepsize αt=1γt\alpha_{t}=\frac{1-\gamma}{\sqrt{t}} for t=1,,Tt=1,\dots,T, we have:

mint[T]𝔼[θtJ(θt)2]J(θ1)+2(1γ)2G2τ(1γ)T+C2(1γ)2T,\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla_{\theta_{t}}J(\theta_{t})\right\rVert^{2}\right]\leq\frac{J(\theta_{1})+2(1-\gamma)^{2}G_{\infty}^{2}\tau^{*}}{(1-\gamma)T}+\frac{C_{2}}{(1-\gamma)^{2}\sqrt{T}},

where

C2=Rmax+(1γ)3G2[cJ+2(3cJ+cJ~)τ+1+LπG(τ+(τ)2)],\displaystyle C_{2}=R_{\max}+(1-\gamma)^{3}G_{\infty}^{2}\left[c_{J}+2(3c_{J}+c_{\tilde{J}})\tau^{*}+1+L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})\right],

with τ=min{τ:mρταT=1γT}\tau^{*}\!=\!\min\{\tau:m\rho^{\tau}\!\leq\!\alpha_{T}\!=\!\frac{1-\gamma}{\sqrt{T}}\}.

We next compare Proposition 1 with two recent studies on the same non-i.i.d. model. First, Karimi et al., (2019) studied infinite-horizon PG with a biased gradient estimator. In their analysis, they did not bound the gradient bias using the stepsize, and hence their convergence has a non-zero error even with a diminishing stepsize. In contrast, we obtain a fine-grained bound on the bias and characterize its dependence on the stepsize. We show that PG exactly converges to a stationary point under the diminishing stepsize.

Another closely related study on the infinite-horizon PG under non-i.i.d. sampling was by Zhang et al., (2019), but their algorithm adopts an unbiased gradient estimator at the cost of using more samples. As a comparison, Proposition 1 indicates that PG-SGD with a biased gradient estimator attains a better convergence rate and accuracy. More specifically, under the constant stepsize, (Zhang et al.,, 2019, Corollary 4.4) showed that their PG algorithm converges with an error bound of 𝒪(1(1γ)5(1γ)2)\mathcal{O}\left(\frac{1}{(1-\gamma)^{5}(1-\sqrt{\gamma})^{2}}\right), whereas PG-SGD with a biased gradient estimator achieves a much smaller error bound 𝒪(1(1γ)2)\mathcal{O}\left(\frac{1}{(1-\gamma)^{2}}\right) by taking α=1γ\alpha=1-\gamma. Similarly, under the diminishing stepsize, (Zhang et al.,, 2019, Theorem 4.3) showed that their PG algorithm converges at a rate of 𝒪(1(1γ)5(1γ)2T)\mathcal{O}\left(\frac{1}{(1-\gamma)^{5}(1-\sqrt{\gamma})^{2}\sqrt{T}}\right), whereas our PG-SGD converges at a rate of 𝒪(log2(T/(1γ))(1γ)2T)\mathcal{O}\left(\frac{\log^{2}(\sqrt{T}/(1-\gamma))}{(1-\gamma)^{2}\sqrt{T}}\right), which is much faster since γ\gamma is usually close to 1, and logT\log T is considered to be less influential in practice.

4 Convergence of TD-AMSGrad under Markovian Sampling

In this section, we adopt AMSGrad to TD learning and analyze its convergence under Markovian sampling. The proof techniques of bounding the bias and the nature of the convergence are very different from those of PG-AMSGrad.

4.1 TD Learning and TD-AMSGrad

Policy evaluation is a fundamental task in RL, and often plays a critical role in other algorithms such as PG that we study in Section 3. The goal of policy evaluation is to obtain an accurate estimation of the accumulated reward function known as the value function V:𝒮V:\mathcal{S}\mapsto\mathbb{R} for a given policy π\pi defined as

Vπ(s)=𝔼[t=1γtR(st,at)|s1=s].V^{\pi}(s)=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t}R(s_{t},a_{t})\Big{|}s_{1}=s\right].

The Bellman operator corresponding to a value function VV is defined by s𝒮\forall s\in\mathcal{S},

(TπV)(s)=a𝒜π(a|s)R(s,a)+γs𝒮(s|s)V(s).(T^{\pi}V)(s)=\sum_{a\in\mathcal{A}}\pi(a|s)R(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}\mathbb{P}(s^{\prime}|s)V(s^{\prime}). (12)

The key to approximate the value function is the observation that it satisfies the projected bellman equation (ΠTπV)(s)=V(s)(\Pi T^{\pi}V)(s)=V(s) when projected onto some convex space. Under the function approximation, the value function V(s)V(s) is parameterized by θd\theta\in\mathbb{R}^{d} and denoted by V(s;θ)V(s;\theta). As many recent finite-time analysis of TD Bhandari et al., (2018); Xu et al., 2019b ; Srikant and Ying, (2019), we consider the linear approximation class of the value function V(s;θ)V(s;\theta) defined as

V(s;θ)=ϕ(s)Tθ,V(s;\theta)=\phi(s)^{T}\theta, (13)

where θd\theta\in\mathbb{R}^{d}, and ϕ:𝒮d\phi:\mathcal{S}\rightarrow\mathbb{R}^{d} is a vector function with the dimension dd, and the elements of ϕ\phi represent the nonlinear kernel (feature) functions. Then the vanilla TD algorithm follows a stochastic iterative method given by

θt+1=θtαtgt,\theta_{t+1}=\theta_{t}-\alpha_{t}g_{t}, (14)

where αt\alpha_{t} is the stepsize, and gtg_{t} is defined as

gt\displaystyle g_{t} :=g(θt;st,at,st+1)=(ϕT(st)θtR(st,at)γϕT(st+1)θt)ϕ(st).\displaystyle:=g(\theta_{t};s_{t},a_{t},s_{t+1})=\left(\phi^{T}(s_{t})\theta_{t}-R(s_{t},a_{t})-\gamma\phi^{T}(s_{t+1})\theta_{t}\right)\phi(s_{t}). (15)

Here, gtg_{t} serves as a stochastic pseudo-gradient, and is an estimator of the full pseudo-gradient given by

g¯(θt)=𝔼𝜈[(ϕT(st)θtR(st,π(st))γϕT(st+1)θt)ϕ(st)]\bar{g}(\theta_{t})\!=\!\underset{\nu}{\mathbb{E}}\left[\left(\phi^{T}(s_{t})\theta_{t}\!-\!R(s_{t},\pi(s_{t}))\!-\!\gamma\phi^{T}(s_{t+1})\theta_{t}\right)\phi(s_{t})\right] (16)

where the expectation is taken over the stationary distribution of the states. We note that g¯(θt)\bar{g}(\theta_{t}) is not a gradient of a loss function, but plays a similar role as the gradient in the gradient descent algorithm.

Then TD-AMSGrad is obtained by replacing the update (14) by the AMSGrad update given in (1)-(4) as in Algorithm 2.

Algorithm 2 TD-AMSGrad
1:  Input: α,λ,θ1,β1,β2,m0=0,v^0=0,s1ζ()\alpha,\lambda,\theta_{1},\beta_{1},\beta_{2},m_{0}=0,\hat{v}_{0}=0,s_{1}\sim\zeta(\cdot).
2:  for  t=1,2,,Tt=1,2,\ldots,T do
3:     Assign αt\alpha_{t}, β1t=β1λt\beta_{1t}=\beta_{1}\lambda^{t}.
4:     Sample atπ,st+1(|st,at)a_{t}\sim\pi,s_{t+1}\sim\mathbb{P}(\cdot|s_{t},a_{t}).
5:     Compute gtg_{t} as (15).
6:     mt=(1β1t)mt1+β1tgtm_{t}=(1-\beta_{1t})m_{t-1}+\beta_{1t}g_{t}.
7:     vt=(1β2)v^t1+β2gt2v_{t}=(1-\beta_{2})\hat{v}_{t-1}+\beta_{2}g_{t}^{2}.
8:     v^t=max(v^t1,vt),V^t=diag(v^t,1,,v^t,d)\hat{v}_{t}=\max(\hat{v}_{t-1},v_{t}),\ \hat{V}_{t}=diag(\hat{v}_{t,1},\dots,\hat{v}_{t,d}).
9:     θt+1=Π𝒟,V^t1/4(θtαtV^t12mt)\theta_{t+1}=\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}\left(\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right),where Π𝒟,V^t1/4(θ)=minθ𝒟V^t1/4(θθ)\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}(\theta^{\prime})=\underset{\theta\in\mathcal{D}}{\min}\left\lVert\hat{V}_{t}^{1/4}\left(\theta^{\prime}-\theta\right)\right\rVert.
10:  end for
11:   Output: 1Tt=1Tθt\frac{1}{T}\sum_{t=1}^{T}\theta_{t}.

As seen in Algorithm 2, the state-action pairs are sampled as a trajectory under the transition probability \mathbb{P} with unknown policy π\pi. Therefore, the samples along the trajectory are dependent, and hence we need to analyze the convergence of TD-AMSGrad under Markovian sampling.

4.2 Technical Assumptions

In this section, we introduce some standard technical assumptions for our analysis.

We first give the following standard assumption on the kernel function Tsitsiklis and Van Roy, (1997); Bhandari et al., (2018); Xu et al., 2019b ; Chen et al., 2019b .

Assumption 3.

For any state s𝒮s\in\mathcal{S}, the kernel function ϕ:𝒮d\phi:\mathcal{S}\rightarrow\mathbb{R}^{d} is uniformly bounded, i.e., ϕ(s)1,s𝒮\left\lVert\phi(s)\right\rVert\leq 1,\;\forall s\in\mathcal{S}. In addition, we define a feature matrix Φ\Phi as

Φ=[ϕT(s1)ϕT(s2)]=[ϕ1(s1)ϕd(s1)ϕ1(s2)ϕd(s2)],\Phi=\begin{bmatrix}\phi^{T}(s_{1})\\ \phi^{T}(s_{2})\\ \vdots\end{bmatrix}=\begin{bmatrix}&\phi_{1}(s_{1})&\cdots&\phi_{d}(s_{1})\\ &\phi_{1}(s_{2})&\cdots&\phi_{d}(s_{2})\\ &\vdots&\vdots&\vdots\end{bmatrix},

and assume that the columns of Φ\Phi are linearly independent.

The boundedness assumption is mild since we can always normalize the kernel functions.

We next define the following norm of the value function as:

VθD=s𝒮ν(s)(ϕ(s)Tθ)2=θΣ,\left\lVert V_{\theta}\right\rVert_{D}=\sqrt{\sum_{s\in\mathcal{S}}\nu(s)\left(\phi(s)^{T}\theta\right)^{2}}=\left\lVert\theta\right\rVert_{\Sigma},

where D=diag(ν(s1),ν(s2),)D=\text{diag}(\nu(s_{1}),\nu(s_{2}),\cdots) is a diagonal matrix whose elements are determined by the stationary distribution ν\nu, and Σd×d\Sigma\in\mathbb{R}^{d\times d} is the steady-state feature covariance matrix given by Σ=s𝒮ν(s)ϕ(s)ϕ(s)T\Sigma=\sum_{s\in\mathcal{S}}\nu(s)\phi(s)\phi(s)^{T}.

The next assumption of the bounded domain is standard in theoretical analysis of the Adam-type algorithms Reddi et al., (2018); Tran and Phong, (2019).

Assumption 4.

The domain 𝒟d\mathcal{D}\subset\mathbb{R}^{d} of approximation parameters is a ball originating at θ=0\theta=0 with a bounded diameter containing θ\theta^{\star}. That is, there exists DD_{\infty}, such that θ𝒟\theta^{\star}\in\mathcal{D}, and θmθn<D\left\lVert\theta_{m}-\theta_{n}\right\rVert<D_{\infty}, for any θm,θn𝒟\theta_{m},\theta_{n}\in\mathcal{D}.

The boundedness parameter DD_{\infty} can be chosen as discussed in Bhandari et al., (2018).

4.3 Convergence of TD-AMSGrad

In the following, we provide the convergence results for TD-AMSGrad with linear function approximation under Markovian sampling.

First consider the full pseudo-gradient g¯(θ)\bar{g}(\theta) in (16). We define θ\theta^{*} as the fixed point of g¯(θ)\bar{g}(\theta), i.e., g¯(θ)=0\bar{g}(\theta^{*})=0. Then θ\theta^{*} is the unique fixed point under Assumption 3 following from the contraction property of the projected Bellman operator Tsitsiklis and Van Roy, (1997). The following lemma is useful for our convergence analysis.

Lemma 2.

(Bhandari et al.,, 2018, Lemma 3) Let ω>0\omega>0 be the minimum eigenvalue of the matrix Σ\Sigma. Under Assumption 3 and for any θd\theta\in\mathbb{R}^{d}, we have

(θθ)Tg¯(θ)(1γ)ωθθ2.(\theta-\theta^{*})^{T}\bar{g}(\theta)\geq(1-\gamma)\sqrt{\omega}\left\lVert\theta-\theta^{*}\right\rVert^{2}.

Lemma 2 indicates that the update of TD learning exhibits a property similar to strongly convex optimization.

Since our analysis is under Markovian sampling, non-zero bias arises in our convergence analysis when estimating the gradient, i.e., 𝔼[gt]g¯(θt)0\mathbb{E}[g_{t}]-\bar{g}(\theta_{t})\neq 0. The following lemma provides an upper bound on such a bias error for TD-AMSGrad.

Lemma 3.

Consider a sequence of non-increasing stepsizes {αt}t=1T\{\alpha_{t}\}_{t=1}^{T} for Algorithm 2 and fix τ=min{τ:mρταT}\tau^{*}=\min\{\tau:m\rho^{\tau}\leq\alpha_{T}\}. Initialize Algorithm 2 such that |g1,i|G0|g_{1,i}|\geq G_{0} for all i[d]i\in[d] with some G0>0G_{0}>0. Under Assumptions 2-4, we have for tτt\leq\tau^{*}:

𝔼[(θtθ)T(gtg¯(θt))]2((1+γ)D+G)GG0τα0;\displaystyle\mathbb{E}[(\theta_{t}-\theta^{*})^{T}(g_{t}-\bar{g}(\theta_{t}))]\leq 2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{*}\alpha_{0};

and for t>τt>\tau^{*}:

𝔼[(θtθ)T(gtg¯(θt))]4DGαT+2((1+γ)D+G)GG0i=tτt1ai,\displaystyle\mathbb{E}[(\theta_{t}-\theta^{*})^{T}(g_{t}-\bar{g}(\theta_{t}))]\leq 4D_{\infty}G_{\infty}\alpha_{T}+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i},

where G=Rmax+(1+γ)DG_{\infty}=R_{\max}+(1+\gamma)D_{\infty}.

The following theorem provides the convergence of TD-AMSGrad under a constant stepsize coupled with diminishing hyper-parameters in the AMSGrad update.

Theorem 3.

(Convergence of TD-AMSGrad with constant stepsize) Let β1t=β1λt\beta_{1t}=\beta_{1}\lambda^{t} and δ=β1/β2\delta=\beta_{1}/\beta_{2} with δ,λ(0,1)\delta,\lambda\in(0,1) in Algorithm 2. Initialize Algorithm 2 such that |g1,i|G0|g_{1,i}|\geq G_{0} for all i[d]i\in[d] with some G0>0G_{0}>0. Let αt=α,t=1,,T\alpha_{t}=\alpha,t=1,\dots,T, and suppose Assumptions 2-4 hold. Then the output of Algorithm 2 satisfies:

𝔼θoutθ2C1T+αC2,\mathbb{E}\left\lVert\theta_{out}-\theta^{\star}\right\rVert^{2}\leq\frac{C_{1}}{T}+\alpha C_{2},

where

C1=\displaystyle C_{1}= GD2αc(1β)+β1λGD22αc(1λ)(1β)+2((1+γ)D+G)GcG0(1β)(τ)2α,\displaystyle\frac{G_{\infty}D_{\infty}^{2}}{\alpha c(1-\beta)}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{2\alpha c(1-\lambda)(1-\beta)}+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{cG_{0}(1-\beta)}(\tau^{*})^{2}\alpha,
C2=\displaystyle C_{2}= 4DGc(1β)+2Gτ((1+γ)D+G)cG0(1β)+(1+β)G22cG0(1β)\displaystyle\frac{4D_{\infty}G_{\infty}}{c(1-\beta)}+\frac{2G_{\infty}\tau^{*}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)}+\frac{(1+\beta)G_{\infty}^{2}}{2cG_{0}(1-\beta)}

with c=(1γ)ωc=(1-\gamma)\sqrt{\omega} and G=Rmax+(1+γ)DG_{\infty}=R_{\max}+(1+\gamma)D_{\infty}.

In Theorem 3, C1,C2C_{1},C_{2} are constants and time-independent. Therefore, under the choice of the stepsize and hyper-parameters in the theorem, Algorithm 2 converges to a neighborhood of the global optimum at a rate of 𝒪(1T)\mathcal{O}\left(\frac{1}{T}\right). The size of the neigborhood is controlled by the stepsize α\alpha. Similarly to Theorem 1, we can balance the tradeoff between the convergence rate and the convergence accuracy, i.e., the size of neighborhood, by setting the stepsize αt=1T\alpha_{t}=\frac{1}{\sqrt{T}}, which yields a convergence to the global optimal solution at the rate of 𝒪(log2TT)\mathcal{O}\left(\frac{\log^{2}T}{\sqrt{T}}\right).

Next, we provide the convergence result with a diminishing stepsize in the following theorem.

Theorem 4.

(Convergence of TD-AMSGrad with diminishing stepsize) Suppose the same conditions of Theorem 3 hold, and let αt=αt\alpha_{t}=\frac{\alpha}{\sqrt{t}} for t=1,,Tt=1,\dots,T. Then the output of Algorithm 2 satisfies:

𝔼θoutθ2C1T+C2T,\mathbb{E}\left\lVert\theta_{out}-\theta^{\star}\right\rVert^{2}\leq\frac{C_{1}}{\sqrt{T}}+\frac{C_{2}}{T},

where

C1=GD22cα(1β)+α(1+β1)G2cG0(1β)+4αDGc(1β)+4ταG((1+γ)D+G)cG0(1β),\displaystyle C_{1}=\frac{G_{\infty}D_{\infty}^{2}}{2c\alpha(1-\beta)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{cG_{0}(1-\beta)}+\frac{4\alpha D_{\infty}G_{\infty}}{c(1-\beta)}+\frac{4\tau^{*}\alpha G_{\infty}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)},
C2=GD22cα(1β)+βGD22cα(1λ)2(1β)+2Gα(τ)2((1+γ)D+G)cG0(1β)\displaystyle C_{2}=\frac{G_{\infty}D_{\infty}^{2}}{\sqrt{2}c\alpha(1-\beta)}+\frac{\beta G_{\infty}D_{\infty}^{2}}{2c\alpha(1-\lambda)^{2}(1-\beta)}+\frac{2G_{\infty}\alpha(\tau^{*})^{2}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)}

with c=(1γ)ωc=(1-\gamma)\sqrt{\omega} and G=Rmax+(1+γ)DG_{\infty}=R_{\max}+(1+\gamma)D_{\infty}.

Comparing with Theorem 3 and observing τ=𝒪(logT)\tau^{*}=\mathcal{O}(\log T), we conclude that TD-ASMGrad with the diminishing stepsize converges exactly to the global optimum at the rate of 𝒪(logTT)\mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right), rather than to a neighborhood.

5 Conclusion

This paper provides the first convergence analysis of the Adam-type RL algorithms under Markovian sampling. For PG-AMSGrad with nonlinear function approximation for the policy, we show that the algorithm converges to a stationary point with a diminishing stepsize. With a constant size, we show that PG-AMSGrad converges only to a neighborhood of a stationary point but at a faster rate. Furthermore, we also provide the finite-time convergence results for TD-AMSGrad to the global optimum under proper choices of the stepsize.

Several future directions along this topic are interesting. For example, the optimality of the convergence result of PG-AMSGrad is of importance to study. More general value function approximation, and convergence results for constant hyper-parameters in TD-AMSGrad are also of interest.

Acknowledgements

The work was supported in part by the U.S. National Science Foundation under Grants CCF-1761506, ECCS-1818904, CCF-1909291 and CCF-1900145.

References

  • Agarwal et al., (2019) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.
  • Baird, (1995) Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
  • Baxter and Bartlett, (2001) Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350.
  • Bello et al., (2017) Bello, I., Zoph, B., Vasudevan, V., and Le, Q. V. (2017). Neural optimizer search with reinforcement learning. In International Conference on Machine Learning (ICML), pages 459–468.
  • Benveniste et al., (2012) Benveniste, A., Métivier, M., and Priouret, P. (2012). Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.
  • Bhandari and Russo, (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
  • Bhandari et al., (2018) Bhandari, J., Russo, D., and Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference on Learning Theory (COLT).
  • Bhatnagar, (2010) Bhatnagar, S. (2010). An actor-critic algorithm with function approximation for discounted cost constrained markov decision processes. Systems & Control Letters, 59(12):760–766.
  • Bhatnagar et al., (2008) Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S. (2008). Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 105–112.
  • Bhatnagar et al., (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11):2471–2482.
  • Cai et al., (2019) Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems (NeurIPS), pages 11312–11322.
  • Castillo et al., (2019) Castillo, G. A., Weng, B., Hereid, A., Wang, Z., and Zhang, W. (2019). Reinforcement learning meets hybrid zero dynamics: A case study for rabbit. In 2019 International Conference on Robotics and Automation (ICRA), pages 284–290.
  • (13) Chen, X., Liu, S., Sun, R., and Hong, M. (2019a). On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations (ICLR).
  • (14) Chen, Z., Zhang, S., Doan, T. T., Maguluri, S. T., and Clarke, J.-P. (2019b). Finite-time analysis of Q-learning with linear function approximation. arXiv preprint arXiv:1905.11425.
  • Dalal et al., (2018) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. (2018). Finite sample analyses for td (0) with function approximation. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Fazel et al., (2018) Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning (ICML), pages 1467–1476.
  • Fu et al., (2020) Fu, Z., Yang, Z., Chen, Y., and Wang, Z. (2020). Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. In International Conference on Learning Representations (ICLR).
  • Györfi and Walk, (1996) Györfi, L. and Walk, H. (1996). On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31–61.
  • Hu and Syed, (2019) Hu, B. and Syed, U. (2019). Characterizing the exact behaviors of temporal difference learning algorithms using Markov jump linear system theory. In Advances in Neural Information Processing Systems (NeurIPS), pages 8477–8488.
  • Ji et al., (2019) Ji, K., Wang, Z., Zhou, Y., and Liang, Y. (2019). Faster stochastic algorithms via history-gradient aided batch size adaptation. arXiv preprint arXiv:1910.09670.
  • Kakade, (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems (NeurIPS), pages 1531–1538.
  • Karimi et al., (2019) Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. (2019). Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory (COLT).
  • Kingma and Ba, (2015) Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
  • Konda, (2002) Konda, V. (2002). Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology.
  • Konda and Tsitsiklis, (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 1008–1014.
  • Kushner and Yin, (2003) Kushner, H. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.
  • Lakshminarayanan and Szepesvari, (2018) Lakshminarayanan, C. and Szepesvari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1347–1355.
  • Liu et al., (2019) Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems (NeurIPS), pages 10564–10575.
  • Luo et al., (2019) Luo, Y., Yang, Z., Wang, Z., and Kolar, M. (2019). Natural actor-critic converges globally for hierarchical linear quadratic regulator. arXiv preprint arXiv:1912.06875.
  • Malik et al., (2019) Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P., and Wainwright, M. (2019). Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2916–2925.
  • Mitrophanov, (2005) Mitrophanov, A. Y. (2005). Sensitivity and convergence of uniformly ergodic Markov chains. Journal of Applied Probability, 42(4):1003–1014.
  • Mnih et al., (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  • Papini et al., (2018) Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. (2018). Stochastic variance-reduced policy gradient. In International Conference on Machine Learning (ICML), pages 4026–4035.
  • Papini et al., (2017) Papini, M., Pirotta, M., and Restelli, M. (2017). Adaptive batch size for safe policy gradients. In Advances in Neural Information Processing Systems (NeurIPS), pages 3591–3600.
  • Papini et al., (2019) Papini, M., Pirotta, M., and Restelli, M. (2019). Smoothing policies and safe policy gradients. arXiv preprint arXiv:1905.03231.
  • Pednault et al., (2002) Pednault, E., Abe, N., and Zadrozny, B. (2002). Sequential cost-sensitive decision making with reinforcement learning. In Proceedings of the eighth ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pages 259–268.
  • Pirotta et al., (2015) Pirotta, M., Restelli, M., and Bascetta, L. (2015). Policy gradient in lipschitz Markov decision processes. Machine Learning, 100(2-3):255–283.
  • Reddi et al., (2018) Reddi, S. J., Kale, S., and Kumar, S. (2018). On the convergence of Adam and beyond. In International Conference on Learning Representations (ICLR).
  • Rummery and Niranjan, (1994) Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems, volume 37.
  • Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML), pages 1889–1897.
  • Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Shani et al., (2020) Shani, L., Efroni, Y., and Mannor, S. (2020). Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  • Shen et al., (2019) Shen, Z., Ribeiro, A., Hassani, H., Qian, H., and Mi, C. (2019). Hessian aided policy gradient. In International Conference on Machine Learning (ICML), pages 5729–5738.
  • Silver et al., (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), pages 387–395.
  • Srikant and Ying, (2019) Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation andtd learning. In Conference on Learning Theory (COLT), pages 2803–2830.
  • Stooke and Abbeel, (2018) Stooke, A. and Abbeel, P. (2018). Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811.
  • Sun et al., (2020) Sun, T., Shen, H., Chen, T., and Li, D. (2020). daptive temporal difference learning with linear function approximation. arXiv preprint arXiv:2002.08537.
  • Sutton, (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44.
  • Sutton et al., (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 1057–1063.
  • Tadić et al., (2017) Tadić, V. B., Doucet, A., et al. (2017). Asymptotic bias of stochastic gradient search. The Annals of Applied Probability, 27(6):3255–3304.
  • Tran and Phong, (2019) Tran, P. T. and Phong, L. T. (2019). On the convergence proof of AMSGrad and a new version. IEEE Access, 7:61706–61716.
  • Tsitsiklis and Van Roy, (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-diffference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674 – 690.
  • Tu and Recht, (2019) Tu, S. and Recht, B. (2019). The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory (COLT), pages 3036–3083.
  • Wang et al., (2020) Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2020). Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations (ICLR).
  • Watkins and Dayan, (1992) Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279–292.
  • Williams, (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256.
  • (57) Xu, P., Gao, F., and Gu, Q. (2019a). An improved convergence analysis of stochastic variance-reduced policy gradient. In International Conference on Uncertainty in Artificial Intelligence (UAI).
  • Xu et al., (2020) Xu, P., Gao, F., and Gu, Q. (2020). Sample efficient policy gradient methods with recursive variance reduction. In International Conference on Learning Representations (ICLR).
  • (59) Xu, T., Zou, S., and Liang, Y. (2019b). Two time-scale off-policy td learning: Non-asymptotic analysis over markovian samples. In Advances in Neural Information Processing Systems (NeurIPS), pages 10633–10643.
  • Yang et al., (2019) Yang, Z., Chen, Y., Hong, M., and Wang, Z. (2019). Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems (NeurIPS), pages 8351–8363.
  • Zhang et al., (2019) Zhang, K., Koppel, A., Zhu, H., and Başar, T. (2019). Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.
  • Zhou et al., (2018) Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. (2018). On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671.
  • (63) Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. (2019a). A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11127–11135.
  • (64) Zou, S., Xu, T., and Liang, Y. (2019b). Finite-sample analysis for sarsa with linear function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 8665–8675.

Supplementary Materials

Appendix A EstQ algorithm

We introduce the unbiased Q-function estimating algorithm EstQ below.

Algorithm 3 EstQ Zhang et al., (2019)
1:  Input: s,a,θs,a,\theta. Initialize Q^=0,s1q=s,a1q=a\hat{Q}=0,s^{q}_{1}=s,a^{q}_{1}=a
2:  Draw TGeom(1γ1/2)T\sim\text{Geom}(1-\gamma^{1/2}).
3:  for  t=1,2,,T1t=1,2,\ldots,T-1 do
4:     Collect reward R(stq,atq)R(s^{q}_{t},a^{q}_{t}) and update the Q-function Q^Q^+γt/2R(stq,atq)\hat{Q}\leftarrow\hat{Q}+\gamma^{t/2}R(s^{q}_{t},a^{q}_{t}).
5:     Sample st+1q(|stq,atq),at+1qπθ(|st+1q)s^{q}_{t+1}\sim\mathbb{P}(\cdot|s^{q}_{t},a^{q}_{t}),a^{q}_{t+1}\sim\pi_{\theta}(\cdot|s^{q}_{t+1}).
6:  end for
7:  Collect reward R(sTq,aTq)R(s^{q}_{T},a^{q}_{T}) and update the Q-function Q^Q^+γT/2R(sTq,aTq)\hat{Q}\leftarrow\hat{Q}+\gamma^{T/2}R(s^{q}_{T},a^{q}_{T}).
8:   Output: Q^πθQ^\hat{Q}^{\pi_{\theta}}\leftarrow\hat{Q}.

In Algorithm 3, we emphasize that the samples {(stq,atq)}\{(s^{q}_{t},a^{q}_{t})\} are used only to estimate the Q-function, and are independent from those used to evaluate the score function in PG-AMSGrad except the initial pair. Zhang et al., (2019) showed that this algorithm provided an unbiased estimator of the Q-function Qπθ(s,a)Q^{\pi_{\theta}}(s,a).

Appendix B Proof of Lemma 1

For notational simplicity, we denote J(θ)=θJ(θ)\nabla J(\theta)=\nabla_{\theta}J(\theta) and J~(θ;s,a)=θJ~(θ;s,a)\nabla\tilde{J}(\theta;s,a)=\nabla_{\theta}\tilde{J}(\theta;s,a) in the sequel. We first introduce two technical lemmas which are useful for the proof of Lemma 1.

Lemma 4.

The gradient estimator is uniformly bounded, that is,

gtgtcΘRmax1γ.\left\lVert g_{t}\right\rVert_{\infty}\leq\left\lVert g_{t}\right\rVert\leq\frac{c_{\Theta}R_{\max}}{1-\gamma}. (17)

Similarly, we also have

J(θ)cΘRmax1γ;J~(θ;s,a)cΘRmax1γ.\left\lVert\nabla J(\theta)\right\rVert\leq\frac{c_{\Theta}R_{\max}}{1-\gamma};\qquad\left\lVert\nabla\tilde{J}(\theta;s,a)\right\rVert\leq\frac{c_{\Theta}R_{\max}}{1-\gamma}.
Proof.

The proof can be proceeded by the definition of gtg_{t}.

gt\displaystyle g_{t} =Q^t(sT,at)θtlog(πθt(at|st))\displaystyle=\hat{Q}_{t}(s_{T},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t}))
=(t=tγttR(st,at))log(πθt(at|st))\displaystyle=\left(\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R(s_{t^{\prime}},a_{t^{\prime}})\right)\cdot\log(\pi_{\theta_{t}}(a_{t}|s_{t}))
t=tγttR(st,at)]log(πθt(at|st))\displaystyle\leq\left\lVert\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R(s_{t^{\prime}},a_{t^{\prime}})\right\rVert\cdot]\left\lVert\log(\pi_{\theta_{t}}(a_{t}|s_{t}))\right\rVert
(i)cΘRmax1γ,\displaystyle\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma},

where (i) follows from Assumption 1 and from the fact that the reward function is uniformly bounded. The proof of the second claim is similar to the first one. ∎

Lemma 5.

(Zou et al., 2019b, , Lemma 3) For any θ1,θ2d\theta_{1},\theta_{2}\in\mathbb{R}^{d}, under Assumption 2 we have

μθ1μθ2TV|𝒜|Lπ(1+logρσ1+11ρ)θ1θ2.\displaystyle\left\|\mu_{\theta_{1}}-\mu_{\theta_{2}}\right\|_{TV}\leq|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\left\lVert\theta_{1}-\theta_{2}\right\rVert.

Proof of Lemma 1: To derive the Lipschitz continuity of J(θ)\nabla J(\theta), we first use the policy gradient theorem Sutton et al., (2000) which gives

J(θ)\displaystyle\nabla J(\theta) =(s,a)t=1γtp(st=s|s1,πθ)Qπθ(s,a)πθ(a|s)dsda\displaystyle=\underset{(s,a)}{\int}\sum_{t=1}^{\infty}\gamma^{t}p(s_{t}=s|s_{1},\pi_{\theta})\cdot Q^{\pi_{\theta}}(s,a)\nabla\pi_{\theta}(a|s)dsda
:=(s,a)Qπθ(s,a)θlog(πθ(a|s))dμθ,\displaystyle:=\underset{(s,a)}{\int}Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\log(\pi_{\theta}(a|s))d\mu_{\theta},

where μθ\mu_{\theta} is denoted as the discounted visitation distribution. We further proceed the proof of Lipschitz continuity of J(θ)\nabla J(\theta) as

J(θ1)J(θ2)\displaystyle\left\lVert\nabla J(\theta_{1})-\nabla J(\theta_{2})\right\rVert
=(s,a)Qπθ1(s,a)θ1log(πθ1(a|s))dμθ1(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ2\displaystyle\quad=\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{2}}
=[(s,a)Qπθ1(s,a)θ1log(πθ1(a|s))dμθ1(s,a)Qπθ1(s,a)θ2log(πθ2(a|s))dμθ1]\displaystyle\quad=\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}\right]
+[(s,a)Qπθ1(s,a)θ2log(πθ2(a|s))dμθ1(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ1]\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}\right]
+[(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ1(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ2]\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{2}}\right]
=(s,a)Qπθ1(s,a)(θ1log(πθ1(a|s))θ2log(πθ2(a|s)))dμθ1\displaystyle\quad=\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\left(\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a|s))-\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))\right)d\mu_{\theta_{1}}
+(s,a)(Qπθ1(s,a)Qπθ2(s,a))θ2log(πθ2(a|s))dμθ1\displaystyle\quad\quad\ +\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}
+[(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ1(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ2]\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{2}}\right]
(i)RmaxL1γθ1θ2(s,a)dμθ1+cΘ(s,a)(Qπθ1(s,a)Qπθ2(s,a))dμθ1\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert\underset{(s,a)}{\int}d\mu_{\theta_{1}}+c_{\Theta}\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))d\mu_{\theta_{1}}
+[(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ1(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ2]\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{2}}\right]
=(ii)RmaxL1γθ1θ2+cΘ(s,a)(Qπθ1(s,a)Qπθ2(s,a))dμθ1\displaystyle\quad\overset{\text{(ii)}}{=}\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert+c_{\Theta}\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))d\mu_{\theta_{1}}
+[(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ1(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ2],\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{2}}\right], (18)

where (i) follows from Assumption 1 and boundedness of Qπθ(s,a)Q^{\pi_{\theta}}(s,a), and (ii) follows since (s,a)dμθ1=1\underset{(s,a)}{\int}d\mu_{\theta_{1}}=1. In the following, we bound the last two terms in the right hand side. First, we show that Qπθ(s,a)Q^{\pi_{\theta}}(s,a) is also Lipschitz continuous with respect to θ\theta. To see that, we have

|Qπθ1(s,a)Qπθ2(s,a)|\displaystyle|Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a)| =𝔼νθ1[t=0γtR(st,at)]𝔼νθ2[t=0γtR(st,at)]\displaystyle=\underset{\nu_{\theta_{1}}}{\mathbb{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right]-\underset{\nu_{\theta_{2}}}{\mathbb{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right]
|t=0γtR(st,at)|νθ1νθ2TV\displaystyle\leq\left\lvert\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right\rvert\left\lVert\nu_{\theta_{1}}-\nu_{\theta_{2}}\right\rVert_{TV}
Rmax1γνθ1νθ2TV,\displaystyle\leq\frac{R_{\max}}{1-\gamma}\left\lVert\nu_{\theta_{1}}-\nu_{\theta_{2}}\right\rVert_{TV},

where νθ\nu_{\theta} is the distribution of the state-action trajectory corresponding to the transition probability \mathbb{P} and the policy π\pi. Although νθ\nu_{\theta} is different from the state-action visitation distribution μθ\mu_{\theta}, both of them share the same property. That is, for a different θ\theta, νθ\nu_{\theta} is determined by the same transition kernel and only differs in the policy. Therefore, Lemma 5 applies to two distributions νθ1\nu_{\theta_{1}} and νθ2\nu_{\theta_{2}}, which yields

|Qπθ1(s,a)Qπθ2(s,a)|Rmax1γ|𝒜|Lπ(1+logρσ1+11ρ)θ1θ2.|Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a)|\leq\frac{R_{\max}}{1-\gamma}\cdot|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\left\lVert\theta_{1}-\theta_{2}\right\rVert. (19)

Then we bound the last term of the right hand side of (B) via using the similar techniques above.

(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ1(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ2\displaystyle\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{2}}
Qπθ2(s,a)θ2log(πθ2(a|s))μθ1μθ2TV\displaystyle\quad\leq\left\lVert Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))\right\rVert\left\lVert\mu_{\theta_{1}}-\mu_{\theta_{2}}\right\rVert_{TV}
(i)cΘRmax1γμθ1μθ2TV\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\left\lVert\mu_{\theta_{1}}-\mu_{\theta_{2}}\right\rVert_{TV}
(ii)cΘRmax1γ|𝒜|Lπ(1+logρσ1+11ρ)θ1θ2,\displaystyle\quad\overset{\text{(ii)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\cdot|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\left\lVert\theta_{1}-\theta_{2}\right\rVert, (20)

where (i) follows from Assumption 1 and boundedness of Qπθ(s,a)Q^{\pi_{\theta}}(s,a), and (ii) follows from Lemma 5. By substituting the bounds from (19) and (B) to (B), we obtain the Lipschitz condition of J(θ)\nabla J(\theta), i.e.,

J(θ1)J(θ2)\displaystyle\left\lVert\nabla J(\theta_{1})-\nabla J(\theta_{2})\right\rVert\leq RmaxL1γθ1θ2+cΘ(s,a)(Qπθ1(s,a)Qπθ2(s,a))dμθ1\displaystyle\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert+c_{\Theta}\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))d\mu_{\theta_{1}}
+[(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ1(s,a)Qπθ2(s,a)θ2log(πθ2(a|s))dμθ2]\displaystyle+\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))d\mu_{\theta_{2}}\right]
\displaystyle\leq [RmaxL1γ+(1+cΘ)Rmax1γ|𝒜|Lπ(1+logρσ1+11ρ)]θ1θ2.\displaystyle\left[\frac{R_{\max}L}{1-\gamma}+\frac{(1+c_{\Theta})R_{\max}}{1-\gamma}\cdot|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\right]\left\lVert\theta_{1}-\theta_{2}\right\rVert.

Next we prove (11). Note that the Q-function is bounded. Then we have

J~(θ1;s,a)J~(θ2;s,a)=\displaystyle\left\lVert\nabla\tilde{J}(\theta_{1};s,a)-\nabla\tilde{J}(\theta_{2};s,a)\right\rVert= Qπθ1(s,a)θ1log(πθ1(a|s))Qπθ2(s,a)θ2log(πθ2(a|s))\displaystyle Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a|s))-Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))
=\displaystyle= Qπθ1(s,a)(θ1log(πθ1(a|s))θ2log(πθ2(a|s)))\displaystyle Q^{\pi_{\theta_{1}}}(s,a)\left(\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a|s))-\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))\right)
+θ2log(πθ2(a|s))(Qπθ1(s,a)Qπθ2(s,a))\displaystyle+\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))\left(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a)\right)
(i)\displaystyle\overset{\text{(i)}}{\leq} RmaxL1γθ1θ2+cΘ(Qπθ1(s,a)Qπθ2(s,a))\displaystyle\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert+c_{\Theta}\left(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a)\right)
(ii)\displaystyle\overset{\text{(ii)}}{\leq} [RmaxL1γ+cΘ|𝒜|Lπ(1+logρσ1+11ρ)]θ1θ2,\displaystyle\left[\frac{R_{\max}L}{1-\gamma}+c_{\Theta}|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\right]\left\lVert\theta_{1}-\theta_{2}\right\rVert,

where (i) follows from Assumption 1 and the boundedness of Qπθ(s,a)Q^{\pi_{\theta}}(s,a), and (ii) follows from (19).

Appendix C Proof of Theorem 1

In this appendix we will first introduce some useful lemmas in Appendix C.1, and then provide technical proofs in Appendix C.2.

C.1 Supporting Lemmas

We first provide two lemmas to deal with the bias of the gradient estimation.

Lemma 6.

Given x,ydx,y\in\mathbb{R}^{d} and a positive diagonal matrix Vd×dV\in\mathbb{R}^{d\times d}, then

xTVyTr(V)xy.x^{T}Vy\leq Tr(V)\left\lVert x\right\rVert_{\infty}\left\lVert y\right\rVert_{\infty}. (21)
Proof.

Since VV is a diagonal matrix with Vi,i>0,i[d]V_{i,i}>0,\forall i\in[d], we have

xTVy\displaystyle x^{T}Vy =i=1dVi,ixiyi(i)i=1dVi,ixy=Tr(V)xy,\displaystyle=\sum_{i=1}^{d}V_{i,i}x_{i}y_{i}\overset{\text{(i)}}{\leq}\sum_{i=1}^{d}V_{i,i}\left\lVert x\right\rVert_{\infty}\left\lVert y\right\rVert_{\infty}=Tr(V)\left\lVert x\right\rVert_{\infty}\left\lVert y\right\rVert_{\infty},

where (i) follows because Vi,i>0,i[d]V_{i,i}>0,\forall i\in[d]. ∎

Lemma 7.

Fix time tt and any τ<t\tau<t. Initialize Algorithm 1 such that |g1,i|G0|g_{1,i}|\geq G_{0} for all i[d]i\in[d] with some G0>0G_{0}>0. Suppose Assumption 1 and 2 hold. Under Assumption 1 and Assumption 2, we have

J(θtτ)𝔼[J~(θtτ;st,at)|tτ]G[σρτ+LπG2G0(k=tτt1αk+i=tτt1k=tτiαk)],\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};s_{t},a_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert\leq G_{\infty}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right], (22)

where G=cΘRmax1γG_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}.

Proof.

For notational brevity we denote Ot=(st,aT)O_{t}=(s_{t},a_{T}). Observe that

J(θtτ)𝔼[J~(θtτ;Ot)|tτ]\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert
=(s,a)Qπθtτθlog(πθtτ(a|s))dμθtτ(s,a)Qπθtτθlog(πθtτ(a|s))dP(st,at|tτ)\displaystyle\quad=\left\lVert\underset{(s,a)}{\int}Q^{\pi_{\theta_{t-\tau}}}\nabla_{\theta}\log(\pi_{\theta_{t-\tau}}(a|s))d\mu_{\theta_{t-\tau}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{t-\tau}}}\nabla_{\theta}\log(\pi_{\theta_{t-\tau}}(a|s))dP(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\rVert
(i)cΘRmax1γμθtτ(,)P(st,at|tτ)TV\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}
=Gμθtτ(,)P(st,at|tτ)TV,\displaystyle\quad=G_{\infty}\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV},

where (i) follows from the boundedness of QπθQ^{\pi_{\theta}}. In the following, we bound μθtτ(,)P(st,at|tτ)TV\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}. To this end, we need to build a new Markov chain in which the policy is fixed after time tτt-\tau. To be specific, in Algorithm 1, we sample actions via a changing policy πθt(at|st)\pi_{\theta_{t}}(a_{t}|s_{t}), which yields a Markov chain:

OtτOtτ+1OtOt+1.O_{t-\tau}\rightarrow O_{t-\tau+1}\cdots\rightarrow O_{t}\rightarrow O_{t+1}. (23)

Now, from time tτt-\tau, we repeatedly use the policy πθtτ\pi_{\theta_{t-\tau}}, which yields an auxiliary Markov chain:

OtτO~tτ+1O~tO~t+1,O_{t-\tau}\rightarrow\tilde{O}_{t-\tau+1}\cdots\rightarrow\tilde{O}_{t}\rightarrow\tilde{O}_{t+1}, (24)

where we denote O~t=(s~t,a~t)\tilde{O}_{t}=(\tilde{s}_{t},\tilde{a}_{t}) correspondingly. Clearly we have for k=tτ+1,,tk=t-\tau+1,\dots,t,

P(s~k=|θtτ,s~k1)=a𝒜πθtτ(a~k1=a|s~k1)(s~k=|a~k1=a,s~k1).P(\tilde{s}_{k}=\cdot|\theta_{t-\tau},\tilde{s}_{k-1})=\sum_{a\in\mathcal{A}}\pi_{\theta_{t-\tau}}(\tilde{a}_{k-1}=a|\tilde{s}_{k-1})\mathbb{P}(\tilde{s}_{k}=\cdot|\tilde{a}_{k-1}=a,\tilde{s}_{k-1}). (25)

Since we use the same policy from time tτt-\tau, the Markov chain given in (24) in this time slot is uniformly ergodic, and thus satisfies Assumption 2. Therefore, we can bound μθtτ(,)P(s~t,a~t|tτ)TV\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})\right\|_{TV} as

μθtτ(,)P(s~t,a~t|tτ)TVσρτ.\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}\leq\sigma\rho^{\tau}. (26)

Observe that

μθtτ(,)P(st,at|tτ)TV\displaystyle\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}
μθtτ(,)P(s~t,a~t|tτ)TV+P(s~t,a~t|tτ)P(st,at|tτ)TV.\displaystyle\quad\leq\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}+\left\|P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}. (27)

It remains to deal with P(s~t,a~t|tτ)P(st,at|tτ)TV\left\|P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}. We proceed as follows

P(s~t,a~t|tτ)P(st,at|tτ)TV\displaystyle\left\|P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}
=P(s~t|tτ)πθtτ(a~t|s~t)P(st|tτ)πθt(at|st)TV\displaystyle\quad=\left\|P(\tilde{s}_{t}|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}|\tilde{s}_{t})-P(s_{t}|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}|s_{t})\right\|_{TV}
=12(s,a)|P(s~t=s|tτ)πθtτ(a~t=a|s~t=s)P(st=s|tτ)πθt(at=a|st=s)|dsda\displaystyle\quad=\frac{1}{2}\underset{(s,a)}{\int}\left|P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a|\tilde{s}_{t}=s)-P(s_{t}=s|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a|s_{t}=s)\right|dsda
=12(s,a)|P(s~t=s|tτ)πθtτ(a~t=a|s~t=s)P(s~t=s|tτ)πθt(at=a|st=s)\displaystyle\quad=\frac{1}{2}\underset{(s,a)}{\int}\Big{|}P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a|\tilde{s}_{t}=s)-P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a|s_{t}=s)
+P(s~t=s|tτ)πθt(at=a|st=s)P(st=s|tτ)πθt(at=a|st=s)|dsda\displaystyle\quad\quad\ +P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a|s_{t}=s)-P(s_{t}=s|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a|s_{t}=s)\Big{|}dsda
12(s,a)P(s~t=s|tτ)|πθtτ(a~t=a|s~t=s)πθt(at=a|st=s)|dads\displaystyle\quad\leq\frac{1}{2}\underset{(s,a)}{\int}P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})\left|\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a|\tilde{s}_{t}=s)-\pi_{\theta_{t}}(a_{t}=a|s_{t}=s)\right|dads
+12s|P(s~t=s|tτ)P(st=s|tτ)|aπθt(at=a|st=s)dads\displaystyle\quad\quad\ +\frac{1}{2}\int_{s}\left|P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})-P(s_{t}=s|\mathcal{F}_{t-\tau})\right|\int_{a}\pi_{\theta_{t}}(a_{t}=a|s_{t}=s)dads
(i)12Lπθtθtτ+12s|P(s~t=s|tτ)P(st=s|tτ)|ds\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{1}{2}L_{\pi}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert+\frac{1}{2}\int_{s}\left|P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})-P(s_{t}=s|\mathcal{F}_{t-\tau})\right|ds
12Lπk=tτt1θk+1θk+12s|P(s~t=s|tτ)P(st=s|tτ)|ds\displaystyle\quad\leq\frac{1}{2}L_{\pi}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert+\frac{1}{2}\int_{s}\left|P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})-P(s_{t}=s|\mathcal{F}_{t-\tau})\right|ds
(ii)LπG2G0k=tτt1αk+12s|P(s~t=s|tτ)P(st=s|tτ)|ds\displaystyle\quad\overset{\text{(ii)}}{\leq}\frac{L_{\pi}G_{\infty}}{2G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}+\frac{1}{2}\int_{s}\left|P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})-P(s_{t}=s|\mathcal{F}_{t-\tau})\right|ds
=LπG2G0k=tτt1αk+P(s~t|tτ)P(st|tτ)TV,\displaystyle\quad=\frac{L_{\pi}G_{\infty}}{2G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}+\left\|P(\tilde{s}_{t}|\mathcal{F}_{t-\tau})-P(s_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}, (28)

where (i) follows from Assumption 1 and (ii) follows from Lemma 9. It remains to bound the second term of the right hand side. To this end, we first write P(s~t=|tτ)P(\tilde{s}_{t}=\cdot|\mathcal{F}_{t-\tau}) as

P(s~t=|tτ)\displaystyle P(\tilde{s}_{t}=\cdot|\mathcal{F}_{t-\tau}) =sP(s~t1=s|tτ)P(s~t=|s~t1=s)ds\displaystyle=\int_{s}P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})P(\tilde{s}_{t}=\cdot|\tilde{s}_{t-1}=s)ds
=sP(s~t1=s|tτ)aP(s~t=|s~t1=s,a~t1=a)πθtτ(a~t1=a|s~t1=s)dads.\displaystyle=\int_{s}P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})\int_{a}P(\tilde{s}_{t}=\cdot|\tilde{s}_{t-1}=s,\tilde{a}_{t-1}=a)\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a|\tilde{s}_{t-1}=s)dads. (29)

Similarly, for P(st=|tτ)P(s_{t}=\cdot|\mathcal{F}_{t-\tau}) we have

P(st=|tτ)\displaystyle P(s_{t}=\cdot|\mathcal{F}_{t-\tau}) =sP(st1=s|tτ)P(st=|st1=s)ds\displaystyle=\int_{s}P(s_{t-1}=s|\mathcal{F}_{t-\tau})P(s_{t}=\cdot|s_{t-1}=s)ds
=sP(st1=s|tτ)aP(st=|st1=s,at1=a)πθt1(at1=a|st1=s)dads.\displaystyle=\int_{s}P(s_{t-1}=s|\mathcal{F}_{t-\tau})\int_{a}P(s_{t}=\cdot|s_{t-1}=s,a_{t-1}=a)\pi_{\theta_{t-1}}(a_{t-1}=a|s_{t-1}=s)dads. (30)

Then we obtain the following bound

|P(s~t=|tτ)P(st=|tτ)|\displaystyle\left|P(\tilde{s}_{t}=\cdot|\mathcal{F}_{t-\tau})-P(s_{t}=\cdot|\mathcal{F}_{t-\tau})\right|
=|sP(s~t1=s|tτ)aP(s~t=|s~t1=s,a~t1=a)πθtτ(a~t1=a|s~t1=s)dads\displaystyle\quad=\Bigg{|}\int_{s}P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})\int_{a}P(\tilde{s}_{t}=\cdot|\tilde{s}_{t-1}=s,\tilde{a}_{t-1}=a)\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a|\tilde{s}_{t-1}=s)dads
sP(st1=s|tτ)aP(st=|st1=s,at1=a)πθt1(at1=a|st1=s)dads|\displaystyle\quad\qquad-\int_{s}P(s_{t-1}=s|\mathcal{F}_{t-\tau})\int_{a}P(s_{t}=\cdot|s_{t-1}=s,a_{t-1}=a)\pi_{\theta_{t-1}}(a_{t-1}=a|s_{t-1}=s)dads\Bigg{|}
sP(s~t1=s|tτ)|aP(s~t=|s~t1=s,a~t1=a)πθtτ(a~t1=a|s~t1=s)\displaystyle\quad\leq\int_{s}P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})\Bigg{|}\int_{a}P(\tilde{s}_{t}=\cdot|\tilde{s}_{t-1}=s,\tilde{a}_{t-1}=a)\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a|\tilde{s}_{t-1}=s)
aP(st=|st1=s,at1=a)πθt1(at1=a|st1=s)|dads\displaystyle\quad\quad\ -\int_{a}P(s_{t}=\cdot|s_{t-1}=s,a_{t-1}=a)\pi_{\theta_{t-1}}(a_{t-1}=a|s_{t-1}=s)\Bigg{|}dads
+s|P(s~t1=s|tτ)P(st1=s|tτ)|\displaystyle\quad\quad\ +\int_{s}\Bigg{|}P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})-P(s_{t-1}=s|\mathcal{F}_{t-\tau})\Bigg{|}\cdot
aP(st=|st1=s,at1=a)πθt1(at1=a|st1=s)dads\displaystyle\quad\quad\qquad\int_{a}P(s_{t}=\cdot|s_{t-1}=s,a_{t-1}=a)\pi_{\theta_{t-1}}(a_{t-1}=a|s_{t-1}=s)dads
sP(s~t1=s|tτ)\displaystyle\quad\leq\int_{s}P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})\cdot
aP(st=|st1=s,at1=a)|πθtτ(a~t1=a|s~t1=s)πθt1(at1=a|st1=s)|dads\displaystyle\quad\quad\qquad\int_{a}P(s_{t}=\cdot|s_{t-1}=s,a_{t-1}=a)\left|\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a|\tilde{s}_{t-1}=s)-\pi_{\theta_{t-1}}(a_{t-1}=a|s_{t-1}=s)\right|dads
+s|P(s~t1=s|tτ)P(st1=s|tτ)|ds\displaystyle\quad\quad+\int_{s}\left|P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})-P(s_{t-1}=s|\mathcal{F}_{t-\tau})\right|ds
sP(s~t1=s|tτ)a|πθtτ(a~t1=a|s~t1=s)πθt1(at1=a|st1=s)|dads\displaystyle\quad\leq\int_{s}P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})\int_{a}\left|\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a|\tilde{s}_{t-1}=s)-\pi_{\theta_{t-1}}(a_{t-1}=a|s_{t-1}=s)\right|dads
+s|P(s~t1=s|tτ)P(st1=s|tτ)|ds\displaystyle\quad\quad+\int_{s}\left|P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})-P(s_{t-1}=s|\mathcal{F}_{t-\tau})\right|ds
Lπθtτθt1+s|P(s~t1=s|tτ)P(st1=s|tτ)|ds\displaystyle\quad\leq L_{\pi}\left\lVert\theta_{t-\tau}-\theta_{t-1}\right\rVert+\int_{s}\left|P(\tilde{s}_{t-1}=s|\mathcal{F}_{t-\tau})-P(s_{t-1}=s|\mathcal{F}_{t-\tau})\right|ds
Lπθtτθt1+2P(s~t1|tτ)P(st1|tτ)TV.\displaystyle\quad\leq L_{\pi}\left\lVert\theta_{t-\tau}-\theta_{t-1}\right\rVert+2\left\|P(\tilde{s}_{t-1}|\mathcal{F}_{t-\tau})-P(s_{t-1}|\mathcal{F}_{t-\tau})\right\|_{TV}.

Then we have a dynamical form as

P(s~t|tτ)P(st|tτ)TV\displaystyle\left\|P(\tilde{s}_{t}|\mathcal{F}_{t-\tau})-P(s_{t}|\mathcal{F}_{t-\tau})\right\|_{TV} =12s|P(s~t=s|tτ)P(st=s|tτ)|ds\displaystyle=\frac{1}{2}\int_{s}\left|P(\tilde{s}_{t}=s|\mathcal{F}_{t-\tau})-P(s_{t}=s|\mathcal{F}_{t-\tau})\right|ds
12Lπθtτθt1+P(s~t1|tτ)P(st1|tτ)TV.\displaystyle\leq\frac{1}{2}L_{\pi}\left\lVert\theta_{t-\tau}-\theta_{t-1}\right\rVert+\left\|P(\tilde{s}_{t-1}|\mathcal{F}_{t-\tau})-P(s_{t-1}|\mathcal{F}_{t-\tau})\right\|_{TV}. (31)

Applying (31) recursively yields

P(s~t|tτ)P(st|tτ)TV\displaystyle\left\|P(\tilde{s}_{t}|\mathcal{F}_{t-\tau})-P(s_{t}|\mathcal{F}_{t-\tau})\right\|_{TV} 12Lπi=tτt1θtτθi\displaystyle\leq\frac{1}{2}L_{\pi}\sum_{i=t-\tau}^{t-1}\left\lVert\theta_{t-\tau}-\theta_{i}\right\rVert
LπG2G0i=tτt1k=tτiαk.\displaystyle\leq\frac{L_{\pi}G_{\infty}}{2G_{0}}\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}. (32)

Substituting (C.1) into (C.1) and according to the fact τ1\tau\geq 1, we have

P(s~t,a~t|tτ)P(st,at|tτ)TVLπG2G0(k=tτt1αk+i=tτt1k=tτiαk).\left\|P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}\leq\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right). (33)

Combining (26) and (33), we obtain

J(θtτ)𝔼[J~(θtτ;Ot)|tτ]\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert Gμθtτ(,)Pt(s~t,a~t|tτ)TV\displaystyle\leq G_{\infty}\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P_{t}(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}
G[σρτ+LπG2G0(k=tτt1αk+i=tτt1k=tτiαk)].\displaystyle\leq G_{\infty}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right].

The next a few lemmas are closely related to the update rule of AMSGrad.

Lemma 8.

(Zhou et al.,, 2018, Lemma A.1) Let {gt,mt,v^t}\{g_{t},m_{t},\hat{v}_{t}\} for t=1,2,t=1,2,\dots be sequences generated by Algorithm 1, and denote G=cΘRmax1γG_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}. Given gtGg_{t}\leq G_{\infty} as in Lemma 4, mtG,v^tG2\left\lVert m_{t}\right\rVert\leq G_{\infty},\left\lVert\hat{v}_{t}\right\rVert\leq G_{\infty}^{2}.

Lemma 9.

Let {mt,v^t}\{m_{t},\hat{v}_{t}\} for t=1,2,t=1,2,\dots be sequences generated by Algorithm 1 and dentoe G=Rmax+(1+γ)DG_{\infty}=R_{\max}+(1+\gamma)D_{\infty}. Then

θt+1θt=αtV^t12mtαtGG0.\left\lVert\theta_{t+1}-\theta_{t}\right\rVert=\left\lVert\alpha_{t}\hat{V}_{t}^{\frac{1}{2}}m_{t}\right\rVert\leq\frac{\alpha_{t}G_{\infty}}{G_{0}}. (34)
Proof.

The proof can proceed easily given Lemma 14 and v^t,iv^0,iG02,t,i\hat{v}_{t,i}\geq\hat{v}_{0,i}\geq G_{0}^{2},\forall t,\forall i.

θt+1θt=αtV^t12mt=αti=1dmt,i2vt,iαtG0mtαtGG0.\displaystyle\left\lVert\theta_{t+1}-\theta_{t}\right\rVert=\left\lVert\alpha_{t}\hat{V}_{t}^{\frac{1}{2}}m_{t}\right\rVert=\alpha_{t}\sqrt{\sum_{i=1}^{d}\frac{m_{t,i}^{2}}{v_{t,i}}}\leq\frac{\alpha_{t}}{G_{0}}\left\lVert m_{t}\right\rVert\leq\frac{\alpha_{t}G_{\infty}}{G_{0}}. (35)

Lemma 10.

(Zhou et al.,, 2018, Lemma A.2) Let αt\alpha_{t} be the stepsize in Algorithm 1 and β1,β2\beta_{1},\beta_{2} be the constant hyper-parameters with β1<β2\beta_{1}<\beta_{2}. Then we have

t=1T𝔼V^t12mt2d(1β1)(1β2)(1β1/β2).\sum_{t=1}^{T}\mathbb{E}\left\lVert\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}\leq\frac{d(1-\beta_{1})}{(1-\beta_{2})(1-\beta_{1}/\beta_{2})}. (36)

In addition, taking β1=0\beta_{1}=0 yields mt=gtm_{t}=g_{t}. Thus we further have

t=1T𝔼V^t12gt2d1β2.\sum_{t=1}^{T}\mathbb{E}\left\lVert\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}\leq\frac{d}{1-\beta_{2}}. (37)
Proof.

We refer the reader to the proof of Lemma A.2 in Zhou et al., (2018) for more details by reducing their proof to the case where p=12,q=1p=\frac{1}{2},q=1. ∎

Since the update rule of AMSGrad is complicated, it is often useful to introduce an auxiliary sequences ztz_{t}. If we define θ0=θ1\theta_{0}=\theta_{1}, then for t1t\geq 1, let

zt=θt+β11β1θt1.z_{t}=\theta_{t}+\frac{\beta_{1}}{1-\beta_{1}}\theta_{t-1}. (38)

The following lemma captures the property of ztz_{t} and its connection with θt\theta_{t}.

Lemma 11.

(Zhou et al.,, 2018, Lemma A.3-A.5) Given ztz_{t} defined in (38), we have for t2t\geq 2

zt+1zt=β11β1(αt1V^t112αtV^t12)mt1αtV^t12gt,z_{t+1}-z_{t}=\frac{\beta_{1}}{1-\beta_{1}}\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)m_{t-1}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}, (39)

and for t=1t=1,

z2z1=α1V^112g1.z_{2}-z_{1}=-\alpha_{1}\hat{V}_{1}^{-\frac{1}{2}}g_{1}. (40)

Further, we can bound zt+1zt\left\lVert z_{t+1}-z_{t}\right\rVert as

zt+1ztαtV^t12gt+β11β1θt+1θt,\left\lVert z_{t+1}-z_{t}\right\rVert\leq\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert+\frac{\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert, (41)

and we can also bound J(zt)J(θt)\left\lVert\nabla J(z_{t})-\nabla J(\theta_{t})\right\rVert as

J(zt)J(θt)cJβ11β1θt+1θt.\left\lVert\nabla J(z_{t})-\nabla J(\theta_{t})\right\rVert\leq\frac{c_{J}\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert. (42)

C.2 Proof of Theorem 1

In the remaining, we can complete this proof by taking the following three steps.

Step1: Establishing convergent sequence

First, we observe that J(θ)\nabla J(\theta) is Lipschitz continuous according to Lemma 1. Then we have

J(zt+1)\displaystyle J(z_{t+1})\leq J(zt)+J(zt)T(zt+1zt)+cJ2zt+1zt2\displaystyle J(z_{t})+\nabla J(z_{t})^{T}(z_{t+1}-z_{t})+\frac{c_{J}}{2}\left\lVert z_{t+1}-z_{t}\right\rVert^{2}
=\displaystyle= J(zt)+J(θt)T(zt+1zt)+(J(zt)J(θt))T(zt+1zt)+cJ2zt+1zt2\displaystyle J(z_{t})+\nabla J(\theta_{t})^{T}(z_{t+1}-z_{t})+(\nabla J(z_{t})-\nabla J(\theta_{t}))^{T}(z_{t+1}-z_{t})+\frac{c_{J}}{2}\left\lVert z_{t+1}-z_{t}\right\rVert^{2}
=\displaystyle= J(zt)+J(θt)T[β11β1(αt1V^t112αtV^t12)mt1αtV^t12gt]\displaystyle J(z_{t})+\nabla J(\theta_{t})^{T}\left[\frac{\beta_{1}}{1-\beta_{1}}\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)m_{t-1}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right]
+(J(zt)J(θt))T(zt+1zt)+cJ2zt+1zt2\displaystyle+(\nabla J(z_{t})-\nabla J(\theta_{t}))^{T}(z_{t+1}-z_{t})+\frac{c_{J}}{2}\left\lVert z_{t+1}-z_{t}\right\rVert^{2}
=\displaystyle= J(zt)+β11β1J(θt)T(αt1V^t112αtV^t12)mt1T1αtJ(θt)TV^t12gtT2\displaystyle J(z_{t})+\underbrace{\frac{\beta_{1}}{1-\beta_{1}}\nabla J(\theta_{t})^{T}\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)m_{t-1}}_{T_{1}}\underbrace{-\alpha_{t}\nabla J(\theta_{t})^{T}\hat{V}_{t}^{-\frac{1}{2}}g_{t}}_{T_{2}}
+(J(zt)J(θt))T(zt+1zt)T3+cJ2zt+1zt2T4.\displaystyle+\underbrace{(\nabla J(z_{t})-\nabla J(\theta_{t}))^{T}(z_{t+1}-z_{t})}_{T_{3}}+\underbrace{\frac{c_{J}}{2}\left\lVert z_{t+1}-z_{t}\right\rVert^{2}}_{T_{4}}. (43)

Next we bound the tail terms. The term T1T_{1} can be bounded as

T1\displaystyle T_{1} (i)β11β1Tr(αt1V^t112αtV^t12)J(θt)mt1\displaystyle\overset{\text{(i)}}{\leq}\frac{\beta_{1}}{1-\beta_{1}}Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\left\lVert\nabla J(\theta_{t})\right\rVert_{\infty}\left\lVert m_{t-1}\right\rVert_{\infty}
=β11β1G2Tr(αt1V^t112αtV^t12)\displaystyle=\frac{\beta_{1}}{1-\beta_{1}}G_{\infty}^{2}Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)
=β11β1G2[Tr(αt1V^t112)Tr(αtV^t12)],\displaystyle=\frac{\beta_{1}}{1-\beta_{1}}G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right],

where (i) follows from Lemma 6.

The term T2T_{2} is the key to deal with under non-i.i.d. sampling, where a non-zero bias arises in the gradient estimation. We bound this term as

T2=\displaystyle T_{2}= αt1J(θt)TV^t112gt+J(θt)T(αt1V^t112αtV^t12)gt\displaystyle-\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}g_{t}+\nabla J(\theta_{t})^{T}\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)g_{t}
(i)\displaystyle\overset{\text{(i)}}{\leq} αt1J(θt)TV^t112gt+Tr(αt1V^t112αtV^t12)J(θt)gt\displaystyle-\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}g_{t}+Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\left\lVert\nabla J(\theta_{t})\right\rVert_{\infty}\left\lVert g_{t}\right\rVert_{\infty}
=\displaystyle= αt1J(θt)TV^t112gt+G2[Tr(αt1V^t112)Tr(αtV^t12)]\displaystyle-\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}g_{t}+G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]
=\displaystyle= αt1J(θt)TV^t112J(θt)+αt1J(θt)TV^t112(J(θt)gt)\displaystyle-\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}\nabla J(\theta_{t})+\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})
+G2[Tr(αt1V^t112)Tr(αtV^t12)]\displaystyle+G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]
(ii)\displaystyle\overset{\text{(ii)}}{\leq} αt1GJ(θt)2+αt1J(θt)TV^t112(J(θt)gt)\displaystyle-\frac{\alpha_{t-1}}{G_{\infty}}\left\lVert\nabla J(\theta_{t})\right\rVert^{2}+\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})
+G2[Tr(αt1V^t112)Tr(αtV^t12)],\displaystyle+G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right],

where (i) follows from Lemma 6 and (ii) follows because V^t\hat{V}_{t} is positive diagonal matrix and each entry is bounded as in Lemma 8.

Next we bound the term T3T_{3} as

T3\displaystyle T_{3} (i)J(zt)J(θt)zt+1zt\displaystyle\overset{\text{(i)}}{\leq}\left\lVert\nabla J(z_{t})-\nabla J(\theta_{t})\right\rVert\left\lVert z_{t+1}-z_{t}\right\rVert
(ii)(αtV^t12gt+β11β1θt+1θt)cJβ11β1θt+1θt\displaystyle\overset{\text{(ii)}}{\leq}\left(\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert+\frac{\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert\right)\frac{c_{J}\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert
=cJβ11β1θt+1θtαtV^t12gt+cJ(β11β1)2θt+1θt2\displaystyle=\frac{c_{J}\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert+c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}
=cJαtV^t12gtcJβ11β1θt+1θt+cJ(β11β1)2θt+1θt2\displaystyle=\sqrt{c_{J}}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert\cdot\frac{\sqrt{c_{J}}\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert+c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}
(iii)cJαtV^t12gt2+2cJ(β11β1)2θt+1θt2\displaystyle\overset{\text{(iii)}}{\leq}c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+2c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}
=(iv)cJαtV^t12gt2+2cJ(β11β1)2αtV^t12mt2,\displaystyle\overset{\text{(iv)}}{=}c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+2c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2},

where (i) follows from Cauchy-Schwarz inequality, (ii) follows from Lemma 11, (iii) follows because xyx2+y2xy\leq x^{2}+y^{2} and (iv) follows by the update rule of Algorithm 1.

Last, we bound the term T4T_{4} as

T4\displaystyle T_{4} (i)cJ2[αtV^t12gt+β11β1θt+1θt]2\displaystyle\overset{\text{(i)}}{\leq}\frac{c_{J}}{2}\left[\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert+\frac{\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert\right]^{2}
(ii)cJαtV^t12gt2+cJ(β11β1)2θt+1θt2\displaystyle\overset{\text{(ii)}}{\leq}c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}
=cJαtV^t12gt2+cJ(β11β1)2αtV^t12mt2,\displaystyle=c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2},

where (i) follows from Lemma 11, and (ii) follows due to the fact (x+y)22x2+2y2(x+y)^{2}\leq 2x^{2}+2y^{2}.

Substituting the upper bounds of the terms T1,T2,T3T_{1},T_{2},T_{3} and T4T_{4} in (C.2) yields

J(zt+1)\displaystyle J(z_{t+1})\leq J(zt)αt1GJ(θt)2+αt1J(θt)TV^t112(J(θt)gt)\displaystyle J(z_{t})-\frac{\alpha_{t-1}}{G_{\infty}}\left\lVert\nabla J(\theta_{t})\right\rVert^{2}+\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})
+11β1G2[Tr(αt1V^t112)Tr(αtV^t12)]\displaystyle+\frac{1}{1-\beta_{1}}G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]
+2cJαtV^t12gt2+3cJ(β11β1)2αtV^t12mt2.\displaystyle+2c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}.

Next we rearrange the above inequality and take expectation over all the randomness to obtain

αt1G𝔼[J(θt)2]\displaystyle\frac{\alpha_{t-1}}{G_{\infty}}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq (𝔼[J(zt)]𝔼[J(zt+1)])+G21β1(𝔼[Tr(αt1V^t112)]𝔼[Tr(αtV^t12)])\displaystyle\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\left(\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)
+2cJ𝔼αtV^t12gt2+3cJ(β11β1)2𝔼αtV^t12mt2\displaystyle+2c_{J}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+αt1𝔼[J(θt)TV^t112(J(θt)gt)]\displaystyle+\alpha_{t-1}\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})\right]
=\displaystyle= (𝔼[J(zt)]𝔼[J(zt+1)])+G21β1(𝔼[Tr(αt1V^t112)]𝔼[Tr(αtV^t12)])\displaystyle\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\left(\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)
+2cJ𝔼αtV^t12gt2+3cJ(β11β1)2𝔼αtV^t12mt2\displaystyle+2c_{J}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+αt1𝔼[𝔼[J(θt)TV^t112(J(θt)gt)|tq]]\displaystyle+\alpha_{t-1}\mathbb{E}\left[\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})\rvert\mathcal{F}_{t}^{q}\right]\right]
=\displaystyle= (𝔼[J(zt)]𝔼[J(zt+1)])+G21β1(𝔼[Tr(αt1V^t112)]𝔼[Tr(αtV^t12)])\displaystyle\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\left(\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)
+2cJ𝔼αtV^t12gt2+3cJ(β11β1)2𝔼αtV^t12mt2\displaystyle+2c_{J}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+αt1𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))],\displaystyle+\alpha_{t-1}\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right], (44)

where in the last equation we denote Ot=(st,at)O_{t}=(s_{t},a_{t}) for brevity, and this notation is used in the sequel as well. We emphasize that the filtration tq\mathcal{F}_{t}^{q} contains all the samples up to time tt except the samples for estimating Qπθt(st,at)Q^{\pi_{\theta_{t}}}(s_{t},a_{t}). Thus we have 𝔼[gt|tq]=J~(θt;Ot)\mathbb{E}[g_{t}\rvert\mathcal{F}_{t}^{q}]=\nabla\tilde{J}(\theta_{t};O_{t}) where the expectation is taken over the randomness in EstQ algorithm.

Step2: Bounding bias term

In the following, we bound the bias term 𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]. Observe that

𝔼[\displaystyle\mathbb{E}\left[\right. J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle\left.\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
=\displaystyle= 𝔼[i=1d1v^t1,iiJ(θt)(iJ(θt)J~i(θt;Ot))]\displaystyle\mathbb{E}\left[\sum_{i=1}^{d}\frac{1}{\sqrt{\hat{v}_{t-1,i}}}\nabla_{i}J(\theta_{t})\cdot(\nabla_{i}J(\theta_{t})-\nabla\tilde{J}_{i}(\theta_{t};O_{t}))\right]
=\displaystyle= 𝔼[i=1d1v^t1,i[iJ(θtτ)(iJ(θtτ)J~i(θtτ;Ot))+iJ(θtτ)(J~i(θtτ;Ot)J~i(θt;Ot))\displaystyle\mathbb{E}\left[\sum_{i=1}^{d}\frac{1}{\sqrt{\hat{v}_{t-1,i}}}\left[\nabla_{i}J(\theta_{t-\tau})\cdot(\nabla_{i}J(\theta_{t-\tau})-\nabla\tilde{J}_{i}(\theta_{t-\tau};O_{t}))+\nabla_{i}J(\theta_{t-\tau})\cdot(\nabla\tilde{J}_{i}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}_{i}(\theta_{t};O_{t}))\right.\right.
+iJ(θtτ)(iJ(θt)iJ(θtτ))+(iJ(θt)iJ(θtτ))(iJ(θt)J~i(θt;Ot))]]\displaystyle+\left.\left.\nabla_{i}J(\theta_{t-\tau})\cdot(\nabla_{i}J(\theta_{t})-\nabla_{i}J(\theta_{t-\tau}))+(\nabla_{i}J(\theta_{t})-\nabla_{i}J(\theta_{t-\tau}))\cdot(\nabla_{i}J(\theta_{t})-\nabla\tilde{J}_{i}(\theta_{t};O_{t}))\right]\right]
=\displaystyle= 𝔼[J(θtτ)TV^t112(J(θtτ)J~(θtτ;Ot))T5+J(θtτ)TV^t112(J~(θtτ;Ot)J~(θt;Ot))T6\displaystyle\mathbb{E}\left[\underbrace{\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t-\tau})-\nabla\tilde{J}(\theta_{t-\tau};O_{t}))}_{T5}+\underbrace{\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))}_{T_{6}}\right.
+J(θtτ)TV^t112(J(θt)J(θtτ))T7+(J(θt)J(θtτ))TV^t112(J(θt)J~(θt;Ot))T8].\displaystyle\left.+\underbrace{\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))}_{T_{7}}+\underbrace{(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))}_{T_{8}}\right].

It turns out that the terms T6,T7,T8T_{6},T_{7},T_{8} are easier to bound and the term T5T_{5} is the key to bound the bias. For the clarity of presentation, we first bound the terms T6,T7,T8T_{6},T_{7},T_{8}.

To bound the term T6T_{6}, we have

T6=\displaystyle T_{6}= J(θtτ)TV^t112(J~(θtτ;Ot)J~(θt;Ot))\displaystyle\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))
=\displaystyle= (J(θtτ)TV^t114)(V^t114(J~(θtτ;Ot)J~(θt;Ot)))\displaystyle\left(\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right)\cdot\left(\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right)
(i)\displaystyle\overset{\text{(i)}}{\leq} J(θtτ)TV^t114V^t114(J~(θtτ;Ot)J~(θt;Ot))\displaystyle\left\lVert\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right\rVert\cdot\left\lVert\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right\rVert
(ii)\displaystyle\overset{\text{(ii)}}{\leq} 1G0J(θtτ)(J~(θtτ;Ot)J~(θt;Ot))\displaystyle\frac{1}{G_{0}}\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right\rVert
(iii)\displaystyle\overset{\text{(iii)}}{\leq} GG0cJ~θtθtτ\displaystyle\frac{G_{\infty}}{G_{0}}\cdot c_{\tilde{J}}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert
(iv)\displaystyle\overset{\text{(iv)}}{\leq} GG0cJ~k=tτt1θk+1θk\displaystyle\frac{G_{\infty}}{G_{0}}\cdot c_{\tilde{J}}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert
(v)\displaystyle\overset{\text{(v)}}{\leq} GG0cJ~GG0k=tτt1αk,\displaystyle\frac{G_{\infty}}{G_{0}}\cdot\frac{c_{\tilde{J}}G_{\infty}}{G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k},

where (i) follows from Cauchy-Schwarz inequality, (ii) follows because v^t,iG02\hat{v}_{t,i}\geq G_{0}^{2}, (iii) follows from Lemma 1 and Lemma 8, (iv) follows by the triangle inequality, and (v) follows due to Lemma 9.

Similarly, we can bound the term T7T_{7} as follows:

T7=\displaystyle T_{7}= J(θtτ)TV^t112(J(θtτ)J(θt))\displaystyle\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))
\displaystyle\leq J(θtτ)TV^t114V^t114(J(θtτ)J(θt))\displaystyle\left\lVert\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right\rVert\cdot\left\lVert\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))\right\rVert
\displaystyle\leq 1G0J(θtτ)(J(θtτ)J(θt))\displaystyle\frac{1}{G_{0}}\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))\right\rVert
\displaystyle\leq GG0cJθtθtτ\displaystyle\frac{G_{\infty}}{G_{0}}\cdot c_{J}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert
\displaystyle\leq GG0cJk=tτt1θk+1θk\displaystyle\frac{G_{\infty}}{G_{0}}\cdot c_{J}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert
\displaystyle\leq GG0cJGG0k=tτt1αk.\displaystyle\frac{G_{\infty}}{G_{0}}\cdot\frac{c_{J}G_{\infty}}{G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}.

Next, we bound the term T8T_{8} and obtain

T8=\displaystyle T_{8}= (J(θt)J(θtτ))TV^t112(J(θt)J~(θt;Ot))\displaystyle(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))
\displaystyle\leq (J(θt)J(θtτ))TV^t114V^t114(J(θt)J~(θt;Ot))\displaystyle\left\lVert(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right\rVert\cdot\left\lVert\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right\rVert
\displaystyle\leq 1G0J(θtτ)J(θt)J(θt)J~(θt;Ot)\displaystyle\frac{1}{G_{0}}\left\lVert\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t})\right\rVert\cdot\left\lVert\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert
\displaystyle\leq 1G0(J(θt)+J~(θt;Ot))cJθtθtτ\displaystyle\frac{1}{G_{0}}\left(\left\lVert\nabla J(\theta_{t})\right\rVert+\left\lVert\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert\right)\cdot c_{J}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert
\displaystyle\leq 2GG0cJGG0k=tτt1αk.\displaystyle\frac{2G_{\infty}}{G_{0}}\cdot\frac{c_{J}G_{\infty}}{G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}.

Last, it remains to bound the term 𝔼[T5]\mathbb{E}[T_{5}]. Observe that 𝔼[T5]=𝔼[𝔼[T5|tτ]]\mathbb{E}[T_{5}]=\mathbb{E}[\mathbb{E}[T_{5}|\mathcal{F}_{t-\tau}]]. We first deal with 𝔼[T5|tτ]\mathbb{E}[T_{5}|\mathcal{F}_{t-\tau}] as

𝔼[T5|tτ]=\displaystyle\mathbb{E}[T_{5}|\mathcal{F}_{t-\tau}]= 𝔼[J(θtτ)TV^t112(J(θtτ)J~(θtτ;Ot))|tτ]\displaystyle\mathbb{E}\left[\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t-\tau})-\nabla\tilde{J}(\theta_{t-\tau};O_{t}))|\mathcal{F}_{t-\tau}\right]
(i)\displaystyle\overset{\text{(i)}}{\leq} Tr(𝔼[V^t112|tτ])J(θtτ)J(θtτ)𝔼[J~(θtτ;Ot)|tτ]\displaystyle Tr\left(\mathbb{E}\left[\hat{V}_{t-1}^{-\frac{1}{2}}\rvert\mathcal{F}_{t-\tau}\right]\right)\left\lVert\nabla J(\theta_{t-\tau})\right\rVert_{\infty}\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert_{\infty}
\displaystyle\leq dG0J(θtτ)J(θtτ)𝔼[J~(θtτ;Ot)|tτ]\displaystyle\frac{d}{G_{0}}\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert
\displaystyle\leq dGG0J(θtτ)𝔼[J~(θtτ;Ot)|tτ],\displaystyle\frac{dG_{\infty}}{G_{0}}\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert,

where (i) follows from Lemma 6. The key to bound the bias is to bound the term J(θtτ)𝔼[J~(θtτ;Ot)|tτ]\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert, which has been done in Lemma 7.

To conclude, the bias term ca be bounded for a fixed τ<t\tau<t as:

𝔼\displaystyle\mathbb{E} [J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
\displaystyle\leq GG0[(3cJ+cJ~)GG0k=tτt1αk+dG[σρτ+LπG2G0(k=tτt1αk+i=tτt1k=tτiαk)]].\displaystyle\frac{G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}}{G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}+dG_{\infty}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right]\right]. (45)

Step3: Establishing convergence to stationary point

For the case with a constant stepsize αt=α\alpha_{t}=\alpha, we choose τ=min{τ:σρτα}\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha\}. To take the summation over the time steps, notice that the bound in (C.2) holds only when t>τt>\tau^{*}, and hence we separate the summation of the bias term into two parts as follows:

t=1T\displaystyle\sum_{t=1}^{T} αG𝔼[J(θt)2]\displaystyle\frac{\alpha}{G_{\infty}}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]
\displaystyle\leq t=1T(𝔼[J(zt)]𝔼[J(zt+1)])+G21β1t=1T(𝔼[Tr(αV^t112)]𝔼[Tr(αV^t12)])\displaystyle\sum_{t=1}^{T}\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\sum_{t=1}^{T}\left(\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)
+2cJt=1T𝔼αV^t12gt2+3cJ(β11β1)2t=1T𝔼αV^t12mt2\displaystyle+2c_{J}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+t=1Tα𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle+\sum_{t=1}^{T}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
=\displaystyle= t=1T(𝔼[J(zt)]𝔼[J(zt+1)])+G21β1t=1T(𝔼[Tr(αV^t112)]𝔼[Tr(αV^t12)])\displaystyle\sum_{t=1}^{T}\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\sum_{t=1}^{T}\left(\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)
+2cJt=1T𝔼αV^t12gt2+3cJ(β11β1)2t=1T𝔼αV^t12mt2\displaystyle+2c_{J}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+t=1τα𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]+t=τ+1Tα𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))],\displaystyle+\sum_{t=1}^{\tau^{*}}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]+\sum_{t=\tau^{*}+1}^{T}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right], (46)

Then applying (C.2) to (C.2) yields

t=1TαG𝔼[J(θt)2]\displaystyle\sum_{t=1}^{T}\frac{\alpha}{G_{\infty}}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq t=1T(𝔼[J(zt)]𝔼[J(zt+1)])+G21β1t=1T(𝔼[Tr(αV^t112)]𝔼[Tr(αV^t12)])\displaystyle\sum_{t=1}^{T}\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\sum_{t=1}^{T}\left(\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)
+2cJt=1T𝔼αV^t12gt2+3cJ(β11β1)2t=1T𝔼αV^t12mt2\displaystyle+2c_{J}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+t=1Tα𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle+\sum_{t=1}^{T}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
(i)\displaystyle\overset{\text{(i)}}{\leq} 𝔼[J(z1)]+G21β1𝔼[Tr(α1V^112)]+2dcJα21β2+3dcJβ12α2(1β1)(1β2)(1β1/β2)\displaystyle\mathbb{E}[J(z_{1})]+\frac{G_{\infty}^{2}}{1-\beta_{1}}\mathbb{E}\left[Tr\left(\alpha_{1}\hat{V}_{1}^{-\frac{1}{2}}\right)\right]+\frac{2dc_{J}\alpha^{2}}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha^{2}}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}
+t=1τα𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle+\sum_{t=1}^{\tau^{*}}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
+αGG0t=τ+1T[(3cJ+cJ~)GτG0α+dG(σρτ+LπG(2τ+(τ)2)2G0α)]\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\sum_{t=\tau^{*}+1}^{T}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{*}}{G_{0}}\alpha+dG_{\infty}\left(\sigma\rho^{\tau^{*}}+\frac{L_{\pi}G_{\infty}(2\tau^{*}+(\tau^{*})^{2})}{2G_{0}}\alpha\right)\right]
(ii)\displaystyle\overset{\text{(ii)}}{\leq} 𝔼[J(z1)]+αdG2G0(1β1)+2dcJα21β2+3dcJβ12α2(1β1)(1β2)(1β1/β2)\displaystyle\mathbb{E}[J(z_{1})]+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}+\frac{2dc_{J}\alpha^{2}}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha^{2}}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}
+t=1τα𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle+\sum_{t=1}^{\tau^{*}}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
+α2GG0[(3cJ+cJ~)GτG0+dG(1+LπG(2τ+(τ)2)2G0)](Tτ)\displaystyle+\frac{\alpha^{2}G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{*}}{G_{0}}+dG_{\infty}\left(1+\frac{L_{\pi}G_{\infty}(2\tau^{*}+(\tau^{*})^{2})}{2G_{0}}\right)\right](T-\tau^{*})
(iii)\displaystyle\overset{\text{(iii)}}{\leq} 𝔼[J(z1)]+αdG2G0(1β1)+dcJα2(3β12+2(1β1)(1β1/β2))(1β1)(1β2)(1β1/β2)+2ατG0G2\displaystyle\mathbb{E}[J(z_{1})]+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}+\frac{dc_{J}\alpha^{2}(3\beta_{1}^{2}+2(1-\beta_{1})(1-\beta_{1}/\beta_{2}))}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2\alpha\tau^{*}}{G_{0}}G_{\infty}^{2}
+α2GG0[(3cJ+cJ~)GτG0+dG(1+LπG(2τ+(τ)2)2G0)](Tτ),\displaystyle+\frac{\alpha^{2}G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{*}}{G_{0}}+dG_{\infty}\left(1+\frac{L_{\pi}G_{\infty}(2\tau^{*}+(\tau^{*})^{2})}{2G_{0}}\right)\right](T-\tau^{*}), (47)

where (i) follows from Lemma 10, (ii) follows from the definition of τ\tau^{*} and (iii) follows since

J(θt)TV^t112(J(θt)J~(θt;Ot))\displaystyle\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t})) J(θt)TV^t114V^t114(J(θt)J~(θt))\displaystyle\leq\left\lVert\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right\rVert\cdot\left\lVert\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t}))\right\rVert
1G0J(θt)(J(θt)+J~(θt;Ot))\displaystyle\leq\frac{1}{G_{0}}\left\lVert\nabla J(\theta_{t})\right\rVert\left(\left\lVert\nabla J(\theta_{t})\right\rVert+\left\lVert\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert\right)
2G0G2.\displaystyle\leq\frac{2}{G_{0}}G_{\infty}^{2}.

Finally, we complete our proof by letting both sides of (C.2) be divided by TT, which yields

mint[T]𝔼[J(θt)2]1Tt=1T𝔼[J(θt)2]C1T+αC2(Tτ)TC1T+αC2,\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{C_{1}}{T}+\frac{\alpha C_{2}(T-\tau^{*})}{T}\leq\frac{C_{1}}{T}+\alpha C_{2},

where

C1=G𝔼[J(z1)]α+dG3G0(1β1)+dcJαG(3β12+2(1β1)(1β1/β2))(1β1)(1β2)(1β1/β2)+2GτG0G2,\displaystyle C_{1}=\frac{G_{\infty}\mathbb{E}[J(z_{1})]}{\alpha}+\frac{dG_{\infty}^{3}}{G_{0}(1-\beta_{1})}+\frac{dc_{J}\alpha G_{\infty}(3\beta_{1}^{2}+2(1-\beta_{1})(1-\beta_{1}/\beta_{2}))}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2G_{\infty}\tau^{*}}{G_{0}}G_{\infty}^{2},
C2=G2G0[(3cJ+cJ~)GτG0+dG(1+LπG(2τ+(τ)2)2G0)].\displaystyle C_{2}=\frac{G_{\infty}^{2}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{*}}{G_{0}}+dG_{\infty}\left(1+\frac{L_{\pi}G_{\infty}(2\tau^{*}+(\tau^{*})^{2})}{2G_{0}}\right)\right].

Appendix D Proof of Theorem 2

The proof of Theorem 2 starts from steps similar to those of Theorem 1. The difference starts from (C.2). Here we consider αt\alpha_{t} is not constant. Then we divide both sides of (C.2) by αt1\alpha_{t-1} and obtain

1G\displaystyle\frac{1}{G_{\infty}} 𝔼[J(θt)2]\displaystyle\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]
(𝔼[J(zt)]𝔼[J(zt+1)])αt1+G2αt1(1β1)(𝔼[Tr(αt1V^t112)]𝔼[Tr(αtV^t12)])\displaystyle\leq\frac{\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)}{\alpha_{t-1}}+\frac{G_{\infty}^{2}}{\alpha_{t-1}(1-\beta_{1})}\left(\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)
+2cJαt1𝔼αtV^t12gt2+3cJαt1(β11β1)2𝔼αtV^t12mt2\displaystyle\quad+\frac{2c_{J}}{\alpha_{t-1}}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+\frac{3c_{J}}{\alpha_{t-1}}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle\quad+\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
:=ftft+1αt1+2cJαt1𝔼αtV^t12gt2+3cJαt1(β11β1)2𝔼αtV^t12mt2\displaystyle:=\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+\frac{2c_{J}}{\alpha_{t-1}}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+\frac{3c_{J}}{\alpha_{t-1}}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))],\displaystyle\quad+\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right],

where

ft=𝔼[J(zt)]+G21β1𝔼[Tr(αt1V^t112)].f_{t}=\mathbb{E}[J(z_{t})]+\frac{G_{\infty}^{2}}{1-\beta_{1}}\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right].

We choose τ=min{τ:σρταT}\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha_{T}\}. Again we choose τ=t\tau=t if tτt\leq\tau^{*} and τ=τ\tau=\tau^{*} if t>τt>\tau^{*}. If t>τt>\tau^{*}, then choice of αt\alpha_{t} yields

𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
GG0[(3cJ+cJ~)GG0k=tτt1αk+dG[σρτ+LπG2G0(k=tτt1αk+i=tτt1k=tτiαk)]]\displaystyle\quad\leq\frac{G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}}{G_{0}}\sum_{k=t-\tau^{*}}^{t-1}\alpha_{k}+dG_{\infty}\left[\sigma\rho^{\tau^{*}}+\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau^{*}}^{t-1}\alpha_{k}+\sum_{i=t-\tau^{*}}^{t-1}\sum_{k=t-\tau^{*}}^{i}\alpha_{k}\right)\right]\right]
GG0[(3cJ+cJ~)GG0ατtτ+dG(αT+LπG(1+τ)2G0ατtτ)],\displaystyle\quad\leq\frac{G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}}{G_{0}}\frac{\alpha\tau^{*}}{\sqrt{t-\tau^{*}}}+dG_{\infty}\left(\frac{\alpha}{\sqrt{T}}+\frac{L_{\pi}G_{\infty}(1+\tau^{*})}{2G_{0}}\frac{\alpha\tau^{*}}{\sqrt{t-\tau^{*}}}\right)\right],

where the last inequality follows because

k=tτtαk=αk=tτt1kαx=tτ+1t+11x1𝑑x2α(ttτ)ατtτ.\sum_{k=t-\tau^{*}}^{t}\alpha_{k}=\alpha\sum_{k=t-\tau^{*}}^{t}\frac{1}{\sqrt{k}}\leq\alpha\int_{x=t-\tau^{*}+1}^{t+1}\frac{1}{\sqrt{x-1}}dx\leq 2\alpha(\sqrt{t}-\sqrt{t-\tau^{*}})\leq\frac{\alpha\tau^{*}}{\sqrt{t-\tau^{*}}}. (48)

By taking the summation over time steps, we obtain

1Gt=1T\displaystyle\frac{1}{G_{\infty}}\sum_{t=1}^{T} 𝔼[J(θt)2]\displaystyle\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]
(i)\displaystyle\overset{\text{(i)}}{\leq} t=1Tftft+1αt1+2cJαt=1T𝔼V^t12gt2+3cJαt=1T(β11β1)2𝔼V^t12mt2\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+2c_{J}\alpha\sum_{t=1}^{T}\mathbb{E}\left\lVert\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\alpha\sum_{t=1}^{T}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}
+t=1τ𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle+\sum_{t=1}^{\tau^{*}}\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
+αGG0t=τ+1T[(3cJ+cJ~)GG0τtτ+dG(1T+LπG(1+τ)2G0τtτ)]\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\sum_{t=\tau^{*}+1}^{T}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}}{G_{0}}\frac{\tau^{*}}{\sqrt{t-\tau^{*}}}\!+\!dG_{\infty}\left(\frac{1}{\sqrt{T}}\!+\!\frac{L_{\pi}G_{\infty}(1+\tau^{*})}{2G_{0}}\frac{\tau^{*}}{\sqrt{t-\tau^{*}}}\right)\right]
(ii)\displaystyle\overset{\text{(ii)}}{\leq} t=1Tftft+1αt1+2dcJα1β2+3dcJβ12α(1β1)(1β2)(1β1/β2)\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+\frac{2dc_{J}\alpha}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}
+t=1τ𝔼[J(θt)TV^t112(J(θt)J~(θt;Ot))]\displaystyle+\sum_{t=1}^{\tau^{*}}\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
+αGG0[2(3cJ+cJ~)GτTτG0+dG(T+LπG(τ+(τ)2)TτG0)]\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\left[\frac{2(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{*}\sqrt{T-\tau^{*}}}{G_{0}}\!+\!dG_{\infty}\left(\sqrt{T}\!+\!\frac{L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})\sqrt{T-\tau^{*}}}{G_{0}}\right)\right]
\displaystyle\leq t=1Tftft+1αt1+2dcJα1β2+3dcJβ12α(1β1)(1β2)(1β1/β2)+2τG0G2\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+\frac{2dc_{J}\alpha}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2\tau^{*}}{G_{0}}G_{\infty}^{2}
+αGG0[2(3cJ+cJ~)GτTτG0+dG(T+LπG(τ+(τ)2)TτG0)]\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\left[\frac{2(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{*}\sqrt{T-\tau^{*}}}{G_{0}}\!+\!dG_{\infty}\left(\sqrt{T}\!+\!\frac{L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})\sqrt{T-\tau^{*}}}{G_{0}}\right)\right]
\displaystyle\leq t=1Tftft+1αt1+2dcJα1β2+3dcJβ12α(1β1)(1β2)(1β1/β2)+2τG0G2\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+\frac{2dc_{J}\alpha}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2\tau^{*}}{G_{0}}G_{\infty}^{2}
+αGG0[2(3cJ+cJ~)GτG0+dG(1+LπG(τ+(τ)2)G0)]T,\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\left[\frac{2(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{*}}{G_{0}}\!+\!dG_{\infty}\left(1\!+\!\frac{L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})}{G_{0}}\right)\right]\cdot\sqrt{T}, (49)

where (i) follows since αt\alpha_{t} is decreasing and from the definition of τ\tau^{*}, and (ii) follows from Lemma 10.

For clarity we bound t=1Tftft+1αt1\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}} separately. The key observation is that ftf_{t} is uniformly bounded as

ft=𝔼[J(zt)]+G21β1𝔼[Tr(αt1V^t112)]Rmax1γ+αdG2G0(1β1).f_{t}=\mathbb{E}[J(z_{t})]+\frac{G_{\infty}^{2}}{1-\beta_{1}}\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]\leq\frac{R_{\max}}{1-\gamma}+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}. (50)

Thus

t=1Tftft+1αt1\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}} =f1α0+t=2Tft(1αt11αt2)fT+1αT1\displaystyle=\frac{f_{1}}{\alpha_{0}}+\sum_{t=2}^{T}f_{t}\left(\frac{1}{\alpha_{t-1}}-\frac{1}{\alpha_{t-2}}\right)-\frac{f_{T+1}}{\alpha_{T-1}}
f1α0+(Rmax1γ+αdG2G0(1β1))t=2T(1αt11αt2)\displaystyle\leq\frac{f_{1}}{\alpha_{0}}+\left(\frac{R_{\max}}{1-\gamma}+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}\right)\sum_{t=2}^{T}\left(\frac{1}{\alpha_{t-1}}-\frac{1}{\alpha_{t-2}}\right)
f1α0+(Rmax1γ+αdG2G0(1β1))/αT1\displaystyle\leq\frac{f_{1}}{\alpha_{0}}+\left(\frac{R_{\max}}{1-\gamma}+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}\right)/\alpha_{T-1}
=f1α+(Rmax1γ+αdG2G0(1β1))T1α.\displaystyle=\frac{f_{1}}{\alpha}+\left(\frac{R_{\max}}{1-\gamma}+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}\right)\frac{\sqrt{T-1}}{\alpha}. (51)

Finally, we complete our proof by substituting (D) in (D) and letting both sides of (D) be divided by TT, which yields

mint[T]𝔼[J(θt)2]1Tt=1T𝔼[J(θt)2]C1T+αC2(Tτ)TC1T+C2T,\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{C_{1}}{T}+\frac{\alpha C_{2}(T-\tau^{*})}{T}\leq\frac{C_{1}}{T}+\frac{C_{2}}{\sqrt{T}},

where

C1\displaystyle C_{1} =f1Gα+2dcJαG1β2+3dcJβ12αG(1β1)(1β2)(1β1/β2)+2τGG0G2\displaystyle=\frac{f_{1}G_{\infty}}{\alpha}+\frac{2dc_{J}\alpha G_{\infty}}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha G_{\infty}}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2\tau^{*}G_{\infty}}{G_{0}}G_{\infty}^{2}
C2\displaystyle C_{2} =RmaxGα(1γ)+dG3G0(1β1)+αG3G0[2(3cJ+cJ~)τG0+d(1+LπG(τ+(τ)2)G0)].\displaystyle=\frac{R_{\max}G_{\infty}}{\alpha(1-\gamma)}+\frac{dG_{\infty}^{3}}{G_{0}(1-\beta_{1})}+\frac{\alpha G_{\infty}^{3}}{G_{0}}\left[\frac{2(3c_{J}+c_{\tilde{J}})\tau^{*}}{G_{0}}\!+\!d\left(1+\frac{L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})}{G_{0}}\right)\right].

Appendix E Proof of Proposition 1

In the following, we show how to adapt the proof techniques in analyzing PG-AMSGrad to PG-SGD. We first reduce Lemma 7 to the vanilla PG case.

Lemma 12.

Fix time tt and any τ<t\tau<t. Suppose Assumptions 1 and 2 hold for PG-SGD. Then we have

J(θtτ)𝔼[J~(θtτ;st,at)|tτ]G[σρτ+LπG2(k=tτt1αk+i=tτt1k=tτiαk)],\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};s_{t},a_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert\leq G_{\infty}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right], (52)

where G=cΘRmax1γG_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}.

Proof.

since the major part of the proof is similar to that of Lemma 7, we only emphasize the different steps.

For notational brevity, we denote Ot=(st,aT)O_{t}=(s_{t},a_{T}). Then we still have

J(θtτ)𝔼[J~(θtτ;Ot)|tτ]Gμθtτ(,)P(st,at|tτ)TV.\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert\leq G_{\infty}\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}. (53)

Then we use the steps similar to those in building an auxiliary Markov chain (24) which is generated by the policy πθtτ\pi_{\theta_{t-\tau}} from time tτt-\tau. Similarly to (C.1), we have

μθtτ(,)P(st,at|tτ)TV\displaystyle\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}
μθtτ(,)P(s~t,a~t|tτ)TV+P(s~t,a~t|tτ)P(st,at|tτ)TV\displaystyle\quad\leq\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}+\left\|P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}
(i)σρτ+12Lπi=tτt1θi+1θi+12Lπi=tτt1k=tτiθk+1θk,\displaystyle\quad\overset{\text{(i)}}{\leq}\sigma\rho^{\tau}+\frac{1}{2}L_{\pi}\sum_{i=t-\tau}^{t-1}\left\lVert\theta_{i+1}-\theta_{i}\right\rVert+\frac{1}{2}L_{\pi}\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert, (54)

where (i) follows from (26), (C.1) and (C.1). Observe that in PG-SGD, for any tt we have,

θt+1θt=αtgtGαt.\left\lVert\theta_{t+1}-\theta_{t}\right\rVert=\alpha_{t}\left\lVert g_{t}\right\rVert\leq G_{\infty}\alpha_{t}.

Thus, we complete the proof by substituting the above observation to (54) and then (53).

Proof of Proposition 1: Following from the Lipschitz condition of J(θ)\nabla J(\theta) in Lemma 1, we obtain

J(θt+1)\displaystyle J(\theta_{t+1})\leq J(θt)+J(θt)T(θt+1θt)+cJ2θt+1θt2\displaystyle J(\theta_{t})+\nabla J(\theta_{t})^{T}(\theta_{t+1}-\theta_{t})+\frac{c_{J}}{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}
=\displaystyle= J(θt)αtJ(θt)Tgt+cJ2θt+1θt2\displaystyle J(\theta_{t})-\alpha_{t}\nabla J(\theta_{t})^{T}g_{t}+\frac{c_{J}}{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}
=\displaystyle= J(θt)αtJ(θt)TJ(θt)+αtJ(θt)T(J(θt)gt)+cJ2θt+1θt2.\displaystyle J(\theta_{t})-\alpha_{t}\nabla J(\theta_{t})^{T}\nabla J(\theta_{t})+\alpha_{t}\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-g_{t})+\frac{c_{J}}{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}.

Then we rearrange the above inequality and take expectation over all the randomness to have

αt𝔼[J(θt)2]\displaystyle\alpha_{t}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq 𝔼[J(θt)]𝔼[J(θt+1)]+αt𝔼[J(θt)T(J(θt)gt)]+cJ2𝔼[θt+1θt2]\displaystyle\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]+\alpha_{t}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-g_{t})\right]+\frac{c_{J}}{2}\mathbb{E}\left[\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}\right]
\displaystyle\leq 𝔼[J(θt)]𝔼[J(θt+1)]+αt𝔼[J(θt)T(J(θt)gt)]+cJG22αt2\displaystyle\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]+\alpha_{t}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-g_{t})\right]+\frac{c_{J}G_{\infty}^{2}}{2}\alpha_{t}^{2}
=\displaystyle= 𝔼[J(θt)]𝔼[J(θt+1)]+αt𝔼[𝔼[J(θt)T(J(θt)gt)|tq]]+cJG22αt2\displaystyle\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]+\alpha_{t}\mathbb{E}\left[\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-g_{t})\rvert\mathcal{F}_{t}^{q}\right]\right]+\frac{c_{J}G_{\infty}^{2}}{2}\alpha_{t}^{2}
=\displaystyle= 𝔼[J(θt)]𝔼[J(θt+1)]+αt𝔼[J(θt)T(J(θt)J~(θt;Ot))]+cJG22αt2,\displaystyle\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]+\alpha_{t}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]+\frac{c_{J}G_{\infty}^{2}}{2}\alpha_{t}^{2}, (55)

where the filtration tq\mathcal{F}_{t}^{q} contains all the samples up to time tt except the samples for estimating Qπθt(st,at)Q^{\pi_{\theta_{t}}}(s_{t},a_{t}) in the EstQ algorithm. Thus we have 𝔼[gt|tq]=J~(θt;Ot)\mathbb{E}[g_{t}\rvert\mathcal{F}_{t}^{q}]=\nabla\tilde{J}(\theta_{t};O_{t}) where the expectation is taken over the randomness in EstQ algorithm. Observe that if τ<t\tau<t, we have

𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
=𝔼[J(θtτ)T(J(θtτ)J~(θtτ;Ot))T1+J(θtτ)T(J~(θtτ;Ot)J~(θt;Ot))T2\displaystyle\quad=\mathbb{E}\left[\underbrace{\nabla J(\theta_{t-\tau})^{T}(\nabla J(\theta_{t-\tau})-\nabla\tilde{J}(\theta_{t-\tau};O_{t}))}_{T1}+\underbrace{\nabla J(\theta_{t-\tau})^{T}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))}_{T_{2}}\right.
+J(θtτ)T(J(θt)J(θtτ))T3+(J(θt)J(θtτ))T(J(θt)J~(θt;Ot))T4].\displaystyle\quad\quad\left.+\underbrace{\nabla J(\theta_{t-\tau})^{T}(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))}_{T_{3}}+\underbrace{(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))}_{T_{4}}\right].

Next we bound the terms T1,T2,T3T_{1},T_{2},T_{3} and T4T_{4}.

Observe that 𝔼[T1]=𝔼[𝔼[T1|tτ]]\mathbb{E}[T_{1}]=\mathbb{E}[\mathbb{E}[T_{1}|\mathcal{F}_{t-\tau}]]. We deal with 𝔼[T1|tτ]\mathbb{E}[T_{1}|\mathcal{F}_{t-\tau}] and have

𝔼[T1|tτ]=\displaystyle\mathbb{E}[T_{1}|\mathcal{F}_{t-\tau}]= 𝔼[J(θtτ)T(J(θtτ)J~(θtτ;Ot))|tτ]\displaystyle\mathbb{E}\left[\nabla J(\theta_{t-\tau})^{T}(\nabla J(\theta_{t-\tau})-\nabla\tilde{J}(\theta_{t-\tau};O_{t}))|\mathcal{F}_{t-\tau}\right]
\displaystyle\leq J(θtτ)J(θtτ)𝔼[J~(θtτ;Ot)|tτ]\displaystyle\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert
\displaystyle\leq GJ(θtτ)𝔼[J~(θtτ;Ot)|tτ]\displaystyle G_{\infty}\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert
\displaystyle\leq G2[σρτ+LπG2(k=tτt1αk+i=tτt1k=tτiαk)].\displaystyle G_{\infty}^{2}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right].

Next we bound T2T_{2} and obtain

T2=\displaystyle T_{2}= J(θtτ)T(J~(θtτ;Ot)J~(θt;Ot))\displaystyle\nabla J(\theta_{t-\tau})^{T}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))
\displaystyle\leq J(θtτ)(J~(θtτ;Ot)J~(θt;Ot))\displaystyle\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right\rVert
\displaystyle\leq GcJ~θtθtτ\displaystyle G_{\infty}\cdot c_{\tilde{J}}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert
\displaystyle\leq GcJ~k=tτt1θk+1θk\displaystyle G_{\infty}\cdot c_{\tilde{J}}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert
\displaystyle\leq cJ~G2k=tτt1αk.\displaystyle c_{\tilde{J}}G_{\infty}^{2}\sum_{k=t-\tau}^{t-1}\alpha_{k}.

Similarly, we can bound term T3T_{3} as

T3=\displaystyle T_{3}= J(θtτ)T(J(θtτ)J(θt))\displaystyle\nabla J(\theta_{t-\tau})^{T}(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))
\displaystyle\leq J(θtτ)(J(θtτ)J(θt))\displaystyle\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))\right\rVert
\displaystyle\leq GcJθtθtτ\displaystyle G_{\infty}\cdot c_{J}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert
\displaystyle\leq GcJk=tτt1θk+1θk\displaystyle G_{\infty}\cdot c_{J}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert
\displaystyle\leq cJG2k=tτt1αk.\displaystyle c_{J}G_{\infty}^{2}\sum_{k=t-\tau}^{t-1}\alpha_{k}.

Last, we bound term T4T_{4} and obtain

T4=\displaystyle T_{4}= (J(θt)J(θtτ))T(J(θt)J~(θt;Ot))\displaystyle(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))
\displaystyle\leq J(θtτ)J(θt)J(θt)J~(θt;Ot)\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t})\right\rVert\cdot\left\lVert\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert
\displaystyle\leq (J(θt)+J~(θt;Ot))cJθtθtτ\displaystyle\left(\left\lVert\nabla J(\theta_{t})\right\rVert+\left\lVert\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert\right)\cdot c_{J}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert
\displaystyle\leq 2cJG2k=tτt1αk.\displaystyle 2c_{J}G_{\infty}^{2}\sum_{k=t-\tau}^{t-1}\alpha_{k}.

Thus for any τ<t\tau<t, we have

𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
G2[(3cJ+cJ~)k=tτt1αk+σρτ+LπG2(k=tτt1αk+i=tτt1k=tτiαk)].\displaystyle\quad\leq G_{\infty}^{2}\left[(3c_{J}+c_{\tilde{J}})\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right]. (56)

Proof of convergence under constant stepsize:

We first consider a constant stepsize αt=α\alpha_{t}=\alpha for t=1,,Tt=1,\dots,T. Choose τ=min{τ:σρτα}\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha\}. We rearrange (55) and take the summation over the time steps to obtain

αt=1T𝔼[J(θt)2]\displaystyle\alpha\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq t=1T(𝔼[J(θt)]𝔼[J(θt+1)])+t=1TcJG22α2+αt=1T𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle\sum_{t=1}^{T}\left(\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]\right)+\sum_{t=1}^{T}\frac{c_{J}G_{\infty}^{2}}{2}\alpha^{2}+\alpha\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
\displaystyle\leq J(θ1)+cJG2α22T+αt=1τ𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle J(\theta_{1})+\frac{c_{J}G_{\infty}^{2}\alpha^{2}}{2}T+\alpha\sum_{t=1}^{\tau^{*}}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
+αt=τ+1T𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle+\alpha\sum_{t=\tau^{*}+1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
(i)\displaystyle\overset{\text{(i)}}{\leq} J(θ1)+cJG2α22T+2αG2τ+αt=τ+1T𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle J(\theta_{1})+\frac{c_{J}G_{\infty}^{2}\alpha^{2}}{2}T+2\alpha G_{\infty}^{2}\tau^{*}+\alpha\sum_{t=\tau^{*}+1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
(ii)\displaystyle\overset{\text{(ii)}}{\leq} J(θ1)+cJG2α22T+2αG2τ+α2G2t=τ+1T[(3cJ+cJ~)τ+1+LπGτ(2+τ)2],\displaystyle J(\theta_{1})+\frac{c_{J}G_{\infty}^{2}\alpha^{2}}{2}T+2\alpha G_{\infty}^{2}\tau^{*}+\alpha^{2}G_{\infty}^{2}\sum_{t=\tau^{*}+1}^{T}\left[(3c_{J}+c_{\tilde{J}})\tau^{*}+1+\frac{L_{\pi}G_{\infty}\tau^{*}(2+\tau^{*})}{2}\right],

where (i) follows because J(θt)T(J(θt)J~(θt;Ot))J(θt)J(θt)J~(θt;Ot)2G2\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\leq\left\lVert\nabla J(\theta_{t})\right\rVert\cdot\left\lVert\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert\leq 2G_{\infty}^{2}, and (ii) follows from (56). Then we divide both sides of the above inequality by αT\alpha T, and have

mint[T]𝔼[J(θt)2]\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right] 1Tt=1T𝔼[J(θt)2]\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]
J(θ1)/α+2G2τT+αG2[cJ2+(3cJ+cJ~)τ+1+LπG(2τ+(τ)2)2].\displaystyle\leq\frac{J(\theta_{1})/\alpha+2G_{\infty}^{2}\tau^{*}}{T}+\alpha G_{\infty}^{2}\left[\frac{c_{J}}{2}+(3c_{J}+c_{\tilde{J}})\tau^{*}+1+\frac{L_{\pi}G_{\infty}(2\tau^{*}+(\tau^{*})^{2})}{2}\right].

Proof of convergence under diminishing stepsize:

Now we consider a diminishing stepsize αt=1γt\alpha_{t}=\frac{1-\gamma}{\sqrt{t}} for t=1,,Tt=1,\dots,T. Choose τ=min{τ:σρταT=1γT}\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha_{T}=\frac{1-\gamma}{\sqrt{T}}\}. We rearrange (55) and take the summation over the time steps to obtain

t=1T\displaystyle\sum_{t=1}^{T} 𝔼[J(θt)2]\displaystyle\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]
\displaystyle\leq t=1T𝔼[J(θt)]𝔼[J(θt+1)]αt+cJG22t=1Tαt+t=1T𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle\sum_{t=1}^{T}\frac{\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]}{\alpha_{t}}+\frac{c_{J}G_{\infty}^{2}}{2}\sum_{t=1}^{T}\alpha_{t}+\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
\displaystyle\leq J(θ1)α1+t=2T𝔼[J(θt)](1αt1αt1)+cJG22t=1Tαt+t=1T𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\sum_{t=2}^{T}\mathbb{E}[J(\theta_{t})]\left(\frac{1}{\alpha_{t}}-\frac{1}{\alpha_{t-1}}\right)+\frac{c_{J}G_{\infty}^{2}}{2}\sum_{t=1}^{T}\alpha_{t}+\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
\displaystyle\leq J(θ1)α1+Rmax1γt=2T(1αt1αt1)+cJG22t=1Tαt+t=1T𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\frac{R_{\max}}{1-\gamma}\sum_{t=2}^{T}\left(\frac{1}{\alpha_{t}}-\frac{1}{\alpha_{t-1}}\right)+\frac{c_{J}G_{\infty}^{2}}{2}\sum_{t=1}^{T}\alpha_{t}+\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
\displaystyle\leq J(θ1)α1+Rmax1γ1αT+cJG22t=1Tαt+t=1T𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\frac{R_{\max}}{1-\gamma}\cdot\frac{1}{\alpha_{T}}+\frac{c_{J}G_{\infty}^{2}}{2}\sum_{t=1}^{T}\alpha_{t}+\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
\displaystyle\leq J(θ1)α1+RmaxT(1γ)2+(1γ)cJG2T+t=1τ𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\frac{R_{\max}\sqrt{T}}{(1-\gamma)^{2}}+(1-\gamma)c_{J}G_{\infty}^{2}\sqrt{T}+\sum_{t=1}^{\tau^{*}}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
+t=τ+1T𝔼[J(θt)T(J(θt)J~(θt;Ot))]\displaystyle+\sum_{t=\tau^{*}+1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]
(i)\displaystyle\overset{\text{(i)}}{\leq} J(θ1)α1+RmaxT(1γ)2+(1γ)cJG2T+2(1γ)G2τ\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\frac{R_{\max}\sqrt{T}}{(1-\gamma)^{2}}+(1-\gamma)c_{J}G_{\infty}^{2}\sqrt{T}+2(1-\gamma)G_{\infty}^{2}\tau^{*}
+(1γ)G2t=τ+1T[(3cJ+cJ~)τtτ+1T+LπG2(1+τ)τtτ]\displaystyle+(1-\gamma)G_{\infty}^{2}\sum_{t=\tau^{*}+1}^{T}\left[\frac{(3c_{J}+c_{\tilde{J}})\tau^{*}}{\sqrt{t-\tau^{*}}}+\frac{1}{\sqrt{T}}+\frac{L_{\pi}G_{\infty}}{2}\cdot\frac{(1+\tau^{*})\tau^{*}}{\sqrt{t-\tau^{*}}}\right]
\displaystyle\leq J(θ1)1γ+RmaxT(1γ)2+(1γ)G2[2τ+[cJ+2(3cJ+cJ~)τ+1+LπG(τ+(τ)2)]T],\displaystyle\frac{J(\theta_{1})}{1-\gamma}+\frac{R_{\max}\sqrt{T}}{(1-\gamma)^{2}}+(1-\gamma)G_{\infty}^{2}\left[2\tau^{*}+\left[c_{J}+2(3c_{J}+c_{\tilde{J}})\tau^{*}+1+L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})\right]\sqrt{T}\right],

where (i) follows from (48). Then we divide both sides of the above inequality by TT, and have

mint[T]𝔼[J(θt)2]1Tt=1T𝔼[J(θt)2]\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]
J(θ1)+2(1γ)2G2τ(1γ)T+Rmax+(1γ)3G2[cJ+2(3cJ+cJ~)τ+1+LπG(τ+(τ)2)](1γ)2T.\displaystyle\quad\leq\frac{J(\theta_{1})+2(1-\gamma)^{2}G_{\infty}^{2}\tau^{*}}{(1-\gamma)T}+\frac{R_{\max}+(1-\gamma)^{3}G_{\infty}^{2}\left[c_{J}+2(3c_{J}+c_{\tilde{J}})\tau^{*}+1+L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})\right]}{(1-\gamma)^{2}\sqrt{T}}.

Appendix F Proof of Lemma 3

We bound the expectation of bias via constructing a new Markov chain and applying useful techniques from information theory. Before deriving the bound, we first introduce some technical lemmas.

Lemma 13.

Given Assumption 4, the gradient gtg_{t} in Algorithm 2 is uniformly bounded as follows,

gtgtRmax+(1+γ)D.\left\lVert g_{t}\right\rVert_{\infty}\leq\left\lVert g_{t}\right\rVert\leq R_{\max}+(1+\gamma)D_{\infty}. (57)
Proof.

The proof follows easily from the boundedness of ϕ\phi and θ\theta.

gt=\displaystyle\left\lVert g_{t}\right\rVert= (ϕ(st)Tθrtγϕ(st)Tθ)ϕ(st)\displaystyle\left\lVert\left(\phi(s^{\prime}_{t})^{T}\theta-r_{t}-\gamma\phi(s^{\prime}_{t})^{T}\theta\right)\phi(s_{t})\right\rVert
\displaystyle\leq ϕ(st)Tθrtγϕ(st)Tθϕ(st)\displaystyle\left\lVert\phi(s^{\prime}_{t})^{T}\theta-r_{t}-\gamma\phi(s^{\prime}_{t})^{T}\theta\right\rVert\cdot\left\lVert\phi(s_{t})\right\rVert
(i)\displaystyle\overset{\text{(i)}}{\leq} ϕ(st)Tθ+Rmax+γϕ(st)Tθ\displaystyle\left\lVert\phi(s^{\prime}_{t})^{T}\theta\right\rVert+R_{\max}+\gamma\left\lVert\phi(s^{\prime}_{t})^{T}\theta\right\rVert
(ii)\displaystyle\overset{\text{(ii)}}{\leq} Rmax+(1+γ)D,\displaystyle R_{\max}+(1+\gamma)D_{\infty},

where (i) follows since the reward function is uniformly bounded and (ii) follows from Assumption 3 and Assumption 4. ∎

Lemma 14.

(Zhou et al.,, 2018, Lemma A.1) Let {gt,mt,v^t}\{g_{t},m_{t},\hat{v}_{t}\} for t=1,2,t=1,2,\dots be sequences generated by Algorithm 2 and denote G=Rmax+(1+γ)DG_{\infty}=R_{\max}+(1+\gamma)D_{\infty}. Given gtGg_{t}\leq G_{\infty} as in Lemma 13, we have g¯tG,mtG,v^tG2\left\lVert\bar{g}_{t}\right\rVert\leq G_{\infty},\left\lVert m_{t}\right\rVert\leq G_{\infty},\left\lVert\hat{v}_{t}\right\rVert\leq G_{\infty}^{2}.

Lemma 15.

(Bhandari et al.,, 2018, Lemma 11) Let ξ(θ;s,a,s)=(g(θ;s,a,s)g¯(θ))T(θθ)\xi(\theta;s,a,s^{\prime})=(g(\theta;s,a,s^{\prime})-\bar{g}(\theta))^{T}(\theta-\theta^{\star}). Fix (s,a,s)(s,a,s^{\prime}). Then ξ\xi is uniformly bounded by

|ξ(θ;s,a,s)|2DG,θ𝒟,\lvert\xi(\theta;s,a,s^{\prime})\rvert\leq 2D_{\infty}G_{\infty},\quad\forall\theta\in\mathcal{D},

and it is Lipschitz continuous as given by

|ξ(θ;s,a,s)ξ(θ;s,a,s)|2((1+γ)D+G)θθ2,θ,θ𝒟.\lvert\xi(\theta;s,a,s^{\prime})-\xi(\theta^{\prime};s,a,s^{\prime})\rvert\leq 2((1+\gamma)D_{\infty}+G_{\infty})\left\lVert\theta-\theta^{\prime}\right\rVert_{2},\quad\forall\theta,\theta^{\prime}\in\mathcal{D}.
Lemma 16.

(Bhandari et al.,, 2018, Lemma 9) Consider two random variables XX and YY such that

Xstst+τY,X\rightarrow s_{t}\rightarrow s_{t+\tau}\rightarrow Y, (58)

for fixed tt and τ>0\tau>0. Suppose Assumption 2 holds. Let X,YX^{\prime},Y^{\prime} are independent copies drawn from the marginal distribution of XX and YY, that is (X=,Y=)=(X=)(Y=)\mathbb{P}(X^{\prime}=\cdot,Y^{\prime}=\cdot)=\mathbb{P}(X=\cdot)\mathbb{P}(Y=\cdot). Then, for any bounded vv, we have

|𝔼[v(X,Y)]𝔼[v(X,Y)]|2v(σρτ),\lvert\mathbb{E}[v(X,Y)]-\mathbb{E}[v(X^{\prime},Y^{\prime})]\rvert\leq 2\left\lVert v\right\rVert_{\infty}(\sigma\rho^{\tau}),

where σ\sigma and ρ\rho are defined in Assumption 2.

Remark 1.

The notation XZYX\rightarrow Z\rightarrow Y indicates that the random variable XX and YY are independent conditioned on ZZ, which is a standard notation in information theory.

Proof of Lemma 3: Now we bound the bias of the gradient estimation. We first develop the connection between ξ(θt;st,at,st+1)\xi(\theta_{t};s_{t},a_{t},s_{t+1}) and ξ(θtτ;st,at,st+1)\xi(\theta_{t-\tau};s_{t},a_{t},s_{t+1}) using Lemma 15. For notational brevity, we define a random tuple Ot=(st,at,st+1)O_{t}=(s_{t},a_{t},s_{t+1}) and clearly gt=g(θ,Ot)g_{t}=g(\theta,O_{t}). We denote

ξt(θ):=ξ(θ,Ot)=(g(θ,Ot)g¯(θ))T(θθ),\xi_{t}(\theta):=\xi(\theta,O_{t})=(g(\theta,O_{t})-\bar{g}(\theta))^{T}(\theta-\theta^{\star}),

where g¯(θ)=𝔼[g(θ,Ot)]\bar{g}(\theta)=\mathbb{E}[g(\theta,O_{t})] as defined in (16).

Next, we apply Lemma 9 to obtain

θtθtτ2i=tτt1θi+1θi2GG0i=tτt1ai.\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert_{2}\leq\sum_{i=t-\tau}^{t-1}\left\lVert\theta_{i+1}-\theta_{i}\right\rVert_{2}\leq\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}.

Thus we can relate ξt(θt)\xi_{t}(\theta_{t}) and ξt(θtτ)\xi_{t}(\theta_{t-\tau}) by using the Lipschitz property in Lemma 15 as follows.

ξt(θt)ξt(θtτ)\displaystyle\xi_{t}(\theta_{t})-\xi_{t}(\theta_{t-\tau})\leq |ξt(θt)ξt(θtτ)|\displaystyle\lvert\xi_{t}(\theta_{t})-\xi_{t}(\theta_{t-\tau})\rvert
\displaystyle\leq 2((1+γ)D+G)θtθtτ\displaystyle 2((1+\gamma)D_{\infty}+G_{\infty})\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert
\displaystyle\leq 2((1+γ)D+G)GG0i=tτt1ai.\displaystyle 2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}. (59)

Next, we seek to bound 𝔼[ξt(θtτ)]\mathbb{E}[\xi_{t}(\theta_{t-\tau})] using Lemma 16. Observe that given any deterministic θ𝒟\theta\in\mathcal{D}, we have

𝔼[ξ(θ,Ot)]=(𝔼[g(θ,Ot)]g¯(θ))T(θθ)=0.\mathbb{E}[\xi(\theta,O_{t})]=(\mathbb{E}[g(\theta,O_{t})]-\bar{g}(\theta))^{T}(\theta-\theta^{\star})=0.

Since θ0\theta_{0} is non-random, we have 𝔼[ξ(θ0,Ot)]=0\mathbb{E}[\xi(\theta_{0},O_{t})]=0. Now we are ready to bound 𝔼[ξ(θtτ,Ot)]\mathbb{E}[\xi(\theta_{t-\tau},O_{t})] with Lemma 16 via constructing a random process satisfying (58). To do so, consider random variables θtτ\theta^{\prime}_{t-\tau} and OtO^{\prime}_{t} drawn independently from the marginal distribution of θtτ\theta_{t-\tau} and OtO_{t}, thus (θtτ=,Ot=)=(θtτ=)(Ot=)\mathbb{P}(\theta^{\prime}_{t-\tau}=\cdot,O^{\prime}_{t}=\cdot)=\mathbb{P}(\theta_{t-\tau}=\cdot)\mathbb{P}(O_{t}=\cdot). Further, we can obtain 𝔼[ξ(θtτ,Ot)]=𝔼[𝔼[ξ(θtτ,Ot)]|θtτ]=0\mathbb{E}[\xi(\theta^{\prime}_{t-\tau},O^{\prime}_{t})]=\mathbb{E}[\mathbb{E}[\xi(\theta^{\prime}_{t-\tau},O^{\prime}_{t})]|\theta^{\prime}_{t-\tau}]=0 since θtτ\theta^{\prime}_{t-\tau} and OtO^{\prime}_{t} are independent. Combining Lemma 15 and Lemma 16 yields

𝔼[ξ(θtτ,Ot)]2(2DG)(σρτ).\mathbb{E}[\xi(\theta_{t-\tau},O_{t})]\leq 2(2D_{\infty}G_{\infty})(\sigma\rho^{\tau}). (60)

Finally, we are ready to bound the bias. First, we take expectation on both sides of (F) and obtain

𝔼[ξt(θt)]𝔼[ξt(θtτ)]+2((1+γ)D+G)GG0i=tτt1ai.\mathbb{E}[\xi_{t}(\theta_{t})]\leq\mathbb{E}[\xi_{t}(\theta_{t-\tau})]+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}.

Let τ=min{τ:σρταT}\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha_{T}\}. When tτt\leq\tau^{*}, we choose τ=t\tau=t and have

𝔼[ξt(θt)]\displaystyle\mathbb{E}[\xi_{t}(\theta_{t})]\leq 𝔼[ξt(θ0)]+2((1+γ)D+G)GG0tα0\displaystyle\mathbb{E}[\xi_{t}(\theta_{0})]+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\cdot t\alpha_{0}
\displaystyle\leq 2((1+γ)D+G)GG0τα0.\displaystyle 2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{*}\alpha_{0}.

When t>τt>\tau^{*}, we choose τ=τ\tau=\tau^{*} and have

𝔼[ξt(θt)]\displaystyle\mathbb{E}[\xi_{t}(\theta_{t})]\leq 𝔼[ξt(θtτ)]+2((1+γ)D+G)GG0i=tτt1ai\displaystyle\mathbb{E}[\xi_{t}(\theta_{t-\tau^{*}})]+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}
(i)\displaystyle\overset{\text{(i)}}{\leq} 4DG(σρτ)+2((1+γ)D+G)GG0i=tτt1ai\displaystyle 4D_{\infty}G_{\infty}(\sigma\rho^{\tau^{*}})+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}
(ii)\displaystyle\overset{\text{(ii)}}{\leq} 4DGαT+2((1+γ)D+G)GG0i=tτt1ai,\displaystyle 4D_{\infty}G_{\infty}\alpha_{T}+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i},

where (i) follows from (60), and (ii) follows due to the definition of the mixing time.

Appendix G Proof of Theorem 3

Differently from the regret bound for AMSGrad in conventional optimization Reddi et al., (2018), our analysis here focuses on the convergence rate. In fact, a slight modification of our proof also provides the convergence rate of AMSGrad for conventional strongly convex optimization, which can be of independent interest. Moreover, we provide results under the constant stepsize and under Marovian sampling, neither of which has been studied in Reddi et al., (2018).

To proceed the proof, we first observe that

θt+1=Π𝒟,V^t1/4(θtαtV^t12mt)=minθ𝒟V^t1/4(θtαtV^t12mtθ).\theta_{t+1}=\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}\left(\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right)=\underset{\theta\in\mathcal{D}}{\min}\left\lVert\hat{V}_{t}^{1/4}\left(\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}-\theta\right)\right\rVert.

Clearly Π𝒟,V^t1/4(θ)=θ\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}(\theta^{\star})=\theta^{\star} due to Assumption 4. We start from the update of θt\theta_{t} when t2t\geq 2.

V^t1/4(θt+1θ)2\displaystyle\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2} =Π𝒟,V^t1/4V^t1/4(θtθαtV^t12mt)2\displaystyle=\left\lVert\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}\hat{V}_{t}^{1/4}\left(\theta_{t}-\theta^{\star}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right)\right\rVert^{2}
V^t1/4(θtθαtV^t12mt)2\displaystyle\leq\left\lVert\hat{V}_{t}^{1/4}\left(\theta_{t}-\theta^{\star}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right)\right\rVert^{2}
=V^t1/4(θtθ)2+αtV^t1/4mt22αt(θtθ)Tmt\displaystyle=\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\left\lVert\alpha_{t}\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}-2\alpha_{t}(\theta_{t}-\theta^{\star})^{T}m_{t}
=V^t1/4(θtθ)2+αtV^t1/4mt22αt(θtθ)T(β1tmt1+(1β1t)gt)\displaystyle=\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\left\lVert\alpha_{t}\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}-2\alpha_{t}(\theta_{t}-\theta^{\star})^{T}(\beta_{1t}m_{t-1}+(1-\beta_{1t})g_{t})
(i)V^t1/4(θtθ)2+αtV^t1/4mt2+αtβ1t(1αtV^t1/4(θtθ)2+αtV^t1/4mt12)\displaystyle\overset{\text{(i)}}{\leq}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\left\lVert\alpha_{t}\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\alpha_{t}\beta_{1t}\left(\frac{1}{\alpha_{t}}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}\left\lVert\hat{V}_{t}^{-1/4}m_{t-1}\right\rVert^{2}\right)
2αt(1β1t)(θtθ)Tgt\displaystyle\quad-2\alpha_{t}(1-\beta_{1t})(\theta_{t}-\theta^{\star})^{T}g_{t}
(ii)V^t1/4(θtθ)2+αtV^t1/4mt2+β1tV^t1/4(θtθ)2+αt2β1tV^t11/4mt12\displaystyle\overset{\text{(ii)}}{\leq}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\left\lVert\alpha_{t}\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1t}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}
2αt(1β1t)(θtθ)Tgt,\displaystyle\quad-2\alpha_{t}(1-\beta_{1t})(\theta_{t}-\theta^{\star})^{T}g_{t},

where (i) follows from Cauchy-Schwarz inequality, and (ii) holds because v^t+1,iv^t,i,t,i\hat{v}_{t+1,i}\geq\hat{v}_{t,i},\forall t,\forall i.

Next, we take the expectation over all samples used up to time step tt on both sides, which still preserves the inequality. Since we consider the Markovian sampling case where the estimation of gradient is biased. That is,

𝔼[(θtθ)Tgt]=𝔼[(θtθ)Tg¯t]+𝔼[(θtθ)T(gtg¯t)]𝔼[(θtθ)Tg¯t],\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}g_{t}\right]=\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}\bar{g}_{t}\right]+\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]\neq\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}\bar{g}_{t}\right], (61)

where g¯t:=g¯(θt)\bar{g}_{t}:=\bar{g}(\theta_{t}) and this notation will be used in the remaining for simplicity.

Thus we have

𝔼V^t1/4(θt+1θ)2\displaystyle\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}
𝔼V^t1/4(θtθ)2+αt2𝔼V^t1/4mt2+β1t𝔼V^t1/4(θtθ)2+αt2β1t𝔼V^t11/4mt12\displaystyle\leq\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}
2αt(1β1t)𝔼[(θtθ)Tgt]\displaystyle\quad-2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}g_{t}\right]
=𝔼V^t1/4(θtθ)2+αt2𝔼V^t1/4mt2+β1t𝔼V^t1/4(θtθ)2+αt2β1t𝔼V^t11/4mt12\displaystyle=\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}
2αt(1β1t)𝔼[(θtθ)Tg¯t]+2αt(1β1t)𝔼[(θtθ)T(g¯tgt)]\displaystyle\quad-2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}\bar{g}_{t}\right]+2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(\bar{g}_{t}-g_{t})\right]
(i)𝔼V^t1/4(θtθ)2+αt2𝔼V^t1/4mt2+β1t𝔼V^t1/4(θtθ)2+αt2β1t𝔼V^t11/4mt12\displaystyle\overset{\text{(i)}}{\leq}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}
2αtc(1β1t)𝔼θtθ2+2αt(1β1t)𝔼[(θtθ)T(g¯tgt)]\displaystyle\quad-2\alpha_{t}c(1-\beta_{1t})\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}+2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(\bar{g}_{t}-g_{t})\right]
(ii)𝔼V^t1/4(θtθ)2+αt2𝔼V^t1/4mt2+β1t𝔼V^t1/4(θtθ)2+αt2β1𝔼V^t11/4mt12\displaystyle\overset{\text{(ii)}}{\leq}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}
2αtc(1β1)𝔼θtθ2+2αt(1β1t)𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad-2\alpha_{t}c(1-\beta_{1})\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}+2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
(iii)𝔼V^t1/4(θtθ)2+αt2𝔼V^t1/4mt2+GD2β1t+αt2β1𝔼V^t11/4mt12\displaystyle\overset{\text{(iii)}}{\leq}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+G_{\infty}D_{\infty}^{2}\beta_{1t}+\alpha_{t}^{2}\beta_{1}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}
2αtc(1β1)𝔼θtθ2+2αt(1β1t)𝔼[(θtθ)T(gtg¯t)],\displaystyle\quad-2\alpha_{t}c(1-\beta_{1})\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}+2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right],

where (i) follows from Lemma 2 and because 1β1t>01-\beta_{1t}>0, (ii) follows because β1t<β1<1\beta_{1t}<\beta_{1}<1 and 𝔼θtθ2>0\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}>0, and (iii) follows from V^t1/4(θtθ)2V^t1/422θtθ2GD2\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}\leq\left\lVert\hat{V}_{t}^{1/4}\right\rVert_{2}^{2}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}\leq G_{\infty}D_{\infty}^{2} by Lemma 14 and Assumption 4. By rearranging the terms in the above inequality and taking the summation over time steps, we have

2c(1\displaystyle 2c(1 β1)t=2T𝔼θtθ2\displaystyle-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}
t=2T1αt(𝔼V^t1/4(θtθ)2𝔼V^t1/4(θt+1θ)2)+t=2Tβ1tGD2αt\displaystyle\leq\sum_{t=2}^{T}\frac{1}{\alpha_{t}}\left(\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}\!-\!\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}\right)+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}
+t=2Tαt𝔼V^t1/4mt2+t=2Tαtβ1𝔼V^t11/4mt12\displaystyle\quad+\sum_{t=2}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\sum_{t=2}^{T}\alpha_{t}\beta_{1}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}
+2t=2T(1β1t)𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad+2\sum_{t=2}^{T}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
(i)t=2T1αt(𝔼V^t1/4(θtθ)2𝔼V^t1/4(θt+1θ)2)+t=2Tβ1tGD2αt\displaystyle\overset{\text{(i)}}{\leq}\sum_{t=2}^{T}\frac{1}{\alpha_{t}}\left(\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}\!-\!\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}\right)+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}
+t=2Tαt𝔼V^t1/4mt2+t=2Tαt1β1𝔼V^t11/4mt12\displaystyle\quad+\sum_{t=2}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\sum_{t=2}^{T}\alpha_{t-1}\beta_{1}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}
+2t=2T(1β1t)𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad+2\sum_{t=2}^{T}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
t=2T1αt(𝔼V^t1/4(θtθ)2𝔼V^t1/4(θt+1θ)2)+t=2Tβ1tGD2αt\displaystyle\leq\sum_{t=2}^{T}\frac{1}{\alpha_{t}}\left(\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}\!-\!\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}\right)+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}
+(1+β1)t=1Tαt𝔼V^t1/4mt2+2t=2T(1β1t)𝔼[(θtθ)T(gtg¯t)],\displaystyle\quad+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+2\sum_{t=2}^{T}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right],

where (i) follows from αtαt1\alpha_{t}\leq\alpha_{t-1}. By further implementing the first term in the right-hand side of the last inequality, we can then bound the sum as

2c(1\displaystyle 2c(1 β1)t=2T𝔼θtθ2\displaystyle-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}
t=2T1αt𝔼(V^t1/4(θtθ)2V^t1/4(θt+1θ)2)+t=2Tβ1tGD2αt\displaystyle\leq\sum_{t=2}^{T}\frac{1}{\alpha_{t}}\mathbb{E}\left(\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}-\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}\right)+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}
+(1+β1)t=1Tαt𝔼V^t1/4mt2+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
=𝔼V^21/4(θ2θ)2α2+t=3T𝔼(V^t1/4(θtθ)2αtV^t11/4(θtθ)2αt1)\displaystyle=\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+\sum_{t=3}^{T}\mathbb{E}\left(\frac{\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}}{\alpha_{t}}-\frac{\left\lVert\hat{V}_{t-1}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}}{\alpha_{t-1}}\right)
𝔼V^T1/4(θT+1θ)2αT+t=2Tβ1tGD2αt+(1+β1)t=1Tαt𝔼V^t1/4mt2\displaystyle\quad-\frac{\mathbb{E}\left\lVert\hat{V}_{T}^{1/4}(\theta_{T+1}-\theta^{\star})\right\rVert^{2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}
+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
=𝔼V^21/4(θ2θ)2α2+t=3T𝔼(i=1dv^t,i1/2(θt,iθi)2αti=1dv^t1,i1/2(θt,iθi)2αt1)\displaystyle=\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+\sum_{t=3}^{T}\mathbb{E}\left(\frac{\sum_{i=1}^{d}\hat{v}_{t,i}^{1/2}(\theta_{t,i}-\theta_{i}^{\star})^{2}}{\alpha_{t}}-\frac{\sum_{i=1}^{d}\hat{v}_{t-1,i}^{1/2}(\theta_{t,i}-\theta_{i}^{\star})^{2}}{\alpha_{t-1}}\right)
𝔼V^T1/4(θT+1θ)2αT+t=2Tβ1tGD2αt+(1+β1)t=1Tαt𝔼V^t1/4mt2.\displaystyle\quad-\frac{\mathbb{E}\left\lVert\hat{V}_{T}^{1/4}(\theta_{T+1}-\theta^{\star})\right\rVert^{2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}.
+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
=𝔼V^21/4(θ2θ)2α2+t=3Ti=1d𝔼(θt,iθi)2(v^t,i1/2αtv^t1,i1/2αt1)\displaystyle=\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+\sum_{t=3}^{T}\sum_{i=1}^{d}\mathbb{E}(\theta_{t,i}-\theta_{i}^{\star})^{2}\left(\frac{\hat{v}_{t,i}^{1/2}}{\alpha_{t}}-\frac{\hat{v}_{t-1,i}^{1/2}}{\alpha_{t-1}}\right)
𝔼V^T1/4(θT+1θ)2αT+t=2Tβ1tGD2αt+(1+β1)t=1Tαt𝔼V^t1/4mt2\displaystyle\quad-\frac{\mathbb{E}\left\lVert\hat{V}_{T}^{1/4}(\theta_{T+1}-\theta^{\star})\right\rVert^{2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}
+2t=2T𝔼[(θtθ)T(gtg¯t)].\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right].

Next, we further bound the above term as

2c(1\displaystyle 2c(1 β1)t=2T𝔼θtθ2\displaystyle-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}
(i)𝔼V^21/4(θ2θ)2α2+D2t=3Ti=1d𝔼(v^t,i1/2αtv^t1,i1/2αt1)\displaystyle\overset{\text{(i)}}{\leq}\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+D_{\infty}^{2}\sum_{t=3}^{T}\sum_{i=1}^{d}\mathbb{E}\left(\frac{\hat{v}_{t,i}^{1/2}}{\alpha_{t}}-\frac{\hat{v}_{t-1,i}^{1/2}}{\alpha_{t-1}}\right)
𝔼V^T1/4(θT+1θ)2αT+t=2Tβ1tGD2αt+(1+β1)t=1Tαt𝔼V^t1/4mt2\displaystyle\quad-\frac{\mathbb{E}\left\lVert\hat{V}_{T}^{1/4}(\theta_{T+1}-\theta^{\star})\right\rVert^{2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}
+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
𝔼V^21/4(θ2θ)2α2+D2i=1d𝔼v^T,i1/2αT+t=2Tβ1tGD2αt+(1+β1)t=1Tαt𝔼V^t1/4mt2\displaystyle\leq\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+D_{\infty}^{2}\sum_{i=1}^{d}\mathbb{E}\frac{\hat{v}_{T,i}^{1/2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}
+2t=2T𝔼[(θtθ)T(gtg¯t)],\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right], (62)

where (i) follows from Assumption 4 and because v^t,i1/2αtv^t1,i1/2αt1\frac{\hat{v}_{t,i}^{1/2}}{\alpha_{t}}\geq\frac{\hat{v}_{t-1,i}^{1/2}}{\alpha_{t-1}}.

Since we consider the case with a constant stepsize αt=α\alpha_{t}=\alpha, we continue to bound (G) as follows

2c(1\displaystyle 2c(1 β1)t=2T𝔼θtθ2\displaystyle-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}
𝔼V^21/4(θ2θ)2α2+D2i=1d𝔼v^T,i1/2αT+t=2Tβ1tGD2αt+(1+β1)t=1Tαt𝔼V^t1/4mt2\displaystyle\leq\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+D_{\infty}^{2}\sum_{i=1}^{d}\mathbb{E}\frac{\hat{v}_{T,i}^{1/2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}
+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
(i)GD2α+GD2α+β1λGD2α(1λ)+α(1+β1)G2G0T+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\overset{\text{(i)}}{\leq}\frac{G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
=2GD2α+β1λGD2α(1λ)+α(1+β1)G2G0T+2t=2T𝔼[(θtθ)T(gtg¯t)],\displaystyle=\frac{2G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right], (63)

where (i) follows because

V^t1/4mt2=i=1dmt,i2v^t,i1G0i=1dmt,i2G2G0.\displaystyle\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}=\sum_{i=1}^{d}\frac{m_{t,i}^{2}}{\sqrt{\hat{v}_{t,i}}}\leq\frac{1}{G_{0}}\sum_{i=1}^{d}m_{t,i}^{2}\leq\frac{G_{\infty}^{2}}{G_{0}}.

Then we are ready to obtain the upper bound by applying Lemma 3. Choosing τ=t\tau=t if tτt\leq\tau^{*} and τ=τ\tau=\tau^{*} if t>τt>\tau^{*}, we obtain

2c\displaystyle 2c (1β1)t=2T𝔼θtθ2\displaystyle(1-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}
2GD2α+β1λGD2α(1λ)+α(1+β1)G2G0T+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\leq\frac{2G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
2GD2α+β1λGD2α(1λ)+α(1+β1)G2G0T+4t=2τ[((1+γ)D+G)GG0τα]\displaystyle\leq\frac{2G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+4\sum_{t=2}^{\tau^{*}}\left[((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{*}\alpha\right]
+4t=τ+1T[2DGα+((1+γ)D+G)GG0τα]\displaystyle\quad+4\sum_{t=\tau^{*}+1}^{T}\left[2D_{\infty}G_{\infty}\alpha+((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{*}\alpha\right]
2GD2α+β1λGD2α(1λ)+α(1+β1)G2G0T+4((1+γ)D+G)GG0(τ)2α\displaystyle\leq\frac{2G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+4((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}(\tau^{*})^{2}\alpha
+4[2DGα+((1+γ)D+G)GG0τα](Tτ).\displaystyle\quad+4\left[2D_{\infty}G_{\infty}\alpha+((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{*}\alpha\right](T-\tau^{*}).

Finally, applying Jensen’s inequality yields

𝔼θoutθ21Tt=1T𝔼θtθ2C1T+αC2,\displaystyle\mathbb{E}\left\lVert\theta_{out}-\theta^{\star}\right\rVert^{2}\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}\leq\frac{C_{1}}{T}+\alpha C_{2}, (64)

where

C1=GD2αc(1β)+β1λGD22αc(1λ)(1β)+2((1+γ)D+G)GcG0(1β)(τ)2α,\displaystyle C_{1}=\frac{G_{\infty}D_{\infty}^{2}}{\alpha c(1-\beta)}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{2\alpha c(1-\lambda)(1-\beta)}+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{cG_{0}(1-\beta)}(\tau^{*})^{2}\alpha,
C2=4DGc(1β)+2Gτ((1+γ)D+G)cG0(1β)+(1+β)G22cG0(1β).\displaystyle C_{2}=\frac{4D_{\infty}G_{\infty}}{c(1-\beta)}+\frac{2G_{\infty}\tau^{*}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)}+\frac{(1+\beta)G_{\infty}^{2}}{2cG_{0}(1-\beta)}.

Appendix H Proof of Theorem 4

We first provide the following useful lemma.

Lemma 17.

Let αt=αt\alpha_{t}=\frac{\alpha}{\sqrt{t}} and β1t=β1λt\beta_{1t}=\beta_{1}\lambda^{t} for t=1,2,t=1,2,\dots. Then

t=1Tβ1tαtβ1α(1λ)2.\sum_{t=1}^{T}\frac{\beta_{1t}}{\alpha_{t}}\leq\frac{\beta_{1}}{\alpha(1-\lambda)^{2}}. (65)
Proof.

The proof follows from taking the standard sum of geometric sequences.

t=1Tβ1tαt=t=1Tβ1ttαt=1Tβ1λt1tα=β1α(1(1λ)t=1Tλt1TλT)β1α(1λ)2.\sum_{t=1}^{T}\frac{\beta_{1t}}{\alpha_{t}}=\sum_{t=1}^{T}\frac{\beta_{1t}\sqrt{t}}{\alpha}\leq\sum_{t=1}^{T}\frac{\beta_{1}\lambda^{t-1}t}{\alpha}=\frac{\beta_{1}}{\alpha}\left(\frac{1}{(1-\lambda)}\sum_{t=1}^{T}\lambda^{t-1}-T\lambda^{T}\right)\leq\frac{\beta_{1}}{\alpha(1-\lambda)^{2}}. (66)

Proof of Theorem 4: The proof starts with steps similar to those of Theorem 3. The difference begins from (G), where we now consider a diminishing stepsize αt=αt\alpha_{t}=\frac{\alpha}{\sqrt{t}}. We then have

2c\displaystyle 2c (1β1)t=2T𝔼θtθ2\displaystyle(1-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}
𝔼V^21/4(θ2θ)2α2+D2i=1d𝔼v^T,i1/2αT+t=2Tβ1tGD2αt+(1+β1)t=1Tαt𝔼V^t1/4mt2\displaystyle\leq\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+D_{\infty}^{2}\sum_{i=1}^{d}\mathbb{E}\frac{\hat{v}_{T,i}^{1/2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}
+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
(i)GD2α2+GD2Tα+β1GD2α(1λ)2+2α(1+β1)G2G0T+2t=2T𝔼[(θtθ)T(gtg¯t)],\displaystyle\overset{\text{(i)}}{\leq}\frac{G_{\infty}D_{\infty}^{2}}{\alpha_{2}}+\frac{G_{\infty}D_{\infty}^{2}\sqrt{T}}{\alpha}+\frac{\beta_{1}G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)^{2}}+\frac{2\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot\sqrt{T}+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right],

where (i) follows from Assumption 4, and Lemmas 14 and 17.

Then we are ready to obtain the upper bound by applying Lemma 3. Choosing τ=t\tau=t if tτt\leq\tau^{*} and τ=τ\tau=\tau^{*} if t>τt>\tau^{*}, we obtain

2c\displaystyle 2c (1β1)t=2T𝔼θtθ2\displaystyle(1-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}
GD2α2+GD2Tα+β1GD2α(1λ)2+2α(1+β1)G2G0T+2t=2T𝔼[(θtθ)T(gtg¯t)]\displaystyle\leq\frac{G_{\infty}D_{\infty}^{2}}{\alpha_{2}}+\frac{G_{\infty}D_{\infty}^{2}\sqrt{T}}{\alpha}+\frac{\beta_{1}G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)^{2}}+\frac{2\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot\sqrt{T}+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]
GD2α2+GD2Tα+β1GD2α(1λ)2+2α(1+β1)G2G0T+4t=2τ[((1+γ)D+G)GG0τα]\displaystyle\leq\frac{G_{\infty}D_{\infty}^{2}}{\alpha_{2}}+\frac{G_{\infty}D_{\infty}^{2}\sqrt{T}}{\alpha}+\frac{\beta_{1}G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)^{2}}+\frac{2\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot\sqrt{T}+4\sum_{t=2}^{\tau^{*}}\left[((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{*}\alpha\right]
+4t=τ+1T[2DGαT+((1+γ)D+G)GG0ταtτ]\displaystyle\quad+4\sum_{t=\tau^{*}+1}^{T}\left[2D_{\infty}G_{\infty}\frac{\alpha}{\sqrt{T}}+((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\frac{\tau^{*}\alpha}{\sqrt{t-\tau^{*}}}\right]
GD2α2+GD2Tα+β1GD2α(1λ)2+2α(1+β1)G2G0T+4[((1+γ)D+G)GG0α(τ)2]\displaystyle\leq\frac{G_{\infty}D_{\infty}^{2}}{\alpha_{2}}+\frac{G_{\infty}D_{\infty}^{2}\sqrt{T}}{\alpha}+\frac{\beta_{1}G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)^{2}}+\frac{2\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot\sqrt{T}+4\left[((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\alpha(\tau^{*})^{2}\right]
+4[2DGαT+((1+γ)D+G)2GταTτG0].\displaystyle\quad+4\left[2D_{\infty}G_{\infty}\alpha\sqrt{T}+((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{2G_{\infty}\tau^{*}\alpha\sqrt{T-\tau^{*}}}{G_{0}}\right].

Finally, applying Jensen’s inequality completes our proof as

𝔼θoutθ21Tt=1T𝔼θtθ2C1T+C2T,\displaystyle\mathbb{E}\left\lVert\theta_{out}-\theta^{\star}\right\rVert^{2}\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}\leq\frac{C_{1}}{\sqrt{T}}+\frac{C_{2}}{T},

where

C1=GD22cα(1β)+α(1+β1)G2cG0(1β)+4αDGc(1β)+4ταG((1+γ)D+G)cG0(1β),\displaystyle C_{1}=\frac{G_{\infty}D_{\infty}^{2}}{2c\alpha(1-\beta)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{cG_{0}(1-\beta)}+\frac{4\alpha D_{\infty}G_{\infty}}{c(1-\beta)}+\frac{4\tau^{*}\alpha G_{\infty}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)},
C2=GD22cα(1β)+βGD22cα(1λ)2(1β)+2Gα(τ)2((1+γ)D+G)cG0(1β).\displaystyle C_{2}=\frac{G_{\infty}D_{\infty}^{2}}{\sqrt{2}c\alpha(1-\beta)}+\frac{\beta G_{\infty}D_{\infty}^{2}}{2c\alpha(1-\lambda)^{2}(1-\beta)}+\frac{2G_{\infty}\alpha(\tau^{*})^{2}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)}.