Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Huaqing Xiong equal contribution Department of Electrical and Computer Engineering, The Ohio State University Tengyu Xu^* Department of Electrical and Computer Engineering, The Ohio State University Yingbin Liang Department of Electrical and Computer Engineering, The Ohio State University Wei Zhang Department of Mechanical and Energy Engineering, Southern University of Science and Technology

Abstract

Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established. This paper provides the first such convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning that incorporate AMSGrad updates (a standard alternative of Adam in theoretical analysis), referred to as PG-AMSGrad and TD-AMSGrad, respectively. Moreover, our analysis focuses on Markovian sampling for both algorithms. We show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $\mathcal{O}(1/T)$ (where $T$ denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at the rate of $\mathcal{O}(\log^{2}T/\sqrt{T})$ . Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $\mathcal{O}(1/T)$ , and with a diminishing stepsize converges exactly to the global optimum at the rate of $\mathcal{O}(\log T/\sqrt{T})$ . Our study develops new techniques for analyzing the Adam-type RL algorithms under Markovian sampling.

1 Introduction

Reinforcement learning (RL) aims to study how an agent learns a policy through interacting with its environment to maximize the accumulative reward for a task. RL has so far accomplished tremendous success in various applications such as playing video games Mnih et al., (2013), bipedal walking Castillo et al., (2019) and online advertising Pednault et al., (2002), to name a few. In general, there are two widely used classes of RL algorithms: policy-based methods and value function based methods.

For the first class, policy gradient (PG) Sutton et al., (2000) is a basic algorithm which has motivated many advanced policy-based algorithms including actor-critic Konda and Tsitsiklis, (2000), DPG Silver et al., (2014), TRPO Schulman et al., (2015), PPO Schulman et al., (2017), etc. The idea of the vanilla PG Sutton et al., (2000) is to parameterize the policy and optimize a target accumulated reward function by (stochastic) gradient descent. The asymptotic convergence and finite-time (i.e., non-asymptotic) analysis have been characterized for PG in various scenarios, which will be further discussed in Section 1.2.

For the second class of value function based algorithms, temporal difference (TD) learning Sutton, (1988) is a fundamental algorithm which has motivated more advanced algorithms such as Q-learning Watkins and Dayan, (1992), SARSA Rummery and Niranjan, (1994), etc. The vanilla TD Sutton, (1988) typically parameterizes the value function of an unknown policy and iteratively finds the true value function or its estimator by following the (projected) Bellman operation, which can also be interpreted to be analogous to a stochastic gradient descent (SGD) update. The asymptotic convergence and finite-time analysis have been established for TD in various scenarios, which will be discussed in Section 1.2.

Despite extensive exploration, all the existing theoretical studies of PG and TD have focused on algorithms that adopt the SGD-type updates without adaption on the stepsize. In practice, however, the adaptive momentum estimation (Adam) method Kingma and Ba, (2015) has been more commonly used in RL Bello et al., (2017); Stooke and Abbeel, (2018). It is so far unclear whether RL algorithms that incorporate the Adam-type updates have provable converge guarantee. The goal of this paper is to provide the first non-asymptotic analysis to characterize the convergence rate of the Adam-type PG and TD algorithms. In addition, we develop new technical tools to analyze the Adam-type algorithms under Markovian sampling. Such analysis is not available in the existing studies of the Adam-type algorithms in optimization which usually assume independent and identically distributed (i.i.d.) sampling.

1.1 Our Contribution

The main contribution of the paper lies in establishing the first non-asymptotic analysis for the Adam-type PG and TD algorithms under the Markovian sampling model.

From the technical standpoint, although the existing analysis of the Adam-type algorithms in conventional optimization may provide useful tools, analysis of such algorithms in RL problems has new challenges. (1) Samples in TD learning are drawn from an unknown transition kernel in a non-i.i.d. manner, and hence a bias of the gradient estimation arises. Such a bias is also dependent on the AMSGrad updates, which further complicates the analysis. We develop techniques to bound the bias by the stepsize even under adaptive momentum updates. (2) The sampling process in PG (with a nonlinear approximation architecture) is not only non-i.i.d., but also follows time-varying parametric policies. Thus, it is even more challenging to bound such a bias with the AMSGrad updates. In response to this, we develop new techniques to characterize the bias error which can also be controlled by the stepsize.

As for the results, we provide the first convergence guarantee for RL algorithms (including PG and TD) incorporating the update rule of AMSGrad (referred to as PG-AMSGrad and TD-AMSGrad, respectively), and our techniques also lead to an improved result for the vanilla PG. (1) First, we show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize¹¹1The “stepsize” here refers to the basic stepsize $\alpha$ in the AMSGrad update (4). The overall learning rate of the algorithm is determined by the basic stepsize $\alpha$ , hyperparameters $\beta_{1}$ and $\beta_{2}$ , and the first and second moments of gradients as given in (1)-(4), and is hence adaptive during the AMSGrad iteration. converges to a neighborhood of a stationary point at a rate of $\mathcal{O}(1/T)$ (where $T$ denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at a rate of $\mathcal{O}(\log^{2}T/\sqrt{T})$ . (2) Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at a rate of $\mathcal{O}(1/T)$ , and with a diminishing stepsize converges exactly to the global optimum at a rate of $\mathcal{O}(\log T/\sqrt{T})$ . (3) In addition, by adapting our technical tools to analyze the vanilla PG with the SGD update under Markovian sampling, we obtain an orderwisely better computational complexity than the existing works.

1.2 Related Work

Due to the rapidly growing theoretical studies on RL, we only review the most relevant studies below.

Convergence analysis of PG: Asymptotic convergence of PG based on stochastic approximation (SA) was established in Williams, (1992); Baxter and Bartlett, (2001); Sutton et al., (2000); Kakade, (2002); Pirotta et al., (2015); Tadić et al., (2017). In specific RL problems such as LQR, PG has been proved to converge to the optimal policy Fazel et al., (2018); Malik et al., (2019); Tu and Recht, (2019). Under specific policy function approximation classes, Bhandari and Russo, (2019); Agarwal et al., (2019) also showed that PG can find the optimal policy. Under the general nonconvex approximation, Shen et al., (2019); Papini et al., (2018, 2019); Xu et al., 2019a ; Xu et al., (2020) characterized the convergence rate for PG and variance reduced PG to a stationary point in the finite-horizon scenario, and Zhang et al., (2019); Karimi et al., (2019) provided the convergence rate for PG in the infinite-horizon scenario. Wang et al., (2020) studied PG with neural network function approximation in an overparameterized regime. Convergence analysis has been also established for the variants of PG, such as actor-critic algorithms (Bhatnagar et al.,, 2008, 2009; Bhatnagar,, 2010; Luo et al.,, 2019; Yang et al.,, 2019; Fu et al.,, 2020), TRPO (Shani et al.,, 2020), TRPO/PPO (Liu et al.,, 2019), etc. This paper studies the infinite-horizon scenario, but focuses on the Adam-type PG, which has not been studied in the literature. Our analysis is also applicable to other variants of PG.

Convergence analysis of TD: Originally proposed in Sutton, (1988), TD learning with function approximation aroused great interest in analyzing its convergence. While a general TD may not converge as pointed out in Baird, (1995); Györfi and Walk, (1996), the authors in Tsitsiklis and Van Roy, (1997) provided conditions to ensure asymptotic convergence of TD with linear function approximation under i.i.d. sampling. Other results on asymptotic convergence using the tools from linear SA were provided in Kushner and Yin, (2003); Benveniste et al., (2012). Non-asymptotic convergence was established for TD under i.i.d. sampling in, e.g., Dalal et al., (2018); Bhandari et al., (2018); Lakshminarayanan and Szepesvari, (2018), and under Markovian sampling in, e.g., Bhandari et al., (2018); Srikant and Ying, (2019); Hu and Syed, (2019). The convergence rate of TD with nonlinear function approximation has recently been studied in Cai et al., (2019) for overparameterized neural networks using i.i.d. samples. In contrast to the aforementioned work on TD with the SGD updates, this paper studies Adam-type TD under Markovian sampling.

Adaptive reinforcement learning algorithms: Adaptivity has been applied to RL algorithms to improve the performance. (Shani et al.,, 2020) used an adaptive proximity term to study the convergence of TRPO. An adaptive batch size was adopted to improve the policy performance (Papini et al.,, 2017) and reduce the variance (Ji et al.,, 2019) of PG. The abovementioned papers did not study how adaptive learning rates can affect the performance of PG or TD. More recently, a concurrent work (Sun et al.,, 2020) which is posted a few days after our paper provided an analysis of TD(0) and TD( $\lambda$ ) when incorporating adaptive gradient descent (AdaGrad) updates. However, this paper focuses on a more popular class of adaptive algorithms: Adam-type methods, and provide the first convergence guarantee when they are applied to PG and TD.

Convergence analysis of Adam-type algorithms in conventional optimization: Adam was proposed in Kingma and Ba, (2015) for speeding up the training of deep neural networks, but the vanilla Adam was shown not to converge in Reddi et al., (2018). Instead AMSGrad was proposed as a slightly modified version to justify the theoretic performance of Adam and its regret bounds were characterized in Reddi et al., (2018); Tran and Phong, (2019) for online convex optimization. Recently, Adam/AMSGrad was proved to converge to a stationary point for certain general nonconvex optimization problems Zou et al., 2019a ; Zhou et al., (2018); Chen et al., 2019a . Our study provides the first convergence guarantee for the Adam-type algorithms in the RL settings, where the Markovian sampling poses the key difference and challenge in our analysis from those in conventional optimization.

2 Preliminary

In this section, we provide the necessary background for the problems that we study in this paper.

2.1 Markov Decision Process

We consider the standard RL settings, where an agent interacts with a (possibly stochastic) environment (e.g. process or system dynamics). This interaction is usually modeled as a discrete-time discounted Markov Decision Processes (MDPs), described by a tuple $(\mathcal{S},\mathcal{A},\mathbb{P},R,\gamma,\zeta)$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathbb{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1]$ is the probability kernel for the state transitions, e.g., $\mathbb{P}(\cdot|s,a)$ denotes the probability distribution of the next state given the current state $s$ and action $a$ . In addition, $R:\mathcal{S}\times\mathcal{A}\mapsto[0,R_{\max}]$ is the reward function mapping station-action pairs to a bounded subset of $\mathbb{R}$ , $\gamma\in(0,1)$ is the discount factor, and $\zeta$ denotes the initial state distribution. The agent’s decision is captured by the policy $\pi:=\pi(\cdot|s)$ which characterizes the density function over the action space $\mathcal{A}$ at the state $s\in\mathcal{S}$ . We denote $\nu:=\nu_{\pi}$ as the stationary distribution of the transition kernel $\mathbb{P}$ for a given $\pi$ . In addition, we define the $\gamma$ -discounted stationary visitation distribution of the policy $\pi$ as $\mu_{\pi}(s)=\sum_{t=1}^{\infty}\gamma^{t}P_{\zeta,\pi}(s_{t}=s)$ . Further, we denote $\mu_{\pi}(s,a)=\mu_{\pi}(s)\pi(a|s)$ as the (discounted) state-action visitation distribution.

In this paper, we assume that the state space $\mathcal{S}$ is countably infinite and the action space $\mathcal{A}$ is finite with possibly large cardinality $|\mathcal{A}|$ . Hence, we estimate the policy and the value function corresponding to a certain unknown policy by parameterized function classes as we introduce in Section Sections 3.1 and 4.1, respectively.

2.2 Update Rule of AMSGrad

Although Adam Kingma and Ba, (2015) has gained great success in practice since being proposed, it is shown not to converge even in the simple convex setting Reddi et al., (2018). Instead, a slightly modified version called AMSGrad Reddi et al., (2018) is widely used to understand the success of adaptive momentum optimization algorithms. Given a gradient $g_{t}$ at time $t$ , the generic form of AMSGrad is given by

$\displaystyle m_{t}$	$\displaystyle=(1-\beta_{1})m_{t-1}+\beta_{1}g_{t};$	(1)
$\displaystyle v_{t}$	$\displaystyle=(1-\beta_{2})\hat{v}_{t-1}+\beta_{2}g_{t}^{2};$	(2)
$\displaystyle\hat{v}_{t}$	$\displaystyle=\max(\hat{v}_{t-1},v_{t}),\ \hat{V}_{t}=diag(\hat{v}_{t,1},\dots,\hat{v}_{t,d});$	(3)
$\displaystyle\theta_{t+1}$	$\displaystyle=\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t},$	(4)

where $\alpha_{t}$ is the stepsize, and $\beta_{1},\beta_{2}$ are two hyper-parameters. In addition, $m_{t}$ , $v_{t}$ given in (1) and (2) are viewed as the estimation of the first moment and second moment, respectively, which play important roles in adapting the learning rate as in (4). Compared to Adam, the main difference of AMSGrad lies in (3), which guarantees the sequence $\hat{v}_{t}$ to be non-decreasing whereas Adam does not require this. Such difference is considered to be a central reason causing the non-convergent behavior of Adam Reddi et al., (2018); Chen et al., 2019a .

2.3 Notations

We use $\|x\|:=\|x\|_{2}=\sqrt{x^{T}x}$ to denote the $\ell_{2}$ -norm of a vector $x$ , and use $\|x\|_{\infty}=\underset{i}{\max}|x_{i}|$ to denote the infinity norm. When $x,y$ are both vectors, $x/y,xy,x^{2},\sqrt{x}$ are all calculated in the element-wise manner, which are used in the update of AMSGrad. We denote $[n]=\{1,2,\dots,n\}$ , and $\lceil x\rceil\in\mathbb{Z}$ as the integer such that $\lceil x\rceil-1\leq x<\lceil x\rceil$ .

3 Convergence of PG-AMSGrad under Markovian Sampling

In this section, we study the convergence of an Adam-type policy gradient algorithm (PG-AMSGrad) with nonlinear function approximation and under non-i.i.d. sampling.

3.1 Policy Gradient and PG-AMSGrad

Consider a reinforcement learning problem which aims to find a policy that maximizes the expected accumulative reward. We assume that the policy is parameterized by $\theta\in\mathbb{R}^{d}$ and form a policy class $\Pi:=\{\pi_{\theta}|\theta\in\mathbb{R}^{d}\}$ , which in general is a nonlinear function class. The policy gradient method is usually used to solve the following infinite-horizon optimization problem:

\underset{\theta\in\mathbb{R}^{d}}{\text{maximize}}\quad J(\theta)=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t}R(s_{t},a_{t})\right].

(5)

The gradient of $J(\theta)$ with respect to $\theta$ is captured by the policy gradient theorem for infinite-horizon MDP with the discounted reward Sutton et al., (2000), and is given by

\nabla_{\theta}J(\theta)=\underset{\mu_{\theta}}{\mathbb{E}}\left[Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\log(\pi_{\theta}(a|s))\right],

(6)

where the expectation is taken over the discounted state-action visitation distribution $\mu_{\theta}:=\mu_{\pi_{\theta}}(s,a)$ , and $Q^{\pi_{\theta}}(s,a)$ denotes the Q-function for an initial state-action pair $(s,a)$ defined as

Q^{\pi_{\theta}}(s,a)=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t}R(s_{t},a_{t})\Big{\lvert}s_{1}=s,a_{1}=a\right].

In addition, we refer to $\nabla_{\theta}\log\pi_{\theta}(a|s)$ as the score function corresponding to the policy $\pi_{\theta}$ .

Since the transition probability is unknown, the policy gradient in (6) needs to be estimated via sampling. The Q-function $Q^{\pi_{\theta}}(s,a)$ and the score function are typically estimated by independent samples. First, at each time $t$ , we draw a sample trajectory to provide an estimated Q-function $\hat{Q}^{\pi_{\theta}}(s,a)$ based on the algorithm EstQ Zhang et al., (2019) (see Algorithm 3 in Appendix A for details). Such an estimator has been shown to be unbiased Zhang et al., (2019). That is, if we use $O^{q}$ to denote the randomness including the samples and horizon in EstQ, then we have

\underset{O^{q}}{\mathbb{E}}\hat{Q}^{\pi_{\theta}}(s,a)=Q^{\pi_{\theta}}(s,a),\quad\forall(s,a).

(7)

Second, we take samples $\{(s_{t},a_{t})\}$ to estimate the score function by following the policy $\pi_{\theta_{t}}$ and the transition function proposed in Konda, (2002) which is given by

\hat{P}(\cdot|s_{t},a_{t})=\gamma\mathbb{P}(\cdot|s_{t},a_{t})+(1-\gamma)\zeta(\cdot),

where $\zeta(\cdot)$ denotes the initial distribution and $\mathbb{P}$ is the transition probability from the original MDP. It has been shown in Konda, (2002) that such a transition probability guarantees the MDP to converge to the state-action visitation distribution.

Then, the gradient estimator to approximate $\nabla_{\theta}J(\theta)$ at time $t$ is given by

g_{t}\!:=\!g(\theta_{t};s_{t},a_{t})\!=\!\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\!\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t})).

(8)

We then apply such a gradient estimator $g_{t}$ to update the policy parameter by the AMSGrad update given in (1)-(4), and obtain PG-AMSGrad as in Algorithm 1.

Algorithm 1 PG-AMSGrad

1: Input:

\alpha,\theta_{1},\beta_{1},\beta_{2},m_{0}=0,\hat{v}_{0}=0,t=1,s_{1}\sim\zeta(\cdot),a_{1}\sim\pi_{\theta_{1}}(\cdot|s)

2: while not converge do

3: Assign stepsize

\alpha_{t}

4: Obtain

\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\leftarrow\text{EstQ}(s_{t},a_{t},\theta_{t})

5: Compute

g_{t}=\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t}))

m_{t}=(1-\beta_{1})m_{t-1}+\beta_{1}g_{t}

v_{t}=(1-\beta_{2})\hat{v}_{t-1}+\beta_{2}g_{t}^{2}

\hat{v}_{t}=\max(\hat{v}_{t-1},v_{t}),\ \hat{V}_{t}=diag(\hat{v}_{t,1},\dots,\hat{v}_{t,d})

\theta_{t+1}=\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}

10:

t\leftarrow t+1

11: Sample

s_{t}\sim\hat{P}(\cdot|s_{t-1},a_{t-1}),a_{t}\sim\pi_{\theta_{t}}(\cdot|s_{t})

12: end while

We note that the gradient estimator obtained in (8) is biased, because the score function is estimated by a sequence of Markovian samples. We will show that such a biased gradient estimator is in fact computationally more efficient than the unbiased estimator used in the existing literature Zhang et al., (2019) (see Section 3.4). Our main technical novelty here lies in developing techniques to analyze the biased estimator under the AMSGrad update for PG.

3.2 Technical Assumptions

In the following, we specify some technical assumptions in our convergence analysis.

We consider a general class of parameterized policy functions that satisfy the following assumption.

Assumption 1.

Assume that the parameterized policy $\pi_{\theta}$ is differentiable with respect to $\theta$ , and the score function $\nabla_{\theta}\log\pi_{\theta}(a|s)$ corresponding to $\pi_{\theta}(\cdot|s)$ exists. In addition, we assume both the policy function and the score function are Lipschitz continuous with the parameters $L_{\pi}$ and $L$ , respectively, i.e., for all $\theta_{1},\theta_{2}\in\mathbb{R}^{d}$ ,

|\pi_{\theta_{1}}(a|s)-\pi_{\theta_{2}}(a|s)|\leq L_{\pi}\left\lVert\theta_{1}-\theta_{2}\right\rVert;

\left\lVert\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a|s))\!-\!\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a|s))\right\rVert\leq L\left\lVert\theta_{1}-\theta_{2}\right\rVert.

Further, we also assume that the score function is uniformly bounded by $c_{\Theta}$ for any $(s,a)\in\mathcal{S}\times\mathcal{A}$ , i.e.,

\left\lVert\nabla_{\theta}\log(\pi_{\theta}(a|s))\right\rVert\leq c_{\Theta},\quad\forall\theta.

This assumption is standard in the existing literature that studies PG with nonconvex function approximation Zhang et al., (2019); Xu et al., 2019a ; Papini et al., (2018).

In Algorithm 1, we sample a data trajectory using the transition kernel $\hat{P}$ and the policy $\pi_{\theta_{t}}$ . Such a sequence of samples are non-i.i.d. and follow a Markovian distribution. We assume that the MDP and the policies we consider satisfy the following standard mixing property.

Assumption 2.

For any $\theta\in\mathbb{R}^{d}$ , there exist constant $\sigma>0$ and $\rho\in(0,1)$ such that

\underset{s\in\mathcal{S}}{\sup}\ \left\|P(s_{t}\in\cdot|s_{1}=s)-\mu_{\theta}(\cdot)\right\|_{TV}\leq\sigma\rho^{t}\quad\forall t,

where $\left\|\mu_{1}-\mu_{2}\right\|_{TV}$ denotes the total-variation norm (or the total-variation distance between two probability measures $\mu_{1}$ and $\mu_{2}$ ).

This assumption holds for irreducible and aperiodic Markov chains Mitrophanov, (2005), and is widely adopted in the theoretical analysis of RL algorithms under Markovian sampling settings Bhandari et al., (2018); Chen et al., 2019b ; Zou et al., 2019b ; Karimi et al., (2019).

3.3 Convergence of PG-AMSGrad

In this section, we provide the convergence analysis of PG-AMSGrad as given in Algorithm 1. We first consider the case with a constant stepsize, and then provide the result with a diminishing stepsize.

Although AMSGrad has been studied in conventional optimization, our analysis of PG-AMSGrad mainly deals with the following new challenges arising in RL. First, samples here are generated via an MDP and distributed in a non-i.i.d. fashion. Thus the gradient estimator is biased and we need to control the bias with a certain upper bound scaled by the stepsize. Second, the sampling distribution also changes over time, which causes additional complication. Thus, our technical development mainly handles the above two challenges under the adaptive momentum update rule of AMSGrad. We next provide the convergence results that we obtain and relegate all the proofs to the appendices.

We first provide the Lipschitz properties for the true policy gradient and its estimator, which are useful for establishing the convergence. Recall that in Algorithm 1, the gradient estimator $g_{t}=\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t}))$ at time $t$ is obtained by using the Q-function estimator generated by the EstQ algorithm (see Appendix A). Note that $\hat{Q}^{\pi_{\theta}}(s,a)$ is an unbiased estimator of $Q^{\pi_{\theta}}(s,a)$ for all $(s,a)$ Zhang et al., (2019), and the samples for estimation are independent of those for other steps in PG-AMSGrad except the initial sample. Taking expectation over the randomness in EstQ at time $t$ (denoted as $O^{q}_{t}$ ), we obtain an estimator $\nabla_{\theta_{t}}\tilde{J}(\theta_{t};s_{t},a_{t})$ defined as

\displaystyle\nabla_{\theta_{t}}\tilde{J}(\theta_{t};s_{t},a_{t}):=\underset{O^{q}_{t}}{\mathbb{E}}[g_{t}]=\underset{O^{q}_{t}}{\mathbb{E}}\left[\hat{Q}^{\pi_{\theta_{t}}}(s_{t},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t}))\right]=Q^{\pi_{\theta_{t}}}(s_{t},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}|s_{t})).

(9)

We next obtain the Lipschitz properties of $\nabla_{\theta}\tilde{J}(\theta;s,a)$ and $\nabla_{\theta}J(\theta)$ in the following lemma.

Lemma 1.

(Lipschitz property of policy gradient) Under Assumptions 1 and 2, the policy gradient $\nabla_{\theta}J(\theta)$ defined in (6) is Lipschitz continuous with the parameter $c_{J}$ , i.e., $\forall\theta_{1},\theta_{2}\in\mathbb{R}^{d}$ ,

\left\lVert\nabla_{\theta_{1}}J(\theta_{1})-\nabla_{\theta_{2}}J(\theta_{2})\right\rVert\leq c_{J}\left\lVert\theta_{1}-\theta_{2}\right\rVert,

(10)

where the constant coefficient $c_{J}=\frac{R_{\max}L}{1-\gamma}+\frac{(1+c_{\Theta})R_{\max}}{1-\gamma}\cdot|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)$ . Further, the policy gradient estimator $\nabla_{\theta}\tilde{J}(\theta;s,a)$ defined in (9) is also Lipschitz continuous with the parameter $c_{\tilde{J}}$ , i.e., $\forall\theta_{1},\theta_{2}\in\mathbb{R}^{d},\forall(s,a)\in\mathcal{S}\times\mathcal{A}$ ,

\left\lVert\nabla_{\theta_{1}}\tilde{J}(\theta_{1};s,a)\!-\!\nabla_{\theta_{2}}\tilde{J}(\theta_{2};s,a)\right\rVert\leq c_{\tilde{J}}\left\lVert\theta_{1}\!-\!\theta_{2}\right\rVert,

(11)

where $c_{\tilde{J}}=\frac{R_{\max}L}{1-\gamma}+c_{\Theta}|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)$ .

The following theorem characterizes the convergence of PG-AMSGrad with a constant stepsize. Recall that the stepsize refers to the parameter $\alpha$ in the AMSGrad update (4), not the overall learning rate of the algorithm.

Theorem 1.

(Convergence of PG-AMSGrad with constant stepsize) Fix $\beta_{1},\beta_{2}$ in Algorithm 1. Initialize Algorithm 1 such that $|g_{1,i}|\geq G_{0}$ for all $i\in[d]$ with some $G_{0}>0$ . Suppose Assumptions 1 and 2 hold. Let $\alpha_{t}=\alpha$ for $t=1,\dots,T$ . Then after running $T$ steps of PG-AMSGrad as given in Algorithm 1, we have:

\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla_{\theta_{t}}J(\theta_{t})\right\rVert^{2}\right]\leq\frac{C_{1}}{T}+\alpha C_{2},

where

	$\displaystyle C_{1}=\frac{G_{\infty}\mathbb{E}[J(z_{1})]}{\alpha}+\frac{dG_{\infty}^{3}}{G_{0}(1-\beta_{1})}+\frac{2G_{\infty}\tau^{*}}{G_{0}}G_{\infty}^{2}+\frac{dc_{J}\alpha G_{\infty}(3\beta_{1}^{2}+2(1-\beta_{1})(1-\beta_{1}/\beta_{2}))}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})},$
	$\displaystyle C_{2}=\frac{G_{\infty}^{3}}{G_{0}}\left[\frac{(3c_{J}\!+\!c_{\tilde{J}})\tau^{}}{G_{0}}\!+\!d\!+\!\frac{dL_{\pi}G_{\infty}(2\tau^{}\!+\!(\tau^{*})^{2})}{2G_{0}}\right]$

with $c_{J},c_{\tilde{J}}$ defined in Lemma 1, $\tau^{*}=\min\{\tau:m\rho^{\tau}\leq\alpha\}$ and $G_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}$ .

Theorem 1 indicates that under the constant stepsize, PG-AMSGrad converges to a neighborhood of a stationary point at a rate of $\mathcal{O}\left(\frac{1}{T}\right)$ . The size of the neighborhood can be controlled by the stepsize $\alpha$ . One can observe that $\alpha$ controls a tradeoff between the convergence rate and the convergence accuracy. Decreasing $\alpha$ improves the convergence accuracy, but slows down the convergence, since the coefficient $C_{1}$ contains $\alpha$ in the denominator. To balance such a tradeoff, we set the stepsize $\alpha_{t}=\frac{1}{\sqrt{T}}$ . In this case, the mixing time becomes $\tau^{*}=\mathcal{O}(\log T)$ and thus PG-AMSGrad converges to a stationary point with a rate of $\mathcal{O}\left(\frac{\log^{2}T}{\sqrt{T}}\right)$ .

In the following, we adopt a diminishing stepsize to eliminate the convergence error and obtain the exact convergence.

Theorem 2.

(Convergence of PG-AMSGrad with diminishing stepsize) Suppose the same conditions of Theorem 1 hold, and let $\alpha_{t}=\frac{\alpha}{\sqrt{t}}$ for $t=1,\dots,T$ . Then running $T$ steps of PG-AMSGrad as given in Algorithm 1, we have:

\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla_{\theta_{t}}J(\theta_{t})\right\rVert^{2}\right]\leq\frac{C_{1}}{T}+\frac{C_{2}}{\sqrt{T}},

where

	$\displaystyle C_{1}=\frac{f_{1}G_{\infty}}{\alpha}+\frac{2dc_{J}\alpha G_{\infty}}{1-\beta_{2}}+\frac{2\tau^{*}G_{\infty}}{G_{0}}G_{\infty}^{2}+\frac{3dc_{J}\beta_{1}^{2}\alpha G_{\infty}}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}$
	$\displaystyle C_{2}=\frac{R_{\max}G_{\infty}}{\alpha(1-\gamma)}+\frac{dG_{\infty}^{3}}{G_{0}(1-\beta_{1})}+\frac{\alpha G_{\infty}^{3}}{G_{0}}\left[\frac{2(3c_{J}\!+\!c_{\tilde{J}})\tau^{}}{G_{0}}+d\left(1+\frac{L_{\pi}G_{\infty}(\tau^{}+(\tau^{*})^{2})}{G_{0}}\right)\right]$

with $c_{J},c_{\tilde{J}}$ defined in Lemma 1, $\tau^{*}\!=\!\min\{\tau:m\rho^{\tau}\!\leq\!\alpha_{T}\!=\!\frac{\alpha}{\sqrt{T}}\}$ and $G_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}$ .

Under the diminishing stepsize, PG-AMSGrad can converge exactly to a stationary point. Since $\tau^{*}=\mathcal{O}(\log T)$ , the convergence rate is given by $\mathcal{O}\left(\frac{\log^{2}T}{\sqrt{T}}\right)$ .

Theorems 1 and 2 indicate that under both a constant stepsize and diminishing stepsize, PG-AMSGrad finds a stationary point efficiently with guaranteed convergence. However, there is a tradeoff between the convergence rate and the convergence accuracy. With a constant stepsize, PG-AMSGrad can converge faster but only to a neighborhood of a stationary point whose size is controlled by the stepsize, whereas a diminishing stepsize yields a better convergence accuracy, but attains a lower convergence rate.

3.4 Implication on Vanilla PG under Markovian Data

Although our focus in this paper is on the Adam-type PG, slight modification of our analysis yields the convergence rate of the vanilla PG under infinite horizon Markovian sampling. In the following, we first present such results and then compare them with two recent studies Zhang et al., (2019); Karimi et al., (2019) on the same model to illustrate the novelty of our analysis and benefit of our sampling strategy.

We consider the vanilla PG algorithm that uses the same gradient estimator and sampling strategy as those of Algorithm 1, but adopts the SGD update (i.e., $\theta_{t+1}=\theta_{t}-\alpha_{t}g_{t}$ ) rather than the AMSGrad update. We call such an algorithm as PG-SGD. The following proposition characterizes the convergence rate for PG-SGD.

Proposition 1.

Suppose Assumptions 1 and 2 hold. After running $T$ steps of PG-SGD with a constant stepsize $\alpha_{t}=\alpha$ for $t=1,\dots,T$ , we have:

\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla_{\theta_{t}}J(\theta_{t})\right\rVert^{2}\right]\leq\frac{J(\theta_{1})/\alpha+2G_{\infty}^{2}\tau^{*}}{T}+\alpha C_{1},

where

\displaystyle C_{1}\!=\!G_{\infty}^{2}\left[\frac{c_{J}}{2}\!+\!(3c_{J}\!+\!c_{\tilde{J}})\tau^{*}\!+\!1\!+\!\frac{L_{\pi}G_{\infty}(2\tau^{*}\!+\!(\tau^{*})^{2})}{2}\right],

with $c_{J},c_{\tilde{J}}$ defined in Lemma 1, $\tau^{*}=\min\{\tau:m\rho^{\tau}\leq\alpha\}$ and $G_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}$ . Furthermore, after running $T$ steps of PG-SGD with a diminishing stepsize $\alpha_{t}=\frac{1-\gamma}{\sqrt{t}}$ for $t=1,\dots,T$ , we have:

\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla_{\theta_{t}}J(\theta_{t})\right\rVert^{2}\right]\leq\frac{J(\theta_{1})+2(1-\gamma)^{2}G_{\infty}^{2}\tau^{*}}{(1-\gamma)T}+\frac{C_{2}}{(1-\gamma)^{2}\sqrt{T}},

where

\displaystyle C_{2}=R_{\max}+(1-\gamma)^{3}G_{\infty}^{2}\left[c_{J}+2(3c_{J}+c_{\tilde{J}})\tau^{*}+1+L_{\pi}G_{\infty}(\tau^{*}+(\tau^{*})^{2})\right],

with $\tau^{*}\!=\!\min\{\tau:m\rho^{\tau}\!\leq\!\alpha_{T}\!=\!\frac{1-\gamma}{\sqrt{T}}\}$ .

We next compare Proposition 1 with two recent studies on the same non-i.i.d. model. First, Karimi et al., (2019) studied infinite-horizon PG with a biased gradient estimator. In their analysis, they did not bound the gradient bias using the stepsize, and hence their convergence has a non-zero error even with a diminishing stepsize. In contrast, we obtain a fine-grained bound on the bias and characterize its dependence on the stepsize. We show that PG exactly converges to a stationary point under the diminishing stepsize.

Another closely related study on the infinite-horizon PG under non-i.i.d. sampling was by Zhang et al., (2019), but their algorithm adopts an unbiased gradient estimator at the cost of using more samples. As a comparison, Proposition 1 indicates that PG-SGD with a biased gradient estimator attains a better convergence rate and accuracy. More specifically, under the constant stepsize, (Zhang et al.,, 2019, Corollary 4.4) showed that their PG algorithm converges with an error bound of $\mathcal{O}\left(\frac{1}{(1-\gamma)^{5}(1-\sqrt{\gamma})^{2}}\right)$ , whereas PG-SGD with a biased gradient estimator achieves a much smaller error bound $\mathcal{O}\left(\frac{1}{(1-\gamma)^{2}}\right)$ by taking $\alpha=1-\gamma$ . Similarly, under the diminishing stepsize, (Zhang et al.,, 2019, Theorem 4.3) showed that their PG algorithm converges at a rate of $\mathcal{O}\left(\frac{1}{(1-\gamma)^{5}(1-\sqrt{\gamma})^{2}\sqrt{T}}\right)$ , whereas our PG-SGD converges at a rate of $\mathcal{O}\left(\frac{\log^{2}(\sqrt{T}/(1-\gamma))}{(1-\gamma)^{2}\sqrt{T}}\right)$ , which is much faster since $\gamma$ is usually close to 1, and $\log T$ is considered to be less influential in practice.

4 Convergence of TD-AMSGrad under Markovian Sampling

In this section, we adopt AMSGrad to TD learning and analyze its convergence under Markovian sampling. The proof techniques of bounding the bias and the nature of the convergence are very different from those of PG-AMSGrad.

4.1 TD Learning and TD-AMSGrad

Policy evaluation is a fundamental task in RL, and often plays a critical role in other algorithms such as PG that we study in Section 3. The goal of policy evaluation is to obtain an accurate estimation of the accumulated reward function known as the value function $V:\mathcal{S}\mapsto\mathbb{R}$ for a given policy $\pi$ defined as

V^{\pi}(s)=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t}R(s_{t},a_{t})\Big{|}s_{1}=s\right].

The Bellman operator corresponding to a value function $V$ is defined by $\forall s\in\mathcal{S}$ ,

(T^{\pi}V)(s)=\sum_{a\in\mathcal{A}}\pi(a|s)R(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}\mathbb{P}(s^{\prime}|s)V(s^{\prime}).

(12)

The key to approximate the value function is the observation that it satisfies the projected bellman equation $(\Pi T^{\pi}V)(s)=V(s)$ when projected onto some convex space. Under the function approximation, the value function $V(s)$ is parameterized by $\theta\in\mathbb{R}^{d}$ and denoted by $V(s;\theta)$ . As many recent finite-time analysis of TD Bhandari et al., (2018); Xu et al., 2019b ; Srikant and Ying, (2019), we consider the linear approximation class of the value function $V(s;\theta)$ defined as

V(s;\theta)=\phi(s)^{T}\theta,

(13)

where $\theta\in\mathbb{R}^{d}$ , and $\phi:\mathcal{S}\rightarrow\mathbb{R}^{d}$ is a vector function with the dimension $d$ , and the elements of $\phi$ represent the nonlinear kernel (feature) functions. Then the vanilla TD algorithm follows a stochastic iterative method given by

\theta_{t+1}=\theta_{t}-\alpha_{t}g_{t},

(14)

where $\alpha_{t}$ is the stepsize, and $g_{t}$ is defined as

\displaystyle g_{t}

\displaystyle:=g(\theta_{t};s_{t},a_{t},s_{t+1})=\left(\phi^{T}(s_{t})\theta_{t}-R(s_{t},a_{t})-\gamma\phi^{T}(s_{t+1})\theta_{t}\right)\phi(s_{t}).

(15)

Here, $g_{t}$ serves as a stochastic pseudo-gradient, and is an estimator of the full pseudo-gradient given by

\bar{g}(\theta_{t})\!=\!\underset{\nu}{\mathbb{E}}\left[\left(\phi^{T}(s_{t})\theta_{t}\!-\!R(s_{t},\pi(s_{t}))\!-\!\gamma\phi^{T}(s_{t+1})\theta_{t}\right)\phi(s_{t})\right]

(16)

where the expectation is taken over the stationary distribution of the states. We note that $\bar{g}(\theta_{t})$ is not a gradient of a loss function, but plays a similar role as the gradient in the gradient descent algorithm.

Then TD-AMSGrad is obtained by replacing the update (14) by the AMSGrad update given in (1)-(4) as in Algorithm 2.

Algorithm 2 TD-AMSGrad

1: Input:

\alpha,\lambda,\theta_{1},\beta_{1},\beta_{2},m_{0}=0,\hat{v}_{0}=0,s_{1}\sim\zeta(\cdot)

2: for

t=1,2,\ldots,T

3: Assign

\alpha_{t}

\beta_{1t}=\beta_{1}\lambda^{t}

4: Sample

a_{t}\sim\pi,s_{t+1}\sim\mathbb{P}(\cdot|s_{t},a_{t})

5: Compute

g_{t}

as (15).

m_{t}=(1-\beta_{1t})m_{t-1}+\beta_{1t}g_{t}

v_{t}=(1-\beta_{2})\hat{v}_{t-1}+\beta_{2}g_{t}^{2}

\hat{v}_{t}=\max(\hat{v}_{t-1},v_{t}),\ \hat{V}_{t}=diag(\hat{v}_{t,1},\dots,\hat{v}_{t,d})

\theta_{t+1}=\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}\left(\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right)

,where

\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}(\theta^{\prime})=\underset{\theta\in\mathcal{D}}{\min}\left\lVert\hat{V}_{t}^{1/4}\left(\theta^{\prime}-\theta\right)\right\rVert

10: end for

11: Output:

\frac{1}{T}\sum_{t=1}^{T}\theta_{t}

As seen in Algorithm 2, the state-action pairs are sampled as a trajectory under the transition probability $\mathbb{P}$ with unknown policy $\pi$ . Therefore, the samples along the trajectory are dependent, and hence we need to analyze the convergence of TD-AMSGrad under Markovian sampling.

4.2 Technical Assumptions

In this section, we introduce some standard technical assumptions for our analysis.

We first give the following standard assumption on the kernel function Tsitsiklis and Van Roy, (1997); Bhandari et al., (2018); Xu et al., 2019b ; Chen et al., 2019b .

Assumption 3.

For any state $s\in\mathcal{S}$ , the kernel function $\phi:\mathcal{S}\rightarrow\mathbb{R}^{d}$ is uniformly bounded, i.e., $\left\lVert\phi(s)\right\rVert\leq 1,\;\forall s\in\mathcal{S}$ . In addition, we define a feature matrix $\Phi$ as

\Phi=\begin{bmatrix}\phi^{T}(s_{1})\\ \phi^{T}(s_{2})\\ \vdots\end{bmatrix}=\begin{bmatrix}&\phi_{1}(s_{1})&\cdots&\phi_{d}(s_{1})\\ &\phi_{1}(s_{2})&\cdots&\phi_{d}(s_{2})\\ &\vdots&\vdots&\vdots\end{bmatrix},

and assume that the columns of $\Phi$ are linearly independent.

The boundedness assumption is mild since we can always normalize the kernel functions.

We next define the following norm of the value function as:

\left\lVert V_{\theta}\right\rVert_{D}=\sqrt{\sum_{s\in\mathcal{S}}\nu(s)\left(\phi(s)^{T}\theta\right)^{2}}=\left\lVert\theta\right\rVert_{\Sigma},

where $D=\text{diag}(\nu(s_{1}),\nu(s_{2}),\cdots)$ is a diagonal matrix whose elements are determined by the stationary distribution $\nu$ , and $\Sigma\in\mathbb{R}^{d\times d}$ is the steady-state feature covariance matrix given by $\Sigma=\sum_{s\in\mathcal{S}}\nu(s)\phi(s)\phi(s)^{T}$ .

The next assumption of the bounded domain is standard in theoretical analysis of the Adam-type algorithms Reddi et al., (2018); Tran and Phong, (2019).

Assumption 4.

The domain $\mathcal{D}\subset\mathbb{R}^{d}$ of approximation parameters is a ball originating at $\theta=0$ with a bounded diameter containing $\theta^{\star}$ . That is, there exists $D_{\infty}$ , such that $\theta^{\star}\in\mathcal{D}$ , and $\left\lVert\theta_{m}-\theta_{n}\right\rVert<D_{\infty}$ , for any $\theta_{m},\theta_{n}\in\mathcal{D}$ .

The boundedness parameter $D_{\infty}$ can be chosen as discussed in Bhandari et al., (2018).

4.3 Convergence of TD-AMSGrad

In the following, we provide the convergence results for TD-AMSGrad with linear function approximation under Markovian sampling.

First consider the full pseudo-gradient $\bar{g}(\theta)$ in (16). We define $\theta^{*}$ as the fixed point of $\bar{g}(\theta)$ , i.e., $\bar{g}(\theta^{*})=0$ . Then $\theta^{*}$ is the unique fixed point under Assumption 3 following from the contraction property of the projected Bellman operator Tsitsiklis and Van Roy, (1997). The following lemma is useful for our convergence analysis.

Lemma 2.

(Bhandari et al.,, 2018, Lemma 3) Let $\omega>0$ be the minimum eigenvalue of the matrix $\Sigma$ . Under Assumption 3 and for any $\theta\in\mathbb{R}^{d}$ , we have

(\theta-\theta^{*})^{T}\bar{g}(\theta)\geq(1-\gamma)\sqrt{\omega}\left\lVert\theta-\theta^{*}\right\rVert^{2}.

Lemma 2 indicates that the update of TD learning exhibits a property similar to strongly convex optimization.

Since our analysis is under Markovian sampling, non-zero bias arises in our convergence analysis when estimating the gradient, i.e., $\mathbb{E}[g_{t}]-\bar{g}(\theta_{t})\neq 0$ . The following lemma provides an upper bound on such a bias error for TD-AMSGrad.

Lemma 3.

Consider a sequence of non-increasing stepsizes $\{\alpha_{t}\}_{t=1}^{T}$ for Algorithm 2 and fix $\tau^{*}=\min\{\tau:m\rho^{\tau}\leq\alpha_{T}\}$ . Initialize Algorithm 2 such that $|g_{1,i}|\geq G_{0}$ for all $i\in[d]$ with some $G_{0}>0$ . Under Assumptions 2-4, we have for $t\leq\tau^{*}$ :

\displaystyle\mathbb{E}[(\theta_{t}-\theta^{*})^{T}(g_{t}-\bar{g}(\theta_{t}))]\leq 2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{*}\alpha_{0};

and for $t>\tau^{*}$ :

\displaystyle\mathbb{E}[(\theta_{t}-\theta^{*})^{T}(g_{t}-\bar{g}(\theta_{t}))]\leq 4D_{\infty}G_{\infty}\alpha_{T}+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i},

where $G_{\infty}=R_{\max}+(1+\gamma)D_{\infty}$ .

The following theorem provides the convergence of TD-AMSGrad under a constant stepsize coupled with diminishing hyper-parameters in the AMSGrad update.

Theorem 3.

(Convergence of TD-AMSGrad with constant stepsize) Let $\beta_{1t}=\beta_{1}\lambda^{t}$ and $\delta=\beta_{1}/\beta_{2}$ with $\delta,\lambda\in(0,1)$ in Algorithm 2. Initialize Algorithm 2 such that $|g_{1,i}|\geq G_{0}$ for all $i\in[d]$ with some $G_{0}>0$ . Let $\alpha_{t}=\alpha,t=1,\dots,T$ , and suppose Assumptions 2-4 hold. Then the output of Algorithm 2 satisfies:

\mathbb{E}\left\lVert\theta_{out}-\theta^{\star}\right\rVert^{2}\leq\frac{C_{1}}{T}+\alpha C_{2},

where

	$\displaystyle C_{1}=$	$\displaystyle\frac{G_{\infty}D_{\infty}^{2}}{\alpha c(1-\beta)}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{2\alpha c(1-\lambda)(1-\beta)}+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{cG_{0}(1-\beta)}(\tau^{*})^{2}\alpha,$
	$\displaystyle C_{2}=$	$\displaystyle\frac{4D_{\infty}G_{\infty}}{c(1-\beta)}+\frac{2G_{\infty}\tau^{*}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)}+\frac{(1+\beta)G_{\infty}^{2}}{2cG_{0}(1-\beta)}$

with $c=(1-\gamma)\sqrt{\omega}$ and $G_{\infty}=R_{\max}+(1+\gamma)D_{\infty}$ .

In Theorem 3, $C_{1},C_{2}$ are constants and time-independent. Therefore, under the choice of the stepsize and hyper-parameters in the theorem, Algorithm 2 converges to a neighborhood of the global optimum at a rate of $\mathcal{O}\left(\frac{1}{T}\right)$ . The size of the neigborhood is controlled by the stepsize $\alpha$ . Similarly to Theorem 1, we can balance the tradeoff between the convergence rate and the convergence accuracy, i.e., the size of neighborhood, by setting the stepsize $\alpha_{t}=\frac{1}{\sqrt{T}}$ , which yields a convergence to the global optimal solution at the rate of $\mathcal{O}\left(\frac{\log^{2}T}{\sqrt{T}}\right)$ .

Next, we provide the convergence result with a diminishing stepsize in the following theorem.

Theorem 4.

(Convergence of TD-AMSGrad with diminishing stepsize) Suppose the same conditions of Theorem 3 hold, and let $\alpha_{t}=\frac{\alpha}{\sqrt{t}}$ for $t=1,\dots,T$ . Then the output of Algorithm 2 satisfies:

\mathbb{E}\left\lVert\theta_{out}-\theta^{\star}\right\rVert^{2}\leq\frac{C_{1}}{\sqrt{T}}+\frac{C_{2}}{T},

where

	$\displaystyle C_{1}=\frac{G_{\infty}D_{\infty}^{2}}{2c\alpha(1-\beta)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{cG_{0}(1-\beta)}+\frac{4\alpha D_{\infty}G_{\infty}}{c(1-\beta)}+\frac{4\tau^{*}\alpha G_{\infty}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)},$
	$\displaystyle C_{2}=\frac{G_{\infty}D_{\infty}^{2}}{\sqrt{2}c\alpha(1-\beta)}+\frac{\beta G_{\infty}D_{\infty}^{2}}{2c\alpha(1-\lambda)^{2}(1-\beta)}+\frac{2G_{\infty}\alpha(\tau^{*})^{2}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)}$

with $c=(1-\gamma)\sqrt{\omega}$ and $G_{\infty}=R_{\max}+(1+\gamma)D_{\infty}$ .

Comparing with Theorem 3 and observing $\tau^{*}=\mathcal{O}(\log T)$ , we conclude that TD-ASMGrad with the diminishing stepsize converges exactly to the global optimum at the rate of $\mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right)$ , rather than to a neighborhood.

5 Conclusion

This paper provides the first convergence analysis of the Adam-type RL algorithms under Markovian sampling. For PG-AMSGrad with nonlinear function approximation for the policy, we show that the algorithm converges to a stationary point with a diminishing stepsize. With a constant size, we show that PG-AMSGrad converges only to a neighborhood of a stationary point but at a faster rate. Furthermore, we also provide the finite-time convergence results for TD-AMSGrad to the global optimum under proper choices of the stepsize.

Several future directions along this topic are interesting. For example, the optimality of the convergence result of PG-AMSGrad is of importance to study. More general value function approximation, and convergence results for constant hyper-parameters in TD-AMSGrad are also of interest.

Acknowledgements

The work was supported in part by the U.S. National Science Foundation under Grants CCF-1761506, ECCS-1818904, CCF-1909291 and CCF-1900145.

References

Agarwal et al., (2019) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.
Baird, (1995) Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
Baxter and Bartlett, (2001) Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350.
Bello et al., (2017) Bello, I., Zoph, B., Vasudevan, V., and Le, Q. V. (2017). Neural optimizer search with reinforcement learning. In International Conference on Machine Learning (ICML), pages 459–468.
Benveniste et al., (2012) Benveniste, A., Métivier, M., and Priouret, P. (2012). Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.
Bhandari and Russo, (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
Bhandari et al., (2018) Bhandari, J., Russo, D., and Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference on Learning Theory (COLT).
Bhatnagar, (2010) Bhatnagar, S. (2010). An actor-critic algorithm with function approximation for discounted cost constrained markov decision processes. Systems & Control Letters, 59(12):760–766.
Bhatnagar et al., (2008) Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S. (2008). Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 105–112.
Bhatnagar et al., (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11):2471–2482.
Cai et al., (2019) Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems (NeurIPS), pages 11312–11322.
Castillo et al., (2019) Castillo, G. A., Weng, B., Hereid, A., Wang, Z., and Zhang, W. (2019). Reinforcement learning meets hybrid zero dynamics: A case study for rabbit. In 2019 International Conference on Robotics and Automation (ICRA), pages 284–290.
(13) Chen, X., Liu, S., Sun, R., and Hong, M. (2019a). On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations (ICLR).
(14) Chen, Z., Zhang, S., Doan, T. T., Maguluri, S. T., and Clarke, J.-P. (2019b). Finite-time analysis of Q-learning with linear function approximation. arXiv preprint arXiv:1905.11425.
Dalal et al., (2018) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. (2018). Finite sample analyses for td (0) with function approximation. In Thirty-Second AAAI Conference on Artificial Intelligence.
Fazel et al., (2018) Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning (ICML), pages 1467–1476.
Fu et al., (2020) Fu, Z., Yang, Z., Chen, Y., and Wang, Z. (2020). Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. In International Conference on Learning Representations (ICLR).
Györfi and Walk, (1996) Györfi, L. and Walk, H. (1996). On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31–61.
Hu and Syed, (2019) Hu, B. and Syed, U. (2019). Characterizing the exact behaviors of temporal difference learning algorithms using Markov jump linear system theory. In Advances in Neural Information Processing Systems (NeurIPS), pages 8477–8488.
Ji et al., (2019) Ji, K., Wang, Z., Zhou, Y., and Liang, Y. (2019). Faster stochastic algorithms via history-gradient aided batch size adaptation. arXiv preprint arXiv:1910.09670.
Kakade, (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems (NeurIPS), pages 1531–1538.
Karimi et al., (2019) Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. (2019). Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory (COLT).
Kingma and Ba, (2015) Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
Konda, (2002) Konda, V. (2002). Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology.
Konda and Tsitsiklis, (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 1008–1014.
Kushner and Yin, (2003) Kushner, H. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.
Lakshminarayanan and Szepesvari, (2018) Lakshminarayanan, C. and Szepesvari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1347–1355.
Liu et al., (2019) Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems (NeurIPS), pages 10564–10575.
Luo et al., (2019) Luo, Y., Yang, Z., Wang, Z., and Kolar, M. (2019). Natural actor-critic converges globally for hierarchical linear quadratic regulator. arXiv preprint arXiv:1912.06875.
Malik et al., (2019) Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P., and Wainwright, M. (2019). Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2916–2925.
Mitrophanov, (2005) Mitrophanov, A. Y. (2005). Sensitivity and convergence of uniformly ergodic Markov chains. Journal of Applied Probability, 42(4):1003–1014.
Mnih et al., (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
Papini et al., (2018) Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. (2018). Stochastic variance-reduced policy gradient. In International Conference on Machine Learning (ICML), pages 4026–4035.
Papini et al., (2017) Papini, M., Pirotta, M., and Restelli, M. (2017). Adaptive batch size for safe policy gradients. In Advances in Neural Information Processing Systems (NeurIPS), pages 3591–3600.
Papini et al., (2019) Papini, M., Pirotta, M., and Restelli, M. (2019). Smoothing policies and safe policy gradients. arXiv preprint arXiv:1905.03231.
Pednault et al., (2002) Pednault, E., Abe, N., and Zadrozny, B. (2002). Sequential cost-sensitive decision making with reinforcement learning. In Proceedings of the eighth ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pages 259–268.
Pirotta et al., (2015) Pirotta, M., Restelli, M., and Bascetta, L. (2015). Policy gradient in lipschitz Markov decision processes. Machine Learning, 100(2-3):255–283.
Reddi et al., (2018) Reddi, S. J., Kale, S., and Kumar, S. (2018). On the convergence of Adam and beyond. In International Conference on Learning Representations (ICLR).
Rummery and Niranjan, (1994) Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems, volume 37.
Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML), pages 1889–1897.
Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Shani et al., (2020) Shani, L., Efroni, Y., and Mannor, S. (2020). Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
Shen et al., (2019) Shen, Z., Ribeiro, A., Hassani, H., Qian, H., and Mi, C. (2019). Hessian aided policy gradient. In International Conference on Machine Learning (ICML), pages 5729–5738.
Silver et al., (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), pages 387–395.
Srikant and Ying, (2019) Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation andtd learning. In Conference on Learning Theory (COLT), pages 2803–2830.
Stooke and Abbeel, (2018) Stooke, A. and Abbeel, P. (2018). Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811.
Sun et al., (2020) Sun, T., Shen, H., Chen, T., and Li, D. (2020). daptive temporal difference learning with linear function approximation. arXiv preprint arXiv:2002.08537.
Sutton, (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44.
Sutton et al., (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 1057–1063.
Tadić et al., (2017) Tadić, V. B., Doucet, A., et al. (2017). Asymptotic bias of stochastic gradient search. The Annals of Applied Probability, 27(6):3255–3304.
Tran and Phong, (2019) Tran, P. T. and Phong, L. T. (2019). On the convergence proof of AMSGrad and a new version. IEEE Access, 7:61706–61716.
Tsitsiklis and Van Roy, (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-diffference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674 – 690.
Tu and Recht, (2019) Tu, S. and Recht, B. (2019). The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory (COLT), pages 3036–3083.
Wang et al., (2020) Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2020). Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations (ICLR).
Watkins and Dayan, (1992) Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279–292.
Williams, (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256.
(57) Xu, P., Gao, F., and Gu, Q. (2019a). An improved convergence analysis of stochastic variance-reduced policy gradient. In International Conference on Uncertainty in Artificial Intelligence (UAI).
Xu et al., (2020) Xu, P., Gao, F., and Gu, Q. (2020). Sample efficient policy gradient methods with recursive variance reduction. In International Conference on Learning Representations (ICLR).
(59) Xu, T., Zou, S., and Liang, Y. (2019b). Two time-scale off-policy td learning: Non-asymptotic analysis over markovian samples. In Advances in Neural Information Processing Systems (NeurIPS), pages 10633–10643.
Yang et al., (2019) Yang, Z., Chen, Y., Hong, M., and Wang, Z. (2019). Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems (NeurIPS), pages 8351–8363.
Zhang et al., (2019) Zhang, K., Koppel, A., Zhu, H., and Başar, T. (2019). Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.
Zhou et al., (2018) Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. (2018). On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671.
(63) Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. (2019a). A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11127–11135.
(64) Zou, S., Xu, T., and Liang, Y. (2019b). Finite-sample analysis for sarsa with linear function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 8665–8675.

Supplementary Materials

Appendix A EstQ algorithm

We introduce the unbiased Q-function estimating algorithm EstQ below.

Algorithm 3 EstQ Zhang et al., (2019)

1: Input:

s,a,\theta

. Initialize

\hat{Q}=0,s^{q}_{1}=s,a^{q}_{1}=a

2: Draw

T\sim\text{Geom}(1-\gamma^{1/2})

3: for

t=1,2,\ldots,T-1

4: Collect reward

R(s^{q}_{t},a^{q}_{t})

and update the Q-function

\hat{Q}\leftarrow\hat{Q}+\gamma^{t/2}R(s^{q}_{t},a^{q}_{t})

5: Sample

s^{q}_{t+1}\sim\mathbb{P}(\cdot|s^{q}_{t},a^{q}_{t}),a^{q}_{t+1}\sim\pi_{\theta}(\cdot|s^{q}_{t+1})

6: end for

7: Collect reward

R(s^{q}_{T},a^{q}_{T})

and update the Q-function

\hat{Q}\leftarrow\hat{Q}+\gamma^{T/2}R(s^{q}_{T},a^{q}_{T})

8: Output:

\hat{Q}^{\pi_{\theta}}\leftarrow\hat{Q}

In Algorithm 3, we emphasize that the samples $\{(s^{q}_{t},a^{q}_{t})\}$ are used only to estimate the Q-function, and are independent from those used to evaluate the score function in PG-AMSGrad except the initial pair. Zhang et al., (2019) showed that this algorithm provided an unbiased estimator of the Q-function $Q^{\pi_{\theta}}(s,a)$ .

Appendix B Proof of Lemma 1

For notational simplicity, we denote $\nabla J(\theta)=\nabla_{\theta}J(\theta)$ and $\nabla\tilde{J}(\theta;s,a)=\nabla_{\theta}\tilde{J}(\theta;s,a)$ in the sequel. We first introduce two technical lemmas which are useful for the proof of Lemma 1.

Lemma 4.

The gradient estimator is uniformly bounded, that is,

\left\lVert g_{t}\right\rVert_{\infty}\leq\left\lVert g_{t}\right\rVert\leq\frac{c_{\Theta}R_{\max}}{1-\gamma}.

(17)

Similarly, we also have

\left\lVert\nabla J(\theta)\right\rVert\leq\frac{c_{\Theta}R_{\max}}{1-\gamma};\qquad\left\lVert\nabla\tilde{J}(\theta;s,a)\right\rVert\leq\frac{c_{\Theta}R_{\max}}{1-\gamma}.

Proof.

The proof can be proceeded by the definition of $g_{t}$ .

	$\displaystyle g_{t}$	$\displaystyle=\hat{Q}_{t}(s_{T},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}\|s_{t}))$
		$\displaystyle=\left(\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R(s_{t^{\prime}},a_{t^{\prime}})\right)\cdot\log(\pi_{\theta_{t}}(a_{t}\|s_{t}))$
		$\displaystyle\leq\left\lVert\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R(s_{t^{\prime}},a_{t^{\prime}})\right\rVert\cdot]\left\lVert\log(\pi_{\theta_{t}}(a_{t}\|s_{t}))\right\rVert$
		$\displaystyle\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma},$

where (i) follows from Assumption 1 and from the fact that the reward function is uniformly bounded. The proof of the second claim is similar to the first one. ∎

Lemma 5.

(Zou et al., 2019b, , Lemma 3) For any $\theta_{1},\theta_{2}\in\mathbb{R}^{d}$ , under Assumption 2 we have

\displaystyle\left\|\mu_{\theta_{1}}-\mu_{\theta_{2}}\right\|_{TV}\leq|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\left\lVert\theta_{1}-\theta_{2}\right\rVert.

Proof of Lemma 1: To derive the Lipschitz continuity of $\nabla J(\theta)$ , we first use the policy gradient theorem Sutton et al., (2000) which gives

	$\displaystyle\nabla J(\theta)$	$\displaystyle=\underset{(s,a)}{\int}\sum_{t=1}^{\infty}\gamma^{t}p(s_{t}=s\|s_{1},\pi_{\theta})\cdot Q^{\pi_{\theta}}(s,a)\nabla\pi_{\theta}(a\|s)dsda$
		$\displaystyle:=\underset{(s,a)}{\int}Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\log(\pi_{\theta}(a\|s))d\mu_{\theta},$

where $\mu_{\theta}$ is denoted as the discounted visitation distribution. We further proceed the proof of Lipschitz continuity of $\nabla J(\theta)$ as

		$\displaystyle\left\lVert\nabla J(\theta_{1})-\nabla J(\theta_{2})\right\rVert$
		$\displaystyle\quad=\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}$
		$\displaystyle\quad=\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}\right]$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}\right]$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right]$
		$\displaystyle\quad=\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\left(\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a\|s))-\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))\right)d\mu_{\theta_{1}}$
		$\displaystyle\quad\quad\ +\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right]$
		$\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert\underset{(s,a)}{\int}d\mu_{\theta_{1}}+c_{\Theta}\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))d\mu_{\theta_{1}}$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right]$
		$\displaystyle\quad\overset{\text{(ii)}}{=}\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert+c_{\Theta}\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))d\mu_{\theta_{1}}$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right],$		(18)

where (i) follows from Assumption 1 and boundedness of $Q^{\pi_{\theta}}(s,a)$ , and (ii) follows since $\underset{(s,a)}{\int}d\mu_{\theta_{1}}=1$ . In the following, we bound the last two terms in the right hand side. First, we show that $Q^{\pi_{\theta}}(s,a)$ is also Lipschitz continuous with respect to $\theta$ . To see that, we have

	$\displaystyle\|Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a)\|$	$\displaystyle=\underset{\nu_{\theta_{1}}}{\mathbb{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right]-\underset{\nu_{\theta_{2}}}{\mathbb{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right]$
		$\displaystyle\leq\left\lvert\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\right\rvert\left\lVert\nu_{\theta_{1}}-\nu_{\theta_{2}}\right\rVert_{TV}$
		$\displaystyle\leq\frac{R_{\max}}{1-\gamma}\left\lVert\nu_{\theta_{1}}-\nu_{\theta_{2}}\right\rVert_{TV},$

where $\nu_{\theta}$ is the distribution of the state-action trajectory corresponding to the transition probability $\mathbb{P}$ and the policy $\pi$ . Although $\nu_{\theta}$ is different from the state-action visitation distribution $\mu_{\theta}$ , both of them share the same property. That is, for a different $\theta$ , $\nu_{\theta}$ is determined by the same transition kernel and only differs in the policy. Therefore, Lemma 5 applies to two distributions $\nu_{\theta_{1}}$ and $\nu_{\theta_{2}}$ , which yields

|Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a)|\leq\frac{R_{\max}}{1-\gamma}\cdot|\mathcal{A}|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\left\lVert\theta_{1}-\theta_{2}\right\rVert.

(19)

Then we bound the last term of the right hand side of (B) via using the similar techniques above.

		$\displaystyle\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}$
		$\displaystyle\quad\leq\left\lVert Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))\right\rVert\left\lVert\mu_{\theta_{1}}-\mu_{\theta_{2}}\right\rVert_{TV}$
		$\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\left\lVert\mu_{\theta_{1}}-\mu_{\theta_{2}}\right\rVert_{TV}$
		$\displaystyle\quad\overset{\text{(ii)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\cdot\|\mathcal{A}\|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\left\lVert\theta_{1}-\theta_{2}\right\rVert,$		(20)

where (i) follows from Assumption 1 and boundedness of $Q^{\pi_{\theta}}(s,a)$ , and (ii) follows from Lemma 5. By substituting the bounds from (19) and (B) to (B), we obtain the Lipschitz condition of $\nabla J(\theta)$ , i.e.,

	$\displaystyle\left\lVert\nabla J(\theta_{1})-\nabla J(\theta_{2})\right\rVert\leq$	$\displaystyle\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert+c_{\Theta}\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))d\mu_{\theta_{1}}$
		$\displaystyle+\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right]$
	$\displaystyle\leq$	$\displaystyle\left[\frac{R_{\max}L}{1-\gamma}+\frac{(1+c_{\Theta})R_{\max}}{1-\gamma}\cdot\|\mathcal{A}\|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\right]\left\lVert\theta_{1}-\theta_{2}\right\rVert.$

Next we prove (11). Note that the Q-function is bounded. Then we have

	$\displaystyle\left\lVert\nabla\tilde{J}(\theta_{1};s,a)-\nabla\tilde{J}(\theta_{2};s,a)\right\rVert=$	$\displaystyle Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a\|s))-Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))$
	$\displaystyle=$	$\displaystyle Q^{\pi_{\theta_{1}}}(s,a)\left(\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a\|s))-\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))\right)$
		$\displaystyle+\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))\left(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a)\right)$
	$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert+c_{\Theta}\left(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a)\right)$
	$\displaystyle\overset{\text{(ii)}}{\leq}$	$\displaystyle\left[\frac{R_{\max}L}{1-\gamma}+c_{\Theta}\|\mathcal{A}\|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\right]\left\lVert\theta_{1}-\theta_{2}\right\rVert,$

where (i) follows from Assumption 1 and the boundedness of $Q^{\pi_{\theta}}(s,a)$ , and (ii) follows from (19).

Appendix C Proof of Theorem 1

In this appendix we will first introduce some useful lemmas in Appendix C.1, and then provide technical proofs in Appendix C.2.

C.1 Supporting Lemmas

We first provide two lemmas to deal with the bias of the gradient estimation.

Lemma 6.

Given $x,y\in\mathbb{R}^{d}$ and a positive diagonal matrix $V\in\mathbb{R}^{d\times d}$ , then

x^{T}Vy\leq Tr(V)\left\lVert x\right\rVert_{\infty}\left\lVert y\right\rVert_{\infty}.

(21)

Proof.

Since $V$ is a diagonal matrix with $V_{i,i}>0,\forall i\in[d]$ , we have

\displaystyle x^{T}Vy

\displaystyle=\sum_{i=1}^{d}V_{i,i}x_{i}y_{i}\overset{\text{(i)}}{\leq}\sum_{i=1}^{d}V_{i,i}\left\lVert x\right\rVert_{\infty}\left\lVert y\right\rVert_{\infty}=Tr(V)\left\lVert x\right\rVert_{\infty}\left\lVert y\right\rVert_{\infty},

where (i) follows because $V_{i,i}>0,\forall i\in[d]$ . ∎

Lemma 7.

Fix time $t$ and any $\tau<t$ . Initialize Algorithm 1 such that $|g_{1,i}|\geq G_{0}$ for all $i\in[d]$ with some $G_{0}>0$ . Suppose Assumption 1 and 2 hold. Under Assumption 1 and Assumption 2, we have

\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};s_{t},a_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert\leq G_{\infty}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right],

(22)

where $G_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}$ .

Proof.

For notational brevity we denote $O_{t}=(s_{t},a_{T})$ . Observe that

	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})\|\mathcal{F}_{t-\tau}\right]\right\rVert$
	$\displaystyle\quad=\left\lVert\underset{(s,a)}{\int}Q^{\pi_{\theta_{t-\tau}}}\nabla_{\theta}\log(\pi_{\theta_{t-\tau}}(a\|s))d\mu_{\theta_{t-\tau}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{t-\tau}}}\nabla_{\theta}\log(\pi_{\theta_{t-\tau}}(a\|s))dP(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\rVert$
	$\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$
	$\displaystyle\quad=G_{\infty}\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV},$

where (i) follows from the boundedness of $Q^{\pi_{\theta}}$ . In the following, we bound $\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}$ . To this end, we need to build a new Markov chain in which the policy is fixed after time $t-\tau$ . To be specific, in Algorithm 1, we sample actions via a changing policy $\pi_{\theta_{t}}(a_{t}|s_{t})$ , which yields a Markov chain:

O_{t-\tau}\rightarrow O_{t-\tau+1}\cdots\rightarrow O_{t}\rightarrow O_{t+1}.

(23)

Now, from time $t-\tau$ , we repeatedly use the policy $\pi_{\theta_{t-\tau}}$ , which yields an auxiliary Markov chain:

O_{t-\tau}\rightarrow\tilde{O}_{t-\tau+1}\cdots\rightarrow\tilde{O}_{t}\rightarrow\tilde{O}_{t+1},

(24)

where we denote $\tilde{O}_{t}=(\tilde{s}_{t},\tilde{a}_{t})$ correspondingly. Clearly we have for $k=t-\tau+1,\dots,t$ ,

P(\tilde{s}_{k}=\cdot|\theta_{t-\tau},\tilde{s}_{k-1})=\sum_{a\in\mathcal{A}}\pi_{\theta_{t-\tau}}(\tilde{a}_{k-1}=a|\tilde{s}_{k-1})\mathbb{P}(\tilde{s}_{k}=\cdot|\tilde{a}_{k-1}=a,\tilde{s}_{k-1}).

(25)

Since we use the same policy from time $t-\tau$ , the Markov chain given in (24) in this time slot is uniformly ergodic, and thus satisfies Assumption 2. Therefore, we can bound $\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}$ as

\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}\leq\sigma\rho^{\tau}.

(26)

Observe that

		$\displaystyle\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$
		$\displaystyle\quad\leq\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(\tilde{s}_{t},\tilde{a}_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}+\left\\|P(\tilde{s}_{t},\tilde{a}_{t}\|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}.$		(27)

It remains to deal with $\left\|P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}$ . We proceed as follows

		$\displaystyle\left\\|P(\tilde{s}_{t},\tilde{a}_{t}\|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$
		$\displaystyle\quad=\left\\|P(\tilde{s}_{t}\|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}\|\tilde{s}_{t})-P(s_{t}\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}\|s_{t})\right\\|_{TV}$
		$\displaystyle\quad=\frac{1}{2}\underset{(s,a)}{\int}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a\|\tilde{s}_{t}=s)-P(s_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)\right\|dsda$
		$\displaystyle\quad=\frac{1}{2}\underset{(s,a)}{\int}\Big{\|}P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a\|\tilde{s}_{t}=s)-P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)$
		$\displaystyle\quad\quad\ +P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)-P(s_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)\Big{\|}dsda$
		$\displaystyle\quad\leq\frac{1}{2}\underset{(s,a)}{\int}P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\left\|\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a\|\tilde{s}_{t}=s)-\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)\right\|dads$
		$\displaystyle\quad\quad\ +\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|\int_{a}\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)dads$
		$\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{1}{2}L_{\pi}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert+\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|ds$
		$\displaystyle\quad\leq\frac{1}{2}L_{\pi}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert+\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|ds$
		$\displaystyle\quad\overset{\text{(ii)}}{\leq}\frac{L_{\pi}G_{\infty}}{2G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}+\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|ds$
		$\displaystyle\quad=\frac{L_{\pi}G_{\infty}}{2G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}+\left\\|P(\tilde{s}_{t}\|\mathcal{F}_{t-\tau})-P(s_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV},$		(28)

where (i) follows from Assumption 1 and (ii) follows from Lemma 9. It remains to bound the second term of the right hand side. To this end, we first write $P(\tilde{s}_{t}=\cdot|\mathcal{F}_{t-\tau})$ as

	$\displaystyle P(\tilde{s}_{t}=\cdot\|\mathcal{F}_{t-\tau})$	$\displaystyle=\int_{s}P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})P(\tilde{s}_{t}=\cdot\|\tilde{s}_{t-1}=s)ds$
		$\displaystyle=\int_{s}P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})\int_{a}P(\tilde{s}_{t}=\cdot\|\tilde{s}_{t-1}=s,\tilde{a}_{t-1}=a)\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a\|\tilde{s}_{t-1}=s)dads.$		(29)

Similarly, for $P(s_{t}=\cdot|\mathcal{F}_{t-\tau})$ we have

	$\displaystyle P(s_{t}=\cdot\|\mathcal{F}_{t-\tau})$	$\displaystyle=\int_{s}P(s_{t-1}=s\|\mathcal{F}_{t-\tau})P(s_{t}=\cdot\|s_{t-1}=s)ds$
		$\displaystyle=\int_{s}P(s_{t-1}=s\|\mathcal{F}_{t-\tau})\int_{a}P(s_{t}=\cdot\|s_{t-1}=s,a_{t-1}=a)\pi_{\theta_{t-1}}(a_{t-1}=a\|s_{t-1}=s)dads.$		(30)

Then we obtain the following bound

	$\displaystyle\left\|P(\tilde{s}_{t}=\cdot\|\mathcal{F}_{t-\tau})-P(s_{t}=\cdot\|\mathcal{F}_{t-\tau})\right\|$
	$\displaystyle\quad=\Bigg{\|}\int_{s}P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})\int_{a}P(\tilde{s}_{t}=\cdot\|\tilde{s}_{t-1}=s,\tilde{a}_{t-1}=a)\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a\|\tilde{s}_{t-1}=s)dads$
	$\displaystyle\quad\qquad-\int_{s}P(s_{t-1}=s\|\mathcal{F}_{t-\tau})\int_{a}P(s_{t}=\cdot\|s_{t-1}=s,a_{t-1}=a)\pi_{\theta_{t-1}}(a_{t-1}=a\|s_{t-1}=s)dads\Bigg{\|}$
	$\displaystyle\quad\leq\int_{s}P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})\Bigg{\|}\int_{a}P(\tilde{s}_{t}=\cdot\|\tilde{s}_{t-1}=s,\tilde{a}_{t-1}=a)\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a\|\tilde{s}_{t-1}=s)$
	$\displaystyle\quad\quad\ -\int_{a}P(s_{t}=\cdot\|s_{t-1}=s,a_{t-1}=a)\pi_{\theta_{t-1}}(a_{t-1}=a\|s_{t-1}=s)\Bigg{\|}dads$
	$\displaystyle\quad\quad\ +\int_{s}\Bigg{\|}P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})-P(s_{t-1}=s\|\mathcal{F}_{t-\tau})\Bigg{\|}\cdot$
	$\displaystyle\quad\quad\qquad\int_{a}P(s_{t}=\cdot\|s_{t-1}=s,a_{t-1}=a)\pi_{\theta_{t-1}}(a_{t-1}=a\|s_{t-1}=s)dads$
	$\displaystyle\quad\leq\int_{s}P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})\cdot$
	$\displaystyle\quad\quad\qquad\int_{a}P(s_{t}=\cdot\|s_{t-1}=s,a_{t-1}=a)\left\|\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a\|\tilde{s}_{t-1}=s)-\pi_{\theta_{t-1}}(a_{t-1}=a\|s_{t-1}=s)\right\|dads$
	$\displaystyle\quad\quad+\int_{s}\left\|P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})-P(s_{t-1}=s\|\mathcal{F}_{t-\tau})\right\|ds$
	$\displaystyle\quad\leq\int_{s}P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})\int_{a}\left\|\pi_{\theta_{t-\tau}}(\tilde{a}_{t-1}=a\|\tilde{s}_{t-1}=s)-\pi_{\theta_{t-1}}(a_{t-1}=a\|s_{t-1}=s)\right\|dads$
	$\displaystyle\quad\quad+\int_{s}\left\|P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})-P(s_{t-1}=s\|\mathcal{F}_{t-\tau})\right\|ds$
	$\displaystyle\quad\leq L_{\pi}\left\lVert\theta_{t-\tau}-\theta_{t-1}\right\rVert+\int_{s}\left\|P(\tilde{s}_{t-1}=s\|\mathcal{F}_{t-\tau})-P(s_{t-1}=s\|\mathcal{F}_{t-\tau})\right\|ds$
	$\displaystyle\quad\leq L_{\pi}\left\lVert\theta_{t-\tau}-\theta_{t-1}\right\rVert+2\left\\|P(\tilde{s}_{t-1}\|\mathcal{F}_{t-\tau})-P(s_{t-1}\|\mathcal{F}_{t-\tau})\right\\|_{TV}.$

Then we have a dynamical form as

	$\displaystyle\left\\|P(\tilde{s}_{t}\|\mathcal{F}_{t-\tau})-P(s_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$	$\displaystyle=\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|ds$
		$\displaystyle\leq\frac{1}{2}L_{\pi}\left\lVert\theta_{t-\tau}-\theta_{t-1}\right\rVert+\left\\|P(\tilde{s}_{t-1}\|\mathcal{F}_{t-\tau})-P(s_{t-1}\|\mathcal{F}_{t-\tau})\right\\|_{TV}.$		(31)

Applying (31) recursively yields

	$\displaystyle\left\\|P(\tilde{s}_{t}\|\mathcal{F}_{t-\tau})-P(s_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$	$\displaystyle\leq\frac{1}{2}L_{\pi}\sum_{i=t-\tau}^{t-1}\left\lVert\theta_{t-\tau}-\theta_{i}\right\rVert$
		$\displaystyle\leq\frac{L_{\pi}G_{\infty}}{2G_{0}}\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}.$		(32)

Substituting (C.1) into (C.1) and according to the fact $\tau\geq 1$ , we have

\left\|P(\tilde{s}_{t},\tilde{a}_{t}|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}\leq\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right).

(33)

Combining (26) and (33), we obtain

	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})\|\mathcal{F}_{t-\tau}\right]\right\rVert$	$\displaystyle\leq G_{\infty}\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P_{t}(\tilde{s}_{t},\tilde{a}_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$
		$\displaystyle\leq G_{\infty}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right].$

∎

The next a few lemmas are closely related to the update rule of AMSGrad.

Lemma 8.

(Zhou et al.,, 2018, Lemma A.1) Let $\{g_{t},m_{t},\hat{v}_{t}\}$ for $t=1,2,\dots$ be sequences generated by Algorithm 1, and denote $G_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}$ . Given $g_{t}\leq G_{\infty}$ as in Lemma 4, $\left\lVert m_{t}\right\rVert\leq G_{\infty},\left\lVert\hat{v}_{t}\right\rVert\leq G_{\infty}^{2}$ .

Lemma 9.

Let $\{m_{t},\hat{v}_{t}\}$ for $t=1,2,\dots$ be sequences generated by Algorithm 1 and dentoe $G_{\infty}=R_{\max}+(1+\gamma)D_{\infty}$ . Then

\left\lVert\theta_{t+1}-\theta_{t}\right\rVert=\left\lVert\alpha_{t}\hat{V}_{t}^{\frac{1}{2}}m_{t}\right\rVert\leq\frac{\alpha_{t}G_{\infty}}{G_{0}}.

(34)

Proof.

The proof can proceed easily given Lemma 14 and $\hat{v}_{t,i}\geq\hat{v}_{0,i}\geq G_{0}^{2},\forall t,\forall i$ .

\displaystyle\left\lVert\theta_{t+1}-\theta_{t}\right\rVert=\left\lVert\alpha_{t}\hat{V}_{t}^{\frac{1}{2}}m_{t}\right\rVert=\alpha_{t}\sqrt{\sum_{i=1}^{d}\frac{m_{t,i}^{2}}{v_{t,i}}}\leq\frac{\alpha_{t}}{G_{0}}\left\lVert m_{t}\right\rVert\leq\frac{\alpha_{t}G_{\infty}}{G_{0}}.

(35)

∎

Lemma 10.

(Zhou et al.,, 2018, Lemma A.2) Let $\alpha_{t}$ be the stepsize in Algorithm 1 and $\beta_{1},\beta_{2}$ be the constant hyper-parameters with $\beta_{1}<\beta_{2}$ . Then we have

\sum_{t=1}^{T}\mathbb{E}\left\lVert\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}\leq\frac{d(1-\beta_{1})}{(1-\beta_{2})(1-\beta_{1}/\beta_{2})}.

(36)

In addition, taking $\beta_{1}=0$ yields $m_{t}=g_{t}$ . Thus we further have

\sum_{t=1}^{T}\mathbb{E}\left\lVert\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}\leq\frac{d}{1-\beta_{2}}.

(37)

Proof.

We refer the reader to the proof of Lemma A.2 in Zhou et al., (2018) for more details by reducing their proof to the case where $p=\frac{1}{2},q=1$ . ∎

Since the update rule of AMSGrad is complicated, it is often useful to introduce an auxiliary sequences $z_{t}$ . If we define $\theta_{0}=\theta_{1}$ , then for $t\geq 1$ , let

z_{t}=\theta_{t}+\frac{\beta_{1}}{1-\beta_{1}}\theta_{t-1}.

(38)

The following lemma captures the property of $z_{t}$ and its connection with $\theta_{t}$ .

Lemma 11.

(Zhou et al.,, 2018, Lemma A.3-A.5) Given $z_{t}$ defined in (38), we have for $t\geq 2$

z_{t+1}-z_{t}=\frac{\beta_{1}}{1-\beta_{1}}\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)m_{t-1}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t},

(39)

and for $t=1$ ,

z_{2}-z_{1}=-\alpha_{1}\hat{V}_{1}^{-\frac{1}{2}}g_{1}.

(40)

Further, we can bound $\left\lVert z_{t+1}-z_{t}\right\rVert$ as

\left\lVert z_{t+1}-z_{t}\right\rVert\leq\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert+\frac{\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert,

(41)

and we can also bound $\left\lVert\nabla J(z_{t})-\nabla J(\theta_{t})\right\rVert$ as

\left\lVert\nabla J(z_{t})-\nabla J(\theta_{t})\right\rVert\leq\frac{c_{J}\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert.

(42)

C.2 Proof of Theorem 1

In the remaining, we can complete this proof by taking the following three steps.

Step1: Establishing convergent sequence

First, we observe that $\nabla J(\theta)$ is Lipschitz continuous according to Lemma 1. Then we have

$\displaystyle J(z_{t+1})\leq$	$\displaystyle J(z_{t})+\nabla J(z_{t})^{T}(z_{t+1}-z_{t})+\frac{c_{J}}{2}\left\lVert z_{t+1}-z_{t}\right\rVert^{2}$
$\displaystyle=$	$\displaystyle J(z_{t})+\nabla J(\theta_{t})^{T}(z_{t+1}-z_{t})+(\nabla J(z_{t})-\nabla J(\theta_{t}))^{T}(z_{t+1}-z_{t})+\frac{c_{J}}{2}\left\lVert z_{t+1}-z_{t}\right\rVert^{2}$
$\displaystyle=$	$\displaystyle J(z_{t})+\nabla J(\theta_{t})^{T}\left[\frac{\beta_{1}}{1-\beta_{1}}\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)m_{t-1}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right]$
	$\displaystyle+(\nabla J(z_{t})-\nabla J(\theta_{t}))^{T}(z_{t+1}-z_{t})+\frac{c_{J}}{2}\left\lVert z_{t+1}-z_{t}\right\rVert^{2}$
$\displaystyle=$	$\displaystyle J(z_{t})+\underbrace{\frac{\beta_{1}}{1-\beta_{1}}\nabla J(\theta_{t})^{T}\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)m_{t-1}}_{T_{1}}\underbrace{-\alpha_{t}\nabla J(\theta_{t})^{T}\hat{V}_{t}^{-\frac{1}{2}}g_{t}}_{T_{2}}$
	$\displaystyle+\underbrace{(\nabla J(z_{t})-\nabla J(\theta_{t}))^{T}(z_{t+1}-z_{t})}_{T_{3}}+\underbrace{\frac{c_{J}}{2}\left\lVert z_{t+1}-z_{t}\right\rVert^{2}}_{T_{4}}.$	(43)

Next we bound the tail terms. The term $T_{1}$ can be bounded as

	$\displaystyle T_{1}$	$\displaystyle\overset{\text{(i)}}{\leq}\frac{\beta_{1}}{1-\beta_{1}}Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\left\lVert\nabla J(\theta_{t})\right\rVert_{\infty}\left\lVert m_{t-1}\right\rVert_{\infty}$
		$\displaystyle=\frac{\beta_{1}}{1-\beta_{1}}G_{\infty}^{2}Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)$
		$\displaystyle=\frac{\beta_{1}}{1-\beta_{1}}G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right],$

where (i) follows from Lemma 6.

The term $T_{2}$ is the key to deal with under non-i.i.d. sampling, where a non-zero bias arises in the gradient estimation. We bound this term as

	$\displaystyle T_{2}=$	$\displaystyle-\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}g_{t}+\nabla J(\theta_{t})^{T}\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)g_{t}$
	$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle-\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}g_{t}+Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\left\lVert\nabla J(\theta_{t})\right\rVert_{\infty}\left\lVert g_{t}\right\rVert_{\infty}$
	$\displaystyle=$	$\displaystyle-\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}g_{t}+G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]$
	$\displaystyle=$	$\displaystyle-\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}\nabla J(\theta_{t})+\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})$
		$\displaystyle+G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]$
	$\displaystyle\overset{\text{(ii)}}{\leq}$	$\displaystyle-\frac{\alpha_{t-1}}{G_{\infty}}\left\lVert\nabla J(\theta_{t})\right\rVert^{2}+\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})$
		$\displaystyle+G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right],$

where (i) follows from Lemma 6 and (ii) follows because $\hat{V}_{t}$ is positive diagonal matrix and each entry is bounded as in Lemma 8.

Next we bound the term $T_{3}$ as

	$\displaystyle T_{3}$	$\displaystyle\overset{\text{(i)}}{\leq}\left\lVert\nabla J(z_{t})-\nabla J(\theta_{t})\right\rVert\left\lVert z_{t+1}-z_{t}\right\rVert$
		$\displaystyle\overset{\text{(ii)}}{\leq}\left(\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert+\frac{\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert\right)\frac{c_{J}\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert$
		$\displaystyle=\frac{c_{J}\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert+c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}$
		$\displaystyle=\sqrt{c_{J}}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert\cdot\frac{\sqrt{c_{J}}\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert+c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}$
		$\displaystyle\overset{\text{(iii)}}{\leq}c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+2c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}$
		$\displaystyle\overset{\text{(iv)}}{=}c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+2c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2},$

where (i) follows from Cauchy-Schwarz inequality, (ii) follows from Lemma 11, (iii) follows because $xy\leq x^{2}+y^{2}$ and (iv) follows by the update rule of Algorithm 1.

Last, we bound the term $T_{4}$ as

	$\displaystyle T_{4}$	$\displaystyle\overset{\text{(i)}}{\leq}\frac{c_{J}}{2}\left[\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert+\frac{\beta_{1}}{1-\beta_{1}}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert\right]^{2}$
		$\displaystyle\overset{\text{(ii)}}{\leq}c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}$
		$\displaystyle=c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2},$

where (i) follows from Lemma 11, and (ii) follows due to the fact $(x+y)^{2}\leq 2x^{2}+2y^{2}$ .

Substituting the upper bounds of the terms $T_{1},T_{2},T_{3}$ and $T_{4}$ in (C.2) yields

	$\displaystyle J(z_{t+1})\leq$	$\displaystyle J(z_{t})-\frac{\alpha_{t-1}}{G_{\infty}}\left\lVert\nabla J(\theta_{t})\right\rVert^{2}+\alpha_{t-1}\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})$
		$\displaystyle+\frac{1}{1-\beta_{1}}G_{\infty}^{2}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)-Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]$
		$\displaystyle+2c_{J}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}.$

Next we rearrange the above inequality and take expectation over all the randomness to obtain

$\displaystyle\frac{\alpha_{t-1}}{G_{\infty}}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq$	$\displaystyle\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\left(\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)$
	$\displaystyle+2c_{J}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
	$\displaystyle+\alpha_{t-1}\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})\right]$
$\displaystyle=$	$\displaystyle\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\left(\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)$
	$\displaystyle+2c_{J}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
	$\displaystyle+\alpha_{t-1}\mathbb{E}\left[\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-g_{t})\rvert\mathcal{F}_{t}^{q}\right]\right]$
$\displaystyle=$	$\displaystyle\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\left(\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)$
	$\displaystyle+2c_{J}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
	$\displaystyle+\alpha_{t-1}\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right],$	(44)

where in the last equation we denote $O_{t}=(s_{t},a_{t})$ for brevity, and this notation is used in the sequel as well. We emphasize that the filtration $\mathcal{F}_{t}^{q}$ contains all the samples up to time $t$ except the samples for estimating $Q^{\pi_{\theta_{t}}}(s_{t},a_{t})$ . Thus we have $\mathbb{E}[g_{t}\rvert\mathcal{F}_{t}^{q}]=\nabla\tilde{J}(\theta_{t};O_{t})$ where the expectation is taken over the randomness in EstQ algorithm.

Step2: Bounding bias term

In the following, we bound the bias term $\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$ . Observe that

	$\displaystyle\mathbb{E}\left[\right.$	$\displaystyle\left.\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{d}\frac{1}{\sqrt{\hat{v}_{t-1,i}}}\nabla_{i}J(\theta_{t})\cdot(\nabla_{i}J(\theta_{t})-\nabla\tilde{J}_{i}(\theta_{t};O_{t}))\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{d}\frac{1}{\sqrt{\hat{v}_{t-1,i}}}\left[\nabla_{i}J(\theta_{t-\tau})\cdot(\nabla_{i}J(\theta_{t-\tau})-\nabla\tilde{J}_{i}(\theta_{t-\tau};O_{t}))+\nabla_{i}J(\theta_{t-\tau})\cdot(\nabla\tilde{J}_{i}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}_{i}(\theta_{t};O_{t}))\right.\right.$
		$\displaystyle+\left.\left.\nabla_{i}J(\theta_{t-\tau})\cdot(\nabla_{i}J(\theta_{t})-\nabla_{i}J(\theta_{t-\tau}))+(\nabla_{i}J(\theta_{t})-\nabla_{i}J(\theta_{t-\tau}))\cdot(\nabla_{i}J(\theta_{t})-\nabla\tilde{J}_{i}(\theta_{t};O_{t}))\right]\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\underbrace{\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t-\tau})-\nabla\tilde{J}(\theta_{t-\tau};O_{t}))}_{T5}+\underbrace{\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))}_{T_{6}}\right.$
		$\displaystyle\left.+\underbrace{\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))}_{T_{7}}+\underbrace{(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))}_{T_{8}}\right].$

It turns out that the terms $T_{6},T_{7},T_{8}$ are easier to bound and the term $T_{5}$ is the key to bound the bias. For the clarity of presentation, we first bound the terms $T_{6},T_{7},T_{8}$ .

To bound the term $T_{6}$ , we have

	$\displaystyle T_{6}=$	$\displaystyle\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))$
	$\displaystyle=$	$\displaystyle\left(\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right)\cdot\left(\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right)$
	$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right\rVert\cdot\left\lVert\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right\rVert$
	$\displaystyle\overset{\text{(ii)}}{\leq}$	$\displaystyle\frac{1}{G_{0}}\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right\rVert$
	$\displaystyle\overset{\text{(iii)}}{\leq}$	$\displaystyle\frac{G_{\infty}}{G_{0}}\cdot c_{\tilde{J}}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert$
	$\displaystyle\overset{\text{(iv)}}{\leq}$	$\displaystyle\frac{G_{\infty}}{G_{0}}\cdot c_{\tilde{J}}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert$
	$\displaystyle\overset{\text{(v)}}{\leq}$	$\displaystyle\frac{G_{\infty}}{G_{0}}\cdot\frac{c_{\tilde{J}}G_{\infty}}{G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k},$

where (i) follows from Cauchy-Schwarz inequality, (ii) follows because $\hat{v}_{t,i}\geq G_{0}^{2}$ , (iii) follows from Lemma 1 and Lemma 8, (iv) follows by the triangle inequality, and (v) follows due to Lemma 9.

Similarly, we can bound the term $T_{7}$ as follows:

	$\displaystyle T_{7}=$	$\displaystyle\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))$
	$\displaystyle\leq$	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right\rVert\cdot\left\lVert\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{1}{G_{0}}\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{G_{\infty}}{G_{0}}\cdot c_{J}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{G_{\infty}}{G_{0}}\cdot c_{J}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{G_{\infty}}{G_{0}}\cdot\frac{c_{J}G_{\infty}}{G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}.$

Next, we bound the term $T_{8}$ and obtain

	$\displaystyle T_{8}=$	$\displaystyle(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))$
	$\displaystyle\leq$	$\displaystyle\left\lVert(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right\rVert\cdot\left\lVert\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{1}{G_{0}}\left\lVert\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t})\right\rVert\cdot\left\lVert\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{1}{G_{0}}\left(\left\lVert\nabla J(\theta_{t})\right\rVert+\left\lVert\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert\right)\cdot c_{J}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{2G_{\infty}}{G_{0}}\cdot\frac{c_{J}G_{\infty}}{G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}.$

Last, it remains to bound the term $\mathbb{E}[T_{5}]$ . Observe that $\mathbb{E}[T_{5}]=\mathbb{E}[\mathbb{E}[T_{5}|\mathcal{F}_{t-\tau}]]$ . We first deal with $\mathbb{E}[T_{5}|\mathcal{F}_{t-\tau}]$ as

	$\displaystyle\mathbb{E}[T_{5}\|\mathcal{F}_{t-\tau}]=$	$\displaystyle\mathbb{E}\left[\nabla J(\theta_{t-\tau})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t-\tau})-\nabla\tilde{J}(\theta_{t-\tau};O_{t}))\|\mathcal{F}_{t-\tau}\right]$
	$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle Tr\left(\mathbb{E}\left[\hat{V}_{t-1}^{-\frac{1}{2}}\rvert\mathcal{F}_{t-\tau}\right]\right)\left\lVert\nabla J(\theta_{t-\tau})\right\rVert_{\infty}\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})\|\mathcal{F}_{t-\tau}\right]\right\rVert_{\infty}$
	$\displaystyle\leq$	$\displaystyle\frac{d}{G_{0}}\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})\|\mathcal{F}_{t-\tau}\right]\right\rVert$
	$\displaystyle\leq$	$\displaystyle\frac{dG_{\infty}}{G_{0}}\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})\|\mathcal{F}_{t-\tau}\right]\right\rVert,$

where (i) follows from Lemma 6. The key to bound the bias is to bound the term $\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert$ , which has been done in Lemma 7.

To conclude, the bias term ca be bounded for a fixed $\tau<t$ as:

	$\displaystyle\mathbb{E}$	$\displaystyle\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\leq$	$\displaystyle\frac{G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}}{G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}+dG_{\infty}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right]\right].$		(45)

Step3: Establishing convergence to stationary point

For the case with a constant stepsize $\alpha_{t}=\alpha$ , we choose $\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha\}$ . To take the summation over the time steps, notice that the bound in (C.2) holds only when $t>\tau^{*}$ , and hence we separate the summation of the bias term into two parts as follows:

$\displaystyle\sum_{t=1}^{T}$	$\displaystyle\frac{\alpha}{G_{\infty}}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]$
$\displaystyle\leq$	$\displaystyle\sum_{t=1}^{T}\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\sum_{t=1}^{T}\left(\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)$
	$\displaystyle+2c_{J}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{T}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
$\displaystyle=$	$\displaystyle\sum_{t=1}^{T}\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\sum_{t=1}^{T}\left(\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)$
	$\displaystyle+2c_{J}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{\tau^{}}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]+\sum_{t=\tau^{}+1}^{T}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right],$	(46)

Then applying (C.2) to (C.2) yields

$\displaystyle\sum_{t=1}^{T}\frac{\alpha}{G_{\infty}}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq$	$\displaystyle\sum_{t=1}^{T}\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)+\frac{G_{\infty}^{2}}{1-\beta_{1}}\sum_{t=1}^{T}\left(\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)$
	$\displaystyle+2c_{J}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\sum_{t=1}^{T}\mathbb{E}\left\lVert\alpha\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{T}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle\mathbb{E}[J(z_{1})]+\frac{G_{\infty}^{2}}{1-\beta_{1}}\mathbb{E}\left[Tr\left(\alpha_{1}\hat{V}_{1}^{-\frac{1}{2}}\right)\right]+\frac{2dc_{J}\alpha^{2}}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha^{2}}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}$
	$\displaystyle+\sum_{t=1}^{\tau^{*}}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\sum_{t=\tau^{}+1}^{T}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{}}{G_{0}}\alpha+dG_{\infty}\left(\sigma\rho^{\tau^{}}+\frac{L_{\pi}G_{\infty}(2\tau^{}+(\tau^{*})^{2})}{2G_{0}}\alpha\right)\right]$
$\displaystyle\overset{\text{(ii)}}{\leq}$	$\displaystyle\mathbb{E}[J(z_{1})]+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}+\frac{2dc_{J}\alpha^{2}}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha^{2}}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}$
	$\displaystyle+\sum_{t=1}^{\tau^{*}}\alpha\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle+\frac{\alpha^{2}G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{}}{G_{0}}+dG_{\infty}\left(1+\frac{L_{\pi}G_{\infty}(2\tau^{}+(\tau^{})^{2})}{2G_{0}}\right)\right](T-\tau^{})$
$\displaystyle\overset{\text{(iii)}}{\leq}$	$\displaystyle\mathbb{E}[J(z_{1})]+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}+\frac{dc_{J}\alpha^{2}(3\beta_{1}^{2}+2(1-\beta_{1})(1-\beta_{1}/\beta_{2}))}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2\alpha\tau^{*}}{G_{0}}G_{\infty}^{2}$
	$\displaystyle+\frac{\alpha^{2}G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{}}{G_{0}}+dG_{\infty}\left(1+\frac{L_{\pi}G_{\infty}(2\tau^{}+(\tau^{})^{2})}{2G_{0}}\right)\right](T-\tau^{}),$	(47)

where (i) follows from Lemma 10, (ii) follows from the definition of $\tau^{*}$ and (iii) follows since

	$\displaystyle\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))$	$\displaystyle\leq\left\lVert\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{4}}\right\rVert\cdot\left\lVert\hat{V}_{t-1}^{-\frac{1}{4}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t}))\right\rVert$
		$\displaystyle\leq\frac{1}{G_{0}}\left\lVert\nabla J(\theta_{t})\right\rVert\left(\left\lVert\nabla J(\theta_{t})\right\rVert+\left\lVert\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert\right)$
		$\displaystyle\leq\frac{2}{G_{0}}G_{\infty}^{2}.$

Finally, we complete our proof by letting both sides of (C.2) be divided by $T$ , which yields

\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{C_{1}}{T}+\frac{\alpha C_{2}(T-\tau^{*})}{T}\leq\frac{C_{1}}{T}+\alpha C_{2},

where

	$\displaystyle C_{1}=\frac{G_{\infty}\mathbb{E}[J(z_{1})]}{\alpha}+\frac{dG_{\infty}^{3}}{G_{0}(1-\beta_{1})}+\frac{dc_{J}\alpha G_{\infty}(3\beta_{1}^{2}+2(1-\beta_{1})(1-\beta_{1}/\beta_{2}))}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2G_{\infty}\tau^{*}}{G_{0}}G_{\infty}^{2},$
	$\displaystyle C_{2}=\frac{G_{\infty}^{2}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{}}{G_{0}}+dG_{\infty}\left(1+\frac{L_{\pi}G_{\infty}(2\tau^{}+(\tau^{*})^{2})}{2G_{0}}\right)\right].$

Appendix D Proof of Theorem 2

The proof of Theorem 2 starts from steps similar to those of Theorem 1. The difference starts from (C.2). Here we consider $\alpha_{t}$ is not constant. Then we divide both sides of (C.2) by $\alpha_{t-1}$ and obtain

	$\displaystyle\frac{1}{G_{\infty}}$	$\displaystyle\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]$
		$\displaystyle\leq\frac{\left(\mathbb{E}[J(z_{t})]-\mathbb{E}[J(z_{t+1})]\right)}{\alpha_{t-1}}+\frac{G_{\infty}^{2}}{\alpha_{t-1}(1-\beta_{1})}\left(\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]-\mathbb{E}\left[Tr\left(\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}\right)\right]\right)$
		$\displaystyle\quad+\frac{2c_{J}}{\alpha_{t-1}}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+\frac{3c_{J}}{\alpha_{t-1}}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
		$\displaystyle\quad+\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
		$\displaystyle:=\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+\frac{2c_{J}}{\alpha_{t-1}}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+\frac{3c_{J}}{\alpha_{t-1}}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
		$\displaystyle\quad+\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right],$

where

f_{t}=\mathbb{E}[J(z_{t})]+\frac{G_{\infty}^{2}}{1-\beta_{1}}\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right].

We choose $\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha_{T}\}$ . Again we choose $\tau=t$ if $t\leq\tau^{*}$ and $\tau=\tau^{*}$ if $t>\tau^{*}$ . If $t>\tau^{*}$ , then choice of $\alpha_{t}$ yields

	$\displaystyle\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\quad\leq\frac{G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}}{G_{0}}\sum_{k=t-\tau^{}}^{t-1}\alpha_{k}+dG_{\infty}\left[\sigma\rho^{\tau^{}}+\frac{L_{\pi}G_{\infty}}{2G_{0}}\left(\sum_{k=t-\tau^{}}^{t-1}\alpha_{k}+\sum_{i=t-\tau^{}}^{t-1}\sum_{k=t-\tau^{*}}^{i}\alpha_{k}\right)\right]\right]$
	$\displaystyle\quad\leq\frac{G_{\infty}}{G_{0}}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}}{G_{0}}\frac{\alpha\tau^{}}{\sqrt{t-\tau^{}}}+dG_{\infty}\left(\frac{\alpha}{\sqrt{T}}+\frac{L_{\pi}G_{\infty}(1+\tau^{})}{2G_{0}}\frac{\alpha\tau^{}}{\sqrt{t-\tau^{*}}}\right)\right],$

where the last inequality follows because

\sum_{k=t-\tau^{*}}^{t}\alpha_{k}=\alpha\sum_{k=t-\tau^{*}}^{t}\frac{1}{\sqrt{k}}\leq\alpha\int_{x=t-\tau^{*}+1}^{t+1}\frac{1}{\sqrt{x-1}}dx\leq 2\alpha(\sqrt{t}-\sqrt{t-\tau^{*}})\leq\frac{\alpha\tau^{*}}{\sqrt{t-\tau^{*}}}.

(48)

By taking the summation over time steps, we obtain

$\displaystyle\frac{1}{G_{\infty}}\sum_{t=1}^{T}$	$\displaystyle\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]$
$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+2c_{J}\alpha\sum_{t=1}^{T}\mathbb{E}\left\lVert\hat{V}_{t}^{-\frac{1}{2}}g_{t}\right\rVert^{2}+3c_{J}\alpha\sum_{t=1}^{T}\left(\frac{\beta_{1}}{1-\beta_{1}}\right)^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right\rVert^{2}$
	$\displaystyle+\sum_{t=1}^{\tau^{*}}\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\sum_{t=\tau^{}+1}^{T}\left[\frac{(3c_{J}+c_{\tilde{J}})G_{\infty}}{G_{0}}\frac{\tau^{}}{\sqrt{t-\tau^{}}}\!+\!dG_{\infty}\left(\frac{1}{\sqrt{T}}\!+\!\frac{L_{\pi}G_{\infty}(1+\tau^{})}{2G_{0}}\frac{\tau^{}}{\sqrt{t-\tau^{}}}\right)\right]$
$\displaystyle\overset{\text{(ii)}}{\leq}$	$\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+\frac{2dc_{J}\alpha}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}$
	$\displaystyle+\sum_{t=1}^{\tau^{*}}\mathbb{E}\left[\nabla J(\theta_{t})^{T}\hat{V}_{t-1}^{-\frac{1}{2}}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\left[\frac{2(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{}\sqrt{T-\tau^{}}}{G_{0}}\!+\!dG_{\infty}\left(\sqrt{T}\!+\!\frac{L_{\pi}G_{\infty}(\tau^{}+(\tau^{})^{2})\sqrt{T-\tau^{*}}}{G_{0}}\right)\right]$
$\displaystyle\leq$	$\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+\frac{2dc_{J}\alpha}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2\tau^{*}}{G_{0}}G_{\infty}^{2}$
	$\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\left[\frac{2(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{}\sqrt{T-\tau^{}}}{G_{0}}\!+\!dG_{\infty}\left(\sqrt{T}\!+\!\frac{L_{\pi}G_{\infty}(\tau^{}+(\tau^{})^{2})\sqrt{T-\tau^{*}}}{G_{0}}\right)\right]$
$\displaystyle\leq$	$\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}+\frac{2dc_{J}\alpha}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2\tau^{*}}{G_{0}}G_{\infty}^{2}$
	$\displaystyle+\frac{\alpha G_{\infty}}{G_{0}}\left[\frac{2(3c_{J}+c_{\tilde{J}})G_{\infty}\tau^{}}{G_{0}}\!+\!dG_{\infty}\left(1\!+\!\frac{L_{\pi}G_{\infty}(\tau^{}+(\tau^{*})^{2})}{G_{0}}\right)\right]\cdot\sqrt{T},$	(49)

where (i) follows since $\alpha_{t}$ is decreasing and from the definition of $\tau^{*}$ , and (ii) follows from Lemma 10.

For clarity we bound $\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}$ separately. The key observation is that $f_{t}$ is uniformly bounded as

f_{t}=\mathbb{E}[J(z_{t})]+\frac{G_{\infty}^{2}}{1-\beta_{1}}\mathbb{E}\left[Tr\left(\alpha_{t-1}\hat{V}_{t-1}^{-\frac{1}{2}}\right)\right]\leq\frac{R_{\max}}{1-\gamma}+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}.

(50)

Thus

$\displaystyle\sum_{t=1}^{T}\frac{f_{t}-f_{t+1}}{\alpha_{t-1}}$	$\displaystyle=\frac{f_{1}}{\alpha_{0}}+\sum_{t=2}^{T}f_{t}\left(\frac{1}{\alpha_{t-1}}-\frac{1}{\alpha_{t-2}}\right)-\frac{f_{T+1}}{\alpha_{T-1}}$
	$\displaystyle\leq\frac{f_{1}}{\alpha_{0}}+\left(\frac{R_{\max}}{1-\gamma}+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}\right)\sum_{t=2}^{T}\left(\frac{1}{\alpha_{t-1}}-\frac{1}{\alpha_{t-2}}\right)$
	$\displaystyle\leq\frac{f_{1}}{\alpha_{0}}+\left(\frac{R_{\max}}{1-\gamma}+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}\right)/\alpha_{T-1}$
	$\displaystyle=\frac{f_{1}}{\alpha}+\left(\frac{R_{\max}}{1-\gamma}+\frac{\alpha dG_{\infty}^{2}}{G_{0}(1-\beta_{1})}\right)\frac{\sqrt{T-1}}{\alpha}.$	(51)

Finally, we complete our proof by substituting (D) in (D) and letting both sides of (D) be divided by $T$ , which yields

\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{C_{1}}{T}+\frac{\alpha C_{2}(T-\tau^{*})}{T}\leq\frac{C_{1}}{T}+\frac{C_{2}}{\sqrt{T}},

where

	$\displaystyle C_{1}$	$\displaystyle=\frac{f_{1}G_{\infty}}{\alpha}+\frac{2dc_{J}\alpha G_{\infty}}{1-\beta_{2}}+\frac{3dc_{J}\beta_{1}^{2}\alpha G_{\infty}}{(1-\beta_{1})(1-\beta_{2})(1-\beta_{1}/\beta_{2})}+\frac{2\tau^{*}G_{\infty}}{G_{0}}G_{\infty}^{2}$
	$\displaystyle C_{2}$	$\displaystyle=\frac{R_{\max}G_{\infty}}{\alpha(1-\gamma)}+\frac{dG_{\infty}^{3}}{G_{0}(1-\beta_{1})}+\frac{\alpha G_{\infty}^{3}}{G_{0}}\left[\frac{2(3c_{J}+c_{\tilde{J}})\tau^{}}{G_{0}}\!+\!d\left(1+\frac{L_{\pi}G_{\infty}(\tau^{}+(\tau^{*})^{2})}{G_{0}}\right)\right].$

Appendix E Proof of Proposition 1

In the following, we show how to adapt the proof techniques in analyzing PG-AMSGrad to PG-SGD. We first reduce Lemma 7 to the vanilla PG case.

Lemma 12.

Fix time $t$ and any $\tau<t$ . Suppose Assumptions 1 and 2 hold for PG-SGD. Then we have

\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};s_{t},a_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert\leq G_{\infty}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right],

(52)

where $G_{\infty}=\frac{c_{\Theta}R_{\max}}{1-\gamma}$ .

Proof.

since the major part of the proof is similar to that of Lemma 7, we only emphasize the different steps.

For notational brevity, we denote $O_{t}=(s_{t},a_{T})$ . Then we still have

\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})|\mathcal{F}_{t-\tau}\right]\right\rVert\leq G_{\infty}\left\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}|\mathcal{F}_{t-\tau})\right\|_{TV}.

(53)

Then we use the steps similar to those in building an auxiliary Markov chain (24) which is generated by the policy $\pi_{\theta_{t-\tau}}$ from time $t-\tau$ . Similarly to (C.1), we have

	$\displaystyle\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$
	$\displaystyle\quad\leq\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(\tilde{s}_{t},\tilde{a}_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}+\left\\|P(\tilde{s}_{t},\tilde{a}_{t}\|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$
	$\displaystyle\quad\overset{\text{(i)}}{\leq}\sigma\rho^{\tau}+\frac{1}{2}L_{\pi}\sum_{i=t-\tau}^{t-1}\left\lVert\theta_{i+1}-\theta_{i}\right\rVert+\frac{1}{2}L_{\pi}\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert,$		(54)

where (i) follows from (26), (C.1) and (C.1). Observe that in PG-SGD, for any $t$ we have,

\left\lVert\theta_{t+1}-\theta_{t}\right\rVert=\alpha_{t}\left\lVert g_{t}\right\rVert\leq G_{\infty}\alpha_{t}.

Thus, we complete the proof by substituting the above observation to (54) and then (53).

∎

Proof of Proposition 1: Following from the Lipschitz condition of $\nabla J(\theta)$ in Lemma 1, we obtain

	$\displaystyle J(\theta_{t+1})\leq$	$\displaystyle J(\theta_{t})+\nabla J(\theta_{t})^{T}(\theta_{t+1}-\theta_{t})+\frac{c_{J}}{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}$
	$\displaystyle=$	$\displaystyle J(\theta_{t})-\alpha_{t}\nabla J(\theta_{t})^{T}g_{t}+\frac{c_{J}}{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}$
	$\displaystyle=$	$\displaystyle J(\theta_{t})-\alpha_{t}\nabla J(\theta_{t})^{T}\nabla J(\theta_{t})+\alpha_{t}\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-g_{t})+\frac{c_{J}}{2}\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}.$

Then we rearrange the above inequality and take expectation over all the randomness to have

$\displaystyle\alpha_{t}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq$	$\displaystyle\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]+\alpha_{t}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-g_{t})\right]+\frac{c_{J}}{2}\mathbb{E}\left[\left\lVert\theta_{t+1}-\theta_{t}\right\rVert^{2}\right]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]+\alpha_{t}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-g_{t})\right]+\frac{c_{J}G_{\infty}^{2}}{2}\alpha_{t}^{2}$
$\displaystyle=$	$\displaystyle\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]+\alpha_{t}\mathbb{E}\left[\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-g_{t})\rvert\mathcal{F}_{t}^{q}\right]\right]+\frac{c_{J}G_{\infty}^{2}}{2}\alpha_{t}^{2}$
$\displaystyle=$	$\displaystyle\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]+\alpha_{t}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]+\frac{c_{J}G_{\infty}^{2}}{2}\alpha_{t}^{2},$	(55)

where the filtration $\mathcal{F}_{t}^{q}$ contains all the samples up to time $t$ except the samples for estimating $Q^{\pi_{\theta_{t}}}(s_{t},a_{t})$ in the EstQ algorithm. Thus we have $\mathbb{E}[g_{t}\rvert\mathcal{F}_{t}^{q}]=\nabla\tilde{J}(\theta_{t};O_{t})$ where the expectation is taken over the randomness in EstQ algorithm. Observe that if $\tau<t$ , we have

	$\displaystyle\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\quad=\mathbb{E}\left[\underbrace{\nabla J(\theta_{t-\tau})^{T}(\nabla J(\theta_{t-\tau})-\nabla\tilde{J}(\theta_{t-\tau};O_{t}))}_{T1}+\underbrace{\nabla J(\theta_{t-\tau})^{T}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))}_{T_{2}}\right.$
	$\displaystyle\quad\quad\left.+\underbrace{\nabla J(\theta_{t-\tau})^{T}(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))}_{T_{3}}+\underbrace{(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))}_{T_{4}}\right].$

Next we bound the terms $T_{1},T_{2},T_{3}$ and $T_{4}$ .

Observe that $\mathbb{E}[T_{1}]=\mathbb{E}[\mathbb{E}[T_{1}|\mathcal{F}_{t-\tau}]]$ . We deal with $\mathbb{E}[T_{1}|\mathcal{F}_{t-\tau}]$ and have

	$\displaystyle\mathbb{E}[T_{1}\|\mathcal{F}_{t-\tau}]=$	$\displaystyle\mathbb{E}\left[\nabla J(\theta_{t-\tau})^{T}(\nabla J(\theta_{t-\tau})-\nabla\tilde{J}(\theta_{t-\tau};O_{t}))\|\mathcal{F}_{t-\tau}\right]$
	$\displaystyle\leq$	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})\|\mathcal{F}_{t-\tau}\right]\right\rVert$
	$\displaystyle\leq$	$\displaystyle G_{\infty}\cdot\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})\|\mathcal{F}_{t-\tau}\right]\right\rVert$
	$\displaystyle\leq$	$\displaystyle G_{\infty}^{2}\left[\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right].$

Next we bound $T_{2}$ and obtain

	$\displaystyle T_{2}=$	$\displaystyle\nabla J(\theta_{t-\tau})^{T}(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))$
	$\displaystyle\leq$	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert(\nabla\tilde{J}(\theta_{t-\tau};O_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right\rVert$
	$\displaystyle\leq$	$\displaystyle G_{\infty}\cdot c_{\tilde{J}}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert$
	$\displaystyle\leq$	$\displaystyle G_{\infty}\cdot c_{\tilde{J}}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert$
	$\displaystyle\leq$	$\displaystyle c_{\tilde{J}}G_{\infty}^{2}\sum_{k=t-\tau}^{t-1}\alpha_{k}.$

Similarly, we can bound term $T_{3}$ as

	$\displaystyle T_{3}=$	$\displaystyle\nabla J(\theta_{t-\tau})^{T}(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))$
	$\displaystyle\leq$	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})\right\rVert\cdot\left\lVert(\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t}))\right\rVert$
	$\displaystyle\leq$	$\displaystyle G_{\infty}\cdot c_{J}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert$
	$\displaystyle\leq$	$\displaystyle G_{\infty}\cdot c_{J}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert$
	$\displaystyle\leq$	$\displaystyle c_{J}G_{\infty}^{2}\sum_{k=t-\tau}^{t-1}\alpha_{k}.$

Last, we bound term $T_{4}$ and obtain

	$\displaystyle T_{4}=$	$\displaystyle(\nabla J(\theta_{t})-\nabla J(\theta_{t-\tau}))^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))$
	$\displaystyle\leq$	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\nabla J(\theta_{t})\right\rVert\cdot\left\lVert\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert$
	$\displaystyle\leq$	$\displaystyle\left(\left\lVert\nabla J(\theta_{t})\right\rVert+\left\lVert\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert\right)\cdot c_{J}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert$
	$\displaystyle\leq$	$\displaystyle 2c_{J}G_{\infty}^{2}\sum_{k=t-\tau}^{t-1}\alpha_{k}.$

Thus for any $\tau<t$ , we have

	$\displaystyle\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\quad\leq G_{\infty}^{2}\left[(3c_{J}+c_{\tilde{J}})\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sigma\rho^{\tau}+\frac{L_{\pi}G_{\infty}}{2}\left(\sum_{k=t-\tau}^{t-1}\alpha_{k}+\sum_{i=t-\tau}^{t-1}\sum_{k=t-\tau}^{i}\alpha_{k}\right)\right].$		(56)

Proof of convergence under constant stepsize:

We first consider a constant stepsize $\alpha_{t}=\alpha$ for $t=1,\dots,T$ . Choose $\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha\}$ . We rearrange (55) and take the summation over the time steps to obtain

	$\displaystyle\alpha\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq$	$\displaystyle\sum_{t=1}^{T}\left(\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]\right)+\sum_{t=1}^{T}\frac{c_{J}G_{\infty}^{2}}{2}\alpha^{2}+\alpha\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\leq$	$\displaystyle J(\theta_{1})+\frac{c_{J}G_{\infty}^{2}\alpha^{2}}{2}T+\alpha\sum_{t=1}^{\tau^{*}}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
		$\displaystyle+\alpha\sum_{t=\tau^{*}+1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle J(\theta_{1})+\frac{c_{J}G_{\infty}^{2}\alpha^{2}}{2}T+2\alpha G_{\infty}^{2}\tau^{}+\alpha\sum_{t=\tau^{}+1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\overset{\text{(ii)}}{\leq}$	$\displaystyle J(\theta_{1})+\frac{c_{J}G_{\infty}^{2}\alpha^{2}}{2}T+2\alpha G_{\infty}^{2}\tau^{}+\alpha^{2}G_{\infty}^{2}\sum_{t=\tau^{}+1}^{T}\left[(3c_{J}+c_{\tilde{J}})\tau^{}+1+\frac{L_{\pi}G_{\infty}\tau^{}(2+\tau^{*})}{2}\right],$

where (i) follows because $\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\leq\left\lVert\nabla J(\theta_{t})\right\rVert\cdot\left\lVert\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t})\right\rVert\leq 2G_{\infty}^{2}$ , and (ii) follows from (56). Then we divide both sides of the above inequality by $\alpha T$ , and have

	$\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]$	$\displaystyle\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]$
		$\displaystyle\leq\frac{J(\theta_{1})/\alpha+2G_{\infty}^{2}\tau^{}}{T}+\alpha G_{\infty}^{2}\left[\frac{c_{J}}{2}+(3c_{J}+c_{\tilde{J}})\tau^{}+1+\frac{L_{\pi}G_{\infty}(2\tau^{}+(\tau^{})^{2})}{2}\right].$

Proof of convergence under diminishing stepsize:

Now we consider a diminishing stepsize $\alpha_{t}=\frac{1-\gamma}{\sqrt{t}}$ for $t=1,\dots,T$ . Choose $\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha_{T}=\frac{1-\gamma}{\sqrt{T}}\}$ . We rearrange (55) and take the summation over the time steps to obtain

	$\displaystyle\sum_{t=1}^{T}$	$\displaystyle\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\sum_{t=1}^{T}\frac{\mathbb{E}[J(\theta_{t})]-\mathbb{E}[J(\theta_{t+1})]}{\alpha_{t}}+\frac{c_{J}G_{\infty}^{2}}{2}\sum_{t=1}^{T}\alpha_{t}+\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\leq$	$\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\sum_{t=2}^{T}\mathbb{E}[J(\theta_{t})]\left(\frac{1}{\alpha_{t}}-\frac{1}{\alpha_{t-1}}\right)+\frac{c_{J}G_{\infty}^{2}}{2}\sum_{t=1}^{T}\alpha_{t}+\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\leq$	$\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\frac{R_{\max}}{1-\gamma}\sum_{t=2}^{T}\left(\frac{1}{\alpha_{t}}-\frac{1}{\alpha_{t-1}}\right)+\frac{c_{J}G_{\infty}^{2}}{2}\sum_{t=1}^{T}\alpha_{t}+\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\leq$	$\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\frac{R_{\max}}{1-\gamma}\cdot\frac{1}{\alpha_{T}}+\frac{c_{J}G_{\infty}^{2}}{2}\sum_{t=1}^{T}\alpha_{t}+\sum_{t=1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\leq$	$\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\frac{R_{\max}\sqrt{T}}{(1-\gamma)^{2}}+(1-\gamma)c_{J}G_{\infty}^{2}\sqrt{T}+\sum_{t=1}^{\tau^{*}}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
		$\displaystyle+\sum_{t=\tau^{*}+1}^{T}\mathbb{E}\left[\nabla J(\theta_{t})^{T}(\nabla J(\theta_{t})-\nabla\tilde{J}(\theta_{t};O_{t}))\right]$
	$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle\frac{J(\theta_{1})}{\alpha_{1}}+\frac{R_{\max}\sqrt{T}}{(1-\gamma)^{2}}+(1-\gamma)c_{J}G_{\infty}^{2}\sqrt{T}+2(1-\gamma)G_{\infty}^{2}\tau^{*}$
		$\displaystyle+(1-\gamma)G_{\infty}^{2}\sum_{t=\tau^{}+1}^{T}\left[\frac{(3c_{J}+c_{\tilde{J}})\tau^{}}{\sqrt{t-\tau^{}}}+\frac{1}{\sqrt{T}}+\frac{L_{\pi}G_{\infty}}{2}\cdot\frac{(1+\tau^{})\tau^{}}{\sqrt{t-\tau^{}}}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{J(\theta_{1})}{1-\gamma}+\frac{R_{\max}\sqrt{T}}{(1-\gamma)^{2}}+(1-\gamma)G_{\infty}^{2}\left[2\tau^{}+\left[c_{J}+2(3c_{J}+c_{\tilde{J}})\tau^{}+1+L_{\pi}G_{\infty}(\tau^{}+(\tau^{})^{2})\right]\sqrt{T}\right],$

where (i) follows from (48). Then we divide both sides of the above inequality by $T$ , and have

	$\displaystyle\underset{t\in[T]}{\min}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[\left\lVert\nabla J(\theta_{t})\right\rVert^{2}\right]$
	$\displaystyle\quad\leq\frac{J(\theta_{1})+2(1-\gamma)^{2}G_{\infty}^{2}\tau^{}}{(1-\gamma)T}+\frac{R_{\max}+(1-\gamma)^{3}G_{\infty}^{2}\left[c_{J}+2(3c_{J}+c_{\tilde{J}})\tau^{}+1+L_{\pi}G_{\infty}(\tau^{}+(\tau^{})^{2})\right]}{(1-\gamma)^{2}\sqrt{T}}.$

Appendix F Proof of Lemma 3

We bound the expectation of bias via constructing a new Markov chain and applying useful techniques from information theory. Before deriving the bound, we first introduce some technical lemmas.

Lemma 13.

Given Assumption 4, the gradient $g_{t}$ in Algorithm 2 is uniformly bounded as follows,

\left\lVert g_{t}\right\rVert_{\infty}\leq\left\lVert g_{t}\right\rVert\leq R_{\max}+(1+\gamma)D_{\infty}.

(57)

Proof.

The proof follows easily from the boundedness of $\phi$ and $\theta$ .

	$\displaystyle\left\lVert g_{t}\right\rVert=$	$\displaystyle\left\lVert\left(\phi(s^{\prime}_{t})^{T}\theta-r_{t}-\gamma\phi(s^{\prime}_{t})^{T}\theta\right)\phi(s_{t})\right\rVert$
	$\displaystyle\leq$	$\displaystyle\left\lVert\phi(s^{\prime}_{t})^{T}\theta-r_{t}-\gamma\phi(s^{\prime}_{t})^{T}\theta\right\rVert\cdot\left\lVert\phi(s_{t})\right\rVert$
	$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle\left\lVert\phi(s^{\prime}_{t})^{T}\theta\right\rVert+R_{\max}+\gamma\left\lVert\phi(s^{\prime}_{t})^{T}\theta\right\rVert$
	$\displaystyle\overset{\text{(ii)}}{\leq}$	$\displaystyle R_{\max}+(1+\gamma)D_{\infty},$

where (i) follows since the reward function is uniformly bounded and (ii) follows from Assumption 3 and Assumption 4. ∎

Lemma 14.

(Zhou et al.,, 2018, Lemma A.1) Let $\{g_{t},m_{t},\hat{v}_{t}\}$ for $t=1,2,\dots$ be sequences generated by Algorithm 2 and denote $G_{\infty}=R_{\max}+(1+\gamma)D_{\infty}$ . Given $g_{t}\leq G_{\infty}$ as in Lemma 13, we have $\left\lVert\bar{g}_{t}\right\rVert\leq G_{\infty},\left\lVert m_{t}\right\rVert\leq G_{\infty},\left\lVert\hat{v}_{t}\right\rVert\leq G_{\infty}^{2}$ .

Lemma 15.

(Bhandari et al.,, 2018, Lemma 11) Let $\xi(\theta;s,a,s^{\prime})=(g(\theta;s,a,s^{\prime})-\bar{g}(\theta))^{T}(\theta-\theta^{\star})$ . Fix $(s,a,s^{\prime})$ . Then $\xi$ is uniformly bounded by

\lvert\xi(\theta;s,a,s^{\prime})\rvert\leq 2D_{\infty}G_{\infty},\quad\forall\theta\in\mathcal{D},

and it is Lipschitz continuous as given by

\lvert\xi(\theta;s,a,s^{\prime})-\xi(\theta^{\prime};s,a,s^{\prime})\rvert\leq 2((1+\gamma)D_{\infty}+G_{\infty})\left\lVert\theta-\theta^{\prime}\right\rVert_{2},\quad\forall\theta,\theta^{\prime}\in\mathcal{D}.

Lemma 16.

(Bhandari et al.,, 2018, Lemma 9) Consider two random variables $X$ and $Y$ such that

X\rightarrow s_{t}\rightarrow s_{t+\tau}\rightarrow Y,

(58)

for fixed $t$ and $\tau>0$ . Suppose Assumption 2 holds. Let $X^{\prime},Y^{\prime}$ are independent copies drawn from the marginal distribution of $X$ and $Y$ , that is $\mathbb{P}(X^{\prime}=\cdot,Y^{\prime}=\cdot)=\mathbb{P}(X=\cdot)\mathbb{P}(Y=\cdot)$ . Then, for any bounded $v$ , we have

\lvert\mathbb{E}[v(X,Y)]-\mathbb{E}[v(X^{\prime},Y^{\prime})]\rvert\leq 2\left\lVert v\right\rVert_{\infty}(\sigma\rho^{\tau}),

where $\sigma$ and $\rho$ are defined in Assumption 2.

Remark 1.

The notation $X\rightarrow Z\rightarrow Y$ indicates that the random variable $X$ and $Y$ are independent conditioned on $Z$ , which is a standard notation in information theory.

Proof of Lemma 3: Now we bound the bias of the gradient estimation. We first develop the connection between $\xi(\theta_{t};s_{t},a_{t},s_{t+1})$ and $\xi(\theta_{t-\tau};s_{t},a_{t},s_{t+1})$ using Lemma 15. For notational brevity, we define a random tuple $O_{t}=(s_{t},a_{t},s_{t+1})$ and clearly $g_{t}=g(\theta,O_{t})$ . We denote

\xi_{t}(\theta):=\xi(\theta,O_{t})=(g(\theta,O_{t})-\bar{g}(\theta))^{T}(\theta-\theta^{\star}),

where $\bar{g}(\theta)=\mathbb{E}[g(\theta,O_{t})]$ as defined in (16).

Next, we apply Lemma 9 to obtain

\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert_{2}\leq\sum_{i=t-\tau}^{t-1}\left\lVert\theta_{i+1}-\theta_{i}\right\rVert_{2}\leq\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}.

Thus we can relate $\xi_{t}(\theta_{t})$ and $\xi_{t}(\theta_{t-\tau})$ by using the Lipschitz property in Lemma 15 as follows.

$\displaystyle\xi_{t}(\theta_{t})-\xi_{t}(\theta_{t-\tau})\leq$	$\displaystyle\lvert\xi_{t}(\theta_{t})-\xi_{t}(\theta_{t-\tau})\rvert$
$\displaystyle\leq$	$\displaystyle 2((1+\gamma)D_{\infty}+G_{\infty})\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert$
$\displaystyle\leq$	$\displaystyle 2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}.$	(59)

Next, we seek to bound $\mathbb{E}[\xi_{t}(\theta_{t-\tau})]$ using Lemma 16. Observe that given any deterministic $\theta\in\mathcal{D}$ , we have

\mathbb{E}[\xi(\theta,O_{t})]=(\mathbb{E}[g(\theta,O_{t})]-\bar{g}(\theta))^{T}(\theta-\theta^{\star})=0.

Since $\theta_{0}$ is non-random, we have $\mathbb{E}[\xi(\theta_{0},O_{t})]=0$ . Now we are ready to bound $\mathbb{E}[\xi(\theta_{t-\tau},O_{t})]$ with Lemma 16 via constructing a random process satisfying (58). To do so, consider random variables $\theta^{\prime}_{t-\tau}$ and $O^{\prime}_{t}$ drawn independently from the marginal distribution of $\theta_{t-\tau}$ and $O_{t}$ , thus $\mathbb{P}(\theta^{\prime}_{t-\tau}=\cdot,O^{\prime}_{t}=\cdot)=\mathbb{P}(\theta_{t-\tau}=\cdot)\mathbb{P}(O_{t}=\cdot)$ . Further, we can obtain $\mathbb{E}[\xi(\theta^{\prime}_{t-\tau},O^{\prime}_{t})]=\mathbb{E}[\mathbb{E}[\xi(\theta^{\prime}_{t-\tau},O^{\prime}_{t})]|\theta^{\prime}_{t-\tau}]=0$ since $\theta^{\prime}_{t-\tau}$ and $O^{\prime}_{t}$ are independent. Combining Lemma 15 and Lemma 16 yields

\mathbb{E}[\xi(\theta_{t-\tau},O_{t})]\leq 2(2D_{\infty}G_{\infty})(\sigma\rho^{\tau}).

(60)

Finally, we are ready to bound the bias. First, we take expectation on both sides of (F) and obtain

\mathbb{E}[\xi_{t}(\theta_{t})]\leq\mathbb{E}[\xi_{t}(\theta_{t-\tau})]+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}.

Let $\tau^{*}=\min\{\tau:\sigma\rho^{\tau}\leq\alpha_{T}\}$ . When $t\leq\tau^{*}$ , we choose $\tau=t$ and have

	$\displaystyle\mathbb{E}[\xi_{t}(\theta_{t})]\leq$	$\displaystyle\mathbb{E}[\xi_{t}(\theta_{0})]+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\cdot t\alpha_{0}$
	$\displaystyle\leq$	$\displaystyle 2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{*}\alpha_{0}.$

When $t>\tau^{*}$ , we choose $\tau=\tau^{*}$ and have

	$\displaystyle\mathbb{E}[\xi_{t}(\theta_{t})]\leq$	$\displaystyle\mathbb{E}[\xi_{t}(\theta_{t-\tau^{*}})]+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}$
	$\displaystyle\overset{\text{(i)}}{\leq}$	$\displaystyle 4D_{\infty}G_{\infty}(\sigma\rho^{\tau^{*}})+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i}$
	$\displaystyle\overset{\text{(ii)}}{\leq}$	$\displaystyle 4D_{\infty}G_{\infty}\alpha_{T}+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\sum_{i=t-\tau}^{t-1}a_{i},$

where (i) follows from (60), and (ii) follows due to the definition of the mixing time.

Appendix G Proof of Theorem 3

Differently from the regret bound for AMSGrad in conventional optimization Reddi et al., (2018), our analysis here focuses on the convergence rate. In fact, a slight modification of our proof also provides the convergence rate of AMSGrad for conventional strongly convex optimization, which can be of independent interest. Moreover, we provide results under the constant stepsize and under Marovian sampling, neither of which has been studied in Reddi et al., (2018).

To proceed the proof, we first observe that

\theta_{t+1}=\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}\left(\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right)=\underset{\theta\in\mathcal{D}}{\min}\left\lVert\hat{V}_{t}^{1/4}\left(\theta_{t}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}-\theta\right)\right\rVert.

Clearly $\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}(\theta^{\star})=\theta^{\star}$ due to Assumption 4. We start from the update of $\theta_{t}$ when $t\geq 2$ .

	$\displaystyle\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}$	$\displaystyle=\left\lVert\Pi_{\mathcal{D},\hat{V}_{t}^{1/4}}\hat{V}_{t}^{1/4}\left(\theta_{t}-\theta^{\star}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right)\right\rVert^{2}$
		$\displaystyle\leq\left\lVert\hat{V}_{t}^{1/4}\left(\theta_{t}-\theta^{\star}-\alpha_{t}\hat{V}_{t}^{-\frac{1}{2}}m_{t}\right)\right\rVert^{2}$
		$\displaystyle=\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\left\lVert\alpha_{t}\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}-2\alpha_{t}(\theta_{t}-\theta^{\star})^{T}m_{t}$
		$\displaystyle=\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\left\lVert\alpha_{t}\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}-2\alpha_{t}(\theta_{t}-\theta^{\star})^{T}(\beta_{1t}m_{t-1}+(1-\beta_{1t})g_{t})$
		$\displaystyle\overset{\text{(i)}}{\leq}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\left\lVert\alpha_{t}\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\alpha_{t}\beta_{1t}\left(\frac{1}{\alpha_{t}}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}\left\lVert\hat{V}_{t}^{-1/4}m_{t-1}\right\rVert^{2}\right)$
		$\displaystyle\quad-2\alpha_{t}(1-\beta_{1t})(\theta_{t}-\theta^{\star})^{T}g_{t}$
		$\displaystyle\overset{\text{(ii)}}{\leq}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\left\lVert\alpha_{t}\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1t}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}$
		$\displaystyle\quad-2\alpha_{t}(1-\beta_{1t})(\theta_{t}-\theta^{\star})^{T}g_{t},$

where (i) follows from Cauchy-Schwarz inequality, and (ii) holds because $\hat{v}_{t+1,i}\geq\hat{v}_{t,i},\forall t,\forall i$ .

Next, we take the expectation over all samples used up to time step $t$ on both sides, which still preserves the inequality. Since we consider the Markovian sampling case where the estimation of gradient is biased. That is,

\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}g_{t}\right]=\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}\bar{g}_{t}\right]+\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]\neq\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}\bar{g}_{t}\right],

(61)

where $\bar{g}_{t}:=\bar{g}(\theta_{t})$ and this notation will be used in the remaining for simplicity.

Thus we have

	$\displaystyle\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}$
	$\displaystyle\leq\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}$
	$\displaystyle\quad-2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}g_{t}\right]$
	$\displaystyle=\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}$
	$\displaystyle\quad-2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}\bar{g}_{t}\right]+2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(\bar{g}_{t}-g_{t})\right]$
	$\displaystyle\overset{\text{(i)}}{\leq}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}$
	$\displaystyle\quad-2\alpha_{t}c(1-\beta_{1t})\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}+2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(\bar{g}_{t}-g_{t})\right]$
	$\displaystyle\overset{\text{(ii)}}{\leq}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\beta_{1t}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\beta_{1}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}$
	$\displaystyle\quad-2\alpha_{t}c(1-\beta_{1})\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}+2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
	$\displaystyle\overset{\text{(iii)}}{\leq}\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}+\alpha_{t}^{2}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+G_{\infty}D_{\infty}^{2}\beta_{1t}+\alpha_{t}^{2}\beta_{1}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}$
	$\displaystyle\quad-2\alpha_{t}c(1-\beta_{1})\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}+2\alpha_{t}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right],$

where (i) follows from Lemma 2 and because $1-\beta_{1t}>0$ , (ii) follows because $\beta_{1t}<\beta_{1}<1$ and $\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}>0$ , and (iii) follows from $\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}\leq\left\lVert\hat{V}_{t}^{1/4}\right\rVert_{2}^{2}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}\leq G_{\infty}D_{\infty}^{2}$ by Lemma 14 and Assumption 4. By rearranging the terms in the above inequality and taking the summation over time steps, we have

	$\displaystyle 2c(1$	$\displaystyle-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}$
		$\displaystyle\leq\sum_{t=2}^{T}\frac{1}{\alpha_{t}}\left(\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}\!-\!\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}\right)+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}$
		$\displaystyle\quad+\sum_{t=2}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\sum_{t=2}^{T}\alpha_{t}\beta_{1}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}$
		$\displaystyle\quad+2\sum_{t=2}^{T}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
		$\displaystyle\overset{\text{(i)}}{\leq}\sum_{t=2}^{T}\frac{1}{\alpha_{t}}\left(\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}\!-\!\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}\right)+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}$
		$\displaystyle\quad+\sum_{t=2}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+\sum_{t=2}^{T}\alpha_{t-1}\beta_{1}\mathbb{E}\left\lVert\hat{V}_{t-1}^{-1/4}m_{t-1}\right\rVert^{2}$
		$\displaystyle\quad+2\sum_{t=2}^{T}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
		$\displaystyle\leq\sum_{t=2}^{T}\frac{1}{\alpha_{t}}\left(\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}\!-\!\mathbb{E}\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}\right)+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}$
		$\displaystyle\quad+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+2\sum_{t=2}^{T}(1-\beta_{1t})\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right],$

where (i) follows from $\alpha_{t}\leq\alpha_{t-1}$ . By further implementing the first term in the right-hand side of the last inequality, we can then bound the sum as

	$\displaystyle 2c(1$	$\displaystyle-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}$
		$\displaystyle\leq\sum_{t=2}^{T}\frac{1}{\alpha_{t}}\mathbb{E}\left(\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}-\left\lVert\hat{V}_{t}^{1/4}(\theta_{t+1}-\theta^{\star})\right\rVert^{2}\right)+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}$
		$\displaystyle\quad+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
		$\displaystyle=\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+\sum_{t=3}^{T}\mathbb{E}\left(\frac{\left\lVert\hat{V}_{t}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}}{\alpha_{t}}-\frac{\left\lVert\hat{V}_{t-1}^{1/4}(\theta_{t}-\theta^{\star})\right\rVert^{2}}{\alpha_{t-1}}\right)$
		$\displaystyle\quad-\frac{\mathbb{E}\left\lVert\hat{V}_{T}^{1/4}(\theta_{T+1}-\theta^{\star})\right\rVert^{2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}$
		$\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
		$\displaystyle=\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+\sum_{t=3}^{T}\mathbb{E}\left(\frac{\sum_{i=1}^{d}\hat{v}_{t,i}^{1/2}(\theta_{t,i}-\theta_{i}^{\star})^{2}}{\alpha_{t}}-\frac{\sum_{i=1}^{d}\hat{v}_{t-1,i}^{1/2}(\theta_{t,i}-\theta_{i}^{\star})^{2}}{\alpha_{t-1}}\right)$
		$\displaystyle\quad-\frac{\mathbb{E}\left\lVert\hat{V}_{T}^{1/4}(\theta_{T+1}-\theta^{\star})\right\rVert^{2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}.$
		$\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
		$\displaystyle=\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+\sum_{t=3}^{T}\sum_{i=1}^{d}\mathbb{E}(\theta_{t,i}-\theta_{i}^{\star})^{2}\left(\frac{\hat{v}_{t,i}^{1/2}}{\alpha_{t}}-\frac{\hat{v}_{t-1,i}^{1/2}}{\alpha_{t-1}}\right)$
		$\displaystyle\quad-\frac{\mathbb{E}\left\lVert\hat{V}_{T}^{1/4}(\theta_{T+1}-\theta^{\star})\right\rVert^{2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}$
		$\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right].$

Next, we further bound the above term as

$\displaystyle 2c(1$	$\displaystyle-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}$
	$\displaystyle\overset{\text{(i)}}{\leq}\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+D_{\infty}^{2}\sum_{t=3}^{T}\sum_{i=1}^{d}\mathbb{E}\left(\frac{\hat{v}_{t,i}^{1/2}}{\alpha_{t}}-\frac{\hat{v}_{t-1,i}^{1/2}}{\alpha_{t-1}}\right)$
	$\displaystyle\quad-\frac{\mathbb{E}\left\lVert\hat{V}_{T}^{1/4}(\theta_{T+1}-\theta^{\star})\right\rVert^{2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}$
	$\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
	$\displaystyle\leq\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+D_{\infty}^{2}\sum_{i=1}^{d}\mathbb{E}\frac{\hat{v}_{T,i}^{1/2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}$
	$\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right],$	(62)

where (i) follows from Assumption 4 and because $\frac{\hat{v}_{t,i}^{1/2}}{\alpha_{t}}\geq\frac{\hat{v}_{t-1,i}^{1/2}}{\alpha_{t-1}}$ .

Since we consider the case with a constant stepsize $\alpha_{t}=\alpha$ , we continue to bound (G) as follows

$\displaystyle 2c(1$	$\displaystyle-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}$
	$\displaystyle\leq\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+D_{\infty}^{2}\sum_{i=1}^{d}\mathbb{E}\frac{\hat{v}_{T,i}^{1/2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}$
	$\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
	$\displaystyle\overset{\text{(i)}}{\leq}\frac{G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
	$\displaystyle=\frac{2G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right],$	(63)

where (i) follows because

\displaystyle\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}=\sum_{i=1}^{d}\frac{m_{t,i}^{2}}{\sqrt{\hat{v}_{t,i}}}\leq\frac{1}{G_{0}}\sum_{i=1}^{d}m_{t,i}^{2}\leq\frac{G_{\infty}^{2}}{G_{0}}.

Then we are ready to obtain the upper bound by applying Lemma 3. Choosing $\tau=t$ if $t\leq\tau^{*}$ and $\tau=\tau^{*}$ if $t>\tau^{*}$ , we obtain

	$\displaystyle 2c$	$\displaystyle(1-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}$
		$\displaystyle\leq\frac{2G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
		$\displaystyle\leq\frac{2G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+4\sum_{t=2}^{\tau^{}}\left[((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{}\alpha\right]$
		$\displaystyle\quad+4\sum_{t=\tau^{}+1}^{T}\left[2D_{\infty}G_{\infty}\alpha+((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{}\alpha\right]$
		$\displaystyle\leq\frac{2G_{\infty}D_{\infty}^{2}}{\alpha}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot T+4((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}(\tau^{*})^{2}\alpha$
		$\displaystyle\quad+4\left[2D_{\infty}G_{\infty}\alpha+((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{}\alpha\right](T-\tau^{}).$

Finally, applying Jensen’s inequality yields

\displaystyle\mathbb{E}\left\lVert\theta_{out}-\theta^{\star}\right\rVert^{2}\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}\leq\frac{C_{1}}{T}+\alpha C_{2},

(64)

where

	$\displaystyle C_{1}=\frac{G_{\infty}D_{\infty}^{2}}{\alpha c(1-\beta)}+\frac{\beta_{1}\lambda G_{\infty}D_{\infty}^{2}}{2\alpha c(1-\lambda)(1-\beta)}+2((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{cG_{0}(1-\beta)}(\tau^{*})^{2}\alpha,$
	$\displaystyle C_{2}=\frac{4D_{\infty}G_{\infty}}{c(1-\beta)}+\frac{2G_{\infty}\tau^{*}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)}+\frac{(1+\beta)G_{\infty}^{2}}{2cG_{0}(1-\beta)}.$

Appendix H Proof of Theorem 4

We first provide the following useful lemma.

Lemma 17.

Let $\alpha_{t}=\frac{\alpha}{\sqrt{t}}$ and $\beta_{1t}=\beta_{1}\lambda^{t}$ for $t=1,2,\dots$ . Then

\sum_{t=1}^{T}\frac{\beta_{1t}}{\alpha_{t}}\leq\frac{\beta_{1}}{\alpha(1-\lambda)^{2}}.

(65)

Proof.

The proof follows from taking the standard sum of geometric sequences.

\sum_{t=1}^{T}\frac{\beta_{1t}}{\alpha_{t}}=\sum_{t=1}^{T}\frac{\beta_{1t}\sqrt{t}}{\alpha}\leq\sum_{t=1}^{T}\frac{\beta_{1}\lambda^{t-1}t}{\alpha}=\frac{\beta_{1}}{\alpha}\left(\frac{1}{(1-\lambda)}\sum_{t=1}^{T}\lambda^{t-1}-T\lambda^{T}\right)\leq\frac{\beta_{1}}{\alpha(1-\lambda)^{2}}.

(66)

∎

Proof of Theorem 4: The proof starts with steps similar to those of Theorem 3. The difference begins from (G), where we now consider a diminishing stepsize $\alpha_{t}=\frac{\alpha}{\sqrt{t}}$ . We then have

	$\displaystyle 2c$	$\displaystyle(1-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}$
		$\displaystyle\leq\frac{\mathbb{E}\left\lVert\hat{V}_{2}^{1/4}(\theta_{2}-\theta^{\star})\right\rVert^{2}}{\alpha_{2}}+D_{\infty}^{2}\sum_{i=1}^{d}\mathbb{E}\frac{\hat{v}_{T,i}^{1/2}}{\alpha_{T}}+\sum_{t=2}^{T}\frac{\beta_{1t}G_{\infty}D_{\infty}^{2}}{\alpha_{t}}+(1+\beta_{1})\sum_{t=1}^{T}\alpha_{t}\mathbb{E}\left\lVert\hat{V}_{t}^{-1/4}m_{t}\right\rVert^{2}$
		$\displaystyle\quad+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
		$\displaystyle\overset{\text{(i)}}{\leq}\frac{G_{\infty}D_{\infty}^{2}}{\alpha_{2}}+\frac{G_{\infty}D_{\infty}^{2}\sqrt{T}}{\alpha}+\frac{\beta_{1}G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)^{2}}+\frac{2\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot\sqrt{T}+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right],$

where (i) follows from Assumption 4, and Lemmas 14 and 17.

Then we are ready to obtain the upper bound by applying Lemma 3. Choosing $\tau=t$ if $t\leq\tau^{*}$ and $\tau=\tau^{*}$ if $t>\tau^{*}$ , we obtain

	$\displaystyle 2c$	$\displaystyle(1-\beta_{1})\sum_{t=2}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}$
		$\displaystyle\leq\frac{G_{\infty}D_{\infty}^{2}}{\alpha_{2}}+\frac{G_{\infty}D_{\infty}^{2}\sqrt{T}}{\alpha}+\frac{\beta_{1}G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)^{2}}+\frac{2\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot\sqrt{T}+2\sum_{t=2}^{T}\mathbb{E}\left[(\theta_{t}-\theta^{\star})^{T}(g_{t}-\bar{g}_{t})\right]$
		$\displaystyle\leq\frac{G_{\infty}D_{\infty}^{2}}{\alpha_{2}}+\frac{G_{\infty}D_{\infty}^{2}\sqrt{T}}{\alpha}+\frac{\beta_{1}G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)^{2}}+\frac{2\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot\sqrt{T}+4\sum_{t=2}^{\tau^{}}\left[((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\tau^{}\alpha\right]$
		$\displaystyle\quad+4\sum_{t=\tau^{}+1}^{T}\left[2D_{\infty}G_{\infty}\frac{\alpha}{\sqrt{T}}+((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\frac{\tau^{}\alpha}{\sqrt{t-\tau^{*}}}\right]$
		$\displaystyle\leq\frac{G_{\infty}D_{\infty}^{2}}{\alpha_{2}}+\frac{G_{\infty}D_{\infty}^{2}\sqrt{T}}{\alpha}+\frac{\beta_{1}G_{\infty}D_{\infty}^{2}}{\alpha(1-\lambda)^{2}}+\frac{2\alpha(1+\beta_{1})G_{\infty}^{2}}{G_{0}}\cdot\sqrt{T}+4\left[((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{G_{\infty}}{G_{0}}\alpha(\tau^{*})^{2}\right]$
		$\displaystyle\quad+4\left[2D_{\infty}G_{\infty}\alpha\sqrt{T}+((1+\gamma)D_{\infty}+G_{\infty})\cdot\frac{2G_{\infty}\tau^{}\alpha\sqrt{T-\tau^{}}}{G_{0}}\right].$

Finally, applying Jensen’s inequality completes our proof as

\displaystyle\mathbb{E}\left\lVert\theta_{out}-\theta^{\star}\right\rVert^{2}\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left\lVert\theta_{t}-\theta^{\star}\right\rVert^{2}\leq\frac{C_{1}}{\sqrt{T}}+\frac{C_{2}}{T},

where

	$\displaystyle C_{1}=\frac{G_{\infty}D_{\infty}^{2}}{2c\alpha(1-\beta)}+\frac{\alpha(1+\beta_{1})G_{\infty}^{2}}{cG_{0}(1-\beta)}+\frac{4\alpha D_{\infty}G_{\infty}}{c(1-\beta)}+\frac{4\tau^{*}\alpha G_{\infty}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)},$
	$\displaystyle C_{2}=\frac{G_{\infty}D_{\infty}^{2}}{\sqrt{2}c\alpha(1-\beta)}+\frac{\beta G_{\infty}D_{\infty}^{2}}{2c\alpha(1-\lambda)^{2}(1-\beta)}+\frac{2G_{\infty}\alpha(\tau^{*})^{2}((1+\gamma)D_{\infty}+G_{\infty})}{cG_{0}(1-\beta)}.$

	$\displaystyle g_{t}$	$\displaystyle=\hat{Q}_{t}(s_{T},a_{t})\nabla_{\theta_{t}}\log(\pi_{\theta_{t}}(a_{t}\|s_{t}))$
		$\displaystyle=\left(\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R(s_{t^{\prime}},a_{t^{\prime}})\right)\cdot\log(\pi_{\theta_{t}}(a_{t}\|s_{t}))$
		$\displaystyle\leq\left\lVert\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}R(s_{t^{\prime}},a_{t^{\prime}})\right\rVert\cdot]\left\lVert\log(\pi_{\theta_{t}}(a_{t}\|s_{t}))\right\rVert$
		$\displaystyle\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma},$

		$\displaystyle\left\lVert\nabla J(\theta_{1})-\nabla J(\theta_{2})\right\rVert$
		$\displaystyle\quad=\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}$
		$\displaystyle\quad=\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}\right]$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}\right]$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right]$
		$\displaystyle\quad=\underset{(s,a)}{\int}Q^{\pi_{\theta_{1}}}(s,a)\left(\nabla_{\theta_{1}}\log(\pi_{\theta_{1}}(a\|s))-\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))\right)d\mu_{\theta_{1}}$
		$\displaystyle\quad\quad\ +\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right]$
		$\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert\underset{(s,a)}{\int}d\mu_{\theta_{1}}+c_{\Theta}\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))d\mu_{\theta_{1}}$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right]$
		$\displaystyle\quad\overset{\text{(ii)}}{=}\frac{R_{\max}L}{1-\gamma}\left\lVert\theta_{1}-\theta_{2}\right\rVert+c_{\Theta}\underset{(s,a)}{\int}(Q^{\pi_{\theta_{1}}}(s,a)-Q^{\pi_{\theta_{2}}}(s,a))d\mu_{\theta_{1}}$
		$\displaystyle\quad\quad\ +\left[\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}\right],$		(18)

		$\displaystyle\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{1}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))d\mu_{\theta_{2}}$
		$\displaystyle\quad\leq\left\lVert Q^{\pi_{\theta_{2}}}(s,a)\nabla_{\theta_{2}}\log(\pi_{\theta_{2}}(a\|s))\right\rVert\left\lVert\mu_{\theta_{1}}-\mu_{\theta_{2}}\right\rVert_{TV}$
		$\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\left\lVert\mu_{\theta_{1}}-\mu_{\theta_{2}}\right\rVert_{TV}$
		$\displaystyle\quad\overset{\text{(ii)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\cdot\|\mathcal{A}\|L_{\pi}\left(1+\lceil\log_{\rho}\sigma^{-1}\rceil+\frac{1}{1-\rho}\right)\left\lVert\theta_{1}-\theta_{2}\right\rVert,$		(20)

	$\displaystyle\left\lVert\nabla J(\theta_{t-\tau})-\mathbb{E}\left[\nabla\tilde{J}(\theta_{t-\tau};O_{t})\|\mathcal{F}_{t-\tau}\right]\right\rVert$
	$\displaystyle\quad=\left\lVert\underset{(s,a)}{\int}Q^{\pi_{\theta_{t-\tau}}}\nabla_{\theta}\log(\pi_{\theta_{t-\tau}}(a\|s))d\mu_{\theta_{t-\tau}}-\underset{(s,a)}{\int}Q^{\pi_{\theta_{t-\tau}}}\nabla_{\theta}\log(\pi_{\theta_{t-\tau}}(a\|s))dP(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\rVert$
	$\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{c_{\Theta}R_{\max}}{1-\gamma}\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$
	$\displaystyle\quad=G_{\infty}\left\\|\mu_{\theta_{t-\tau}}(\cdot,\cdot)-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV},$

		$\displaystyle\left\\|P(\tilde{s}_{t},\tilde{a}_{t}\|\mathcal{F}_{t-\tau})-P(s_{t},a_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV}$
		$\displaystyle\quad=\left\\|P(\tilde{s}_{t}\|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}\|\tilde{s}_{t})-P(s_{t}\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}\|s_{t})\right\\|_{TV}$
		$\displaystyle\quad=\frac{1}{2}\underset{(s,a)}{\int}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a\|\tilde{s}_{t}=s)-P(s_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)\right\|dsda$
		$\displaystyle\quad=\frac{1}{2}\underset{(s,a)}{\int}\Big{\|}P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a\|\tilde{s}_{t}=s)-P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)$
		$\displaystyle\quad\quad\ +P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)-P(s_{t}=s\|\mathcal{F}_{t-\tau})\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)\Big{\|}dsda$
		$\displaystyle\quad\leq\frac{1}{2}\underset{(s,a)}{\int}P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})\left\|\pi_{\theta_{t-\tau}}(\tilde{a}_{t}=a\|\tilde{s}_{t}=s)-\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)\right\|dads$
		$\displaystyle\quad\quad\ +\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|\int_{a}\pi_{\theta_{t}}(a_{t}=a\|s_{t}=s)dads$
		$\displaystyle\quad\overset{\text{(i)}}{\leq}\frac{1}{2}L_{\pi}\left\lVert\theta_{t}-\theta_{t-\tau}\right\rVert+\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|ds$
		$\displaystyle\quad\leq\frac{1}{2}L_{\pi}\sum_{k=t-\tau}^{t-1}\left\lVert\theta_{k+1}-\theta_{k}\right\rVert+\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|ds$
		$\displaystyle\quad\overset{\text{(ii)}}{\leq}\frac{L_{\pi}G_{\infty}}{2G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}+\frac{1}{2}\int_{s}\left\|P(\tilde{s}_{t}=s\|\mathcal{F}_{t-\tau})-P(s_{t}=s\|\mathcal{F}_{t-\tau})\right\|ds$
		$\displaystyle\quad=\frac{L_{\pi}G_{\infty}}{2G_{0}}\sum_{k=t-\tau}^{t-1}\alpha_{k}+\left\\|P(\tilde{s}_{t}\|\mathcal{F}_{t-\tau})-P(s_{t}\|\mathcal{F}_{t-\tau})\right\\|_{TV},$		(28)