Sample Efficient Reinforcement Learning with REINFORCE

Junzi Zhang¹, Jongho Kim², Brendan O’Donoghue³, Stephen Boyd²

Abstract

Policy gradient methods are among the most effective methods for large-scale reinforcement learning, and their empirical success has prompted several works that develop the foundation of their global convergence theory. However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of “bad” episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice.

1 Introduction

In this paper, we study the global convergence rates of the REINFORCE algorithm (Williams 1992) for episodic reinforcement learning. REINFORCE is a vanilla policy gradient method that computes a stochastic approximate gradient with a single trajectory or a fixed size mini-batch of trajectories with particular choice of gradient estimator, where we use ‘vanilla’ here to disambiguate the method from more exotic variants such as natural policy gradient methods. REINFORCE and its variants are among the most widely used policy gradient methods in practice due to their good empirical performance and implementation simplicity (Mnih and Gregor 2014; Gu et al. 2015; Zoph and Le 2016; Rennie et al. 2017; Guu et al. 2017; Johnson et al. 2017; Yi et al. 2018; Kool, van Hoof, and Welling 2018, 2020). Related methods include the actor-critic family (Konda and Tsitsiklis 2003; Mnih et al. 2016) and deterministic and trust-region based variants (Silver et al. 2014; Schulman et al. 2017, 2015).

The theoretical results for policy gradient methods have, up to recently, been restricted to convergence to local stationary points (Agarwal et al. 2019). Lately, a series of works have established global convergence results. These recent developments cover a broad range of issues including global optimality characterization (Fazel et al. 2018; Bhandari and Russo 2019), convergence rates (Zhang et al. 2019; Mei et al. 2020; Bhandari and Russo 2020; Cen et al. 2020), the use of function approximation (Agarwal et al. 2019; Wang et al. 2019; Fu, Yang, and Wang 2020), and efficient exploration (Agarwal et al. 2020) (for more details, see the related work section, which we defer to Appendix E due to space limits). Nevertheless, prior work on vanilla policy gradient methods either requires exact and deterministic policy gradients or only guarantees convergence up to $\Theta(1/M^{p})$ with a fixed mini-batch size $M>0$ of trajectories collected when performing a single update (where $p>0$ is $1/2$ in most cases), while global convergence is only achieved when the batch size $M$ goes to infinity. By contrast, practical implementations of policy gradient methods typically use either a single or a fixed number of sample trajectories, which tends to perform well. In addition, prior theoretical results (for general MDPs) have used the state-action visitation measure based gradient estimation (see e.g., (Wang et al. 2019, (3.10))), which are typically not used in practice.

The main purpose of this paper is to bridge this gap between theory and practice. We do this in two major ways. First, we derive performance bounds for the case of a fixed mini-batch size, rather than requiring diverging size. Second, we remove the need for the state-action visitation measure based gradient, instead using the REINFORCE gradient estimator. It is nontrivial to go from a diverging mini-batch size to a fixed one. In fact, by allowing for an arbitrarily large batch size, existing works in the literature were able to make use of IID samples to decouple the analysis into deterministic gradient descent/ascent and error control of stochastic gradient estimations. In contrast, with a single trajectory or a fixed batch size, such a decoupling is no longer feasible. In addition, the state-action visitation measure based gradient estimations are unbiased and unbounded, while REINFORCE gradient estimations are biased and bounded. Hence a key to the analysis is to deal with the bias while making better use of the boundedness. Our analysis not only addresses these challenges, but also leads to convergence results in almost sure and high probability senses, which are stronger than the expected convergence results that dominate the literature (for vanilla policy gradient methods). We also emphasize that the goal of this work is to provide a deeper understanding of a widely used algorithm, REINFORCE, with little or no modifications, rather than tweaking it to achieve near-optimal performance bounds. Lastly, our analysis is not the complete picture and several open questions about the performance of policy gradient methods remain. We discuss these issues in the conclusion.

1.1 Contribution

Our major contribution can be summarized as follows. We establish the first set of global convergence results for the REINFORCE algorithm. In particular, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate for REINFORCE, showing that the algorithm is sample efficient (i.e., with polynomial/non-exponential complexity). To our knowledge, these (almost sure and high probability) results are stronger than existing global convergence results for (vanilla) policy gradient methods in the literature. Moreover, our convergence results remove the non-vanishing $\Theta(1/M^{p})$ term (with $M>0$ being the mini-batch size of the trajectories and $p>0$ being some constant exponent) and hence show for the first time that policy gradient estimations with a single or finite number of trajectories also enjoy global convergence properties. Finally, the widely-used REINFORCE gradient estimation procedure is studied, as opposed to the state-action visitation measure based estimators typically studied in the literature but rarely used in practice.

2 Problem setting and preliminaries

Below we begin with our problem setting and some preliminaries on MDPs and policy optimization. For brevity we restrict ourselves to the stationary infinite-horizon discounted setting. We briefly discuss potential extensions beyond this setting in §6.

2.1 Problem setting

We consider a finite MDP $\mathcal{M}$ , which is characterized by a finite state space $\mathcal{S}=\{1,\dots,S\}$ , a finite action space $\mathcal{A}=\{1,\dots,A\}$ , a transition probability $p$ (with $p(s^{\prime}|s,a)$ being the probability of transitioning to state $s^{\prime}$ given the current state $s$ and action $a$ ), a reward function $r$ (with $r(s,a)$ being the instantaneous reward when taking action $a$ at state $s$ ), a discount factor $\gamma\in[0,1)$ and an initial state distribution $\rho\in\Delta(\mathcal{S})$ . Here $\Delta(\mathcal{X})$ denotes the probability simplex over a finite set $\mathcal{X}$ . A (stationary, stochastic) policy $\pi$ is a mapping from $\mathcal{S}$ to $\Delta(\mathcal{A})$ . We will use $\pi(a|s)$ , $\pi(s,a)$ or $\pi_{s,a}$ alternatively to denote the probability of taking action $a$ at state $s$ following policy $\pi$ . The policy $\pi$ can also be viewed as an $SA$ dimensional vector in

\begin{split}\Pi=\Big{\{}\pi\in{\mbox{\bf R}}^{SA}\,\Big{|}\,&\sum\nolimits_{a=1}^{A}\pi_{s,a}=1\,(\forall s\in\mathcal{S}),\,\\ &\pi_{s,a}\geq 0\,(\forall s\in\mathcal{S},\,a\in\mathcal{A})\Big{\}}.\end{split}

(1)

Notice that here we use the double indices $s$ and $a$ for notational convenience. We use $\pi(s,\cdot)\in{\mbox{\bf R}}^{A}$ to denote the sub-vector $(\pi(s,1),\dots,\pi(s,A))$ . We also assume that $r(s,a)$ is deterministic for any $s\in\mathcal{S}$ and $a\in\mathcal{A}$ for simplicity, although our results hold for any $r$ with an almost sure uniform bound. Here $r$ can be similarly viewed as an $SA$ -dimensional vector. Without loss of generality, we assume that $r(s,a)\in[0,1]$ for all $s\in\mathcal{S}$ and $a\in\mathcal{A}$ , which is a common assumption (Jaksch, Ortner, and Auer 2010; Agarwal et al. 2019; Mei et al. 2020; Even-Dar and Mansour 2003; Jin et al. 2018). We also assume that $\rho$ is component-wise positive, as is assumed in (Bhandari and Russo 2019).

Given a policy $\pi\in\Pi$ , the expected cumulative reward of the MDP is defined as

F(\pi)=\mathbf{E}\sum\nolimits_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t}),

(2)

where $s_{0}\sim\rho$ , $a_{t}\sim\pi(\cdot|s_{t}),~{}s_{t+1}\sim p(\cdot|s_{t},a_{t}),\,\forall t\geq 0$ , and the goal is to find a policy $\pi$ which solves the following optimization problem:

\begin{array}[]{ll}\text{maximize}_{\pi\in\Pi}&F(\pi).\end{array}

(3)

Any policy $\pi^{\star}\in\mathop{\rm argmax}_{\pi\in\Pi}F(\pi)$ is said to be optimal, and the corresponding optimal objective value is denoted as $F^{\star}=F(\pi^{\star})$ . Note that in the literature, $F(\pi)$ is also commonly written as $V_{\rho}^{\pi}$ and referred to as the value function. Here we hide the dependency on $\rho$ as it is fixed throughout the paper.

2.2 Vanilla policy gradient method and REINFORCE algorithm

When the transition probability $p$ and reward $r$ are fully known, problem (3) reduces to solving an MDP, in which case various classical algorithms are available, including value iteration and policy iteration (Bertsekas 2017). In this paper, we consider the episodic reinforcement learning setting in which the agent accesses $p$ and $r$ by interacting with the environment over successive episodes, i.e., the agent accesses the environment in the form of a $\rho$ -restart model (Shani, Efroni, and Mannor 2019), which is commonly adopted in the policy gradient literature (Kakade et al. 2003). In addition, we focus on the REINFORCE algorithm, a representative policy gradient method.

Policy parametrization and surrogate objectives.

Here we consider parametrizing the policy with parameter $\theta\in\Theta$ , i.e., $\pi_{\theta}:\Theta\rightarrow\Pi$ , and take the following (regularized) optimization problem as an approximation to (3):

\begin{array}[]{ll}\text{maximize}_{\theta\in\Theta}&L_{\lambda}({\theta})=F(\pi_{\theta})+\lambda R(\theta),\end{array}

(4)

where $\lambda\geq 0$ and $R:\Theta\rightarrow{\mbox{\bf R}}$ is a differentiable regularization term that improves convergence, to be specified later. Although our ultimate goal is still to solve the original problem (3) this regularized optimization problem is a useful surrogate and our approach will be to tackle problem (4) with progressively smaller $\lambda$ regularization penalties, thereby converging to solving the actual problem we care about.

Policy gradient method.

In each episode $n$ , the policy gradient method directly performs an online stochastic gradient ascent update on a surrogate objective $L_{\lambda^{n}}({\theta})$ , i.e.,

\theta^{n+1}=\theta^{n}+\alpha^{n}\widehat{\nabla}_{\theta}L_{\lambda^{n}}({\theta^{n}}),

(5)

where $\alpha^{n}$ is the step-size and $\lambda^{n}$ is the regularization parameter. Here the stochastic gradient $\widehat{\nabla}_{\theta}L_{\lambda^{n}}({\theta^{n}})$ is computed by sampling a single trajectory $\tau^{n}$ following policy $\pi_{\theta^{n}}$ from $\mathcal{M}$ with the initial state distribution $\rho$ . Here $\tau^{n}=(s_{0}^{n},a_{0}^{n},r_{0}^{n},s_{1}^{n},a_{1}^{n},r_{1}^{n},\dots,s_{H^{n}}^{n},a_{H^{n}}^{n},r_{H^{n}}^{n})$ , where $H^{n}$ is a finite (and potentially random) stopping time of the trajectory (to be specified below), $s_{0}^{n}\sim\rho$ , $a_{t}^{n}\sim\pi_{\theta^{n}}(\cdot|s_{t}^{n})$ , $s_{t+1}^{n}\sim p(\cdot|s_{t}^{n},a_{t}^{n})$ and $r_{t}^{n}=r(s_{t}^{n},a_{t}^{n})$ for all $t=0,\dots,H^{n}$ . We summarize the generic policy gradient method (with single trajectory gradient estimates) in Algorithm 1. An extension to mini-batch scenarios will be discussed in §5. As is always (implicitly) assumed in the literature of episodic reinforcement learning (e.g., cf. (Marbach and Tsitsiklis 2001)), given the current policy, we assume that the sampled trajectory is conditionally independent of all previous policies and trajectories.

Algorithm 1 Policy Gradient Method with Single Trajectory Estimates

1: Input: initial parameter

\theta^{0}

, step-sizes

\alpha^{n}

and regularization parameters

\lambda^{n}

(

n\geq 0

2: for

n=0,1,\dots

3: Choose

H^{n}

, sample trajectory

\tau^{n}

from

\mathcal{M}

following policy

\pi_{\theta^{n}}

, and compute an approximate gradient

\widehat{\nabla}_{\theta}L_{\lambda^{n}}({\theta^{n}})

L_{\lambda^{n}}

using trajectory

\tau^{n}

4: Update

\theta^{n+1}=\theta^{n}+\alpha^{n}\widehat{\nabla}_{\theta}L_{\lambda^{n}}({\theta^{n}})

5: end for

REINFORCE algorithm.

There are several ways of choosing the stochastic gradient operator $\widehat{\nabla}_{\theta}$ in the policy gradient method, and the well-known REINFORCE algorithm (Williams 1992) corresponds to a specific family of estimators based on the policy gradient theorem (Sutton et al. 2000) (cf. §3). Other common alternatives include zeroth order/random search (Fazel et al. 2018; Malik et al. 2018) and actor-critic (Konda and Tsitsiklis 2003) approximations. One may also choose to parametrize the policy as a mapping from the parameter space to a specific action, which would then result in deterministic policy gradient approximations (Silver et al. 2014).

Although our main goal is to study the REINFORCE algorithm, our analysis indeed holds for rather generic stochastic gradient estimates. In the next section, we introduce the (mild) assumptions needed for our convergence analysis and the detailed gradient estimation procedures in the REINFORCE algorithm, and then verify that the assumptions do hold for these gradient estimations.

2.3 Phased learning and performance criteria

Phased learning.

To facilitate the exposition below, we divide the optimization in Algorithm 1 into successive phases $l=0,1,\dots$ , each with length $T_{l}>0$ . We then fix the regularization coefficient $\lambda_{l}$ within each phase $l\geq 0$ . In addition, a post-processing step is enforced at the end of each phase to produce the initialization of the next phase. The resulting algorithm is described in Algorithm 2. Here the trajectory is denoted as $\tau^{l,k}=(s_{0}^{l,k},a_{0}^{l,k},r_{0}^{l,k},\dots,s_{H^{l,k}}^{l,k},a_{H^{k,l}}^{l,k},r_{H^{l,k}}^{l,k})$ , and we will refer to $\theta^{l,k}$ as the $(l,k)$ -th iterate hereafter. The post-processing function is required to guarantee that the resulting policy $\pi_{\theta}$ is lower bounded by a pre-specified tolerance $\epsilon_{\rm pp}\in(0,1/A]$ to ensure that the regularization is bounded (cf. Algorithm 3 for a formal description and §3.1 for an example realization).

Note that here the $k$ -th episode in phase $l$ corresponds to the $n$ -th episode in the original indexing with $n=\sum_{j=0}^{l-1}T_{j}+k$ . For notational compactness below, for $\mathcal{T}=\{T_{j}\}_{j=0}^{\infty}$ , we define $B_{\mathcal{T}}:{\bf Z}_{+}\times{\bf Z}_{+}\rightarrow{\bf Z}_{+}$ , where $B_{\mathcal{T}}(l,k)=\sum_{j=0}^{l-1}T_{j}+k$ maps the double index $(l,k)$ to the corresponding original episode number, with $\mathop{\bf dom}B_{\mathcal{T}}=\{(l,k)\in{\bf Z}_{+}\times{\bf Z}_{+}\,|\,0\leq k\leq T_{l}-1\}$ . The mapping $B_{\mathcal{T}}$ is a bijection and we denote its inverse by $G_{\mathcal{T}}$ .

Algorithm 2 Phased Policy Gradient Method

1: Input: initial parameter

\tilde{\theta}^{0,0}

, step-sizes

\alpha^{l,k}

, regularization parameters

\lambda^{l}

, phase lengths

T_{l}

(

l,\,k\geq 0

) and post-processing tolerance

\epsilon_{\rm pp}\in(0,1/A]

2: Set

\theta^{0,0}=\texttt{PostProcess}(\tilde{\theta}^{0,0},\epsilon_{\rm pp})

3: for phase

l=0,1,2,\dots

4: for episode

k=0,1,\dots,T_{l}-1

5: Choose

H^{l,k}

, sample trajectory

\tau^{l,k}

from

\mathcal{M}

following policy

\pi_{\theta^{l,k}}

, and compute an approximate gradient

\widehat{\nabla}_{\theta}L_{\lambda^{l}}({\theta^{l,k}})

L_{\lambda^{l}}

using trajectory

\tau^{l,k}

6: Update

\theta^{l,k+1}=\theta^{l,k}+\alpha^{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}({\theta^{l,k}})

7: end for

8: Set

\theta^{l+1,0}=\texttt{PostProcess}(\theta^{l,T_{l}},\epsilon_{\rm pp})

9: end for

Algorithm 3

\texttt{PostProcess}(\theta,\epsilon_{\rm pp})

Input:

\epsilon_{\rm pp}\in(0,1/A]

\theta\in\Theta

Return

\theta^{\prime}

(near

\theta

) such that

\pi_{\theta^{\prime}}(s,a)\geq\epsilon_{\rm pp}

for each

s,\,a\in\mathcal{S}\times\mathcal{A}

Performance criteria.

The criterion we adopt to evaluate the performance of Algorithm 2 is regret. For any $N\geq 0$ , the regret up to episode $N$ is defined as the cumulative sub-optimality of the policy over the $N$ episodes. Formally, we define

\begin{split}\mathbf{regret}(N)&=\sum\nolimits_{\{(l,k)|B_{\mathcal{T}}(l,k)\leq N\}}F^{\star}-\hat{F}^{l,k}(\pi_{\theta^{l,k}}).\end{split}

(6)

Here the summation is over all $(l,k)$ -th iterates whose corresponding original episode numbers are smaller or equal to $N$ , and $\hat{F}^{l,k}(\pi_{\theta^{l,k}})=\mathbf{E}_{l,k}\sum\nolimits_{t=0}^{H^{l,k}}\gamma^{t}r(s_{t}^{l,k},a_{t}^{l,k})$ , where $s_{0}\sim\rho$ , $a_{t}^{l,k}\sim\pi_{\theta^{l,k}}(\cdot|s_{t}^{l,k})$ , $s_{t+1}^{l,k}\sim p(\cdot|s_{t}^{l,k},a_{t}^{l,k})$ , $\forall t\geq 0$ , and $\mathbf{E}_{l,k}$ denotes the conditional expectation given the $(l,k)$ -th iteration $\theta^{l,k}$ . Notice that the regret defined above takes into account the fact that the trajectories are stopped/truncated to have finite horizons $H^{l,k}$ , which characterizes the actual loss caused by sampling the trajectories in line 5 of Algorithm 2. A similar regret definition for the episodic (discounted) reinforcement learning setting considered here is adopted in (Fu, Yang, and Wang 2020). We remark that all our regret bounds remain correct up to lower order terms when we replace $\hat{F}^{l,k}$ with $F$ or an expectation-free version.

Similarly, we also define the single phase version of regret as follows. The regret up to episode $K\in\{0,\dots,T_{l}-1\}$ in phase $l$ is defined as

\mathbf{regret}_{l}(K)=\sum\nolimits_{k=0}^{K}F^{\star}-\hat{F}^{l,k}(\pi_{\theta^{l,k}}).

(7)

Notice that (6) and (7) are connected via

\mathbf{regret}(N)=\sum_{l=0}^{l_{N}-1}\mathbf{regret}_{l}(T_{l}-1)+\mathbf{regret}_{l_{N}}(k_{N}),

(8)

where $(l_{N},k_{N})=G_{\mathcal{T}}(N)$ .

We provide high probability regret bounds in §4. We remark that a regret bound of the form $\mathbf{regret}(N)/(N+1)\leq R$ (for some $R>0$ ) immediately implies that $\min_{l,k:\,B_{\mathcal{T}}(l,k)\leq N}F^{\star}-F(\pi_{\theta^{l,k}})\leq R$ , where the latter is also a commonly adopted performance criteria in the literature (Agarwal et al. 2019; Wang et al. 2019).

3 Assumptions and REINFORCE gradients

3.1 Assumptions

Here we list a few fundamental assumptions that we require for our analysis.

Assumption 1 (Setting).

The regularization term is a log-barrier, i.e.,

R(\theta)=\tfrac{1}{SA}\sum\nolimits_{s\in\mathcal{S},a\in\mathcal{A}}\log(\pi_{\theta}(s,a)),

and the policy is parametrized to be a soft-max, i.e., $\pi_{\theta}(s,a)=\frac{\exp(\theta_{s,a})}{\sum_{a^{\prime}\in\mathcal{A}}\exp(\theta_{s,a^{\prime}})}$ , with $\Theta={\mbox{\bf R}}^{SA}$ .

The first assumption concerns the form of the policy parameterization and the regularization. Notice that the regularization term here can also be seen as a relative entropy/KL regularization (with a uniform distribution policy reference) (Agarwal et al. 2019). Such kind of regularization terms are also widely adopted in practice (although typically with variations) (Peters, Mülling, and Altun 2010; Schulman, Chen, and Abbeel 2017).

With Assumption 1, the post-processing function in Algorithm 3 can be for example realized by first calculating $\hat{\pi}=\epsilon_{\rm pp}{\bf 1}+(1-A\epsilon_{\rm pp})\pi_{\theta}$ , and then return $\theta^{\prime}$ with $\theta^{\prime}_{s,a}=\log\hat{\pi}_{s,a}+c_{s}$ . Here $\bf 1$ is an all-one vector and $c_{s}\in{\mbox{\bf R}}$ ( $s=1,\dots,S$ ) are arbitrary real numbers.

Assumption 2 (Policy gradient estimator).

There exist constants $C,\,C_{1},\,C_{2},\,M_{1},\,M_{2}>0$ , such that for all $l$ , $k\geq 0$ , we have $\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}\leq C_{1}$ almost surely and that

\begin{split}&\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})^{T}\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\geq C_{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}-\delta_{l,k},\\ &\mathbf{E}_{l,k}\|\widehat{\nabla}_{\theta}L_{\lambda^{k}}(\theta^{l,k})\|_{2}^{2}\leq M_{1}+M_{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2},\end{split}

where $\sum_{k=0}^{T_{l}-1}\delta_{l,k}^{2}\leq C$ , $\forall$ $l\geq 0$ . In addition, $H^{l,k}\geq\log_{1/\gamma}(k+1)$ , $\forall$ $l,\,k\geq 0$ .

The second assumption requires that the gradient estimates are almost surely bounded, nearly unbiased and satisfy a bounded second-order moment growth condition. This is a slight generalization of standard assumptions in the stochastic gradient descent literature (Bottou, Curtis, and Nocedal 2018). Additionally, we also require that the trajectory lengths $H^{l,k}$ are at least logarithmically growing in $k$ to control the loss of rewards due to truncation. For notational simplicity, hereafter we omit to mention the trajectory sampling (i.e., $s_{0}\sim\rho,~{}a_{t}^{l,k}\sim\pi_{\theta^{l,k}}(\cdot|s_{t}^{l,k}),~{}s_{t+1}^{l,k}\sim p(\cdot|s_{t}^{l,k},a_{t}^{l,k}),\,\forall t\geq 0$ ) when we write down $\mathbf{E}_{l,k}$ .

Notice that Assumption 2 immediately holds if $\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})$ is unbiased and has a bounded second-order moment. We have implicitly assumed that $L_{\lambda}$ is differentiable, which we can do due to the following lemma:

Proposition 1 ((Agarwal et al. 2019, Lemma E.4)).

Under Assumption 1, $L_{\lambda}$ is strongly smooth with parameter $\beta_{\lambda}=\frac{8}{(1-\gamma)^{3}}+\frac{2\lambda}{S}$ , i.e., $\|\nabla_{\theta}L_{\lambda}(\theta)-\nabla_{\theta}L_{\lambda}(\theta^{\prime})\|_{2}\leq\beta_{\lambda}\|\theta-\theta^{\prime}\|_{2}$ for any $\theta,\,\theta^{\prime}\in{\mbox{\bf R}}^{SA}$ .

3.2 REINFORCE gradient estimations

Now we introduce REINFORCE gradient estimation with baselines, and specify the hyper-parameters under which the technical Assumption 2 holds, when operating under the setting Assumption 1.

REINFORCE gradient estimation with log-regularization takes the following form:

\begin{split}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})=&\sum\nolimits_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}(\widehat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})-b(s_{t}^{l,k}))\\ &\qquad\times\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\\ &+\tfrac{\lambda^{l}}{SA}\sum\nolimits_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s),\end{split}

(9)

where $\beta\in(0,1)$ , $\widehat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})=\sum_{t^{\prime}=t}^{H^{l,k}}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{l,k}$ , and the second term above corresponds to the gradient of the regularization $R(\theta)$ . Notice that here the outer summation is only up to $\lfloor\beta H^{l,k}\rfloor$ , which ensures that $\hat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})$ is sufficiently accurate. Here $b:\mathcal{S}\rightarrow{\mbox{\bf R}}$ is called the baseline, and is required to be independent of the trajectory $\tau^{l,k}$ (Agarwal, Jiang, and Kakade 2019, §4.1). The purpose of subtracting $b$ from the approximate $Q$ -values is to (potentially) reduce the variance of the “plain” REINFORCE gradient estimation, which corresponds to the case when $b=0$ .

With this we have the following result, the proof of which can be found in the Appendix.

Lemma 2.

Suppose that Assumption 1 holds, $\beta\in(0,1)$ , and that for all $l$ , $k\geq 0$ , $\lambda^{l}\leq\bar{\lambda}$ ,

H^{l,k}\geq\tfrac{2\log_{1/\gamma}\left(\frac{8(k+1)}{(1-\gamma)^{3}}\right)}{3\min\{\beta,1-\beta\}}(=\Theta(\log(k+1))).

(10)

Assume in addition that $|b(s)|\leq B$ for any $s\in\mathcal{S}$ , where $B>0$ is a constant. Then for the gradient estimation (9), Assumption 2 holds with

\begin{split}&\text{$C=16\Big{(}\tfrac{1}{(1-\gamma)^{2}}+\bar{\lambda}\Big{)}^{2}$, \quad$C_{1}=\tfrac{2(1+B(1-\gamma))}{(1-\gamma)^{2}}+2\bar{\lambda}$,}\\ &\text{$C_{2}=1$,\quad$M_{1}=\tfrac{32}{(1-\gamma)^{4}}+\bar{V}_{b}$, \quad$M_{2}=2$.}\end{split}

and $\delta_{l,k}=\Big{(}\tfrac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\Big{)}(k+1)^{-\frac{2}{3}}$ , $\forall$ $l$ , $k\geq 0$ . Here $\bar{V}_{b}\in\Big{[}0,4\Big{(}\tfrac{1+B(1-\gamma)}{(1-\gamma)^{2}}+\bar{\lambda}\Big{)}^{2}\Big{]}$ is the uniform upper bound on variances of policy gradient estimations with form (9).

This result extends without modification to non-stationary baselines $b_{t}^{l,k}$ , as long as each $b_{t}^{l,k}$ is independent of trajectory $\tau^{l,k}$ and $|b_{t}^{l,k}(s)|\leq B$ for any $t,\,l,\,k\geq 0$ . Note that the explicit upper bound on $\bar{V}_{b}$ is pessimistic, and in practice $\bar{V}_{b}$ is usually much smaller than $\bar{V}_{0}$ with appropriate choices of baselines (e.g., the adaptive reinforcement baselines (Williams 1992; Zhao et al. 2011)), although the latter has a smaller upper bound as stated in Lemma 2.

4 Main convergence results

4.1 Preliminary tools

We first present some preliminary tools for our analysis.

Non-convexity and control of “bad” episodes.

One of the key difficulties in applying policy gradient methods to solve an MDP problem towards global optimality is that problem (3) is in general non-convex (Agarwal et al. 2019). Fortunately, we have the following result, which connects the gradient of the surrogate objective $L_{\lambda}$ with the global optimality gap of the original optimization problem (3).

Proposition 3 ((Agarwal et al. 2019, Theorem 5.3)).

Under Assumption 1, for any $\epsilon>0$ , suppose that we have $\|\nabla_{\theta}L_{\lambda}(\theta)\|_{2}\leq\epsilon$ and that $\epsilon\leq\lambda/(2SA)$ . Then $F^{\star}-F(\pi_{\theta})\leq\frac{2\lambda}{1-\gamma}\left\|\frac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}$ .

Here for any policy $\pi\in\Pi$ , $d_{\rho}^{\pi}=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathbf{Prob}_{\pi}(s_{t}=s|s_{0}\sim\rho)$ is the discounted state visitation distribution, where $\mathbf{Prob}_{\pi}(s_{t}=s|s_{0}\sim\rho)$ is the probability of arriving at $s$ in step $t$ starting from $s_{0}\sim\rho$ following policy $\pi$ in $\mathcal{M}$ . In addition, the division in $d_{\rho}^{\pi^{\star}}/\rho$ is component-wise.

Now motivated by Proposition 3, when analyzing the regret up to episode $K$ in phase $l$ , we define the following set of “bad episodes”:

I^{+}=\{k\in\{0,\dots,K\}\,|\,\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}\geq\lambda^{l}/(2SA)\}.

Then one can show that for any $\epsilon>0$ , if we choose $\lambda^{l}=\frac{\epsilon(1-\gamma)}{2}$ , we have that $F^{\star}-F(\pi_{\theta^{l,k}})\leq\|d_{\rho}^{\pi^{\star}}/\rho\|_{\infty}\epsilon$ for any $k\in\{0,\dots,K\}\backslash I^{+}$ , while $F^{\star}-F(\pi_{\theta^{l,k}})\leq 1/(1-\gamma)$ holds trivially for $k\in I^{+}$ due to the assumption that the rewards are between $0$ and $1$ . We then establish a sub-linear (in $K$ ) bound the size of $I^{+}$ , which serves as the key stepping stone for the overall sub-linear regret bound. The details of these arguments can be found in the Appendix.

Doubling trick.

The second tool is a classical doubling trick that is commonly adopted in the design of online learning algorithms (Besson and Kaufmann 2018; Basei, Guo, and Hu 2020), which can be used to stitch together the regret over multiple learning phases in Algorithm 2.

Notice that Proposition 3 suggests that for any pre-specified tolerance $\epsilon$ , one can select $\lambda$ proportional to $\epsilon$ and then run (stochastic) gradient ascent to drive $F^{\star}-F(\pi_{\theta})$ below the tolerance. To obtain the eventual convergence and regret bound in the long run we apply the doubling trick, which specifies a growing phase length sequence with $T_{l+1}\approx 2T_{l}$ in Algorithm 2 and a suitably decaying sequence of regularization parameters $\{\lambda^{l}\}_{l=0}^{\infty}$ .

From high probability to almost sure convergence.

The last tool is an observation that an arbitrary anytime sub-linear high probability regret bound with logarithmic dependency on $1/\delta$ immediately leads to almost sure convergence of the average regret with a corresponding asymptotic rate. Although such an observation seems to be informally well-known in the theoretical computer science community, we provide a compact formal discussion below for self-contained-ness.

Lemma 4.

Suppose that $\forall$ $\delta\in(0,1)$ , with probability at least $1-\delta$ , $\forall$ $N\geq 0$ , we have

\mathbf{regret}(N)\leq d_{1}(N+c)^{d_{2}}(\log(N/\delta))^{d_{3}}+d_{4}(\log N)^{d_{5}}

(11)

for some constants $c,\,d_{1},\,d_{3},\,d_{4},\,d_{5}\geq 0$ and $d_{2}\in[0,1)$ . Then we also have

\begin{split}\mathbf{Prob}&\left(\text{$\exists$ $\bar{N}\in\mathbf{Z}_{+}$, such that $\forall$ $N\geq\bar{N}$, $A_{N}$ holds}\right)=1,\end{split}

where the events $A_{N}=\{\mathbf{regret}(N)/(N+1)\leq(*)\}$ , and

(*)=d_{1}N^{-(1-d_{2})}\left(1+\tfrac{c}{N}\right)^{d_{2}}(3\log N)^{d_{3}}+\tfrac{d_{4}(\log N)^{d_{5}}}{N}.

To put it another way, we have

\lim_{N\rightarrow\infty}\mathbf{regret}(N)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of $(*)$ .

Notice that here we restrict the right-hand side of (11) to a rather specific form simply because our bounds below are all of this form. However, similar results hold for much more general forms of bounds.

4.2 Regret analysis

In this section, we establish the regret bound of Algorithm 2, when used with the REINFORCE gradient estimator from §3.2. We begin by bounding the regret of a single phase and then use the doubling trick to combine these into the overall regret bound.

Single phase analysis.

We begin by bounding the regret defined in (7) of each phase in Algorithm 2. Note that a single phase in Algorithm 2 is exactly Algorithm 1 terminated in episode $T_{l}$ , with $\lambda^{n}=\lambda^{l}$ for all $n\geq 0$ and $\theta^{0}=\theta^{l,0}$ . Also notice that for a given phase $l\geq 0$ , in order for Theorem 5 below to hold, we actually only need the conditions in Assumption 2 to be satisfied for this specific $l$ .

Theorem 5.

Under Assumptions 1 and 2, for phase $l\geq 0$ suppose that we choose $\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ for some $C_{l,\alpha}\in(0,C_{2}/(M_{2}\beta_{\lambda^{l}})]$ . Then for any $\epsilon>0$ , if we choose $\lambda^{l}=\frac{\epsilon(1-\gamma)}{2}$ , then $\forall$ $\delta\in(0,1)$ , with probability at least $1-\delta$ , for any $K\in\{0,\dots,T_{l}-1\}$ , we have

\begin{split}&\mathbf{regret}_{l}(K)\leq U_{1}\tfrac{\sqrt{K+1}\log_{2}(K+3)\sqrt{\log(2/\delta)}}{\epsilon^{2}}\\ &\qquad+(K+1)\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\epsilon+\tfrac{2\gamma}{1-\gamma}\log(K+3).\end{split}

(12)

Here the constant $U_{1}$ only depends on the underlying MDP $\mathcal{M}$ , phase initialization $\theta^{l,0}$ and the constants $C,\,C_{1},\,C_{2}$ , $M_{1},\,C_{l,\alpha},\,\lambda^{l}$ .

The proof as well as a more formal statement of Theorem 5 with details of the constants (cf. Theorem 9) are deferred to the Appendix. Here the constant $\beta_{\lambda}$ is the smoothness constant from Proposition 1. We remark that when $\epsilon$ is fixed, the regret bound (12) can be seen as a sub-linear (in $K$ as $K\rightarrow\infty$ ) regret term plus an error term $(K+1)\epsilon+\frac{2\gamma}{1-\gamma}\log(K+3)$ . Alternatively, one can interpret it as follows:

\begin{split}\dfrac{\mathbf{regret}_{l}(K)}{K+1}\leq&\,U_{1}\dfrac{\log_{2}(K+3)\sqrt{\log(2/\delta)}}{\sqrt{K+1}\epsilon^{2}}\\ &\,+\dfrac{2\gamma}{1-\gamma}\dfrac{\log(K+3)}{K+1}+\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\epsilon.\end{split}

Namely, the average regret in episode $l$ converges to a constant multiple of the pre-specified tolerance $\epsilon$ at a sub-linear rate (as $K\rightarrow\infty$ ).

Overall regret bound.

Now we stitch together the single phase regret bounds established above to obtain the overall regret bound of Algorithm 2, with the help of the doubling trick. This leads to the following theorem.

Theorem 6 (Regret for REINFORCE).

Under Assumption 1, suppose that for each $l\geq 0$ , we choose $\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ , with $C_{l,\alpha}\in[1/(2\beta_{\bar{\lambda}}),1/(2\beta_{\lambda^{l}})]$ and $\bar{\lambda}=\frac{1-\gamma}{2}$ , and choose $T_{l}=2^{l}$ , $\epsilon^{l}=T_{l}^{-1/6}=2^{-l/6}$ , $\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2}$ and $\epsilon_{\rm pp}=1/(2A)$ . In addition, suppose that (9) is adopted to evaluate $\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})$ , with $\beta\in(0,1)$ , $|b(s)|\leq B$ for any $s\in\mathcal{S}$ (where $B>0$ is a constant), and that (10) holds for $H^{l,k}$ for all $l,\,k\geq 0$ . Then we have for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , for any $N\geq 0$ , we have

\mathbf{regret}(N)=O\Big{(}\Big{(}\tfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\Big{)}N^{\frac{5}{6}}(\log\tfrac{N}{\delta})^{\frac{5}{2}}\Big{)}.

(13)

In addition, we have

\lim_{N\rightarrow\infty}\mathbf{regret}(N)/(N+1)=0\quad\text{almost surely}

(14)

with asymptotic rate $O\Big{(}\Big{(}\frac{S^{2}A^{2}}{(1-\gamma)^{7}}+\Big{\|}\frac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\Big{)}N^{-\frac{1}{6}}(\log N)^{\frac{5}{2}}\Big{)}$ .

Note that the almost sure convergence (14) is immediately implied by the high probability bound (13) via Lemma 4. Here for clarity, we have restricted the statement to the case when we use the REINFORCE gradient estimation from §3.2. A more general counterpart result can be found in Appendix B.3, from which Theorem 6 is immediately implied. See also Appendix C for a more formal statement of the regret bound (cf. Corollary 11) for REINFORCE with detailed constants.

Notice that compared to the single phase regret bound in (12), the overall regret bound in (13) now gets rid of the dependency on a pre-specified tolerance $\epsilon>0$ . This should be attributed to the adaptivity in the regularization parameter sequence. Also notice that here we have followed the convention of the reinforcement learning literature to make all the problem dependent quantities (e.g., $\gamma,\,S,\,A$ , etc.) explicit in the big- $O$ notation.

One crucial difference between our regret bound and those in the existing literature of vanilla policy gradient methods in the general MDP settings (which are sometimes not stated in the form of regret, but can be easily deduced from their proofs in those cases) is that the previous results either require exact and deterministic updates or contain a non-vanishing $\Theta(1/M^{p})$ term, with $M$ being the mini-batch size (of the trajectories) and $p>0$ being some exponent (with a typical value of $1/2$ ). By removing such non-vanishing terms, we obtain the first sub-linear regret bound for model-free vanilla policy gradient methods with finite mini-batch sizes.

5 Extension to mini-batch updates

We now consider extending our previous results to mini-batch settings, by modifying Algorithm 2 as follows. Firstly, in each inner iteration, instead of sampling only one trajectory in line 5, we sample $M\geq 1$ independent trajectories $\tau^{l,k}_{1},\dots,\tau^{l,k}_{M}$ from $\mathcal{M}$ following policy $\pi_{\theta^{l,k}}$ and then compute an approximate gradient $\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}(\theta^{l,k})$ ( $i=1,\dots,M$ ) using each of these $M$ trajectories. We then modify the update in line 6 as

\theta^{l,k+1}=\theta^{l,k}+\alpha^{l,k}\frac{1}{M}\sum_{i=1}^{M}\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}}).

See Algorithm 4 in Appendix D for a formal description of the modified algorithm.

Regret with mini-batches.

Notice that since each inner iteration (in Algorithm 4) now consists of $M$ episodes, we need to slightly modify the definition of the regret up to episode $N$ ( $N\geq 0$ ) as follows:

\begin{split}\mathbf{regret}&(N;M)=\\ &\sum\nolimits_{\left\{(l,k)|B_{\mathcal{T}}(l,k)\leq\lfloor\frac{N}{M}\rfloor-1\right\}}M(F^{\star}-\hat{F}^{l,k}(\pi_{\theta^{l,k}}))\\ &+\left(N-M\left\lfloor\frac{N}{M}\right\rfloor\right)(F^{\star}-\hat{F}^{l,k}(\pi_{\theta^{l_{N,M},k_{N,M}}})),\end{split}

(15)

where $(l_{N,M},k_{N,M})=G_{\mathcal{T}}(\lfloor N/M\rfloor)$ and $\hat{F}^{l,k}(\pi_{\theta^{l,k}})$ is the same as in (6). The above definition accounts for the fact that each of the $M$ episodes in an inner iteration/step $(l,k)$ corresponds to the same iterate $\theta^{l,k}$ and hence has the same contribution to the regret. The second term on the right-hand side accounts for the contribution of the (remaining) $N-M\lfloor N/M\rfloor$ episodes (among a total of $M$ episodes) in inner iteration/step $(l_{N,M},k_{N,M})$ .

Then the following regret bound can be established.

Corollary 7 (Regret for mini-batch REINFORCE).

Under Assumption 1, suppose that for each $l\geq 0$ , we choose $\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ , with $C_{l,\alpha}\in[1/(2\beta_{\bar{\lambda}}),1/(2\beta_{\lambda^{l}})]$ and $\bar{\lambda}=\frac{1-\gamma}{2}$ , and choose $T_{l}=2^{l}$ , $\epsilon^{l}=T_{l}^{-1/6}=2^{-l/6}$ , $\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2}$ and $\epsilon_{\rm pp}=1/(2A)$ . In addition, suppose that the assumptions in Lemma 12 hold (note that Assumption 1 and $\lambda^{l}\leq\bar{\lambda}$ already automatically hold by the other assumptions). Then we have for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , jointly for all episodes $N$ , we have (for the mini-batch Algorithm 4)

\begin{split}&\mathbf{regret}(N;M)=O\Big{(}\Big{(}\tfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\Big{)}\\ &\quad\times(M^{\frac{1}{6}}+M^{-\frac{5}{6}})(N+M)^{\frac{5}{6}}(\log(N/\delta))^{\frac{5}{2}}+\tfrac{M(\log N)^{2}}{1-\gamma}\Big{)}.\end{split}

In addition, we also have

\lim_{N\rightarrow\infty}\mathbf{regret}(N;M)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of

\begin{split}&O\Big{(}\Big{(}\tfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\Big{)}\\ &\,\times(M^{\frac{1}{6}}+M^{-\frac{5}{6}})N^{-\frac{1}{6}}\left(1+\tfrac{M}{N}\right)^{\frac{5}{6}}(\log N)^{\frac{5}{2}}+\tfrac{M(\log N)^{2}}{(1-\gamma)N}\Big{)}.\end{split}

Again, we note that the almost sure convergence above is directly implied by the high probability bound via Lemma 4. The proof and a more formal statement of this corollary (cf. Corollary 13) can be found in Appendix D. In particular, when $M=1$ , the bound above reduces to (13). In addition, we can see that there might be a trade-off between the terms $M^{1/6}$ and $M^{-5/6}$ . The intuition behind this is a trade-off between lower variance with larger batch sizes and more frequent updates with smaller batch sizes.

6 Conclusion and open problems

In this work, we establish the global convergence rates of practical policy gradient algorithms with a fixed size mini-batch of trajectories combined with REINFORCE gradient estimation.

Although in §4 and §5, we only instantiate the bounds for the REINFORCE gradient estimators, we note that our general results (in particular, Theorem 10 in Appendix B.3) can be easily applied to other gradient estimators (e.g., actor-critic and state-action visitation measure based estimators) as well, as long as one can verify the existence of the constants in Assumption 2 in a similar way to Lemma 2. In addition, one can also easily derive sample complexity results as by-products of our analysis. In fact, our proof of Theorem 5 immediately implies a $\tilde{O}(1/\epsilon^{4})$ sample complexity bound (for Algorithm 1 with REINFORCE gradient estimators and a constant regularization parameter) for any pre-specified tolerance $\epsilon>0$ , where we use $\tilde{O}$ to indicate the big- $O$ notation with logarithmic terms suppressed. We have focused only on regret in this paper mainly for clarity purposes.

It is also relatively straightforward to extend our results to finite horizon non-stationary settings, in which the soft-max policy parametrization will have a dimension of $SAH$ and different policy gradient estimators can be adopted (without trajectory truncation), with $H$ being the horizon of each episode. In this case, it’s also easy to rewrite the regret bound as a function of the total number of time steps $T\leq HN$ , where $N$ is the total number of episodes. Other straightforward extensions include refined convergence to stationary points (in both almost sure and high probability senses and with no requirement on large batch sizes), and inexact convergence results when $\delta^{l,k}$ (cf. Assumption 2) is not square summable (e.g., when $H^{l,k}$ is fixed or not growing sufficiently fast).

There are also several open problems that may be resolved by combining the techniques introduced in this paper with existing results in the literature. Firstly, it would be desirable to remove the “exploration” assumption that the initial distribution $\rho$ is component-wise positive. This may be achieved by combining our results with the policy cover technique in (Agarwal et al. 2020) or the optimistic bonus tricks in (Cai et al. 2019; Efroni et al. 2020). Secondly, the bounds in our paper are likely far from optimal (i.e., sharp). Hence it would be desirable to either refine our analysis or apply our techniques to accelerated policy gradient methods (e.g., IS-MBPG (Huang et al. 2020)) to obtain better global convergence rates and/or last-iterate convergence. Thirdly, it would be very interesting to see if global convergence results still hold for REINFORCE when the relative entropy regularization term used in this paper is replaced with the practically adopted entropy regularization term in the literature. The answer is affirmative when exact gradient estimations are available (Mei et al. 2020; Cen et al. 2020), but it remains unknown how these results might be generalized to the stochastic settings in our paper. We conjecture that entropy regularization leads to better global convergence rates and can help us remove the necessity of the PostProcess steps in Algorithm 2 as they are uniformly bounded. Finally, one may also consider relaxing the uniform bound assumption on the rewards $r$ to instead being sub-Gaussian, introducing function approximation, and extending our results to natural policy gradient and actor-critic methods as well as more modern policy gradient methods like DPG, PPO and TRPO.

Acknowledgment

We would like to thank Anran Hu for pointing out a mistake in the proof of an early draft of this paper. We thank Guillermo Angeris, Shane Barratt, Haotian Gu, Xin Guo, Yusuke Kikuchi, Bennet Meyers, Xinyue Shen, Mahan Tajrobehkar, Jonathan Tuck, Jiaming Wang and Xiaoli Wei for their feedback on some preliminary results in this paper. We thank Junyu Zhang for several detailed and fruitful discussions of the draft. We also thank the anonymous (meta-)reviewers for the great comments and suggestions. Jongho Kim is supported by Samsung Scholarship.

References

Abbasi-Yadkori et al. (2019) Abbasi-Yadkori, Y.; Bartlett, P.; Bhatia, K.; Lazic, N.; Szepesvari, C.; and Weisz, G. 2019. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, 3692–3702.
Agarwal et al. (2020) Agarwal, A.; Henaff, M.; Kakade, S.; and Sun, W. 2020. PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning. arXiv preprint arXiv:2007.08459 .
Agarwal, Jiang, and Kakade (2019) Agarwal, A.; Jiang, N.; and Kakade, S. 2019. Reinforcement Learning: Theory and Algorithms. Technical report, Department of Computer Science, University of Washington.
Agarwal et al. (2019) Agarwal, A.; Kakade, S.; Lee, J.; and Mahajan, G. 2019. Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261 .
Basei, Guo, and Hu (2020) Basei, M.; Guo, X.; and Hu, A. 2020. Linear Quadratic Reinforcement Learning: Sublinear Regret in the Episodic Continuous-Time Framework. arXiv preprint arXiv:2006.15316 .
Baxter and Bartlett (2001) Baxter, J.; and Bartlett, P. 2001. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 319–350.
Bertsekas (2017) Bertsekas, D. 2017. Dynamic programming and optimal control, volume II. Athena scientific Belmont, MA, 4th edition.
Besson and Kaufmann (2018) Besson, L.; and Kaufmann, E. 2018. What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971 .
Bhandari and Russo (2019) Bhandari, J.; and Russo, D. 2019. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 .
Bhandari and Russo (2020) Bhandari, J.; and Russo, D. 2020. A Note on the Linear Convergence of Policy Gradient Methods. arXiv preprint arXiv:2007.11120 .
Bottou, Curtis, and Nocedal (2018) Bottou, L.; Curtis, F.; and Nocedal, J. 2018. Optimization methods for large-scale machine learning. SIAM Review 60(2): 223–311.
Cai et al. (2019) Cai, Q.; Yang, Z.; Jin, C.; and Wang, Z. 2019. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830 .
Carmona, Laurière, and Tan (2019) Carmona, R.; Laurière, M.; and Tan, Z. 2019. Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods. arXiv preprint arXiv:1910.04295 .
Cen et al. (2020) Cen, S.; Cheng, C.; Chen, Y.; Wei, Y.; and Chi, Y. 2020. Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization. arXiv preprint arXiv:2007.06558 .
Efroni et al. (2020) Efroni, Y.; Shani, L.; Rosenberg, A.; and Mannor, S. 2020. Optimistic Policy Optimization with Bandit Feedback. arXiv preprint arXiv:2002.08243 .
Even-Dar, Kakade, and Mansour (2004) Even-Dar, E.; Kakade, S.; and Mansour, Y. 2004. Experts in a Markov decision process. Advances in neural information processing systems 17: 401–408.
Even-Dar, Kakade, and Mansour (2009) Even-Dar, E.; Kakade, S.; and Mansour, Y. 2009. Online Markov decision processes. Mathematics of Operations Research 34(3): 726–736.
Even-Dar and Mansour (2003) Even-Dar, E.; and Mansour, Y. 2003. Learning rates for Q-learning. Journal of machine learning Research 5(Dec): 1–25.
Fazel et al. (2018) Fazel, M.; Ge, R.; Kakade, S.; and Mesbahi, M. 2018. Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039 .
Fei et al. (2020) Fei, Y.; Yang, Z.; Wang, Z.; and Xie, Q. 2020. Dynamic regret of policy optimization in non-stationary environments. arXiv preprint arXiv:2007.00148 .
Fu et al. (2019) Fu, Z.; Yang, Z.; Chen, Y.; and Wang, Z. 2019. Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498 .
Fu, Yang, and Wang (2020) Fu, Z.; Yang, Z.; and Wang, Z. 2020. Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy. arXiv preprint arXiv:2008.00483 .
Glynn (1986) Glynn, P. 1986. Stochastic approximation for Monte Carlo optimization. In Proceedings of the 18th conference on Winter simulation, 356–365.
Gu et al. (2015) Gu, S.; Levine, S.; Sutskever, I.; and Mnih, A. 2015. MuProp: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176 .
Guo et al. (2020) Guo, X.; Hu, A.; Xu, R.; and Zhang, J. 2020. A General Framework for Learning Mean-Field Games. arXiv preprint arXiv:2003.06069 .
Guu et al. (2017) Guu, K.; Pasupat, P.; Liu, E.; and Liang, P. 2017. From language to programs: Bridging reinforcement learning and maximum marginal likelihood. arXiv preprint arXiv:1704.07926 .
Huang et al. (2020) Huang, F.; Gao, S.; Pei, J.; and Huang, H. 2020. Momentum-Based Policy Gradient Methods. arXiv preprint arXiv:2007.06680 .
Jaksch, Ortner, and Auer (2010) Jaksch, T.; Ortner, R.; and Auer, P. 2010. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11(Apr): 1563–1600.
Jin et al. (2018) Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. 2018. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, 4863–4873.
Jin, Schmitt, and Wen (2020) Jin, Z.; Schmitt, J.; and Wen, Z. 2020. On the Analysis of Model-free Methods for the Linear Quadratic Regulator. arXiv preprint arXiv:2007.03861 .
Johnson et al. (2017) Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Hoffman, J.; Fei-Fei, L.; Lawrence Zitnick, C.; and Girshick, R. 2017. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, 2989–2998.
Kakade et al. (2003) Kakade, S.; et al. 2003. On the sample complexity of reinforcement learning. Ph.D. thesis, University of London London, England.
Klenke (2013) Klenke, A. 2013. Probability Theory: A Comprehensive Course. Springer Science & Business Media.
Konda and Tsitsiklis (2003) Konda, V.; and Tsitsiklis, J. 2003. On actor-critic algorithms. SIAM journal on Control and Optimization 42(4): 1143–1166.
Kool, van Hoof, and Welling (2018) Kool, W.; van Hoof, H.; and Welling, M. 2018. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475 .
Kool, van Hoof, and Welling (2020) Kool, W.; van Hoof, H.; and Welling, M. 2020. Estimating gradients for discrete random variables by sampling without replacement. arXiv preprint arXiv:2002.06043 .
Liu et al. (2019) Liu, B.; Cai, Q.; Yang, Z.; and Wang, Z. 2019. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306 .
Ma et al. (2016) Ma, Y.; Zhao, T.; Hatano, K.; and Sugiyama, M. 2016. An online policy gradient algorithm for Markov decision processes with continuous states and actions. Neural Computation 28(3): 563–593.
Malik et al. (2018) Malik, D.; Pananjady, A.; Bhatia, K.; Khamaru, K.; Bartlett, P.; and Wainwright, M. 2018. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. arXiv preprint arXiv:1812.08305 .
Marbach and Tsitsiklis (2001) Marbach, P.; and Tsitsiklis, J. 2001. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control 46(2): 191–209.
Mazumdar et al. (2019) Mazumdar, E.; Ratliff, L.; Jordan, M.; and Sastry, S. 2019. Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear Quadratic Games. arXiv preprint arXiv:1907.03712 .
Mei et al. (2020) Mei, J.; Xiao, C.; Szepesvari, C.; and Schuurmans, D. 2020. On the Global Convergence Rates of Softmax Policy Gradient Methods. arXiv preprint arXiv:2005.06392 .
Mnih and Gregor (2014) Mnih, A.; and Gregor, K. 2014. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030 .
Mnih et al. (2016) Mnih, V.; Badia, A.and Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937.
Neu et al. (2010) Neu, G.; Antos, A.; György, A.; and Szepesvári, C. 2010. Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, 1804–1812.
Neu, Jonsson, and Gómez (2017) Neu, G.; Jonsson, A.; and Gómez, V. 2017. A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798 .
Papini et al. (2018) Papini, M.; Binaghi, D.; Canonaco, G.; Pirotta, M.; and Restelli, M. 2018. Stochastic variance-reduced policy gradient. arXiv preprint arXiv:1806.05618 .
Peters, Mülling, and Altun (2010) Peters, J.; Mülling, K.; and Altun, Y. 2010. Relative entropy policy search. In AAAI, volume 10, 1607–1612. Atlanta.
Pham et al. (2020) Pham, N.; Nguyen, L.; Phan, D.; Nguyen, P.; van Dijk, M.; and Tran-Dinh, Q. 2020. A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning. arXiv preprint arXiv:2003.00430 .
Rennie et al. (2017) Rennie, S.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7008–7024.
Ryu and Boyd (2016) Ryu, E.; and Boyd, S. 2016. Primer on monotone operator methods. Appl. Comput. Math 15(1): 3–43.
Schulman, Chen, and Abbeel (2017) Schulman, J.; Chen, X.; and Abbeel, P. 2017. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440 .
Schulman et al. (2015) Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International conference on machine learning, 1889–1897.
Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
Shani, Efroni, and Mannor (2019) Shani, L.; Efroni, Y.; and Mannor, S. 2019. Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. arXiv preprint arXiv:1909.02769 .
Shen et al. (2019) Shen, Z.; Ribeiro, A.; Hassani, H.; Qian, H.; and Mi, C. 2019. Hessian aided policy gradient. In International Conference on Machine Learning, 5729–5738.
Silver et al. (2014) Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning, volume 32(1), 387–395.
Sutton and Barto (2018) Sutton, R.; and Barto, A. 2018. Reinforcement Learning: An Introduction. MIT press.
Sutton et al. (2000) Sutton, R.; McAllester, D.; Singh, S.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
Wang et al. (2019) Wang, L.; Cai, Q.; Yang, Z.; and Wang, Z. 2019. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150 .
Williams (1992) Williams, R. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4): 229–256.
Xiong et al. (2020) Xiong, H.; Xu, T.; Liang, Y.; and Zhang, W. 2020. Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling. arXiv preprint arXiv:2002.06286 .
Xu, Gao, and Gu (2019) Xu, P.; Gao, F.; and Gu, Q. 2019. Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610 .
Xu, Gao, and Gu (2020) Xu, P.; Gao, F.; and Gu, Q. 2020. An improved convergence analysis of stochastic variance-reduced policy gradient. In Uncertainty in Artificial Intelligence, 541–551. PMLR.
Yi et al. (2018) Yi, K.; Wu, J.; Gan, C.; Torralba, A.; Kohli, P.; and Tenenbaum, J. 2018. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In Advances in neural information processing systems, 1031–1042.
Yuan et al. (2020) Yuan, H.; Lian, X.; Liu, J.; and Zhou, Y. 2020. Stochastic Recursive Momentum for Policy Gradient Methods. arXiv preprint arXiv:2003.04302 .
Zhang et al. (2020) Zhang, J.; Koppel, A.; Bedi, A.; Szepesvari, C.; and Wang, M. 2020. Variational Policy Gradient Method for Reinforcement Learning with General Utilities. arXiv preprint arXiv:2007.02151 .
Zhang et al. (2019) Zhang, K.; Koppel, A.; Zhu, H.; and Başar, T. 2019. Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383 .
Zhang, Yang, and Basar (2019) Zhang, K.; Yang, Z.; and Basar, T. 2019. Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. In Advances in Neural Information Processing Systems, 11602–11614.
Zhao, Li, and Wen (2019) Zhao, M.; Li, Y.; and Wen, Z. 2019. A Stochastic Trust-Region Framework for Policy Optimization. arXiv preprint arXiv:1911.11640 .
Zhao et al. (2011) Zhao, T.; Hachiya, H.; Niu, G.; and Sugiyama, M. 2011. Analysis and improvement of policy gradient estimation. In Advances in Neural Information Processing Systems, 262–270.
Zoph and Le (2016) Zoph, B.; and Le, Q. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 .

Appendix

In this appendix, we provide detailed proofs and formal statements of the results in the main text. For notational simplicity, we sometimes abbreviate “almost sure” as “a.s.” or even omit “a.s.” whenever it is clear from the context. Also notice that as is always implicitly assumed in the literature of episodic reinforcement learning (e.g., cf. (Marbach and Tsitsiklis 2001)), given the current policy, the sampled trajectory is conditionally independent of all previous policies and trajectories.

Big- $O$ notation.

We first clarify the precise definitions of the Big- $O$ notation used in our statements. Let $f:\mathbf{Z}_{+}\times{\mbox{\bf R}}^{d}\rightarrow{\mbox{\bf R}}_{+}$ be a function of the total number of episodes $N$ and all problem and algorithm dependent quantities (written jointly as a vector) $U\in{\mbox{\bf R}}^{d}$ . Similarly, let $g:\mathbf{Z}_{+}\times{\mbox{\bf R}}^{d}\rightarrow{\mbox{\bf R}}_{+}$ be a function of $N$ and some (subset of) problem and algorithm dependent quantities $U_{0}\in{\mbox{\bf R}}^{d_{0}}$ , with $d_{0}\leq d$ . Then we write $f(N;U)=O(g(N;U_{0}))$ to indicate that there exist constants $W>0$ and $N_{0}\in\mathbf{Z}_{+}$ (independent of $N$ and $U_{0}$ ), such that $f(N;U)\leq Wg(N;U_{0})$ for all $N\geq N_{0}$ .

Appendix A Proofs for REINFORCE gradient estimations

A.1 Proof of Lemma 2 (with $b=0$ )

Proof.

We validate the three groups conditions in Assumption 2 in order. For the simplicity of exposition, we first restrict to the case when $b=0$ , i.e., no baseline is incorporated.

Gradient estimation boundedness. Firstly, notice that since $r(s,a)\in[0,1]$ , we have $\hat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})\leq 1/(1-\gamma)$ . And by the soft-max parametrization in Assumption 1, we have

\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)={\bf 1}_{s,a}-\mathbf{E}_{a^{\prime}\sim\pi_{\theta^{l,k}}(\cdot|s)}{\bf 1}_{s,a^{\prime}},

where the vector ${\bf 1}_{s,a}\in{\mbox{\bf R}}^{SA}$ has all zero entries apart from the one corresponding to the state-action pair $(s,a)$ . Hence $\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)\|_{2}\leq 2$ for any $(s,a)\in\mathcal{S}\times\mathcal{A}$ , and we see that

\begin{split}\left\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right\|_{2}&\leq\dfrac{1}{1-\gamma}\sum_{t=0}^{\infty}\gamma^{t}\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\|_{2}+\dfrac{\bar{\lambda}}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)\|_{2}\\ &\leq 2/(1-\gamma)^{2}+2\bar{\lambda},\quad\text{a.s.}\end{split}

(16)

Hence we can take $C_{1}=2/(1-\gamma)^{2}+2\bar{\lambda}$ .

Validation of nearly unbiasedness. Secondly, notice that

\begin{split}\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})=&\,\mathbf{E}_{l,k}\left(\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}\mathbf{E}_{l,k}\left(\hat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})\Big{|}s_{t}^{l,k},a_{t}^{l,k}\right)\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right)\\ &+\dfrac{\lambda^{l}}{SA}\sum\nolimits_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)=\,J_{1}+J_{2}+J_{3},\end{split}

where

\begin{split}J_{1}=&\,\mathbf{E}_{l,k}\left(\sum_{t=0}^{\infty}\gamma^{t}\mathbf{E}_{l,k}\left(\sum\nolimits_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{l,k}\Big{|}s_{t}^{l,k},a_{t}^{l,k}\right)\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right)\\ &+\dfrac{\lambda^{l}}{SA}\sum\nolimits_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s),\end{split}

\begin{split}J_{2}&=-\mathbf{E}_{l,k}\left(\sum_{t=\lfloor\beta H^{l,k}\rfloor+1}^{\infty}\gamma^{t}\mathbf{E}_{l,k}\left(\sum\nolimits_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{l,k}\Big{|}s_{t}^{l,k},a_{t}^{l,k}\right)\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right),\\ J_{3}&=-\mathbf{E}_{l,k}\left(\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}\mathbf{E}_{l,k}\left(\sum\nolimits_{t^{\prime}=H^{l,k}+1}^{\infty}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{l,k}\Big{|}s_{t}^{l,k},a_{t}^{l,k}\right)\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right).\end{split}

By (Agarwal, Jiang, and Kakade 2019, Theorem 4.6), we have

\begin{split}J_{1}=\mathbf{E}_{l,k}&\left(\sum_{t=0}^{\infty}\gamma^{t}Q^{\pi_{\theta^{l,k}}}(s_{t}^{l,k},a_{t}^{l,k})\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right)\\ &+\dfrac{\lambda^{l}}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta^{k}}(a|s)\\ =\nabla_{\theta}&L_{\lambda^{l}}(\theta^{l,k}).\end{split}

(17)

Here for any $\pi\in\Pi$ ,

Q^{\pi}(s,a)=\mathbf{E}\left(\sum\nolimits_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\Big{|}s_{0}=s,\,a_{0}=a\right),

with $\,a_{t}\sim\pi(s_{t},\cdot),\,s_{t+1}\sim p(\cdot|s_{t},a_{t}),\,\forall t>0$ .

And since $r(s,a)\in[0,1]$ , we have

	$\displaystyle\\|J_{2}\\|_{2}$	$\displaystyle\leq\dfrac{1}{1-\gamma}\sum_{t=\lfloor\beta H^{l,k}\rfloor+1}^{\infty}\gamma^{t}\\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}\|s_{t}^{l,k})\\|_{2}$
		$\displaystyle\leq 2\gamma^{\beta H^{l,k}}/(1-\gamma)^{2},$

and similarly

\begin{split}\|J_{3}\|_{2}&\leq\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}\sum_{t^{\prime}=H^{l,k}+1}^{\infty}\gamma^{t^{\prime}-t}\|\nabla_{\theta}\log\pi_{\theta^{k}}(a_{t}^{l,k}|s_{t}^{l,k})\|_{2}\\ &\leq\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}\times 2\gamma^{(1-\beta)H^{l,k}}/(1-\gamma)\\ &\leq 2\gamma^{(1-\beta)H^{l,k}}/(1-\gamma)^{2}.\end{split}

Hence for any $\eta_{0}>0$ , by taking

H^{l,k}\geq\frac{1+\eta_{0}}{(2+\eta_{0})\min\{\beta,1-\beta\}}\log_{1/\gamma}\left(\frac{4^{(2+\eta_{0})/(1+\eta_{0})}(k+1)}{(1-\gamma)^{(4+2\eta_{0})/(1+\eta_{0})}}\right)(=\Theta(\log(k+1))),

we have $H^{l,k}\geq\log_{1/\gamma}(k+1)$ , and that

\left\|\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})-\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right\|_{2}\leq\dfrac{4\gamma^{\min\{\beta,1-\beta\}H^{l,k}}}{(1-\gamma)^{2}}\leq(k+1)^{-\frac{1+\eta_{0}}{2+\eta_{0}}},

(18)

which implies that

\begin{split}\nabla_{\theta}&L_{\lambda^{l}}(\theta^{l,k})^{T}\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\\ &=\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\left(\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})-\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right)^{T}\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\\ &\geq\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}-\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}(k+1)^{-\frac{1+\eta_{0}}{2+\eta_{0}}}\\ &\geq\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}-\left(\dfrac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\right)(k+1)^{-\frac{1+\eta_{0}}{2+\eta_{0}}},\end{split}

where the last two steps used Cauchy inequality, (18) and the fact that by (17),

\begin{split}\|\nabla_{\theta}&L_{\lambda^{l}}(\theta^{l,k})\|_{2}\\ &\leq\sum_{t=0}^{\infty}\gamma^{t}\mathbf{E}_{l,k}\left(Q^{\pi_{\theta}^{l,k}}(s_{t}^{l,k},a_{t}^{l,k})\left\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right\|_{2}\right)+\dfrac{\bar{\lambda}}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)\|_{2}\\ &\leq 2/(1-\gamma)^{2}+2\bar{\lambda}.\end{split}

Hence we can take $C_{2}=1$ and $\delta_{l,k}=\left(\frac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\right)(k+1)^{-\frac{1+\eta_{0}}{2+\eta_{0}}}$ . Thus we have

	$\displaystyle\sum_{k=0}^{T_{l}-1}\delta_{l,k}^{2}$	$\displaystyle=\left(\frac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\right)^{2}\sum_{k=0}^{\infty}(k+1)^{-\frac{2+2\eta_{0}}{2+\eta_{0}}}$
		$\displaystyle\leq 8\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}\left(1+\dfrac{1}{\eta_{0}}\right),$

and hence we can take $C=8\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}\left(1+\frac{1}{\eta_{0}}\right)$ . Notice that for notational simplicity, we have taken $\eta_{0}=1$ in the statement of the proposition.

Validation of bounded second-order moment growth. Finally, we bound the second-order moment of the policy gradient. In the following, for a random vector $X=(X_{1},\dots,X_{n})\in{\mbox{\bf R}}^{n}$ , we define $\mathop{\bf Var}X=\sum_{i=1}^{n}\mathop{\bf var}X_{i}$ , and similarly $\mathop{\bf Var}_{l,k}X=\sum_{i=1}^{n}\mathop{\bf var}_{l,k}X_{i}$ , where $\mathop{\bf var}_{l,k}$ denotes the conditional variance given the $(l,k)$ -th iteration $\theta^{l,k}$ . Now define the constant $\bar{V}$ as the uniform upper bound on the variance of the policy gradient vector, i.e.,

\bar{V}=\sup_{H\geq 0,\,\theta\in\Theta,\,\lambda\in[0,\bar{\lambda}]}\mathop{\bf Var}\left(\sum_{t=0}^{\lfloor\beta H\rfloor}\gamma^{t}\widehat{Q}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})+\dfrac{\lambda}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta}(a|s)\right),

where $\tau=(s_{0},a_{0},r_{0},\dots,s_{H},a_{H},r_{H})$ is sampled from $\mathcal{M}$ following policy $\pi_{\theta}$ , and $\hat{Q}(s_{t},a_{t})=\sum_{t^{\prime}=t}^{H}\gamma^{t^{\prime}-t}r_{t^{\prime}}$ .

Then we have $\mathop{\bf Var}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\leq\bar{V}$ for any $l,\,k\geq 0$ by definition. In addition, since for any random vector $X\in{\mbox{\bf R}}^{n}$ ,

\mathop{\bf var{}}X\leq\sum_{i=1}^{n}\mathbf{E}X_{i}^{2}=\mathbf{E}\|X\|_{2}^{2},

we have by the same argument as (16) that

	$\displaystyle\bar{V}$	$\displaystyle\leq\mathbf{E}\left\\|\sum_{t=0}^{\lfloor\beta H\rfloor}\gamma^{t}\widehat{Q}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}\|s_{t})+\dfrac{\lambda}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta}(a\|s)\right\\|_{2}^{2}$
		$\displaystyle\leq\dfrac{4(1+\bar{\lambda}(1-\gamma)^{2})^{2}}{(1-\gamma)^{4}}.$

Finally, since for any random vector $X\in{\mbox{\bf R}}^{n}$ ,

\mathbf{E}_{l,k}\|X\|^{2}=\mathbf{E}_{l,k}\sum\nolimits_{i=1}^{n}X_{i}^{2}=\sum\nolimits_{i=1}^{n}(\mathbf{E}_{l,k}X_{i}^{2}+\mathop{\bf var}\nolimits_{l,k}X_{i})=\|\mathbf{E}_{l,k}X\|_{2}^{2}+\mathop{\bf Var}\nolimits_{l,k}X,

we have

\begin{split}\mathbf{E}_{l,k}\left\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right\|_{2}^{2}&\leq\|J_{1}+J_{2}+J_{3}\|_{2}^{2}+\bar{V}\\ &\leq 2\|J_{1}\|_{2}^{2}+2(\|J_{2}\|_{2}+\|J_{3}\|_{2})^{2}+\bar{V}\\ &\leq 2\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\dfrac{32\gamma^{2\min\{\beta,1-\beta\}H^{k}}}{(1-\gamma)^{4}}+\bar{V}\\ &\leq 2\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\dfrac{32}{(1-\gamma)^{4}}+\bar{V},\end{split}

and hence we can take $M_{2}=2$ and $M_{1}=32/(1-\gamma)^{4}+\bar{V}$ . This completes our proof. ∎

A.2 Proof of Lemma 2

Proof.

The proof is nearly identical to the case when $b=0$ above. Hence we only outline the proof while highlighting the differences.

Firstly, similar to (16), we have

\left\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right\|_{2}\leq\left(\frac{1}{1-\gamma}+B\right)\frac{2}{1-\gamma}+2\bar{\lambda}\quad\text{a.s.,}

and hence we can take $C_{1}=\frac{2+2B(1-\gamma)}{(1-\gamma)^{2}}+2\bar{\lambda}$ .

Secondly, by the proof of (Agarwal, Jiang, and Kakade 2019, Lemma 4.10), we have

\begin{split}\mathbf{E}_{l,k}\left(\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}b(s_{t}^{l,k})\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right)=0.\end{split}

(19)

Hence $\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})$ is the same as in the proof when $b=0$ above, and hence we can take

\begin{split}&\text{$C_{2}=1$,\quad$C=16\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}$,}\\ &\text{$\delta_{l,k}=\left(\frac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\right)(k+1)^{-\frac{2}{3}}$, \quad$H^{l,k}\geq\frac{3\log_{1/\gamma}\left(\frac{8(k+1)}{(1-\gamma)^{3}}\right)}{2\min\{\beta,1-\beta\}}$.}\end{split}

Finally, by definition, $\bar{V}_{b}$ can be written explicitly as

\bar{V}_{b}=\sup_{H\geq 0,\,\theta\in\Theta,\,\lambda\in[0,\bar{\lambda}]}\mathop{\bf Var}\left(\sum_{t=0}^{\lfloor\beta H\rfloor}\gamma^{t}(\widehat{Q}(s_{t},a_{t})-b(s_{t}))\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})+\dfrac{\lambda}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta}(a|s)\right),

Hence similar to $\bar{V}$ in then proof when $b=0$ above, we have $\bar{V}_{b}\leq 4\left(\frac{1+B(1-\gamma)}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}$ , $M_{2}=2$ and $M_{1}=32/(1-\gamma)^{4}+\bar{V}_{b}$ . ∎

Appendix B Proofs for convergence analysis

Firstly, we state a simple result from elementary analysis, which will be used repeatedly in our proof below.

Lemma 8.

Let $x_{k}=\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ for $k\geq 0$ . Then we have

\begin{split}&\sum_{k=K_{1}}^{K_{2}}x_{k}\geq\dfrac{K_{2}-K_{1}+1}{\sqrt{K_{2}+3}\log_{2}(K_{2}+3)},\quad\sum_{k=0}^{\infty}x_{k}^{4}\leq\sum_{k=0}^{\infty}x_{k}^{2}\leq 1.\end{split}

Proof.

The first inequality immediately comes from the fact that $x_{k}$ is monotonically decreasing in $k$ . The second inequality can be derived by noticing that $x_{k}<1$ for any $k\geq 0$ , and that

\sum_{k=0}^{\infty}x_{k}^{2}\leq\int_{0}^{\infty}\dfrac{1}{(x+2)(\log_{2}(x+2))^{2}}dx=1.

This completes the proof. ∎

B.1 Proof of Lemma 4

Proof.

The proof is a direct application of the well-known Borel-Cantelli lemma (Klenke 2013). Let $\delta_{N}=1/N^{2}$ and define the events $\{\bar{A}_{N}\}_{N\geq 0}$ as

\bar{A}_{N}=\{\mathbf{regret}(N)>d_{1}(N+c)^{d_{2}}(\log(N/\delta_{N}))^{d_{3}}+d_{4}(\log N)^{d_{5}}\}.

Then $\mathbf{Prob}(\bar{A}_{N})\leq\delta_{N}$ , and hence $\sum_{N=1}^{\infty}\mathbf{Prob}(\bar{A}_{N})\leq\sum_{N=1}^{\infty}1/N^{2}<\infty$ . Hence by Borel-Cantelli lemma, we have

\mathbf{Prob}(\bar{A}_{N}\text{ occurs infinitely often})=0.

Finally, by noticing that the complement of $\bar{A}_{N}$ is a subset of $A_{N}$ , the proof is complete. ∎

B.2 Proof of Theorem 5

Theorem 9 (Formal statement of Theorem 5).

Under Assumptions 1 and 2, for phase $l\geq 0$ suppose that we choose $\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ for some $C_{l,\alpha}\in(0,C_{2}/(M_{2}\beta_{\lambda^{l}})]$ . Then for any $\epsilon>0$ , if we choose $\lambda^{l}=\frac{\epsilon(1-\gamma)}{2}$ , then for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , for any $K\in\{0,\dots,T_{l}-1\}$ , we have

\begin{split}\mathbf{regret}_{l}(K)\leq&\,\dfrac{4(D_{l}+\sqrt{2C_{l}\log(2/\delta)})}{(1-\gamma)C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)+\epsilon(K+1)\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\\ &+\frac{\gamma+\gamma\log(K+1)}{1-\gamma}.\end{split}

(20)

Here $E_{l}=\frac{C_{l,\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}}$ , and

\begin{split}C_{l}&=32C_{1}^{2}C_{l,\alpha}^{2}\left(\dfrac{1}{(1-\gamma)^{2}}+\lambda^{l}\right)^{2}+\dfrac{\beta_{\lambda^{l}}^{2}C_{1}^{4}C_{l,\alpha}^{4}}{2},\\ D_{l}&=CC_{l,\alpha}^{2}+\beta_{\lambda^{l}}M_{1}C_{l,\alpha}^{2}+F^{\star}-L_{\lambda^{l}}(\theta^{l,0}).\end{split}

Proof.

The proof consists of two parts. In the first part, we establish an upper bound on the weighted gradient norms sum of previous iterates in the current phase. The second part then utilizes this bound to establish an upper bound on the phase regret.

Bounding the weighted gradient norms sum.

By Proposition 1 and an equivalent definition of strongly smoothness (cf. (Ryu and Boyd 2016, Appendix)), we have

\begin{split}-L_{\lambda^{l}}(\theta^{l,k+1})-(-L_{\lambda^{l}}(\theta^{l,k}))&\leq-\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})^{T}(\theta^{l,k+1}-\theta^{l,k})+\dfrac{\beta_{\lambda^{l}}}{2}\|\theta^{l,k+1}-\theta^{l,k}\|_{2}^{2}\\ &=\underbrace{-\alpha^{l,k}\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})^{T}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})+\dfrac{\beta_{\lambda^{l}}(\alpha^{l,k})^{2}}{2}\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}}_{Y_{l,k}}.\end{split}

Let $Z_{l,k}=Y_{l,k}-\mathbf{E}_{l,k}[Y_{l,k}]$ . Then the above inequality implies that

\begin{split}L_{\lambda^{l}}&(\theta^{l,k})-L_{\lambda^{l}}(\theta^{l,k+1})\\ \leq&-\alpha^{l,k}L_{\lambda^{l}}(\theta^{l,k})^{T}\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})+\dfrac{\beta_{\lambda^{l}}(\alpha^{l,k})^{2}}{2}\mathbf{E}_{l,k}\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+Z_{l,k}\\ \leq&-\alpha^{l,k}\left(C_{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}-\delta_{l,k}\right)+\dfrac{\beta_{\lambda^{l}}(\alpha^{l,k})^{2}}{2}\left(M_{1}+M_{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}\right)+Z_{l,k}\\ =&-\alpha^{l,k}(C_{2}-M_{2}\beta_{\lambda^{l}}\alpha^{l,k}/2)\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\alpha^{l,k}\delta_{l,k}+\dfrac{\beta_{\lambda^{l}}M_{1}(\alpha^{l,k})^{2}}{2}+Z_{l,k}\\ \leq&-\dfrac{C_{2}\alpha^{l,k}}{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\alpha^{l,k}\delta_{l,k}+\dfrac{\beta_{\lambda^{l}}M_{1}(\alpha^{l,k})^{2}}{2}+Z_{l,k}.\end{split}

(21)

Now define $X_{l,K}=\sum_{k=0}^{K-1}Z_{l,k}$ (with $X_{l,0}=0$ ), then

\mathbf{E}(X_{l,K+1}|\mathcal{F}_{l,K})=\sum_{k=0}^{K-1}Z_{l,k}+\mathbf{E}(Y_{l,K}-\mathbf{E}_{l,K}Y_{l,K}|\mathcal{F}_{l,K})=X_{l,K}.

(22)

Here $\mathcal{F}_{l,K}$ is the filtration up to episode $K$ in phase $l$ , i.e., the $\sigma$ -algebra generated by all iterations $\{\theta^{0,0},\dots,\theta^{0,T_{0}},\dots,\theta^{l,0},\dots,\theta^{l,K}\}$ up to the $(l,K)$ -th one. Notice that the second equality makes use of the fact that given the current policy, the correspondingly sampled trajectory is conditionally independent of all previous policies and trajectories.

In addition, for any $K\geq 1$ ,

\begin{split}|X_{l,K}-X_{l,K-1}|=&\,|Z_{l,K-1}|\leq\alpha^{l,K-1}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}\|\mathbf{E}_{l,K-1}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})-\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}\\ &+\dfrac{\beta_{\lambda^{l}}(\alpha^{l,K-1})^{2}}{2}\left|\mathbf{E}_{l,K-1}\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}^{2}-\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}^{2}\right|\\ \leq&\,\underbrace{2C_{1}\left(\dfrac{2}{(1-\gamma)^{2}}+2\lambda^{l}\right)\alpha^{l,K-1}+\dfrac{\beta_{\lambda^{l}}}{2}C_{1}^{2}(\alpha^{l,K-1})^{2}}_{c_{l,K}}.\end{split}

Here we use the fact that

\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}\leq 2/(1-\gamma)^{2}+2\lambda^{l},

which follows from the same argument as (16). The above inequality on $|X_{l,K}-X_{l,K-1}|$ also implies that $\mathbf{E}|X_{l,K}|<\infty$ , which, together with (22), implies that $X_{l,K}$ is a martingale. Notice that although $X_{l,K}$ is only defined for $K=0,\dots,T_{l}$ , we can augment the sequence by setting $X_{l,K}=X_{l,T_{l}}$ and $\mathcal{F}_{l,K}=\mathcal{F}_{l,T_{l}}$ for all $K>T_{l}$ , and it’s obvious that (22) and $\mathbf{E}|X_{l,K}|<\infty$ also hold for $K\geq T_{l}$ . And by saying that $X_{l,K}$ is a martingale, we are indeed referring to this (infinite length) augmented sequence of random variables.

Now by the definition of $\alpha^{l,k}$ , it’s easy to see that $\sum_{K=1}^{T_{l}}c_{l,K}^{2}\leq C_{l}<\infty$ , where

C_{l}=32C_{1}^{2}C_{l,\alpha}^{2}\left(\dfrac{1}{(1-\gamma)^{2}}+\lambda^{l}\right)^{2}+\dfrac{\beta_{\lambda^{l}}^{2}C_{1}^{4}C_{l,\alpha}^{4}}{2}.

(23)

Hence by Azuma-Hoeffding inequality, for any $c>0$ ,

\mathbf{Prob}(|X_{l,T_{l}}|\geq c)\leq 2e^{-c^{2}/(2C_{l})}.

(24)

Then by summing up the inequalities (21) from $k=0$ to $K$ , we obtain that

\begin{split}\dfrac{C_{2}}{2}&\sum_{k=0}^{K}\alpha^{l,k}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}\leq\dfrac{C_{2}}{2}\sum_{k=0}^{T_{l}-1}\alpha^{l,k}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}\\ &\leq\sum_{k=0}^{\infty}\alpha^{l,k}\delta_{l,k}+\dfrac{\beta_{\lambda^{l}}M_{1}\sum_{k=0}^{\infty}(\alpha^{l,k})^{2}}{2}+\sum_{k=0}^{T_{l}-1}Z_{l,k}+\sup_{\theta\in\Theta}L_{\lambda^{l}}(\theta)-L_{\lambda^{l}}(\theta^{l,0})\\ &\leq\sum_{k=0}^{\infty}(\alpha^{l,k})^{2}\sum_{k=0}^{\infty}\delta_{l,k}^{2}+\dfrac{\beta_{\lambda^{l}}M_{1}}{2}\sum_{k=0}^{\infty}(\alpha^{l,k})^{2}+X_{l,T_{l}}+F^{\star}-L_{\lambda^{l}}(\theta^{l,0})\\ &\leq\underbrace{CC_{l,\alpha}^{2}+\beta_{\lambda^{l}}M_{1}C_{l,\alpha}^{2}+F^{\star}-L_{\lambda^{l}}(\theta^{l,0})}_{D_{l}}+X_{l,T_{l}},\end{split}

(25)

where we use the fact that the regularization term $R(\theta)\leq 0$ for all $\theta\in\Theta$ .

Hence we have

\sum_{k=0}^{K}\alpha^{l,k}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}\leq\frac{2(D_{l}+X_{l,T_{l}})}{C_{2}}.

(26)

Bounding the phase regret.

We now establish the regret bound in phase $l$ using (26).

Fix $l\geq 0$ and $K\in\{0,\dots,T_{l}-1\}$ . Let

I^{+}=\{k\in\{0,\dots,K\}\,|\,\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}\geq\lambda^{l}/(2SA)\}.

For simplicity, assume for now that $|I^{+}|>0$ . Then since $\alpha^{l,k}$ is decreasing in $k$ , we have

\begin{split}2(D_{l}+X_{l,T_{l}})/C_{2}&\geq\dfrac{(\lambda^{l})^{2}}{4S^{2}A^{2}}\sum_{k=K-|I^{+}|+1}^{K}\alpha^{l,k}\\ &=\epsilon^{2}\underbrace{\frac{C_{l,\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}}}_{E_{l}}\sum_{k=K-|I^{+}|+1}^{K}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}\\ (\text{By Lemma \ref{simple_alpha_bound}})&\geq E_{l}\epsilon^{2}\dfrac{|I^{+}|}{\sqrt{K+3}\log_{2}(K+3)}.\end{split}

Hence we have (by the simple fact that $\sqrt{K+3}\leq 2\sqrt{K+1}$ for any $K\geq 0$ )

|I^{+}|\leq\dfrac{4(D_{l}+X_{l,T_{l}})}{C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)

(27)

Now by Proposition 3 and the choice of $\lambda^{l}$ , we have that for any $k\notin I^{+}$ ,

F^{\star}-F(\pi_{\theta^{l,k}})\leq\|d_{\rho}^{\pi^{\star}}/\rho\|_{\infty}\epsilon.

Since for any $\pi\in\Pi$ , $F(\pi)\in[0,1/(1-\gamma)]$ , we have $F^{\star}-F(\pi)\leq 1/(1-\gamma)$ . Hence by (27), we have

\begin{split}\sum\nolimits_{k\leq K}F^{\star}&-F(\pi_{\theta^{l,k}})\\ &=\sum_{k\in I^{+}}F^{\star}-F(\pi_{\theta^{l,k}})+\sum_{k\notin I^{+}}F^{\star}-F(\pi_{\theta^{l,k}})\\ &\leq|I^{+}|/(1-\gamma)+\left(K+1-|I^{+}|\right)\|d_{\rho}^{\pi^{\star}}/\rho\|_{\infty}\epsilon\\ &\leq\dfrac{4(D_{l}+X_{l,T_{l}})}{(1-\gamma)C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)+(K+1)\|d_{\rho}^{\pi^{\star}}/\rho\|_{\infty}\epsilon.\end{split}

This immediately implies that

\begin{split}&\mathbf{regret}_{l}(K)=\sum\nolimits_{k\leq K}F^{\star}-F(\pi_{\theta^{l,k}})+\sum\nolimits_{k\leq K}F(\pi_{\theta^{l,k}})-\hat{F}^{l,k}(\pi_{\theta^{l,k}})\\ &\leq\sum\nolimits_{k\leq K}F^{\star}-F(\pi_{\theta^{l,k}})+\sum\nolimits_{k\leq K}\mathbf{E}_{l,k}\sum\nolimits_{t=H^{l,k}+1}^{\infty}\gamma^{t}r(s_{t}^{l,k},a_{t}^{l,k})\\ &\leq\sum\nolimits_{k\leq K}F^{\star}-F(\pi_{\theta^{l,k}})+\sum_{k\leq K}\frac{\gamma/(k+1)}{1-\gamma}\\ &\leq\dfrac{4(D_{l}+X_{l,T_{l}})}{(1-\gamma)C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)+(K+1)\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\epsilon+\frac{\gamma+\gamma\log(K+1)}{1-\gamma}.\end{split}

(28)

Now if $|I^{+}|=0$ , then we immediately have that

\mathbf{regret}_{l}(K)\leq(K+1)\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\epsilon+\frac{\gamma+\gamma\log(K+1)}{1-\gamma},

and hence (28) always holds.

Finally, by (24), we have that with probability at least $1-\delta$ , for all $K\in\{0,\dots,T_{l}-1\}$ ,

\begin{split}&\mathbf{regret}_{l}(K)\\ &\leq\dfrac{4(D_{l}+\sqrt{2C_{l}\log(2/\delta)})}{(1-\gamma)C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)+\epsilon(K+1)\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}+\frac{\gamma+\gamma\log(K+1)}{1-\gamma}.\end{split}

This completes our proof. ∎

B.3 Overall regret bound for general policy gradient estimators

In this section, we state and prove the overall regret bound for general policy gradient estimators, which generalizes Theorem 6 for REINFORCE gradient estimators.

Theorem 10 (General regret bound).

Under Assumptions 1 and 2, suppose that for each $l\geq 0$ , we choose $\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ for some $C_{l,\alpha}\in[\underline{C}^{\alpha},C_{2}/(M_{2}\beta_{\lambda^{l}})]$ , with $\underline{C}^{\alpha}\in(0,C_{2}/(M_{2}\beta_{\bar{\lambda}})]$ and $\bar{\lambda}=\frac{1-\gamma}{2}$ . In addition, suppose that we specify $T_{0}\geq 1$ , choose $T_{l}=2^{l}T_{0}$ , $\epsilon^{l}=T_{l}^{-1/6}$ and $\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2}$ for each $l\geq 0$ . Then we have for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , for any $N\geq 0$ , we have

\begin{split}\mathbf{regret}(N)\leq\bar{R}_{1}(N)+\bar{R}_{2}(N)=O(N^{5/6}(\log(N/\delta))^{5/2}),\end{split}

(29)

where

\begin{split}\bar{R}_{1}(N)=&\,\left(\dfrac{4(\bar{D}+\sqrt{2\bar{C}((\log_{2}(N+1)+2)\log 2+\log(1/\delta))})}{(1-\gamma)C_{2}\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)\\ &\,\times(N+T_{0})^{\frac{5}{6}}(\log_{2}(2N+2T_{0}+1))^{2},\\ \bar{R}_{2}(N)=&\,\frac{(\log_{2}(N+1)+1)(\gamma+\gamma\log(N+T_{0}))}{1-\gamma}.\end{split}

(30)

Here the constants $\underline{E}=\frac{\underline{C}^{\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}}$ ,

\bar{D}=C\bar{C}_{\alpha}^{2}+\beta_{\bar{\lambda}}M_{1}\bar{C}_{\alpha}^{2}+\frac{1}{1-\gamma}+\log(1/\epsilon_{\rm pp}),

\bar{C}=32C_{1}^{2}\bar{C}_{\alpha}^{2}\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}+\frac{\beta_{\bar{\lambda}}^{2}C_{1}^{4}\bar{C}_{\alpha}^{4}}{2},

with $\bar{C}_{\alpha}=\frac{C_{2}(1-\gamma)^{3}}{8M_{2}}$ .

In addition, we also have

\lim_{N\rightarrow\infty}\mathbf{regret}(N)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of $O(N^{-1/6}(\log N)^{5/2})$ .

Remark 1.

Notice that the constant $\underline{E}$ is a uniform lower bound of $E_{l}$ ( $l\geq 0$ ), while the constants $\bar{D}$ and $\bar{C}$ are uniform upper bounds of $D_{l}$ and $C_{l}$ ( $l\geq 0$ ), respectively. Here the constants $E_{l},\,D_{l},\,C_{l}$ are those defined in Theorem 9.

Remark 2.

In the big- $O$ notation above, we have (temporarily) hidden the problem dependent quantities, which will be made explicit when we specialize the results to the REINFORCE gradient estimation below.

Proof.

We first prove the high probability result. By (20) and the choices of $\epsilon^{l}$ and $\lambda^{l}$ , we have that for any phase $l\geq 0$ , with probability at least $1-\delta/2^{l+1}$ , for all $K\in\{0,\dots,T_{l}-1\}$ ,

\begin{split}&\mathbf{regret}_{l}(K)\\ &\leq\left(\dfrac{4(\bar{D}+\sqrt{2\bar{C}((l+2)\log 2+\log(1/\delta))})}{(1-\gamma)C_{2}\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)T_{l}^{5/6}\log_{2}(T_{l}+2)+\frac{\gamma+\gamma\log T_{l}}{1-\gamma}.\end{split}

where $\underline{E}=\frac{\underline{C}^{\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}}$ ,

\bar{D}=C\bar{C}_{\alpha}^{2}+\beta_{\bar{\lambda}}M_{1}\bar{C}_{\alpha}^{2}+\frac{1}{1-\gamma}+\log(1/\epsilon_{\rm pp}),

\bar{C}=32C_{1}^{2}\bar{C}_{\alpha}^{2}\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}+\frac{\beta_{\bar{\lambda}}^{2}C_{1}^{4}\bar{C}_{\alpha}^{4}}{2},

with $\bar{C}_{\alpha}=\frac{C_{2}(1-\gamma)^{3}}{8M_{2}}$ and $\bar{\lambda}=\frac{1-\gamma}{2}$ . Here we used the fact that $\epsilon^{l}\leq 1$ , which then implies that $\lambda^{l}\leq\bar{\lambda}$ and

\frac{8}{(1-\gamma)^{3}}\leq\beta_{\lambda^{l}}\leq\beta_{\bar{\lambda}}=\frac{8}{(1-\gamma)^{3}}+\frac{2\bar{\lambda}}{S}.

We also used the fact that $F^{\star}-F(\pi)\leq 1/(1-\gamma)$ for any $\pi\in\Pi$ and that by the definition of PostProcess, $R_{\lambda^{l}}(\pi_{\theta^{l,0}})\geq\log\epsilon_{\rm pp}$ .

Now recall that for any $N\geq 0$ , we have

\begin{split}\mathbf{regret}(N)&=\sum_{l=0}^{l_{N}-1}\mathbf{regret}_{l}(T_{l}-1)+\mathbf{regret}_{l_{N}}(k_{N})\\ &\leq\sum_{l=0}^{l_{N}}\mathbf{regret}_{l}(T_{l}-1),\end{split}

where $(l_{N},k_{N})=G_{\mathcal{T}}(N)$ . In addition, by the choices of $T_{l}$ , we have that for any $0\leq k\leq T_{l}-1$ ,

	$\displaystyle B_{\mathcal{T}}(l,k)$	$\displaystyle=\sum_{j=0}^{l-1}T_{j}+k$
		$\displaystyle=(2^{l}-1)T_{0}+k$
		$\displaystyle\geq(2^{l}-1)T_{0}.$

Hence for any $N\geq 0$ , we have $l_{N}\leq\log_{2}\left(\frac{N}{T_{0}}+1\right)\leq\log_{2}(N+1)$ .

Thus we have that with probability at least $1-\sum_{l=0}^{l_{N}}\delta/2^{l+1}\geq 1-\delta$ , for any $N\geq 0$ ,

\begin{split}&\mathbf{regret}(N)\leq(l_{N}+1)(\hat{R}_{1}(N)+\hat{R}_{2}(N)),\end{split}

where

\begin{split}\hat{R}_{1}(N)&=\left(\dfrac{4(\bar{D}+\sqrt{2\bar{C}((l_{N}+2)\log 2+\log(1/\delta))})}{(1-\gamma)C_{2}\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)(N+T_{0})^{\frac{5}{6}}\log_{2}(N+T_{0}+2),\\ \hat{R}_{2}(N)&=\frac{\gamma+\gamma\log(N+T_{0})}{1-\gamma}.\end{split}

Finally, noticing that $l_{N}+1\leq\log_{2}(N+1)+1\leq\log_{2}(2N+2T_{0}+1)$ , we have

\begin{split}(l_{N}+1)\hat{R}_{1}(N)\leq&\,\left(\dfrac{4(\bar{D}+\sqrt{2\bar{C}((\log_{2}(N+1)+2)\log 2+\log(1/\delta))})}{(1-\gamma)C_{2}\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)\\ &\,\times(N+T_{0})^{\frac{5}{6}}(\log_{2}(2N+2T_{0}+1))^{2},\\ (l_{N}+1)\hat{R}_{2}(N)\leq&\,\frac{(\log_{2}(N+1)+1)(\gamma+\gamma\log(N+T_{0}))}{1-\gamma},\end{split}

which immediately imply (29) and (30). Notice that here we used the fact that $\log_{2}(N+1)+1\leq\log_{2}(2N+2T_{0}+1)$ (since $T_{0}\geq 1$ ), and that $T_{l}\leq N+1\leq N+T_{0}$ for all $l=0,\dots,l_{N}-1$ and $T_{l_{N}}=2^{l_{N}}T_{0}\leq N+T_{0}$ .

Finally, by invoking Lemma 4, we immediately obtain the almost sure convergence result. This completes our proof. ∎

Appendix C Formal statement of REINFORCE regret bound

Here we provide a slightly more formal restatement of Theorem 6, with details about the constants in the big- $O$ notation in the main text. Recall that our goal there is specialize (and slightly simplify) the regret bound in Theorem 10 to the case when the REINFORCE gradient estimation in §3.2 is adopted to evaluate $\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})$ . In particular, we have the following corollary. The proof is done by simply combining Lemma 2 (with $\lambda^{l}\leq\bar{\lambda}$ by their definitions in Theorem 6 or Corollary 11 below) and Theorem 10, together with the specific choices of the hyper-parameters as well as the constants in Lemma 2 plugged in and some elementary algebraic simplifications, and is hence omitted.

Corollary 11 (Formal statement of Theorem 6).

Under Assumption 1, suppose that for each $l\geq 0$ , we choose $\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ , with $C_{l,\alpha}\in[\underline{C}^{\alpha},1/(2\beta_{\lambda^{l}})]$ , $\underline{C}^{\alpha}\in(0,1/(2\beta_{\bar{\lambda}})]$ and $\bar{\lambda}=\frac{1-\gamma}{2}$ , and choose $T_{l}=2^{l}$ , $\epsilon^{l}=T_{l}^{-1/6}=2^{-l/6}$ , $\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2}$ and $\epsilon_{\rm pp}=1/(2A)$ . In addition, suppose that (9) is adopted to evaluate $\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})$ , with $\beta\in(0,1)$ , $|b(s)|\leq B$ for any $s\in\mathcal{S}$ (where $B>0$ is a constant), and that (10) holds for $H^{l,k}$ for all $l,\,k\geq 0$ . Then we have for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , for any $N\geq 0$ , we have

\begin{split}\mathbf{regret}(N)\leq\tilde{R}_{1}(N)+\tilde{R}_{2}(N),\end{split}

where

\begin{split}\tilde{R}_{1}(N)=&\,\left(\dfrac{4(\tilde{D}+\sqrt{2\tilde{C}((\log_{2}(N+1)+2)\log 2+\log(1/\delta))})}{(1-\gamma)\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)\\ &\times(N+1)^{\frac{5}{6}}(\log_{2}(2N+3))^{2},\\ \tilde{R}_{2}(N)=&\,\frac{\gamma(\log_{2}(N+1)+1)^{2}}{1-\gamma}.\end{split}

Here the constants are $\underline{E}=\frac{\underline{C}^{\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}}$ , and

\begin{split}\tilde{D}&=(1-\gamma)^{6}\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}+\frac{1}{256}(1-\gamma)^{6}\beta_{\bar{\lambda}}\left(\frac{32}{(1-\gamma)^{4}}+\bar{V}_{b}\right)+\frac{1}{1-\gamma}+\log(2A),\\ \tilde{C}&=\frac{\beta_{\bar{\lambda}}^{2}(1-\gamma)^{12}\left(\frac{(1+B(1-\gamma))}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{4}}{8192}+\frac{1}{2}(1-\gamma)^{6}\left(\frac{(1+B(1-\gamma))}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{4}.\end{split}

Here $\bar{V}_{b}$ is the variance bound defined in Lemma 2.

Suppose in addition that we specify $\underline{C}^{\alpha}=1/(2\beta_{\bar{\lambda}})$ , then we can simplify the regret bound into the following simple form:

\mathbf{regret}(N)=O\left(\left(\frac{S^{2}A^{2}}{(1-\gamma)^{7}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)N^{\frac{5}{6}}(\log(N/\delta))^{\frac{5}{2}}\right).

In addition, we also have

\lim_{N\rightarrow\infty}\mathbf{regret}(N)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of $O\left(\left(\frac{S^{2}A^{2}}{(1-\gamma)^{7}}+\left\|d_{\rho}^{\pi^{\star}}/\rho\right\|_{\infty}\right)N^{-\frac{1}{6}}(\log N)^{\frac{5}{2}}\right)$ .

Remark 3.

Notice that here (and below), with the specific choices of algorithm hyper-parameters and gradient estimators we are finally able to make all the problem dependent quantities (e.g., $\gamma,\,S,\,A$ , etc.) explicit in the big- $O$ notation, which is consistent with the convention of the reinforcement learning literature. Here the only hidden quantities are some absolute constants.

Appendix D Mini-batch phased policy gradient method

Here we formalize the mini-batch version of Algorithm 2 described at the beginning of §5 as Algorithm 4, and provide a formal statement as well as a proof for Corollary 7.

Firstly, we have the following lemma, which transfers guarantees on $\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}(\theta^{l,k})$ ( $i=1,\dots,M$ ) to the averaged gradient estimation $\frac{1}{M}\sum_{i=1}^{M}\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}})$ . The proof follows directly from the fact that the variance of the sum of independent random variables is the sum of the variances, and is thus omitted.

Lemma 12.

Suppose that each $\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}(\theta^{l,k})$ ( $i=1,\dots,M$ ) is computed using (9) with the corresponding trajectory, and that the same assumptions as in Lemma 2 hold. Then Assumption 2 also holds for $\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})=\frac{1}{M}\sum_{i=1}^{M}\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}})$ with the same constants $C,C_{1},C_{2},M_{2},\delta_{l,k}$ and $\bar{V}_{b}$ as in Lemma 2, while $M_{1}=\frac{32}{(1-\gamma)^{4}}+\frac{\bar{V}_{b}}{M}$ .

Now we are ready to state and prove (a more formal version of) Corollary 7.

Corollary 13 (Formal statement of Corollary 7).

Under Assumption 1, suppose that for each $l\geq 0$ , we choose $\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ , with $C_{l,\alpha}\in[\underline{C}^{\alpha},1/(2\beta_{\lambda^{l}})]$ , $\underline{C}^{\alpha}\in(0,1/(2\beta_{\bar{\lambda}})]$ and $\bar{\lambda}=\frac{1-\gamma}{2}$ , and choose $T_{l}=2^{l}$ , $\epsilon^{l}=T_{l}^{-1/6}=2^{-l/6}$ , $\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2}$ and $\epsilon_{\rm pp}=1/(2A)$ . In addition, suppose that the assumptions in Lemma 12 hold (note that Assumption 1 and $\lambda^{l}\leq\bar{\lambda}$ already automatically hold by the other assumptions). Then we have for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , for any $N\geq M\geq 1$ , we have that (for the mini-batch Algorithm 4)

\begin{split}\mathbf{regret}(N-M;M)\leq\tilde{R}_{1}(N;M)+\tilde{R}_{2}(N;M),\end{split}

where

\begin{split}\tilde{R}_{1}(N;M)=&\left(\dfrac{4(\tilde{D}_{M}+\sqrt{2\tilde{C}((\log_{2}(N/M)+2)\log 2+\log(1/\delta))})}{(1-\gamma)\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)\\ &\,\times M^{\frac{1}{6}}N^{\frac{5}{6}}(\log_{2}(2(N/M)+1))^{2},\\ \tilde{R}_{2}(N;M)=&\frac{\gamma M(\log_{2}(N/M)+1)^{2}}{1-\gamma}.\end{split}

Here the constants $\underline{E}$ and $\tilde{C}$ are the same as in Corollary 11, while

\tilde{D}_{M}=(1-\gamma)^{6}\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}+\frac{1}{256}(1-\gamma)^{6}\beta_{\bar{\lambda}}\left(\frac{32}{(1-\gamma)^{4}}+\frac{\bar{V}_{b}}{M}\right)+\frac{1}{1-\gamma}+\log(2A).

Here $\bar{V}_{b}$ is the variance bound defined in Lemma 2.

Suppose in addition that we specify $\underline{C}^{\alpha}=1/(2\beta_{\bar{\lambda}})$ . Then we can simplify the regret bound into the following simple form:

\begin{split}&\mathbf{regret}(N;M)\\ &=O\left(\left(\dfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)(M^{\frac{1}{6}}+M^{-\frac{5}{6}})(N+M)^{\frac{5}{6}}(\log(N/\delta))^{\frac{5}{2}}+\dfrac{M(\log N)^{2}}{1-\gamma}\right).\end{split}

In addition, we also have

\lim_{N\rightarrow\infty}\mathbf{regret}(N;M)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of

O\left(\left(\dfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)(M^{\frac{1}{6}}+M^{-\frac{5}{6}})N^{-\frac{1}{6}}\left(1+\dfrac{M}{N}\right)^{\frac{5}{6}}(\log N)^{\frac{5}{2}}+\dfrac{M(\log N)^{2}}{(1-\gamma)N}\right).

Proof of Corollary 7.

By the definition of $\mathbf{regret}(N;M)$ , we immediately see that

\mathbf{regret}(N;M)\leq M\mathbf{regret}(\lfloor N/M\rfloor),

(31)

where $\mathbf{regret}(J)$ ( $J\geq 0$ ) is the original regret (6) in the mini-batch setting, which is defined only for the total number of inner iterations/steps (instead of episodes, so not magnified with a factor of $M$ ). More precisely, we have that for any $J\geq 0$ ,

\mathbf{regret}(J)=\sum\nolimits_{\{(l,k)|B_{\mathcal{T}}(l,k)\leq J\}}F^{\star}-F(\pi_{\theta^{l,k}}).

Now by Lemma 12 and Theorem 10, and following the same simplification as is done in Corollary 11, we have that for any $J\geq 0$ ,

\begin{split}&\mathbf{regret}(J)\\ &\leq\left(\dfrac{4(\tilde{D}_{M}+\sqrt{2\tilde{C}((\log_{2}(J+1)+2)\log 2+\log(1/\delta))})}{(1-\gamma)\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)(J+1)^{\frac{5}{6}}(\log_{2}(2J+3))^{2}\\ &\quad+\frac{\gamma(\log_{2}(J+1)+1)^{2}}{1-\gamma},\end{split}

where the constants are as stated in the Corollary claims.

The proof is then complete by plugging the bound of $\mathbf{regret}(J)$ above into (31) and invoking Lemma 4. ∎

Algorithm 4 Mini-Batch Phased Policy Gradient Method

1: Input: initial parameter

\tilde{\theta}^{0,0}

, step-sizes

\alpha^{l,k}

, regularization parameters

\lambda^{l}

, phase lengths

T_{l}

(

l,\,k\geq 0

), post-processing tolerance

\epsilon_{\rm pp}

and batch size

M>0

2: Set

\theta^{0,0}=\texttt{PostProcess}(\tilde{\theta}^{0,0},\epsilon_{\rm pp})

3: for phase

l=0,1,2,\dots

4: for step

k=0,1,\dots,T_{l}-1

5: Choose

H^{l,k}

, sample IID trajectories

\{\tau_{i}^{l,k}\}_{i=1}^{M}

(each with horizon

H^{l,k}

) from

\mathcal{M}

following policy

\pi_{\theta^{l,k}}

, and compute an approximate gradient

\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}})

L_{\lambda^{l}}

for each trajectory

\tau_{i}^{l,k}

(

i=1,\dots,M

6: Update

\theta^{l,k+1}=\theta^{l,k}+\alpha^{l,k}\frac{1}{M}\sum_{i=1}^{M}\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}})

7: end for

8: Set

\theta^{l+1,0}=\texttt{PostProcess}(\theta^{l,T_{l}},\epsilon_{\rm pp})

9: end for

Appendix E Related work

Policy gradient methods are a large family of algorithms for reinforcement learning that directly operate on the agent policy, rather than on the action-value function (Glynn 1986; Sutton and Barto 2018). Examples of policy gradient methods include REINFORCE (Williams 1992), A3C (Mnih et al. 2016), DPG (Silver et al. 2014), PPO (Schulman et al. 2017), and TRPO (Schulman et al. 2015), to name just a few. These methods seek to directly maximize the cumulative reward as a function of the policies, they are straightforward to implement and are amenable to function approximations. The asymptotic convergence of (some) policy gradient methods to a stationary point has long been established (Sutton et al. 2000; Konda and Tsitsiklis 2003; Marbach and Tsitsiklis 2001; Baxter and Bartlett 2001). The rate of such convergence is also known and has been improved more recently, with the help of variance reduction (Papini et al. 2018; Xu, Gao, and Gu 2020, 2019), Hessian information (Shen et al. 2019) and momentum techniques (Xiong et al. 2020; Yuan et al. 2020; Pham et al. 2020; Huang et al. 2020). In contrast, until recently, the global convergence of (vanilla) policy gradient methods (like REINFORCE) had not been established unless unrealistic assumptions like concavity of the expected cumulative reward function (Ma et al. 2016) are imposed. The only exceptions are TRPO (Neu, Jonsson, and Gómez 2017) and the soft-max natural policy gradient method with fully known models (Agarwal et al. 2019), which happen to be equivalent to the MDP Expert algorithms (Even-Dar, Kakade, and Mansour 2004, 2009; Neu et al. 2010).

In the past two years, a line of research on the global convergence theory for (both vanilla and natural) policy gradient methods has emerged. By using a gradient domination property of the cumulative reward, global convergence of (both vanilla and natural) policy gradient methods is first established for linear-quadratic regulators (Fazel et al. 2018). For general Markov Decision Processes (MDPs), (Zhang et al. 2019) establishes convergence to approximately locally optimal (i.e., second-order stationary) solutions for vanilla policy gradient methods. The global optimality of stationary points for general MDPs is shown in (Bhandari and Russo 2019), and rates of convergence towards globally optimal solutions for (both vanilla and natural) policy gradient methods with (neural network) function approximation are derived in (Agarwal et al. 2019; Wang et al. 2019). These convergence results are then improved by several very recent works focusing on exact gradient settings. In particular, (Mei et al. 2020) focuses on the more practically relevant soft-max parametrization and vanilla policy gradient methods and improves the results of (Agarwal et al. 2019) by removing the requirement of the relative entropy regularization and obtaining faster convergence rates; (Bhandari and Russo 2020) obtains linear convergence for a general class of policy gradient methods; (Cen et al. 2020) shows local quadratic convergence of natural policy gradient methods; and (Zhang et al. 2020) extends the results to reinforcement learning with general utilities. For more modern policy gradient methods, (Zhao, Li, and Wen 2019) establishes the asymptotic global convergence of of TRPO, while (Liu et al. 2019) further derives the global convergence rates for PPO and TRPO. These rates are then improved in (Shani, Efroni, and Mannor 2019) for TRPO with adaptive regularization terms. Very recently, (Fu, Yang, and Wang 2020) extends these results to obtain the global convergence rates of single-timescale actor-critic methods with PPO actor updates, and (Agarwal et al. 2020) derives global convergence rates of a new policy gradient algorithm combining natural policy gradient methods with a policy cover technique and show that the algorithm entails better exploration behavior and hence removes the necessity for the access to a fully supported initial distribution $\rho$ , which is assumed in most other works on global convergence of policy gradient methods (including our work). All the above works either require exact and deterministic updates or mini-batch updates with a diverging mini-batch size.

Lately, (Jin, Schmitt, and Wen 2020) studies vanilla policy gradient methods using the REINFORCE gradient estimators computed with a single trajectory in each episode and obtains high probability sample complexity results, but the setting is restricted to linear-quadratic regulators and their bounds have polynomial dependency on $1/\delta$ (in contrast to our logarithmic dependency on $1/\delta$ ), where $\delta$ is the probability that the bounds are violated. The authors of (Abbasi-Yadkori et al. 2019) study natural policy gradient methods with a general high probability estimation oracle for state-action value functions (i.e., $Q$ -functions) in the average reward settings, and establish high probability regret bounds for these algorithms. Finally, we remark that there are also some recent results on the global convergence rates of natural policy gradient methods in adversarial settings (with full-information feedback) (Cai et al. 2019), model-based natural policy gradient methods (Efroni et al. 2020) as well as extensions to non-stationary (Fei et al. 2020) and multi-agent game settings (Zhang, Yang, and Basar 2019; Mazumdar et al. 2019; Carmona, Laurière, and Tan 2019; Fu et al. 2019; Guo et al. 2020), which are beyond the scope of this paper.

Sample Efficient Reinforcement Learning with REINFORCE

Abstract

1 Introduction

1.1 Contribution

2 Problem setting and preliminaries

2.1 Problem setting

2.2 Vanilla policy gradient method and REINFORCE algorithm

Policy parametrization and surrogate objectives.

Policy gradient method.

REINFORCE algorithm.

2.3 Phased learning and performance criteria

Phased learning.

Performance criteria.

3 Assumptions and REINFORCE gradients

3.1 Assumptions

Assumption 1 (Setting).

Assumption 2 (Policy gradient estimator).

Proposition 1 ((Agarwal et al. 2019, Lemma E.4)).

3.2 REINFORCE gradient estimations

Lemma 2.

4 Main convergence results

4.1 Preliminary tools

Non-convexity and control of “bad” episodes.

Proposition 3 ((Agarwal et al. 2019, Theorem 5.3)).

Doubling trick.

From high probability to almost sure convergence.

Lemma 4.

4.2 Regret analysis

Single phase analysis.

Theorem 5.

Overall regret bound.

Theorem 6 (Regret for REINFORCE).

5 Extension to mini-batch updates

Regret with mini-batches.

Corollary 7 (Regret for mini-batch REINFORCE).

6 Conclusion and open problems

Acknowledgment

References

Appendix

Big-OO notation.

Appendix A Proofs for REINFORCE gradient estimations

A.1 Proof of Lemma 2 (with b=0b=0)

Proof.

A.2 Proof of Lemma 2

Proof.

Appendix B Proofs for convergence analysis

Lemma 8.

Proof.

B.1 Proof of Lemma 4

Proof.

B.2 Proof of Theorem 5

Theorem 9 (Formal statement of Theorem 5).

Proof.

Bounding the weighted gradient norms sum.

Bounding the phase regret.

B.3 Overall regret bound for general policy gradient estimators

Theorem 10 (General regret bound).

Remark 1.

Remark 2.

Proof.

Appendix C Formal statement of REINFORCE regret bound

Corollary 11 (Formal statement of Theorem 6).

Remark 3.

Appendix D Mini-batch phased policy gradient method

Lemma 12.

Corollary 13 (Formal statement of Corollary 7).

Proof of Corollary 7.

Appendix E Related work

Big- $O$ notation.

A.1 Proof of Lemma 2 (with $b=0$ )