This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sample Efficient Reinforcement Learning with REINFORCE

Junzi Zhang1, Jongho Kim2, Brendan O’Donoghue3, Stephen Boyd2
Abstract

Policy gradient methods are among the most effective methods for large-scale reinforcement learning, and their empirical success has prompted several works that develop the foundation of their global convergence theory. However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of “bad” episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice.

1 Introduction

In this paper, we study the global convergence rates of the REINFORCE algorithm (Williams 1992) for episodic reinforcement learning. REINFORCE is a vanilla policy gradient method that computes a stochastic approximate gradient with a single trajectory or a fixed size mini-batch of trajectories with particular choice of gradient estimator, where we use ‘vanilla’ here to disambiguate the method from more exotic variants such as natural policy gradient methods. REINFORCE and its variants are among the most widely used policy gradient methods in practice due to their good empirical performance and implementation simplicity (Mnih and Gregor 2014; Gu et al. 2015; Zoph and Le 2016; Rennie et al. 2017; Guu et al. 2017; Johnson et al. 2017; Yi et al. 2018; Kool, van Hoof, and Welling 2018, 2020). Related methods include the actor-critic family (Konda and Tsitsiklis 2003; Mnih et al. 2016) and deterministic and trust-region based variants (Silver et al. 2014; Schulman et al. 2017, 2015).

The theoretical results for policy gradient methods have, up to recently, been restricted to convergence to local stationary points (Agarwal et al. 2019). Lately, a series of works have established global convergence results. These recent developments cover a broad range of issues including global optimality characterization (Fazel et al. 2018; Bhandari and Russo 2019), convergence rates (Zhang et al. 2019; Mei et al. 2020; Bhandari and Russo 2020; Cen et al. 2020), the use of function approximation (Agarwal et al. 2019; Wang et al. 2019; Fu, Yang, and Wang 2020), and efficient exploration (Agarwal et al. 2020) (for more details, see the related work section, which we defer to Appendix E due to space limits). Nevertheless, prior work on vanilla policy gradient methods either requires exact and deterministic policy gradients or only guarantees convergence up to Θ(1/Mp)\Theta(1/M^{p}) with a fixed mini-batch size M>0M>0 of trajectories collected when performing a single update (where p>0p>0 is 1/21/2 in most cases), while global convergence is only achieved when the batch size MM goes to infinity. By contrast, practical implementations of policy gradient methods typically use either a single or a fixed number of sample trajectories, which tends to perform well. In addition, prior theoretical results (for general MDPs) have used the state-action visitation measure based gradient estimation (see e.g., (Wang et al. 2019, (3.10))), which are typically not used in practice.

The main purpose of this paper is to bridge this gap between theory and practice. We do this in two major ways. First, we derive performance bounds for the case of a fixed mini-batch size, rather than requiring diverging size. Second, we remove the need for the state-action visitation measure based gradient, instead using the REINFORCE gradient estimator. It is nontrivial to go from a diverging mini-batch size to a fixed one. In fact, by allowing for an arbitrarily large batch size, existing works in the literature were able to make use of IID samples to decouple the analysis into deterministic gradient descent/ascent and error control of stochastic gradient estimations. In contrast, with a single trajectory or a fixed batch size, such a decoupling is no longer feasible. In addition, the state-action visitation measure based gradient estimations are unbiased and unbounded, while REINFORCE gradient estimations are biased and bounded. Hence a key to the analysis is to deal with the bias while making better use of the boundedness. Our analysis not only addresses these challenges, but also leads to convergence results in almost sure and high probability senses, which are stronger than the expected convergence results that dominate the literature (for vanilla policy gradient methods). We also emphasize that the goal of this work is to provide a deeper understanding of a widely used algorithm, REINFORCE, with little or no modifications, rather than tweaking it to achieve near-optimal performance bounds. Lastly, our analysis is not the complete picture and several open questions about the performance of policy gradient methods remain. We discuss these issues in the conclusion.

1.1 Contribution

Our major contribution can be summarized as follows. We establish the first set of global convergence results for the REINFORCE algorithm. In particular, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate for REINFORCE, showing that the algorithm is sample efficient (i.e., with polynomial/non-exponential complexity). To our knowledge, these (almost sure and high probability) results are stronger than existing global convergence results for (vanilla) policy gradient methods in the literature. Moreover, our convergence results remove the non-vanishing Θ(1/Mp)\Theta(1/M^{p}) term (with M>0M>0 being the mini-batch size of the trajectories and p>0p>0 being some constant exponent) and hence show for the first time that policy gradient estimations with a single or finite number of trajectories also enjoy global convergence properties. Finally, the widely-used REINFORCE gradient estimation procedure is studied, as opposed to the state-action visitation measure based estimators typically studied in the literature but rarely used in practice.

2 Problem setting and preliminaries

Below we begin with our problem setting and some preliminaries on MDPs and policy optimization. For brevity we restrict ourselves to the stationary infinite-horizon discounted setting. We briefly discuss potential extensions beyond this setting in §6.

2.1 Problem setting

We consider a finite MDP \mathcal{M}, which is characterized by a finite state space 𝒮={1,,S}\mathcal{S}=\{1,\dots,S\}, a finite action space 𝒜={1,,A}\mathcal{A}=\{1,\dots,A\}, a transition probability pp (with p(s|s,a)p(s^{\prime}|s,a) being the probability of transitioning to state ss^{\prime} given the current state ss and action aa), a reward function rr (with r(s,a)r(s,a) being the instantaneous reward when taking action aa at state ss), a discount factor γ[0,1)\gamma\in[0,1) and an initial state distribution ρΔ(𝒮)\rho\in\Delta(\mathcal{S}). Here Δ(𝒳)\Delta(\mathcal{X}) denotes the probability simplex over a finite set 𝒳\mathcal{X}. A (stationary, stochastic) policy π\pi is a mapping from 𝒮\mathcal{S} to Δ(𝒜)\Delta(\mathcal{A}). We will use π(a|s)\pi(a|s), π(s,a)\pi(s,a) or πs,a\pi_{s,a} alternatively to denote the probability of taking action aa at state ss following policy π\pi. The policy π\pi can also be viewed as an SASA dimensional vector in

Π={πRSA|a=1Aπs,a=1(s𝒮),πs,a0(s𝒮,a𝒜)}.\begin{split}\Pi=\Big{\{}\pi\in{\mbox{\bf R}}^{SA}\,\Big{|}\,&\sum\nolimits_{a=1}^{A}\pi_{s,a}=1\,(\forall s\in\mathcal{S}),\,\\ &\pi_{s,a}\geq 0\,(\forall s\in\mathcal{S},\,a\in\mathcal{A})\Big{\}}.\end{split} (1)

Notice that here we use the double indices ss and aa for notational convenience. We use π(s,)RA\pi(s,\cdot)\in{\mbox{\bf R}}^{A} to denote the sub-vector (π(s,1),,π(s,A))(\pi(s,1),\dots,\pi(s,A)). We also assume that r(s,a)r(s,a) is deterministic for any s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A} for simplicity, although our results hold for any rr with an almost sure uniform bound. Here rr can be similarly viewed as an SASA-dimensional vector. Without loss of generality, we assume that r(s,a)[0,1]r(s,a)\in[0,1] for all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}, which is a common assumption (Jaksch, Ortner, and Auer 2010; Agarwal et al. 2019; Mei et al. 2020; Even-Dar and Mansour 2003; Jin et al. 2018). We also assume that ρ\rho is component-wise positive, as is assumed in (Bhandari and Russo 2019).

Given a policy πΠ\pi\in\Pi, the expected cumulative reward of the MDP is defined as

F(π)=𝐄t=0γtr(st,at),F(\pi)=\mathbf{E}\sum\nolimits_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t}), (2)

where s0ρs_{0}\sim\rho, atπ(|st),st+1p(|st,at),t0a_{t}\sim\pi(\cdot|s_{t}),~{}s_{t+1}\sim p(\cdot|s_{t},a_{t}),\,\forall t\geq 0, and the goal is to find a policy π\pi which solves the following optimization problem:

maximizeπΠF(π).\begin{array}[]{ll}\text{maximize}_{\pi\in\Pi}&F(\pi).\end{array} (3)

Any policy πargmaxπΠF(π)\pi^{\star}\in\mathop{\rm argmax}_{\pi\in\Pi}F(\pi) is said to be optimal, and the corresponding optimal objective value is denoted as F=F(π)F^{\star}=F(\pi^{\star}). Note that in the literature, F(π)F(\pi) is also commonly written as VρπV_{\rho}^{\pi} and referred to as the value function. Here we hide the dependency on ρ\rho as it is fixed throughout the paper.

2.2 Vanilla policy gradient method and REINFORCE algorithm

When the transition probability pp and reward rr are fully known, problem (3) reduces to solving an MDP, in which case various classical algorithms are available, including value iteration and policy iteration (Bertsekas 2017). In this paper, we consider the episodic reinforcement learning setting in which the agent accesses pp and rr by interacting with the environment over successive episodes, i.e., the agent accesses the environment in the form of a ρ\rho-restart model (Shani, Efroni, and Mannor 2019), which is commonly adopted in the policy gradient literature (Kakade et al. 2003). In addition, we focus on the REINFORCE algorithm, a representative policy gradient method.

Policy parametrization and surrogate objectives.

Here we consider parametrizing the policy with parameter θΘ\theta\in\Theta, i.e., πθ:ΘΠ\pi_{\theta}:\Theta\rightarrow\Pi, and take the following (regularized) optimization problem as an approximation to (3):

maximizeθΘLλ(θ)=F(πθ)+λR(θ),\begin{array}[]{ll}\text{maximize}_{\theta\in\Theta}&L_{\lambda}({\theta})=F(\pi_{\theta})+\lambda R(\theta),\end{array} (4)

where λ0\lambda\geq 0 and R:ΘRR:\Theta\rightarrow{\mbox{\bf R}} is a differentiable regularization term that improves convergence, to be specified later. Although our ultimate goal is still to solve the original problem (3) this regularized optimization problem is a useful surrogate and our approach will be to tackle problem (4) with progressively smaller λ\lambda regularization penalties, thereby converging to solving the actual problem we care about.

Policy gradient method.

In each episode nn, the policy gradient method directly performs an online stochastic gradient ascent update on a surrogate objective Lλn(θ)L_{\lambda^{n}}({\theta}), i.e.,

θn+1=θn+αn^θLλn(θn),\theta^{n+1}=\theta^{n}+\alpha^{n}\widehat{\nabla}_{\theta}L_{\lambda^{n}}({\theta^{n}}), (5)

where αn\alpha^{n} is the step-size and λn\lambda^{n} is the regularization parameter. Here the stochastic gradient ^θLλn(θn)\widehat{\nabla}_{\theta}L_{\lambda^{n}}({\theta^{n}}) is computed by sampling a single trajectory τn\tau^{n} following policy πθn\pi_{\theta^{n}} from \mathcal{M} with the initial state distribution ρ\rho. Here τn=(s0n,a0n,r0n,s1n,a1n,r1n,,sHnn,aHnn,rHnn)\tau^{n}=(s_{0}^{n},a_{0}^{n},r_{0}^{n},s_{1}^{n},a_{1}^{n},r_{1}^{n},\dots,s_{H^{n}}^{n},a_{H^{n}}^{n},r_{H^{n}}^{n}), where HnH^{n} is a finite (and potentially random) stopping time of the trajectory (to be specified below), s0nρs_{0}^{n}\sim\rho, atnπθn(|stn)a_{t}^{n}\sim\pi_{\theta^{n}}(\cdot|s_{t}^{n}), st+1np(|stn,atn)s_{t+1}^{n}\sim p(\cdot|s_{t}^{n},a_{t}^{n}) and rtn=r(stn,atn)r_{t}^{n}=r(s_{t}^{n},a_{t}^{n}) for all t=0,,Hnt=0,\dots,H^{n}. We summarize the generic policy gradient method (with single trajectory gradient estimates) in Algorithm 1. An extension to mini-batch scenarios will be discussed in §5. As is always (implicitly) assumed in the literature of episodic reinforcement learning (e.g., cf. (Marbach and Tsitsiklis 2001)), given the current policy, we assume that the sampled trajectory is conditionally independent of all previous policies and trajectories.

Algorithm 1 Policy Gradient Method with Single Trajectory Estimates
1:  Input: initial parameter θ0\theta^{0}, step-sizes αn\alpha^{n} and regularization parameters λn\lambda^{n} (n0n\geq 0).
2:  for n=0,1,n=0,1,\dots do
3:     Choose HnH^{n}, sample trajectory τn\tau^{n} from \mathcal{M} following policy πθn\pi_{\theta^{n}}, and compute an approximate gradient ^θLλn(θn)\widehat{\nabla}_{\theta}L_{\lambda^{n}}({\theta^{n}}) of LλnL_{\lambda^{n}} using trajectory τn\tau^{n}.
4:     Update θn+1=θn+αn^θLλn(θn)\theta^{n+1}=\theta^{n}+\alpha^{n}\widehat{\nabla}_{\theta}L_{\lambda^{n}}({\theta^{n}}).
5:  end for

REINFORCE algorithm.

There are several ways of choosing the stochastic gradient operator ^θ\widehat{\nabla}_{\theta} in the policy gradient method, and the well-known REINFORCE algorithm (Williams 1992) corresponds to a specific family of estimators based on the policy gradient theorem (Sutton et al. 2000) (cf. §3). Other common alternatives include zeroth order/random search (Fazel et al. 2018; Malik et al. 2018) and actor-critic (Konda and Tsitsiklis 2003) approximations. One may also choose to parametrize the policy as a mapping from the parameter space to a specific action, which would then result in deterministic policy gradient approximations (Silver et al. 2014).

Although our main goal is to study the REINFORCE algorithm, our analysis indeed holds for rather generic stochastic gradient estimates. In the next section, we introduce the (mild) assumptions needed for our convergence analysis and the detailed gradient estimation procedures in the REINFORCE algorithm, and then verify that the assumptions do hold for these gradient estimations.

2.3 Phased learning and performance criteria

Phased learning.

To facilitate the exposition below, we divide the optimization in Algorithm 1 into successive phases l=0,1,l=0,1,\dots, each with length Tl>0T_{l}>0. We then fix the regularization coefficient λl\lambda_{l} within each phase l0l\geq 0. In addition, a post-processing step is enforced at the end of each phase to produce the initialization of the next phase. The resulting algorithm is described in Algorithm 2. Here the trajectory is denoted as τl,k=(s0l,k,a0l,k,r0l,k,,sHl,kl,k,aHk,ll,k,rHl,kl,k)\tau^{l,k}=(s_{0}^{l,k},a_{0}^{l,k},r_{0}^{l,k},\dots,s_{H^{l,k}}^{l,k},a_{H^{k,l}}^{l,k},r_{H^{l,k}}^{l,k}), and we will refer to θl,k\theta^{l,k} as the (l,k)(l,k)-th iterate hereafter. The post-processing function is required to guarantee that the resulting policy πθ\pi_{\theta} is lower bounded by a pre-specified tolerance ϵpp(0,1/A]\epsilon_{\rm pp}\in(0,1/A] to ensure that the regularization is bounded (cf. Algorithm 3 for a formal description and §3.1 for an example realization).

Note that here the kk-th episode in phase ll corresponds to the nn-th episode in the original indexing with n=j=0l1Tj+kn=\sum_{j=0}^{l-1}T_{j}+k. For notational compactness below, for 𝒯={Tj}j=0\mathcal{T}=\{T_{j}\}_{j=0}^{\infty}, we define B𝒯:𝐙+×𝐙+𝐙+B_{\mathcal{T}}:{\bf Z}_{+}\times{\bf Z}_{+}\rightarrow{\bf Z}_{+}, where B𝒯(l,k)=j=0l1Tj+kB_{\mathcal{T}}(l,k)=\sum_{j=0}^{l-1}T_{j}+k maps the double index (l,k)(l,k) to the corresponding original episode number, with 𝐝𝐨𝐦B𝒯={(l,k)𝐙+×𝐙+| 0kTl1}\mathop{\bf dom}B_{\mathcal{T}}=\{(l,k)\in{\bf Z}_{+}\times{\bf Z}_{+}\,|\,0\leq k\leq T_{l}-1\}. The mapping B𝒯B_{\mathcal{T}} is a bijection and we denote its inverse by G𝒯G_{\mathcal{T}}.

Algorithm 2 Phased Policy Gradient Method
1:  Input: initial parameter θ~0,0\tilde{\theta}^{0,0}, step-sizes αl,k\alpha^{l,k}, regularization parameters λl\lambda^{l}, phase lengths TlT_{l} (l,k0l,\,k\geq 0) and post-processing tolerance ϵpp(0,1/A]\epsilon_{\rm pp}\in(0,1/A].
2:  Set θ0,0=PostProcess(θ~0,0,ϵpp)\theta^{0,0}=\texttt{PostProcess}(\tilde{\theta}^{0,0},\epsilon_{\rm pp}).
3:  for phase l=0,1,2,l=0,1,2,\dots do
4:     for episode k=0,1,,Tl1k=0,1,\dots,T_{l}-1 do
5:        Choose Hl,kH^{l,k}, sample trajectory τl,k\tau^{l,k} from \mathcal{M} following policy πθl,k\pi_{\theta^{l,k}}, and compute an approximate gradient ^θLλl(θl,k)\widehat{\nabla}_{\theta}L_{\lambda^{l}}({\theta^{l,k}}) of LλlL_{\lambda^{l}} using trajectory τl,k\tau^{l,k}.
6:        Update θl,k+1=θl,k+αl,k^θLλl(θl,k)\theta^{l,k+1}=\theta^{l,k}+\alpha^{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}({\theta^{l,k}}).
7:     end for
8:     Set θl+1,0=PostProcess(θl,Tl,ϵpp)\theta^{l+1,0}=\texttt{PostProcess}(\theta^{l,T_{l}},\epsilon_{\rm pp}).
9:  end for
Algorithm 3 PostProcess(θ,ϵpp)\texttt{PostProcess}(\theta,\epsilon_{\rm pp})
  Input: ϵpp(0,1/A]\epsilon_{\rm pp}\in(0,1/A], θΘ\theta\in\Theta.
  Return θ\theta^{\prime} (near θ\theta) such that πθ(s,a)ϵpp\pi_{\theta^{\prime}}(s,a)\geq\epsilon_{\rm pp} for each s,a𝒮×𝒜s,\,a\in\mathcal{S}\times\mathcal{A}.

Performance criteria.

The criterion we adopt to evaluate the performance of Algorithm 2 is regret. For any N0N\geq 0, the regret up to episode NN is defined as the cumulative sub-optimality of the policy over the NN episodes. Formally, we define

𝐫𝐞𝐠𝐫𝐞𝐭(N)={(l,k)|B𝒯(l,k)N}FF^l,k(πθl,k).\begin{split}\mathbf{regret}(N)&=\sum\nolimits_{\{(l,k)|B_{\mathcal{T}}(l,k)\leq N\}}F^{\star}-\hat{F}^{l,k}(\pi_{\theta^{l,k}}).\end{split} (6)

Here the summation is over all (l,k)(l,k)-th iterates whose corresponding original episode numbers are smaller or equal to NN, and F^l,k(πθl,k)=𝐄l,kt=0Hl,kγtr(stl,k,atl,k)\hat{F}^{l,k}(\pi_{\theta^{l,k}})=\mathbf{E}_{l,k}\sum\nolimits_{t=0}^{H^{l,k}}\gamma^{t}r(s_{t}^{l,k},a_{t}^{l,k}), where s0ρs_{0}\sim\rho, atl,kπθl,k(|stl,k)a_{t}^{l,k}\sim\pi_{\theta^{l,k}}(\cdot|s_{t}^{l,k}), st+1l,kp(|stl,k,atl,k)s_{t+1}^{l,k}\sim p(\cdot|s_{t}^{l,k},a_{t}^{l,k}), t0\forall t\geq 0, and 𝐄l,k\mathbf{E}_{l,k} denotes the conditional expectation given the (l,k)(l,k)-th iteration θl,k\theta^{l,k}. Notice that the regret defined above takes into account the fact that the trajectories are stopped/truncated to have finite horizons Hl,kH^{l,k}, which characterizes the actual loss caused by sampling the trajectories in line 5 of Algorithm 2. A similar regret definition for the episodic (discounted) reinforcement learning setting considered here is adopted in (Fu, Yang, and Wang 2020). We remark that all our regret bounds remain correct up to lower order terms when we replace F^l,k\hat{F}^{l,k} with FF or an expectation-free version.

Similarly, we also define the single phase version of regret as follows. The regret up to episode K{0,,Tl1}K\in\{0,\dots,T_{l}-1\} in phase ll is defined as

𝐫𝐞𝐠𝐫𝐞𝐭l(K)=k=0KFF^l,k(πθl,k).\mathbf{regret}_{l}(K)=\sum\nolimits_{k=0}^{K}F^{\star}-\hat{F}^{l,k}(\pi_{\theta^{l,k}}). (7)

Notice that (6) and (7) are connected via

𝐫𝐞𝐠𝐫𝐞𝐭(N)=l=0lN1𝐫𝐞𝐠𝐫𝐞𝐭l(Tl1)+𝐫𝐞𝐠𝐫𝐞𝐭lN(kN),\mathbf{regret}(N)=\sum_{l=0}^{l_{N}-1}\mathbf{regret}_{l}(T_{l}-1)+\mathbf{regret}_{l_{N}}(k_{N}), (8)

where (lN,kN)=G𝒯(N)(l_{N},k_{N})=G_{\mathcal{T}}(N).

We provide high probability regret bounds in §4. We remark that a regret bound of the form 𝐫𝐞𝐠𝐫𝐞𝐭(N)/(N+1)R\mathbf{regret}(N)/(N+1)\leq R (for some R>0R>0) immediately implies that minl,k:B𝒯(l,k)NFF(πθl,k)R\min_{l,k:\,B_{\mathcal{T}}(l,k)\leq N}F^{\star}-F(\pi_{\theta^{l,k}})\leq R, where the latter is also a commonly adopted performance criteria in the literature (Agarwal et al. 2019; Wang et al. 2019).

3 Assumptions and REINFORCE gradients

3.1 Assumptions

Here we list a few fundamental assumptions that we require for our analysis.

Assumption 1 (Setting).

The regularization term is a log-barrier, i.e.,

R(θ)=1SAs𝒮,a𝒜log(πθ(s,a)),R(\theta)=\tfrac{1}{SA}\sum\nolimits_{s\in\mathcal{S},a\in\mathcal{A}}\log(\pi_{\theta}(s,a)),

and the policy is parametrized to be a soft-max, i.e., πθ(s,a)=exp(θs,a)a𝒜exp(θs,a)\pi_{\theta}(s,a)=\frac{\exp(\theta_{s,a})}{\sum_{a^{\prime}\in\mathcal{A}}\exp(\theta_{s,a^{\prime}})}, with Θ=RSA\Theta={\mbox{\bf R}}^{SA}.

The first assumption concerns the form of the policy parameterization and the regularization. Notice that the regularization term here can also be seen as a relative entropy/KL regularization (with a uniform distribution policy reference) (Agarwal et al. 2019). Such kind of regularization terms are also widely adopted in practice (although typically with variations) (Peters, Mülling, and Altun 2010; Schulman, Chen, and Abbeel 2017).

With Assumption 1, the post-processing function in Algorithm 3 can be for example realized by first calculating π^=ϵpp𝟏+(1Aϵpp)πθ\hat{\pi}=\epsilon_{\rm pp}{\bf 1}+(1-A\epsilon_{\rm pp})\pi_{\theta}, and then return θ\theta^{\prime} with θs,a=logπ^s,a+cs\theta^{\prime}_{s,a}=\log\hat{\pi}_{s,a}+c_{s}. Here 𝟏\bf 1 is an all-one vector and csRc_{s}\in{\mbox{\bf R}} (s=1,,Ss=1,\dots,S) are arbitrary real numbers.

Assumption 2 (Policy gradient estimator).

There exist constants C,C1,C2,M1,M2>0C,\,C_{1},\,C_{2},\,M_{1},\,M_{2}>0, such that for all ll, k0k\geq 0, we have ^θLλl(θl,k)2C1\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}\leq C_{1} almost surely and that

θLλl(θl,k)T𝐄l,k^θLλl(θl,k)C2θLλl(θl,k)22δl,k,𝐄l,k^θLλk(θl,k)22M1+M2θLλl(θl,k)22,\begin{split}&\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})^{T}\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\geq C_{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}-\delta_{l,k},\\ &\mathbf{E}_{l,k}\|\widehat{\nabla}_{\theta}L_{\lambda^{k}}(\theta^{l,k})\|_{2}^{2}\leq M_{1}+M_{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2},\end{split}

where k=0Tl1δl,k2C\sum_{k=0}^{T_{l}-1}\delta_{l,k}^{2}\leq C, \forall l0l\geq 0. In addition, Hl,klog1/γ(k+1)H^{l,k}\geq\log_{1/\gamma}(k+1), \forall l,k0l,\,k\geq 0.

The second assumption requires that the gradient estimates are almost surely bounded, nearly unbiased and satisfy a bounded second-order moment growth condition. This is a slight generalization of standard assumptions in the stochastic gradient descent literature (Bottou, Curtis, and Nocedal 2018). Additionally, we also require that the trajectory lengths Hl,kH^{l,k} are at least logarithmically growing in kk to control the loss of rewards due to truncation. For notational simplicity, hereafter we omit to mention the trajectory sampling (i.e., s0ρ,atl,kπθl,k(|stl,k),st+1l,kp(|stl,k,atl,k),t0s_{0}\sim\rho,~{}a_{t}^{l,k}\sim\pi_{\theta^{l,k}}(\cdot|s_{t}^{l,k}),~{}s_{t+1}^{l,k}\sim p(\cdot|s_{t}^{l,k},a_{t}^{l,k}),\,\forall t\geq 0) when we write down 𝐄l,k\mathbf{E}_{l,k}.

Notice that Assumption 2 immediately holds if ^θLλl(θl,k)\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k}) is unbiased and has a bounded second-order moment. We have implicitly assumed that LλL_{\lambda} is differentiable, which we can do due to the following lemma:

Proposition 1 ((Agarwal et al. 2019, Lemma E.4)).

Under Assumption 1, LλL_{\lambda} is strongly smooth with parameter βλ=8(1γ)3+2λS\beta_{\lambda}=\frac{8}{(1-\gamma)^{3}}+\frac{2\lambda}{S}, i.e., θLλ(θ)θLλ(θ)2βλθθ2\|\nabla_{\theta}L_{\lambda}(\theta)-\nabla_{\theta}L_{\lambda}(\theta^{\prime})\|_{2}\leq\beta_{\lambda}\|\theta-\theta^{\prime}\|_{2} for any θ,θRSA\theta,\,\theta^{\prime}\in{\mbox{\bf R}}^{SA}.

3.2 REINFORCE gradient estimations

Now we introduce REINFORCE gradient estimation with baselines, and specify the hyper-parameters under which the technical Assumption 2 holds, when operating under the setting Assumption 1.

REINFORCE gradient estimation with log-regularization takes the following form:

^θLλl(θl,k)=t=0βHl,kγt(Q^l,k(stl,k,atl,k)b(stl,k))×θlogπθl,k(atl,k|stl,k)+λlSAs𝒮,a𝒜θlogπθl,k(a|s),\begin{split}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})=&\sum\nolimits_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}(\widehat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})-b(s_{t}^{l,k}))\\ &\qquad\times\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\\ &+\tfrac{\lambda^{l}}{SA}\sum\nolimits_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s),\end{split} (9)

where β(0,1)\beta\in(0,1), Q^l,k(stl,k,atl,k)=t=tHl,kγttrtl,k\widehat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})=\sum_{t^{\prime}=t}^{H^{l,k}}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{l,k}, and the second term above corresponds to the gradient of the regularization R(θ)R(\theta). Notice that here the outer summation is only up to βHl,k\lfloor\beta H^{l,k}\rfloor, which ensures that Q^l,k(stl,k,atl,k)\hat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k}) is sufficiently accurate. Here b:𝒮Rb:\mathcal{S}\rightarrow{\mbox{\bf R}} is called the baseline, and is required to be independent of the trajectory τl,k\tau^{l,k} (Agarwal, Jiang, and Kakade 2019, §4.1). The purpose of subtracting bb from the approximate QQ-values is to (potentially) reduce the variance of the “plain” REINFORCE gradient estimation, which corresponds to the case when b=0b=0.

With this we have the following result, the proof of which can be found in the Appendix.

Lemma 2.

Suppose that Assumption 1 holds, β(0,1)\beta\in(0,1), and that for all ll, k0k\geq 0, λlλ¯\lambda^{l}\leq\bar{\lambda},

Hl,k2log1/γ(8(k+1)(1γ)3)3min{β,1β}(=Θ(log(k+1))).H^{l,k}\geq\tfrac{2\log_{1/\gamma}\left(\frac{8(k+1)}{(1-\gamma)^{3}}\right)}{3\min\{\beta,1-\beta\}}(=\Theta(\log(k+1))). (10)

Assume in addition that |b(s)|B|b(s)|\leq B for any s𝒮s\in\mathcal{S}, where B>0B>0 is a constant. Then for the gradient estimation (9), Assumption 2 holds with

C=16(1(1γ)2+λ¯)2C1=2(1+B(1γ))(1γ)2+2λ¯,C2=1M1=32(1γ)4+V¯bM2=2.\begin{split}&\text{$C=16\Big{(}\tfrac{1}{(1-\gamma)^{2}}+\bar{\lambda}\Big{)}^{2}$, \quad$C_{1}=\tfrac{2(1+B(1-\gamma))}{(1-\gamma)^{2}}+2\bar{\lambda}$,}\\ &\text{$C_{2}=1$,\quad$M_{1}=\tfrac{32}{(1-\gamma)^{4}}+\bar{V}_{b}$, \quad$M_{2}=2$.}\end{split}

and δl,k=(2(1γ)2+2λ¯)(k+1)23\delta_{l,k}=\Big{(}\tfrac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\Big{)}(k+1)^{-\frac{2}{3}}, \forall ll, k0k\geq 0. Here V¯b[0,4(1+B(1γ)(1γ)2+λ¯)2]\bar{V}_{b}\in\Big{[}0,4\Big{(}\tfrac{1+B(1-\gamma)}{(1-\gamma)^{2}}+\bar{\lambda}\Big{)}^{2}\Big{]} is the uniform upper bound on variances of policy gradient estimations with form (9).

This result extends without modification to non-stationary baselines btl,kb_{t}^{l,k}, as long as each btl,kb_{t}^{l,k} is independent of trajectory τl,k\tau^{l,k} and |btl,k(s)|B|b_{t}^{l,k}(s)|\leq B for any t,l,k0t,\,l,\,k\geq 0. Note that the explicit upper bound on V¯b\bar{V}_{b} is pessimistic, and in practice V¯b\bar{V}_{b} is usually much smaller than V¯0\bar{V}_{0} with appropriate choices of baselines (e.g., the adaptive reinforcement baselines (Williams 1992; Zhao et al. 2011)), although the latter has a smaller upper bound as stated in Lemma 2.

4 Main convergence results

4.1 Preliminary tools

We first present some preliminary tools for our analysis.

Non-convexity and control of “bad” episodes.

One of the key difficulties in applying policy gradient methods to solve an MDP problem towards global optimality is that problem (3) is in general non-convex (Agarwal et al. 2019). Fortunately, we have the following result, which connects the gradient of the surrogate objective LλL_{\lambda} with the global optimality gap of the original optimization problem (3).

Proposition 3 ((Agarwal et al. 2019, Theorem 5.3)).

Under Assumption 1, for any ϵ>0\epsilon>0, suppose that we have θLλ(θ)2ϵ\|\nabla_{\theta}L_{\lambda}(\theta)\|_{2}\leq\epsilon and that ϵλ/(2SA)\epsilon\leq\lambda/(2SA). Then FF(πθ)2λ1γdρπρF^{\star}-F(\pi_{\theta})\leq\frac{2\lambda}{1-\gamma}\left\|\frac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}.

Here for any policy πΠ\pi\in\Pi, dρπ=(1γ)t=0γt𝐏𝐫𝐨𝐛π(st=s|s0ρ)d_{\rho}^{\pi}=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathbf{Prob}_{\pi}(s_{t}=s|s_{0}\sim\rho) is the discounted state visitation distribution, where 𝐏𝐫𝐨𝐛π(st=s|s0ρ)\mathbf{Prob}_{\pi}(s_{t}=s|s_{0}\sim\rho) is the probability of arriving at ss in step tt starting from s0ρs_{0}\sim\rho following policy π\pi in \mathcal{M}. In addition, the division in dρπ/ρd_{\rho}^{\pi^{\star}}/\rho is component-wise.

Now motivated by Proposition 3, when analyzing the regret up to episode KK in phase ll, we define the following set of “bad episodes”:

I+={k{0,,K}|θLλl(θl,k)2λl/(2SA)}.I^{+}=\{k\in\{0,\dots,K\}\,|\,\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}\geq\lambda^{l}/(2SA)\}.

Then one can show that for any ϵ>0\epsilon>0, if we choose λl=ϵ(1γ)2\lambda^{l}=\frac{\epsilon(1-\gamma)}{2}, we have that FF(πθl,k)dρπ/ρϵF^{\star}-F(\pi_{\theta^{l,k}})\leq\|d_{\rho}^{\pi^{\star}}/\rho\|_{\infty}\epsilon for any k{0,,K}\I+k\in\{0,\dots,K\}\backslash I^{+}, while FF(πθl,k)1/(1γ)F^{\star}-F(\pi_{\theta^{l,k}})\leq 1/(1-\gamma) holds trivially for kI+k\in I^{+} due to the assumption that the rewards are between 0 and 11. We then establish a sub-linear (in KK) bound the size of I+I^{+}, which serves as the key stepping stone for the overall sub-linear regret bound. The details of these arguments can be found in the Appendix.

Doubling trick.

The second tool is a classical doubling trick that is commonly adopted in the design of online learning algorithms (Besson and Kaufmann 2018; Basei, Guo, and Hu 2020), which can be used to stitch together the regret over multiple learning phases in Algorithm 2.

Notice that Proposition 3 suggests that for any pre-specified tolerance ϵ\epsilon, one can select λ\lambda proportional to ϵ\epsilon and then run (stochastic) gradient ascent to drive FF(πθ)F^{\star}-F(\pi_{\theta}) below the tolerance. To obtain the eventual convergence and regret bound in the long run we apply the doubling trick, which specifies a growing phase length sequence with Tl+12TlT_{l+1}\approx 2T_{l} in Algorithm 2 and a suitably decaying sequence of regularization parameters {λl}l=0\{\lambda^{l}\}_{l=0}^{\infty}.

From high probability to almost sure convergence.

The last tool is an observation that an arbitrary anytime sub-linear high probability regret bound with logarithmic dependency on 1/δ1/\delta immediately leads to almost sure convergence of the average regret with a corresponding asymptotic rate. Although such an observation seems to be informally well-known in the theoretical computer science community, we provide a compact formal discussion below for self-contained-ness.

Lemma 4.

Suppose that \forall δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, \forall N0N\geq 0, we have

𝐫𝐞𝐠𝐫𝐞𝐭(N)d1(N+c)d2(log(N/δ))d3+d4(logN)d5\mathbf{regret}(N)\leq d_{1}(N+c)^{d_{2}}(\log(N/\delta))^{d_{3}}+d_{4}(\log N)^{d_{5}} (11)

for some constants c,d1,d3,d4,d50c,\,d_{1},\,d_{3},\,d_{4},\,d_{5}\geq 0 and d2[0,1)d_{2}\in[0,1). Then we also have

𝐏𝐫𝐨𝐛( N¯𝐙+, such that  NN¯AN holds)=1,\begin{split}\mathbf{Prob}&\left(\text{$\exists$ $\bar{N}\in\mathbf{Z}_{+}$, such that $\forall$ $N\geq\bar{N}$, $A_{N}$ holds}\right)=1,\end{split}

where the events AN={𝐫𝐞𝐠𝐫𝐞𝐭(N)/(N+1)()}A_{N}=\{\mathbf{regret}(N)/(N+1)\leq(*)\}, and

()=d1N(1d2)(1+cN)d2(3logN)d3+d4(logN)d5N.(*)=d_{1}N^{-(1-d_{2})}\left(1+\tfrac{c}{N}\right)^{d_{2}}(3\log N)^{d_{3}}+\tfrac{d_{4}(\log N)^{d_{5}}}{N}.

To put it another way, we have

limN𝐫𝐞𝐠𝐫𝐞𝐭(N)/(N+1)=0almost surely\lim_{N\rightarrow\infty}\mathbf{regret}(N)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of ()(*).

Notice that here we restrict the right-hand side of (11) to a rather specific form simply because our bounds below are all of this form. However, similar results hold for much more general forms of bounds.

4.2 Regret analysis

In this section, we establish the regret bound of Algorithm 2, when used with the REINFORCE gradient estimator from §3.2. We begin by bounding the regret of a single phase and then use the doubling trick to combine these into the overall regret bound.

Single phase analysis.

We begin by bounding the regret defined in (7) of each phase in Algorithm 2. Note that a single phase in Algorithm 2 is exactly Algorithm 1 terminated in episode TlT_{l}, with λn=λl\lambda^{n}=\lambda^{l} for all n0n\geq 0 and θ0=θl,0\theta^{0}=\theta^{l,0}. Also notice that for a given phase l0l\geq 0, in order for Theorem 5 below to hold, we actually only need the conditions in Assumption 2 to be satisfied for this specific ll.

Theorem 5.

Under Assumptions 1 and 2, for phase l0l\geq 0 suppose that we choose αl,k=Cl,α1k+3log2(k+3)\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)} for some Cl,α(0,C2/(M2βλl)]C_{l,\alpha}\in(0,C_{2}/(M_{2}\beta_{\lambda^{l}})]. Then for any ϵ>0\epsilon>0, if we choose λl=ϵ(1γ)2\lambda^{l}=\frac{\epsilon(1-\gamma)}{2}, then \forall δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for any K{0,,Tl1}K\in\{0,\dots,T_{l}-1\}, we have

𝐫𝐞𝐠𝐫𝐞𝐭l(K)U1K+1log2(K+3)log(2/δ)ϵ2+(K+1)dρπρϵ+2γ1γlog(K+3).\begin{split}&\mathbf{regret}_{l}(K)\leq U_{1}\tfrac{\sqrt{K+1}\log_{2}(K+3)\sqrt{\log(2/\delta)}}{\epsilon^{2}}\\ &\qquad+(K+1)\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\epsilon+\tfrac{2\gamma}{1-\gamma}\log(K+3).\end{split} (12)

Here the constant U1U_{1} only depends on the underlying MDP \mathcal{M}, phase initialization θl,0\theta^{l,0} and the constants C,C1,C2C,\,C_{1},\,C_{2}, M1,Cl,α,λlM_{1},\,C_{l,\alpha},\,\lambda^{l}.

The proof as well as a more formal statement of Theorem 5 with details of the constants (cf. Theorem 9) are deferred to the Appendix. Here the constant βλ\beta_{\lambda} is the smoothness constant from Proposition 1. We remark that when ϵ\epsilon is fixed, the regret bound (12) can be seen as a sub-linear (in KK as KK\rightarrow\infty) regret term plus an error term (K+1)ϵ+2γ1γlog(K+3)(K+1)\epsilon+\frac{2\gamma}{1-\gamma}\log(K+3). Alternatively, one can interpret it as follows:

𝐫𝐞𝐠𝐫𝐞𝐭l(K)K+1U1log2(K+3)log(2/δ)K+1ϵ2+2γ1γlog(K+3)K+1+dρπρϵ.\begin{split}\dfrac{\mathbf{regret}_{l}(K)}{K+1}\leq&\,U_{1}\dfrac{\log_{2}(K+3)\sqrt{\log(2/\delta)}}{\sqrt{K+1}\epsilon^{2}}\\ &\,+\dfrac{2\gamma}{1-\gamma}\dfrac{\log(K+3)}{K+1}+\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\epsilon.\end{split}

Namely, the average regret in episode ll converges to a constant multiple of the pre-specified tolerance ϵ\epsilon at a sub-linear rate (as KK\rightarrow\infty).

Overall regret bound.

Now we stitch together the single phase regret bounds established above to obtain the overall regret bound of Algorithm 2, with the help of the doubling trick. This leads to the following theorem.

Theorem 6 (Regret for REINFORCE).

Under Assumption 1, suppose that for each l0l\geq 0, we choose αl,k=Cl,α1k+3log2(k+3)\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}, with Cl,α[1/(2βλ¯),1/(2βλl)]C_{l,\alpha}\in[1/(2\beta_{\bar{\lambda}}),1/(2\beta_{\lambda^{l}})] and λ¯=1γ2\bar{\lambda}=\frac{1-\gamma}{2}, and choose Tl=2lT_{l}=2^{l}, ϵl=Tl1/6=2l/6\epsilon^{l}=T_{l}^{-1/6}=2^{-l/6}, λl=ϵl(1γ)2\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2} and ϵpp=1/(2A)\epsilon_{\rm pp}=1/(2A). In addition, suppose that (9) is adopted to evaluate ^θLλl(θl,k)\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k}), with β(0,1)\beta\in(0,1), |b(s)|B|b(s)|\leq B for any s𝒮s\in\mathcal{S} (where B>0B>0 is a constant), and that (10) holds for Hl,kH^{l,k} for all l,k0l,\,k\geq 0. Then we have for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for any N0N\geq 0, we have

𝐫𝐞𝐠𝐫𝐞𝐭(N)=O((S2A2(1γ)7+dρπρ)N56(logNδ)52).\mathbf{regret}(N)=O\Big{(}\Big{(}\tfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\Big{)}N^{\frac{5}{6}}(\log\tfrac{N}{\delta})^{\frac{5}{2}}\Big{)}. (13)

In addition, we have

limN𝐫𝐞𝐠𝐫𝐞𝐭(N)/(N+1)=0almost surely\lim_{N\rightarrow\infty}\mathbf{regret}(N)/(N+1)=0\quad\text{almost surely} (14)

with asymptotic rate O((S2A2(1γ)7+dρπρ)N16(logN)52)O\Big{(}\Big{(}\frac{S^{2}A^{2}}{(1-\gamma)^{7}}+\Big{\|}\frac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\Big{)}N^{-\frac{1}{6}}(\log N)^{\frac{5}{2}}\Big{)}.

Note that the almost sure convergence (14) is immediately implied by the high probability bound (13) via Lemma 4. Here for clarity, we have restricted the statement to the case when we use the REINFORCE gradient estimation from §3.2. A more general counterpart result can be found in Appendix B.3, from which Theorem 6 is immediately implied. See also Appendix C for a more formal statement of the regret bound (cf. Corollary 11) for REINFORCE with detailed constants.

Notice that compared to the single phase regret bound in (12), the overall regret bound in (13) now gets rid of the dependency on a pre-specified tolerance ϵ>0\epsilon>0. This should be attributed to the adaptivity in the regularization parameter sequence. Also notice that here we have followed the convention of the reinforcement learning literature to make all the problem dependent quantities (e.g., γ,S,A\gamma,\,S,\,A, etc.) explicit in the big-OO notation.

One crucial difference between our regret bound and those in the existing literature of vanilla policy gradient methods in the general MDP settings (which are sometimes not stated in the form of regret, but can be easily deduced from their proofs in those cases) is that the previous results either require exact and deterministic updates or contain a non-vanishing Θ(1/Mp)\Theta(1/M^{p}) term, with MM being the mini-batch size (of the trajectories) and p>0p>0 being some exponent (with a typical value of 1/21/2). By removing such non-vanishing terms, we obtain the first sub-linear regret bound for model-free vanilla policy gradient methods with finite mini-batch sizes.

5 Extension to mini-batch updates

We now consider extending our previous results to mini-batch settings, by modifying Algorithm 2 as follows. Firstly, in each inner iteration, instead of sampling only one trajectory in line 5, we sample M1M\geq 1 independent trajectories τ1l,k,,τMl,k\tau^{l,k}_{1},\dots,\tau^{l,k}_{M} from \mathcal{M} following policy πθl,k\pi_{\theta^{l,k}} and then compute an approximate gradient ^θ(i)Lλl(θl,k)\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}(\theta^{l,k}) (i=1,,Mi=1,\dots,M) using each of these MM trajectories. We then modify the update in line 6 as

θl,k+1=θl,k+αl,k1Mi=1M^θ(i)Lλl(θl,k).\theta^{l,k+1}=\theta^{l,k}+\alpha^{l,k}\frac{1}{M}\sum_{i=1}^{M}\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}}).

See Algorithm 4 in Appendix D for a formal description of the modified algorithm.

Regret with mini-batches.

Notice that since each inner iteration (in Algorithm 4) now consists of MM episodes, we need to slightly modify the definition of the regret up to episode NN (N0N\geq 0) as follows:

𝐫𝐞𝐠𝐫𝐞𝐭(N;M)={(l,k)|B𝒯(l,k)NM1}M(FF^l,k(πθl,k))+(NMNM)(FF^l,k(πθlN,M,kN,M)),\begin{split}\mathbf{regret}&(N;M)=\\ &\sum\nolimits_{\left\{(l,k)|B_{\mathcal{T}}(l,k)\leq\lfloor\frac{N}{M}\rfloor-1\right\}}M(F^{\star}-\hat{F}^{l,k}(\pi_{\theta^{l,k}}))\\ &+\left(N-M\left\lfloor\frac{N}{M}\right\rfloor\right)(F^{\star}-\hat{F}^{l,k}(\pi_{\theta^{l_{N,M},k_{N,M}}})),\end{split} (15)

where (lN,M,kN,M)=G𝒯(N/M)(l_{N,M},k_{N,M})=G_{\mathcal{T}}(\lfloor N/M\rfloor) and F^l,k(πθl,k)\hat{F}^{l,k}(\pi_{\theta^{l,k}}) is the same as in (6). The above definition accounts for the fact that each of the MM episodes in an inner iteration/step (l,k)(l,k) corresponds to the same iterate θl,k\theta^{l,k} and hence has the same contribution to the regret. The second term on the right-hand side accounts for the contribution of the (remaining) NMN/MN-M\lfloor N/M\rfloor episodes (among a total of MM episodes) in inner iteration/step (lN,M,kN,M)(l_{N,M},k_{N,M}).

Then the following regret bound can be established.

Corollary 7 (Regret for mini-batch REINFORCE).

Under Assumption 1, suppose that for each l0l\geq 0, we choose αl,k=Cl,α1k+3log2(k+3)\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}, with Cl,α[1/(2βλ¯),1/(2βλl)]C_{l,\alpha}\in[1/(2\beta_{\bar{\lambda}}),1/(2\beta_{\lambda^{l}})] and λ¯=1γ2\bar{\lambda}=\frac{1-\gamma}{2}, and choose Tl=2lT_{l}=2^{l}, ϵl=Tl1/6=2l/6\epsilon^{l}=T_{l}^{-1/6}=2^{-l/6}, λl=ϵl(1γ)2\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2} and ϵpp=1/(2A)\epsilon_{\rm pp}=1/(2A). In addition, suppose that the assumptions in Lemma 12 hold (note that Assumption 1 and λlλ¯\lambda^{l}\leq\bar{\lambda} already automatically hold by the other assumptions). Then we have for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, jointly for all episodes NN, we have (for the mini-batch Algorithm 4)

𝐫𝐞𝐠𝐫𝐞𝐭(N;M)=O((S2A2(1γ)7+dρπρ)×(M16+M56)(N+M)56(log(N/δ))52+M(logN)21γ).\begin{split}&\mathbf{regret}(N;M)=O\Big{(}\Big{(}\tfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\Big{)}\\ &\quad\times(M^{\frac{1}{6}}+M^{-\frac{5}{6}})(N+M)^{\frac{5}{6}}(\log(N/\delta))^{\frac{5}{2}}+\tfrac{M(\log N)^{2}}{1-\gamma}\Big{)}.\end{split}

In addition, we also have

limN𝐫𝐞𝐠𝐫𝐞𝐭(N;M)/(N+1)=0almost surely\lim_{N\rightarrow\infty}\mathbf{regret}(N;M)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of

O((S2A2(1γ)7+dρπρ)×(M16+M56)N16(1+MN)56(logN)52+M(logN)2(1γ)N).\begin{split}&O\Big{(}\Big{(}\tfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\Big{\|}\tfrac{d_{\rho}^{\pi^{\star}}}{\rho}\Big{\|}_{\infty}\Big{)}\\ &\,\times(M^{\frac{1}{6}}+M^{-\frac{5}{6}})N^{-\frac{1}{6}}\left(1+\tfrac{M}{N}\right)^{\frac{5}{6}}(\log N)^{\frac{5}{2}}+\tfrac{M(\log N)^{2}}{(1-\gamma)N}\Big{)}.\end{split}

Again, we note that the almost sure convergence above is directly implied by the high probability bound via Lemma 4. The proof and a more formal statement of this corollary (cf. Corollary 13) can be found in Appendix D. In particular, when M=1M=1, the bound above reduces to (13). In addition, we can see that there might be a trade-off between the terms M1/6M^{1/6} and M5/6M^{-5/6}. The intuition behind this is a trade-off between lower variance with larger batch sizes and more frequent updates with smaller batch sizes.

6 Conclusion and open problems

In this work, we establish the global convergence rates of practical policy gradient algorithms with a fixed size mini-batch of trajectories combined with REINFORCE gradient estimation.

Although in §4 and §5, we only instantiate the bounds for the REINFORCE gradient estimators, we note that our general results (in particular, Theorem 10 in Appendix B.3) can be easily applied to other gradient estimators (e.g., actor-critic and state-action visitation measure based estimators) as well, as long as one can verify the existence of the constants in Assumption 2 in a similar way to Lemma 2. In addition, one can also easily derive sample complexity results as by-products of our analysis. In fact, our proof of Theorem 5 immediately implies a O~(1/ϵ4)\tilde{O}(1/\epsilon^{4}) sample complexity bound (for Algorithm 1 with REINFORCE gradient estimators and a constant regularization parameter) for any pre-specified tolerance ϵ>0\epsilon>0, where we use O~\tilde{O} to indicate the big-OO notation with logarithmic terms suppressed. We have focused only on regret in this paper mainly for clarity purposes.

It is also relatively straightforward to extend our results to finite horizon non-stationary settings, in which the soft-max policy parametrization will have a dimension of SAHSAH and different policy gradient estimators can be adopted (without trajectory truncation), with HH being the horizon of each episode. In this case, it’s also easy to rewrite the regret bound as a function of the total number of time steps THNT\leq HN, where NN is the total number of episodes. Other straightforward extensions include refined convergence to stationary points (in both almost sure and high probability senses and with no requirement on large batch sizes), and inexact convergence results when δl,k\delta^{l,k} (cf. Assumption 2) is not square summable (e.g., when Hl,kH^{l,k} is fixed or not growing sufficiently fast).

There are also several open problems that may be resolved by combining the techniques introduced in this paper with existing results in the literature. Firstly, it would be desirable to remove the “exploration” assumption that the initial distribution ρ\rho is component-wise positive. This may be achieved by combining our results with the policy cover technique in (Agarwal et al. 2020) or the optimistic bonus tricks in (Cai et al. 2019; Efroni et al. 2020). Secondly, the bounds in our paper are likely far from optimal (i.e., sharp). Hence it would be desirable to either refine our analysis or apply our techniques to accelerated policy gradient methods (e.g., IS-MBPG (Huang et al. 2020)) to obtain better global convergence rates and/or last-iterate convergence. Thirdly, it would be very interesting to see if global convergence results still hold for REINFORCE when the relative entropy regularization term used in this paper is replaced with the practically adopted entropy regularization term in the literature. The answer is affirmative when exact gradient estimations are available (Mei et al. 2020; Cen et al. 2020), but it remains unknown how these results might be generalized to the stochastic settings in our paper. We conjecture that entropy regularization leads to better global convergence rates and can help us remove the necessity of the PostProcess steps in Algorithm 2 as they are uniformly bounded. Finally, one may also consider relaxing the uniform bound assumption on the rewards rr to instead being sub-Gaussian, introducing function approximation, and extending our results to natural policy gradient and actor-critic methods as well as more modern policy gradient methods like DPG, PPO and TRPO.

Acknowledgment

We would like to thank Anran Hu for pointing out a mistake in the proof of an early draft of this paper. We thank Guillermo Angeris, Shane Barratt, Haotian Gu, Xin Guo, Yusuke Kikuchi, Bennet Meyers, Xinyue Shen, Mahan Tajrobehkar, Jonathan Tuck, Jiaming Wang and Xiaoli Wei for their feedback on some preliminary results in this paper. We thank Junyu Zhang for several detailed and fruitful discussions of the draft. We also thank the anonymous (meta-)reviewers for the great comments and suggestions. Jongho Kim is supported by Samsung Scholarship.

References

  • Abbasi-Yadkori et al. (2019) Abbasi-Yadkori, Y.; Bartlett, P.; Bhatia, K.; Lazic, N.; Szepesvari, C.; and Weisz, G. 2019. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, 3692–3702.
  • Agarwal et al. (2020) Agarwal, A.; Henaff, M.; Kakade, S.; and Sun, W. 2020. PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning. arXiv preprint arXiv:2007.08459 .
  • Agarwal, Jiang, and Kakade (2019) Agarwal, A.; Jiang, N.; and Kakade, S. 2019. Reinforcement Learning: Theory and Algorithms. Technical report, Department of Computer Science, University of Washington.
  • Agarwal et al. (2019) Agarwal, A.; Kakade, S.; Lee, J.; and Mahajan, G. 2019. Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261 .
  • Basei, Guo, and Hu (2020) Basei, M.; Guo, X.; and Hu, A. 2020. Linear Quadratic Reinforcement Learning: Sublinear Regret in the Episodic Continuous-Time Framework. arXiv preprint arXiv:2006.15316 .
  • Baxter and Bartlett (2001) Baxter, J.; and Bartlett, P. 2001. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 319–350.
  • Bertsekas (2017) Bertsekas, D. 2017. Dynamic programming and optimal control, volume II. Athena scientific Belmont, MA, 4th edition.
  • Besson and Kaufmann (2018) Besson, L.; and Kaufmann, E. 2018. What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971 .
  • Bhandari and Russo (2019) Bhandari, J.; and Russo, D. 2019. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 .
  • Bhandari and Russo (2020) Bhandari, J.; and Russo, D. 2020. A Note on the Linear Convergence of Policy Gradient Methods. arXiv preprint arXiv:2007.11120 .
  • Bottou, Curtis, and Nocedal (2018) Bottou, L.; Curtis, F.; and Nocedal, J. 2018. Optimization methods for large-scale machine learning. SIAM Review 60(2): 223–311.
  • Cai et al. (2019) Cai, Q.; Yang, Z.; Jin, C.; and Wang, Z. 2019. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830 .
  • Carmona, Laurière, and Tan (2019) Carmona, R.; Laurière, M.; and Tan, Z. 2019. Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods. arXiv preprint arXiv:1910.04295 .
  • Cen et al. (2020) Cen, S.; Cheng, C.; Chen, Y.; Wei, Y.; and Chi, Y. 2020. Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization. arXiv preprint arXiv:2007.06558 .
  • Efroni et al. (2020) Efroni, Y.; Shani, L.; Rosenberg, A.; and Mannor, S. 2020. Optimistic Policy Optimization with Bandit Feedback. arXiv preprint arXiv:2002.08243 .
  • Even-Dar, Kakade, and Mansour (2004) Even-Dar, E.; Kakade, S.; and Mansour, Y. 2004. Experts in a Markov decision process. Advances in neural information processing systems 17: 401–408.
  • Even-Dar, Kakade, and Mansour (2009) Even-Dar, E.; Kakade, S.; and Mansour, Y. 2009. Online Markov decision processes. Mathematics of Operations Research 34(3): 726–736.
  • Even-Dar and Mansour (2003) Even-Dar, E.; and Mansour, Y. 2003. Learning rates for Q-learning. Journal of machine learning Research 5(Dec): 1–25.
  • Fazel et al. (2018) Fazel, M.; Ge, R.; Kakade, S.; and Mesbahi, M. 2018. Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039 .
  • Fei et al. (2020) Fei, Y.; Yang, Z.; Wang, Z.; and Xie, Q. 2020. Dynamic regret of policy optimization in non-stationary environments. arXiv preprint arXiv:2007.00148 .
  • Fu et al. (2019) Fu, Z.; Yang, Z.; Chen, Y.; and Wang, Z. 2019. Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498 .
  • Fu, Yang, and Wang (2020) Fu, Z.; Yang, Z.; and Wang, Z. 2020. Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy. arXiv preprint arXiv:2008.00483 .
  • Glynn (1986) Glynn, P. 1986. Stochastic approximation for Monte Carlo optimization. In Proceedings of the 18th conference on Winter simulation, 356–365.
  • Gu et al. (2015) Gu, S.; Levine, S.; Sutskever, I.; and Mnih, A. 2015. MuProp: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176 .
  • Guo et al. (2020) Guo, X.; Hu, A.; Xu, R.; and Zhang, J. 2020. A General Framework for Learning Mean-Field Games. arXiv preprint arXiv:2003.06069 .
  • Guu et al. (2017) Guu, K.; Pasupat, P.; Liu, E.; and Liang, P. 2017. From language to programs: Bridging reinforcement learning and maximum marginal likelihood. arXiv preprint arXiv:1704.07926 .
  • Huang et al. (2020) Huang, F.; Gao, S.; Pei, J.; and Huang, H. 2020. Momentum-Based Policy Gradient Methods. arXiv preprint arXiv:2007.06680 .
  • Jaksch, Ortner, and Auer (2010) Jaksch, T.; Ortner, R.; and Auer, P. 2010. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11(Apr): 1563–1600.
  • Jin et al. (2018) Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. 2018. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, 4863–4873.
  • Jin, Schmitt, and Wen (2020) Jin, Z.; Schmitt, J.; and Wen, Z. 2020. On the Analysis of Model-free Methods for the Linear Quadratic Regulator. arXiv preprint arXiv:2007.03861 .
  • Johnson et al. (2017) Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Hoffman, J.; Fei-Fei, L.; Lawrence Zitnick, C.; and Girshick, R. 2017. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, 2989–2998.
  • Kakade et al. (2003) Kakade, S.; et al. 2003. On the sample complexity of reinforcement learning. Ph.D. thesis, University of London London, England.
  • Klenke (2013) Klenke, A. 2013. Probability Theory: A Comprehensive Course. Springer Science & Business Media.
  • Konda and Tsitsiklis (2003) Konda, V.; and Tsitsiklis, J. 2003. On actor-critic algorithms. SIAM journal on Control and Optimization 42(4): 1143–1166.
  • Kool, van Hoof, and Welling (2018) Kool, W.; van Hoof, H.; and Welling, M. 2018. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475 .
  • Kool, van Hoof, and Welling (2020) Kool, W.; van Hoof, H.; and Welling, M. 2020. Estimating gradients for discrete random variables by sampling without replacement. arXiv preprint arXiv:2002.06043 .
  • Liu et al. (2019) Liu, B.; Cai, Q.; Yang, Z.; and Wang, Z. 2019. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306 .
  • Ma et al. (2016) Ma, Y.; Zhao, T.; Hatano, K.; and Sugiyama, M. 2016. An online policy gradient algorithm for Markov decision processes with continuous states and actions. Neural Computation 28(3): 563–593.
  • Malik et al. (2018) Malik, D.; Pananjady, A.; Bhatia, K.; Khamaru, K.; Bartlett, P.; and Wainwright, M. 2018. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. arXiv preprint arXiv:1812.08305 .
  • Marbach and Tsitsiklis (2001) Marbach, P.; and Tsitsiklis, J. 2001. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control 46(2): 191–209.
  • Mazumdar et al. (2019) Mazumdar, E.; Ratliff, L.; Jordan, M.; and Sastry, S. 2019. Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear Quadratic Games. arXiv preprint arXiv:1907.03712 .
  • Mei et al. (2020) Mei, J.; Xiao, C.; Szepesvari, C.; and Schuurmans, D. 2020. On the Global Convergence Rates of Softmax Policy Gradient Methods. arXiv preprint arXiv:2005.06392 .
  • Mnih and Gregor (2014) Mnih, A.; and Gregor, K. 2014. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030 .
  • Mnih et al. (2016) Mnih, V.; Badia, A.and Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937.
  • Neu et al. (2010) Neu, G.; Antos, A.; György, A.; and Szepesvári, C. 2010. Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, 1804–1812.
  • Neu, Jonsson, and Gómez (2017) Neu, G.; Jonsson, A.; and Gómez, V. 2017. A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798 .
  • Papini et al. (2018) Papini, M.; Binaghi, D.; Canonaco, G.; Pirotta, M.; and Restelli, M. 2018. Stochastic variance-reduced policy gradient. arXiv preprint arXiv:1806.05618 .
  • Peters, Mülling, and Altun (2010) Peters, J.; Mülling, K.; and Altun, Y. 2010. Relative entropy policy search. In AAAI, volume 10, 1607–1612. Atlanta.
  • Pham et al. (2020) Pham, N.; Nguyen, L.; Phan, D.; Nguyen, P.; van Dijk, M.; and Tran-Dinh, Q. 2020. A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning. arXiv preprint arXiv:2003.00430 .
  • Rennie et al. (2017) Rennie, S.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7008–7024.
  • Ryu and Boyd (2016) Ryu, E.; and Boyd, S. 2016. Primer on monotone operator methods. Appl. Comput. Math 15(1): 3–43.
  • Schulman, Chen, and Abbeel (2017) Schulman, J.; Chen, X.; and Abbeel, P. 2017. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440 .
  • Schulman et al. (2015) Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International conference on machine learning, 1889–1897.
  • Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
  • Shani, Efroni, and Mannor (2019) Shani, L.; Efroni, Y.; and Mannor, S. 2019. Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. arXiv preprint arXiv:1909.02769 .
  • Shen et al. (2019) Shen, Z.; Ribeiro, A.; Hassani, H.; Qian, H.; and Mi, C. 2019. Hessian aided policy gradient. In International Conference on Machine Learning, 5729–5738.
  • Silver et al. (2014) Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning, volume 32(1), 387–395.
  • Sutton and Barto (2018) Sutton, R.; and Barto, A. 2018. Reinforcement Learning: An Introduction. MIT press.
  • Sutton et al. (2000) Sutton, R.; McAllester, D.; Singh, S.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
  • Wang et al. (2019) Wang, L.; Cai, Q.; Yang, Z.; and Wang, Z. 2019. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150 .
  • Williams (1992) Williams, R. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4): 229–256.
  • Xiong et al. (2020) Xiong, H.; Xu, T.; Liang, Y.; and Zhang, W. 2020. Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling. arXiv preprint arXiv:2002.06286 .
  • Xu, Gao, and Gu (2019) Xu, P.; Gao, F.; and Gu, Q. 2019. Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610 .
  • Xu, Gao, and Gu (2020) Xu, P.; Gao, F.; and Gu, Q. 2020. An improved convergence analysis of stochastic variance-reduced policy gradient. In Uncertainty in Artificial Intelligence, 541–551. PMLR.
  • Yi et al. (2018) Yi, K.; Wu, J.; Gan, C.; Torralba, A.; Kohli, P.; and Tenenbaum, J. 2018. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In Advances in neural information processing systems, 1031–1042.
  • Yuan et al. (2020) Yuan, H.; Lian, X.; Liu, J.; and Zhou, Y. 2020. Stochastic Recursive Momentum for Policy Gradient Methods. arXiv preprint arXiv:2003.04302 .
  • Zhang et al. (2020) Zhang, J.; Koppel, A.; Bedi, A.; Szepesvari, C.; and Wang, M. 2020. Variational Policy Gradient Method for Reinforcement Learning with General Utilities. arXiv preprint arXiv:2007.02151 .
  • Zhang et al. (2019) Zhang, K.; Koppel, A.; Zhu, H.; and Başar, T. 2019. Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383 .
  • Zhang, Yang, and Basar (2019) Zhang, K.; Yang, Z.; and Basar, T. 2019. Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. In Advances in Neural Information Processing Systems, 11602–11614.
  • Zhao, Li, and Wen (2019) Zhao, M.; Li, Y.; and Wen, Z. 2019. A Stochastic Trust-Region Framework for Policy Optimization. arXiv preprint arXiv:1911.11640 .
  • Zhao et al. (2011) Zhao, T.; Hachiya, H.; Niu, G.; and Sugiyama, M. 2011. Analysis and improvement of policy gradient estimation. In Advances in Neural Information Processing Systems, 262–270.
  • Zoph and Le (2016) Zoph, B.; and Le, Q. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 .

Appendix

In this appendix, we provide detailed proofs and formal statements of the results in the main text. For notational simplicity, we sometimes abbreviate “almost sure” as “a.s.” or even omit “a.s.” whenever it is clear from the context. Also notice that as is always implicitly assumed in the literature of episodic reinforcement learning (e.g., cf. (Marbach and Tsitsiklis 2001)), given the current policy, the sampled trajectory is conditionally independent of all previous policies and trajectories.

Big-OO notation.

We first clarify the precise definitions of the Big-OO notation used in our statements. Let f:𝐙+×RdR+f:\mathbf{Z}_{+}\times{\mbox{\bf R}}^{d}\rightarrow{\mbox{\bf R}}_{+} be a function of the total number of episodes NN and all problem and algorithm dependent quantities (written jointly as a vector) URdU\in{\mbox{\bf R}}^{d}. Similarly, let g:𝐙+×RdR+g:\mathbf{Z}_{+}\times{\mbox{\bf R}}^{d}\rightarrow{\mbox{\bf R}}_{+} be a function of NN and some (subset of) problem and algorithm dependent quantities U0Rd0U_{0}\in{\mbox{\bf R}}^{d_{0}}, with d0dd_{0}\leq d. Then we write f(N;U)=O(g(N;U0))f(N;U)=O(g(N;U_{0})) to indicate that there exist constants W>0W>0 and N0𝐙+N_{0}\in\mathbf{Z}_{+} (independent of NN and U0U_{0}), such that f(N;U)Wg(N;U0)f(N;U)\leq Wg(N;U_{0}) for all NN0N\geq N_{0}.

Appendix A Proofs for REINFORCE gradient estimations

A.1 Proof of Lemma 2 (with b=0b=0)

Proof.

We validate the three groups conditions in Assumption 2 in order. For the simplicity of exposition, we first restrict to the case when b=0b=0, i.e., no baseline is incorporated.

Gradient estimation boundedness. Firstly, notice that since r(s,a)[0,1]r(s,a)\in[0,1], we have Q^l,k(stl,k,atl,k)1/(1γ)\hat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})\leq 1/(1-\gamma). And by the soft-max parametrization in Assumption 1, we have

θlogπθl,k(a|s)=𝟏s,a𝐄aπθl,k(|s)𝟏s,a,\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)={\bf 1}_{s,a}-\mathbf{E}_{a^{\prime}\sim\pi_{\theta^{l,k}}(\cdot|s)}{\bf 1}_{s,a^{\prime}},

where the vector 𝟏s,aRSA{\bf 1}_{s,a}\in{\mbox{\bf R}}^{SA} has all zero entries apart from the one corresponding to the state-action pair (s,a)(s,a). Hence θlogπθl,k(a|s)22\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)\|_{2}\leq 2 for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, and we see that

^θLλl(θl,k)211γt=0γtθlogπθl,k(atl,k|stl,k)2+λ¯SAs𝒮,a𝒜θlogπθl,k(a|s)22/(1γ)2+2λ¯,a.s.\begin{split}\left\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right\|_{2}&\leq\dfrac{1}{1-\gamma}\sum_{t=0}^{\infty}\gamma^{t}\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\|_{2}+\dfrac{\bar{\lambda}}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)\|_{2}\\ &\leq 2/(1-\gamma)^{2}+2\bar{\lambda},\quad\text{a.s.}\end{split} (16)

Hence we can take C1=2/(1γ)2+2λ¯C_{1}=2/(1-\gamma)^{2}+2\bar{\lambda}.

Validation of nearly unbiasedness. Secondly, notice that

𝐄l,k^θLλl(θl,k)=𝐄l,k(t=0βHl,kγt𝐄l,k(Q^l,k(stl,k,atl,k)|stl,k,atl,k)θlogπθl,k(atl,k|stl,k))+λlSAs𝒮,a𝒜θlogπθl,k(a|s)=J1+J2+J3,\begin{split}\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})=&\,\mathbf{E}_{l,k}\left(\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}\mathbf{E}_{l,k}\left(\hat{Q}^{l,k}(s_{t}^{l,k},a_{t}^{l,k})\Big{|}s_{t}^{l,k},a_{t}^{l,k}\right)\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right)\\ &+\dfrac{\lambda^{l}}{SA}\sum\nolimits_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)=\,J_{1}+J_{2}+J_{3},\end{split}

where

J1=𝐄l,k(t=0γt𝐄l,k(t=tγttrtl,k|stl,k,atl,k)θlogπθl,k(atl,k|stl,k))+λlSAs𝒮,a𝒜θlogπθl,k(a|s),\begin{split}J_{1}=&\,\mathbf{E}_{l,k}\left(\sum_{t=0}^{\infty}\gamma^{t}\mathbf{E}_{l,k}\left(\sum\nolimits_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{l,k}\Big{|}s_{t}^{l,k},a_{t}^{l,k}\right)\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right)\\ &+\dfrac{\lambda^{l}}{SA}\sum\nolimits_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s),\end{split}
J2=𝐄l,k(t=βHl,k+1γt𝐄l,k(t=tγttrtl,k|stl,k,atl,k)θlogπθl,k(atl,k|stl,k)),J3=𝐄l,k(t=0βHl,kγt𝐄l,k(t=Hl,k+1γttrtl,k|stl,k,atl,k)θlogπθl,k(atl,k|stl,k)).\begin{split}J_{2}&=-\mathbf{E}_{l,k}\left(\sum_{t=\lfloor\beta H^{l,k}\rfloor+1}^{\infty}\gamma^{t}\mathbf{E}_{l,k}\left(\sum\nolimits_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{l,k}\Big{|}s_{t}^{l,k},a_{t}^{l,k}\right)\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right),\\ J_{3}&=-\mathbf{E}_{l,k}\left(\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}\mathbf{E}_{l,k}\left(\sum\nolimits_{t^{\prime}=H^{l,k}+1}^{\infty}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{l,k}\Big{|}s_{t}^{l,k},a_{t}^{l,k}\right)\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right).\end{split}

By (Agarwal, Jiang, and Kakade 2019, Theorem 4.6), we have

J1=𝐄l,k(t=0γtQπθl,k(stl,k,atl,k)θlogπθl,k(atl,k|stl,k))+λlSAs𝒮,a𝒜θlogπθk(a|s)=θLλl(θl,k).\begin{split}J_{1}=\mathbf{E}_{l,k}&\left(\sum_{t=0}^{\infty}\gamma^{t}Q^{\pi_{\theta^{l,k}}}(s_{t}^{l,k},a_{t}^{l,k})\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right)\\ &+\dfrac{\lambda^{l}}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta^{k}}(a|s)\\ =\nabla_{\theta}&L_{\lambda^{l}}(\theta^{l,k}).\end{split} (17)

Here for any πΠ\pi\in\Pi,

Qπ(s,a)=𝐄(t=0γtr(st,at)|s0=s,a0=a),Q^{\pi}(s,a)=\mathbf{E}\left(\sum\nolimits_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\Big{|}s_{0}=s,\,a_{0}=a\right),

with atπ(st,),st+1p(|st,at),t>0\,a_{t}\sim\pi(s_{t},\cdot),\,s_{t+1}\sim p(\cdot|s_{t},a_{t}),\,\forall t>0.

And since r(s,a)[0,1]r(s,a)\in[0,1], we have

J22\displaystyle\|J_{2}\|_{2} 11γt=βHl,k+1γtθlogπθl,k(atl,k|stl,k)2\displaystyle\leq\dfrac{1}{1-\gamma}\sum_{t=\lfloor\beta H^{l,k}\rfloor+1}^{\infty}\gamma^{t}\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\|_{2}
2γβHl,k/(1γ)2,\displaystyle\leq 2\gamma^{\beta H^{l,k}}/(1-\gamma)^{2},

and similarly

J32t=0βHl,kγtt=Hl,k+1γttθlogπθk(atl,k|stl,k)2t=0βHl,kγt×2γ(1β)Hl,k/(1γ)2γ(1β)Hl,k/(1γ)2.\begin{split}\|J_{3}\|_{2}&\leq\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}\sum_{t^{\prime}=H^{l,k}+1}^{\infty}\gamma^{t^{\prime}-t}\|\nabla_{\theta}\log\pi_{\theta^{k}}(a_{t}^{l,k}|s_{t}^{l,k})\|_{2}\\ &\leq\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}\times 2\gamma^{(1-\beta)H^{l,k}}/(1-\gamma)\\ &\leq 2\gamma^{(1-\beta)H^{l,k}}/(1-\gamma)^{2}.\end{split}

Hence for any η0>0\eta_{0}>0, by taking

Hl,k1+η0(2+η0)min{β,1β}log1/γ(4(2+η0)/(1+η0)(k+1)(1γ)(4+2η0)/(1+η0))(=Θ(log(k+1))),H^{l,k}\geq\frac{1+\eta_{0}}{(2+\eta_{0})\min\{\beta,1-\beta\}}\log_{1/\gamma}\left(\frac{4^{(2+\eta_{0})/(1+\eta_{0})}(k+1)}{(1-\gamma)^{(4+2\eta_{0})/(1+\eta_{0})}}\right)(=\Theta(\log(k+1))),

we have Hl,klog1/γ(k+1)H^{l,k}\geq\log_{1/\gamma}(k+1), and that

𝐄l,k^θLλl(θl,k)θLλl(θl,k)24γmin{β,1β}Hl,k(1γ)2(k+1)1+η02+η0,\left\|\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})-\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right\|_{2}\leq\dfrac{4\gamma^{\min\{\beta,1-\beta\}H^{l,k}}}{(1-\gamma)^{2}}\leq(k+1)^{-\frac{1+\eta_{0}}{2+\eta_{0}}}, (18)

which implies that

θLλl(θl,k)T𝐄l,k^θLλl(θl,k)=θLλl(θl,k)22+(𝐄l,k^θLλl(θl,k)θLλl(θl,k))TθLλl(θl,k)θLλl(θl,k)22θLλl(θl,k)2(k+1)1+η02+η0θLλl(θl,k)22(2(1γ)2+2λ¯)(k+1)1+η02+η0,\begin{split}\nabla_{\theta}&L_{\lambda^{l}}(\theta^{l,k})^{T}\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\\ &=\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\left(\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})-\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right)^{T}\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\\ &\geq\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}-\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}(k+1)^{-\frac{1+\eta_{0}}{2+\eta_{0}}}\\ &\geq\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}-\left(\dfrac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\right)(k+1)^{-\frac{1+\eta_{0}}{2+\eta_{0}}},\end{split}

where the last two steps used Cauchy inequality, (18) and the fact that by (17),

θLλl(θl,k)2t=0γt𝐄l,k(Qπθl,k(stl,k,atl,k)θlogπθl,k(atl,k|stl,k)2)+λ¯SAs𝒮,a𝒜θlogπθl,k(a|s)22/(1γ)2+2λ¯.\begin{split}\|\nabla_{\theta}&L_{\lambda^{l}}(\theta^{l,k})\|_{2}\\ &\leq\sum_{t=0}^{\infty}\gamma^{t}\mathbf{E}_{l,k}\left(Q^{\pi_{\theta}^{l,k}}(s_{t}^{l,k},a_{t}^{l,k})\left\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right\|_{2}\right)+\dfrac{\bar{\lambda}}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\|\nabla_{\theta}\log\pi_{\theta^{l,k}}(a|s)\|_{2}\\ &\leq 2/(1-\gamma)^{2}+2\bar{\lambda}.\end{split}

Hence we can take C2=1C_{2}=1 and δl,k=(2(1γ)2+2λ¯)(k+1)1+η02+η0\delta_{l,k}=\left(\frac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\right)(k+1)^{-\frac{1+\eta_{0}}{2+\eta_{0}}}. Thus we have

k=0Tl1δl,k2\displaystyle\sum_{k=0}^{T_{l}-1}\delta_{l,k}^{2} =(2(1γ)2+2λ¯)2k=0(k+1)2+2η02+η0\displaystyle=\left(\frac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\right)^{2}\sum_{k=0}^{\infty}(k+1)^{-\frac{2+2\eta_{0}}{2+\eta_{0}}}
8(1(1γ)2+λ¯)2(1+1η0),\displaystyle\leq 8\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}\left(1+\dfrac{1}{\eta_{0}}\right),

and hence we can take C=8(1(1γ)2+λ¯)2(1+1η0)C=8\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}\left(1+\frac{1}{\eta_{0}}\right). Notice that for notational simplicity, we have taken η0=1\eta_{0}=1 in the statement of the proposition.

Validation of bounded second-order moment growth. Finally, we bound the second-order moment of the policy gradient. In the following, for a random vector X=(X1,,Xn)RnX=(X_{1},\dots,X_{n})\in{\mbox{\bf R}}^{n}, we define 𝐕𝐚𝐫X=i=1n𝐯𝐚𝐫Xi\mathop{\bf Var}X=\sum_{i=1}^{n}\mathop{\bf var}X_{i}, and similarly 𝐕𝐚𝐫l,kX=i=1n𝐯𝐚𝐫l,kXi\mathop{\bf Var}_{l,k}X=\sum_{i=1}^{n}\mathop{\bf var}_{l,k}X_{i}, where 𝐯𝐚𝐫l,k\mathop{\bf var}_{l,k} denotes the conditional variance given the (l,k)(l,k)-th iteration θl,k\theta^{l,k}. Now define the constant V¯\bar{V} as the uniform upper bound on the variance of the policy gradient vector, i.e.,

V¯=supH0,θΘ,λ[0,λ¯]𝐕𝐚𝐫(t=0βHγtQ^(st,at)θlogπθ(at|st)+λSAs𝒮,a𝒜θlogπθ(a|s)),\bar{V}=\sup_{H\geq 0,\,\theta\in\Theta,\,\lambda\in[0,\bar{\lambda}]}\mathop{\bf Var}\left(\sum_{t=0}^{\lfloor\beta H\rfloor}\gamma^{t}\widehat{Q}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})+\dfrac{\lambda}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta}(a|s)\right),

where τ=(s0,a0,r0,,sH,aH,rH)\tau=(s_{0},a_{0},r_{0},\dots,s_{H},a_{H},r_{H}) is sampled from \mathcal{M} following policy πθ\pi_{\theta}, and Q^(st,at)=t=tHγttrt\hat{Q}(s_{t},a_{t})=\sum_{t^{\prime}=t}^{H}\gamma^{t^{\prime}-t}r_{t^{\prime}}.

Then we have 𝐕𝐚𝐫l,k^θLλl(θl,k)V¯\mathop{\bf Var}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\leq\bar{V} for any l,k0l,\,k\geq 0 by definition. In addition, since for any random vector XRnX\in{\mbox{\bf R}}^{n},

𝐯𝐚𝐫Xi=1n𝐄Xi2=𝐄X22,\mathop{\bf var{}}X\leq\sum_{i=1}^{n}\mathbf{E}X_{i}^{2}=\mathbf{E}\|X\|_{2}^{2},

we have by the same argument as (16) that

V¯\displaystyle\bar{V} 𝐄t=0βHγtQ^(st,at)θlogπθ(at|st)+λSAs𝒮,a𝒜θlogπθ(a|s)22\displaystyle\leq\mathbf{E}\left\|\sum_{t=0}^{\lfloor\beta H\rfloor}\gamma^{t}\widehat{Q}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})+\dfrac{\lambda}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta}(a|s)\right\|_{2}^{2}
4(1+λ¯(1γ)2)2(1γ)4.\displaystyle\leq\dfrac{4(1+\bar{\lambda}(1-\gamma)^{2})^{2}}{(1-\gamma)^{4}}.

Finally, since for any random vector XRnX\in{\mbox{\bf R}}^{n},

𝐄l,kX2=𝐄l,ki=1nXi2=i=1n(𝐄l,kXi2+𝐯𝐚𝐫l,kXi)=𝐄l,kX22+𝐕𝐚𝐫l,kX,\mathbf{E}_{l,k}\|X\|^{2}=\mathbf{E}_{l,k}\sum\nolimits_{i=1}^{n}X_{i}^{2}=\sum\nolimits_{i=1}^{n}(\mathbf{E}_{l,k}X_{i}^{2}+\mathop{\bf var}\nolimits_{l,k}X_{i})=\|\mathbf{E}_{l,k}X\|_{2}^{2}+\mathop{\bf Var}\nolimits_{l,k}X,

we have

𝐄l,k^θLλl(θl,k)22J1+J2+J322+V¯2J122+2(J22+J32)2+V¯2θLλl(θl,k)22+32γ2min{β,1β}Hk(1γ)4+V¯2θLλl(θl,k)22+32(1γ)4+V¯,\begin{split}\mathbf{E}_{l,k}\left\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right\|_{2}^{2}&\leq\|J_{1}+J_{2}+J_{3}\|_{2}^{2}+\bar{V}\\ &\leq 2\|J_{1}\|_{2}^{2}+2(\|J_{2}\|_{2}+\|J_{3}\|_{2})^{2}+\bar{V}\\ &\leq 2\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\dfrac{32\gamma^{2\min\{\beta,1-\beta\}H^{k}}}{(1-\gamma)^{4}}+\bar{V}\\ &\leq 2\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\dfrac{32}{(1-\gamma)^{4}}+\bar{V},\end{split}

and hence we can take M2=2M_{2}=2 and M1=32/(1γ)4+V¯M_{1}=32/(1-\gamma)^{4}+\bar{V}. This completes our proof. ∎

A.2 Proof of Lemma 2

Proof.

The proof is nearly identical to the case when b=0b=0 above. Hence we only outline the proof while highlighting the differences.

Firstly, similar to (16), we have

^θLλl(θl,k)2(11γ+B)21γ+2λ¯a.s.,\left\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\right\|_{2}\leq\left(\frac{1}{1-\gamma}+B\right)\frac{2}{1-\gamma}+2\bar{\lambda}\quad\text{a.s.,}

and hence we can take C1=2+2B(1γ)(1γ)2+2λ¯C_{1}=\frac{2+2B(1-\gamma)}{(1-\gamma)^{2}}+2\bar{\lambda}.

Secondly, by the proof of (Agarwal, Jiang, and Kakade 2019, Lemma 4.10), we have

𝐄l,k(t=0βHl,kγtb(stl,k)θlogπθl,k(atl,k|stl,k))=0.\begin{split}\mathbf{E}_{l,k}\left(\sum_{t=0}^{\lfloor\beta H^{l,k}\rfloor}\gamma^{t}b(s_{t}^{l,k})\nabla_{\theta}\log\pi_{\theta^{l,k}}(a_{t}^{l,k}|s_{t}^{l,k})\right)=0.\end{split} (19)

Hence 𝐄l,k^θLλl(θl,k)\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k}) is the same as in the proof when b=0b=0 above, and hence we can take

C2=1C=16(1(1γ)2+λ¯)2,δl,k=(2(1γ)2+2λ¯)(k+1)23Hl,k3log1/γ(8(k+1)(1γ)3)2min{β,1β}.\begin{split}&\text{$C_{2}=1$,\quad$C=16\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}$,}\\ &\text{$\delta_{l,k}=\left(\frac{2}{(1-\gamma)^{2}}+2\bar{\lambda}\right)(k+1)^{-\frac{2}{3}}$, \quad$H^{l,k}\geq\frac{3\log_{1/\gamma}\left(\frac{8(k+1)}{(1-\gamma)^{3}}\right)}{2\min\{\beta,1-\beta\}}$.}\end{split}

Finally, by definition, V¯b\bar{V}_{b} can be written explicitly as

V¯b=supH0,θΘ,λ[0,λ¯]𝐕𝐚𝐫(t=0βHγt(Q^(st,at)b(st))θlogπθ(at|st)+λSAs𝒮,a𝒜θlogπθ(a|s)),\bar{V}_{b}=\sup_{H\geq 0,\,\theta\in\Theta,\,\lambda\in[0,\bar{\lambda}]}\mathop{\bf Var}\left(\sum_{t=0}^{\lfloor\beta H\rfloor}\gamma^{t}(\widehat{Q}(s_{t},a_{t})-b(s_{t}))\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})+\dfrac{\lambda}{SA}\sum_{s\in\mathcal{S},a\in\mathcal{A}}\nabla_{\theta}\log\pi_{\theta}(a|s)\right),

where τ=(s0,a0,r0,,sH,aH,rH)\tau=(s_{0},a_{0},r_{0},\dots,s_{H},a_{H},r_{H}) is sampled from \mathcal{M} following policy πθ\pi_{\theta}, and Q^(st,at)=t=tHγttrt\hat{Q}(s_{t},a_{t})=\sum_{t^{\prime}=t}^{H}\gamma^{t^{\prime}-t}r_{t^{\prime}}.

Hence similar to V¯\bar{V} in then proof when b=0b=0 above, we have V¯b4(1+B(1γ)(1γ)2+λ¯)2\bar{V}_{b}\leq 4\left(\frac{1+B(1-\gamma)}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}, M2=2M_{2}=2 and M1=32/(1γ)4+V¯bM_{1}=32/(1-\gamma)^{4}+\bar{V}_{b}. ∎

Appendix B Proofs for convergence analysis

Firstly, we state a simple result from elementary analysis, which will be used repeatedly in our proof below.

Lemma 8.

Let xk=1k+3log2(k+3)x_{k}=\frac{1}{\sqrt{k+3}\log_{2}(k+3)} for k0k\geq 0. Then we have

k=K1K2xkK2K1+1K2+3log2(K2+3),k=0xk4k=0xk21.\begin{split}&\sum_{k=K_{1}}^{K_{2}}x_{k}\geq\dfrac{K_{2}-K_{1}+1}{\sqrt{K_{2}+3}\log_{2}(K_{2}+3)},\quad\sum_{k=0}^{\infty}x_{k}^{4}\leq\sum_{k=0}^{\infty}x_{k}^{2}\leq 1.\end{split}
Proof.

The first inequality immediately comes from the fact that xkx_{k} is monotonically decreasing in kk. The second inequality can be derived by noticing that xk<1x_{k}<1 for any k0k\geq 0, and that

k=0xk201(x+2)(log2(x+2))2𝑑x=1.\sum_{k=0}^{\infty}x_{k}^{2}\leq\int_{0}^{\infty}\dfrac{1}{(x+2)(\log_{2}(x+2))^{2}}dx=1.

This completes the proof. ∎

B.1 Proof of Lemma 4

Proof.

The proof is a direct application of the well-known Borel-Cantelli lemma (Klenke 2013). Let δN=1/N2\delta_{N}=1/N^{2} and define the events {A¯N}N0\{\bar{A}_{N}\}_{N\geq 0} as

A¯N={𝐫𝐞𝐠𝐫𝐞𝐭(N)>d1(N+c)d2(log(N/δN))d3+d4(logN)d5}.\bar{A}_{N}=\{\mathbf{regret}(N)>d_{1}(N+c)^{d_{2}}(\log(N/\delta_{N}))^{d_{3}}+d_{4}(\log N)^{d_{5}}\}.

Then 𝐏𝐫𝐨𝐛(A¯N)δN\mathbf{Prob}(\bar{A}_{N})\leq\delta_{N}, and hence N=1𝐏𝐫𝐨𝐛(A¯N)N=11/N2<\sum_{N=1}^{\infty}\mathbf{Prob}(\bar{A}_{N})\leq\sum_{N=1}^{\infty}1/N^{2}<\infty. Hence by Borel-Cantelli lemma, we have

𝐏𝐫𝐨𝐛(A¯N occurs infinitely often)=0.\mathbf{Prob}(\bar{A}_{N}\text{ occurs infinitely often})=0.

Finally, by noticing that the complement of A¯N\bar{A}_{N} is a subset of ANA_{N}, the proof is complete. ∎

B.2 Proof of Theorem 5

Theorem 9 (Formal statement of Theorem 5).

Under Assumptions 1 and 2, for phase l0l\geq 0 suppose that we choose αl,k=Cl,α1k+3log2(k+3)\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)} for some Cl,α(0,C2/(M2βλl)]C_{l,\alpha}\in(0,C_{2}/(M_{2}\beta_{\lambda^{l}})]. Then for any ϵ>0\epsilon>0, if we choose λl=ϵ(1γ)2\lambda^{l}=\frac{\epsilon(1-\gamma)}{2}, then for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for any K{0,,Tl1}K\in\{0,\dots,T_{l}-1\}, we have

𝐫𝐞𝐠𝐫𝐞𝐭l(K)4(Dl+2Cllog(2/δ))(1γ)C2Elϵ2K+1log2(K+3)+ϵ(K+1)dρπρ+γ+γlog(K+1)1γ.\begin{split}\mathbf{regret}_{l}(K)\leq&\,\dfrac{4(D_{l}+\sqrt{2C_{l}\log(2/\delta)})}{(1-\gamma)C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)+\epsilon(K+1)\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\\ &+\frac{\gamma+\gamma\log(K+1)}{1-\gamma}.\end{split} (20)

Here El=Cl,α(1γ)216S2A2E_{l}=\frac{C_{l,\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}}, and

Cl=32C12Cl,α2(1(1γ)2+λl)2+βλl2C14Cl,α42,Dl=CCl,α2+βλlM1Cl,α2+FLλl(θl,0).\begin{split}C_{l}&=32C_{1}^{2}C_{l,\alpha}^{2}\left(\dfrac{1}{(1-\gamma)^{2}}+\lambda^{l}\right)^{2}+\dfrac{\beta_{\lambda^{l}}^{2}C_{1}^{4}C_{l,\alpha}^{4}}{2},\\ D_{l}&=CC_{l,\alpha}^{2}+\beta_{\lambda^{l}}M_{1}C_{l,\alpha}^{2}+F^{\star}-L_{\lambda^{l}}(\theta^{l,0}).\end{split}
Proof.

The proof consists of two parts. In the first part, we establish an upper bound on the weighted gradient norms sum of previous iterates in the current phase. The second part then utilizes this bound to establish an upper bound on the phase regret.

Bounding the weighted gradient norms sum.

By Proposition 1 and an equivalent definition of strongly smoothness (cf. (Ryu and Boyd 2016, Appendix)), we have

Lλl(θl,k+1)(Lλl(θl,k))θLλl(θl,k)T(θl,k+1θl,k)+βλl2θl,k+1θl,k22=αl,kθLλl(θl,k)T^θLλl(θl,k)+βλl(αl,k)22^θLλl(θl,k)22Yl,k.\begin{split}-L_{\lambda^{l}}(\theta^{l,k+1})-(-L_{\lambda^{l}}(\theta^{l,k}))&\leq-\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})^{T}(\theta^{l,k+1}-\theta^{l,k})+\dfrac{\beta_{\lambda^{l}}}{2}\|\theta^{l,k+1}-\theta^{l,k}\|_{2}^{2}\\ &=\underbrace{-\alpha^{l,k}\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})^{T}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})+\dfrac{\beta_{\lambda^{l}}(\alpha^{l,k})^{2}}{2}\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}}_{Y_{l,k}}.\end{split}

Let Zl,k=Yl,k𝐄l,k[Yl,k]Z_{l,k}=Y_{l,k}-\mathbf{E}_{l,k}[Y_{l,k}]. Then the above inequality implies that

Lλl(θl,k)Lλl(θl,k+1)αl,kLλl(θl,k)T𝐄l,k^θLλl(θl,k)+βλl(αl,k)22𝐄l,k^θLλl(θl,k)22+Zl,kαl,k(C2θLλl(θl,k)22δl,k)+βλl(αl,k)22(M1+M2θLλl(θl,k)22)+Zl,k=αl,k(C2M2βλlαl,k/2)θLλl(θl,k)22+αl,kδl,k+βλlM1(αl,k)22+Zl,kC2αl,k2θLλl(θl,k)22+αl,kδl,k+βλlM1(αl,k)22+Zl,k.\begin{split}L_{\lambda^{l}}&(\theta^{l,k})-L_{\lambda^{l}}(\theta^{l,k+1})\\ \leq&-\alpha^{l,k}L_{\lambda^{l}}(\theta^{l,k})^{T}\mathbf{E}_{l,k}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})+\dfrac{\beta_{\lambda^{l}}(\alpha^{l,k})^{2}}{2}\mathbf{E}_{l,k}\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+Z_{l,k}\\ \leq&-\alpha^{l,k}\left(C_{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}-\delta_{l,k}\right)+\dfrac{\beta_{\lambda^{l}}(\alpha^{l,k})^{2}}{2}\left(M_{1}+M_{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}\right)+Z_{l,k}\\ =&-\alpha^{l,k}(C_{2}-M_{2}\beta_{\lambda^{l}}\alpha^{l,k}/2)\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\alpha^{l,k}\delta_{l,k}+\dfrac{\beta_{\lambda^{l}}M_{1}(\alpha^{l,k})^{2}}{2}+Z_{l,k}\\ \leq&-\dfrac{C_{2}\alpha^{l,k}}{2}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}+\alpha^{l,k}\delta_{l,k}+\dfrac{\beta_{\lambda^{l}}M_{1}(\alpha^{l,k})^{2}}{2}+Z_{l,k}.\end{split} (21)

Now define Xl,K=k=0K1Zl,kX_{l,K}=\sum_{k=0}^{K-1}Z_{l,k} (with Xl,0=0X_{l,0}=0), then

𝐄(Xl,K+1|l,K)=k=0K1Zl,k+𝐄(Yl,K𝐄l,KYl,K|l,K)=Xl,K.\mathbf{E}(X_{l,K+1}|\mathcal{F}_{l,K})=\sum_{k=0}^{K-1}Z_{l,k}+\mathbf{E}(Y_{l,K}-\mathbf{E}_{l,K}Y_{l,K}|\mathcal{F}_{l,K})=X_{l,K}. (22)

Here l,K\mathcal{F}_{l,K} is the filtration up to episode KK in phase ll, i.e., the σ\sigma-algebra generated by all iterations {θ0,0,,θ0,T0,,θl,0,,θl,K}\{\theta^{0,0},\dots,\theta^{0,T_{0}},\dots,\theta^{l,0},\dots,\theta^{l,K}\} up to the (l,K)(l,K)-th one. Notice that the second equality makes use of the fact that given the current policy, the correspondingly sampled trajectory is conditionally independent of all previous policies and trajectories.

In addition, for any K1K\geq 1,

|Xl,KXl,K1|=|Zl,K1|αl,K1θLλl(θl,K1)2𝐄l,K1^θLλl(θl,K1)^θLλl(θl,K1)2+βλl(αl,K1)22|𝐄l,K1^θLλl(θl,K1)22^θLλl(θl,K1)22|2C1(2(1γ)2+2λl)αl,K1+βλl2C12(αl,K1)2cl,K.\begin{split}|X_{l,K}-X_{l,K-1}|=&\,|Z_{l,K-1}|\leq\alpha^{l,K-1}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}\|\mathbf{E}_{l,K-1}\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})-\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}\\ &+\dfrac{\beta_{\lambda^{l}}(\alpha^{l,K-1})^{2}}{2}\left|\mathbf{E}_{l,K-1}\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}^{2}-\|\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}^{2}\right|\\ \leq&\,\underbrace{2C_{1}\left(\dfrac{2}{(1-\gamma)^{2}}+2\lambda^{l}\right)\alpha^{l,K-1}+\dfrac{\beta_{\lambda^{l}}}{2}C_{1}^{2}(\alpha^{l,K-1})^{2}}_{c_{l,K}}.\end{split}

Here we use the fact that

θLλl(θl,K1)22/(1γ)2+2λl,\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,K-1})\|_{2}\leq 2/(1-\gamma)^{2}+2\lambda^{l},

which follows from the same argument as (16). The above inequality on |Xl,KXl,K1||X_{l,K}-X_{l,K-1}| also implies that 𝐄|Xl,K|<\mathbf{E}|X_{l,K}|<\infty, which, together with (22), implies that Xl,KX_{l,K} is a martingale. Notice that although Xl,KX_{l,K} is only defined for K=0,,TlK=0,\dots,T_{l}, we can augment the sequence by setting Xl,K=Xl,TlX_{l,K}=X_{l,T_{l}} and l,K=l,Tl\mathcal{F}_{l,K}=\mathcal{F}_{l,T_{l}} for all K>TlK>T_{l}, and it’s obvious that (22) and 𝐄|Xl,K|<\mathbf{E}|X_{l,K}|<\infty also hold for KTlK\geq T_{l}. And by saying that Xl,KX_{l,K} is a martingale, we are indeed referring to this (infinite length) augmented sequence of random variables.

Now by the definition of αl,k\alpha^{l,k}, it’s easy to see that K=1Tlcl,K2Cl<\sum_{K=1}^{T_{l}}c_{l,K}^{2}\leq C_{l}<\infty, where

Cl=32C12Cl,α2(1(1γ)2+λl)2+βλl2C14Cl,α42.C_{l}=32C_{1}^{2}C_{l,\alpha}^{2}\left(\dfrac{1}{(1-\gamma)^{2}}+\lambda^{l}\right)^{2}+\dfrac{\beta_{\lambda^{l}}^{2}C_{1}^{4}C_{l,\alpha}^{4}}{2}. (23)

Hence by Azuma-Hoeffding inequality, for any c>0c>0,

𝐏𝐫𝐨𝐛(|Xl,Tl|c)2ec2/(2Cl).\mathbf{Prob}(|X_{l,T_{l}}|\geq c)\leq 2e^{-c^{2}/(2C_{l})}. (24)

Then by summing up the inequalities (21) from k=0k=0 to KK, we obtain that

C22k=0Kαl,kθLλl(θl,k)22C22k=0Tl1αl,kθLλl(θl,k)22k=0αl,kδl,k+βλlM1k=0(αl,k)22+k=0Tl1Zl,k+supθΘLλl(θ)Lλl(θl,0)k=0(αl,k)2k=0δl,k2+βλlM12k=0(αl,k)2+Xl,Tl+FLλl(θl,0)CCl,α2+βλlM1Cl,α2+FLλl(θl,0)Dl+Xl,Tl,\begin{split}\dfrac{C_{2}}{2}&\sum_{k=0}^{K}\alpha^{l,k}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}\leq\dfrac{C_{2}}{2}\sum_{k=0}^{T_{l}-1}\alpha^{l,k}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}\\ &\leq\sum_{k=0}^{\infty}\alpha^{l,k}\delta_{l,k}+\dfrac{\beta_{\lambda^{l}}M_{1}\sum_{k=0}^{\infty}(\alpha^{l,k})^{2}}{2}+\sum_{k=0}^{T_{l}-1}Z_{l,k}+\sup_{\theta\in\Theta}L_{\lambda^{l}}(\theta)-L_{\lambda^{l}}(\theta^{l,0})\\ &\leq\sum_{k=0}^{\infty}(\alpha^{l,k})^{2}\sum_{k=0}^{\infty}\delta_{l,k}^{2}+\dfrac{\beta_{\lambda^{l}}M_{1}}{2}\sum_{k=0}^{\infty}(\alpha^{l,k})^{2}+X_{l,T_{l}}+F^{\star}-L_{\lambda^{l}}(\theta^{l,0})\\ &\leq\underbrace{CC_{l,\alpha}^{2}+\beta_{\lambda^{l}}M_{1}C_{l,\alpha}^{2}+F^{\star}-L_{\lambda^{l}}(\theta^{l,0})}_{D_{l}}+X_{l,T_{l}},\end{split} (25)

where we use the fact that the regularization term R(θ)0R(\theta)\leq 0 for all θΘ\theta\in\Theta.

Hence we have

k=0Kαl,kθLλl(θl,k)222(Dl+Xl,Tl)C2.\sum_{k=0}^{K}\alpha^{l,k}\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}^{2}\leq\frac{2(D_{l}+X_{l,T_{l}})}{C_{2}}. (26)

Bounding the phase regret.

We now establish the regret bound in phase ll using (26).

Fix l0l\geq 0 and K{0,,Tl1}K\in\{0,\dots,T_{l}-1\}. Let

I+={k{0,,K}|θLλl(θl,k)2λl/(2SA)}.I^{+}=\{k\in\{0,\dots,K\}\,|\,\|\nabla_{\theta}L_{\lambda^{l}}(\theta^{l,k})\|_{2}\geq\lambda^{l}/(2SA)\}.

For simplicity, assume for now that |I+|>0|I^{+}|>0. Then since αl,k\alpha^{l,k} is decreasing in kk, we have

2(Dl+Xl,Tl)/C2(λl)24S2A2k=K|I+|+1Kαl,k=ϵ2Cl,α(1γ)216S2A2Elk=K|I+|+1K1k+3log2(k+3)(By Lemma 8)Elϵ2|I+|K+3log2(K+3).\begin{split}2(D_{l}+X_{l,T_{l}})/C_{2}&\geq\dfrac{(\lambda^{l})^{2}}{4S^{2}A^{2}}\sum_{k=K-|I^{+}|+1}^{K}\alpha^{l,k}\\ &=\epsilon^{2}\underbrace{\frac{C_{l,\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}}}_{E_{l}}\sum_{k=K-|I^{+}|+1}^{K}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}\\ (\text{By Lemma \ref{simple_alpha_bound}})&\geq E_{l}\epsilon^{2}\dfrac{|I^{+}|}{\sqrt{K+3}\log_{2}(K+3)}.\end{split}

Hence we have (by the simple fact that K+32K+1\sqrt{K+3}\leq 2\sqrt{K+1} for any K0K\geq 0)

|I+|4(Dl+Xl,Tl)C2Elϵ2K+1log2(K+3)|I^{+}|\leq\dfrac{4(D_{l}+X_{l,T_{l}})}{C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3) (27)

Now by Proposition 3 and the choice of λl\lambda^{l}, we have that for any kI+k\notin I^{+},

FF(πθl,k)dρπ/ρϵ.F^{\star}-F(\pi_{\theta^{l,k}})\leq\|d_{\rho}^{\pi^{\star}}/\rho\|_{\infty}\epsilon.

Since for any πΠ\pi\in\Pi, F(π)[0,1/(1γ)]F(\pi)\in[0,1/(1-\gamma)], we have FF(π)1/(1γ)F^{\star}-F(\pi)\leq 1/(1-\gamma). Hence by (27), we have

kKFF(πθl,k)=kI+FF(πθl,k)+kI+FF(πθl,k)|I+|/(1γ)+(K+1|I+|)dρπ/ρϵ4(Dl+Xl,Tl)(1γ)C2Elϵ2K+1log2(K+3)+(K+1)dρπ/ρϵ.\begin{split}\sum\nolimits_{k\leq K}F^{\star}&-F(\pi_{\theta^{l,k}})\\ &=\sum_{k\in I^{+}}F^{\star}-F(\pi_{\theta^{l,k}})+\sum_{k\notin I^{+}}F^{\star}-F(\pi_{\theta^{l,k}})\\ &\leq|I^{+}|/(1-\gamma)+\left(K+1-|I^{+}|\right)\|d_{\rho}^{\pi^{\star}}/\rho\|_{\infty}\epsilon\\ &\leq\dfrac{4(D_{l}+X_{l,T_{l}})}{(1-\gamma)C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)+(K+1)\|d_{\rho}^{\pi^{\star}}/\rho\|_{\infty}\epsilon.\end{split}

This immediately implies that

𝐫𝐞𝐠𝐫𝐞𝐭l(K)=kKFF(πθl,k)+kKF(πθl,k)F^l,k(πθl,k)kKFF(πθl,k)+kK𝐄l,kt=Hl,k+1γtr(stl,k,atl,k)kKFF(πθl,k)+kKγ/(k+1)1γ4(Dl+Xl,Tl)(1γ)C2Elϵ2K+1log2(K+3)+(K+1)dρπρϵ+γ+γlog(K+1)1γ.\begin{split}&\mathbf{regret}_{l}(K)=\sum\nolimits_{k\leq K}F^{\star}-F(\pi_{\theta^{l,k}})+\sum\nolimits_{k\leq K}F(\pi_{\theta^{l,k}})-\hat{F}^{l,k}(\pi_{\theta^{l,k}})\\ &\leq\sum\nolimits_{k\leq K}F^{\star}-F(\pi_{\theta^{l,k}})+\sum\nolimits_{k\leq K}\mathbf{E}_{l,k}\sum\nolimits_{t=H^{l,k}+1}^{\infty}\gamma^{t}r(s_{t}^{l,k},a_{t}^{l,k})\\ &\leq\sum\nolimits_{k\leq K}F^{\star}-F(\pi_{\theta^{l,k}})+\sum_{k\leq K}\frac{\gamma/(k+1)}{1-\gamma}\\ &\leq\dfrac{4(D_{l}+X_{l,T_{l}})}{(1-\gamma)C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)+(K+1)\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\epsilon+\frac{\gamma+\gamma\log(K+1)}{1-\gamma}.\end{split} (28)

Now if |I+|=0|I^{+}|=0, then we immediately have that

𝐫𝐞𝐠𝐫𝐞𝐭l(K)(K+1)dρπρϵ+γ+γlog(K+1)1γ,\mathbf{regret}_{l}(K)\leq(K+1)\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\epsilon+\frac{\gamma+\gamma\log(K+1)}{1-\gamma},

and hence (28) always holds.

Finally, by (24), we have that with probability at least 1δ1-\delta, for all K{0,,Tl1}K\in\{0,\dots,T_{l}-1\},

𝐫𝐞𝐠𝐫𝐞𝐭l(K)4(Dl+2Cllog(2/δ))(1γ)C2Elϵ2K+1log2(K+3)+ϵ(K+1)dρπρ+γ+γlog(K+1)1γ.\begin{split}&\mathbf{regret}_{l}(K)\\ &\leq\dfrac{4(D_{l}+\sqrt{2C_{l}\log(2/\delta)})}{(1-\gamma)C_{2}E_{l}\epsilon^{2}}\sqrt{K+1}\log_{2}(K+3)+\epsilon(K+1)\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}+\frac{\gamma+\gamma\log(K+1)}{1-\gamma}.\end{split}

This completes our proof. ∎

B.3 Overall regret bound for general policy gradient estimators

In this section, we state and prove the overall regret bound for general policy gradient estimators, which generalizes Theorem 6 for REINFORCE gradient estimators.

Theorem 10 (General regret bound).

Under Assumptions 1 and 2, suppose that for each l0l\geq 0, we choose αl,k=Cl,α1k+3log2(k+3)\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)} for some Cl,α[C¯α,C2/(M2βλl)]C_{l,\alpha}\in[\underline{C}^{\alpha},C_{2}/(M_{2}\beta_{\lambda^{l}})], with C¯α(0,C2/(M2βλ¯)]\underline{C}^{\alpha}\in(0,C_{2}/(M_{2}\beta_{\bar{\lambda}})] and λ¯=1γ2\bar{\lambda}=\frac{1-\gamma}{2}. In addition, suppose that we specify T01T_{0}\geq 1, choose Tl=2lT0T_{l}=2^{l}T_{0}, ϵl=Tl1/6\epsilon^{l}=T_{l}^{-1/6} and λl=ϵl(1γ)2\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2} for each l0l\geq 0. Then we have for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for any N0N\geq 0, we have

𝐫𝐞𝐠𝐫𝐞𝐭(N)R¯1(N)+R¯2(N)=O(N5/6(log(N/δ))5/2),\begin{split}\mathbf{regret}(N)\leq\bar{R}_{1}(N)+\bar{R}_{2}(N)=O(N^{5/6}(\log(N/\delta))^{5/2}),\end{split} (29)

where

R¯1(N)=(4(D¯+2C¯((log2(N+1)+2)log2+log(1/δ)))(1γ)C2E¯+dρπρ)×(N+T0)56(log2(2N+2T0+1))2,R¯2(N)=(log2(N+1)+1)(γ+γlog(N+T0))1γ.\begin{split}\bar{R}_{1}(N)=&\,\left(\dfrac{4(\bar{D}+\sqrt{2\bar{C}((\log_{2}(N+1)+2)\log 2+\log(1/\delta))})}{(1-\gamma)C_{2}\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)\\ &\,\times(N+T_{0})^{\frac{5}{6}}(\log_{2}(2N+2T_{0}+1))^{2},\\ \bar{R}_{2}(N)=&\,\frac{(\log_{2}(N+1)+1)(\gamma+\gamma\log(N+T_{0}))}{1-\gamma}.\end{split} (30)

Here the constants E¯=C¯α(1γ)216S2A2\underline{E}=\frac{\underline{C}^{\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}},

D¯=CC¯α2+βλ¯M1C¯α2+11γ+log(1/ϵpp),\bar{D}=C\bar{C}_{\alpha}^{2}+\beta_{\bar{\lambda}}M_{1}\bar{C}_{\alpha}^{2}+\frac{1}{1-\gamma}+\log(1/\epsilon_{\rm pp}),
C¯=32C12C¯α2(1(1γ)2+λ¯)2+βλ¯2C14C¯α42,\bar{C}=32C_{1}^{2}\bar{C}_{\alpha}^{2}\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}+\frac{\beta_{\bar{\lambda}}^{2}C_{1}^{4}\bar{C}_{\alpha}^{4}}{2},

with C¯α=C2(1γ)38M2\bar{C}_{\alpha}=\frac{C_{2}(1-\gamma)^{3}}{8M_{2}}.

In addition, we also have

limN𝐫𝐞𝐠𝐫𝐞𝐭(N)/(N+1)=0almost surely\lim_{N\rightarrow\infty}\mathbf{regret}(N)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of O(N1/6(logN)5/2)O(N^{-1/6}(\log N)^{5/2}).

Remark 1.

Notice that the constant E¯\underline{E} is a uniform lower bound of ElE_{l} (l0l\geq 0), while the constants D¯\bar{D} and C¯\bar{C} are uniform upper bounds of DlD_{l} and ClC_{l} (l0l\geq 0), respectively. Here the constants El,Dl,ClE_{l},\,D_{l},\,C_{l} are those defined in Theorem 9.

Remark 2.

In the big-OO notation above, we have (temporarily) hidden the problem dependent quantities, which will be made explicit when we specialize the results to the REINFORCE gradient estimation below.

Proof.

We first prove the high probability result. By (20) and the choices of ϵl\epsilon^{l} and λl\lambda^{l}, we have that for any phase l0l\geq 0, with probability at least 1δ/2l+11-\delta/2^{l+1}, for all K{0,,Tl1}K\in\{0,\dots,T_{l}-1\},

𝐫𝐞𝐠𝐫𝐞𝐭l(K)(4(D¯+2C¯((l+2)log2+log(1/δ)))(1γ)C2E¯+dρπρ)Tl5/6log2(Tl+2)+γ+γlogTl1γ.\begin{split}&\mathbf{regret}_{l}(K)\\ &\leq\left(\dfrac{4(\bar{D}+\sqrt{2\bar{C}((l+2)\log 2+\log(1/\delta))})}{(1-\gamma)C_{2}\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)T_{l}^{5/6}\log_{2}(T_{l}+2)+\frac{\gamma+\gamma\log T_{l}}{1-\gamma}.\end{split}

where E¯=C¯α(1γ)216S2A2\underline{E}=\frac{\underline{C}^{\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}},

D¯=CC¯α2+βλ¯M1C¯α2+11γ+log(1/ϵpp),\bar{D}=C\bar{C}_{\alpha}^{2}+\beta_{\bar{\lambda}}M_{1}\bar{C}_{\alpha}^{2}+\frac{1}{1-\gamma}+\log(1/\epsilon_{\rm pp}),
C¯=32C12C¯α2(1(1γ)2+λ¯)2+βλ¯2C14C¯α42,\bar{C}=32C_{1}^{2}\bar{C}_{\alpha}^{2}\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}+\frac{\beta_{\bar{\lambda}}^{2}C_{1}^{4}\bar{C}_{\alpha}^{4}}{2},

with C¯α=C2(1γ)38M2\bar{C}_{\alpha}=\frac{C_{2}(1-\gamma)^{3}}{8M_{2}} and λ¯=1γ2\bar{\lambda}=\frac{1-\gamma}{2}. Here we used the fact that ϵl1\epsilon^{l}\leq 1, which then implies that λlλ¯\lambda^{l}\leq\bar{\lambda} and

8(1γ)3βλlβλ¯=8(1γ)3+2λ¯S.\frac{8}{(1-\gamma)^{3}}\leq\beta_{\lambda^{l}}\leq\beta_{\bar{\lambda}}=\frac{8}{(1-\gamma)^{3}}+\frac{2\bar{\lambda}}{S}.

We also used the fact that FF(π)1/(1γ)F^{\star}-F(\pi)\leq 1/(1-\gamma) for any πΠ\pi\in\Pi and that by the definition of PostProcess, Rλl(πθl,0)logϵppR_{\lambda^{l}}(\pi_{\theta^{l,0}})\geq\log\epsilon_{\rm pp}.

Now recall that for any N0N\geq 0, we have

𝐫𝐞𝐠𝐫𝐞𝐭(N)=l=0lN1𝐫𝐞𝐠𝐫𝐞𝐭l(Tl1)+𝐫𝐞𝐠𝐫𝐞𝐭lN(kN)l=0lN𝐫𝐞𝐠𝐫𝐞𝐭l(Tl1),\begin{split}\mathbf{regret}(N)&=\sum_{l=0}^{l_{N}-1}\mathbf{regret}_{l}(T_{l}-1)+\mathbf{regret}_{l_{N}}(k_{N})\\ &\leq\sum_{l=0}^{l_{N}}\mathbf{regret}_{l}(T_{l}-1),\end{split}

where (lN,kN)=G𝒯(N)(l_{N},k_{N})=G_{\mathcal{T}}(N). In addition, by the choices of TlT_{l}, we have that for any 0kTl10\leq k\leq T_{l}-1,

B𝒯(l,k)\displaystyle B_{\mathcal{T}}(l,k) =j=0l1Tj+k\displaystyle=\sum_{j=0}^{l-1}T_{j}+k
=(2l1)T0+k\displaystyle=(2^{l}-1)T_{0}+k
(2l1)T0.\displaystyle\geq(2^{l}-1)T_{0}.

Hence for any N0N\geq 0, we have lNlog2(NT0+1)log2(N+1)l_{N}\leq\log_{2}\left(\frac{N}{T_{0}}+1\right)\leq\log_{2}(N+1).

Thus we have that with probability at least 1l=0lNδ/2l+11δ1-\sum_{l=0}^{l_{N}}\delta/2^{l+1}\geq 1-\delta, for any N0N\geq 0,

𝐫𝐞𝐠𝐫𝐞𝐭(N)(lN+1)(R^1(N)+R^2(N)),\begin{split}&\mathbf{regret}(N)\leq(l_{N}+1)(\hat{R}_{1}(N)+\hat{R}_{2}(N)),\end{split}

where

R^1(N)=(4(D¯+2C¯((lN+2)log2+log(1/δ)))(1γ)C2E¯+dρπρ)(N+T0)56log2(N+T0+2),R^2(N)=γ+γlog(N+T0)1γ.\begin{split}\hat{R}_{1}(N)&=\left(\dfrac{4(\bar{D}+\sqrt{2\bar{C}((l_{N}+2)\log 2+\log(1/\delta))})}{(1-\gamma)C_{2}\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)(N+T_{0})^{\frac{5}{6}}\log_{2}(N+T_{0}+2),\\ \hat{R}_{2}(N)&=\frac{\gamma+\gamma\log(N+T_{0})}{1-\gamma}.\end{split}

Finally, noticing that lN+1log2(N+1)+1log2(2N+2T0+1)l_{N}+1\leq\log_{2}(N+1)+1\leq\log_{2}(2N+2T_{0}+1), we have

(lN+1)R^1(N)(4(D¯+2C¯((log2(N+1)+2)log2+log(1/δ)))(1γ)C2E¯+dρπρ)×(N+T0)56(log2(2N+2T0+1))2,(lN+1)R^2(N)(log2(N+1)+1)(γ+γlog(N+T0))1γ,\begin{split}(l_{N}+1)\hat{R}_{1}(N)\leq&\,\left(\dfrac{4(\bar{D}+\sqrt{2\bar{C}((\log_{2}(N+1)+2)\log 2+\log(1/\delta))})}{(1-\gamma)C_{2}\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)\\ &\,\times(N+T_{0})^{\frac{5}{6}}(\log_{2}(2N+2T_{0}+1))^{2},\\ (l_{N}+1)\hat{R}_{2}(N)\leq&\,\frac{(\log_{2}(N+1)+1)(\gamma+\gamma\log(N+T_{0}))}{1-\gamma},\end{split}

which immediately imply (29) and (30). Notice that here we used the fact that log2(N+1)+1log2(2N+2T0+1)\log_{2}(N+1)+1\leq\log_{2}(2N+2T_{0}+1) (since T01T_{0}\geq 1), and that TlN+1N+T0T_{l}\leq N+1\leq N+T_{0} for all l=0,,lN1l=0,\dots,l_{N}-1 and TlN=2lNT0N+T0T_{l_{N}}=2^{l_{N}}T_{0}\leq N+T_{0}.

Finally, by invoking Lemma 4, we immediately obtain the almost sure convergence result. This completes our proof. ∎

Appendix C Formal statement of REINFORCE regret bound

Here we provide a slightly more formal restatement of Theorem 6, with details about the constants in the big-OO notation in the main text. Recall that our goal there is specialize (and slightly simplify) the regret bound in Theorem 10 to the case when the REINFORCE gradient estimation in §3.2 is adopted to evaluate ^θLλl(θl,k)\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k}). In particular, we have the following corollary. The proof is done by simply combining Lemma 2 (with λlλ¯\lambda^{l}\leq\bar{\lambda} by their definitions in Theorem 6 or Corollary 11 below) and Theorem 10, together with the specific choices of the hyper-parameters as well as the constants in Lemma 2 plugged in and some elementary algebraic simplifications, and is hence omitted.

Corollary 11 (Formal statement of Theorem 6).

Under Assumption 1, suppose that for each l0l\geq 0, we choose αl,k=Cl,α1k+3log2(k+3)\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}, with Cl,α[C¯α,1/(2βλl)]C_{l,\alpha}\in[\underline{C}^{\alpha},1/(2\beta_{\lambda^{l}})], C¯α(0,1/(2βλ¯)]\underline{C}^{\alpha}\in(0,1/(2\beta_{\bar{\lambda}})] and λ¯=1γ2\bar{\lambda}=\frac{1-\gamma}{2}, and choose Tl=2lT_{l}=2^{l}, ϵl=Tl1/6=2l/6\epsilon^{l}=T_{l}^{-1/6}=2^{-l/6}, λl=ϵl(1γ)2\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2} and ϵpp=1/(2A)\epsilon_{\rm pp}=1/(2A). In addition, suppose that (9) is adopted to evaluate ^θLλl(θl,k)\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k}), with β(0,1)\beta\in(0,1), |b(s)|B|b(s)|\leq B for any s𝒮s\in\mathcal{S} (where B>0B>0 is a constant), and that (10) holds for Hl,kH^{l,k} for all l,k0l,\,k\geq 0. Then we have for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for any N0N\geq 0, we have

𝐫𝐞𝐠𝐫𝐞𝐭(N)R~1(N)+R~2(N),\begin{split}\mathbf{regret}(N)\leq\tilde{R}_{1}(N)+\tilde{R}_{2}(N),\end{split}

where

R~1(N)=(4(D~+2C~((log2(N+1)+2)log2+log(1/δ)))(1γ)E¯+dρπρ)×(N+1)56(log2(2N+3))2,R~2(N)=γ(log2(N+1)+1)21γ.\begin{split}\tilde{R}_{1}(N)=&\,\left(\dfrac{4(\tilde{D}+\sqrt{2\tilde{C}((\log_{2}(N+1)+2)\log 2+\log(1/\delta))})}{(1-\gamma)\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)\\ &\times(N+1)^{\frac{5}{6}}(\log_{2}(2N+3))^{2},\\ \tilde{R}_{2}(N)=&\,\frac{\gamma(\log_{2}(N+1)+1)^{2}}{1-\gamma}.\end{split}

Here the constants are E¯=C¯α(1γ)216S2A2\underline{E}=\frac{\underline{C}^{\alpha}(1-\gamma)^{2}}{16S^{2}A^{2}}, and

D~=(1γ)6(1(1γ)2+λ¯)2+1256(1γ)6βλ¯(32(1γ)4+V¯b)+11γ+log(2A),C~=βλ¯2(1γ)12((1+B(1γ))(1γ)2+λ¯)48192+12(1γ)6((1+B(1γ))(1γ)2+λ¯)4.\begin{split}\tilde{D}&=(1-\gamma)^{6}\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}+\frac{1}{256}(1-\gamma)^{6}\beta_{\bar{\lambda}}\left(\frac{32}{(1-\gamma)^{4}}+\bar{V}_{b}\right)+\frac{1}{1-\gamma}+\log(2A),\\ \tilde{C}&=\frac{\beta_{\bar{\lambda}}^{2}(1-\gamma)^{12}\left(\frac{(1+B(1-\gamma))}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{4}}{8192}+\frac{1}{2}(1-\gamma)^{6}\left(\frac{(1+B(1-\gamma))}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{4}.\end{split}

Here V¯b\bar{V}_{b} is the variance bound defined in Lemma 2.

Suppose in addition that we specify C¯α=1/(2βλ¯)\underline{C}^{\alpha}=1/(2\beta_{\bar{\lambda}}), then we can simplify the regret bound into the following simple form:

𝐫𝐞𝐠𝐫𝐞𝐭(N)=O((S2A2(1γ)7+dρπρ)N56(log(N/δ))52).\mathbf{regret}(N)=O\left(\left(\frac{S^{2}A^{2}}{(1-\gamma)^{7}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)N^{\frac{5}{6}}(\log(N/\delta))^{\frac{5}{2}}\right).

In addition, we also have

limN𝐫𝐞𝐠𝐫𝐞𝐭(N)/(N+1)=0almost surely\lim_{N\rightarrow\infty}\mathbf{regret}(N)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of O((S2A2(1γ)7+dρπ/ρ)N16(logN)52)O\left(\left(\frac{S^{2}A^{2}}{(1-\gamma)^{7}}+\left\|d_{\rho}^{\pi^{\star}}/\rho\right\|_{\infty}\right)N^{-\frac{1}{6}}(\log N)^{\frac{5}{2}}\right).

Remark 3.

Notice that here (and below), with the specific choices of algorithm hyper-parameters and gradient estimators we are finally able to make all the problem dependent quantities (e.g., γ,S,A\gamma,\,S,\,A, etc.) explicit in the big-OO notation, which is consistent with the convention of the reinforcement learning literature. Here the only hidden quantities are some absolute constants.

Appendix D Mini-batch phased policy gradient method

Here we formalize the mini-batch version of Algorithm 2 described at the beginning of §5 as Algorithm 4, and provide a formal statement as well as a proof for Corollary 7.

Firstly, we have the following lemma, which transfers guarantees on ^θ(i)Lλl(θl,k)\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}(\theta^{l,k}) (i=1,,Mi=1,\dots,M) to the averaged gradient estimation 1Mi=1M^θ(i)Lλl(θl,k)\frac{1}{M}\sum_{i=1}^{M}\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}}). The proof follows directly from the fact that the variance of the sum of independent random variables is the sum of the variances, and is thus omitted.

Lemma 12.

Suppose that each ^θ(i)Lλl(θl,k)\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}(\theta^{l,k}) (i=1,,Mi=1,\dots,M) is computed using (9) with the corresponding trajectory, and that the same assumptions as in Lemma 2 hold. Then Assumption 2 also holds for ^θLλl(θl,k)=1Mi=1M^θ(i)Lλl(θl,k)\widehat{\nabla}_{\theta}L_{\lambda^{l}}(\theta^{l,k})=\frac{1}{M}\sum_{i=1}^{M}\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}}) with the same constants C,C1,C2,M2,δl,kC,C_{1},C_{2},M_{2},\delta_{l,k} and V¯b\bar{V}_{b} as in Lemma 2, while M1=32(1γ)4+V¯bMM_{1}=\frac{32}{(1-\gamma)^{4}}+\frac{\bar{V}_{b}}{M}.

Now we are ready to state and prove (a more formal version of) Corollary 7.

Corollary 13 (Formal statement of Corollary 7).

Under Assumption 1, suppose that for each l0l\geq 0, we choose αl,k=Cl,α1k+3log2(k+3)\alpha^{l,k}=C_{l,\alpha}\frac{1}{\sqrt{k+3}\log_{2}(k+3)}, with Cl,α[C¯α,1/(2βλl)]C_{l,\alpha}\in[\underline{C}^{\alpha},1/(2\beta_{\lambda^{l}})], C¯α(0,1/(2βλ¯)]\underline{C}^{\alpha}\in(0,1/(2\beta_{\bar{\lambda}})] and λ¯=1γ2\bar{\lambda}=\frac{1-\gamma}{2}, and choose Tl=2lT_{l}=2^{l}, ϵl=Tl1/6=2l/6\epsilon^{l}=T_{l}^{-1/6}=2^{-l/6}, λl=ϵl(1γ)2\lambda^{l}=\frac{\epsilon^{l}(1-\gamma)}{2} and ϵpp=1/(2A)\epsilon_{\rm pp}=1/(2A). In addition, suppose that the assumptions in Lemma 12 hold (note that Assumption 1 and λlλ¯\lambda^{l}\leq\bar{\lambda} already automatically hold by the other assumptions). Then we have for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for any NM1N\geq M\geq 1, we have that (for the mini-batch Algorithm 4)

𝐫𝐞𝐠𝐫𝐞𝐭(NM;M)R~1(N;M)+R~2(N;M),\begin{split}\mathbf{regret}(N-M;M)\leq\tilde{R}_{1}(N;M)+\tilde{R}_{2}(N;M),\end{split}

where

R~1(N;M)=(4(D~M+2C~((log2(N/M)+2)log2+log(1/δ)))(1γ)E¯+dρπρ)×M16N56(log2(2(N/M)+1))2,R~2(N;M)=γM(log2(N/M)+1)21γ.\begin{split}\tilde{R}_{1}(N;M)=&\left(\dfrac{4(\tilde{D}_{M}+\sqrt{2\tilde{C}((\log_{2}(N/M)+2)\log 2+\log(1/\delta))})}{(1-\gamma)\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)\\ &\,\times M^{\frac{1}{6}}N^{\frac{5}{6}}(\log_{2}(2(N/M)+1))^{2},\\ \tilde{R}_{2}(N;M)=&\frac{\gamma M(\log_{2}(N/M)+1)^{2}}{1-\gamma}.\end{split}

Here the constants E¯\underline{E} and C~\tilde{C} are the same as in Corollary 11, while

D~M=(1γ)6(1(1γ)2+λ¯)2+1256(1γ)6βλ¯(32(1γ)4+V¯bM)+11γ+log(2A).\tilde{D}_{M}=(1-\gamma)^{6}\left(\frac{1}{(1-\gamma)^{2}}+\bar{\lambda}\right)^{2}+\frac{1}{256}(1-\gamma)^{6}\beta_{\bar{\lambda}}\left(\frac{32}{(1-\gamma)^{4}}+\frac{\bar{V}_{b}}{M}\right)+\frac{1}{1-\gamma}+\log(2A).

Here V¯b\bar{V}_{b} is the variance bound defined in Lemma 2.

Suppose in addition that we specify C¯α=1/(2βλ¯)\underline{C}^{\alpha}=1/(2\beta_{\bar{\lambda}}). Then we can simplify the regret bound into the following simple form:

𝐫𝐞𝐠𝐫𝐞𝐭(N;M)=O((S2A2(1γ)7+dρπρ)(M16+M56)(N+M)56(log(N/δ))52+M(logN)21γ).\begin{split}&\mathbf{regret}(N;M)\\ &=O\left(\left(\dfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)(M^{\frac{1}{6}}+M^{-\frac{5}{6}})(N+M)^{\frac{5}{6}}(\log(N/\delta))^{\frac{5}{2}}+\dfrac{M(\log N)^{2}}{1-\gamma}\right).\end{split}

In addition, we also have

limN𝐫𝐞𝐠𝐫𝐞𝐭(N;M)/(N+1)=0almost surely\lim_{N\rightarrow\infty}\mathbf{regret}(N;M)/(N+1)=0\quad\text{almost surely}

with an asymptotic rate of

O((S2A2(1γ)7+dρπρ)(M16+M56)N16(1+MN)56(logN)52+M(logN)2(1γ)N).O\left(\left(\dfrac{S^{2}A^{2}}{(1-\gamma)^{7}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)(M^{\frac{1}{6}}+M^{-\frac{5}{6}})N^{-\frac{1}{6}}\left(1+\dfrac{M}{N}\right)^{\frac{5}{6}}(\log N)^{\frac{5}{2}}+\dfrac{M(\log N)^{2}}{(1-\gamma)N}\right).
Proof of Corollary 7.

By the definition of 𝐫𝐞𝐠𝐫𝐞𝐭(N;M)\mathbf{regret}(N;M), we immediately see that

𝐫𝐞𝐠𝐫𝐞𝐭(N;M)M𝐫𝐞𝐠𝐫𝐞𝐭(N/M),\mathbf{regret}(N;M)\leq M\mathbf{regret}(\lfloor N/M\rfloor), (31)

where 𝐫𝐞𝐠𝐫𝐞𝐭(J)\mathbf{regret}(J) (J0J\geq 0) is the original regret (6) in the mini-batch setting, which is defined only for the total number of inner iterations/steps (instead of episodes, so not magnified with a factor of MM). More precisely, we have that for any J0J\geq 0,

𝐫𝐞𝐠𝐫𝐞𝐭(J)={(l,k)|B𝒯(l,k)J}FF(πθl,k).\mathbf{regret}(J)=\sum\nolimits_{\{(l,k)|B_{\mathcal{T}}(l,k)\leq J\}}F^{\star}-F(\pi_{\theta^{l,k}}).

Now by Lemma 12 and Theorem 10, and following the same simplification as is done in Corollary 11, we have that for any J0J\geq 0,

𝐫𝐞𝐠𝐫𝐞𝐭(J)(4(D~M+2C~((log2(J+1)+2)log2+log(1/δ)))(1γ)E¯+dρπρ)(J+1)56(log2(2J+3))2+γ(log2(J+1)+1)21γ,\begin{split}&\mathbf{regret}(J)\\ &\leq\left(\dfrac{4(\tilde{D}_{M}+\sqrt{2\tilde{C}((\log_{2}(J+1)+2)\log 2+\log(1/\delta))})}{(1-\gamma)\underline{E}}+\left\|\dfrac{d_{\rho}^{\pi^{\star}}}{\rho}\right\|_{\infty}\right)(J+1)^{\frac{5}{6}}(\log_{2}(2J+3))^{2}\\ &\quad+\frac{\gamma(\log_{2}(J+1)+1)^{2}}{1-\gamma},\end{split}

where the constants are as stated in the Corollary claims.

The proof is then complete by plugging the bound of 𝐫𝐞𝐠𝐫𝐞𝐭(J)\mathbf{regret}(J) above into (31) and invoking Lemma 4. ∎

Algorithm 4 Mini-Batch Phased Policy Gradient Method
1:  Input: initial parameter θ~0,0\tilde{\theta}^{0,0}, step-sizes αl,k\alpha^{l,k}, regularization parameters λl\lambda^{l}, phase lengths TlT_{l} (l,k0l,\,k\geq 0), post-processing tolerance ϵpp\epsilon_{\rm pp} and batch size M>0M>0.
2:  Set θ0,0=PostProcess(θ~0,0,ϵpp)\theta^{0,0}=\texttt{PostProcess}(\tilde{\theta}^{0,0},\epsilon_{\rm pp}).
3:  for phase l=0,1,2,l=0,1,2,\dots do
4:     for step k=0,1,,Tl1k=0,1,\dots,T_{l}-1 do
5:        Choose Hl,kH^{l,k}, sample IID trajectories {τil,k}i=1M\{\tau_{i}^{l,k}\}_{i=1}^{M} (each with horizon Hl,kH^{l,k}) from \mathcal{M} following policy πθl,k\pi_{\theta^{l,k}}, and compute an approximate gradient ^θ(i)Lλl(θl,k)\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}}) of LλlL_{\lambda^{l}} for each trajectory τil,k\tau_{i}^{l,k} (i=1,,Mi=1,\dots,M).
6:        Update θl,k+1=θl,k+αl,k1Mi=1M^θ(i)Lλl(θl,k)\theta^{l,k+1}=\theta^{l,k}+\alpha^{l,k}\frac{1}{M}\sum_{i=1}^{M}\widehat{\nabla}_{\theta}^{(i)}L_{\lambda^{l}}({\theta^{l,k}}).
7:     end for
8:     Set θl+1,0=PostProcess(θl,Tl,ϵpp)\theta^{l+1,0}=\texttt{PostProcess}(\theta^{l,T_{l}},\epsilon_{\rm pp}).
9:  end for

Appendix E Related work

Policy gradient methods are a large family of algorithms for reinforcement learning that directly operate on the agent policy, rather than on the action-value function (Glynn 1986; Sutton and Barto 2018). Examples of policy gradient methods include REINFORCE (Williams 1992), A3C (Mnih et al. 2016), DPG (Silver et al. 2014), PPO (Schulman et al. 2017), and TRPO (Schulman et al. 2015), to name just a few. These methods seek to directly maximize the cumulative reward as a function of the policies, they are straightforward to implement and are amenable to function approximations. The asymptotic convergence of (some) policy gradient methods to a stationary point has long been established (Sutton et al. 2000; Konda and Tsitsiklis 2003; Marbach and Tsitsiklis 2001; Baxter and Bartlett 2001). The rate of such convergence is also known and has been improved more recently, with the help of variance reduction (Papini et al. 2018; Xu, Gao, and Gu 2020, 2019), Hessian information (Shen et al. 2019) and momentum techniques (Xiong et al. 2020; Yuan et al. 2020; Pham et al. 2020; Huang et al. 2020). In contrast, until recently, the global convergence of (vanilla) policy gradient methods (like REINFORCE) had not been established unless unrealistic assumptions like concavity of the expected cumulative reward function (Ma et al. 2016) are imposed. The only exceptions are TRPO (Neu, Jonsson, and Gómez 2017) and the soft-max natural policy gradient method with fully known models (Agarwal et al. 2019), which happen to be equivalent to the MDP Expert algorithms (Even-Dar, Kakade, and Mansour 2004, 2009; Neu et al. 2010).

In the past two years, a line of research on the global convergence theory for (both vanilla and natural) policy gradient methods has emerged. By using a gradient domination property of the cumulative reward, global convergence of (both vanilla and natural) policy gradient methods is first established for linear-quadratic regulators (Fazel et al. 2018). For general Markov Decision Processes (MDPs), (Zhang et al. 2019) establishes convergence to approximately locally optimal (i.e., second-order stationary) solutions for vanilla policy gradient methods. The global optimality of stationary points for general MDPs is shown in (Bhandari and Russo 2019), and rates of convergence towards globally optimal solutions for (both vanilla and natural) policy gradient methods with (neural network) function approximation are derived in (Agarwal et al. 2019; Wang et al. 2019). These convergence results are then improved by several very recent works focusing on exact gradient settings. In particular, (Mei et al. 2020) focuses on the more practically relevant soft-max parametrization and vanilla policy gradient methods and improves the results of (Agarwal et al. 2019) by removing the requirement of the relative entropy regularization and obtaining faster convergence rates; (Bhandari and Russo 2020) obtains linear convergence for a general class of policy gradient methods; (Cen et al. 2020) shows local quadratic convergence of natural policy gradient methods; and (Zhang et al. 2020) extends the results to reinforcement learning with general utilities. For more modern policy gradient methods, (Zhao, Li, and Wen 2019) establishes the asymptotic global convergence of of TRPO, while (Liu et al. 2019) further derives the global convergence rates for PPO and TRPO. These rates are then improved in (Shani, Efroni, and Mannor 2019) for TRPO with adaptive regularization terms. Very recently, (Fu, Yang, and Wang 2020) extends these results to obtain the global convergence rates of single-timescale actor-critic methods with PPO actor updates, and (Agarwal et al. 2020) derives global convergence rates of a new policy gradient algorithm combining natural policy gradient methods with a policy cover technique and show that the algorithm entails better exploration behavior and hence removes the necessity for the access to a fully supported initial distribution ρ\rho, which is assumed in most other works on global convergence of policy gradient methods (including our work). All the above works either require exact and deterministic updates or mini-batch updates with a diverging mini-batch size.

Lately, (Jin, Schmitt, and Wen 2020) studies vanilla policy gradient methods using the REINFORCE gradient estimators computed with a single trajectory in each episode and obtains high probability sample complexity results, but the setting is restricted to linear-quadratic regulators and their bounds have polynomial dependency on 1/δ1/\delta (in contrast to our logarithmic dependency on 1/δ1/\delta), where δ\delta is the probability that the bounds are violated. The authors of (Abbasi-Yadkori et al. 2019) study natural policy gradient methods with a general high probability estimation oracle for state-action value functions (i.e., QQ-functions) in the average reward settings, and establish high probability regret bounds for these algorithms. Finally, we remark that there are also some recent results on the global convergence rates of natural policy gradient methods in adversarial settings (with full-information feedback) (Cai et al. 2019), model-based natural policy gradient methods (Efroni et al. 2020) as well as extensions to non-stationary (Fei et al. 2020) and multi-agent game settings (Zhang, Yang, and Basar 2019; Mazumdar et al. 2019; Carmona, Laurière, and Tan 2019; Fu et al. 2019; Guo et al. 2020), which are beyond the scope of this paper.