This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

Qinbo Bai\equalcontrib, Washim Uddin Mondal\equalcontrib, Vaneet Aggarwal
Abstract

In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has 𝒪~(T3/4)\tilde{\mathcal{O}}({T}^{3/4}) regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.

Introduction

Reinforcement Learning (RL) describes a class of problems where a learner repeatedly interacts with an unknown environment with the intention of maximizing the cumulative sum of rewards. This model has found its application in a wide array of areas, ranging from networking to transportation to epidemic control (Geng et al. 2020; Al-Abbasi, Ghosh, and Aggarwal 2019; Ling, Mondal, and Ukkusuri 2023). RL problems are typically analysed via three distinct setups - episodic, infinite horizon discounted reward, and infinite horizon average reward. Among these, the infinite horizon average reward setup holds particular significance in real-world applications (including those mentioned above) due to its alignment with many practical scenarios and its ability to capture essential long-term behaviors. However, scalable algorithms in this setup have not been widely studied. This paper provides an algorithm in the infinite horizon average reward setup with general parametrized policies which yields sub-linear regret guarantees. We would like to mention that this result is the first of its kind in the average reward setting.

There are two major approaches to solving an RL problem. The first one, known as the model-based approach, involves constructing an estimate of the transition probabilities of the underlying Markov Decision Process (MDP). This estimate is subsequently leveraged to derive policies (Auer, Jaksch, and Ortner 2008; Agrawal and Jia 2017; Ouyang et al. 2017; Fruit et al. 2018). It is worth noting that model-based techniques encounter a significant challenge – these algorithms demand a substantial memory to house the model parameters. Consequently, their practical application is hindered when dealing with large state spaces. An alternative strategy is referred to as model-free algorithms. These methods either directly estimate the policy function or maintain an estimate of the QQ function, which are subsequently employed for policy generation (Mnih et al. 2015; Schulman et al. 2015; Mnih et al. 2016). The advantage of these algorithms lies in their adaptability to handle large state spaces.

In the average reward MDP, which is the setting considered in our paper, one of the key performance indicators of an algorithm is the expected regret. It has been theoretically demonstrated in (Auer, Jaksch, and Ortner 2008) that the expected regret of any algorithm for a broad class of MDPs is lower bounded by Ω(T)\Omega(\sqrt{T}) where TT denotes the length of the time horizon. Many model-based algorithms, such as, (Auer, Jaksch, and Ortner 2008; Agrawal and Jia 2017) achieve this bound. Unfortunately, the above algorithms are designed to be applicable solely in the tabular setup. Recently, (Wei et al. 2021) proposed a model-based algorithm for the linear MDP setup that is shown to achieve the optimal regret bound. On the other hand, (Wei et al. 2020) proposed a model-free QQ-estimation-based algorithm that achieves the optimal regret in the tabular setup.

One way to extend algorithms beyond the tabular setting is via policy parameterization. Here, the policies are indexed by parameters (via, for example, neural networks), and the learning process is manifested by updating these parameters using some update rule (such as gradient descent). Such algorithms are referred to as policy gradient (PG) algorithms. Interestingly, the analysis of PG algorithms is typically restricted within the discounted reward setup. For example, (Agarwal et al. 2021) characterized the sample complexity of PG and Natural PG (NPG) with softmax and tabular parameterization. Sample complexity results for general parameterization are given by (Liu et al. 2020; Ding et al. 2020). However, the sub-linear regret analysis of a PG-based algorithm with general parameterization in the average reward setup, to the best of our knowledge, has not been studied in the literature. This paper aims to bridge this gap by addressing this crucial problem.

Challenges and Contribution

We propose a PG-based algorithm with general parameterization in the average reward setup and establish a sublinear regret of the proposed algorithm. In particular, within the class of ergodic MDPs, we first show that our PG-based algorithm achieves an average optimality error of 𝒪~(T14)\tilde{\mathcal{O}}(T^{-\frac{1}{4}}). Utilizing this convergence result, we establish that the algorithm achieves a regret bound of O~(T3/4)\tilde{O}(T^{3/4}).

Despite the availability of sample complexity analysis of PG algorithms in the discounted reward setup, obtaining a sublinear regret bound for their average reward counterpart is quite difficult. This is because in the average reward case, the value function estimators, which are crucial to estimate the gradient, can become unbounded, unlike their discounted reward counterparts. Indeed, the sample complexity results in the discounted MDPs are often associated with a 11γ\frac{1}{1-\gamma} factor (where γ\gamma denotes the discount factor which is 11 for the average reward case), indicating that a naive adaptation of these estimators will not work for the average reward setup. Also, discounted setups typically assume access to a simulator to generate unbiased value estimates. On the contrary, our paper deals with a single sample trajectory and does not assume the availability of a simulator. To obtain a good estimator of the gradient, we design an epoch-based algorithm where the length of each epoch is HH. The algorithm estimates the value functions within a given epoch by sampling rewards of sub-trajectories of length NN that are at least NN distance apart. The separation between these sub-trajectories ensures that their reward samples are sufficiently independent. The key challenge of this paper is to bound a second-order term which is related to the variance of the estimated gradient and the true gradient. We show that by judiciously controlling the growth rate of HH and NN with TT, it is possible to obtain a gradient estimator that has an asymptotically decreasing variance.

Related Works

As discussed in the introduction, the reinforcement learning problem has been widely studied recently for infinite horizon discounted reward cases or the episodic setting. For example, (Jin et al. 2018) proposed the model-free UCB-Q learning and showed a 𝒪(T)\mathcal{O}(\sqrt{T}) regret in the episodic setting. In the discounted reward setting, (Ding et al. 2020) achieved 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) sample complexity for the softmax parametrization using the Natural Policy Gradient algorithm whereas (Mondal and Aggarwal 2023; Fatkhullin et al. 2023) exhibited the same complexity for the general parameterization. However, the regret analysis or the global convergence of the average reward infinite horizon case is much less investigated.

Algorithm Regret Ergodic Model-free Setting
UCRL2 (Auer, Jaksch, and Ortner 2008) O~(DSAT)\tilde{O}\bigg{(}DS\sqrt{AT}\bigg{)} No No Tabular
PSRL (Agrawal and Jia 2017) O~(DSAT)\tilde{O}\bigg{(}DS\sqrt{AT}\bigg{)} No No Tabular
OPTIMISTIC Q-LEARNING (Wei et al. 2020) O~(T2/3)\tilde{O}\bigg{(}T^{2/3}\bigg{)} No Yes Tabular
MDP-OOMD (Wei et al. 2020) O~(T)\tilde{O}\bigg{(}\sqrt{T}\bigg{)} Yes Yes Tabular
FOPO (Wei et al. 2021)111FOPO is computationally inefficient. O~(T)\tilde{O}\bigg{(}\sqrt{T}\bigg{)} No No Linear MDP
OLSVI.FH (Wei et al. 2021) O~(T3/4)\tilde{O}\bigg{(}T^{3/4}\bigg{)} No No Linear MDP
MDP-EXP2 (Wei et al. 2021) O~(T)\tilde{O}\bigg{(}\sqrt{T}\bigg{)} Yes No Linear MDP
This paper O~(T34)\tilde{O}\bigg{(}T^{\frac{3}{4}}\bigg{)} Yes Yes General parametrization
Lower bound (Auer, Jaksch, and Ortner 2008) Ω(DSAT)\Omega\bigg{(}\sqrt{DSAT}\bigg{)} N/A N/A N/A
Table 1: This table summarizes the different model-based and mode-free state-of-the-art algorithms available in the literature for average reward MDPs. We note that the proposed algorithm is the first paper to analyze the regret for average reward MDP with general parametrization.

For infinite horizon average reward MDPs, (Auer, Jaksch, and Ortner 2008) proposed a model-based Upper confidence Reinforcement learning (UCRL2) algorithm and established that it obeys a 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}) regret bound. (Agrawal and Jia 2017) proposed posterior sampling-based approaches for average reward MDPs. (Wei et al. 2020) proposed the optimistic-Q learning algorithm which connects the discounted reward and average reward setting together to show 𝒪(T3/4)\mathcal{O}(T^{3/4}) regret in weakly communicating average reward case and another online mirror descent algorithm which achieves 𝒪(T)\mathcal{O}(\sqrt{T}) regret in the ergodic setting. For the linear MDP setting, (Wei et al. 2021) proposed three algorithms, including the MDP-EXP2 algorithm which achieves O(T)O(\sqrt{T}) regret under the ergodicity assumption. These works have been summarized in Table 1. We note that the assumption of weakly communicating MDP is the minimum assumption needed to have sub-linear regret results. However, it is much more challenging to work with this assumption in the general parametrized setting because of the following reasons. Firstly, there is no guarantee that the state distribution will converge to the steady distribution exponentially fast which is the required property to show that the value functions are bounded by the mixing time. Secondly, it is unclear how to obtain an asymptotically unbiased estimate of the policy gradient. Thus, we assume ergodic MDP in this work following other works in the literature (Pesquerel and Maillard 2022; Gong and Wang 2020). MDPs with constraints have also been recently studied for model-based (Agarwal, Bai, and Aggarwal 2022b, a), model-free tabular (Wei, Liu, and Ying 2022; Chen, Jain, and Luo 2022), and linear MDP setup (Ghosh, Zhou, and Shroff 2022).

However, all of the above algorithms are designed for the tabular setting or the linear MDP assumption, and none of them uses a PG algorithm with the general parametrization setting. In this paper, we propose a PG algorithm for ergodic MDPs with general parametrization and analyze its regret. Our algorithm can, therefore, be applied beyond the tabular or linear MDP setting.

Formulation

In this paper, we consider an infinite horizon reinforcement learning problem with an average reward criterion, which is modeled by the Markov Decision Process (MDP) written as a tuple =(𝒮,𝒜,r,P,ρ)\mathcal{M}=(\mathcal{S},\mathcal{A},r,P,\rho) where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space of size AA, r:𝒮×𝒜[0,1]r:\mathcal{S}\times\mathcal{A}\rightarrow[0,1] is the reward function, P:𝒮×𝒜Δ|𝒮|P:\mathcal{S}\times\mathcal{A}\rightarrow\Delta^{|\mathcal{S}|} is the state transition function where Δ|𝒮|\Delta^{|\mathcal{S}|} denotes the probability simplex with dimension |𝒮||\mathcal{S}|, and ρ:𝒮[0,1]\rho:\mathcal{S}\rightarrow[0,1] is the initial distribution of states. A policy π:𝒮Δ|𝒜|\pi:\mathcal{S}\rightarrow\Delta^{|\mathcal{A}|} decides the distribution of the action to be taken given the current state. For a given policy, π\pi we define the long-term average reward as follows.

JρπlimT1T𝐄[t=0T1r(st,at)|s0ρ]J_{\rho}^{\pi}\triangleq\lim\limits_{T\rightarrow\infty}\frac{1}{T}\mathbf{E}\bigg{[}\sum_{t=0}^{T-1}r(s_{t},a_{t})\bigg{|}s_{0}\sim\rho\bigg{]} (1)

where the expectation is taken over all state-action trajectories that are generated by following the action execution process, atπ(|st)a_{t}\sim\pi(\cdot|s_{t}) and the state transition rule, st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}), t{0,1,}\forall t\in\{0,1,\cdots\}. To simplify notations, we shall drop the dependence on ρ\rho whenever there is no confusion. We consider a parametrized class of policies, Π\Pi whose each element is indexed by a d\mathrm{d}-dimensional parameter, θΘ\theta\in\Theta where Θd\Theta\subset\mathbb{R}^{\mathrm{d}}. Our goal is to solve the following optimization problem.

maxθΘJπθJ(θ)\max_{\theta\in\Theta}~{}J^{\pi_{\theta}}\triangleq J(\theta) (2)

A policy πθ\pi_{\theta} induces a transition function Pπθ:𝒮Δ|𝒮|P^{\pi_{\theta}}:\mathcal{S}\rightarrow\Delta^{|\mathcal{S}|} as Pπθ(s,s)=a𝒜P(s|s,a)πθ(a|s)P^{\pi_{\theta}}(s,s^{\prime})=\sum_{a\in\mathcal{A}}P(s^{\prime}|s,a)\pi_{\theta}(a|s), s,s𝒮\forall s,s^{\prime}\in\mathcal{S}. If \mathcal{M} is such that for every policy π\pi, the induced function, PπP^{\pi} is irreducible, and aperiodic, then \mathcal{M} is called ergodic.

Assumption 1.

The MDP \mathcal{M} is ergodic.

Ergodicity is commonly applied in the analysis of MDPs (Pesquerel and Maillard 2022; Gong and Wang 2020). It is well known that if \mathcal{M} is ergodic, then θΘ\forall\theta\in\Theta, there exists a unique stationary distribution, dπθΔ|𝒮|d^{\pi_{\theta}}\in\Delta^{|\mathcal{S}|} defined as,

dπθ(s)=limT1T[t=0T1Pr(st=s|s0ρ,πθ)]\displaystyle d^{\pi_{\theta}}(s)=\lim_{T\rightarrow\infty}\dfrac{1}{T}\left[\sum_{t=0}^{T-1}\mathrm{Pr}(s_{t}=s|s_{0}\sim\rho,\pi_{\theta})\right] (3)

Note that under the assumption of ergodicity, dπθd^{\pi_{\theta}} is independent of the initial distribution, ρ\rho, and satisfies Pπθdπθ=dπθP^{\pi_{\theta}}d^{\pi_{\theta}}=d^{\pi_{\theta}}. In this case, we can write the average reward as follows.

J(θ)=𝐄sdπθ,aπθ(|s)[r(s,a)]=(dπθ)Trπθwhererπθ(s)a𝒜r(s,a)πθ(a|s),s𝒮\displaystyle\begin{split}&J(\theta)=\mathbf{E}_{s\sim d^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}[r(s,a)]=(d^{\pi_{\theta}})^{T}r^{\pi_{\theta}}\\ &\text{where}~{}r^{\pi_{\theta}}(s)\triangleq\sum_{a\in\mathcal{A}}r(s,a)\pi_{\theta}(a|s),~{}\forall s\in\mathcal{S}\end{split} (4)

Hence, the average reward J(θ)J(\theta) is also independent of the initial distribution, ρ\rho. Furthermore, θΘ\forall\theta\in\Theta, there exist a function Qπθ:𝒮×𝒜Q^{\pi_{\theta}}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} such that the following Bellman equation is satisfied (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}.

Qπθ(s,a)=r(s,a)J(θ)+𝐄sP(|s,a)[Vπθ(s)]Q^{\pi_{\theta}}(s,a)=r(s,a)-J(\theta)+\mathbf{E}_{s^{\prime}\sim P(\cdot|s,a)}\left[V^{\pi_{\theta}}(s^{\prime})\right] (5)

where the state value function, Vπθ:𝒮V^{\pi_{\theta}}:\mathcal{S}\rightarrow\mathbb{R} is defined as,

Vπθ(s)=a𝒜πθ(a|s)Qπθ(s,a),s𝒮\displaystyle V^{\pi_{\theta}}(s)=\sum_{a\in\mathcal{A}}\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s,a),~{}\forall s\in\mathcal{S} (6)

Note that if (5)(\ref{eq_bellman}) is satisfied by QπθQ^{\pi_{\theta}}, then it is also satisfied by Qπθ+cQ^{\pi{\theta}}+c for any arbitrary constant, cc. To define these functions uniquely, we assume that s𝒮dπθ(s)Vπθ(s)=0\sum_{s\in\mathcal{S}}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)=0. In this case, Vπθ(s)V^{\pi_{\theta}}(s) can be written as follows s𝒮\forall s\in\mathcal{S}.

Vπθ(s)=t=0s𝒮[(Pπθ)t(s,s)dπθ(s)]rπθ(s)=𝐄θ[t=0r(st,at)J(θ)|s0=s]\displaystyle\begin{split}V^{\pi_{\theta}}(s)&=\sum_{t=0}^{\infty}\sum_{s^{\prime}\in\mathcal{S}}\left[(P^{\pi_{\theta}})^{t}(s,s^{\prime})-d^{\pi_{\theta}}(s^{\prime})\right]r^{\pi_{\theta}}(s^{\prime})\\ &=\mathbf{E}_{\theta}\left[\sum_{t=0}^{\infty}r(s_{t},a_{t})-J(\theta)\bigg{|}s_{0}=s\right]\end{split} (7)

where 𝐄θ[]\mathbf{E}_{\theta}[\cdot] denotes expectation over all trajectories induced by the policy πθ\pi_{\theta}. Similarly, (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}, Qπθ(s,a)Q^{\pi_{\theta}}(s,a) can be uniquely written as,

Qπθ(s,a)=𝐄θ[t=0r(st,at)J(θ)|s0=s,a0=a]Q^{\pi_{\theta}}(s,a)=\mathbf{E}_{\theta}\left[\sum_{t=0}^{\infty}r(s_{t},a_{t})-J(\theta)\bigg{|}s_{0}=s,a_{0}=a\right] (8)

Additionally, we define the advantage function Aπθ:𝒮×𝒜A^{\pi_{\theta}}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} such that (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A},

Aπθ(s,a)Qπθ(s,a)Vπθ(s)\displaystyle A^{\pi_{\theta}}(s,a)\triangleq Q^{\pi_{\theta}}(s,a)-V^{\pi_{\theta}}(s) (9)

Ergodicity also implies the existence of a finite mixing time. In particular, if \mathcal{M} is ergodic, then the mixing time is defined as follows.

Definition 1.

The mixing time of an MDP \mathcal{M} with respect to a policy parameter θ\theta is defined as,

tmixθmin{t1|(Pπθ)t(s,)dπθ14,s𝒮}\displaystyle t_{\mathrm{mix}}^{\theta}\triangleq\min\left\{t\geq 1\big{|}\left\lVert(P^{\pi_{\theta}})^{t}(s,\cdot)-d^{\pi_{\theta}}\right\rVert\leq\dfrac{1}{4},\forall s\in\mathcal{S}\right\} (10)

We also define tmixsupθΘtmixθt_{\mathrm{mix}}\triangleq\sup_{\theta\in\Theta}t^{\theta}_{\mathrm{mix}} as the the overall mixing time. In this paper, tmixt_{\mathrm{mix}} is finite due to ergodicity.

Mixing time is a measure of how fast the MDP reaches close to its stationary distribution if the same policy is kept on being executed repeatedly. We also define the hitting time as follows.

Definition 2.

The hitting time of an MDP \mathcal{M} with respect to a policy parameter, θ\theta is defined as,

thitθmaxs𝒮1dπθ(s)\displaystyle t_{\mathrm{hit}}^{\theta}\triangleq\max_{s\in\mathcal{S}}\dfrac{1}{d^{\pi_{\theta}}(s)} (11)

We also define thitsupθΘthitθt_{\mathrm{hit}}\triangleq\sup_{\theta\in\Theta}t^{\theta}_{\mathrm{hit}} as the the overall hitting time. In this paper, thitt_{\mathrm{hit}} is finite due to ergodicity.

Let, Jsup𝜽ΘJ(θ)J^{*}\triangleq\sup_{\boldsymbol{\theta}\in\Theta}J(\theta). For a given MDP \mathcal{M} and a time horizon TT, the regret of an algorithm 𝔸\mathbb{A} is defined as follows.

RegT(𝔸,)t=0T1(Jr(st,at))\displaystyle\mathrm{Reg}_{T}(\mathbb{A},\mathcal{M})\triangleq\sum_{t=0}^{T-1}\left(J^{*}-r(s_{t},a_{t})\right) (12)

where the action, ata_{t}, t{0,1,}t\in\{0,1,\cdots\} is chosen by following the algorithm, 𝔸\mathbb{A} based on the trajectory up to time, tt, and the state, st+1s_{t+1} is obtained by following the state transition function, PP. Wherever there is no confusion, we shall simplify the notation of regret to RegT\mathrm{Reg}_{T}. The goal of maximizing J()J(\cdot) can be accomplished by designing an algorithm that minimizes the regret.

Algorithm

In this section, we discuss a policy-gradient-based algorithm in the average reward RL settings. For simplicity, we assume that the set of all policy parameters is Θ=d\Theta=\mathbb{R}^{\mathrm{d}}. The standard policy gradient algorithm iterates the policy parameter θ\theta as follows k{1,2,}\forall k\in\{1,2,\cdots\} starting with an initial guess θ1\theta_{1}.

θk+1=θk+αθJ(θk)\theta_{k+1}=\theta_{k}+\alpha\nabla_{\theta}J(\theta_{k}) (13)

where α\alpha is the parameter learning rate. The following result is well-known in the literature (Sutton et al. 1999).

Lemma 1.

The gradient of the long-term average reward can be expressed as follows.

θJ(θ)=𝐄sdπθ,aπθ(|s)[Aπθ(s,a)θlogπθ(a|s)]\displaystyle\nabla_{\theta}J(\theta)=\mathbf{E}_{s\sim d^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\bigg{[}A^{\pi_{\theta}}(s,a)\nabla_{\theta}\log\pi_{\theta}(a|s)\bigg{]} (14)

Typically we have access neither to PP, the state transition function to compute the required expectation nor to the functions VπθV^{\pi_{\theta}}, QπθQ^{\pi_{\theta}}. In the absence of this knowledge, computation of gradient, therefore, becomes a difficult job. In the subsequent discussion, we shall demonstrate how the gradient can be estimated using sampled trajectories. Our policy gradient-based algorithm is described in Algorithm 1.

The algorithm proceeds in multiple epochs with the length of each epoch being H=16thittmixT(logT)2H=16t_{\mathrm{hit}}t_{\mathrm{mix}}\sqrt{T}(\log T)^{2}. Observe that the algorithm is assumed to be aware of TT. This assumption can be easily relaxed invoking the well-known doubling trick (Lattimore and Szepesvári 2020). We also assume that the values of tmixt_{\mathrm{mix}}, and thitt_{\mathrm{hit}} are known to the algorithm. Similar presumptions have been used in the previous literature (Wei et al. 2020). In the kkth epoch, the algorithm generates a trajectory of length HH, denoted as 𝒯k={(st,at)}t=(k1)HkH1\mathcal{T}_{k}=\{(s_{t},a_{t})\}_{t=(k-1)H}^{kH-1}, by following the policy πθk\pi_{\theta_{k}}. We utilise the policy parameter θk\theta_{k} and the trajectory 𝒯k\mathcal{T}_{k} in Algorithm 2 to compute the estimates V^πθk(s)\hat{V}^{\pi_{\theta_{k}}}(s), and Q^πθk(s,a)\hat{Q}^{\pi_{\theta_{k}}}(s,a) for a given state-action pair (s,a)(s,a). The algorithm searches the trajectory 𝒯k\mathcal{T}_{k} to locate disjoint sub-trajectories of length N=4tmix(logT)N=4t_{\mathrm{mix}}(\log T) that start with the given state ss and are at least NN distance apart. Let ii be the number of such sub-trajectories and the sum of rewards in the jjth such sub-trajectory be yjy_{j}. Then V^πθk(s)\hat{V}^{\pi_{\theta_{k}}}(s) is computed as,

V^πθk(s)=1ij=1iyj\displaystyle\hat{V}^{\pi_{\theta_{k}}}(s)=\dfrac{1}{i}\sum_{j=1}^{i}y_{j} (15)

The sub-trajectories are kept at least NN distance apart to ensure that the samples {yj}j=1i\{y_{j}\}_{j=1}^{i} are fairly independent. The estimate Q^πθ(s,a)\hat{Q}^{\pi_{\theta}}(s,a), on the other hand, is given as,

Q^πθ(s,a)=1πθk(a|s)[1ij=1iyj1(aτj=a)]\displaystyle\begin{split}\hat{Q}^{\pi_{\theta}}(s,a)=\dfrac{1}{\pi_{\theta_{k}}(a|s)}\left[\dfrac{1}{i}\sum_{j=1}^{i}y_{j}\mathrm{1}(a_{\tau_{j}}=a)\right]\end{split} (16)

where τj\tau_{j} is the starting time of the jjth chosen sub-trajectory. Finally, the advantage value is estimated as,

A^πθk(s,a)=Q^πθk(s,a)V^πθk(s)\displaystyle\hat{A}^{\pi_{\theta_{k}}}(s,a)=\hat{Q}^{\pi_{\theta_{k}}}(s,a)-\hat{V}^{\pi_{\theta_{k}}}(s) (17)

This allows us to compute an estimate of the policy gradient as follows.

ωk^θJ(θk)=1Ht=tktk+11A^πθ(st,at)θlogπθk(at|st)\displaystyle\omega_{k}\triangleq\hat{\nabla}_{\theta}J(\theta_{k})=\dfrac{1}{H}\sum_{t=t_{k}}^{t_{k+1}-1}\hat{A}^{\pi_{\theta}}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta_{k}}(a_{t}|s_{t}) (18)

where tk=(k1)Ht_{k}=(k-1)H is the starting time of the kkth epoch. The policy parameters are updated via (20). In the following lemma, we show that A^πθk(s,a)\hat{A}^{\pi_{\theta_{k}}}(s,a) is a good estimator of Aπθk(s,a)A^{\pi_{\theta_{k}}}(s,a).

Lemma 2.

The following inequalities hold k\forall k, (s,a)\forall(s,a) and sufficiently large TT.

𝐄[(A^πθk(s,a)Aπθk(s,a))2]𝒪(thitN3logTHπθk(a|s))=𝒪(tmix2(logT)2Tπθk(a|s))\displaystyle\begin{split}\mathbf{E}&\bigg{[}\bigg{(}\hat{A}^{\pi_{\theta_{k}}}(s,a)-A^{\pi_{\theta_{k}}}(s,a)\bigg{)}^{2}\bigg{]}\\ &\leq\mathcal{O}\left(\dfrac{t_{\mathrm{hit}}N^{3}\log T}{H\pi_{\theta_{k}}(a|s)}\right)=\mathcal{O}\left(\dfrac{t_{\mathrm{mix}}^{2}(\log T)^{2}}{\sqrt{T}\pi_{\theta_{k}}(a|s)}\right)\end{split} (19)

Lemma 2 establishes that the L2L_{2} error of our proposed estimator can be bounded above as 𝒪~(1/T)\tilde{\mathcal{O}}(1/\sqrt{T}). As we shall see later, this result can be used to bound the estimation error of the gradient. It is worthwhile to point out that V^πθk(s)\hat{V}^{\pi_{\theta_{k}}}(s) and Q^πθk(s,a)\hat{Q}^{\pi_{\theta_{k}}}(s,a) defined in (15)(\ref{eq_V_estimate}), (16)(\ref{eq_reward_tracking}) respectively, may not themselves be good estimators of their target quantities although their difference is one. We would also like to mention that our Algorithm 2 is inspired by Algorithm 2 of (Wei et al. 2020). The main difference is that we take the episode length to be H=𝒪~(T)H=\tilde{\mathcal{O}}(\sqrt{T}) while in (Wei et al. 2020), it was chosen to be 𝒪~(1)\tilde{\mathcal{O}}(1). This extra T\sqrt{T} factor makes the estimation error a decreasing function of TT.

Algorithm 1 Parameterized Policy Gradient
1:  Input: Initial parameter θ1\theta_{1}, learning rate α\alpha, initial state s0ρ()s_{0}\sim\rho(\cdot), episode length HH
2:  K=T/HK=T/H
3:  for k{1,,K}k\in\{1,\cdots,K\} do
4:     𝒯kϕ\mathcal{T}_{k}\leftarrow\phi
5:     for t{(k1)H,,kH1}t\in\{(k-1)H,\cdots,kH-1\} do
6:        Execute atπθk(|st)a_{t}\sim\pi_{\theta_{k}}(\cdot|s_{t}), receive reward r(st,at)r(s_{t},a_{t}) and observe st+1s_{t+1}
7:        𝒯k𝒯k{(st,at)}\mathcal{T}_{k}\leftarrow\mathcal{T}_{k}\cup\{(s_{t},a_{t})\}
8:     end for
9:     for t{(k1)H,,kH1}t\in\{(k-1)H,\cdots,kH-1\} do
10:        Using Algorithm 2, and 𝒯k\mathcal{T}_{k}, compute A^πθk(st,at)\hat{A}^{\pi_{\theta_{k}}}(s_{t},a_{t})
11:     end for
12:     Using (18), compute ωk\omega_{k}
13:     Update parameters as
θk+1=θk+αωk\theta_{k+1}=\theta_{k}+\alpha\omega_{k} (20)
14:  end for
Algorithm 2 Advantage Estimation
1:  Input: Trajectory (st1,at1,,st2,at2)(s_{t_{1}},a_{t_{1}},\ldots,s_{t_{2}},a_{t_{2}}), state ss, action aa, and policy parameter θ\theta
2:  Initialize: i0i\leftarrow 0, τt1\tau\leftarrow t_{1}
3:  Define: N=4tmixlog2TN=4t_{\mathrm{mix}}\log_{2}T.
4:  while τt2N\tau\leq t_{2}-N do
5:     if sτ=ss_{\tau}=s then
6:        ii+1i\leftarrow i+1.
7:        τiτ\tau_{i}\leftarrow\tau
8:        yi=t=ττ+N1r(st,at)y_{i}=\sum_{t=\tau}^{\tau+N-1}r(s_{t},a_{t}).
9:        ττ+2N\tau\leftarrow\tau+2N.
10:     else
11:        ττ+1\tau\leftarrow\tau+1.
12:     end if
13:  end while
14:  if i>0i>0 then
15:     V^(s)=1ij=1iyj\hat{V}(s)=\dfrac{1}{i}\sum_{j=1}^{i}y_{j},
16:     Q^(s,a)=1πθ(a|s)[1ij=1iyj1(aτj=a)]\hat{Q}(s,a)=\dfrac{1}{\pi_{\theta}(a|s)}\left[\dfrac{1}{i}\sum_{j=1}^{i}y_{j}\mathrm{1}(a_{\tau_{j}}=a)\right]
17:  else
18:     V^(s)=0\hat{V}(s)=0, Q^(s,a)=0\hat{Q}(s,a)=0
19:  end if
20:  return Q^(s,a)V^(s)\hat{Q}(s,a)-\hat{V}(s)

Global Convergence Analysis

In this section, we show that our proposed Algorithm 1 converges globally. This essentially means that the parameters {θk}k=1\{\theta_{k}\}_{k=1}^{\infty} are such that the sequence {J(θk)}k=1\{J(\theta_{k})\}_{k=1}^{\infty}, in certain sense, approaches the optimal average reward, JJ^{*}. Such convergence will be later useful in bounding the regret of our algorithm. Before delving into the analysis, we would like to first point out a few assumptions that are needed to establish the results.

Assumption 2.

The log-likelihood function is GG-Lipschitz and BB-smooth. Formally, θ,θ1,θ2Θ,(s,a)𝒮×𝒜\forall\theta,\theta_{1},\theta_{2}\in\Theta,\forall(s,a)\in\mathcal{S}\times\mathcal{A}

θlogπθ(a|s)GθΘ,(s,a)𝒮×𝒜\displaystyle\|\nabla_{\theta}\log\pi_{\theta}(a|s)\|\leq G\quad\forall\theta\in\Theta,\forall(s,a)\in\mathcal{S}\times\mathcal{A} (21)
θlogπθ1(a|s)θlogπθ2(a|s)Bθ1θ2\displaystyle\|\nabla_{\theta}\log\pi_{\theta_{1}}(a|s)-\nabla_{\theta}\log\pi_{\theta_{2}}(a|s)\|\leq B\|\theta_{1}-\theta_{2}\|\quad
Remark 1.

The Lipschitz and smoothness properties for the log-likelihood are quite common in the field of policy gradient algorithm (Agarwal et al. 2020; Zhang et al. 2021; Liu et al. 2020). Such properties can also be verified for simple parameterization such as Gaussian policy.

One can immediately see that by combining Assumption 2 with Lemma 2 and using the definition of the gradient estimator as given in (18)(\ref{eq_grad_estimate}), we arrive at the following important result.

Lemma 3.

The following relation holds k\forall k.

𝐄[ωkθJ(θk)2]𝒪(AG2tmix2(logT)2T)\displaystyle\mathbf{E}\left[\left\lVert\omega_{k}-\nabla_{\theta}J(\theta_{k})\right\rVert^{2}\right]\leq\mathcal{O}\left(\dfrac{AG^{2}t_{\mathrm{mix}}^{2}(\log T)^{2}}{\sqrt{T}}\right) (22)

Lemma 3 claims that the error in estimating the gradient can be bounded above as 𝒪~(1/T)\tilde{\mathcal{O}}(1/\sqrt{T}). This result will be used in proving the global convergence of our algorithm.

Assumption 3.

Define the transferred function approximation error

Ldρπ,π(ωθ,θ)=𝐄sdρπ𝐄aπ(|s)[(θlogπθ(a|s)ωθAπθ(s,a))2]\displaystyle\begin{split}L_{d_{\rho}^{\pi^{*}},\pi^{*}}(\omega^{*}_{\theta},\theta&)=\mathbf{E}_{s\sim d_{\rho}^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}\bigg{[}\\ &\bigg{(}\nabla_{\theta}\log\pi_{\theta}(a|s)\cdot\omega^{*}_{\theta}-A^{\pi_{\theta}}(s,a)\bigg{)}^{2}\bigg{]}\end{split} (23)

where π\pi^{*} is the optimal policy and ωθ\omega^{*}_{\theta} is given as

ωθ=argminωd𝐄sdρπθ𝐄aπθ(|s)[(θlogπθ(a|s)ωAπθ(s,a))2]\displaystyle\begin{split}\omega^{*}_{\theta}=\arg&\min_{\omega\in\mathbb{R}^{\mathrm{d}}}~{}\mathbf{E}_{s\sim d_{\rho}^{\pi_{\theta}}}\mathbf{E}_{a\sim\pi_{\theta}(\cdot|s)}\bigg{[}\\ &\bigg{(}\nabla_{\theta}\log\pi_{\theta}(a|s)\cdot\omega-A^{\pi_{\theta}}(s,a)\bigg{)}^{2}\bigg{]}\end{split} (24)

We assume that the error satisfies Ldρπ,π(ωθ,θ)ϵbiasL_{d_{\rho}^{\pi^{*}},\pi^{*}}(\omega^{*}_{\theta},\theta)\leq\epsilon_{\mathrm{bias}} for any θΘ\theta\in\Theta where ϵbias\epsilon_{\mathrm{bias}} is a positive constant.

Remark 2.

The transferred function approximation error, defined by (23) and (24), quantifies the expressivity of the policy class in consideration. It has been shown that the softmax parameterization (Agarwal et al. 2021) or linear MDP structure (Jin et al. 2020) admits ϵbias=0\epsilon_{\mathrm{bias}}=0. When parameterized by the restricted policy class that does not contain all the policies, ϵbias\epsilon_{\mathrm{bias}} turns out to be strictly positive. However, for a rich neural network parameterization, the ϵbias\epsilon_{bias} is small (Wang et al. 2019). A similar assumption has been adopted in (Liu et al. 2020) and (Agarwal et al. 2021).

Remark 3.

It is to be mentioned that ωθ\omega^{*}_{\theta} defined in (24) can be alternatively written as,

ωθ=F(θ)𝐄sdπθ𝐄aπθ(|s)[θlogπθ(a|s)Aπθ(s,a)]\displaystyle\omega^{*}_{\theta}=F(\theta)^{\dagger}\mathbf{E}_{s\sim d^{\pi_{\theta}}}\mathbf{E}_{a\sim\pi_{\theta}(\cdot|s)}\left[\nabla_{\theta}\log\pi_{\theta}(a|s)A^{\pi_{\theta}}(s,a)\right]

where \dagger symbolizes the Moore-Penrose pseudoinverse operation and F(θ)F(\theta) is the Fisher information matrix as defined below.

F(θ)=𝐄sdπθ𝐄aπθ(|s)[θlogπθ(a|s)(θlogπθ(a|s))T]\displaystyle\begin{split}F(\theta)=\mathbf{E}_{s\sim d^{\pi_{\theta}}}&\mathbf{E}_{a\sim\pi_{\theta}(\cdot|s)}\left[\right.\\ &\left.\nabla_{\theta}\log\pi_{\theta}(a|s)(\nabla_{\theta}\log\pi_{\theta}(a|s))^{T}\right]\end{split} (25)
Assumption 4.

There exists a constant μF>0\mu_{F}>0 such that F(θ)μFIdF(\theta)-\mu_{F}I_{\mathrm{d}} is positive semidefinite where IdI_{\mathrm{d}} denotes an identity matrix.

Assumption 4 is also commonly used in the policy gradient analysis (Liu et al. 2020). This is satisfied by the Gaussian policy with a linearly parameterized mean.

In the discounted reward setup, one key result is the performance difference lemma. In the averaged reward setting, this is derived as stated below.

Lemma 4.

The difference in the performance for any policies πθ\pi_{\theta} and πθ\pi_{\theta^{\prime}}is bounded as follows

J(θ)J(θ)=𝐄sdπθ𝐄aπθ(|s)[Aπθ(s,a)]J(\theta)-J(\theta^{\prime})=\mathbf{E}_{s\sim d^{\pi_{\theta}}}\mathbf{E}_{a\sim\pi_{\theta}(\cdot|s)}\big{[}A^{\pi_{\theta^{\prime}}}(s,a)\big{]} (26)

Using Lemma 4, we present a general framework for convergence analysis of the policy gradient algorithm in the averaged reward case as dictated below. This is inspired by the convergence analysis of (Liu et al. 2020) for the discounted reward MDPs.

Lemma 5.

Suppose a general gradient ascent algorithm updates the policy parameter in the following way.

θk+1=θk+αωk\theta_{k+1}=\theta_{k}+\alpha\omega_{k} (27)

When Assumptions 2, 3, and 4 hold, we have the following inequality for any KK.

J1Kk=1KJ(θk)ϵbias+GKkK(ωkωk)+Bα2Kk=1Kωk2+1αK𝐄sdπ[KL(π(|s)πθ1(|s))]\begin{split}&J^{*}-\frac{1}{K}\sum_{k=1}^{K}J(\theta_{k})\leq\sqrt{\epsilon_{\mathrm{bias}}}+\frac{G}{K}\sum_{k}^{K}\|(\omega_{k}-\omega^{*}_{k})\|\\ &+\frac{B\alpha}{2K}\sum_{k=1}^{K}\|\omega_{k}\|^{2}+\frac{1}{\alpha K}\mathbf{E}_{s\sim d^{\pi^{*}}}[KL(\pi^{*}(\cdot|s)\|\pi_{\theta_{1}}(\cdot|s))]\end{split} (28)

where ωk:=ωθk\omega^{*}_{k}:=\omega^{*}_{\theta_{k}} and ωθk\omega^{*}_{\theta_{k}} is defined in (24), J=J(θ)J^{*}=J(\theta^{*}), and π=πθ\pi^{*}=\pi_{\theta^{*}} where θ\theta^{*} is the optimal parameter.

Lemma 5 bounds the optimality error of any gradient ascent algorithm as a function of intermediate gradient norms. Note the presence of ϵbias\epsilon_{\mathrm{bias}} in the upper bound. Clearly, for a severely restricted policy class where ϵbias\epsilon_{\mathrm{bias}} is significant, the optimality bound becomes poor. Consider the expectation of the second term in (28). Note that,

(1Kk=1K𝐄ωkωk)21Kk=1K𝐄[ωkωk2]=1Kk=1K𝐄[ωkF(θk)θJ(θk)2]2Kk=1K𝐄[ωkθJ(θk)2]+2Kk=1K𝐄[θJ(θk)F(θk)θJ(θk)2](a)2Kk=1K𝐄[ωkθJ(θk)2]+2Kk=1K(1+1μF2)𝐄[θJ(θk)2]\begin{split}&\bigg{(}\frac{1}{K}\sum_{k=1}^{K}\mathbf{E}\|\omega_{k}-\omega^{*}_{k}\|\bigg{)}^{2}\leq\frac{1}{K}\sum_{k=1}^{K}\mathbf{E}\bigg{[}\|\omega_{k}-\omega^{*}_{k}\|^{2}\bigg{]}\\ &=\frac{1}{K}\sum_{k=1}^{K}\mathbf{E}\bigg{[}\|\omega_{k}-F(\theta_{k})^{\dagger}\nabla_{\theta}J(\theta_{k})\|^{2}\bigg{]}\\ &\leq\frac{2}{K}\sum_{k=1}^{K}\mathbf{E}\bigg{[}\|\omega_{k}-\nabla_{\theta}J(\theta_{k})\|^{2}\bigg{]}\\ &\hskip 28.45274pt+\frac{2}{K}\sum_{k=1}^{K}\mathbf{E}\bigg{[}\|\nabla_{\theta}J(\theta_{k})-F(\theta_{k})^{\dagger}\nabla_{\theta}J(\theta_{k})\|^{2}\bigg{]}\\ &\overset{(a)}{\leq}\frac{2}{K}\sum_{k=1}^{K}\mathbf{E}\bigg{[}\|\omega_{k}-\nabla_{\theta}J(\theta_{k})\|^{2}\bigg{]}\\ &\hskip 28.45274pt+\frac{2}{K}\sum_{k=1}^{K}\left(1+\dfrac{1}{\mu_{F}^{2}}\right)\mathbf{E}\bigg{[}\|\nabla_{\theta}J(\theta_{k})\|^{2}\bigg{]}\end{split} (29)

where (a)(a) uses Assumption 4. The expectation of the third term in (28) can be bounded as follows.

1Kk=1K𝐄[ωk2]1Kk=1K𝐄[ωkθJ(θk)2]+1Kk=1K𝐄[θJ(θk)2]\begin{split}\dfrac{1}{K}\sum_{k=1}^{K}\mathbf{E}\bigg{[}\|\omega_{k}\|^{2}\bigg{]}&\leq\dfrac{1}{K}\sum_{k=1}^{K}\mathbf{E}\bigg{[}\|\omega_{k}-\nabla_{\theta}J(\theta_{k})\|^{2}\bigg{]}\\ &+\dfrac{1}{K}\sum_{k=1}^{K}\mathbf{E}\left[\left\lVert\nabla_{\theta}J(\theta_{k})\right\rVert^{2}\right]\end{split} (30)

In both (29), (30), the terms related to ωkθJ(θk)2\left\lVert\omega_{k}-\nabla_{\theta}J(\theta_{k})\right\rVert^{2} are bounded by Lemma 3. To bound the term θJ(θk)2\left\lVert\nabla_{\theta}J(\theta_{k})\right\rVert^{2}, we use the following lemma.

Lemma 6.

Let J()J(\cdot) be LL-smooth and α=14L\alpha=\frac{1}{4L}. Then the following inequality holds.

1Kk=1KJ(θk)216LK+163Kk=1KJ(θk)ωk2\displaystyle\begin{split}\dfrac{1}{K}\sum_{k=1}^{K}\left\lVert\nabla J(\theta_{k})\right\rVert^{2}\leq&\dfrac{16L}{K}+\dfrac{16}{3K}\sum_{k=1}^{K}\|\nabla J(\theta_{k})-\omega_{k}\|^{2}\end{split}

Using Lemma 3, we obtain the following inequality.

1Kk=1KθJ(θk)2𝒪~(AG2tmix2+LtmixthitT)\displaystyle\dfrac{1}{K}\sum_{k=1}^{K}\left\lVert\nabla_{\theta}J(\theta_{k})\right\rVert^{2}\leq\mathcal{\tilde{O}}\left(\dfrac{AG^{2}t_{\mathrm{mix}}^{2}+Lt_{\mathrm{mix}}t_{\mathrm{hit}}}{\sqrt{T}}\right) (31)

Applying Lemma 3 and (31) in (30), we finally get,

1Kk=1K𝐄[ωk2]\displaystyle\dfrac{1}{K}\sum_{k=1}^{K}\mathbf{E}\bigg{[}\|\omega_{k}\|^{2}\bigg{]} 𝒪~(AG2tmix2+LtmixthitT)\displaystyle\leq\mathcal{\tilde{O}}\left(\dfrac{AG^{2}t_{\mathrm{mix}}^{2}+Lt_{\mathrm{mix}}t_{\mathrm{hit}}}{\sqrt{T}}\right) (32)

Similarly, using (29)(\ref{eq_second_term_bound}), we deduce the following.

1Kk=1K𝐄ωkωk𝒪~(AGtmix+LtmixthitT14(1+1/μF)1)\displaystyle\frac{1}{K}\sum_{k=1}^{K}\mathbf{E}\|\omega_{k}-\omega^{*}_{k}\|\leq\mathcal{\tilde{O}}\left(\dfrac{\sqrt{A}Gt_{\mathrm{mix}}+\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}}{T^{\frac{1}{4}}(1+1/\mu_{F})^{-1}}\right) (33)

Inequalities (32) and (33) lead to the following result.

Theorem 1.

Let {θk}k=1K\{\theta_{k}\}_{k=1}^{K} be defined as in Lemma 5. If assumptions 1, 2, 3, 4 hold, J()J(\cdot) is LL-smooth and α=14L\alpha=\frac{1}{4L}, then the following inequality holds for K=T/HK=T/H where TT is sufficiently large and H=16tmixthitT(log2T)2H=16t_{\mathrm{mix}}t_{\mathrm{hit}}\sqrt{T}(\log_{2}T)^{2}.

J1Kk=1K𝐄[J(θk)]𝒪~(ABG2tmix2+BLtmixthitLT)+𝒪~(AG2tmix+GLtmixthitT14(1+1/μF)1)+ϵbias+𝒪~(LtmixthitT𝐄sdπ[KL(π(|s)πθ1(|s))])\displaystyle\begin{split}&J^{*}-\frac{1}{K}\sum_{k=1}^{K}\mathbf{E}\left[J(\theta_{k})\right]\leq\mathcal{\tilde{O}}\left(\dfrac{ABG^{2}t_{\mathrm{mix}}^{2}+BLt_{\mathrm{mix}}t_{\mathrm{hit}}}{L\sqrt{T}}\right)\\ &+\mathcal{\tilde{O}}\left(\dfrac{\sqrt{A}G^{2}t_{\mathrm{mix}}+G\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}}{T^{\frac{1}{4}}(1+1/\mu_{F})^{-1}}\right)+\sqrt{\epsilon_{\mathrm{bias}}}\\ &+\mathcal{\tilde{O}}\left(\dfrac{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}{\sqrt{T}}\mathbf{E}_{s\sim d^{\pi^{*}}}[KL(\pi^{*}(\cdot|s)\|\pi_{\theta_{1}}(\cdot|s))]\right)\end{split} (34)

Theorem 1 dictates that the sequence {J(θk)}k=1K\{J(\theta_{k})\}_{k=1}^{K} generated by Algorithm 1 converges to JJ^{*} with a convergence rate of 𝒪(T14+ϵbias)\mathcal{O}(T^{-\frac{1}{4}}+\sqrt{\epsilon_{\mathrm{bias}}}). Alternatively, one can say that in order to achieve an optimality error of ϵ+ϵbias\epsilon+\sqrt{\epsilon_{\mathrm{bias}}}, it is sufficient to choose T=𝒪(ϵ4)T=\mathcal{O}\left(\epsilon^{-4}\right). It matches the state-of-the-art sample complexity bound of the policy gradient algorithm with general parameterization in the discounted reward setup (Liu et al. 2020).

Regret Analysis

In this section, we demonstrate how the convergence analysis in the previous section can be used to bind the expected regret of our proposed algorithm. Note that the regret can be decomposed as follows.

RegT=t=0T1(Jr(st,at))=Hk=1K(JJ(θk))+k=1Ktk(J(θk)r(st,at))\displaystyle\begin{split}&\mathrm{Reg}_{T}=\sum_{t=0}^{T-1}\left(J^{*}-r(s_{t},a_{t})\right)\\ &=H\sum_{k=1}^{K}\left(J^{*}-J({\theta_{k}})\right)+\sum_{k=1}^{K}\sum_{t\in\mathcal{I}_{k}}\left(J(\theta_{k})-r(s_{t},a_{t})\right)\end{split} (35)

where k{(k1)H,,kH1}\mathcal{I}_{k}\triangleq\{(k-1)H,\cdots,kH-1\}. Note that the expectation of the first term in (35)(\ref{reg_decompose}) can be bounded using Theorem 34. The expectation of the second term can be expressed as follows,

𝐄[k=1Ktk(J(θk)r(st,at))]=(a)𝐄[k=1Ktk𝐄sP(|st,at)[Vπθk(s)]Qπθk(st,at)]=(b)𝐄[k=KtkVπθk(st+1)Vπθk(st)]=𝐄[k=1KVπθk(skH)Vπθk(s(k1)H)]=𝐄[k=1K1Vπθk+1(skH)Vπθk(skH)]P+𝐄[VπθK(sT)Vπθ0(s0)]Q\displaystyle\begin{split}&\mathbf{E}\left[\sum_{k=1}^{K}\sum_{t\in\mathcal{I}_{k}}\left(J(\theta_{k})-r(s_{t},a_{t})\right)\right]\\ &\overset{(a)}{=}\mathbf{E}\left[\sum_{k=1}^{K}\sum_{t\in\mathcal{I}_{k}}\mathbf{E}_{s^{\prime}\sim P(\cdot|s_{t},a_{t})}[V^{\pi_{\theta_{k}}}(s^{\prime})]-Q^{\pi_{\theta_{k}}}(s_{t},a_{t})\right]\\ &\overset{(b)}{=}\mathbf{E}\left[\sum_{k=}^{K}\sum_{t\in\mathcal{I}_{k}}V^{\pi_{\theta_{k}}}(s_{t+1})-V^{\pi_{\theta_{k}}}(s_{t})\right]\\ &=\mathbf{E}\left[\sum_{k=1}^{K}V^{\pi_{\theta_{k}}}(s_{kH})-V^{\pi_{\theta_{k}}}(s_{(k-1)H})\right]\\ &=\underbrace{\mathbf{E}\left[\sum_{k=1}^{K-1}V^{\pi_{\theta_{k+1}}}(s_{kH})-V^{\pi_{\theta_{k}}}(s_{kH})\right]}_{\triangleq P}\\ &+\underbrace{\mathbf{E}\left[V^{\pi_{\theta_{K}}}(s_{T})-V^{\pi_{\theta_{0}}}(s_{0})\right]}_{\triangleq Q}\end{split} (36)

where (a)(a) follows from Bellman equation and (b)(b) utilises the following facts: 𝐄[Vπθk(st+1)]=𝐄sP(|st,at)[Vπθk(s)]\mathbf{E}[V^{\pi_{\theta_{k}}}(s_{t+1})]=\mathbf{E}_{s^{\prime}\sim P(\cdot|s_{t},a_{t})}[V^{\pi_{\theta_{k}}}(s^{\prime})] and 𝐄[Vπθk(st)]=𝐄[Qπθk(st,at)]\mathbf{E}[V^{\pi_{\theta_{k}}}(s_{t})]=\mathbf{E}[Q^{\pi_{\theta_{k}}}(s_{t},a_{t})]. The term, PP in (36) can be bounded using Lemma 7 (stated below). Moreover, the term QQ can be upper bounded as 𝒪(tmix)\mathcal{O}(t_{\mathrm{mix}}) as clarified in the appendix.

Lemma 7.

If assumptions 1 and 2 hold, then for K=T/HK=T/H where H=16tmixthitT(log2T)2H=16t_{\mathrm{mix}}t_{\mathrm{hit}}\sqrt{T}(\log_{2}T)^{2}, the following inequalities are true k\forall k, (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A} and sufficiently large TT.

(a)|πθk+1(a|s)πθk(a|s)|Gπθ¯k(a|s)θk+1θk\displaystyle(a)~{}|\pi_{\theta_{k+1}}(a|s)-\pi_{\theta_{k}}(a|s)|\leq G\pi_{\bar{\theta}_{k}}(a|s)\left\lVert\theta_{k+1}-\theta_{k}\right\rVert
(b)k=1K𝐄|J(θk+1)J(θk)|𝒪~(αGCthitT14)\displaystyle(b)\sum_{k=1}^{K}\mathbf{E}|J(\theta_{k+1})-J(\theta_{k})|\leq\mathcal{\tilde{O}}\left(\dfrac{\alpha GC}{t_{\mathrm{hit}}}T^{\frac{1}{4}}\right)
(c)k=1K𝐄|Vπθk+1(sk)Vπθk(sk)|𝒪~(αGCtmixthitT14)\displaystyle(c)\sum_{k=1}^{K}\mathbf{E}|V^{\pi_{\theta_{k+1}}}(s_{k})-V^{\pi_{\theta_{k}}}(s_{k})|\leq\mathcal{\tilde{O}}\left(\dfrac{\alpha GCt_{\mathrm{mix}}}{t_{\mathrm{hit}}}T^{\frac{1}{4}}\right)

where θ¯k\bar{\theta}_{k} denotes some convex combination of θk\theta_{k} and θk+1\theta_{k+1}, CAGtmix+LtmixthitC\triangleq\sqrt{A}Gt_{\mathrm{mix}}+\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}, and {sk}k=1K\{s_{k}\}_{k=1}^{K} is an arbitrary sequence of states.

Lemma 7 can be interpreted as the stability results of our algorithm. It essentially states that the policy parameters are updated such that the average difference between consecutive average reward and value functions decreases with the horizon, TT. Using the above result, we now prove our regret guarantee.

Theorem 2.

If assumptions 1, 2, 3, and 4 hold, J()J(\cdot) is LL-smooth, and TT is sufficiently large, then our proposed Algorithm 1 achieves the following expected regret bound with learning rate α=14L\alpha=\frac{1}{4L}.

𝐄[RegT]Tϵbias+𝒪(tmix)+𝒪~(BC2LT)+𝒪~(GC(1+1μF)T34)+𝒪~(GCtmixLthitT14)+𝒪~(Ltmixthit𝐄sdπ[KL(π(|s)πθ1(|s))]T)\displaystyle\begin{split}&\mathbf{E}\left[\mathrm{Reg}_{T}\right]\leq T\sqrt{\epsilon_{\mathrm{bias}}}+\mathcal{O}(t_{\mathrm{mix}})+\mathcal{\tilde{O}}\left(\dfrac{BC^{2}}{L}\sqrt{T}\right)\\ &+\mathcal{\tilde{O}}\left(GC\left(1+\dfrac{1}{\mu_{F}}\right)T^{\frac{3}{4}}\right)+\mathcal{\tilde{O}}\left(\dfrac{GCt_{\mathrm{mix}}}{Lt_{\mathrm{hit}}}T^{\frac{1}{4}}\right)\\ &+\mathcal{\tilde{O}}\left(Lt_{\mathrm{mix}}t_{\mathrm{hit}}\mathbf{E}_{s\sim d^{\pi^{*}}}[KL(\pi^{*}(\cdot|s)\|\pi_{\theta_{1}}(\cdot|s))]\sqrt{T}\right)\end{split} (37)

where the term CC is the same as defined in Lemma 7.

Theorem 2 shows that the expected regret of Algorithm 1 is bounded by 𝒪~(T34+Tϵbias)\tilde{\mathcal{O}}(T^{\frac{3}{4}}+T\sqrt{\epsilon_{\mathrm{bias}}}). It also shows how other parameters such as tmixt_{\mathrm{mix}}, and thitt_{\mathrm{hit}} influence the regret value. Note that the term related to ϵbias\epsilon_{\mathrm{bias}} is unavoidable in our setting due to the incompleteness of the general parameterized policy set.

Conclusion

In this paper, we proposed an algorithm based on the vanilla policy gradient for reinforcement learning in an infinite horizon average reward setting. Unlike the recent works on this setting which require the MDP to be tabular or have a linear structure, we assume the framework of general parametrization of the policy. We show that the proposed algorithm converges to the neighborhood of the global optimum with rate 𝒪(T1/4)\mathcal{O}(T^{-1/4}), which matches the result of vanilla policy gradient with general parametrization in discounted reward setting. We use this convergence result to further show that our algorithm achieves a regret of 𝒪~(T34)\tilde{\mathcal{O}}(T^{\frac{3}{4}}).

We note that this paper unveils numerous promising directions for future research. These avenues encompass exploring the possibility of relaxing the assumption of ergodic MDPs to weakly communicating MDPs, refining regret bounds for enhanced performance, deriving more robust lower bounds for the general parametrization, and extending the problem domain to incorporate constraints.

Acknowledgements

This work was supported in part by the U.S. National Science Foundation under Grant CCF-2149588 and Cisco Inc.

References

  • Agarwal et al. (2020) Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2020. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, 64–66. PMLR.
  • Agarwal et al. (2021) Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2021. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1): 4431–4506.
  • Agarwal, Bai, and Aggarwal (2022a) Agarwal, M.; Bai, Q.; and Aggarwal, V. 2022a. Concave Utility Reinforcement Learning with Zero-Constraint Violations. Transactions on Machine Learning Research.
  • Agarwal, Bai, and Aggarwal (2022b) Agarwal, M.; Bai, Q.; and Aggarwal, V. 2022b. Regret guarantees for model-based reinforcement learning with long-term average constraints. In Uncertainty in Artificial Intelligence, 22–31. PMLR.
  • Agrawal and Jia (2017) Agrawal, S.; and Jia, R. 2017. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30.
  • Al-Abbasi, Ghosh, and Aggarwal (2019) Al-Abbasi, A. O.; Ghosh, A.; and Aggarwal, V. 2019. Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(12): 4714–4727.
  • Auer, Jaksch, and Ortner (2008) Auer, P.; Jaksch, T.; and Ortner, R. 2008. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
  • Chen, Jain, and Luo (2022) Chen, L.; Jain, R.; and Luo, H. 2022. Learning infinite-horizon average-reward Markov decision process with constraints. In International Conference on Machine Learning, 3246–3270. PMLR.
  • Ding et al. (2020) Ding, D.; Zhang, K.; Basar, T.; and Jovanovic, M. 2020. Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes. In Advances in Neural Information Processing Systems, volume 33, 8378–8390. Curran Associates, Inc.
  • Dorfman and Levy (2022) Dorfman, R.; and Levy, K. Y. 2022. Adapting to mixing time in stochastic optimization with Markovian data. In International Conference on Machine Learning, 5429–5446. PMLR.
  • Fatkhullin et al. (2023) Fatkhullin, I.; Barakat, A.; Kireeva, A.; and He, N. 2023. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. arXiv:2302.01734.
  • Fruit et al. (2018) Fruit, R.; Pirotta, M.; Lazaric, A.; and Ortner, R. 2018. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, 1578–1586. PMLR.
  • Geng et al. (2020) Geng, N.; Lan, T.; Aggarwal, V.; Yang, Y.; and Xu, M. 2020. A multi-agent reinforcement learning perspective on distributed traffic engineering. In 2020 IEEE 28th International Conference on Network Protocols (ICNP), 1–11. IEEE.
  • Ghosh, Zhou, and Shroff (2022) Ghosh, A.; Zhou, X.; and Shroff, N. 2022. Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained MDP with Linear Function Approximation. In The Eleventh International Conference on Learning Representations.
  • Gong and Wang (2020) Gong, H.; and Wang, M. 2020. A Duality Approach for Regret Minimization in Average-Award Ergodic Markov Decision Processes. In Learning for Dynamics and Control, 862–883. PMLR.
  • Jin et al. (2018) Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. I. 2018. Is Q-Learning Provably Efficient? In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  • Jin et al. (2020) Jin, C.; Yang, Z.; Wang, Z.; and Jordan, M. I. 2020. Provably efficient reinforcement learning with linear function approximation. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, 2137–2143. PMLR.
  • Lattimore and Szepesvári (2020) Lattimore, T.; and Szepesvári, C. 2020. Bandit algorithms. Cambridge University Press.
  • Ling, Mondal, and Ukkusuri (2023) Ling, L.; Mondal, W. U.; and Ukkusuri, S. V. 2023. Cooperating Graph Neural Networks with Deep Reinforcement Learning for Vaccine Prioritization. arXiv preprint arXiv:2305.05163.
  • Liu et al. (2020) Liu, Y.; Zhang, K.; Basar, T.; and Yin, W. 2020. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33: 7624–7636.
  • Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937. PMLR.
  • Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540): 529–533.
  • Mondal and Aggarwal (2023) Mondal, W. U.; and Aggarwal, V. 2023. Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes. arXiv preprint arXiv:2310.11677.
  • Ouyang et al. (2017) Ouyang, Y.; Gagrani, M.; Nayyar, A.; and Jain, R. 2017. Learning unknown markov decision processes: A thompson sampling approach. Advances in neural information processing systems, 30.
  • Pesquerel and Maillard (2022) Pesquerel, F.; and Maillard, O.-A. 2022. IMED-RL: Regret optimal learning of ergodic Markov decision processes. In NeurIPS 2022-Thirty-sixth Conference on Neural Information Processing Systems.
  • Schulman et al. (2015) Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International conference on machine learning, 1889–1897. PMLR.
  • Suttle et al. (2023) Suttle, W. A.; Bedi, A.; Patel, B.; Sadler, B. M.; Koppel, A.; and Manocha, D. 2023. Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. In International Conference on Machine Learning, 33240–33267. PMLR.
  • Sutton et al. (1999) Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12.
  • Wang et al. (2019) Wang, L.; Cai, Q.; Yang, Z.; and Wang, Z. 2019. Neural Policy Gradient Methods: Global Optimality and Rates of Convergence. arXiv:1909.01150.
  • Wei et al. (2021) Wei, C.-Y.; Jahromi, M. J.; Luo, H.; and Jain, R. 2021. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, 3007–3015. PMLR.
  • Wei et al. (2020) Wei, C.-Y.; Jahromi, M. J.; Luo, H.; Sharma, H.; and Jain, R. 2020. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, 10170–10180. PMLR.
  • Wei, Liu, and Ying (2022) Wei, H.; Liu, X.; and Ying, L. 2022. A provably-efficient model-free algorithm for infinite-horizon average-reward constrained Markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 3868–3876.
  • Zhang et al. (2021) Zhang, J.; Ni, C.; Yu, Z.; Szepesvari, C.; and Wang, M. 2021. On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method. Advances in Neural Information Processing Systems, 34: 2228–2240.

Proofs for the Section of Global Convergence Analysis

Proof of Lemma 1

Proof.

Using (6), we arrive at the following.

θVπθk(s)=θ(aπθ(a|s)Qπθ(s,a))\displaystyle\nabla_{\theta}V^{\pi_{\theta_{k}}}(s)=\nabla_{\theta}\bigg{(}\sum_{a}\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s,a)\bigg{)} (38)
=a(θπθ(a|s))Qπθ(s,a)+aπθ(a|s)θQπθ(s,a)\displaystyle=\sum_{a}\bigg{(}\nabla_{\theta}\pi_{\theta}(a|s)\bigg{)}Q^{\pi_{\theta}}(s,a)+\sum_{a}\pi_{\theta}(a|s)\nabla_{\theta}Q^{\pi_{\theta}}(s,a)
=(a)aπθ(a|s)(θlogπθ(a|s))Qπθ(s,a)+aπθ(a|s)θ(r(s,a)J(θ)+sP(s|s,a)Vπθ(s))\displaystyle\overset{(a)}{=}\sum_{a}\pi_{\theta}(a|s)\bigg{(}\nabla_{\theta}\log\pi_{\theta}(a|s)\bigg{)}Q^{\pi_{\theta}}(s,a)+\sum_{a}\pi_{\theta}(a|s)\nabla_{\theta}\bigg{(}r(s,a)-J(\theta)+\sum_{s^{\prime}}P(s^{\prime}|s,a)V^{\pi_{\theta}}(s^{\prime})\bigg{)}
=aπθ(a|s)(θlogπθ(a|s))Qπθ(s,a)+aπθ(a|s)(sP(s|s,a)θVπθ(s))θJ(θ)\displaystyle=\sum_{a}\pi_{\theta}(a|s)\bigg{(}\nabla_{\theta}\log\pi_{\theta}(a|s)\bigg{)}Q^{\pi_{\theta}}(s,a)+\sum_{a}\pi_{\theta}(a|s)\bigg{(}\sum_{s^{\prime}}P(s^{\prime}|s,a)\nabla_{\theta}V^{\pi_{\theta}}(s^{\prime})\bigg{)}-\nabla_{\theta}J(\theta)

where the step (a) is a consequence of θlogπθ=πθπθ\nabla_{\theta}\log\pi_{\theta}=\frac{\nabla\pi_{\theta}}{\pi_{\theta}} and the Bellman equation. Multiplying both sides by dπθ(s)d^{\pi_{\theta}}(s), taking a sum over s𝒮s\in\mathcal{S}, and rearranging the terms, we obtain the following.

θJ(θ)=sdπθ(s)θJ(θ)=sdπθ(s)aπθ(a|s)(θlogπθ(a|s))Qπθ(s,a)+sdπθ(s)aπθ(a|s)(sP(s|s,a)θVπθ(s))sdπθ(s)θVπθ(s)=𝐄sdπθ,aπθ(|s)[Qπθ(s,a)θlogπθ(a|s)]+sdπθ(s)sPπθ(s|s)θVπθ(s)sdπθ(s)θVπθ(s)=(a)𝐄sdπθ,aπθ(|s)[Qπθ(s,a)θlogπθ(a|s)]+sdπθ(s)θVπθ(s)sdπθ(s)θVπθ(s)=𝐄sdπθ,aπθ(|s)[Qπθ(s,a)θlogπθ(a|s)]\displaystyle\begin{split}\nabla_{\theta}J(\theta)&=\sum_{s}d^{\pi_{\theta}}(s)\nabla_{\theta}J(\theta)\\ &=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(a|s)\bigg{(}\nabla_{\theta}\log\pi_{\theta}(a|s)\bigg{)}Q^{\pi_{\theta}}(s,a)\\ &+\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(a|s)\bigg{(}\sum_{s^{\prime}}P(s^{\prime}|s,a)\nabla_{\theta}V^{\pi_{\theta}}(s^{\prime})\bigg{)}-\sum_{s}d^{\pi_{\theta}}(s)\nabla_{\theta}V^{\pi_{\theta}}(s)\\ &\overset{}{=}\mathbf{E}_{s\sim d^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\bigg{[}Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\log\pi_{\theta}(a|s)\bigg{]}+\sum_{s}d^{\pi_{\theta}}(s)\sum_{s^{\prime}}P^{\pi_{\theta}}(s^{\prime}|s)\nabla_{\theta}V^{\pi_{\theta}}(s^{\prime})-\sum_{s}d^{\pi_{\theta}}(s)\nabla_{\theta}V^{\pi_{\theta}}(s)\\ &\overset{(a)}{=}\mathbf{E}_{s\sim d^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\bigg{[}Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\log\pi_{\theta}(a|s)\bigg{]}+\sum_{s^{\prime}}d^{\pi_{\theta}}(s^{\prime})\nabla_{\theta}V^{\pi_{\theta}}(s^{\prime})-\sum_{s}d^{\pi_{\theta}}(s)\nabla_{\theta}V^{\pi_{\theta}}(s)\\ &=\mathbf{E}_{s\sim d^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\bigg{[}Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\log\pi_{\theta}(a|s)\bigg{]}\end{split} (39)

where (a)(a) uses the fact that dπθd^{\pi_{\theta}} is a stationary distribution. Note that,

𝐄sdπθ,aπθ(|s)\displaystyle\mathbf{E}_{s\sim d^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)} [Vπθ(s)logπθ(a|s)]\displaystyle\bigg{[}V^{\pi_{\theta}}(s)\nabla\log\pi_{\theta}(a|s)\bigg{]} (40)
=𝐄sdπθ[a𝒜Vπθ(s)θπθ(a|s)]\displaystyle=\mathbf{E}_{s\sim d^{\pi_{\theta}}}\left[\sum_{a\in\mathcal{A}}V^{\pi_{\theta}}(s)\nabla_{\theta}\pi_{\theta}(a|s)\right]
=𝐄sdπθ[a𝒜Vπθ(s)θπθ(a|s)]\displaystyle=\mathbf{E}_{s\sim d^{\pi_{\theta}}}\left[\sum_{a\in\mathcal{A}}V^{\pi_{\theta}}(s)\nabla_{\theta}\pi_{\theta}(a|s)\right]
=𝐄sdπθ[Vπθ(s)θ(a𝒜πθ(a|s))]=𝐄sdπθ[Vπθ(s)θ(1)]=0\displaystyle=\mathbf{E}_{s\sim d^{\pi_{\theta}}}\bigg{[}V^{\pi_{\theta}}(s)\nabla_{\theta}\left(\sum_{a\in\mathcal{A}}\pi_{\theta}(a|s)\right)\bigg{]}=\mathbf{E}_{s\sim d^{\pi_{\theta}}}\bigg{[}V^{\pi_{\theta}}(s)\nabla_{\theta}(1)\bigg{]}=0

We can, therefore, replace the function QπθQ^{\pi_{\theta}} in the policy gradient with the advantage function Aπθ(s,a)=Qπθ(s,a)Vπθ(s)A^{\pi_{\theta}}(s,a)=Q^{\pi_{\theta}}(s,a)-V^{\pi_{\theta}}(s), (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}. Thus,

θJ(θ)=𝐄sdπθ,aπθ(|s)[Aπθ(s,a)θlogπθ(a|s)]\nabla_{\theta}J(\theta)=\mathbf{E}_{s\sim d^{\pi_{\theta}},a\sim\pi_{\theta}(\cdot|s)}\bigg{[}A^{\pi_{\theta}}(s,a)\nabla_{\theta}\log\pi_{\theta}(a|s)\bigg{]} (41)

Proof of Lemma 2

Proof.

The proof runs along the line of (Wei et al. 2020, Lemma 6). Consider the kkth epoch and assume that πθk\pi_{\theta_{k}} is denoted as π\pi for notational convenience. Let, MM be the number of disjoint sub-trajectories of length NN that start with the state ss and are at least NN distance apart (found by Algorithm 2). Let, yk,iy_{k,i} be the sum of observed rewards of the iith sub-trajectory and τi\tau_{i} be its starting time. The advantage function estimate is,

A^π(s,a)={1π(a|s)[1Mi=1Myk,i1(aτi=a)]1Mi=1Myk,iifM>00ifM=0\displaystyle\hat{A}^{\pi}(s,a)=\begin{cases}\dfrac{1}{\pi(a|s)}\left[\dfrac{1}{M}\sum_{i=1}^{M}y_{k,i}\mathrm{1}(a_{\tau_{i}}=a)\right]-\dfrac{1}{M}\sum_{i=1}^{M}y_{k,i}~{}&\text{if}~{}M>0\\ 0~{}&\text{if}~{}M=0\end{cases} (42)

Note the following,

𝐄[yk,i|sτi=s,aτi=a]=r(s,a)+𝐄[t=τi+1τi+Nr(st,at)|sτi=s,aτi=a]=r(s,a)+sP(s|s,a)𝐄[t=τi+1τi+Nr(st,at)|sτi+1=s]=r(s,a)+sP(s|s,a)[j=0N1(Pπ)j(s,)]Trπ=r(s,a)+sP(s|s,a)[j=0N1(Pπ)j(s,)dπ]Trπ+N(dπ)Trπ=(a)r(s,a)+sP(s|s,a)[j=0(Pπ)j(s,)dπ]Trπ+NJπsP(s|s,a)[j=N(Pπ)j(s,)dπ]TrπETπ(s,a)=(b)r(s,a)+sP(s|s,a)Vπ(s)+NJπETπ(s,a)=(c)Qπ(s,a)+(N+1)JπETπ(s,a)\displaystyle\begin{split}&\mathbf{E}\left[y_{k,i}\bigg{|}s_{\tau_{i}}=s,a_{\tau_{i}}=a\right]\\ &=r(s,a)+\mathbf{E}\left[\sum_{t=\tau_{i}+1}^{\tau_{i}+N}r(s_{t},a_{t})\bigg{|}s_{\tau_{i}}=s,a_{\tau_{i}}=a\right]\\ &=r(s,a)+\sum_{s^{\prime}}P(s^{\prime}|s,a)\mathbf{E}\left[\sum_{t=\tau_{i}+1}^{\tau_{i}+N}r(s_{t},a_{t})\bigg{|}s_{\tau_{i}+1}=s^{\prime}\right]\\ &=r(s,a)+\sum_{s^{\prime}}P(s^{\prime}|s,a)\left[\sum_{j=0}^{N-1}(P^{\pi})^{j}(s^{\prime},\cdot)\right]^{T}r^{\pi}\\ &=r(s,a)+\sum_{s^{\prime}}P(s^{\prime}|s,a)\left[\sum_{j=0}^{N-1}(P^{\pi})^{j}(s^{\prime},\cdot)-d^{\pi}\right]^{T}r^{\pi}+N(d^{\pi})^{T}r^{\pi}\\ &\overset{(a)}{=}r(s,a)+\sum_{s^{\prime}}P(s^{\prime}|s,a)\left[\sum_{j=0}^{\infty}(P^{\pi})^{j}(s^{\prime},\cdot)-d^{\pi}\right]^{T}r^{\pi}+NJ^{\pi}-\underbrace{\sum_{s^{\prime}}P(s^{\prime}|s,a)\left[\sum_{j=N}^{\infty}(P^{\pi})^{j}(s^{\prime},\cdot)-d^{\pi}\right]^{T}r^{\pi}}_{\triangleq\mathrm{E}^{\pi}_{T}(s,a)}\\ &\overset{(b)}{=}r(s,a)+\sum_{s^{\prime}}P(s^{\prime}|s,a)V^{\pi}(s^{\prime})+NJ^{\pi}-\mathrm{E}^{\pi}_{T}(s,a)\\ &\overset{(c)}{=}Q^{\pi}(s,a)+(N+1)J^{\pi}-\mathrm{E}^{\pi}_{T}(s,a)\end{split} (43)

where (a)(a) follows from the definition of JπJ^{\pi} as given in (4)(\ref{eq_r_pi_theta}), (b)(b) is an application of the definition of VπV^{\pi} given in (7)(\ref{def_v_pi_theta_s}), and (c)(c) follows from the Bellman equation. Define the following quantity.

δπ(s,T)t=N(Pπ)t(s,)dπ1whereN=4tmix(log2T)\displaystyle\delta^{\pi}(s,T)\triangleq\sum_{t=N}^{\infty}\left\lVert(P^{\pi})^{t}({s,\cdot})-d^{\pi}\right\rVert_{1}~{}~{}\text{where}~{}N=4t_{\mathrm{mix}}(\log_{2}T) (44)

Using Lemma 9, we get δπ(s,T)1T3\delta^{\pi}(s,T)\leq\frac{1}{T^{3}} which implies, |ETπ(s,a)|1T3|\mathrm{E}^{\pi}_{T}(s,a)|\leq\frac{1}{T^{3}}. Observe that,

𝐄[(1π(a|s)yk,i1(aτi=a)yk,i)|sτi=s]=𝐄[yk,i|sτi=s,aτi=a]aπ(a|s)𝐄[yk,i|sτi=s,aτi=a]=Qπ(s,a)+(N+1)JπETπ(s,a)aπ(a|s)[Qπ(s,a)+(N+1)JπETπ(s,a)]=Qπ(s,a)Vπ(s)[ET(s,a)aπ(a|s)ETπ(s,a)]=Aπ(s,a)ΔTπ(s,a)\displaystyle\begin{split}&\mathbf{E}\left[\left(\dfrac{1}{\pi(a|s)}y_{k,i}\mathrm{1}(a_{\tau_{i}}=a)-y_{k,i}\right)\bigg{|}s_{\tau_{i}}=s\right]\\ &=\mathbf{E}\left[y_{k,i}\bigg{|}s_{\tau_{i}}=s,a_{\tau_{i}}=a\right]-\sum_{a^{\prime}}\pi(a^{\prime}|s)\mathbf{E}\left[y_{k,i}\bigg{|}s_{\tau_{i}}=s,a_{\tau_{i}}=a^{{}^{\prime}}\right]\\ &=Q^{\pi}(s,a)+(N+1)J^{\pi}-\mathrm{E}^{\pi}_{T}(s,a)-\sum_{a^{\prime}}\pi(a^{\prime}|s)[Q^{\pi}(s,a)+(N+1)J^{\pi}-\mathrm{E}^{\pi}_{T}(s,a)]\\ &=Q^{\pi}(s,a)-V^{\pi}(s)-\left[\mathrm{E}_{T}(s,a)-\sum_{a^{\prime}}\pi(a^{\prime}|s)\mathrm{E}_{T}^{\pi}(s,a^{\prime})\right]\\ &=A^{\pi}(s,a)-\Delta^{\pi}_{T}(s,a)\end{split} (45)

where ΔTπ(s,a)ET(s,a)aπ(a|s)ETπ(s,a)\Delta^{\pi}_{T}(s,a)\triangleq\mathrm{E}_{T}(s,a)-\sum_{a^{\prime}}\pi(a^{\prime}|s)\mathrm{E}_{T}^{\pi}(s,a^{\prime}). Using the bound on ETπ(s,a)\mathrm{E}^{\pi}_{T}(s,a), one can show that, |ΔTπ(s,a)|2T3|\Delta_{T}^{\pi}(s,a)|\leq\frac{2}{T^{3}}, which implies,

|𝐄[(1π(a|s)yk,i1(aτi=a)yk,i)|sτi=s]Aπ(s,a)||ΔTπ(s,a)|2T3\displaystyle\left|\mathbf{E}\left[\left(\dfrac{1}{\pi(a|s)}y_{k,i}\mathrm{1}(a_{\tau_{i}}=a)-y_{k,i}\right)\bigg{|}s_{\tau_{i}}=s\right]-A^{\pi}(s,a)\right|\leq|\Delta_{T}^{\pi}(s,a)|\leq\dfrac{2}{T^{3}} (46)

Note that (46) cannot be directly used to bound the bias of A^π(s,a)\hat{A}^{\pi}(s,a). This is because the random variable MM is correlated with the reward variables {yk,i}i=1M\{y_{k,i}\}_{i=1}^{M}. To decorrelate these two random variables, imagine an MDP where the state distribution rejuvenates to the stationary distribution, dπd^{\pi} after exactly NN time steps since the completion of a sub-trajectory. In other words, if a sub-trajectory starts at τi\tau_{i}, and ends at τi+N\tau_{i}+N, then the system ‘rests’ for additional NN steps before rejuvenating with the state distribution, dπd^{\pi} at τi+2N\tau_{i}+2N. Clearly, the wait time between the rejuvenation after the (i1)(i-1)th sub-trajectory and the start of the iith sub-trajectory is, wi=τi(τi1+2N)w_{i}=\tau_{i}-(\tau_{i-1}+2N), i>1i>1. Let w1w_{1} be the time between the start time of the kkth epoch and the start time of the first sub-trajectory. Note that,

(a)(a) w1w_{1} only depends on the initial state, s(k1)Hs_{(k-1)H} and the induced transition function, PπP^{\pi},

(b)(b) wiw_{i}, where i>1i>1, depends on the stationary distribution, dπd^{\pi}, and the induced transition function, PπP^{\pi},

(c)(c) MM only depends on {w1,w2,}\{w_{1},w_{2},\cdots\} as other segments of the epoch have fixed length, 2N2N.

Clearly, in this imaginary MDP, the sequence, {w1,w2,}\{w_{1},w_{2},\cdots\}, and hence, MM is independent of {yk,1,yk,2,}\{y_{k,1},y_{k,2},\cdots\}. Let, 𝐄\mathbf{E}^{\prime} denote the expectation operation and Pr\mathrm{Pr}^{\prime} denote the probability of events in this imaginary system. Define the following.

Δiyk,i1(aτi=a)π(a|s)yk,iAπ(s,a)+ΔTπ(s,a)\displaystyle\Delta_{i}\triangleq\dfrac{y_{k,i}\mathrm{1}(a_{\tau_{i}}=a)}{\pi(a|s)}-y_{k,i}-A^{\pi}(s,a)+\Delta^{\pi}_{T}(s,a) (47)

where ΔTπ(s,a)\Delta^{\pi}_{T}(s,a) is defined in (45)(\ref{eq_appndx_47}). Note that we have suppressed the dependence on TT, s,as,a, and π\pi while defining Δi\Delta_{i} to remove clutter. Using (45)(\ref{eq_appndx_47}), one can write 𝐄[Δi(s,a)|{wi}]=0\mathbf{E}^{\prime}\left[\Delta_{i}(s,a)|\{w_{i}\}\right]=0. Moreover,

𝐄[(A^π(s,a)Aπ(s,a))2]=𝐄[(A^π(s,a)Aπ(s,a))2|M>0]×Pr(M>0)+(Aπ(s,a))2×Pr(M=0)=𝐄[(1Mi=1MΔiΔTπ(s,a))2|M>0]×Pr(M>0)+(Aπ(s,a))2×Pr(M=0)2𝐄{wi}[𝐄[(1Mi=1MΔi)2|{wi}]|w1HN]×Pr(w1HN)+2(ΔTπ(s,a))2+(Aπ(s,a))2×Pr(M=0)(a)2𝐄{wi}[1M2i=1M𝐄[Δi2|{wi}]|w1HN]×Pr(w1HN)+8T6+(Aπ(s,a))2×Pr(M=0)\displaystyle\begin{split}&\mathbf{E}^{\prime}\left[\left(\hat{A}^{\pi}(s,a)-A^{\pi}(s,a)\right)^{2}\right]\\ &=\mathbf{E}^{\prime}\left[\left(\hat{A}^{\pi}(s,a)-A^{\pi}(s,a)\right)^{2}\bigg{|}M>0\right]\times\mathrm{Pr}^{\prime}(M>0)+\left(A^{\pi}(s,a)\right)^{2}\times\mathrm{Pr}^{\prime}(M=0)\\ &=\mathbf{E}^{\prime}\left[\left(\dfrac{1}{M}\sum_{i=1}^{M}\Delta_{i}-\Delta_{T}^{\pi}(s,a)\right)^{2}\bigg{|}M>0\right]\times\mathrm{Pr}^{\prime}(M>0)+\left(A^{\pi}(s,a)\right)^{2}\times\mathrm{Pr}^{\prime}(M=0)\\ &\overset{}{\leq}2\mathbf{E}_{\{w_{i}\}}^{\prime}\left[\mathbf{E}^{\prime}\left[\left(\dfrac{1}{M}\sum_{i=1}^{M}\Delta_{i}\right)^{2}\bigg{|}\{w_{i}\}\right]\bigg{|}w_{1}\leq H-N\right]\times\mathrm{Pr}^{\prime}(w_{1}\leq H-N)+2\left(\Delta_{T}^{\pi}(s,a)\right)^{2}+\left(A^{\pi}(s,a)\right)^{2}\times\mathrm{Pr}^{\prime}(M=0)\\ &\overset{(a)}{\leq}2\mathbf{E}_{\{w_{i}\}}^{\prime}\left[\dfrac{1}{M^{2}}\sum_{i=1}^{M}\mathbf{E}^{\prime}\left[\Delta_{i}^{2}\big{|}\{w_{i}\}\right]\bigg{|}w_{1}\leq H-N\right]\times\mathrm{Pr}^{\prime}(w_{1}\leq H-N)+\dfrac{8}{T^{6}}+\left(A^{\pi}(s,a)\right)^{2}\times\mathrm{Pr}^{\prime}(M=0)\\ \end{split} (48)

where (a)(a) uses the bound |ΔTπ(s,a)|2T3|\Delta_{T}^{\pi}(s,a)|\leq\frac{2}{T^{3}} derived in (46)(\ref{eq_appndx_48}), and the fact that {Δi}\{\Delta_{i}\} are zero mean independent random variables conditioned on {wi}\{w_{i}\}. Note that |yk,i|N|y_{k,i}|\leq N almost surely, |Aπ(s,a)|𝒪(tmix)|A^{\pi}(s,a)|\leq\mathcal{O}(t_{\mathrm{mix}}) via Lemma 8, and |ΔTπ(s,a)|2T3|\Delta^{\pi}_{T}(s,a)|\leq\frac{2}{T^{3}} as shown in (46)(\ref{eq_appndx_48}). Combining, we get, 𝐄[|Δi|2|{wi}]𝒪(N2/π(a|s))\mathbf{E}^{\prime}[|\Delta_{i}|^{2}\big{|}\{w_{i}\}]\leq\mathcal{O}(N^{2}/\pi(a|s)) (see the definition of Δi\Delta_{i} in (47)). Invoking this bound into (48)(\ref{eq_appndx_50}), we get the following result.

𝐄[(A^π(s,a)Aπ(s,a))2]2𝐄[1M|w1HN]𝒪(N2π(a|s))+8T6+𝒪(tmix2)×Pr(w1>HN)\displaystyle\begin{split}\mathbf{E}^{\prime}&\left[\left(\hat{A}^{\pi}(s,a)-A^{\pi}(s,a)\right)^{2}\right]\leq 2\mathbf{E}^{\prime}\left[\dfrac{1}{M}\bigg{|}w_{1}\leq H-N\right]\mathcal{O}\left(\dfrac{N^{2}}{\pi(a|s)}\right)+\dfrac{8}{T^{6}}+\mathcal{O}(t_{\mathrm{mix}}^{2})\times\mathrm{Pr}^{\prime}(w_{1}>H-N)\\ \end{split} (49)

Note that, one can use Lemma 10 to bound the following violation probability.

Pr(w1>HN)(13dπ(s)4)4thitT(logT)1(a)(13dπ(s)4)4dπ(s)(logT)1T3\displaystyle\mathrm{Pr}^{\prime}(w_{1}>H-N)\leq\left(1-\dfrac{3d^{\pi}(s)}{4}\right)^{4t_{\mathrm{hit}}\sqrt{T}(\log T)-1}\overset{(a)}{\leq}\left(1-\dfrac{3d^{\pi}(s)}{4}\right)^{\dfrac{4}{d^{\pi}(s)}(\log T)}\leq\dfrac{1}{T^{3}} (50)

where (a)(a) follows from the fact that 4thitT(log2T)14dπ(s)log2T4t_{\mathrm{hit}}\sqrt{T}(\log_{2}T)-1\geq\frac{4}{d^{\pi}(s)}\log_{2}T for sufficiently large TT. Finally, note that, if M<M0M<M_{0}, where M0M_{0} is defined as,

M0HN2N+4NlogTdπ(s)\displaystyle M_{0}\triangleq\dfrac{H-N}{2N+\dfrac{4N\log T}{d^{\pi}(s)}} (51)

then there exists at least one wiw_{i} that is larger than 4Nlog2T/dπ(s)4N\log_{2}T/d^{\pi}(s) which can happen with the following maximum probability according to Lemma 10.

Pr(M<M0)(13dπ(s)4)4logTdπ(s)1T3\displaystyle\mathrm{Pr}^{\prime}\left(M<M_{0}\right)\leq\left(1-\dfrac{3d^{\pi}(s)}{4}\right)^{\frac{4\log T}{d^{\pi(s)}}}\leq\dfrac{1}{T^{3}} (52)

The above probability bound can be used to obtain the following result,

𝐄[1M|M>0]=m=11mPr(M=m)Pr(M>0)1×Pr(MM0)+1M0Pr(M>M0)Pr(M>0)1T3+2N+4NlogTdπ(s)HN11T3𝒪(NlogTHdπ(s))\displaystyle\begin{split}\mathbf{E}^{\prime}\left[\dfrac{1}{M}\bigg{|}M>0\right]=\dfrac{\sum_{m=1}^{\infty}\dfrac{1}{m}\mathrm{Pr}^{\prime}(M=m)}{\mathrm{Pr}^{\prime}(M>0)}&\leq\dfrac{1\times\mathrm{Pr}^{\prime}(M\leq M_{0})+\dfrac{1}{M_{0}}\mathrm{Pr}^{\prime}(M>M_{0})}{\mathrm{Pr}^{\prime}(M>0)}\\ &\leq\dfrac{\dfrac{1}{T^{3}}+\dfrac{2N+\dfrac{4N\log T}{d^{\pi}(s)}}{H-N}}{1-\dfrac{1}{T^{3}}}\leq\mathcal{O}\left(\dfrac{N\log T}{Hd^{\pi}(s)}\right)\end{split} (53)

Injecting (50)(\ref{eq_appndx_52_}) and (53)(\ref{eq_appndx_55_}) into (49)(\ref{eq_appndx_51_}), we finally obtain the following.

𝐄\displaystyle\mathbf{E}^{\prime} [(A^π(s,a)Aπ(s,a))2]𝒪(N3logTHdπ(s)π(a|s))=𝒪(N3thitlogTHπ(a|s))=𝒪(tmix2(logT)2Tπ(a|s))\displaystyle\left[\left(\hat{A}^{\pi}(s,a)-A^{\pi}(s,a)\right)^{2}\right]\leq\mathcal{O}\left(\dfrac{N^{3}\log T}{Hd^{\pi}(s)\pi(a|s)}\right)=\mathcal{O}\left(\dfrac{N^{3}t_{\mathrm{hit}}\log T}{H\pi(a|s)}\right)=\mathcal{O}\left(\dfrac{t^{2}_{\mathrm{mix}}(\log T)^{2}}{\sqrt{T}\pi(a|s)}\right) (54)

Eq. (54)(\ref{eq_appndx_56_}) shows that our desired inequality is satisfied in the imaginary system. We now need a mechanism to translate this result to our actual MDP. Notice that, we can write (A^π(s,a)Aπ(s,a))2=f(X)(\hat{A}^{\pi}(s,a)-A^{\pi}(s,a))^{2}=f(X) where X=(M,τ1,𝒯1,,τM,𝒯M)X=(M,\tau_{1},\mathcal{T}_{1},\cdots,\tau_{M},\mathcal{T}_{M}), and 𝒯i=(aτi,sτi+1,aτi+1,,sτi+N,aτi+N)\mathcal{T}_{i}=(a_{\tau_{i}},s_{\tau_{i}+1},a_{\tau_{i}+1},\cdots,s_{\tau_{i}+N},a_{\tau_{i}+N}). We have,

𝐄[f(X)]𝐄[f(X)]=Xf(X)Pr(X)Xf(X)Pr(X)maxXPr(X)Pr(X)\displaystyle\dfrac{\mathbf{E}[f(X)]}{\mathbf{E}^{\prime}[f(X)]}=\dfrac{\sum_{X}f(X)\mathrm{Pr}(X)}{\sum_{X}f(X)\mathrm{Pr}^{\prime}(X)}\leq\max_{X}\dfrac{\mathrm{Pr}(X)}{\mathrm{Pr^{\prime}}(X)} (55)

The last inequality uses the non-negativity of f()f(\cdot). Observe that, for a fixed sequence, XX, we have,

Pr(X)=Pr(τ1)×Pr(𝒯1|τ1)×Pr(τ2|τ1,𝒯1)×Pr(𝒯2|τ2)××Pr(τM|τM1,𝒯M1)×Pr(𝒯M|τM)×Pr(sts,t[τM+2N,kHN]|τM,𝒯M),\displaystyle\begin{split}\mathrm{Pr}(X)=&\mathrm{Pr}(\tau_{1})\times\mathrm{Pr}(\mathcal{T}_{1}|\tau_{1})\times\mathrm{Pr}(\tau_{2}|\tau_{1},\mathcal{T}_{1})\times\mathrm{Pr}(\mathcal{T}_{2}|\tau_{2})\times\cdots\\ &\times\mathrm{Pr}(\tau_{M}|\tau_{M-1},\mathcal{T}_{M-1})\times\mathrm{Pr}(\mathcal{T}_{M}|\tau_{M})\times\mathrm{Pr}(s_{t}\neq s,\forall t\in[\tau_{M}+2N,kH-N]|\tau_{M},\mathcal{T}_{M}),\end{split} (56)
Pr(X)=Pr(τ1)×Pr(𝒯1|τ1)×Pr(τ2|τ1,𝒯1)×Pr(𝒯2|τ2)××Pr(τM|τM1,𝒯M1)×Pr(𝒯M|τM)×Pr(sts,t[τM+2N,kHN]|τM,𝒯M),\displaystyle\begin{split}\mathrm{Pr}^{\prime}(X)=&\mathrm{Pr}(\tau_{1})\times\mathrm{Pr}(\mathcal{T}_{1}|\tau_{1})\times\mathrm{Pr}^{\prime}(\tau_{2}|\tau_{1},\mathcal{T}_{1})\times\mathrm{Pr}(\mathcal{T}_{2}|\tau_{2})\times\cdots\\ &\times\mathrm{Pr}^{\prime}(\tau_{M}|\tau_{M-1},\mathcal{T}_{M-1})\times\mathrm{Pr}(\mathcal{T}_{M}|\tau_{M})\times\mathrm{Pr}(s_{t}\neq s,\forall t\in[\tau_{M}+2N,kH-N]|\tau_{M},\mathcal{T}_{M}),\end{split} (57)

Thus, the difference between Pr(X)\mathrm{Pr}(X) and Pr(X)\mathrm{Pr}^{\prime}(X) arises because Pr(τi+1|τi,𝒯i)Pr(τi+1|τi,𝒯i)\mathrm{Pr}(\tau_{i+1}|\tau_{i},\mathcal{T}_{i})\neq\mathrm{Pr}^{\prime}(\tau_{i+1}|\tau_{i},\mathcal{T}_{i}), i{1,,M1}\forall i\in\{1,\cdots,M-1\}. Note that the ratio of these two terms can be bounded as follows,

Pr(τi+1|τi,𝒯i)Pr(τi+1|τi,𝒯i)=ssPr(sτi+2N=s|τi,𝒯i)×Pr(sts,t[τi+2N,τi+11],sτi+1=s|sτi+2N=s)ssPr(sτi+2N=s|τi,𝒯i)×Pr(sts,t[τi+2N,τi+11],sτi+1=s|sτi+2N=s)maxsPr(sτi+2N=s|τi,𝒯i)Pr(sτi+2N=s|τi,𝒯i)=maxs1+Pr(sτi+2N=s|τi,𝒯i)dπ(s)dπ(s)(a)maxs1+1T3dπ(s)1+thitT31+1T2\displaystyle\begin{split}\dfrac{\mathrm{Pr}(\tau_{i+1}|\tau_{i},\mathcal{T}_{i})}{\mathrm{Pr}^{\prime}(\tau_{i+1}|\tau_{i},\mathcal{T}_{i})}&=\dfrac{\sum_{s^{\prime}\neq s}\mathrm{Pr}(s_{\tau_{i}+2N}=s^{\prime}|\tau_{i},\mathcal{T}_{i})\times\mathrm{Pr}(s_{t}\neq s,\forall t\in[\tau_{i}+2N,\tau_{i+1}-1],s_{\tau_{i+1}}=s|s_{\tau_{i}+2N}=s^{\prime})}{\sum_{s^{\prime}\neq s}\mathrm{Pr}^{\prime}(s_{\tau_{i}+2N}=s^{\prime}|\tau_{i},\mathcal{T}_{i})\times\mathrm{Pr}(s_{t}\neq s,\forall t\in[\tau_{i}+2N,\tau_{i+1}-1],s_{\tau_{i+1}}=s|s_{\tau_{i}+2N}=s^{\prime})}\\ &\leq\max_{s^{\prime}}\dfrac{\mathrm{Pr}(s_{\tau_{i}+2N}=s^{\prime}|\tau_{i},\mathcal{T}_{i})}{\mathrm{Pr}^{\prime}(s_{\tau_{i}+2N}=s^{\prime}|\tau_{i},\mathcal{T}_{i})}\\ &=\max_{s^{\prime}}1+\dfrac{\mathrm{Pr}(s_{\tau_{i}+2N}=s^{\prime}|\tau_{i},\mathcal{T}_{i})-d^{\pi}(s^{\prime})}{d^{\pi}(s^{\prime})}\overset{(a)}{\leq}\max_{s^{\prime}}1+\dfrac{1}{T^{3}d^{\pi}(s^{\prime})}\leq 1+\dfrac{t_{\mathrm{hit}}}{T^{3}}\leq 1+\dfrac{1}{T^{2}}\end{split} (58)

where (a)(a) is a consequence of Lemma 9. We have,

Pr(X)Pr(X)(1+1T2)MeMT2(a)e1T𝒪(1+1T)\displaystyle\dfrac{\mathrm{Pr}(X)}{\mathrm{Pr}^{\prime}(X)}\leq\left(1+\dfrac{1}{T^{2}}\right)^{M}\leq e^{\frac{M}{T^{2}}}\overset{(a)}{\leq}e^{\frac{1}{T}}\leq\mathcal{O}\left(1+\dfrac{1}{T}\right) (59)

where (a)(a) uses the fact that MTM\leq T. Combining (55)(\ref{eq_appndx_57_}) and (59)(\ref{eq_appndx_61_}), we get,

𝐄[(A^π(s,a)Aπ(s,a))2]𝒪(1+1T)𝐄[(A^π(s,a)Aπ(s,a))2](a)𝒪(tmix2(logT)2Tπ(a|s))\displaystyle\begin{split}\mathbf{E}\left[\left(\hat{A}^{\pi}(s,a)-A^{\pi}(s,a)\right)^{2}\right]&\leq\mathcal{O}\left(1+\dfrac{1}{T}\right)\mathbf{E}^{\prime}\left[\left(\hat{A}^{\pi}(s,a)-A^{\pi}(s,a)\right)^{2}\right]\\ &\overset{(a)}{\leq}\mathcal{O}\left(\dfrac{t^{2}_{\mathrm{mix}}(\log T)^{2}}{\sqrt{T}\pi(a|s)}\right)\end{split} (60)

where (a)(a) follows from (54)(\ref{eq_appndx_56_}). This concludes the lemma. ∎

Proof of Lemma 3

Proof.

Recall from Eq. (18) that,

ωk=1Ht=tktk+11A^πθk(st,at)θlogπθk(at|st),\displaystyle\omega_{k}=\dfrac{1}{H}\sum_{t=t_{k}}^{t_{k+1}-1}\hat{A}^{\pi_{\theta_{k}}}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta_{k}}(a_{t}|s_{t}), (61)

Define the following quantity,

ω¯k=1Ht=tktk+11Aπθk(st,at)θlogπθk(at|st)\displaystyle\bar{\omega}_{k}=\dfrac{1}{H}\sum_{t=t_{k}}^{t_{k+1}-1}A^{\pi_{\theta_{k}}}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta_{k}}(a_{t}|s_{t}) (62)

where tk=(k1)Ht_{k}=(k-1)H is the starting time of the kkth epoch. Note that the true gradient is given by,

θJ(θk)=𝐄sdπθk,aπθk(|s)[Aπθk(s,a)θlogπθ(a|s)]\displaystyle\nabla_{\theta}J(\theta_{k})=\mathbf{E}_{s\sim d^{\pi_{\theta_{k}}},a\sim\pi_{\theta_{k}}(\cdot|s)}\left[A^{\pi_{\theta_{k}}}(s,a)\nabla_{\theta}\log\pi_{\theta}(a|s)\right] (63)

Moreover, using Assumption 2 and Lemma 9, one can prove that |Aπθk(s,a)θlogπθ(a|s)|𝒪(tmixG)|A^{\pi_{\theta_{k}}}(s,a)\nabla_{\theta}\log\pi_{\theta}(a|s)|\leq\mathcal{O}(t_{\mathrm{mix}}G), (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A} which implies |θJ(θk)|𝒪(tmixG)|\nabla_{\theta}J(\theta_{k})|\leq\mathcal{O}(t_{\mathrm{mix}}G). Applying Lemma 12, we, therefore, arrive at,

𝐄[ω¯kθJ(θk)2]𝒪(G2tmix2logT)×𝒪(tmixlogTH)=𝒪(G2tmix2thitT)\displaystyle\mathbf{E}\left[\left\lVert\bar{\omega}_{k}-\nabla_{\theta}J(\theta_{k})\right\rVert^{2}\right]\leq\mathcal{O}\left(G^{2}t^{2}_{\mathrm{mix}}\log T\right)\times\mathcal{O}\left(\dfrac{t_{\mathrm{mix}}\log T}{H}\right)=\mathcal{O}\left(\dfrac{G^{2}t_{\mathrm{mix}}^{2}}{t_{\mathrm{hit}}\sqrt{T}}\right) (64)

We would like to point out that Lemma 12 was also used in (Suttle et al. 2023) to analyze their actor-critic algorithm. Finally, the difference, 𝐄ωkω¯k2\mathbf{E}\left\lVert\omega_{k}-\bar{\omega}_{k}\right\rVert^{2} can be bounded as follows.

𝐄ωkω¯k2=𝐄[1Ht=tktk+11A^πθk(st,at)θlogπθk(at|st)1Ht=tktk+11A^πθk(st,at)θlogπθk(at|st)2](a)G2Ht=tktk+11𝐄[(A^πθk(st,at)Aπθk(st,at))2]G2Ht=tktk+11𝐄[aπθk(a|st)𝐄[(A^πθk(st,a)Aπθk(st,a))2|st]](b)𝒪(AG2tmix2(logT)2T)\displaystyle\begin{split}\mathbf{E}\left\lVert\omega_{k}-\bar{\omega}_{k}\right\rVert^{2}&=\mathbf{E}\left[\left\lVert\dfrac{1}{H}\sum_{t=t_{k}}^{t_{k+1}-1}\hat{A}^{\pi_{\theta_{k}}}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta_{k}}(a_{t}|s_{t})-\dfrac{1}{H}\sum_{t=t_{k}}^{t_{k+1}-1}\hat{A}^{\pi_{\theta_{k}}}(s_{t},a_{t})\nabla_{\theta}\log\pi_{\theta_{k}}(a_{t}|s_{t})\right\rVert^{2}\right]\\ &\overset{(a)}{\leq}\dfrac{G^{2}}{H}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{E}\left[\left(\hat{A}^{\pi_{\theta_{k}}}(s_{t},a_{t})-A^{\pi_{\theta_{k}}}(s_{t},a_{t})\right)^{2}\right]\\ &\leq\dfrac{G^{2}}{H}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{E}\left[\sum_{a}\pi_{\theta_{k}}(a|s_{t})\mathbf{E}\left[\left(\hat{A}^{\pi_{\theta_{k}}}(s_{t},a)-A^{\pi_{\theta_{k}}}(s_{t},a)\right)^{2}\bigg{|}s_{t}\right]\right]\\ &\overset{(b)}{\leq}\mathcal{O}\left(\dfrac{AG^{2}t^{2}_{\mathrm{mix}}(\log T)^{2}}{\sqrt{T}}\right)\end{split} (65)

where (a)(a) follows from Assumption 2 and Jensen’s inequality whereas (b)(b) follows from Lemma 2. Combining, (64)(\ref{eq_appndx_67_}) and (65)(\ref{eq_appndx_68_}), we conclude the result. ∎

Proof of Lemma 4

Proof.

Using the Lemma 11, it is obvious to see that

JπJπ\displaystyle J^{\pi}-J^{\pi^{\prime}} =sadπ(s)(π(a|s)π(a|s))Qπ(s,a)\displaystyle=\sum_{s}\sum_{a}d^{\pi}(s)(\pi(a|s)-\pi^{\prime}(a|s))Q^{\pi^{\prime}}(s,a) (66)
=sadπ(s)π(a|s)Qπ(s,a)sdπ(s)Vπ(s)\displaystyle=\sum_{s}\sum_{a}d^{\pi}(s)\pi(a|s)Q^{\pi^{\prime}}(s,a)-\sum_{s}d^{\pi}(s)V^{\pi^{\prime}}(s)
=(a)sadπ(s)π(a|s)Qπ(s,a)sadπ(s)π(a|s)Vπ(s)\displaystyle\overset{(a)}{=}\sum_{s}\sum_{a}d^{\pi}(s)\pi(a|s)Q^{\pi^{\prime}}(s,a)-\sum_{s}\sum_{a}d^{\pi}(s)\pi(a|s)V^{\pi^{\prime}}(s)
=sadπ(s)π(a|s)[Qπ(s,a)Vπ(s)]\displaystyle=\sum_{s}\sum_{a}d^{\pi}(s)\pi(a|s)[Q^{\pi^{\prime}}(s,a)-V^{\pi^{\prime}}(s)]
=𝐄sdπ𝐄aπ(|s)[Aπ(s,a)]\displaystyle=\mathbf{E}_{s\sim d^{\pi}}\mathbf{E}_{a\sim\pi(\cdot|s)}\big{[}A^{\pi^{\prime}}(s,a)\big{]}

Equation (a) follows from the fact that aπ(a|s)=1\sum_{a}\pi(a|s)=1, s𝒮\forall s\in\mathcal{S}. ∎

Proof of Lemma 5

Proof.

We start with the definition of KL divergence.

𝐄sdπ[KL(π(|s)πθk(|s))KL(π(|s)πθk+1(|s))]\displaystyle\mathbf{E}_{s\sim d^{\pi^{*}}}[KL(\pi^{*}(\cdot|s)\|\pi_{\theta_{k}}(\cdot|s))-KL(\pi^{*}(\cdot|s)\|\pi_{\theta_{k+1}}(\cdot|s))] (67)
=𝐄sdπ𝐄aπ(|s)[logπθk+1(a|s)πθk(a|s)]\displaystyle=\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}\bigg{[}\log\frac{\pi_{\theta_{k+1}(a|s)}}{\pi_{\theta_{k}}(a|s)}\bigg{]}
(a)𝐄sdπ𝐄aπ(|s)[θlogπθk(a|s)(θk+1θk)]B2θk+1θk2\displaystyle\overset{(a)}{\geq}\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}[\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot(\theta_{k+1}-\theta_{k})]-\frac{B}{2}\|\theta_{k+1}-\theta_{k}\|^{2}
=α𝐄sdπ𝐄aπ(|s)[θlogπθk(a|s)ωk]Bα22ωk2\displaystyle=\alpha\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}[\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot\omega_{k}]-\frac{B\alpha^{2}}{2}\|\omega_{k}\|^{2}
=α𝐄sdπ𝐄aπ(|s)[θlogπθk(a|s)ωk]+α𝐄sdπ𝐄aπ(|s)[θlogπθk(a|s)(ωkωk)]Bα22ωk2\displaystyle=\alpha\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}[\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot\omega^{*}_{k}]+\alpha\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}[\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot(\omega_{k}-\omega^{*}_{k})]-\frac{B\alpha^{2}}{2}\|\omega_{k}\|^{2}
=α[JJ(θk)]+α𝐄sdπ𝐄aπ(|s)[θlogπθk(a|s)ωk]α[JJ(θk)]\displaystyle=\alpha[J^{*}-J(\theta_{k})]+\alpha\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}[\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot\omega^{*}_{k}]-\alpha[J^{*}-J(\theta_{k})]
+α𝐄sdπ𝐄aπ(|s)[θlogπθk(a|s)(ωkωk)]Bα22ωk2\displaystyle+\alpha\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}[\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot(\omega_{k}-\omega^{*}_{k})]-\frac{B\alpha^{2}}{2}\|\omega_{k}\|^{2}
=(b)α[JJ(θk)]+α𝐄sdπ𝐄aπ(|s)[θlogπθk(a|s)ωkAπθk(s,a)]\displaystyle\overset{(b)}{=}\alpha[J^{*}-J(\theta_{k})]+\alpha\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}\bigg{[}\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot\omega^{*}_{k}-A^{\pi_{\theta_{k}}}(s,a)\bigg{]}
+α𝐄sdπ𝐄aπ(|s)[θlogπθk(a|s)(ωkωk)]Bα22ωk2\displaystyle+\alpha\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}[\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot(\omega_{k}-\omega^{*}_{k})]-\frac{B\alpha^{2}}{2}\|\omega_{k}\|^{2}
(c)α[JJ(θk)]α𝐄sdπ𝐄aπ(|s)[(θlogπθk(a|s)ωkAπθk(s,a))2]\displaystyle\overset{(c)}{\geq}\alpha[J^{*}-J(\theta_{k})]-\alpha\sqrt{\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}\bigg{[}\bigg{(}\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\cdot\omega^{*}_{k}-A^{\pi_{\theta_{k}}}(s,a)\bigg{)}^{2}\bigg{]}}
α𝐄sdπ𝐄aπ(|s)θlogπθk(a|s)2(ωkωk)Bα22ωk2\displaystyle-\alpha\mathbf{E}_{s\sim d^{\pi^{*}}}\mathbf{E}_{a\sim\pi^{*}(\cdot|s)}\|\nabla_{\theta}\log\pi_{\theta_{k}}(a|s)\|_{2}\|(\omega_{k}-\omega^{*}_{k})\|-\frac{B\alpha^{2}}{2}\|\omega_{k}\|^{2}
(d)α[JJ(θk)]αϵbiasαG(ωkωk)Bα22ωk2\displaystyle\overset{(d)}{\geq}\alpha[J^{*}-J(\theta_{k})]-\alpha\sqrt{\epsilon_{\mathrm{bias}}}-\alpha G\|(\omega_{k}-\omega^{*}_{k})\|-\frac{B\alpha^{2}}{2}\|\omega_{k}\|^{2}

where the step (a) holds by Assumption 2 and step (b) holds by Lemma 4. Step (c) uses the convexity of the function f(x)=x2f(x)=x^{2}. Finally, step (d) comes from the Assumption 3. Rearranging items, we have

JJ(θk)ϵbias+G(ωkωk)+Bα2ωk2+1α𝐄sdπ[KL(π(|s)πθk(|s))KL(π(|s)πθk+1(|s))]\begin{split}J^{*}-J(\theta_{k})&\leq\sqrt{\epsilon_{\mathrm{bias}}}+G\|(\omega_{k}-\omega^{*}_{k})\|+\frac{B\alpha}{2}\|\omega_{k}\|^{2}\\ &+\frac{1}{\alpha}\mathbf{E}_{s\sim d^{\pi^{*}}}[KL(\pi^{*}(\cdot|s)\|\pi_{\theta_{k}}(\cdot|s))-KL(\pi^{*}(\cdot|s)\|\pi_{\theta_{k+1}}(\cdot|s))]\end{split} (68)

Summing from k=1k=1 to KK, using the non-negativity of KL divergence and dividing the resulting expression by KK, we get the desired result. ∎

Proof of Lemma 6

Proof.

By the LL-smooth property of the objective function, we have

J(θk+1)J(θk)+J(θk),θk+1θkL2θk+1θk2=(a)J(θk)+αJ(θk)TωkLα22ωk2=J(θk)+αJ(θk)2αJ(θk)ωk,J(θk)Lα22J(θk)ωkJ(θk)2(b)J(θk)+αJ(θk)2α2J(θk)ωk2α2J(θk)2Lα2J(θk)ωk2Lα2J(θk)2=J(θk)+(α2Lα2)J(θk)2(α2+Lα2)J(θk)ωk2\displaystyle\begin{split}J(\theta_{k+1})&\geq J(\theta_{k})+\left<\nabla J(\theta_{k}),\theta_{k+1}-\theta_{k}\right>-\frac{L}{2}\|\theta_{k+1}-\theta_{k}\|^{2}\\ &\overset{(a)}{=}J(\theta_{k})+\alpha\nabla J(\theta_{k})^{T}\omega_{k}-\frac{L\alpha^{2}}{2}\left\lVert\omega_{k}\right\rVert^{2}\\ &=J(\theta_{k})+\alpha\left\lVert\nabla J(\theta_{k})\right\rVert^{2}-\alpha\langle\nabla J(\theta_{k})-\omega_{k},\nabla J(\theta_{k})\rangle-\frac{L\alpha^{2}}{2}\|\nabla J(\theta_{k})-\omega_{k}-\nabla J(\theta_{k})\|^{2}\\ &\overset{(b)}{\geq}J(\theta_{k})+\alpha\left\lVert\nabla J(\theta_{k})\right\rVert^{2}-\frac{\alpha}{2}\|\nabla J(\theta_{k})-\omega_{k}\|^{2}-\frac{\alpha}{2}\|\nabla J(\theta_{k})\|^{2}\\ &\quad-L\alpha^{2}\|\nabla J(\theta_{k})-\omega_{k}\|^{2}-L\alpha^{2}\|\nabla J(\theta_{k})\|^{2}\\ &=J(\theta_{k})+\left(\frac{\alpha}{2}-L\alpha^{2}\right)\left\lVert\nabla J(\theta_{k})\right\rVert^{2}-\left(\frac{\alpha}{2}+L\alpha^{2}\right)\|\nabla J(\theta_{k})-\omega_{k}\|^{2}\end{split} (69)

where step (a) holds from the fact that θt+1=θk+αωk\theta_{t+1}=\theta_{k}+\alpha\omega_{k} and step (b) holds due to the Cauchy-Schwarz inequality. Rearranging the terms yields the following.

J(θk)2J(θk+1)J(θk)+(α2+Lα2)J(θk)ωk2α2Lα2\left\lVert\nabla J(\theta_{k})\right\rVert^{2}\leq\frac{J(\theta_{k+1})-J(\theta_{k})+(\frac{\alpha}{2}+L\alpha^{2})\|\nabla J(\theta_{k})-\omega_{k}\|^{2}}{\frac{\alpha}{2}-L\alpha^{2}} (70)

Choosing α=14L\alpha=\frac{1}{4L} and summing over k{1,2,,K}k\in\{1,2,\cdots,K\} results in the following.

1Kk=1KJ(θk)216LK[J(θK)J(θ0)]+3Kk=1KJ(θk)ωk2\displaystyle\dfrac{1}{K}\sum_{k=1}^{K}\left\lVert\nabla J(\theta_{k})\right\rVert^{2}\leq\dfrac{16L}{K}\left[J(\theta_{K})-J(\theta_{0})\right]+\dfrac{3}{K}\sum_{k=1}^{K}\|\nabla J(\theta_{k})-\omega_{k}\|^{2} (71)

Using |J(θk)|1|J(\theta_{k})|\leq 1 due to bounded reward, we conclude the result. ∎

Proofs for the Section of Regret Analysis

Proof of Lemma 7

Proof.

Using Taylor’s expansion, we can write the following (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}, k\forall k.

|πθk+1(a|s)πθk(a|s)|=|(θk+1θk)Tθπθ¯(a|s)|=πθ¯k(a|s)|(θk+1θk)Tθlogπθ¯k(a|s)|πθ¯k(a|s)θk+1θkθlogπθ¯k(a|s)(a)Gπθ¯k(a|s)θk+1θk\displaystyle\begin{split}|\pi_{\theta_{k+1}}(a|s)-\pi_{\theta_{k}}(a|s)|&=\left|(\theta_{k+1}-\theta_{k})^{T}\nabla_{\theta}\pi_{\bar{\theta}}(a|s)\right|\\ &=\pi_{\bar{\theta}_{k}}(a|s)\left|(\theta_{k+1}-\theta_{k})^{T}\nabla_{\theta}\log\pi_{\bar{\theta}_{k}}(a|s)\right|\\ &\leq\pi_{\bar{\theta}_{k}}(a|s)\left\lVert\theta_{k+1}-\theta_{k}\right\rVert\left\lVert\nabla_{\theta}\log\pi_{\bar{\theta}_{k}}(a|s)\right\rVert\overset{(a)}{\leq}G\pi_{\bar{\theta}_{k}}(a|s)\left\lVert\theta_{k+1}-\theta_{k}\right\rVert\end{split} (72)

where θ¯k\bar{\theta}_{k} is some convex combination of θk\theta_{k} and θk+1\theta_{k+1} and (a)(a) follows from Assumption 2. This concludes the first statement. Applying (72) and Lemma 11, we obtain,

k=1K𝐄|J(θk+1)J(θk)|=k=1K𝐄|s,adπθk+1(s)(πθk+1(a|s)πθk(a|s))Qπθk(s,a)|k=1K𝐄[s,adπθk+1(s)|πθk+1(a|s)πθk(a|s)||Qπθk(s,a)|]Gk=1K𝐄[s,adπθk+1(s)πθ¯k(a|s)θk+1θk|Qπθk1(s,a)|](a)Gαk=1K𝐄[s,adπθk+1(s)πθ¯k(a|s)=1ωk6tmix]=6Gαtmixk=1K𝐄ωk(b)6GαtmixK(k=1K𝐄ωk2)12(c)𝒪~(αGT14thit[AGtmix+Ltmixthit])\displaystyle\begin{split}\sum_{k=1}^{K}\mathbf{E}\left|J(\theta_{k+1})-J(\theta_{k})\right|&=\sum_{k=1}^{K}\mathbf{E}\left|\sum_{s,a}d^{\pi_{\theta_{k+1}}}(s)(\pi_{\theta_{k+1}}(a|s)-\pi_{\theta_{k}}(a|s))Q^{\pi_{\theta_{k}}}(s,a)\right|\\ &\leq\sum_{k=1}^{K}\mathbf{E}\left[\sum_{s,a}d^{\pi_{\theta_{k+1}}}(s)\left|\pi_{\theta_{k+1}}(a|s)-\pi_{\theta_{k}}(a|s)\right|\left|Q^{\pi_{\theta_{k}}}(s,a)\right|\right]\\ &\leq G\sum_{k=1}^{K}\mathbf{E}\left[\sum_{s,a}d^{\pi_{\theta_{k+1}}}(s)\pi_{\bar{\theta}_{k}}(a|s)\|\theta_{k+1}-\theta_{k}\||Q^{\pi_{\theta_{k-1}}}(s,a)|\right]\\ &\overset{(a)}{\leq}G\alpha\sum_{k=1}^{K}\mathbf{E}\left[\underbrace{\sum_{s,a}d^{\pi_{\theta_{k+1}}}(s)\pi_{\bar{\theta}_{k}}(a|s)}_{=1}\|\omega_{k}\|\cdot 6t_{\mathrm{mix}}\right]\\ &\overset{}{=}6G\alpha t_{\mathrm{mix}}\sum_{k=1}^{K}\mathbf{E}\left\lVert\omega_{k}\right\rVert\\ &\overset{(b)}{\leq}6G\alpha t_{\mathrm{mix}}\sqrt{K}\left(\sum_{k=1}^{K}\mathbf{E}\left\lVert\omega_{k}\right\rVert^{2}\right)^{\frac{1}{2}}\overset{(c)}{\leq}\mathcal{\tilde{O}}\left(\dfrac{\alpha GT^{\frac{1}{4}}}{t_{\mathrm{hit}}}\left[\sqrt{A}Gt_{\mathrm{mix}}+\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}\right]\right)\end{split} (73)

Inequality (a)(a) uses Lemma 8 and the update rule θk+1=θk+αωk\theta_{k+1}=\theta_{k}+\alpha\omega_{k}. Step (b)(b) holds by the Cauchy inequality and Jensen inequality whereas (c)(c) can be derived using (32)\eqref{eq_34} and substituting K=T/HK=T/H. This establishes the second statement. Next, recall from (4)(\ref{eq_r_pi_theta}) that for any policy πθ\pi_{\theta}, rπθ(s)aπθ(a|s)r(s,a)r^{\pi_{\theta}}(s)\triangleq\sum_{a}\pi_{\theta}(a|s)r(s,a). Note that, for any policy parameter θ\theta, and any state s𝒮s\in\mathcal{S}, the following holds.

Vπθ(s)=t=0(Pπθ)t(s,)dπθ,rπθ=t=0N1(Pπθ)t(s,),rπθNJ(θ)+t=N(Pπθ)t(s,)dπθ,rπθ.\displaystyle V^{\pi_{\theta}}(s)=\sum_{t=0}^{\infty}\left<(P^{\pi_{\theta}})^{t}(s,\cdot)-d^{\pi_{\theta}},r^{\pi_{\theta}}\right>=\sum_{t=0}^{N-1}\left<(P^{\pi_{\theta}})^{t}({s,\cdot}),r^{\pi_{\theta}}\right>-NJ(\theta)+\sum_{t=N}^{\infty}\left<(P^{\pi_{\theta}})^{t}({s,\cdot})-d^{\pi_{\theta}},r^{\pi_{\theta}}\right>. (74)

Define the following quantity.

δπθ(s,T)t=N(Pπθ)t(s,)dπθ1whereN=4tmix(log2T)\displaystyle\delta^{\pi_{\theta}}(s,T)\triangleq\sum_{t=N}^{\infty}\left\lVert(P^{\pi_{\theta}})^{t}({s,\cdot})-d^{\pi_{\theta}}\right\rVert_{1}~{}~{}\text{where}~{}N=4t_{\mathrm{mix}}(\log_{2}T) (75)

Lemma 9 states that for sufficiently large TT, we have δπθ(s,T)1T3\delta^{\pi_{\theta}}(s,T)\leq\frac{1}{T^{3}} for any policy πθ\pi_{\theta} and state ss. Combining this result with the fact that the reward function is bounded in [0,1][0,1], we obtain,

k=1K𝐄|Vπθk+1(sk)Vπθk(sk)|k=1K𝐄|t=0N1(Pπθk+1)t(sk,)(Pπθk)t(sk,),rπθk+1|+k=1K𝐄|t=0N1(Pπθk)t(sk,),rπθk+1rπθk|+Nk=1K𝐄|J(θk+1)J(θk)|+2KT3(a)k=1Kt=0N1𝐄(Pπθk+1)t(Pπθk)t)rπθk+1+k=1Kt=0N1𝐄rπθk+1rπθk+𝒪~(αGT14tmixthit[AGtmix+Ltmixthit])\displaystyle\begin{split}&\sum_{k=1}^{K}\mathbf{E}|V^{\pi_{\theta_{k+1}}}(s_{k})-V^{\pi_{\theta_{k}}}(s_{k})|\\ &\leq\sum_{k=1}^{K}\mathbf{E}\left|\sum_{t=0}^{N-1}\left<(P^{\pi_{\theta_{k+1}}})^{t}({s_{k},\cdot})-(P^{\pi_{\theta_{k}}})^{t}({s_{k},\cdot}),r^{\pi_{\theta_{k+1}}}\right>\right|+\sum_{k=1}^{K}\mathbf{E}\left|\sum_{t=0}^{N-1}\left<(P^{\pi_{\theta_{k}}})^{t}({s_{k},\cdot}),r^{\pi_{\theta_{k+1}}}-r^{\pi_{\theta_{k}}}\right>\right|\\ &\hskip 227.62204pt+N\sum_{k=1}^{K}\mathbf{E}|J(\theta_{k+1})-J(\theta_{k})|+\frac{2K}{T^{3}}\\ &\overset{(a)}{\leq}\sum_{k=1}^{K}\sum_{t=0}^{N-1}\mathbf{E}\left\lVert(P^{\pi_{\theta_{k+1}}})^{t}-(P^{\pi_{\theta_{k}}})^{t})r^{\pi_{\theta_{k+1}}}\right\rVert_{\infty}+\sum_{k=1}^{K}\sum_{t=0}^{N-1}\mathbf{E}\left\lVert r^{\pi_{\theta_{k+1}}}-r^{\pi_{\theta_{k}}}\right\rVert_{\infty}\\ &\hskip 199.16928pt+\mathcal{\tilde{O}}\left(\dfrac{\alpha GT^{\frac{1}{4}}t_{\mathrm{mix}}}{t_{\mathrm{hit}}}\left[\sqrt{A}Gt_{\mathrm{mix}}+\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}\right]\right)\end{split} (76)

where (a)(a) follows from (73) and substituting N=4tmix(log2T)N=4t_{\mathrm{mix}}(\log_{2}T). For the first term, note that,

((Pπθk+1)t(Pπθk)t)rπθk+1Pπθk+1((Pπθk+1)t1(Pπθk)t1)rπθk+1+(Pπθk+1Pπθk)(Pπθk)t1rπθk+1(a)((Pπθk+1)t1(Pπθk)t1)rπθk+1+maxsPπθk+1(s,)Pπθk(s,)1\displaystyle\begin{split}&\left\lVert((P^{\pi_{\theta_{k+1}}})^{t}-(P^{\pi_{\theta_{k}}})^{t})r^{\pi_{\theta_{k+1}}}\right\rVert_{\infty}\\ &\leq\left\lVert P^{\pi_{\theta_{k+1}}}((P^{\pi_{\theta_{k+1}}})^{t-1}-(P^{\pi_{\theta_{k}}})^{t-1})r^{\pi_{\theta_{k+1}}}\right\rVert_{\infty}+\left\lVert(P^{\pi_{\theta_{k+1}}}-P^{\pi_{\theta_{k}}})(P^{\pi_{\theta_{k}}})^{t-1}r^{\pi_{\theta_{k+1}}}\right\rVert_{\infty}\\ &\overset{(a)}{\leq}\left\lVert((P^{\pi_{\theta_{k+1}}})^{t-1}-(P^{\pi_{\theta_{k}}})^{t-1})r^{\pi_{\theta_{k+1}}}\right\rVert_{\infty}+\max_{s}\left\lVert P^{\pi_{\theta_{k+1}}}({s,\cdot})-P^{\pi_{\theta_{k}}}({s,\cdot})\right\rVert_{1}\end{split} (77)

Inequality (a)(a) holds since every row of PπθkP^{\pi_{\theta_{k}}} sums to 11 and (Pπθk)t1rπθk+11\left\lVert(P^{\pi_{\theta_{k}}})^{t-1}r^{\pi_{\theta_{k+1}}}\right\rVert_{\infty}\leq 1. Moreover, invoking (72), and the parameter update rule θk+1=θk+αωk\theta_{k+1}=\theta_{k}+\alpha\omega_{k}, we get,

maxsPπθk+1(s,)Pπθk(s,)1\displaystyle\max_{s}\|P^{\pi_{\theta_{k+1}}}({s,\cdot})-P^{\pi_{\theta_{k}}}({s,\cdot})\|_{1} =maxs|sa(πθk+1(a|s)πθk(a|s))P(s|s,a)|\displaystyle=\max_{s}\left|\sum_{s^{\prime}}\sum_{a}(\pi_{\theta_{k+1}}(a|s)-\pi_{\theta_{k}}(a|s))P(s^{\prime}|s,a)\right|
Gθk+1θkmaxs|saπθ¯k(a|s)P(s|s,a)|\displaystyle\leq G\left\lVert\theta_{k+1}-\theta_{k}\right\rVert\max_{s}\left|\sum_{s^{\prime}}\sum_{a}\pi_{\bar{\theta}_{k}}(a|s)P(s^{\prime}|s,a)\right|
αGωk\displaystyle\leq\alpha G\left\lVert\omega_{k}\right\rVert

Plugging the above result into (77)(\ref{eq_long_recursion}) and using a recursive argument, we get,

((Pπθk+1)t(Pπθk)t)rπθk+1\displaystyle\left\lVert((P^{\pi_{\theta_{k+1}}})^{t}-(P^{\pi_{\theta_{k}}})^{t})r^{\pi_{\theta_{k+1}}}\right\rVert_{\infty} t=1tmaxsPπθk+1(s,)Pπθk(s,)1\displaystyle\leq\sum_{t^{\prime}=1}^{t}\max_{s}\left\lVert P^{\pi_{\theta_{k+1}}}({s,\cdot})-P^{\pi_{\theta_{k}}}({s,\cdot})\right\rVert_{1}
t=1tαGωkαtGωk\displaystyle\leq\sum_{t^{\prime}=1}^{t}\alpha G\left\lVert\omega_{k}\right\rVert\leq\alpha tG\left\lVert\omega_{k}\right\rVert

Finally, we have

k=1Kt=0N1𝐄((Pπθk+1)t(Pπθk)t)rπθk+1k=1Kt=0N1αtGωk𝒪(αGN2)k=1K𝐄ωk𝒪(αGN2K)(k=1K𝐄ωk2)12(a)𝒪~(αGT14tmixthit[AGtmix+Ltmixthit])\displaystyle\begin{split}\sum_{k=1}^{K}\sum_{t=0}^{N-1}&\mathbf{E}\left\lVert((P^{\pi_{\theta_{k+1}}})^{t}-(P^{\pi_{\theta_{k}}})^{t})r^{\pi_{\theta_{k+1}}}\right\rVert_{\infty}\\ &\leq\sum_{k=1}^{K}\sum_{t=0}^{N-1}\alpha tG\left\lVert\omega_{k}\right\rVert\\ &\leq\mathcal{O}(\alpha GN^{2})\sum_{k=1}^{K}\mathbf{E}\left\lVert\omega_{k}\right\rVert\\ &\leq\mathcal{O}(\alpha GN^{2}\sqrt{K})\left(\sum_{k=1}^{K}\mathbf{E}\left\lVert\omega_{k}\right\rVert^{2}\right)^{\frac{1}{2}}\\ &\overset{(a)}{\leq}\mathcal{\tilde{O}}\left(\dfrac{\alpha GT^{\frac{1}{4}}t_{\mathrm{mix}}}{t_{\mathrm{hit}}}\left[\sqrt{A}Gt_{\mathrm{mix}}+\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}\right]\right)\end{split} (78)

where (a)(a) follows from (32). Moreover, notice that,

k=1Kt=0N1𝐄rπθk+1rπθk\displaystyle\sum_{k=1}^{K}\sum_{t=0}^{N-1}\mathbf{E}\left\lVert r^{\pi_{\theta_{k+1}}}-r^{\pi_{\theta_{k}}}\right\rVert_{\infty} k=1Kt=0N1𝐄[maxs|a(πθk+1(a|s)πθk(a|s))r(s,a)|]\displaystyle\overset{}{\leq}\sum_{k=1}^{K}\sum_{t=0}^{N-1}\mathbf{E}\left[\max_{s}\left|\sum_{a}(\pi_{\theta_{k+1}}(a|s)-\pi_{\theta_{k}}(a|s))r(s,a)\right|\right] (79)
(a)αGNk=1K𝐄ωk\displaystyle\overset{(a)}{\leq}\alpha GN\sum_{k=1}^{K}\mathbf{E}\left\lVert\omega_{k}\right\rVert
αGNK(k=1K𝐄ωk2)12\displaystyle\leq\alpha GN\sqrt{K}\left(\sum_{k=1}^{K}\mathbf{E}\left\lVert\omega_{k}\right\rVert^{2}\right)^{\frac{1}{2}}
(b)𝒪~(αGT14thit[AGtmix+Ltmixthit])\displaystyle\overset{(b)}{\leq}\mathcal{\tilde{O}}\left(\dfrac{\alpha GT^{\frac{1}{4}}}{t_{\mathrm{hit}}}\left[\sqrt{A}Gt_{\mathrm{mix}}+\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}\right]\right)

where (a)(a) follows from (72) and the parameter update rule θk+1=θk+αωk\theta_{k+1}=\theta_{k}+\alpha\omega_{k} while (b)(b) is a consequence of (32). Combining (76), (78), and (79), we establish the third statement. ∎

Proof of Theorem 2

Recall the decomposition of the regret in Eq. (35) and take the expectation

𝐄[RegT]=t=0T1(Jr(st,at))=Hk=1K(JJ(θk))+k=1Ktk(J(θk)r(st,at))=Hk=1K(JJ(θk))+𝐄[k=1K1Vπθk+1(skH)Vπθk(skH)]+𝐄[VπθK(sT)Vπθ0(s0)]\displaystyle\begin{split}\mathbf{E}[\mathrm{Reg}_{T}]&=\sum_{t=0}^{T-1}\left(J^{*}-r(s_{t},a_{t})\right)=H\sum_{k=1}^{K}\left(J^{*}-J({\theta_{k}})\right)+\sum_{k=1}^{K}\sum_{t\in\mathcal{I}_{k}}\left(J(\theta_{k})-r(s_{t},a_{t})\right)\\ &=H\sum_{k=1}^{K}\left(J^{*}-J({\theta_{k}})\right)+\mathbf{E}\left[\sum_{k=1}^{K-1}V^{\pi_{\theta_{k+1}}}(s_{kH})-V^{\pi_{\theta_{k}}}(s_{kH})\right]+\mathbf{E}\left[V^{\pi_{\theta_{K}}}(s_{T})-V^{\pi_{\theta_{0}}}(s_{0})\right]\end{split} (80)

and using the result in (34), Lemma 7 and Lemma 8, we get,

𝐄[RegT]𝒪~(ABG2tmix2+BLtmixthitLT)+𝒪~([AG2tmix+GLtmixthit](1+1μF)T34)+Tϵbias+𝒪~(Ltmixthit𝐄sdπ[KL(π(|s)πθ1(|s))]T)+𝒪~(tmixLthit[AG2tmix+GLtmixthit]T14)+𝒪(tmix)\displaystyle\begin{split}\mathbf{E}&[\mathrm{Reg}_{T}]\leq\mathcal{\tilde{O}}\left(\dfrac{ABG^{2}t_{\mathrm{mix}}^{2}+BLt_{\mathrm{mix}}t_{\mathrm{hit}}}{L}\sqrt{T}\right)+\mathcal{\tilde{O}}\left(\left[\sqrt{A}G^{2}t_{\mathrm{mix}}+G\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}\right]\left(1+\dfrac{1}{\mu_{F}}\right)T^{\frac{3}{4}}\right)+T\sqrt{\epsilon_{\mathrm{bias}}}\\ &+\mathcal{\tilde{O}}\left(Lt_{\mathrm{mix}}t_{\mathrm{hit}}\mathbf{E}_{s\sim d^{\pi^{*}}}[KL(\pi^{*}(\cdot|s)\|\pi_{\theta_{1}}(\cdot|s))]\sqrt{T}\right)+\mathcal{\tilde{O}}\left(\dfrac{t_{\mathrm{mix}}}{Lt_{\mathrm{hit}}}\left[\sqrt{A}G^{2}t_{\mathrm{mix}}+G\sqrt{Lt_{\mathrm{mix}}t_{\mathrm{hit}}}\right]T^{\frac{1}{4}}\right)+\mathcal{O}(t_{\mathrm{mix}})\end{split} (81)

Some Auxiliary Lemmas for the Proofs

Lemma 8.

(Wei et al. 2020, Lemma 14) For any ergodic MDP with mixing time tmixt_{\mathrm{mix}}, the following holds (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A} and any policy π\pi.

(a)|Vπ(s)|5tmix,(b)|Qπ(s,a)|6tmix\displaystyle(a)|V^{\pi}(s)|\leq 5t_{\mathrm{mix}},~{}~{}(b)|Q^{\pi}(s,a)|\leq 6t_{\mathrm{mix}}
Lemma 9.

(Wei et al. 2020, Corollary 13.2) Let δπ(,T)\delta^{\pi}(\cdot,T) be defined as written below for an arbitrary policy π\pi.

δπ(s,T)t=N(Pπ)t(s,)dπ1,s𝒮whereN=4tmix(log2T)\displaystyle\delta^{\pi}(s,T)\triangleq\sum_{t=N}^{\infty}\left\lVert(P^{\pi})^{t}({s,\cdot})-d^{\pi}\right\rVert_{1},~{}\forall s\in\mathcal{S}~{}~{}\text{where}~{}N=4t_{\mathrm{mix}}(\log_{2}T) (82)

If tmix<T/4t_{\mathrm{mix}}<T/4, we have the following inequality s𝒮\forall s\in\mathcal{S}: δπ(s,T)1T3\delta^{\pi}(s,T)\leq\frac{1}{T^{3}}.

Lemma 10.

(Wei et al. 2020, Lemma 16) Let ={t1+1,t1+2,,t2}\mathcal{I}=\{t_{1}+1,t_{1}+2,\cdots,t_{2}\} be a certain period of an epoch kk of Algorithm 2 with length NN. Then for any ss, the probability that the algorithm never visits ss in \mathcal{I} is upper bounded by

(13dπθk(s)4)N\left(1-\frac{3d^{\pi_{\theta_{k}}}(s)}{4}\right)^{\left\lfloor\frac{\lfloor\mathcal{I}\rfloor}{N}\right\rfloor} (83)
Lemma 11.

(Wei et al. 2020, Lemma 15) For two difference policy π\pi and π\pi^{\prime}, the difference of the objective function JJ is

JπJπ=sadπ(s)(π(a|s)π(a|s))Qπ(s,a)J^{\pi}-J^{\pi^{\prime}}=\sum_{s}\sum_{a}d^{\pi}(s)(\pi(a|s)-\pi^{\prime}(a|s))Q^{\pi^{\prime}}(s,a) (84)
Lemma 12.

(Dorfman and Levy 2022, Lemma A.6) Let θΘ\theta\in\Theta be a policy parameter. Fix a trajectory z={(st,at,rt,st+1)}tz=\{(s_{t},a_{t},r_{t},s_{t+1})\}_{t\in\mathbb{N}} generated by following the policy πθ\pi_{\theta} starting from some initial state s0ρs_{0}\sim\rho. Let, L(θ)\nabla L(\theta) be the gradient that we wish to estimate over zz, and l(θ,)l(\theta,\cdot) is a function such that 𝐄zdπθ,πθl(θ,z)=L(θ)\mathbf{E}_{z\sim d^{\pi_{\theta}},\pi_{\theta}}l(\theta,z)=\nabla L(\theta). Assume that l(θ,z),L(θ)GL\left\lVert l(\theta,z)\right\rVert,\left\lVert\nabla L(\theta)\right\rVert\leq G_{L}, θΘ\forall\theta\in\Theta, z𝒮×𝒜××𝒮\forall z\in\mathcal{S}\times\mathcal{A}\times\mathbb{R}\times\mathcal{S}. Define lQ=1Qi=1Ql(θ,zi)l^{Q}=\frac{1}{Q}\sum_{i=1}^{Q}l(\theta,z_{i}). If P=2tmixlogTP=2t_{\mathrm{mix}}\log T, then the following holds as long as QTQ\leq T,

𝐄[lQL(θ)2]𝒪(GL2log(PQ)PQ)\displaystyle\mathbf{E}\left[\left\lVert l^{Q}-\nabla L(\theta)\right\rVert^{2}\right]\leq\mathcal{O}\left(G_{L}^{2}\log\left(PQ\right)\dfrac{P}{Q}\right) (85)