This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploration by Maximizing Rényi Entropy for Reward-Free RL Framework

Chuheng Zhang  Yuanying Cai11footnotemark: 1  Longbo Huang  Jian Li
These authors contributed equally to this work.
Abstract

Exploration is essential for reinforcement learning (RL). To face the challenges of exploration, we consider a reward-free RL framework that completely separates exploration from exploitation and brings new challenges for exploration algorithms. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment and collects a dataset of transitions by executing the policy. In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment. This framework is suitable for the meta RL setting where there are many reward functions of interest. In the exploration phase, we propose to maximize the Rényi entropy over the state-action space and justify this objective theoretically. The success of using Rényi entropy as the objective results from its encouragement to explore the hard-to-reach state-actions. We further deduce a policy gradient formulation for this objective and design a practical exploration algorithm that can deal with complex environments. In the planning phase, we solve for good policies given arbitrary reward functions using a batch RL algorithm. Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase.

1 Introduction

The trade-off between exploration and exploitation is at the core of reinforcement learning (RL). Designing efficient exploration algorithm, while being a highly nontrivial task, is essential to the success in many RL tasks (Burda et al. 2019b; Ecoffet et al. 2019). Hence, it is natural to ask the following high-level question: What can we achieve by pure exploration? To address this question, several settings related to meta reinforcement learning (meta RL) have been proposed (see e.g., Wang et al. 2016; Duan et al. 2016; Finn, Abbeel, and Levine 2017). One common setting in meta RL is to learn a model in a reward-free environment in the meta-training phase, and use the learned model as the initialization to fast adapt for new tasks in the meta-testing phase (Eysenbach et al. 2019; Gupta et al. 2018; Nagabandi et al. 2019). Since the agent still needs to explore the environment under the new tasks in the meta-testing phase (sometimes it may need more new samples in some new task, and sometimes less), it is less clear how to evaluate the effectiveness of the exploration in the meta-training phase. Another setting is to learn a policy in a reward-free environment and test the policy under the task with a specific reward function (such as the score in Montezuma’s Revenge) without further training with the task (Burda et al. 2019b; Ecoffet et al. 2019; Burda et al. 2019a). However, there is no guarantee that the algorithm has fully explored the transition dynamics of the environment unless we test the learned model for arbitrary reward functions. Recently, Jin et al. (2020) proposed reward-free RL framework that fully decouples exploration and exploitation. Further, they designed a provably efficient algorithm that conducts a finite number of steps of reward-free exploration and returns near-optimal policies for arbitrary reward functions. However, their algorithm is designed for the tabular case and can hardly be extended to continuous or high-dimensional state spaces since they construct a policy for each state.

The reward-free RL framework is as follows: First, a set of policies are trained to explore the dynamics of the reward-free environment in the exploration phase (i.e., the meta-training phase). Then, a dataset of trajectories is collected by executing the learned exploratory policies. In the planning phase (i.e., the meta-testing phase), an arbitrary reward function is specified and a batch RL algorithm (Lange, Gabel, and Riedmiller 2012; Fujimoto, Meger, and Precup 2019) is applied to solve for a good policy solely based on the dataset, without further interaction with the environment. This framework is suitable to the scenarios when there are many reward functions of interest or the reward is designed offline to elicit desired behavior. The key to success in this framework is to obtain a dataset with good coverage over all possible situations in the environment with as few samples as possible, which in turn requires the exploratory policy to fully explore the environment.

Several methods that encourage various forms of exploration have been developed in the reinforcement learning literature. The maximum entropy framework (Haarnoja et al. 2017) maximizes the cumulative reward in the meanwhile maximizing the entropy over the action space conditioned on each state. This framework results in several efficient and robust algorithms, such as soft Q-learning (Schulman, Chen, and Abbeel 2017; Nachum et al. 2017), SAC (Haarnoja et al. 2018) and MPO (Abdolmaleki et al. 2018). On the other hand, maximizing the state space coverage may results in better exploration. Various kinds of objectives/regularizations are used for better exploration of the state space, including information-theoretic metrics (Houthooft et al. 2016; Mohamed and Rezende 2015; Eysenbach et al. 2019) (especially the entropy of the state space, e.g., Hazan et al. (2019); Islam et al. (2019)), the prediction error of a dynamical model (Burda et al. 2019a; Pathak et al. 2017; de Abril and Kanai 2018), the state visitation count (Burda et al. 2019b; Bellemare et al. 2016; Ostrovski et al. 2017) or others heuristic signals such as novelty (Lehman and Stanley 2011; Conti et al. 2018), surprise (Achiam and Sastry 2016) or curiosity (Schmidhuber 1991).

In practice, it is desirable to obtain a single exploratory policy instead of a set of policies whose cardinality may be as large as that of the state space (e.g., Jin et al. 2020; Misra et al. 2019). Our exploration algorithm outputs a single exploratory policy for the reward-free RL framework by maximizing the Rényi entropy over the state-action space in the exploration phase. In particular, we demonstrate the advantage of using state-action space, instead of the state space via a very simple example (see Section 3 and Figure 1). Moreover, Rényi entropy generalizes a family of entropies, including the commonly used Shannon entropy. We justify the use of Rényi entropy as the objective function theoretically by providing an upper bound on the number of samples in the dataset to ensure that a near-optimal policy is obtained for any reward function in the planning phase.

Further, we derive a gradient ascent update rule for maximizing the Rényi entropy over the state-action space. The derived update rule is similar to the vanilla policy gradient update with the reward replaced by a function of the discounted stationary state-action distribution of the current policy. We use variational autoencoder (VAE) (Kingma and Welling 2014) as the density model to estimate the distribution. The corresponding reward changes over iterations which makes it hard to accurately estimate a value function under the current reward. To address this issue, we propose to estimate a state value function using the off-policy data with the reward relabeled by the current density model. This enables us to efficiently update the policy in a stable way using PPO (Schulman et al. 2017). Afterwards, we collect a dataset by executing the learned policy. In the planning phase, when a reward function is specified, we augment the dataset with the rewards and use a batch RL algorithm, batch constrained deep Q-learning (BCQ) (Fujimoto, Meger, and Precup 2019; Fujimoto et al. 2019), to plan for a good policy under the reward function. We conduct experiments on several environments with discrete, continuous or high-dimensional state spaces. The experiment results indicate that our algorithm is effective, sample efficient and robust in the exploration phase, and achieves good performance under the reward-free RL framework.

Our contributions are summarized as follows:

  • (Section 3) We consider the reward-free RL framework that separates exploration from exploitation and therefore places a higher requirement for an exploration algorithm. To efficiently explore under this framework, we propose a novel objective that maximizes the Rényi entropy over the state-action space in the exploration phase and justify this objective theoretically.

  • (Section 4) We propose a practical algorithm based on a derived policy gradient formulation to maximize the Rényi entropy over the state-action space for the reward-free RL framework.

  • (Section 5) We conduct a wide range of experiments and the results indicate that our algorithm is efficient and robust in the exploration phase and results in superior performance in the downstream planning phase for arbitrary reward functions.

2 Preliminary

A reward-free environment can be formulated as a controlled Markov process (CMP) 111 Although some literature use CMP to refer to MDP, we use CMP to denote a Markov process (Silver 2015) equipped with a control structure in this paper. (𝒮,𝒜,,μ,γ)(\mathcal{S},\mathcal{A},\mathbb{P},\mu,\gamma) which specifies the state space 𝒮\mathcal{S} with S=|𝒮|S=|\mathcal{S}|, the action space 𝒜\mathcal{A} with A=|𝒜|A=|\mathcal{A}|, the transition dynamics \mathbb{P}, the initial state distribution μΔ𝒮\mu\in\Delta^{\mathcal{S}} and the discount factor γ\gamma. A (stationary) policy πθ:𝒮Δ𝒜\pi_{\theta}:\mathcal{S}\to\Delta^{\mathcal{A}} parameterized by θ\theta specifies the probability of choosing the actions on each state. The stationary discounted state visitation distribution (or simply the state distribution) under the policy π\pi is defined as dμπ(s):=(1γ)t=0γtPr(st=s|s0μ;π)d_{\mu}^{\pi}(s):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\text{Pr}(s_{t}=s|s_{0}\sim\mu;\pi). The stationary discounted state-action visitation distribution (or simply the state-action distribution) under the policy π\pi is defined as dμπ(s,a):=(1γ)t=0γtPr(st=s,at=a|s0μ;π)d_{\mu}^{\pi}(s,a):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\text{Pr}(s_{t}=s,a_{t}=a|s_{0}\sim\mu;\pi). Unless otherwise stated, we use dμπd_{\mu}^{\pi} to denote the state-action distribution. We also use ds0,a0πd_{s_{0},a_{0}}^{\pi} to denote the distribution started from the state-action pair (s0,a0)(s_{0},a_{0}) instead of the initial state distribution μ\mu.

When a reward function r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\to\mathbb{R} is specified, CMP becomes the Markov decision process (MDP) (Sutton and Barto 2018) (𝒮,𝒜,,r,μ,γ)(\mathcal{S},\mathcal{A},\mathbb{P},r,\mu,\gamma). The objective for MDP is to find a policy πθ\pi_{\theta} that maximizes the expected cumulative reward J(θ;r):=𝔼s0μ[Vπθ(s0;r)]J(\theta;r):=\mathbb{E}_{s_{0}\sim\mu}[V^{\pi_{\theta}}(s_{0};r)], where Vπ(s;r)=𝔼[t=0γtr(st,at)|s0=s;π]V^{\pi}(s;r)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s;\pi] is the state value function. We indicate the dependence on rr to emphasize that there may be more than one reward function of interest. The policy gradient for this objective is θJ(θ;r)=𝔼(s,a)dμπθ[θlogπθ(a|s)Qπθ(s,a;r)]\nabla_{\theta}J(\theta;r)=\mathbb{E}_{(s,a)\sim d^{\pi_{\theta}}_{\mu}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s,a;r)] (Williams 1992), where Qπ(s,a;r)=𝔼[t=0γtr(st,at)|s0=s,a0=a;π]=11γds,aπ,rQ^{\pi}(s,a;r)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{0}=a;\pi]=\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},r\rangle is the Q function.

Rényi entropy for a distribution dΔ𝒳d\in\Delta^{\mathcal{X}} is defined as Hα(d):=11αlog(x𝒳dα(x))H_{\alpha}(d):=\frac{1}{1-\alpha}\log\left(\sum_{x\in\mathcal{X}}d^{\alpha}(x)\right), where d(x)d(x) is the probability mass or the probability density function on xx (and the summation becomes integration in the latter case). When α0\alpha\to 0, Rényi entropy becomes Hartley entropy and equals the logarithm of the size of the support of dd. When α1\alpha\to 1, Rényi entropy becomes Shannon entropy H1(d):=x𝒳d(x)logd(x)H_{1}(d):=-\sum_{x\in\mathcal{X}}d(x)\log d(x) (Bromiley, Thacker, and Bouhova-Thacker 2004; Sanei, Borzadaran, and Amini 2016).

Given a distribution dΔ𝒳d\in\Delta^{\mathcal{X}}, the corresponding density model Pϕ:𝒳P_{\phi}:\mathcal{X}\to\mathbb{R} parameterized by ϕ\phi gives the probability density estimation of dd based on the samples drawn from dd. Variational auto-encoder (VAE) (Kingma and Welling 2014) is a popular density model which maximizes the variational lower bound (ELBO) of the log-likelihood. Specifically, VAE maximizes the lower bound of 𝔼xd[logPϕ(x)]\mathbb{E}_{x\sim d}[\log P_{\phi}(x)], i.e., maxϕ𝔼xd,zqϕ(|x)[logpϕ(x|z)]𝔼xd[DKL(qϕ(|x)||p(z))]\max_{\phi}\mathbb{E}_{x\sim d,z\sim q_{\phi}(\cdot|x)}[\log p_{\phi}(x|z)]-\mathbb{E}_{x\sim d}[D_{\text{KL}}(q_{\phi}(\cdot|x)||p(z))], where qϕq_{\phi} and pϕp_{\phi} are the decoder and the encoder respectively and p(z)p(z) is a prior distribution for the latent variable zz.

3 The objective for the exploration phase

The objective for the exploration phase under the reward-free RL framework is to induce an informative and compact dataset: The informative condition is that the dataset should have good coverage such that the planning phase generates good policies for arbitrary reward functions. The compact condition is that the size of the dataset should be as small as possible to ensure a successful planning. In this section, we show that the Rényi entropy over the state-action space (i.e., Hα(dμπ)H_{\alpha}(d_{\mu}^{\pi})) is a good objective function for the exploration phase. We first demonstrate the advantage of maximizing the state-action space entropy over maximizing the state space entropy with a toy example. Then, we provide a motivation to use Rényi entropy by analyzing a deterministic setting. At last, we provide an upper bound on the number of samples needed in the dataset for a successful planning if we have access to a policy that maximizes the Rényi entropy over the state-action space. For ease of analysis, we assume the state-action space is discrete in this section and derive a practical algorithm that deals with continuous state-action space in the next section.

Refer to caption
Figure 1: A deterministic five-state MDP with two actions. With discount factor γ=1\gamma=1, a policy that maximizes the entropy of the discounted state visitation distribution does not visit all the transitions while a policy that maximizes the entropy of the discounted state-action visitation distribution visits all the transitions. Covering all the transitions is important for the reward-free RL framework. The width of the arrows indicates the visitation frequency under different policies.

Why maximize the entropy over the state-action space? We demonstrate the advantage of maximizing the entropy over the state-action space with a toy example shown in Figure 1. The example contains an MDP with two actions and five states. The first action always drives the agent back to the first state while the second action moves the agent to the next state. For simplicity of presentation, we consider a case with a discount factor γ=1\gamma=1, but other values are similar. The policy that maximizes the entropy of the state distribution is a deterministic policy that takes the actions shown in red. The dataset obtained by executing this policy is poor since the planning algorithm fails when, in the planning phase, a sparse reward is assigned to one of the state-action pairs that it visits with zero probability (e.g., a reward function r(s,a)r(s,a) that equals to 11 on (s2,a1)(s_{2},a_{1}) and equals to 0 otherwise). In contrast, a policy that maximizes the entropy of the state-action distribution avoids the problem. For example, executing the policy that maximizes the Rényi entropy with α=0.5\alpha=0.5 over the state-action space, the expected size of the induced dataset is only 44 such that the dataset contains all the state-action pairs (cf. Appendix The toy example in Figure 1). Note that, when the transition dynamics is known to be deterministic, a dataset containing all the state-action pairs is sufficient for the planning algorithm to obtain an optimal policy since the full transition dynamics is known.

Why using Rényi entropy? We analyze a deterministic setting where the transition dynamics is known to be deterministic. In this setting, the objective for the framework can be expressed as a specific objective function for the exploration phase. This objective function is hard to optimize but motivates us to use Rényi entropy as the surrogate.

We define n:=SAn:=SA as the cardinality of the state-action space. Given an exploratory policy π\pi, we assume the dataset is collected in a way such that the transitions in the dataset can be treated as i.i.d. samples from dμπd_{\mu}^{\pi}, where dμπd_{\mu}^{\pi} is the state-action distribution induced by the policy π\pi.

In the deterministic setting, we can recover the exact transition dynamics of the environment using a dataset of transitions that contains all the nn state-action pairs. Such a dataset leads to a successful planning for arbitrary reward functions, and therefore satisfies the informative condition. In order to obtain such a dataset that is also compact, we stop collecting samples if the dataset contains all the nn pairs. Given the distribution dμπ=(d1,,dn)Δnd_{\mu}^{\pi}=(d_{1},\cdots,d_{n})\in\Delta^{n} from which we collect samples, the average size of the dataset is G(dμπ)G(d_{\mu}^{\pi}), where

G(dμπ)=0(1i=1n(1edit))𝑑t,G(d_{\mu}^{\pi})=\int_{0}^{\infty}\left(1-\prod_{i=1}^{n}(1-e^{-d_{i}t})\right)dt, (1)

which is a result of the coupon collector’s problem (Flajolet, Gardy, and Thimonier 1992). Accordingly, the objective for the exploration phase can be expressed as minπΠG(dμπ)\min_{\pi\in\Pi}G(d_{\mu}^{\pi}), where Π\Pi is the set of all the feasible policies. (Notice that due to the limitation of the transition dynamics, the feasible state-action distributions form only a subset of Δn\Delta^{n}.) We show the contour of this function on Δn\Delta^{n} in Figure 2a. We can see that, when any component of the distribution dμπd_{\mu}^{\pi} approaches zeros, G(dμπ)G(d_{\mu}^{\pi}) increases rapidly forming a barrier indicated by the black arrow. This barrier prevents the agent to visit a state-action with a vanishing probability, and thus encourages the agent to explore the hard-to-reach state-actions.

However, this function involves an improper integral which is hard to handle, and therefore cannot be directly used as an objective function in the exploration phase. One common choice for a tractable objective function is Shannon entropy, i.e., maxπΠH1(dμπ)\max_{\pi\in\Pi}H_{1}(d_{\mu}^{\pi}) (Hazan et al. 2019; Islam et al. 2019), whose contour is shown in Figure 2b. We can see that, Shannon entropy may still be large when some component of dμπd_{\mu}^{\pi} approaches zero (e.g., the black arrow indicates a case where the entropy is relatively large but d10d_{1}\to 0). Therefore, maximizing Shannon entropy may result in a policy that visits some state-action pair with a vanishing probability. We find that the contour of Rényi entropy (shown in Figure 2c/d) aligns better with G(dμπ)G(d_{\mu}^{\pi}) and alleviates this problem. In general, the policy π\pi that maximizes Rényi entropy may correspond to a smaller G(dμπ)G(d_{\mu}^{\pi}) than that resulted from maximizing Shannon entropy for different CMPs (more evidence of which can be found in Appendix An example to illustrate the advantage of using Rényi entropy).

Refer to caption
Figure 2: The contours of different objective functions on Δn\Delta^{n} with n=3n=3. G(d)G(d) is the average size of the transition dataset sampled from the distribution dd to ensure a successful planning. Compared with Shannon entropy (H1H_{1}), the contour of Rényi entropy (H0.5H_{0.5}, H0.1H_{0.1}) aligns better with GG since Rényi entropy rapidly decreases when any component of dd approaches zero (indicated by the arrow).

Theoretical justification for the objective function. Next, we formally justify the use of Rényi entropy over the state-action space with the following theorem. For ease of analysis, we consider a standard episodic setting: The MDP has a finite planning horizon HH and stochastic dynamics \mathbb{P} with the objective to maximize the cumulative reward J(π;r):=𝔼s1μ[Vπ(s1;r)]J(\pi;r):=\mathbb{E}_{s_{1}\sim\mu}[V^{\pi}(s_{1};r)], where Vπ(s;r):=𝔼[h=1Hrh(sh,ah)|s1=s,π]V^{\pi}(s;r):=\mathbb{E}_{\mathbb{P}}\left[\sum_{h=1}^{H}r_{h}(s_{h},a_{h})|s_{1}=s,\pi\right]. We assume the reward function rh:𝒮×𝒜[0,1],h[H]r_{h}:\mathcal{S}\times\mathcal{A}\to[0,1],\forall h\in[H] is deterministic. A (non-stationary, stochastic) policy π:𝒮×[H]Δ𝒜\pi:\mathcal{S}\times[H]\to\Delta^{\mathcal{A}} specifies the probability of choosing the actions on each state and on each step. The state-action distribution induced by π\pi on the hh-th step is dhπ(s,a):=Pr(sh=s,ah=a|s1μ;π)d_{h}^{\pi}(s,a):=\text{Pr}(s_{h}=s,a_{h}=a|s_{1}\sim\mu;\pi).

Theorem 3.1.

Denote ϖ\varpi as a set of policies {π(h)}h=1H\{\pi^{(h)}\}_{h=1}^{H}, where π(h):S×[H]Δ𝒜\pi^{(h)}:S\times[H]\to\Delta^{\mathcal{A}} and π(h)argmaxπHα(dhπ)\pi^{(h)}\in arg\max_{\pi}H_{\alpha}(d_{h}^{\pi}). Construct a dataset \mathcal{M} with MM trajectories, each of which is collected by first uniformly randomly choosing a policy π\pi from ϖ\varpi and then executing the policy π\pi. Assume

Mc(H2SAϵ)2(β+1)HAlog(SAHpϵ),M\geq c\left(\dfrac{H^{2}SA}{\epsilon}\right)^{2(\beta+1)}\frac{H}{A}\log\left(\dfrac{SAH}{p\epsilon}\right),

where β=α2(1α)\beta=\frac{\alpha}{2(1-\alpha)} and c>0c>0 is an absolute constant. Then, there exists a planning algorithm such that, for any reward function rr, with probability at least 1p1-p, the output policy π^\hat{\pi} of the planning algorithm based on \mathcal{M} is 3ϵ3\epsilon-optimal, i.e., J(π;r)J(π^;r)3ϵJ(\pi^{*};r)-J(\hat{\pi};r)\leq 3\epsilon, where J(π;r)=maxπJ(π;r)J(\pi^{*};r)=\max_{\pi}J(\pi;r).

We provide the proof in Appendix Proof of Theorem 3.1. The theorem justifies that Rényi entropy with small α\alpha is a proper objective function for the exploration phase since the number of samples needed to ensure a successful planning is bounded when α\alpha is small. When α1\alpha\to 1, the bound becomes infinity. The algorithm proposed by Jin et al. (2020) requires to sample O~(H5S2A/ϵ2)\tilde{O}(H^{5}S^{2}A/\epsilon^{2}) trajectories where O~\tilde{O} hides a logarithmic factor, which matches our bound when α0\alpha\to 0. However, they construct a policy for each state on each step, whereas we only need HH policies which easily adapts for the non-tabular case.

4 Method

In this section, we develop an algorithm for the non-tabular case. In the exploration phase, we update the policy π\pi to maximize Hα(dμπ)H_{\alpha}(d_{\mu}^{\pi}). We first deduce a gradient ascent update rule which is similar to vanilla policy gradient with the reward replaced by a function of the state-action distribution of the current policy. Afterwards, we estimate the state-action distribution using VAE. We also estimate a value function and update the policy using PPO, which is more sample efficient and robust than vanilla policy gradient. Then, we obtain a dataset by collecting samples from the learned policy. In the planning phase, we use a popular batch RL algorithm, BCQ, to plan for a good policy when the reward function is specified. One may also use other batch RL algorithms. We show the pseudocode of the process in Algorithm 1, the details of which are described in the following paragraphs.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Experiments on MultiRooms. a) Illustration of the MultiRooms environment and the state visitation distribution of the policies learned by different algorithms. b) The number of unique state-action pairs visited by running the policy for 2k steps in each iteration in the exploration phase. c) The normalized cumulative reward of the policy computed in the planning phase (i.e., normalized by the corresponding optimal cumulative reward), averaged over different reward functions. The lines indicate the average across five independent runs and the shaded areas indicate the standard deviation.

Policy gradient formulation. Let us first consider the gradient of the objective function Hα(dμπ)H_{\alpha}(d_{\mu}^{\pi}), where the policy π\pi is approximated by a policy network with the parameters denoted as θ\theta. We omit the dependency of π\pi on θ\theta. The gradient of the objective function is

θHα(dμπ)\displaystyle\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi}) α1α𝔼(s,a)dμπ[θlogπ(a|s)\displaystyle\propto\dfrac{\alpha}{1-\alpha}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\Big{[}\nabla_{\theta}\log\pi(a|s) (2)
(11γds,aπ,(dμπ)α1+(dμπ(s,a))α1)].\displaystyle\left(\tfrac{1}{1-\gamma}\langle d_{s,a}^{\pi},(d_{\mu}^{\pi})^{\alpha-1}\rangle+(d_{\mu}^{\pi}(s,a))^{\alpha-1}\right)\Big{]}.

As a special case, when α1\alpha\to 1, the Rényi entropy becomes the Shannon entropy and the gradient turns into

θH1(dμπ)\displaystyle\nabla_{\theta}H_{1}(d_{\mu}^{\pi}) =𝔼(s,a)dμπ[θlogπ(a|s)\displaystyle=\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\Big{[}\nabla_{\theta}\log\pi(a|s) (3)
(11γds,aπ,logdμπlogdμπ(s,a))].\displaystyle\left(\tfrac{1}{1-\gamma}\langle d_{s,a}^{\pi},-\log d_{\mu}^{\pi}\rangle-\log d_{\mu}^{\pi}(s,a)\right)\Big{]}.

Due to space limit, the derivation can be found in Appendix Policy gradient of state-action space entropy. There are two terms in the gradients. The first term equals to 𝔼(s,a)dμπ[θlogπ(a|s)Qπ(s,a;r)]\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)Q^{\pi}(s,a;r)\right] with the reward r(s,a)=(dμπ(s,a))α1r(s,a)=(d_{\mu}^{\pi}(s,a))^{\alpha-1} or r(s,a)=logdμπ(s,a)r(s,a)=-\log d_{\mu}^{\pi}(s,a), which resembles the policy gradient (of cumulative reward) for a standard MDP. This term encourages the policy to choose the actions that lead to the rarely visited state-action pairs. In a similar way, the second term resembles the policy gradient of instant reward 𝔼(s,a)dμπ[θlogπ(a|s)r(s,a)]\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)r(s,a)\right] and encourages the agent to choose the actions that are rarely selected on the current step. The second term in (3) equals to θ𝔼sdμπ[H1(π(|s))]\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))]. For stability, we also replace the second term in (2) with θ𝔼sdμπ[H1(π(|s))]\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))] 222 We found that using this term leads to more stable performance empirically since this term does not suffer from the high variance induced by the estimation of dμπd_{\mu}^{\pi} (see also Appendix Practical implementation). . Accordingly, we update the policy based on the following formula where η>0\eta>0 is a hyperparameter:

θHα(dμπ)\displaystyle\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi}) {θJ(θ;r=(dμπ)α1)0<α<1θJ(θ;r=logdμπ)α=1\displaystyle\approx\begin{cases}\nabla_{\theta}J(\theta;r=(d_{\mu}^{\pi})^{\alpha-1})&0<\alpha<1\\ \nabla_{\theta}J(\theta;r=-\log d_{\mu}^{\pi})&\alpha=1\\ \end{cases} (4)
+ηθ𝔼sdμπ[H1(π(|s))].\displaystyle\qquad\qquad\quad+\eta\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))].

Discussion. Islam et al. (2019) motivate the agent to explore by maximizing the Shannon entropy over the state space resulting in an intrinsic reward r(s)=logdμπ(s)r(s)=-\log d_{\mu}^{\pi}(s) which is similar to ours when α1\alpha\to 1. Bellemare et al. (2016) use an intrinsic reward r(s,a)=N^(s,a)1/2r(s,a)=\hat{N}(s,a)^{-1/2} where N^(s,a)\hat{N}(s,a) is an estimation of the visit count of (s,a)(s,a). Our algorithm with α=1/2\alpha=1/2 induces a similar reward.

Algorithm 1 Maximizing the state-action space Rényi entropy for the reward-free RL framework
1:Input: The number of iterations in the exploration phase TT; The size of the dataset MM; The parameter of Rényi entropy α\alpha.
2:Initialize: Replay buffer that stores the samples form the last nn iterations 𝒟=\mathcal{D}=\emptyset; Density model PϕP_{\phi} (VAE); Value function VψV_{\psi}; Policy network πθ\pi_{\theta}.
3:\triangleright Exploration phase (MaxRenyi)
4:for i=1i=1 to TT do
5:     Sample 𝒟i\mathcal{D}_{i} using πθ\pi_{\theta} and add it to 𝒟\mathcal{D}
6:     Update PϕP_{\phi} to estimate the state-action distribution based on 𝒟i\mathcal{D}_{i}
7:     Update VψV_{\psi} to minimize ψ(𝒟)\mathcal{L}_{\psi}(\mathcal{D}) defined in (5)
8:     Update πθ\pi_{\theta} to minimize θ(𝒟i)\mathcal{L}_{\theta}(\mathcal{D}_{i}) defined in (6)
9:\triangleright Collect the dataset
10:Rollout the exploratory policy πθ\pi_{\theta}, collect MM trajectories and store them in \mathcal{M}
11:\triangleright Planning phase
12:Reveal a reward function r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}
13:Construct a labeled dataset ¯:={(s,a,r(s,a))}\overline{\mathcal{M}}:=\{(s,a,r(s,a))\}
14:Plan for a policy πr=BCQ(¯)\pi_{r}=\text{BCQ}(\overline{\mathcal{M}})

Sample collection. To estimate dμπd_{\mu}^{\pi} for calculating the underlying reward, we collect samples in the following way (cf. Line 5 in Algorithm 1): In the ii-th iteration, we sample mm trajectories. In each trajectory, we terminate the rollout on each step with probability 1γ1-\gamma. In this way, we obtain a set of trajectories 𝒟i={(sj1,aj1),,(sjNj,ajNj)}j=1m\mathcal{D}_{i}=\{(s_{j1},a_{j1}),\cdots,(s_{jN_{j}},a_{jN_{j}})\}_{j=1}^{m} where NjN_{j} is the length of the jj-th trajectory. Then, we can use VAE to estimate dμπd_{\mu}^{\pi} based on 𝒟i\mathcal{D}_{i}, i.e., using ELBO as the density estimation (cf. Line 6 in Algorithm 1).

Value function. Instead of performing vanilla policy gradient, we update the policy using PPO which is more robust and sample efficient. However, the underlying reward function changes across iterations. This makes it hard to learn a value function incrementally that is used to reduce variance in PPO. We propose to train a value function network using relabeled off-policy data. In the ii-th iteration, we maintain a replay buffer 𝒟=𝒟i𝒟i1𝒟in+1\mathcal{D}=\mathcal{D}_{i}\cup\mathcal{D}_{i-1}\cup\cdots\cup\mathcal{D}_{i-n+1} that stores the trajectories of the last nn iterations (cf. Line 5 in Algorithm 1). Next, we calculate the reward for each state-action pair in 𝒟\mathcal{D} with the latest density estimator PϕP_{\phi}, i.e., we assign r=(Pϕ(s,a))α1r=(P_{\phi}(s,a))^{\alpha-1} or r=logPϕ(s,a)r=-\log P_{\phi}(s,a) for each (s,a)𝒟(s,a)\in\mathcal{D}. Based on these rewards, we can estimate the target value Vtarg(s)V^{\text{targ}}(s) for each state s𝒟s\in\mathcal{D} using the truncated TD(λ\lambda) estimator (Sutton and Barto 2018) which balances bias and variance (see the detail in Appendix Value function estimator). Then, we train the value function network VψV_{\psi} (where ψ\psi is the parameter) to minimize the mean squared error w.r.t. the target values:

ψ(𝒟)=𝔼s𝒟[(Vψ(s)Vtarg(s))2].\mathcal{L}_{\psi}(\mathcal{D})=\mathbb{E}_{s\sim\mathcal{D}}\left[(V_{\psi}(s)-V^{\text{targ}}(s))^{2}\right]. (5)

Policy network. In each iteration, we update the policy network to maximize the following objective function that is used in PPO:

θ(𝒟i)=\displaystyle-\mathcal{L}_{\theta}(\mathcal{D}_{i})= 𝔼s,a𝒟i[min(πθ(a|s)πold(a|s)A^(s,a),g(ε,A^(s,a)))]\displaystyle\mathbb{E}_{s,a\sim\mathcal{D}_{i}}\left[\min\left(\tfrac{\pi_{\theta}(a|s)}{\pi_{\text{old}}(a|s)}\hat{A}(s,a),g(\varepsilon,\hat{A}(s,a))\right)\right] (6)
+η𝔼s𝒟i[H1(πθ(|s))],\displaystyle+\eta\mathbb{E}_{s\sim\mathcal{D}_{i}}[H_{1}(\pi_{\theta}(\cdot|s))],

where g(ε,A)={(1+ε)AA0(1ε)AA<0g(\varepsilon,A)=\begin{cases}(1+\varepsilon)A&A\geq 0\\ (1-\varepsilon)A&A<0\end{cases} and A^(s,a)\hat{A}(s,a) is the advantage estimated using GAE (Schulman et al. 2016) and the learned value function VϕV_{\phi}.

Refer to caption
Refer to caption
Figure 4: Experiments on Montezuma’s Revenge. a) The episodic return with the extrinsic game reward along with the training in the exploration phase. b) The trajectory of the agent by executing the policy trained with the corresponding reward function in the planning phase. Best viewed in color.

5 Experiments

We first conduct experiments on the MultiRooms environment from minigrid (Chevalier-Boisvert, Willems, and Pal 2018) to compare the performance of MaxRenyi with several baseline exploration algorithms in the exploration phase and the downstream planning phase. We also compare the performance of MaxRenyi with different α\alphas, and show the advantage of using a PPO based algorithm and maximizing the entropy over the state-action space by comparing with ablated versions. Then, we conduct experiments on a set of Atari (with image-based observations) (Machado et al. 2018) and Mujoco (Todorov, Erez, and Tassa 2012) tasks to show that our algorithm outperforms the baselines in complex environments for the reward-free RL framework. We also provide detailed results to investigate why our algorithm succeeds. More experiments and the detailed experiment settings/hyperparameters can be found in Appendix Experiment settings, hyperparameters and more experiments..

Experiments on MultiRooms. The observation in MultiRooms is the first person perspective from the agent which is high-dimensional and partially observable, and the actions are turning left/right, moving forward and opening the door (cf. Figure 3a above). This environment is hard to explore for standard RL algorithms due to sparse reward. In the exploration phase, the agent has to navigate and open the doors to explore through a series of rooms. In the planning phase, we randomly assign a goal state-action pair, reward the agent upon this state-action, and then train a policy with this reward function. We compare our exploration algorithm MaxRenyi with ICM (Pathak et al. 2017), RND (Burda et al. 2019b) (which use different prediction errors as the intrinsic reward), MaxEnt (Hazan et al. 2019) (which maximizes the state space entropy) and an ablated version of our algorithm MaxRenyi(VPG) (that updates the policy directly by vanilla policy gradient using (2) and (3)).

For the exploration phase, we show the performance of different algorithms in Figure 3b. First, we see that variants of MaxRenyi performs better than the baseline algorithms. Specifically, MaxRenyi is more stable than ICM and RND that explores well at the start but later degenerates when the agent becomes familiar with all the states. Second, we observe that MaxRenyi performs better than MaxRenyi(VPG) with different α\alphas, indicating that MaxRenyi benefits from adopting PPO which reduces variance with a value function and update the policy conservatively. Third, MaxRenyi with α=0.5\alpha=0.5 performs better than α=1\alpha=1 which is consistent with our theory. However, we observe that MaxRenyi with α=0\alpha=0 is more likely to be unstable empirically and results in slightly worse performance. At last, we show the visitation frequency of the exploratory policies resulted from different algorithms in Figure 3a. We see that MaxRenyi visits the states more uniformly compared with the baseline algorithms, especially the hard-to-reach states such as the corners of the rooms.

For the planning phase, we collect datasets of different sizes by executing different exploratory policies, and use the datasets to compute policies with different reward functions using BCQ. We show the performance of the resultant policies in Figure 3c. First, we observe that the datasets generated by running random policies do not lead to a successful planning, indicating the importance of learning a good exploratory policy in this framework. Second, the dataset with only 8k samples leads to a successful planning (with a normalized cumulative reward larger than 0.8) using MaxRenyi, whereas a dataset with 16k samples is needed to succeed in the planning phase when using ICM, RND or MaxEnt. This illustrates that MaxRenyi leads to a better performance in the planning phase (i.e., attains good policies with fewer samples) than the previous exploration algorithms.

Further, we compare MaxRenyi that maximizes the entropy over the state-action space with two ablated versions: “State + bonus” that maximizes the entropy over the state space with an entropy bonus (i.e., using PϕP_{\phi} to estimate the state distribution in Algorithm 1), and “State” that maximizes the entropy over the state space (i.e., additionally setting η=0\eta=0 in (6)). Notice that the reward-free RL framework requires that a good policy is obtained for arbitrary reward functions in the planning phase. Therefore, we show the lowest normalized cumulative reward of the planned policies across different reward functions for the algorithms in Table 1. We see that maximizing the entropy over the state-action space results in better performance for this framework.

M=M= 2k 4k 8k 16k
MaxRenyi 0.14 0.52 0.87 0.91
State + bonus 0.14 0.48 0.78 0.87
State 0.07 0.15 0.41 0.57
Table 1: The lowest normalized cumulative reward of the planned policies across different reward functions, averaged over five runs. We use α=0.5\alpha=0.5 in the algorithms.

Experiments on Atari and Mujoco. In the exploration phase, we run different exploration algorithms in the reward-free environment of Atari (Mujoco) for 200M (10M) steps and collect a dataset with 100M (5M) samples by executing the learned policy. In the planning phase, a policy is computed offline to maximize the performance under the extrinsic game reward using the batch RL algorithm based on the dataset. We show the performance of the planned policies in Table 2. We can see that MaxRenyi performs well on a range of tasks with high-dimensional observations or continuous state-actions for the reward-free RL framework.

MaxRenyi RND ICM Random
Montezuma 8,100 4,740 380 0
Venture 900 840 240 0
Gravitar 1,850 1,890 940 340
PrivateEye 6,820 3,040 100 100
Solaris 2,192 1,416 1,844 1,508
Hopper 973 819 704 78
HalfCheetah 1,466 1,083 700 54
Table 2: The performance of the planned policies under the reward-free RL framework corresponding to different exploratory algorithms, averaged over five runs. Random indicates using a random policy as the exploratory policy. We use α=0.5\alpha=0.5 in MaxRenyi.

We also provide detailed results for Montezuma’s Revenge and Hopper. For Montezuma’s Revenge, we show the result for the exploration phase in Figure 4a. We observe that although MaxRenyi learns an exploratory policy in a reward-free environment, it achieves reasonable performance under the extrinsic game reward and performs better than RND and ICM. We also provide the number of visited rooms along the training in Appendix Experiment settings, hyperparameters and more experiments., which demonstrates that our algorithm performs better in terms of the state space coverage as well. In the planning phase, we design two different sparse reward functions that only reward the agent if it goes through Room 3 or Room 7 (to the next room). We show the trajectories of the policies planned with the two reward functions in red and blue respectively in Figure 4b. We see that although the reward functions are sparse, the agent chooses the correct path (e.g., opening the correct door in Room 1 with the only key) and successfully reaches the specified room. This indicates that our algorithm generates good policies for different reward functions based on an offline planning in complex environments. For Hopper, we show the t-SNE (Maaten and Hinton 2008) plot of the state-actions randomly sampled from the trajectories generated by different policies in Figure 5. We can see that MaxRenyi generates state-actions that are distributed more uniformly than RND and overlap those from the expert policy. This indicates that the good performance of our algorithm in the planning phase is resulted from the better coverage of our exploratory policy.

Refer to caption
Figure 5: t-SNE plot of the state-actions randomly sampled from the trajectories generated by different policies in Hopper. Random policy chooses actions randomly. Expert policy is trained using PPO for 2M steps with extrinsic reward. RND and MaxRenyi policy are trained using corresponding algorithms without extrinsic reward.

6 Conclusion

In this paper, we consider a reward-free RL framework, which is useful when there are multiple reward functions of interest or when we design reward functions offline. In this framework, an exploratory policy is learned by interacting with a reward-free environment in the exploration phase and generates a dataset. In the planning phase, when the reward function is specified, a policy is computed offline to maximize the corresponding cumulative reward using the batch RL algorithm based on the dataset. We propose a novel objective function, the Rényi entropy over the state-action space, for the exploration phase. We theoretically justify this objective and design a practical algorithm to optimize for this objective. In the experiments, we show that our exploration algorithm is effective under this framework, while being more sample efficient and more robust than the previous exploration algorithms.

Acknowledgement

Chuheng Zhang and Jian Li are supported in part by the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003, and the Zhongguancun Haihua Institute for Frontier Information Technology, Turing AI Institute of Nanjing and Xi’an Institute for Interdisciplinary Information Core Technology. Yuanying Cai and Longbo Huang are supported in part by the National Natural Science Foundation of China Grant 61672316, and the Zhongguancun Haihua Institute for Frontier Information Technology, China.

References

  • Abdolmaleki et al. (2018) Abdolmaleki, A.; Springenberg, J. T.; Tassa, Y.; Munos, R.; Heess, N.; and Riedmiller, M. 2018. Maximum a posteriori policy optimisation. In Proceedings of the 6th International Conference on Learning Representations.
  • Achiam and Sastry (2016) Achiam, J.; and Sastry, S. 2016. Surprise-based intrinsic motivation for deep reinforcement learning. In Deep RL Workshop, Advances in Neural Information Processing Systems.
  • Agarwal et al. (2019) Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2019. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261 .
  • Auer, Cesa-Bianchi, and Fischer (2002) Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3): 235–256.
  • Bellemare et al. (2016) Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; and Munos, R. 2016. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 1471–1479.
  • Brockman et al. (2016) Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI Gym.
  • Bromiley, Thacker, and Bouhova-Thacker (2004) Bromiley, P.; Thacker, N.; and Bouhova-Thacker, E. 2004. Shannon entropy, Renyi entropy, and information. Statistics and Inf. Series (2004-004) .
  • Burda et al. (2019a) Burda, Y.; Edwards, H.; Pathak, D.; Storkey, A.; Darrell, T.; and Efros, A. A. 2019a. Large-scale study of curiosity-driven learning. In Proceedings of the 7th International Conference on Learning Representations.
  • Burda et al. (2019b) Burda, Y.; Edwards, H.; Storkey, A.; and Klimov, O. 2019b. Exploration by random network distillation. In Proceedings of the 7th International Conference on Learning Representations.
  • Chevalier-Boisvert, Willems, and Pal (2018) Chevalier-Boisvert, M.; Willems, L.; and Pal, S. 2018. Minimalistic Gridworld Environment for OpenAI Gym. https://github.com/maximecb/gym-minigrid.
  • Conti et al. (2018) Conti, E.; Madhavan, V.; Such, F. P.; Lehman, J.; Stanley, K.; and Clune, J. 2018. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in Neural Information Processing Systems, 5027–5038.
  • Dann, Lattimore, and Brunskill (2017) Dann, C.; Lattimore, T.; and Brunskill, E. 2017. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, 5713–5723.
  • de Abril and Kanai (2018) de Abril, I. M.; and Kanai, R. 2018. Curiosity-driven reinforcement learning with homeostatic regulation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 1–6. IEEE.
  • Duan et al. (2016) Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779 .
  • Ecoffet et al. (2019) Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; and Clune, J. 2019. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995 .
  • Eysenbach et al. (2019) Eysenbach, B.; Gupta, A.; Ibarz, J.; and Levine, S. 2019. Diversity is all you need: Learning skills without a reward function. In Proceedings of the 7th International Conference on Learning Represention.
  • Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, 1126–1135. JMLR.
  • Flajolet, Gardy, and Thimonier (1992) Flajolet, P.; Gardy, D.; and Thimonier, L. 1992. Birthday paradox, coupon collectors, caching algorithms and self-organizing search. Discrete Applied Mathematics 39(3): 207–229.
  • Fujimoto et al. (2019) Fujimoto, S.; Conti, E.; Ghavamzadeh, M.; and Pineau, J. 2019. Benchmarking Batch Deep Reinforcement Learning Algorithms. In Deep RL Workshop, Advances in Neural Information Processing Systems.
  • Fujimoto, Meger, and Precup (2019) Fujimoto, S.; Meger, D.; and Precup, D. 2019. Off-Policy Deep Reinforcement Learning without Exploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97, 2052–2062. JMLR.
  • Gupta et al. (2018) Gupta, A.; Eysenbach, B.; Finn, C.; and Levine, S. 2018. Unsupervised meta-learning for reinforcement learning. arXiv preprint arXiv:1806.04640 .
  • Haarnoja et al. (2017) Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, volume 70, 1352–1361. JMLR.
  • Haarnoja et al. (2018) Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80, 1861–1870. JMLR.
  • Hazan et al. (2019) Hazan, E.; Kakade, S.; Singh, K.; and Van Soest, A. 2019. Provably Efficient Maximum Entropy Exploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97, 2681–2691. JMLR.
  • Houthooft et al. (2016) Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; and Abbeel, P. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, 1109–1117.
  • Islam et al. (2019) Islam, R.; Seraj, R.; Bacon, P.-L.; and Precup, D. 2019. Entropy Regularization with Discounted Future State Distribution in Policy Gradient Methods. In Optimization Foundations of RL Workshop, Advances in Neural Information Processing Systems.
  • Jin et al. (2020) Jin, C.; Krishnamurthy, A.; Simchowitz, M.; and Yu, T. 2020. Reward-Free Exploration for Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning. JMLR.
  • Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations.
  • Lai and Robbins (1985) Lai, T. L.; and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1): 4–22.
  • Lange, Gabel, and Riedmiller (2012) Lange, S.; Gabel, T.; and Riedmiller, M. 2012. Batch reinforcement learning. In Reinforcement learning, 45–73. Springer.
  • Lehman and Stanley (2011) Lehman, J.; and Stanley, K. O. 2011. Novelty search and the problem with objectives. In Genetic programming theory and practice IX, 37–56. Springer.
  • Maaten and Hinton (2008) Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579–2605.
  • Machado et al. (2018) Machado, M. C.; Bellemare, M. G.; Talvitie, E.; Veness, J.; Hausknecht, M. J.; and Bowling, M. 2018. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. Journal of Artificial Intelligence Research 61: 523–562.
  • Misra et al. (2019) Misra, D.; Henaff, M.; Krishnamurthy, A.; and Langford, J. 2019. Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning. arXiv preprint arXiv:1911.05815 .
  • Mohamed and Rezende (2015) Mohamed, S.; and Rezende, D. J. 2015. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems, 2125–2133.
  • Nachum et al. (2017) Nachum, O.; Norouzi, M.; Xu, K.; and Schuurmans, D. 2017. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, 2775–2785.
  • Nagabandi et al. (2019) Nagabandi, A.; Clavera, I.; Liu, S.; Fearing, R. S.; Abbeel, P.; Levine, S.; and Finn, C. 2019. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In Proceedings of the 7th International Conference on Learning Representations.
  • Ostrovski et al. (2017) Ostrovski, G.; Bellemare, M. G.; van den Oord, A.; and Munos, R. 2017. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2721–2730. JMLR. org.
  • Pathak et al. (2017) Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 16–17.
  • Puterman (2014) Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  • Sanei, Borzadaran, and Amini (2016) Sanei, M.; Borzadaran, G. R. M.; and Amini, M. 2016. Renyi entropy in continuous case is not the limit of discrete case. Mathematical Sciences and Applications E-Notes 4.
  • Schmidhuber (1991) Schmidhuber, J. 1991. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the international conference on simulation of adaptive behavior: From animals to animats, 222–227.
  • Schulman, Chen, and Abbeel (2017) Schulman, J.; Chen, X.; and Abbeel, P. 2017. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440 .
  • Schulman et al. (2016) Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2016. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the 4th International Conference of Learning Representations.
  • Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
  • Silver (2015) Silver, D. 2015. Lecture 2: Markov Decision Processes. UCL. Retrieved from www0. cs. ucl. ac. uk/staff/d. silver/web/Teaching_files/MDP. pdf .
  • Strehl and Littman (2008) Strehl, A. L.; and Littman, M. L. 2008. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences 74(8): 1309–1331.
  • Sutton and Barto (2018) Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press Cambridge.
  • Todorov, Erez, and Tassa (2012) Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 5026–5033. IEEE.
  • Wang et al. (2016) Wang, J. X.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.; Leibo, J. Z.; Munos, R.; Blundell, C.; Kumaran, D.; and Botvinick, M. 2016. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763 .
  • Williams (1992) Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3-4): 229–256.

The toy example in Figure 1

In this example, the initial state is fixed to be s1s_{1}. Due to the simplicity of this toy example, we can solve for the policy that maximizes the Rényi entropy of the state-action distribution with γ1\gamma\to 1 and α=0.5\alpha=0.5. The optimal policy is

π=[0.3210.2760.2940.4010.3810.6790.7240.7060.5990.619],\pi=\begin{bmatrix}0.321&0.276&0.294&0.401&0.381\\ 0.679&0.724&0.706&0.599&0.619\end{bmatrix},

where πi,j\pi_{i,j} represents the probability of choosing aia_{i} on sjs_{j}. The corresponding state-action distribution is

dμπ=[0.1070.0620.0470.0450.0650.2260.1620.1130.0670.106],d_{\mu}^{\pi}=\begin{bmatrix}0.107&0.062&0.047&0.045&0.065\\ 0.226&0.162&0.113&0.067&0.106\end{bmatrix},

where the (i,j)(i,j)-th element represents the probability density of the state-action distribution on (sj,ai)(s_{j},a_{i}). Using equation (1), we obtain G(dμπ)=43.14G(d_{\mu}^{\pi})=43.14 which is the expected number of samples collected form this distribution that contains all the state-action pairs.

An example to illustrate the advantage of using Rényi entropy

Although we did not justify theoretically that Rényi entropy serves as a better surrogate function of GG than Shannon entropy in the main text, here we illustrate that this may hold in general. Specifically, we try to show that the corresponding GG of the policy that maximizes Rényi entropy with small α\alpha is always smaller than that of the policy that maximizes Shannon entropy. Notice that when the dynamics is known to be deterministic, GG indicates how many samples are needed in expectation to ensure a successful planning. In the stochastic setting, GG also serves as a good indicator since this is the lower bound of the number of samples needed for a successful planning.

To show that Rényi entropy is a good surrogate function, we can use brute-force search to check if there exists a CMP on which the Shannon entropy is better than Rényi entropy in terms of GG. As an illustration, we only consider CMPs with two states and two actions. Specifically, for this family of CMPs, there are 5 free parameters to be searched: 4 parameters for the transition dynamics and 1 parameter for the initial state distribution. We set step size for searching to 0.1 for each parameter and use α=0.1\alpha=0.1, γ=0.9\gamma=0.9. The result shows there does not exists any CMP on which the Shannon entropy is better than Rényi entropy in term of GG out of the searched 10510^{5} CMPs.

To further investigate, we consider a specific CMP. The transition matrix PSA×SP\in\mathbb{R}^{SA\times S} of this CMP is

P=[1.00.00.90.11.00.00.40.6],P=\begin{bmatrix}1.0&0.0\\ 0.9&0.1\\ 1.0&0.0\\ 0.4&0.6\end{bmatrix},

where the rows corresponds to (s1,a1),(s1,a2),(s2,a1),(s2,a2)(s_{1},a_{1}),(s_{1},a_{2}),(s_{2},a_{1}),(s_{2},a_{2}) respectively and the columns corresponds the next states (s1s_{1} and s2s_{2} respectively). The initial state distribution of this CMP is

μ=[1.00.0].\mu=\begin{bmatrix}1.0&0.0\end{bmatrix}.

For this CMP, the GG value of the policy that maximizes Shannon entropy is 88, while the values are only 39 and 47 for policies that maximize Rényi entropy with α=0.1\alpha=0.1, α=0.5\alpha=0.5 respectively. The optimal GG value is 32. This indicates that we may need 2x more samples to accurately estimate the dynamics of the CMP. We observe s2s_{2} is much harder to reach than s1s_{1} on this CMP: The initial state is always s1s_{1} and taking action a1a_{1} will always reach s1s_{1}. In order to visit s2s_{2} more, policies should take a2a_{2} with a high probability on s1s_{1}. In fact, the policy that maximizes Shannon entropy takes a2a_{2} on s1s_{1} with the probability of only 0.58, while the policies that maximize Rényi entropy with α=0.1\alpha=0.1 and α=0.5\alpha=0.5 take a2a_{2} on s1s_{1} with probabilities 0.71 and 0.64 respectively.

This example illustrates the following essence that widely exists in other CMPs: A good objective function should encourage the policy to visit the hard-to-reach state-actions as frequently as possible and Rényi entropy does a better job than Shannon entropy from this perspective.

Proof of Theorem 3.1

We provide Theorem 3.1 to justify our objective that maximizes the entropy over the state-action space in the exploration phase for the reward-free RL framework. In the following analysis, we consider the standard episodic and tabular case as follows: The MDP is defined as 𝒮,𝒜,H,,r,μ\langle\mathcal{S},\mathcal{A},H,\mathbb{P},r,\mu\rangle, where 𝒮\mathcal{S} is the state space with S=|𝒮|S=|\mathcal{S}|, 𝒜\mathcal{A} is the action space with A=|𝒜|A=|\mathcal{A}|, HH is the length of each episode, ={h:h[H]}\mathbb{P}=\{\mathbb{P}_{h}:h\in[H]\} is the set of unknown transition matrices, r={rh:h[H]}r=\{r_{h}:h\in[H]\} is the set of the reward functions and μ\mu is the unknown initial state distribution. We denote h(s|s,a)\mathbb{P}_{h}(s^{\prime}|s,a) as the transition probability from (s,a)(s,a) to ss^{\prime} on the hh-th step. We assume that the instant reward for taking the action aa on the state ss on the hh-th step rh(s,a)[0,1]r_{h}(s,a)\in[0,1] is deterministic. A policy π:𝒮×[H]Δ𝒜\pi:\mathcal{S}\times[H]\to\Delta^{\mathcal{A}} specifies the probability of choosing the actions on each state and on each step. We denote the set of all such policies as Π\Pi.

The agent with policy π\pi interacts with the MDP as follows: At the start of each episode, an initial state s1s_{1} is sampled from the distribution μ\mu. Next, at each step h[H]h\in[H], the agent first observes the state sh𝒮s_{h}\in\mathcal{S} and then selects an action aha_{h} from the distribution π(sh,h)\pi(s_{h},h). The environment returns the reward rh(sh,ah)[0,1]r_{h}(s_{h},a_{h})\in[0,1] and transits to the next state sh+1𝒮s_{h+1}\in\mathcal{S} according to the probability h(|sh,ah)\mathbb{P}_{h}(\cdot|s_{h},a_{h}). The state-action distribution induced by π\pi on the hh-th step is dhπ(s,a):=Pr(sh=s,ah=a|s1μ;π)d_{h}^{\pi}(s,a):=\text{Pr}(s_{h}=s,a_{h}=a|s_{1}\sim\mu;\pi).

The state value function for a policy π\pi is defined as Vhπ(s;r):=𝔼[t=hHrt(st,at)|sh=s,π]V_{h}^{\pi}(s;r):=\mathbb{E}\left[\sum_{t=h}^{H}r_{t}(s_{t},a_{t})|s_{h}=s,\pi\right]. The objective is to find a policy π\pi that maximizes the cumulative reward J(π;r):=𝔼s1μ[V1π(s1;r)]J(\pi;r):=\mathbb{E}_{s_{1}\sim\mu}[V_{1}^{\pi}(s_{1};r)].

The proof goes through in a similar way to that of Jin et al. (2020).

Algorithm 2 The planning algorithm

Input: A dataset of transitions \mathcal{M}; The reward function rr; The level of accuracy ϵ\epsilon.

1:for all (s,a,s,h)𝒮×𝒜×𝒮×[H](s,a,s^{\prime},h)\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}\times[H] do
2:     Nh(s,a,s)(sh,ah,sh+1)𝕀[sh=s,ah=a,sh+1=s]N_{h}(s,a,s^{\prime})\leftarrow\sum_{(s_{h},a_{h},s_{h+1})\in\mathcal{M}}\mathbb{I}[s_{h}=s,a_{h}=a,s_{h+1}=s^{\prime}]
3:     Nh(s,a)sNh(s,a,s)N_{h}(s,a)\leftarrow\sum_{s^{\prime}}N_{h}(s,a,s^{\prime})
4:     ^h(s|s,a)=Nh(s,a,s)/Nh(s,a)\hat{\mathbb{P}}_{h}(s^{\prime}|s,a)=N_{h}(s,a,s^{\prime})/N_{h}(s,a)
5:π^APPROXIMATE-MDP-SOLVER(^,r,ϵ)\hat{\pi}\leftarrow\text{APPROXIMATE-MDP-SOLVER}(\mathbb{\hat{P}},r,\epsilon)
6:return π^\hat{\pi}
Proof of Theorem 3.1.

Given the dataset \mathcal{M} specified in the theorem, a reward function rr and the level of accuracy ϵ\epsilon, we use the planning algorithm shown in Algorithm 2 to obtain a near-optimal policy π^\hat{\pi}. The planning algorithm first estimates a transition matrix ^\hat{\mathbb{P}} based on \mathcal{M} and then solves for a policy based on the estimated transition matrix ^\hat{\mathbb{P}} and the reward function rr.

Define J^(π;r):=𝔼s1μ[V^1π(s1;r)]\hat{J}(\pi;r):=\mathbb{E}_{s_{1}\sim\mu}[\hat{V}^{\pi}_{1}(s_{1};r)] and V^\hat{V} is the value function on the MDP with the transition dynamics ^\hat{\mathbb{P}}. We first decompose the error into the following terms:

J(π;r)J(π^;r)\displaystyle J(\pi^{*};r)-J(\hat{\pi};r)\leq |J^(π;r)J(π;r)|evaluation error 1+(J^(π;r)J^(π^;r))0 by definition+(J^(π^;r)J^(π^;r))optimization error+|J^(π^;r)J(π^;r)|evaluation error 2\displaystyle\underbrace{|\hat{J}(\pi^{*};r)-J(\pi^{*};r)|}_{\text{evaluation error 1}}+\underbrace{(\hat{J}(\pi^{*};r)-\hat{J}(\hat{\pi}^{*};r))}_{\leq 0\text{ by definition}}+\underbrace{(\hat{J}(\hat{\pi}^{*};r)-\hat{J}(\hat{\pi};r))}_{\text{optimization error}}+\underbrace{|\hat{J}(\hat{\pi};r)-J(\hat{\pi};r)|}_{\text{evaluation error 2}}

Then, the proof of the theorem goes as follows:

To bound the two evaluation error terms, we first present Lemma .4 to show that the policy which maximizes Rényi entropy is able to visit the state-action space reasonably uniformly, by leveraging the convexity of the feasible region in the state-action distribution space (Lemma .2). Then, with Lemma .4 and the concentration inequality Lemma .5, we show that the two evaluation error terms can be bounded by ϵ\epsilon for any policy and any reward function (Lemma .7).

To bound the optimization error term, we use the natural policy gradient (NPG) algorithm as the APPROXIMATE-MDP-SOLVER in Algorithm 2 to solve for a near-optimal policy on the estimated ^\hat{\mathbb{P}} and the given reward function rr. Finally, we apply the optimization bound for the NPG algorithm (Agarwal et al. 2019) to bound the optimization error term (Lemma .8).

Definition .1 (𝒦h\mathcal{K}_{h}).

𝒦h\mathcal{K}_{h} is defined as the feasible region in the state-action distribution space 𝒦h:={dhπ:πΠ}Δn\mathcal{K}_{h}:=\{d_{h}^{\pi}:\pi\in\Pi\}\subset\Delta^{n}, h[H]\forall h\in[H], where n=SAn=SA is the cardinality of the state-action space.

Hazan et al. (Hazan et al. 2019) proved the convexity of such feasible regions in the infinite-horizon and discounted setting. For completeness, we provide a similar proof on the convexity of 𝒦h\mathcal{K}_{h} in the episodic setting.

Lemma .2.

[Convexity of 𝒦h\mathcal{K}_{h}] 𝒦h\mathcal{K}_{h} is convex. Namely, for any π1,π2Π\pi_{1},\pi_{2}\in\Pi and h[H]h\in[H], denote d1=dhπ1d_{1}=d_{h}^{\pi_{1}} and d2=dhπ2d_{2}=d_{h}^{\pi_{2}} , there exists a policy πΠ\pi\in\Pi such that d:=λd1+(1λ)d2=dhπ𝒦h,λ[0,1]d:=\lambda d_{1}+(1-\lambda)d_{2}=d_{h}^{\pi}\in\mathcal{K}_{h},\forall\lambda\in[0,1].

Proof of Lemma .2.

Define a mixture policy π\pi^{\prime} that first chooses from {π1,π2}\{\pi_{1},\pi_{2}\} with probability λ\lambda and 1λ1-\lambda respectively at the start of each episode, and then executes the chosen policy through the episode. Define the state distribution (or the state-action distribution) for this mixture policy on each layer as dhπ,h[H]d_{h}^{\pi^{\prime}},\forall h\in[H]. Obviously, d=dhπd=d_{h}^{\pi^{\prime}}. For any mixture policy π\pi^{\prime}, we can construct a policy π:𝒮×[H]Δ𝒜\pi:\mathcal{S}\times[H]\to\Delta^{\mathcal{A}} where π(a|s,h)=dhπ(s,a)dhπ(s)\pi(a|s,h)=\dfrac{d_{h}^{\pi^{\prime}}(s,a)}{d_{h}^{\pi^{\prime}}(s)} such that dhπ=dhπd_{h}^{\pi}=d_{h}^{\pi^{\prime}} (Puterman 2014). In this way, we find a policy π\pi such that dhπ=dd_{h}^{\pi}=d. ∎

Similar to Jin et al. (Jin et al. 2020), we define δ\delta-significance for state-action pairs on each step and show that the policy that maximizes Rényi entropy is able to reasonably uniformly visit the significant state-action pairs.

Definition .3 (δ\delta-significance).

A state-action pair (s,a)(s,a) on the hh-th step is δ\delta-significant if there exists a policy πΠ\pi\in\Pi, such that the probability to reach (s,a)(s,a) following the policy π\pi is greater than δ\delta, i.e., maxπdhπ(s,a)δ\max_{\pi}d_{h}^{\pi}(s,a)\geq\delta.

Recall the way we construct the dataset \mathcal{M}: With the set of policies ϖ:={π(h)}h=1H\varpi:=\{\pi^{(h)}\}_{h=1}^{H} where π(h):S×[H]Δ𝒜\pi^{(h)}:S\times[H]\to\Delta^{\mathcal{A}} and π(h)argmaxπHα(dhπ),h[H]\pi^{(h)}\in arg\max_{\pi}H_{\alpha}(d_{h}^{\pi}),\forall h\in[H], we first uniformly randomly choose a policy π\pi from ϖ\varpi at the start of each episode, and then execute the policy π\pi through the episode. Note that there is a set of policies that maximize the Rényi entropy on the hh-th layer since the policy on the subsequent layers does not affect the entropy on the hh-th layer. We denote the induced state-action distribution on the hh-th step as μh\mu_{h}, i.e., the dataset \mathcal{M} can be regarded as being sampled from μh\mu_{h} for each step h[H]h\in[H]. Therefore, μh(s,a)=1Ht=1Hdhπ(t)(s,a),s𝒮,a𝒜,h[H]\mu_{h}(s,a)=\frac{1}{H}\sum_{t=1}^{H}d_{h}^{\pi^{(t)}}(s,a),\forall s\in\mathcal{S},a\in\mathcal{A},h\in[H].

Lemma .4.

If 0<α<10<\alpha<1, then

δ-significant (s,a,h),maxπdhπ(s,a)μh(s,a)SAHδα/(1α).\forall\delta\text{-significant }(s,a,h),\ \dfrac{\max_{\pi}d_{h}^{\pi}(s,a)}{\mu_{h}(s,a)}\leq\dfrac{SAH}{\delta^{\alpha/(1-\alpha)}}. (7)
Proof of Lemma .4.

For any δ\delta-significant (s,a,h)(s,a,h), consider the policy πargmaxπdhπ(s,a)\pi^{\prime}\in arg\max_{\pi}d_{h}^{\pi}(s,a). Denote x:=dhπΔ(𝒮×𝒜)x:=d_{h}^{\pi^{\prime}}\in\Delta(\mathcal{S}\times\mathcal{A}). We can treat xx as a vector with n:=SAn:=SA dimensions and use the first dimension to represent (s,a)(s,a), i.e., x1=dhπ(s,a)x_{1}=d_{h}^{\pi^{\prime}}(s,a). Denote z:=dhπ(h)z:=d_{h}^{\pi^{(h)}}. Since π(h)argmaxπHα(dhπ)\pi^{(h)}\in arg\max_{\pi}H_{\alpha}(d_{h}^{\pi}), z=argmaxd𝒦hHα(d)z=arg\max_{d\in\mathcal{K}_{h}}H_{\alpha}(d). By Lemma .2, yλ:=(1λ)x+λz𝒦h,λ[0,1]y_{\lambda}:=(1-\lambda)x+\lambda z\in\mathcal{K}_{h},\forall\lambda\in[0,1]. Since Hα()H_{\alpha}(\cdot) is concave over Δn\Delta^{n} when 0<α<10<\alpha<1, Hα(yλ)H_{\alpha}(y_{\lambda}) will monotonically increase as we increase λ\lambda from 0 to 11, i.e.,

Hα(yλ)λ0,λ[0,1].\dfrac{\partial H_{\alpha}(y_{\lambda})}{\partial\lambda}\geq 0,\quad\forall\lambda\in[0,1].

This is true when λ=1\lambda=1, i.e., when yλ=zy_{\lambda}=z, we have

Hα(yλ)λ|λ=1=\displaystyle\dfrac{\partial H_{\alpha}(y_{\lambda})}{\partial\lambda}\Big{|}_{\lambda=1}= α1αi=1n((1λ)xi+λzi)α1(zixi)i=1n((1λ)xi+λzi)α|λ=1\displaystyle\frac{\alpha}{1-\alpha}\frac{\sum_{i=1}^{n}\left((1-\lambda)x_{i}+\lambda z_{i}\right)^{\alpha-1}(z_{i}-x_{i})}{\sum_{i=1}^{n}\left((1-\lambda)x_{i}+\lambda z_{i}\right)^{\alpha}}\Big{|}_{\lambda=1}
=\displaystyle= α1αi=1nziα1(zixi)i=1nziαi=1nziαi=1nziα1xi0.\displaystyle\frac{\alpha}{1-\alpha}\frac{\sum_{i=1}^{n}z_{i}^{\alpha-1}(z_{i}-x_{i})}{\sum_{i=1}^{n}z_{i}^{\alpha}}\propto\sum_{i=1}^{n}z_{i}^{\alpha}-\sum_{i=1}^{n}z_{i}^{\alpha-1}x_{i}\geq 0.

Since xi,zix_{i},z_{i} are non-negative, we have

i=1nziαi=1nxiziα1=i=1n(xizi)1αxiα(x1z1)1αx1α.\sum_{i=1}^{n}z_{i}^{\alpha}\geq\sum_{i=1}^{n}x_{i}z_{i}^{\alpha-1}=\sum_{i=1}^{n}\left(\dfrac{x_{i}}{z_{i}}\right)^{1-\alpha}x_{i}^{\alpha}\geq\left(\frac{x_{1}}{z_{1}}\right)^{1-\alpha}x_{1}^{\alpha}.

Note that x1δ>0x_{1}\geq\delta>0, we have

x1z1(i=1nziαx1α)1/(1α)(n(1n)αδα)1/(1α)=nδα/(1α).\dfrac{x_{1}}{z_{1}}\leq\left(\dfrac{\sum_{i=1}^{n}z_{i}^{\alpha}}{x_{1}^{\alpha}}\right)^{1/(1-\alpha)}\leq\left(\dfrac{n(\frac{1}{n})^{\alpha}}{\delta^{\alpha}}\right)^{1/(1-\alpha)}=\dfrac{n}{\delta^{\alpha/(1-\alpha)}}. (8)

Since μh(s,a)=1Ht=1Hdhπ(t)(s,a),s𝒮,a𝒜,h[H]\mu_{h}(s,a)=\frac{1}{H}\sum_{t=1}^{H}d_{h}^{\pi^{(t)}}(s,a),\forall s\in\mathcal{S},a\in\mathcal{A},h\in[H], we have μh(s,a)1Hdhπ(h)(s,a)=1Hz1\mu_{h}(s,a)\geq\frac{1}{H}d_{h}^{\pi^{(h)}}(s,a)=\frac{1}{H}z_{1}. Using the fact that x1=maxπdhπ(s,a)x_{1}=\max_{\pi}d_{h}^{\pi}(s,a), we can obtain the inequality in (7). ∎

When α0+\alpha\to 0^{+}, we have

δ-significant (s,a,h),maxπdhπ(s,a)μh(s,a)SAH.\forall\delta\text{-significant }(s,a,h),\ \dfrac{\max_{\pi}d_{h}^{\pi}(s,a)}{\mu_{h}(s,a)}\leq SAH.

When α=12\alpha=\frac{1}{2},

δ-significant (s,a,h),maxπdhπ(s,a)μh(s,a)SAH/δ.\forall\delta\text{-significant }(s,a,h),\ \dfrac{\max_{\pi}d_{h}^{\pi}(s,a)}{\mu_{h}(s,a)}\leq SAH/\delta.
Lemma .5.

Suppose ^h\hat{\mathbb{P}}_{h} is the empirical transition matrix estimated from MM samples that are i.i.d. drawn from μh\mu_{h} and h\mathbb{P}_{h} is the true transition matrix, then with probability at least 1p1-p, for any h[H]h\in[H] and any value function G:S[0,H]G:S\to[0,H], we have:

𝔼(s,a)μh|(^hh)G(s,a)|2O(H2SMlog(HMp)).\mathbb{E}_{(s,a)\sim\mu_{h}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})G(s,a)|^{2}\leq O\left(\dfrac{H^{2}S}{M}\log(\dfrac{HM}{p})\right). (9)
Proof of Lemma .5.

The lemma is proved in a similar way to that of Lemma C.2 in (Jin et al. 2020) that uses a concentration inequality and a union bound. The proof differs in that we do not need to count the different events on the action space for the union bound since the state-action distribution μh\mu_{h} is given here (instead of only the state distribution). This results in a missing factor AA in the logarithm term compared with their lemma. ∎

Lemma .6 (Lemma E.15 in Dann et al. (Dann, Lattimore, and Brunskill 2017)).

For any two MDPs MM^{\prime} and M′′M^{\prime\prime} with rewards rr^{\prime} and r′′r^{\prime\prime} and transition probabilities \mathbb{P}^{\prime} and ′′\mathbb{P}^{\prime\prime}, the difference in values VV^{\prime} and V′′V^{\prime\prime} with respect to the same policy π\pi is

Vh(s)Vh′′(s)=𝔼M′′,π[i=hH[ri(si,ai)ri′′(si,ai)+(ii′′)Vi+1(si,ai)]|sh=s].V^{\prime}_{h}(s)-V^{\prime\prime}_{h}(s)=\mathbb{E}_{M^{\prime\prime},\pi}\left[\left.\sum_{i=h}^{H}[r^{\prime}_{i}(s_{i},a_{i})-r^{\prime\prime}_{i}(s_{i},a_{i})+(\mathbb{P}^{\prime}_{i}-\mathbb{P}^{\prime\prime}_{i})V^{\prime}_{i+1}(s_{i},a_{i})]\right|s_{h}=s\right].
Lemma .7.

Under the conditions of Theorem 3.1, with probability at least 1p1-p, for any reward function rr and any policy π\pi, we have:

|J^(π;r)J(π;r)|ϵ.|\hat{J}(\pi;r)-J(\pi;r)|\leq\epsilon. (10)
Proof of Lemma .7.

The proof is similar to Lemma 3.6 in Jin et al. (Jin et al. 2020) but differs in several details. Therefore, we still provide the proof for completeness. The proof is based on the following observations: 1) The total contribution of all insignificant state-action pairs is small; 2) ^\hat{\mathbb{P}} is reasonably accurate for significant state-action pairs due to the concentration inequality in Lemma .5.

Using Lemma .6 on MM and M^\hat{M} with the true transition dynamics \mathbb{P} and the estimated transition dynamics ^\hat{\mathbb{P}} respectively, we have

|J^(π;r)J(π;r)|=|𝔼s11[V^1π(s1;r)V1π(s1;r)]|\displaystyle|\hat{J}(\pi;r)-J(\pi;r)|=\left|\mathbb{E}_{s_{1}\sim\mathbb{P}_{1}}\left[\hat{V}_{1}^{\pi}(s_{1};r)-V_{1}^{\pi}(s_{1};r)\right]\right| |𝔼πh=1H(^hh)V^h+1π(sh,ah)|\displaystyle\leq\left|\mathbb{E}_{\pi}\sum_{h=1}^{H}(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})\right|
𝔼πh=1H|(^hh)V^h+1π(sh,ah)|.\displaystyle\leq\mathbb{E}_{\pi}\sum_{h=1}^{H}\left|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})\right|.

Let 𝒮hδ={(s,a):maxπdhπ(s,a)δ}\mathcal{S}_{h}^{\delta}=\{(s,a):\max_{\pi}d_{h}^{\pi}(s,a)\geq\delta\} be the set of δ\delta -significant state-action pairs on the hh-th step. We have

𝔼π\displaystyle\mathbb{E}_{\pi} |(^h)V^h+1π(sh,ah)|(s,a)𝒮hδ|(^hh)V^h+1π(s,a)|dhπ(s,a)ξh+(s,a)𝒮hδ|(^hh)V^h+1π(s,a)|dhπ(s,a)ζh.\displaystyle|(\hat{\mathbb{P}}_{h}-\mathbb{P})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})|\leq\underbrace{\sum_{(s,a)\in\mathcal{S}_{h}^{\delta}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}^{\pi}_{h+1}(s,a)|d_{h}^{\pi}(s,a)}_{\xi_{h}}+\underbrace{\sum_{(s,a)\notin\mathcal{S}_{h}^{\delta}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}^{\pi}_{h+1}(s,a)|d_{h}^{\pi}(s,a)}_{\zeta_{h}}.

By the definition of δ\delta-significant (s,a)(s,a) pairs, we have

ζhH(s,a)𝒮hδdhπ(s,a)H(s,a)𝒮hδδSAHδ.\zeta_{h}\leq H\sum_{(s,a)\notin\mathcal{S}_{h}^{\delta}}d_{h}^{\pi}(s,a)\leq H\sum_{(s,a)\notin\mathcal{S}_{h}^{\delta}}\delta\leq SAH\delta.

As for ξh\xi_{h}, by Cauchy-Shwartz inequality and Lemma .4, we have

ξh\displaystyle\xi_{h} [(s,a)𝒮hδ|(^hh)V^h+1π(s,a)|2dhπ(s,a)]12\displaystyle\leq\left[\sum_{(s,a)\in\mathcal{S}_{h}^{\delta}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)|^{2}d_{h}^{\pi}(s,a)\right]^{\frac{1}{2}}
[(s,a)𝒮hδ|(^hh)V^h+1π(s,a)|2SAHδα/(1α)μh(s,a)]12\displaystyle\leq\left[\sum_{(s,a)\in\mathcal{S}_{h}^{\delta}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)|^{2}\frac{SAH}{\delta^{\alpha/(1-\alpha)}}\mu_{h}(s,a)\right]^{\frac{1}{2}}
[SAHδα/(1α)𝔼μh|(^hh)V^h+1π(s,a)|2]12.\displaystyle\leq\left[\frac{SAH}{\delta^{\alpha/(1-\alpha)}}\mathbb{E}_{\mu_{h}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)|^{2}\right]^{\frac{1}{2}}.

By Lemma .5, we have

𝔼(s,a)μh|(^h)V^h+1π(s,a)|2O(H2SMlog(HMp)).\mathbb{E}_{(s,a)\sim\mu_{h}}|(\hat{\mathbb{P}}_{h}-\mathbb{P})\hat{V}_{h+1}^{\pi}(s,a)|^{2}\leq O\left(\frac{H^{2}S}{M}\log(\frac{HM}{p})\right).

Thus, we have

𝔼π|(^h)V^h+1π(sh,ah)|ξh+ζhO(H3S2AMδα/(1α)log(HMp))+HSAδ.\mathbb{E}_{\pi}|(\hat{\mathbb{P}}_{h}-\mathbb{P})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})|\leq\xi_{h}+\zeta_{h}\leq O\left(\sqrt{\frac{H^{3}S^{2}A}{M\delta^{\alpha/(1-\alpha)}}\log(\frac{HM}{p})}\right)+HSA\delta.

Therefore, combining inequalities above, we have

|𝔼s11[V^1π(s1;r)V1π(s1;r)]|\displaystyle|\mathbb{E}_{s_{1}\sim\mathbb{P}_{1}}[\hat{V}_{1}^{\pi}(s_{1};r)-V_{1}^{\pi}(s_{1};r)]| 𝔼πh=1H|(^hh)V^h+1π(sh,ah)|\displaystyle\leq\mathbb{E}_{\pi}\sum_{h=1}^{H}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})|
O(H5S2AMδα/(1α)log(HMp))+H2SAδ\displaystyle\leq O\left(\sqrt{\frac{H^{5}S^{2}A}{M\delta^{\alpha/(1-\alpha)}}\log(\frac{HM}{p})}\right)+H^{2}SA\delta
O(H2SA(Bδβ+δ)),\displaystyle\leq O\left(H^{2}SA\left(\frac{\sqrt{B}}{\delta^{\beta}}+\delta\right)\right),

where β=α2(1α)\beta=\frac{\alpha}{2(1-\alpha)} and B=HAMlog(HMp)B=\frac{H}{AM}\log(\frac{HM}{p}). With δ=(B)1β+1\delta=(\sqrt{B})^{\frac{1}{\beta+1}}, it is sufficient to ensure that |J^(π;r)J(π;r)|ϵ|\hat{J}(\pi;r)-J(\pi;r)|\leq\epsilon, if we set

MO((H2SAϵ)2(β+1)HAlog(SAHϵp)).M\geq O\left(\left(\frac{H^{2}SA}{\epsilon}\right)^{2(\beta+1)}\frac{H}{A}\log\left(\frac{SAH}{\epsilon p}\right)\right).

Especially, for α0+\alpha\to 0^{+}, we have

|J^(π;r)J(π;r)|O(H5S2AMlog(HMp))+H2SAδ.|\hat{J}(\pi;r)-J(\pi;r)|\leq O\left(\sqrt{\dfrac{H^{5}S^{2}A}{M}\log(\dfrac{HM}{p})}\right)+H^{2}SA\delta.

To establish (10), we can choose δ=ϵ/2SAH2\delta=\epsilon/2SAH^{2} and set McH5S2Aϵ2log(SAHpϵ)M\geq c\dfrac{H^{5}S^{2}A}{\epsilon^{2}}\log\left(\dfrac{SAH}{p\epsilon}\right) for sufficiently large absolute constant cc. ∎

We use the natural policy gradient (NPG) algorithm as the approximate MDP solver.

Lemma .8 ((Agarwal et al. 2019; Jin et al. 2020)).

With learning rate η\eta and iteration number TT, the output policy π(T)\pi^{(T)} of the NPG algorithm satisfies the following:

J(π;r)J(π(T);r)HlogAηT+ηH2J(\pi^{*};r)-J(\pi^{(T)};r)\leq\dfrac{H\log A}{\eta T}+\eta H^{2} (11)

By choosing η=logA/HT\eta=\sqrt{\log A/HT} and T=4H3logA/ϵ2T=4H^{3}\log A/\epsilon^{2}, the policy π(T)\pi^{(T)} returned by NPG is ϵ\epsilon-optimal.

Policy gradient of state-action space entropy

In the following derivation, we use the notation that sums over the states or the state-action pairs. Nevertheless, it can be easily extended to continuous state-action spaces by simply replacing summation by integration. Note that, although the Rényi entropy for continuous distributions is not the limit of the Rényi entropy for discrete distributions (since they differs by a constant (Sanei, Borzadaran, and Amini 2016)), their gradients share the same form. Therefore, the derivation still holds for the continuous case.

Lemma .1 (Gradient of discounted state visitation distribution).

A policy π\pi is parameterized by θ\theta. The gradient of the discounted state visitation distribution induced by the policy w.r.t. the parameter θ\theta is

θdμπ(s)=11γ𝔼(s,a)dμπ[θlogπ(a|s)ds,aπ(s)]\nabla_{\theta}d_{\mu}^{\pi}(s^{\prime})=\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)d_{s,a}^{\pi}(s^{\prime})] (12)
Proof.

The discounted state visitation distribution started from one state-action pair can be written as the distribution started from the next state-action pair.

ds0,a0π(s)=(1γ)𝕀(s0=s)+γs1,a1Pπ(s1,a1|s0,a0)ds1,a1π(s),d_{s_{0},a_{0}}^{\pi}(s^{\prime})=(1-\gamma)\mathbb{I}(s_{0}=s^{\prime})+\gamma\sum_{s_{1},a_{1}}P^{\pi}(s_{1},a_{1}|s_{0},a_{0})d_{s_{1},a_{1}}^{\pi}(s^{\prime}),

where Pπ(s,a|s,a):=(s|s,a)π(a|s)P^{\pi}(s^{\prime},a^{\prime}|s,a):=\mathbb{P}(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime}).

Then, the proof follows an unrolling procedure on the discounted state visitation distribution as follows:

θds0,a0π(s)\displaystyle\nabla_{\theta}d^{\pi}_{s_{0},a_{0}}(s^{\prime}) =s1,a1γθ[P(s1|s0,a0)π(a1|s1)ds1,a1π(s)]\displaystyle=\sum_{s_{1},a_{1}}\gamma\nabla_{\theta}\left[P(s_{1}|s_{0},a_{0})\pi(a_{1}|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]
=s1,a1γP(s1|s0,a0)[θπ(a1|s1)ds1,a1π(s)+π(a1|s1)θds1,a1π(s)]\displaystyle=\sum_{s_{1},a_{1}}\gamma P(s_{1}|s_{0},a_{0})\left[\nabla_{\theta}\pi(a_{1}|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})+\pi(a_{1}|s_{1})\nabla_{\theta}d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]
=s1,a1γPπ(s1,a1|s0,a0)[θlogπ(a1|s1)ds1,a1π(s)+θds1,a1π(s)](unroll)\displaystyle=\sum_{s_{1},a_{1}}\gamma P^{\pi}(s_{1},a_{1}|s_{0},a_{0})\left[\nabla_{\theta}\log\pi(a_{1}|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})+\nabla_{\theta}d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]\quad\text{(unroll)}
=11γs,at=1γtP(st=s,at=a|s0,a0;π)[θlogπ(a|s)ds,aπ(s)]\displaystyle=\frac{1}{1-\gamma}\sum_{s,a}\sum_{t=1}^{\infty}\gamma^{t}P(s_{t}=s,a_{t}=a|s_{0},a_{0};\pi)[\nabla_{\theta}\log\pi(a|s)d_{s,a}^{\pi}(s^{\prime})]
=11γ𝔼(s,a)ds0,a0π[θlogπ(a|s)ds,aπ(s)]θlogπ(a0|s0)ds0,a0π(s).\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim d_{s_{0},a_{0}}^{\pi}}[\nabla_{\theta}\log\pi(a|s)d_{s,a}^{\pi}(s^{\prime})]-\nabla_{\theta}\log\pi(a_{0}|s_{0})d_{s_{0},a_{0}}^{\pi}(s^{\prime}).

At last,

θdμπ(s)=𝔼sμ,a0π(|s0)[θlogπ(a0|s0)ds0,a0π(s)+θds0,a0π(s)]=11γ𝔼(s,a)dμπ[θlogπ(a|s)ds,aπ(s)].\nabla_{\theta}d_{\mu}^{\pi}(s^{\prime})=\mathbb{E}_{s\sim\mu,a_{0}\sim\pi(\cdot|s_{0})}[\nabla_{\theta}\log\pi(a_{0}|s_{0})d_{s_{0},a_{0}}^{\pi}(s^{\prime})+\nabla_{\theta}d_{s_{0},a_{0}}^{\pi}(s^{\prime})]=\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)d_{s,a}^{\pi}(s^{\prime})].

Lemma .2 (Policy gradient of the Shannon entropy over the state-action space).

Given a policy π\pi parameterized by θ\theta, the gradient of H1(dμπ)H_{1}(d_{\mu}^{\pi}) w.r.t. the parameter θ\theta is as follows:

θH1(dμπ)=𝔼(s,a)dμπ[θlogπ(a|s)(11γds,aπ,logdμπlogdμπ(s,a))].\nabla_{\theta}H_{1}(d_{\mu}^{\pi})=\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},-\log d_{\mu}^{\pi}\rangle-\log d_{\mu}^{\pi}(s,a)\right)\right]. (13)
Proof.

First, we can calculate the gradient of Shannon entropy dH(d)=(logdc)\nabla_{d}H(d)=(-\log d-c) where dd is the state-action distribution and c=s,alogd(s,a)c=-\sum_{s,a}\log d(s,a) is a normalization factor. Then, we observe that θdμπ(s,a)=θ[dμπ(s)π(a|s)]=θdμπ(s)π(a|s)+dμπ(s,a)θlogπ(a|s)\nabla_{\theta}d_{\mu}^{\pi}(s,a)=\nabla_{\theta}[d_{\mu}^{\pi}(s)\pi(a|s)]=\nabla_{\theta}d_{\mu}^{\pi}(s)\pi(a|s)+d_{\mu}^{\pi}(s,a)\nabla_{\theta}\log\pi(a|s).

Using the standard chain rule and Lemma .1, we can obtain the policy gradient:

θH1(dμπ)=\displaystyle\nabla_{\theta}H_{1}(d_{\mu}^{\pi})= s,aθdμπ(s,a)H1(dμπ)dμπ(s,a)=s,aθdμπ(s,a)(logdμπ(s,a)c)=s,aθdμπ(s,a)logdμπ(s,a)\displaystyle\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)\dfrac{\partial H_{1}(d_{\mu}^{\pi})}{\partial d_{\mu}^{\pi}(s,a)}=\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)(-\log d_{\mu}^{\pi}(s,a)-c)=-\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)\log d_{\mu}^{\pi}(s,a)
=\displaystyle= 11γs,adμπ(s,a)θlogπ(a|s)s,ads,aπ(s,a)logdμπ(s,a)s,adμπ(s,a)θlogπ(a|s)logdμπ(s,a)\displaystyle-\frac{1}{1-\gamma}\sum_{s^{\prime},a^{\prime}}d_{\mu}^{\pi}(s^{\prime},a^{\prime})\nabla_{\theta}\log\pi(a^{\prime}|s^{\prime})\sum_{s,a}d_{s^{\prime},a^{\prime}}^{\pi}(s,a)\log d_{\mu}^{\pi}(s,a)-\sum_{s,a}d_{\mu}^{\pi}(s,a)\nabla_{\theta}\log\pi(a|s)\log d_{\mu}^{\pi}(s,a)
=\displaystyle= 𝔼(s,a)dμπ[θlogπ(a|s)(11γds,aπ,logdμπlogdμπ(s,a))],\displaystyle\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},-\log d_{\mu}^{\pi}\rangle-\log d_{\mu}^{\pi}(s,a)\right)\right],

where ds,aπ,logdμπ\langle d_{s,a}^{\pi},-\log d_{\mu}^{\pi}\rangle is the cross-entropy between the two distributions. We use the fact 𝔼xp[cθlogp(x)]=0\mathbb{E}_{x\sim p}\left[c\nabla_{\theta}\log p(x)\right]=0 in the first line. ∎

Lemma .3 (Policy gradient of the Rényi entropy over the state-action space).

Given a policy π\pi parameterized by θ\theta and 0<α<10<\alpha<1, the gradient of Hα(dμπ)H_{\alpha}(d_{\mu}^{\pi}) w.r.t. the parameter θ\theta is as follows:

θHα(dμπ)α1α𝔼(s,a)dμπ[θlogπ(a|s)(11γds,aπ,(dμπ)α1+(dμπ(s,a))α1)].\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi})\propto\frac{\alpha}{1-\alpha}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},(d_{\mu}^{\pi})^{\alpha-1}\rangle+\left(d_{\mu}^{\pi}(s,a)\right)^{\alpha-1}\right)\right]. (14)
Proof.

First, we note that the gradient of Rényi entropy is

dHα(d)=α1αdα1ci=1ndiα,\nabla_{d}H_{\alpha}(d)=\dfrac{\alpha}{1-\alpha}\dfrac{d^{\alpha-1}-c}{\sum_{i=1}^{n}d_{i}^{\alpha}},

where cc is a constant normalizer. Given a vector d=(d1,,dn)Td=(d_{1},\cdots,d_{n})^{T} and a real number aa, we use dad^{a} to denote the vector (d1a,,dna)T(d_{1}^{a},\cdots,d_{n}^{a})^{T}. Similarly, using the chain rule and Lemma .1, we have

θHα(dμπ)\displaystyle\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi}) =s,aθdμπ(s,a)Hα(dμπ)dμπ(s,a)α1αs,aθdμπ(s,a)[(dμπ(s,a))α1c]\displaystyle=\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)\dfrac{\partial H_{\alpha}(d_{\mu}^{\pi})}{\partial d_{\mu}^{\pi}(s,a)}\propto\frac{\alpha}{1-\alpha}\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)\left[(d_{\mu}^{\pi}(s,a))^{\alpha-1}-c\right]
=α1αs,aθdμπ(s,a)(dμπ(s,a))α1\displaystyle=\frac{\alpha}{1-\alpha}\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)(d_{\mu}^{\pi}(s,a))^{\alpha-1}
=α1α(11γs,adμπ(s,a)θlogπ(a|s)s,ads,aπ(s,a)(dμπ(s,a))α1\displaystyle=\frac{\alpha}{1-\alpha}\bigg{(}\frac{1}{1-\gamma}\sum_{s^{\prime},a^{\prime}}d_{\mu}^{\pi}(s^{\prime},a^{\prime})\nabla_{\theta}\log\pi(a^{\prime}|s^{\prime})\sum_{s,a}d_{s^{\prime},a^{\prime}}^{\pi}(s,a)(d_{\mu}^{\pi}(s,a))^{\alpha-1}
+s,adμπ(s,a)θlogπ(a|s)(dμπ(s,a))α1)\displaystyle\quad\quad+\sum_{s,a}d_{\mu}^{\pi}(s,a)\nabla_{\theta}\log\pi(a|s)(d_{\mu}^{\pi}(s,a))^{\alpha-1}\bigg{)}
=α1α𝔼s,adμπ[θlogπ(a|s)(11γds,aπ,(dμπ)α1+(dμπ(s,a))α1)].\displaystyle=\frac{\alpha}{1-\alpha}\mathbb{E}_{s,a\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},(d_{\mu}^{\pi})^{\alpha-1}\rangle+\left(d_{\mu}^{\pi}(s,a)\right)^{\alpha-1}\right)\right].

We should note that when α0\alpha\to 0, dHα(d)0\nabla_{d}H_{\alpha}(d)\to 0; and when α1\alpha\to 1, dHα(d)logd\nabla_{d}H_{\alpha}(d)\to-\log d. Especially, we can write down the case for α=12\alpha=\frac{1}{2}:

θHα(dμπ)=𝔼s,adμπ[θlogπ(a|s)(11γds,aπ,1dμπ+1dμπ(s,a))]\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi})=\mathbb{E}_{s,a\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},\sqrt{\dfrac{1}{d_{\mu}^{\pi}}}\rangle+\sqrt{\dfrac{1}{d_{\mu}^{\pi}(s,a)}}\right)\right] (15)

This leads to a policy gradient update with reward (dμπ)1/2\propto(d_{\mu}^{\pi})^{-1/2}. This inverse-square-root style reward is frequently used as the exploration bonus in standard RL problems (Strehl and Littman 2008; Bellemare et al. 2016) and similar to the UCB algorithm (Auer, Cesa-Bianchi, and Fischer 2002; Lai and Robbins 1985) in the bandit problem.

Practical implementation

Recall the deduced policy gradient formulations of the entropies on a CMP in (13) and (14). We observe that if we set the reward function r(s,a)=(dμπ(s,a))α1r(s,a)=(d_{\mu}^{\pi}(s,a))^{\alpha-1} (when 0<α<10<\alpha<1) or r(s,a)=logdμπ(s,a)r(s,a)=-\log d_{\mu}^{\pi}(s,a) (when α1\alpha\to 1), these policy gradient formulations can be compared with the policy gradient for a standard MDP: The first term resembles the policy gradient of cumulative reward 𝔼(s,a)dμπ[θlogπ(a|s)Qπ(s,a)]\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)Q^{\pi}(s,a)] and the second term resembles the policy gradient of instant reward 𝔼(s,a)dμπ[θlogπ(a|s)r(s,a)]\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)r(s,a)].

For the first term, a wide range of existing policy gradient based algorithms can be used to estimate this term. Since these algorithms such as PPO do not retain the scale of the policy gradient, we induce a coefficient η\eta to adjust for the scale between this term and the second term.

For the second term, when α=1\alpha=1, this term can be reduced to

𝔼(s,a)dμπ[θlogπ(a|s)(logdμπ(s,a))]=θ𝔼sdμπ[H1(π(|s))],\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)(-\log d_{\mu}^{\pi}(s,a))]=\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))],

where H1(π(|s))H_{1}(\pi(\cdot|s)) denotes the Shannon entropy over the action distribution conditioned on the state sample ss. Although the two sides are equivalent in expectation, the right hand side leads to smaller variance when estimated using samples due to the following reasons: First, the left hand side is calculated with the values of a density estimator (that estimates dμπd_{\mu}^{\pi}) on the state-action samples and the density estimator suffers from high variance. In contrast, the right hand side does not involve the density estimation. Second, to estimate the expectation with samples, the left hand side uses both the state samples and the action samples while the right hand side uses only the state samples and the action distributions conditioned on these states. This reduces the variance in the estimation. When 0<α<10<\alpha<1, since the second term in equation (2) can hardly be reduced to such a simple form, we still use θ𝔼sdμπ[H1(π(|s))]\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))] as the approximation.

Value function estimator

In this section, we provide the detail of the truncated TD(λ\lambda) estimator (Sutton and Barto 2018) that we use in our algorithm. Recall that in the ii-th iteration, we maintain the replay buffer 𝒟=𝒟i𝒟i1𝒟in+1\mathcal{D}=\mathcal{D}_{i}\cup\mathcal{D}_{i-1}\cup\cdots\cup\mathcal{D}_{i-n+1} as 𝒟={(sj1,aj1),,(sjNj,ajNj)}j=1nm\mathcal{D}=\{(s_{j1},a_{j1}),\cdots,(s_{jN_{j}},a_{jN_{j}})\}_{j=1}^{nm} where NjN_{j} is the length of the jj-th trajectory and nmnm is the total number of trajectories in the replay buffer. Correspondingly, the reward calculated using the latest density estimator PϕP_{\phi} is denoted as rjk=(Pϕ(sjk,ajk))α1r_{jk}=(P_{\phi}(s_{jk},a_{jk}))^{\alpha-1} or rjk=logPϕ(sjk,ajk)r_{jk}=-\log P_{\phi}(s_{jk},a_{jk}) for each (sjk,ajk)𝒟(s_{jk},a_{jk})\in\mathcal{D} , j[nm],k[Nj]\forall j\in[nm],k\in[N_{j}].

The truncated n-step value function estimator can be written as:

Vpnstep(sjk):=τ=kh1rjτ+Vψ(sjh), with h=min(Nj,k+p).V^{\text{nstep}}_{p}(s_{jk}):=\sum_{\tau=k}^{h-1}r_{j\tau}+V_{\psi}(s_{jh})\text{, with }h=\min(N_{j},k+p).

where VψV_{\psi} is the current value function network with the parameter ψ\psi. Notice that, since the discount rate is considered when we collect sample, here we estimate the undiscounted sum. When pNjkp\geq N_{j}-k (i.e., h=Njh=N_{j}), this corresponds to the Monte Carlo estimator whose variance is high. When p=1p=1 (i.e., h=k+1h=k+1), this corresponds to the TD(0) estimator that has large bias. To balance variance and bias, we use the truncated TD(λ\lambda) estimator which is the weighted sum of the truncated n-step estimators with different pp, i.e.,

Vλtarg(sjk):=(1λ)p=1P1λp1Vpnstep(sjk)+λP1VPnstep(sjk),V^{\text{targ}}_{\lambda}(s_{jk}):=(1-\lambda)\sum_{p=1}^{P-1}\lambda^{p-1}V^{\text{nstep}}_{p}(s_{jk})+\lambda^{P-1}V^{\text{nstep}}_{P}(s_{jk}),

where λ[0,1]\lambda\in[0,1] and P+P\in\mathbb{N}^{+} are the hyperparameters.

Experiment settings, hyperparameters and more experiments.

.1 Experiment settings and hyperparameters

Throughout the experiments, we set the hyperparameters as follows: The coefficient that balances the two terms in the policy gradient formulation is η=104\eta=10^{-4}. The parameter for computing GAE and the target state values is λ=0.95\lambda=0.95. The steps for computing the target state values is P=15P=15. The learning rate for the policy network is 4×1034\times 10^{-3} and the learning rate for VAE and the value function 10310^{-3}.

In Pendulum and FourRooms (see Appendix .2), we use γ=0.995\gamma=0.995 and store samples of the latest 10 iterations in the replay buffer. To estimate the corresponding Rényi entropy of the policy during the training in the top of Figure 6, we sample several trajectories in each iteration. In each trajectory, we terminate the trajectory on each step with probability 1γ1-\gamma. For α=0.5\alpha=0.5 and 11, we sample 1000 trajectories to estimate the distribution and then calculate the Rényi entropy over the state-action space. For α=0\alpha=0, we sample 50 groups of trajectories. In the ii-th group, we sample 20 trajectories and count the number of unique state-action pairs visited CiC_{i}. We use 150i=150logCi\frac{1}{50}\sum_{i=1}^{50}\log C_{i} as the estimation of the Rényi entropy when α=0\alpha=0. The corresponding entropy of the optimal policy is estimated in a same manner. The bottom figures in Figure 6 is plotted using 100 trajectories, each of which is also collected in a same manner.

In MultiRooms, we use γ=0.995\gamma=0.995 and store samples of the latest 10 iterations in the replay buffer. We use the MiniGrid-MultiRoom-N6-v0 task from minigrid. We fix the room configuration for a single run of the algorithm and use different room configurations in different runs. The four actions are turning left/right, moving forward and opening the door. In the planning phase, we randomly pick 2020 grids and construct the corresponding reward functions. The reward function gives the agent a reward r=1r=1 if the agent reaches the corresponding state and gives a reward r=0r=0 otherwise. We show the means and the standard deviations of five runs with different random seeds in the figures.

For Atari games, we use γ=0.995\gamma=0.995 and store samples of the latest 20 iterations in the replay buffer. For Mujoco tasks, we use γ=0.995\gamma=0.995 and only store samples of the current iteration. In Figure 4b, the two reward functions in the planning phase are designed as follows: Assume the rooms are numbered in the default order by the environment (cf. the labels in Figure 4b). Under the first reward function, when the agent goes from Room 3 to Room 9 (the room below Room 3) for the first time, we assign a unit reward to the agent. Otherwise, we assign zero rewards to the agent. Similarly, under the second reward function, we give a reward to the agent when it goes from Room 7 to Room 13 (the room below Room 7).

We will release the source code for this experiments once the paper is published.

.2 Convergence of MaxRenyi

Convergence is one of the most critical properties of an algorithm. The algorithm based on maximizing Shannon entropy has a convergence guarantee leveraging the existing results on the convergence of standard RL algorithms and its similar form to the standard RL problem (noting that H1(dμπ)=dμπ,logdμπH_{1}(d_{\mu}^{\pi})=\langle d_{\mu}^{\pi},-\log d_{\mu}^{\pi}\rangle and J(π)=11γdμπ,rJ(\pi)=\frac{1}{1-\gamma}\langle d_{\mu}^{\pi},r\rangle) (Hazan et al. 2019). On the other hand, MaxRenyi does not have such a guarantee not only due to the form of Rényi entropy but also due to the techniques we use. Therefore, we present a set of experiments here to illustrate whether our exploration algorithm MaxRenyi empirically converges to near-optimal policies in terms of the entropy over the state-action space. To answer this question, we compare the entropy induced by the agent during the training in MaxRenyi with the maximum entropy the agent can achieve. We implement MaxRenyi for two simple environments, Pendulum and FourRooms, where we can solve for the maximum entropy by brute force search. Pendulum from OpenAI Gym (Brockman et al. 2016) has a continuous state-action space. For this task, the entropy is estimated by discretizing the state space into 16×1616\times 16 grids and the action space into two discrete actions. FourRooms is a 11×1111\times 11 grid world environment with four actions and deterministic transitions. We show the results in Figure 6. The results indicate that our exploration algorithm approaches the optimal in terms of the corresponding state-action space entropy with different α\alpha in both discrete and continuous settings.

Refer to caption
Refer to caption
Figure 6: Top. The corresponding Rényi entropy of the state-action distribution induced by the policies during the training using our exploration algorithm with different α\alpha. The dashed lines indicate the corresponding maximum entropies the agent can achieve. Bottom. The discounted state distributions of the learned policies compared with the random policy.

.3 More results for Atari games and Mujoco tasks

The reward-free RL framework contains two phases, the exploration phase and the planning phase. Due to space limit, we only show the final performance of the algorithms (i.e., the performance of the agent planned based on the data collected using the learned exploratory policy) in the main text. Here, we also show the performance of the exploration algorithms in the exploration phase (i.e., the performance for the reward-free exploration setting). We show the results in Table 3. First, the scores for the exploration phase are measured using the extrinsic game rewards of the tasks. However, the exploration algorithms are trained in the reward-free exploration setting (i.e., do not receive such extrinsic game rewards). The results indicate that, although MaxRenyi is designed for the reward-free RL framework, it outperforms other exploration algorithms in the reward-free exploration setting. Then, the average scores for the planning phase is the same as those in Table 1. In addition, we present the standard deviations over five random seeds. The results indicate that the stability of MaxRenyi is comparable to RND and ICM while achieving better performance on average.

MaxRenyi RND ICM Random
Exploration phase Montezuma 7,540(2210) 4,220(736) 1,140(1232) 0
Venture 460(273) 740(546) 120(147) 0
Gravitar 1,420(627) 1,010(886) 580(621) 173
PrivateEye 6,480(7,813) 6,220(7,505) 100(0) 25
Solaris 1,284(961) 1,276(712) 1,136(928) 1,236
Hopper 207(628) 72(92) 86(75) 31
HalfCheetah 456(328) 137(126) 184(294) -82
Planning phase Montezuma 8,100(1,124) 4,740(2,216) 380(146) 0(0)
Venture 900(228) 840(307) 240(80) 0(0)
Gravitar 1,850(579) 1,890(665) 940(281) 340(177)
PrivateEye 6,820(7,058) 3,040(5,880) 100(0) 100(0)
Solaris 2,192(1,547) 1,416(734) 1,844(173) 1,508(519)
Hopper 973(159) 819(87) 704(59) 78(36)
HalfCheetah 1,466(216) 1,083(377) 700(488) 54(50)
Table 3: The performance of the exploratory policies (Exploration phase) and the planned policies under the reward-free RL framework (Planning phase) corresponding to different exploratory algorithms. Random indicates using a random policy as the exploratory policy. We use α=0.5\alpha=0.5 in MaxRenyi. The numbers before and in the parentheses indicate the means and the standard deviations over five random seeds respectively.

.4 More results for Montezuma’s Revenge

Although the extrinsic game reward in Montezuma’s Revenge aligns with the coverage of the state space, the number of rooms visited in one episode may be a more direct measure for an exploration algorithm. Therefore, we also provide the curves for the number of rooms visited during the reward-free exploration. Compared with Figure 4a, we observe that, although MaxRenyi visits only slightly more rooms than RND, it yields higher returns by visiting more scoring states. This is possibly because our algorithm encourages the agent to visit the hard-to-reach states more.

Refer to caption
Figure 7: The average number of rooms visited in an episode along during the reward-free exploration.

Moreover, our algorithm is less likely to fail under different reward functions in the planning phase. There is a failure mode for the planning phase in Montezuma’s Revenge: For example, with the reward to go across Room 3, the trained agent may first go to Room 7 to obtain an extra key and then return to open the door toward Room 0, which is not the shortest path. Out of 5 runs, our algorithms fail 2 times where RND fails all the 5 times.