Exploration by Maximizing Rényi Entropy for Reward-Free RL Framework

Chuheng Zhang Yuanying Cai¹¹footnotemark: 1 Longbo Huang Jian Li
These authors contributed equally to this work.

Abstract

Exploration is essential for reinforcement learning (RL). To face the challenges of exploration, we consider a reward-free RL framework that completely separates exploration from exploitation and brings new challenges for exploration algorithms. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment and collects a dataset of transitions by executing the policy. In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment. This framework is suitable for the meta RL setting where there are many reward functions of interest. In the exploration phase, we propose to maximize the Rényi entropy over the state-action space and justify this objective theoretically. The success of using Rényi entropy as the objective results from its encouragement to explore the hard-to-reach state-actions. We further deduce a policy gradient formulation for this objective and design a practical exploration algorithm that can deal with complex environments. In the planning phase, we solve for good policies given arbitrary reward functions using a batch RL algorithm. Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase.

1 Introduction

The trade-off between exploration and exploitation is at the core of reinforcement learning (RL). Designing efficient exploration algorithm, while being a highly nontrivial task, is essential to the success in many RL tasks (Burda et al. 2019b; Ecoffet et al. 2019). Hence, it is natural to ask the following high-level question: What can we achieve by pure exploration? To address this question, several settings related to meta reinforcement learning (meta RL) have been proposed (see e.g., Wang et al. 2016; Duan et al. 2016; Finn, Abbeel, and Levine 2017). One common setting in meta RL is to learn a model in a reward-free environment in the meta-training phase, and use the learned model as the initialization to fast adapt for new tasks in the meta-testing phase (Eysenbach et al. 2019; Gupta et al. 2018; Nagabandi et al. 2019). Since the agent still needs to explore the environment under the new tasks in the meta-testing phase (sometimes it may need more new samples in some new task, and sometimes less), it is less clear how to evaluate the effectiveness of the exploration in the meta-training phase. Another setting is to learn a policy in a reward-free environment and test the policy under the task with a specific reward function (such as the score in Montezuma’s Revenge) without further training with the task (Burda et al. 2019b; Ecoffet et al. 2019; Burda et al. 2019a). However, there is no guarantee that the algorithm has fully explored the transition dynamics of the environment unless we test the learned model for arbitrary reward functions. Recently, Jin et al. (2020) proposed reward-free RL framework that fully decouples exploration and exploitation. Further, they designed a provably efficient algorithm that conducts a finite number of steps of reward-free exploration and returns near-optimal policies for arbitrary reward functions. However, their algorithm is designed for the tabular case and can hardly be extended to continuous or high-dimensional state spaces since they construct a policy for each state.

The reward-free RL framework is as follows: First, a set of policies are trained to explore the dynamics of the reward-free environment in the exploration phase (i.e., the meta-training phase). Then, a dataset of trajectories is collected by executing the learned exploratory policies. In the planning phase (i.e., the meta-testing phase), an arbitrary reward function is specified and a batch RL algorithm (Lange, Gabel, and Riedmiller 2012; Fujimoto, Meger, and Precup 2019) is applied to solve for a good policy solely based on the dataset, without further interaction with the environment. This framework is suitable to the scenarios when there are many reward functions of interest or the reward is designed offline to elicit desired behavior. The key to success in this framework is to obtain a dataset with good coverage over all possible situations in the environment with as few samples as possible, which in turn requires the exploratory policy to fully explore the environment.

Several methods that encourage various forms of exploration have been developed in the reinforcement learning literature. The maximum entropy framework (Haarnoja et al. 2017) maximizes the cumulative reward in the meanwhile maximizing the entropy over the action space conditioned on each state. This framework results in several efficient and robust algorithms, such as soft Q-learning (Schulman, Chen, and Abbeel 2017; Nachum et al. 2017), SAC (Haarnoja et al. 2018) and MPO (Abdolmaleki et al. 2018). On the other hand, maximizing the state space coverage may results in better exploration. Various kinds of objectives/regularizations are used for better exploration of the state space, including information-theoretic metrics (Houthooft et al. 2016; Mohamed and Rezende 2015; Eysenbach et al. 2019) (especially the entropy of the state space, e.g., Hazan et al. (2019); Islam et al. (2019)), the prediction error of a dynamical model (Burda et al. 2019a; Pathak et al. 2017; de Abril and Kanai 2018), the state visitation count (Burda et al. 2019b; Bellemare et al. 2016; Ostrovski et al. 2017) or others heuristic signals such as novelty (Lehman and Stanley 2011; Conti et al. 2018), surprise (Achiam and Sastry 2016) or curiosity (Schmidhuber 1991).

In practice, it is desirable to obtain a single exploratory policy instead of a set of policies whose cardinality may be as large as that of the state space (e.g., Jin et al. 2020; Misra et al. 2019). Our exploration algorithm outputs a single exploratory policy for the reward-free RL framework by maximizing the Rényi entropy over the state-action space in the exploration phase. In particular, we demonstrate the advantage of using state-action space, instead of the state space via a very simple example (see Section 3 and Figure 1). Moreover, Rényi entropy generalizes a family of entropies, including the commonly used Shannon entropy. We justify the use of Rényi entropy as the objective function theoretically by providing an upper bound on the number of samples in the dataset to ensure that a near-optimal policy is obtained for any reward function in the planning phase.

Further, we derive a gradient ascent update rule for maximizing the Rényi entropy over the state-action space. The derived update rule is similar to the vanilla policy gradient update with the reward replaced by a function of the discounted stationary state-action distribution of the current policy. We use variational autoencoder (VAE) (Kingma and Welling 2014) as the density model to estimate the distribution. The corresponding reward changes over iterations which makes it hard to accurately estimate a value function under the current reward. To address this issue, we propose to estimate a state value function using the off-policy data with the reward relabeled by the current density model. This enables us to efficiently update the policy in a stable way using PPO (Schulman et al. 2017). Afterwards, we collect a dataset by executing the learned policy. In the planning phase, when a reward function is specified, we augment the dataset with the rewards and use a batch RL algorithm, batch constrained deep Q-learning (BCQ) (Fujimoto, Meger, and Precup 2019; Fujimoto et al. 2019), to plan for a good policy under the reward function. We conduct experiments on several environments with discrete, continuous or high-dimensional state spaces. The experiment results indicate that our algorithm is effective, sample efficient and robust in the exploration phase, and achieves good performance under the reward-free RL framework.

Our contributions are summarized as follows:

•

(Section 3) We consider the reward-free RL framework that separates exploration from exploitation and therefore places a higher requirement for an exploration algorithm. To efficiently explore under this framework, we propose a novel objective that maximizes the Rényi entropy over the state-action space in the exploration phase and justify this objective theoretically.
•

(Section 4) We propose a practical algorithm based on a derived policy gradient formulation to maximize the Rényi entropy over the state-action space for the reward-free RL framework.
•

(Section 5) We conduct a wide range of experiments and the results indicate that our algorithm is efficient and robust in the exploration phase and results in superior performance in the downstream planning phase for arbitrary reward functions.

2 Preliminary

A reward-free environment can be formulated as a controlled Markov process (CMP) ¹¹1 Although some literature use CMP to refer to MDP, we use CMP to denote a Markov process (Silver 2015) equipped with a control structure in this paper. $(\mathcal{S},\mathcal{A},\mathbb{P},\mu,\gamma)$ which specifies the state space $\mathcal{S}$ with $S=|\mathcal{S}|$ , the action space $\mathcal{A}$ with $A=|\mathcal{A}|$ , the transition dynamics $\mathbb{P}$ , the initial state distribution $\mu\in\Delta^{\mathcal{S}}$ and the discount factor $\gamma$ . A (stationary) policy $\pi_{\theta}:\mathcal{S}\to\Delta^{\mathcal{A}}$ parameterized by $\theta$ specifies the probability of choosing the actions on each state. The stationary discounted state visitation distribution (or simply the state distribution) under the policy $\pi$ is defined as $d_{\mu}^{\pi}(s):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\text{Pr}(s_{t}=s|s_{0}\sim\mu;\pi)$ . The stationary discounted state-action visitation distribution (or simply the state-action distribution) under the policy $\pi$ is defined as $d_{\mu}^{\pi}(s,a):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\text{Pr}(s_{t}=s,a_{t}=a|s_{0}\sim\mu;\pi)$ . Unless otherwise stated, we use $d_{\mu}^{\pi}$ to denote the state-action distribution. We also use $d_{s_{0},a_{0}}^{\pi}$ to denote the distribution started from the state-action pair $(s_{0},a_{0})$ instead of the initial state distribution $\mu$ .

When a reward function $r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ is specified, CMP becomes the Markov decision process (MDP) (Sutton and Barto 2018) $(\mathcal{S},\mathcal{A},\mathbb{P},r,\mu,\gamma)$ . The objective for MDP is to find a policy $\pi_{\theta}$ that maximizes the expected cumulative reward $J(\theta;r):=\mathbb{E}_{s_{0}\sim\mu}[V^{\pi_{\theta}}(s_{0};r)]$ , where $V^{\pi}(s;r)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s;\pi]$ is the state value function. We indicate the dependence on $r$ to emphasize that there may be more than one reward function of interest. The policy gradient for this objective is $\nabla_{\theta}J(\theta;r)=\mathbb{E}_{(s,a)\sim d^{\pi_{\theta}}_{\mu}}[\nabla_{\theta}\log\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s,a;r)]$ (Williams 1992), where $Q^{\pi}(s,a;r)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{0}=a;\pi]=\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},r\rangle$ is the Q function.

Rényi entropy for a distribution $d\in\Delta^{\mathcal{X}}$ is defined as $H_{\alpha}(d):=\frac{1}{1-\alpha}\log\left(\sum_{x\in\mathcal{X}}d^{\alpha}(x)\right)$ , where $d(x)$ is the probability mass or the probability density function on $x$ (and the summation becomes integration in the latter case). When $\alpha\to 0$ , Rényi entropy becomes Hartley entropy and equals the logarithm of the size of the support of $d$ . When $\alpha\to 1$ , Rényi entropy becomes Shannon entropy $H_{1}(d):=-\sum_{x\in\mathcal{X}}d(x)\log d(x)$ (Bromiley, Thacker, and Bouhova-Thacker 2004; Sanei, Borzadaran, and Amini 2016).

Given a distribution $d\in\Delta^{\mathcal{X}}$ , the corresponding density model $P_{\phi}:\mathcal{X}\to\mathbb{R}$ parameterized by $\phi$ gives the probability density estimation of $d$ based on the samples drawn from $d$ . Variational auto-encoder (VAE) (Kingma and Welling 2014) is a popular density model which maximizes the variational lower bound (ELBO) of the log-likelihood. Specifically, VAE maximizes the lower bound of $\mathbb{E}_{x\sim d}[\log P_{\phi}(x)]$ , i.e., $\max_{\phi}\mathbb{E}_{x\sim d,z\sim q_{\phi}(\cdot|x)}[\log p_{\phi}(x|z)]-\mathbb{E}_{x\sim d}[D_{\text{KL}}(q_{\phi}(\cdot|x)||p(z))]$ , where $q_{\phi}$ and $p_{\phi}$ are the decoder and the encoder respectively and $p(z)$ is a prior distribution for the latent variable $z$ .

3 The objective for the exploration phase

The objective for the exploration phase under the reward-free RL framework is to induce an informative and compact dataset: The informative condition is that the dataset should have good coverage such that the planning phase generates good policies for arbitrary reward functions. The compact condition is that the size of the dataset should be as small as possible to ensure a successful planning. In this section, we show that the Rényi entropy over the state-action space (i.e., $H_{\alpha}(d_{\mu}^{\pi})$ ) is a good objective function for the exploration phase. We first demonstrate the advantage of maximizing the state-action space entropy over maximizing the state space entropy with a toy example. Then, we provide a motivation to use Rényi entropy by analyzing a deterministic setting. At last, we provide an upper bound on the number of samples needed in the dataset for a successful planning if we have access to a policy that maximizes the Rényi entropy over the state-action space. For ease of analysis, we assume the state-action space is discrete in this section and derive a practical algorithm that deals with continuous state-action space in the next section.

Refer to caption — Figure 1: A deterministic five-state MDP with two actions. With discount factor $\gamma=1$ , a policy that maximizes the entropy of the discounted *state* visitation distribution does not visit all the transitions while a policy that maximizes the entropy of the discounted *state-action* visitation distribution visits all the transitions. Covering all the transitions is important for the reward-free RL framework. The width of the arrows indicates the visitation frequency under different policies.

Why maximize the entropy over the state-action space? We demonstrate the advantage of maximizing the entropy over the state-action space with a toy example shown in Figure 1. The example contains an MDP with two actions and five states. The first action always drives the agent back to the first state while the second action moves the agent to the next state. For simplicity of presentation, we consider a case with a discount factor $\gamma=1$ , but other values are similar. The policy that maximizes the entropy of the state distribution is a deterministic policy that takes the actions shown in red. The dataset obtained by executing this policy is poor since the planning algorithm fails when, in the planning phase, a sparse reward is assigned to one of the state-action pairs that it visits with zero probability (e.g., a reward function $r(s,a)$ that equals to $1$ on $(s_{2},a_{1})$ and equals to $0$ otherwise). In contrast, a policy that maximizes the entropy of the state-action distribution avoids the problem. For example, executing the policy that maximizes the Rényi entropy with $\alpha=0.5$ over the state-action space, the expected size of the induced dataset is only 44 such that the dataset contains all the state-action pairs (cf. Appendix The toy example in Figure 1). Note that, when the transition dynamics is known to be deterministic, a dataset containing all the state-action pairs is sufficient for the planning algorithm to obtain an optimal policy since the full transition dynamics is known.

Why using Rényi entropy? We analyze a deterministic setting where the transition dynamics is known to be deterministic. In this setting, the objective for the framework can be expressed as a specific objective function for the exploration phase. This objective function is hard to optimize but motivates us to use Rényi entropy as the surrogate.

We define $n:=SA$ as the cardinality of the state-action space. Given an exploratory policy $\pi$ , we assume the dataset is collected in a way such that the transitions in the dataset can be treated as i.i.d. samples from $d_{\mu}^{\pi}$ , where $d_{\mu}^{\pi}$ is the state-action distribution induced by the policy $\pi$ .

In the deterministic setting, we can recover the exact transition dynamics of the environment using a dataset of transitions that contains all the $n$ state-action pairs. Such a dataset leads to a successful planning for arbitrary reward functions, and therefore satisfies the informative condition. In order to obtain such a dataset that is also compact, we stop collecting samples if the dataset contains all the $n$ pairs. Given the distribution $d_{\mu}^{\pi}=(d_{1},\cdots,d_{n})\in\Delta^{n}$ from which we collect samples, the average size of the dataset is $G(d_{\mu}^{\pi})$ , where

G(d_{\mu}^{\pi})=\int_{0}^{\infty}\left(1-\prod_{i=1}^{n}(1-e^{-d_{i}t})\right)dt,

(1)

which is a result of the coupon collector’s problem (Flajolet, Gardy, and Thimonier 1992). Accordingly, the objective for the exploration phase can be expressed as $\min_{\pi\in\Pi}G(d_{\mu}^{\pi})$ , where $\Pi$ is the set of all the feasible policies. (Notice that due to the limitation of the transition dynamics, the feasible state-action distributions form only a subset of $\Delta^{n}$ .) We show the contour of this function on $\Delta^{n}$ in Figure 2a. We can see that, when any component of the distribution $d_{\mu}^{\pi}$ approaches zeros, $G(d_{\mu}^{\pi})$ increases rapidly forming a barrier indicated by the black arrow. This barrier prevents the agent to visit a state-action with a vanishing probability, and thus encourages the agent to explore the hard-to-reach state-actions.

However, this function involves an improper integral which is hard to handle, and therefore cannot be directly used as an objective function in the exploration phase. One common choice for a tractable objective function is Shannon entropy, i.e., $\max_{\pi\in\Pi}H_{1}(d_{\mu}^{\pi})$ (Hazan et al. 2019; Islam et al. 2019), whose contour is shown in Figure 2b. We can see that, Shannon entropy may still be large when some component of $d_{\mu}^{\pi}$ approaches zero (e.g., the black arrow indicates a case where the entropy is relatively large but $d_{1}\to 0$ ). Therefore, maximizing Shannon entropy may result in a policy that visits some state-action pair with a vanishing probability. We find that the contour of Rényi entropy (shown in Figure 2c/d) aligns better with $G(d_{\mu}^{\pi})$ and alleviates this problem. In general, the policy $\pi$ that maximizes Rényi entropy may correspond to a smaller $G(d_{\mu}^{\pi})$ than that resulted from maximizing Shannon entropy for different CMPs (more evidence of which can be found in Appendix An example to illustrate the advantage of using Rényi entropy).

Theoretical justification for the objective function. Next, we formally justify the use of Rényi entropy over the state-action space with the following theorem. For ease of analysis, we consider a standard episodic setting: The MDP has a finite planning horizon $H$ and stochastic dynamics $\mathbb{P}$ with the objective to maximize the cumulative reward $J(\pi;r):=\mathbb{E}_{s_{1}\sim\mu}[V^{\pi}(s_{1};r)]$ , where $V^{\pi}(s;r):=\mathbb{E}_{\mathbb{P}}\left[\sum_{h=1}^{H}r_{h}(s_{h},a_{h})|s_{1}=s,\pi\right]$ . We assume the reward function $r_{h}:\mathcal{S}\times\mathcal{A}\to[0,1],\forall h\in[H]$ is deterministic. A (non-stationary, stochastic) policy $\pi:\mathcal{S}\times[H]\to\Delta^{\mathcal{A}}$ specifies the probability of choosing the actions on each state and on each step. The state-action distribution induced by $\pi$ on the $h$ -th step is $d_{h}^{\pi}(s,a):=\text{Pr}(s_{h}=s,a_{h}=a|s_{1}\sim\mu;\pi)$ .

Theorem 3.1.

Denote $\varpi$ as a set of policies $\{\pi^{(h)}\}_{h=1}^{H}$ , where $\pi^{(h)}:S\times[H]\to\Delta^{\mathcal{A}}$ and $\pi^{(h)}\in arg\max_{\pi}H_{\alpha}(d_{h}^{\pi})$ . Construct a dataset $\mathcal{M}$ with $M$ trajectories, each of which is collected by first uniformly randomly choosing a policy $\pi$ from $\varpi$ and then executing the policy $\pi$ . Assume

M\geq c\left(\dfrac{H^{2}SA}{\epsilon}\right)^{2(\beta+1)}\frac{H}{A}\log\left(\dfrac{SAH}{p\epsilon}\right),

where $\beta=\frac{\alpha}{2(1-\alpha)}$ and $c>0$ is an absolute constant. Then, there exists a planning algorithm such that, for any reward function $r$ , with probability at least $1-p$ , the output policy $\hat{\pi}$ of the planning algorithm based on $\mathcal{M}$ is $3\epsilon$ -optimal, i.e., $J(\pi^{*};r)-J(\hat{\pi};r)\leq 3\epsilon$ , where $J(\pi^{*};r)=\max_{\pi}J(\pi;r)$ .

We provide the proof in Appendix Proof of Theorem 3.1. The theorem justifies that Rényi entropy with small $\alpha$ is a proper objective function for the exploration phase since the number of samples needed to ensure a successful planning is bounded when $\alpha$ is small. When $\alpha\to 1$ , the bound becomes infinity. The algorithm proposed by Jin et al. (2020) requires to sample $\tilde{O}(H^{5}S^{2}A/\epsilon^{2})$ trajectories where $\tilde{O}$ hides a logarithmic factor, which matches our bound when $\alpha\to 0$ . However, they construct a policy for each state on each step, whereas we only need $H$ policies which easily adapts for the non-tabular case.

4 Method

In this section, we develop an algorithm for the non-tabular case. In the exploration phase, we update the policy $\pi$ to maximize $H_{\alpha}(d_{\mu}^{\pi})$ . We first deduce a gradient ascent update rule which is similar to vanilla policy gradient with the reward replaced by a function of the state-action distribution of the current policy. Afterwards, we estimate the state-action distribution using VAE. We also estimate a value function and update the policy using PPO, which is more sample efficient and robust than vanilla policy gradient. Then, we obtain a dataset by collecting samples from the learned policy. In the planning phase, we use a popular batch RL algorithm, BCQ, to plan for a good policy when the reward function is specified. One may also use other batch RL algorithms. We show the pseudocode of the process in Algorithm 1, the details of which are described in the following paragraphs.

Policy gradient formulation. Let us first consider the gradient of the objective function $H_{\alpha}(d_{\mu}^{\pi})$ , where the policy $\pi$ is approximated by a policy network with the parameters denoted as $\theta$ . We omit the dependency of $\pi$ on $\theta$ . The gradient of the objective function is

	$\displaystyle\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi})$	$\displaystyle\propto\dfrac{\alpha}{1-\alpha}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\Big{[}\nabla_{\theta}\log\pi(a\|s)$		(2)
		$\displaystyle\left(\tfrac{1}{1-\gamma}\langle d_{s,a}^{\pi},(d_{\mu}^{\pi})^{\alpha-1}\rangle+(d_{\mu}^{\pi}(s,a))^{\alpha-1}\right)\Big{]}.$		(2)

As a special case, when $\alpha\to 1$ , the Rényi entropy becomes the Shannon entropy and the gradient turns into

	$\displaystyle\nabla_{\theta}H_{1}(d_{\mu}^{\pi})$	$\displaystyle=\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\Big{[}\nabla_{\theta}\log\pi(a\|s)$		(3)
		$\displaystyle\left(\tfrac{1}{1-\gamma}\langle d_{s,a}^{\pi},-\log d_{\mu}^{\pi}\rangle-\log d_{\mu}^{\pi}(s,a)\right)\Big{]}.$		(3)

Due to space limit, the derivation can be found in Appendix Policy gradient of state-action space entropy. There are two terms in the gradients. The first term equals to $\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)Q^{\pi}(s,a;r)\right]$ with the reward $r(s,a)=(d_{\mu}^{\pi}(s,a))^{\alpha-1}$ or $r(s,a)=-\log d_{\mu}^{\pi}(s,a)$ , which resembles the policy gradient (of cumulative reward) for a standard MDP. This term encourages the policy to choose the actions that lead to the rarely visited state-action pairs. In a similar way, the second term resembles the policy gradient of instant reward $\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)r(s,a)\right]$ and encourages the agent to choose the actions that are rarely selected on the current step. The second term in (3) equals to $\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))]$ . For stability, we also replace the second term in (2) with $\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))]$ ²²2 We found that using this term leads to more stable performance empirically since this term does not suffer from the high variance induced by the estimation of $d_{\mu}^{\pi}$ (see also Appendix Practical implementation). . Accordingly, we update the policy based on the following formula where $\eta>0$ is a hyperparameter:

	$\displaystyle\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi})$	$\displaystyle\approx\begin{cases}\nabla_{\theta}J(\theta;r=(d_{\mu}^{\pi})^{\alpha-1})&0<\alpha<1\\ \nabla_{\theta}J(\theta;r=-\log d_{\mu}^{\pi})&\alpha=1\\ \end{cases}$		(4)
		$\displaystyle\qquad\qquad\quad+\eta\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot\|s))].$		(4)

Discussion. Islam et al. (2019) motivate the agent to explore by maximizing the Shannon entropy over the state space resulting in an intrinsic reward $r(s)=-\log d_{\mu}^{\pi}(s)$ which is similar to ours when $\alpha\to 1$ . Bellemare et al. (2016) use an intrinsic reward $r(s,a)=\hat{N}(s,a)^{-1/2}$ where $\hat{N}(s,a)$ is an estimation of the visit count of $(s,a)$ . Our algorithm with $\alpha=1/2$ induces a similar reward.

Algorithm 1 Maximizing the state-action space Rényi entropy for the reward-free RL framework

1:Input: The number of iterations in the exploration phase

T

; The size of the dataset

M

; The parameter of Rényi entropy

\alpha

2:Initialize: Replay buffer that stores the samples form the last

n

iterations

\mathcal{D}=\emptyset

; Density model

P_{\phi}

(VAE); Value function

V_{\psi}

; Policy network

\pi_{\theta}

\triangleright

Exploration phase (MaxRenyi)

4:for

i=1

T

5: Sample

\mathcal{D}_{i}

using

\pi_{\theta}

and add it to

\mathcal{D}

6: Update

P_{\phi}

to estimate the state-action distribution based on

\mathcal{D}_{i}

7: Update

V_{\psi}

to minimize

\mathcal{L}_{\psi}(\mathcal{D})

defined in (5)

8: Update

\pi_{\theta}

to minimize

\mathcal{L}_{\theta}(\mathcal{D}_{i})

defined in (6)

\triangleright

Collect the dataset

10:Rollout the exploratory policy

\pi_{\theta}

, collect

M

trajectories and store them in

\mathcal{M}

11:

\triangleright

Planning phase

12:Reveal a reward function

r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}

13:Construct a labeled dataset

\overline{\mathcal{M}}:=\{(s,a,r(s,a))\}

14:Plan for a policy

\pi_{r}=\text{BCQ}(\overline{\mathcal{M}})

Sample collection. To estimate $d_{\mu}^{\pi}$ for calculating the underlying reward, we collect samples in the following way (cf. Line 5 in Algorithm 1): In the $i$ -th iteration, we sample $m$ trajectories. In each trajectory, we terminate the rollout on each step with probability $1-\gamma$ . In this way, we obtain a set of trajectories $\mathcal{D}_{i}=\{(s_{j1},a_{j1}),\cdots,(s_{jN_{j}},a_{jN_{j}})\}_{j=1}^{m}$ where $N_{j}$ is the length of the $j$ -th trajectory. Then, we can use VAE to estimate $d_{\mu}^{\pi}$ based on $\mathcal{D}_{i}$ , i.e., using ELBO as the density estimation (cf. Line 6 in Algorithm 1).

Value function. Instead of performing vanilla policy gradient, we update the policy using PPO which is more robust and sample efficient. However, the underlying reward function changes across iterations. This makes it hard to learn a value function incrementally that is used to reduce variance in PPO. We propose to train a value function network using relabeled off-policy data. In the $i$ -th iteration, we maintain a replay buffer $\mathcal{D}=\mathcal{D}_{i}\cup\mathcal{D}_{i-1}\cup\cdots\cup\mathcal{D}_{i-n+1}$ that stores the trajectories of the last $n$ iterations (cf. Line 5 in Algorithm 1). Next, we calculate the reward for each state-action pair in $\mathcal{D}$ with the latest density estimator $P_{\phi}$ , i.e., we assign $r=(P_{\phi}(s,a))^{\alpha-1}$ or $r=-\log P_{\phi}(s,a)$ for each $(s,a)\in\mathcal{D}$ . Based on these rewards, we can estimate the target value $V^{\text{targ}}(s)$ for each state $s\in\mathcal{D}$ using the truncated TD( $\lambda$ ) estimator (Sutton and Barto 2018) which balances bias and variance (see the detail in Appendix Value function estimator). Then, we train the value function network $V_{\psi}$ (where $\psi$ is the parameter) to minimize the mean squared error w.r.t. the target values:

\mathcal{L}_{\psi}(\mathcal{D})=\mathbb{E}_{s\sim\mathcal{D}}\left[(V_{\psi}(s)-V^{\text{targ}}(s))^{2}\right].

(5)

Policy network. In each iteration, we update the policy network to maximize the following objective function that is used in PPO:

	$\displaystyle-\mathcal{L}_{\theta}(\mathcal{D}_{i})=$	$\displaystyle\mathbb{E}_{s,a\sim\mathcal{D}_{i}}\left[\min\left(\tfrac{\pi_{\theta}(a\|s)}{\pi_{\text{old}}(a\|s)}\hat{A}(s,a),g(\varepsilon,\hat{A}(s,a))\right)\right]$		(6)
		$\displaystyle+\eta\mathbb{E}_{s\sim\mathcal{D}_{i}}[H_{1}(\pi_{\theta}(\cdot\|s))],$		(6)

where $g(\varepsilon,A)=\begin{cases}(1+\varepsilon)A&A\geq 0\\ (1-\varepsilon)A&A<0\end{cases}$ and $\hat{A}(s,a)$ is the advantage estimated using GAE (Schulman et al. 2016) and the learned value function $V_{\phi}$ .

5 Experiments

We first conduct experiments on the MultiRooms environment from minigrid (Chevalier-Boisvert, Willems, and Pal 2018) to compare the performance of MaxRenyi with several baseline exploration algorithms in the exploration phase and the downstream planning phase. We also compare the performance of MaxRenyi with different $\alpha$ s, and show the advantage of using a PPO based algorithm and maximizing the entropy over the state-action space by comparing with ablated versions. Then, we conduct experiments on a set of Atari (with image-based observations) (Machado et al. 2018) and Mujoco (Todorov, Erez, and Tassa 2012) tasks to show that our algorithm outperforms the baselines in complex environments for the reward-free RL framework. We also provide detailed results to investigate why our algorithm succeeds. More experiments and the detailed experiment settings/hyperparameters can be found in Appendix Experiment settings, hyperparameters and more experiments..

Experiments on MultiRooms. The observation in MultiRooms is the first person perspective from the agent which is high-dimensional and partially observable, and the actions are turning left/right, moving forward and opening the door (cf. Figure 3a above). This environment is hard to explore for standard RL algorithms due to sparse reward. In the exploration phase, the agent has to navigate and open the doors to explore through a series of rooms. In the planning phase, we randomly assign a goal state-action pair, reward the agent upon this state-action, and then train a policy with this reward function. We compare our exploration algorithm MaxRenyi with ICM (Pathak et al. 2017), RND (Burda et al. 2019b) (which use different prediction errors as the intrinsic reward), MaxEnt (Hazan et al. 2019) (which maximizes the state space entropy) and an ablated version of our algorithm MaxRenyi(VPG) (that updates the policy directly by vanilla policy gradient using (2) and (3)).

For the exploration phase, we show the performance of different algorithms in Figure 3b. First, we see that variants of MaxRenyi performs better than the baseline algorithms. Specifically, MaxRenyi is more stable than ICM and RND that explores well at the start but later degenerates when the agent becomes familiar with all the states. Second, we observe that MaxRenyi performs better than MaxRenyi(VPG) with different $\alpha$ s, indicating that MaxRenyi benefits from adopting PPO which reduces variance with a value function and update the policy conservatively. Third, MaxRenyi with $\alpha=0.5$ performs better than $\alpha=1$ which is consistent with our theory. However, we observe that MaxRenyi with $\alpha=0$ is more likely to be unstable empirically and results in slightly worse performance. At last, we show the visitation frequency of the exploratory policies resulted from different algorithms in Figure 3a. We see that MaxRenyi visits the states more uniformly compared with the baseline algorithms, especially the hard-to-reach states such as the corners of the rooms.

For the planning phase, we collect datasets of different sizes by executing different exploratory policies, and use the datasets to compute policies with different reward functions using BCQ. We show the performance of the resultant policies in Figure 3c. First, we observe that the datasets generated by running random policies do not lead to a successful planning, indicating the importance of learning a good exploratory policy in this framework. Second, the dataset with only 8k samples leads to a successful planning (with a normalized cumulative reward larger than 0.8) using MaxRenyi, whereas a dataset with 16k samples is needed to succeed in the planning phase when using ICM, RND or MaxEnt. This illustrates that MaxRenyi leads to a better performance in the planning phase (i.e., attains good policies with fewer samples) than the previous exploration algorithms.

Further, we compare MaxRenyi that maximizes the entropy over the state-action space with two ablated versions: “State + bonus” that maximizes the entropy over the state space with an entropy bonus (i.e., using $P_{\phi}$ to estimate the state distribution in Algorithm 1), and “State” that maximizes the entropy over the state space (i.e., additionally setting $\eta=0$ in (6)). Notice that the reward-free RL framework requires that a good policy is obtained for arbitrary reward functions in the planning phase. Therefore, we show the lowest normalized cumulative reward of the planned policies across different reward functions for the algorithms in Table 1. We see that maximizing the entropy over the state-action space results in better performance for this framework.

$M=$	2k	4k	8k	16k
MaxRenyi	0.14	0.52	0.87	0.91
State + bonus	0.14	0.48	0.78	0.87
State	0.07	0.15	0.41	0.57

Table 1: The lowest normalized cumulative reward of the planned policies across different reward functions, averaged over five runs. We use

\alpha=0.5

in the algorithms.

Experiments on Atari and Mujoco. In the exploration phase, we run different exploration algorithms in the reward-free environment of Atari (Mujoco) for 200M (10M) steps and collect a dataset with 100M (5M) samples by executing the learned policy. In the planning phase, a policy is computed offline to maximize the performance under the extrinsic game reward using the batch RL algorithm based on the dataset. We show the performance of the planned policies in Table 2. We can see that MaxRenyi performs well on a range of tasks with high-dimensional observations or continuous state-actions for the reward-free RL framework.

	MaxRenyi	RND	ICM	Random
Montezuma	8,100	4,740	380	0
Venture	900	840	240	0
Gravitar	1,850	1,890	940	340
PrivateEye	6,820	3,040	100	100
Solaris	2,192	1,416	1,844	1,508
Hopper	973	819	704	78
HalfCheetah	1,466	1,083	700	54

Table 2: The performance of the planned policies under the reward-free RL framework corresponding to different exploratory algorithms, averaged over five runs. Random indicates using a random policy as the exploratory policy. We use

\alpha=0.5

in MaxRenyi.

We also provide detailed results for Montezuma’s Revenge and Hopper. For Montezuma’s Revenge, we show the result for the exploration phase in Figure 4a. We observe that although MaxRenyi learns an exploratory policy in a reward-free environment, it achieves reasonable performance under the extrinsic game reward and performs better than RND and ICM. We also provide the number of visited rooms along the training in Appendix Experiment settings, hyperparameters and more experiments., which demonstrates that our algorithm performs better in terms of the state space coverage as well. In the planning phase, we design two different sparse reward functions that only reward the agent if it goes through Room 3 or Room 7 (to the next room). We show the trajectories of the policies planned with the two reward functions in red and blue respectively in Figure 4b. We see that although the reward functions are sparse, the agent chooses the correct path (e.g., opening the correct door in Room 1 with the only key) and successfully reaches the specified room. This indicates that our algorithm generates good policies for different reward functions based on an offline planning in complex environments. For Hopper, we show the t-SNE (Maaten and Hinton 2008) plot of the state-actions randomly sampled from the trajectories generated by different policies in Figure 5. We can see that MaxRenyi generates state-actions that are distributed more uniformly than RND and overlap those from the expert policy. This indicates that the good performance of our algorithm in the planning phase is resulted from the better coverage of our exploratory policy.

6 Conclusion

In this paper, we consider a reward-free RL framework, which is useful when there are multiple reward functions of interest or when we design reward functions offline. In this framework, an exploratory policy is learned by interacting with a reward-free environment in the exploration phase and generates a dataset. In the planning phase, when the reward function is specified, a policy is computed offline to maximize the corresponding cumulative reward using the batch RL algorithm based on the dataset. We propose a novel objective function, the Rényi entropy over the state-action space, for the exploration phase. We theoretically justify this objective and design a practical algorithm to optimize for this objective. In the experiments, we show that our exploration algorithm is effective under this framework, while being more sample efficient and more robust than the previous exploration algorithms.

Acknowledgement

Chuheng Zhang and Jian Li are supported in part by the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003, and the Zhongguancun Haihua Institute for Frontier Information Technology, Turing AI Institute of Nanjing and Xi’an Institute for Interdisciplinary Information Core Technology. Yuanying Cai and Longbo Huang are supported in part by the National Natural Science Foundation of China Grant 61672316, and the Zhongguancun Haihua Institute for Frontier Information Technology, China.

References

Abdolmaleki et al. (2018) Abdolmaleki, A.; Springenberg, J. T.; Tassa, Y.; Munos, R.; Heess, N.; and Riedmiller, M. 2018. Maximum a posteriori policy optimisation. In Proceedings of the 6th International Conference on Learning Representations.
Achiam and Sastry (2016) Achiam, J.; and Sastry, S. 2016. Surprise-based intrinsic motivation for deep reinforcement learning. In Deep RL Workshop, Advances in Neural Information Processing Systems.
Agarwal et al. (2019) Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2019. Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261 .
Auer, Cesa-Bianchi, and Fischer (2002) Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3): 235–256.
Bellemare et al. (2016) Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; and Munos, R. 2016. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 1471–1479.
Brockman et al. (2016) Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI Gym.
Bromiley, Thacker, and Bouhova-Thacker (2004) Bromiley, P.; Thacker, N.; and Bouhova-Thacker, E. 2004. Shannon entropy, Renyi entropy, and information. Statistics and Inf. Series (2004-004) .
Burda et al. (2019a) Burda, Y.; Edwards, H.; Pathak, D.; Storkey, A.; Darrell, T.; and Efros, A. A. 2019a. Large-scale study of curiosity-driven learning. In Proceedings of the 7th International Conference on Learning Representations.
Burda et al. (2019b) Burda, Y.; Edwards, H.; Storkey, A.; and Klimov, O. 2019b. Exploration by random network distillation. In Proceedings of the 7th International Conference on Learning Representations.
Chevalier-Boisvert, Willems, and Pal (2018) Chevalier-Boisvert, M.; Willems, L.; and Pal, S. 2018. Minimalistic Gridworld Environment for OpenAI Gym. https://github.com/maximecb/gym-minigrid.
Conti et al. (2018) Conti, E.; Madhavan, V.; Such, F. P.; Lehman, J.; Stanley, K.; and Clune, J. 2018. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in Neural Information Processing Systems, 5027–5038.
Dann, Lattimore, and Brunskill (2017) Dann, C.; Lattimore, T.; and Brunskill, E. 2017. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, 5713–5723.
de Abril and Kanai (2018) de Abril, I. M.; and Kanai, R. 2018. Curiosity-driven reinforcement learning with homeostatic regulation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 1–6. IEEE.
Duan et al. (2016) Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779 .
Ecoffet et al. (2019) Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; and Clune, J. 2019. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995 .
Eysenbach et al. (2019) Eysenbach, B.; Gupta, A.; Ibarz, J.; and Levine, S. 2019. Diversity is all you need: Learning skills without a reward function. In Proceedings of the 7th International Conference on Learning Represention.
Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, 1126–1135. JMLR.
Flajolet, Gardy, and Thimonier (1992) Flajolet, P.; Gardy, D.; and Thimonier, L. 1992. Birthday paradox, coupon collectors, caching algorithms and self-organizing search. Discrete Applied Mathematics 39(3): 207–229.
Fujimoto et al. (2019) Fujimoto, S.; Conti, E.; Ghavamzadeh, M.; and Pineau, J. 2019. Benchmarking Batch Deep Reinforcement Learning Algorithms. In Deep RL Workshop, Advances in Neural Information Processing Systems.
Fujimoto, Meger, and Precup (2019) Fujimoto, S.; Meger, D.; and Precup, D. 2019. Off-Policy Deep Reinforcement Learning without Exploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97, 2052–2062. JMLR.
Gupta et al. (2018) Gupta, A.; Eysenbach, B.; Finn, C.; and Levine, S. 2018. Unsupervised meta-learning for reinforcement learning. arXiv preprint arXiv:1806.04640 .
Haarnoja et al. (2017) Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, volume 70, 1352–1361. JMLR.
Haarnoja et al. (2018) Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80, 1861–1870. JMLR.
Hazan et al. (2019) Hazan, E.; Kakade, S.; Singh, K.; and Van Soest, A. 2019. Provably Efficient Maximum Entropy Exploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97, 2681–2691. JMLR.
Houthooft et al. (2016) Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; and Abbeel, P. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, 1109–1117.
Islam et al. (2019) Islam, R.; Seraj, R.; Bacon, P.-L.; and Precup, D. 2019. Entropy Regularization with Discounted Future State Distribution in Policy Gradient Methods. In Optimization Foundations of RL Workshop, Advances in Neural Information Processing Systems.
Jin et al. (2020) Jin, C.; Krishnamurthy, A.; Simchowitz, M.; and Yu, T. 2020. Reward-Free Exploration for Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning. JMLR.
Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations.
Lai and Robbins (1985) Lai, T. L.; and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1): 4–22.
Lange, Gabel, and Riedmiller (2012) Lange, S.; Gabel, T.; and Riedmiller, M. 2012. Batch reinforcement learning. In Reinforcement learning, 45–73. Springer.
Lehman and Stanley (2011) Lehman, J.; and Stanley, K. O. 2011. Novelty search and the problem with objectives. In Genetic programming theory and practice IX, 37–56. Springer.
Maaten and Hinton (2008) Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579–2605.
Machado et al. (2018) Machado, M. C.; Bellemare, M. G.; Talvitie, E.; Veness, J.; Hausknecht, M. J.; and Bowling, M. 2018. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. Journal of Artificial Intelligence Research 61: 523–562.
Misra et al. (2019) Misra, D.; Henaff, M.; Krishnamurthy, A.; and Langford, J. 2019. Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning. arXiv preprint arXiv:1911.05815 .
Mohamed and Rezende (2015) Mohamed, S.; and Rezende, D. J. 2015. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems, 2125–2133.
Nachum et al. (2017) Nachum, O.; Norouzi, M.; Xu, K.; and Schuurmans, D. 2017. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, 2775–2785.
Nagabandi et al. (2019) Nagabandi, A.; Clavera, I.; Liu, S.; Fearing, R. S.; Abbeel, P.; Levine, S.; and Finn, C. 2019. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In Proceedings of the 7th International Conference on Learning Representations.
Ostrovski et al. (2017) Ostrovski, G.; Bellemare, M. G.; van den Oord, A.; and Munos, R. 2017. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2721–2730. JMLR. org.
Pathak et al. (2017) Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 16–17.
Puterman (2014) Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
Sanei, Borzadaran, and Amini (2016) Sanei, M.; Borzadaran, G. R. M.; and Amini, M. 2016. Renyi entropy in continuous case is not the limit of discrete case. Mathematical Sciences and Applications E-Notes 4.
Schmidhuber (1991) Schmidhuber, J. 1991. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the international conference on simulation of adaptive behavior: From animals to animats, 222–227.
Schulman, Chen, and Abbeel (2017) Schulman, J.; Chen, X.; and Abbeel, P. 2017. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440 .
Schulman et al. (2016) Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2016. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the 4th International Conference of Learning Representations.
Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
Silver (2015) Silver, D. 2015. Lecture 2: Markov Decision Processes. UCL. Retrieved from www0. cs. ucl. ac. uk/staff/d. silver/web/Teaching_files/MDP. pdf .
Strehl and Littman (2008) Strehl, A. L.; and Littman, M. L. 2008. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences 74(8): 1309–1331.
Sutton and Barto (2018) Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT press Cambridge.
Todorov, Erez, and Tassa (2012) Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 5026–5033. IEEE.
Wang et al. (2016) Wang, J. X.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.; Leibo, J. Z.; Munos, R.; Blundell, C.; Kumaran, D.; and Botvinick, M. 2016. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763 .
Williams (1992) Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3-4): 229–256.

The toy example in Figure 1

In this example, the initial state is fixed to be $s_{1}$ . Due to the simplicity of this toy example, we can solve for the policy that maximizes the Rényi entropy of the state-action distribution with $\gamma\to 1$ and $\alpha=0.5$ . The optimal policy is

\pi=\begin{bmatrix}0.321&0.276&0.294&0.401&0.381\\ 0.679&0.724&0.706&0.599&0.619\end{bmatrix},

where $\pi_{i,j}$ represents the probability of choosing $a_{i}$ on $s_{j}$ . The corresponding state-action distribution is

d_{\mu}^{\pi}=\begin{bmatrix}0.107&0.062&0.047&0.045&0.065\\ 0.226&0.162&0.113&0.067&0.106\end{bmatrix},

where the $(i,j)$ -th element represents the probability density of the state-action distribution on $(s_{j},a_{i})$ . Using equation (1), we obtain $G(d_{\mu}^{\pi})=43.14$ which is the expected number of samples collected form this distribution that contains all the state-action pairs.

An example to illustrate the advantage of using Rényi entropy

Although we did not justify theoretically that Rényi entropy serves as a better surrogate function of $G$ than Shannon entropy in the main text, here we illustrate that this may hold in general. Specifically, we try to show that the corresponding $G$ of the policy that maximizes Rényi entropy with small $\alpha$ is always smaller than that of the policy that maximizes Shannon entropy. Notice that when the dynamics is known to be deterministic, $G$ indicates how many samples are needed in expectation to ensure a successful planning. In the stochastic setting, $G$ also serves as a good indicator since this is the lower bound of the number of samples needed for a successful planning.

To show that Rényi entropy is a good surrogate function, we can use brute-force search to check if there exists a CMP on which the Shannon entropy is better than Rényi entropy in terms of $G$ . As an illustration, we only consider CMPs with two states and two actions. Specifically, for this family of CMPs, there are 5 free parameters to be searched: 4 parameters for the transition dynamics and 1 parameter for the initial state distribution. We set step size for searching to 0.1 for each parameter and use $\alpha=0.1$ , $\gamma=0.9$ . The result shows there does not exists any CMP on which the Shannon entropy is better than Rényi entropy in term of $G$ out of the searched $10^{5}$ CMPs.

To further investigate, we consider a specific CMP. The transition matrix $P\in\mathbb{R}^{SA\times S}$ of this CMP is

P=\begin{bmatrix}1.0&0.0\\ 0.9&0.1\\ 1.0&0.0\\ 0.4&0.6\end{bmatrix},

where the rows corresponds to $(s_{1},a_{1}),(s_{1},a_{2}),(s_{2},a_{1}),(s_{2},a_{2})$ respectively and the columns corresponds the next states ( $s_{1}$ and $s_{2}$ respectively). The initial state distribution of this CMP is

\mu=\begin{bmatrix}1.0&0.0\end{bmatrix}.

For this CMP, the $G$ value of the policy that maximizes Shannon entropy is 88, while the values are only 39 and 47 for policies that maximize Rényi entropy with $\alpha=0.1$ , $\alpha=0.5$ respectively. The optimal $G$ value is 32. This indicates that we may need 2x more samples to accurately estimate the dynamics of the CMP. We observe $s_{2}$ is much harder to reach than $s_{1}$ on this CMP: The initial state is always $s_{1}$ and taking action $a_{1}$ will always reach $s_{1}$ . In order to visit $s_{2}$ more, policies should take $a_{2}$ with a high probability on $s_{1}$ . In fact, the policy that maximizes Shannon entropy takes $a_{2}$ on $s_{1}$ with the probability of only 0.58, while the policies that maximize Rényi entropy with $\alpha=0.1$ and $\alpha=0.5$ take $a_{2}$ on $s_{1}$ with probabilities 0.71 and 0.64 respectively.

This example illustrates the following essence that widely exists in other CMPs: A good objective function should encourage the policy to visit the hard-to-reach state-actions as frequently as possible and Rényi entropy does a better job than Shannon entropy from this perspective.

Proof of Theorem 3.1

We provide Theorem 3.1 to justify our objective that maximizes the entropy over the state-action space in the exploration phase for the reward-free RL framework. In the following analysis, we consider the standard episodic and tabular case as follows: The MDP is defined as $\langle\mathcal{S},\mathcal{A},H,\mathbb{P},r,\mu\rangle$ , where $\mathcal{S}$ is the state space with $S=|\mathcal{S}|$ , $\mathcal{A}$ is the action space with $A=|\mathcal{A}|$ , $H$ is the length of each episode, $\mathbb{P}=\{\mathbb{P}_{h}:h\in[H]\}$ is the set of unknown transition matrices, $r=\{r_{h}:h\in[H]\}$ is the set of the reward functions and $\mu$ is the unknown initial state distribution. We denote $\mathbb{P}_{h}(s^{\prime}|s,a)$ as the transition probability from $(s,a)$ to $s^{\prime}$ on the $h$ -th step. We assume that the instant reward for taking the action $a$ on the state $s$ on the $h$ -th step $r_{h}(s,a)\in[0,1]$ is deterministic. A policy $\pi:\mathcal{S}\times[H]\to\Delta^{\mathcal{A}}$ specifies the probability of choosing the actions on each state and on each step. We denote the set of all such policies as $\Pi$ .

The agent with policy $\pi$ interacts with the MDP as follows: At the start of each episode, an initial state $s_{1}$ is sampled from the distribution $\mu$ . Next, at each step $h\in[H]$ , the agent first observes the state $s_{h}\in\mathcal{S}$ and then selects an action $a_{h}$ from the distribution $\pi(s_{h},h)$ . The environment returns the reward $r_{h}(s_{h},a_{h})\in[0,1]$ and transits to the next state $s_{h+1}\in\mathcal{S}$ according to the probability $\mathbb{P}_{h}(\cdot|s_{h},a_{h})$ . The state-action distribution induced by $\pi$ on the $h$ -th step is $d_{h}^{\pi}(s,a):=\text{Pr}(s_{h}=s,a_{h}=a|s_{1}\sim\mu;\pi)$ .

The state value function for a policy $\pi$ is defined as $V_{h}^{\pi}(s;r):=\mathbb{E}\left[\sum_{t=h}^{H}r_{t}(s_{t},a_{t})|s_{h}=s,\pi\right]$ . The objective is to find a policy $\pi$ that maximizes the cumulative reward $J(\pi;r):=\mathbb{E}_{s_{1}\sim\mu}[V_{1}^{\pi}(s_{1};r)]$ .

The proof goes through in a similar way to that of Jin et al. (2020).

Algorithm 2 The planning algorithm

Input: A dataset of transitions $\mathcal{M}$ ; The reward function $r$ ; The level of accuracy $\epsilon$ .

1:for all

(s,a,s^{\prime},h)\in\mathcal{S}\times\mathcal{A}\times\mathcal{S}\times[H]

N_{h}(s,a,s^{\prime})\leftarrow\sum_{(s_{h},a_{h},s_{h+1})\in\mathcal{M}}\mathbb{I}[s_{h}=s,a_{h}=a,s_{h+1}=s^{\prime}]

N_{h}(s,a)\leftarrow\sum_{s^{\prime}}N_{h}(s,a,s^{\prime})

\hat{\mathbb{P}}_{h}(s^{\prime}|s,a)=N_{h}(s,a,s^{\prime})/N_{h}(s,a)

\hat{\pi}\leftarrow\text{APPROXIMATE-MDP-SOLVER}(\mathbb{\hat{P}},r,\epsilon)

6:return

\hat{\pi}

Proof of Theorem 3.1.

Given the dataset $\mathcal{M}$ specified in the theorem, a reward function $r$ and the level of accuracy $\epsilon$ , we use the planning algorithm shown in Algorithm 2 to obtain a near-optimal policy $\hat{\pi}$ . The planning algorithm first estimates a transition matrix $\hat{\mathbb{P}}$ based on $\mathcal{M}$ and then solves for a policy based on the estimated transition matrix $\hat{\mathbb{P}}$ and the reward function $r$ .

Define $\hat{J}(\pi;r):=\mathbb{E}_{s_{1}\sim\mu}[\hat{V}^{\pi}_{1}(s_{1};r)]$ and $\hat{V}$ is the value function on the MDP with the transition dynamics $\hat{\mathbb{P}}$ . We first decompose the error into the following terms:

\displaystyle J(\pi^{*};r)-J(\hat{\pi};r)\leq

\displaystyle\underbrace{|\hat{J}(\pi^{*};r)-J(\pi^{*};r)|}_{\text{evaluation error 1}}+\underbrace{(\hat{J}(\pi^{*};r)-\hat{J}(\hat{\pi}^{*};r))}_{\leq 0\text{ by definition}}+\underbrace{(\hat{J}(\hat{\pi}^{*};r)-\hat{J}(\hat{\pi};r))}_{\text{optimization error}}+\underbrace{|\hat{J}(\hat{\pi};r)-J(\hat{\pi};r)|}_{\text{evaluation error 2}}

Then, the proof of the theorem goes as follows:

To bound the two evaluation error terms, we first present Lemma .4 to show that the policy which maximizes Rényi entropy is able to visit the state-action space reasonably uniformly, by leveraging the convexity of the feasible region in the state-action distribution space (Lemma .2). Then, with Lemma .4 and the concentration inequality Lemma .5, we show that the two evaluation error terms can be bounded by $\epsilon$ for any policy and any reward function (Lemma .7).

To bound the optimization error term, we use the natural policy gradient (NPG) algorithm as the APPROXIMATE-MDP-SOLVER in Algorithm 2 to solve for a near-optimal policy on the estimated $\hat{\mathbb{P}}$ and the given reward function $r$ . Finally, we apply the optimization bound for the NPG algorithm (Agarwal et al. 2019) to bound the optimization error term (Lemma .8).

∎

Definition .1 ( $\mathcal{K}_{h}$ ).

$\mathcal{K}_{h}$ is defined as the feasible region in the state-action distribution space $\mathcal{K}_{h}:=\{d_{h}^{\pi}:\pi\in\Pi\}\subset\Delta^{n}$ , $\forall h\in[H]$ , where $n=SA$ is the cardinality of the state-action space.

Hazan et al. (Hazan et al. 2019) proved the convexity of such feasible regions in the infinite-horizon and discounted setting. For completeness, we provide a similar proof on the convexity of $\mathcal{K}_{h}$ in the episodic setting.

Lemma .2.

[Convexity of $\mathcal{K}_{h}$ ] $\mathcal{K}_{h}$ is convex. Namely, for any $\pi_{1},\pi_{2}\in\Pi$ and $h\in[H]$ , denote $d_{1}=d_{h}^{\pi_{1}}$ and $d_{2}=d_{h}^{\pi_{2}}$ , there exists a policy $\pi\in\Pi$ such that $d:=\lambda d_{1}+(1-\lambda)d_{2}=d_{h}^{\pi}\in\mathcal{K}_{h},\forall\lambda\in[0,1]$ .

Proof of Lemma .2.

Define a mixture policy $\pi^{\prime}$ that first chooses from $\{\pi_{1},\pi_{2}\}$ with probability $\lambda$ and $1-\lambda$ respectively at the start of each episode, and then executes the chosen policy through the episode. Define the state distribution (or the state-action distribution) for this mixture policy on each layer as $d_{h}^{\pi^{\prime}},\forall h\in[H]$ . Obviously, $d=d_{h}^{\pi^{\prime}}$ . For any mixture policy $\pi^{\prime}$ , we can construct a policy $\pi:\mathcal{S}\times[H]\to\Delta^{\mathcal{A}}$ where $\pi(a|s,h)=\dfrac{d_{h}^{\pi^{\prime}}(s,a)}{d_{h}^{\pi^{\prime}}(s)}$ such that $d_{h}^{\pi}=d_{h}^{\pi^{\prime}}$ (Puterman 2014). In this way, we find a policy $\pi$ such that $d_{h}^{\pi}=d$ . ∎

Similar to Jin et al. (Jin et al. 2020), we define $\delta$ -significance for state-action pairs on each step and show that the policy that maximizes Rényi entropy is able to reasonably uniformly visit the significant state-action pairs.

Definition .3 ( $\delta$ -significance).

A state-action pair $(s,a)$ on the $h$ -th step is $\delta$ -significant if there exists a policy $\pi\in\Pi$ , such that the probability to reach $(s,a)$ following the policy $\pi$ is greater than $\delta$ , i.e., $\max_{\pi}d_{h}^{\pi}(s,a)\geq\delta$ .

Recall the way we construct the dataset $\mathcal{M}$ : With the set of policies $\varpi:=\{\pi^{(h)}\}_{h=1}^{H}$ where $\pi^{(h)}:S\times[H]\to\Delta^{\mathcal{A}}$ and $\pi^{(h)}\in arg\max_{\pi}H_{\alpha}(d_{h}^{\pi}),\forall h\in[H]$ , we first uniformly randomly choose a policy $\pi$ from $\varpi$ at the start of each episode, and then execute the policy $\pi$ through the episode. Note that there is a set of policies that maximize the Rényi entropy on the $h$ -th layer since the policy on the subsequent layers does not affect the entropy on the $h$ -th layer. We denote the induced state-action distribution on the $h$ -th step as $\mu_{h}$ , i.e., the dataset $\mathcal{M}$ can be regarded as being sampled from $\mu_{h}$ for each step $h\in[H]$ . Therefore, $\mu_{h}(s,a)=\frac{1}{H}\sum_{t=1}^{H}d_{h}^{\pi^{(t)}}(s,a),\forall s\in\mathcal{S},a\in\mathcal{A},h\in[H]$ .

Lemma .4.

If $0<\alpha<1$ , then

\forall\delta\text{-significant }(s,a,h),\ \dfrac{\max_{\pi}d_{h}^{\pi}(s,a)}{\mu_{h}(s,a)}\leq\dfrac{SAH}{\delta^{\alpha/(1-\alpha)}}.

(7)

Proof of Lemma .4.

For any $\delta$ -significant $(s,a,h)$ , consider the policy $\pi^{\prime}\in arg\max_{\pi}d_{h}^{\pi}(s,a)$ . Denote $x:=d_{h}^{\pi^{\prime}}\in\Delta(\mathcal{S}\times\mathcal{A})$ . We can treat $x$ as a vector with $n:=SA$ dimensions and use the first dimension to represent $(s,a)$ , i.e., $x_{1}=d_{h}^{\pi^{\prime}}(s,a)$ . Denote $z:=d_{h}^{\pi^{(h)}}$ . Since $\pi^{(h)}\in arg\max_{\pi}H_{\alpha}(d_{h}^{\pi})$ , $z=arg\max_{d\in\mathcal{K}_{h}}H_{\alpha}(d)$ . By Lemma .2, $y_{\lambda}:=(1-\lambda)x+\lambda z\in\mathcal{K}_{h},\forall\lambda\in[0,1]$ . Since $H_{\alpha}(\cdot)$ is concave over $\Delta^{n}$ when $0<\alpha<1$ , $H_{\alpha}(y_{\lambda})$ will monotonically increase as we increase $\lambda$ from 0 to $1$ , i.e.,

\dfrac{\partial H_{\alpha}(y_{\lambda})}{\partial\lambda}\geq 0,\quad\forall\lambda\in[0,1].

This is true when $\lambda=1$ , i.e., when $y_{\lambda}=z$ , we have

	$\displaystyle\dfrac{\partial H_{\alpha}(y_{\lambda})}{\partial\lambda}\Big{\|}_{\lambda=1}=$	$\displaystyle\frac{\alpha}{1-\alpha}\frac{\sum_{i=1}^{n}\left((1-\lambda)x_{i}+\lambda z_{i}\right)^{\alpha-1}(z_{i}-x_{i})}{\sum_{i=1}^{n}\left((1-\lambda)x_{i}+\lambda z_{i}\right)^{\alpha}}\Big{\|}_{\lambda=1}$
	$\displaystyle=$	$\displaystyle\frac{\alpha}{1-\alpha}\frac{\sum_{i=1}^{n}z_{i}^{\alpha-1}(z_{i}-x_{i})}{\sum_{i=1}^{n}z_{i}^{\alpha}}\propto\sum_{i=1}^{n}z_{i}^{\alpha}-\sum_{i=1}^{n}z_{i}^{\alpha-1}x_{i}\geq 0.$

Since $x_{i},z_{i}$ are non-negative, we have

\sum_{i=1}^{n}z_{i}^{\alpha}\geq\sum_{i=1}^{n}x_{i}z_{i}^{\alpha-1}=\sum_{i=1}^{n}\left(\dfrac{x_{i}}{z_{i}}\right)^{1-\alpha}x_{i}^{\alpha}\geq\left(\frac{x_{1}}{z_{1}}\right)^{1-\alpha}x_{1}^{\alpha}.

Note that $x_{1}\geq\delta>0$ , we have

\dfrac{x_{1}}{z_{1}}\leq\left(\dfrac{\sum_{i=1}^{n}z_{i}^{\alpha}}{x_{1}^{\alpha}}\right)^{1/(1-\alpha)}\leq\left(\dfrac{n(\frac{1}{n})^{\alpha}}{\delta^{\alpha}}\right)^{1/(1-\alpha)}=\dfrac{n}{\delta^{\alpha/(1-\alpha)}}.

(8)

Since $\mu_{h}(s,a)=\frac{1}{H}\sum_{t=1}^{H}d_{h}^{\pi^{(t)}}(s,a),\forall s\in\mathcal{S},a\in\mathcal{A},h\in[H]$ , we have $\mu_{h}(s,a)\geq\frac{1}{H}d_{h}^{\pi^{(h)}}(s,a)=\frac{1}{H}z_{1}$ . Using the fact that $x_{1}=\max_{\pi}d_{h}^{\pi}(s,a)$ , we can obtain the inequality in (7). ∎

When $\alpha\to 0^{+}$ , we have

\forall\delta\text{-significant }(s,a,h),\ \dfrac{\max_{\pi}d_{h}^{\pi}(s,a)}{\mu_{h}(s,a)}\leq SAH.

When $\alpha=\frac{1}{2}$ ,

\forall\delta\text{-significant }(s,a,h),\ \dfrac{\max_{\pi}d_{h}^{\pi}(s,a)}{\mu_{h}(s,a)}\leq SAH/\delta.

Lemma .5.

Suppose $\hat{\mathbb{P}}_{h}$ is the empirical transition matrix estimated from $M$ samples that are i.i.d. drawn from $\mu_{h}$ and $\mathbb{P}_{h}$ is the true transition matrix, then with probability at least $1-p$ , for any $h\in[H]$ and any value function $G:S\to[0,H]$ , we have:

\mathbb{E}_{(s,a)\sim\mu_{h}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})G(s,a)|^{2}\leq O\left(\dfrac{H^{2}S}{M}\log(\dfrac{HM}{p})\right).

(9)

Proof of Lemma .5.

The lemma is proved in a similar way to that of Lemma C.2 in (Jin et al. 2020) that uses a concentration inequality and a union bound. The proof differs in that we do not need to count the different events on the action space for the union bound since the state-action distribution $\mu_{h}$ is given here (instead of only the state distribution). This results in a missing factor $A$ in the logarithm term compared with their lemma. ∎

Lemma .6 (Lemma E.15 in Dann et al. (Dann, Lattimore, and Brunskill 2017)).

For any two MDPs $M^{\prime}$ and $M^{\prime\prime}$ with rewards $r^{\prime}$ and $r^{\prime\prime}$ and transition probabilities $\mathbb{P}^{\prime}$ and $\mathbb{P}^{\prime\prime}$ , the difference in values $V^{\prime}$ and $V^{\prime\prime}$ with respect to the same policy $\pi$ is

V^{\prime}_{h}(s)-V^{\prime\prime}_{h}(s)=\mathbb{E}_{M^{\prime\prime},\pi}\left[\left.\sum_{i=h}^{H}[r^{\prime}_{i}(s_{i},a_{i})-r^{\prime\prime}_{i}(s_{i},a_{i})+(\mathbb{P}^{\prime}_{i}-\mathbb{P}^{\prime\prime}_{i})V^{\prime}_{i+1}(s_{i},a_{i})]\right|s_{h}=s\right].

Lemma .7.

Under the conditions of Theorem 3.1, with probability at least $1-p$ , for any reward function $r$ and any policy $\pi$ , we have:

|\hat{J}(\pi;r)-J(\pi;r)|\leq\epsilon.

(10)

Proof of Lemma .7.

The proof is similar to Lemma 3.6 in Jin et al. (Jin et al. 2020) but differs in several details. Therefore, we still provide the proof for completeness. The proof is based on the following observations: 1) The total contribution of all insignificant state-action pairs is small; 2) $\hat{\mathbb{P}}$ is reasonably accurate for significant state-action pairs due to the concentration inequality in Lemma .5.

Using Lemma .6 on $M$ and $\hat{M}$ with the true transition dynamics $\mathbb{P}$ and the estimated transition dynamics $\hat{\mathbb{P}}$ respectively, we have

	$\displaystyle\|\hat{J}(\pi;r)-J(\pi;r)\|=\left\|\mathbb{E}_{s_{1}\sim\mathbb{P}_{1}}\left[\hat{V}_{1}^{\pi}(s_{1};r)-V_{1}^{\pi}(s_{1};r)\right]\right\|$	$\displaystyle\leq\left\|\mathbb{E}_{\pi}\sum_{h=1}^{H}(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})\right\|$
		$\displaystyle\leq\mathbb{E}_{\pi}\sum_{h=1}^{H}\left\|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})\right\|.$

Let $\mathcal{S}_{h}^{\delta}=\{(s,a):\max_{\pi}d_{h}^{\pi}(s,a)\geq\delta\}$ be the set of $\delta$ -significant state-action pairs on the $h$ -th step. We have

\displaystyle\mathbb{E}_{\pi}

\displaystyle|(\hat{\mathbb{P}}_{h}-\mathbb{P})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})|\leq\underbrace{\sum_{(s,a)\in\mathcal{S}_{h}^{\delta}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}^{\pi}_{h+1}(s,a)|d_{h}^{\pi}(s,a)}_{\xi_{h}}+\underbrace{\sum_{(s,a)\notin\mathcal{S}_{h}^{\delta}}|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}^{\pi}_{h+1}(s,a)|d_{h}^{\pi}(s,a)}_{\zeta_{h}}.

By the definition of $\delta$ -significant $(s,a)$ pairs, we have

\zeta_{h}\leq H\sum_{(s,a)\notin\mathcal{S}_{h}^{\delta}}d_{h}^{\pi}(s,a)\leq H\sum_{(s,a)\notin\mathcal{S}_{h}^{\delta}}\delta\leq SAH\delta.

As for $\xi_{h}$ , by Cauchy-Shwartz inequality and Lemma .4, we have

	$\displaystyle\xi_{h}$	$\displaystyle\leq\left[\sum_{(s,a)\in\mathcal{S}_{h}^{\delta}}\|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)\|^{2}d_{h}^{\pi}(s,a)\right]^{\frac{1}{2}}$
		$\displaystyle\leq\left[\sum_{(s,a)\in\mathcal{S}_{h}^{\delta}}\|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)\|^{2}\frac{SAH}{\delta^{\alpha/(1-\alpha)}}\mu_{h}(s,a)\right]^{\frac{1}{2}}$
		$\displaystyle\leq\left[\frac{SAH}{\delta^{\alpha/(1-\alpha)}}\mathbb{E}_{\mu_{h}}\|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)\|^{2}\right]^{\frac{1}{2}}.$

By Lemma .5, we have

\mathbb{E}_{(s,a)\sim\mu_{h}}|(\hat{\mathbb{P}}_{h}-\mathbb{P})\hat{V}_{h+1}^{\pi}(s,a)|^{2}\leq O\left(\frac{H^{2}S}{M}\log(\frac{HM}{p})\right).

Thus, we have

\mathbb{E}_{\pi}|(\hat{\mathbb{P}}_{h}-\mathbb{P})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})|\leq\xi_{h}+\zeta_{h}\leq O\left(\sqrt{\frac{H^{3}S^{2}A}{M\delta^{\alpha/(1-\alpha)}}\log(\frac{HM}{p})}\right)+HSA\delta.

Therefore, combining inequalities above, we have

	$\displaystyle\|\mathbb{E}_{s_{1}\sim\mathbb{P}_{1}}[\hat{V}_{1}^{\pi}(s_{1};r)-V_{1}^{\pi}(s_{1};r)]\|$	$\displaystyle\leq\mathbb{E}_{\pi}\sum_{h=1}^{H}\|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s_{h},a_{h})\|$
		$\displaystyle\leq O\left(\sqrt{\frac{H^{5}S^{2}A}{M\delta^{\alpha/(1-\alpha)}}\log(\frac{HM}{p})}\right)+H^{2}SA\delta$
		$\displaystyle\leq O\left(H^{2}SA\left(\frac{\sqrt{B}}{\delta^{\beta}}+\delta\right)\right),$

where $\beta=\frac{\alpha}{2(1-\alpha)}$ and $B=\frac{H}{AM}\log(\frac{HM}{p})$ . With $\delta=(\sqrt{B})^{\frac{1}{\beta+1}}$ , it is sufficient to ensure that $|\hat{J}(\pi;r)-J(\pi;r)|\leq\epsilon$ , if we set

M\geq O\left(\left(\frac{H^{2}SA}{\epsilon}\right)^{2(\beta+1)}\frac{H}{A}\log\left(\frac{SAH}{\epsilon p}\right)\right).

Especially, for $\alpha\to 0^{+}$ , we have

|\hat{J}(\pi;r)-J(\pi;r)|\leq O\left(\sqrt{\dfrac{H^{5}S^{2}A}{M}\log(\dfrac{HM}{p})}\right)+H^{2}SA\delta.

To establish (10), we can choose $\delta=\epsilon/2SAH^{2}$ and set $M\geq c\dfrac{H^{5}S^{2}A}{\epsilon^{2}}\log\left(\dfrac{SAH}{p\epsilon}\right)$ for sufficiently large absolute constant $c$ . ∎

We use the natural policy gradient (NPG) algorithm as the approximate MDP solver.

Lemma .8 ((Agarwal et al. 2019; Jin et al. 2020)).

With learning rate $\eta$ and iteration number $T$ , the output policy $\pi^{(T)}$ of the NPG algorithm satisfies the following:

J(\pi^{*};r)-J(\pi^{(T)};r)\leq\dfrac{H\log A}{\eta T}+\eta H^{2}

(11)

By choosing $\eta=\sqrt{\log A/HT}$ and $T=4H^{3}\log A/\epsilon^{2}$ , the policy $\pi^{(T)}$ returned by NPG is $\epsilon$ -optimal.

Policy gradient of state-action space entropy

In the following derivation, we use the notation that sums over the states or the state-action pairs. Nevertheless, it can be easily extended to continuous state-action spaces by simply replacing summation by integration. Note that, although the Rényi entropy for continuous distributions is not the limit of the Rényi entropy for discrete distributions (since they differs by a constant (Sanei, Borzadaran, and Amini 2016)), their gradients share the same form. Therefore, the derivation still holds for the continuous case.

Lemma .1 (Gradient of discounted state visitation distribution).

A policy $\pi$ is parameterized by $\theta$ . The gradient of the discounted state visitation distribution induced by the policy w.r.t. the parameter $\theta$ is

\nabla_{\theta}d_{\mu}^{\pi}(s^{\prime})=\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)d_{s,a}^{\pi}(s^{\prime})]

(12)

Proof.

The discounted state visitation distribution started from one state-action pair can be written as the distribution started from the next state-action pair.

d_{s_{0},a_{0}}^{\pi}(s^{\prime})=(1-\gamma)\mathbb{I}(s_{0}=s^{\prime})+\gamma\sum_{s_{1},a_{1}}P^{\pi}(s_{1},a_{1}|s_{0},a_{0})d_{s_{1},a_{1}}^{\pi}(s^{\prime}),

where $P^{\pi}(s^{\prime},a^{\prime}|s,a):=\mathbb{P}(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime})$ .

Then, the proof follows an unrolling procedure on the discounted state visitation distribution as follows:

	$\displaystyle\nabla_{\theta}d^{\pi}_{s_{0},a_{0}}(s^{\prime})$	$\displaystyle=\sum_{s_{1},a_{1}}\gamma\nabla_{\theta}\left[P(s_{1}\|s_{0},a_{0})\pi(a_{1}\|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]$
		$\displaystyle=\sum_{s_{1},a_{1}}\gamma P(s_{1}\|s_{0},a_{0})\left[\nabla_{\theta}\pi(a_{1}\|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})+\pi(a_{1}\|s_{1})\nabla_{\theta}d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]$
		$\displaystyle=\sum_{s_{1},a_{1}}\gamma P^{\pi}(s_{1},a_{1}\|s_{0},a_{0})\left[\nabla_{\theta}\log\pi(a_{1}\|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})+\nabla_{\theta}d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]\quad\text{(unroll)}$
		$\displaystyle=\frac{1}{1-\gamma}\sum_{s,a}\sum_{t=1}^{\infty}\gamma^{t}P(s_{t}=s,a_{t}=a\|s_{0},a_{0};\pi)[\nabla_{\theta}\log\pi(a\|s)d_{s,a}^{\pi}(s^{\prime})]$
		$\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim d_{s_{0},a_{0}}^{\pi}}[\nabla_{\theta}\log\pi(a\|s)d_{s,a}^{\pi}(s^{\prime})]-\nabla_{\theta}\log\pi(a_{0}\|s_{0})d_{s_{0},a_{0}}^{\pi}(s^{\prime}).$

At last,

\nabla_{\theta}d_{\mu}^{\pi}(s^{\prime})=\mathbb{E}_{s\sim\mu,a_{0}\sim\pi(\cdot|s_{0})}[\nabla_{\theta}\log\pi(a_{0}|s_{0})d_{s_{0},a_{0}}^{\pi}(s^{\prime})+\nabla_{\theta}d_{s_{0},a_{0}}^{\pi}(s^{\prime})]=\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)d_{s,a}^{\pi}(s^{\prime})].

∎

Lemma .2 (Policy gradient of the Shannon entropy over the state-action space).

Given a policy $\pi$ parameterized by $\theta$ , the gradient of $H_{1}(d_{\mu}^{\pi})$ w.r.t. the parameter $\theta$ is as follows:

\nabla_{\theta}H_{1}(d_{\mu}^{\pi})=\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},-\log d_{\mu}^{\pi}\rangle-\log d_{\mu}^{\pi}(s,a)\right)\right].

(13)

Proof.

First, we can calculate the gradient of Shannon entropy $\nabla_{d}H(d)=(-\log d-c)$ where $d$ is the state-action distribution and $c=-\sum_{s,a}\log d(s,a)$ is a normalization factor. Then, we observe that $\nabla_{\theta}d_{\mu}^{\pi}(s,a)=\nabla_{\theta}[d_{\mu}^{\pi}(s)\pi(a|s)]=\nabla_{\theta}d_{\mu}^{\pi}(s)\pi(a|s)+d_{\mu}^{\pi}(s,a)\nabla_{\theta}\log\pi(a|s)$ .

Using the standard chain rule and Lemma .1, we can obtain the policy gradient:

	$\displaystyle\nabla_{\theta}H_{1}(d_{\mu}^{\pi})=$	$\displaystyle\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)\dfrac{\partial H_{1}(d_{\mu}^{\pi})}{\partial d_{\mu}^{\pi}(s,a)}=\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)(-\log d_{\mu}^{\pi}(s,a)-c)=-\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)\log d_{\mu}^{\pi}(s,a)$
	$\displaystyle=$	$\displaystyle-\frac{1}{1-\gamma}\sum_{s^{\prime},a^{\prime}}d_{\mu}^{\pi}(s^{\prime},a^{\prime})\nabla_{\theta}\log\pi(a^{\prime}\|s^{\prime})\sum_{s,a}d_{s^{\prime},a^{\prime}}^{\pi}(s,a)\log d_{\mu}^{\pi}(s,a)-\sum_{s,a}d_{\mu}^{\pi}(s,a)\nabla_{\theta}\log\pi(a\|s)\log d_{\mu}^{\pi}(s,a)$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a\|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},-\log d_{\mu}^{\pi}\rangle-\log d_{\mu}^{\pi}(s,a)\right)\right],$

where $\langle d_{s,a}^{\pi},-\log d_{\mu}^{\pi}\rangle$ is the cross-entropy between the two distributions. We use the fact $\mathbb{E}_{x\sim p}\left[c\nabla_{\theta}\log p(x)\right]=0$ in the first line. ∎

Lemma .3 (Policy gradient of the Rényi entropy over the state-action space).

Given a policy $\pi$ parameterized by $\theta$ and $0<\alpha<1$ , the gradient of $H_{\alpha}(d_{\mu}^{\pi})$ w.r.t. the parameter $\theta$ is as follows:

\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi})\propto\frac{\alpha}{1-\alpha}\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},(d_{\mu}^{\pi})^{\alpha-1}\rangle+\left(d_{\mu}^{\pi}(s,a)\right)^{\alpha-1}\right)\right].

(14)

Proof.

First, we note that the gradient of Rényi entropy is

\nabla_{d}H_{\alpha}(d)=\dfrac{\alpha}{1-\alpha}\dfrac{d^{\alpha-1}-c}{\sum_{i=1}^{n}d_{i}^{\alpha}},

where $c$ is a constant normalizer. Given a vector $d=(d_{1},\cdots,d_{n})^{T}$ and a real number $a$ , we use $d^{a}$ to denote the vector $(d_{1}^{a},\cdots,d_{n}^{a})^{T}$ . Similarly, using the chain rule and Lemma .1, we have

	$\displaystyle\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi})$	$\displaystyle=\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)\dfrac{\partial H_{\alpha}(d_{\mu}^{\pi})}{\partial d_{\mu}^{\pi}(s,a)}\propto\frac{\alpha}{1-\alpha}\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)\left[(d_{\mu}^{\pi}(s,a))^{\alpha-1}-c\right]$
		$\displaystyle=\frac{\alpha}{1-\alpha}\sum_{s,a}\nabla_{\theta}d_{\mu}^{\pi}(s,a)(d_{\mu}^{\pi}(s,a))^{\alpha-1}$
		$\displaystyle=\frac{\alpha}{1-\alpha}\bigg{(}\frac{1}{1-\gamma}\sum_{s^{\prime},a^{\prime}}d_{\mu}^{\pi}(s^{\prime},a^{\prime})\nabla_{\theta}\log\pi(a^{\prime}\|s^{\prime})\sum_{s,a}d_{s^{\prime},a^{\prime}}^{\pi}(s,a)(d_{\mu}^{\pi}(s,a))^{\alpha-1}$
		$\displaystyle\quad\quad+\sum_{s,a}d_{\mu}^{\pi}(s,a)\nabla_{\theta}\log\pi(a\|s)(d_{\mu}^{\pi}(s,a))^{\alpha-1}\bigg{)}$
		$\displaystyle=\frac{\alpha}{1-\alpha}\mathbb{E}_{s,a\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a\|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},(d_{\mu}^{\pi})^{\alpha-1}\rangle+\left(d_{\mu}^{\pi}(s,a)\right)^{\alpha-1}\right)\right].$

∎

We should note that when $\alpha\to 0$ , $\nabla_{d}H_{\alpha}(d)\to 0$ ; and when $\alpha\to 1$ , $\nabla_{d}H_{\alpha}(d)\to-\log d$ . Especially, we can write down the case for $\alpha=\frac{1}{2}$ :

\nabla_{\theta}H_{\alpha}(d_{\mu}^{\pi})=\mathbb{E}_{s,a\sim d_{\mu}^{\pi}}\left[\nabla_{\theta}\log\pi(a|s)\left(\frac{1}{1-\gamma}\langle d_{s,a}^{\pi},\sqrt{\dfrac{1}{d_{\mu}^{\pi}}}\rangle+\sqrt{\dfrac{1}{d_{\mu}^{\pi}(s,a)}}\right)\right]

(15)

This leads to a policy gradient update with reward $\propto(d_{\mu}^{\pi})^{-1/2}$ . This inverse-square-root style reward is frequently used as the exploration bonus in standard RL problems (Strehl and Littman 2008; Bellemare et al. 2016) and similar to the UCB algorithm (Auer, Cesa-Bianchi, and Fischer 2002; Lai and Robbins 1985) in the bandit problem.

Practical implementation

Recall the deduced policy gradient formulations of the entropies on a CMP in (13) and (14). We observe that if we set the reward function $r(s,a)=(d_{\mu}^{\pi}(s,a))^{\alpha-1}$ (when $0<\alpha<1$ ) or $r(s,a)=-\log d_{\mu}^{\pi}(s,a)$ (when $\alpha\to 1$ ), these policy gradient formulations can be compared with the policy gradient for a standard MDP: The first term resembles the policy gradient of cumulative reward $\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)Q^{\pi}(s,a)]$ and the second term resembles the policy gradient of instant reward $\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)r(s,a)]$ .

For the first term, a wide range of existing policy gradient based algorithms can be used to estimate this term. Since these algorithms such as PPO do not retain the scale of the policy gradient, we induce a coefficient $\eta$ to adjust for the scale between this term and the second term.

For the second term, when $\alpha=1$ , this term can be reduced to

\mathbb{E}_{(s,a)\sim d_{\mu}^{\pi}}[\nabla_{\theta}\log\pi(a|s)(-\log d_{\mu}^{\pi}(s,a))]=\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))],

where $H_{1}(\pi(\cdot|s))$ denotes the Shannon entropy over the action distribution conditioned on the state sample $s$ . Although the two sides are equivalent in expectation, the right hand side leads to smaller variance when estimated using samples due to the following reasons: First, the left hand side is calculated with the values of a density estimator (that estimates $d_{\mu}^{\pi}$ ) on the state-action samples and the density estimator suffers from high variance. In contrast, the right hand side does not involve the density estimation. Second, to estimate the expectation with samples, the left hand side uses both the state samples and the action samples while the right hand side uses only the state samples and the action distributions conditioned on these states. This reduces the variance in the estimation. When $0<\alpha<1$ , since the second term in equation (2) can hardly be reduced to such a simple form, we still use $\nabla_{\theta}\mathbb{E}_{s\sim d_{\mu}^{\pi}}[H_{1}(\pi(\cdot|s))]$ as the approximation.

Value function estimator

In this section, we provide the detail of the truncated TD( $\lambda$ ) estimator (Sutton and Barto 2018) that we use in our algorithm. Recall that in the $i$ -th iteration, we maintain the replay buffer $\mathcal{D}=\mathcal{D}_{i}\cup\mathcal{D}_{i-1}\cup\cdots\cup\mathcal{D}_{i-n+1}$ as $\mathcal{D}=\{(s_{j1},a_{j1}),\cdots,(s_{jN_{j}},a_{jN_{j}})\}_{j=1}^{nm}$ where $N_{j}$ is the length of the $j$ -th trajectory and $nm$ is the total number of trajectories in the replay buffer. Correspondingly, the reward calculated using the latest density estimator $P_{\phi}$ is denoted as $r_{jk}=(P_{\phi}(s_{jk},a_{jk}))^{\alpha-1}$ or $r_{jk}=-\log P_{\phi}(s_{jk},a_{jk})$ for each $(s_{jk},a_{jk})\in\mathcal{D}$ , $\forall j\in[nm],k\in[N_{j}]$ .

The truncated n-step value function estimator can be written as:

V^{\text{nstep}}_{p}(s_{jk}):=\sum_{\tau=k}^{h-1}r_{j\tau}+V_{\psi}(s_{jh})\text{, with }h=\min(N_{j},k+p).

where $V_{\psi}$ is the current value function network with the parameter $\psi$ . Notice that, since the discount rate is considered when we collect sample, here we estimate the undiscounted sum. When $p\geq N_{j}-k$ (i.e., $h=N_{j}$ ), this corresponds to the Monte Carlo estimator whose variance is high. When $p=1$ (i.e., $h=k+1$ ), this corresponds to the TD(0) estimator that has large bias. To balance variance and bias, we use the truncated TD( $\lambda$ ) estimator which is the weighted sum of the truncated n-step estimators with different $p$ , i.e.,

V^{\text{targ}}_{\lambda}(s_{jk}):=(1-\lambda)\sum_{p=1}^{P-1}\lambda^{p-1}V^{\text{nstep}}_{p}(s_{jk})+\lambda^{P-1}V^{\text{nstep}}_{P}(s_{jk}),

where $\lambda\in[0,1]$ and $P\in\mathbb{N}^{+}$ are the hyperparameters.

Experiment settings, hyperparameters and more experiments.

.1 Experiment settings and hyperparameters

Throughout the experiments, we set the hyperparameters as follows: The coefficient that balances the two terms in the policy gradient formulation is $\eta=10^{-4}$ . The parameter for computing GAE and the target state values is $\lambda=0.95$ . The steps for computing the target state values is $P=15$ . The learning rate for the policy network is $4\times 10^{-3}$ and the learning rate for VAE and the value function $10^{-3}$ .

In Pendulum and FourRooms (see Appendix .2), we use $\gamma=0.995$ and store samples of the latest 10 iterations in the replay buffer. To estimate the corresponding Rényi entropy of the policy during the training in the top of Figure 6, we sample several trajectories in each iteration. In each trajectory, we terminate the trajectory on each step with probability $1-\gamma$ . For $\alpha=0.5$ and $1$ , we sample 1000 trajectories to estimate the distribution and then calculate the Rényi entropy over the state-action space. For $\alpha=0$ , we sample 50 groups of trajectories. In the $i$ -th group, we sample 20 trajectories and count the number of unique state-action pairs visited $C_{i}$ . We use $\frac{1}{50}\sum_{i=1}^{50}\log C_{i}$ as the estimation of the Rényi entropy when $\alpha=0$ . The corresponding entropy of the optimal policy is estimated in a same manner. The bottom figures in Figure 6 is plotted using 100 trajectories, each of which is also collected in a same manner.

In MultiRooms, we use $\gamma=0.995$ and store samples of the latest 10 iterations in the replay buffer. We use the MiniGrid-MultiRoom-N6-v0 task from minigrid. We fix the room configuration for a single run of the algorithm and use different room configurations in different runs. The four actions are turning left/right, moving forward and opening the door. In the planning phase, we randomly pick $20$ grids and construct the corresponding reward functions. The reward function gives the agent a reward $r=1$ if the agent reaches the corresponding state and gives a reward $r=0$ otherwise. We show the means and the standard deviations of five runs with different random seeds in the figures.

For Atari games, we use $\gamma=0.995$ and store samples of the latest 20 iterations in the replay buffer. For Mujoco tasks, we use $\gamma=0.995$ and only store samples of the current iteration. In Figure 4b, the two reward functions in the planning phase are designed as follows: Assume the rooms are numbered in the default order by the environment (cf. the labels in Figure 4b). Under the first reward function, when the agent goes from Room 3 to Room 9 (the room below Room 3) for the first time, we assign a unit reward to the agent. Otherwise, we assign zero rewards to the agent. Similarly, under the second reward function, we give a reward to the agent when it goes from Room 7 to Room 13 (the room below Room 7).

We will release the source code for this experiments once the paper is published.

.2 Convergence of MaxRenyi

Convergence is one of the most critical properties of an algorithm. The algorithm based on maximizing Shannon entropy has a convergence guarantee leveraging the existing results on the convergence of standard RL algorithms and its similar form to the standard RL problem (noting that $H_{1}(d_{\mu}^{\pi})=\langle d_{\mu}^{\pi},-\log d_{\mu}^{\pi}\rangle$ and $J(\pi)=\frac{1}{1-\gamma}\langle d_{\mu}^{\pi},r\rangle$ ) (Hazan et al. 2019). On the other hand, MaxRenyi does not have such a guarantee not only due to the form of Rényi entropy but also due to the techniques we use. Therefore, we present a set of experiments here to illustrate whether our exploration algorithm MaxRenyi empirically converges to near-optimal policies in terms of the entropy over the state-action space. To answer this question, we compare the entropy induced by the agent during the training in MaxRenyi with the maximum entropy the agent can achieve. We implement MaxRenyi for two simple environments, Pendulum and FourRooms, where we can solve for the maximum entropy by brute force search. Pendulum from OpenAI Gym (Brockman et al. 2016) has a continuous state-action space. For this task, the entropy is estimated by discretizing the state space into $16\times 16$ grids and the action space into two discrete actions. FourRooms is a $11\times 11$ grid world environment with four actions and deterministic transitions. We show the results in Figure 6. The results indicate that our exploration algorithm approaches the optimal in terms of the corresponding state-action space entropy with different $\alpha$ in both discrete and continuous settings.

.3 More results for Atari games and Mujoco tasks

The reward-free RL framework contains two phases, the exploration phase and the planning phase. Due to space limit, we only show the final performance of the algorithms (i.e., the performance of the agent planned based on the data collected using the learned exploratory policy) in the main text. Here, we also show the performance of the exploration algorithms in the exploration phase (i.e., the performance for the reward-free exploration setting). We show the results in Table 3. First, the scores for the exploration phase are measured using the extrinsic game rewards of the tasks. However, the exploration algorithms are trained in the reward-free exploration setting (i.e., do not receive such extrinsic game rewards). The results indicate that, although MaxRenyi is designed for the reward-free RL framework, it outperforms other exploration algorithms in the reward-free exploration setting. Then, the average scores for the planning phase is the same as those in Table 1. In addition, we present the standard deviations over five random seeds. The results indicate that the stability of MaxRenyi is comparable to RND and ICM while achieving better performance on average.

		MaxRenyi	RND	ICM	Random
Exploration phase	Montezuma	7,540(2210)	4,220(736)	1,140(1232)	0
	Venture	460(273)	740(546)	120(147)	0
	Gravitar	1,420(627)	1,010(886)	580(621)	173
	PrivateEye	6,480(7,813)	6,220(7,505)	100(0)	25
	Solaris	1,284(961)	1,276(712)	1,136(928)	1,236
	Hopper	207(628)	72(92)	86(75)	31
	HalfCheetah	456(328)	137(126)	184(294)	-82
Planning phase	Montezuma	8,100(1,124)	4,740(2,216)	380(146)	0(0)
	Venture	900(228)	840(307)	240(80)	0(0)
	Gravitar	1,850(579)	1,890(665)	940(281)	340(177)
	PrivateEye	6,820(7,058)	3,040(5,880)	100(0)	100(0)
	Solaris	2,192(1,547)	1,416(734)	1,844(173)	1,508(519)
	Hopper	973(159)	819(87)	704(59)	78(36)
	HalfCheetah	1,466(216)	1,083(377)	700(488)	54(50)

Table 3: The performance of the exploratory policies (Exploration phase) and the planned policies under the reward-free RL framework (Planning phase) corresponding to different exploratory algorithms. Random indicates using a random policy as the exploratory policy. We use

\alpha=0.5

in MaxRenyi. The numbers before and in the parentheses indicate the means and the standard deviations over five random seeds respectively.

.4 More results for Montezuma’s Revenge

Although the extrinsic game reward in Montezuma’s Revenge aligns with the coverage of the state space, the number of rooms visited in one episode may be a more direct measure for an exploration algorithm. Therefore, we also provide the curves for the number of rooms visited during the reward-free exploration. Compared with Figure 4a, we observe that, although MaxRenyi visits only slightly more rooms than RND, it yields higher returns by visiting more scoring states. This is possibly because our algorithm encourages the agent to visit the hard-to-reach states more.

Moreover, our algorithm is less likely to fail under different reward functions in the planning phase. There is a failure mode for the planning phase in Montezuma’s Revenge: For example, with the reward to go across Room 3, the trained agent may first go to Room 7 to obtain an extra key and then return to open the door toward Room 0, which is not the shortest path. Out of 5 runs, our algorithms fail 2 times where RND fails all the 5 times.

	$\displaystyle\xi_{h}$	$\displaystyle\leq\left[\sum_{(s,a)\in\mathcal{S}_{h}^{\delta}}\|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)\|^{2}d_{h}^{\pi}(s,a)\right]^{\frac{1}{2}}$
		$\displaystyle\leq\left[\sum_{(s,a)\in\mathcal{S}_{h}^{\delta}}\|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)\|^{2}\frac{SAH}{\delta^{\alpha/(1-\alpha)}}\mu_{h}(s,a)\right]^{\frac{1}{2}}$
		$\displaystyle\leq\left[\frac{SAH}{\delta^{\alpha/(1-\alpha)}}\mathbb{E}_{\mu_{h}}\|(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})\hat{V}_{h+1}^{\pi}(s,a)\|^{2}\right]^{\frac{1}{2}}.$

	$\displaystyle\nabla_{\theta}d^{\pi}_{s_{0},a_{0}}(s^{\prime})$	$\displaystyle=\sum_{s_{1},a_{1}}\gamma\nabla_{\theta}\left[P(s_{1}\|s_{0},a_{0})\pi(a_{1}\|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]$
		$\displaystyle=\sum_{s_{1},a_{1}}\gamma P(s_{1}\|s_{0},a_{0})\left[\nabla_{\theta}\pi(a_{1}\|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})+\pi(a_{1}\|s_{1})\nabla_{\theta}d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]$
		$\displaystyle=\sum_{s_{1},a_{1}}\gamma P^{\pi}(s_{1},a_{1}\|s_{0},a_{0})\left[\nabla_{\theta}\log\pi(a_{1}\|s_{1})d_{s_{1},a_{1}}^{\pi}(s^{\prime})+\nabla_{\theta}d_{s_{1},a_{1}}^{\pi}(s^{\prime})\right]\quad\text{(unroll)}$
		$\displaystyle=\frac{1}{1-\gamma}\sum_{s,a}\sum_{t=1}^{\infty}\gamma^{t}P(s_{t}=s,a_{t}=a\|s_{0},a_{0};\pi)[\nabla_{\theta}\log\pi(a\|s)d_{s,a}^{\pi}(s^{\prime})]$
		$\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim d_{s_{0},a_{0}}^{\pi}}[\nabla_{\theta}\log\pi(a\|s)d_{s,a}^{\pi}(s^{\prime})]-\nabla_{\theta}\log\pi(a_{0}\|s_{0})d_{s_{0},a_{0}}^{\pi}(s^{\prime}).$

Exploration by Maximizing Rényi Entropy for Reward-Free RL Framework

Abstract

1 Introduction

2 Preliminary

3 The objective for the exploration phase

Theorem 3.1.

4 Method

5 Experiments

6 Conclusion

Acknowledgement

References

The toy example in Figure 1

An example to illustrate the advantage of using Rényi entropy

Proof of Theorem 3.1

Proof of Theorem 3.1.

Definition .1 (𝒦h\mathcal{K}_{h}).

Lemma .2.

Proof of Lemma .2.

Definition .3 (δ\delta-significance).

Lemma .4.

Proof of Lemma .4.

Lemma .5.

Proof of Lemma .5.

Lemma .6 (Lemma E.15 in Dann et al. (Dann, Lattimore, and Brunskill 2017)).

Lemma .7.

Proof of Lemma .7.

Lemma .8 ((Agarwal et al. 2019; Jin et al. 2020)).

Policy gradient of state-action space entropy

Lemma .1 (Gradient of discounted state visitation distribution).

Proof.

Lemma .2 (Policy gradient of the Shannon entropy over the state-action space).

Proof.

Lemma .3 (Policy gradient of the Rényi entropy over the state-action space).

Proof.

Practical implementation

Value function estimator

Experiment settings, hyperparameters and more experiments.

.1 Experiment settings and hyperparameters

.2 Convergence of MaxRenyi

.3 More results for Atari games and Mujoco tasks

.4 More results for Montezuma’s Revenge

Definition .1 ( $\mathcal{K}_{h}$ ).

Definition .3 ( $\delta$ -significance).