\DeclareMathOperator\Var

Var \DeclareMathOperator\CorCorr \DeclareMathOperator\CovCov \DeclareMathOperator\ind1

Exploration in Model-based Reinforcement Learning with Randomized Reward

.                                                Lingxiao Wang Ping Li

.                                              Cognitive Computing Lab
.                                                         Baidu Research
.                                     10900 NE 8th St Bellevue WA 98004 USA

Abstract

¹¹1This manuscript was completed in September 2021 while both authors worked at Baidu Cognitive Computing Lab.

Model-based Reinforcement Learning (MBRL) has been widely adapted due to its sample efficiency. However, existing worst-case regret analysis typically requires optimistic planning, which is not realistic in general. In contrast, motivated by the theory, empirical study utilizes ensemble of models, which achieve state-of-the-art performance on various testing environments. Such deviation between theory and empirical study leads us to question whether randomized model ensemble guarantee optimism, and hence the optimal worst-case regret? This paper partially answers such question from the perspective of reward randomization, a scarcely explored direction of exploration with MBRL. We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism, which further yields a near-optimal worst-case regret in terms of the number of interactions. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration. Correspondingly, we propose concrete examples of efficient reward randomization. To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.

1 Introduction

Reinforcement learning (RL) (sutton2018reinforcement) aims to learn the optimal policy by iteratively interacting with the environment. Model-based reinforcement learning (MBRL) (osband2014model; luo2019algorithmic; ha2018world; luo2019algorithmic; sun2019model; kaiser2020model; ayoub2020model; kakade2020information) achieves such a goal by fitting the environment from the observation and obtaining the policy from the fitted environment. Incorporated with deep learning, MBRL has achieved tremendous success in real-world tasks, including video games (ha2018world; kaiser2020model) and control tasks (watter2015embed; williams2015model; chua2018deep; hafner2019dream; song2021pc).

A key factor to the success of MBRL is sample efficiency. In terms of the theoretical analysis, such sample efficiency is characterized by the regret analysis of MBRL. Previous analysis suggests that when incorporated with exploration strategies, MBRL enjoys a near-optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret (jaksch2010near; ayoub2020model; kakade2020information), where $T$ is the total number of interactions with the environment. However, previous provably efficient exploration typically utilizes optimistic planning (jaksch2010near; luo2019algorithmic; ayoub2020model; kakade2020information). Such exploration strategy requires (i) identifying a confidence set of models $\mathcal{D}$ , which captures the uncertainty in model estimation, and then (ii) conducting optimistic planning by searching for the maximal policy among all possible models within $\mathcal{D}$ . The key to the success of optimistic planning is optimism under the face of the uncertainty principle (jaksch2010near). Intuitively, optimistic planning encourages the agent to explore less visited areas, hence enhancing the sample complexity of the corresponding RL algorithm. While step (i) is realizable with ensemble techniques, step (ii) is in general impossible to implement, as it requires solving an optimization problem over a possibly continuous space of models $\mathcal{D}$ . As an alternative, previous empirical study (chua2018deep; pathak2017curiosity; pathak2019self) typically borrows the idea from optimistic planning and the study of Thompson sampling (TS) based algorithm (osband2014model). A common empirical approach is to utilize model ensembles to capture the uncertainty of model estimations. Such ensembles are further utilized in planning through TS (chua2018deep) or bonus construction (pathak2017curiosity; pathak2019self). Unlike optimistic planning, such approaches typically do not have a worst-case regret guarantee. Nevertheless, they attain state-of-the-art performance in various testing environments. Such deviation from theory and practice motivates us to propose this question:

Does randomized model ensemble guarantees optimism, and hence the optimal worst-case regret?

In this paper, we provide a partial solution to the above question from reward randomization, a relatively less studied method for exploration in MBRL. We initiate our analysis under the kernelized linear regulator (KNR) transition model (mania2022active; kakade2020information; song2021pc) and known reward functions. We propose PlanEx , which conducts exploration by iteratively planning with the fitted transition model and a randomized reward function. We further show that PlanEx attains the near-optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret. A key observation of reward randomization is a notion of partial optimism (russo2019worst; zanette2020frequentist), which ensures that a sufficient amount of interactions are devoted to exploration under the optimism principle. Motivated by the analysis under the KNR transition model, we extend PlanEx to general function approximation with calibrated model (curi2020efficient; kidambi2021mobile) and propose a generic design principle of reward randomization. We further propose concrete examples of valid reward randomization and demonstrate the effectiveness of reward randomization theoretically. We highlight that the proposed reward randomization method can be easily implemented based on the model ensembles. In addition, the reward randomization is highly modular and can be incorporated with various SOTA baselines.

Contribution. Our work provides a partial solution to the question we raised. Specifically, we investigate reward randomization and propose PlanEx . Our contributions are as follows.

•

We propose PlanEx , a novel exploration algorithm for MBPO with worst-case regret guarantee that is realizable with general function parameterizations.
•

We show that PlanEx has near-optimal worst-case regret under the KNR dynamics.

To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.

Related Work. Our work is closely related to the regret analysis of MBRL and online control problem (osband2014model; luo2019algorithmic; sun2019model; lu2019information; ayoub2020model; kakade2020information; curi2020efficient; agarwal2020flambe; song2021pc). ayoub2020model propose the value-target regression (VTR) algorithm, which focuses on the aspects of the transition model that are relevant to RL. agarwal2020flambe propose FLAMBE, a provably efficient MBRL algorithm under the linear MDP setting (jin2020provably; yang2019sample). kakade2020information propose LC3, an online control algorithm under the KNR dynamics (mania2022active) that attains the optimal worst-case regret. Both ayoub2020model and kakade2020information utilizes optimistic planning for exploration, which is in general intractable. In contrast, we utilize reward randomization for exploration, which also attains the optimal worst-case regret and is highly tractable. To attain tractable optimistic planning, curi2020efficient design HUCRL, which introduces an extra state deviation variable in the optimization for planning. In contrast, planning with randomized reward does not introduce extra variable in optimization. luo2019algorithmic optimizes a lower bound of value functions, which avoids explicit uncertain quantification. Recent works also utilizes reward bonus to attain optimistic planning (kidambi2021mobile; song2021pc). song2021pc propose PC-MLP, which constructs bonus by estimating the policy cover (agarwal2020pc) and is computationally tractable. As a comparison, PC-MLP requires extra sampling to estimate the covariate matrix of policy cover. As a consequence, PC-MLP does not attain the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ -regret. In contrast, PlanEx does not require extra sampling to construct the bonus and can achieve the $\widetilde{\mathcal{O}}(\sqrt{T})$ -regret. In addition, to attain tractable realization, PC-MLP utilizes different feature in model fitting and bonus construction, which is inconsistent with the theoretical analysis. In contrast, the implementation of PlanEx is consistent with the theoretical analysis under the calibrated model assumption. Previous work also study efficient model-free RL exploration algorithms with function approximation. See, e.g., jiang2017contextual; jin2020provably; du2020good; wang2020reward; cai2020provably; agarwal2020pc; modi2021model and references therein for this line of research.

Our analysis is inspired by the recent progress in worst-case regret analysis of randomized RL algorithms (russo2019worst; pacchiano2020optimism; zanette2020frequentist; ishfaq2021randomized). Our optimism analysis is inspired by russo2019worst and zanette2020frequentist. russo2019worst propose the first worst-case regret analysis to the randomized least-squares value iteration (RLSVI) algorithm under the tabular setting. zanette2020frequentist extend the analysis of RLSVI to truncated linear function approximations under the general state space. ishfaq2021randomized analyze the randomized Q-learning with both linear function approximation and general function approximation. In contrast, we focus on the randomized MBRL algorithm. pacchiano2020optimism analyze the worst-case regret of MBRL with both reward and transition randomization. We remark that both pacchiano2020optimism and ishfaq2021randomized require drawing multiple samples for each state-action pair in planning and further maximizing over all randomized reward functions in planning. In contrast, we only require one sample at each time step, and do not need further maximization. In addition, pacchiano2020optimism focus on the tabular setting with finite state and action spaces, whereas we consider the generic setting with function approximation.

2 Background

2.1 Reinforcement Learning

In this paper, we model the environment by an episodic MDP $({\mathcal{S}},\mathcal{A},H,\{r_{h}\}_{h\in[H]},\mathbb{P})$ . Here ${\mathcal{S}}$ and $\mathcal{A}$ are the state spaces, $H$ is the length of episodes, $r_{h}:{\mathcal{S}}\times\mathcal{A}\mapsto[0,1]$ is the bounded reward function for $h\in[H]$ , and $\mathbb{P}$ is the transition kernel, which defines the transition probability $s_{h+1}\sim\mathbb{P}(\cdot\,|\,s_{h},a_{h})$ for all $h\in[H]$ and $(s_{h},a_{h})\in{\mathcal{S}}\times\mathcal{A}$ . Interaction Procedure. An agent with a set of policies $\{\pi_{h}\}_{h\in[H]}$ interacts with such environment as follows. The agent starts from a fixed initial state $s_{1}\in{\mathcal{S}}$ . Iteratively, upon reaching the state $s_{h}\in{\mathcal{S}}$ , the agent takes the action $a_{h}=\pi_{h}(s_{h})$ . The agent then receives the reward $r_{h}(s_{h},a_{h})$ . The environment transits into the next state $s_{h+1}$ according to the probability $\mathbb{P}(\cdot\,|\,s_{h},a_{h})$ . The process ends when the agent reaches the state $s_{H+1}$ .

To describe the expected cumulative reward, for each policy $\pi=\{\pi_{h}\}_{h\in[H]}$ , we introduce the action-value functions $\{Q^{\pi}_{h}\}_{h\in[H]}$ defined as follows,

\displaystyle Q^{\pi}_{h}(s_{h},a_{h};\{r_{h}\}_{h\in[H]},\mathbb{P})=\sum^{H}_{\tau=h}\mathbb{E}\bigl{[}r_{\tau}(s_{\tau},a_{\tau})\,\big{|}\,s_{h},a_{h},\pi\bigr{]},\quad\forall h\in[H],~{}(s_{h},a_{h})\in{\mathcal{S}}\times\mathcal{A},

(1)

where $a_{\tau}=\pi_{\tau}(s_{\tau})$ and $s_{\tau+1}\sim\mathbb{P}(\cdot\,|\,s_{\tau},a_{\tau})$ for all $\tau=h,\ldots,H$ . Similarly, we define the value functions $\{V^{\pi}_{h}\}_{h\in[H]}$ as follows,