Ensemble Bootstrapping for Q-Learning

Oren Peer Chen Tessler Nadav Merlis Ron Meir

Abstract

Q-learning (QL), a common reinforcement learning algorithm, suffers from over-estimation bias due to the maximization term in the optimal Bellman operator. This bias may lead to sub-optimal behavior. Double-Q-learning tackles this issue by utilizing two estimators, yet results in an under-estimation bias. Similar to over-estimation in Q-learning, in certain scenarios, the under-estimation bias may degrade performance. In this work, we introduce a new bias-reduced algorithm called \algfull (\alg), a natural extension of Double-Q-learning to ensembles. We analyze our method both theoretically and empirically. Theoretically, we prove that \alg-like updates yield lower MSE when estimating the maximal mean of a set of independent random variables. Empirically, we show that there exist domains where both over and under-estimation result in sub-optimal performance. Finally, We demonstrate the superior performance of a deep RL variant of \alg over other deep QL algorithms for a suite of ATARI games.

Machine Learning, ICML

CODE(0x577ba7392088)\Crefname@preamble

equationEquationEquations\Crefname@preamblefigureFigureFigures\Crefname@preambletableTableTables\Crefname@preamblepagePagePages\Crefname@preamblepartPartParts\Crefname@preamblechapterChapterChapters\Crefname@preamblesectionSectionSections\Crefname@preambleappendixAppendixAppendices\Crefname@preambleenumiItemItems\Crefname@preamblefootnoteFootnoteFootnotes\Crefname@preambletheoremTheoremTheorems\Crefname@preamblelemmaLemmaLemmas\Crefname@preamblecorollaryCorollaryCorollaries\Crefname@preamblepropositionPropositionPropositions\Crefname@preambledefinitionDefinitionDefinitions\Crefname@preambleresultResultResults\Crefname@preambleexampleExampleExamples\Crefname@preambleremarkRemarkRemarks\Crefname@preamblenoteNoteNotes\Crefname@preamblealgorithmAlgorithmAlgorithms\Crefname@preamblelistingListingListings\Crefname@preamblelineLineLines\crefname@preambleequationEquationEquations\crefname@preamblefigureFigureFigures\crefname@preamblepagePagePages\crefname@preambletableTableTables\crefname@preamblepartPartParts\crefname@preamblechapterChapterChapters\crefname@preamblesectionSectionSections\crefname@preambleappendixAppendixAppendices\crefname@preambleenumiItemItems\crefname@preamblefootnoteFootnoteFootnotes\crefname@preambletheoremTheoremTheorems\crefname@preamblelemmaLemmaLemmas\crefname@preamblecorollaryCorollaryCorollaries\crefname@preamblepropositionPropositionPropositions\crefname@preambledefinitionDefinitionDefinitions\crefname@preambleresultResultResults\crefname@preambleexampleExampleExamples\crefname@preambleremarkRemarkRemarks\crefname@preamblenoteNoteNotes\crefname@preamblealgorithmAlgorithmAlgorithms\crefname@preamblelistingListingListings\crefname@preamblelineLineLines\crefname@preambleequationequationequations\crefname@preamblefigurefigurefigures\crefname@preamblepagepagepages\crefname@preambletabletabletables\crefname@preamblepartpartparts\crefname@preamblechapterchapterchapters\crefname@preamblesectionsectionsections\crefname@preambleappendixappendixappendices\crefname@preambleenumiitemitems\crefname@preamblefootnotefootnotefootnotes\crefname@preambletheoremtheoremtheorems\crefname@preamblelemmalemmalemmas\crefname@preamblecorollarycorollarycorollaries\crefname@preamblepropositionpropositionpropositions\crefname@preambledefinitiondefinitiondefinitions\crefname@preambleresultresultresults\crefname@preambleexampleexampleexamples\crefname@preambleremarkremarkremarks\crefname@preamblenotenotenotes\crefname@preamblealgorithmalgorithmalgorithms\crefname@preamblelistinglistinglistings\crefname@preamblelinelinelines\cref@isstackfull\@tempstack\@crefcopyformatssectionsubsection\@crefcopyformatssubsectionsubsubsection\@crefcopyformatsappendixsubappendix\@crefcopyformatssubappendixsubsubappendix\@crefcopyformatsfiguresubfigure\@crefcopyformatstablesubtable\@crefcopyformatsequationsubequation\@crefcopyformatsenumienumii\@crefcopyformatsenumiienumiii\@crefcopyformatsenumiiienumiv\@crefcopyformatsenumivenumv\@labelcrefdefinedefaultformatsCODE(0x577ba7812ec8)

1 Introduction

\cref@constructprefix

page\cref@result

In recent years, reinforcement learning (RL) algorithms have impacted a vast range of real-world tasks, such as robotics (andrychowicz2020learning; kober2013reinforcement), power management (xiong2018power), autonomous control (bellemare2020autonomous), traffic control (abdulhai2003traffic1; wiering2000multi), and more (mahmud2018applications; luong2019applications2). These achievements are possible by the ability of RL agents to learn a behavior policy via interaction with the environment while simultaneously maximizing the obtained reward.

A common family of algorithms in RL is Q-learning (watkins1992q, QL) based algorithms, which focuses on learning the value-function. The value represents the expected, discounted, reward-to-go that the agent will obtain. In particular, such methods learn the optimal policy via an iterative maximizing bootstrapping procedure. A well-known property in QL is that this procedure results in a positive estimation bias. hasselt2010double and van2016deep argue that over-estimating the real value may negatively impact the performance of the obtained policy. They propose Double Q-learning (DQL) and show that, as opposed to QL, it has a negative estimation bias.

The harmful effects of any bias on value-learning are widely known. Notably, the deadly triad (sutton2018reinforcement; van2018deep) attributes the procedure of bootstrapping with biased values as one of the major causes of divergence when learning value functions. This has been empirically shown in multiple works (van2016deep; van2018deep). Although this phenomenon is well known, most works focus on avoiding positive-estimation instead of minimizing the bias itself (van2016deep; wang2016dueling; hessel2018rainbow).

In this work, we tackle the estimation bias in an underestimation setting. We begin, similar to hasselt2010double, by analyzing the estimation bias when observing a set of i.i.d. random variables (RVs). We consider three schemes for estimating the maximal mean of these RVs. (i) The single-estimator, corresponding to a QL-like estimation scheme. (ii) The double-estimator, corresponding to DQL, an estimation scheme that splits the samples into two equal-sized sets: one for determining the index of the RV with the maximal mean and the other for estimating the mean of the selected RV. (iii) The ensemble-estimator, our proposed method, splits the data into $K$ sets. Then, a single set is used for estimating the RV with the maximal-mean whereas all other sets are used for estimating the mean of the selected RV.

We start by showing that the ensemble estimator can be seen as a double estimator that unequally splits the samples (\crefprop:proxy). This enables, for instance, less focus on maximal-index identification and more on mean estimation. We prove that the ensemble estimator under-estimates the maximal mean (\creflemma:EE underestimation). We also show, both theoretically (\crefprop:split_ratio) and empirically (\creffig:MSE as func of r), that in order to reduce the magnitude of the estimation error, a double estimator with equally-distributed samples is sub-optimal.

Following this theoretical analysis, we extend the ensemble estimator to the multi-step reinforcement learning case. We call this extension \algfull (\alg). We analyze \alg in a tabular meta chain MDP and on a set of ATARI environments using deep neural networks. In the tabular case, we show that as the ensemble size grows, \alg minimizes the magnitude of the estimation bias. Moreover, when coupled with deep neural networks, we observe that \alg obtains superior performance on a set of ATARI domains when compared to Deep Q-Networks (DQN) and Double-DQN (DDQN).

Our contributions are as follows:

•

We Analyze the problem of estimating the maximum expectation over independent random variables. We prove that the estimation mean-squared-error (MSE) can be reduced using the Ensemble Estimator. In addition, we show that obtaining the minimal MSE requires utilizing more than two ensemble members.
•

Drawing inspiration from the above, we introduce \algfull (\alg) and show that it reduces the bootstrapping estimation bias.
•

We show that \alg is superior to both Q-learning and double Q-learning in both a tabular setting and when coupled with deep neural networks (ATARI).

2 Preliminaries

\cref@constructprefix

page\cref@result

2.1 Model Free Reinforcement Learning

A reinforcement learning (sutton2018reinforcement) agent faces a sequential decision-making problem in some unknown or partially known environment. We focus on environments in which the action space is discrete. The agent interacts with the environment as follows. At each time $t$ , the agent observes the environment state $s_{t}\!\in\!\Scal$ and selects an action $a_{t}\!\in\!\A$ . Then, it receives a scalar reward $r_{t}\!\in\!\dsR$ and transitions to the next state $s_{t+1}\!\in\!S$ according to some (unknown) transition kernel $P(s_{t+1}|s_{t},a_{t})$ . A policy $\pi:\Scal\to\A$ , is a function that determines which action the agent should take at each state and the sampled performance of a policy, a random variable starting from state $s$ , is denoted by