This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Leiden Institute of Advanced Computer Science
Leiden University, Leiden
The Netherlands

11email: [email protected]

High-Accuracy Model-Based Reinforcement Learning, a Survey

Aske Plaat    Walter Kosters    Mike Preuss
Abstract

Deep reinforcement learning has shown remarkable success in the past few years. Highly complex sequential decision making problems from game playing and robotics have been solved with deep model-free methods. Unfortunately, the sample complexity of model-free methods is often high. To reduce the number of environment samples, model-based reinforcement learning creates an explicit model of the environment dynamics.

Achieving high model accuracy is a challenge in high-dimensional problems. In recent years, a diverse landscape of model-based methods has been introduced to improve model accuracy, using methods such as uncertainty modeling, model-predictive control, latent models, and end-to-end learning and planning. Some of these methods succeed in achieving high accuracy at low sample complexity, most do so either in a robotics or in a games context. In this paper, we survey these methods; we explain in detail how they work and what their strengths and weaknesses are. We conclude with a research agenda for future work to make the methods more robust and more widely applicable to other applications.

Keywords:
Model-based reinforcement learning Latent models Deep learning Machine learning Planning

1 Introduction

Recent breakthroughs in game playing and robotics have shown the power of deep reinforcement learning, for example, by learning to play Atari and Go from scratch or by learning to fly an acrobatic model helicopter (Mnih et al., 2015; Silver et al., 2016; Abbeel et al., 2007). Unfortunately, for most applications the sample efficiency is low (Silver et al., 2016; LeCun et al., 2015), and achieving faster learning is a major topic in current research. Model-based methods can achieve faster learning by making an internal dynamics model of the environment. By then using this dynamics model for policy updates, the number of necessary environment samples can be reduced substantially (Sutton, 1991).

The success of the model-based approach hinges on the accuracy of this dynamics model—there is a trade-off between accuracy and sample complexity, especially in models with many parameters that require many samples to prevent overfitting (LeCun et al., 2015; Talvitie, 2015). The challenge for the methods in this survey is to train a high-accuracy dynamics model for high-dimensional problems with low sample complexity.

Reinforcement learning finds solutions for sequential decision problems (see Figure 2). Where model-free methods learn the best action in each state, model-based methods go a step further: the next-state transition function in each state is learned. Model-based methods capture the core of complex decision sequences, and models may also be applicable to related environments (Risi and Preuss, 2020; Torrado et al., 2018), for transfer learning.

The contribution of this survey is to give an in-depth overview of recent high-accuracy model-based methods for high-dimensional problems. We present a taxonomy based on application type, learning method, and planning method, and while improving model accuracy is difficult, successful methods are reported for game playing and visuo-motor control (although rarely both at the same time). We describe how and why the methods work—we do note, however, that the computational cost is still high, and that the outcomes of experiments are often sensitive to the choice of hyperparameters. We close with a research agenda to improve reproducibility, to further improve accuracy, and to make methods more widely applicable.

The field of deep model-based reinforcement learning is quite active. Excellent works with background information exist for reinforcement learning (Sutton and Barto, 2018) and deep learning (Goodfellow et al., 2016). Previous surveys provide an overview of the uses of classic (tabular) model-based methods (Deisenroth et al., 2013; Kober et al., 2013; Kaelbling et al., 1996). The purpose of the current survey is to focus on deep learning methods. Other relevant surveys into model-based reinforcement learning are (Justesen et al., 2019; Polydoros and Nalpantidis, 2017; Hui, 2018; Wang et al., 2019; Çalışır and Pehlivanoğlu, 2019; Moerland et al., 2020b).

Section 2 provides necessary background and the MDP formalism for reinforcement learning. Section 3 then surveys the field. Section 4 provides a discussion reflecting on the different approaches and provides open problems and research directions for future work. Section 5 concludes the survey.

EnvironmentAgentrr^{\prime}ss^{\prime}aa
Figure 1: Reinforcement learning: agent performs action aa on environment, which provides new state ss^{\prime} and reward rr^{\prime}

2 Background

The goal of reinforcement learning is to learn optimal behavior for a certain environment, maximizing expected cumulative future reward. This goal is reached after a sequence of decisions is taken, actions that can be different for each state; the best sequence of actions solves a sequential decision making problem. Reinforcement learning draws inspiration from human and animal learning (Hamrick, 2019; Kahneman, 2011; Anthony et al., 2017; Duan et al., 2016), where behavioral adaptation by reward and punishment is studied.

In contrast, supervised learning methods can be used when a dataset of labeled training pairs (x,y)(x,y) is present. The reinforcement learning paradigm, however, does not assume the presence of such a dataset, but derives the ground truths from an external environment, see Figure 1. The environment provides a new state ss^{\prime} and its reward rr^{\prime} for every action aa that the agent tries in a certain state ss (Sutton and Barto, 2018). In this way, as many (state-action, reward) pairs can be generated as needed.111A dataset is static. In reinforcement learning the choice of actions may depend on the rewards that are returned during the learning process, giving rise to a dynamic, potentially unstable, learning process.

2.1 Formalizing Model-Based Reinforcement Learning

Reinforcement learning problems are often modeled as a Markov decision process (Bishop, 2006; Sutton and Barto, 2018), MDP. First we introduce state, action, transition and reward. Then we introduce trajectory, policy and value. Finally, we discuss model-based and model-free solution approaches.

ss π\pi aa ss^{\prime}rr
Figure 2: Backup Diagram (Sutton and Barto, 2018). Maximizing the reward for state ss is performed by following the transition function to find the next state ss^{\prime}. Note that the policy π(s,a)\pi(s,a) tells the first half of this transition, going from sas\rightarrow a; the transition function Ta(s,s)T_{a}(s,s^{\prime}) completes the transition, going from sss\rightarrow s^{\prime} (via aa).

A Markov decision process is a 4-tuple (S,A,Ta,Ra)(S,A,T_{a},R_{a}) where SS is a finite set of states, AA is a finite set of actions; AsAA_{s}\subseteq A is the set of actions available from state ss. Furthermore, TaT_{a} is the transition function: Ta(s,s)T_{a}(s,s^{\prime}) is the probability that action aa in state ss at time tt will lead to state ss^{\prime} at time t+1t+1. Finally, Ra(s,s)R_{a}(s,s^{\prime}) is the immediate reward received after transitioning from state ss to state ss^{\prime} due to action aa.

A policy π\pi is a stochastic or deterministic function mapping states to actions. The goal of the agent is to learn a policy that maximizes the value function, the expected cumulative discounted sum of rewards V(s)=𝔼τ[t=0Tγtrt]V(s)=\mathbb{E}_{\tau}\big{[}\sum_{t=0}^{T}\gamma^{t}r_{t}\big{]} in a trajectory τ\tau, with γ\gamma a discount parameter in an episode with TT steps. The optimal policy π\pi^{\star} contains the solution to a sequential decision problem: a prescription of which action must be taken in each state.

EnvironmentPolicy/Valuelearningacting
Figure 3: Model-Free Learning

In model-free reinforcement learning only the environment knows the transition model Ta()T_{a}(\cdot) that computes the next state ss^{\prime} distribution; the policy is learned directly from environment feedback rr^{\prime} (Figure 3). In contrast, in model-based reinforcement learning, the agent constructs its own version of the transition model, and the policy can be learned from the environment feedback and with the help of the local transition model. Figure 2 shows a visual diagram of states, actions, and transitions.

The function Vπ(s)V^{\pi}(s) is called the state-value function. Some algorithms compute the policy π\pi directly, and some first compute the function Vπ(s)V^{\pi}(s). For continuous or stochastic problems often direct policy-based methods work best, for discrete or deterministic problems the value-based methods are most often used (Kaelbling et al., 1996). A third approach combines the best of value and policy methods: actor-critic (Sutton and Barto, 2018; Konda and Tsitsiklis, 2000; Mnih et al., 2016). In the remainder of the paper we will see that the distinction between continuous/discrete action spaces and policy-based/value-based algorithms plays an important role (see also Table 2).

Closely related to the value function is the state-action-value function Qπ(s,a)Q^{\pi}(s,a). This QQ-function gives the expected sum of discounted rewards for action aa in state ss, and then afterwards following policy π\pi. The optimal policy can be found by recursively choosing the argmax action with Q(s,a)=V(s)Q(s,a)=V^{\star}(s) in each state. This relationship is given by π=maxπVπ(s)=maxπ,aQπ(s,a)\pi^{*}=\max\limits_{\pi}V^{\pi}(s)=\max\limits_{\pi,a}Q^{\pi}(s,a).

There are many algorithms to find optimal policies (Hessel et al., 2017). Algorithms that use the agent’s transition function directly to find the next state are called planning algorithms, algorithms that use the environment to find the next state are called learning algorithms. We now briefly discuss classical model-free learning and planning approaches, before we continue to survey model-based algorithms in more depth in the next section.

In deep learning, functions such as the policy function π\pi are approximated by the parameters (or weights) θ\theta of a deep neural network, and are written as πθ\pi_{\theta}, to distinguish them from classical tabular policies.

2.2 Model-Free Learning

When the agent does not have access to the transition or reward model, then the policy function has to be learned by querying the environment to find the reward for the action in a certain state. Learning the policy or value function in this way is called model-free learning, see Figure 3.

Recall that the policy is a mapping of states to (best) actions. Each time when a new reward is returned by the environment the policy can be improved: the best action for the state is updated to reflect the new information. Algorithm 1 shows high-level steps of model-free reinforcement learning (later on the algorithms become more elaborate).

Algorithm 1 Model-Free Learning
repeat
     Sample env EE to generate data D=(s,a,r,s)D=(s,a,r^{\prime},s^{\prime})
     Use DD to update policy π(s,a)\pi(s,a)
until π\pi converges

Model-free reinforcement learning is the most basic form of reinforcement learning (Kaelbling et al., 1996; Deisenroth et al., 2013; Kober et al., 2013). It has been successfully applied to a range of challenging problems (Mnih et al., 2015; Abbeel et al., 2007). In model-free reinforcement learning a policy is learned from the ground up through interactions with the environment.

The goal of classic model-free learning is to find the optimal policy for the environment; the goal of deep model-free learning is to find a policy function that generalizes well to states from the environment that have not been seen during training. A secondary goal is to do so with good sample efficiency: to use as few environment samples as possible.

Model-free learning follows the current behavior policy π\pi in selecting the action to try, deciding between exploration of new actions and exploitation of known good actions with a selection rule such as ϵ\epsilon-greedy. Exploration is essentially blind, and learning the policy and value often takes many samples, millions in current experiments (Mnih et al., 2015; Wang et al., 2019).

A well-known model-free reinforcement learning algorithm is Q-learning (Watkins, 1989). Algorithms such as Q-learning were developed in a classical tabular setting. Deep neural networks have been used with success in model-free learning, in domains in which samples can be generated cheaply and quickly, such as in Atari video games (Mnih et al., 2015). Deep model-free algorithms such as Deep Q-Network (DQN) (Mnih et al., 2013) and Proximal Policy Optimization, PPO (Schulman et al., 2017) have become quite popular. PPO is an algorithm that computes the policy directly, DQN finds the value function first (Section 2.1).

Model-free methods select actions in a straightforward manner, without using a separately learned local transition model. An advantage of the straightforward action selection is that they can find global optima without suffering from selection bias from model imperfections. Model-based methods may not always be able to find as good policies as model-free can.

A disadvantage is that interaction with the environment can be costly. Especially when the environment involves the real world, such as in real-world robot-interaction, then sampling should be minimized, for reasons of cost, and to prevent wear of the robot arm. In virtual environments on the other hand, model-free approaches have been quite successful (Mnih et al., 2015).

An overview of model-free reinforcement learning can be found in (Sutton and Barto, 2018; Çalışır and Pehlivanoğlu, 2019; Kaelbling et al., 1996).

2.3 Planning

When an agent has an internal transition model, then planning algorithms can use it to find the optimal policy, by selecting actions in states, looking ahead, and backing up reward values, see Figure 2 and Figure 4. Planning algorithms require access to an explicit model. In the deterministic case the transition model provides the next state for each of the possible actions in the states, it is a function s=Ta(s)s^{\prime}=T_{a}(s). In the stochastic case, it provides the probability distribution Ta(s,s)T_{a}(s,s^{\prime}). The reward model provides the immediate reward for transitioning from state ss to state ss^{\prime} after taking action aa, backing up the value from the child state to the parent state (see the backup diagram in Figure 2). The policy function π(s,a)\pi(s,a) concerns the top layer of the diagram, from ss to aa. The transition function Ta(s,s)T_{a}(s,s^{\prime}) covers both layers, from ss to ss^{\prime}. The transition function defines a space of states in which the planning algorithm can search for the optimal policy π\pi^{\star} and value VV^{\star}.

Transition ModelPolicy/Valueplanningacting
Figure 4: Planning
Algorithm 2 Value Iteration
Initialize V(s)V(s) to arbitrary values
repeat
     for all ss do
         for all aa do
              Q[s,a]=sTa(s,s)(Ra(s,s)+γV(s))Q[s,a]=\sum_{s^{\prime}}T_{a}(s,s^{\prime})(R_{a}(s,s^{\prime})+\gamma V(s^{\prime}))
         end for
         V[s]=maxa(Q[s,a])V[s]=\max_{a}(Q[s,a])
     end for
until V converges
return V

A basic planning approach is Bellman’s dynamic programming (Bellman, 1957, 2013), a recursive traversal method of the state and action space. Value iteration is a straightforward dynamic programming algorithm. The pseudo-code for value iteration is shown in Algorithm 2 (Alpaydin, 2020). It traverses all actions in all states, computing the value of the entire state space, until the value function converges.

When the agent has an accurate model, planning algorithms can be used to find the best policy. This approach is sample efficient since a policy is found without further interaction with the environment. A sampling action performed in an environment is irreversible, since state changes of the environment can not be undone by the agent in the reinforcement learning paradigm. In contrast, a planning action taken in the agent’s local transition model is reversible (Moerland et al., 2020a, 2018). A planning agent can backtrack, a sampling agent cannot. The ability to backtrack is especially useful to try alternatives to further improve local solutions—local solutions can be found easily by sampling; for finding global optima efficiently the ability to backtrack out of a local optimum improves efficiency.

2.4 Model-Based Learning

It is now time to look at model-based reinforcement learning, where the policy and value function are learned by both sampling and planning. Recall that the environment samples return (s,r)(s^{\prime},r^{\prime}) pairs when the agent selects action aa in state ss. This means that we can learn the transition model Ta(s,s)T_{a}(s,s^{\prime}) and the reward model Ra(s,s)R_{a}(s,s^{\prime}), for example by supervised learning, since all information is present. When the transition and reward model are present in the agent, they can then be used with planning to update the policy and value functions as often as we like using the local functions without any further sampling of the environment (although we might want to continue sampling to further improve our models). This alternative approach of finding the policy and the value is called model-based learning, see Algorithm 3 and Figure 5.

EnvironmentAgent’s Transition ModelPolicy/Valuelearningactingplanning
Figure 5: Model-Based Reinforcement Learning
EnvironmentAgent’s Transition ModelPolicy/Valuelearninglearningactingplanning
Figure 6: Dyna’s Model-Based Imagination
repeat
     Sample env EE to generate data D=(s,a,r,s)D=(s,a,r^{\prime},s^{\prime})
     Use DD to learn Ta(s,s)T_{a}(s,s^{\prime}) and Ra(s,s)R_{a}(s,s^{\prime})
     Use T,RT,R to update policy π(s,a)\pi(s,a) by planning
until π\pi converges
Algorithm 3 Model-Based Reinforcement Learning

Why would we want to go this convoluted learning-and-planning route, which may even introduce model-bias, if the environment samples can teach us the optimal policy and value directly? The reason is that the convoluted route may be more sample efficient. In model-free learning a sample is used once to optimize the policy, and then thrown away, in model-based learning the sample is used to learn a transition model, which can then be used many times in planning to optimize the policy. The sample is used more efficiently by the agent.

A well-known classic model-based approach is imagination, which was introduced by Sutton (1990, 1991) in the Dyna system, long before deep learning was used widely. Dyna uses the samples to update the policy function directly (model-free learning) and also uses the samples to learn a transition model, to augment the model-free environment-samples with the model-based imagined “samples.” Imagination is a hybrid algorithm that uses both model-based planning and model-free learning to improve the behavior policy. Figure 6 illustrates the working of the Dyna approach. Algorithm 4 shows the steps of the algorithm (compared to Algorithm 3, the line in italics is new, from Algorithm 1). Note how the policy is updated twice in each iteration, by environment sampling, and by transition planning. More details are shown in Algorithm 5 (Sutton, 1990).

repeat
     Sample env EE to generate data D=(s,a,r,s)D=(s,a,r^{\prime},s^{\prime})
     Use DD to update policy π(s,a)\pi(s,a)
     Use DD to learn Ta(s,s)T_{a}(s,s^{\prime}) and Ra(s,s)R_{a}(s,s^{\prime})
     Use T,RT,R to update policy π(s,a)\pi(s,a) by planning
until π\pi converges
Algorithm 4 Dyna’s Model-Based Imagination
Initialize Q(s,a)Q(s,a)\rightarrow\mathbb{R} randomly
Initialize M(s,a)×SM(s,a)\rightarrow\mathbb{R}\times S randomly \triangleright Model
repeat
     Select sSs\in S randomly
     aπ(s)a\leftarrow\pi(s) \triangleright π(s)\pi(s) can be ϵ\epsilon-greedy(s)(s) based on QQ
     (s,r)E(s,a)(s^{\prime},r)\leftarrow E(s,a) \triangleright Learn new state and reward from environment
     Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a)\leftarrow Q(s,a)+\alpha\cdot[r+\gamma\cdot\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-Q(s,a)]
     M(s,a)(s,r)M(s,a)\leftarrow(s^{\prime},r)
     for n=1,,Nn=1,\dots,N do
         Select s^\hat{s} and a^\hat{a} randomly
         (s,r)M(s^,a^)(s^{\prime},r)\leftarrow M(\hat{s},\hat{a}) \triangleright Plan imagined state and reward from model
         Q(s^,a^)Q(s^,a^)+α[r+γmaxaQ(s,a)Q(s^,a^)]Q(\hat{s},\hat{a})\leftarrow Q(\hat{s},\hat{a})+\alpha\cdot[r+\gamma\cdot\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-Q(\hat{s},\hat{a})]
     end for
until QQ converges
Algorithm 5 Dyna-Q: Classic learning and planning with a Q-function-based dynamics model (Sutton, 1990)

After these introductory words, we are now ready to take a deeper look into recent concrete deep model-based reinforcement learning methods.

3 Survey of Model-Based Deep Reinforcement Learning

The success of model-based reinforcement learning in high-dimensional problems depends on the accuracy of the dynamics model. The model is typically used by planning algorithms for multiple sequential predictions, and errors in predictions accumulate quickly with each step. Model-based reinforcement learning is an active field, and many papers have been published that document progress towards improving model-accuracy. The methods that are proposed in the papers are quite diverse.

We will now present our taxonomy. The taxonomy distinghuishes three aspects: (1) application type, (2) learning method, and (3) planning method. Table 1 summarizes the taxonomy, which is the basis of the remainder of this survey. We will now describe the methods, explaining how they fit together by going through the learning, planning, and application that they use.

Application Learning Planning
Discrete CNN/LSTM End-to-end learning and planning
Latent models Trajectory rollouts
Continuous Latent Models Trajectory rollouts
Uncertainty modeling Model-predictive control
Ensemble models
Table 1: Taxonomy: Application, Learning, Planning

First, we will describe the way in which the model is learned, and how the accuracy of the model is improved. Among the approaches are uncertainty modeling such as Gaussian processes and ensembles, and convolutional neural networks or latent models (sometimes integrated in end-to-end learning and planning).

Second, we will describe the way in which the model is subsequently used by the planner to improve the behavior policy (Figure 6). These methods aim to reduce the impact of planning with inaccurate models. Among the methods are planning with (short) trajectories, model-predictive control, and end-to-end learning and planning.

Finally, we will describe applications on which model-based methods have been successful. The effectiveness of model-based methods depends on whether they fit the application domain in which they are used, and on further aspects of the application. There are two main types of applications, those with continuous action spaces, and those with discrete action spaces. For continuous action spaces, simulated physics robotics in MuJoCo is a favorite test bed (Todorov et al., 2012). For discrete action spaces many researchers use mazes or blocks puzzles. For large, high dimensional, problems the Arcade Learning Environment is used, where the input consists of the screen pixels, and the output actions are the joystick movements (Bellemare et al., 2013). We will use this taxonomy to categorize and understand the recent literature on high-accuracy model-based reinforcement learning. We list some of the papers in Table 2,

Name Learning Planning Application
PILCO (Deisenroth and Rasmussen, 2011) Uncertainty Trajectory Pendulum
iLQG (Tassa et al., 2012) Uncertainty MPC Small
GPS (Levine and Abbeel, 2014) Uncertainty Trajectory Small
SVG (Heess et al., 2015) Uncertainty Trajectory Small
Local Model (Gu et al., 2016) Uncertainty Trajectory MuJoCo
Visual Foresight (Finn and Levine, 2017) Video Prediction MPC Manipulation
PETS (Chua et al., 2018) Ensemble MPC MuJoCo
MVE (Feinberg et al., 2018) Ensemble Trajectory MuJoCo
Meta Policy (Clavera et al., 2018) Ensemble Trajectory MuJoCo
Policy Optim (Janner et al., 2019) Ensemble Trajectory MuJoCo
PlaNet (Hafner et al., 2018) Latent MPC MuJoCo
Dreamer (Hafner et al., 2019) Latent Trajectory MuJoCo
Plan2Explore (Sekar et al., 2020) Latent Trajectory MuJoCo
Video-prediction (Oh et al., 2015) Latent Trajectory Atari
VPN (Oh et al., 2017) Latent Trajectory Atari
SimPLe (Kaiser et al., 2019) Latent Trajectory Atari
Dreamer-v2 (Hafner et al., 2020) Latent Trajectory Atari
MuZero (Schrittwieser et al., 2020) Latent e2e/MCTS Atari/Go
VIN (Tamar et al., 2016) CNN e2e Mazes
VProp (Nardelli et al., 2018) CNN e2e Mazes
Planning (Guez et al., 2019) CNN/LSTM e2e Mazes
TreeQN (Farquhar et al., 2018) Latent e2e Mazes
I2A (Weber et al., 2017) Latent e2e Mazes
Predictron (Silver et al., 2017b) Latent e2e Mazes
World Model (Ha and Schmidhuber, 2018b) Latent e2e Car Racing
Table 2: Overview of High-Accuracy Model-Based Reinforcement Learning Methods; Top: Continuous/Policy-based, Bottom: Discrete/Value-based

which provides an overview of many of the methods that we discuss in this survey. We will explain the main issues and challenges in the field step by step, using the taxonomy as guideline, illustrating solutions to these issues and challenges with approaches from the papers from the table.

VALUE-basedVINI2APlanningPredictronMuZeroVPropTreeQNWorld ModVideo PredVPNSimplePlaNetDreamerDreamer-v2Plan2explorePOLICY-basedLocalMVEMetaPolicy OptimPILCOiLQGGPSPETSSVGVisual
Figure 7: Influence of Model-Based Deep Reinforcement Learning Approaches; Top: Continuous/Policy-based (MuJoCo), Bottom: Discrete/Value-based (Mazes, Atari); Red: Uncertainty, Blue: Ensemble, Green: Latent Models, Dashed: end-to-end, Bold: Large Problems

Figure 7 illustrates how the approaches of the papers influence each other. Note that, as is often the case in reinforcement learning, the influence has two origins: policy-based methods for continuous action spaces (robotics, upper part), and value-based methods for discrete action spaces (games, lower part). The colors in the figure refer to approaches that are also listed in Table 2.

Let us now start with the taxonomy. We begin with learning, next is planning, and finally applications.

3.1 Learning

The transition model is what gives model-based reinforcement learning its name. The accuracy of the model is of great importance, planning with inaccurate models will not improve the policy much, planning with a biased model will even harm the policy, and performance of model-based methods will be worse than the model-free baseline (Gu et al., 2016). Getting high-accuracy models with few samples is challenging when the model has many parameters, since, in order to prevent overfitting, we would need many environment observations (high sample complexity).

In this section we will describe techniques that have been developed to improve model accuracy. We will discuss:

  1. 1.

    uncertainty modeling,

  2. 2.

    ensemble methods,

  3. 3.

    latent models.

For environments with continuous action spaces and non-determinism, such as robotics, uncertainty modeling and ensembles have shown progress. Latent models were developed in both continuous and discrete action spaces.

3.1.1 Uncertainty Modeling

One of the shortcomings of conventional reinforcement learning methods is that they only focus on expected value, ignoring the variance of values. This is problematic when few samples are taken for each trajectory τ\tau. Uncertainty modeling methods have been developed to counter this problem. Gausian processes can learn simple processes with good sample efficiency, although for high-dimensional problems they need many samples. They have been used for probabilistic inference to learn control (Deisenroth and Rasmussen, 2011) in the PILCO system. This system was effective on Cartpole and Mountain car (Figure 11), but does not scale to larger problems.

A related method uses nonlinear least-squares optimization (Tassa et al., 2012). Here the model learner uses quadratic approximation on the reward function, which is then used with linear approximation of the transition function. With further enhancements this method was able to teach a humanoid robot how to stand up (see Figure 9).

We can also sample from a trajectory distribution optimized for cost, and to use that to train the policy, with a policy-based method (Levine and Koltun, 2013). Then we can optimize policies with the aid of locally-linear models and a stochastic trajectory optimizer. This approach, called Guided policy search (GPS), has been shown to train complex policies with thousands of parameters learning tasks in MuJoCo such as swimming, hopping and walking. Alternatively, we can compute value gradients along the real environment trajectories, instead of planned ones, and re-parameterize the trajectory through sampling, to mitigate learned model inaccuracy (Heess et al., 2015). This was done by Stochastic value gradients (SVG) with global neural network value function approximators.

Learning arm and hand manipulation directly from video camera input is a challenging problem in robotics. The camera image provides a high dimensional input and increases problem size and complexity of the subsequent manipulation task substantially. Both Finn and Levine (2017); Ebert et al. (2018) introduce a method called Visual foresight. This system uses a training procedure where data is sampled according to a probability distribution. Concurrently, a video prediction model is trained. This model generates a sequence of future frames based on an image and a sequence of actions, as in GPS. At test time, the least-cost sequence of actions is selected in a model-predictive control planning framework (see Section 3.2.2). This approach is able to perform multi-object manipulation, pushing, picking and placing, and cloth-folding tasks (which adds the difficulty of material that changes shape as it is being manipulated).

3.1.2 Ensembles

Ensemble methods, such as a random forest of decision trees (Ho, 1995), are widely used in machine learning (Bishop, 2006), and they are also used in controlling uncertainty in high dimensional modeling. Ensemble methods mitigate variance and improve performance by running algorithms multiple times. They are used with success in model-based deep reinforcement learning as well. Chua et al. (2018) combine uncertainty-aware modeling with sampling-based uncertainty propagation, creating a method called Probabilistic ensembles with trajectory sampling, PETS. (This approach is described in the next section, see Algorithm 6). An ensemble of probabilistic neural network models is used by Nagabandi et al. (2018). Ensembles perform well; performance on pusher, reacher, and half-cheetah (see Figure 9) is reported to approach asymptotic model-free baselines such as PPO (Schulman et al., 2017). Ensembles of probabilistic networks (Chua et al., 2018) are also used with short rollouts, where the model horizon is shorter than the task horizon (Janner et al., 2019). Results have been reported for hopper, walker, and half-cheetah, again matching the performamce of model-free approaches.

The ensemble approach is related to meta learning, where we try to speed up learning a new task by learning from previous, related, tasks (Brazdil et al., 2008; Hospedales et al., 2020; Huisman et al., 2021). MAML is a popular meta learning approach (Finn et al., 2017), that attempts to learn a network initialization xx such that for any task MkM_{k} the policy attains maximum performance after one policy gradient step. The MAML approach can be used to improve model accuracy by learning an ensemble of dynamics models and by then meta-optimizing the policy for adaptation in each of the learned models (Clavera et al., 2018). Results indicate that such meta-learning of a policy over an ensemble of learned models indeed approaches the level of performance of model-free methods with substantially better sample complexity.

3.1.3 Latent Models

The next group of methods that we describe are the latent models. Central to all our approaches is the need for improvement of model accuracy in complex, high-dimensional, problems. The main challenge to achieve high accuracy is to overcome the size of the high-dimensional state space. The idea behind latent models is that in most high-dimensional environments there are elements that are less important, such as background trees that never move, that have little or no relation with the reward of the agent’s actions. The goal of latent models is to abstract away these unimportant elements of the input space, reducing the effective dimensionality of the space. They do so by learning the relation between the elements of the input and the reward. When we focus our learning mechanism on the changes in observations that are correlated with changes in these values, then we can improve the efficiency of learning high-dimensional problems greatly. Latent models thus learn a smaller representation, smaller than the observation space. Planning takes place in this smaller representation space.

The value prediction network (VPN) was introduced by Oh et al. (2015, 2017) to achieve this goal. They ask the question in their paper: “What if we could predict future rewards and values directly without predicting future observations?” and describe a nework architecture and learning method for such focused value prediction models. The core idea is not to learn directly in actual observation space, but first to transform the actual state respresentation to a smaller latent representation model, also known as abstract model. The other functions, such as value, reward, and next-state, then work with the smaller latent representations, instead of the actual high-dimensional states. By training all functions based on the values (Grimm et al., 2020), planning and learning occur in a space where states are encouraged only to contain the elements that influence value changes. In VPN the latent model consists of four networks: an encoding function, a reward function, a value function, and a transition function. All functions are parameterized with their own set of parameters (Figure 8). Latent space is lower-dimensional, and training and planning become more efficient. The figure shows a single step rollout, planning one step ahead, as in Dyna-Q (Algorithm 4).

Refer to caption
Figure 8: Architecture of latent model (Oh et al., 2017)

The training of the networks can in principle be performed with any value-based reinforcement learning algorithm. Oh et al. (2017) report results with nn-step Q-learning and temporal difference search (Silver et al., 2012).

VPN (Oh et al., 2017) showed impressive results on Atari games such as Pacman and Seaquest, outperforming model-free DQN (Mnih et al., 2015), and outperforming observation-based planning in stochastic domains. Subsequently, many other works have been published that further improved results (Kaiser et al., 2019; Hafner et al., 2018, 2019, 2020; Sekar et al., 2020; Silver et al., 2017b; Ha and Schmidhuber, 2018b). Many of these latent-model approaches are complicated designs, with multiple neural networks, and different learning and planning algorithms.

The latent-model approach is related to world models, a term used by Ha and Schmidhuber (2018a, b). World models are inspired by the manner in which humans are thought to contruct a mental model of the world in which we live. World models are often generative recurrent neural networks that are trained unsupervised using a variational autoencoder (Kingma and Welling, 2013, 2019; Goodfellow et al., 2014) and a recurrent network. They learn a compressed spatial and temporal representation of the environment. In world models multiple neural networks are used, for a vision model, a memory model, and a controller (Ha and Schmidhuber, 2018b). By using features extracted from the world model as inputs to the agent, a compact and simple policy can be trained to solve a task, and planning occurs in the compressed or simplified world. World models have been applied by Ha and Schmidhuber (2018a, b) on a car racing game (Kempka et al., 2016). The term world model actually goes back to 1990, where it was used by Schmidhuber (1990b). Latent models are also related to dimensionality reduction (Van Der Maaten et al., 2009).

The architecture of latent models, or world models, is elaborate. The dynamics model typically includes an observation model, a representation model, a transition model, and a value or reward model (Karl et al., 2016; Buesing et al., 2018; Doerr et al., 2018). The task of the observation model is to reduce the high-dimensional world into a lower-dimensional world, to allow more efficient planning. Often a variational autoencoder or LSTM is used.

The Arcade learning environment is one of the main benchmarks in reinforcement learning. The high-dimensionality of Atari video input has long been problematic for model-based reinforcement learning. Latent models were instrumental in reducing the dimensionality of Atari, producing the first successes for model-based approaches on this major benchmark.

Related to the VPN approach (Oh et al., 2015, 2017) is other work that uses latent models on Atari, such as Kaiser et al. (2019), that is aimed at video prediction, outperforming model-free baselines (Hessel et al., 2017), reaching comparable accuracy with up to an order of magnitude better sample efficiency. The approach by Kaiser et al. (2019) uses a variational autoencoder (VAE) to process input frames, conditioned on the actions of the agent, to learn the world model, using PPO (Schulman et al., 2017). The policy π\pi is then improved by planning inside the reduced world model, with short rollouts. This behavior policy π\pi then determines the actions aa to be used for learning from the environment.

Latent models are also used on continuous MuJoCo problems. Here the work by Hafner et al. (2018, 2019) on the PlaNet and Dreamer systems is noteworthy including the application of their work back to Atari (Hafner et al., 2020), which achieved human-level performance. PlaNet uses a Recurrent state space model (RSSM) that consists of a transition model, an observation model, a variational encoder and a reward model (Karl et al., 2016; Buesing et al., 2018; Doerr et al., 2018). Based on these models a Model-predictive control agent is used to adapt its plan, replanning each step (Richards, 2005). The RSSM is used by a Cross entropy method search (Botev et al. (2013), CEM) for the best action sequence. In contrast to model-free approaches, no explicit policy or value function network is used; the policy is implemented as MPC planning with the best sequence of future actions. PlaNet is tested on continuous tasks and reaches performance that is close to strong model-free algorithms. A further system, called Dreamer (Hafner et al., 2019), builds on PlaNet. Using an actor critic approach (Mnih et al., 2016) and backpropagating value gradients through predicted sequences of compact model states, the improved system solves a diverse collection of continuous problems from the Deepmind control suite (Tassa et al., 2018), see Figure 9. Dreamer is also applied to discrete problems from the Arcade learning environment, and to few-shot learning (Sekar et al., 2020). A further improvement achieved human-level performance on 55 Atari games, a first for a model-based approach (Hafner et al., 2020), showing that the latent model approach is well-suited for high-dimensional problems.

3.2 Planning

After the transition model has been learned, it will be used with a planning algorithm so that the behavior policy can be improved. Since the transition model will contain some inaccuracy, the challenge is to find a planning algorithm that performs well despite the inaccuracies. We describe three groups of methods that have been developed for planning algorithms to cope with inaccurate models. These are:

  1. 1.

    trajectory rollouts,

  2. 2.

    model-predictive control,

  3. 3.

    end-to-end learning and planning.

Trajectory rollouts and model-predictive control have been shown to work for both continuous and discrete action spaces; end-to-end learning and planning has been developed in the context of discrete action spaces (mazes and games).

Of the three planning methods, we will start with these trajectory rollouts.

3.2.1 Trajectory Rollouts

As we saw in Section 2.1, methods for continuous action spaces typically sample full trajectory rollouts to get stable actions. At each planning step, the transition model Ta(s)sT_{a}(s)\rightarrow s^{\prime} computes the new state, using the reward to update the policy. Due to the inaccuracies of the internal model, planning algorithms that perform many steps will quickly accumulate model errors (Gu et al., 2016). Full rollouts of long and inaccurate trajectories are therefore problematic. We can reduce the impact of accumulated model errors by not planning too far ahead. For example, Gu et al. (2016) perform experiments with locally linear models that roll out planning trajectories of length 5 to 10. This reportedly works well for MuJoCo tasks gripper and reacher.

In their work on model-based value expansion (MVE), Feinberg et al. (2018) also allow imagination to fixed depth, value estimates are split into a near-future model-based component and a distant future model-free component. They experiment with model horizons of 1, 2, and 10. They find that 10 generally performs best on typical MuJoCo tasks such as swimmer, walker, and cheetah. The sample complexity in their experiments is better than model-free methods such as DDPG (Silver et al., 2014). Similarly good results are reported by Janner et al. (2019); Kalweit and Boedecker (2017), both approaches use a model horizon that is much shorter than the task horizon.

3.2.2 Model-Predictive Control

Taking the idea of shorter trajectories for planning than for learning further, we arrive at Model-predictive control (MPC) (Kwon et al., 1983; Garcia et al., 1989). Model-predictive control is a well-known approach in process engineering, to control complex processes with frequent re-planning of a limited time horizon. Model-predictive control uses the fact that while many real-world processes are not linear, they are approximately linear over a small operating range. Applications are found in the automotive industry and in aerospace, for example for terrain-following and obstacle-avoidance algorithms (Kamyar and Taheri, 2014). In optimal control, four MPC approaches are identified: linear model MPC, nonlinear prediction model, explicit control law MPC, and robust MPC to deal with disturbances (Garcia et al., 1989). In this survey, we focus on how the principle of continuous replannning with a rolling planning horizon performs in nonlinear model-based reinforcement learning.

It is instructive to compare the MPC and linear quadratic regulators (LQR) approach, since both methods come from the field of optimal control. MPC computes the target function with a small time window that rolls forward as new information comes in; it is dynamic. LQR computes the target function in a single episode, using all available information; it is static. We observe that in model-based reinforcement learning MPC is used in the planning part with the behavior policy π\pi being the target and the transition function Ta()T_{a}(\cdot) the input; for LQR the transition function Ta()T_{a}(\cdot) is the target, and the environment samples (st,rt)(s_{t},r_{t}) are the input. Thus, one could conceivably use both MPC and LQR, the first as planning and the second as learning algorithm, in a model-based approach.

An iterative form of LQG has indeed been used together with MPC on a smaller MuJoCo problem (Tassa et al., 2012), achieving good results. MPC used step-by-step real-time local optimization; Tassa et al. (2012) used many further improvements to the trajectory optimization, physics engine, and cost function to achieve good performance.

MPC has also been used in other model learning approaches. Both Finn and Levine (2017); Ebert et al. (2018) use a form of MPC in the planning for their Visual foresight robotic manipulation system (that we have seen in a previous section). The MPC part uses a model that generates the corresponding sequence of future frames based on an image to select the least-cost sequence of actions.

Initialize data DD with a random controller for one trial
for Trial k=1k=1 to KK do
    Train a PE dynamics model T~\widetilde{T} given DD
    for Time t=0t=0 to TaskHorizon do
         for Actions sampled at:t+TCEM()\vec{a}_{t:t+T}\!\sim\!\text{CEM}(\cdot), 1 to NSamples do
             Propagate state particles sτp\vec{s}_{\tau}^{p} using TS and T~|{D,at:t+T}\widetilde{T}|\{D,\vec{a}_{t:t+T}\}
             Evaluate actions as τ=tt+T1Pp=1Pr(sτp,aτ)\sum_{\tau=t}^{t+T}{\tfrac{1}{P}\sum_{p=1}^{P}{r(\vec{s}_{\tau}^{p},\vec{a}_{\tau})}}
             Update CEM()\text{CEM}(\cdot) distribution
         end for
         Execute first action at\vec{a}^{*}_{t} (only) from optimal actions at:t+T\vec{a}^{*}_{t:t+T}
         Record outcome: DD{st,at,st+1}D\leftarrow D\cup\{\vec{s}_{t},\vec{a}^{*}_{t},\vec{s}_{t+1}\}.
    end for
end for
Algorithm 6 PETS MPC (Chua et al., 2018)

Another approach uses ensemble models for learning the transition model, while using MPC for planning. PETS (Chua et al., 2018) uses probabilistic ensembles (Lakshminarayanan et al., 2017) for learning. In MPC fashion only the first action from the CEM-optimized sequence is used, re-planning at every time-step (see Algorithm 6).

MPC is a simple and effective planning method that is well-suited for model inaccuracy, by restricting the planning horizon. MPC is used with success in model-based reinforcement learning, with high-variance or complex transition models. MPC has also been used with success in combination with latent models (Hafner et al., 2018; Kaiser et al., 2019).

3.2.3 End-to-End Learning and Planning

Our third planning approach is different, it integrates planning with learning in a fully differentiable algorithm. Let us see how this works.

Model-based reinforcement learning consists of two distinct functions: learning the transition model and planning with the model to improve the behavior policy (Figure 5). In classical, tabular, reinforcement learning, both learning and planning functions are designed by hand (Sutton and Barto, 2018). In deep reinforcement learning, one of these functions is approximated by deep learning—the model learning—while the planner is still hand-written. End-to-end learning and planning breaks this classical planning barrier. End-to-end approaches integrate the planning into deep learning, using differentiable planning algorithms, extending the backpropagation fully from reward to observation in all parts of the model-based approach.

How can a neural network learn to plan? While conceptually exciting and appealing, there are challenges to overcome. Among them are finding suitable differentiable planning algorithms and the increase in computational training complexity, since now the planner must also be learned.

The idea of planning by gradient descent exists for some time, several authors explored learning approximations of state transition dynamics in neural networks (Kelley, 1960; Schmidhuber, 1990a; Ilin et al., 2007). Neural networks are typically used to transform and filter, to learn selection and classification tasks. A planner unrolls a state, computes values, using selection and value aggregation, and backtracks to try another state. Although counter-intuitive at first, these operations are not that different from what classic neural networks are performing. A progression of papers has published methods on how this can be achieved.

We will start at the beginning, with convolutional neural networks (CNN) and value iteration. We will see how the iterations of value iteration can be implemented in the layers of a convolutional neural network (CNN). Next, two variations of this method are presented, and a way to implement planning with convolutional LSTM modules. All these approaches implement differentiable, trainable, planning algorithms, that can generalize to different inputs. The later methods use elaborate schemes with latent models so that the learning can be applied to different application domains.

Let us start to see what is possible with a CNN. A CNN can be used to implement value iteration. This was first shown by Tamar et al. (2016), who introduced value iteration networks (VIN). The core idea is that value iteration (VI, see Algorithm 2) can be implemented step-by-step by a multi-layer convolutional network: each layer does a step of lookahead. In this way VI is implemented in a CNN. The VI iterations for the Q-action-value-function are rolled out in the network layers QQ with AA channels. Through backpropagation the model learns the value function. The aim is to learn a general model, that can navigate in unseen environments.

VIN can be used for discrete and continuous path planning, and has been tried in grid world problems and natural language tasks. VIN has achieved generalization of finding shortest paths in unseen mazes. However, a limitation of VIN is that the number of layers of the CNN restricts the number of planning steps, restricting VINs to small and low-dimensional domains. Follow-up studies focus on making end-to-end learning and planning more generally applicable. Schleich et al. (2019) extend VINs by adding abstraction, and Srinivas et al. (2018) introduce universal planning networks, UPN, which generalize to modified robot morphologies. Value propagation (Nardelli et al., 2018) uses a hierarchical structure to generalize end-to-end methods to large problems. TreeQN (Farquhar et al., 2018) incorporates a recursive tree structure in the network, modeling the different functions of an MDP explicitly. TreeQN is applied to Sokoban and nine Atari games.

A further step is to model more complex planning algorithms, such as Monte Carlo Tree Search (MCTS), a successful planning algorithm (Coulom, 2006; Browne et al., 2012). This has been achieved to a certain extent by Guez et al. (2018) who implement many elements of MCTS in MCTSnets and (Guez et al., 2019). In this method planning is learned with a general recurrent architecture consisting of LSTMs and a convolutional network (Schmidhuber, 1990b) in the form of a stack of ConvLSTM modules (Xingjian et al., 2015). The architecture was used on Sokoban and boxworld (Zambaldi et al., 2018), and was able to perform full planning steps. Future work should investigate how to achieve sample-efficiency with this architecture.

The question whether model-based planning can be learned by a neural network has been studied by (Pascanu et al., 2017), who showed that imagination-based planning steps can indeed be learned for a small game with an LSTM. Related to this, imagination-augmented agents (I2A) has been designed as a fully end-to-end differentiable architecture for model-based imagination and model-free reinforcement learning (Weber et al., 2017). It consists of an LSTM-based encoder (Chiappa et al., 2017; Buesing et al., 2018), a ConvLSTM rollout module, and a standard CNN-based model-free path. The policy improvement algorithm is A3C. Weber et al. (2017) report that on Sokoban and Pacman I2A performs better than model-free learning and MCTS. I2A has been specifically designed to handle model imperfections well and uses a manager or meta-controller to choose between rolling out actions in the environment or by imagination (Hamrick et al., 2017).

In VIN there is a tight connection between the network architecture and the application structure. One way to remedy this restriction is with a latent model, such as the ones that were discussed earlier. One of the first attempts is the Predictron (Silver et al., 2017b), where the familiar four elements appear: a representation model, a transition model, a reward model, and a value model. The goal of the latent model is to perform value prediction (not state prediction), including being able to encode special events such as “staying alive” or “reaching the next room.” Predictron performs limited-horizon rollouts, and has been applied to procedurally generated mazes.

One of the main success stories of model-based reinforcement learning is AlphaZero (Silver et al., 2017a, 2018). AlphaZero combines planning and learning in a highly successful way. Inspired by this approach, a fully differentiable version of this architecture has been introduced by Schrittwieser et al. (2020) which is named MuZero. This system is able to learn the rules and to learn to play games as different as Atari, chess, and Go, purely from the environment, with end-to-end learning and planning. The MuZero architecture is based on Predictron, with an abstract model consisting of a representation, transition, reward, and prediction function (policy and value). For planning, MuZero uses an explicitly coded version of MCTS that uses policy and value input form the network (Rosin, 2011), but that is executed separately from the network.

End-to-end planning and learning has shown impressive results, but there are still open questions concerning the applicability to different applications, and especially the scalability to larger problems.

3.3 Applications

After we have discussed in some depth the learning and planning methods, we must look at the third element of the taxonomy: the applications. We will see that the type of application plays an important role in the success of the learning and planning methods.

Model-based reinforcement learning is applied to sequential decision problems. Which types of sequential decision problems can we distinguish? Two main application areas are robotics and games. The actions in robotics are continuous, and the environment is non-deterministic. The actions in games are typically discrete and the environment is often deterministic. We will describe four application areas:

  1. 1.

    continuous action space

    1. (a)

      small tasks

    2. (b)

      large tasks

  2. 2.

    discrete action space

    1. (a)

      low-dimensional input

    2. (b)

      high-dimensional input

3.3.1 Continuous Actions

Sequential decision problems are well-suited to model robotic actions, such as how to move the joints in a robotic arm to pour a cup of tea, how to move the joints of a humanoid figure to stand up when lying down, and how to develop gaits of a four-legged animal. The action space of such problems is continuous since the angles over which robotic joints move span a continuous range of values. Furthermore, the environment in which robots operate mimics the real world, and is non-deterministic. Things move, objects do not always respond in a predictable fashion, and unexpected situations arise.

In reinforcement learning, where agent algorithms are trained by the feedback on their many actions, working with real robots would get prohibitively expensive due to wear. Most reinforcement learning systems use physics simulations such as offered by MuJoCo (Todorov et al., 2012). MuJoCo allows the creation of experiments that provide environments for an agent. Tasks can range form small to large.

Refer to caption
Figure 9: DeepMind Control Suite. Top: Acrobot, Ball-in-cup, Cart-pole, Cheetah, Finger, Fish, Hopper. Bottom: Humanoid, Manipulator, Pendulum, Point-mass, Reacher, Swimmer (6 and 15 links), Walker
Refer to caption
Figure 10: Half-Cheetah

Small

MuJoCo tasks differ in difficulty, depending on how many joints or degrees of freedom are modeled, and which task is being learned. Figure 9 shows some of the tasks that have been modeled in MuJoCo as part of the DeepMind Control Suite (Tassa et al., 2018). Some of the small tasks are ball-in-cup and reacher. The iterative quadratic non-linear optimization method iLQG (Tassa et al., 2012) is able to teach a humanoid to stand up, and Guided policy search (Levine and Koltun, 2013) and Stochastic value gradients (Heess et al., 2015) can learn tasks such as swimmer, reacher, half-cheetah (Figure 10) and walker. Also ensemble methods such as PETS, MVE, and meta ensembles achieve good results on these applications  (Gu et al., 2016; Chua et al., 2018; Feinberg et al., 2018; Clavera et al., 2018; Janner et al., 2019).

Large

MuJoCo has enabled progress in model-based reinforcement learning in small continuous tasks, and also in larger tasks, such as how to develop gaits of a four-legged robotic animal, or how to scale an obstacle course. PlaNet (Hafner et al., 2018), Dreamer (Hafner et al., 2019) and MBPO (Janner et al., 2019) achieve good results on more complicated MuJoCo tasks using latent models to reduce the dimensionality.

3.3.2 Discrete Actions

There is a long tradition in reinforcement learning to see if we can teach a computer to play complicated games and puzzles (Plaat, 2020). Games and puzzles are often played on a board with discrete squares. Actions in such games are discrete, a move to square e3 is not a move to square e4. The environments are also deterministic, we assume that pieces do not move by itself.

Most games that are used in deep model-based reinforcement learning papers fall into this category. More complex games, such as partial information (card games such as poker (Brown and Sandholm, 2019)) or games with multiple actors (real-time strategy video games such as StarCraft (Vinyals et al., 2019; Ontanón et al., 2013; Wong et al., 2021)) are not used in the approaches that we survey here.

Refer to caption Refer to caption
Figure 11: Cart Pole and Mountain Car

Low-Dimensional

Among low-dimensional applications that are used in model-based reinforcement learning are simple pendulum problems, Cartpole and Mountain car, where the challenge is to reverse engineer the laws of impulse and gravity (Figure 11). The action space consists of two discrete actions, push left or push right, the environment is continuous and deterministic. PILCO (Deisenroth and Rasmussen, 2011) achieves good results with Gaussian process modeling and gradient based planning on the pendulum task.

Refer to caption
Figure 12: Taxi in a Grid world (Dietterich, 1998; Kansal and Martin, 2018)

Perhaps the most frequently used low-dimensional application area is grid-world, where various navigation tasks are tested (Figure 12). VIN (Tamar et al., 2016), VProp (Nardelli et al., 2018) and the Predictron (Silver et al., 2017b) that use maze navigation to test their approaches to integrating end-to-end learning and planning. Grid worlds and mazes can be designed and scaled in different forms and sizes, making them well suited for testing new ideas.

Refer to caption
Figure 13: Sokoban Puzzle (Chao, 2013)

Other low-dimensional games are board games such as chess and Go. These games are low-dimensional (their input has few atttributes, in contrast to a mega-pixel image) but they nevertheless have a large state space. Finding good policies for chess and Go was one of the most challenging feats in reinforcement learning (Campbell et al., 2002; Silver et al., 2016; Plaat, 2020). Block puzzles such as Sokoban (Figure 13), are also often used to test reinforcement learning methods. Sokoban is a block-pushing puzzle that derives much of its complexity from the facft that the agent can push a box, but cannot pull (undo) a mistake, giving rise to many dead-ends that are hard to detect. It has been used by I2A (Weber et al., 2017) and MCTS network planning (Guez et al., 2019) approaches that implement planning by unrolling steps within a neural network.

High-dimensional

Most recent success in model-free reinforcement learning has been achieved in high-dimensional problems, such as the Arcade Learning Environment (Bellemare et al., 2013), see for example (Mnih et al., 2015; Hessel et al., 2017). Atari games were popular video games in the 1980s in game arcades. Figure 14 shows a screenshot of Q*bert, a typical Atari arcade game.

Refer to caption
Figure 14: Q*bert, Example Game from the Arcade Learning Environment

Deep learning methods are well suited to process high-dimensional inputs. The challenge for model-based methods is to learn accurate models with a low number of observations. Latent model approaches have been tried for Atari with some success (Oh et al., 2015, 2017; Kaiser et al., 2019; Hafner et al., 2020). The MuZero approach was even able to learn the rules of very different games: Atari, Go, shogi (Japanese chess) and chess (Schrittwieser et al., 2020).

4 Discussion and Outlook

We have now discussed in depth our taxonomy of application, learning method, and planning method. We have seen different innovative approaches, all aiming to achieve similar goals. Let us discuss how well they succeed.

Model-based reinforcement learning promises high accuracy and low sample complexity. Sutton’s work on imagination, where a transition model is created with environment samples that are then used to create extra “imagined” samples for the policy for free, clearly suggests this aspect of model-based reinforcement learning. The transition model acts as a multiplier on the amount of information that is used from each environment sample, as the agent builds up its own model of the environment.

Another, and perhaps more important aspect, is generalization performance. Model-based reinforcement learning builds a dynamics model of the environment. This model can be used multiple times, not only for the same problem instance, but also for new problem instances, and for variations. By learning the state-to-state transition model and the reward model, model-based reinforcement learning captures the essence of a domain, where model-free methods only learn best response actions. Model-based reinforcement learning may thus be better suited for solving transfer learning problems, and for solving long sequential decision making problems. It is the difference between learning how to respond to certain actions of a difficult boss, and knowing the boss.

4.1 The Challenge

Classical tabular approaches and Gaussian process approaches have been quite succesful in achieving low sample complexity for problems of small complexity (Sutton and Barto, 2018; Deisenroth et al., 2013; Kober et al., 2013). However, the topic of the current survey is deep models, for large, high dimensional, problems with complex, non-linear, and discontinuous functions. These application domains pose a problem for classical approaches.

The main challenge that the model-based reinforcement learning algorithms in this survey address is the following. For high-dimensional tasks the curse of dimensionality causes data to be sparse and variance to be high. Deep methods tend to overfit on small sample sizes, and model-free methods use many millions of environment samples. For high-dimensional problems, the accuracy of transition models is under pressure. Model-based methods that use poor models make poor planning predictions far into the future (Talvitie, 2015).

The challenge is therefore to learn deep, high-dimensional transition functions from limited data that are accurate—or that account for model uncertainty—and plan over these models to achieve policy and value functions that are as accurate or better than model-free methods.

4.2 Taxonomy

In discussing our survey, we first summarize the taxonomy. Model-based methods use two main algorithms: (1) an algorithm that is used to learn the model, and (2) an algorithm that is used to plan the policy based on the model (Figure 5). Our taxonomy is based on application characteristics and algorithm characteristics—Table 3: application, learning, planning. First of all, we distinguish continuous versus discrete action spaces (robotics versus games). Second, we look at the algorithms. For the model-learning algorithms we look at latent models, uncertainty modeling and ensembles for continuous action spaces, and for discrete action spaces we look at latent and end-to-end learning and planning. For the planning algorithms we look at trajectory rollouts and model-predictive control for continuous action spaces, and for discrete action spaces we look at end-to-end learning and planning and trajectory rollouts.

Application Learning Planning
Discrete CNN/LSTM End-to-end learning and planning
Latent models Trajectory rollouts
Continuous Latent models Trajectory rollouts
Uncertainty modeling Model-predictive control
Ensemble models
Table 3: Taxonomy: Application, Learning, Planning

4.3 Outcome

We can now consider the question if finding methods to achieve high-accuracy models with low sample complexity has been solved. This is the central research question that many researchers have worked on. Unfortunately, authors have used different tasks within benchmark suites—or even different benchmarks—making comparisons between different publications challenging. A study by Wang et al. (2019) reimplemented many methods for continuous problems to perform a fair comparison. They found that ensemble methods and model-predictive control indeed achieve good results on MuJoCo tasks, and do so in significantly fewer time steps than model-free methods. Typcially model-based used 200k time steps versus 1 million for model-free. However, they also found that although the sample complexity is less, the wall-clock time may be different, with model-free methods such as PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018a, b) being much faster for some problems. The accuracy for the policy varies greatly for different problems, as does the sensitivity to different hyperparameter values; i.e., results are brittle.

Taking these caveats into consideration, in general we conclude that the papers that we survey report that, for high-dimensional problems, model-based methods do indeed approach an accuracy as high as model-free baselines, with substantially fewer environment samples. Therefore we conclude that the methods that we survey overcome the difficulties posed by overfitting. Indeed, we have seen that new classes of applications have become possible, both in continuous action spaces—learning to perform complex robotic behaviors—and in discrete action spaces—learning the rules of Atari, chess, shogi, and Go.

4.4 Further Challenges

Although in some important applications high accuracy at lower sample complexity has been achieved, and although quite a few results are impressive, challenges remain. To start, the algorithms that have been developed are quite complex. Latent models, end-to-end learning and planning, and uncertainty modeling, are complex algorithms, that require effort to understand, and even more to implement correctly in new applications. In addition, the learned transition models are used in single problems only. Few results are reported where they are used in a transfer learning or meta learning setting (Brazdil et al., 2008; Hospedales et al., 2020; Huisman et al., 2021), with the exception of Sekar et al. (2020).

Furthermore, reproducibility is a challenge due to the use of different benchmarks. Also, high sensitivity to differences in hyperparameter values leads to brittleness in results, making reprodicibility difficult. Finally, we note that continuous problems that are solved appear to be of lower dimensionality than some discrete problems.

4.5 Research Agenda

The good results are promising for future work. Based on the results that have been achieved, and the challenges that remain, we come to the following research agenda.

We note that different approaches were developed for different applications. Reproducibility of results is a challenge, different hyperparameters can have a large influence on performance. Furthermore, there are many different benchmarks in use in the field, and the complexity of some continuous benchmarks is less than for discrete. Reproducibility and benchmarking are the first item on our research agenda.

We also note that end-to-end learning and planning is a complex algorithm, that does not work on all applications, and that requires large computational effort. Applying end-to-end to large problems is still a challenge. As second item on our research agenda is to develop end-to-end learning and planning further, to become more efficient, and to be applicable to more and different applications.

We further note that latent models and world models are also complex. Different approaches use difference types of modules, sometimes consisting of submodules. As third item on our research agenda is to integrate latent models with end-to-end learning and planning, as fourth item, we would like to simplify latent models and standardize them, if possible for different applications, and, as fifth item, we wish to apply latent models to higher-dimensional continuous problems.

Finally, we would like to enter on our research agenda meta and transfer learning experiments for model-based reinforcement learning, as sixth item.

In summary, to improve the accuracy and applicability of model-based methods we suggest to work on the following:

  1. 1.

    Improve reproducibility of model-based reinforcement learning, standardize benchmarking, and improve robustness (hyperparameters)

  2. 2.

    Improve efficiency and of integrated end-to-end learning and planning; improve applicability to more and larger applications

  3. 3.

    Integrate latent models and end-to-end learning and planning

  4. 4.

    Simplify the latent model architecture across different applications

  5. 5.

    Apply latent models to more higher-dimensional continuous problems

  6. 6.

    Use model-based reinforcement learning transition models for meta and transfer learning

5 Conclusion

Deep learning has revolutionized reinforcement learning. The new methods allow us to approach more complicated problems than before. Control and decision making tasks involving high dimensional visual input have come within reach.

Model-based methods offer the advantage of lower sample complexity than model-free methods, because agents learn their own transition model of the environment. However, traditional methods, such as Gaussian processes, that work well on moderately complex problems with few samples, do not perform well on high-dimensional problems. High-capacity models may have high sample complexity to create high-accuracy models, and finding methods that generalize well with low sample complexity has been difficult.

In the last five years many new methods have been devised, and great success has been achieved in model-based deep reinforcement learning. This survey summarizes the main ideas of recent papers in a taxonomy based on applications and algorithms. Latent models condense complex problems into compact latent representations that are easier to learn, improving the accuracy of the model; limited horizon planning reduces the impact of low-accuracy models; end-to-end methods have been devised to integrate learning and planning in one fully differentiable approach. The Arcade Learning Environment has been one of the main benchmarks in model-free reinforcement learning, starting off the recent interest in the field with the work by Mnih et al. (2013, 2015). The high-dimensionality of Atari video input has long been problematic for model-based reinforcement learning. Latent models were instrumental in reducing the dimensionality of Atari, producing the first successes for model-based approaches on this major benchmark. Despite this success, challenges remain. In the discussion we mentioned open problems for each of the approaches, where we expect worthwhile future work to occur. Impressive results have been reported; future work can be expected in transfer learning with latent models, and the interplay of latent models, in combination with end-to-end learning of larger problems. Benchmarks in the field have also had to keep up. Benchmarks have progressed from single-agent grid worlds to high-dimensional games and complicated camera-arm manipulation tasks. Reproducibility and benchmarking studies are of great importance for real progress. In real-time strategy games model-based methods are being combined with multi-agent, hierarchical and evolutionary approaches, allowing the study of collaboration, competation and negotiation.

Model-based deep reinforcement learning is a vibrant field of AI with a long history before deep learning. The field is blessed with a high degree of activity, an open culture, clear benchmarks, shared code-bases (Bellemare et al., 2013; Brockman et al., 2016; Vinyals et al., 2017; Tassa et al., 2018) and a quick turnaround of ideas. We hope that this survey will contribute to the low barrier of entry.

Acknowledgments

We thank the members of the Leiden Reinforcement Learning Group, and especially Thomas Moerland and Mike Huisman, for many discussions and insights.

References

  • Abbeel et al. [2007] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in Neural Information Processing Systems, pages 1–8, 2007.
  • Alpaydin [2020] Ethem Alpaydin. Introduction to machine learning, Third edition. MIT Press, 2020.
  • Anthony et al. [2017] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pages 5360–5370, 2017.
  • Bellemare et al. [2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • Bellman [1957, 2013] Richard Bellman. Dynamic programming. Courier Corporation, 1957, 2013.
  • Bishop [2006] Christopher M Bishop. Pattern recognition and machine learning. Information science and statistics. Springer Verlag, Heidelberg, 2006.
  • Botev et al. [2013] Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013.
  • Brazdil et al. [2008] Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo Vilalta. Metalearning: Applications to data mining. Springer Science & Business Media, 2008.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  • Brown and Sandholm [2019] Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365(6456):885–890, 2019.
  • Browne et al. [2012] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo Tree Search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
  • Buesing et al. [2018] Lars Buesing, Theophane Weber, Sébastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018.
  • Çalışır and Pehlivanoğlu [2019] Sinan Çalışır and Meltem Kurt Pehlivanoğlu. Model-free reinforcement learning algorithms: A survey. In 2019 27th Signal Processing and Communications Applications Conference (SIU), pages 1–4, 2019.
  • Campbell et al. [2002] Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep Blue. Artificial Intelligence, 134(1-2):57–83, 2002.
  • Chao [2013] Yang Chao. Sokoban.org, 2013.
  • Chiappa et al. [2017] Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent environment simulators. arXiv preprint arXiv:1704.02254, 2017.
  • Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
  • Clavera et al. [2018] Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214, 2018.
  • Coulom [2006] Rémi Coulom. Efficient selectivity and backup operators in Monte-Carlo Tree Search. In International Conference on Computers and Games, pages 72–83. Springer, 2006.
  • Deisenroth and Rasmussen [2011] Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 465–472, 2011.
  • Deisenroth et al. [2013] Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics. In rFoundations and Trends in Robotics 2, pages 1–142. Now publishers, 2013.
  • Dietterich [1998] Thomas G Dietterich. The maxq method for hierarchical reinforcement learning. In ICML, volume 98, pages 118–126. Citeseer, 1998.
  • Doerr et al. [2018] Andreas Doerr, Christian Daniel, Martin Schiegg, Duy Nguyen-Tuong, Stefan Schaal, Marc Toussaint, and Sebastian Trimpe. Probabilistic recurrent state-space models. arXiv preprint arXiv:1801.10395, 2018.
  • Duan et al. [2016] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
  • Ebert et al. [2018] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
  • Farquhar et al. [2018] Gregory Farquhar, Tim Rocktäschel, Maximilian Igl, and SA Whiteson. TreeQN and ATreeC: Differentiable tree planning for deep reinforcement learning. International Conference on Learning Representations, 2018.
  • Feinberg et al. [2018] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
  • Finn and Levine [2017] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.
  • Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
  • Garcia et al. [1989] Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, Cambridge, 2016.
  • Grimm et al. [2020] Christopher Grimm, André Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. arXiv preprint arXiv:2011.03506, 2020.
  • Gu et al. [2016] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep Q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
  • Guez et al. [2018] Arthur Guez, Théophane Weber, Ioannis Antonoglou, Karen Simonyan, Oriol Vinyals, Daan Wierstra, Rémi Munos, and David Silver. Learning to search with MCTSnets. arXiv preprint arXiv:1802.04697, 2018.
  • Guez et al. [2019] Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sébastien Racanière, Théophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, et al. An investigation of model-free planning. arXiv preprint arXiv:1901.03559, 2019.
  • Ha and Schmidhuber [2018a] David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pages 2450–2462, 2018a.
  • Ha and Schmidhuber [2018b] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018b.
  • Haarnoja et al. [2018a] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018a.
  • Haarnoja et al. [2018b] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
  • Hafner et al. [2018] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
  • Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  • Hafner et al. [2020] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
  • Hamrick [2019] Jessica B Hamrick. Analogues of mental simulation and imagination in deep learning. Current Opinion in Behavioral Sciences, 29:8–16, 2019.
  • Hamrick et al. [2017] Jessica B Hamrick, Andrew J Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W Battaglia. Metacontrol for adaptive imagination-based optimization. arXiv preprint arXiv:1705.02670, 2017.
  • Heess et al. [2015] Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
  • Hessel et al. [2017] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
  • Ho [1995] Tin Kam Ho. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, volume 1, pages 278–282. IEEE, 1995.
  • Hospedales et al. [2020] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
  • Hui [2018] Jonathan Hui. Model-based reinforcement learning https://medium.com/@jonathan_hui/rl-model-based-reinforcement-learning-3c2b6f0aa323. Medium post, 2018.
  • Huisman et al. [2021] Mike Huisman, Jan N. van Rijn, and Aske Plaat. A survey of deep meta-learning. Artificial Intelligence Review, 2021.
  • Ilin et al. [2007] Roman Ilin, Robert Kozma, and Paul J Werbos. Efficient learning in cellular simultaneous recurrent neural networks—the case of maze navigation problem. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 324–329, 2007.
  • Janner et al. [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509, 2019.
  • Justesen et al. [2019] Niels Justesen, Philip Bontrager, Julian Togelius, and Sebastian Risi. Deep learning for video game playing. IEEE Transactions on Games, 12(1):1–20, 2019.
  • Kaelbling et al. [1996] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
  • Kahneman [2011] Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
  • Kaiser et al. [2019] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for Atari. arXiv preprint arXiv:1903.00374, 2019.
  • Kalweit and Boedecker [2017] Gabriel Kalweit and Joschka Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pages 195–206, 2017.
  • Kamyar and Taheri [2014] Reza Kamyar and Ehsan Taheri. Aircraft optimal terrain/threat-based trajectory planning and control. Journal of Guidance, Control, and Dynamics, 37(2):466–483, 2014.
  • Kansal and Martin [2018] Satwik Kansal and Brendan Martin. Learn data science webpage., 2018. URL https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/.
  • Karl et al. [2016] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick Van der Smagt. Deep variational Bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2016.
  • Kelley [1960] Henry J Kelley. Gradient theory of optimal flight paths. American Rocket Society Journal, 30(10):947–954, 1960.
  • Kempka et al. [2016] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. VizDoom: A Doom-based AI research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games, pages 1–8, 2016.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kingma and Welling [2019] Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691, 2019.
  • Kober et al. [2013] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
  • Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actor–critic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
  • Kwon et al. [1983] W Hi Kwon, AM Bruckstein, and T Kailath. Stabilizing state-feedback design via the moving horizon method. International Journal of Control, 37(3):631–643, 1983.
  • Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
  • Levine and Abbeel [2014] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
  • Levine and Koltun [2013] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • Moerland et al. [2018] Thomas M Moerland, Joost Broekens, Aske Plaat, and Catholijn M Jonker. A0c: Alpha zero in continuous action space. arXiv preprint arXiv:1805.09613, 2018.
  • Moerland et al. [2020a] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. A framework for reinforcement learning and planning. arXiv preprint arXiv:2006.15009, 2020a.
  • Moerland et al. [2020b] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Model-based reinforcement learning: A survey. arXiv preprint arXiv:2006.16712, 2020b.
  • Nagabandi et al. [2018] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566, 2018.
  • Nardelli et al. [2018] Nantas Nardelli, Gabriel Synnaeve, Zeming Lin, Pushmeet Kohli, Philip HS Torr, and Nicolas Usunier. Value propagation networks. arXiv preprint arXiv:1805.11199, 2018.
  • Oh et al. [2015] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in Atari games. In Advances in Neural Information Processing Systems, pages 2863–2871, 2015.
  • Oh et al. [2017] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in Neural Information Processing Systems, pages 6118–6128, 2017.
  • Ontanón et al. [2013] Santiago Ontanón, Gabriel Synnaeve, Alberto Uriarte, Florian Richoux, David Churchill, and Mike Preuss. A survey of real-time strategy game AI research and competition in StarCraft. IEEE Transactions on Computational Intelligence and AI in Games, 5(4):293–311, 2013.
  • Pascanu et al. [2017] Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racanière, David Reichert, Théophane Weber, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. arXiv preprint arXiv:1707.06170, 2017.
  • Plaat [2020] Aske Plaat. Learning to Play: Reinforcement Learning and Games. Springer Verlag, Heidelberg, See https://learningtoplay.net, 2020.
  • Polydoros and Nalpantidis [2017] Athanasios S Polydoros and Lazaros Nalpantidis. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems, 86(2):153–173, 2017.
  • Richards [2005] Arthur George Richards. Robust constrained model predictive control. PhD thesis, Massachusetts Institute of Technology, 2005.
  • Risi and Preuss [2020] Sebastian Risi and Mike Preuss. From Chess and Atari to StarCraft and Beyond: How Game AI is Driving the World of AI. KI-Künstliche Intelligenz, pages 1–11, 2020.
  • Rosin [2011] Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
  • Schleich et al. [2019] Daniel Schleich, Tobias Klamt, and Sven Behnke. Value iteration networks on multiple levels of abstraction. arXiv preprint arXiv:1905.11068, 2019.
  • Schmidhuber [1990a] Jürgen Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. In 1990 IJCNN International Joint Conference on Neural Networks, pages 253–258. IEEE, 1990a.
  • Schmidhuber [1990b] Jürgen Schmidhuber. Making the world differentiable: On using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. 1990b.
  • Schrittwieser et al. [2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sekar et al. [2020] Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. arXiv preprint arXiv:2005.05960, 2020.
  • Silver et al. [2012] David Silver, Richard S Sutton, and Martin Müller. Temporal-difference search in computer Go. Machine Learning, 87(2):183–219, 2012.
  • Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. 2014.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
  • Silver et al. [2017a] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017a.
  • Silver et al. [2017b] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning, pages 3191–3199, 2017b.
  • Silver et al. [2018] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.
  • Srinivas et al. [2018] Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks. arXiv preprint arXiv:1804.00645, 2018.
  • Sutton [1990] Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, pages 216–224. Elsevier, 1990.
  • Sutton [1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning, An Introduction, Second Edition. MIT Press, 2018.
  • Talvitie [2015] Erik Talvitie. Agnostic system identification for monte carlo planning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • Tamar et al. [2016] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Adv. in Neural Information Processing Systems, pages 2154–2162, 2016.
  • Tassa et al. [2012] Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906–4913, 2012.
  • Tassa et al. [2018] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  • Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012.
  • Torrado et al. [2018] Ruben Rodriguez Torrado, Philip Bontrager, Julian Togelius, Jialin Liu, and Diego Perez-Liebana. Deep reinforcement learning for general video game ai. In 2018 IEEE Conference on Computational Intelligence and Games (CIG), pages 1–8. IEEE, 2018.
  • Van Der Maaten et al. [2009] Laurens Van Der Maaten, Eric Postma, Jaap Van den Herik, et al. Dimensionality reduction: a comparative. J Mach Learn Res, 10(66-71):13, 2009.
  • Vinyals et al. [2017] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft II: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
  • Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Wang et al. [2019] Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. preprint arXiv:1907.02057, 2019.
  • Watkins [1989] Christopher JCH Watkins. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
  • Weber et al. [2017] Théophane Weber, Sébastien Racanière, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 5690–5701, 2017.
  • Wong et al. [2021] Annie Wong, Thomas Bäck, Anna V. Kononova, and Aske Plaat. Multiagent deep reinforcement learning: Challenges and directions towards human-like approaches. Artificial Intelligence Review, 2021.
  • Xingjian et al. [2015] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, pages 802–810, 2015.
  • Zambaldi et al. [2018] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830, 2018.