This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture

Suyoung Lee
KAIST
[email protected]
&Sae-Young Chung
KAIST
[email protected]
Corresponding author.
Abstract

The generalization ability of most meta-reinforcement learning (meta-RL) methods is largely limited to test tasks that are sampled from the same distribution used to sample training tasks. To overcome the limitation, we propose Latent Dynamics Mixture (LDM) that trains a reinforcement learning agent with imaginary tasks generated from mixtures of learned latent dynamics. By training a policy on mixture tasks along with original training tasks, LDM allows the agent to prepare for unseen test tasks during training and prevents the agent from overfitting the training tasks. LDM significantly outperforms standard meta-RL methods in test returns on the gridworld navigation and MuJoCo tasks where we strictly separate the training task distribution and the test task distribution.

1 Introduction

Overfitting and lack of generalization ability have been raised as the most critical problems of deep reinforcement learning (RL) [5, 8, 30, 34, 38, 47, 52]. Numbers of meta-reinforcement learning (meta-RL) methods have proposed solutions to the problems by meta-training a policy that easily adapts to unseen but similar tasks. Meta-RL trains an agent in multiple sample tasks to construct an inductive bias over the shared structure across tasks. Most meta-RL works evaluate their agents on test tasks that are sampled from the same distribution used to sample training tasks. Therefore, the vulnerability of meta-RL to test-time distribution shift is hardly revealed [12, 26, 29, 30].

One major category of meta-RL is gradient-based meta-RL that learns an initialization of a model such that few steps of policy gradient are sufficient to attain good performance in a new task [9, 36, 39, 40, 55]. Most of these methods require many test-time rollouts for adaptation that may be costly in real environments. Moreover, the networks are composed of feedforward networks that make online adaptation within a rollout difficult.

Another major category of meta-RL is context-based meta-RL that tries to learn the tasks’ structures by utilizing recurrent or memory-augmented models [6, 13, 23, 25, 32, 35, 49, 56]. A context-based meta-RL agent encodes its collected experience into a context. The policy conditioned on the context is trained to maximize the return. These methods have difficulties generalizing to unseen out-of-distribution (OOD) tasks mainly because of two reasons. (1) The process of encoding unseen task dynamics into a context is not well generalized. (2) Even if the unseen dynamics is well encoded, the policy that has never been trained conditioned on the unseen context cannot interpret the context to output optimal actions.

We propose Latent Dynamics Mixture (LDM), a novel meta-RL method that overcomes the aforementioned limitations and generalizes to strictly unseen test tasks without any additional test-time updates. LDM is based on variational Bayes-adaptive meta-RL that meta-learns approximate inference on a latent belief distribution over multiple reward and transition dynamics [56]. We generate imaginary tasks using mixtures of training tasks’ meta-learned latent beliefs. By providing the agent with imaginary tasks during training, the agent can train its context encoder and policy given the context for unseen tasks that may appear during testing. Since LDM prepares for the test during training, it does not require additional gradient adaptation during testing.

For example, let there be four types of training tasks, each of which must move east, north, west, and south. By mixing the two tasks of moving east and north, we may create a new task of moving northeast. By mixing the training tasks in different weights, we may create tasks with goals in any direction.

We evaluate LDM and other meta-RL methods on the gridworld navigation task and MuJoCo meta-RL tasks, where we completely separate the distributions of training tasks and test tasks. We show that LDM, without any prior knowledge on the distribution of test tasks during training, achieves superior test returns compared to other meta-RL methods.

2 Problem Setup

Refer to caption
(a) train\mathcal{M}_{\textrm{train}}
Refer to caption
(b) test\mathcal{M}_{\textrm{test}}
Refer to caption
(c) Mean returns achieved for each task at the NN-th episode.
Figure 1: The gridworld example for the problem setup. (a) Training MDPs train\mathcal{M}_{\textrm{train}}: a goal is located at one of the 18 shaded states. During training, the agent has to navigate to discover the unknown goal position randomly sampled at the beginning of each task. (b) Test MDPs test\mathcal{M}_{\textrm{test}}: a test goal is located at one of the 27 shaded states disjoint to the set of training goals in (a). The agent does not have access to the tasks in test\mathcal{M}_{\textrm{test}} during training. The solid lines and dashed lines represent examples of optimal paths with and without the knowledge of the true goals, respectively. \circ and ×\times symbols represent the initial state and the hidden goal state, respectively.

Our work is motivated by the meta-learning setting of variBAD [56], therefore we follow most of the problem setup and notations except for a key difference that the test and training task distributions are strictly disjoint in our setup. A Markov decision process (MDP) M=(𝒮,𝒜,R,T,T0,γ,H)M=(\mathcal{S},\mathcal{A},R,T,T_{0},\gamma,H) consists of a set of states 𝒮\mathcal{S}, a set of actions 𝒜\mathcal{A}, a reward function R(rt+1|st,at,st+1)R(r_{t+1}|s_{t},a_{t},s_{t+1}), a transition function T(st+1|st,at)T(s_{t+1}|s_{t},a_{t}), an initial state distribution T0(s0)T_{0}(s_{0}), a discount factor γ\gamma and a time horizon HH.

During meta-training, a task (or a batch of tasks) is sampled following p(M)p(M) over the set of MDPs \mathcal{M} at every iteration. Each MDP Mk=(𝒮,𝒜,Rk,Tk,T0,γ,H)M_{k}=(\mathcal{S},\mathcal{A},R_{k},T_{k},T_{0},\gamma,H) has individual reward function RkR_{k} (e.g., goal location) and transition function TkT_{k} (e.g., amount of friction), while sharing some general structures. We assume that the agent does not have access to the task index kk, which determines the MDP. At meta-test, standard meta-RL methods evaluate agents on tasks sampled from the same distribution pp that is used to sample the training tasks. To evaluate the generalization ability of agents in environments unseen during training, we split \mathcal{M} into two strictly disjoint training and test sets of MDPs, i.e., =traintest\mathcal{M}=\mathcal{M}_{\textrm{train}}\cup\mathcal{M}_{\textrm{test}} and traintest=\mathcal{M}_{\textrm{train}}\cap\mathcal{M}_{\textrm{test}}=\emptyset. The agent does not have any prior information about test\mathcal{M}_{\textrm{test}} and cannot interact in test\mathcal{M}_{\textrm{test}} during training.

Since the MDP is initially unknown, the best the agent can do is to update its belief bt(R,T)b_{t}(R,T) about the environment according to its experience τ:t={s0,a0,r1,s1,a1,r2,,st}\tau_{:t}=\{s_{0},a_{0},r_{1},s_{1},a_{1},r_{2},\ldots,s_{t}\}. According to the Bayesian RL formulation, the agent’s belief about the reward and transition dynamics at timestep tt can be formalized as a posterior over the MDP given the agent’s trajectory, bt(R,T)=p(R,T|τ:t)b_{t}(R,T)=p(R,T|\tau_{:t}). By augmenting the belief to the state, a Bayes-Adaptive MDP (BAMDP) can be constructed [7]. The agent’s goal in a BAMDP is to maximize the expected return while exploring the environment by minimizing the uncertainty about the initially unknown MDP.

The inference and posterior update problem in a BAMDP can be solved by combining meta-learning and approximate variational inference [56]. An inference model encodes the experience into a low-dimensional stochastic latent variable mm to represent the posterior belief over the MDPs.111We use the terms context, latent belief, and latent (dynamics) model interchangeably to denote mm. The reward and transition dynamics can be formulated as shared functions across MDPs: R(rt+1|st,at,st+1;m)R(r_{t+1}|s_{t},a_{t},s_{t+1};m) and T(st+1|st,at;m)T(s_{t+1}|s_{t},a_{t};m). Then the problem of computing the posterior p(R,T|τ:t)p(R,T|\tau_{:t}) becomes inferring the posterior p(m|τ:t)p(m|\tau_{:t}) over mm. By conditioning the policy on the posterior p(m|τ:t)p(m|\tau_{:t}), an approximately Bayes-optimal policy can be obtained.

Refer to Figure 1 for the gridworld navigation example that is the same as the example used in variBAD [56] except for the increased number of cells and that the task set \mathcal{M} is divided into disjoint train\mathcal{M}_{\textrm{train}} and test\mathcal{M}_{\textrm{test}}. A Bayes-optimal agent for a task in train\mathcal{M}_{\textrm{train}} first assigns a uniform prior to the goal states of train\mathcal{M}_{\textrm{train}} (Figure 1a) and then explores these states until it discovers a goal state as the dashed path in Figure 1a. If this agent, trained to solve the tasks in train\mathcal{M}_{\textrm{train}} only, is put in to solve a task in test\mathcal{M}_{\textrm{test}} without any prior knowledge (Figure 1b), the best policy an agent can take first is to maintain its initial belief learned in train\mathcal{M}_{\textrm{train}} and explore the goal states of train\mathcal{M}_{\textrm{train}}. Once the agent realizes that there are no goal states in train\mathcal{M}_{\textrm{train}}, it could start exploring the states that are not visited (i.e., train\mathcal{M}-\mathcal{M}_{\textrm{train}}), and discover an unseen goal state in test\mathcal{M}_{\textrm{test}}. However, it is unlikely that the agent trained only in train\mathcal{M}_{\textrm{train}} will encode its experience into beliefs for unseen tasks accurately and explore the goal states of test\mathcal{M}_{\textrm{test}} efficiently conditioned on the unseen context without any prior knowledge or test-time adaptation.

3 Latent Dynamics Mixture

Refer to caption
Figure 2: Imaginary task generation from latent dynamics mixture. We train nn normal workers and a mixture worker in parallel. Each normal worker W(i)W^{(i)} trains a policy network and a latent dynamics network on its sampled MDP M(i)trainM^{(i)}\in\mathcal{M}_{\textrm{train}}. A mixture latent model m^t\hat{m}_{t} is generated as a weighted sum of normal workers’ latent models mt(i)m_{t}^{(i)}. All workers share a single policy network and a single latent dynamics network. We feed this mixture belief to the latent dynamics network’s learned decoder to generate a new reward r^t\hat{r}_{t} and construct an imaginary task M^\hat{M}.

Our work aims to train an agent that prepares for unseen test tasks during training as in Figure 1b. We provide the agent during training with imaginary tasks created from mixtures of training tasks’ latent beliefs. By training the agent to solve the imaginary tasks, the agent learns to encode unseen dynamics and to produce optimal policy given the beliefs for tasks not only in train\mathcal{M}_{\textrm{train}} but also for more general tasks that may appear during testing.

Refer to Figure  2 for an overview of the entire process. We train nn normal workers W(1),,W(n)W^{(1)},\ldots,W^{(n)} and a mixture worker W^\hat{W} in parallel. For the convenience of explanation, we first focus on the case with only one mixture worker. At the beginning of every iteration, we sample nn MDPs M(1),,M(n)M^{(1)},\ldots,M^{(n)} from train\mathcal{M}_{\textrm{train}} and assign each MDP M(i)M^{(i)} to each normal worker W(i)W^{(i)}. All normal and mixture workers share a single policy network and a single latent dynamics network. Normal workers train the shared policy network and latent dynamics network using true rewards from the sampled MDPs. A mixture worker trains the policy network with imaginary rewards from the learned decoder’s output given mixture beliefs.

3.1 Policy Network

Any type of recurrent network that can encode the past trajectory into a belief state btb_{t} is sufficient for the policy network. We use an RL2\textrm{RL}^{2} [6] type of policy network (Figure 2). Each normal worker W(i)W^{(i)} trains a recurrent Encoder-p (parameterized by ϕp\phi_{p}) and a feedforward policy network (parameterized by ψ\psi) to maximize the return for its assigned MDP M(i)M^{(i)}. A mixture worker trains the same policy network to maximize the return in an imaginary task M^\hat{M} where the imaginary reward r^t\hat{r}_{t} is from the decoder given the mixture model m^t\hat{m}_{t}. Any online RL algorithm can be used to train the policy network. We use A2C for the gridworld and PPO [37] for MuJoCo tasks to optimize ϕp\phi_{p} and ψ\psi end-to-end.

3.2 Latent Dynamics Network

We use the same network structure and training methods of VAE introduced in variBAD [56] for our latent dynamics network (Figure 2). The only difference is that the policy network and the latent dynamics network do not share an encoder. Therefore Encoder-v (parameterized by ϕv\phi_{v}) of the latent dynamics network does not need to output the context necessary for the policy but only needs to encode the MDP dynamics into a low-dimensional stochastic latent embedding mm. The latent dynamics model mm changes over time as the agent explores an MDP (denoted as mtm_{t}), but converges as the agent collects sufficient information to infer the dynamics of the current MDP. The latent dynamics network is not involved in the action selection of workers. We store trajectories from train\mathcal{M}_{\textrm{train}} in a buffer and use samples from the buffer to train the latent dynamics network offline. Each normal worker trains the latent dynamics network to decode the entire trajectory, including the future, to allow inference about unseen future transitions. In this work, we focus on MDPs where only reward dynamics varies [9, 10, 12, 13, 17, 26] and only train the reward decoder (parameterized by θR\theta_{R}) as in [56]. The parameters ϕv\phi_{v} and θR\theta_{R} are optimized end-to-end to maximize the ELBO [20] using a reparameterization trick.

3.3 Imaginary Task Generation from Latent Dynamics Mixture

While training the policy network of the normal workers W(1),,W(n)W^{(1)},\ldots,W^{(n)} in parallel, we generate an imaginary latent model m^t\hat{m}_{t} as a randomly weighted sum of the latent models mt(1),,mt(n)m_{t}^{(1)},\ldots,m_{t}^{(n)} of the normal workers.

m^t=i=1nα(i)mt(i)andα(1),,α(n)βDirichlet(1,…,1)β1n.\hat{m}_{t}=\sum_{i=1}^{n}\alpha^{(i)}m_{t}^{(i)}\quad\textrm{and}\quad\alpha^{(1)},\ldots,\alpha^{(n)}\sim\beta\cdot\textrm{Dirichlet(1,\ldots,1)}-\frac{\beta-1}{n}. (1)

α(i)\alpha^{(i)}’s are random mixture weights multiplied to each latent model mt(i)m_{t}^{(i)}. At the beginning of every iteration when the normal workers are assigned to new MDPs, we also sample new mixture weights fixed for that iteration. There are many distributions suitable for sampling mixture weights, and we use the Dirichlet distribution in Equation 1. β\beta is a hyperparameter that controls the mixture’s degree of extrapolation. The sum of mixture weights equals 1 regardless of the hyperparameter β\beta. If β=1\beta=1, all α(i)\alpha^{(i)}’s are bounded between 0 and 1. Then the mixture model becomes a convex combination of the training models. If β>1\beta>1, the resulting mixture model may express extrapolated dynamics of training tasks. We find β=1.0\beta=1.0 suits best for most of our experiments among {0.5,1.0,1.5,2.0,2.5}\{0.5,1.0,1.5,2.0,2.5\} that we tried. Refer to extrapolation results in Section 5.1.2 where β\beta greater than 1 can be effective.

A mixture worker interacts with an MDP sampled from train\mathcal{M}_{\textrm{train}}, but we replace the environment reward rtr_{t} with the imaginary reward r^t\hat{r}_{t} to construct a mixture task M^\hat{M}. A mixture worker trains the policy network to maximize the return for the imaginary task M^\hat{M}. We expect the imaginary task M^\hat{M} to share some common structures with the training tasks because the mixture task is generated using the decoder that is trained to fit the training tasks’ reward dynamics. On the other hand, the decoder can generate unseen rewards because we feed the decoder unseen mixture beliefs. The mixture worker only trains the policy network but not the decoder with imaginary dynamics of M^\hat{M}.

Dropout of state and action input for the decoder

As the latent dynamics network is trained for the tasks in train\mathcal{M}_{\textrm{train}}, we find that the decoder easily overfits the state and action observations, ignoring the latent model mm. Returning to the gridworld example, if we train the decoder with the tasks in train\mathcal{M}_{\textrm{train}} (Figure 1a), and feed the decoder one of the goal states in test\mathcal{M}_{\textrm{test}}, the decoder refers to the next state input si+1s_{i+1} only and always returns zero rewards regardless of the latent model mm (Figure 4a). We apply dropout of rate pdropp_{\textrm{drop}} to all inputs of the decoder except the latent model mm. It forces the decoder to refer to the latent model when predicting the reward and generate general mixture tasks. Refer to Appendix E for ablations on dropout.

Training the decoder with a single-step regression loss is generally less complex than training the policy network with a multi-step policy gradient loss. Therefore the decoder can be stably trained even with input dropout, generalizing better than the meta-trained policy. Refer to Appendix B for empirical results on the test-time generalization ability of the latent dynamics network with dropout.

3.4 Implementation Details

Multiple episodes of the same task in one iteration

Following the setting of variBAD [56], we define an iteration as a sequence of NN episodes of the same task and train the agent to act Bayes-optimally within the NN rollout episodes (i.e., H+=N×HH^{+}=N\times H steps). After every NN episodes, new tasks are sampled from train\mathcal{M}_{\textrm{train}} for the normal workers, and new mixture weights are sampled for the mixture worker. Then we can compare our method to other meta-RL methods that are designed to maximize the return after rollouts of many episodes.

Multiple mixture workers

We may train more than one mixture worker at the same time by sampling different sets of mixture weights for different mixture workers. Increasing the ratio of mixture workers to normal workers may help the agent generalize to unseen tasks faster, but the normal workers may require more iterations to learn optimal policies in train\mathcal{M}_{\textrm{train}}. We train n=14n=14 normal workers and n^=2\hat{n}=2 mixture workers in parallel unless otherwise stated. Refer to Appendix D for empirical analysis on the ratio of workers.

4 Related Work

Meta-Reinforcement Learning

Although context-based meta-RL methods require a large amount of data for meta-training, they can learn within the task and make online adaptations. RL2\textrm{RL}^{2} [6] is the most simple, yet effective context-based model-free meta-RL method that utilizes a recurrent network to encode the experience into a policy. PEARL [35] integrates an off-policy meta-RL method with online probabilistic filtering of latent task variables to achieve high meta-training efficiency. MAML [9] learns an initialization such that a few steps of policy gradient is enough for the agent to adapt to a new test task. E-MAML and ProMP [36, 40] extend MAML by proposing exploration strategies for collecting rollouts for adaptation.

Bayesian Reinforcement Learning

Bayesian RL quantifies the uncertainty or the posterior belief over the MDPs using past experience. Conditioning on the environment’s uncertainty, a Bayes-optimal policy can set the optimal balance between exploration and exploitation to maximize the return during training [1, 11, 19, 28]. In a Bayes-adaptive MDP (BAMDP), where the agent augments the state space of MDP with its belief, it is almost impossible to find the optimal policy due to the unknown parameterization and the intractable belief update. VariBAD [56] proposes an approximate but tractable solution that combines meta-learning and approximate variational inference. However, it is restricted to settings where the training and test task distributions are almost the same. Furthermore, the learned latent model is only used as additional information for the policy. LDM uses the learned latent model more actively by creating mixture tasks to train the policy for more general test tasks out of training task distribution.

Curriculum, Goal and Task Generation

The idea of generating new tasks for RL is not new. Florensa et al. [10] proposes generative adversarial training to generate goals. Gupta et al. [12] proposes an automatic task design process based on mutual information. SimOpt [3] learns the randomization of simulation parameters based on a few real-world rollouts. POET and enhanced POET [45, 46] generate the terrain for a 2D walker given access to the environment parameters. Dream to Control [14] solves a long-horizon task using a latent-imagination. BIRD [54] learns from imaginary trajectories by maximizing the mutual information between imaginary and real trajectories. Chandak et al. [2] trains an agent to forecast the future tasks in non-stationary MDPs. Most of these works require prior knowledge or control of the environment parameters, or the pool of the generated tasks is restricted to the training task distribution.

Data-augmentation for Reinforcement Learning

Many image augmentation techniques such as random convolution, random shift, l2-regularization, dropout, batch normalization and noise injection are shown to improve the generalization of RL [5, 16, 22, 24, 33, 50]. Mixreg [44] that applies the idea of mix-up [53] in RL, generates new training data as a convex interpolation of input observation and output reward. LDM can be thought of as a data-augmentation method in a way that it generates a mixture task using the data from training tasks. However, LDM generates a new task in the latent space with the latent model that fully encodes the MDPs’ dynamics. Instead of the pre-defined augmentation techniques with heuristics, we generate mixture tasks using the learned decoder, which contains the shared structure of the MDPs.

Out-of-distribution Meta-Reinforcement Learning

Some recent works aim to generalize RL to OOD tasks [8]. MIER [30] relabels past trajectories in a buffer during a test to generate synthetic training data suited for the test MDP. FLAP [31] learns a shared linear representation of the policy. To adapt to a new task FLAP only needs to predict a set of linear weights for fast adaptation. MetaGenRL [21] meta-learns an RL objective to train a randomly initialized policy on a test task. Most of these methods require experience from the test for additional training or updates for the network. AdMRL [26] performs adversarial virtual training with varying rewards. AdMRL assumes known reward space and parameterization, while LDM meta-learns.

5 Experiments

We evaluate LDM and other meta-RL methods on the gridworld example (Figure 1) and three MuJoCo meta-RL tasks [42]. We slightly modify the standard MuJoCo tasks by splitting the task space \mathcal{M} into disjoint train\mathcal{M}_{\textrm{train}} and test\mathcal{M}_{\textrm{test}}. We use RL2 [6] and variBAD [56] as baselines representing the context-based meta-RL methods. LDM without mixture training and the latent dynamics network reduces to RL2. LDM without mixture training reduces to variBAD if the policy and latent dynamics networks share an encoder. We use E-MAML [40] and ProMP [36] as baselines representing the gradient-based meta-RL methods. We implement RL2-based Mixreg [44] to evaluate the difference between generating mixture tasks in the latent space and the observation space. Refer to Appendix A for the details of implementations and hyperparameters.

All methods are trained on tasks sampled from train\mathcal{M}_{\textrm{train}} uniformly at random, except for the oracle methods that are trained on tasks in the entire task set =traintest\mathcal{M}=\mathcal{M}_{\textrm{train}}\cup\mathcal{M}_{\textrm{test}}. Note that the oracle performance is only for reference and can not be achieved by any non-oracle methods theoretically. Non-oracle methods without prior knowledge on test\mathcal{M}_{\textrm{test}} require additional exploration during testing to experience the changes from train\mathcal{M}_{\textrm{train}}, whereas the oracle agent can exploit the knowledge of the test distribution. In the gridworld example, the oracle agent can search for goals in test\mathcal{M}_{\textrm{test}} before navigating the outermost goal states of train\mathcal{M}_{\textrm{train}}. Therefore the main focus should be on the relative improvement of LDM compared to non-oracle methods, bridging the gap between the oracle and the non-oracle methods. We report mean results using 8 random seeds and apply a moving average of window size 5 for all main experiments (4 seeds for ablations). Shaded areas indicate standard deviations for all plots.

5.1 Gridworld Navigation

Refer to caption
Figure 3: Results for the gridworld task evaluated at the NN-th episode in terms of the mean returns in train\mathcal{M}_{\textrm{train}} and test\mathcal{M}_{\textrm{test}}, and the number of tasks in test\mathcal{M}_{\textrm{test}} in which the agent fails to reach the goal.
Experimental setup

We use the gridworld navigation task introduced in Figure 1. The agent is allowed 5 actions: up, down, left, right, and stay. The reward is 1 for reaching or staying at the hidden goal and 0.1-0.1 for all other transitions. Each episode lasts for H=30H=30 steps. All baselines are given N=4N=4 rollout episodes for a fixed task except for ProMP and E-MAML that are given N=20N=20 rollouts. Such choice of NN follows from the reference implementations of the baselines. The time horizon has been carefully set so that the agent can not visit all states within the first episode but can visit them within two episodes. If a rollout is over, the agent is reset to the origin. The optimal policy is to search for the hidden goal and stay at the goal or return to the goal as quickly as possible. After NN episodes, a new task is sampled from train\mathcal{M}_{\textrm{train}} uniformly at random. The reward decoder requires only the latent model and the next state as input for this task. We apply dropout with the rate pdrop=0.7p_{\textrm{drop}}=0.7 to the next state input of the reward decoder.

Results

Except for ProMP and E-MAML that use feedforward networks, all the baseline methods achieve the same optimal performance in train\mathcal{M}_{\textrm{train}} at the NN-th episode (Figure 3). However, our method outperforms the non-oracle baselines in test\mathcal{M}_{\textrm{test}}. Out of the 27 tasks in test\mathcal{M}_{\textrm{test}}, our agent succeeds to visit the goals in 19 tasks on average. Although LDM still does not reach 8 test goals on average, considering that there is no prior information on the test distribution and that the optimal policy for train\mathcal{M}_{\textrm{train}} (dashed path in Figure 1a) does not visit most of the goal states in test\mathcal{M}_{\textrm{test}}, the improvement of LDM over RL2 and variBAD is significant. RL2 achieves a high test return initially, but as the policy overfits the training tasks, its test return decays. VariBAD achieves high returns in train\mathcal{M}_{\textrm{train}}, but fails to generalize in most of the tasks in test\mathcal{M}_{\textrm{test}}. Mixreg performs better than RL2, but does not match LDM.

Refer to Figure 8a for example of LDM’s test-time behavior for a task in test\mathcal{M}_{\textrm{test}} that was introduced in Figure 1a. The agent searches for a goal in train\mathcal{M}_{\textrm{train}} first. Once it explores all the goal states in train\mathcal{M}_{\textrm{train}}, it starts to search for the goals in test\mathcal{M}_{\textrm{test}}. From the second episode, the agent directly heads to the goal based on the context made during the first episode. Note that the initial prior is nonzero even for the goals in test\mathcal{M}_{\textrm{test}} due to the dropout applied to the decoder’s state input.

5.1.1 Tasks generated by LDM

Refer to caption
Figure 4: Examples of mixture tasks generated by LDM. First row: Mixture weights (α(i)\alpha^{(i)}) multiplied to the latent models (mH+(i)m_{H^{+}}^{(i)}) at the end of each training task (Equation 1). When the same training tasks are sampled multiple times, we plot the sum of weights. Second row (reward map): decoder output for each next state conditioned on the mixture weights from the first row. A cross mark denotes the state with the maximum mixture reward.

We present an empirical analysis in Figure 4 to verify that LDM indeed generates meaningful new tasks that help to solve the test tasks. Without prior knowledge or interaction with test\mathcal{M}_{\textrm{test}}, it is impossible to create exactly the same reward maps of test\mathcal{M}_{\textrm{test}}. But the reward maps in Figure 4c,d,e are sufficient to induce exploration toward the goal states of test\mathcal{M}_{\textrm{test}}. Due to the dropout applied to the next state input of the decoder, the decoder assigns high rewards to some goals that belong to test\mathcal{M}_{\textrm{test}}. Note that the decoder does not generalize without dropout therefore assigns high rewards only to training goals (Figure 4a).

5.1.2 Extrapolation ability of LDM

To demonstrate that LDM is not restricted to target tasks inside the convex hull of training tasks, we design gridworld-extrapolation task as in Figure 5. This task is similar to the previous gridworld task in Section 5.1, except that we shift the outermost cells of train\mathcal{M}_{\textrm{train}} inside to construct extrapolation tasks test2\mathcal{M}_{\textrm{test2}} (Figure 5c). Refer to Figure 6 for the final return for each task when we train LDM (pdrop=0.5p_{\textrm{drop}}=0.5) with different values of β\beta. For small values of β=1.0\beta=1.0 and β=1.5\beta=1.5, LDM focuses on training the interpolation tasks in test1\mathcal{M}_{\textrm{test1}}. As β\beta increases, the returns for tasks in test1\mathcal{M}_{\textrm{test1}} decrease, but the returns for tasks in test2\mathcal{M}_{\textrm{test2}} increase.

Refer to caption
(a) train\mathcal{M}_{\textrm{train}}
Refer to caption
(b) test1\mathcal{M}_{\textrm{test1}}
Refer to caption
(c) test2\mathcal{M}_{\textrm{test2}}
Figure 5: Gridworld-extrapolation task.
Refer to caption
Figure 6: Returns of each task for different extrapolation level β\beta on the gridworld-extrapolation task. Mean return for each task in \mathcal{M} at the NN-th episode, mean of 4 random seeds.

5.2 MuJoCo

Refer to caption
Figure 7: Mean returns at the NN-th episodes in eval\mathcal{M}_{\textrm{eval}} of three MuJoCo tasks.
Experimental setup

We evaluate our agent and baselines on three standard MuJoCo domains (Ant-direction, Ant-goal, and Half-cheetah-velocity) to verify how effective LDM is for continuous control tasks. We evaluate the agents on a fixed set of evaluation tasks evaltest\mathcal{M}_{\textrm{eval}}\subset\mathcal{M}_{\textrm{test}} for each domain to ensure the test result is not affected by the sampling of evaluation tasks (Table 1). For MuJoCo tasks we report the results of PEARL [35], MQL [8], and MIER [30] with number of rollouts per iteration (NN) equal to 3, 11 and 3 respectively. ProMP and E-MAML are given N=20N=20 and all other methods are given N=2N=2 rollouts (following the reference implementations). H=200H=200 steps for each episode of all MuJoCo tasks. Because PEARL, MQL, and MIER use an off-policy RL algorithm, their performance converges with much less training data. Therefore, for the three baselines, we report the converged asymptotic performance after 5 million steps of environment interaction. We set LDM’s pdrop=0.5p_{\textrm{drop}}=0.5 for all MuJoCo tasks.

Table 1: Set of training, test and evaluation tasks of Mujoco tasks. k{0,1,2,3}k\in\{0,1,2,3\}.
Ant-direction Ant-goal Half-cheetah-velocity
θ\theta rr θ\theta vv
train\mathcal{M}_{\textrm{train}} 90×k90^{\circ}\times k [0.0,1.0)[2.5,3.0)[0.0,1.0)\cup[2.5,3.0) [0,360)[0^{\circ},360^{\circ}) [0.0,0.5)[3.0,3.5)[0.0,0.5)\cup[3.0,3.5)
test\mathcal{M}_{\textrm{test}} 90×k+4590^{\circ}\times k+45^{\circ} [1.0,2.5)[1.0,2.5) [0,360)[0^{\circ},360^{\circ}) [0.5,3.0)[0.5,3.0)
eval\mathcal{M}_{\textrm{eval}} 90×k+4590^{\circ}\times k+45^{\circ} 1.751.75 90×k90^{\circ}\times k {0.75,1.25,1.75,2.25,2.75}\{0.75,1.25,1.75,2.25,2.75\}
Ant-direction

We construct the Ant-direction task based on the example discussed in the introduction. θ\theta denotes the angle of the target direction from the origin. A major component of the reward is the dot product of the target direction and the agent’s velocity.

Ant-goal

We construct the Ant-goal task similar to the gridworld task, where the agent has to reach the unknown goal position. The training task distribution train\mathcal{M}_{\textrm{train}} (shaded region of Figure 8c) is continuous, unlike the gridworld and Ant-direction tasks. rr and θ\theta denote the radius and angle of the goal from the origin, respectively. A major component of the reward is the negative value of the taxicab distance between the agent and the goal. We set β=2\beta=2 because LDM with β=1\beta=1 generates goals near the origin mostly due to the symmetry of training tasks. Refer to Appendix C for a detailed analysis regarding the choice of β\beta. One may argue that the training and test tasks are not disjoint because the agent passes through the goal states of test\mathcal{M}_{\textrm{test}} momentarily while learning to solve train\mathcal{M}_{\textrm{train}}. However, the task inference with the unseen rewards from test\mathcal{M}_{\textrm{test}} cannot be learned during training. The policy to stay at the goals of test\mathcal{M}_{\textrm{test}} until the end of the time horizon is also not learned.

Half-cheetah-velocity

We train the Half-cheetah agent to match the target velocity sampled from the two distributions at extremes and test for target velocities in between. This task is relatively easier than the previous Ant tasks due to the reduced dimension of the task space. Therefore we train more mixture workers than the other tasks, training n=12n=12 normal workers and n^=4\hat{n}=4 mixture workers in parallel. Refer to Appendix D for additional results with different numbers of mixture workers. The target velocity is vv, where the velocity is measured as a change of position per second (or 20 environment steps). A major component of the reward is the negative value of the difference between the target velocity and the agent’s velocity. Similar to Ant-Goal, although the test velocities in test\mathcal{M}_{\textrm{test}} are achieved momentarily in the process of reaching a target velocity in [3.0,3.5)[3.0,3.5), the agent does not learn to maintain the target velocities in test\mathcal{M}_{\textrm{test}} during training in train\mathcal{M}_{\textrm{train}}.

Refer to caption
(a) Sample trajectory of LDM for 4 rollout episodes. The red shade denotes decoder output for each state conditioned on the online context.
Refer to caption
(b) Ant-drection, 4 trajectories from eval\mathcal{M}_{\textrm{eval}}.
Refer to caption
(c) Ant-goal, 4 trajectories from eval\mathcal{M}_{\textrm{eval}}.
Refer to caption
(d) Half-cheetah-velocity, 5 trajectories from eval\mathcal{M}_{\textrm{eval}}.
Figure 8: Example trajectories of the agents in eval\mathcal{M}_{\textrm{eval}}. We illustrate the behavior at the NN-th episode as colored paths. The targets of eval\mathcal{M}_{\textrm{eval}} are indicated as dashed lines or cross marks
Results

Refer to Figure 7 for the test returns in eval\mathcal{M}_{\textrm{eval}}. LDM outperforms the non-oracle baselines at the NN-th episode. VariBAD oracle achieves the best test returns for all MuJoCo tests, revealing the strength of variBAD’s task inference ability for tasks seen during training. However, the non-oracle variBAD has difficulty generalizing to test\mathcal{M}_{\textrm{test}}. The performance of Mixreg is lower than that of RL2, which reveals the limitation of mixing in complex observation space. In all MuJoCo tasks, the agent can infer the true task dynamics using the reward at every timestep, even at the first step of an episode. Therefore gradient-based methods with feedforward networks also make progress, unlike the gridworld task. Because PEARL is designed to collect data for the first two episodes, it achieves low returns before the NN-th episode. Even after accumulating context, the policy of PEARL is not prepared for the unseen contexts in Ant-direction and Half-cheetah-velocity. Note that MQL and MIER require additional buffer-based re-labeling or re-training for testing after collecting some rollouts of the test task. LDM can prepare in advance during training so that we do not require collection of test rollouts and extra buffer-based training during testing.

Because LDM is not trained in test\mathcal{M}_{\textrm{test}}, it can not solve the target tasks optimally from the beginning of each task in the Ant-direction and Ant-goal tasks, unlike variBAD Oracle (Figure 8b and 8c). However, LDM reaches the target direction or goal close enough after a sufficiently small amount of exploration, unlike RL2. For the Half-cheetah-velocity task, LDM generates well-separated policies compared to RL2 (Figure 8d). Refer to Appendix F for additional experimental results on the test returns at the first rollout episode, the training returns, and sample trajectories on training tasks.

6 Conclusion

We propose Latent Dynamics Mixture to improve the generalization ability of meta-RL by training policy with generated mixture tasks. Our method outperforms baseline meta-RL methods in experiments with strictly disjoint training and test task distributions, even reaching the oracle performance in some tasks. Because our latent dynamics network and the task generation process are independent of the policy network, we expect LDM to make orthogonal contributions when combined with not only RL2 but most of other meta-RL methods.

We believe that our work can be a starting point for many interesting future works. For example, instead of a heuristic weight-sampling for generating the mixture, we may incorporate OOD generation techniques in the latent space. Another extension is to train the latent dynamics network to decode not only the reward but also the state transition dynamics. Then we can generate a purely imaginary mixture task without additional environment interaction.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) [2021R1A2C2007518]. We thank Dong Hoon Lee, Minguk Jang, and Dong Geun Shin for constructive feedback and discussions.

References

  • Cassandra et al. [1994] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic domains. In National Conference on Artificial Intelligence, 1994.
  • Chandak et al. [2020] Y. Chandak, G. Theocharous, S. Shankar, M. White, S. Mahadevan, and P. Thomas. Optimizing for the future in non-stationary mdps. In International Conference on Machine Learning, (ICML), pages 119:1414–1425, 2020.
  • Chebotar et al. [2019] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. D. Ratliff, and D. Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In International Conference on Robotics and Automation, (ICRA), pages 8973–8979, 2019.
  • Chen et al. [2018] B. Chen, W. Deng, and H. Shen. Virtual class enhanced discriminative embedding learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 31:1942–1952, 2018.
  • Cobbe et al. [2019] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. Quantifying generalization in reinforcement learning. In International Conference on Machine Learning, (ICML), pages 97:1282–1289, 2019.
  • Duan et al. [2016] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. RL2\textrm{RL}^{2}: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv: 1611.02779, 2016.
  • Duff [2002] M. O. Duff. Optimal Learning: Computational Procedures for Bayes-adaptive Markov Decision Processes. University of Massachusetts at Amherst, 2002.
  • Fakoor et al. [2020] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola. Meta-q-learning. In International Conference on Learning Representations, (ICLR), 2020.
  • Finn et al. [2017] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, (ICML), pages 70:1126–1135, 2017.
  • Florensa et al. [2018] C. Florensa, D. Held, X. Geng, and P. Abbeel. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning (ICML), pages 80:1515–1528, 2018.
  • Ghavamzadeh et al. [2015] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. Bayesian reinforcement learning: A survey. Found. Trends Mach. Learn., pages 8(5–6):359–483, 2015.
  • Gupta et al. [2018a] A. Gupta, B. Eysenbach, C. Finn, and S. Levine. Unsupervised meta-learning for reinforcement learning. arXiv preprint arXiv: 1806.04640, 2018a.
  • Gupta et al. [2018b] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine. Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems (NeurIPS), pages 31:5302–5311, 2018b.
  • Hafner et al. [2020] D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, (ICLR), 2020.
  • Humplik et al. [2019] J. Humplik, A. Galashov, L. Hasenclever, P. A. Ortega, Y. W. Teh, and N. Heess. Meta reinforcement learning as task inference. arXiv preprint arXiv: 1905.06424, 2019.
  • Igl et al. [2019] M. Igl, K. Ciosek, Y. Li, S. Tschiatschek, C. Zhang, S. Devlin, and K. Hofmann. Generalization in reinforcement learning with selective noise injection and information bottleneck. In Neural Information Processing Systems (NeurIPS), pages 32:13978–13990, 2019.
  • Jabri et al. [2019] A. Jabri, K. Hsu, B. Eysenbach, A. Gupta, S. Levine, and C. Finn. Unsupervised curricula for visual meta-reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
  • Kaddour et al. [2020] J. Kaddour, S. Sæmundsson, and M. P. Deisenroth. Probabilistic active meta-learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 33:20813–20822, 2020.
  • Kaelbling et al. [1998] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134, 1998.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
  • Kirsch et al. [2020] L. Kirsch, S. van Steenkiste, and J. Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. In International Conference on Learning Representations, (ICLR), 2020.
  • Laskin et al. [2020] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data. In Neural Information Processing Systems (NeurIPS), pages 33:19884–19895, 2020.
  • Lee et al. [2020a] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. In Advances in Neural Information Processing Systems (NeurIPS), pages 33:741–752, 2020a.
  • Lee et al. [2020b] K. Lee, K. Lee, J. Shin, and H. Lee. Network randomization: A simple technique for generalization in deep reinforcement learning. In International Conference on Learning Representations, (ICLR), 2020b.
  • Lee et al. [2020c] K. Lee, Y. Seo, S. Lee, H. Lee, and J. Shin. Context-aware dynamics model for generalization in model-based reinforcement learning. In International Conference on Machine Learning, (ICML), pages 119:5757–5766, 2020c.
  • Lin et al. [2020] Z. Lin, G. Thomas, G. Yang, and T. Ma. Model-based adversarial meta-reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 33:10161–10173, 2020.
  • Liu et al. [2019] S. Liu, A. J. Davison, and E. Johns. Self-supervised generalisation with meta auxiliary learning. In Neural Information Processing Systems (NeurIPS), pages 32:1679–1689, 2019.
  • Martin [1967] J. J. Martin. Bayesian Decision Problems and Markov Chains. Wiley, 1967.
  • Mehta et al. [2020] B. Mehta, T. Deleu, S. C. Raparthy, C. J. Pal, and L. Paull. Curriculum in gradient-based meta-reinforcement learning. arXiv preprint arXiv:2002.07956, 2020.
  • Mendonca et al. [2020] R. Mendonca, X. Geng, C. Finn, and S. Levine. Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling. arXiv preprint arXiv: 2006.07178, 2020.
  • Peng et al. [2021] M. Peng, B. Zhu, and J. Jiao. Linear representation meta-reinforcement learning for instant adaptation. arXiv preprint arXiv: 2101.04750, 2021.
  • Raileanu et al. [2020a] R. Raileanu, M. Goldstein, A. Szlam, and R. Fergus. Fast adaptation to new environments via policy-dynamics value functions. In International Conference on Machine Learning, (ICML), pages 119:7920–7931, 2020a.
  • Raileanu et al. [2020b] R. Raileanu, M. Goldstein, D. Yarats, I. Kostrikov, and R. Fergus. Automatic data augmentation for generalization in deep reinforcement learning. arXiv preprint arXiv:2006.12862, 2020b.
  • Rajeswaran et al. [2017] A. Rajeswaran, K. Lowrey, E. Todorov, and S. M. Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems (NeurIPS), pages 30:6550–6561, 2017.
  • Rakelly et al. [2019] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, (ICML), pages 97:5331–5340, 2019.
  • Rothfuss et al. [2019] J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel. Promp: Proximal meta-policy search. In International Conference on Learning Representations, (ICLR), 2019.
  • Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv: 1707.06347, 2017.
  • Song et al. [2020a] X. Song, Y. Jiang, S. Tu, Y. Du, and B. Neyshabur. Observational overfitting in reinforcement learning. In International Conference on Learning Representations, (ICLR), 2020a.
  • Song et al. [2020b] Y. Song, A. Mavalankar, W. Sun, and S. Gao. Provably efficient model-based policy adaptation. In International Conference on Machine Learning (ICML), pages 119:9088–9098, 2020b.
  • Stadie et al. [2018] B. C. Stadie, G. Yang, R. Houthooft, X. Chen, Y. Duan, Y. Wu, P. Abbeel, and I. Sutskever. The importance of sampling in meta-reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 31:9280–9290, 2018.
  • Tobin et al. [2017] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017.
  • Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012.
  • Wang et al. [2016] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. In Annual Meeting of the Cognitive Science Community (CogSci), 2016.
  • Wang et al. [2020a] K. Wang, B. Kang, J. Shao, and J. Feng. Improving generalization in reinforcement learning with mixture regularization. In Advances in Neural Information Processing Systems (NeurIPS), pages 33:7968–7978, 2020a.
  • Wang et al. [2019] R. Wang, J. Lehman, J. Clune, and K. O. Stanley. Paired open-ended trailblazer (POET): endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv: 1901.01753, 2019.
  • Wang et al. [2020b] R. Wang, J. Lehman, A. Rawal, J. Zhi, Y. Li, J. Clune, and K. Stanley. Enhanced poet: Open-ended reinforcement learning through unbounded invention of learning challenges and their solutions. In International Conference on Machine Learning, (ICML), pages 119:9940–9951, 2020b.
  • Whiteson et al. [2011] S. Whiteson, B. Tanner, M. E. Taylor, and P. Stone. Protecting against evaluation overfitting in empirical reinforcement learning. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 120–127, 2011.
  • Xie et al. [2020] A. Xie, J. Harrison, and C. Finn. Deep reinforcement learning amidst lifelong non-stationarity. arXiv preprint arXiv: 2006.10701, 2020.
  • Yang et al. [2020] J. Yang, B. K. Petersen, H. Zha, and D. Faissol. Single episode policy transfer in reinforcement learning. In International Conference on Learning Representations, (ICLR), 2020.
  • Yarats et al. [2021] D. Yarats, I. Kostrikov, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, (ICLR), 2021.
  • Yin et al. [2020] M. Yin, G. Tucker, M. Zhou, S. Levine, and C. Finn. Meta-learning without memorization. In International Conference on Learning Representations, (ICLR), 2020.
  • Zhang et al. [2018a] C. Zhang, O. Vinyals, R. Munos, and S. Bengio. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv: 1804.06893, 2018a.
  • Zhang et al. [2018b] H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, (ICLR), 2018b.
  • Zhu et al. [2020] G. Zhu, M. Zhang, H. Lee, and C. Zhang. Bridging imagination and reality for model-based deep reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 33:8993–9006, 2020.
  • Zintgraf et al. [2019] L. M. Zintgraf, K. Shiarlis, V. Kurin, K. Hofmann, and S. Whiteson. Fast context adaptation via meta-learning. In International Conference on Machine Learning, (ICML), pages 97:7693–7702, 2019.
  • Zintgraf et al. [2020] L. M. Zintgraf, K. Shiarlis, M. Igl, S. Schulze, Y. Gal, K. Hofmann, and S. Whiteson. Varibad: A very good method for bayes-adaptive deep RL via meta-learning. In International Conference on Learning Representations, (ICLR), 2020.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

    2. (b)

      Did you describe the limitations of your work? [Yes] In the conclusion section

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [N/A]

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [N/A]

    2. (b)

      Did you include complete proofs of all theoretical results? [N/A]

  3. 3.

    If you ran experiments…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplementary material or as a URL)? [Yes] as a URL in Appendix

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] in Appendix

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes]

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] in Appendix

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes] source codes of the baselines

    2. (b)

      Did you mention the license of the assets? [Yes] in the supplementary material

    3. (c)

      Did you include any new assets either in the supplementary material or as a URL? [Yes] Links to the baseline codes in Appendix

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Implementation Details

A.1 Baselines

RL2 and variBAD

We use the open-source reference implementation of variBAD at https://github.com/lmzintgraf/varibad to report the results of RL2 and variBAD. All gridworld and MuJoCo environments used for our experiments are based on this implementation. We modify the environments to contain separate train\mathcal{M}_{\textrm{train}} and test\mathcal{M}_{\textrm{test}}. The oracle versions of RL2 and variBAD use the original environment \mathcal{M}. We keep all the network structures and hyperparameters of the reference implementation except for the gridworld task. We increase the GRU hidden size from 64 to 128 and the GRU output size from 5 to 32 for the gridworld task because our gridworld task (7×77\times 7) has more cells than the gridworld task in variBAD (5×55\times 5).

LDM

We implement LDM based on the implementation of variBAD. Refer to Table 3 for the hyperparameters used to train LDM. Most of the network structures and hyperparameters follow the reference implementation of variBAD. The policy network of LDM is from the RL2 implementation of variBAD. The latent dynamics network is from the VAE part of variBAD. VariBAD uses a multi-head structure (each head for a goal state in train\mathcal{M}_{\textrm{train}}) with binary cross-entropy (BCE) loss for the decoder output for the gridworld task. Because LDM needs to generate rewards for tasks out of train\mathcal{M}_{\textrm{train}} as well, we modify the decoder to a general single head structure. The latent model mm is multidimensional, therefore we sample the weight corresponding to each dimension independently. The weights in 4a represent the mean weights of all dimensions. Refer to our reference implementation at https://github.com/suyoung-lee/LDM.

Mixreg

The original Mixreg is based on Deep Q-Network (DQN). Therefore, we implement a variant of Mixreg based on RL2 by modifying the code we use for RL2. We use the same Dirichlet mixture weights used for LDM and multiply the weights to the states and rewards to generate mixture tasks. We keep the ratio between the true tasks and mixture tasks as 14:2.

ProMP and E-MAML

We use the open-source reference implementation of ProMP at https://github.com/jonasrothfuss/ProMP for ProMP and E-MAML. We only modify the environments to contain separate train\mathcal{M}_{\textrm{train}} and test\mathcal{M}_{\textrm{test}} and keep all the reference implementation setup.

PEARL, MQL, and MIER

We use the open-source reference implementation of PEARL at https://github.com/katerakelly/oyster, MQL at https://github.com/amazon-research/meta-q-learning and MIER at https://github.com/russellmendonca/mier_public. We only modify the environments to contain separate train\mathcal{M}_{\textrm{train}} and test\mathcal{M}_{\textrm{test}} and keep all the reference implementation setup. We report the performance at 5 million steps as asymptotic performance for MuJoCo tasks.

A.2 Runtime Comparison

We report the average runtime spent to train Half-cheetah-velocity for 5e7 environment steps (5e6 steps for PEARL) in Table 2. We ran multiple experiments on our machine (Nvidia TITAN X) simultaneously, therefore consider the following as relative ordering of complexity.

Table 2: Mean runtime to train Half-cheetah-velocity.
LDM Mixreg RL2 variBAD ProMP E-MAML PEARL
Half-cheetah-velocity
Runtime (hours)
31 28 25 10 2 2 25

ProMP and E-MAML require the least training time because they do not use recurrent networks. VariBAD requires less train time than RL2 because variBAD does not backpropagate the policy network’s gradient to the recurrent encoder. LDM requires more training time than RL2 because LDM trains the policy network and a separate latent dynamics network.

A.3 Hyperparameters

Table 3: Hyperparameters of LDM.
Gridworld MuJoCo
RL algorithm
A2C
Epsilon: 1.0e-5
Discount: 0.95
Max grad norm: 0.5
Value loss coeff.: 0.5
Entropy coeff.: 0.01
GAE parameter tau: 0.95
PPO
Batch size: 3200
Minibatches: 4
Max grad norm: 0.5
Clip parameter: 0.1
Value loss coeff.: 0.5
Entropy coeff.: 0.01
Number of steps
of a rollout episode (HH)
30 200
Number of rollout episodes
per iteration (NN)
4 2
Extrapolation level (β\beta)
1.0 1.0 (2.0 for Ant-goal)
Decoder input dropout rate (pdropp_{\textrm{drop}})
0.7 0.5
Number of parallel processes Normal workers (nn) 14 14 (12 for Cheetah-vel)
Mixture workers (n^\hat{n}) 2 2 (4 for Cheetah-vel)
Policy Network Encoder-p (ϕp\phi_{p})
State
encoding
1 FC layer, 32 dim 1 FC layer, 32 dim
Action
encoding
0 dim 1 FC layer, 16 dim
Reward
encoding
1 FC layer, 8 dim 1 FC layer, 16 dim
GRU 128 hidden size 128 hidden size
GRU
output (btb_{t}) size
32 128
Policy
(πψ\pi_{\psi})
1 FC layer, 32 nodes 1 FC layer, 128 nodes
Activation tanh tanh
Learning rate 7.0e-4 7.0e-4
Latent Dynamics Network Encoder-v (ϕv\phi_{v})
State
encoding
1 FC layer, 32 dim 1 FC layer, 32 dim
Action
encoding
0 dim 1 FC layer, 16 dim
Reward
encoding
1 FC layer, 8 dim 1 FC layer, 16 dim
GRU 128 hidden size 128 hidden size
Task embedding
(mtm_{t}) size
(sample from GRU
output dimension)
32
5 for Ant-direction
and Cheetah-vel
10 for Ant-goal
Reward decoder
(θR\theta_{R})
2 FC layers,
32 and 32 nodes
2 FC layers,
64 and 32 nodes
Decoder loss
function
BCE MSE
Activation ReLU ReLU
Learning rate 0.001 0.001
Buffer size 100000 10000

Appendix B Test-time Generalization of the Latent Model

Refer to caption
(a) Gridworld: sample tasks.
Refer to caption
(b) Gridworld: t-SNE plot of test-time latent model mH+m_{H^{+}}.The cross marks denote the tasks that the LDM agent fails to visit.
Refer to caption
(c) Ant-goal: sample tasks.
Refer to caption
(d) Ant-goal: t-SNE plot of test-time latent model mH+m_{H^{+}}. The number on each point times 4545^{\circ} denotes the target goal’s angle from the origin. Latent models with red numbers belong to test\mathcal{M}_{\textrm{test}}.
Figure 9: Latent dynamics network’s learned latent models on 45 tasks in the gridworld and 48 tasks in Ant-goal.

We empirically demonstrate that the structure of the test task is well reflected in the latent models although the latent dynamics network is not trained in test\mathcal{M}_{\textrm{test}} and M^\hat{M} (Figure 9). For each task in the gridworld task, we collect the latent model at the last step (t=H+t=H^{+}). Then we reduce the dimension of the collected latent models to two dimensions via t-SNE (Figure 9b). The latent models of the tasks in test\mathcal{M}_{\textrm{test}} are between the inner subset and the outer subset of the training tasks. Similarly, we evaluate the latent dynamics model in Ant-goal. We evaluate the latent models on 48 tasks as in Figure 9c where r{0.5,1.0,1.5,2.0,2.5,3.0}r\in\{0.5,1.0,1.5,2.0,2.5,3.0\} and θ{0,45,90,135,180,225,270,315}\theta\in\{0^{\circ},45^{\circ},90^{\circ},135^{\circ},180^{\circ},225^{\circ},270^{\circ},315^{\circ}\}. Although LDM is not trained on the tasks with r=1.5r=1.5 and r=2.0r=2.0, the latent models are between the training tasks’ latent models (Figure 9d). On the other hand, the policy network can not be trained stably with a large input dropout (RL2 dropout in Figure 13). These empirical results support our claim that our latent dynamics network with dropout can generalize to unseen test dynamics although the policy can not. Therefore the mixtures of the latent models can generate tasks similar to the test tasks.

Appendix C Analysis on the Ant-goal Task (Extrapolation Level β\beta)

At the beginning of an iteration, we sample n=14n=14 training tasks from train\mathcal{M}_{\textrm{train}} on the Ant-goal task. Because we sample a sufficiently large number of training tasks, a mixture task’s goal is located near the origin with high probability if we set β=1.0\beta=1.0 (Figure 10). Therefore we use β=2.0\beta=2.0 that effectively improves the test returns on the Ant-goal task.

Refer to caption
Figure 10: Expected mixture goals for different extrapolation level β\beta on the Ant-goal task. We sample training tasks from train\mathcal{M}_{\textrm{train}} for 200 iterations. For each iteration, we sample n=14n=14 goals for normal workers. Then we mix the goal coordinates of the training tasks using the Dirichlet weights to generate n^=2\hat{n}=2 mixture goals. We plot the coordinates of the training and mixture goals.

Appendix D Number of Mixture Workers n^\hat{n}

Refer to Figure 11 for the returns in eval\mathcal{M}_{\textrm{eval}} and eval-train\mathcal{M}_{\textrm{eval-train}} (Table 4) on the Half-cheetah-velocity task for different values of n^\hat{n}. We report the results for n^{2,3,4,5}\hat{n}\in\{2,3,4,5\} and keep the total number of workers fixed at 16. For all values of n^\hat{n}, LDM outperforms RL2 in test returns at the beginning of training. However, the test return of LDM decays as training tasks dominate the policy updates. If n^\hat{n} is small, training of the policy is easily dominated by the training tasks, and the test return converges that of RL2 quickly. If n^\hat{n} is too large, normal workers have difficulty in learning the optimal policy for training tasks (n^=5\hat{n}=5 in Figure 11, second row). We use n^=4\hat{n}=4 that sets a balance in between.

Refer to caption
Figure 11: Training and test returns for different numbers of mixture workers n^\hat{n} on Half-cheetah-velocity, using 4 random seeds.

Appendix E Ablations on Dropout

E.1 Amount of Dropout pdropp_{\textrm{drop}}

We report the performance of LDM with different dropout rates pdrop{0.0,0.5,0.7,0.9}p_{\textrm{drop}}\in\{0.0,0.5,0.7,0.9\} in Figure 12. LDM without the input dropout (pdrop=0.0p_{\textrm{drop}}=0.0) slightly outperforms RL2 at the end of training, but the improvement is insignificant. The test performance improves as the dropout rate increases. But if the rate is too large (pdrop=0.9p_{\textrm{drop}}=0.9), it becomes difficult to train the decoder and the performance decreases.

Refer to caption
Figure 12: Results of LDM with different dropout rate on gridworld task. Evaluated at the NN-th episode, using 4 random seeds.

E.2 Dropout on other baselines

To demonstrate that dropout is not all the regularization required to achieve better generalization, we evaluate RL2 dropout and variBAD dropout, both with pdrop=0.7p_{\textrm{drop}}=0.7 in Figure 13. RL2 doesn’t use a decoder, therefore we apply dropout on the state input of the policy encoder. RL2 dropout cannot be trained stably, since training policy network with a multi-step policy gradient loss is much more complex than training the decoder with a single-step regression loss. For variBAD dropout, we apply dropout on the state input of the decoder (same as LDM). VariBAD shares an encoder for both VAE and the policy network and does not backpropagate the policy loss to the encoder. Therefore, the encoding, trained with decoder with dropout, may disturb the policy network from stable training. We evaluate LDM with a shared encoder, where we use a single encoder instead of separate encoder-p and encoder-v. We encountered instability in policy training when the encoding, trained for a decoder with dropout, was used for policy. If we have separate encoders, even when the VAE is not perfectly trained due to the dropout, it does not affect the policy.

Refer to caption
Figure 13: Results of LDM, RL2, and variBAD trained with and without dropout on gridworld task. Evaluated at the NN-th episode, using 4 random seeds.

Appendix F Additional Experimental Results

F.1 Gridworld First Rollout Episode

The non-oracle methods spend most of the timesteps in the first episode to explore the goal states of train\mathcal{M}_{\textrm{train}}. Therefore the mean returns in test\mathcal{M}_{\textrm{test}} are significantly lower than those at the NN-th episode (Figure 3). The oracle methods explore the goal states of test\mathcal{M}_{\textrm{test}} before exploring the outermost states of train\mathcal{M}_{\textrm{train}} during the first episode. Therefore the mean returns of oracle methods in train\mathcal{M}_{\textrm{train}} are lower than those of non-oracle methods.

Refer to caption
Figure 14: Results of the gridworld task evaluated at the first episode in terms of the mean returns in train\mathcal{M}_{\textrm{train}} and test\mathcal{M}_{\textrm{test}}, and the number of tasks in test\mathcal{M}_{\textrm{test}} in which the agent fails to reach the goal.

F.2 MuJoCo First Rollout Episode

Since the task can be inferred from the reward at any timestep of training, the performance of LDM at the first episode is nearly the same as that at the NN-th episode (Figure 7). ProMP, E-MAML, and PEARL need to collect trajectories until the NN-th episode.

Refer to caption
Figure 15: Mean returns at the first episodes in eval\mathcal{M}_{\textrm{eval}} of three MuJoCo tasks.

F.3 MuJoCo Training Results

As we defined evaltest\mathcal{M}_{\textrm{eval}}\subset\mathcal{M}_{\textrm{test}}, we define eval-traintrain\mathcal{M}_{\textrm{eval-train}}\subset\mathcal{M}_{\textrm{train}} to report the training results (Table 4).

Table 4: eval-train\mathcal{M}_{\textrm{eval-train}} for MuJoCo tasks. k{0,1,2,3}k\in\{0,1,2,3\}.
Ant-direction Ant-goal Half-cheetah-velocity
θ\theta rr θ\theta vv
eval-train\mathcal{M}_{\textrm{eval-train}} 90×k90^{\circ}\times k {0.50,2.75}\{0.50,2.75\} 90×k90^{\circ}\times k {0.25,3.25}\{0.25,3.25\}

Refer to Figure 16 for the mean returns on the training tasks and Figure 17 for the trajectories. LDM achieves higher training returns for all tasks than its baseline RL2, although LDM devotes some portion of training steps to train mixture tasks. LDM achieves the best training performance in the Ant-direction task while RL2’s performance gradually decreases. When there are only a few training tasks, RL2-based methods often collapse into a single-mode, unable to construct sharp decision boundaries between tasks (Figure 17a). VariBAD achieves high training returns in Ant-direction and Ant-goal, although its test returns in test\mathcal{M}_{\textrm{test}} are low.

Refer to caption
(a) Ant-direction, mean return of the 4 tasks in eval-train\mathcal{M}_{\textrm{eval-train}}
Refer to caption
(b) Ant-goal, mean return of the 8 tasks in eval-train\mathcal{M}_{\textrm{eval-train}}
Refer to caption
(c) Half-cheetah-velocity, mean return of the 2 tasks in eval-train\mathcal{M}_{\textrm{eval-train}}
Figure 16: Mean return in eval-train\mathcal{M}_{\textrm{eval-train}} of three MuJoCo tasks.
Refer to caption
(a) Ant-drection, 4 trajectories from eval-train\mathcal{M}_{\textrm{eval-train}} in each plot. The target directions are dashed lines.
Refer to caption
(b) Ant-goal, 8 trajectories from eval-train\mathcal{M}_{\textrm{eval-train}} for each method (4 in each plot). The cross marks are the goal positions.
Refer to caption
(c) Half-cheetah-velocity, 2 trajectories from eval-train\mathcal{M}_{\textrm{eval-train}} in each plot.
Figure 17: Example trajectories of the agents in eval-train\mathcal{M}_{\textrm{eval-train}} of MuJoCo tasks. We illustrate the behavior at the NN-th episode as colored paths. The targets of eval-train\mathcal{M}_{\textrm{eval-train}} are indicated as dashed lines or cross marks