22institutetext: Cerenaut 22email: [email protected]
%****␣main.tex␣Line␣100␣****https://cerenaut.ai
Augmenting Replay in World Models for Continual Reinforcement Learning
Abstract
Continual RL requires an agent to learn new tasks without forgetting previous ones, while improving on both past and future tasks. The most common approaches use model-free algorithms and replay buffers can help to mitigate catastrophic forgetting, but often struggle with scalability due to large memory requirements. Biologically inspired replay suggests replay to a world model, aligning with model-based RL; as opposed to the common setting of replay in model-free algorithms. Model-based RL offers benefits for continual RL by leveraging knowledge of the environment, independent of policy. We introduce WMAR (World Models with Augmented Replay), a model-based RL algorithm with a memory-efficient distribution-matching replay buffer. WMAR extends the well known DreamerV3 algorithm, which employs a simple FIFO buffer and was not tested in continual RL. We evaluated WMAR and DreamerV3, with the same-size replay buffers. They were tested on two scenarios: tasks with shared structure using OpenAI Procgen and tasks without shared structure using the Atari benchmark. WMAR demonstrated favourable properties for continual RL considering metrics for forgetting as well as skill transfer on past and future tasks. Compared to DreamerV3, WMAR showed slight benefits in tasks with shared structure and substantially better forgetting characteristics on tasks without shared structure. Our results suggest that model-based RL with a memory-efficient replay buffer can be an effective approach to continual RL, justifying further research.
Keywords:
World models Model-based RL Continual RL Continual Learning Replay1 Introduction
Typically, RL focuses on a single unchanging task. Continual RL, however, presents tasks sequentially, representing many real-world scenarios [13]. Continual RL poses the well-known challenge of catastrophic forgetting [17], where an agent forgets old tasks when learning new ones. Another challenge is to use prior knowledge to learn new tasks more efficiently. One approach to continual RL is to store experiences of all tasks to prevent catastrophic forgetting [14]. This is often done through a replay buffer. However, it requires very high storage capacity, reducing scalability and accessibility of the algorithm [19].
The inspiration for many replay-based methods comes from Complementary Learning Systems (CLS) [9, 13], which describes learning in mammalian brains. The hippocampus memorises recent observations and replays them to the neocortex, which is a slow statistical learner. Replay is interleaved with new experiences, thus mitigating catastrophic forgetting. Interestingly, the neocortex is understood to form a world model whose purpose is to predict the consequences of our actions [16]. Although CLS describes replay to a world model, traditionally replay in RL is used in model-free RL to improve the policy, rather than a world model directly.
The idea of world models has been exploited in model-based RL algorithms, where a world model is capable of predicting the effects of actions on the environment [4, 6] and the more recent state-of-the-art DreamerV1 to V3 [5, 7, 8]. They have been applied to continual RL by several authors [18, 10, 12, 20]. World models are an intuitive choice for exploiting replay buffers, as they naturally support off-policy learning.
In this paper, we take DreamerV3 [8], add a memory-efficient replay buffer [11], and apply it to continual RL. Hence, we present World Models with Augmented Replay (WMAR). We applied WMAR to two settings. First, where each task has a distinct environment and reward function; the most common setting used in continual RL, and Atari games are often used. In the second setting, there is commonality between tasks, and learnt knowledge can be leveraged to perform subsequent tasks. This is referred to as tasks with ‘shared structure’ [13]. Often a video game is used, but conditions such as movement dynamics or spacing of features in the environment change over time, e.g. [21]. Many potential real-world applications exist within the ‘shared structure’ setting. For example, a robot assistant that should be able to acquire new and related tasks within a home, like cleaning with a broom and then a mop. We used OpenAI Procgen (procedurally generated games) for tasks with ‘shared structure’, and Atari games for tasks ‘without shared structure’. Our analysis is not restricted to catastrophic forgetting but included backward and forward transfer of skills between tasks. We aim to achieve the following qualities of a successful CL agent [13, 2]:
-
•
Stability avoid forgetting and losing performance on previously learnt tasks
-
•
Backward transfer increase performance on previously learnt tasks after training on new, similar tasks.
-
•
Plasticity don’t slow down on learning new and dissimilar tasks compared to learning them independently
-
•
Forward transfer learn new similar tasks faster compared to learning them independently
-
•
Scalable low memory and computational requirements
-
•
Task id’s don’t rely on task identifiers, which are often unavailable
The primary contributions include: a) applying model-based RL to continual RL for tasks with and without shared structure and b) augmenting the replay buffer of DreamerV3 with a memory-efficient long-term distribution matching buffer. For background and related work, see S.LABEL:app:background.
2 World models with augmented replay (WMAR)
WMAR extends DreamerV3, which has achieved state-of-the-art performance in several single-GPU RL benchmarks. WMAR consists of three primary components. A world model for modelling the environment, actor-critic controller for acting on the environment, and augmented replay buffer to store past experiences. The replay buffer is used to train the world model, and the world model is able to simulate ‘dreamed’ experiences, which are used to train the controller. A world model’s ability to simulate the environment enables off-policy learning and data augmentation, which are beneficial in continual learning where direct environment interaction may be limited. We hypothesise that maintaining the world model’s accuracy across different tasks will help preserve performance on past environments while adapting to new ones. This approach does not require explicit task identifiers, potentially allowing for more flexible adaptation to changing environments.
The key components are described below, and more details are found in S.LABEL:app:world_models. The source code is available at https://github.com/cerenaut/wmar.
2.1 World model
A Recurrent State-Space Model (RSSM) [6], predicts environment dynamics, Figure 1. It maintains a deterministic hidden state and models a stochastic representation of the next state given current and previous observations and actions Dynamics are modelled with a Gated Recurrent Unit (GRU) network, predicting the deterministic state and the stochastic state . The stochastic state is either inferred by a variational autoencoder or predicted by the dynamics model for open-loop prediction in dreaming, where stochastic state posteriors are unavailable.
We used a standard GRU with Tanh activation for the recurrent component. The stochastic state comprises 32 discrete stochastic units, each with 32 categorical classes, following the architecture of DreamerV3.
The world model state at timestep is the concatenation of and , forming a Markovian representation of the environment state. The world model is trained to reconstruct input images and rewards. KL balancing [7] helps to model the transition between states.


2.2 Actor critic
The actor and critic (S.LABEL:app:network_arch) are MLPs that map the state of the model to actions and value estimates. They are trained entirely on trajectories generated stochastically by the world model, referred to as ‘dreaming’. The actor and critic are trained on-policy using REINFORCE [23]. Inevitable changes in the model state space are not an issue since imagined trajectories are cheap to generate and do not require interaction with the actual environment, allowing this process to be run to convergence.
2.3 Augmented replay buffer
The augmented replay buffer, Figure 2, comprises a short-term FIFO buffer (as used in DreamerV3) and a long-term global distribution matching buffer . They are equally sized and used in parallel. Data from both are uniformly sampled for each training minibatch (S.Algorithm LABEL:alg:combined_buffers). A key objective is to minimise memory requirements and hence the size of the replay buffer. While Dreamer maintained a single buffer containing the last 1 million (1,000,000) observations, we empirically chose a size of 262,000 observations for each buffer, resulting in an augmented buffer that is significantly smaller (about ), without noticeably affecting performance. We also introduced spliced rollouts, which are a simple and sometimes necessary alternative to storing entire episodes, enabling smaller buffer sizes.

Short-term FIFO buffer
The FIFO buffer has a capacity of samples and contains the most recent rollout observations. The buffer allows the world model to train on all incoming experiences and biased the training samples to those more recently collected, thus improving convergence properties on the current task.
Long-term global distribution matching (LTDM) buffer
Matching the global training distribution even with limited capacity in the replay buffer can reduce catastrophic forgetting [11]. Therefore, we used a long-term global distribution matching buffer, also with a capacity of samples. It contains a uniform random subset of 512 spliced rollouts. Reservoir sampling was used, assigning a random value for each rollout chunk as a key in a size-limited priority queue, preserving experiences with the highest key values and discarding the remainder.
Spliced rollouts
Restricting the size of the replay buffer can cause it to become full with only a few episodes. Containing too few unique episodes may cause the training data to become highly unrepresentative of the general environment states, thus reducing the world model’s accuracy and likely causing a subsequent loss of performance from the actor. This is especially relevant for the long-term global distribution matching buffer, which may store relatively few samples from each environment. Therefore, while typical implementations store complete rollouts, we spliced rollouts into smaller chunks of size 512. Operating over the spliced rollouts rather than entire rollouts provided a guarantee on the granularity for sampling data. The remaining rollouts with fewer than 512 states were concatenated before the next episode with an appropriate reset flag indicating the start of a new episode. To improve the efficiency of the rollout process, we truncated episodes after a fixed number of steps, so that the final number of environment steps was identical in every training iteration. We found that spliced rollouts were a simple solution to control granularity in the distribution matching buffer and did not adversely affect performance.
2.4 Task-agnostic exploration
In tasks without shared structure, there are significant differences between environments in terms of task dynamics, visual differences, and magnitude of rewards, which poses a significant challenge to CL, especially without task id’s. Exploring new environments is difficult, as policies that have already been trained on previous tasks may lack the randomness required to adequately explore the state-space of a new task. We addressed this challenge using fixed-entropy regularisation (as in DreamerV3) for training the world model. In addition, we used predetermined reward scales in individual environments based on single-task baselines, during actor-critic training. This approach enabled mitigation of the exploration challenge without the addition of an exploration-orientated learning system such as Plan2Explore [22].
3 Experiments
3.1 Dataset
We evaluated WMAR in a set of challenging OpenAI Procgen and Atari environments [3, 1]. They are commonly used RL benchmarks, can be extremely similar for the case of shared structure (Procgen) or dissimilar where there is no shared structure (Atari), and they were computationally feasible on the available hardware. We define shared structure as tasks that have similar environment state and observations, action dynamics, and rewards. We applied visual perturbations to create differences between tasks. To evaluate performance on tasks without shared structure, we selected 4 Atari tasks.
3.2 Baselines
To measure the efficacy of the augmented replay buffer (i.e., adding a long-term distribution matching buffer with spliced rollouts), we compared WMAR and DreamerV3, with equal memory allowance for replay buffers. For the most direct comparison, we implemented our own version of DreamerV3, referred to as , and augmented the replay buffer to create WMAR. To validate DV3’, we compared to an open source implementation of DreamerV3 https://github.com/danijar/dreamerv3, see S.LABEL:app:validation.
We also ran single-task baselines: WMAR and a random agent. The single-tasks baselines were used for evaluation metrics.
3.3 Evaluation metrics
We evaluated agents using task reward and following Kessler et al. [12], extended the evaluation to forgetting and forward transfer. Forgetting informs us about backward transfer and stability, and forward transfer informs us about plasticity.
We evaluated performance by normalising episodic reward to two single-task baselines: WMAR and a random agent. The normalised comparison allowed us to measure the relative performance, providing a natural evaluation of forgetting and forward transfer for all environments in each CL suite.
3.3.1 Normalised rewards
We define an ordered suite of tasks . Performance in task after steps in single-task experiments is given by and in CL by . Agents were trained on each task for environment steps in single-task and CL experiments. For each task , we calculated the normalised reward using the average episodic rewards of the single-task random policy and the trained agent after environment steps .
(1) |
A normalised score of 0 corresponds to random performance and a score of 1 corresponds to the performance when trained on only that task (training on only one task is usually much easier than learning the same task in a continual learning setting).
3.3.2 Forgetting (Backward transfer)
Average forgetting for each task is the difference between performance after training on a given task and performance at the end of all tasks. Average forgetting over all tasks is defined as:
(2) |
A lower value for forgetting is indicative of improved stability and a better continual learning method. A negative value for forgetting would imply that the agent has managed to gain performance on earlier tasks, thus exhibiting backward transfer.
3.3.3 Forward transfer
The forward transfer for a task is the normalised difference between performance in the CL and single-task experiments. The average over all tasks is defined as:
(3) | ||||
where | ||||
(4) | ||||
(5) |
The larger the forward transfer, the better the continual learning method. A positive value implies effective use of learnt knowledge from previous environments and, as a result, accelerated learning in the current environment. When each task is not related to the others, no positive forward transfer is expected. In this case, forward transfer of 0 represents optimal plasticity, and negative values indicate a barrier to learning newer tasks from previous tasks.
3.4 Experimental setup
All WMAR experiments are trained on one Nvidia A40 40GB GPU with all single-task benchmarks running within 0.25 days and continual learning benchmarks running within 1 day of wall time, making these experiments widely available and reproducible across research labs. For more details, see S.LABEL:app:extime.
4 Results
4.1 Tasks without shared structure – Atari
We chose a subset of Atari environments where the agent could achieve reasonable performance with less training and followed [15] in using sticky actions. The environments were presented with arbitrary ordering, as testing each permutation would have been prohibitively expensive.
Normalised performance is plotted in Figure 3, and single-task results used for reward scaling are given in S.LABEL:app:baseline. The predetermined reward scales are shown in S.Table LABEL:tab:atari_rew_scales. WMAR was capable of continual RL, whereas it is clear that was not; it only performed well when it was actively training in a given task.
Forgetting (Backward transfer)
struggled with retaining performance, losing nearly all knowledge after learning each new task. In contrast, WMAR maintained much of its performance in previously learnt tasks, demonstrating significantly improved stability. This highlighted the importance of the distribution matching buffer in preventing forgetting, especially when tasks differ greatly. Even though the probability of drawing samples from the first task decreased to 12.5% by the end of training, WMAR retained much of its performance on the early tasks. However, this was not universal; for instance, ‘Crazy Climber’ showed substantial forgetting and the worst overall performance, even with the buffer in place.
Forward transfer
In general WMAR’s ability to learn new tasks was maintained as it learnt new tasks, reaching close to single-task performance on all of the tasks. There was one exception in the 3rd task, ‘Crazy Climber’, which was approximately half as good as earlier tasks. On the other hand, learnt slightly less effectively in each subsequent task. However, did not suffer on ‘Crazy Climber’ and on the 4th and last task, ‘Frostbite’, performance was well above single-task performance. As a result, the average forward transfer for was higher than for WMAR.


Model | Avg. forgetting | Avg. fwd. transfer |
---|---|---|
(Lower is better) | (Higher is better) | |
WMAR | ||
4.2 Tasks with shared structure – CoinRun
Normalised performance is shown in Figure 4, and single-task results used for reward scaling are given in S.LABEL:app:baseline. The terminology for CoinRun environments is: NB = no background, RT = restricted themes, and MA = monochrome assets.
Although CoinRun environments have identical mechanics, they share both subtle and substantial visual differences, posing significant challenges. However, WMAR could learn continuously and displayed forward and backward transfer of skills between environments. showed similar characteristics, but performance dropped on the first task, ‘CoinRun’, after training on it (forgetting), and on the last task before training on it (lacking forward transfer). They had approximately equal peak performance on each task; however, WMAR was noticeably more consistent than . Overall, WMAR showed more consistent performance across tasks with fewer fluctuations in performance during training, which may be a desirable property in practical applications.
Forgetting (Backward transfer)
WMAR exhibited desirable forgetting (backward transfer), whereas displayed very little backward transfer, see Table 2.
Forward transfer
Both WMAR and had good forward transfer properties Table 2; the augmented buffer did not make a noticeable difference. We found significant forward transfer on all tasks, greatly reducing the amount of time required to reach a given performance. It is notable as the tasks were constructed so that later task levels (aspects of a task) could be different from earlier tasks, making generalisation more challenging.


Model | Avg. forgetting | Avg. fwd. transfer |
---|---|---|
(Lower is better) | (Higher is better) | |
WMAR | ||
5 Discussion
WMAR had superior performance with the same memory requirements compared to . On tasks without shared structure, WMAR exhibited almost no forgetting, while had almost complete forgetting. On tasks with shared structure, WMAR showed noticeably decreased forgetting. In addition, WMAR was able to learn multiple tasks with highly varying magnitude of rewards, without task id’s, which is an important and difficult challenge. As RL algorithms often consume high compute, we emphasise the benefit of lower computational and memory costs. Most RL is based on model-free approaches, where experience replay is used to improve learning the policy. Our results confirm that replay to a world model in a model-based approach is also a strong foundation for continual RL.
Our methods are largely orthogonal to previous state-of-the-art approaches to combating catastrophic forgetting such as EWC and P&C which work over network parameters, and CLEAR which uses replay but typically operates over model-free approaches and uses behaviour cloning of the policy from environment input to action output.
Generalisation
Forward and backward transfer was far better for tasks with shared structure than those without shared structure. We hypothesise that this was due to the world model’s convolutional feature extractor, which could benefit from commonality in the visual domain only. It may have suffered from more abstract changes such as inverting the colours or permuting the controls between tasks, leading to decreased overall performance.
Reward scaling without task id’s
While WMAR was sufficiently robust in learning multiple tasks without forgetting, even when reward scales are somewhat different, we found that this property did not hold when reward scales differed significantly (e.g. by a factor of ), where only the task with the higher reward scale would be learnt. In that case, we found that approximate reward scaling allowed the agent to learn multiple tasks. We hypothesise that the different rewards and subsequent returns cause poorly scaled advantages when training the actor, resulting in the actor only learning tasks with the highest returns. Experiments with automatic scaling of advantages through non-linear squashing transformations proved to hurt learning on individual tasks, so the static, linear, reward transformation was used.
Memory capacity
Despite the benefits and improved memory capacity of WMAR, a key limitation of any buffer-based method is finite capacity. As more tasks are explored and previous tasks are not revisited, an increasing number of samples from previous tasks will inevitably be lost, leading to forgetting.
5.1 Limitations and future work
This study provides a strong justification for future work; however, the experiments were limited in scope. They could be expanded (e.g., as in [5, 12]) by increasing the length of task sequences, increasing the number of environment steps, testing on more environments, and comparing to additional reference models.
Testing all task permutations is prohibitively expensive and therefore, we used a set task order. However, task ordering can result in significant differences in CL results (e.g., Appendix G in [20]). Future work could randomise the order (as a proxy for testing all permutations) over a larger number of seeds.
Another area for future work is to improve WMAR by combining it with existing techniques such as behaviour cloning in CLEAR. Such an adaptation could counter shifts in the latent distribution as the world model trains and where the actor is frozen.
6 Conclusions
We extended a well-known model-based world model architecture, DreamerV3, with an augmented replay buffer, and applied it to the problem of continual RL in two scenarios: tasks with and without shared structure, i.e. commonalities between tasks that could be leveraged by an agent. WMAR and DV3’ were set with the same memory budget (same-sized buffers) and were compared. Performance was evaluated for forward and backward transfer, in addition to the common practice of measuring only forgetting. We found that model-based agents are capable of continual learning on both task types. The augmented replay buffer of WMAR conferred a minor benefit in tasks with shared structure and substantial improvements in tasks without shared structure. The results suggest that model-based RL using a world model with a memory-efficient replay buffer can be an effective and practical approach to continual RL, justifying future work.
References
- [1] Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research 47, 253–279 (Jun 2013)
- [2] Chen, Z., Liu, B.: Continual Learning and Catastrophic Forgetting. In: Lifelong Machine Learning, pp. 55–75. Springer International Publishing, Cham (2018)
- [3] Cobbe, K., Hesse, C., Hilton, J., Schulman, J.: Leveraging Procedural Generation to Benchmark Reinforcement Learning. In: Proceedings of the 37th International Conference on Machine Learning. pp. 2048–2056. PMLR (Nov 2020), iSSN: 2640-3498
- [4] Ha, D., Schmidhuber, J.: World Models. arXiv preprint arXiv:1803.10122 (Mar 2018), arXiv:1803.10122 [cs, stat]
- [5] Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to Control: Learning Behaviors by Latent Imagination (Mar 2020), arXiv:1912.01603 [cs]
- [6] Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning Latent Dynamics for Planning from Pixels. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2555–2565. PMLR (Jun 2019)
- [7] Hafner, D., Lillicrap, T., Norouzi, M., Ba, J.: Mastering Atari with Discrete World Models (Feb 2022), arXiv:2010.02193 [cs, stat]
- [8] Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering Diverse Domains through World Models (Jan 2023), arXiv:2301.04104 [cs, stat]
- [9] Hassabis, D., Kumaran, D., Summerfield, C., Botvinick, M.: Neuroscience-Inspired Artificial Intelligence. Neuron 95(2), 245–258 (Jul 2017), publisher: Elsevier
- [10] Huang, Y., Xie, K., Bharadhwaj, H., Shkurti, F.: Continual model-based reinforcement learning with hypernetworks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). pp. 799–805 (2021). https://doi.org/10.1109/ICRA48506.2021.9560793
- [11] Isele, D., Cosgun, A.: Selective Experience Replay for Lifelong Learning. Proceedings of the AAAI Conference on Artificial Intelligence 32(1) (Apr 2018), number: 1
- [12] Kessler, S., Ostaszewski, M., Bortkiewicz, M., Żarski, M., Wolczyk, M., Parker-Holder, J., Roberts, S.J., Mi\loś, P.: The Effectiveness of World Models for Continual Reinforcement Learning. In: Proceedings of The 2nd Conference on Lifelong Learning Agents. pp. 184–204. PMLR (Nov 2023), iSSN: 2640-3498
- [13] Khetarpal, K., Riemer, M., Rish, I., Precup, D.: Towards Continual Reinforcement Learning: A Review and Perspectives. Journal of Artificial Intelligence Research 75, 1401–1476 (Dec 2022)
- [14] Lipton, Z.C., Azizzadenesheli, K., Kumar, A., Li, L., Gao, J., Deng, L.: Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear (Mar 2018), arXiv:1611.01211 [cs, stat]
- [15] Machado, M.C., Bellemare, M.G., Talvitie, E., Veness, J., Hausknecht, M., Bowling, M.: Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. Journal of Artificial Intelligence Research 61, 523–562 (Mar 2018)
- [16] Mathis, M.W.: The neocortical column as a universal template for perception and world-model learning. Nature Reviews Neuroscience 24(1), 3–3 (Jan 2023), number: 1 Publisher: Nature Publishing Group
- [17] McCloskey, M., Cohen, N.J.: Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In: Bower, G.H. (ed.) Psychology of Learning and Motivation, vol. 24, pp. 109–165. Academic Press (Jan 1989)
- [18] Nagabandi, A., Finn, C., Levine, S.: Deep online learning via meta-learning: Continual adaptation for model-based rl. arXiv preprint arXiv:1812.07671 (2018)
- [19] OpenAI, Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., Pinto, H.P.d.O., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., Zhang, S.: Dota 2 with Large Scale Deep Reinforcement Learning (Dec 2019)
- [20] Rahimi-Kalahroudi, A., Rajendran, J., Momennejad, I., van Seijen, H., Chandar, S.: Replay Buffer with Local Forgetting for Adapting to Local Environment Changes in Deep Model-Based Reinforcement Learning (Mar 2023)
- [21] Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., Tesauro, G.: Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference (May 2019), arXiv:1810.11910 [cs, stat]
- [22] Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., Pathak, D.: Planning to Explore via Self-Supervised World Models. In: Proceedings of the 37th International Conference on Machine Learning. pp. 8583–8592. PMLR (Nov 2020), iSSN: 2640-3498
- [23] Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 229–256 (1992), publisher: Springer