33email: {wojciech.masarczyk.dokt, kamil.deja.dokt, tomasz.trzcinski}@pw.edu.pl
On robustness of generative representations against catastrophic forgetting
Abstract
Catastrophic forgetting of previously learned knowledge while learning new tasks is a widely observed limitation of contemporary neural networks. Although many continual learning methods are proposed to mitigate this drawback, the main question remains unanswered: what is the root cause of catastrophic forgetting? In this work, we aim at answering this question by posing and validating a set of research hypotheses related to the specificity of representations built internally by neural models. More specifically, we design a set of empirical evaluations that compare the robustness of representations in discriminative and generative models against catastrophic forgetting. We observe that representations learned by discriminative models are more prone to catastrophic forgetting than their generative counterparts, which sheds new light on the advantages of developing generative models for continual learning. Finally, our work opens new research pathways and possibilities to adopt generative models in continual learning beyond mere replay mechanisms.
1 Introduction
Neural networks are widely used across many real-life applications, ranging from image recognition [20] to natural language processing [24]. Nevertheless, neural models used in those applications assume identical and independently distributed training data - the assumption rarely met in practice. As a result, contemporary neural network models are prone to catastrophic forgetting [5] - a well-known limitation of neural networks that results in the erosion of previously learned knowledge. Continual learning is a field of machine learning that aims at addressing this pitfall of neural models by constant adaption to new data. The majority of works in this field focus on developing methods for mitigating the effects of catastrophic forgetting [16]. These methods can be grouped into three categories – regularization based [31, 8], rehearsal methods [25, 2, 15, 4, 22, 19] and methods using dynamic architectures [12, 11, 6, 10, 27, 21, 30]. Nevertheless, recent findings [17, 23] show that it is possible to surpass well-established continual learning methods across popular benchmarks with heuristic-based baselines. The surprising effectiveness of these baselines indicates that the root cause of catastrophic forgetting is yet to be discovered, and we follow this intuition in our work.

.
We investigate the catastrophic forgetting with methodological rigor; we state and empirically validate research hypotheses that shed new light on this phenomenon. We build upon the works of [3, 18] where the roots of forgetting are analyzed and the findings of [23, 26] that analyze the effectiveness of intuitive solutions to this problem. Here, contrary to previous works, we postulate to look at the continual learning from the perspective of internal representations of neural networks and analyze its impact on the final performance of the continually learned model. Inspired by [23], we argue that the dynamics of catastrophic forgetting depends on the task at hand, yet contrary to this work, we analyze this observation from the perspective of internal neural representations rather than peculiarities of continual learning tasks.
To summarize, the main contribution of our work is the statement of the following research hypotheses, along with their empirical validation:
Hypothesis 1 – Representations learned by autoencoders and variational autoencoders are less prone to catastrophic forgetting than representations of discriminative models.
Hypothesis 2 – Autoencoders and variational autoencoders learn more transferable features than discriminative models.
Moreover, the results from our experiments provide an explanation for the effect observed in [23], where the authors argue that continual reconstruction tasks do not suffer from catastrophic forgetting. Our experiments show that this observation can be explained from the perspective of generative representations 111Throughout this work, we refer to the hidden representations of autoencoders or variational autoencoders as generative representations. Similarly, we refer to the reconstruction task as the generative task. Although autoencoders are not strictly generative models, we use this term to highlight the difference in the objective of the particular model. In the case of the reconstruction task, the aim is to generate a sample, while in the case of classification, the aim is to discriminate samples.. This suggests that the lack of catastrophic forgetting is not exclusively linked to the continual reconstruction task.
Last but not least, our experiments show that it is possible to achieve the performance of over average accuracy on popular continual learning benchmarks without any specific mechanism to overcome catastrophic forgetting. This raises the question of whether these datasets and currently used evaluation protocols should be used for benchmarking novel continual learning methods as they no longer pose any significant challenge, as the results in Table 1 show.
2 Related works
Following [16], continual learning methods can be grouped into three categories: regularization, dynamic architectures, and rehearsal.
Regularization
Dynamic architectures
Here, the solution comprises many substructures, which are usually trained in isolation for specific tasks. On top of these structures, a separate mechanism decides which substructure to use during evaluation. In [21, 30, 29] new structural elements are added to the model for each new task which requires growing memory for the whole model. In [12, 11, 6, 10, 27] a large model is considered from which independent submodels are selected for subsequent tasks.
Rehearsal
Methods in this group try to prevent forgetting by retraining the model with a combination of new and previous data examples. Methods presented in [19, 1] employ a memory buffer to store all or possibly most relevant previous data examples. To overcome the scalability issues of the standard buffer approach, authors [22] propose to replace the buffer with a generative model. This method introduced a general schema that was further extended in several new approaches with various generative models [25, 2, 15, 4].
While most of the works develop new methods that are more robust to catastrophic forgetting, only several works investigate the actual phenomenon [3, 14, 13, 18]. Specifically, in [18] authors analyze the relationship between semantic similarity of tasks and magnitude of catastrophic forgetting. In [3] authors show that the effect of catastrophic forgetting diminishes with the prolonged exposure to the domain, which suggests that catastrophic forgetting could be an artifact of immature systems. Another approach [14] investigates the problem with the tools of Explainable Artificial Intelligence (XAI), comparing the effects of catastrophic forgetting for different layers of CNN.
Our work is directly influenced by Thai et al. [23], where authors claim that networks trained in continual reconstruction tasks do not suffer from catastrophic forgetting. We show that different dynamics of forgetting in reconstruction and classification tasks are linked to the representations of the data constructed by the neural networks. Specifically, we argue that learning representations through generative modeling is naturally more aligned with continual learning.
3 Methodology
To examine and compare representations from different models, we need to design a fair method to learn these models and collect respective representations. To that end we consider three types of models: discriminative – , generative based on Autoencoder – and generative based on Variational Autonecoder [7] – . To make a fair comparison, these networks share the same architecture, except for the last layer, defined by the model’s objective. In the case of the discriminative task, the last layer has output neurons to discriminate between classes. In the generative task, the final layer outputs vectors of equal size as input data. Depending on the model, we train the networks with different objective functions. model is trained with CrossEntropy, minimizes the MSE and the uses ELBO. As shown in Fig. ,2 we train all models on the same training sequence but with different objectives. The index of particular task is denoted by , where . To obtain the representations of data for a particular model, we feed the data to the model and collect the representations from the penultimate layer of the model. We use , to denote activations from model after finishing task for input data from task .

Datasets
We use three commonly used continual learning benchmarks to create a continuous sequence of tasks: splitMNIST, splitFashionMNIST, splitCIFAR. We follow the typical formulation of the continual learning task and split the datasets to disjoint subsets with two classes for each task.
Architectures
We use simple autoencoder architecture composed of three encoding fully connected layers and three decoding fully connected layers for all the tasks, followed by ReLU activations except for the last layer. For the generative task, the output layer is a single fully connected layer that maps the penultimate layer’s activations to the vector of equal size with input data. For the classification task, we use a multi-head output layer with two neurons per task. For MNIST and FashionMNIST datasets, we use architecture with (512, 256, 8, 256, 512) neurons on each hidden layer. For the CIFAR dataset, we use the same architecture with a bottleneck of 128 neurons.
3.1 Hypothesis 1
3.1.1 Discriminative abilities of representations
First, to test our hypothesis, we look at the problem of catastrophic forgetting from the accuracy perspective, as it is the most widely adopted measure of the catastrophic forgetting phenomenon. To measure the robustness of representations to the catastrophic forgetting, we train the models , and on identical data sequences and collect their representations from penultimate layer for all data splits. These representations are used to train a set of linear classifiers. Although we collect the representations for all data splits throughout the training, we train respective classifiers only when the corresponding task is active. This means that we train only the first classifier during the first task, and the rest remain untrained. After finishing the first task, we freeze the first classifier for the rest of the training. This way, degradation of its performance can be attributed exclusively to forgetting the representations for the first task, and we can directly measure the amount of forgetting in the model.
More precisely, after each epoch of training, for task data of the form , where denotes the respective data split, we collect the representations for model from its penultimate layer for and train softmax regression classifier on dataset of the form , where denotes the present task. Next, the trained softmax classifiers are evaluated on the validation dataset after extracting features with respective backbones. Note that in the above approach, classifiers for , remain untrained. Since the classifiers are frozen after completing -th task, the loss of performance may only be attributed to the drift of representations from the penultimate layer. Therefore, the smaller the loss of the classifier’s performance, the more resilient the features are to the catastrophic forgetting. The same procedure is applied to the models and as shown in Fig 2.

Fig. 3 depicts the results of this experiment. Columns represent results for , , respectively. Starting from the top, the rows present the results for MNIST, FashionMNIST, and CIFAR. Different colors present the accuracy for different tasks. As one can see in the left column of Fig. 3, for all three benchmarks, the classifier’s performance trained with representations of discriminative model suddenly drops after introducing new tasks and degrades further to the region of random guessing (depicted as dashed line). In contrast, for the generative case for both models, the performance is stable throughout the whole sequence of training, suggesting that the representations of tested generative models are almost immune to the problem of catastrophic forgetting, especially in the case of simpler datasets as MNIST and FashionMNIST. In the case of CIFAR-10, the amount of forgetting is considerable for the case of . However, the degradation of the performance is more stable than in the case of the discriminative model. This suggests that forgetting in generative models has different nature than forgetting in discriminative models. The right column shows that the representations of model suffer the least in the case of CIFAR-10. Explaining this difference requires further analysis, and we foresee it as future work.
Surprisingly, in MNIST and FashionMNIST, achieving an average accuracy above 90% without any dedicated mechanism to overcome catastrophic forgetting is possible. This raises the question of whether these datasets and evaluation protocols should be used to benchmark novel methods in continual learning.
The above experiment proves the validity of the first hypothesis through the lenses of accuracy as it is a widely adopted measure to estimate catastrophic forgetting. However, such an approach may not be fully informative as generative representations can have poor discriminative abilities. To address this limitation, we propose to measure catastrophic forgetting through the index of representations similarity.
3.1.2 Centered Kernel Alignment (CKA)
To further examine the validity of our hypothesis that representations learned by generative models are less susceptible to catastrophic forgetting, we investigate the evolution of network representations in time. Since the above experiment is directly linked to the discriminative abilities of the representations, here we use a task-agnostic measure to directly estimate the drift of the representations in continual learning training. For that purpose, we use the well-established method of Centered Kernel Alignment (CKA) [9], defined as:
(1) |
where and . CKA takes values from 0 to 1, where 1 means identical representations. Using CKA, it is possible to estimate the similarity of representations obtained for different datasets or different dimensionality. We measure the similarity between representations of the same network collected at different moments of the training of the neural network. Specifically, to analyze the evolution of representations from model , we collect the reference representations just after finishing task . Then, to measure the relative drift of representations on task , we compute the CKA index between the reference and current representations , for . We follow this procedure for each task in training sequence . The analogous procedure is applied to the representations of models and .

Fig. 4 presents the results of the experiment. In the case of discriminative representations (top row), we can see an abrupt change of representations just after introducing the new task on all tested datasets. This is in line with the results from the previous experiment, which show that most of the performance is lost during the next task. In the following tasks, the representations stabilize, but these representations are no longer valuable since classifying with them is not better than random guessing (see Fig. 3). The results from the middle row of Fig. 4 show that representations from slightly evolve during the training on consecutive tasks and remain similar to the reference representations. These results directly support hypothesis 1. The representations of changes significantly on the first two tasks of simpler datasets (MNIST and FashionMNIST). These results indicate that it takes more time for the VAE model to learn stable features. In CIFAR-10, the amount of forgetting on VAE representations is almost equal to the amount of forgetting on AE representations which in both cases is almost negligible. This may suggest that the more complex the data is, the more general features the generative models learn.
Additionally, the dynamics of changes are less chaotic for the generative model, as the similarity of representations monotonically degrades with each task. This is in stark contrast to the discriminative representations which evolve chaotically. For instance, in the FashionMNIST dataset (middle column, top row), representations of the first task drastically change during the second and third task, then stabilize on the fourth task and unexpectedly become more similar during the last task. This may suggest that although the forgetting occurs in the generative case, it has a gradual form and is more predictable than in the discriminative part.
3.2 Hypothesis 2
To validate the second hypothesis, which states that autoencoders and variational autoencoders learn more transferable features than discriminative models, we change the experimental protocol and learn the models , and only on the first task in the learning sequence . We adopt this approach as we are interested in measuring the transferability of learned features in the context of future tasks.
After the training for the first task is finished, we follow the procedure visualized in Fig 2 and collect representations for all data splits. Next, we train all classifiers on respective data splits. Then, we evaluate these classifiers on the validation datasets. Because the backbone of the neural network is trained only for the first task, we measure this way the transferability of the features learned during the first task to other tasks (results of this experiment are presented in the Table 1). To gain further insights, we train a single classifier on all collected representations. This way, for each dataset, the task is a full 10-way classification. The results of this experiment tell us how general and useful are the features learned during the first task in the context of classifying the whole dataset. Results of this experiment are in the Table 2.
Model | MNIST | FashionMNIST | CIFAR-10 |
---|---|---|---|
96.8 5.4 | 92.1 7.2 | 80.5 2.0 | |
88.9 6.7 | 94.2 5.5 | 78.9 3.0 |
Model | MNIST | FashionMNIST | CIFAR-10 |
---|---|---|---|
82.9 2.4 | 74.6 1.6 | 44.4 1.0 | |
58.6 5.2 | 61.1 0.9 | 37.4 0.2 |
The superiority of the generative representations in terms of transferability is visible on all considered benchmarks. Although classifiers learned on representations from the VAE model performs worse than classifiers trained on representations, both of these approaches obtain significantly better results than classifiers trained of model representations. In the case of the model for MNIST and FashionMNIST datasets, the classifiers have a nearly perfect average accuracy of and , respectively. The results of the classifiers trained on joint datasets (displayed in the brackets in Table 1) are even more surprising. It turns out that the representations learned by the model on one task generalize well beyond that one task, which is proved by average accuracy of 82.9%, 74.6%, and 44.36% in 10-way classification tasks on MNIST, FashionMNIST, and CIFAR-10 respectively. On the other hand, the poor performance of classifiers trained with the representations learned by the discriminative model confirms that representations learned this way do not generalize beyond the current task. This poor performance using representations of discriminative models is expected as the discriminative model aims to find the set of features that only separate the current data, resulting in useless features for upcoming tasks. We postulate that this poor transferability of features for upcoming tasks is the main reason for the catastrophic forgetting in discriminative models.
Guided by the above results, we create an additional experiment where we analyze how the training on the first task impacts the reconstruction quality on other tasks. To that end, we carry out experiments only with the generative models and . During training on the first task, we evaluate the model on the reconstruction task with data samples from other tasks. In contrast to previous experiments, we do not perform any additional finetuning before evaluating the model. To evaluate the model on different tasks, we feed the data from this task and compute the corresponding reconstruction loss.

Fig. 5 presents the results of this experiment. The top row presents the results for for MNIST, FashionMNIST, and CIFAR-10, respectively. The bottom row presents results for with the same dataset ordering. In each plot, the blue curve represents the loss on the first task – the one that the model learns. All the plots have the same interesting tendency – the reconstruction loss on all tasks decreases proportionally with the loss of the first task. This shows that the unvisited tasks directly benefit from the training on the first task. In other words, the representations learned during the first task are useful for the following tasks.
In MNIST and FashionMNIST, there is a significant gap between the loss on the current task (blue curve) and other tasks. The gap is more significant in the case of , which is in line with the results from previous experiments suggesting that VAE based models need more complex datasets to develop general features. This is clearly visible in the rightmost column presenting results for CIFAR-10, where this gap does not exist. This may seem counter-intuitive as CIFAR-10 is the most challenging dataset in this group. However, the more complex the training data is, the more general the features learned by the generative model are.
With these experiments, we provide an intuitive explanation to the phenomenon observed by Thai et al. [23], where authors suggest that the continual reconstruction task does not suffer from catastrophic forgetting. We argue that this is thanks to the generality of features learned by the generative model during the first task since the features learned by the autoencoder during the first task are also useful for the upcoming tasks. As can be observed in Fig 5) the model directly benefits from the learned features on previous tasks, and thus there is no interference between them, which results in the lack of catastrophic forgetting.
4 Discussion
In this work, we stated two hypotheses concerning different representations and their impact on catastrophic forgetting. We carried out experiments to empirically validate our hypotheses. From the performed experiments, it follows that the dynamics of the catastrophic forgetting for discriminative and generative representations are different. In particular, results from Fig. 4 show that the forgetting of generative representations is gradual and monotonic. These properties suggest that it might be possible to model the effect of catastrophic forgetting in generative models precisely. However, to that end, one has to measure the effects of catastrophic forgetting exactly. Currently used proxy tasks for that purpose, such as classification tasks, hinder the analysis. Since it becomes more evident that catastrophic forgetting is not a homogeneous phenomenon and its character depends on the task or learning paradigm, we foresee the need for task agnostic measures of catastrophic forgetting. Such work would greatly benefit the community and give us unprecedented insights into the nature of the phenomenon.
From the practical point of view, this paper offers a new understanding of the dynamics of autoencoders and variational autoencoders learning in the continual scenario and their ability to generalize obtained knowledge to future tasks. These results can be directly adapted to the currently used generative rehearsal methods making them more robust or even immune to catastrophic forgetting.
Last, this work is the first step towards a deeper understanding of the similarities and dissimilarities of generative and discriminative representations. While several papers discuss this relation, e.g., [28] many questions remain unanswered, especially in the context of continual learning. How can one use the mechanisms of generative learning to obtain better representations for discrimination tasks in continual learning? Is the generic nature of generative representations the only explanation for the lack of catastrophic forgetting? Answering these and other questions, we left as future work.
References
- [1] Aljundi, R., Belilovsky, E., Tuytelaars, T., Charlin, L., Caccia, M., Lin, M., Page-Caccia, L.: Online Continual Learning with Maximally Interfered Retrieval. In: NeurIPS (2019)
- [2] Caccia, L., Belilovsky, E., Caccia, M., Pineau, J.: Online Learned Continual Compression with Adaptive Quantization Modules. In: ICML (2020)
- [3] Davidson, G., Mozer, M.C.: Sequential mastery of multiple visual tasks: Networks naturally learn to learn and forget to forget. In: CVPR (2020)
- [4] Deja, K., Wawrzyński, P., Marczak, D., Masarczyk, W., Trzciński, T.: Binplay: A binary latent autoencoder for generative replay continual learning. In: IJCNN (2021)
- [5] French, R.M.: Catastrophic forgetting in connectionist networks. Trends in cognitive sciences (1999)
- [6] Golkar, S., Kagan, M., Cho, K.: Continual Learning via Neural Pruning. In: Neuro AI. Workshop at NeurIPS (2019)
- [7] Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. In: ICLR (2014)
- [8] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., Hadsell, R.: Overcoming catastrophic forgetting in neural networks. PNAS (2017)
- [9] Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network representations revisited. In: ICML (2019)
- [10] Mallya, A., Davis, D., Lazebnik, S.: Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights. In: ECCV (2018)
- [11] Mallya, A., Lazebnik, S.: Packnet: Adding Multiple Tasks to a Single Network by Iterative Pruning. In: CVPR (2018)
- [12] Masse, N.Y., Grant, G.D., Freedman, D.J.: Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. PNAS (2018)
- [13] Nguyen, C.V., Achille, A., Lam, M., Hassner, T., Mahadevan, V., Soatto, S.: Toward understanding catastrophic forgetting in continual learning. arXiv (2019)
- [14] Nguyen, G., Chen, S., Do, T., Jun, T.J., Choi, H., Kim, D.: Dissecting catastrophic forgetting in continual learning by deep visualization. arXiv (2020)
- [15] von Oswald, J., Henning, C., Sacramento, J., Grewe, B.F.: Continual learning with hypernetworks. arXiv preprint arXiv:1906.00695 (2019)
- [16] Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.: Continual lifelong learning with neural networks: A review. Neural Networks (2019)
- [17] Prabhu, A., Torr, P., Dokania, P.: Gdumb: A simple approach that questions our progress in continual learning. In: ECCV (2020)
- [18] Ramasesh, V.V., Dyer, E., Raghu, M.: Anatomy of catastrophic forgetting: Hidden representations and task semantics. In: ICLR (2021)
- [19] Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., Wayne, G.: Experience Replay for Continual Learning. In: NeurIPS (2019)
- [20] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. IJCV (2015)
- [21] Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., Hadsell, R.: Progressive Neural Networks (2016), arXiv:1606.04671
- [22] Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual Learning with Deep Generative Replay. In: NeurIPS (2017)
- [23] Thai, A., Stojanov, S., Rehg, I., Rehg, J.M.: Does continual learning = catastrophic forgetting? arXiv (2021)
- [24] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
- [25] van de Ven, G.M., Tolias, A.S.: Generative replay with feedback connections as a general strategy for continual learning (2018), arXiv:1809.10635
- [26] Verwimp, E., Lange, M.D., Tuytelaars, T.: Rehearsal revealed: The limits and merits of revisiting samples in continual learning (2021)
- [27] Wortsman, M., Ramanujan, V., Liu, R., Kembhavi, A., Rastegari, M., Yosinski, J., Farhadi, A.: Supermasks in Superposition. In: NeurIPS (2020)
- [28] Wu, Y.N., Gao, R., Han, T., Zhu, S.C.: A tale of three probabilistic families: Discriminative, descriptive and generative models (2018)
- [29] Xu, J., Zhu, Z.: Reinforced Continual Learning. In: NeurIPS (2018)
- [30] Yoon, J., Yang, E., Lee, J., Hwang, S.J.: Lifelong Learning with Dynamically Expandable Networks. In: ICLR (2018)
- [31] Zenke, F., Poole, B., Ganguli, S.: Continual Learning Through Synaptic Intelligence. In: ICML (2017)