D-CBRS: Accounting For Intra-Class Diversity in Continual Learning

Abstract

Continual learning – accumulating knowledge from a sequence of learning experiences – is an important yet challenging problem. In this paradigm, the model’s performance for previously encountered instances may substantially drop as additional data are seen. When dealing with class-imbalanced data, forgetting is further exacerbated. Prior work has proposed replay-based approaches which aim at reducing forgetting by intelligently storing instances for future replay. Although Class-Balancing Reservoir Sampling (CBRS) has been successful in dealing with imbalanced data, the intra-class diversity has not been accounted for, implicitly assuming that each instance of a class is equally informative. We present Diverse-CBRS (D-CBRS), an algorithm that allows us to consider within class diversity when storing instances in the memory. Our results show that D-CBRS outperforms state-of-the-art memory management continual learning algorithms on data sets with considerable intra-class diversity.

Index Terms— Continual learning, lifelong learning, catastrophic forgetting, class-incremental learning

1 Introduction

Continual learning examines the problem of learning from data streams by accommodating new knowledge while retaining previously learned experiences [1]. In such an incremental learning setting, a model is supposed to learn new tasks, domains, or classes with incoming data. For example, in class-incremental learning, analogous to human experience, incoming streams continuously introduce new classes (i.e., knowledge) that are expected to be learned [2, 3].

Yet, unsurprisingly, continual learning leads to forgetting: as we encounter new classes, the model’s performance may substantially degrade regarding previously acquired patterns [4, 5, 6]. Current approaches that seek to reduce catastrophic forgetting for class-incremental learning primarily include rehearsal-based approaches [7, 8] which store samples that belong to previously observed classes and replay them while receiving a data stream and regularization-based approaches [9, 10] which introduce regularization terms to loss functions to emphasize existing classes.

Most rehearsal-based methods use either ring buffer [7] or reservoir sampling [11] as memory management. Because these methods rely on storing and replaying previously seen data, when dealing with imbalanced datasets, the replay is overwhelmed by the “overrepresented” classes. Thus, it exacerbates catastrophic forgetting even further [12].

Recently, a novel approach was proposed for managing memory in the presence of imbalanced data, named as Class-Balancing Reservoir Sampling (CBRS) [13]. In a nutshell, CBRS keeps the memory balanced in terms of the number of instances per class. In this way, CBRS alleviates the possible forgetting issues regarding underrepresented classes.

We posit that catastrophic forgetting, in addition to class imbalance, can also be substantially exacerbated by overlooking the intra-class diversity. While CBRS keeps the classes in memory balanced, it does not account for the intra-class diversity because it treats every data instance as equally informative or representative of the respective class. Hence, building upon CBRS, our motivation lies in finding a new way to incorporate the intra-class diversity when storing incoming data in the memory. Our proposed method, Diverse Class-Balancing Reservoir Sampling (D-CBRS), in addition to maintaining the class balance, also maximizes the intra-class diversity by storing more diverse instances in the memory. We conduct experiments to compare D-CBRS with state-of-the-art memory management approaches on two different datasets. In our experimental setup, the results show that D-CBRS outperforms the other approaches. We further discuss the limitations of our approach, and propose future improvements. Our findings indicate that accounting for the intra-class diversity presents a new research direction in the continual learning space.

Refer to caption — Fig. 1: (a) Sample of imbalanced and intra-class diverse incoming data distribution with 5 classes. Cluster 1 and cluster 2 indicate the intra-class diversity for each class (i.e., each class consists of two different clusters of instances). Reservoir (b) and CBRS (c) store substantially less instances of the minority clusters in memory compared to our proposed D-CBRS (d).

2 PROPOSED METHOD AND RELATION TO PRIOR WORK

The proposed method builds upon CBRS. For this reason, we first describe CBRS, its notation, and how it achieves class balance. Subsequently, we delve into D-CBRS, show where it differs from CBRS, and how it preserves the within-class diversity. Finally, we delineate the steps for training a model with replay under class incremental learning.

2.1 CBRS

CBRS is an algorithm that determines which data instances will be stored in the memory for future replay. While storing the data samples, it also seeks to keep the classes in the memory as balanced as possible.

Notation. The size of a given memory is noted with m. A filled memory refers to all m spots being occupied with data instances. Once the memory is filled, CBRS labels the largest class in the memory (in terms of the number of instances) as a full class. Once a class is labeled as full, it remains so.

Algorithm. CBRS is characterized by two distinct stages. The first stage lasts until memory is filled. During the first stage, all the incoming instances are stored in the memory. In the second stage, which starts after the memory is filled, when a data instance $(x_{i},y_{i})$ is encountered, CBRS first checks whether the instance is a member of a full class. If so, the instance will be replaced by one of the other instances which belongs to class $y_{i}$ with probability $\frac{m_{c}}{n_{c}}$ , where $m_{c}$ is the number of instances of class $y_{i}$ in memory, $n_{c}$ is the total number of instances of class $y_{i}$ received so far. If the instance does not fall into full classes, the new instance is stored in the memory by removing randomly an instance from a full class.

2.2 D-CBRS

As mentioned earlier, CBRS does not account for the within-class diversity as it treats every data instance as equally relevant and informative of the class. To address this problem, our proposed method aims to incorporate information regarding diversity within classes. The main idea behind D-CRBS is that it clusters instances within each class and keeps track of the number of instances of each cluster within each class. In this way, it ensures that instances from smaller clusters, which often hold rich information for inter-class discriminatory purposes, will be stored. Consequently, the proposed approach alleviates the forgetting problem since it will keep such rare instances in the memory. Figure 1 illustrates this point.

Algorithm. Similar to CBRS, initially D-CBRS stores all instances until the memory is filled. The difference between CBRS and D-CBRS starts in the second stage where the removal of instances from the memory occurs in CBRS. When the memory is filled, an incoming instance $(x_{i},y_{i})$ can either belong to a full class or not. For the first case, if $y_{i}$ is not one of the full classes, D-CBRS stores the sample by replacing with a randomly picked instance from the largest cluster of any of the full classes in the memory. In the second case, when $y_{i}$ is a member of a full class, D-CBRS first checks if the incoming instance belongs to the largest cluster of that class $y_{i}$ . If so, the new instance takes a randomly chosen instance’s place in that cluster with probability $\frac{m_{c}}{n_{c}}$ , where $m_{c}$ is the number of instances of the cluster in memory and $n_{c}$ is the total number of instances of the cluster encountered up to that point. Otherwise, it will be replaced by a randomly selected instance from the largest cluster of class $y_{i}$ . The pseudo-code for D-CBRS can be found in Algorithm 1. Note that in our experiments, we used the K-means clustering algorithm [14, 15] since its implementation is straightforward and it works reasonably well for MNIST and F-MNIST datasets that we will use later. However, D-CBRS can work with other clustering algorithms, including spectral clustering [16].

input : data stream arriving sequentially

{(x_{i},y_{i})}^{n}_{i=1}

output : updated memory

3for $i=1,\ldots,n$ do

4 if memory is not filled then

5 store (

x_{i},y_{i}

)

6 else

7 if $y_{i}$ is not a full class then

8 cluster a randomly picked full class choose an instance randomly from largest cluster of the full class and replace by (

x_{i},y_{i}

)

9 else

10 cluster full class

y_{i}

(x_{p},y_{p})\leftarrow\text{pick an instance randomly}

12 from largest cluster of the full class

13 if $x_{i}$ belongs to largest cluster then

m_{c}\leftarrow\text{number of currently}

15 stored instances belongs to

16 the largest cluster of

y_{i}

n_{c}\leftarrow\text{number of instances}

18 seen so far belongs to

19 the largest cluster of

y_{i}

u\leftarrow{Uniform(0,1)}

21 if $u\leq m_{c}/n_{c}$ then

22 replace (

x_{p},y_{p}

) by (

x_{i},y_{i}

)

23 else

24 replace (

x_{p},y_{p}

) by (

x_{i},y_{i}

)

Algorithm 1 D-CBRS

2.3 Training with replay

In class-incremental scenarios, streams of instances from each class arrive sequentially and each class appears at a time only. Assume that $S$ is the set of streams $\{s_{1},s_{2},...s_{n}\}$ , where $s_{i}$ consists of class $i$ instances. The model will be trained from $s_{1}$ to $s_{n}$ . The model is then expected to remember all classes.

In rehearsal approaches, it is assumed that there is a limited memory space to store samples during training and the model is allowed to remember previously seen classes, along with replaying the memory for subsequent training steps.

For the training phase, we follow the same steps described in [7, 17], where CBRS originated from. Implementation of the algorithm is as follows: each incoming stream is split to batches. Then each batch is concatenated with a sample batch stemming from the memory. The obtained batch is used to train the model. We repeat the concatenation step $n_{b}$ times, for each of which we concatenate the same incoming batch with a different sampled batch from the memory. Thus, the model is updated $n_{b}$ times when a new batch stream is received. This type of replay reduces catastrophic forgetting further because the model sees both the new batch multiple times and random batches from the memory in each update step. Lastly, the memory will be updated according to incoming stream instances.

Once the training phase is completed, we evaluate the performance of the model on an unseen test set, which consists of all the classes of the original dataset.

3 EXPERIMENTS

In this section, we describe our experimental setup and show the results of our algorithm and its comparison with existing memory management algorithms for rehearsal-based continual learning scenarios. Following CBRS, we use MNIST [18] and Fashion-MNIST [19] datasets in this section.

3.1 Experimental Setup

3.1.1 Memory Management Approaches

We compare D-CBRS with standard memory management algorithms for rehearsal-based approaches including reservoir sampling (RS) and class-balancing reservoir sampling (CBRS). Since CBRS outperforms Gradient-Space Sampling (GSS) [17], another memory management approach, we do not include GSS in our experiments. Although RS is also surpassed by CBRS, we include it in our comparisons as CBRS has evolved from it and it provides a useful baseline. Figure 1 shows the sample distribution and memory allocation for each of the memory management approaches.

RS takes an input stream of data, then returns a random subset of instances from it. Each sample is stored with probability $\frac{size}{n}$ , where size is the memory size and n is the total number of observed samples so far.

CBRS follows a similar strategy as that of RS while keeping classes stored in the memory balanced. Detail explanation can be found in Section 2.1.

3.1.2 Simulating Class Imbalance

We followed the technique described in the CBRS paper [13] to generate class imbalance scenarios. As suggested, we start by defining a set of retention factors, i.e., what percentage of instances for each class from the original dataset will be used in the training phase. The used retention factor set for our experiments is as follows:

r=\{0.01,0.05,0.1,0.3,1\}.

(1)

Second, we randomly choose a retention factor from $r$ for each class in the dataset without replacement. If the number of classes is greater than the defined retention factor set size, once all the instances of the retention set are used we reinitialize our retention factor set as $r$ . We ensure that identical class-imbalanced datasets are used for evaluating all memory management methods.

3.1.3 Simulating Intra-class Diversity

To see the effect of the intra-class diversity on the memory management methods, we simulate the diversity by randomly grouping together original classes of the original dataset to form more diverse classes. For example, given a dataset with $10$ classes and our target number of classes $5$ , we create a new class by combining randomly two classes and updating the instances of those classes by labeling the new class. When we do this for all classes, we will have 5 new classes and each one of them will consist of instances that belong to two old classes. In this way, we make sure that there will be a sufficient level of diversity within the newly-created classes.

Each independent run may produce a different dataset depending on which old classes are combined together. Once sampled, all the methods are evaluated on the same dataset for a single run.

3.1.4 Models and Hyperparameters

Since the two studied datasets are simple enough not to require a more complex model, and also to compare the results better with CBRS, we used the exact same model: a multi-layer perceptron (MLP) with 2 hidden layers, 250 neurons per hidden layer, and the ReLU activation function.

We use Adam [20] optimizer with a learning rate of $0.001$ and cross-entropy loss. Following CBRS, we assign batch size as $10$ and the number of steps per batch as $5$ . We fix memory size as $m=500$ samples in our experiments.

4 Results

In this section, we present the results of our approach in three different setups.

4.1 Base Case – Class Imbalance Only

In the first setup, we simulated only class imbalance in the dataset as described in Section 3.1.2. Since the two datasets are relatively simple to learn and do not have substantial intra-class diversity, this served as a check on whether D-CBRS achieves the same accuracy as CBRS when only class imbalance exists in the dataset. As expected, our results from 5 different random runs show that D-CBRS achieves approximately similar accuracy as CBRS on both MNIST and F-MNIST data sets (See Table 1 (a)).

4.2 Omniscient Case – Class Imbalance & Intra-Class Diversity With Labels

In the second case, we ran the algorithms on imbalanced and diverse datasets (simulated as described in Section 3.1.3). To understand the maximum accuracy that D-CBRS would achieve if it could perfectly predict diversity of data instances, we allowed D-CBRS access to actual class labels. For example, in the MNIST case, if class $A$ was new class from digits 2 and 9, D-CBRS would be aware of the labels of the two composing classes. In this way, we were able to disentangle the accuracy of the clustering algorithm from the accuracy of our proposed method. The results (Table 1 (b)) show (1) how CBRS suffers when intra-class diversity is present and (2) the substantial improvement in terms of accuracy that D-CBRS would be capable of if it could query an oracle regarding intra-class diversity. In other words, this experiment shows the significance of leveraging powerful clustering algorithms in our framework.

Method	Dataset
	Imbalanced (a)		Imbalanced & Div. (b)
	MNIST	F-MNIST	MNIST	F-MNIST
Reservoir	66.9 ± 3.8	62.3 ± 2.2	66.3 ± 1.2	64.7 ± 2.2
CBRS	82.8 ± 3.5	75.0 ± 2.1	76.8 ± 2.3	70.3 ± 1.9
D-CBRS	80.6 ± 3.2	73.1 ± 3.1	84.6 ± 0.4	76.9 ± 1.2

Table 1: Results of 5 streams with 95% confidence interval of the accuracy on the test set for (a) Base Case and (b) Omniscient Case with 5 total resulting merged classes.

4.3 Realistic Case – Class Imbalance & Intra-Class Diversity Without Labels

Realistically, D-CBRS has to rely on a diversity scoring algorithm, such as a clustering one, since it cannot have access to a “diversity scoring oracle”. In this case, the performance of D-CBRS is dependent on the performance of the underlying clustering algorithm (e.g., K-means in our experiments). Our results show that D-CBRS outperforms CBRS even with a relatively simple algorithm as K-means; see Table 2. While D-CBRS outperforms CBRS when intra-class diversity is present, it should be noted that it does so at a computational cost – the clustering algorithm’s running complexity. We show intra-class diversity to be an important aspect of continual learning, and suggest that future work should focus on developing algorithms that improve both the performance of the diversity scoring and the computational complexity of D-CBRS.

Method	Imbalanced & Diverse Dataset
	5 Classes		3 Classes
	MNIST	F-MNIST	MNIST	F-MNIST
Reservoir	66.3 ± 1.6	65.5 ± 3.9	73.7 ± 3.8	75.8 ± 5.5
CBRS	75.2 ± 3.0	70.9 ± 3.7	72.7 ± 5.2	72.8 ± 5.4
D-CBRS	79.7 ± 1.1	72.9 ± 3.0	80.6 ± 3.5	78.0 ± 4.5

Table 2: Results of 5 runs with 95% confidence interval of the accuracy on the test set for Realistic Case with (a) 5 and (b) 3 total resulting merged classes.

5 Conclusion

Previous work in continual learning has proposed algorithms for dealing with imbalanced data distributions. In this work, we showed that in addition to class imbalance, intra-class diversity is also a factor that substantially affects performance and the catastrophic forgetting phenomenon. Thus, it needs to be accounted for in the design of memory management algorithms for class-incremental learning settings. We proposed D-CBRS as an approach to consider the intra-class diversity and class imbalance in such settings. Our experimental results showed that D-CBRS outperforms previous approaches when the distribution of the data is imbalanced and diverse. Future research directions involve improving efficiency and accuracy of D-CBRS, exploring more complex data sets, and understanding tradeoffs for varying memory sizes.

References

[1] Z. Chen, B. Liu, R. Brachman, P. Stone, and F. Rossi, Lifelong Machine Learning, Morgan & Claypool Publishers, 2nd edition, 2018.
[2] G. M. van de Ven and A. S. Tolias, “Three scenarios for continual learning,” 2019.
[3] L. Smith and M. Gasser, “The Development of Embodied Cognition: Six Lessons from Babies,” Artificial Life, vol. 11, no. 1-2, pp. 13–29, 01 2005.
[4] R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in Cognitive Sciences, vol. 3, no. 4, pp. 128–135, 1999.
[5] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” Psychology of Learning and Motivation - Advances in Research and Theory, vol. 24, no. C, pp. 109–165, 1989.
[6] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” 2015.
[7] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. Dokania, P. H. S. Torr, and M. Ranzato, “Continual learning with tiny episodic memories,” CoRR, vol. abs/1902.10486, 2019.
[8] R. Aljundi, L. Caccia, E. Belilovsky, M. Caccia, M. Lin, L. Charlin, and T. Tuytelaars, “Online continual learning with maximally interfered retrieval,” CoRR, vol. abs/1908.04742, 2019.
[9] Z. Li and D. Hoiem, “Learning without forgetting,” CoRR, vol. abs/1606.09282, 2016.
[10] J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell, “Overcoming catastrophic forgetting in neural networks,” CoRR, vol. abs/1612.00796, 2016.
[11] J. S. Vitter, “Random sampling with a reservoir,” ACM Trans. Math. Softw., vol. 11, no. 1, pp. 37–57, mar 1985.
[12] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu, “Large scale incremental learning,” 2019.
[13] A. Chrysakis and M. F. Moens, “Online continual learning from imbalanced data,” in Proceedings of the 37th International Conference on Machine Learning, Hal Daumé III and Aarti Singh, Eds. 13–18 Jul 2020, vol. 119 of Proceedings of Machine Learning Research, pp. 1952–1961, PMLR.
[14] S. Lloyd, “Least squares quantization in pcm,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
[15] F. Pourkamali-Anaraki and S. Becker, “Preconditioned data sparsification for big data with applications to PCA and K-means,” IEEE Transactions on Information Theory, vol. 63, no. 5, pp. 2954–2974, 2017.
[16] F. Pourkamali-Anaraki, “Scalable spectral clustering with Nyström approximation: Practical and theoretical aspects,” IEEE Open Journal of Signal Processing, vol. 1, pp. 242–256, 2020.
[17] R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio, “Gradient based sample selection for online continual learning,” 2019.
[18] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
[19] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” 2017.
[20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.