Class-Incremental Lifelong Learning in Multi-Label Classification

Kaile Du Linyan Li Fan Lyu Fuyuan Hu Zhenping Xia Fenglei Xu

Abstract

Existing class-incremental lifelong learning studies only the data is with single-label, which limits its adaptation to multi-label data. This paper studies Lifelong Multi-Label (LML) classification, which builds an online class-incremental classifier in a sequential multi-label classification data stream. Training on the data with Partial Labels in LML classification may result in more serious Catastrophic Forgetting in old classes. To solve the problem, the study proposes an Augmented Graph Convolutional Network (AGCN) with a built Augmented Correlation Matrix (ACM) across sequential partial-label tasks. The results of two benchmarks show that the method is effective for LML classification and reducing forgetting.

Machine Learning, ICML

1 Introduction

Class-incremental learning (Rebuffi et al., 2017) constructs a unified evolvable classifier, which online learns new classes from a sequential image data stream and achieves multi-class classification for the seen classes. For privacy, storage and efficient computation reasons, the training data in lifelong learning for the old tasks are often unavailable when new tasks arrive, and the new task data only has labels of itself. Thus, the catastrophic forgetting (Kirkpatrick et al., 2017), i.e., the training on new tasks may lead to the old knowledge overlapped by the new knowledge, is the main challenge of Lifelong Single-Label (LSL) classification. To solve the catastrophic forgetting problem, the state-of-art methods for class-incremental lifelong learning can be categorized into regularization-based methods, such as EWC (Kirkpatrick et al., 2017), LwF (Li & Hoiem, 2017) and LIMIT (Zhou et al., 2022); rehearsal-based methods, such as AGEM (Chaudhry et al., 2019), ER (Rolnick et al., 2019), PRS (Kim et al., 2020) and SCR (Mai et al., 2021). However, most lifelong learning studies only consider single-labelled input data, which introduces significant limitations in practical applications.

This paper studies how to learn classes sequentially from new LML classification tasks. As shown in Fig. 1, given testing images, and LML model can continuously recognize multiple labels with new classes learned. It is stated that LML recognition is challenging due to not just catastrophic forgetting, but Partial labels for the current tasks, which means the training data may contain possible labels of past tasks. There exist few lifelong learning algorithms designed specifically for LML against this challenge.

Refer to caption — Figure 1: The inference of LML classification. Given multi-label images for testing, the model can recognize more labels by learning more incremental classes (label with the probability that greater than the threshold 0.7 will be output).

We are inspired by the recent research on label relationships in multi-label learning (Chen et al., 2019, 2021), we consider building the label relationships across tasks, i.e., label correlation matrix. However, because of the partial label problem, it is difficult to construct the class relationships by using statistics directly. We propose an AGCN, a novel solution to LML classification to deal with the partial label problem. First, an auto-updated expert network is designed to generate predictions of the old tasks, these predictions as soft labels are used to represent the old classes for the old tasks and construct ACM. Then, the AGCN receives the dynamic ACM and correlates the label spaces of both the old and new tasks, which continually supports the multi-label prediction. Moreover, to further mitigate the forgetting on both seen classes and class relationships, a distillation loss and a relationship-preserving loss function are designed for class-level forgetting and relationship-level forgetting, respectively. We construct two multi-label image datasets, Split-COCO and Split-WIDE, based on MS-COCO and NUS-WIDE, respectively. The results show that our AGCN achieves state-of-art performances in LML classification. Our code is available at https://github.com/Kaile-Du/AGCN.

2 Methodology

2.1 Lifelong multi-label learning

In this study, each data is trained only once in the form of a data stream. Given $T$ recognition tasks with respect to train datasets $\{\mathcal{D}^{1}_{\text{trn}},\cdots,\mathcal{D}^{T}_{\text{trn}}\}$ and test datasets $\{\mathcal{D}^{1}_{\text{tst}},\cdots,\mathcal{D}^{T}_{\text{tst}}\}$ . For the $t$ -th task, we have new and task-specific classes to be trained, namely $\mathcal{C}^{t}$ . The goal is to build a multi-label classifier to discriminate an increasing number of classes. We denote $\mathcal{C}_{\text{seen}}^{t}=\bigcup_{n=1}^{t}\mathcal{C}^{n}$ as seen classes at task $t$ , where $\mathcal{C}_{\text{seen}}^{t}$ contains old class set $\mathcal{C}_{\text{seen}}^{t-1}$ and new class set $\mathcal{C}^{t}$ , that is, $\mathcal{C}_{\text{seen}}^{t}=\mathcal{C}_{\text{seen}}^{t-1}\cup\mathcal{C}^{t}$ . Note that during the testing phase, the ground truth labels for LML classification contain all the old classes $\mathcal{C}_{\text{seen}}^{t}$ .

2.2 Augmented Correlation Matrix

Correlation Matrix is often built for multi-label learning (Chen et al., 2019, 2021), and can be used to construct label relationships. Our Augmented Correlation Matrix (ACM) provides the label relationships among all seen classes $\mathcal{C}^{t}_{\text{seen}}$ and is augmented to capture the intra- and inter-task label independences. Most existing multi-label learning algorithms (Chen et al., 2019, 2021) rely on constructing the inferring label correlation matrix $\mathbf{A}$ by the hard label statistics among the class set $\mathcal{C}$ : $\mathbf{A}_{ij}=P(\mathcal{C}_{i}|\mathcal{C}_{j})|_{i\neq j}$ . We construct ACM $\mathbf{A}^{t}$ for task $t>1$ in an online fashion to simulate the statistic value denoted as

{\mathbf{A}}^{t}=\begin{bmatrix}{\mathbf{A}}^{t-1}&\mathbf{R}^{t}\\ \mathbf{Q}^{t}&\mathbf{B}^{t}\end{bmatrix}=\begin{bmatrix}\text{Old-Old}&\text{Old-New}\\ \text{New-Old}&\text{New-New}\end{bmatrix},

(1)

in which we take four block matrices including $\mathbf{A}^{t-1}$ and $\mathbf{B}^{t}$ , $\mathbf{R}^{t}$ and $\mathbf{Q}^{t}$ to represent intra- and inter-task label relationships between old and old classes, new and new classes, old and new classes as well as new and old classes respectively. For the first task, $\mathbf{A}^{1}=\mathbf{B}^{1}$ . For $t>1$ , $\mathbf{A}^{t}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t}|\times|\mathcal{C}_{\text{seen}}^{t}|}$ . It is worth noting that the block $\mathbf{A}^{t-1}$ (Old-Old) can be derived from the old task, so we will focus on how to compute the other three blocks in the ACM.

New-New block ( $\mathbf{B}^{t}\in\mathbb{R}^{|\mathcal{C}^{t}|\times|\mathcal{C}^{t}|}$ ). This block computes the intra-task label relationships among the new classes, and the conditional probability in $\mathbf{B}^{t}$ can be calculated using the hard label statistics from the training dataset similar to the common multi-label learning:

\mathbf{B}^{t}_{ij}=P(\mathcal{C}^{t}_{i}\in\mathcal{C}^{t}|\mathcal{C}^{t}_{j}\in\mathcal{C}^{t})=\frac{N_{ij}}{N_{j}},

(2)

where $N_{ij}$ is the number of examples with both class $\mathcal{C}^{t}_{i}$ and $\mathcal{C}^{t}_{j}$ , $N_{j}$ is the number of examples with class $\mathcal{C}^{t}_{j}$ . Due to the online data stream, $N_{ij}$ and $N_{j}$ are accumulated and updated at each step of the training process.

Old-New block ( $\mathbf{R}^{t}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t-1}|\times|\mathcal{C}^{t}|}$ ). Given an image $\mathbf{x}$ , for the old classes, ${\hat{z}}_{i}$ (predicted probability) generated by the expert can be considered as the soft label for the $i$ -th class (see Eq. (7)). Thus, the product ${\hat{z}}_{i}{{{y}}_{j}}$ can be regarded as an alternative of the cooccurrences of ${\mathcal{C}_{\text{seen}}^{t-1}}_{i}$ and $\mathcal{C}^{t}_{j}$ . $\sum_{\mathbf{x}}{\hat{z}}_{i}{{{y}}_{j}}$ means the online mini-batch accumulation. Thus, we have

\displaystyle\mathbf{R}^{t}_{ij}

\displaystyle=P({\mathcal{C}_{\text{seen}}^{t-1}}_{i}\in{\mathcal{C}_{\text{seen}}^{t-1}}|\mathcal{C}^{t}_{j}\in\mathcal{C}^{t})=\frac{\sum_{\mathbf{x}}{\hat{z}}_{i}{{{y}}_{j}}}{N_{j}}.

(3)

New-Old block ( $\mathbf{Q}^{t}\in\mathbb{R}^{|\mathcal{C}^{t}|\times|\mathcal{C}_{\text{seen}}^{t-1}|}$ ). Based on Bayes’ rule, we can obtain this block by

\displaystyle\mathbf{Q}^{t}_{ji}

\displaystyle=P(\mathcal{C}^{t}_{j}|{\mathcal{C}_{\text{seen}}^{t-1}}_{i})=\frac{P({\mathcal{C}_{\text{seen}}^{t-1}}_{i}|\mathcal{C}^{t}_{j})P(\mathcal{C}^{t}_{j})}{P({\mathcal{C}_{\text{seen}}^{t-1}}_{i})}=\frac{\mathbf{R}^{t}_{ij}N_{j}}{\sum_{\mathbf{x}}{\hat{z}}_{i}}.

(4)

Finally, we online construct an ACM using the soft label statistics from the auto-updated expert network and the hard label statistics from the training data.

2.3 Augmented Graph Convolutional Network

ACM is auto-updated dependencies among all seen classes. With the established ACM, we can leverage Graph Convolutional Network (GCN) to assist the prediction of CNN as Eq. (6). We propose an Augmented Graph Convolutional Network (AGCN) to manage the augmented fully-connected graph. AGCN is learned to map this label graph into a set of inter-dependent object classifiers. AGCN is a two-layer stacked graph model, which is similar to ML-GCN (Chen et al., 2019).Based on the ACM $\mathbf{A}^{t}$ , AGCN can capture class-incremental dependencies in an online way. Let the graph node be initialized by the Glove embedding (Pennington et al., 2014) namely $\mathbf{H}^{t,0}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t}|\times d}$ where $d$ represents the embedding dimensionality. The graph presentation $\mathbf{H}^{t}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t}|\times D}$ in task $t$ is mapped by:

\mathbf{H}^{t}=\text{AGCN}(\mathbf{A}^{t},\mathbf{H}^{t,0}).

(5)

As shown in Fig. 2, together with an CNN feature extractor, the multiple labels for an image $\mathbf{x}$ will be predicted by

\mathbf{\hat{y}}=\sigma\left({\text{AGCN}(\mathbf{A}^{t},\mathbf{H}^{t,0})}\otimes\text{CNN}\left(\mathbf{x}\right)\right),

(6)

where $\mathbf{A}^{t}$ denotes the ACM and $\mathbf{H}^{t,0}$ is the initialized graph node. Prediction $\mathbf{\hat{y}}=[\mathbf{\hat{y}}_{\text{old}}~{}\mathbf{\hat{y}}_{\text{new}}]$ , where $\mathbf{\hat{y}}_{\text{old}}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t-1}|}$ for old classes and $\mathbf{\hat{y}}_{\text{new}}\in\mathbb{R}^{|\mathcal{C}^{t}|}$ for new classes. We train the current task for classifying using the Cross Entropy loss.

To mitigate the class-level catastrophic forgetting, inspired by the distillation-based lifelong learning method (Li & Hoiem, 2017; Zhou et al., 2022), we construct auto-updated expert networks consisting of CNN ${}_{\text{xpt}}$ and AGCN ${}_{\text{xpt}}$ . The expert parameters are fixed after each task has been trained and auto-update along with new task learning. Based on the expert, we construct the distillation loss as

\ell_{\text{dst}}({\mathbf{\hat{z}}},\mathbf{\hat{y}}_{\text{old}})=-\sum_{i=1}^{|\mathcal{C}_{\text{seen}}^{t-1}|}\left[{{\hat{z}_{i}}}\log\left({\hat{y}}_{i}\right)+\left(1-{{\hat{z}}}_{i}\right)\log\left(1-{\hat{y}}_{i}\right)\right],

(7)

where ${\mathbf{\hat{z}}}=\sigma\left(\text{AGCN}_{\text{xpt}}(\mathbf{A}^{t-1},\mathbf{H}^{t-1,0})\otimes\text{CNN}_{\text{xpt}}\left(\mathbf{x}\right)\right)$ can be treated as the soft labels to represent the prediction on old classes. The $i$ -th element $\hat{z}_{i}$ of ${\mathbf{\hat{z}}}$ represent the probability that the image $\mathbf{x}$ contains the class.

To mitigate the relationship-level forgetting, we constantly preserve the established relationships in the sequential tasks. The graph node embedding is irrelevant to the label co-occurrence and can be stored as a teacher to avoid the forgetting of label relationships. Suppose the learned embedding after task $t$ is stored as $\mathbf{G}^{t}=\text{AGCN}_{\text{xpt}}(\mathbf{A}^{t},\mathbf{H}^{t,0})$ , $t>1$ . We propose a relationship-preserving loss as a constraint to the class relationships:

\ell_{\text{gph}}({\mathbf{G}}^{t-1},\mathbf{H}^{t})=\sum^{|\mathcal{C}_{\text{seen}}^{t-1}|}_{i=1}\left\|{\mathbf{G}}^{t-1}_{i}-\mathbf{H}^{t}_{i}\right\|^{2}.

(8)

By minimizing $\ell_{\text{gph}}$ with the partial constraint of old node embedding, the changes of AGCN parameters are limited. Thus, the forgetting of the established label relationships are alleviated with the progress of LML classification. The final loss for the model training is defined as

\ell=\lambda_{1}\ell_{\text{cls}}(\mathbf{y},\mathbf{\hat{y}}_{\text{new}})+\lambda_{2}\ell_{\text{dst}}(\mathbf{\hat{z}},\mathbf{\hat{y}}_{\text{old}})+\lambda_{3}\ell_{\text{gph}}({\mathbf{G}}^{t-1},\mathbf{H}^{t}),

(9)

where $\ell_{\text{cls}}$ is the classification loss, $\ell_{\text{dst}}$ is used to mitigate the class-level forgetting and $\ell_{\text{gph}}$ is used to reduce the relationship-level forgetting. $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ are the loss weights for $\ell_{\text{cls}}$ , $\ell_{\text{dst}}$ and $\ell_{\text{gph}}$ . Extensive ablation studies are conducted for $\ell_{\text{gph}}$ after all relationships are built.

3 Experiments

Table 1: We report 3 main metrics (%) for LML after the whole data stream is seen once on Split-WIDE and Split-COCO.

Method	Split-WIDE			Split-COCO
Method	mAP $\uparrow$	CF1 $\uparrow$	OF1 $\uparrow$	mAP $\uparrow$	CF1 $\uparrow$	OF1 $\uparrow$
Multi-Task	66.17	61.45	71.57	65.85	61.79	66.27
Fine-Tuning	20.33	19.10	35.72	9.83	10.54	28.83
Forgetting $\downarrow$	40.85	31.20	15.10	58.04	63.54	20.60
EWC	22.03	22.78	35.70	12.20	12.50	29.67
Forgetting $\downarrow$	34.86	28.18	15.17	45.61	55.44	19.85
LwF	29.46	29.64	42.69	19.95	21.69	40.68
Forgetting $\downarrow$	20.26	18.99	5.73	41.16	39.85	11.43
AGEM	32.47	33.28	38.93	23.31	27.25	37.94
Forgetting $\downarrow$	16.42	15.71	9.73	34.52	18.92	12.94
ER	34.03	34.94	39.37	25.03	30.54	38.38
Forgetting $\downarrow$	15.15	11.80	8.61	33.46	17.28	12.34
PRS	37.93	21.12	15.64	28.81	18.40	13.86
Forgetting $\downarrow$	13.59	51.09	62.90	30.90	54.36	52.51
SCR	35.34	35.47	41.92	25.75	30.63	39.10
Forgetting $\downarrow$	14.26	10.17	8.04	32.02	15.98	11.96
AGCN (Ours)	41.12	38.27	43.27	34.11	35.49	42.37
Forgetting $\downarrow$	11.22	5.43	4.28	23.71	14.79	8.16

Table 2: Ablation studies (%) for ACM

\mathbf{A}^{t}

used to model intra- and inter-task label relationships on Split-COCO.

	$\mathbf{A}^{t-1}$ & $\mathbf{B}^{t}$	$\mathbf{R}^{t}$ & $\mathbf{Q}^{t}$	mAP $\uparrow$	CF1 $\uparrow$	OF1 $\uparrow$
1	$\surd$	$\times$	31.52	30.37	34.87
2	$\surd$	$\surd$	34.11	35.49	42.37

3.1 Dataset construction

Split-COCO. We choose the 40 most frequent concepts from 80 classes of MS-COCO (Lin et al., 2014) to construct Split-COCO, which has 65082 examples for training and 27,173 examples for validation. The 40 classes are split into 10 different tasks, and each task contains 4 classes.

Split-WIDE. NUS-WIDE (Chua et al., 2009) has a larger scale than MS-COCO. Following (Jiang & Li, 2017), we choose the 21 most frequent concepts from 81 classes of NUS-WIDE to construct the Split-WIDE, which has 144,858 examples for training and 41,146 examples for validation. We split the Split-WIDE into 7 tasks, where each task contains 3 classes.

3.2 Evaluation metrics

Multi-label evaluation. Following the traditional multi-label learning (Chen et al., 2019; Kim et al., 2020; Chen et al., 2021), 3 more important multi-label metrics: mAP, CF1 and OF1. Forgetting measure (Chaudhry et al., 2018). This score denotes the value difference of the above three multi-label metrics between the final score and the score when the task was first trained done.

Table 3: AGCN ablation studies (%) for loss weights and relationship-preserving loss on Split-COCO.

$\lambda_{1}$	$\lambda_{2}$	$\lambda_{3}$	mAP $\uparrow$	CF1 $\uparrow$	OF1 $\uparrow$
$0.05$	$0.95$	$0$	29.90	31.80	37.12
	Forgetting $\downarrow$		29.24	24.88	19.67
$0.07$	$0.93$	$0$	30.99	32.03	39.31
	Forgetting $\downarrow$		28.28	22.55	13.88
$0.09$	$0.91$	$0$	29.71	32.71	38.91
	Forgetting $\downarrow$		29.97	21.79	16.49
$0.07$	$0.93$	$10^{4}$	33.05	33.31	41.04
	Forgetting $\downarrow$		26.41	20.99	11.38
$0.07$	$0.93$	$10^{5}$	34.11	35.49	42.37
	Forgetting $\downarrow$		23.71	14.79	8.16
$0.07$	$0.93$	$10^{6}$	33.71	33.05	42.62
	Forgetting $\downarrow$		25.69	21.30	7.89

3.3 Results

Multi-Task is the performance upper bound, and Fine-Tuning is the performance lower bound. In Tab. 1, our method shows better performance than the other state-of-art methods on the three metrics, as well as the forgetting value evaluated after task $T$ . On Split-COCO, the AGCN outperforms the best of the state-of-art method PRS by a large margin (34.11% vs. 28.81%). The AGCN shows better performance than the others on Split-WIDE (41.12% vs. 37.93%), suggesting that AGCN is effective on the large-scale multi-label dataset. As shown in Fig. 3, which illustrates the mAP changes as tasks are being learned on two benchmarks. The proposed AGCN is better than other state-of-art methods through the whole LML process.

The final ACM visualization is shown in Fig. 4. The dependency between two classes with higher correlation has larger weights than irrelevant ones, which means the intra- and inter-task relationships can be well constructed even if the old classes are unavailable.

3.4 Ablation studies

ACM effectiveness. In Tab. 2, if we do not build the relationships across old and new tasks, the performance of AGCN (Line 1) is already better than other non-AGCN methods, for example, 31.52% vs. 28.81% in mAP. This means only intra-task label relationships are effective for LML. When the inter-task block matrices $\mathbf{R}^{t}$ and $\mathbf{Q}^{t}$ are available, AGCN with both intra- and inter-task relationships (Line 2) can perform even better in all three metrics, which means the inter-task relationships can further enhance the multi-label recognition.

Hyperparameter selection. Then, we analyze the influences of loss weights and relationship-preserving loss on Split-COCO as shown in Tab. 3. When $\lambda_{1}=0.07$ , $\lambda_{2}=0.93$ , the performance is better than others. By adding the relationship-preserving loss $\ell_{\text{gph}}$ , the performance obtains larger gains, which means the mitigation of catastrophic forgetting of relationships is quite crucial for LML classification. We select the best $\lambda_{3}$ as the hyper-parameters, i.e., $\lambda_{3}=10^{5}$ for LML classification.

4 Conclusion

LML classification is a new paradigm of lifelong learning. The key challenges are constructing label relationships and reducing catastrophic forgetting to improve overall performance. In this paper, a novel AGCN based on an auto-updated expert mechanism is proposed to solve the challenge. We construct a label correlation matrix with soft labels generated by an expert network. We also mitigate relationship forgetting by a proposed relationship-preserving loss. In general, AGCN connects previous and current tasks on all seen classes in LML classification. Extensive experiments demonstrate that AGCN can capture well the label dependencies and effectively mitigate the catastrophic forgetting, thus achieving better classification performance.

References

Chaudhry et al. (2018) Chaudhry, A., Dokania, P. K., Ajanthan, T., and Torr, P. H. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision, pp. 532–547, 2018.
Chaudhry et al. (2019) Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-gem. In Proceedings of the International Conference on Learning Representations, 2019.
Chen et al. (2021) Chen, Z., Wei, X.-S., Wang, P., and Guo, Y. Learning graph convolutional networks for multi-label recognition and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Chen et al. (2019) Chen, Z.-M., Wei, X.-S., Wang, P., and Guo, Y. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5177–5186, 2019.
Chua et al. (2009) Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., and Zheng, Y. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 1–9, 2009.
Jiang & Li (2017) Jiang, Q.-Y. and Li, W.-J. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3232–3240, 2017.
Kim et al. (2020) Kim, C. D., Jeong, J., and Kim, G. Imbalanced continual learning with partitioning reservoir sampling. In Proceedings of the European Conference on Computer Vision, pp. 411–428, 2020.
Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. National Academy of Sciences, 114(13):3521–3526, 2017.
Li & Hoiem (2017) Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017.
Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pp. 740–755, 2014.
Mai et al. (2021) Mai, Z., Li, R., Kim, H., and Sanner, S. Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3589–3599, 2021.
Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543, 2014.
Rebuffi et al. (2017) Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2001–2010, 2017.
Rolnick et al. (2019) Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32:350–360, 2019.
Zhou et al. (2022) Zhou, D.-W., Ye, H.-J., and Zhan, D.-C. Few-shot class-incremental learning by sampling multi-phase tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.