This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Class-Incremental Lifelong Learning in Multi-Label Classification

Kaile Du    Linyan Li    Fan Lyu    Fuyuan Hu    Zhenping Xia    Fenglei Xu
Abstract

Existing class-incremental lifelong learning studies only the data is with single-label, which limits its adaptation to multi-label data. This paper studies Lifelong Multi-Label (LML) classification, which builds an online class-incremental classifier in a sequential multi-label classification data stream. Training on the data with Partial Labels in LML classification may result in more serious Catastrophic Forgetting in old classes. To solve the problem, the study proposes an Augmented Graph Convolutional Network (AGCN) with a built Augmented Correlation Matrix (ACM) across sequential partial-label tasks. The results of two benchmarks show that the method is effective for LML classification and reducing forgetting.

Machine Learning, ICML

1 Introduction

Class-incremental learning (Rebuffi et al., 2017) constructs a unified evolvable classifier, which online learns new classes from a sequential image data stream and achieves multi-class classification for the seen classes. For privacy, storage and efficient computation reasons, the training data in lifelong learning for the old tasks are often unavailable when new tasks arrive, and the new task data only has labels of itself. Thus, the catastrophic forgetting (Kirkpatrick et al., 2017), i.e., the training on new tasks may lead to the old knowledge overlapped by the new knowledge, is the main challenge of Lifelong Single-Label (LSL) classification. To solve the catastrophic forgetting problem, the state-of-art methods for class-incremental lifelong learning can be categorized into regularization-based methods, such as EWC (Kirkpatrick et al., 2017), LwF (Li & Hoiem, 2017) and LIMIT (Zhou et al., 2022); rehearsal-based methods, such as AGEM (Chaudhry et al., 2019), ER (Rolnick et al., 2019), PRS (Kim et al., 2020) and SCR (Mai et al., 2021). However, most lifelong learning studies only consider single-labelled input data, which introduces significant limitations in practical applications.

This paper studies how to learn classes sequentially from new LML classification tasks. As shown in Fig. 1, given testing images, and LML model can continuously recognize multiple labels with new classes learned. It is stated that LML recognition is challenging due to not just catastrophic forgetting, but Partial labels for the current tasks, which means the training data may contain possible labels of past tasks. There exist few lifelong learning algorithms designed specifically for LML against this challenge.

Refer to caption
Figure 1: The inference of LML classification. Given multi-label images for testing, the model can recognize more labels by learning more incremental classes (label with the probability that greater than the threshold 0.7 will be output).
Refer to caption
Figure 2: The framework of AGCN. The training data for task tt is fed into the CNN block, and the graph node embeddings and the ACM are input to the AGCN block. After each task has been trained, we save expert blocks to provide soft labels in the next task training.

We are inspired by the recent research on label relationships in multi-label learning (Chen et al., 2019, 2021), we consider building the label relationships across tasks, i.e., label correlation matrix. However, because of the partial label problem, it is difficult to construct the class relationships by using statistics directly. We propose an AGCN, a novel solution to LML classification to deal with the partial label problem. First, an auto-updated expert network is designed to generate predictions of the old tasks, these predictions as soft labels are used to represent the old classes for the old tasks and construct ACM. Then, the AGCN receives the dynamic ACM and correlates the label spaces of both the old and new tasks, which continually supports the multi-label prediction. Moreover, to further mitigate the forgetting on both seen classes and class relationships, a distillation loss and a relationship-preserving loss function are designed for class-level forgetting and relationship-level forgetting, respectively. We construct two multi-label image datasets, Split-COCO and Split-WIDE, based on MS-COCO and NUS-WIDE, respectively. The results show that our AGCN achieves state-of-art performances in LML classification. Our code is available at https://github.com/Kaile-Du/AGCN.

2 Methodology

2.1 Lifelong multi-label learning

In this study, each data is trained only once in the form of a data stream. Given TT recognition tasks with respect to train datasets {𝒟trn1,,𝒟trnT}\{\mathcal{D}^{1}_{\text{trn}},\cdots,\mathcal{D}^{T}_{\text{trn}}\} and test datasets {𝒟tst1,,𝒟tstT}\{\mathcal{D}^{1}_{\text{tst}},\cdots,\mathcal{D}^{T}_{\text{tst}}\}. For the tt-th task, we have new and task-specific classes to be trained, namely 𝒞t\mathcal{C}^{t}. The goal is to build a multi-label classifier to discriminate an increasing number of classes. We denote 𝒞seent=n=1t𝒞n\mathcal{C}_{\text{seen}}^{t}=\bigcup_{n=1}^{t}\mathcal{C}^{n} as seen classes at task tt, where 𝒞seent\mathcal{C}_{\text{seen}}^{t} contains old class set 𝒞seent1\mathcal{C}_{\text{seen}}^{t-1} and new class set 𝒞t\mathcal{C}^{t}, that is, 𝒞seent=𝒞seent1𝒞t\mathcal{C}_{\text{seen}}^{t}=\mathcal{C}_{\text{seen}}^{t-1}\cup\mathcal{C}^{t}. Note that during the testing phase, the ground truth labels for LML classification contain all the old classes 𝒞seent\mathcal{C}_{\text{seen}}^{t}.

2.2 Augmented Correlation Matrix

Correlation Matrix is often built for multi-label learning (Chen et al., 2019, 2021), and can be used to construct label relationships. Our Augmented Correlation Matrix (ACM) provides the label relationships among all seen classes 𝒞seent\mathcal{C}^{t}_{\text{seen}} and is augmented to capture the intra- and inter-task label independences. Most existing multi-label learning algorithms (Chen et al., 2019, 2021) rely on constructing the inferring label correlation matrix 𝐀\mathbf{A} by the hard label statistics among the class set 𝒞\mathcal{C}: 𝐀ij=P(𝒞i|𝒞j)|ij\mathbf{A}_{ij}=P(\mathcal{C}_{i}|\mathcal{C}_{j})|_{i\neq j}. We construct ACM 𝐀t\mathbf{A}^{t} for task t>1t>1 in an online fashion to simulate the statistic value denoted as

𝐀t=[𝐀t1𝐑t𝐐t𝐁t]=[Old-OldOld-NewNew-OldNew-New],{\mathbf{A}}^{t}=\begin{bmatrix}{\mathbf{A}}^{t-1}&\mathbf{R}^{t}\\ \mathbf{Q}^{t}&\mathbf{B}^{t}\end{bmatrix}=\begin{bmatrix}\text{Old-Old}&\text{Old-New}\\ \text{New-Old}&\text{New-New}\end{bmatrix}, (1)

in which we take four block matrices including 𝐀t1\mathbf{A}^{t-1} and 𝐁t\mathbf{B}^{t}, 𝐑t\mathbf{R}^{t} and 𝐐t\mathbf{Q}^{t} to represent intra- and inter-task label relationships between old and old classes, new and new classes, old and new classes as well as new and old classes respectively. For the first task, 𝐀1=𝐁1\mathbf{A}^{1}=\mathbf{B}^{1}. For t>1t>1, 𝐀t|𝒞seent|×|𝒞seent|\mathbf{A}^{t}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t}|\times|\mathcal{C}_{\text{seen}}^{t}|}. It is worth noting that the block 𝐀t1\mathbf{A}^{t-1} (Old-Old) can be derived from the old task, so we will focus on how to compute the other three blocks in the ACM.

New-New block (𝐁t|𝒞t|×|𝒞t|\mathbf{B}^{t}\in\mathbb{R}^{|\mathcal{C}^{t}|\times|\mathcal{C}^{t}|}). This block computes the intra-task label relationships among the new classes, and the conditional probability in 𝐁t\mathbf{B}^{t} can be calculated using the hard label statistics from the training dataset similar to the common multi-label learning:

𝐁ijt=P(𝒞it𝒞t|𝒞jt𝒞t)=NijNj,\mathbf{B}^{t}_{ij}=P(\mathcal{C}^{t}_{i}\in\mathcal{C}^{t}|\mathcal{C}^{t}_{j}\in\mathcal{C}^{t})=\frac{N_{ij}}{N_{j}}, (2)

where NijN_{ij} is the number of examples with both class 𝒞it\mathcal{C}^{t}_{i} and 𝒞jt\mathcal{C}^{t}_{j}, NjN_{j} is the number of examples with class 𝒞jt\mathcal{C}^{t}_{j}. Due to the online data stream, NijN_{ij} and NjN_{j} are accumulated and updated at each step of the training process.

Old-New block (𝐑t|𝒞seent1|×|𝒞t|\mathbf{R}^{t}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t-1}|\times|\mathcal{C}^{t}|}). Given an image 𝐱\mathbf{x}, for the old classes, z^i{\hat{z}}_{i} (predicted probability) generated by the expert can be considered as the soft label for the ii-th class (see Eq. (7)). Thus, the product z^iyj{\hat{z}}_{i}{{{y}}_{j}} can be regarded as an alternative of the cooccurrences of 𝒞seent1i{\mathcal{C}_{\text{seen}}^{t-1}}_{i} and 𝒞jt\mathcal{C}^{t}_{j}. 𝐱z^iyj\sum_{\mathbf{x}}{\hat{z}}_{i}{{{y}}_{j}} means the online mini-batch accumulation. Thus, we have

𝐑ijt\displaystyle\mathbf{R}^{t}_{ij} =P(𝒞seent1i𝒞seent1|𝒞jt𝒞t)=𝐱z^iyjNj.\displaystyle=P({\mathcal{C}_{\text{seen}}^{t-1}}_{i}\in{\mathcal{C}_{\text{seen}}^{t-1}}|\mathcal{C}^{t}_{j}\in\mathcal{C}^{t})=\frac{\sum_{\mathbf{x}}{\hat{z}}_{i}{{{y}}_{j}}}{N_{j}}. (3)

New-Old block (𝐐t|𝒞t|×|𝒞seent1|\mathbf{Q}^{t}\in\mathbb{R}^{|\mathcal{C}^{t}|\times|\mathcal{C}_{\text{seen}}^{t-1}|}). Based on Bayes’ rule, we can obtain this block by

𝐐jit\displaystyle\mathbf{Q}^{t}_{ji} =P(𝒞jt|𝒞seent1i)=P(𝒞seent1i|𝒞jt)P(𝒞jt)P(𝒞seent1i)=𝐑ijtNj𝐱z^i.\displaystyle=P(\mathcal{C}^{t}_{j}|{\mathcal{C}_{\text{seen}}^{t-1}}_{i})=\frac{P({\mathcal{C}_{\text{seen}}^{t-1}}_{i}|\mathcal{C}^{t}_{j})P(\mathcal{C}^{t}_{j})}{P({\mathcal{C}_{\text{seen}}^{t-1}}_{i})}=\frac{\mathbf{R}^{t}_{ij}N_{j}}{\sum_{\mathbf{x}}{\hat{z}}_{i}}. (4)

Finally, we online construct an ACM using the soft label statistics from the auto-updated expert network and the hard label statistics from the training data.

2.3 Augmented Graph Convolutional Network

ACM is auto-updated dependencies among all seen classes. With the established ACM, we can leverage Graph Convolutional Network (GCN) to assist the prediction of CNN as Eq. (6). We propose an Augmented Graph Convolutional Network (AGCN) to manage the augmented fully-connected graph. AGCN is learned to map this label graph into a set of inter-dependent object classifiers. AGCN is a two-layer stacked graph model, which is similar to ML-GCN (Chen et al., 2019).Based on the ACM 𝐀t\mathbf{A}^{t}, AGCN can capture class-incremental dependencies in an online way. Let the graph node be initialized by the Glove embedding (Pennington et al., 2014) namely 𝐇t,0|𝒞seent|×d\mathbf{H}^{t,0}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t}|\times d} where dd represents the embedding dimensionality. The graph presentation 𝐇t|𝒞seent|×D\mathbf{H}^{t}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t}|\times D} in task tt is mapped by:

𝐇t=AGCN(𝐀t,𝐇t,0).\mathbf{H}^{t}=\text{AGCN}(\mathbf{A}^{t},\mathbf{H}^{t,0}). (5)

As shown in Fig. 2, together with an CNN feature extractor, the multiple labels for an image 𝐱\mathbf{x} will be predicted by

𝐲^=σ(AGCN(𝐀t,𝐇t,0)CNN(𝐱)),\mathbf{\hat{y}}=\sigma\left({\text{AGCN}(\mathbf{A}^{t},\mathbf{H}^{t,0})}\otimes\text{CNN}\left(\mathbf{x}\right)\right), (6)

where 𝐀t\mathbf{A}^{t} denotes the ACM and 𝐇t,0\mathbf{H}^{t,0} is the initialized graph node. Prediction 𝐲^=[𝐲^old𝐲^new]\mathbf{\hat{y}}=[\mathbf{\hat{y}}_{\text{old}}~{}\mathbf{\hat{y}}_{\text{new}}], where 𝐲^old|𝒞seent1|\mathbf{\hat{y}}_{\text{old}}\in\mathbb{R}^{|\mathcal{C}_{\text{seen}}^{t-1}|} for old classes and 𝐲^new|𝒞t|\mathbf{\hat{y}}_{\text{new}}\in\mathbb{R}^{|\mathcal{C}^{t}|} for new classes. We train the current task for classifying using the Cross Entropy loss.

To mitigate the class-level catastrophic forgetting, inspired by the distillation-based lifelong learning method (Li & Hoiem, 2017; Zhou et al., 2022), we construct auto-updated expert networks consisting of CNNxpt{}_{\text{xpt}} and AGCNxpt{}_{\text{xpt}}. The expert parameters are fixed after each task has been trained and auto-update along with new task learning. Based on the expert, we construct the distillation loss as

dst(𝐳^,𝐲^old)=i=1|𝒞seent1|[z^ilog(y^i)+(1z^i)log(1y^i)],\ell_{\text{dst}}({\mathbf{\hat{z}}},\mathbf{\hat{y}}_{\text{old}})=-\sum_{i=1}^{|\mathcal{C}_{\text{seen}}^{t-1}|}\left[{{\hat{z}_{i}}}\log\left({\hat{y}}_{i}\right)+\left(1-{{\hat{z}}}_{i}\right)\log\left(1-{\hat{y}}_{i}\right)\right], (7)

where 𝐳^=σ(AGCNxpt(𝐀t1,𝐇t1,0)CNNxpt(𝐱)){\mathbf{\hat{z}}}=\sigma\left(\text{AGCN}_{\text{xpt}}(\mathbf{A}^{t-1},\mathbf{H}^{t-1,0})\otimes\text{CNN}_{\text{xpt}}\left(\mathbf{x}\right)\right) can be treated as the soft labels to represent the prediction on old classes. The ii-th element z^i\hat{z}_{i} of 𝐳^{\mathbf{\hat{z}}} represent the probability that the image 𝐱\mathbf{x} contains the class.

To mitigate the relationship-level forgetting, we constantly preserve the established relationships in the sequential tasks. The graph node embedding is irrelevant to the label co-occurrence and can be stored as a teacher to avoid the forgetting of label relationships. Suppose the learned embedding after task tt is stored as 𝐆t=AGCNxpt(𝐀t,𝐇t,0)\mathbf{G}^{t}=\text{AGCN}_{\text{xpt}}(\mathbf{A}^{t},\mathbf{H}^{t,0}), t>1t>1. We propose a relationship-preserving loss as a constraint to the class relationships:

gph(𝐆t1,𝐇t)=i=1|𝒞seent1|𝐆it1𝐇it2.\ell_{\text{gph}}({\mathbf{G}}^{t-1},\mathbf{H}^{t})=\sum^{|\mathcal{C}_{\text{seen}}^{t-1}|}_{i=1}\left\|{\mathbf{G}}^{t-1}_{i}-\mathbf{H}^{t}_{i}\right\|^{2}. (8)

By minimizing gph\ell_{\text{gph}} with the partial constraint of old node embedding, the changes of AGCN parameters are limited. Thus, the forgetting of the established label relationships are alleviated with the progress of LML classification. The final loss for the model training is defined as

=λ1cls(𝐲,𝐲^new)+λ2dst(𝐳^,𝐲^old)+λ3gph(𝐆t1,𝐇t),\ell=\lambda_{1}\ell_{\text{cls}}(\mathbf{y},\mathbf{\hat{y}}_{\text{new}})+\lambda_{2}\ell_{\text{dst}}(\mathbf{\hat{z}},\mathbf{\hat{y}}_{\text{old}})+\lambda_{3}\ell_{\text{gph}}({\mathbf{G}}^{t-1},\mathbf{H}^{t}), (9)

where cls\ell_{\text{cls}} is the classification loss, dst\ell_{\text{dst}} is used to mitigate the class-level forgetting and gph\ell_{\text{gph}} is used to reduce the relationship-level forgetting. λ1\lambda_{1}, λ2\lambda_{2} and λ3\lambda_{3} are the loss weights for cls\ell_{\text{cls}}, dst\ell_{\text{dst}} and gph\ell_{\text{gph}}. Extensive ablation studies are conducted for gph\ell_{\text{gph}} after all relationships are built.

3 Experiments

Refer to caption
Figure 3: mAP (%) changes on two benchmarks.
Table 1: We report 3 main metrics (%) for LML after the whole data stream is seen once on Split-WIDE and Split-COCO.
Method Split-WIDE Split-COCO
mAP \uparrow CF1 \uparrow OF1 \uparrow mAP \uparrow CF1 \uparrow OF1 \uparrow
Multi-Task 66.17 61.45 71.57 65.85 61.79 66.27
Fine-Tuning 20.33 19.10 35.72 9.83 10.54 28.83
Forgetting \downarrow 40.85 31.20 15.10 58.04 63.54 20.60
EWC 22.03 22.78 35.70 12.20 12.50 29.67
Forgetting \downarrow 34.86 28.18 15.17 45.61 55.44 19.85
LwF 29.46 29.64 42.69 19.95 21.69 40.68
Forgetting \downarrow 20.26 18.99 5.73 41.16 39.85 11.43
AGEM 32.47 33.28 38.93 23.31 27.25 37.94
Forgetting \downarrow 16.42 15.71 9.73 34.52 18.92 12.94
ER 34.03 34.94 39.37 25.03 30.54 38.38
Forgetting \downarrow 15.15 11.80 8.61 33.46 17.28 12.34
PRS 37.93 21.12 15.64 28.81 18.40 13.86
Forgetting \downarrow 13.59 51.09 62.90 30.90 54.36 52.51
SCR 35.34 35.47 41.92 25.75 30.63 39.10
Forgetting \downarrow 14.26 10.17 8.04 32.02 15.98 11.96
AGCN (Ours) 41.12 38.27 43.27 34.11 35.49 42.37
Forgetting \downarrow 11.22 5.43 4.28 23.71 14.79 8.16
Table 2: Ablation studies (%) for ACM 𝐀t\mathbf{A}^{t} used to model intra- and inter-task label relationships on Split-COCO.
𝐀t1\mathbf{A}^{t-1} & 𝐁t\mathbf{B}^{t} 𝐑t\mathbf{R}^{t} & 𝐐t\mathbf{Q}^{t} mAP \uparrow CF1 \uparrow OF1 \uparrow
1 \surd ×\times 31.52 30.37 34.87
2 \surd \surd 34.11 35.49 42.37

3.1 Dataset construction

Split-COCO. We choose the 40 most frequent concepts from 80 classes of MS-COCO (Lin et al., 2014) to construct Split-COCO, which has 65082 examples for training and 27,173 examples for validation. The 40 classes are split into 10 different tasks, and each task contains 4 classes.

Split-WIDE. NUS-WIDE (Chua et al., 2009) has a larger scale than MS-COCO. Following (Jiang & Li, 2017), we choose the 21 most frequent concepts from 81 classes of NUS-WIDE to construct the Split-WIDE, which has 144,858 examples for training and 41,146 examples for validation. We split the Split-WIDE into 7 tasks, where each task contains 3 classes.

3.2 Evaluation metrics

Multi-label evaluation. Following the traditional multi-label learning (Chen et al., 2019; Kim et al., 2020; Chen et al., 2021), 3 more important multi-label metrics: mAP, CF1 and OF1. Forgetting measure (Chaudhry et al., 2018). This score denotes the value difference of the above three multi-label metrics between the final score and the score when the task was first trained done.

Table 3: AGCN ablation studies (%) for loss weights and relationship-preserving loss on Split-COCO.
λ1\lambda_{1} λ2\lambda_{2} λ3\lambda_{3} mAP \uparrow CF1 \uparrow OF1 \uparrow
0.050.05 0.950.95 0 29.90 31.80 37.12
Forgetting \downarrow 29.24 24.88 19.67
0.070.07 0.930.93 0 30.99 32.03 39.31
Forgetting \downarrow 28.28 22.55 13.88
0.090.09 0.910.91 0 29.71 32.71 38.91
Forgetting \downarrow 29.97 21.79 16.49
0.070.07 0.930.93 10410^{4} 33.05 33.31 41.04
Forgetting \downarrow 26.41 20.99 11.38
0.070.07 0.930.93 10510^{5} 34.11 35.49 42.37
Forgetting \downarrow 23.71 14.79 8.16
0.070.07 0.930.93 10610^{6} 33.71 33.05 42.62
Forgetting \downarrow 25.69 21.30 7.89
Refer to caption
Figure 4: ACM visualization on Split-WIDE and Split-COCO.

3.3 Results

Multi-Task is the performance upper bound, and Fine-Tuning is the performance lower bound. In Tab. 1, our method shows better performance than the other state-of-art methods on the three metrics, as well as the forgetting value evaluated after task TT. On Split-COCO, the AGCN outperforms the best of the state-of-art method PRS by a large margin (34.11% vs. 28.81%). The AGCN shows better performance than the others on Split-WIDE (41.12% vs. 37.93%), suggesting that AGCN is effective on the large-scale multi-label dataset. As shown in Fig. 3, which illustrates the mAP changes as tasks are being learned on two benchmarks. The proposed AGCN is better than other state-of-art methods through the whole LML process.

The final ACM visualization is shown in Fig. 4. The dependency between two classes with higher correlation has larger weights than irrelevant ones, which means the intra- and inter-task relationships can be well constructed even if the old classes are unavailable.

3.4 Ablation studies

ACM effectiveness. In Tab. 2, if we do not build the relationships across old and new tasks, the performance of AGCN (Line 1) is already better than other non-AGCN methods, for example, 31.52% vs. 28.81% in mAP. This means only intra-task label relationships are effective for LML. When the inter-task block matrices 𝐑t\mathbf{R}^{t} and 𝐐t\mathbf{Q}^{t} are available, AGCN with both intra- and inter-task relationships (Line 2) can perform even better in all three metrics, which means the inter-task relationships can further enhance the multi-label recognition.

Hyperparameter selection. Then, we analyze the influences of loss weights and relationship-preserving loss on Split-COCO as shown in Tab. 3. When λ1=0.07\lambda_{1}=0.07, λ2=0.93\lambda_{2}=0.93, the performance is better than others. By adding the relationship-preserving loss gph\ell_{\text{gph}}, the performance obtains larger gains, which means the mitigation of catastrophic forgetting of relationships is quite crucial for LML classification. We select the best λ3\lambda_{3} as the hyper-parameters, i.e., λ3=105\lambda_{3}=10^{5} for LML classification.

4 Conclusion

LML classification is a new paradigm of lifelong learning. The key challenges are constructing label relationships and reducing catastrophic forgetting to improve overall performance. In this paper, a novel AGCN based on an auto-updated expert mechanism is proposed to solve the challenge. We construct a label correlation matrix with soft labels generated by an expert network. We also mitigate relationship forgetting by a proposed relationship-preserving loss. In general, AGCN connects previous and current tasks on all seen classes in LML classification. Extensive experiments demonstrate that AGCN can capture well the label dependencies and effectively mitigate the catastrophic forgetting, thus achieving better classification performance.

References

  • Chaudhry et al. (2018) Chaudhry, A., Dokania, P. K., Ajanthan, T., and Torr, P. H. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision, pp.  532–547, 2018.
  • Chaudhry et al. (2019) Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-gem. In Proceedings of the International Conference on Learning Representations, 2019.
  • Chen et al. (2021) Chen, Z., Wei, X.-S., Wang, P., and Guo, Y. Learning graph convolutional networks for multi-label recognition and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • Chen et al. (2019) Chen, Z.-M., Wei, X.-S., Wang, P., and Guo, Y. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  5177–5186, 2019.
  • Chua et al. (2009) Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., and Zheng, Y. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, pp.  1–9, 2009.
  • Jiang & Li (2017) Jiang, Q.-Y. and Li, W.-J. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  3232–3240, 2017.
  • Kim et al. (2020) Kim, C. D., Jeong, J., and Kim, G. Imbalanced continual learning with partitioning reservoir sampling. In Proceedings of the European Conference on Computer Vision, pp.  411–428, 2020.
  • Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. National Academy of Sciences, 114(13):3521–3526, 2017.
  • Li & Hoiem (2017) Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017.
  • Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pp.  740–755, 2014.
  • Mai et al. (2021) Mai, Z., Li, R., Kim, H., and Sanner, S. Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3589–3599, 2021.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.  1532–1543, 2014.
  • Rebuffi et al. (2017) Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  2001–2010, 2017.
  • Rolnick et al. (2019) Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32:350–360, 2019.
  • Zhou et al. (2022) Zhou, D.-W., Ye, H.-J., and Zhan, D.-C. Few-shot class-incremental learning by sampling multi-phase tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.