This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ER-FSL: Experience Replay with Feature Subspace Learning
for Online Continual Learning

Huiwei Lin Harbin Institute of TechnologyShenzhenChina [email protected]
(2024)
Abstract.

Online continual learning (OCL) involves deep neural networks retaining knowledge from old data while adapting to new data, which is accessible only once. A critical challenge in OCL is catastrophic forgetting, reflected in reduced model performance on old data. Existing replay-based methods mitigate forgetting by replaying buffered samples from old data and learning current samples of new data. In this work, we dissect existing methods and empirically discover that learning and replaying in the same feature space is not conducive to addressing the forgetting issue. Since the learned features associated with old data are readily changed by the features related to new data due to data imbalance, leading to the forgetting problem. Based on this observation, we intuitively explore learning and replaying in different feature spaces. Learning in a feature subspace is sufficient to capture novel knowledge from new data while replaying in a larger feature space provides more feature space to maintain historical knowledge from old data. To this end, we propose a novel OCL approach called experience replay with feature subspace learning (ER-FSL). Firstly, ER-FSL divides the entire feature space into multiple subspaces, with each subspace used to learn current samples. Moreover, it introduces a subspace reuse mechanism to address situations where no blank subspaces exist. Secondly, ER-FSL replays previous samples using an accumulated space comprising all learned subspaces. Extensive experiments on three datasets demonstrate the superiority of ER-FSL over various state-of-the-art methods.

neural networks, online continual learning, image classification
Corresponding Author.
journalyear: 2024copyright: acmlicensedconference: Proceedings of the 32nd ACM International Conference on Multimedia; 28 October - 1 November 2024; Melbourne, Australiabooktitle: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), 28 October - 1 November 2024, Melbourne, Australiaprice: XXXXdoi: XXXXisbn: XXXXsubmissionid: 3868ccs: Computing methodologies Cross-validationccs: Computing methodologies Image representations

1. Introduction

Refer to caption
Figure 1. The comparison of existing studies and our work. (a) An example of existing studies. Top: the model trains all samples in the same feature space; Bottom: the change of samples in the feature space using existing methods. (b) An example of our method. Top: different from existing studies, the model learns current samples in a feature subspace while replaying buffered samples in a larger one (e.g., feature whole-space). Bottom: the change of samples in the feature space using our approach.

Online continual learning (OCL) is a significant problem in deep neural networks. Benefiting from offline learning on vast amounts of data, deep neural networks have demonstrated exceptional performance across various application fields, especially in the multimedia domain (Xu et al., 2023; Hao et al., 2021; Shen et al., 2021). However, they cannot continually learn as humans do. As new data accumulates, if the model continues to use conventional training strategies, it is highly susceptible to the catastrophic forgetting (CF). This phenomenon refers to the model’s performance on previously learned data significantly deteriorates after learning new data. Thus, OCL emerges as a solution to enable models to continually acquire novel knowledge from new data while retaining historical knowledge from old data. Moreover, the data can be accessed only once in an online fashion, which adds complexity to the OCL problem.

Among all methods for continual learning, replay-based methods are highly suitable for OCL to address CF problems. In this family of methods, a memory buffer is utilized to save and replay partial old data. We provide an example of OCL in a class-incremental scenario, where the model first learns classes of “dog” and “airplane”, and then learns classes of “cat” and “ship”. As demonstrated at the top of Figure 1 (a), replay-based methods allow the model to continuously learn current samples of new data and replay buffered samples of old data. Building on this foundation, some methods have been proposed to select more important current samples for storage (Aljundi et al., 2019b; Jin et al., 2021) while replaying more critical buffered samples (Aljundi et al., 2019a). At the same time, other methods (Lin et al., 2023b) are proposed to improve the training process for more effective learning.

In this work, we dissect existing methods and empirically discover that learning and replaying in the same feature space is not conducive to addressing the forgetting issue. On one hand, training the model on new data disrupts the embedding of old data in the feature space through gradient descent. Without replaying, the current samples occupy the main position of gradient propagation during the training process of OCL. As a result, the model learns more for correctly identifying features of new data but results in changing features of old data. It makes the old data indistinguishable in the feature space and further causes the forgetting of the model. On the other hand, although existing replay-based methods can preserve some features related to old data, the problem of changing the features of old data in the model is inevitable. Due to the larger number of current samples compared to buffered samples, the gradient descent is still dominated by current samples. When learning and replaying occur in the same feature space, the model tends to focus more on the features of new data. For clarity, we decompose the original synchronous learning and review process. As depicted at the bottom of Figure 1 (a), the samples of “dog” and “airplane” become indistinguishable in the feature space after the model has learned the samples of “cat” and “ship”. Even after replaying, partial samples of “dog” and “airplane” are still indistinguishable.

With this inspiration, we intuitively explore a novel strategy to learn current samples and replay buffered samples using different feature spaces. As described at the top of Figure 1 (b), the model learns current samples (blue) in a feature subspace (Wang et al., 2020; Jiang et al., 2022; Zhu and Koniusz, 2022) and replay buffered samples (green) in the feature whole-space. The simple yet effective strategy would ensure the model’s generalization ability while improving its anti-forgetting ability. For one thing, learning current samples in a feature subspace is sufficient for the model to capture novel knowledge. For another thing, further replaying buffered samples in a larger feature space provides more features of old data for the model to retain historical knowledge. Its main idea is illustrated at the bottom of Figure 1 (b). By learning in the feature subspace, the model can effectively distinguish the new data but may struggle to recognize old data. However, by replaying in the larger space, the model can retain more features of old data, thereby improving its recognition ability for old data. Consequently, the old samples that are challenging to separate in the low-dimensional space (i.e., feature subspace) can now be effectively handled in a high-dimensional space (i.e., feature whole-space), significantly alleviating the forgetting issue.

To this end, we develop a straightforward yet highly effective replay-based approach called experience replay with feature subspace learning (ER-FSL) for OCL. The fundamental motivation of ER-FSL is to employ different feature spaces for learning and replaying. Specifically, it divides the overall feature space into multiple subspaces, with each subspace used to learn a new task. Simultaneously, all learned subspaces collectively form an accumulated feature space for replaying buffered samples. This process can be mainly divided into three primary components. 1) In the learning component, the model utilizes the feature subspace to learn current samples and ensure its generalization ability. If there is no blank subspace, the model can take a subspace reuse mechanism to select subspaces for future task learning. 2) In the replaying component, buffered samples are replayed within the accumulated feature space, aiding the model in remembering more features associated with old data. 3) Based on this training way, the model can accurately identify more unknown samples within the accumulated feature space by the testing component.

Our main contributions can be summarized as follows:

  • 1)

    We theoretically analyze the role of feature spaces in existing methods and explore a novel strategy for utilizing separate feature spaces during learning and replaying processes. To the best of our knowledge, this work represents the first investigation into utilizing different embedding feature spaces for the replay-based OCL approaches.

  • 2)

    We propose a novel OCL framework called ER-FSL to mitigate the forgetting problem by addressing the changing of old features. The primary operation involves selecting a feature subspace for learning current samples and replaying buffered samples within the accumulated feature space.

  • 3)

    We conduct extensive experiments on three datasets for image classification, and the empirical results consistently demonstrate the superiority of ER-FSL over various state-of-the-art methods. We also investigate the benefits of each component by ablation studies. The source code is available at https://github.com/FelixHuiweiLin/ER-FSL.

2. Related Work

2.1. Continual Learning

Since continual learning generally exists in various scenarios (Hu et al., 2022; Zhang et al., 2022c; Yang et al., 2022; Zhang et al., 2022a) of deep neural networks, its related research is quite extensive. In addition to the work related to the innovation and application of continual learning methods, the analysis (Pham et al., 2021; Mirzadeh et al., 2020) and overview (Masana et al., 2022; Mai et al., 2022) work also have attracted much attention.

Continual learning (De Lange et al., 2021; Masana et al., 2022), also known as lifelong learning (Liu and Mazumder, 2021) or incremental learning (Wang et al., 2023b), is a machine learning paradigm that trains models on a continuous stream of new data. It ensures the model’s generalization ability (Ghunaim et al., 2023; Brahma and Rai, 2023) and anti-forgetting ability (Dong et al., 2023, 2021) at the same time. Existing methods can be generally divided into three categories. 1) Architecture-based methods (Qin et al., 2021; Yan et al., 2021; Hu et al., 2023) overcome CF problem by dynamical networks or static networks. dynamic networks imply that the model’s network structure gradually expands with the increasing number of samples during the continual learning process, while static networks maintain their structure unchanged and allocate parameters selectively. 2) Regularization-based methods (Wang et al., 2023a; Pelosin et al., 2022; Akyürek et al., 2021) constrain the optimization process of the model using an additional regularization term. The design of this regularization term can be based on variations in parameters during the training process or on knowledge distillation. 3) Replay-based methods save true old data (Lopez-Paz and Ranzato, 2017; Sun et al., 2023; Luo et al., 2023; Zhou et al., 2022) or generate pseudo old data (Cui et al., 2021; Qi et al., 2022) to replay with new data. And some feature replay methods (Toldo and Ozay, 2022) are proposed. The proposed ER-FSL in this work is a novel replay-based method.

2.2. Online Continual Learning

OCL is a specialized area within continual learning that emphasizes effective learning from a single pass through an online data stream, where tasks or information are introduced incrementally over time. It plays a critical role in scenarios requiring continual knowledge evolution and adaptation to new information (Mai et al., 2022).

Existing OCL methods are mainly based on replaying ways except AOP (Guo et al., 2022a). A variety of ER-based methods have been proposed for OCL due to the effectiveness of a replay-based method called Experience replay (ER) (Rolnick et al., 2019). Some approaches are proposed to select more valuable samples for storing (Aljundi et al., 2019b; Jin et al., 2021; He and Zhu, 2021) and replaying (Aljundi et al., 2019a; Shim et al., 2021; Wang et al., 2022; Prabhu et al., 2020). Other approaches (Mai et al., 2021; Caccia et al., 2022; Zhang et al., 2022b; Guo et al., 2022b; Gu et al., 2022; Lin et al., 2023b; Chrysakis, 2023; Wei et al., 2023; Guo et al., 2023; Wang et al., 2023c; Liang and Li, 2023; Lin et al., 2023a) belong to the model update strategy and focus on improving the training process of samples. Both SS-IL (Ahn et al., 2021) and ER-ACE (Caccia et al., 2022) propose different cross-entropy loss functions for learning new data and reviewing old data to alleviate catastrophic forgetting. Subsequently, PCR (Lin et al., 2023b) analyzes and integrates these two types of methods from the perspective of gradient propagation, while LODE (Liang and Li, 2023) approaches the integration from the angle of decomposing the loss function. Both of them significantly enhance the performance of the original methods. Furthermore, the performance of the latest methods (Guo et al., 2022b; Wei et al., 2023) depends on multiple data augmentation operations, since data augmentation helps improve the performance of the model (Zhu et al., 2021).

The proposed ER-FSL introduces a novel model update strategy for OCL. Different from existing strategies that learn and replay within the same feature space, the proposed ER-FSL embeds features in different spaces for current samples and buffered samples. This differentiation allows for a more effective enhancement of the model’s anti-forgetting capabilities compared to existing methods.

3. Problem Definition and Analysis

3.1. Problem Definition

Taking a class-incremental scenario as an example, OCL generally considers a single-pass data stream and divides it into a sequence of TT learning tasks as 𝒟={𝒟1,,𝒟T}\mathcal{D}=\{\mathcal{D}_{1},...,\mathcal{D}_{T}\}, where each task 𝒟t={𝒙,y}1Nt\mathcal{D}_{t}=\{\bm{x},y\}_{1}^{N_{t}} contains NtN_{t} labeled samples. y𝒞ty\in\mathcal{C}_{t} is the class label of sample 𝒙\bm{x}, where 𝒞t\mathcal{C}_{t} is the set of task-specific classes. Different tasks contain unique classes, and all of the learned classes are denoted as 𝒞1:t=k=1t𝒞k\mathcal{C}_{1:t}=\bigcup_{k=1}^{t}\mathcal{C}_{k}. The model is a neural network, consisting of a feature extractor 𝒛=h(𝒙;𝜽)\bm{z}=h(\bm{x};\bm{\theta}) and a classifier f(𝒛;𝑾)=𝑾𝒛f(\bm{z};\bm{W})=\bm{W}\cdot\bm{z} for the sample 𝒙\bm{x}. 𝒛=[z1,z2,,zd]\bm{z}=[z_{1},z_{2},...,z_{d}] is a dd-dimensional feature vectors, 𝜽\bm{\theta} and 𝑾=[𝒘1,𝒘2,,𝒘c]\bm{W}=[\bm{w}_{1},\bm{w}_{2},...,\bm{w}_{c}] are learnable parameters, and 𝒘c=[w1c,w2c,,wdc]\bm{w}_{c}=[w_{1}^{c},w_{2}^{c},...,w_{d}^{c}] is a dd-dimensional prototype vector for class cc. OCL aims to train a unified model on data seen only once while performing well on both new and old classes.

At the beginning, the model can only access each mini-batch current samples 𝒟t{\mathcal{B}\subset\mathcal{D}_{t}} once in the training process of each task. Such a training strategy is known as finetune, where the model learns without any anti-forgetting operations. Based on 𝒛=h(𝒙;𝜽)\bm{z}=h(\bm{x};\bm{\theta}), its objective loss function can be denoted as

(1) L=E(𝒙,y)[log(exp(𝒘y𝒛)cC1:texp(𝒘c𝒛))].L=E_{(\bm{x},y)\sim{\mathcal{B}}}[-log(\frac{exp(\bm{w}_{y}\cdot\bm{z})}{\sum_{c\in C_{1:t}}exp(\bm{w}_{c}\cdot\bm{z})})].

Subsequently, a memory buffer \mathcal{M} is utilized to store a small subset of observed data for replay-based methods, such as ER (Rolnick et al., 2019). To alleviate the forgetting problem, a mini-batch of buffered samples \mathcal{B}_{\mathcal{M}}\subset\mathcal{M} is drawn from the memory buffer and then trained alongside current samples. Therefore, the loss function defined as Equation (1) can be improved to

(2) L=E(𝒙,y)[log(exp(𝒘y𝒛)cC1:texp(𝒘c𝒛))].L=E_{(\bm{x},y)\sim{\mathcal{B}\cup\mathcal{B}_{\mathcal{M}}}}[-log(\frac{exp(\bm{w}_{y}\cdot\bm{z})}{\sum_{c\in C_{1:t}}exp(\bm{w}_{c}\cdot\bm{z})})].

Finally, for each unknown sample 𝒙\bm{x}, the model categorizes it as the class with the highest prediction probability

(3) y=argmaxcexp(𝒘c𝒛)jC1:texp(𝒘j𝒛),cC1:t.\begin{split}y^{*}=\mathop{\arg\max}_{c}\frac{exp(\bm{w}_{c}\cdot\bm{z})}{\sum_{j\in C_{1:t}}exp(\bm{w}_{j}\cdot\bm{z})},c\in C_{1:t}.\end{split}
Refer to caption

(a) The decomposed inner-product for buffered samples after learning the first task in a general way.

Refer to caption

(b) The decomposed inner-product for buffered samples after learning the second task using finetune (38.9%).

Refer to caption

(c) The decomposed inner-product for buffered samples after learning the second task using ER (54.2%).

Refer to caption

(d) The decomposed inner-product for buffered samples after learning the second task using our way (62.4%).

Refer to caption

(e) The performance in new classes of two tasks when learning in different subspaces by finetune.

Refer to caption

(f) The performance on all classes when learning in different feature subspace by ER.

Figure 2. The analysis results on CIFAR10 with 2 learning tasks when the memory buffer size is 1000.

3.2. Problem Exploration

To further address the forgetting problem of the model, we analyze existing methods, explore their shortcomings, and find corresponding solutions. Specifically, we divide the CIFAR10 dataset that contains 10 classes into two tasks, each containing 5 classes, and conduct OCL analysis experiments. The model used in the experiments is Resnet18 (He et al., 2016), where the dimension of 𝒛\bm{z} is 512. All analysis results are demonstrated in Figure 2.

The imbalanced data, through gradient descent, results in the model focusing more on features of distinguishing new classes and ignoring features of recognizing old classes. In the class-incremental scenario, the CF phenomenon of the model is manifested as the biased prediction 𝑾𝒙\bm{W}\cdot\bm{x}, where the learned model tends to classify most samples into new classes. Specifically, when training a sample 𝒙\bm{x} of class yy, the gradient of feature extractor can be expressed as

(4) L𝒛=(py1)𝒘y+pc𝒘c,\frac{\partial L}{\partial\bm{z}}=(p_{y}-1)\bm{w}_{y}+p_{c}\bm{w}_{c},

where cyc\neq y. It makes the feature 𝒛\bm{z} of the sample to be closer to the prototype 𝒘y\bm{w}_{y} of class yy while keeping away from the prototypes 𝒘c\bm{w}_{c} of other classes. As a result, the inner-product 𝒘y𝒛\bm{w}_{y}\cdot\bm{z} will be larger than others 𝒘c𝒛\bm{w}_{c}\cdot\bm{z} in the prediction. By decomposing 𝒘c𝒛\bm{w}_{c}\cdot\bm{z} into [w1cz1,w2cz2,,wdczd][w_{1}^{c}z_{1},w_{2}^{c}z_{2},...,w_{d}^{c}z_{d}], we find that the value of each dimension represents a certain feature and its importance to the sample. The greater the importance of a feature, the higher its corresponding value, and consequently, the greater its contribution to the prediction. When training the model with Equation (1), all gradients are generated by the new classes. The model can only focus on features that distinguish new classes and ignore features related to old classes. Although this situation can be alleviated using Equation (2), the changing of features related to old classes is inevitable. Since the number of current samples is still higher than the number of buffered samples, the gradient is primarily influenced by new classes within the same feature space.

To validate this view, we calculate the decomposed inner-product between the features of buffered samples with the prototypes of old classes (red) and new classes (blue). Figure 2 (a) shows the results after the model learning the first task. It can be seen that the values of the decomposed inner-product for old classes (red) are larger, and the model can recognize most of the buffered samples. Besides, Figure 2 (b) and (c) show the results of completing the learning of the second task in a finetune way (Equation (1)) and in an ER way (Equation (2)), respectively. If the model learns the second task using finetune, the gradient is produced by current samples from new classes. The original features (red) the model learned are changed, and the model pays more attention to the features related to new classes (blue). Hence, most of the buffered samples belonging to old classes are classified into new classes due to the biased prediction. Although existing replay-based methods such as ER can improve the values for old classes (as seen in Figure 2 (c)), the average value for new classes (0.0013) is still higher than the one for old classes (-0.0016). Therefore, simply learning and replaying in the same feature space is not beneficial for the model to address the forgetting problem. This motivates us to investigate a new question: Why not learn current samples and replay buffered samples in different feature spaces?

3.3. Feasibility Analysis

Analysis for Learning. Learning current samples across the feature whole-space is not necessary. As illustrated in Figure 2 (a) and (b), only a subset of the features are useful for new classes within the feature whole-space. This is due to 𝒘c𝒛=i=1dwiczi\bm{w}_{c}\cdot\bm{z}=\sum_{i=1}^{d}w^{c}_{i}z_{i}, where the significance of these features will be smoothed, potentially impacting model performance, especially in larger feature spaces. To address this issue, we introduce a scaling factor to regulate the dimensions of feature and prototype vectors in a fine-tuned manner. The model’s performance on new classes for each task is reported in Figure 2 (e). The findings show that reducing the dimensions of the feature space through factors of different scales, the model’s performance on new classes will be significantly improved. Meanwhile, the smaller the scale of the feature space, the smaller the fluctuation of the model.

Analysis for Replaying. Replaying buffered samples with a larger feature space than the one for learning can improve the ability of anti-forgetting for the model. As seen in Figure 2 (b) and (c), due to imbalanced data, features related to old classes generally exhibit a phenomenon of weaker importance. With larger feature space, the model can memorize more features associated with old classes in the additional space, further improving the performance of old classes. We use different feature subspaces for the model to learn current samples while replaying in the feature whole-space, and the results are stated in Figure 2 (f). The results demonstrate that with a higher dimensional feature space, the accuracy of the model on old classes has been effectively improved. As the scale changes, there is little room for improvement in this performance. This is because, for old data, too many features are not necessary either. Furthermore, we also calculate its decomposed inner-product for buffered samples when the scale is 0.3 and report the results in Figure 2 (d). As seen in the left part, these features (indexes 1-150) are used for distinguishing new classes, since the values for new classes are larger. In the right part, these additional features (indexes 151-512), which are used to replay buffered samples, tend to correctly categorize most buffered samples as old classes.

Summary of Analyses. After conducting these analyses, we have synthesized our key findings as follows: (1) Learning current samples and replaying buffered samples within the same feature space is adverse to overcoming the forgetting problem. (2) Learning current samples within a feature subspace is sufficient to ensure the generalization ability of the model. (3) Replaying buffered samples within a larger feature space can leverage more features associated with old classes, thereby improving the anti-forgetting ability of the model. Therefore, adopting a strategy of learning within a feature subspace while replaying within a larger feature space presents a viable approach to enhancing the model’s performance.

4. Methodology

Motivated by these discoveries, we develop a novel OCL framework called experience replay with feature subspace learning (ER-FSL). As stated in Figure 3, our framework consists of a CNN-based feature extractor and a classifier. The entire workflow can be divided into the following three modules.

4.1. Memory Buffer Module

The setting of a memory buffer (\mathcal{M}) is critical for the model’s performance in OCL. First, the size of the memory buffer is fixed throughout the entire training process of OCL. Second, reservoir sampling is used to screen current samples and determine whether they are stored in the memory buffer. A random sampling algorithm can extract a portion of samples from a large set and ensure that the probability of selecting each sample is equal. Third, random sampling retrieves buffered samples from the buffer for replaying.

4.2. Continual Training Module

The training phase of ER-FSL plays a crucial role in maintaining the model’s generalization ability and anti-forgetting capability. The model can not only quickly learn novel knowledge from current samples, but also ensure the retention of historical knowledge using buffered samples as much as possible. Its objective function is

(5) LERFSL=(1γ)Lc+γLb,\begin{split}L_{ER-FSL}=(1-\gamma)L_{c}+\gamma L_{b},\end{split}

where γ\gamma is a scale factor to balance a learning component LcL_{c} and a replaying component LbL_{b}. It encompasses a balanced optimization approach that incorporates learning novel knowledge while preserving historical knowledge.

The learning component LcL_{c} is a general cross-entropy loss function only associated with current samples. It learns current samples of new classes using a feature subspace, where the novel knowledge of distinguishing new classes is saved in the subspace. Its loss function can be denoted as

(6) Lc=E(𝒙,y)[log(exp(𝒘ys𝒛s)cC1:texp(𝒘cs𝒛s))].L_{c}=E_{(\bm{x},y)\sim{\mathcal{B}}}[-log(\frac{exp(\bm{w}_{y}^{s}\cdot\bm{z}^{s})}{\sum_{c\in C_{1:t}}exp(\bm{w}_{c}^{s}\cdot\bm{z}^{s})})].

Here, 𝒛s=𝒛𝑺\bm{z}^{s}=\bm{z}\cdot\bm{S} is the embedding of sample xx and 𝒘cs=𝒘c𝑺\bm{w}_{c}^{s}=\bm{w}_{c}\cdot\bm{S} is the prototype of class cc in the feature subspace (the blue elements in Figure 3). 𝑺\bm{S} is the diagonal matrix as

(7) 𝑺=[s1100sdd],sii={1,i[(t1)k,tk)0,others.\bm{S}=\begin{bmatrix}s_{11}&\cdots&0\\ \vdots&\ddots&\vdots\\ 0&\cdots&s_{dd}\end{bmatrix},s_{ii}=\left\{\begin{aligned} 1&,&i\in[(t-1)k,tk)\\ 0&,&others\end{aligned}\right..

It means that ER-FSL divides a dd-dim feature space into TT kk-dim subspaces, where each subspace is used to learn task tt.

Refer to caption
Figure 3. The overview of our ER-FSL framework.

However, the overall size of the model’s feature space is typically fixed, even as the number of new tasks increases. After a certain number of new tasks, the model cannot allocate a blank subspace for learning additional tasks. Hence, it is necessary to select a portion of space from the previously learned space for new tasks.

Based on the classifier 𝑾𝒛\bm{W}\cdot\bm{z}, the contribution of the parameters 𝑾\bm{W} for the dd-th dimension is [wd1zd,wd2zd,,wdczd][w_{d}^{1}z_{d},w_{d}^{2}z_{d},...,w_{d}^{c}z_{d}]. It equals to [wd1,wd2,,wdc]zd[w_{d}^{1},w_{d}^{2},...,w_{d}^{c}]\cdot z_{d}, where [wd1,wd2,,wdc][w_{d}^{1},w_{d}^{2},...,w_{d}^{c}] is the dd-th dimension of 𝑾\bm{W}. If the variance of [wd1,wd2,,wdc][w_{d}^{1},w_{d}^{2},...,w_{d}^{c}] is larger, the features on this dimension can provide richer information to distinguish between sample classes. It means that the subspaces on dimensions with small variances contribute less and can be selected for learning new tasks. Hence, we denote the subspace reuse mechanism to select the subspace when there is no blank subspace as

(8) 𝑺=[s1100sdd],sii={1,i𝒦0,others.\bm{S}=\begin{bmatrix}s_{11}&\cdots&0\\ \vdots&\ddots&\vdots\\ 0&\cdots&s_{dd}\end{bmatrix},s_{ii}=\left\{\begin{aligned} 1&,&i\in\mathcal{K}\\ 0&,&others\end{aligned}\right..

where 𝒦\mathcal{K} is a subset of feature space indexes. For all elements 𝒦[j]\mathcal{K}[j] (j[1,k]j\in[1,k]) in the 𝒦\mathcal{K}, the 𝒦[j]\mathcal{K}[j]-th dimension of 𝑾\bm{W} has the top kk smallest variance.

The replaying component LbL_{b} is also a general cross-entropy loss function only related to buffered samples. It replays buffered samples using an accumulated feature space, which consists of all learned feature space. The model uses the additional space to store features related to old data, improving the model’s memory level of old knowledge. The loss function is denoted as

(9) Lb=E(𝒙,y)[log(exp(𝒘ya𝒛a)cC1:texp(𝒘ca𝒛a))],L_{b}=E_{(\bm{x},y)\sim{\mathcal{B}_{\mathcal{M}}}}[-log(\frac{exp(\bm{w}_{y}^{a}\cdot\bm{z}^{a})}{\sum_{c\in C_{1:t}}exp(\bm{w}_{c}^{a}\cdot\bm{z}^{a})})],

where 𝒛a=𝒛𝑨\bm{z}^{a}=\bm{z}\cdot\bm{A} is the embedding of sample xx and 𝒘ca=𝒘c𝑨\bm{w}_{c}^{a}=\bm{w}_{c}\cdot\bm{A} is the prototype of class cc in the accumulated feature space (the green and blue elements in Figure 3). 𝑨\bm{A} is the diagonal matrix as

(10) 𝑨=[a1100add],aii={1,i[0,tk)0,others.\bm{A}=\begin{bmatrix}a_{11}&\cdots&0\\ \vdots&\ddots&\vdots\\ 0&\cdots&a_{dd}\end{bmatrix},a_{ii}=\left\{\begin{aligned} 1&,&i\in[0,tk)\\ 0&,&others\end{aligned}\right..

4.3. Continual Testing Module

The testing component predicts testing samples by the learned model. Similar to the replaying component, each testing sample obtains its class probability distribution in the used feature whole-space. And it can be classified as

(11) y^=argmaxcexp(𝒘ca𝒛a)jC1:texp(𝒘ja𝒛a),c𝒞1:t\hat{y}=\mathop{\arg\max}_{c}\frac{exp(\bm{w}_{c}^{a}\cdot\bm{z}^{a})}{\sum_{j\in C_{1:t}}exp(\bm{w}_{j}^{a}\cdot\bm{z}^{a})},c\in\mathcal{C}_{1:t}

The process of this framework is described in Algorithm 1. To begin with, a fixed-size memory buffer is used to save current samples (line 10) and replay previous samples (line 5). Then, the continual training module (lines 5-9) overcomes the forgetting problem by learning current samples and replaying previous samples in different feature spaces. Finally, the continual testing module (lines 14-16) predicts unknown instances by the accumulated feature space.

For clarity, we illustrate the subspace and accumulated space for different tasks in Figure 4. For the first task, 𝒛s\bm{z}^{s} and 𝒛a\bm{z}^{a} are shown as the green elements. Similarly, for the second task, 𝒛s\bm{z}^{s} is shown as the blue elements, and 𝒛a\bm{z}^{a} is shown as the concatenation of the green and blue elements. Besides, given the third task, 𝒛s\bm{z}^{s} is shown as the gray elements, and 𝒛a\bm{z}^{a} is shown as the concatenation of the green, blue, and gray elements. Finally, given the fourth task, ER-FSL adopts a subspace reuse mechanism since no new subspace is available for new data. Hence, 𝒛s\bm{z}^{s} is shown as the orange elements.

Algorithm 1 ER-FSL
0:  Dataset 𝒟={𝒟t}t=1T\mathcal{D}=\{\mathcal{D}_{t}\}_{t=1}^{T}, Learning Rate λ\lambda, Scale γ\gamma
0:  Network Parameters 𝚽={𝜽,𝑾}\bm{\Phi}=\{\bm{\theta},\bm{W}\}
1:  Initialize: Memory Buffer {}\mathcal{M}\leftarrow\{\}
2:  for 𝒟t𝒟\mathcal{D}_{t}\subset\mathcal{D} do
3:     //ContinualTrainingContinual\ Training
4:     for 𝒟t\mathcal{B}\in\mathcal{D}_{t} do
5:        MemoryRetrieval()\mathcal{B}_{\mathcal{M}}\leftarrow MemoryRetrieval(\mathcal{M})
6:        LcE(𝒙,y)[log(exp(𝒘ys𝒛s)cC1:texp(𝒘cs𝒛s))]L_{c}\leftarrow E_{(\bm{x},y)\sim{\mathcal{B}}}[-log(\frac{exp(\bm{w}_{y}^{s}\cdot\bm{z}^{s})}{\sum_{c\in C_{1:t}}exp(\bm{w}_{c}^{s}\cdot\bm{z}^{s})})]
7:        LbE(𝒙,y)[log(exp(𝒘ya𝒛a)cC1:texp(𝒘ca𝒛a))]L_{b}\leftarrow E_{(\bm{x},y)\sim{\mathcal{B}_{\mathcal{M}}}}[-log(\frac{exp(\bm{w}_{y}^{a}\cdot\bm{z}^{a})}{\sum_{c\in C_{1:t}}exp(\bm{w}_{c}^{a}\cdot\bm{z}^{a})})]
8:        L(1γ)Lc+γLbL\leftarrow(1-\gamma)L_{c}+\gamma L_{b}
9:        𝜽𝜽+λ𝜽L\bm{\theta}\leftarrow\bm{\theta}+\lambda\nabla_{\bm{\theta}}L
10:        MemoryUpdate(,)\mathcal{M}\leftarrow MemoryUpdate(\mathcal{M},\mathcal{B})
11:     end for
12:     //ContinualTestingContinual\ Testing
13:     mnumberofunknowninstancesm\leftarrow number\ of\ unknown\ instances
14:     for i{1,2,,m}i\in\{1,2,...,m\} do
15:        y^argmaxcexp(𝒘ca𝒛a)jC1:texp(𝒘ja𝒛a),c𝒞1:t\hat{y}\leftarrow\mathop{\arg\max}_{c}\frac{exp(\bm{w}_{c}^{a}\cdot\bm{z}^{a})}{\sum_{j\in C_{1:t}}exp(\bm{w}_{j}^{a}\cdot\bm{z}^{a})},c\in\mathcal{C}_{1:t}
16:     end for
17:     return 𝚽\bm{\Phi}
18:  end for
Refer to caption
Figure 4. The feature space of different tasks.
Table 1. Final Accuracy Rate (higher is better). The best scores are in boldface, and the second-best scores are underlined.
Datasets [sample size] Split CIFAR10 (%) [32×\times32] Split CIFAR100 (%) [32×\times32] Split MiniImageNet (%) [84×\times84]
Buffer 100 200 500 1000 500 1000 2000 5000 500 1000 2000 5000
IID 55.4±2.5 17.0±0.9 14.5±0.9
IID++(Caccia et al., 2022) 66.1±2.6 27.0±2.5 21.2±1.8
FINE-TUNE 16.6±1.4 5.4±0.5 4.4±0.4
ER (NeurIPS2019)(Rolnick et al., 2019) 35.5±2.0 38.8±3.4 39.9±4.0 43.2±5.6 12.6±1.4 15.7±1.1 17.6±1.2 16.8±1.5 10.6±1.0 12.0±1.2 13.9±1.2 13.9±2.3
GSS (NeurIPS2019)(Aljundi et al., 2019b) 32.2±3.1 37.1±3.6 38.9±3.3 43.6±2.9 12.9±1.3 16.1±0.7 17.2±0.9 17.9±1.2 10.4±1.1 12.3±1.0 14.0±0.8 14.6±0.9
MIR (NeurIPS2019)(Aljundi et al., 2019a) 37.2±3.6 41.6±3.9 43.5±3.9 47.7±4.5 14.9±1.1 17.3±1.6 17.8±1.7 18.4±1.3 10.9±0.8 11.5±0.8 14.0±1.7 14.1±0.8
ER-WA (CVPR2020)(Zhao et al., 2020) 36.6±2.4 39.2±4.4 39.4±4.9 42.9±4.4 16.9±1.0 19.8±1.3 19.2±1.7 17.8±1.9 11.2±1.6 13.4±1.3 14.5±0.8 15.0±1.4
DER++ (NeurIPS2020)(Buzzega et al., 2020) 39.1±3.1 41.9±3.7 42.1±4.4 45.7±3.0 15.4±0.9 18.0±1.3 18.7±1.9 18.7±1.8 11.0±1.2 11.9±1.5 12.0±1.8 11.1±1.6
GMED (NeurIPS2021)(Jin et al., 2021) 34.8±4.1 40.3±3.8 42.1±3.5 46.9±3.2 14.7±2.9 17.3±2.4 20.7±2.1 24.1±2.3 12.1±1.2 13.1±1.3 16.4±1.8 17.6±1.7
ASER (AAAI2021)(Shim et al., 2021) 32.8±2.0 37.5±3.2 41.6±3.7 40.8±3.7 13.0±0.9 15.9±1.5 17.5±1.4 18.0±0.9 9.7±0.7 12.1±1.3 14.6±1.0 14.5±2.0
SS-IL (ICCV2021)(Ahn et al., 2021) 37.1±2.1 42.2±3.3 46.2±2.6 47.6±2.3 21.6±0.6 23.0±1.3 24.7±1.8 24.9±1.2 16.7±1.2 19.3±1.2 20.1±1.6 23.3±1.2
SCR (CVPR-W2021)(Mai et al., 2021) 35.7±2.6 48.5±1.9 56.1±1.3 57.6±2.2 11.1±0.4 13.9±0.4 14.6±1.1 15.7±1.0 10.3±0.7 12.7±1.2 14.5±0.3 15.9±0.6
ER-DVC (CVPR2022)(Gu et al., 2022) 32.6±3.3 36.1±4.4 37.5±4.2 40.0±5.6 14.4±1.4 16.4±1.6 18.3±1.3 18.4±1.8 12.1±0.9 13.7±1.4 16.0±1.5 16.8±2.0
ER-ACE (ICLR2022)(Caccia et al., 2022) 37.6±2.7 43.6±2.1 49.7±2.2 50.9±3.0 17.1±1.1 20.8±1.4 21.8±1.7 23.9±1.4 13.7±1.1 15.2±1.3 17.9±1.3 18.3±1.2
OCM (ICML2022)(Guo et al., 2022b) 48.5±2.2 53.0±2.3 58.0±2.2 61.3±2.8 14.3±0.8 17.7±1.4 21.0±1.5 22.7±0.9 11.8±0.6 13.3±1.5 16.8±0.4 18.2±1.0
OBC (ICLR2023)(Chrysakis, 2023) 39.2±1.2 45.1±2.2 50.5±2.3 51.8±2.3 18.5±1.2 21.5±0.8 23.1±1.7 23.8±1.6 12.3±0.6 14.9±1.5 17.2±1.8 18.3±1.9
PCR (CVPR2023)(Lin et al., 2023b) 40.9±4.1 47.8±2.6 52.2±2.9 55.8±3.5 21.7±0.9 25.7±0.9 27.6±1.4 29.9±0.8 17.7±1.0 19.5±1.5 23.4±1.5 25.0±2.4
ER-LODE (NIPS2023)(Liang and Li, 2023) 41.0±1.4 46.3±1.5 51.5±1.7 53.5±2.4 18.8±1.4 21.6±0.9 23.1±1.7 24.7±1.7 15.0±1.3 16.6±1.1 18.5±1.6 19.5±1.1
ER-CBA (ICCV2023)(Wang et al., 2023c) 37.9±2.8 42.6±3.6 44.3±2.3 48.6±2.6 13.5±0.9 17.3±1.1 21.3±1.2 25.6±0.8 12.0±1.1 13.8±0.9 16.5±0.4 18.6±0.7
ER-FSL (Ours) 46.9±2.7 52.7±1.0 58.3±1.5 61.5±2.1 23.3±0.7 26.6±0.8 29.2±1.1 32.1±0.9 17.5±0.1 21.0±0.9 23.6±1.4 27.2±1.2
Refer to caption

(a) Split CIFAR10

Refer to caption

(b) Split CIFAR100

Refer to caption

(c) Split MiniImagenet

Figure 5. Average accuracy rate on observed learning tasks on three datasets when the memory buffer size is 1000.

5. Performance Evaluation

5.1. Evaluation Setup

Evaluation Datasets. We conduct experiments on three datasets. Split CIFAR10 (Krizhevsky et al., 2009), which is divided into 5 tasks, with each task comprising 2 classes; Split CIFAR100 (Krizhevsky et al., 2009), and Split MiniImageNet (Vinyals et al., 2016), both organized into 10 tasks, each consisting of 10 classes.

Evaluation Metrics. Similar with (Shim et al., 2021), we can acquire average accuracy rate AiA_{i} at the ii-th task as

(12) Ai=1ij=1iai,j,\begin{split}A_{i}=\frac{1}{i}\sum_{j=1}^{i}a_{i,j},\end{split}

where ai,j(j<=i)a_{i,j}(j<=i) is the accuracy evaluated on the jj-th task after the network has learned the first ii tasks. For total TT tasks, ATA_{T} is equivalent to the final accuracy rate.

Implementation Details. Similar to the recent work (Wang et al., 2023c), we utilize ResNet18 as the feature extractor. All classes in three datasets are shuffled. The model processes 10 current samples alongside 10 buffered samples in each training step. Additionally, we employ a combination of various augmentation operations to generate the augmented samples for all methods. Hyperparameters are selected based on a validation set comprising 10% of the training set. During the training phase, the network, initially randomly initialized, is trained using the SGD optimizer with a learning rate of 0.1.

5.2. Overall Performance

In this section, we conduct experiments to compare the overall performance of ER-FSL with various state-of-the-art baselines. We aim to gain insights into the strengths and weaknesses of ER-FSL.

Table 1 demonstrates the final average accuracy for three datasets. All reported scores are the average score of 10 runs with a 95% confidence interval. The results evidence that our proposed ER-FSL achieves the best overall performance. Specifically, ER-FSL achieves the best performance under 10 of the 12 experimental scenarios. It has the most outstanding performance on Split CIFAR100 and Split MiniImageNet. For example, ER-FSL outperforms the strongest baseline PCR with a gap of 1.6%, 0.9%, 1.6%, and 2.2% on Split CIFAR100 when the size of the memory buffer is 500, 1000, 2000, and 5000, respectively. We note that ER-FSL is not optimal on Split CIFAR10 when the buffer size is smaller. Since there are fewer classes in this dataset and fewer samples in the buffer, and OCM addresses this issue using additional data augmentation operations.

We also compare ER-FSL with CBA (Wang et al., 2023c) using the experimental setup described in CBA(Wang et al., 2023c). The learning rate is set at 0.03, which differs from the 0.1 used in our work. As shown in Table 2, the results indicate that ER-FSL performs significantly better than CBA.

Figure 5 describes the performance for some effective approaches at each task on all datasets. In the learning process, ER-FSL consistently outperforms other baselines. Especially on Split CIFAR100 and Split MiniImageNet, the performance of ER-FSL becomes increasingly evident as the number of tasks increases. For instance, ER-FSL does not surpass PCR in the initial few tasks but achieves the best in the remaining tasks as shown in Figure 5 (b) and (c).

Table 2. Final Accuracy Rate (higher is better)/Final Forgetting Rate (lower is better ) under the setting in CBA (Wang et al., 2023c).
Datasets Split CIFAR10 Split CIFAR100
Buffer 200 500 2000 5000
CBA 44.31/27.55 49.63/19.99 26.90/9.41 29.09/8.05
ER-FSL 51.41/15.51 58.43/12.98 28.58/9.7 30.14/8.25
Table 3. Final Accuracy Rate (higher is better) for ablation study on Split CIFAR100 when the buffer size is 1000.
Index 1 2 3 4 5 6 7
Setting ER-FSL LcL_{c} LbL_{b} y^\hat{y} 𝑺\bm{S} Inversion γ\gamma
All classes 26.6 17.5 12.6 7.2 24.7 17.8 24.7
New classes 33.8 10.5 38.2 71.6 28.4 18.6 29.5
Old classes 25.7 18.2 9.7 0.0 24.2 17.8 24.2
Refer to caption
Figure 6. The performance of ER-FSL on Split CIFAR100 (buffer size=1000) with different values of γ\gamma.

5.3. Ablation Study

We conduct ablation experiments to analyze the contribution of various components and choices made in ER-FSL, and the results are stated in Table 3. First, the learning component LcL_{c} is necessary to ensure the model’s generalization ability. Without it (Index 2), the model’s overall performance is significantly decreased. Besides, the replaying component LbL_{b} overcomes the forgetting issue, as the accuracy is low when the replaying component is removed (Index 3). Finally, if the model uses the same subspace as 𝒛s\bm{z}^{s} when testing, the model can not memorize anything of old data (Index 4). Therefore, the model should predict unknown instances in the space as 𝒛a\bm{z}^{a}.

In the meantime, some subtle settings are also important. First, it is necessary to assign different feature subspaces to each task. If we use a fixed SS to select the same subspace, the performance of the model will decrease (Index 5). Second, the spaces used for replaying and learning cannot be inversed. Since the inversion version (Index 6, replaying in a feature subspace and learning in a larger space) performs worse than the original one (Index 1). Third, the scale γ\gamma is vital to balance the novel and historical knowledge. Without it, the Equation (5) becomes Lc+LbL_{c}+L_{b}, which can not achieve the best results (Index 7). Moreover, we report the performance of the model with different γ\gamma in Figure 6. The results indicate that a suitable γ\gamma can enable the model to perform better on both old and new data.

Furthermore, we even conduct experiments on Split CIFAR100 with different subspace sizes and report the results in Table 4. Firstly, the performance of ER-FSL improves as the subspace size increases since a larger space can capture more useful features. Secondly, the results show that the best subspace size is 100, which triggers the subspace reuse mechanism. It means that the results of ER-FSL in Table 1 could be better; however, for a fair comparison with other methods, we have to limit our method to choosing a feature subspace of size 51 to fill the entire feature space (size = 512). Thirdly, although the performance tends to decrease when the size further increases, the decline is not significant.

5.4. Complexity Analysis

The final comparison between the existing methods and ER-FSL regarding computation and memory complexity is illustrated in Table 5. We compare the computation complexity using C𝒟C_{\mathcal{D}} (Ghunaim et al., 2023), which is determined by the relative training FLOPs. For example, we set the computation complexity of ER as 1. Since ER-FSL, ER-ACE, and PCR straightly modify the loss of ER, their computational complexities are equivalent to 1. However, OCM heavily relies on massive data augmentation operations, making its relative complexity 16. Meanwhile, we set the relative memory as the memory complexity. For instance, the memory complexity of ER is set as 1. OCM has a complexity of 2 due to the need to save an additional model for knowledge distillation, while OBC uses two classifiers, making its complexity greater than 1. The memory complexity of ER-FSL is 1. Thus, ER-FSL is also excellent in terms of complexity.

Table 4. The performance of ER-FSL on Split CIFAR100 with different sizes of subspaces (buffer size=1000).
Subspace size 10 20 30 40 50 51 (our) 100 200 300 400
Subspace reuse no no yes
Final accuracy 25.3 26.1 26.9 27.0 26.9 26.6 27.6 27.0 25.9 24.8
Table 5. The computation and memory complexity.
Metric ER ER-ACE OBC OCM PCR Ours
Computation (C𝒟C_{\mathcal{D}}) 1 1 1.5 16 1 1
Memory (Model) 1 1 ¿1 2 1 1

6. Conclusion

In this paper, we develop a simple yet effective OCL method called ER-FSL to alleviate the CF. By examining the change of features, we find that learning and replaying in the same feature space is not beneficial for the anti-forgetting of the model. Based on this observation, a novel ER-FSL is proposed to learn and replay in different feature spaces for OCL. It divides the entire feature space into multiple feature subspaces, where each subspace is used to learn each new task. Meanwhile, it replays previous samples using an accumulated feature space, which consists of all learned feature subspaces. Extensive experiments on three datasets demonstrate the superiority of ER-FSL over various state-of-the-art baselines.

References

  • (1)
  • Ahn et al. (2021) Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. 2021. Ss-il: Separated softmax for incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 844–853.
  • Akyürek et al. (2021) Afra Feyza Akyürek, Ekin Akyürek, Derry Wijaya, and Jacob Andreas. 2021. Subspace Regularizers for Few-Shot Class Incremental Learning. In International Conference on Learning Representations.
  • Aljundi et al. (2019a) Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. 2019a. Online Continual Learning with Maximal Interfered Retrieval. Advances in Neural Information Processing Systems 32 (2019), 11849–11860.
  • Aljundi et al. (2019b) Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. 2019b. Gradient based sample selection for online continual learning. Advances in neural information processing systems 32 (2019).
  • Brahma and Rai (2023) Dhanajit Brahma and Piyush Rai. 2023. A Probabilistic Framework for Lifelong Test-Time Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3582–3591.
  • Buzzega et al. (2020) Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. 2020. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems 33 (2020), 15920–15930.
  • Caccia et al. (2022) Lucas Caccia, Rahaf Aljundi, Nader Asadi, Tinne Tuytelaars, Joelle Pineau, and Eugene Belilovsky. 2022. New Insights on Reducing Abrupt Representation Change in Online Continual Learning. In International Conference on Learning Representations.
  • Chrysakis (2023) Aristotelis Chrysakis. 2023. Online Bias Correction for Task-Free Continual Learning. In The Eleventh International Conference on Learning Representations.
  • Cui et al. (2021) Bo Cui, Guyue Hu, and Shan Yu. 2021. DeepCollaboration: Collaborative Generative and Discriminative Models for Class Incremental Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1175–1183.
  • De Lange et al. (2021) Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence 44, 7 (2021), 3366–3385.
  • Dong et al. (2021) Jiahua Dong, Yang Cong, Gan Sun, Bingtao Ma, and Lichen Wang. 2021. I3dol: Incremental 3d object learning without catastrophic forgetting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 6066–6074.
  • Dong et al. (2023) Jiahua Dong, Wenqi Liang, Yang Cong, and Gan Sun. 2023. Heterogeneous Forgetting Compensation for Class-Incremental Learning. arXiv preprint arXiv:2308.03374 (2023).
  • Ghunaim et al. (2023) Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip HS Torr, and Bernard Ghanem. 2023. Real-time evaluation in online continual learning: A new hope. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11888–11897.
  • Gu et al. (2022) Yanan Gu, Xu Yang, Kun Wei, and Cheng Deng. 2022. Not Just Selection, but Exploration: Online Class-Incremental Continual Learning via Dual View Consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7442–7451.
  • Guo et al. (2022a) Yiduo Guo, Wenpeng Hu, Dongyan Zhao, and Bing Liu. 2022a. Adaptive Orthogonal Projection for Batch and Online Continual Learning. Proceedings of AAAI-2022 2 (2022).
  • Guo et al. (2022b) Yiduo Guo, Bing Liu, and Dongyan Zhao. 2022b. Online Continual Learning through Mutual Information Maximization. In International Conference on Machine Learning. PMLR, 8109–8126.
  • Guo et al. (2023) Yiduo Guo, Bing Liu, and Dongyan Zhao. 2023. Dealing with Cross-Task Class Discrimination in Online Continual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11878–11887.
  • Hao et al. (2021) Zhiwei Hao, Yong Luo, Han Hu, Jianping An, and Yonggang Wen. 2021. Data-free ensemble knowledge distillation for privacy-conscious multimedia model compression. In Proceedings of the 29th ACM International Conference on Multimedia. 1803–1811.
  • He and Zhu (2021) Jiangpeng He and Fengqing Zhu. 2021. Online continual learning for visual food classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2337–2346.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hu et al. (2022) Hexiang Hu, Ozan Sener, Fei Sha, and Vladlen Koltun. 2022. Drinking from a firehose: Continual learning with web-scale natural language. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Hu et al. (2023) Zhiyuan Hu, Yunsheng Li, Jiancheng Lyu, Dashan Gao, and Nuno Vasconcelos. 2023. Dense network expansion for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11858–11867.
  • Jiang et al. (2022) Weisen Jiang, James Kwok, and Yu Zhang. 2022. Subspace learning for effective meta-learning. In International Conference on Machine Learning. PMLR, 10177–10194.
  • Jin et al. (2021) Xisen Jin, Arka Sadhu, Junyi Du, and Xiang Ren. 2021. Gradient-based Editing of Memory Examples for Online Task-free Continual Learning. Advances in Neural Information Processing Systems 34 (2021).
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
  • Liang and Li (2023) Yan-Shuo Liang and Wu-Jun Li. 2023. Loss Decoupling for Task-Agnostic Continual Learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Lin et al. (2023a) Huiwei Lin, Shanshan Feng, Baoquan Zhang, Hongliang Qiao, Xutao Li, and Yunming Ye. 2023a. UER: A Heuristic Bias Addressing Approach for Online Continual Learning. In Proceedings of the 31st ACM International Conference on Multimedia. 96–104.
  • Lin et al. (2023b) Huiwei Lin, Baoquan Zhang, Shanshan Feng, Xutao Li, and Yunming Ye. 2023b. PCR: Proxy-based Contrastive Replay for Online Class-Incremental Continual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24246–24255.
  • Liu and Mazumder (2021) Bing Liu and Sahisnu Mazumder. 2021. Lifelong and continual learning dialogue systems: learning during conversation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 15058–15063.
  • Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems 30 (2017).
  • Luo et al. (2023) Zilin Luo, Yaoyao Liu, Bernt Schiele, and Qianru Sun. 2023. Class-incremental exemplar compression for class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11371–11380.
  • Mai et al. (2022) Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. 2022. Online continual learning in image classification: An empirical survey. Neurocomputing 469 (2022), 28–51.
  • Mai et al. (2021) Zheda Mai, Ruiwen Li, Hyunwoo Kim, and Scott Sanner. 2021. Supervised Contrastive Replay: Revisiting the Nearest Class Mean Classifier in Online Class-Incremental Continual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3589–3599.
  • Masana et al. (2022) Marc Masana, Xialei Liu, Bartłomiej Twardowski, Mikel Menta, Andrew D Bagdanov, and Joost Van De Weijer. 2022. Class-incremental learning: survey and performance evaluation on image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 5 (2022), 5513–5533.
  • Mirzadeh et al. (2020) Seyed Iman Mirzadeh, Mehrdad Farajtabar, Dilan Gorur, Razvan Pascanu, and Hassan Ghasemzadeh. 2020. Linear Mode Connectivity in Multitask and Continual Learning. In International Conference on Learning Representations.
  • Pelosin et al. (2022) Francesco Pelosin, Saurav Jha, Andrea Torsello, Bogdan Raducanu, and Joost van de Weijer. 2022. Towards exemplar-free continual learning in vision transformers: an account of attention, functional and weight regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3820–3829.
  • Pham et al. (2021) Quang Pham, Chenghao Liu, and HOI Steven. 2021. Continual normalization: Rethinking batch normalization for online continual learning. In International Conference on Learning Representations.
  • Prabhu et al. (2020) Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. 2020. Gdumb: A simple approach that questions our progress in continual learning. In European conference on computer vision. Springer, 524–540.
  • Qi et al. (2022) Daiqing Qi, Handong Zhao, and Sheng Li. 2022. Better Generative Replay for Continual Federated Learning. In The Eleventh International Conference on Learning Representations.
  • Qin et al. (2021) Qi Qin, Wenpeng Hu, Han Peng, Dongyan Zhao, and Bing Liu. 2021. Bns: Building network structures dynamically for continual learning. Advances in Neural Information Processing Systems 34 (2021), 20608–20620.
  • Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. 2019. Experience replay for continual learning. Advances in Neural Information Processing Systems 32 (2019).
  • Shen et al. (2021) Meng Shen, Huaizheng Zhang, Yixin Cao, Fan Yang, and Yonggang Wen. 2021. Missing data imputation for solar yield prediction using temporal multi-modal variational auto-encoder. In Proceedings of the 29th ACM International Conference on Multimedia. 2558–2566.
  • Shim et al. (2021) Dongsub Shim, Zheda Mai, Jihwan Jeong, Scott Sanner, Hyunwoo Kim, and Jongseong Jang. 2021. Online class-incremental continual learning with adversarial shapley value. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 9630–9638.
  • Sun et al. (2023) Zhicheng Sun, Yadong Mu, and Gang Hua. 2023. Regularizing Second-Order Influences for Continual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20166–20175.
  • Toldo and Ozay (2022) Marco Toldo and Mete Ozay. 2022. Bring evanescent representations to life in lifelong class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16732–16741.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. Advances in neural information processing systems 29 (2016), 3630–3638.
  • Wang et al. (2023c) Quanziang Wang, Renzhen Wang, Yichen Wu, Xixi Jia, and Deyu Meng. 2023c. CBA: Improving Online Continual Learning via Continual Bias Adaptor. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19082–19092.
  • Wang et al. (2023a) Wenjin Wang, Yunqing Hu, Qianglong Chen, and Yin Zhang. 2023a. Task Difficulty Aware Parameter Allocation & Regularization for Lifelong Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7776–7785.
  • Wang et al. (2023b) Yabin Wang, Zhiheng Ma, Zhiwu Huang, Yaowei Wang, Zhou Su, and Xiaopeng Hong. 2023b. Isolation and impartial aggregation: A paradigm of incremental learning without interference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 10209–10217.
  • Wang et al. (2020) Zheng Wang, Feiping Nie, Lai Tian, Rong Wang, and Xuelong Li. 2020. Discriminative Feature Selection via A Structured Sparse Subspace Learning Module.. In IJCAI. 3009–3015.
  • Wang et al. (2022) Zhenyi Wang, Li Shen, Le Fang, Qiuling Suo, Tiehang Duan, and Mingchen Gao. 2022. Improving task-free continual learning by distributionally robust memory evolution. In International Conference on Machine Learning. PMLR, 22985–22998.
  • Wei et al. (2023) Yujie Wei, Jiaxin Ye, Zhizhong Huang, Junping Zhang, and Hongming Shan. 2023. Online Prototype Learning for Online Continual Learning. arXiv preprint arXiv:2308.00301 (2023).
  • Xu et al. (2023) Rui Xu, Yong Luo, Han Hu, Bo Du, Jialie Shen, and Yonggang Wen. 2023. Rethinking the Localization in Weakly Supervised Object Localization. In Proceedings of the 31st ACM International Conference on Multimedia. 5484–5494.
  • Yan et al. (2021) Shipeng Yan, Jiangwei Xie, and Xuming He. 2021. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3014–3023.
  • Yang et al. (2022) Guanglei Yang, Enrico Fini, Dan Xu, Paolo Rota, Mingli Ding, Moin Nabi, Xavier Alameda-Pineda, and Elisa Ricci. 2022. Uncertainty-aware contrastive distillation for incremental semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Zhang et al. (2022a) Weixia Zhang, Dingquan Li, Chao Ma, Guangtao Zhai, Xiaokang Yang, and Kede Ma. 2022a. Continual learning for blind image quality assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Zhang et al. (2022c) Xikun Zhang, Dongjin Song, and Dacheng Tao. 2022c. Hierarchical prototype networks for continual graph representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Zhang et al. (2022b) Yaqian Zhang, Bernhard Pfahringer, Eibe Frank, Albert Bifet, Nick Jin Sean Lim, and Yunzhe Jia. 2022b. A simple but strong baseline for online continual learning: Repeated augmented rehearsal. Advances in Neural Information Processing Systems 35 (2022), 14771–14783.
  • Zhao et al. (2020) Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, and Shu-Tao Xia. 2020. Maintaining discrimination and fairness in class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13208–13217.
  • Zhou et al. (2022) Da-Wei Zhou, Qi-Wei Wang, Han-Jia Ye, and De-Chuan Zhan. 2022. A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. In The Eleventh International Conference on Learning Representations.
  • Zhu et al. (2021) Fei Zhu, Zhen Cheng, Xu-yao Zhang, and Cheng-lin Liu. 2021. Class-Incremental Learning via Dual Augmentation. Advances in Neural Information Processing Systems 34 (2021).
  • Zhu and Koniusz (2022) Hao Zhu and Piotr Koniusz. 2022. EASE: Unsupervised discriminant subspace learning for transductive few-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9078–9088.