This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SRIL: Selective Regularization for Class-Incremental Learning

Jisu Han   Jaemin Na   Wonjun Hwang
Ajou University
{jisu3709, osial46, wjhwang}@ajou.ac.kr
Abstract

Human intelligence gradually accepts new information and accumulates knowledge throughout the lifespan. However, deep learning models suffer from a catastrophic forgetting phenomenon, where they forget previous knowledge when acquiring new information. Class-Incremental Learning aims to create an integrated model that balances plasticity and stability to overcome this challenge. In this paper, we propose a selective regularization method that accepts new knowledge while maintaining previous knowledge. We first introduce an asymmetric feature distillation method for old and new classes inspired by cognitive science, using the gradient of classification and knowledge distillation losses to determine whether to perform pattern completion or pattern separation. We also propose a method to selectively interpolate the weight of the previous model for a balance between stability and plasticity, and we adjust whether to transfer through model confidence to ensure the performance of the previous class and enable exploratory learning. We validate the effectiveness of the proposed method, which surpasses the performance of existing methods through extensive experimental protocols using CIFAR-100, ImageNet-Subset, and ImageNet-Full. 111Our code will be available at https://github.com/anonymous.

1 Introduction

The successful development of deep learning has created many businesses and has applied to the real-world. However, many studies on deep learning have been evaluated in limited experimental settings. To achieve human-level goals, consideration of changes in real-world environments is required. In this respect, continual learning has recently been receiving a lot of attention from the artificial intelligence community. Continual learning is a methodology that can continuously learn about real-world situations where visual characteristics change, such as robot vision and autonomous driving systems [34]. Although recent studies on continual learning have shown promising results, they still suffer from the problem of losing previous knowledge in the process of learning new knowledge. This phenomenon is called catastrophic forgetting [11, 27], and this problem results in performance degradation of deep learning models. One of the major concerns of recent continual learning approaches is mitigating the catastrophic forgetting problem to prevent such performance deterioration.

Refer to caption
Figure 1: The illustration above shows our simple intuition of class-incremental learning. Our intuition is that in the process of learning a new class, there are new representations (e.g., beak) to learn to improve the overall classification performance, and there are representations that need to be differentiated while preserving previous knowledge (e.g., fabric and eyes). The histogram below shows the change for each channel of the intermediate feature when a model that has pretrained 50 classes learns 1 additional class. Unlike previous feature distillation methods, we have a few dominant channels for pattern separation.

Class-Incremental Learning (CIL) is a special case of the continual learning that seeks to learn new classes that have not been learned before while preserving knowledge from old classes. To this end, recent works have grafted knowledge distillation techniques onto CIL and have achieved considerable results. The major approaches for applying knowledge distillation to CIL can be divided into response-based methods [31, 4, 42], relation-based methods [37, 7], and feature-based methods [15, 8, 19]. Both the response-based and relation-based methods transfer the dark knowledge of the old model considering different characteristics of the semantic information through asymmetric learning strategies for old and new data [3, 12]. On the other hand, the feature-based method directly transfers representation of the old model, making it difficult to consider the different characteristics of the old and new classes. In this paper, we introduce a selective regularization method named SRIL that leverages knowledge distillation and weight interpolation. First, we propose an asymmetric feature distillation method called Gradient-based Feature Distillation. Our intuition begins with pattern completion and pattern separation in cognitive science [30, 33]. The pattern completion means integrating memories by judging that they are existing knowledge when they are newly readjusted by external stimuli, and pattern separation is the process of making memories distinct from existing memories by judging that they are different from the contents in memory. We intend to achieve realignment of knowledge by adopting an asymmetric learning strategy for the previously learned class and the newly learned class in terms of pattern completion and pattern separation. In the process of learning a new class, we utilize the gradient of the knowledge distillation and classification losses to determine whether knowledge distillation is beneficial or harmful. Then, we generate a mask for the channel of each intermediate feature and exploit the mask to apply feature distillation to conflicting channels for the old and new classes. Selective feature distillation through masks generated by this process considers different characteristics by taking an asymmetric strategy for old and new classes. As a result, Figure 1 shows the difference between symmetric feature distillation and our proposed asymmetric feature distillation. Our method has large feature changes for few dominant channels for pattern separation and small feature changes for pattern completion for the remaining channels. Meanwhile, the stability-plasticity dilemma [28, 39] of forgetting old knowledge while learning new knowledge, and failing to accept new knowledge in order to not to forget the old knowledge, is a well-known problem in CIL. To deal with this issue, we introduce a Confidence-aware Weight Interpolation, which determines whether the model retains old knowledge well and performs selective regularization according to the determined result. Here, we determine whether the new model maintains previous knowledge through the confidence of the new model on old data. The existing weight interpolation method is used to improve the generalization performance of a model as a method for approximating an ensemble [41, 40]. In contrast, we ensure that the upper bound of the loss for the old data does not increase by making the new model close to the old model in the weight space to ensure model stability. However, not to move away from the weight space can rather be a constraint on learning new knowledge. Therefore, if the new model’s confidence in the old data exceeds a certain value based on the old model’s confidence, we remove regularization to enable exploratory learning and balance stability and plasticity. Overall, our contributions are as follows:

  • We propose a new asymmetric Gradient-based Feature Distillation (GFD) to consider feature characteristics between old and new classes.

  • We propose Confidence-aware Weight Interpolation (CWI) to strike a balance between stability and plasticity. CWI ensures the stability of the model and enables exploratory learning by operating selectively.

  • We achieve comparable performance to the recent state-of-the-art methods and validate our methods through various experimental settings in CIFAR-100, ImageNet-Subset, and ImageNet-Full.

Refer to caption
Figure 2: Overview of the selective regularization framework SRIL. Our framework consists of two methods: GFD and CWI. (i) GFD calculates the cosine similarity between gradients of classification loss and knowledge distillation loss to generate a binary mask that applies knowledge distillation when in the same direction and does not apply when in different directions. (ii) CWI determines whether to apply weight interpolation based on the confidence of the current model for old data.

2 Related Work

Class-Incremental Learning. CIL is largely classified into three types according to how to solve the problem. (i) Regularization based methods regularize the difference between the parameters of the previous network and current network to maintain a low loss area for previous tasks [20, 5, 1]. (ii) Dynamic architecture method is a method that continuously adds sub-networks or expands the network dynamically through pruning and retraining [32, 43, 17]. (iii) Replay based method allows a limited amount of memory composed of data from previous tasks. As a representative method, studies are being conducted to transfer dark knowledge of previously trained networks using knowledge distillation [31, 4, 15] or to improve performance through efficient memory management strategies [26, 25].

Knowledge Distillation. Knowledge distillation (KD) approaches have been proposed to improve the performance of lightweight models by transferring knowledge from large models to small models [14]. Similarly, many studies have been conducted in CIL to solve the catastrophic forgetting problem by transferring the knowledge of the previous model through KD when learning about new data. LwF [23] was the first paper to apply KD to CIL, and uses KD so that the output of the model follows the output of the previous trained model. iCaRL [31] used exemplar to store a small amount of old data samples and applies KD to new data to maintain representation. LUCIR [15] proposed feature distillation in the embedding space rather than prediction by logit. PODNet [8] applied distillation not only in the embedding space but also in the intermediate features. GeoDL [35] also considered the learning path on sub dimension in the KD process through an additional regularization term for the geodesic between features in the embedding space. AFC [19] defined the gradient magnitude of each channel in intermediate features as importance and applies differentiated feature distillation for each channel.

Weight Space Ensemble. Ensemble is the most basic method used to improve the performance of a model in deep learning. However, existing ensemble methods are accompanied by an increase in model size or additional cost in the inference process. SWA [18] showed that by performing weight averaging periodically along the trajectory of SGD in a single model, the model improves in more generalized performance and converges to a wide minima. WiSE-FT [41] demonstrated that the weight space ensemble from the zero-shot model improves the robustness of the model in the fine-tuning process. In a recent study of model soups [40], the performance of the model was improved without additional inference time by applying the weight space-ensemble from the fine-tuned multiple model. Inspired by these weight space ensemble methods, there have been attempts to apply them in CIL. It either transfer the knowledge of new classes to the old model by updating the old model via exponential moving average [36], or apply consistent weight interpolation for the entire learning process. [10].

3 Method

3.1 Problem Definition

CIL aims to improve performance for all classes by starting from learning a limited number of classes and sequentially adding new classes. These sequential stages are called tasks. In CIL scenario, we have sequential training dataset Dt={xit,yit}i=1ND^{t}=\left\{x^{t}_{i},y^{t}_{i}\right\}^{N}_{i=1} and exemplar t={xj1:t1,yj1:t1}j=1R\mathcal{E}^{t}=\left\{x^{1:t-1}_{j},y^{1:t-1}_{j}\right\}^{R}_{j=1} containing limited data for previous tasks, where tt, xx and yy denote task, input image and labels, respectively, and 1:t1:t denotes from the 11-st task to the tt-th task. Each task’s classes are disjoint in CIL scenario, i.e.,y1:t1yt=i.e.,\ y^{1:t-1}\cap y^{t}=\varnothing. Since our goal is to maintain the performance of previous tasks while learning new tasks, the optimization problem is defined as follow:

argminθ𝔼(x,y)D1:t[(x,y;θ)]s.t.𝔼(x,y)Dtt[(x,y;θ)]<ϵ,\displaystyle\begin{split}\arg\min_{\theta}\ \mathbb{E}_{(x,y)\sim D^{1:t}}\left[\mathcal{L}\left(x,y;\theta\right)\right]\\ s.t.\ \mathbb{E}_{(x,y)\sim D^{t}\cup\mathcal{E}^{t}}\left[\mathcal{L}\left(x,y;\theta\right)\right]<\epsilon,\end{split} (1)

where the \mathcal{L} is the loss function (e.g., Cross-Entropy), and the θ\theta is parameters of the model. Since the model converges while learning from current task data and exemplar, there exists some small number ϵ\epsilon higher than loss.

3.2 Gradient-based Feature Distillation

We use the gradients of classification and KD losses for pattern integration and separation, inspired by  [9, 44]. We consider that each channel contains information about patterns in the features of the model, therefore we decide whether to apply feature distillation to each channel through the similarity between the gradient of the classification and the KD losses. To achieve this, we build a binary mask MM whose activation is determined by the cosine similarity of the two gradients.

Ml,c={1,ifcos(zl,ckd,zl,ccls)00,otherwise,\displaystyle M_{l,c}=\begin{cases}1,\;if\;\;cos(\nabla_{z_{l,c}}\mathcal{L}_{kd},\nabla_{z_{l,c}}\mathcal{L}_{cls})\geq 0\\ 0,\;\mathrm{otherwise}\end{cases}, (2)

where cos()cos(\cdot) is a cosine similarity. The zl,ckd\nabla_{z_{l,c}}\mathcal{L}_{kd} and zl,ccls\nabla_{z_{l,c}}\mathcal{L}_{cls} are the gradients of KD and classification losses for each channel cc for the features zl,c{z_{l,c}} of the ll-th layer, respectively. The binary mask MM is activated when directions of the two gradients for the new class data are equal. If the mask is activated, feature distillation is applied. Using the new class data, pattern completion is achieved through KD when the gradients of KD and classification losses have same directions for each channel of the features. On the other hands, applying KD can inhibit the classification performance when the two gradients have different directions [9]. Therefore, in this case, we rule out the KD to achieve the pattern separation. Nonetheless, not regularizing at all for a particular channel may cause problems with the stability of the model. Additionally, in the process of pattern separation, from the perspective of pursuing differentiation from existing knowledge, maintaining the existing pattern can lead to complete differentiation. To achieve this, we apply a mask that is opposite to the binary mask to the old classes data while applying the binary mask to the new data.

Finally, the objective for our GFD is defined as follows:

gfd=Mfd(xnew)+(1M)fd(xold),\displaystyle\mathcal{L}_{gfd}=M\cdot\mathcal{L}_{fd}(x_{new})+\left(1-M\right)\cdot\mathcal{L}_{fd}(x_{old}), (3)

where,

fd=l=1Lc=1Czl,ctzl,ct1F2,\displaystyle\mathcal{L}_{fd}=\sum_{l=1}^{L}\sum_{c=1}^{C}\left|\left|z^{t}_{l,c}-z^{t-1}_{l,c}\right|\right|_{F}^{2}, (4)

in which xnewx_{new} and xoldx_{old} are new classes data and old classes data, respectively. LL is the total number of layers and CC is the total number of channels in each layer. ||||F\left|\left|\cdot\right|\right|_{F} is the frobenius norm and fd\mathcal{L}_{fd} is a channel-wise feature distillation loss, and we normalize each feature for learning stability.

3.3 Confidence-aware Weight Interpolation

Under the stability-plasticity dilemma [28, 39] in the CIL, simultaneously improving stability and plasticity is a challenging problem. However, continual learning considering the balance between the stability and plasticity is essential to improve overall performance. Regarding stability-plasticity, previous methods [8, 19] have had difficulties in ensuring stability enough to expect satisfactory performance improvement despite performing feature distillation that directly transfer the representation. To overcome these limitations, we interpolate the weights of the old model into the new model, thereby preventing loss changes for the old data and ensuring stability for the previous task. In addition, we pursue the plasticity through exploratory learning after the stability of the model is guaranteed enough.

Consequently, we introduce a confidence-aware weight interpolation to improve both stability and plasticity. As the model learns on the new data, we interpolate the weights of the new model through the weights of the old model to prevent the model from getting out of the low error area due to a large difference between parameters with the old model. We update the parameters of the new model θt\theta^{t} through the parameters of the old model θt1\theta^{t-1}, which is expressed as:

θtβθt+(1β)θt1,\displaystyle\theta^{t}\leftarrow\beta\theta^{t}+(1-\beta)\theta^{t-1}, (5)

where β[0,1]\beta\in[0,1] is an interpolation parameter. However, weight interpolation can adversely affect plasticity from a continuous perspective. To solve this problem, we use confidence in the old data to enable exploratory learning without weight interpolation if the model has sufficient knowledge about the old data. We adaptively adjust the interpolation parameter β\beta, to apply weight interpolation when the confidence of the new model and the old model for the old data is above the threshold, and not to apply when the confidence is below the threshold. Therefore, interpolation parameter β\beta is defined as follows:

β={α,ifconf(xold;θt1)conf(xold;θt)δ1,otherwise,\displaystyle\beta=\begin{cases}\alpha,\;if\;conf(x_{old};\theta^{t-1})-conf(x_{old};\theta^{t})\geq\delta\\ 1,\;\mathrm{otherwise}\end{cases}, (6)

in which confidence denotes conf(x;θ)=𝔼[py(x;θ)]conf(x;\theta)=\mathbb{E}\left[p_{y}(x;\theta)\right] and α[0,1]\alpha\in[0,1] is the hyperparmeter, where py(x;θ)p_{y}(x;\theta) is the predicted probability for the ground truth yy, and δ\delta is the threshold. Since the prediction distribution also changes as the class increases, we set δ=λthδt\delta=\lambda_{th}\delta^{t}, where λth\lambda_{th} is the hyperparameter, and adaptive factor is δt=(nt)2/n1:t\delta^{t}=\left(n^{t}\right)^{2}/n^{1:t} and ntn^{t} is the number of classes in task tt.

Refer to caption
Figure 3: Difference in confidence based on iterations. Our method pursues stability through weight interpolation in the unstable initial learning process when a new class is added to the model. In the subsequent learning process, which has stable performance on previously learned data, it pursues exploratory learning on a new class by eliminating regularization by weight interpolation.

Figure 3 shows the difference in confidence between the old and new model for the previous 50 classes data in the process of learning an additional 10 classes from the model trained for 50 classes on CIFAR-100 as a result of applying CWI. In the initial learning process, while learning new class data, the prediction performance for old classes is destroyed, but the difference in confidence for old classes is reduced through feature distillation and weight interpolation. In addition, in the later learning process, when the difference in confidence between the two models is lower than the threshold, it is confirmed that a certain level of confidence is guaranteed by exploratory learning on new class data without weight interpolation.

Algorithm 1 SRIL: Selective Regularization Algorithm
Input: training set DtD^{t}, exemplar set t\mathcal{E}^{t}, old model’s parameters θt1\theta^{t-1}, new model’s parameters θt\theta^{t}, interpolation parameter α\alpha, distill factor λgfdt\lambda^{t}_{gfd}, learning rate γ\gamma
Output: θt\theta^{t}
θtθt1\theta^{t}\leftarrow\theta^{t-1} // initialize new model
for ee in {1,,E}\{1,\cdots,E\} do
     sample a mini-batch (x,y)Dtt(x,y)\sim D^{t}\cup\mathcal{E}^{t}
     compute confidence from old classes (xold,yold)(x_{old},y_{old})
     if conf(xold;θt1)conf(xold;θt)δconf(x_{old};\theta^{t-1})-conf(x_{old};\theta^{t})\geq\delta then
         θtαθt+(1α)θt1\theta^{t}\leftarrow\alpha\theta^{t}+(1-\alpha)\theta^{t-1}
     end if
     zero initialization binary mask MM
     compute gradient from new classes (xnew,ynew)(x_{new},y_{new})
     for ll in {1,,L}\{1,\cdots,L\} do
         if  cos(zl,ckd,zl,ccls)0\nabla_{z_{l,c}}\mathcal{L}_{kd},\nabla_{z_{l,c}}\mathcal{L}_{cls})\geq 0 then
              Ml,c=1M_{l,c}=1
         end if
     end for
     gfd=Mfd(xnew)+(1M)fd(xold)\mathcal{L}_{gfd}=M\cdot\mathcal{L}_{fd}(x_{new})+\left(1-M\right)\cdot\mathcal{L}_{fd}(x_{old})
     total=cls+λgfdtgfd\mathcal{L}_{total}=\mathcal{L}_{cls}+\lambda^{t}_{gfd}\mathcal{L}_{gfd}
     compute gradient total\nabla\mathcal{L}_{total}
     update parameters θtθtγtotal\theta^{t}\leftarrow\theta^{t}-\gamma\nabla\mathcal{L}_{total}
end for
update exemplar from DttD^{t}\cup\mathcal{E}^{t}

3.4 Overall Framework

Our SRIL consists of Gradient-based Feature Distillation and Confidence-aware Weight Interpolation. GFD is a KD method for considering different expressive features of the old class and new class. CWI balances between stability and plasticity. Our overall loss is follow as:

total=lsc+λgfdtgfd,\displaystyle\mathcal{L}_{total}=\mathcal{L}_{lsc}+\lambda^{t}_{gfd}\mathcal{L}_{gfd}, (7)

in which distill factor is λgfdt=λgfdλt\lambda^{t}_{gfd}=\lambda_{gfd}\cdot\lambda^{t}, where λgfd\lambda_{gfd} is a hyperparameter, and adaptation factor is λt=n1:t/nt\lambda^{t}=\sqrt{n^{1:t}/n^{t}}. λt\lambda^{t} is a well-used adaptation factor in CIL [15, 8, 35, 19]. We also use local similarity classifiers (LSC) following the previous works [8, 19]. LSC allows K proxies for each class. The logit for each class 𝖼\mathsf{c} is obtained as the average of K proxies.

s𝖼,k=expϕ𝖼,k,ziexpϕ𝖼,i,z𝐲^𝖼=ks𝖼,kϕ𝖼,k,z,\displaystyle s_{\mathsf{c},k}=\frac{exp\left\langle\phi_{\mathsf{c},k},z\right\rangle}{\sum_{i}exp\left\langle\phi_{\mathsf{c},i},z\right\rangle}\quad\hat{\mathbf{y}}_{\mathsf{c}}=\sum_{k}s_{\mathsf{c},k}\left\langle\phi_{\mathsf{c},k},z\right\rangle, (8)

where ϕ\phi is a classifier weight, s𝖼,ks_{\mathsf{c},k} and 𝐲^𝖼\hat{\mathbf{y}}_{\mathsf{c}} denote the probability by each proxy and the probability for each class. We use LSC loss [8] which is based on NCA loss [29] as follows:

lsc=[logexp(η(𝐲^yε))iyexp(η𝐲^i)]+,\displaystyle\mathcal{L}_{lsc}=\left[-\log\frac{{exp(\eta(\hat{\mathbf{y}}_{y}-\varepsilon}))}{\sum_{i\neq y}exp(\eta\hat{\mathbf{y}}_{i})}\right]_{+}, (9)

where η\eta and ε\varepsilon are learnable parameter and small margin, respectively, and hinge []+[\cdot]_{+} denotes max(0,)max(0,\cdot).

4 Experiment

To verify the experimental validity, we compare performance with the state-of-the-art replay-based methods that share the same problem as ours and use KD. We perform experiments using the small scale CIFAR-100 dataset and the large scale ImageNet dataset, both of which are generally used in CIL. Exemplar stores 20 fixed samples for each class with herding selection [31]. We report both results of CNN predictions and nearest-meaning-of-exemplars [31] classification, denoted as CNN and NME, respectively.

Table 1: Comparison of average accuracy (%) between state-of-the-art methods using exempler and our proposed SRIL on CIFAR-100. The best accuracy is indicated in bold.
CIFAR-100
50 tasks 25 tasks 10 tasks 5 tasks
New classes per task 1 2 5 10
iCaRL [31] 44.20±0.9844.20\pm 0.98 50.60±1.0650.60\pm 1.06 53.78±1.1653.78\pm 1.16 58.08±0.5958.08\pm 0.59
BiC [42] 47.09±1.4847.09\pm 1.48 48.96±1.0348.96\pm 1.03 53.21±1.0153.21\pm 1.01 56.86±0.4656.86\pm 0.46
LUCIR (NME) [15] 48.57±0.3748.57\pm 0.37 56.82±0.1956.82\pm 0.19 60.83±0.7060.83\pm 0.70 63.63±0.8763.63\pm 0.87
LUCIR (CNN) [15] 49.30±0.3249.30\pm 0.32 57.57±0.2357.57\pm 0.23 61.22±0.6961.22\pm 0.69 64.01±0.9164.01\pm 0.91
PODNet (NME) [8] 61.40±0.6861.40\pm 0.68 62.71±1.2662.71\pm 1.26 64.03±1.3064.03\pm 1.30 64.48±1.3264.48\pm 1.32
PODNet (CNN) [8] 57.98±0.4657.98\pm 0.46 60.72±1.3660.72\pm 1.36 63.19±1.1663.19\pm 1.16 64.83±0.9864.83\pm 0.98
GeoDL [35] 52.28±3.9152.28\pm 3.91 60.21±0.4660.21\pm 0.46 63.61±0.8163.61\pm 0.81 65.34±1.0565.34\pm 1.05
DDE [16] - - 64.12±1.4064.12\pm 1.40 65.42±0.7265.42\pm 0.72
AANet [24] - 62.31±1.0262.31\pm 1.02 64.31±0.9064.31\pm 0.90 66.31±0.8766.31\pm 0.87
AFC (NME) [19] 62.58±1.0262.58\pm 1.02 64.06±0.7364.06\pm 0.73 64.29±0.9264.29\pm 0.92 65.82±0.8865.82\pm 0.88
AFC (CNN) [19] 62.18±0.5762.18\pm 0.57 63.89±0.9363.89\pm 0.93 64.98±0.8764.98\pm 0.87 66.49±0.8166.49\pm 0.81
CSCCT [2] 58.80±1.9258.80\pm 1.92 61.10±1.1261.10\pm 1.12 63.72±1.0663.72\pm 1.06 -
SRIL (NME) 63.84±0.98\mathbf{63.84\pm 0.98} 64.87±0.91\mathbf{64.87\pm 0.91} 66.25±1.16\mathbf{66.25\pm 1.16} 67.13±1.03\mathbf{67.13\pm 1.03}
SRIL (CNN) 63.64±1.0263.64\pm 1.02 64.36±1.1664.36\pm 1.16 65.11±0.8265.11\pm 0.82 66.21±0.8966.21\pm 0.89

4.1 Experimental Setup

Dataset and protocol. We evaluate our method on CIFAR-100 [22], ImageNet-Subset [6, 15, 8] and ImageNet-Full [6]. CIFAR-100 consists of 60,000 images of 32×3232\times 32 pixels with 100 classes. It contains 500 training data and 100 test data for each class. ImageNet is a large-scale classification dataset, including 1.28 million images and 50k test set, and consists of 1,000 classes. ImageNet-Full uses all classes of ImageNet, and ImageNet-Subset refers to a dataset that randomly extracts 100 classes out of 1,000 classes. We use the same random seed and class order as the previous methods [8, 19] for fair comparison. After learning a part of the entire class for CIL setting, we gradually learn a certain number of classes. Feature-based distillation methods transfer representation directly, so representation of previously learned models is important. Therefore, it shows better performance than response-based distillation in small task incremental learning settings, where it learns about half of the total class in advance and then gradually learns a small number of classes [15, 8, 19].

Implementation details. For reproducibility, we conducted an experiment based on the PODNet [8] code. All experiments are conducted on TITAN-Xp GPU and two GPUs are used for ImageNet-Full. We follow the experimental setup of PODNet [8]. In the CIFAR-100 experiment, we use the ResNet-32 [13] architecture. We trained the model for 160 epochs with SGD with momentum of 0.9 and used a batch size of 128 and a weight decay of 0.0005. We use a cosine annealing learning rate scheduler with an initial learning rate of 0.1. The hyperparameters are α=0.995,λth=0.1\alpha=0.995,\lambda_{th}=0.1, and λgfd=2\lambda_{gfd}=2. For ImageNet-Subset and ImageNet-Full, we use the ResNet-18 [13] architecture. We trained the model for 90 epochs with SGD with momentum of 0.9 and used a cosine annealing learning rate scheduler with an initial learning rate of 0.1 and 0.05, respectively. We use a batch size of 64 and a weight decay of 0.0001. α\alpha and λth\lambda_{th} use 0.999 and 0.1. λgfd\lambda_{gfd} uses 5 and 7 for ImageNet-Subset and ImageNet-Full, respectively.

4.2 Main Results

CIFAR-100. We compare the average accuracy with the state-of-the-art methods that use the exemplar and KD [31, 42, 15, 8, 35, 16, 24, 19, 2] in Table 1. For a method based on other framework, GeoDL [35], DDE [16], AANet [24] and CSCCT [2] are a performance result based on PODNet (CNN) [8]. We learn 50 classes in 0-th task and increase the classes of fixed numbers for each task. As a result of the experiment using the NME classifier, the performance improvement of 0.811.96 percent points in all experimental environments. Experiments using a CNN classifier show that 5 tasks had lower performance of 0.28 percent points than closest state-of-the-art, but 10 tasks, 25 tasks and 50 tasks show performance improvements of 0.131.46 percent points compared to the state-of-the-art method.

ImageNet-Subset/Full. Unlike CIFAR-100, ImageNet uses only the CNN classifier for all experimental results based on the empirical fact that the performance of the knn-based NME classifier is poor due to the large number of classes. For the same reason, previous studies also only report CNN results for ImageNet experiments [8, 19]. Also, for methods that are additionally applied to other frameworks, results based on frameworks recorded with best performance. Table 2 shows the results of our ImageNet experiment. In the ImageNet-Subset, our method achieves state-of-the-art in all experiments, with performance 0.781.73 percent points higher than the nearest best performance. Experiments on ImageNet-Full achieved state-of-the-art by a difference of 0.07 percent point in experiments on 5 tasks, and 0.15 percent point lower than the state-of-the-art in experiments on 10 tasks.

Table 2: Comparison of average accuracy (%) between state-of-the-art methods and SRIL on ImageNet-Subset and ImageNet-Full. The best accuracy is represented in bold, and the second-best accuracy is underlined.
ImageNet-Subset ImageNet-Full
50 tasks 25 tasks 10 tasks 5 tasks 10 tasks 5 tasks
Class per tasks 1 2 5 10 50 100
iCaRL [31] 54.9754.97 54.5654.56 60.9060.90 65.5665.56 46.7246.72 51.3651.36
BiC [42] 46.4946.49 59.6559.65 65.1465.14 68.9768.97 44.3144.31 45.7245.72
LUCIR [15] 55.4455.44 60.8160.81 65.8365.83 67.0767.07 59.9259.92 64.3464.34
Mnemonics [26] - 69.7469.74 71.3771.37 72.5872.58 63.0163.01 64.5464.54
PODNet [8] 62.4862.48 68.3168.31 74.3374.33 75.5475.54 64.1364.13 66.9566.95
DDE [16] - - 75.4175.41 76.7176.71 64.7164.71 66.4266.42
GeoDL [35] - 71.7271.72 73.5573.55 73.8773.87 64.4664.46 65.2365.23
AANet [24] - 71.7871.78 75.5875.58 76.96 64.8564.85 67.7367.73
CSCCT [2] - 68.9168.91 74.3574.35 76.4176.41 - -
AFC [19] 72.08 73.34 75.75 76.84 67.02\mathbf{67.02} 68.90
SRIL (Ours) 72.86\mathbf{72.86} 75.33\mathbf{75.33} 77.28\mathbf{77.28} 78.57\mathbf{78.57} 66.87 68.97\mathbf{68.97}

4.3 Ablation studies

Refer to caption
Figure 4: Comparison of CKA similarity with other feature distillation methods.

Representation similarity analysis. Figure 4 shows centered kernel alignment (CKA) [21] similarity for the intermediate feature between 0-th task model and 5050-th task model trained on CIFAR-100. Both axis refers to the output of the middle layer, which applies feature distillation of the 0-th task model and 5050-th task model. The diagonal of the heatmap is CKA similarity for the output of the same layer. Therefore, the closer the diagonal component is to 1, the higher representation similarity of the two models. As a method of using feature distillation, we compare PODNet [8] and AFC [19] with our proposed method, SRIL. As a result, SRIL has higher CKA similarity than the existing methods. In particular, for the last intermediate feature closest to the embedding, the CKA similarities of PODNet, AFC and SRIL are 0.715, 0.885 and 0.992, respectively. T0 Acc at the top of Figure 4 denotes the accuracy of the 0-th task data of the 5050-th task model. T0 Acc are 47.60%47.60\%, 54.90%54.90\%, and 60.92%60.92\%, respectively, and SRIL achieves the best accuracy. Through this, we confirm that SRIL maintains performance for old classes and has high stability by preserving the representation.

Refer to caption
Figure 5: Comparison of stability and plasticity with other replay-based methods.

Stability-Plasticity analysis. We analyze Forgetting Measure (FMFM[5] and Intransigence Measure (IMIM[5] to determine how the stability and plasticity of SRIL differ from other feature distillation methods in Figure 5. FMFM is a measure of stability and is defined as the average difference between the maximum accuracy of the previous model and the accuracy of the current model for all tasks. A lower FMFM value indicates higher stability. IMIM is a measure of plasticity and is defined as an average difference in accuracy for new data between models learned without regularization and models with regularization of all tasks. A lower IMIM value indicates higher plasticity. As a result, PODNet [8] has strengths in plasticity, and AFC [19] and SRIL have strengths in stability. Due to the stability-plasticity dilemma, no method is good for both FMFM and IMIM, and our method strikes a good balance between stability and plasticity and achieves the best average accuracy. We explain the results of the main experiment through the results of stability-plasticity analysis. Our method has high stability by using two selective regularization methods, feature distillation and weight interpolation, and improves overall performance by maintaining performance on the old task. As a result, there was a large performance gain in small task setting, where the class increased by a small number. On ImageNet-Full, we achieved a similar level of performance to the existing method.

Effect of each component. We perform an ablation study to examine the effectiveness of each component of our methods in Table 3. The separated feature distillation (SFD) is a general feature distillation applied to each of old and new classes, and is the same as the form without a mask in GFD. As a result of the experiment, in the case of GFD, there was an overall performance improvement when using the mask, and when using CWI, it shows a large performance improvement for the experiment learning 50 tasks, which are small tasks. In addition, it can be confirmed that the highest performance is obtained when the two methods are used together. Especially, in the case of the second column, stability and plasticity are not balanced due to lack of stability why only using feature distillation. However, by balancing stability and plasticity through CWI, we achieve a significant performance gain of 4.43 percent point.

Table 3: Effect of each component with 50 tasks and 5 tasks on CIFAR-100.
50 tasks 5 tasks
SFD GFD CWI NME CNN NME CNN
35.80±2.5335.80\pm 2.53 40.02±1.8040.02\pm 1.80 46.75±0.8746.75\pm 0.87 45.76±0.8845.76\pm 0.88
62.48±0.5062.48\pm 0.50 58.81±0.8558.81\pm 0.85 66.07±0.7766.07\pm 0.77 65.29±0.6965.29\pm 0.69
62.89±1.0362.89\pm 1.03 59.21±0.6759.21\pm 0.67 66.41±1.1466.41\pm 1.14 65.34±0.7565.34\pm 0.75
63.30±0.5863.30\pm 0.58 63.27±1.1163.27\pm 1.11 66.12±0.8366.12\pm 0.83 65.73±1.4065.73\pm 1.40
63.84±0.98\mathbf{63.84\pm 0.98} 63.64±1.02\mathbf{63.64\pm 1.02} 67.13±1.03\mathbf{67.13\pm 1.03} 66.21±0.89\mathbf{66.21\pm 0.89}
Refer to caption
Figure 6: Sensitivity analysis for each hyperparameter with 5 tasks and 50 tasks on CIFAR-100. The lineplot and barplot represent average accuracy and forgetting measure, respectively.

Hyperparameter sensitivity analysis. In Figure 6, we investigate our hyperparameters λgfd\lambda_{gfd} for the gradient-based feature distillation, and λth\lambda_{th} and α\alpha for the confidence-aware weight interpolation on CIFAR-100. As λgfd\lambda_{gfd} increases, the forgetting measure decreases, and the average accuracy remains similar. FMFM increases as λth\lambda_{th} and α\alpha increase. This means that stability decreases and plasticity increases as the hyperparameters of the two values increase. When the two values are infinity and 1, weight interpolation is not used at all, and at this time, the performance deteriorates the most. Our method can control the balance between plasticity and stability through these three parameters. Based on these results, we determine the final hyperparameters.

Table 4: Effect of memory budget on CIFAR-100 with 50 tasks.\ast Reproduced by the official code.
Memory per class 5 10 20 50
iCaRL [31] 16.4416.44 28.5728.57 44.2044.20 48.2948.29
BiC [42] 20.8420.84 21.9721.97 47.0947.09 55.0155.01
LUCIR (NME) [15] 21.8121.81 41.9241.92 48.5748.57 56.0956.09
LUCIR (CNN) [15] 22.1722.17 42.7042.70 49.3049.30 57.0257.02
PODNet (NME) [8] 48.3748.37 57.2057.20 61.4061.40 62.2762.27
PODNet (CNN) [8] 35.5935.59 48.5748.57 57.9857.98 63.6963.69
AFC (NME) [19] 44.3344.33 57.2757.27 62.3362.33 63.8363.83
AFC (CNN) [19] 44.6644.66 55.8755.87 62.1862.18 65.0765.07
SRIL (NME) 54.09±1.45\mathbf{54.09\pm 1.45} 61.99±0.66\mathbf{61.99\pm 0.66} 63.84±0.98\mathbf{63.84\pm 0.98} 65.31±1.3065.31\pm 1.30
SRIL (CNN) 50.23±1.0550.23\pm 1.05 60.47±1.3160.47\pm 1.31 63.64±1.0263.64\pm 1.02 65.31±1.01\mathbf{65.31\pm 1.01}
Refer to caption
Figure 7: t-SNE visualization of embedded features of the 0-th task data for the model trained up to the 5050-th task on CIFAR-100.

Effect of memory budget. We conduct an experiment on exemplar that stores 5, 10, 20, and 50 samples for each class to check the dependence by memory budget in Table 4. Although our method adopts an asymmetric learning strategy for current task data and exemplar, our method achieves state-of-the-art for all memory sizes and shows robust performance even for small exemplar sizes. For the most challenging experiment with a memory size of 5 per class, we achieve significant performance improvements of 5.72 and 5.57 percent points for NME and CNN, respectively.

Feature visualization. We visualize the embedded features of 0-th task data for model trained up to the 5050-th task on CIFAR-100 using t-SNE [38] in Figure 7. Compared to previous work [8], our method achieves to construct more cohesive clusters for each class. These results show that our method works successfully on knn based NME classifier and small exemplar size.

Effect of initial task size. To investigate the effect on the initial representation, we experiment by reducing the initial task size, the number of classes for the 0-th task, from 50 to 20 by 10 in Table 5. For all experiments, our method achieves state-of-the-art. Confirming good performance even in the case of lack of diversity in the initial representation demonstrates that our method is robust to the initial representation of the model.

Table 5: Effect of varying initial task sizes on CIFAR-100 with 1 class per tasks.
80 tasks 70 tasks 60 tasks 50 tasks
Initial task size 20 30 40 50
iCaRL [31] 41.2841.28 43.3843.38 44.3544.35 44.2044.20
BiC [42] 40.9540.95 42.2742.27 45.1845.18 47.0947.09
LUCIR (NME) [15] 40.8140.81 46.8046.80 46.7146.71 48.5748.57
LUCIR (CNN) [15] 41.6941.69 47.8547.85 47.5147.51 49.3049.30
PODNet (NME) [8] 49.0349.03 55.3055.30 57.8957.89 61.4061.40
PODNet (CNN) [8] 47.6847.68 52.8852.88 55.4255.42 57.9857.98
AFC (NME) [19] 51.3151.31 57.0557.05 60.0660.06 62.5862.58
AFC (CNN) [19] 52.9052.90 57.6157.61 60.2760.27 62.1862.18
SRIL (NME) 53.21±1.42\mathbf{53.21\pm 1.42} 58.30±1.4958.30\pm 1.49 61.21±1.3861.21\pm 1.38 63.84±0.98\mathbf{63.84\pm 0.98}
SRIL (CNN) 53.18±1.1153.18\pm 1.11 58.93±1.84\mathbf{58.93\pm 1.84} 61.29±1.64\mathbf{61.29\pm 1.64} 63.64±1.0263.64\pm 1.02

5 Conclusion

In this paper, we proposed a selective regularization method to accept new knowledge while preserving previous knowledge. We introduced an asymmetric feature distillation approach using the gradients of classification and knowledge distillation losses to decide whether to perform pattern completion and separation. Furthermore, we proposed a confidence-aware weight interpolation method to improve the balance between stability and plasticity in class-incremental learning. We achieved competitive performance to the state-of-the-art methods, and extensive experiments demonstrate the effectiveness of our method in a variety of scenarios.

References

  • [1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
  • [2] Arjun Ashok, KJ Joseph, and Vineeth N Balasubramanian. Class-incremental learning with cross-space clustering and controlled transfer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pages 105–122. Springer, 2022.
  • [3] Lucas Caccia, Rahaf Aljundi, Nader Asadi, Tinne Tuytelaars, Joelle Pineau, and Eugene Belilovsky. New insights on reducing abrupt representation change in online continual learning. In International Conference on Learning Representations, 2022.
  • [4] Francisco M Castro, Manuel J Marin-Jimenez, Nicolas Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 233–248, 2018.
  • [5] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In ECCV, 2018.
  • [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [7] Songlin Dong, Xiaopeng Hong, Xiaoyu Tao, Xinyuan Chang, Xing Wei, and Yihong Gong. Few-shot class-incremental learning via relation knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1255–1263, 2021.
  • [8] Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2020.
  • [9] Yunshu Du, Wojciech M Czarnecki, Siddhant M Jayakumar, Mehrdad Farajtabar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224, 2018.
  • [10] Steven Vander Eeckt et al. Weight averaging: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition. ICASSP, 2023.
  • [11] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. ArXiv e-prints, dec 2013.
  • [12] Yiduo Guo, Bing Liu, and Dongyan Zhao. Online continual learning through mutual information maximization. In International Conference on Machine Learning, pages 8109–8126. PMLR, 2022.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [15] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [16] Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, and Hanwang Zhang. Distilling causal effect of data in class-incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 3957–3966, 2021.
  • [17] Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compacting, picking and growing for unforgetting continual learning. In Advances in Neural Information Processing Systems, pages 13647–13657, 2019.
  • [18] P Izmailov, AG Wilson, D Podoprikhin, D Vetrov, and T Garipov. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pages 876–885, 2018.
  • [19] Minsoo Kang, Jaeyoo Park, and Bohyung Han. Class-incremental learning by knowledge distillation with adaptive feature consolidation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16071–16080, 2022.
  • [20] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proc. of the national academy of sciences, 2017.
  • [21] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019.
  • [22] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [23] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [24] Yaoyao Liu, Bernt Schiele, and Qianru Sun. Adaptive aggregation networks for class-incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2544–2553, 2021.
  • [25] Yaoyao Liu, Bernt Schiele, and Qianru Sun. Rmm: Reinforced memory management for class-incremental learning. Advances in Neural Information Processing Systems, 34:3478–3490, 2021.
  • [26] Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, and Qianru Sun. Mnemonics training: Multi-class incremental learning without forgetting. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12245–12254, 2020.
  • [27] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
  • [28] Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013.
  • [29] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In Proceedings of the IEEE international conference on computer vision, pages 360–368, 2017.
  • [30] Randall C O’Reilly and James L McClelland. Hippocampal conjunctive encoding, storage, and recall: Avoiding a trade-off. Hippocampus, 4(6):661–682, 1994.
  • [31] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  • [32] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. ArXiv e-prints, jun 2016.
  • [33] Adam Santoro. Reassessing pattern separation in the dentate gyrus, 2013.
  • [34] Khadija Shaheen, Muhammad Abdullah Hanif, Osman Hasan, and Muhammad Shafique. Continual learning for real-world autonomous systems: Algorithms, challenges and frameworks. Journal of Intelligent & Robotic Systems, 105(1):9, 2022.
  • [35] Christian Simon, Piotr Koniusz, and Mehrtash Harandi. On learning the geodesic path for incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1591–1600, 2021.
  • [36] Zafir Stojanovski, Karsten Roth, and Zeynep Akata. Momentum-based weight interpolation of strong zero-shot models for continual learning. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.
  • [37] Xiaoyu Tao, Xinyuan Chang, Xiaopeng Hong, Xing Wei, and Yihong Gong. Topology-preserving class-incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, pages 254–270. Springer, 2020.
  • [38] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • [39] Pieter Verbeke and Tom Verguts. Learning to synchronize: How biological agents can couple neural task modules for dealing with the stability-plasticity dilemma. PLoS computational biology, 15(8):e1006604, 2019.
  • [40] Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
  • [41] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  • [42] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 374–382, 2019.
  • [43] J Yoon, E Yang, J Lee, and SJ Hwang. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, ICLR, 2018.
  • [44] Yichen Zhu and Yi Wang. Student customized knowledge distillation: Bridging the gap between student and teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5057–5066, 2021.