\useunder

\ul

Online Class Incremental Learning on Stochastic Blurry Task Boundary
via Mask and Visual Prompt Tuning

Jun-Yeong Moon^∗
Equally contributed Keon-Hee Park^∗
Jung Uk Kim^† Gyeong-Moon Park
Corresponding author Kyung Hee University, Yongin, Republic of Korea
{moonjunyyy, pgh2874, ju.kim, gmpark}@khu.ac.kr

Abstract

Continual learning aims to learn a model from a continuous stream of data, but it mainly assumes a fixed number of data and tasks with clear task boundaries. However, in real-world scenarios, the number of input data and tasks is constantly changing in a statistical way, not a static way. Although recently introduced incremental learning scenarios having blurry task boundaries somewhat address the above issues, they still do not fully reflect the statistical properties of real-world situations because of the fixed ratio of disjoint and blurry samples. In this paper, we propose a new Stochastic incremental Blurry task boundary scenario, called Si-Blurry, which reflects the stochastic properties of the real-world. We find that there are two major challenges in the Si-Blurry scenario: (1) intra- and inter-task forgettings and (2) class imbalance problem. To alleviate them, we introduce Mask and Visual Prompt tuning (MVP). In MVP, to address the intra- and inter-task forgetting issues, we propose a novel instance-wise logit masking and contrastive visual prompt tuning loss. Both of them help our model discern the classes to be learned in the current batch. It results in consolidating the previous knowledge. In addition, to alleviate the class imbalance problem, we introduce a new gradient similarity-based focal loss and adaptive feature scaling to ease overfitting to the major classes and underfitting to the minor classes. Extensive experiments show that our proposed MVP significantly outperforms the existing state-of-the-art methods in our challenging Si-Blurry scenario. The code is available at https://github.com/moonjunyyy/Si-Blurry

1 Introduction

Refer to caption — (a) Visualization of i-Blurry scenario [18].

Continual learning involves constantly learning from a stream of data while having limited access to previously seen information. In this scenario, unlike humans who can retain and apply their prior knowledge to new situations, modern deep neural networks face a challenge of catastrophic forgetting [30, 11]. To overcome this challenge, various approaches are being explored [32, 31, 13, 21, 8]. However, these traditional continual learning scenarios have clear task boundaries, where one can distinguish tasks with input data, unlike in the real-world. In real-world applications, clear task boundaries are often absent and access to data is limited to small portions at a time. This is referred to as online learning with blurry task boundaries [3].

There are many cases of class emerging or disappearing like the stock market or e-commerce. To address this, the i-Blurry scenario [18] has been recently proposed, which combines disjoint continual learning and blurry task-free continual learning. Although i-Blurry somewhat alleviates the above issue, it does not fully capture the complexity of real-world data, because i-Blurry has the fixed number of classes between tasks. Figure 1(a) shows the i-Blurry scenario that contains the static number of classes in each task. In the real-world scenarios, the number of classes and tasks vary dynamically as illustrated in Figure 1(c) and 1(d). That is, as samples of a specific class continuously disappear or appear in the data stream, distribution of the data is dynamically changing.

To reflect the dynamic distribution of the real-world data, we propose a novel Stochastic incremental Blurry (Si-Blurry) scenario. We adopt a stochastic approach to imitate the chaotic nature of the real-world. As shown in Figure 1(b), our Si-Blurry scenario is capable of effectively simulating not only newly emerging or disappearing data but also irregularly changing data distribution. In the Si-Blurry scenario, we find that there are two main reasons for performance degradation: (1) intra- and inter-task forgettings, and (2) class imbalance problem. First, the continuous change of classes between batches causes intra- and inter-task forgettings, which make the model difficult to retain previously learned knowledge. Second, ignorance of minor classes and overfitting to major classes worsen the class imbalance problem in the Si-Blurry scenario. Minor class ignorance occurs by insufficient consideration of a few samples which belong to minor classes in training, and overfitting on major classes makes the model biased to a large number of samples that belong to major classes or disjoint classes.

To deal with the aforementioned problems, we propose a novel online continual learning method called Mask and Visual Prompt tuning (MVP). We propose instance-wise logit masking and contrastive visual prompt tuning loss to alleviate intra- and inter-task forgettings by making classification easier and allowing prompts to learn the knowledge for each task effectively. Moreover, we propose a gradient similarity-based focal loss to prevent the problem of minor class ignorance. This method boosts learning of the ignored samples of minor classes in a batch, so that the samples for minor classes can be considered intensively. We also propose adaptive feature scaling to address the problem of overfitting to major classes. This method measures the marginal benefit [7] of learning from a sample and prevents our model from learning already sufficiently trained samples.

We summarize our main contributions as follows:

•

We introduce a new incremental learning scenario, coined Si-Blurry, which aims to simulate a more realistic continual learning setting that the neural networks continually learn new classes online while a task boundary is stochastically varying.
•

We propose an instance-wise logit masking and contrastive visual prompt tuning loss to prevent the model from intra-task and inter-task forgettings.
•

To solve the class imbalance problem, we propose a new gradient similarity-based focal loss and adaptive feature scaling for minor-class ignorance and overfitting on major classes.
•

We experimentally achieved significantly high performance compared to existing methods, supporting that our proposed method shows overwhelming performance and solves the problems of Si-Blurry in CIFAR-100, Tiny-ImageNet, and ImageNet-R.

2 Related Work

2.1 Disjoint Continual Learning

Class Incremental Learning (CIL) [12] assumes that each task contains distinct classes not overlapping with another and that the class observed once in a task never appears again in subsequent tasks. CIL is categorized into the (1) regularization-based method, (2) replay-based method, (3) parameter isolation method, and (4) prompt-based method. The regularization-based method uses previous knowledge for regularizing the network while training new tasks [23, 17, 4, 27, 42]. The replay-based method stores a few samples from the old task and replays them in the new task to mitigate catastrophic forgetting [34, 38, 2, 6, 29]. The parameter isolation method expands the network or consists of sub-networks in a single network for each task [37, 43, 1, 36, 44]. The prompt-based method proposed in natural language processing (NLP) for transfer learning attaches a set of learnable parameters, named prompt, to the frozen pre-trained model [40, 39, 15].

2.2 Blurry Continual Learning

Blurry Continual Learning [33, 3] assumes no new classes appear after the first task even though classes overlap across the tasks. A blurry setup has some requirements. First of all, each task is streamed sequentially. Second, the major class of each task is different. Last, a model can leverage only a small portion of data from the previous task. A blurry setup seems realistic. However, a blurry scenario has a shortage to apply real-world scenarios in that observing new classes is commonplace in a real-world scenario. i-Blurry [18] proposes a more realistic setting that considers a blurry scenario with a class-incremental setting. However, the i-Blurry scenario also has a limitation of properly reflecting the real-world scenario due to: (1) the same number of new classes appearing in every task, and (2) new classes and blurry classes having the same proportion in every task. This is why we propose a stochastic incremental blurry scenario, which focuses on the stochastic property critical to real-world scenarios.

2.3 Class Imbalance in Continual Learning

Class imbalance, known as the long-tail problem, is that the classes are not represented equally in the classification task. Class imbalance is common in the real-world and it can cause inaccurate prediction performance in classification problems. In continual learning, the replay-based method suffers severe catastrophic forgetting due to the inequality between stored old samples and streamed new samples [42]. To address this problem, existing methods consider gradient information to get the knowledge of prior tasks during training [29, 2], episodic memory management to enhance model performance by sampling effective samples [5, 28, 41], and calibration of the bias [42]. However, in a blurry scenario, incoming samples per class are different and this causes a class imbalance problem. Class imbalance in a task leads to bias for disjoint classes and major classes which exacerbates the training of the minority classes.

3 Stochastic Incremental Blurry Scenario

3.1 Scenario Configuration

In a real-world scenario, the quantity of input data and tasks tends to change a stochastic manner. To simulate this, we propose the Stochastic incremental-Blurry (Si-Blurry) scenario. From [18], we divide the classes into two categories using disjoint class ratio: disjoint classes and blurry classes. As shown in Figure 2, we randomly assign each blurry class and disjoint class to each blurry task ( $T^{B}$ ) and disjoint task ( $T^{D}$ ) by the disjoint class ratio. In blurry tasks, we gather the sample of blurry sample ratio and randomly distribute it to each task. This makes the classes on each task overlap, which blur explicit task boundary. Each task with a stochastic blurry task boundary $T^{B+D}$ consists of $T^{B}$ and $T^{D}$ . Figure 1(b) shows an example of the distribution of Si-Blurry. Because the Si-Blurry task is stochastic, the batches get more diverse and imbalanced. As a result, there are lack of explicit task boundaries and the data imbalance, which pose significant challenges to formal continual learning methods. We define two problems that are exacerbated on Si-Blurry in the following subsections.

3.2 Intra- and Inter-Task Forgettings

First, intra-task forgetting can be interpreted as inter-batch forgetting. This problem also presents in joint training but is largely addressed by randomizing the sample order in each batch. However, in online learning scenarios, each sample is only presented once, making it impossible to apply this strategy. Additionally, the stochastic nature of Si-Blurry creates more diverse batches, thereby intensifying the issue of intra-task forgetting. Intra-task forgetting can severely limit the ability of the model to learn and generalize well in the face of evolving and dynamic data distributions.

Inter-task forgetting, which refers to the phenomenon of losing previously learned knowledge of a task due to the changes in class distribution, is a major challenge in online continual learning. Si-Blurry, a scenario proposed to simulate the complexity of real-world data in a stochastic manner, does not have clear task boundaries like those in conventional continuous learning, making it difficult to handle this problem. The stochastic change in class distribution in Si-Blurry exacerbates this problem. Effective strategies need to be devised to enable the model to learn the knowledge continually from data with varying class distributions, without losing previously acquired knowledge, and any catastrophic forgetting.

3.3 Class Imbalance

Minor-class ignorance and overfitting to major classes cause the class imbalance problem. The issue of minor-class ignorance arises when the number of samples in a batch varies and this issue causes an imbalanced weighted loss. The loss imbalance leads to the negligence of the minority of samples in a batch, resulting in their poor representation while training the model. Overfitting to major classes is, conversely, a phenomenon where a class with a large number of samples deteriorates the model performance by acquiring unnecessary knowledge relating to generalization performance.

These are particularly important issues in the context of Si-Blurry, where no explicit task boundaries exist, and continual learning requires a model to capture the knowledge from a wide range of samples. Finding a solution to this problem is essential for ensuring that the model remains effective in recognizing and classifying all samples, regardless of the number of samples per class, and that it continues to learn and improve over time.

4 Mask and Visual Prompt Tuning (MVP)

4.1 Preliminary and Problem Formulation

The Si-Blurry scenario considers learning a model with only a few samples because it cannot access the whole current training data. When given the accessible data $\mathcal{B}=\left\{\mathbf{x_{i}},y_{i}\right\}_{i=1}^{N}$ , where $\mathbf{x_{i}}\in\mathcal{X}$ , $y_{i}\in\mathcal{Y}$ , we reshape the samples to a flattened patch shape $\mathbb{R}^{L\times(S^{2}\times C)}$ to feed into the pre-trained model $f:\mathbb{R}^{L\times(\mathrm{S}^{2}\times\mathrm{C})}\to\mathbb{R}^{L\times\mathrm{D}}$ which is frozen, where $L$ , $S$ , $C$ , and $D$ represent the token length, patch size, channel, and embedding dimension, respectively. A linear classifier $W\in\mathbb{R}^{D\times\left|\mathcal{Y}\right|}$ is trainable.

Previous studies demonstrate the effectiveness of using knowledge from the pre-trained model and tuning the small size of parameters [40, 39] for continual learning. To this end, we adopt the prompt tuning method [22] for online continual learning. Similarly to DualPrompt [39], we utilize a pre-trained Vision Transformer (ViT) as a feature extractor for the query. We match the query with the key to apply the contrastive visual prompt tuning loss and select the prompt.

4.2 Instance-wise Logit Masking

Existing prompt methods assume the explicit task boundary and require information on the task boundary for training, which is not feasible in Si-Blurry, where no explicit task boundary exists. The cross-entropy loss is highly effective in optimizing classification models, but it requires a proper comparison target to acquire sufficient knowledge. Unlike traditional joint training, the absence of comparison constantly leads to forgetting in online continual learning. To address this problem causing intra- and inter-task forgettings, we propose a new instance-wise logit masking technique.

To complement the prompt-based continual learning approach and further enhance the performance of the model, we introduce a learnable mask paired with prompts that helps the model to learn more intra-relevant and easier learning goals. Since the key-value mechanism is used to select each prompt, where a feature extracted by the pre-trained model serves as a query, each prompt can be responsible for a certain region of the feature space in which classes have similar extracted features. As illustrated in Figure 3, we apply the mask to logit using an element-wise product and train the mask, and then calculate cross-entropy loss which makes the mask divide the tasks into easier classification tasks. The logit masking provides the model with a scaled gradient during back-propagation, which protects the knowledge that has been sufficiently trained and encourages the learning of classes to be learned.

4.3 Contrastive Visual Prompt Tuning Loss

The logit mask assumes that each key enables the prompts to learn similar knowledge. We empirically find that the existing prompt-based method [40] converges to a single point, which renders the query selection mechanism inaccurate and meaningless. Moreover, the keys are updated continuously, which causes forgetting. To overcome these challenges and leverage the benefits of prompt-based continual learning, we propose a novel loss function, called Contrastive Visual Prompt Tuning Loss. We formulate this loss term as follows:

		$\displaystyle s_{p}=\sum^{P}_{p=1}\sum^{B}_{q=1}{\mathrm{exp}\left(\delta\left(\textbf{k}_{p},\textbf{q}_{q}\right)/(\mathcal{C}_{p}+1)\right)},$
		$\displaystyle s_{n}=\sum^{P}_{p=1}\sum^{P}_{q=1}{\mathrm{exp}\left(\delta\left(\textbf{k}_{p},\textbf{k}_{q}\right)/(\mathcal{C}_{p}+1)\right)},$
		$\displaystyle\mathcal{L}_{CVPT}\,=\,-\mathrm{log}\,\frac{s_{n}}{s_{p}+s_{n}},$		(1)

where $\delta$ is cosine distance, $P$ denotes the size of the prompt pool, $\textbf{q}_{n}\in\mathbb{R}^{D}$ indicates the query feature, $\textbf{k}_{n}\in\mathbb{R}^{D}$ denotes the key of $n^{th}$ prompt, and $\mathcal{C}_{n}$ means the count of selection of $n^{th}$ prompt. In $\mathcal{L}_{CVPT}$ , $(\mathcal{C}_{p}+1)$ plays a role of the temperature to control the softness. If $\mathcal{C}_{n}$ is large, the effect of loss to key lessens. As illustrated in Figure 3, the $\mathcal{L}_{CVPT}$ increases the distances between keys. Also, as the prompt learns, the keys become heavier to ensure consistency in key selection. The instance-wise logit masking coupled with $\mathcal{L}_{CVPT}$ can prevent inter-task and intra-task forgetting by ensuring that each prompt divides its responsible region and preserves the knowledge.

4.4 Gradient Similarity-based Focal Loss

Since the blurry setup has a task that comprises of imbalanced classes, it is challenging for the model to extract the knowledge of all the observed classes in the blurry setup. Due to the stochastic nature of Si-Blurry, we cannot guarantee a minimum number of samples for minor classes. To mitigate the aforementioned minor class ignorance, we propose a Gradient Similarity-based Focal loss $\mathcal{L}_{GSF}$ (GSF loss). It focuses on the loss from ignored samples leveraging ignore scores $\mathrm{Score}^{ign}$ . The ignore score $\mathrm{Score}^{ign}_{i}$ denotes how much a sample $\mathbf{x_{i}}$ is ignored by other samples during training. We use cosine distance in between a gradient vector from each sample $\nabla W_{y_{i}}(f(\mathbf{x_{i}}))\in\mathbb{R}^{D}$ and the averaged gradient vector $\nabla W_{y_{i}}(f(\mathcal{B}))\in\mathbb{R}^{D}$ from accessible data $\mathcal{B}$ to yield an ignore score for a sample $\textbf{x}_{i}$ . We formulate ignore score and GSF loss as:

		$\displaystyle\nabla W_{y_{i}}(f(\mathcal{B}))=\frac{1}{\left\|\mathcal{B}\right\|}\sum_{(\textbf{x},y)\in\mathcal{B}}\nabla W_{y_{i}}(f(\textbf{x})),$
		$\displaystyle\mathrm{Score}_{i}^{ign}=\delta\left(\nabla{W_{y_{i}}\left(f\left(\textbf{x}_{i}\right)\right)},\,\nabla W_{y_{i}}(f(\mathcal{B})\right)),$		(2)
		$\displaystyle\mathcal{L}_{GSF}=\frac{1}{\left\|\mathcal{B}\right\|}\sum_{i=1}^{\left\|\mathcal{B}\right\|}\left(\mathrm{Score}_{i}^{ign}\right)^{\gamma}\cdot\mathcal{L}_{CE}\left(\hat{y_{i}},y_{i}\right),$		(3)

where $(\textbf{x},y)\in\mathcal{B}$ is a training sample, $\mathcal{L}_{CE}$ is a cross-entropy loss, and $W_{y_{i}}$ is the weights of corresponding label, respectively. Eq. 4.4 represents ignore score for a sample $\textbf{x}_{i}$ . When the $\mathrm{Score}^{ign}$ has a high value, it implies the model is hard to extract the knowledge from the sample, whereas the low $\mathrm{Score}^{ign}$ means vice versa. Leveraging $\mathrm{Score}^{ign}$ , we can emphasize the loss from the ignored sample and capture more knowledge of minor classes than before as illustrated in Figure 3. Eq. 3 represents GSF loss which considers the amount of ignorance. In [24], the focal loss dynamically scales cross-entropy loss considering confidence in the correct class. Our proposed GSF loss also dynamically scales cross-entropy loss. However, our loss scales the cross-entropy loss considering ignore score. Ignore score is the degree of ignorance that is conceptually different from confidence. GSF loss can mitigate the class ignorance problem which minor classes highly suffered, and enables balanced class learning.

4.5 Adaptive Feature Scaling

In online learning, the model cannot access all the data of the current task but access a few samples. In our novel Si-Blurry scenario, each task has a class imbalance problem. When the model access training data, there are no or few samples of minor classes. It makes the model overfit to major classes and newly streamed disjoint classes. To mitigate the overfitting problem caused by the class imbalance, we propose Adaptive Feature Scaling (AFS) which expands or contracts the feature vector considering the marginal benefit score $\mathrm{Score}^{MB}$ . Using the $\mathrm{Score}^{MB}$ , the model can learn new knowledge from the accessible data while preserving the knowledge from inaccessible prior data.

As prior works [9, 25, 26] suggest, the similarity between the feature vector and the weights from the last fully connected layer relates to the prediction when the model is trained by cross-entropy loss with softmax function. We propose a marginal benefit score $\mathrm{Score}^{MB}$ that represents how similar the feature vector is with the weights of the corresponding label. Leveraging this, we estimate the marginal benefit from the given instance and adjust the model updates by the given instance. We can calculate $\mathrm{Score}^{MB}$ as follows:

	$\displaystyle\mathrm{Score}^{MB}_{i}=\delta\left(f\left(\textbf{x}_{i}\right),W_{y_{i}}\right)\ +\textit{m},$		(4)
	$\displaystyle\mathbf{h_{i}}=\frac{f({\textbf{x}_{i}})}{\mathrm{Score}^{MB}_{i}},$		(5)
	$\displaystyle\widehat{y}_{i}=W(\mathbf{h_{i}})$		(6)

where m is a margin, and $\mathbf{h_{i}}$ is a scaled feature vector by $\mathrm{Score}^{MB}_{i}$ . When the $\mathrm{Score}^{MB}$ has a high value, it implies the given sample has a large marginal benefit. The $\mathrm{Score}^{MB}$ reduces the feature vector to increase the expected loss value. Enlarged expected loss makes the model learn enough knowledge from the given sample. In contrast, when the $\mathrm{Score}^{MB}$ has a low value, it implies the given sample has a little marginal benefit for the model. In this case, a feature vector is expanded to decrease the expected loss value. The model is less trained due to the curtailed expected loss. We estimate the marginal benefit that can be extracted from a sample and scale up and down the feature vector considering the marginal benefit.

To overcome the class imbalance problem, we propose two components that seem similar: gradient similarity-based focal loss (GSF) and adaptive feature scaling (AFS). Although GSF and AFS look similar, their main roles are different. GSF emphasizes learning minor classes to tackle the class imbalance problem in the task. AFS regularizes learning major classes to address the overfitting problem.

Finally, we train our model in an end-to-end manner. The total loss for our method is defined as:

\displaystyle\mathcal{L}_{\mathrm{total}}=(1-\alpha)\mathcal{L}_{\mathrm{CE}}+\alpha\mathcal{L}_{GSF}+\mathcal{L}_{CVPT}\,,

(7)

where $\mathcal{L}_{\mathrm{CE}}$ is cross entropy loss with instance-wise logit masking. We use hyperparameter $\alpha$ for the balanced training and $\gamma$ at the gradient similarity-based focal loss to scale the ignore score.

Buffer Size	Method		CIFAR-100			Tiny-ImageNet			ImageNet-R
Buffer Size	Method		$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$		$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$		$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
0	Finetuning		19.71±3.39	10.42±4.92		15.50±0.74	10.42±4.92		7.51±3.94	2.29±0.85
	Linear Probing		49.69±6.09	23.07±7.33		42.15±2.79	21.97±6.43		29.24±1.26	16.87±3.14
	LwF [23]		55.51±3.49	36.53±10.96		49.00±1.52	27.47±7.59		31.61±1.53	20.62±3.67
	L2P [40]		57.08±4.43	41.63±12.73		52.09±1.92	35.05±5.73		29.65±1.63	19.55±4.78
	DualPrompt [39]		67.07±4.16	56.82±3.49		66.09±2.00	48.72±3.41		40.11±1.27	29.24±4.63
\cdashline2-11		MVP (Ours)		68.10±4.91	62.59±2.38		68.95±1.33	52.78±2.08		40.60±1.21	31.96±3.07
500	ER [35]		65.57±4.77	60.68±1.15		59.46±1.81	40.60±2.71		40.31±1.33	28.85±1.43
	EWC++ [17]		34.54±5.19	25.62±3.35		55.05±1.75	34.88±3.65		18.62±1.00	11.36±2.40
	RM [3]		40.86±3.32	23.94±0.61		31.96±0.80	7.43±0.27		18.31±1.09	4.14±0.18
	CLIB [18]		69.68±2.20	67.16±0.72		60.11±1.53	48.97±1.48		37.18±1.52	29.51±0.98
\cdashline2-11		MVP-R (Ours)		76.06±4.22	79.32±1.28		76.52±0.73	65.19±0.58		49.07±1.47	44.17±1.72
2,000	ER [35]		69.86±4.08	71.81±0.69		66.75±1.13	55.07±1.28		45.74±1.35	38.13±0.32
	EWC++ [17]		47.75±5.35	46.93±1.44		64.92±1.21	53.04±1.53		30.20±1.31	21.28±1.88
	RM [3]		53.27±3.00	65.51±0.55		47.26±1.13	44.55±0.37		27.88±1.29	24.25±0.99
	CLIB [18]		71.53±2.61	72.09±0.49		65.47±0.76	56.87±0.54		42.69±1.30	35.43±0.38
\cdashline2-11		MVP-R (Ours)		78.65±3.59	84.42±0.44		80.67±0.75	74.34±0.32		52.47±1.45	50.54±2.08

Method	Components		Memory = 0		Memory = 2,000
Method	Mask	Cont	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
Baseline	-	-	67.07±4.16	56.82±3.49	75.26±5.02	80.72±0.83
MVP (Ours)	✓		67.64±3.81	58.52±3.28	77.83±0.35	84.26±0.04
		✓	66.37±4.59	58.63±1.18	76.67±1.98	83.32±0.40
	✓	✓	68.08±6.46	60.20±3.28	77.85±0.04	84.28±0.15

Method	Components		Memory = 0		Memory = 2,000
Method	GSF	AFS	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
Baseline	-	-	67.07±4.16	56.82±3.49	75.26±5.02	80.72±0.83
MVP (Ours)	✓		67.45±5.21	56.11±3.20	77.34±2.16	83.75±0.53
		✓	67.45±3.78	57.93±2.11	77.86±2.09	84.31±0.20
	✓	✓	67.66±3.47	58.28±2.95	78.28±3.67	84.41±0.21

Method	Components				Memory = 0		Memory = 2,000
Method	Mask	Cont	GSF	AFS	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
Baseline	-	-	-	-	67.07±4.16	56.82±3.49	75.26±5.02	80.72±0.53
MVP (Ours)	✓	✓			68.08±6.46	60.20±3.28	77.85±0.04	84.28±0.15
			✓	✓	67.66±3.47	58.28±2.95	78.28±3.67	84.41±0.21
	✓	✓	✓	✓	68.10±4.91	62.59±2.38	78.65±3.59	84.42±0.44

Case	i-Blurry			Si-Blurry
Case	CLIB [18]	DP[39]	MVP-R (2,000)	CLIB [18]	DP [39]	MVP-R (2,000)
Best case	72.56	67.51	84.69	72.91	61.68	84.89
Worst case	71.86	62.57	83.68	71.78	53.68	83.80
\cdashline1-7 Average	72.12±0.38	64.90±1.96	84.44±0.43	72.09±0.49	56.82±3.49	84.42±0.44

Disjoint Class Ratio	0		50		100
Disjoint Class Ratio	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
DualPrompt [39]	68.85±2.77	72.31±9.19	67.07±4.16	56.82±3.49	71.45±1.67	48.68±3.47
MVP (Ours)	67.86±2.62	73.83±8.34	68.10±4.91	62.59±2.38	73.35±2.63	53.40±5.49

Blurry Sample Ratio	10		30		50
Blurry Sample Ratio	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
DualPrompt [39]	67.07±4.16	56.82±3.49	70.58±2.05	59.47±7.38	68.08±5.56	49.93±2.82
MVP (Ours)	68.10±4.91	62.59±2.38	71.10±2.10	63.02±6.68	70.58±2.05	59.47±7.38

Method	$\gamma$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
Baseline	-	67.07±4.16	56.82±3.49
MVP (Ours)	0.5	67.25±5.08	60.39±1.55
	1.0	67.45±5.05	60.95±1.61
	1.5	67.52±5.11	61.05±1.37
	2.0	68.10±4.91	62.59±2.38
	2.5	67.62±5.17	61.11±1.55

$\alpha$	Memory = 0		Memory = 2,000
$\alpha$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
-	40.11±1.27	29.24±4.63	49.00±2.06	37.96±0.34
\cdashline1-5 0.1	40.38±1.67	31.63±3.39	52.13±0.14	50.50±3.11
0.3	40.52±1.59	31.81±3.66	52.14±0.28	50.51±2.76
0.5	40.60±1.21	31.96±3.07	52.47±1.45	50.54±2.08
0.7	40.53±1.01	31.56±2.05	52.28±2.34	50.43±1.53

Method	Forgetting
Baseline	46.61±5.30
\cdashline1-2 + GSF,AFS	45.35±4.40
+ Cont,Mask	39.98±4.02
+ Cont,Mask,GSF,AFS	39.68±3.98

Memory Size	Methods	Metrics
Memory Size	Methods	$A_{\mathrm{Last}}$ (↑)	$\mathrm{Forgetting}$ (↓)
0	FineTuning	10.42±4.92	45.11±5.98
	LwF [23]	36.53±10.96	56.43±12.91
	L2P [40]	41.63±12.73	55.46±13.15
	DualPrompt [39]	56.82±3.49	40.35±1.25
	MVP (Ours)	62.59±2.38	34.63±2.46
500	ER [35]	60.68±1.15	28.85±3.51
	EWC++ [17]	25.62±3.35	47.16±9.72
	RM [3]	23.94±0.61	24.28±2.90
	CLIB [18]	67.16±0.72	15.45±0.94
	MVP (Ours)	79.32±1.28	14.57±1.60
2000	ER [35]	71.81±0.69	15.45±0.94
	EWC++ [17]	46.93±1.44	28.75±7.58
	RM [3]	65.51±0.55	9.50±1.49
	CLIB [18]	72.09±0.49	8.07±0.98
	MVP (Ours)	84.42±0.44	8.79±1.49

Method	Memory = 500		Memory = 2,000
Method	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$	$A_{\mathrm{AUC}}$	$A_{\mathrm{Last}}$
L2P	69.91±1.49	56.58±0.64	75.24±0.82	68.73±0.80
DualPrompt	75.07±1.01	62.12±1.50	79.76±0.47	72.09±0.80
\cdashline1-5 MVP-R (Ours)	76.52±0.73	65.19±0.58	80.67±0.75	74.34±0.32

Method	TFLOPs	Training (s) /Iter
CLIB	69.6	11.590
DualPrompt	4.37	0.906
\cdashline1-3 MVP (Ours)	4.19	0.882

Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning

Abstract

1 Introduction

2 Related Work

2.1 Disjoint Continual Learning

2.2 Blurry Continual Learning

2.3 Class Imbalance in Continual Learning

3 Stochastic Incremental Blurry Scenario

3.1 Scenario Configuration

3.2 Intra- and Inter-Task Forgettings

3.3 Class Imbalance

4 Mask and Visual Prompt Tuning (MVP)

4.1 Preliminary and Problem Formulation

4.2 Instance-wise Logit Masking

4.3 Contrastive Visual Prompt Tuning Loss

4.4 Gradient Similarity-based Focal Loss

4.5 Adaptive Feature Scaling

5 Experiments

5.1 Experimental Details

5.2 Results on the Si-Blurry Scenario

5.3 Ablation Study

5.4 Comparison between i-Blurry and Si-Blurry

5.5 Disjoint Sample Ratio

5.6 Blurry Sample Ratio

6 Conclusion

Acknowledgement

References

Appendix A Details on the Compared Methods

Appendix B Additional Ablation Studies

B.1 Hyperparameters γ\gamma and m

B.2 Hyperparameter α\alpha

B.3 Mask-Prompt Pool Size and Prompt Selection

B.4 Forgetting

Appendix C Visualization of Masks and Prompt Keys

C.1 Instance-wise Logit Mask

C.2 Prompt Key

Appendix D Discussions

D.1 Additional Results for the Forgetting Score

D.2 Additional Results with Memory

D.3 Computational Cost

D.4 Task Configuration of Best and Worst Cases

References

Online Class Incremental Learning on Stochastic Blurry Task Boundary
via Mask and Visual Prompt Tuning

B.1 Hyperparameters $\gamma$ and m

B.2 Hyperparameter $\alpha$