DesCLIP: Robust Continual Adaptation via General Attribute Descriptions for Pretrained Vision-Language Models

Chiyuan He, Zihuan Qiu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li Corresponding author: Fanman Meng. The authors are with the School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected].

Abstract

Continual adaptation of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt for expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. Our findings reveal that forcing models to optimize inappropriate visual-text matches exacerbates forgetting of VLMs. To tackle this issue, we propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects, enabling VLMs to establish robust vision-GA-class trilateral associations rather than relying solely on vision-class connections. Specifically, we introduce a language assistant to generate concrete GA description candidates via proper request prompts. Then, an anchor-based embedding filter is designed to obtain highly relevant GA description embeddings, which are leveraged as the paired text embeddings for visual-textual instance matching, thereby tuning the visual encoder. Correspondingly, the class text embeddings are gradually calibrated to align with these shared GA description embeddings. Extensive experiments demonstrate the advancements and efficacy of our proposed method, with comprehensive empirical evaluations highlighting its superior performance compared to existing pretrained and VLM-based continual learning methods.

Index Terms:

Continual learning, vision-language model, general attribute description, knowledge forgetting

I Introduction

In recent years, deep models pretrained on large-scale datasets have achieved remarkable success across visual, linguistic, and multi-modal domains. Pretrained vision-language models (VLMs), exemplified by CLIP [1] and ALIGN [2], have demonstrated substantial promise in handling open-vocabulary tasks. Despite their strong zero-shot capabilities in common domains, VLMs often underperform on specialized tasks, such as distinguishing low-quality images or identifying fine-grained object categories. Consequently, significant efforts [3, 4, 5, 6, 7, 8] have focused on adapting VLMs on downstream datasets to adapt to these new tasks. However, as the demand for and volume of data continue to grow, incorporating previous models and data for joint training imposes substantial storage and computational overhead. Considering the significant cost of repeatedly training foundation models, exploring continual learning (CL) becomes particularly valuable in this context.

Recently, Zheng et al. [9] and Zhang et al. [10] have highlighted the risk of losing existing generalization capabilities when adjusting VLMs with generic knowledge to specialized domain models. This adjustment may result in the model losing effectiveness on prior tasks and lacking potential for optimization on subsequent tasks. This phenomenon, known as catastrophic forgetting in the field of continual learning, is particularly pronounced o n VLMs. Unlike conventional scenarios [11, 12, 13, 14], catastrophic forgetting o f VLMs impacts not only the task-specific knowledge of previously learned tasks but also the extensive pretrained knowledge, presenting significant challenges in adjusting VLMs for continual tasks.

Refer to caption — Figure 1: (a) Existing methods: learning to match unfamiliar specific classes leads to a risk of forgetting. (b) Ours: learning to construct connections with highly relevant general attributes and gradually calibrate class-text embeddings.

Over the past few years, research [16, 10, 17, 18, 9] has explored how VLMs can adapt to CL tasks. Conventional adapting approaches [3, 4, 7, 19, 20] have shown limited effectiveness in adapting pretrained VLMs for incremental tasks. This limitation is primarily due to their reliance on shared structures and prompts across all tasks, which often results in forgetting of previously learned knowledge when accommodating new information. To address this issue, Wang et al. [16] proposed AttriCLIP, a sparse prompt selection mechanism that selects abstract attributes with high visual relevance to enrich textual hint prompts. However, these prompts depend solely on image conditions and lack association with class-relevant information, which restricts the effectiveness. Other works [21, 22, 10] have focused on selectively updating parameters of pretrained models to better support incremental continual tasks. More recently, approaches incorporating additional structures [17, 18] with sparse mechanisms have shown promise in mitigating conflicts with prior knowledge and alleviating forgetting of VLMs. Additionally, Zheng et al. [9] and Yu et al. [23] use the additional reference datasets to perform knowledge distillation, effectively mitigating the forgetting of generic knowledge.

Although these studies have demonstrated some effectiveness in mitigating knowledge forgetting of VLMs, they can largely be regarded as adaptations of traditional and pretrained-vision-model-based CL methods, tailored to fit the VLM framework. Yet, they fail to fully exploit the robust cross-modal knowledge associations established during the VLM pretraining phase. For instance, these approaches overlook the visual-textual associations established via pretraining, instead relying heavily on rudimentary textual prompts (e.g., A photo of a [CLS], [XXX] [CLS]) to correlate with visual information. This introduces a risk of forgetting in unfamiliar downstream tasks. For example, CLIP may struggle with unfamiliar categories like “European garden spider,” resulting in a weak correspondence between its visual and textual representations. Forcing this association can lead to overfitting, which in turn accelerates the forgetting of pretrained and previously learned knowledge. In Fig. 2, we demonstrate that beginning with unfamiliar tasks (characterized by low image-text similarity confidence) or from familiar tasks yields markedly different evaluation results on selected subtasks of ImageNet [24]. Fig. 2 (b) underscores that forcibly aligning unfamiliar visual-text pairs hinders knowledge retention of VLMs, detracting from subsequent task learning. Additionally, Fig. 2 (c) uses centered kernel alignment (CKA) [15] to analyze representation similarity of continually tuned CLIP to a pretrained CLIP. It can be observed that matching unfamiliar classes further disrupts integrity of pretrained representation, hindering general knowledge preservation.

Another line of research [25, 26, 27, 28] have attempted to enrich textual descriptions of specific classes to assist VLMs in understanding objects in downstream tasks. However, the descriptions generated lack interaction with visual information or fail to provide class-specific representative details, making it difficult to ensure a reliable understanding of the objects.

To overcome these limitations, we emphasize the transfer from generalized knowledge to specialized insights. To our knowledge, existing research has not focused on addressing forgetting by establishing robust associations between general attributes and specialized downstream class. Our approach guides the continual learning process by exploring the context encoding ability of the language branch, forming strong links between general and specialized knowledge. As shown in Fig. 1, we advocate for visual representations that align closely with highly-relevant general attribute (GA) embeddings, which are well-known to VLMs, instead of relying on naive class-text embeddings. This prevents the risk of overfitting to unfamiliar classes, as such overfitting can lead to knowledge forgetting. By gradually calibrating text embeddings to align with shared GA embeddings, we form GA-class associations for these incremental downstream tasks. In essence, we redirect the focus from conventional vision-class text connections to establishing robust vision-GA-class trilateral associations, enabling a more effective knowledge transfer that significantly mitigates forgetting of VLMs. In summary, our main contributions are as follows:

•

We revisit the continual learning of VLMs, focusing on the incremental transfer from generalized to specialized knowledge. By introducing concrete descriptions of general attributes (GAs), we establish more robust vision-GA-class trilateral associations during downstream incremental phases, effectively mitigating forgetting caused by inappropriate visual-text matching.
•

We propose an anchor-based embedding filter to identify and retain GA description embeddings highly relevant to visual representations. Building on this, we introduce a GA-Guided progressive visual-textual alignment scheme to guide the learning process.
•

Our method introduces no additional overhead in terms of model structure, data replay, or feature rehearsal storage. Extensive experiments on CIFAR100 [29], ImageNet [24], and CUB-200 [30] demonstrate the exceptional performance of our approach. Thorough ablation studies and analyses further corroborate its effectiveness.

II Related Work

II-A Continual Learning

Continual Learning (CL) investigates how deep models can incrementally learn knowledge. Existing CL research can be categorized into several types based on the strategies they employ. Among these, regularization-based methods [11, 31, 32] introduce regularization terms during model training to penalize forgetting old knowledge. These regularization terms can either focus on protecting model parameters [31] or on output distributions [11, 33] (e.g., knowledge distillation). Dynamic network-based methods [34, 35, 36, 37] aim to learn the model by introducing new structures for new tasks while preserving old knowledge, although this incurs substantial overhead as model parameters increase with the number of tasks. Recently, replay-based methods have become increasingly common. Data replay methods [38, 37] assist models in retaining old knowledge by recalling a small number of real samples. Additionally, some methods [39, 40, 41] recall old knowledge by storing sample features and the distributions of these features. However, replay-based methods introduce storage costs and require repetitive computation for old data.

In recent years, studies such as [42, 43, 44, 45] have predominantly focused on integrating additional components for incremental tasks, such as learnable prompts [44, 45, 46, 47] and adapters [42, 43], into pretrained models. This integration necessitates the development of methods for selecting and evaluating the relevance of these components to ensure both their appropriateness and compatibility with the pretrained model. However, a significant limitation arises as the number of tasks increases: the associated computational and storage costs grow substantially, posing challenges to scalability and efficiency.

II-B Vision-Language Models

With advancements in pre-training techniques, large-scale foundation models [1, 48, 49, 50, 51] have significantly impacted the industry. For instance, Vision-Language Models such as Contrastive Language-Image Pretraining (CLIP) [1] and Adversarially Learned Inference for Image-Text Matching (ALIGN) [2] have demonstrated remarkable zero-shot capabilities for general tasks. However, despite being pre-trained on over 400 million image-text pairs, CLIP still face challenges in specific downstream tasks, such as accurately identifying certain types of vehicles and lizards.

To better adapt VLMs for downstream tasks, various text prompt-based fine-tuning methods [3, 4, 6, 7, 52] have been proposed, which can enhance VLM performance on specific tasks. In more complex scenarios, learnable prompts can be inserted into intermediate layers [8] to incorporate more sophisticated general knowledge. Additionally, the integration of adapter structures [53, 19, 20] has also been shown to be an effective strategy. Other approaches [54, 55, 56, 57] focus on the representation alignment of VLMs and aim to improve the transfer of general knowledge. Although these methods demonstrate excellent performance in CLIP transfer tasks, they are inherently unsuitable for incrementally learning, as the additional learnable structures cannot effectively mitigate catastrophic forgetting.

II-C Continual Adaptation for VLMs

Investigating the continual learning and adaptation of VLMs for diverse downstream tasks holds significant value, as it reduces data storage requirements and computational redundancy while addressing the challenge of inaccessible previous data. It is crucial to protect the model’s gerenic pretrained knowledge and previously learned knowledge. The full fine-tuning strategies discussed in II-A will lead to significant forgetting of pre-trained knowledge, which is a notable distinction between pre-trained foundation models (e.g., CLIP) and small-scale deep models. Additionally, frameworks such as CoOp [3] and CoOpOp [4] have been shown to have limited adjustment capabilities for VLMs in incremental tasks due to their reliance on shared structures and contextual prompts across all tasks, leading to forgetting old knowledge during the process of fitting new knowledge. To solve this, Wang et al. [16] introduced AttriCLIP, which establishes a shared attribute bank for all tasks and selects suitable contexts based on visual images to bridge the gap between images and text. Yu et al. [18] proposed using a mixture of experts (MoE) framework to adapt knowledge for different tasks, decoupling the model’s zero-shot capabilities from its specialized task abilities. From the perspective of parameter sparse updating, efforts from SPG [22], SparseCL [21], and SPU [10] have aimed to update VLM parameters selectively by employing appropriate “important parameter” selection patterns; for example, SPU selects more important parameters for updates based on the gradients accumulated by batches. Additionally, Zheng et al. [9] and Yu et al. [23] proposed the use of additional reference datasets to facilitate knowledge distillation in a VLM, effectively mitigating the forgetting of generic knowledge.

III Methodology

III-A Preliminaries

1) Continual Learning Formulation: A sequence of task datasets is denoted as $\{\mathcal{D}_{1},\mathcal{D}_{2},\ldots,\mathcal{D}_{T}\}$ . During training on task $\mathcal{D}_{t}$ ( $t\in\{1,2,\ldots,T\}$ ), access to data from previous tasks $\{\mathcal{D}_{1},\mathcal{D}_{2},\ldots,\mathcal{D}_{t-1}\}$ is either highly restricted or entirely unavailable. In class-incremental learning (CIL), datasets for different tasks are introduced sequentially. Each task $t$ is associated with a unique set of classes $\mathcal{C}_{t}=\{C_{t,1},C_{t,2},\ldots,C_{t,|\mathcal{C}_{t}|}\}$ , where $|\mathcal{C}_{t}|$ denotes the number of classes in task $t$ . The classes associated with different tasks are disjoint:

\mathcal{C}_{t}\cap\mathcal{C}_{t^{\prime}}=\emptyset,\quad\forall t\neq t^{\prime},\,t,t^{\prime}\in\{1,2,\ldots,T\}.

(1)

2) CLIP for Incremental Tasks: CLIP [1] comprises an image encoder $\mathcal{F}_{\Theta}(\cdot)$ and a text encoder $\mathcal{T}(\cdot)$ . Specifically, an image $x\in\mathbb{R}^{H\times W\times 3}$ and a text prompt, referred to as the rudimentary prompt $\mathbf{RP}_{y}$ , are input into $\mathcal{F}_{\Theta}(\cdot)$ and $\mathcal{T}(\cdot)$ , respectively, producing a visual embedding $\mathbf{z}\in\mathbb{R}^{D}$ and a text embedding $\mathbf{w}_{y}\in\mathbb{R}^{D}$ :

\mathbf{z}=\mathcal{F}_{\Theta}(x),\quad\mathbf{w}_{y}=\mathcal{T}(\mathbf{RP}_{y}).

(2)

Here, $\mathbf{RP}_{y}$ is derived from hand-crafted prompts, typically following a template such as “A photo of a [CLS],” where [CLS] represents the specific class name. The probability of classifying a test image $x$ as class $y_{i}$ is computed using the softmax function:

p\left(y_{i}\mid x\right)=\frac{\exp\left(\left\langle\frac{\mathbf{z}}{\|\mathbf{z}\|},\frac{\mathbf{w}_{y_{i}}}{\|\mathbf{w}_{y_{i}}\|}\right\rangle/\tau\right)}{\sum_{k=1}^{K}\exp\left(\left\langle\frac{\mathbf{z}}{\|\mathbf{z}\|},\frac{\mathbf{w}_{y_{k}}}{\|\mathbf{w}_{y_{k}}\|}\right\rangle/\tau\right)},

(3)

where $\tau$ is the temperature parameter, $\mathbf{w}_{y_{k}}$ is the class text embedding derived from the rudimentary prompt of the $k$ -th class, and $K$ denotes the total number of downstream classes.

Building on this architecture, ContinualCLIP [58] tackles the challenge of continual learning with a training-free approach. For each new task $t$ , the text embedding set $\mathbf{W}_{t}$ is expanded to incorporate embeddings for the new task’s classes. At task $t$ , the updated text embedding set is defined as:

\mathbf{W}_{t}=\bigcup_{j=1}^{t}\bigcup_{k=1}^{|C_{j}|}\mathbf{w}_{j,k},

(4)

where $\mathbf{w}_{j,k}$ denotes the text embedding for the $k$ -th class of task $j$ encountered so far. Consequently, the prediction for a test image $x$ after task $t$ is computed as:

p\left(y_{i}\mid x\right)=\frac{\exp\left(\left\langle\frac{\mathbf{z}}{\|\mathbf{z}\|},\frac{\mathbf{w}_{y_{i}}}{\|\mathbf{w}_{y_{i}}\|}\right\rangle/\tau\right)}{\sum_{\mathbf{w}_{j,k}\in\mathbf{W}_{t}}\exp\left(\left\langle\frac{\mathbf{z}}{\|\mathbf{z}\|},\frac{\mathbf{w}_{j,k}}{\|\mathbf{w}_{j,k}\|}\right\rangle/\tau\right)}.

(5)

III-B Overview of proposed DesCLIP

The overall architecture of our proposed framework is shown in Fig. 3. Within our framework, the CLIP’s textual encoder $\mathcal{T}(\cdot)$ remains fixed and comprises two input branches: one processes rudimentary prompts derived from basic class names, while the other generates a diverse pool of general attribute (GA) description candidates via a language assistant. To obtain highly relevant visual-GA text pairs, we introduce the anchor-based embedding filter (AEF). The AEF identifies the most relevant attribute description embeddings from the candidate pool with respect to the current visual features. These filtered embeddings are then paired with visual features to compute a class-agnostic instance-matching loss, which is utilized to fine-tune the visual encoder $\mathcal{F}_{\Theta}(\cdot)$ . Concurrently, the text embeddings are gradually calibrated to align with shared attribute embeddings, further enhancing the consistency among representations of vision, GA, and downstream classes.

III-C General Attribute Description Generation

CLIP establish robust visual-textual associations during the pre-training phase through instance-level contrastive learning. However, most existing research overlooks this foundational capability, conventionally relying on fixed, hand-crafted templates combined with class names as prompts to derive “prior” classification weights via the textual encoder. Although Wang et al. [16] introduced an “attribute bank” to enable attribute sharing across different task instances, this approach lacks intrinsic relevance to specific classes. For instance, attributes such as “white” and “grass” fail to provide meaningful distinctions between classes like “cat” and “dog”.

To address this limitation, we propose the use of a language assistant to generate rich, contextually relevant attribute descriptions for specific classes. The language assistant utilizes an advanced large language model (LLM) with a generalized understanding of downstream task entities. Drawing inspiration from [25, 26, 28], we design a variety of describe-request prompts (DRPs) to guide the language assistant in generating visually relevant attribute descriptions. Examples of basic DRPs include:

•

Q: Describe what does a/an [CLS] look like?
•

Q: Describe a/an [CLS]’s attribute features.
•

Q: Describe a/an [CLS]’s outlook features.

Additionally, more complex prompts are designed to produce discriminative attribute descriptions, such as:

•

Q: Describe what are some attribute characteristics of [CLS] compared with other [P-CLS], visually.
•

Q: Describe what kind of [P-CLS] a/an [CLS] is, visually.

Here, [P-CLS] refers to the parent class of [CLS]. Fine-grained DRPs are also employed for tasks with a known general scope, such as identifying objects within the “birds” supercategory:

•

Describe a/an [CLS]’s attributes from its beak, eyes, body, belly, tail, wings, breast, etc.

The DRP-guided general attribute description generation are illustrated in Fig. 4. The language assistant generates $n_{dsc}$ attribute description candidates (DCs) for the $k$ -th class of incremental task $t$ , denoted as:

\mathbf{DC}_{\{t,k\}}=\bigg{\{}\mathbf{DC}_{\{t,k\},1},\mathbf{DC}_{\{t,k\},2},\ldots,\mathbf{DC}_{\{t,k\},n_{dsc}}\bigg{\}},

(6)

where $k\in\{1,2,\ldots,|C_{t}|\}$ . These DCs are then embedded using the textual encoder $\mathcal{T}(\cdot)$ to produce attribute embedding candidates (ECs):

\begin{split}\mathbf{EC}_{\{t,k\}}&=\mathcal{T}\Big{(}\mathbf{DC}_{\{t,k\}}\Big{)}\\ &=\bigg{\{}\mathbf{EC}_{\{t,k\},1},\mathbf{EC}_{\{t,k\},2},\ldots,\mathbf{EC}_{\{t,k\},n_{dsc}}\bigg{\}}.\end{split}

(7)

Each element of $\mathbf{EC}_{\{t,k\}}$ has the same dimension as the rudimentary text embedding $\mathbf{w}_{t,k}$ .

III-D Anchor-based Embedding Filter

The generated embedding candidates of GAs do not always align with visual representations due to potential domain discrepancies or unrelated information in the text descriptions produced by the language assistant. To establish robust vision-GA associations, we propose an anchor-based embedding filter (AEF) mechanism to refine the embedding candidates. This mechanism identifies candidates that sufficiently match the visual representations, enabling the construction of approximate image-text pairs tailored to the specific requirements of the task.

For a training sample $(x_{i},y_{i})$ , where $y_{i}=c=\{t,k\}$ is assumed, the label $c$ is considered to correspond to the $k$ -th class of incremental task $t$ , with $k\in\{1,2,\dots,|C_{t}|\}$ . As illustrated in Fig. 3, the inputs to AEF include the visual features $\mathbf{z}_{i}=\mathcal{F}_{\Theta}(x_{i})$ , the rudimentary text embedding $\mathbf{w}_{c}=\mathcal{T}(\textbf{RP}_{y_{i}})$ , and the embedding candidates $\mathbf{EC}_{c}$ . The cosine similarity between the visual features $\mathbf{z}_{i}$ and the rudimentary text embedding $\mathbf{w}_{c}$ is calculated as:

CS_{i}^{c}=\left\langle\frac{\mathbf{z}_{i}}{\|\mathbf{z}_{i}\|},\frac{\mathbf{w}_{c}}{\|\mathbf{w}_{c}\|}\right\rangle.

(8)

Subsequently, the similarity scores between the visual features and each embedding candidate in $\mathbf{EC}_{c}$ are calculated:

\textbf{EC\_S}_{c}^{i,j}=\left\langle\frac{\mathbf{z}_{i}}{\|\mathbf{z}_{i}\|},\frac{\mathbf{EC}_{c,j}}{\|\mathbf{EC}_{c,j}\|}\right\rangle,

(9)

where $j\in\{1,2,\dots,n_{dsc}\}$ . To mitigate the risk of overfitting in the CLIP’s visual encoder, visual features with low relevance to either class text or GA descriptions should be filtered out. Hence, a condition $\chi(\mathbf{z}_{i})$ is defined to filter visual features as:

\chi(\mathbf{z}_{i})=\begin{cases}1,&\text{if}~{}~{}\max\big{(}CS_{i}^{c},\max_{j}\textbf{EC\_S}_{c}^{i,j}\big{)}>\delta_{d},\\ 0,&\text{otherwise}.\end{cases}

(10)

Here, $\delta_{d}$ is a predefined threshold. The similarity between retained visual features and the rudimentary text embedding is used to define the anchor threshold $AT_{i}^{c}$ :

AT_{i}^{c}=\big{\{}CS_{i}^{c}\;|\;\chi(\mathbf{z}_{i})=1\big{\}}.

(11)

We posit that embedding candidates in $\mathbf{EC}_{c}$ exhibiting a similarity score surpassing a threshold $\gamma$ (relative to the anchor threshold $AT_{i}^{c}$ ) are more consistent with the visual features of the current sample. These candidates are filtered as follows:

\mathbf{FE}^{i}=\bigg{\{}\textbf{EC}_{c,j}\;|\;\textbf{EC\_S}_{c}^{i,j}>AT_{i}^{c}+\gamma\bigg{\}},

(12)

where $j\in\{1,2,\dots,n_{dsc}\}$ , and sorted with descending order according to $\textbf{EC\_S}_{c}^{i,j}$ . To reduce the potential influence of domain discrepancies arising from contextual descriptions, we further restrict the selection process by exclusively retaining attribute description sentences in $\mathbf{DC}_{c}$ that explicitly include the class name as a noun (i.e., [CLS]+[GA]).

III-E General Attribute-Guided Progressive Visual-Textual Alignment

This section introduces the methodology for optimizing the visual and textual branches of CLIP in incremental tasks, leveraging the filtered embeddings identified for relevant training samples. Within the CLIP architecture, the optimization focuses on the visual encoder $\mathcal{F}_{\Theta}(\cdot)$ (specifically, the initial MLP layers within each Transformer block [10]) and the rudimentary text embeddings introduced for the current task.

1) Instance Matching: To align the visual features with highly relevant embedded GA descriptions, we select the most closest textual representation $\mathbf{h}_{i}$ as the paired text embedding:

\mathbf{h}_{i}=\mathbf{FE}^{i}[0].

(13)

The instance matching loss $\mathcal{L}_{\text{IM}}$ is computed across the batch as:

\mathcal{L}_{\text{IM}}=\mathbb{E}_{i\in P}\Big{[}-\log\frac{\mathbf{M}_{i,i}}{\mathbf{M}_{i,i}+\sum_{j\in P,j\neq i}\mathbf{M}_{i,j}}\Big{]},

(14)

where

\mathbf{M}_{i,j}=\exp\Big{(}\left\langle\frac{\mathbf{z}_{i}}{\|\mathbf{z}_{i}\|},\frac{\mathbf{h}_{j}}{\|\mathbf{h}_{j}\|}\right\rangle/\tilde{\tau}\Big{)},

(15)

\mathbf{M}_{i,i}=\exp\Big{(}\left\langle\frac{\mathbf{z}_{i}}{\|\mathbf{z}_{i}\|},\frac{\mathbf{h}_{i}}{\|\mathbf{h}_{i}\|}\right\rangle/\tilde{\tau}\Big{)}.

(16)

Here, $\tilde{\tau}$ denotes an elevated temperature scaling factor, defined as $\tilde{\tau}=10\tau$ . The set $P$ represents the indices of valid samples within a batch of size $B$ and is specified as:

P=\Big{\{}i\mid i\in\{1,\dots,B\},\,\chi(\mathbf{z}_{i})=1\,,\mathbf{FE}^{i}\neq\emptyset\Big{\}},

(17)

where $\chi(\mathbf{z}_{i})$ is the condition defined in III-D. For each incremental task $t$ , we tune the model using a contrastive learning framework similar to the pretraining strategy of CLIP [1]. To mitigate forgetting, we adhere to a “nearest matching” principle, aligning visual features with GA description embeddings exhibiting higher correlations. This approach minimizes the risk of overfitting visual features to specific class text embeddings, maintaining both generalization and previous knowledge.

2) Text Embedding Calibration: General attribute descriptions play a pivotal role in guiding the calibration of text embeddings to achieve better alignment with visual representations. This alignment is particularly crucial because the original rudimentary text embeddings often exhibit weak correlations with visual features in “unfamiliar” downstream tasks. Such misalignment can lead to overfitting in the visual branch of the VLM to class labels, thereby exacerbating forgetting. To mitigate this issue, we propose a weight-shifting mechanism that calibrates the rudimentary text embeddings for the classes in task $t$ . This mechanism repositions the text embeddings toward representative attributes shared across the corresponding visual features, fostering stronger alignment between shared general attributes and class-specific text embeddings. Specifically, we define a shifting weight $\mathbf{s}_{t,k}\in\mathbb{R}^{D}$ , and a shift transformation $\Psi(\cdot,\mathbf{s}_{t,k})$ for the calibration of rudimentary text embedding $\mathbf{w}_{t,k}$ , where $\{t,k\}$ representing the $k-$ th class of incremental task $t$ , $k\in\{1,2,...,|C_{t}|\}$ . The calibrated text embedding $\mathbf{w^{\prime}}_{t,k}$ can be obtained as:

\mathbf{w^{\prime}}_{t,k}=\Psi(\mathbf{w}_{t,k},\mathbf{s}_{t,k})=\frac{\mathbf{w}_{t,k}}{\|\mathbf{w}_{t,k}\|}+\alpha\cdot\mathbf{s}_{t,k}.

(18)

The key to text embedding calibration is to ensure a strong correlation with the visual representations of the class while preventing an excessive focus on any single attribute text. Therefore, $\mathbf{w}^{\prime}_{t,k}$ should be aligned with $\mathbf{FE}^{i}$ , which is filtered based on the visual features $\mathbf{z}_{i}$ of the class $c=\{t,k\}$ :

TABLE I: Comparison of different methods on the CIFAR100 [29], ImageNet-Subset [24], and CUB-200 [30] datasets under various settings. ‘UB’ denotes the upper bound achieved through joint training.

Method	CIFAR100 [29]						ImageNet-Subset [24]						CUB-200 [30]
	$T=5$		$T=10$		$T=20$		$T=5$		$T=10$		$T=20$		$T=5$		$T=10$		$T=20$
	Last	Avg	Last	Avg	Last	Avg	Last	Avg	Last	Avg	Last	Avg	Last	Avg	Last	Avg	Last	Avg
Seq FT	-	-	50.6	66.9	-	-	-	-	58.8	73.8	-	-	-	-	40.8	64.6	-	-
LwF [11]	-	-	56.1	74.8	-	-	-	-	56.4	75.0	-	-	-	-	52.1	71.5	-	-
EWC [31]	-	-	64.4	78.4	-	-	-	-	70.3	80.2	-	-	-	-	64.4	78.6	-	-
CoOp [3]	78.8	85.0	75.1	83.0	78.2	85.3	78.8	81.8	78.0	81.4	74.9	80.6	53.1	63.7	41.8	54.3	49.1	58.4
CoOpOp [4]	76.5	80.0	76.5	76.8	70.0	73.2	68.8	76.5	62.4	71.3	61.2	70.8	51.9	66.7	53.4	66.9	50.2	65.8
ContinualCLIP [58]	72.8	-	72.8	-	72.8	-	74.6	-	74.6	-	74.6	-	60.9	-	60.9	-	60.9	-
L2P [44]	-	-	70.2	79.6	-	-	-	-	71.1	78.4	-	-	-	-	69.9	77.8	-	-
DualPrompt [44]	-	-	72.0	81.8	-	-	-	-	71.7	79.8	-	-	-	-	64.5	75.9	-	-
SLCA [59]	80.2	87.2	80.2	87.6	80.2	87.6	82.3	86.0	80.3	84.4	80.2	85.4	75.5	78.9	73.9	80.0	71.1	79.4
AttriCLIP [16]	81.9	86.8	80.9	86.3	79.6	86.0	79.2	84.3	78.5	81.8	77.4	82.1	62.8	72.2	52.5	66.4	57.1	67.4
MoE-Adapter [18]	83.8	88.0	82.1	87.9	81.0	86.8	81.2	85.9	82.9	85.9	82.6	86.2	78.4	82.5	75.1	81.4	73.9	80.5
SPU [10]	84.5	89.1	82.9	88.2	81.2	86.8	82.8	86.1	82.4	85.5	81.9	85.5	78.8	82.8	76.2	81.8	73.9	79.8
TaskRes-CL [5]	84.2	88.3	81.2	87.3	79.9	86.4	82.9	86.3	82.5	85.7	82.1	85.8	75.4	78.9	75.0	80.4	72.7	78.3
DesCLIP (Ours)	85.9	90.0	84.5	90.1	82.9	88.8	84.3	87.6	84.2	87.3	83.2	87.1	81.3	84.5	78.4	83.8	75.3	81.7
UB	90.0	-	90.0	-	90.0	-	87.6	-	87.6	-	87.6	-	84.5	-	84.5	-	84.5	-

TABLE II: Comparison of various methods on CIFAR100 [29] and CUB-200 [30] (

T=10

) with CLIP ViT-B/16 [1] backbone. ‘#FR’ represents using feature replay. ‘#AS’ represents using additional structure.

Method			CIFAR100 [29]			CUB-200 [30]
	#FR	#AS	Last	Avg	C.	Last	Avg	C.
Seq FT	✗	✗	46.3	-	24.2	45.7	-	39.7
ContinualCLIP[58]	✗	✗	68.3	-	63.6	55.1	-	63.6
L2P [44]	✗	✓	64.5	72.9	41.8	66.3	75.5	43.8
SLCA [59]	✓	✗	71.5	79.8	59.1	50.6	58.4	62.7
AttriCLIP [16]	✗	✓	67.0	77.8	60.3	50.8	65.4	63.1
SPU [10]	✗	✗	75.8	84.3	58.7	67.8	75.6	59.8
TaskRes-CL [5]	✗	✗	75.6	83.2	63.6	67.1	73.8	63.6
MoE-Adapter [18]	✗	✓	77.8	84.9	62.2	68.3	76.2	62.9
RAPF [60]	✓	✓	78.5	85.1	54.0	73.2	80.4	52.6
DesCLIP (Ours)	✗	✗	79.1	85.7	62.0	72.0	78.6	62.5

\mathcal{L}_{\text{TA}}^{i}=\mathbb{E}_{\mathbf{u}\in\mathbf{FE}^{i}}\Big{[}\beta\cdot\Big{\|}\frac{\mathbf{w}^{\prime}_{t,k}}{\|\mathbf{w}^{\prime}_{t,k}\|}-\frac{\mathbf{u}}{\|\mathbf{u}\|}\Big{\|}_{2}+\Big{(}1-\left\langle\frac{\mathbf{w}^{\prime}_{t,k}}{\|\mathbf{w}^{\prime}_{t,k}\|},\frac{\mathbf{u}}{\|\mathbf{u}\|}\right\rangle\Big{)}\Big{]},

(19)

where $\beta$ is a parameter that determines whether the alignment of text embeddings places greater emphasis on their absolute distance in the text space (e.g., Euclidean distance) or on the directional consistency. Text alignment loss $\mathcal{L}_{\text{TA}}$ of the current batch is: $\mathcal{L}_{\text{TA}}=\sum_{i\in P}\mathcal{L}_{\text{TA}}^{i}.$

Since, in the context of continual learning, data from previous tasks cannot be revisited during the current task, the weights $\mathbf{w^{\prime}}_{t}=\{\mathbf{w^{\prime}}_{t,1},\mathbf{w^{\prime}}_{t,2},...,\mathbf{w^{\prime}}_{t,|C_{t}|}\}$ are calibrated solely within the scope of the current task $t$ . Consequently, only the shifting weights $\mathbf{s}_{t}=\{\mathbf{s}_{t,1},\mathbf{s}_{t,2},...,\mathbf{s}_{t,|C_{t}|}\}$ associated with the current task are learnable, whereas the shifting weights $\mathbf{s}_{0:t-1}=\{\mathbf{s}_{0},\mathbf{s}_{1},...,\mathbf{s}_{t-1}\}$ from previous tasks remain fixed.

3) Reconstructed Intra-task Classification: To ensure alignment between text embeddings and the visual branch during optimization, we reconstruct the classification loss for task $t$ . This loss constrains the calibration of text embeddings to remain within the low-loss region of the classification space for the current task, which is a critical prerequisite. Specifically, for each visual feature $\mathbf{z}_{i}$ in a batch, its similarity to all calibrated text embeddings for the current task is computed to generate predicted logits. These logits are aligned with the ground truth label $y_{i}$ , and the prediction classification loss is calculated as:

\mathcal{L}_{\text{RIC}}=\mathbb{E}_{i\in\{1,2,...,B\}}\Bigg{[}\log\frac{-\exp\Big{(}\left\langle\frac{\mathbf{z}_{i}}{\|\mathbf{z}_{i}\|},\frac{\mathbf{w^{\prime}}_{y_{i}}}{\|\mathbf{w^{\prime}}_{y_{i}}\|}\right\rangle/\tau\Big{)}}{\sum_{k=1}^{|C_{t}|}\exp\Big{(}\left\langle\frac{\mathbf{z}_{i}}{\|\mathbf{z}_{i}\|},\frac{\mathbf{w^{\prime}}_{t,k}}{\|\mathbf{w^{\prime}}_{t,k}\|}\right\rangle/\tau\Big{)}}\Bigg{]},

(20)

where $\mathbf{w^{\prime}}_{y_{i}}$ represents the calibrated text embedding for the class corresponding to $\mathbf{z}_{i}$ .

4) Optimization: To achieve optimal performance, the total loss for batch optimization is defined as:

\min_{\Theta,\mathbf{s}_{t}}\mathcal{L}=\lambda_{\text{IM}}\cdot\mathcal{L}_{\text{IM}}+\lambda_{\text{TA}}\cdot\mathcal{L}_{\text{TA}}+\lambda_{\text{RIC}}\cdot\mathcal{L}_{\text{RIC}},

(21)

where $\lambda_{\text{IM}}$ , $\lambda_{\text{TA}}$ , and $\lambda_{\text{RIC}}$ are balancing factors that control the contributions of the respective loss terms.

III-F Inference Stage

The GA descriptions and embeddings we introduce do not participate in the inference stage of the CLIP after each training phase, which effectively avoids additional storage overhead and eliminates any increase in inference time. Consequently, the model obtained through our method maintains identical parameter size and inference time during the inference stage as the original zero-shot CLIP. In inference stage after task $t$ , we leverage the calibrated text embeddings of all seen classes: $\mathbf{W^{\prime}}_{t}=\bigcup_{j=1}^{t}\bigcup_{k=1}^{|C_{j}|}\mathbf{w^{\prime}}_{j,k}.$ Hence, the probability of predicting the testing image $x$ as the class $y_{i}$ can be expressed as:

p\left(y_{i}\mid x\right)=\frac{\exp\left(\left\langle\frac{\mathbf{z}}{\|\mathbf{z}\|},\frac{\mathbf{w^{\prime}}_{y_{i}}}{\|\mathbf{w^{\prime}}_{y_{i}}\|}\right\rangle/\tau\right)}{\sum_{\mathbf{w}^{\prime}_{j,k}\in\mathbf{W^{\prime}}_{t}}\exp\left(\left\langle\frac{\mathbf{z}}{\|\mathbf{z}\|},\frac{\mathbf{w}^{\prime}_{j,k}}{\|\mathbf{w}^{\prime}_{j,k}\|}\right\rangle/\tau\right)}.

(22)

IV Experiments

IV-A Setup

1) Datasets: The evaluation experiments for continual learning are conducted on CIFAR100 [29], ImageNet-Subset [24] and CUB-200 [30]. ImageNet-full [24] is utilized as a control set to evaluate the retention of the pretrained generalization knowledge in CLIP.

•

CIFAR100 [29] introduced a dataset comprising 60,000 images, each with a resolution of $32\times 32$ , distributed across 100 distinct classes. Each class includes 500 training samples and 100 testing samples. CIFAR100 has become a widely recognized benchmark for evaluating continual learning methods. Its low-resolution images pose a significant challenge for VLMs that are pre-trained on high-resolution datasets, highlighting the difficulty of adapting to such datasets in continual learning scenarios.
•

ImageNet-Subset [24]. Following the class configuration proposed in [32], we select the challenging ImageNet-Subset benchmark for method evaluation. This subset consists of fine-grained animal images from 100 categories that are unfamiliar to VLMs and prone to misclassification due to their subtle differences.
•

Caltech-UCSD Birds-200 (CUB-200) [30] comprises 11,788 bird images distributed across 200 categories. The subtle variations among bird images pose significant challenges for VLMs to achieve accurate identification. Moreover, the relatively small size of the dataset further intensifies the difficulty, particularly in few-shot scenarios.
•

ImageNet-full [24] is a large-scale image dataset containing over 1.2 million images across 1,000 classes, widely used for pretraining and evaluating visual representation learning models.

2) Metrics: To evaluate the continual learning performance of classification models, we employ two primary metrics: ‘Last’ and ‘Avg’. ‘Last’ denotes the average accuracy across all classes after the model has completed training on the final task. ‘Avg’ represents the mean incremental accuracy calculated over all tasks the model has learned thus far. In addition, The control set accuracy ‘C.’ [10] is employed to evaluate the retention of CLIP’s zero-shot generalization knowledge after continual learning, assessed on ImageNet-full [24] (1000 classes in total).

3) Competitors: We compare our method against baseline and state-of-the-art methods for continual learning VLMs. These include ContinualCLIP [58] under zero-shot conditions; non-continual adapting methods such as CoOp [3] and CoOpOp [4]; and conventional continual learning method such as LwF [11] and EWC [31]. Additionally, we evaluate VLM-specific continual learning methods, including AttriCLIP [16], TaskRes-CL [5], SPU [10], MoE-Adapter [18], and RAPF [60]. For a broader comparison, we also incorporate techniques tailored to address continual learning in visual-only pre-trained models, such as L2P [44], DualPrompt [45], and SLCA [59].

4) Implementation Details: All experiments are conducted on two NVIDIA GeForce RTX 3090 GPUs. Following [16], we adopt the pre-trained CLIP ViT-L/14 [1] as the backbone. To ensure a fair comparison, no data replay strategies are used during the continual learning process of the VLM.

The model is trained for 10 epochs on the datasets for each incremental task. Stochastic Gradient Descent (SGD) [61] is employed as the optimizer, using a cosine learning rate decay schedule with a batch size of 32. The learning rate for the MLPs’ weights in the CLIP’s visual encoder is set to $1\times 10^{-5}$ , while the learning rate for the text shifting weights is set to 0.1. The coefficient $\alpha$ for shifting weight addition is set to 0.1. Loss balancing factors are set as $\lambda_{\text{IM}}=2.0$ , $\lambda_{\text{TA}}=0.5$ , and $\lambda_{\text{RIC}}=1.0$ by default. For the fine-grained CUB-200 dataset [30], $\lambda_{\text{IM}}$ is adjusted to 15.0. To filter out visual features that fail to intuitively reflect attribute or class information, $\delta_{d},\gamma$ are estimated based on downstream tasks and set to 0.20, 0.015; 0.25, 0.03; and 0.30, 0.015 for CIFAR100 [29], ImageNet-Subset [24], and CUB-200 [30], respectively. Additionally, we generate $n_{dsc}=30$ attribute descriptions for each class using the language assistant GPT-4 [50].

IV-B Comparison with State-of-the-art Methods

TABLE III: Ablation study of each component on Split CUB-200 [30].

Settings	RIC	IM	TA	$T=10$		$T=20$
Settings	RIC	IM	TA	$\Delta$ Last	$\Delta$ Avg	$\Delta$ Last	$\Delta$ Avg
Zero-shot	-	-	-	60.9	66.0	60.9	66.0
V.E.(full)	✓			-7.8	-7.3	-8.2	-7.9
V.E.	✓			-2.1	-2.1	-2.7	-2.6
V.E.	✓	✓		+3.6	+3.9	+2.0	+2.2
V.E.+TEC	✓	✓		+12.5	+13.1	+11.4	+12.0
only TEC	✓			+14.1	+14.4	+11.8	+12.3
V.E.+TEC	✓	✓	✓	+17.5	+17.8	+14.4	+15.7

TABLE IV: Ablation study of different parameter selections on ImageNet-Subset [24] and CUB-200 [30].

$\lambda_{\textbf{IM}}$	threshold param $\gamma$				ImageNet-Sub [24]		CUB-200 [30]
$\lambda_{\textbf{IM}}$	0	0.015	0.03	0.05	Last	Avg	Last	Avg
0			✓		82.8	86.0	-	-
1.0			✓		83.1	86.7	-	-
2.0	✓				82.9	86.4	-	-
2.0		✓			83.5	86.9	-	-
2.0			✓		84.2	87.3	76.9	82.0
2.0				✓	83.8	87.1	-	-
4.0			✓		82.5	84.9	-	-
0		✓			-	-	75.4	81.5
10.0		✓			-	-	77.8	82.8
15.0	✓				-	-	77.6	82.0
15.0		✓			81.8	85.2	78.4	83.8
15.0			✓		-	-	78.0	83.5
20.0		✓			-	-	78.2	83.3

TABLE V: Results of different types of

\Psi(\mathbf{w},\mathbf{s})

on Split CUB-200 [30].

\varphi(\mathbf{w})

represents using class-wise adapters [62] on rudimentary text embeddings.

$\Psi(\mathbf{w},\mathbf{s})$	$T=10$		$T=20$
$\Psi(\mathbf{w},\mathbf{s})$	Last	Avg	Last	Avg
$\mathbf{w}$	64.5	69.9	62.9	68.2
$\mathbf{w}+\alpha\cdot\varphi(\mathbf{w})$	68.9	74.2	66.7	72.5
$\mathbf{w}+\alpha\cdot\mathbf{s}$	77.5	82.9	74.8	81.0
$\mathbf{w}/\\|\mathbf{w}\\|+\alpha\cdot\varphi(\mathbf{w}/\\|\mathbf{w}\\|)$	70.7	75.8	68.1	73.2
$\mathbf{w}/\\|\mathbf{w}\\|+\alpha\cdot\mathbf{s}$	78.4	83.8	75.3	81.7

The accuracy results for continual learning, including ‘Last’ and ‘Avg’, are summarized in Table I. These results are derived from comprehensive experiments conducted on multiple datasets under varying incremental task settings. Additionally, Fig. 5 depicts the forgetting curves, comparing recent methods and providing a detailed evaluation of accuracy after each incremental task.

1) Performance on Coarse Dataset: CIFAR100 [29] is selected as the coarse dataset, where VLMs generally perform well on common classes but may struggle with blurred images. As shown in Table I, our method consistently achieves superior performance across all incremental task stages, obtaining an average accuracy (‘Avg’) of 90.1% under the setting of $T=10$ . Notably, our approach surpasses ContinualCLIP [58] by a significant margin of 11.6% on the ‘Last’ metric. Moreover, it outperforms TaskRes-CL [5], MoE-Adapter [18], and SPU [10] by +3.3%, +2.4% and +1.6%, respectively.

2) Performance on Fine-grained Datasets: Table I demonstrate that our method consistently achieves the best performance across all incremental task stages on fine-grained datasets ImageNet-Subset [24] and CUB-200 [30]. On ImageNet-Subset, our method outperforms TaskRes-CL [5], SPU [10], and AttriCLIP [16] by +1.7%, +1.8%, and +5.7%, respectively. For CUB-200, our approach achieves the highest task accuracy at every stage. With $T=10$ , our method surpasses SLCA by +4.5%, SPU by +2.2%, and TaskRes-CL by +3.4% on the ‘Last’ metric.

3) Performance with CLIP ViT-B/16: Table II presents the continual learning performance across 10 tasks utilizing the efficient backbone CLIP ViT-B/16 [1]. Without relying on replay mechanisms or incorporating additional architectural components, our method almost outperforms state-of-the-art approaches on both CIFAR100 [29] and CUB-200 [30]. Notably, while achieving outstanding performance on CIFAR100 and CUB-200, our method maintains accuracies of 62.0 $\%$ and 62.5 $\%$ on the control set ImageNet-full [24], closely matching the original pretrained CLIP accuracy of 63.6 $\%$ .

4) Zero-shot Degradation or Improvement? We propose a metric to evaluate the relative accuracy changes for all classes with respect to the zero-shot CLIP model after incremental tasks. Specifically, we calculate the class-wise accuracy difference between the zero-shot CLIP and the trained CLIP model after half of the incremental tasks (50 classes), denoted as $\Delta_{50|(50,zs)}$ , and present the results in Fig. 6 (Upper). Our method achieves the highest averaged $\Delta_{50|(50,zs)}$ , reaching +9.3. This demonstrates that our approach experiences the least zero-shot degradation while achieving superior performance, attributed to the more robust visual-textual relationships established during continual learning compared to the original vision-class text relationships.

Furthermore, we calculate the accuracy difference between the ‘after all-class incremental learning’ state and the ‘only learning on the first-50 classes’ state, denoted as $\Delta_{50|(100,50)}$ , as shown in Fig. 6 (Bottom). Our method achieves an averaged $\Delta_{50|(100,50)}$ of +0.02, indicating effective knowledge transfer from previous tasks. This improvement can be attributed to the establishment of a strong connection between general visual-textual knowledge during continual learning.

5) Few-shot Performance: Fig. 7 illustrates the continual learning performance in few-shot scenarios on the CUB-200 dataset [30]. When the number of training samples per task is reduced from ‘full’ to ‘1/5’, our method maintains a competitive advantage, with an average performance drop of only -6.4. This decline is significantly smaller compared to TaskRes-CL’s -7.3 and SLCA’s -18.0, showcasing the robustness of our approach under challenging conditions.

IV-C Ablation Study

1) Effectiveness of Each Component: We conduct detailed ablation studies on various components of our DesCLIP framework. As shown in Table III, the zero-shot model is used as the baseline, and the relative improvements or declines are reported. It is evident that a fully fine-tuned visual encoder (V.E.(full)) suffers from catastrophic forgetting. Without data replay, merely fine-tuning the visual encoder (V.E.) partially (tuning the first MLPs in each Transformer block) proves inadequate for continual learning [10]. Based on V.E., by integrating instance matching (IM) with visually filtered attribute description embeddings, we achieve a relative improvement of 2 $\sim$ 3% over the zero-shot baseline. This improvement highlights the effectiveness of IM in mitigating forgetting, which typically arises from overfitting specific downstream task classes. Furthermore, the impact of text embedding calibration (TEC) is notable. The combination of V.E.+TEC+IM+TA, leveraging the filtered attribute description embeddings, yields the best performance, showcasing the synergy between these components.

2) Parameter Selection: Table IV presents the 10-task performance under different parameter configurations. Our analysis reveals that the effectiveness of the instance matching strength, $\lambda_{\textbf{IM}}$ , is highly sensitive to the specific task context. Stronger instance matching proves advantageous in scenarios requiring finer granularity, such as CUB-200 [30]. On the other hand, the filtering of attribute description embeddings plays a crucial role. Excessively conservative filtering (e.g., $\gamma=0$ ) or overly stringent filtering (e.g., $\gamma=0.05$ ) leads to noticeable performance degradation.

3) Types of Text Embedding Calibration: We demonstrate the comparisons of different types of TEC in Table V. “ $\Psi(\mathbf{w},\mathbf{s})=\mathbf{w}/\|\mathbf{w}\|+\alpha\cdot\mathbf{s}$ ” is the optimal solution, which is an calibration of the rudimentary text embedding in cosine space, which can best approximate the alignment of the representative shared attribute description embedding.

4) Generated Attribute Descriptions: We analyze the impact of the generated description amount on the performance of continual learning, as shown in Fig. 8. For coarse CIFAR100 [29], we only employed the auxiliary prompt “Please maintain diversity between descriptions” to generate $n_{dsc}$ attribute descriptions for each class. Our findings indicated that the model performed optimally when $n_{dsc}$ is set to 30. In contrast, for the fine-grained CUB-200 [30], we have discovered that incorporating additional auxiliary prompts, “Describe a/an [CLS]’s attributes from its beak, eyes, body, belly, tail, wings, breast, etc.” results in improved performance (red star in Fig. 8) even with a smaller $n_{dsc}$ . This approach enhances the quality of the descriptions, provided that the language assistant possesses sufficient knowledge of the object’s details.

IV-D Visualizations

1) Description of Compliance and Noncompliance: In Fig. 9, we present examples visual image along with attribute description sentences that match and do not match the attribute descriptions filtered by AEF. In particular, different instances on the left, also from ‘Cardinal’, select the similar attribute feature description “a stout, bright orange beak” Furthermore, the highly matched attribute “crimson red feathers that cover its entire body” filtered by the first instance is not represented in the second instance.

2) GA Descriptions Closest to Calibrated Text Embeddings: In Fig. 10, we present the descriptions that are most closely aligned with the calibrated text embeddings. For each class, we showcase three of the top five closest attribute descriptions across several classes. Notably, some common visual attribute features are reflected in these descriptions, such as “slender, pointed wings; streaked brown and white feathers; a clean white belly; and darker hues on the wings and tail” for the “Brown Creeper” category.

3) Filtered-out Visual Instances: As shown in Fig. 11, it can be observed that, compared to the retained visual instances, the filtered visual instances lack relevance to both general attributes and class-specific information.

4) t-SNE Visualization: Fig. 12 presents the t-SNE visualizations of CLIP’s visual representations and text embeddings of each class. We employed distribution alignment to transform text embeddings into the visual representation space, rather than using the original text embedding space. Intuitively, compared to Zero-shot, TaskRes-CL [5] adjusts text embeddings in downstream incremental tasks without optimizing visual representations; SPU [10] optimizes visual representations but struggles to align visual-text representations for unfamiliar classes. In contrast, our method achieves superior alignment of visual-text representations in incremental tasks.

V Limitations

Observed in experimental trials, The effectiveness of our DesCLIP framework hinges on the language assistant (or real experts) being knowledgeable about the general attribute features of objects. This can pose challenges in certain applications that require indirect reasoning, such as StanfordCars [63], where it is difficult to accurately describe the representative features of a vehicle model associated with a specific license plate. Additionally, there are high standards for the quality of attribute descriptions; inappropriate prompts can introduce domain bias, ultimately hindering the ability of the Anchor-based embedding filter (AEF) to select highly relevant attribute features.

VI Conclusions

Current research for VLM-based continual learning predominantly emphasizes to connect visual inputs and specific new-class text in downstream tasks, frequently neglecting the latent relationship between general knowledge and specialized knowledge for VLMs. In this paper, we propose DesCLIP, a framework that harnesses general attribute (GA) descriptions to enhance VLMs in establishing robust vision-GA-class text associations. By going beyond the traditional connections between visual inputs and class texts, DesCLIP employs a language assistant to generate candidates for attribute descriptions through tailored prompting. Additionally, we implemented an anchor-based embedding filter (AEF) to extract highly relevant description embeddings, which serve as paired text embeddings for instance matching (IM). Addtionally, we perform text embedding calibration (TEC) which allows for the progressive calibration of rudimentary text embeddings to align with representative GA representations. Our extensive experiments validate the effectiveness and advancements of DesCLIP, demonstrating its superior performance over existing pretrained and VLM-based continual learning methods.

References

[1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[2] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning. PMLR, 2021, pp. 4904–4916.
[3] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
[4] ——, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 816–16 825.
[5] T. Yu, Z. Lu, X. Jin, Z. Chen, and X. Wang, “Task residual for tuning vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 899–10 909.
[6] Y. Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, P. Wang, and Y. Zhang, “Dual modality prompt tuning for vision-language pre-trained model,” IEEE Transactions on Multimedia, vol. 26, pp. 2056–2068, 2024.
[7] H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6757–6767.
[8] ——, “Tcp: Textual-based class-aware prompt tuning for visual-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23 438–23 448.
[9] Z. Zheng, M. Ma, K. Wang, Z. Qin, X. Yue, and Y. You, “Preventing zero-shot transfer degradation in continual learning of vision-language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 125–19 136.
[10] W. Zhang, P. Janson, R. Aljundi, and M. Elhoseiny, “Overcoming generic knowledge loss with selective parameter update,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 24 046–24 056.
[11] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 2935–2947, 2017.
[12] W. Li, B.-B. Gao, B. Xia, J. Wang, J. Liu, Y. Liu, C. Wang, and F. Zheng, “Cross-modal alternating learning with task-aware representations for continual learning,” IEEE Transactions on Multimedia, vol. 26, pp. 5911–5924, 2024.
[13] Q. Wang, R. Wang, Y. Li, D. Wei, H. Wang, K. Ma, Y. Zheng, and D. Meng, “Relational experience replay: Continual learning by adaptively tuning task-wise relationship,” IEEE Transactions on Multimedia, vol. 26, pp. 9683–9698, 2024.
[14] Y. Cui, W. Deng, X. Xu, Z. Liu, Z. Liu, M. Pietikäinen, and L. Liu, “Uncertainty-guided semi-supervised few-shot class-incremental learning with knowledge distillation,” IEEE Transactions on Multimedia, vol. 25, pp. 6422–6435, 2023.
[15] S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” in International conference on machine learning. PMLR, 2019, pp. 3519–3529.
[16] R. Wang, X. Duan, G. Kang, J. Liu, S. Lin, S. Xu, J. Lü, and B. Zhang, “Attriclip: A non-incremental learner for incremental knowledge learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3654–3663.
[17] J. Xiang, T. Tao, Y. Gu, T. Shu, Z. Wang, Z. Yang, and Z. Hu, “Language models meet world models: Embodied experiences enhance language models,” Advances in neural information processing systems, vol. 36, 2024.
[18] J. Yu, Y. Zhuge, L. Zhang, P. Hu, D. Wang, H. Lu, and Y. He, “Boosting continual learning of vision-language models via mixture-of-experts adapters,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23 219–23 230.
[19] O. Pantazis, G. Brostow, K. Jones, and O. Mac Aodha, “Svl-adapter: Self-supervised adapter for vision-language pretrained models,” arXiv preprint arXiv:2210.03794, 2022.
[20] Y. Xin, J. Du, Q. Wang, Z. Lin, and K. Yan, “Vmt-adapter: Parameter-efficient transfer learning for multi-task dense scene understanding,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 14, 2024, pp. 16 085–16 093.
[21] Z. Wang, Z. Zhan, Y. Gong, G. Yuan, W. Niu, T. Jian, B. Ren, S. Ioannidis, Y. Wang, and J. Dy, “Sparcl: Sparse continual learning on the edge,” Advances in Neural Information Processing Systems, vol. 35, pp. 20 366–20 380, 2022.
[22] T. Konishi, M. Kurokawa, C. Ono, Z. Ke, G. Kim, and B. Liu, “Parameter-level soft-masking for continual learning,” in International Conference on Machine Learning. PMLR, 2023, pp. 17 492–17 505.
[23] Y.-C. Yu, C.-P. Huang, J.-J. Chen, K.-P. Chang, Y.-H. Lai, F.-E. Yang, and Y.-C. F. Wang, “Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models,” in European Conference on Computer Vision. Springer, 2025, pp. 219–236.
[24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[25] S. Pratt, I. Covert, R. Liu, and A. Farhadi, “What does a platypus look like? generating customized prompts for zero-shot image classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 691–15 701.
[26] C. Yi, L. Ren, D.-C. Zhan, and H.-J. Ye, “Leveraging cross-modal neighbor representation for improved clip classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 402–27 411.
[27] K. Song, H. Ma, B. Zou, H. Zhang, and W. Huang, “Fd-align: feature discrimination alignment for fine-tuning pre-trained models in few-shot learning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[28] O. Saha, G. Van Horn, and S. Maji, “Improved zero-shot classification by adapting vlms with text descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 542–17 552.
[29] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Toronto, ON, Canada, 2009.
[30] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” California Institute of Technology, 2011.
[31] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
[32] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “Podnet: Pooled outputs distillation for small-tasks incremental learning,” in Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XX 16. Springer, 2020, pp. 86–102.
[33] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010.
[34] S. Yan, J. Xie, and X. He, “Der: Dynamically expandable representation for class incremental learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3014–3023.
[35] M. Boschini, L. Bonicelli, P. Buzzega, A. Porrello, and S. Calderara, “Class-incremental continual learning into the extended der-verse,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 5, pp. 5497–5512, 2022.
[36] F.-Y. Wang, D.-W. Zhou, H.-J. Ye, and D.-C. Zhan, “Foster: Feature boosting and compression for class-incremental learning,” in European conference on computer vision. Springer, 2022, pp. 398–414.
[37] D.-W. Zhou, Q.-W. Wang, H.-J. Ye, and D.-C. Zhan, “A model or 603 exemplars: Towards memory-efficient class-incremental learning,” arXiv preprint arXiv:2205.13218, 2022.
[38] R. Wang, Y. Bao, B. Zhang, J. Liu, W. Zhu, and G. Guo, “Anti-retroactive interference for lifelong learning,” in European Conference on Computer Vision. Springer, 2022, pp. 163–178.
[39] F. Zhu, Z. Cheng, X.-Y. Zhang, and C.-l. Liu, “Class-incremental learning via dual augmentation,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 306–14 318, 2021.
[40] F. Zhu, X.-Y. Zhang, C. Wang, F. Yin, and C.-L. Liu, “Prototype augmentation and self-supervision for incremental learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5871–5880.
[41] G. Petit, A. Popescu, H. Schindler, D. Picard, and B. Delezoide, “Fetril: Feature translation for exemplar-free class-incremental learning,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp. 3911–3920.
[42] D.-W. Zhou, H.-L. Sun, H.-J. Ye, and D.-C. Zhan, “Expandable subspace ensemble for pre-trained model-based class-incremental learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23 554–23 564.
[43] D.-W. Zhou, Z.-W. Cai, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need,” arXiv preprint arXiv:2303.07338, 2023.
[44] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 139–149.
[45] Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y. Lee, X. Ren, G. Su, V. Perot, J. Dy et al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” in European Conference on Computer Vision. Springer, 2022, pp. 631–648.
[46] J. S. Smith, L. Karlinsky, V. Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira, “Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 909–11 919.
[47] Z. Gao, J. Cen, and X. Chang, “Consistent prompting for rehearsal-free continual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 463–28 473.
[48] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International conference on machine learning. PMLR, 2022, pp. 12 888–12 900.
[49] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026.
[50] OpenAI, “Gpt-4 technical report,” 2023. [Online]. Available: https://openai.com/research/gpt-4
[51] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022.
[52] W. Zhang, L. Wu, Z. Zhang, T. Yu, C. Ma, X. Jin, X. Yang, and W. Zeng, “Unleash the power of vision-language models by visual attention prompt and multi-modal interaction,” IEEE Transactions on Multimedia, pp. 1–13, 2024.
[53] X. Li, D. Lian, Z. Lu, J. Bai, Z. Chen, and X. Wang, “Graphadapter: Tuning vision-language models with dual knowledge graph,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[54] Y. Zhang, C. Zhang, K. Yu, Y. Tang, and Z. He, “Concept-guided prompt learning for generalization in vision-language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7377–7386.
[55] W. Wu, Z. Sun, Y. Song, J. Wang, and W. Ouyang, “Transferring vision-language models for visual recognition: A classifier perspective,” International Journal of Computer Vision, vol. 132, no. 2, pp. 392–409, 2024.
[56] G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, and R. Ji, “Cheap and quick: Efficient vision-language instruction tuning for large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[57] Y. Gao, J. Liu, Z. Xu, J. Zhang, K. Li, R. Ji, and C. Shen, “Pyramidclip: Hierarchical feature alignment for vision-language model pretraining,” Advances in neural information processing systems, vol. 35, pp. 35 959–35 970, 2022.
[58] V. Thengane, S. Khan, M. Hayat, and F. Khan, “Clip model is an efficient continual learner,” arXiv preprint arXiv:2210.03114, 2022.
[59] G. Zhang, L. Wang, G. Kang, L. Chen, and Y. Wei, “Slca: Slow learner with classifier alignment for continual learning on a pre-trained model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 148–19 158.
[60] L. Huang, X. Cao, H. Lu, and X. Liu, “Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion,” in European Conference on Computer Vision. Springer, 2025, pp. 214–231.
[61] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT). Springer, 2010, pp. 177–186.
[62] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024.
[63] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE international conference on computer vision workshops, 2013, pp. 554–561.