This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LVP-CLIP: Revisiting CLIP for Continual Learning with Label Vector Pool

Yue Ma
Syracuse University
[email protected]
   Huantao Ren
Syracuse University
[email protected]
   Boyu Wang
Syracuse University
[email protected]
   Jingang Jin
Syracuse University
[email protected]
   Senem Velipasalar
Syracuse University
[email protected]
   Qinru Qiu
Syracuse University
[email protected]
Abstract

Continual learning aims to update a model so that it can sequentially learn new tasks without forgetting previously acquired knowledge. Recent continual learning approaches often leverage the vision-language model CLIP for its high-dimensional feature space and cross-modality feature matching. Traditional CLIP-based classification methods identify the most similar text label for a test image by comparing their embeddings. However, these methods are sensitive to the quality of text phrases and less effective for classes lacking meaningful text labels. In this work, we rethink CLIP-based continual learning and introduce the concept of Label Vector Pool (LVP). LVP replaces text labels with training images as similarity references, eliminating the need for ideal text descriptions. We present three variations of LVP and evaluate their performance on class- and domain-incremental learning tasks. Leveraging CLIP’s high dimensional feature space, LVP learning algorithms are task-order invariant. The new knowledge does not modify the old knowledge, hence, there is minimum forgetting. Different tasks can be learned independently and in parallel with low computational and memory demands. Experimental results show that proposed LVP-based methods outperform the current state-of-the-art baseline by a significant margin of 40.7%.

1 Introduction

Refer to caption
Figure 1: Comparison with traditional CLIP-based approaches. While traditional methods compare similarity between the encoded test image and text labels, our approach evaluates similarity between image embeddings directly and makes the text encoder play an auxiliary role when possible.

Deep neural networks trained by using supervised learning have achieved remarkable accuracy in classification tasks. Their effectiveness relies on the assumption that the training data distribution fully and accurately represents the testing data. However, in real-world scenarios, data samples are often not available all at once. New classes of knowledge are discovered sequentially and corresponding training data arrives in stages. Continual learning addresses this by incrementally training a model to effectively learn new tasks without catastrophic forgetting [21] of previously acquired knowledge. Recent continual learning approaches frequently utilize the vision-language model CLIP (Contrastive Language-Image Pretraining) [23] due to its high-dimensional feature space and ability to match features across different modalities.

CLIP bridges the gap between language and vision through contrastive learning, enabling zero-shot classification by matching a list of text embeddings to the test image embedding. The primary challenge of CLIP lies in the vast search space for the optimal text, since different text inputs can yield highly different results. Recent research works have focused on improving the text embedding quality. For example, PointClip [41] leverages human knowledge and experience to design text descriptions for point cloud data. PointClipV2 [46] uses language models, like GPT-3 [4], to generate enhanced text. Methods like CoOp [45] and CoCoOp [44] further improve performance by incorporating trainable parameters into the text encoder. Yet, all these approaches still classify the input image by searching for the best matching text embedding.

L2P [36] first introduces the prompt-pool on pre-trained ViT [8] for continual learning, leveraging trainable prompts inside a pool to retain knowledge from different tasks. Following works [35, 33, 28, 30] improve L2P by either redesigning the prompts and prompt pools or refining the matching procedure between the prompts and embeddings. A challenge for prompt-pool-based methods is the uncertainty in the matching procedure between the test embedding and the prompt key. Although various optimization objectives are designed to improve matching, there is no guarantee that the optimal prompt will be selected for a given test image. If an incorrect prompt key is chosen, leading to the selection of wrong prompts, the result is more likely to be inaccurate, since the prompt may introduce biased information instead of providing helpful knowledge. Furthermore, as the same set of prompts is updated across different tasks, forgetting is unavoidable.

In this work, we revisit image classification and continual learning by introducing a novel concept, referred to as the Label Vector Pool (LVP). As shown in Fig. 1, instead of searching for the optimal text label phrases and relying only on their embeddings for similarity comparison, we utilize the image embeddings generated from training images as references. The proposed approach, utilizing LVP, enables CLIP-based incremental learning models to reduce if not remove their dependency on the quality of label phrases. LVP allows CLIP to adapt to a wide range of datasets, particularly those with classes difficult to describe in text or class names that have no semantic meaning, such as “ZIL103” or “ZSU234”, etc. Furthermore, feature vectors from the same modality tend to cluster more closely than those from different modalities, leading to improved classification accuracy.

Since we use CLIP as the foundation model and leverage its embedding capabilities, we refer to our method as LVP-CLIP, which is a completely different approach compared to prompt-pool-based methods. Instead of constructing a prompt pool that provides additional features for the classifier, we consolidate the knowledge of each class into a single vector and store it directly in the pool. This single vector can be formed from training image embeddings or as a combination of image and text embeddings. Given a test image, we simply calculate the similarity between its embedding and each vector in LVP, then select the class with the highest similarity.

Most existing class-incremental and domain-incremental learning algorithms assume that the upper bound of class count is known in advance. This assumption is necessary since those methods typically use an MLP-based classifier, where the number of classes determines the output layer size. When the number of categories exceeds the classifier’s predefined output dimension, the MLP must be expanded to accommodate the larger output and the entire model must be fine-tuned with data from all previously learned tasks. While a low upper bound results in frequent adjustments to the MLP size and parameter fine-tuning, a high upper bound leads to over provisioning. In contrast, our proposed LVP-CLIP imposes no restriction on the number of classes. Its memory cost grows linearly at a very low rate as the number of classes increases. In fact, we demonstrate its scalability by applying it to a cross-dataset, cross-domain incremental learning task, Cross-Task Incremental Learning (CTIL), which includes 595 classes, as discussed in detail in Sec. 5.

LVP-CLIP avoids performance degradation as the number of tasks increases, since learning new information does not modify previously stored knowledge. In other words, it treats incremental learning and batch learning equivalently. As a similarity-based approach, it simply searches for the embedding vector in the feature space that is most similar to the input vector. The high dimensionality of its feature space provides substantial memory capacity. The main contributions of this work include the following:

  • We propose the concept of Label Vector Pool (LVP) by revisiting the classification procedure of CLIP. As a similarity based approach like CLIP, the LVP uses image embedding or a mixture of image and text embeddings as references, hence is more flexible and less biased to any one modality compared to CLIP.

  • We present three variations of LVP, namely LVP-CLIP-I, LVP-CLIP-IT and LVP-CLIP-C. We evaluate our methods in class-incremental, domain-incremental and cross-task incremental settings and outperform SOTA methods.

  • The proposed LVP-based incremental learning has orders of magnitude lower computational complexity in learning and 2x lower inference complexity compared to baselines.

  • The performance of LVP-based learning is scalable to large number of tasks. We demonstrate that LVP has outstanding memory capacity and learning capability using an experimental setting that consists of 4 commonly used incremental learning datasets with 595 classes.

2 Related Work

Conventional solutions for continual learning. Existing continual learning algorithms can be classified into three categories, namely regularization-based, architecture-based, and rehearsal-based methods. Regularization-based methods [1, 17, 40, 12, 11, 39] set constraints on the trainable parameters by limiting the learning rate of the important parameters for old tasks. Although these methods do not require additional memory for replay buffers and additional model parameters, they have limited performance on complex datasets. Architecture-based methods [15, 38, 43, 42, 34, 26, 20] create separate parameters for each task to bypass catastrophic forgetting. Rehearsal-based methods [5, 6, 2, 3, 37, 25] maintain a buffer of data from past tasks. During the optimization for new tasks, the model is also trained on buffered data from previous tasks to mitigate forgetting. The performance of these methods is limited by the buffer size, and as buffer size decreases the performance declines sharply. Additionally, these methods are challenging to apply when data is private or cannot be stored [27]. Our LVP-CLIP addresses continual learning challenge by extracting class knowledge directly from CLIP without relying on a rehearsal buffer.

Prompt-based continual learning. Almost all recent works that achieved noteworthy performance in continual learning use a prompt-based architecture. After L2P [36] first proposed visual prompting, which constructs a prompt pool for continual learning tasks, prompt-pool-based methods have become the main track for its great performance. DualPrompt [35] improves the pool design with task-invariant prompts (G-Prompt) for general knowledge and task-specific prompts (E-Prompt) for expert knowledge. S-Prompts [33] designs domain-specific prompts and utilizes a K-NN operation as a domain identifier for inference. AttriClip [32] adapts the prompt pool for CLIP by maintaining a prompt pool for the text encoder. The prompt-pool-based methods may suffer from the mismatching challenge, since there is a potential to select the wrong prompt during inference. In addition, prompts add additional computation complexity in training and inference. These challenges do not apply to LVP. All of these recent continual learning works rely on the pre-trained models, such as ViT [8] and CLIP [23]. Hence, we adopt the same foundational model for feature extraction and embedding.

3 Preliminaries

For clarity, we use \thicksim and \wedge to differentiate the symbols for training and testing sets, respectively, in the rest of the paper. We use the superscripts to denote the index of classes or tasks and the subscripts to denote the index of training/testing instances.

3.1 Continual Learning

We consider a sequence of tasks S={S1,,SM}S=\{S^{1},\cdots,S^{M}\}, where MM is the total number of tasks. Each task is a set, St={(x1t,y1t),,(xntt,yntt)},t[1,,M]S^{t}=\{(x_{1}^{t},y_{1}^{t}),\cdots,(x_{n^{t}}^{t},y_{n^{t}}^{t})\},t\in[1,...,M], where (xit,yit),i[1,,nt](x_{i}^{t},y_{i}^{t}),i\in[1,...,n^{t}], is a pair containing the input xitXx_{i}^{t}\in X and its corresponding label yitYy_{i}^{t}\in Y and ntn^{t} is the total numbers of samples. The goal of continual learning is to train a model f(θ):XYf(\theta):X\rightarrow Y continuously over time on a set of tasks, arriving sequentially, such that it learns the new tasks without forgetting the old tasks.

Task-, Domain- and Class-Incremental Learning (TIL, DIL, CIL) are the three main scenarios of continual learning. Domain-incremental learning assumes that the number and labels of classes remain consistent across tasks, and the only difference among tasks is the distribution of the input data. Both Task- and Class-incremental learning address the scenario, where each task introduces a distinct set of new classes to be learned. Task-incremental learning assumes a known task identity at inference time, whereas the class-incremental learning does not make such assumption.

3.2 CLIP

CLIP is a vision-language model trained on text-image pairs. It consists of a text encoder gθt()g_{\theta_{t}}(\cdot) and an image encoder fθi()f_{\theta_{i}}(\cdot). Given a sentence txttxt and an image imgH×W×Cimg\in\mathbb{R}^{H\times W\times C}, the text and image embeddings are gθt(txt)=Tg_{\theta_{t}}(txt)=T and fθi(img)=If_{\theta_{i}}(img)=I respectively, where T,IDT,I\in\mathbb{R}^{D} with DD denoting the dimension of the embeddings. Given a dataset containing KK classes, the txttxt is a phrase like “a photo of a [yy]”, where y[1,,K]y\in[1,...,K] is the index of the label, and [yy] denotes the class name of the label. The probability of labeling the test image imgimg with the class yy is computed as:

p(y|img)=exp(Ty,I)k=1Kexp(Tk,I),\small p(y|img)=\frac{\exp{(\langle T^{y},I\rangle)}}{\sum^{K}_{k=1}\exp{(\langle T^{k},I\rangle)}}, (1)

where ,\langle\cdot,\cdot\rangle denotes the similarity function. Three commonly used similarity functions are L1 norm, L2 norm and cosine similarity, denoted by ,L1,,L2,,Cos\langle\cdot,\cdot\rangle_{L1},\langle\cdot,\cdot\rangle_{L2},\langle\cdot,\cdot\rangle_{Cos}, respectively. To classify an image, the class with the highest probability is chosen, i.e.

y^=argmaxkp(k|img),k[1,,K].\small\hat{y}=\operatorname*{argmax}_{k}p(k|img),\,\,\,\,k\in[1,...,K]. (2)

4 Methodology

In this section, we first introduce a superset method, LVP, by rethinking how CLIP can be used for classification. Then, we design three continual learning methods utilizing LVP, namely LVP-I, LVP-IT and LVP-C. Finally, we describe the loss functions to train these methods.

4.1 Label Vector Pool(LVP)

As discussed in Sec. 3.2, CLIP classifies an input image by comparing the distance between the image embedding and a list of text embeddings that represent class labels. The query vector with unknown-ID is compared with a group of key vectors with known IDs, and the query vector is assigned the ID of the key vector with the highest similarity. Following this concept, we define label/labeled vector LDL\in\mathbb{R}^{D} as a vector with known ID, such as the class or domain name, or task ID, etc. A LVP for class kk is a set of label vectors with ID kk, denoted as Lk={L1k,L2k,,LPkk},where LikD,i[1,..,Pk],k[1,,K]L^{k}=\{L^{k}_{1},L^{k}_{2},\cdots,L^{k}_{P^{k}}\},\text{where }L^{k}_{i}\in\mathbb{R}^{D},i\in[1,..,P^{k}],k\in[1,...,K], and PP is the pool size. The similarity between an image embedding II and LkL^{k} is computed as the maximum similarity between II and all instances in LkL^{k}:

Lk,I=max(L1k,I,L2k,I,,LPkk,I).\langle L^{k},I\rangle=\max{({\langle L^{k}_{1},I\rangle},{\langle L^{k}_{2},I\rangle},\cdots,{\langle L^{k}_{P^{k}},I\rangle})}.\\ (3)

The probability that a testing image imgimg belongs to class yy can be computed as:

p(y|img)=exp(Ly,I)k=1Kexp(Lk,I).\small p(y|img)=\frac{\exp{(\langle L^{y},I\rangle})}{\sum^{K}_{k=1}\exp{(\langle L^{k},I\rangle)}}. (4)

As can be seen, Eq. 1 is a special case of Eq. 4, where the LVP contains only one labeled vector, which is the text embedding of the class name, i.e., Lk={Tk},and Pk=1L^{k}=\{T^{k}\},\text{and }P^{k}=1. We denote the computational complexity for classifying one image as OO, and represent it using the number of times the similarity function is calculated, i.e., O=k=1KPkO=\sum_{k=1}^{K}{P^{k}}. If all classes have the same pool size PP, then the complexity simplifies to O=k=1KPk=P×KO=\sum_{k=1}^{K}{P^{k}}=P\times K.

Our LVP-based approach is a general framework for similarity-based classification and extends CLIP in two ways: (1) the test image is no longer restricted to comparison with text embeddings, it can be compared with any labeled vectors with matching dimensions; (2) each class can be represented by one or multiple labeled vectors, which may be obtained from different modalities.

Refer to caption
Figure 2: The hypothesis is that embeddings in the same modality should be more similar to each other. The training image embedding I~1\tilde{I}_{1} is expected to be more similar to the test image embedding I^1\hat{I}_{1} than the text embeddings T1T_{1} and T2T_{2}, i.e., I^1,I~1>I^1,T1,I^1,I~1>I^1,T2\langle\hat{I}_{1},\tilde{I}_{1}\rangle>\langle\hat{I}_{1},T_{1}\rangle,\langle\hat{I}_{1},\tilde{I}_{1}\rangle>\langle\hat{I}_{1},T_{2}\rangle.

4.2 Motivation for Using Image LVP

The training of CLIP ensures that matching images and text phrases will be mapped to nearby locations in the feature space. Intuitively, we expect that embeddings of inputs from the same modality (e.g., image-to-image or text-to-text) will be closer to each other than those from different modalities, as illustrated in Fig. 2. As discussed in LABEL:sec:\labelpool, the task of classification is to identify the labeled vector most similar to the query. This raises an interesting question: can we use image embeddings by themselves in the LVP instead of text embeddings?

10% 30% 50% 70% 100% text
P 50 150 250 350 500 1
O 5000 15000 25000 35000 50000 100
Acc. 71.0 74.8 76.1 76.9 78.2 73.3
Table 1: Testing accuracy using embeddings of the training images (columns 2-6) and label text (column 7) as the LVP on CIFAR100. The percentage values in the first row represent the proportion of the training set included in the LVP. The second (P) and third (O) rows show the corresponding LVP size and the computational complexity of testing a single image.
Refer to caption
Figure 3: Framework of LVP-CLIP. Firstly, the concept of LVP is demonstrated. Secondly, three realizations of LVP is shown as LVP-T, LVP-I and LVP-IT. LVP-T known as zero-shot is LVP generated form the text encoder. Our proposed LVP-I is the mean of image embeddings of each class in the training set. LVP-IT can be obtained as a combination of LVP-T and LVP-I with the task-specific trainable paremeters α,β\alpha,\beta of each class. In addition, LVP-C is a classifier optimized on LVP.
Refer to caption
Figure 4: Distributions of the same feature across different classes in CIFAR100 training set. We examine the first 4 feature distributions for 3 classes, denoted as Eik,i[1,4],k[1,3]E^{k}_{i},i\in[1,4],k\in[1,3]. Panels (a) to (d) show the distributions of E1kE^{k}_{1} to E4kE^{k}_{4} for these three classes. The vertical lines represent the mean values of each distribution. As shown, all features approximately follow a Gaussian distribution, with different combinations of means across different classes.

We verified this idea on CIFAR100 [14] dataset by using the embeddings of the training images as the LVP. The experimental results are shown in Tab. 1. When 30% of the training data is used as the LVP, the classification accuracy already surpasses that achieved by using the text labels as LVP. With the entire set of training data used as the LVP, testing accuracy improves by almost 5% compared to using text labels. Since we used CLIP as the foundation model and leveraged its powerful embedding capability, we refer to our approach as LVP-CLIP. The drawback of LVP-CLIP is that the computational and memory complexity increases as the pool size PP grows, which will be addressed next.

4.3 Designing More Efficient LVP

To address the computational and memory complexity while maintaining performance, we propose three different designs using LVP: (i) LVP-CLIP-I uses only image embeddings in the LVP, and reduces the pool size PP to 1, so that its inference complexity is the same as the zero-shot learning while providing better accuracy; (ii) LVP-CLIP-IT extends the LVP by combining text embeddings with image embeddings; (iii) Unlike LVP-CLIP-I and LVP-CLIP-IT, LVP-CLIP-C does not use the similarity function to compare the test image with LVP. It trains a classifier using information stored in LVP and applies it for classification.

The overall framework for the three variations of LVP-CLIP is shown in Fig. 3. The traditional CLIP-based classifier compares the encoded test image with the text-embedding of the class label. This can be considered as a special case of LVP, where the label vector is the text embedding. Therefore, we also refer to it as LVP-CLIP-T.

Reducing LVP Size (LVP-CLIP-I). As noted in Sec. 3, using the whole training set as the LVP is effective, yet expensive in terms of computational and memory requirements. If we can use only one vector in the LVP, which one would be the best representative of the whole training set for each class? Each image embedding IkI^{k} is a set of image features, Ik={E1k,E2k,,EDk},where EkI^{k}=\{E^{k}_{1},E^{k}_{2},\cdots,E^{k}_{D}\},\text{where }E^{k}\in\mathbb{R}. We plot the distributions of those features in the same class and compare such distributions across different example classes in Fig. 4. In the figure, the superscript of the features indicates class index, and subscripts represent the feature index. As can be seen, within a class, each feature roughly follows a Gaussian distribution. Across classes, features have different combinations of means. Therefore, we use their mean value as the representative label vector for each class,

Lk={I~¯k}, and I~¯k=1N~ki=1N~kI~ik,\small L^{k}=\{\bar{\tilde{I}}^{k}\},\text{ and }\bar{\tilde{I}}^{k}=\frac{1}{\tilde{N}^{k}}\sum^{\tilde{N}^{k}}_{i=1}{\tilde{I}^{k}_{i}}, (5)

where I~ikD\tilde{I}^{k}_{i}\in\mathbb{R}^{D} is a training image embedding of class k[1,K]k\in[1,K] and the total number of training images from class kk is N~k\tilde{N}^{k}. As shown in Tab. 2, by using the average embedding of the whole training data in each class as the LVP, we not only reduce the inference complexity but also improve the performance by about 2% with LVP-CLIP-I.

text 100% LVP-I LVP-IT LVP-C
P 1 500 1 1 1
O 100 50000 100 100 -
Acc. 73.3 78.2 80.1 82.0 81.0
Table 2: Testing accuracy of different LVP-CLIP methods on CIFAR100. (Cosine similarity is used here.)

Enrich Image Embedding Representation with Text (LVP-CLIP-IT). The success of CLIP indicates that text embeddings and image embeddings follow similar distributions and have very high correspondence. In cases, where the distribution of training images is not similar to that of the testing images, information from another modality, i.e., text, may serve as an unbiased prompt. Therefore, with LVP-CLIP-IT, we design the LVP as a weighted combination of the text embedding and the average embedding I~¯k\bar{\tilde{I}}^{k} of the training images as follows:

Lk={ITk}, and ITk=αk×Tk+βk×I~¯k,\small L^{k}=\{IT^{k}\},\text{ and }IT^{k}=\alpha^{k}\times T^{k}+\beta^{k}\times\bar{\tilde{I}}^{k}, (6)

where αk,βkD,\alpha^{k},\beta^{k}\in\mathbb{R}^{D}, are trainable vectors to balance the text embedding and image embedding, respectively. Experimental results in Tab. 2 show that combining image embedding and text embedding improves the performance by another 2% over LVP-I.

4.4 LVP with a Classifier (LVP-CLIP-C)

With LVP-CLIP-I and LVP-CLIP-IT, the embedding vectors belonging to difference classes are stored, and classification is performed by comparing the distance between the embedding of a test image against each stored labeled vector using a similarity function. This process treats all feature dimensions as equally important. To handle cases, where some features should have higher importance than others in the classification process, we build a simple linear classifier that maps an embedding vector to a class prediction, fθ():D[0,1]Kf_{\theta}(\cdot):\mathbb{R}^{D}\rightarrow[0,1]^{K} and train it using data only from the LVP. For a test image I^\hat{I}, the prediction is:

class=argmax(fθ(I^)).\text{class}=\operatorname*{argmax}(f_{\theta}(\hat{I})). (7)

4.5 Optimization Objective

Among the three variations of LVP, LVP-CLIP-I has no trainable parameters. It solely relies on the averaging of the embeddings of training images. LVP-CLIP-IT and LVP-CLIP-C, on the other hand, have a small set of parameters that are trainable.

For LVP-CLIP-IT, we set α,β\alpha,\beta as task-specific parameters to avoid forgetting, and use the entropy loss to train them. Given task t[1,M]t\in[1,M] and class kt[1,Kt]k^{t}\in[1,K^{t}], where MM and KtK^{t} are the total number of tasks and the number of classes in task tt, respectively, the loss function used to train α\alpha and β\beta is:

=𝔼[logexp(ITy,I)kt=1Ktexp(ITkt,I)],t[1,,M].\mathcal{L}=\mathbb{E}[-\log{\frac{\exp{(\langle IT^{y},I\rangle)}}{\sum^{K^{t}}_{k^{t}=1}\exp{(\langle IT^{k^{t}},I\rangle)}}}],t\in[1,...,M]. (8)

For each new task, a new set of α\alpha and β\beta will be optimized.

The parameter θ\theta in LVP-CLIP-C is shared among all tasks. Since it is trained using labeled vectors stored in LVP, every time a new task is added, it will be trained again with the current LVP. The process is similar to experience replay. In this way, it also does not suffer from the forgetting. Again, entropy loss is used for the training. Given the LVP Lk,k[1,K]L^{k},k\in[1,K] of each class, the loss function can be written as:

=𝔼[logexp(Ly)k=1Kexp(Lk)].\mathcal{L}=\mathbb{E}[-\log{\frac{\exp{(L^{y})}}{\sum^{K}_{k=1}\exp{(L^{k})}}}]. (9)

4.6 Discussion on Complexity and Performance

For all three LVP variants, the size of the LVP grows as the new tasks are learned, albeit at a very slow rate. For each class, we only need one or a few labeled vectors based on the difficulty and distribution of each dataset. Taking ImageNet100 as an example, each labeled vector LDL\in\mathbb{R}^{D} is of size D=768D=768, which is only 0.5%0.5\% of the size of an image (3×224×224)(3\times 224\times 224). Even for 100 classes, the overall size of LVP is 7680076800, equivalent to the size of 0.5 images. The same amount of memory is required to store the text embeddings for classes, if CLIP is used to classify input images.

Furthermore, LVP-CLIP is task-order invariant, since each LVP is generated independently. As a result, its performance is not affected by task order. Additionally, since new label vectors do not modify the existing ones, LVP-CLIP exhibits minimal forgetting. The independence of the label vectors allows for parallel learning of different classes and making it simple to merge multiple classes —a distinct advantage over existing approaches. Finally, the LVP pool is not a rehearsal buffer, since it does not store raw images, making it less vulnerable to user privacy leakage.

5 Experiments

We evaluate LVP-CLIP in three experiment settings: (i) class-incremental learning (CIL), (ii) domain-incremental learning (DIL), and (iii) cross-tasks incremental learning (CTIL). We compare LVP-CLIP  with the SOTA methods across various categories under commensurate experimental settings. Additionally, we perform extensive ablation studies to gain deeper insights into our approach.

Implementation details. We use frozen text and image encoders throughout the experiments. For the image encoder, ViT-L/14 [23] is used as the backbone in all experiments. While most CLIP-based works use cosine similarity to calculate the distance between two embeddings, we have found that L1 distance works better for image-image embeddings and is much simpler. Therefore, we use L1 similarity for LVP-CLIP-I, and cosine similarity for LVP-CLIP-IT. The parameter α\alpha in LVP-CLIP-IT is initialized to 0.5 for all datasets while β\beta is consistently initialized to 1. We use SGD as the optimizer to train α\alpha and β\beta, with a learning rate of 0.0001.

For LVP-CLIP-C, the classifier is trained using labeled vectors from LVP-IT, except for datasets Core50 where no semantically unique labels are provided, and thus, LVP-I vectors are used. The classifier is trained with the ADAM optimizer, using a learning rate of 0.01. Training is stopped when the loss reaches approximately 0.1-0.05. None of the three proposed variants of LVP-CLIP requires knowledge of the total number of classes in advance.

For the CIL experiments, we generate one label vector for each class, therefore, Pk=1,k[1,,K]P^{k}=1,k\in[1,...,K]. For the DIL experiments, we generate a label vector for each domain in a class. Thus PkP^{k} equals to the number of domains.

An upper-bound of classification accuracy is obtained for each experiment by assuming that all training data were available upfront and a classifier is trained based on the complete dataset by using features extracted by the CLIP image encoder.

We use average testing accuracy (higher is better) as our metric [19]. After all tasks are learned, the overall accuracy is calculated by averaging the accuracy of each task. In all the experiments except CTIL, the testing data for different tasks is of equal size, ensuring that the average test accuracy is not biased toward any specific task.

5.1 Class Incremental Learning

We first evaluate our method on two popular 2D image datasets namely CIFAR100 [14] and ImageNet100 [7]. Following the setup in [32], we select 100 classes from the original ImageNet. Both CIFAR100 and ImageNet100 are divided into 10 tasks, with each task consisting of 10 classes.

We compare our methods with two rehearsal-based methods, iCaRL [24] and ARI [31], and three CLIP-based methods, CoOp [45], Continual-CLIP [29] and AttriCLIP [32]. For a fair comparison, all methods are implemented using ViT-L [8], except for iCaRL and ARI, which are implemented with ResNet [10].

In Tab. 3, the best and 2nd-best performances are shown in bold and with underline, respectively. For ImageNet100, all variants of LVP-CLIP outperform other CLIP-based and experience-replay methods with significant margins. Specifically, LVP-CLIP-IT and LVP-CLIP-C provide 9.2% and 9.3% improvement, respectively, over the best performing CLIP-based method AttriCLIP. On CIFAR100, LVP-CLIP-IT has 0.6% higher accuracy than AttriCLIP.

Method Buffer size CIFAR100 [14] ImageNet100 [7]
iCaRL [24] 20/class 49.5* 59.5*
ARI [31] 20/class 80.9* 79.3*
CoOp [45] 10/class 67.6* 79.3*
Cont.-CLIP [29] 0 66.7* 75.4*
AttriCLIP [32] 0 81.4* 83.3*
LVP-CLIP-I 0 80.2* 91.8*
LVP-CLIP-IT 0 82.0* 92.5*
LVP-CLIP-C 0 81.1* 92.6*
Upper-bound - 86.5* 96.0*
Table 3: Testing accuracy on CIFAR100 and ImageNet100. Data with * is obtained from [32].

5.2 Domain Incremental Learning

For these experiments, we use two popular public datasets, namely DomainNet [22] and CORe50 [18]. DomainNet includes objects from 345 classes. Each object is represented by images spanning six domains: clipart, infograph, painting, quickdraw, real-world and sketch. Each domain has its own dedicated training and testing datasets. Therefore, it has 6 training tasks and 6 testing tasks.

The CORe50 dataset, on the other hand, consists of 50 classes across 11 domains. Of these, 8 domains are used for training data, presented sequentially one at a time as 8 tasks, while 3 domains are reserved for testing. Importantly, the testing domains are not part of the training process, making CORe50 also suitable as a domain adaptation dataset.

The performance is measured as the average accuracy over all testing domains. For every class, we generate a label vector for each trained domain, resulting an LVP of size 6 for DomainNet and 8 for CORe50.

We compare our method with two regularization-based approaches, EWC [13] and LwF [16], a rehearsal-based method, ER [9], and two prompt-pool-based methods, L2P [36] and S-Prompts [33]. For fair comparison, all methods are implemented with ViT-L [8]. As seen in Tab. 4, for both DomainNet and CORe50 datasets, the variants of LVP-CLIP provide the top and second-best performances outperforming all the baselines.

Method Buffer size DomainNet [22] CORe50 [18]
EWC [13] 0 60.0 75.8
LwF [16] 0 61.4 77.6
ER [9] 50/class 64.3 79.5
L2P [36] 0 67.6 79.7
S-Prompts [33] 0 69.7 85.2
LVP-CLIP-I 0 70.1 86.1
LVP-CLIP-IT 0 70.9 -
LVP-CLIP-C 0 68.6 89.6
Upper-bound - 75.9 99.0
Table 4: Testing accuracy on DomainNet and CORe50 datasets. There is not LVP-CLIP-IT result for CORe50 because the dataset does not have differentiable text labels for different classes.
Method CIFAR100 ImageNet100 DomainNet CORe50 CF100 + IN100 + DN + CR50 Ideal Difference
Tasks train/test 10/10 10/10 6/6 8/3 34/29 - -
L2P [36] 88.3 82.3 67.6 79.7 37.4 81.1 -43.7
DualPrompt [35] 86.5 85.4 71.8 84.3 40.3 82.9 -42.6
LVP-CLIP-I 80.2 91.8 70.1 86.1 80.9 82.7 -1.8
LVP-CLIP-IT 82.0 92.5 70.9 - 81.0 83.7 -2.7
LVP-CLIP-C 81.1 92.6 68.6 89.6 79.4 83.4 -4.0
Upper-bound 86.5 96.0 75.9 99.0 87.1 88.9 -1.8
Table 5: Testing accuracy on CIFAR100 + ImageNet100 + DomainNet + CORe50. Bold is the best and underline is the second best.

5.3 Cross-Task Incremental Learning

We present the Cross-Task Incremental Learning (CTIL) experimental setting, which uses an ID-UNKNOWN multi-dataset for task-incremental learning. Here, a task refers to the unit for continual learning, encompassing both class-incremental and domain-incremental learning across dataset. Our motivation is to show how well a method can perform in a more realistic setting to continuously learn new tasks. CTIL presents a significant challenge in continual learning, as it requires a model to perform both CIL and DIL. We conduct experiments on a four-dataset CTIL benchmark, which combines CIFAR100, ImageNet100, DomainNet and Core50, resulting in a total of 595 (100+100+345+50) distinct object classes, divided into 34 (10+10+6+8) training tasks and 29 (10+10+6+3) testing tasks. The 34 training tasks are processed sequentially in a random order.

Since this setting is too challenging for most of the existing incremental learning frameworks, we chose only L2P [36] and DualPrompt [35] as the baseline methods for comparison. We also present the performance of our methods and the baselines on each individual dataset. In the second-to-last column of Tab. 5, we show the “Ideal” performance of each technique in the combined dataset. This is calculated as the weighted average of their accuracies on individual datasets, adjusted by the number of test samples in each. For example, the “Ideal” score for LVP-CLIP-I is calculated as (80.210+91.810+70.16+86.13)/(10+10+6+3)=82.7(80.2*10+91.8*10+70.1*6+86.1*3)/(10+10+6+3)=82.7. The Difference between the Ideal and the actual accuracy is shown in the last column. As a reference, the upper-bound of the classification accuracy is also provided, which is obtained by training a classifier using all the training data with a ViT-L-based backbone.

It is not surprising that the actual accuracy is consistently lower than the ideal accuracy. This is due to two main reasons: (i) as the number of classes grows, their separation in the feature space diminishes, making it harder to distinguish them; and (ii) as tasks are trained sequentially, earlier tasks are increasingly susceptible to forgetting. Since LVP-CLIP-I does not modify previously learned knowledge, it largely avoids forgetting. The 1.8% drop in accuracy is mainly due to more congested feature distributions. LVP-CLIP-C shows a slightly higher degradation (4.0%), likely because its simple linear classifier does not work well in a congested feature space. Overall, the LVP-CLIP-based approaches closely approximate the ideal performance, indicating that the object classes in these four datasets remain separable in the high-dimensional feature space. This also indicates that the significant 42% accuracy drop observed in L2P and DualPrompt is primarily attributed to forgetting.

Although prompt-pool-based approaches show slightly better performance on certain datasets (e.g., CIFAR100 and DomainNet), they greatly suffere from forgetting as the number of classes increases. Moreover, DualPrompt and L2P require roughly twice as much computation at inference compared to LVP-CLIP. In addition, LVP-CLIP-based learning only requires forward propagation, whereas L2P and DualPrompt rely on backpropagation, making LVP-CLIP far more efficient in terms of learning complexity.

5.4 Ablation Studies

Ablation studies are conducted to assess the effect of various design variables on performance. All studies were performed on CIFAR100.

Text Prompt Quality. The impact of different text prompts is shown in Tab. 6. Here, we use LVP-CLIP-T to represent the traditional CLIP-based classification, where the text (T) embedding serves as the label vector for similarity. Two slightly different text phases are used for LVP-CLIP-IT and LVP-CLIP-T and their performances are compared. As can be seen, tradiational CLIP is highly sensitive to text prompt quality, making it essential to optimize the prompt. However, the proposed LVP-CLIP-IT is less susceptible to the text quality, since it uses text as supplementary information alongside image features.

Text “[cls]” “a photo of a [cls]”
LVP-CLIP-T 65.9 73.3
LVP-CLIP-IT 81.3 82.0
Table 6: Ablation study on the impact of text prompt qualities.

Initialization of α\alpha. The performance of LVP-CLIP-IT is sensitive to the initial value of α\alpha as shown in Tab. 7. It is a good practice to set α\alpha to 0.5 when text quality is good, and decrease it as text quality declines.

Initialization of α\alpha 0.1 0.3 0.5 0.7 1.0
LVP-CLIP-IT 80.9 81.6 82.0 81.9 81.7
Table 7: Ablation study on the impact of α\alpha’s initial value.

Training dataset size. The effect of training dataset size on LVP-CLIP is shown in Tab. 8. Since the labeled vector is the average embedding of the training images, a larger training set generally improves the label vector’s approximation of the distribution mean, provided the data is randomly sampled. However, it appears that around 150 training images sufficient to obtain a relatively good estimation of the mean. As expected, with a small number of training images, the LVP-CLIP-IT significantly outperforms LVP-CLIP-I and LVP-CLIP-C, due to the additional information provided by the text embedding. For more experimental results and analysis, please refer to the Suppl. file.

# of training set 1% 3% 5% 10% 30% 50% 100%
train data/class 5 15 25 50 150 250 500
LVP-CLIP-I 66.2 74.9 77.3 78.8 80.0 79.9 80.2
LVP-CLIP-IT 74.2 79.3 80.3 81.0 81.7 81.8 82.0
LVP-CLIP-C 67.3 75.5 77.8 79.6 80.8 81.1 81.1
Table 8: Ablation result of the # of training images on CIFAR100.

6 Conclusion

In this paper, we have introduced a novel concept, the Label Vector Pool (LVP), which enables incremental learning without forgetting by harnessing the powerful feature extraction and encoding capabilities of CLIP. LVP blurs the distinction between batch learning and incremental learning, since it does not modify the learned model to accommodate new knowledge. This approach offers a highly cost-effective solution for incremental learning. It provides several times the memory capacity of traditional methods and superb scalability to large dataset. It significantly outperforms the SOTA approach in cross-dataset mixed class- and domain-incremental learning settings with 595 classes.

References

  • Aljundi et al. [2018] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • Aljundi et al. [2019a] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019a.
  • Aljundi et al. [2019b] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019b.
  • Brown [2020] Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • Buzzega et al. [2020] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and SIMONE CALDERARA. Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems, pages 15920–15930. Curran Associates, Inc., 2020.
  • Cha et al. [2021] Hyuntak Cha, Jaeho Lee, and Jinwoo Shin. Co2l: Contrastive continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9516–9525, 2021.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • Hayes et al. [2019] Tyler L. Hayes, Nathan D. Cahill, and Christopher Kanan. Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA), page 9769–9776. IEEE Press, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Ke et al. [2020] Zixuan Ke, Bing Liu, and Xingchang Huang. Continual learning of a mixed sequence of similar and dissimilar tasks. In Advances in Neural Information Processing Systems, pages 18493–18504. Curran Associates, Inc., 2020.
  • Kirkpatrick et al. [2017a] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017a.
  • Kirkpatrick et al. [2017b] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017b.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Li et al. [2019] Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In Proceedings of the 36th International Conference on Machine Learning, pages 3925–3934. PMLR, 2019.
  • Li and Hoiem [2017] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  • Li and Hoiem [2018] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018.
  • Lomonaco and Maltoni [2017] Vincenzo Lomonaco and Davide Maltoni. Core50: a new dataset and benchmark for continuous object recognition. In Conference on robot learning, pages 17–26. PMLR, 2017.
  • Lopez-Paz and Ranzato [2017] David Lopez-Paz and Marc' Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  • Mallya and Lazebnik [2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • McCloskey and Cohen [1989] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem, 1989.
  • Peng et al. [2019] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1406–1415, 2019.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  • Rebuffi et al. [2017] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  • Rolnick et al. [2019] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  • Serra et al. [2018] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In Proceedings of the 35th International Conference on Machine Learning, pages 4548–4557. PMLR, 2018.
  • Shokri and Shmatikov [2015] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, page 1310–1321, New York, NY, USA, 2015. Association for Computing Machinery.
  • Smith et al. [2023] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11909–11919, 2023.
  • Thengane et al. [2022] Vishal Thengane, Salman Khan, Munawar Hayat, and Fahad Khan. Clip model is an efficient continual learner. arXiv preprint arXiv:2210.03114, 2022.
  • Wang et al. [2024] Boyu Wang, Yue Ma, and Qinru Qiu. Prompt-based domain incremental learning with modular classification layer. In ECAI 2024. IOS Press, 2024.
  • Wang et al. [2022a] Runqi Wang, Yuxiang Bao, Baochang Zhang, Jianzhuang Liu, Wentao Zhu, and Guodong Guo. Anti-retroactive interference for lifelong learning. In European Conference on Computer Vision, pages 163–178. Springer, 2022a.
  • Wang et al. [2023] Runqi Wang, Xiaoyue Duan, Guoliang Kang, Jianzhuang Liu, Shaohui Lin, Songcen Xu, Jinhu Lü, and Baochang Zhang. Attriclip: A non-incremental learner for incremental knowledge learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3654–3663, 2023.
  • Wang et al. [2022b] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. In Advances in Neural Information Processing Systems, pages 5682–5695. Curran Associates, Inc., 2022b.
  • Wang et al. [2020] Zifeng Wang, Tong Jian, Kaushik Chowdhury, Yanzhi Wang, Jennifer Dy, and Stratis Ioannidis. Learn-prune-share for lifelong learning. In 2020 IEEE International Conference on Data Mining (ICDM), pages 641–650, 2020.
  • Wang et al. [2022c] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Dualprompt: Complementary prompting for rehearsal-free continual learning. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, page 631–648, Berlin, Heidelberg, 2022c. Springer-Verlag.
  • Wang et al. [2022d] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 139–149, 2022d.
  • Wu et al. [2019] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Yan et al. [2021] Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3014–3023, 2021.
  • Yin et al. [2020] Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K. Jha, and Jan Kautz. Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Zenke et al. [2017] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning, pages 3987–3995. PMLR, 2017.
  • Zhang et al. [2022] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8552–8562, 2022.
  • Zhao et al. [2022] Tingting Zhao, Zifeng Wang, Aria Masoomi, and Jennifer Dy. Deep bayesian unsupervised lifelong learning. Neural Networks, 149:95–106, 2022.
  • Zhou et al. [2023] Da-Wei Zhou, Qi-Wei Wang, Han-Jia Ye, and De-Chuan Zhan. A model or 603 exemplars: Towards memory-efficient class-incremental learning, 2023.
  • Zhou et al. [2022a] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825, 2022a.
  • Zhou et al. [2022b] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
  • Zhu et al. [2023] Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2639–2650, 2023.
\thetitle

Supplementary Material

7 Ablation study on the effect of the similarity function

The performance of various similarity functions is compared in Tab. 9. The LVP-CLIP-T, based only on text embeddings (which corresponds to the traditional CLIP approach), is very sensitive to the choice of the similarity function. It performs extremely poorly with L1 similarity, and the cosine similarity gives much better performance compared to L1 and L2 distances. Because of the poor match between the text-embedding and the L1 similarity, LVP-CLIP-IT presents a similar outcome, i.e. gives the lowest performance with L1 similarity and favors cosine similarity.

The LVP-CLIP-I, on the other hand, has very stable performance under different similarity functions, with L1 slightly outperforming the other two. Since L1 is also much easier to calculate than cosine similarity, we adopt it as the similarity function if the labeled vector is derived from image-embeddings. Otherwise, cosine similarity is used.

Similarity functions L1 L2 Cos
LVP-CLIP-T 0.1 65.9 73.3
LVP-CLIP-I 80.2 80.0 80.1
LVP-CLIP-IT 76.8 81.8 82.0
Table 9: Ablation study on the effect of the similarity function, performed on CIFAR100 dataset.

8 Memory size of LVP

The memory size of the label vector pool (LVP) for each dataset is shown in Tab. 10. We represent the size by using different units, including the number of floating-point numbers and the equivalent number of images with dimensions 3x224x224 pixels. The total number of floating-point numbers needed to store the LVP can be calculated as P×K×DP\times K\times D where PP is the pool size, KK is the total number of classes, and D=768D=768. As can be seen, even with the four datasets combined, the memory needed for LVP is equivalent to only 13.6 frames of images (or 8.2MB).

CF100 [14] IN100 [7] DN [22] CR50 [18] CF100+IN100+DN+CR50
pool size P 1 1 6 8 mixed(1,1,6,8)
total class K 100 100 345 50 595
float number 76,800 76,800 1,589,760 307,200 2,050,560
images 0.5 0.5 10.6 2.0 13.6
Bytes 0.3MB 0.3MB 6.4MB 1.2MB 8.2MB
Table 10: Memory size required for LVP in terms of floating point numbers and the equivalent image size.

9 Dataset without semantic labels

As discussed in our main paper, our proposed LVP enables CLIP [23]-based classification without solely relying on text embeddings. This is especially useful for datasets that lack meaningful text labels, such as CORe50. As shown in  Fig. 5, CORe50 dataset has ten categories, represented by the 10 columns. Each category includes five distinct instances shown in 5 rows. Each instance is considered as an individual class. Hence, every small image in  Fig. 5 is a unique class. These classes are labeled as o1o1, o2o2, …, o50o50 with no inherent semantic meaning. Creating a set of meaningful semantic labels to distinguish these instances is challenging. Therefore, classifying the images by comparing their features to text embeddings of the labels becomes impractical. However, with the help of the proposed LVP, the LVP-I embeddings can be easily generated from the training images, facilitating accurate classification.

Refer to caption
Figure 5: Images and labels from the CORe50 [18] dataset. There are a total of 50 classes but only 10 object names. Each object has five different instances as five classes. Since the class names are very close to each other as text, it is nearly impossible to separate them by zero-shot learning.

10 Unique advantages of LVP-CLIP

As illustrated in Fig. 6, parallel learning and retaining-free continual learning are two unique advantages of LVP-CLIP that most previous works cannot achieve. LVP-CLIP does not assume that the total number of classes is known in advance, and can learn new tasks by simply concatenating the label vector pools of each task. Moreover, since the LVPs of each task is completely independent of other tasks, the generation of LVPs can be processed on different machines in parallel.

Refer to caption
Figure 6: The illustration of the parallelizability and retaining-free continual learning ability of LVP-CLIP. Four machines conduct the experiments independently and in parallel and store the LVPs for each dataset. By simply concatenating all of LVPs, the continual learning of the four datasets is achieved. Moreover, as the new tasks arrive, the concatenation is simply repeated to store the knowledge from new tasks.

11 Cross-Task Incremental Learning

Fig. 7 shows the T-SNE visualization of the label vector pools generated during cross-task incremental learning (CTIL). As can be seen, the LVPs for different datasets are well-separated in the feature space, with the exception of ImageNet100 and DomainNet datasets.

Tab. 11 provides a detailed comparison between the ideal and actual performance of the three variants of the LVP-CLIP for each learning task. Ideal performance is defined as the test accuracy for each task when the four datasets in the CTIL setting are learned and tested independently. Entries highlighted in red indicate tasks where ideal and actual performance are closely aligned (within a difference of 0.1). As shown in Fig. 7 , the LVPs of ImageNet100 and DomainNet are intermixed, and not well separated. This explains the higher offsets observed between the Ideal and actual performances on the ten IN100 test tasks and the DN-5 task(the ‘real’ domain) compared to the other test tasks. It is clear that, the distribution of ImageNet100 dataset is close to the ‘real’ domain of DomainNet.

Refer to caption
Figure 7: The T-SNE visualization of LVP-I and LVP-IT. Thanks to the remarkable feature extraction of CLIP, different dataset can be well-separated in the feature space except the ImageNet100 and DomainNet.
Test Tasks CF100-1 CF100-2 CF100-3 CF100-4 CF100-5 CF100-6 CF100-7 CF100-8 CF100-9 CF100-10
Ideal-I 81.9 80.7 81.1 81.2 79.3 79.6 74.9 80.3 83.8 79.3
Ideal-IT 85.7 81.8 82.0 82.7 81.1 79.5 77.1 80.9 87.3 81.6
Ideal-C 83.1 81.4 82.5 79.9 80.2 79.3 75.3 80.5 86.1 81.1
LVP-CLIP-I 82.0 81.0 81.0 81.0 79.3 79.6 74.9 80.3 83.8 79.2
LVP-CLIP-IT 85.3 79.3 81.1 82.5 80.5 79.1 76.5 80.5 85.5 81.0
LVP-CLIP-C 84.7 79.4 81.7 80.8 76.6 78.5 73.6 80.6 84.5 79.1
Test Tasks IN100-1 IN100-2 IN100-3 IN100-4 IN100-5 IN100-6 IN100-7 IN100-8 IN100-9 IN100-10
Ideal-I 93.0 84.2 88.2 97.0 94.4 92.0 92.0 90.6 91.2 95.4
Ideal-IT 93.2 86.2 90.2 96.4 94.6 92.0 93.6 90.2 92.6 96.2
Ideal-C 94.4 85.4 90.8 96.2 94.8 91.6 93.4 90.0 93.6 95.6
LVP-CLIP-I 89.0 81.0 87.2 94.6 84.0 88.6 84.0 85.0 86.2 85.6
LVP-CLIP-IT 90.0 82.6 89.4 95.2 85.6 86.8 81.2 81.8 83.2 84.6
LVP-CLIP-C 90.2 77.8 83.6 92.0 81.2 81.2 81.2 83.8 80.0 81.2
Test Tasks DN-1 DN-2 DN-3 DN-4 DN-5 DN-6 CR50-1 CR50-2 CR50-3 ALL
Ideal-I 82.2 56.1 76.0 46.1 87.0 73.3 87.0 85.1 86.3 82.7
Ideal-IT 82.4 58.6 77.1 42.1 88.2 74.5 - - - 83.7
Ideal-C 80.1 60.7 75.5 33.5 87.7 74.5 90.3 88.8 89.6 83.3
LVP-CLIP-I 82.2 56.1 75.9 46.1 86.0 73.3 87.0 85.1 86.3 80.9
LVP-CLIP-IT 81.9 58.2 77.0 42.1 86.0 74.2 86.6 84.2 86.1 81.0
LVP-CLIP-C 79.9 59.8 74.3 31.0 84.4 73.9 89.6 87.5 89.5 79.4
Table 11: Results of all the cross-task incremental learning experiments. The ideal result is the test accuracy of each test task when the learning and testing are done on a given dataset independently. The LVP-CLIP results are the result of each test task across the four-datasets. The numbers for which the offset from the ideal performance is less than or equal to 0.1 are highlighted in red indicating nearly zero forgetting.

12 Classes in ImageNet100

We have selected 100 classes from ImageNet [7] following [32]. The label ID and class names of the 100 classes are as follows: [15, ‘American robin’], [45, ‘Gila monster’], [54, ‘eastern hog-nosed snake’], [57, ‘garter snake’], [64, ‘green mamba’], [72, ‘European garden spider’], [90, ‘lorikeet’], [99, ‘goose’], [119, ‘rock crab’], [120, ‘fiddler crab’], [122, ‘American lobster’], [131, ‘little blue heron’], [137, ‘American coot’],[151, ‘Chihuahua’], [155, ‘Shih Tzu’], [157, ‘Papillon’], [158, ‘toy terrier’], [166, ‘Treeing Walker Coonhound’], [167, ‘English foxhound’], [169, ‘borzoi’], [176, ‘Saluki’], [180, ‘American Staffordshire Terrier’], [209, ‘Chesapeake Bay Retriever’], [211, ‘Vizsla’], [222, ‘Kuvasz’], [228, ‘Komondor’], [234, ‘Rottweiler’], [236, ‘Dobermann’], [242, ‘Boxer’], [246, ‘Great Dane’], [267, ‘Standard Poodle’], [268, ‘Mexican hairless dog [xoloitzcuintli]’], [272, ‘coyote’], [275, ‘African wild dog’], [277, ‘red fox’],[281, ‘tabby cat’],[299, ‘meerkat’],[305, ‘dung beetle’], [313, ‘stick insect’], [317, ‘leafhopper’], [331, ‘hare’], [342, ‘wild boar’], [368, ‘gibbon’], [374, ‘langur’], [407, ‘ambulance’], [421, ‘baluster handrail’],[431, ‘bassinet’], [449, ‘boathouse’], [452, ‘poke bonnet’], [455, ‘bottle cap’], [479, ‘car wheel’], [494, ‘bell or wind chime’], [498, ‘movie theater’], [503, ‘cocktail shaker’], [508, ‘computer keyboard’], [544, ‘Dutch oven’], [560, ‘football helmet’], [570, ‘gas mask or respirator’], [592, ‘hard disk drive’],[593, ‘harmonica’], [599, ‘honeycomb’], [606, ‘clothes iron’], [608, ‘jeans’], [619, ‘lampshade’],[620, ‘laptop computer’], [653, ‘milk can’], [659, ‘mixing bowl’], [662, ‘modem’], [665, ‘moped’], [667, ‘graduation cap’], [674, ‘mousetrap’], [682, ‘obelisk’],[703, ‘park bench’], [708, ‘pedestal’], [717, ‘pickup truck’], [724, ‘pirate ship’],[748, ‘purse’], [758, ‘fishing casting reel’], [765, ‘rocking chair’], [766, ‘rotisserie’],[772, ‘safety pin’], [775, ‘sarong’], [796, ‘balaclava ski mask’], [798, ‘slide rule’],[830, ‘stretcher’], [854, ‘front curtain’], [857, ‘throne’], [858, ‘tile roof’], [872, ‘tripod’],[876, ‘hot tub’], [882, ‘vacuum cleaner’], [904, ‘window screen’], [908, ‘airplane wing’], [936, ‘cabbage’], [938, ‘cauliflower’], [953, ‘pineapple’], [959, ‘carbonara’],[960, ‘chocolate syrup’], [993, ‘gyromitra’], [994, ‘stinkhorn mushroom’]]