This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Selective Vision-Language Subspace Projection for Few-shot CLIP

Xingyu Zhu University of Science and Technology of ChinaHefei, AnhuiChina [email protected] Beier Zhu Nanyang Technological UniversitySingapore [email protected] Yi Tan University of Science and Technology of ChinaHefei, AnhuiChina [email protected] Shuo Wang🖂 University of Science and Technology of ChinaHefei, AnhuiChina [email protected] Yanbin Hao University of Science and Technology of ChinaHefei, AnhuiChina [email protected]  and  Hanwang Zhang Nanyang Technological UniversitySingapore [email protected]
(2024)
Abstract.

Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts. However, most existing methods overlook modality gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification performance. To tackle this issue, we introduce a method called Selective Vision-Language Subspace Projection (SSP), which incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. Specifically, our SSP framework comprises two parallel modules: a vision projector and a language projector. Both projectors utilize local image features to span the respective subspaces for image and texts, thereby projecting the image and text features into their respective subspaces to achieve alignment. Moreover, our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks. Extensive experiments on 11 datasets have demonstrated SSP’s superior text-image alignment capabilities, outperforming the state-of-the-art alignment methods. The code is available at https://github.com/zhuhsingyuu/SSP

CLIP, Few-shot, Alignment, Subspace projection
Xingyu Zhu and Beier Zhu contributed equally to this work
🖂 Shuo Wang is the corresponding author.
copyright: acmlicensedjournalyear: 2024doi: 10.1145/3664647.3680885conference: Proceedings of the 30th ACM International Conference on Multimedia; October 28 - November 1, 2024; Melbourne, Australia.isbn: 979-8-4007-0686-8/24/10ccs: Computing methodologies Computer vision representations

1. Introduction

Refer to caption
Figure 1. A illustration of modality gaps conducted on ImageNet (Deng et al., 2009) with ViT-B/32. (a) Text and image features from CLIP lie in different cones. (b) Text and image features aligned by SSP almost line in the same cone. (c) Comparisons of distribution metrics for CLIP and SSP.

The topic of pre-trained vision-language models (VLMs) (Jia et al., 2021; Li et al., 2021) has attracted significant research interest due to their exceptional performance in various multimodal tasks (Zhang et al., 2022a; Ye et al., 2023; Guzhov et al., 2022; Yu et al., 2023a; Lu et al., 2023; Guo et al., 2024; lu2024rethinkingvisualcontentrefinemen). Among these, CLIP (Contrastive Language-Image Pretraining) (Radford et al., 2021) stands out in classification tasks, especially in zero/few-shot scenarios (Wang et al., 2023; Zhu et al., 2024; Wang et al., 2020, 2022, 2024). For a given test image, CLIP computes the similarity between the image feature and the text features of all class labels, in the form of ‘‘a photo of a [class name].”. It then assigns the class with the highest similarity as the predicted label.

Refer to caption
Figure 2. Comparisons of class activations maps, where the CLIP’s encoded features may concentrate on opposite or noisy regions as discussed in (Li et al., 2023b), while our SSP-aligned features primarily foreground objects.

Recently, prompt tuning (Zhou et al., 2022b, a) and adapter tuning (Gao et al., 2024; Zhang et al., 2022b; Udandarao et al., 2022) techniques have been applied in CLIP to further explore its generalization ability in few-shot tasks. These methods add learnable embeddings or adapter layers to either the visual encoder or textual encoder of CLIP. While they have improved the performance, they overlook the modality gaps existed in multi-modal models. As discussed in (Ouali et al., 2023; Liang et al., 2022; Udandarao, 2022) and illustrated in Figure 1(a), modality gaps refer to different modality embeddings, e.g., CLIP’s encoded text features (pink cone) and image features (blue cone), are located in two separated regions in a hypersphere. To better understand these gaps, we use von Mises-Fisher (vMF) distribution (Banerjee et al., 2005; Du et al., 2024) to fit the text and image distributions, considering both modality features are normalized to unit length and lie on a unit hypersphere, where the parameters 𝝁\bm{\mu} and κ\kappa in vMF are analogous to the mean and standard deviation of Gaussian distribution, respectively (details of vMF formalization can be found in the Supplementary Material). Figure 1(c) summarizes the results, where the original text and image features show different distributions, e.g., a large angle between 𝝁tex\bm{\mu}_{\text{tex}} and 𝝁vis\bm{\mu}_{\text{vis}}, and a large 1\ell_{1}-norm between κtex\kappa_{\text{tex}} and κvis\kappa_{\text{vis}}. This phenomenon is unexpected, as paired text-image features are optimized during the pre-training stage to be closely located to each other while being separated from other paired image-text features. To further investigate these gaps, we visualize the activation maps by calculating the similarity maps between the text features and local image features (feature map) in Figure 2, the second row shows that the text features focus on the unrelated regions, not just the foreground objects. This also demonstrates the gaps between the text and image features. In general, several reasons can cause the modality gaps, broadly categorized as follows: (1) differences in downstream dataset distribution from the training data distribution (Zavras et al., 2024), (2) different random model initializations cause the different feature cones (Liang et al., 2022), (3) CLIP model does not incorporate the fine-grained relationship between text tokens and image patches (Fu et al., 2022; Liu et al., [n. d.]), while these works on analyzing the modality gaps, they do not concentrate on CLIP’s few-shot generalization capability.

Based on the observation and analysis, we propose a training-free method called Selective Vision-Language Subspace Projection (SSP), that leverages the local image features as a bridge to align image and text features. Our method aims at reducing the modality gap so that the paired text-image features are no longer farther apart from each other, as shown in Figure 1(b) (the pink cone and blue cone are closer). To be specific, our SSP consists of two parallel modules, namely, a vision projector and a language projector. In the vision projector module, we select the regions of local image features (also known as feature maps) extracted from CLIP’s visual encoder before the attention pooling layer to span a unified vision subspace by considering the common structural and textural elements present in images, where these selected regions exhibit the highest cosine similarity to the global image features. Then all image features are projected onto this vision subspace to achieve alignment. The language projector module operates in a similar manner, where we use regions of local image features to construct the language subspace for each class. Subsequently, text features extracted from CLIP are aligned by projecting them onto their respective language subspaces. During the inference stage, all projected features are employed to classify the test image. The effectiveness of SSP can be seen from the metric results in Figure 1(c), by narrowing the modality gap, the text-image features aligned by SSP exhibit a higher cosine similarity, e.g., 0.7755 vs. 0.8325, as well as the smaller difference of κ\kappa, and they also obtain a closer distribution distance by comparing the KL divergence. Besides, the visualized activation maps in the last row of Figure 2 show that text features processed by SSP can mainly focus on the foreground object. Overall, our SSP method only involves training-free matrix calculations, which can enhance the CLIP’s capability in few-shot scenarios by improving the alignment between the paired text-image features.

The main contributions of our SSP are summarized as follows:

  1. (1)

    We propose a training-free method SSP to reduce the modality gaps in CLIP’s encoded features, which improves the CLIP’s few-shot generalization ability.

  2. (2)

    We design two parallel projectors, named the vision projector and the language projector, which leverage regions of the local image features to align the image and text features through projection, respectively.

  3. (3)

    Our designed projectors are training-free, only involving matrix calculations, which not only simplifies the implementation process but also reduces the computational overhead.

  4. (4)

    Our SSP is flexible and can be applied to various CLIP-based methods, improving their performance across diverse benchmarks, even in state-of-the-art methods, e.g., an average accuracy improvement of 0.63%0.63\% over APE (Zhu et al., 2023d) in 16-shot.

Refer to caption
Figure 3. The training images and extended labels are sent to the frozen visual encoder and textual encoder to extract features, respectively. Subsequently, the related local image features (features maps) are employed to construct the vision subspace and language subspaces, which are performed to align the extracted image and text features through subspace projection. Finally, a projected testing feature along with projected training features are inputted into the classification framework to predict results.

2. Related Work

2.1. VLMs Pre-training

VLMs have gained significant advances in recent years (Zhang et al., 2023; Liu et al., 2024). These models bridge the modalities of vision and language and are typically pre-trained on a large dataset. With image-text pairs, VLMs use an image encoder and a text encoder to extract image and text features, and then learn the vision-language correlation by using certain pre-training objectives. Moreover, a range of VLMs, including CLIP (Contrastive Language-Image Pretraining) (Radford et al., 2021; Li et al., 2023b), CoCa (Contrastive Cross-modal Learning) (Yu et al., 2022), and BLIP (Bidirectional Learning of Image and Text Priors) (Li et al., 2022), can be leveraged for various downstream tasks (Wang et al., 2018; Guo et al., 2019, 2024; Yan et al., 2023a, b), such as object recognition (Zhang et al., 2022b; Zhu et al., 2023a), object detection (Gu et al., 2022), and image captioning (Mokady et al., 2021; Barraco et al., 2022). For instance, CLIP is pre-trained on a vast dataset of web-based image-text pairs and learns to align their representations through contrastive loss, which enables it to recognize unseen data by matching the embeddings of any given images and texts. In this paper, we aim to use CLIP to address few-shot classification problems.

2.2. Prompt Tuning of VLMs

The concept of prompt tuning is initially introduced in the natural language processing area, which refers to utilizing a fixed part of the text input as learnable embeddings and fine-tuning its parameters based on the downstream task data (Zhou et al., 2022b, a; Lu et al., 2022; Zhu et al., 2023a, b, c). Several works have applied this tuning technical in VLMs, i.e., CoOp (Zhou et al., 2022b) uses learnable word embeddings to generate context prompts automatically, eliminating the need for manual prompt templates. CoCoOp (Zhou et al., 2022a) can be considered as an extension of CoOp. as it proposed a meta network to learn image features that served as conditions added to prompt embeddings to further enhance the model’s generalization ability. ProDA (Lu et al., 2022) aims to learn the distributions of the prompt embedding. ProGrad (Zhu et al., 2023a) used zero-shot prediction results to direct the model gradient update, preventing conflicts between few-shot models and general knowledge while mitigating overfitting issues. However, the above-mentioned methods, although they enhance the classification performance of CLIP in few-shot scenarios, involve the introduction of learnable parameters and increase training consumption. In contrast, our SSP only involves matrix calculation and does not introduce any learnable parameters.

2.3. Adapter Tuning of VLMs

Adapter tuning techniques are applied in VLMs (Gao et al., 2024; Zhang et al., 2022b; Zhu et al., 2023d; Udandarao et al., 2022) to enhance downstream generalization ability by freezing the original model parameters and only updating the parameters of added adapter module. CLIP-Adapter (Gao et al., 2024) utilized a fully connected layer to adapt the features outputted from the frozen CLIP. Tip-Adapter (Yu et al., 2023b) leverages a cache model to measure the relations between image and text features to construct a classifier based on few-shot training data. APE (Zhu et al., 2023d) represents an enhanced version of Tip-adapter, selecting the most discriminative feature channels via statistical analysis. SuS-X (Udandarao et al., 2022) relies on the category names from the training set to generate image samples based on stable diffusion (Rombach et al., 2022). Cross-Modal Adapter (CMA) (Jiang et al., 2022) achieves cross-modal interaction by sharing adapter weights between two modalities. While the aforementioned methods adapt text-image features for the downstream tasks, they do not account for their modality gaps. Our SSP method strives to reduce the gaps to better align the paired text-image features and can be integrated into the aforementioned methods.

3. Methodology

In this part, we first offer a brief overview of the zero-shot inference ability of CLIP in Section 3.1. Then, we delve into the specifics of our vision projector, discussed in Section 3.2, and language projector discussed in Section 3.3. Following that, we present the classification process of SSP in Section 3.4. Lastly, we conduct a comparative analysis in Section 3.5.

3.1. Preliminaries

The CLIP model possesses the ability to perform zero-shot classification, involving extending the ‘‘[class name]’’ to the template ‘‘a photo of a [class name]’’. Then, the image feature 𝒇d\bm{f}\in\mathbb{R}^{d} and the class-extended text feature 𝒕id\bm{t}_{i}\in\mathbb{R}^{d} are extracted from CLIP’s visual encoder and textual encoder, respectively. The classification is determined by the cosine similarity 𝒇,𝒕i\langle\bm{f},\bm{t}_{i}\rangle:

(1) p(𝒕y|𝒇)=exp(𝒇,𝒕y)i=1Nexp(𝒇,𝒕i),\displaystyle p(\bm{t}_{y}|\bm{f})=\frac{\exp(\langle\bm{f},\bm{t}_{y}\rangle)}{\sum_{i=1}^{N}\exp(\langle\bm{f},\bm{t}_{i}\rangle)},

where NN represents the number of classes. In the few-shot setting of CLIP, there are given NN classes, each class containing KK-shot training images. In our method, apart from the encoded image feature 𝒇\bm{f} and the text feature 𝒕\bm{t}, we also exploit local image features 𝒙hw×d\bm{x}\in\mathbb{R}^{hw\times d}, where hh and ww represent the height and width of the local image feature, respectively. The overview of our method is depicted in Figure 3. We select local image features that exhibit strong correlations with both image and text features, respectively, as determined by cosine similarity. Then we utilize these selected local features to construct the vision subapace and language subspaces, respectively. In the inference stage, the test image feature is projected into the vision and language subspaces, respectively, and then the projected training text-image features and the projected test features are sent to the classifier to predict the result.

3.2. Vision Projector

In the vision projector module, the cosine similarity between the image feature and the local image features is calculated for each sample in the training set as follows:

(2) 𝒔i,j=𝒇i,j𝒙i,jT,i[1,N],j[1,K].\displaystyle\bm{s}_{i,j}=\bm{f}_{i,j}\cdot\bm{x}_{i,j}^{\mathrm{T}},\ i\in[1,N],\ j\in[1,K].

Here, 𝒇i,j\bm{f}_{i,j} is the image feature of jj-th sample from class ii, and 𝒙i,j\bm{x}_{i,j} denotes the corresponding local image features. KK indicates the number of samples within each class. 𝒔i,jhw\bm{s}_{i,j}\in\mathbb{R}^{hw} denotes the vector of similarity scores, with each element indicating the correlation between the image feature and the local region feature. Previous studies (Li et al., 2023b; Miyai et al., 2023) have demonstrated that local image features play a crucial role in capturing object and semantic information for vision and language, respectively. Consequently, a higher value in 𝒔i,j\bm{s}_{i,j} indicates that this local region contains more discriminative information for correct classification. Conversely, a smaller value suggests that the region includes less useful or irrelevant information for the corresponding class. Based on these insights, we utilize regions with top-QQ largest values to construct the vision subspace:

(3) 𝒙^i,j=𝒙i,j[n,:]Q×d,n={n1,n2,,nQ},\displaystyle\hat{\bm{x}}_{i,j}=\bm{x}_{i,j}[n,:]\in\mathbb{R}^{Q\times d},\ n=\{n_{1},n_{2},\cdots,n_{Q}\},

where QQ is a hyper-parameter validated in Section 4.2.1, and n={n1,n2,,nQ}n=\{n_{1},n_{2},\cdots,n_{Q}\} denotes the indices set of top-QQ largest values in 𝒔i,j\bm{s}_{i,j}. By aggregating these local features across all training samples and concatenating them with 𝒇i,j\bm{f}_{i,j}, we obtain:

(4) 𝑿^=[𝒇1,1,𝒙^1,1,,𝒇i,j,𝒙^i,j,,𝒇N,K,𝒙^N,K]T((N+1)K×Q)×d.\displaystyle\begin{aligned} \hat{\bm{X}}&=[\bm{f}_{1,1},\hat{\bm{x}}_{1,1},\cdots,\bm{f}_{i,j},\hat{\bm{x}}_{i,j},\cdots,\bm{f}_{N,K},\hat{\bm{x}}_{N,K}]^{\mathrm{T}}\\ &\in\mathbb{R}^{((N+1)K\times Q)\times d}.\end{aligned}

𝑿^\hat{\bm{X}} comprises the informative features for vision patterns, and we decompose the 𝑿^\hat{\bm{X}} by Singular Value Decomposition (SVD):

(5) 𝑼vis𝚺vis𝑽visT=SVD(𝑿^),\displaystyle\bm{U}_{\text{vis}}\mathbf{\Sigma}_{\text{vis}}\bm{V}^{\mathrm{T}}_{\text{vis}}={\rm{SVD}}(\hat{\bm{X}}),

where 𝑼vis((N+1)K×Q)×((N+1)K×Q)\bm{U}_{\text{vis}}\in\mathbb{R}^{((N+1)K\times Q)\times((N+1)K\times Q)} and 𝑽visd×d\bm{V}_{\text{vis}}\in\mathbb{R}^{d\times d} is the left singular vectors and right singular vectors, respectively, and they are corresponded to the singular values 𝚺vis((N+1)K×Q)×d\bm{\Sigma}_{\text{vis}}\in\mathbb{R}^{((N+1)K\times Q)\times d} sorted in the descending order. We employ the principal vectors of 𝑽vis\bm{V}_{\text{vis}}, denoted as 𝑽^vis\hat{\bm{V}}_{\text{vis}}, to span the visual subspace (Smith et al., 2017), and the corresponding projection matrix is calculated as (Guerci and Bergin, 2002; Yuan and Gan, 2017; Zhu et al., 2020):

(6) 𝑷vis=𝑽^vis𝑽^visT,\displaystyle\bm{P}_{\text{vis}}=\hat{\bm{V}}_{\text{vis}}\hat{\bm{V}}_{\text{vis}}^{\mathrm{T}},

where 𝑷vis=𝑷visTd×d\bm{P}_{\text{vis}}=\bm{P}_{\text{vis}}^{\mathrm{T}}\in\mathbb{R}^{d\times d}. Depending on the constructed vision subspace and projection matrix, we align the image features via subspace projection:

(7) 𝑭^=𝑷vis𝑭,\displaystyle\hat{\bm{F}}=\bm{P}_{\text{vis}}\bm{F},

where 𝑭=[𝒇1,1,,𝒇N,K]d×NK\bm{F}=[\bm{f}_{1,1},\cdots,\bm{f}_{N,K}]\in\mathbb{R}^{d\times NK} refers to all training image features, and the projected image features 𝑭^\hat{\bm{F}} are then utilized in the classification process.

3.3. Language Projector

The language projector module has the same effects as the vision projector, and they exhibit a similar procedure. The cosine similarity between the text feature and local image features is calculated as:

(8) 𝒛i,j=𝒕i𝒙i,jT,i[1,N],j[1,K],\displaystyle\bm{z}_{i,j}=\bm{t}_{i}\cdot\bm{x}_{i,j}^{\mathrm{T}},\ i\in[1,N],\ j\in[1,K],

where 𝒕i\bm{t}_{i} is the text feature of ii-th class, and 𝒙i,j\bm{x}_{i,j} denote the local image features of jj-th image for class ii. 𝒛i,jhw\bm{z}_{i,j}\in\mathbb{R}^{hw} stores the similarity scores between the text feature and the local image feature. The larger values in 𝒛i,j\bm{z}_{i,j} imply that these local regions exhibit highly semantic with 𝒕i\bm{t}_{i}, and we select regions with top-CC largest values to construct language subspace:

(9) 𝒙~i,j=𝒙i,j[m,:],m={m1,m2,,mC}C×d,\displaystyle\tilde{\bm{x}}_{i,j}=\bm{x}_{i,j}[m,:],\ m=\{m_{1},m_{2},\cdots,m_{C}\}\in\mathbb{R}^{C\times d},

where m={m1,m2,,mC}m=\{m_{1},m_{2},\cdots,m_{C}\} represents indices of top-CC largest values in 𝒛i,j\bm{z}_{i,j}. For each text feature 𝒕i\bm{t}_{i}, there are KK local image features belonging to that class. Consequently, we can gather a total of K×C+1K\times C+1 features, denoted as 𝑿~i=[𝒕i,𝒙~i,1,,𝒙~i,K]T((K×C+1)×d\tilde{\bm{X}}_{i}=[\bm{t}_{i},\tilde{\bm{x}}_{i,1},\cdots,\tilde{\bm{x}}_{i,K}]^{\mathrm{T}}\in\mathbb{R}^{((K\times C+1)\times d} for each class. In contrast to a unified vision subspace, we construct a language subspace for each class based on {𝑿~i}i=1N\{\tilde{\bm{X}}_{i}\}_{i=1}^{N} via SVD, as shown below:

(12) 𝑼texi𝚺texi(𝑽texi)T=SVD(𝑿~i),i[1,N],\displaystyle\begin{gathered}\bm{U}_{\text{tex}}^{i}\bm{\Sigma}_{\text{tex}}^{i}(\bm{V}_{\text{tex}}^{i})^{\mathrm{T}}={\rm{SVD}}(\tilde{\bm{X}}_{i}),\ i\in[1,N],\\ \end{gathered}

where 𝑼texi\bm{U}_{\text{tex}}^{i} and 𝑽texi\bm{V}_{\text{tex}}^{i} are the left and right singular vectors, respectively, corresponding to the singular values 𝚺texi\bm{\Sigma}_{\text{tex}}^{i} sorted in descending order. 𝑽~texi\tilde{\bm{V}}_{\text{tex}}^{i} denotes the primary singular vectors of 𝑽texi\bm{V}_{\text{tex}}^{i}, which forms the basis for the ii-th language subspace. The corresponding projection matrix is denoted as 𝑷texi=𝑽~texi(𝑽~texi)T𝐑d×d\bm{P}^{i}_{\text{tex}}=\tilde{\bm{V}}_{\text{tex}}^{i}(\tilde{\bm{V}}_{\text{tex}}^{i})^{\mathrm{T}}\in\mathbf{R}^{d\times d}. Subsequently, the text features are projected by their corresponding projection matrix as follows:

(13) 𝑻~=[𝒕~1,,𝒕~i,,𝒕~N]TN×d,=[𝑷tex1𝒕1,,𝑷texi𝒕i,,𝑷texN𝒕N].\displaystyle\begin{aligned} \tilde{\bm{T}}&=[\tilde{\bm{t}}_{1},\cdots,\tilde{\bm{t}}_{i},\cdots,\tilde{\bm{t}}_{N}]^{\mathrm{T}}\in\mathbb{R}^{N\times d},\\ &=[\bm{P}^{1}_{\text{tex}}\bm{t}_{1},\cdots,\bm{P}^{i}_{\text{tex}}\bm{t}_{i},\cdots,\bm{P}^{N}_{\text{tex}}\bm{t}_{N}].\\ \end{aligned}

The projected text features 𝑻~\tilde{\bm{T}} and the projected image features 𝑭^\hat{\bm{F}} belonging to the same class lie closely to each other, as depicted in Figure 1 (b), Thereby, these paired text-image features are utilized for classifying test images during the inference stage.

Refer to caption
Figure 4. The main differences between our SSP and other adapter-based methods (Zhang et al., 2022b; Zhu et al., 2023d; Ouali et al., 2023).

3.4. Classification Process

Our SSP doesn’t involve the parameters training process and can seamlessly build on top of existing adapter-based methods, such as LFA(Ouali et al., 2023), Tip(Zhang et al., 2022b), and APE(Zhu et al., 2023d), to achieve improved classification results, as summarized in Table2 and Table 3. Specifically, given a test image 𝒇testd\bm{f}_{\text{test}}\in\mathbb{R}^{d}, it’s projected into the vision subspace and language subspace, respectively. The vision subspace projection process is straightforwardly calculated as:

(14) 𝒇testvis=𝑷vis𝒇test,\displaystyle\bm{f}_{\text{test}}^{\text{vis}}=\bm{P}_{\text{vis}}\bm{f}_{\text{test}},

where 𝒇testvis\bm{f}_{\text{test}}^{\text{vis}} denotes the projected feature. However, for the language subspace projection, we encounter a challenge since we lack categorical information to determine the appropriate language projection matrix to use. To address this, we rely on projection theory (Yuan and Gan, 2017; Zhu et al., 2020) to choose the language projection matrix based on minimizing the 2\ell_{2} norm of the orthogonal projected features:

(18) argmini(𝑰𝑷texi)𝒇test22,i[1,N]𝒇testtex=𝑷texi𝒇test.\displaystyle\begin{gathered}\mathop{\arg\min}_{i}\ ||(\bm{I}-\bm{P}^{i}_{\text{tex}})\bm{f}_{\text{test}}||_{2}^{2},\ i\in[1,N]\\ \bm{f}_{\text{test}}^{\text{tex}}=\bm{P}^{i}_{\text{tex}}\bm{f}_{\text{test}}.\\ \end{gathered}

If the 𝒇test\bm{f}_{\text{test}} lies in the ii-th language subspace entirely, the projection operation doesn’t change the norm of the test feature, i.e., 𝒇test=𝑷texi𝒇test\bm{f}_{\text{test}}=\bm{P}^{i}_{\text{tex}}\bm{f}_{\text{test}}. After aligning the test feature via our vision and language subspace projection. The aligned test features along with the aligned image and text features are inputted into the various classification frameworks the get the prediction. For example, if we choose the LFA (Ouali et al., 2023) as the classifier, we calculate the transformation matrix 𝑾\bm{W} based on aligned image features 𝑭^\hat{\bm{F}} and text features 𝑻~\tilde{\bm{T}}. We then use 𝒇testtex\bm{f}_{\text{test}}^{\text{tex}} to compute the final classification logits, given by 𝑻~𝑾𝒇testtex\tilde{\bm{T}}\cdot\bm{W}\cdot\bm{f}_{\text{test}}^{\text{tex}}. If we choose the Tip (Zhang et al., 2022b) or APE (Zhu et al., 2023d) as the classification frameworks, we utilize both 𝒇testvis\bm{f}_{\text{test}}^{\text{vis}} and 𝒇testtex\bm{f}_{\text{test}}^{\text{tex}} to calculate the prediction results, which are given by 𝑭^𝒇testvis+𝑻~𝒇testvis\hat{\bm{F}}\cdot\bm{f}_{\text{test}}^{\text{vis}}+\tilde{\bm{T}}\cdot\bm{f}_{\text{test}}^{\text{vis}}.

3.5. Analysis of SSP

Our method focuses on aligning the image and text features and can be easily built on top of various classifiers as described above. This sets our SSP approach apart from adapter-based methods such as Tip(Zhang et al., 2022b), APE(Zhu et al., 2023d), and LFA(Ouali et al., 2023). Although these methods operate on adapting the feature encoded from CLIP, they can also be viewed as classification frameworks due to their direct calculations between the training and testing samples. As shown in Figure4, the adapter-based methods are depicted in the green box at the bottom. They employ various techniques to calculate the relationship among 𝑻\bm{T}, 𝑭\bm{F}, 𝑳\bm{L}, and the test sample 𝒇test\bm{f}_{\text{test}} to achieve classiffication, where 𝑳\bm{L} signifies the one-hot labels for 𝑭\bm{F}. In contrast, our method, illustrated in the gray section at the top, leverages local image features 𝑿\bm{X} as a bridge to align the image features, text features, and test features, resulting in 𝑻~\bm{\tilde{T}}, 𝑭^\bm{\hat{F}}, 𝒇testvis\bm{f}_{\text{test}}^{\text{vis}}, and 𝒇testtex\bm{f}_{\text{test}}^{\text{tex}}. These aligned features are substituted for the original ones in the computations to derive the classification results.

4. Experiments

Refer to caption
(a) Selection of Regions for Vision Projector.
Refer to caption
(b) Selection of Regions for Language Projector.
Refer to caption
(c) Trend of similarity between text features and local features.
Refer to caption
(d) Trend of similarity between image features and local features.
Figure 5. Comparison of classification accuracy and percentage measurements by varying the number of selected local image features under the 16-shot setting: (a) and (b) classification results using only vision subspace projection and language subspace, respectively, (c) and (d) relationships between text features and local image features selected by image features, and between image features and local image features selected by text features, respectively
Refer to caption
(a) Vision Subspace Projection.
Refer to caption
(b) Language Subspace Projection.
Refer to caption
(c) Vision Projection Analysis
Refer to caption
(d) Language Projection Analysis
Figure 6. The accuracy (%) of the classifiers and projection errors with varying the number of singular vectors: (a) and (b) the vision projection matrix and the language projection construction for the different number of shots, respectively, (c) and (d) the vision projection errors and the language projection errors under 16-shot setting, respectively.

4.1. Experimental Settings

Datasets. We conduct our method on 11 widely-used image classification benchmarks, covering a diverse range of object, scene, texture, and fine-grained categories, including ImageNet (Deng et al., 2009), Caltech101 (Fei-Fei et al., 2004), DTD (Cimpoi et al., 2014), EuroSAT (Helber et al., 2019), FGVCAircraft (Maji et al., 2013), Flowers102 (Nilsback and Zisserman, 2008), Food101 (Bossard et al., 2014), OxfordPets (Parkhi et al., 2012), StanfordCars (Krause et al., 2013), SUN397 (Xiao et al., 2010), and UCF101 (Soomro et al., 2012). Additionally, we adopted variants of ImageNet, namely, ImageNetV2 (Recht et al., 2019), ImageNet-Sketch (Wang et al., 2019), ImageNet-A(Hendrycks et al., 2021b), and ImageNet-R (Hendrycks et al., 2021a), to assess the out-of-distribution generalization ability of methods, followed by (Zhou et al., 2022a).
Implementation Details. Our experiments are conducted using ResNet-50 (RN50) as the visual encoder. The weights of these encoders remain frozen during both training and inference stages. We follow the data preprocessing protocol in CLIP, including random cropping, resizing, and random horizontal flip. To incorporate textual information, we insert the class names into the standard templates, (e.g., “a photo of [class name]”). We set Q=40Q=40 and C=40C=40 in regions selection. The number of preserved main components is set as 900 for both vision and language subspace construction. To ensure a fair comparison, we use the same classification approach as these methods, when comparing with APE, we utilize the textual prompt provided in CuPL (Pratt et al., 2022).

Training Details. When compared with LFA (Ouali et al., 2023), our approach follows the same fine-tuning stage described in (Ouali et al., 2023), using AdamW (Kingma and Ba, 2015) optimizer with a cosine scheduler [32], a learning rate of 5e-4 and a weight decay of 5e-4, for 50-200 iterations, and a cosine scheduler.

4.2. Ablation Study

In this section, we use RN50 on ImageNet(Deng et al., 2009) to evaluate the contributions and effectiveness of different components in our approach.

Table 1. The classification accuracy (%) with different projection operations under different shot settings.
  𝑷vis\bm{P}_{\text{vis}} 𝑷texi\bm{P}_{\text{tex}}^{i} KK=16 KK=8 KK=4 KK=2 KK=1
58.83 (CLIP)
61.89 61.83 61.67 61.48 61.32
62.44 62.36 62.14 61.93 61.95
64.34 63.65 63.09 62.53 61.95
 
Table 2. The results of different visual encoders on ImageNet under a 16-shot setting. G.A. stands for GraphAdapter.
  Models RN50 RN101 ViT-B/32 ViT-B/16
CoOp (CVPR22) 62.95 66.60 66.85 71.92
TaskRes (CVPR23) 64.75 67.70 68.20 73.07
G.A. (NeurIPS23) 64.94 67.87 68.47 73.40
Tip (ECCV22) 62.03 65.19 65.87 70.82
Tip + SSP 62.75(+0.72) 65.73(+0.54) 66.56(+0.69) 71.56(+0.74)
LFA (ICCV23) 63.65 67.16 67.63 72.61
LFA + SSP 64.34(+0.69) 68.03(+0.87) 68.17(+0.54) 73.04(+0.43)
APE (ICCV23) 63.02 69.48 69.31 74.27
APE + SSP 63.33(+0.31) 69.83(+0.35) 69.84(+0.53) 74.79(+0.52)
 
Table 3. The classification accuracy (%) comparison on few-shot learning, i.e., 1-/2-/4-/6-/8-/16-shot, across 11 datasets. The results for LFA, Tip, and APE from our implementation by open public project, and the datasets include F102.(Flowers102), Euro(EuroSAT), F101.(Food101), SUN.(SUN397), C101.(Caltech101), UCF.(UFC101), and ImgN.(ImageNet).
  Method Pets F102. FGVC DTD Euro. Cars F101. SUN. C101. UCF. ImgN. Avg.
CLIP 85.77 66.14 17.28 42.32 37.56 55.61 77.31 58.52 86.29 61.46 58.18 58.77
16-shot
LFA (ICCV23) 86.75 94.56 35.86 66.35 84.13 73.58 76.32 71.32 92.68 77.00 63.65 74.75
LFA + SSP 86.40 95.13 35.10 67.85 84.83 73.08 77.79 69.95 93.35 78.11 64.34 75.08 \uparrow0.34
Tip (ECCV22) 88.10 89.93 29.88 60.70 70.59 66.61 77.88 66.82 90.63 70.68 62.03 70.35
Tip + SSP 88.83 90.62 30.18 62.23 73.62 67.19 77.94 67.01 91.56 70.95 62.75 71.17 \uparrow0.82
APE (ICCV23) 87.33 91.19 32.46 65.78 77.79 70.36 78.44 68.94 91.97 76.74 63.02 73.09
APE + SSP 88.12 91.51 32.79 67.91 78.33 70.77 78.51 69.14 92.05 78.46 63.33 73.72 \uparrow0.63
8-shot
LFA (ICCV23) 84.63 91.80 29.40 59.57 76.54 67.79 76.4 69.88 91.36 74.09 61.38 71.17
LFA + SSP 84.96 92.89 30.18 62.00 78.52 69.58 77.58 69.95 91.85 75.05 62.65 72.29 \uparrow1.12
Tip (ECCV22) 86.94 88.23 25.53 58.39 67.95 63.06 77.69 65.58 89.94 68.44 61.44 68.47
Tip + SSP 87.19 88.63 27.78 58.98 72.28 63.89 77.75 65.68 90.91 69.28 62.22 69.51 \uparrow1.04
APE (ICCV23) 86.97 90.78 28.38 63.65 75.04 65.86 77.71 67.90 91.60 70.34 62.63 70.99
APE + SSP 87.35 91.03 28.86 65.60 75.16 67.28 77.88 67.77 91.72 72.67 62.64 71.63 \uparrow0.64
4-shot
LFA (ICCV23) 83.51 89.40 24.39 55.14 70.74 63.19 77.83 67.71 89.86 69.18 57.80 68.07
LFA + SSP 84.06 89.65 26.88 57.86 72.31 63.70 77.60 68.65 90.87 71.45 59.91 69.36 \uparrow1.29
Tip (ECCV22) 86.48 83.80 22.11 53.90 65.54 61.23 77.52 64.23 89.09 66.19 61.00 66.46
Tip + SSP 86.81 84.13 23.67 54.79 67.21 61.47 77.64 64.27 90.14 67.80 61.98 67.26 \uparrow0.80
APE (ICCV23) 85.72 87.66 25.14 60.46 73.48 65.19 77.31 66.87 91.56 69.15 62.46 69.55
APE + SSP 86.24 87.86 24.92 61.35 74.35 65.38 77.58 66.95 91.65 70.10 62.53 69.90 \uparrow0.36
2-shot
LFA (ICCV23) 81.60 81.00 19.38 49.29 61.27 56.51 64.58 61.93 88.76 65.93 55.18 62.31
LFA + SSP 82.09 82.26 22.77 54.37 63.3 58.81 65.39 63.19 89.13 67.09 57.82 64.20 \uparrow1.89
Tip (ECCV22) 86.92 79.01 21.21 49.59 61.42 58.00 77.50 62.72 88.64 64.76 60.95 64.61
Tip + SSP 87.03 79.5 22.71 50.77 62.36 59.11 77.62 62.84 89.01 66.09 61.82 65.35 \uparrow0.74
APE (ICCV23) 85.20 83.68 23.55 54.67 71.89 61.25 77.62 65.94 89.94 66.14 62.38 67.48
APE + SSP 86.07 83.80 23.64 57.57 72.37 61.96 77.27 66.30 90.63 66.43 62.41 68.04 \uparrow0.61
1-shot
LFA (ICCV23) 79.61 76.00 16.26 45.09 60.10 50.81 77.29 58.55 85.19 59.00 52.46 60.03
LFA + SSP 82.45 77.18 19.68 51.65 60.94 56.09 77.30 61.87 87.87 66.59 56.11 63.43 \uparrow3.40
Tip (ECCV22) 86.02 73.12 18.96 46.10 54.41 57.37 77.42 61.31 87.06 62.75 60.69 62.29
Tip + SSP 86.32 76.05 19.74 46.81 59.17 57.60 77.58 61.49 88.76 63.02 61.71 63.48 \uparrow1.19
APE (ICCV23) 85.04 79.98 21.03 54.31 67.59 60.25 77.07 64.55 89.66 63.36 62.04 65.90
APE + SSP 85.34 79.51 20.64 54.73 68.63 60.29 77.22 64.36 90.26 63.34 62.05 66.03 \uparrow0.14
 
Table 4. The classification accuracy (%) comparison on out-of-distribution test data under 16-shot setting.
  Source Target
Method ImageNet ImageNet-A ImageNet-V2 ImageNet-R ImageNet-Sketch Avg. OOD Avg.
CLIP 58.16 21.83 51.41 56.15 33.37 44.18 40.69
CoOp (IJCV22) 63.33 23.06 55.40 56.60 34.67 46.61 42.43
CoOpOp (CVPR22) 62.81 23.32 55.72 57.74 34.48 46.81 42.82
LFA (ICCV23) 63.88 24.31 55.79 58.13 34.37 47.29 43.15
Ours 64.34 24.53 56.04 58.96 34.58 47.69 43.53 \uparrow0.38
 

4.2.1. The selection of local image features.

The selected regions of the local image features are crucial in our method, as they determine which salient regions of the image contribute to the projection subspace. These selected local features should contain discriminative patterns relevant to the category while excluding irrelevant areas. To identify the optimal number of regions for the vision subspace and language subspace, we vary the number of QQ and CC from 15 to 49, respectively. The results are depicted in Figure 5(a) and Figure 5(b). We observe that as QQ and CC increase, the averaged accuracy initially increases and then decreases. The highest accuracy is achieved in the range of 35-45. Furthermore, we analyze the disparity between the selected regions for image and text, respectively. To achieve this, we calculate the similarity between the text features and the selected regions based on image features, as well as the similarity between the image features and the selected regions based on text features. The results are shown in Figure 5(c) and Figure 5(d). In Figure 5(c), as the regions become more complete (blue curves), the similarity between the selected region features and the text label feature (red curve) decreases. Similarly, in Figure 5(d), as the number of selected regions increases, the similarity between the selected region features and image features (blue curve) also decreases. These observations and results indicate that the responded regions for image and text features are different, and selections should be performed separately for each of them.

4.2.2. The effects of subspace construction.

We investigate the effects of using different numbers of singular vectors in constructing the vision projection matrix (𝑷vis\bm{P}_{\text{vis}}) and language projection matrix (𝑷texi\bm{P}_{\text{tex}}^{i}). Considering a feature dimension of d=1024d=1024, we vary the number of singular vectors from 450 to 1000. The results are presented in Figure 6(a) and Figure 6(b). Firstly, we observe that using a small number of singular vectors in both vision and language subspace constructions leads to poor classification performance. To be specific, for the vision subspace, when the number exceeds 700, the results converge across all shot settings. This indicates that using 700 singular vectors is sufficient to span the entire vision subspace. In the case of the language subspace, we observe that the projection operation brings noticeable improvements, particularly in low-shot settings. When the number of singular vectors ranges from 900 to 1000, the accuracy converges across all settings, suggesting that using 900-1000 singular vectors can span the entire language subspace. Additionally, the results shown in Figure 6(c) and Figure 6(d) indicate that the vision subspace projection error decreases significantly when the number of singular vectors exceeds 700, while the language subspace projection error gradually decreases. This is because the projection matrix (e.g., 𝑷texi𝑰\bm{P}_{\text{tex}}^{i}\approx\bm{I}) is nearly an identity matrix, resulting in an invalid projection operation (𝒇test=𝑰𝒇test\bm{f}_{\text{test}}=\bm{I}\cdot\bm{f}_{\text{test}}) and a projection error close to zero ((𝑰𝑷texi)𝒇test𝟎(\bm{I}-\bm{P}_{\text{tex}}^{i})\bm{f}_{\text{test}}\approx\bm{0}). Based on the above analysis, we choose to employ 900 singular vectors for both constructing the vision subspace and the language subspace.

4.2.3. The effects of vision and language projector .

we analyzed the impact of the vision projector and the language projector individually. The performance outcomes across different shot settings summarized in Table 1 show that both vision subspace projection and language subspace projection lead to improvements compared to the original CLIP. Specifically, under K=16K=16, vision projection yields a 3.61% accuracy improvement, while language projection achieves a 3.01% accuracy improvement. Furthermore, it is worth noting that combining both projection operations yields the best performance, as indicated by the last row of Table 1.

4.3. Performance Comparisons

We conducted a comprehensive comparison of our method with the latest approaches, including CoOp (Zhou et al., 2022b), TaskRes (Yu et al., 2023b), GraphAdapter (Li et al., 2023a), LFA (Ouali et al., 2023), APE (Zhu et al., 2023d), and Tip (Zhang et al., 2022b), across 11 image classification datasets. We first perform experiments using different visual encoders, including RN50, RN101, ViT-B/32, and ViT-B/16, on the ImageNet with a 16-shot setting. The results, as summarized in Table 2, indicate that our SSP method leads to performance improvements compared to Tip, LFA, and APE. Notably, when built upon the APE method, our SSP outperforms all other methods. Furthermore, we extended our comparisons to evaluate our method across different shot settings on various datasets, with the results outlined in Table 3. While our method can not achieve the highest performance on some individual datasets, it consistently outperforms the other methods when considering the average accuracy across all datasets. This demonstrates the robustness of our SSP method. Specifically, compared to LFA, our method achieves a significant average accuracy improvement of 3.4% in the 1-shot setting. Furthermore, even when compared to APE, our method consistently exhibits slightly better performance across all shot settings.

4.4. Domain Generalization Comparisons

We conducted experiments using the ImageNet dataset as the source dataset, which provides 16-shot training images for each category. We then tested our method on four different variants of the ImageNet dataset: ImageNetV2 (Recht et al., 2019), ImageNet-Sketch (Wang et al., 2019), ImageNet-A(Hendrycks et al., 2021b), and ImageNet-R (Hendrycks et al., 2021a). These test datasets share the same class labels as ImageNet but exhibit semantic gaps. The classification results are summarized in Table 4. Our method achieved a significant improvement of 0.73% over the LFA method specifically on the ImageNet-R dataset. Furthermore, our method demonstrated an average performance improvement of 0.38% across all out-of-distribution (OOD) datasets. These results demonstrate our SSP is robust in domain adaptation.

Table 5. A comparison of training duration and parameter counts on the ImageNet dataset, utilizing an RN50 visual encoder on one single A100 GPU under a 16-shot setting.
  Method Time Params. Acc.
SSP(only Vis. Proj.) 4.4s - 61.84
SSP(only Lag. Proj.) 2Min1s - 62.44
LFA 3Min9s 1.05 M 63.65
+SSP 4Min5s 1.05 M 64.34(+0.69)
Tip 5Min10s - 62.03
+SSP 6Min15s - 62.75(+0.72)
APE 5Min50s - 63.02
+SSP 6Min50s - 63.33(+0.31)
 

4.5. Analysis of Computation Efficiency

We compared the computation time and parameters between our SSP and other approaches on the ImageNet dataset, under the 16-shot setting. The results illustrated in Table 5 show that our SSP does not introduce any learnable parameters and primarily incurs computation time during matrix operations. To align image features, SSP requires a single SVD operation, typically taking a few seconds that could be ignored. As for text features alignment, the number of SVD operations depends on the total categories in the dataset (e.g., 1000 categories in ImageNet). When combined with LFA, Tip, or APE, our SSP needs nearly an extra 1 minute computation time, but it consistently improves the classification accuracy, particularly by 0.72% over Tip.

5. Conclusion

In this paper, we have proposed a method named Selective Vision-Language Subspace Projection to align the different modality features extracted for pre-trained CLIP through subspace projection. This alignment strengthens the generalization ability of CLIP in the few-shot scenarios. Our SSP is seamlessly integrated into various classification frameworks and does not introduce any additional learnable parameters. The extensive experiments validate the effectiveness of our proposed method. In our future work, we plan to explore more general alignment techniques by incorporating other pre-trained models, such as large language models, diffusion models, etc.

6. Acknowledgments

The work was supported by the National Natural Science Foundation of China (Grants No. 62202439). This work was also supported by the advanced computing resources provided by the Supercomputing Center of the USTC.

References

  • (1)
  • Banerjee et al. (2005) Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, Suvrit Sra, and Greg Ridgeway. 2005. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. JMLR (2005).
  • Barraco et al. (2022) Manuele Barraco, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, and Rita Cucchiara. 2022. The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis. In CVPR Workshops.
  • Bossard et al. (2014) Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 - Mining Discriminative Components with Random Forests. In ECCV.
  • Cimpoi et al. (2014) Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. 2014. Describing Textures in the Wild. In CVPR.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.
  • Du et al. (2024) Chaoqun Du, Yulin Wang, Shiji Song, and Gao Huang. 2024. Probabilistic Contrastive Learning for Long-Tailed Visual Recognition. CoRR abs/2403.06726 (2024).
  • Fei-Fei et al. (2004) Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. In CVPR Workshops.
  • Fu et al. (2022) Jinmiao Fu, Shaoyuan Xu, Huidong Liu, Yang Liu, Ning Xie, Chien-Chih Wang, Jia Liu, Yi Sun, and Bryan Wang. 2022. CMA-CLIP: Cross-Modality Attention Clip for Text-Image Classification. In ICIP.
  • Gao et al. (2024) Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2024. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. IJCV (2024).
  • Gu et al. (2022) Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2022. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In ICLR.
  • Guerci and Bergin (2002) JR Guerci and JS Bergin. 2002. Principal components, covariance matrix tapers, and the subspace leakage problem. IEEE TAES (2002).
  • Guo et al. (2024) Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Method, and Application. IEEE Trans. Circuits Syst. Video Technol. (2024).
  • Guo et al. (2019) Dan Guo, Shuo Wang, Qi Tian, and Meng Wang. 2019. Dense Temporal Convolution Network for Sign Language Translation. In IJCAI.
  • Guzhov et al. (2022) Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2022. Audioclip: Extending Clip to Image, Text and Audio. In ICASSP.
  • Helber et al. (2019) Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. 2019. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE J-STARS (2019).
  • Hendrycks et al. (2021a) Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. 2021a. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. (2021).
  • Hendrycks et al. (2021b) Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2021b. Natural Adversarial Examples. CVPR (2021).
  • Jia et al. (2021) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML.
  • Jiang et al. (2022) Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, and Gao Huang. 2022. Cross-Modal Adapter for Text-Video Retrieval. CoRR abs/2211.09623 (2022).
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
  • Krause et al. (2013) Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D Object Representations for Fine-Grained Categorization. In ICCV Workshops.
  • Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.
  • Li et al. (2021) Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven Chu-Hong Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In NeurIPS.
  • Li et al. (2023a) Xin Li, Dongze Lian, Zhihe Lu, Jiawang Bai, Zhibo Chen, and Xinchao Wang. 2023a. GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph. In NeurIPS.
  • Li et al. (2023b) Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. 2023b. CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks. CoRR abs/2304.05653 (2023).
  • Liang et al. (2022) Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y. Zou. 2022. Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. In NeurIPS.
  • Liu et al. ([n. d.]) Fenglin Liu, Xian Wu, Shen Ge, Xiaoyu Zhang, Wei Fan, and Yuexian Zou. [n. d.]. Bridging the Gap between Vision and Language Domains for Improved Image Captioning. In ACM Multimedia year = 2020.
  • Liu et al. (2024) Fan Liu, Tianshu Zhang, Wenwen Dai, Wenwen Cai, Xiaocong Zhou, and Delong Chen. 2024. Few-shot Adaptation of Multi-modal Foundation Models: A Survey. CoRR abs/2401.01736 (2024).
  • Lu et al. (2023) Jinda Lu, Shuo Wang, Xinyu Zhang, Yanbin Hao, and Xiangnan He. 2023. Semantic-based Selection, Synthesis, and Supervision for Few-shot Learning. In ACM Multimedia.
  • Lu et al. (2022) Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. 2022. Prompt Distribution Learning. In CVPR.
  • Maji et al. (2013) Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. CoRR abs/1306.5151 (2013).
  • Miyai et al. (2023) Atsuyuki Miyai, Qing Yu, Go Irie, and Kiyoharu Aizawa. 2023. LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning. CoRR abs/2306.01293 (2023).
  • Mokady et al. (2021) Ron Mokady, Amir Hertz, and Amit H. Bermano. 2021. ClipCap: CLIP Prefix for Image Captioning. CoRR abs/2111.09734 (2021).
  • Nilsback and Zisserman (2008) Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In ICVGIP.
  • Ouali et al. (2023) Yassine Ouali, Adrian Bulat, Brais Martínez, and Georgios Tzimiropoulos. 2023. Black Box Few-Shot Adaptation for Vision-Language models. In ICCV.
  • Parkhi et al. (2012) Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. 2012. Cats and dogs. In CVPR.
  • Pratt et al. (2022) Sarah M. Pratt, Rosanne Liu, and Ali Farhadi. 2022. What does a platypus look like? Generating customized prompts for zero-shot image classification. CoRR abs/2209.03320 (2022).
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
  • Recht et al. (2019) Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet Classifiers Generalize to ImageNet?. In ICML.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR.
  • Smith et al. (2017) Samuel L. Smith, David H. P. Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In ICLR.
  • Soomro et al. (2012) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR abs/1212.0402 (2012).
  • Udandarao (2022) Vishaal Udandarao. 2022. Understanding and fixing the modality gap in vision-language models. Master’s thesis, University of Cambridge (2022).
  • Udandarao et al. (2022) Vishaal Udandarao, Ankush Gupta, and Samuel Albanie. 2022. SuS-X: Training-Free Name-Only Transfer of Vision-Language Models. CoRR abs/2211.16198 (2022).
  • Wang et al. (2019) Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. 2019. Learning Robust Global Representations by Penalizing Local Predictive Power. In NeurIPS.
  • Wang et al. (2018) Shuo Wang, Dan Guo, Wengang Zhou, Zheng-Jun Zha, and Meng Wang. 2018. Connectionist Temporal Fusion for Sign Language Translation. In ACM Multimedia.
  • Wang et al. (2024) Shuo Wang, Jinda Lu, Haiyang Xu, Yanbin Hao, and Xiangnan He. 2024. Feature Mixture on Pre-Trained Model for Few-Shot Learning. IEEE Trans. Image Process. (2024).
  • Wang et al. (2020) Shuo Wang, Jun Yue, Jianzhuang Liu, Qi Tian, and Meng Wang. 2020. Large-Scale Few-Shot Learning via Multi-modal Knowledge Discovery. In ECCV.
  • Wang et al. (2022) Shuo Wang, Xinyu Zhang, Yanbin Hao, Chengbing Wang, and Xiangnan He. 2022. Multi-directional Knowledge Transfer for Few-Shot Learning. In ACM Multimedia.
  • Wang et al. (2023) Zhicai Wang, Yanbin Hao, Tingting Mu, Ouxiang Li, Shuo Wang, and Xiangnan He. 2023. Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning. In CVPR.
  • Xiao et al. (2010) Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.
  • Yan et al. (2023a) Rui Yan, Lingxi Xie, Xiangbo Shu, Liyan Zhang, and Jinhui Tang. 2023a. Progressive Instance-Aware Feature Learning for Compositional Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2023).
  • Yan et al. (2023b) Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2023b. HiGCIN: Hierarchical Graph-Based Cross Inference Network for Group Activity Recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2023).
  • Ye et al. (2023) Zhenhui Ye, Rongjie Huang, Yi Ren, Ziyue Jiang, Jinglin Liu, Jinzheng He, Xiang Yin, and Zhou Zhao. 2023. CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training. In ACL.
  • Yu et al. (2023a) Jiarui Yu, Haoran Li, Yanbin Hao, Jinmeng Wu, Tong Xu, Shuo Wang, and Xiangnan He. 2023a. How Can Contrastive Pre-training Benefit Audio-Visual Segmentation? A Study from Supervised and Zero-shot Perspectives. In BMVC.
  • Yu et al. (2022) Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. 2022. CoCa: Contrastive Captioners are Image-Text Foundation Models. CoRR abs/2205.01917 (2022).
  • Yu et al. (2023b) Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang. 2023b. Task Residual for Tuning Vision-Language Models. In CVPR.
  • Yuan and Gan (2017) Xiaolei Yuan and Lu Gan. 2017. Robust adaptive beamforming via a novel subspace method for interference covariance matrix reconstruction. Signal Processing (2017).
  • Zavras et al. (2024) Angelos Zavras, Dimitrios Michail, Begüm Demir, and Ioannis Papoutsis. 2024. Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment. CoRR abs/2402.09816 (2024).
  • Zhang et al. (2023) Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2023. Vision-Language Models for Vision Tasks: A Survey. CoRR abs/2304.00685 (2023).
  • Zhang et al. (2022a) Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. 2022a. PointCLIP: Point Cloud Understanding by CLIP. In CVPR.
  • Zhang et al. (2022b) Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022b. Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification. In ECCV.
  • Zhou et al. (2022a) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional Prompt Learning for Vision-Language Models. In CVPR.
  • Zhou et al. (2022b) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to Prompt for Vision-Language Models. IJCV (2022).
  • Zhu et al. (2023a) Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. 2023a. Prompt-aligned Gradient for Prompt Tuning. In ICCV.
  • Zhu et al. (2023b) Beier Zhu, Yulei Niu, Saeil Lee, Minhoe Hur, and Hanwang Zhang. 2023b. Debiased Fine-Tuning for Vision-Language Models by Prompt Regularization. AAAI.
  • Zhu et al. (2023c) Beier Zhu, Kaihua Tang, Qianru Sun, and Hanwang Zhang. 2023c. Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models. In NeurIPS.
  • Zhu et al. (2024) Xingyu Zhu, Shuo Wang, Jinda Lu, Yanbin Hao, Haifeng Liu, and Xiangnan He. 2024. Boosting Few-Shot Learning via Attentive Feature Regularization. In AAAI.
  • Zhu et al. (2020) Xingyu Zhu, Xu Xu, and Zhongfu Ye. 2020. Robust adaptive beamforming via subspace for interference covariance matrix reconstruction. Signal Processing (2020).
  • Zhu et al. (2023d) Xiangyang Zhu, Renrui Zhang, Bowei He, Aojun Zhou, Dong Wang, Bin Zhao, and Peng Gao. 2023d. Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement. In ICCV.