Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

Siming Fu [email protected] , Xiaoxuan He Xiaoxuan˙[email protected] Alibaba groupHangzhou ShiChina , Xinpeng Ding Hong Kong University of Science and TechnologyHong KongChina [email protected] , Yuchen Cao Zhejiang UniversityHangzhou ShiChina Yuchen˙[email protected] and Hualiang Wang Hong Kong University of Science and TechnologyHong KongChina [email protected]

(2023)

Abstract.

Recently, large-scale pre-trained vision-language models have presented benefits for alleviating class imbalance in long-tailed recognition. However, the long-tailed data distribution can corrupt the representation space, where the distance between head and tail categories is much larger than the distance between two tail categories. This uneven feature space distribution causes the model to exhibit unclear and inseparable decision boundaries on the uniformly distributed test set, which lowers its performance. To address these challenges, we propose the uniformly category prototype-guided vision-language framework to effectively mitigate feature space bias caused by data imbalance. Especially, we generate a set of category prototypes uniformly distributed on a hypersphere. Category prototype-guided mechanism for image-text matching makes the features of different classes converge to these distinct and uniformly distributed category prototypes, which maintain a uniform distribution in the feature space, and improve class boundaries. Additionally, our proposed irrelevant text filtering and attribute enhancement module allows the model to ignore irrelevant noisy text and focus more on key attribute information, thereby enhancing the robustness of our framework. In the image recognition fine-tuning stage, to address the positive bias problem of the learnable classifier, we design the class feature prototype-guided classifier, which compensates for the performance of tail classes while maintaining the performance of head classes. Our method outperforms previous vision-language methods for long-tailed learning work by a large margin and achieves state-of-the-art performance.

long-tailed learning, vision-language framework, uniformly distributed, irrelevant text filter

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canada^†^†booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canada^†^†price: 15.00^†^†doi: 10.1145/3581783.3611904^†^†isbn: 979-8-4007-0108-5/23/10^†^†ccs: Computing methodologies Computer vision

1. Introduction

Real-world data often exhibits a long-tailed distribution, where a few head classes contain the majority of the data, while most tail classes have very few samples. This phenomenon is unfavorable for the practical application of deep learning models. As a result, many works have attempted to alleviate class imbalance from different perspectives, such as re-sampling training data (Kang et al., 2019; Mahajan et al., 2018; Fu et al., 2022), re-weighting loss functions (Elkan, 2001; Zhou and Liu, 2005; Sun et al., 2007; Wang et al., 2022), or using transfer learning methods (Wang et al., 2017; Chu et al., 2020; Wang et al., 2021b). Although these works have made significant contributions, they are still limited by relying solely on image patterns to solve the problem, ignoring intermodal relationships.

Undeniably, there are inherent connections between images and text descriptions of the same category, especially when it comes to visual concepts and attributes. Unlike image patterns that typically present specific low-level features of objects or scenes such as shape, color, and texture, text patterns can be summarized by experts as prior knowledge, which undoubtedly provides great help for model training when there are not enough images to learn a general classification representation for recognition. The recent advancements in vision-language models have resulted in the development of efficacious methodologies for acquiring robust representations that interconnect image and text modalities for long-tailed recognition. VL-TLR (Tian et al., 2022) and BALLAD (Ma et al., 2021) employ the multimodal potential of vision-language frameworks to devise powerful techniques that mitigate class imbalance.

Refer to caption — Figure 1. A toy feature distribution comparison (left) before and (right) after implementing the uniformly distributed category prototype-guided image-text matching for three classes within ImageNet-LT. The elephant represents the head category, while the other two illustrate the tail categories. Prior to optimization, the representation space is dominated by the head category, leading to poor uniformity in learned category prototypes. Our approach improves performance by enhancing uniformity. Markers: Circles indicate representation space; triangles are text and image features; and pentagrams are category prototypes. The qualitative experiment visualization is shown in Appendix. B.

The aforementioned studies (Tian et al., 2022; Ma et al., 2021) provide valuable perspectives on the capabilities of vision-language models in augmenting recognition performance for long-tailed datasets. However, these studies overlook a critical phenomenon: when the data distribution is imbalanced, the model’s training process tends to emphasize uniformity loss between head and tail categories, while neglecting the uniformity loss between two tail categories (Li et al., 2022a). As illustrated in the left panel of Fig. 1, the representation space is predominantly occupied by the head category (elephant), resulting in a substantially larger distance between the head and tail categories compared to the distance between the two tail categories. The greater the imbalance of long-tailed data, the greater the bias in the feature space distribution and the more uneven the distribution becomes. This uneven feature space distribution causes the model to exhibit unclear and inseparable decision boundaries on the uniformly distributed test set, which lowers its performance. This is the main reason why vision-language models applied to long-tailed recognition fail to achieve satisfactory performance. Furthermore, we notice that the text contains a significant amount of noise information, the representation learned by the model may incorporate these irrelevant details, leading to a decline in the quality and semantic relevance of the representation. This can result in inaccuracies and ambiguities in the inference process, thus leading to a misleading model. Last but not least, in the image recognition process, due to major classes having significantly more examples, classifiers are induced to focus on them to optimize evaluation metrics and accuracy. This leads to poor performance on tail classes, named the forward bias of the classifier.

To address the aforementioned challenges, we propose a uniformly distributed category prototype-guided vision-language model framework, which effectively avoids the collapse of feature space representation quality and model performance caused by scarce data in tail categories. Specifically, we leverage the “anchor text” to obtain uniformly distributed initialization category feature prototypes in the hypersphere feature space, and use these pre-generated prototypes as category targets. Furthermore, we design a visual-language representation framework based on category feature prototype contrastive learning, which is different from the conventional approach of directly aligning text and image features in contrastive learning. Instead, we match the cross-modalities features of corresponding categories to their target category prototypes during the stage of image-text matching. This ensures that the distribution of image and text features in the feature space is uniform and avoids being dominated and biased by the head categories. As depicted in the right panel of Fig. 1, the acquired category prototypes demonstrate a uniform distribution, which consequently results in an enhancement of the model’s performance. In addition, based on the learned category feature prototypes, we design an extraneous text filter and attribute enhancement module that enables the model to focus on key textual attributes while avoiding the learning of trivial correlations between image and text pairs. In the final stage of image recognition fine-tuning, we design a class feature prototype-guided classifier to alleviate the positive bias issue inherent in conventional learnable classifiers (Kang et al., 2019). The contributions of this paper can be summarized as follows:

•

We propose a uniformly distributed category prototype-guided visual-linguistic framework, in which the category prototype-guided mechanism for image-text matching guarantees a uniform distribution of image and text features in the feature space, preventing domination and bias by head categories.
•

We propose the irrelevant text filter and an attribute enhancement module that allows the model to focus on key textual attributes while avoiding the learning of trivial correlations between image and text pairs.
•

We design the category prototype-guided recognition classifier to alleviate positive bias in the inference process of head categories.
•

Our proposed method achieves state-of-the-art performance on long-tailed distributed datasets, including ImageNet-LT, Places-LT, and iNaturalist 2018.

2. The proposed method

3. Related Work

Long-tailed Recognition. Real-world data usually has an unbalanced distribution. To address the long-tailed class imbalance, massive deep long-tailed learning studies have been conducted in recent years, including class re-balancing, information augmentation, and module improvement. Class re-balancing seeks to re-balance the negative influence brought by the class imbalance in training sample numbers. Seesaw loss (Wang et al., 2021d) re-balances positive and negative gradients for each class with two re-weighting factors, i.e., mitigation and compensation. Label-Distribution-Aware Margin (LDAM) (Cao et al., 2019) enforces class-dependent margin factors for different classes based on their training label frequencies, which encourages tail classes to have larger margins. The vMF classifier (Wang et al., 2022) optimizes the class-wise features via vMF distributions to obtain uniform and aligned representations. Information augmentation seeks to introduce additional information into model training so that the model performance can be improved for long-tailed learning. Routing Diverse Experts (RIDE) (Wang et al., 2020) introduced a knowledge distillation method to reduce the parameters of the multi-expert model by learning a student network with fewer experts. MiSLAS (Zhong et al., 2021) proposed to use data mixup to enhance representation learning in the decoupled scheme. Module improvement improves network modules in long-tailed learning. DRO-LT (Samuel and Chechik, 2021) extended the prototypical contrastive learning with distribution robust optimization (Goh and Sim, 2010), which makes the learned model more robust to distribution shift. Notably, the aforementioned methods predominantly address long-tailed recognition by considering only image modalities, while seldom incorporating text modalities into the problem.

Vision-Language Model. Vision-language pre-training (VLP) is typically directed by specific vision-language objectives (Radford et al., 2021; Yao et al., 2021; Li et al., 2022c, 2023), facilitating the learning of image-text correspondences from extensive image-text pair datasets (Schuhmann et al., 2022, 2021). CLIP (Radford et al., 2021) adopts an image-text contrastive objective, which learns by attracting paired images and texts while repelling unrelated ones in the embedding space. SimVLM (Wang et al., 2021c) considerably streamlines VLP by exclusively leveraging language modeling objectives on weakly aligned image-text pairs. ALIGN (Jia et al., 2021) introduces a straightforward dual-encoder architecture that learns to align visual and language representations from noisy image and text data, which can subsequently be applied to vision-only or vision-language task transfers. Furthermore, BLIP (Li et al., 2022b) presents a novel VLP framework that flexibly adapts to both vision-language understanding and generation tasks.

There have been several vision-language approaches designed for long-tailed recognition. BALLAD (Ma et al., 2021) first continues pre-training the vision-language backbone through contrastive learning on a specific long-tailed target dataset and employs an additional adapter layer to enhance the representations of tail classes. VL-LTR (Tian et al., 2022) presents a new visual linguistic framework for long-tailed visual recognition, which includes a class-wise text-image pre-training and a language-guided recognition head. However, they focus on leveraging the multi-modality capability of the vision-language framework, ignoring that long-tailed distribution can still corrupt the representation space.

4. The proposed method

4.1. Uniformly Distributed Category prototype Guided Image-Text Matching

In representation learning, an ideal characteristic is the attainment of prototypes with a uniform distribution (Li et al., 2022a; Cui et al., 2021). This property signifies that supervised contrastive learning should converge to prototypes with distinct classes that are uniformly dispersed on a hypersphere. A uniform prototype maximizes inter-class distances within the feature space, thereby optimizing the margin and enhancing the generalization capability of long-tailed recognition models. It has been observed that the attribute variability stemming from imbalanced data distribution in long-tailed recognition primarily contributes to reduced prediction confidence for tail classes. To tackle this issue, we devise an image and text matching mechanism founded upon the category prototype, which exhibits a uniform distribution on a hypersphere. The category prototype embodies the aforementioned class features. Our proposed method effectively mitigates feature space bias induced by imbalanced data distribution between head and tail classes, diverging from the traditional approach of directly comparing text and image information.

Initialization of Uniformly Distributed Category Prototypes. In contrastive learning, ideal prototypes should be uniformly situated on the unit hypersphere $\mathbb{S}^{d-1}=\left\{\boldsymbol{u}\in\mathbb{R}^{d}:\|\boldsymbol{u}\|=1\right\}$ within the feature space. The robust representational capacity of the CLIP model enables the generation of uniformly distributed category features, which effectively represent distinct categories. Inspired by this, we employ the CLIP model to pre-determine the optimal location for each category prototype in the hyper-spherical feature space. Specifically, we utilize the ”anchor text” $T$ , generated from a template in the format ”a photo of a $\{label\}$ ”, where $\{label\}$ is substituted with the category name, as the anchor text for each category. By feeding the anchor text into the pre-trained CLIP model, we obtain the ”anchor text feature” and use it to initialize the category feature prototype as follows:

(1)

\displaystyle\boldsymbol{c}^{k}=\mathcal{L}_{\mathrm{enc}}\left(T\right),

where $\boldsymbol{c}^{k}$ represents the prototype for category $k$ , and $\mathcal{L}_{\mathrm{enc}}$ denotes the language encoder of the pre-trained CLIP model. We employ $\boldsymbol{c}^{k}$ as the target against which text and image information is matched, promoting the learning of more discriminative features by the text and image encoders. Furthermore, these initialized category feature prototypes are uniformly distributed on the hyper-sphere, preventing the encoders from developing a preference for high-frequency categories. It is worth noting that we use the pre-trained clip encoder in the process, and we will investigate its impact in Appendix. C for more details.

Uniformly Distributed Category Prototype-guided mechanism for image-text matching. Within vision-language frameworks, the prevalent method for learning feature representations involves directly aligning image and text features in the feature space. However, in long-tailed datasets, sample imbalances within tail categories may disrupt the uniformity of feature prototypes on the hyper-sphere space, resulting in performance deterioration. To tackle this challenge, we introduce an innovative approach based on category prototype contrastive learning to promote the alignment of image and text features. During the image-text matching process, a batch of images $\mathcal{I}=\{I_{i}\}_{i=1}^{N}$ and corresponding text sentences $\mathcal{T}=\{T_{i}\}_{i=1}^{N}$ are randomly sampled, where $N$ denotes the batch size. These images and texts are subsequently encoded into feature representations utilizing visual encoder $\mathcal{V}_{\mathrm{enc}}(\cdot)$ and language encoder $\mathcal{L}_{\mathrm{enc}}(\cdot)$ , respectively:

(2)

\displaystyle\boldsymbol{z}_{i}^{I}=\mathcal{V}_{\mathrm{enc}}\left(I_{i}\right),\quad\boldsymbol{z}_{i}^{T}=\mathcal{L}_{\mathrm{enc}}\left(T_{i}\right),

where the vectors $\boldsymbol{z}_{i}^{I}$ and $\boldsymbol{z}_{i}^{T}$ are both normalized. The category prototype contrastive loss function $\mathcal{L}_{\operatorname{PC}}$ can be defined as:

(3)

\displaystyle\mathcal{L}_{\operatorname{PC}}=-\log{\bigg{[}\frac{\exp\left(\boldsymbol{z}_{i}^{I}\cdot\boldsymbol{c}^{i}/\tau\right)}{\sum_{j=1}^{K}\exp\left(\boldsymbol{z}_{i}^{I}\cdot\boldsymbol{c}^{j}/\tau\right)}\bigg{]}}-\log{\bigg{[}\frac{\exp\left(\boldsymbol{z}_{i}^{T}\cdot\boldsymbol{c}^{i}/\tau\right)}{\sum_{j=1}^{K}\exp\left(\boldsymbol{z}_{i}^{T}\cdot\boldsymbol{c}^{j}/\tau\right)}\bigg{]}},

where the parameter $\tau$ , initialized to 0.07 (Khosla et al., 2020), is a hyper-parameter. The image-text matching process with category prototype contrastive loss encourages each class’s image and text features to align with its corresponding uniformly distributed category prototype in the hypersphere. This alignment prevents the long-tailed distribution from breaking the uniformity of the feature space, thereby improving the model’s representation quality of the long-tailed data. It encourages image and text features to move towards their corresponding category prototype while avoiding proximity to other category prototypes.

Moreover, the adaptation of category prototypes to the semantic and stylistic characteristics of the given dataset is crucial for enhancing the performance of the model. Thus, we propose a unique dynamic updating scheme for category feature prototypes that enables them to adjust to the imbalanced distribution of data in both the head and tail of the dataset. This method prevents the significant displacement of tail category feature prototypes, which would disrupt the uniformity of the prototypes. We implement an exponential moving average (EMA) algorithm (Li et al., 2019) with class label frequency to update the category feature prototypes dynamically:

(4)		$\displaystyle\boldsymbol{c}^{k}\leftarrow m\boldsymbol{c}^{k}$	$\displaystyle+(1-m)\bigg{[}\boldsymbol{z}_{i}^{T}+\pi_{k}\cdot\boldsymbol{z}_{i}^{I}\bigg{]}/(\pi_{k}+1),$
(4)			$\displaystyle\quad\forall i\in\left\{i\mid\hat{y}_{i}=k\right\},$

where $\pi$ denotes the sample frequency vector ( $\pi_{k}=N_{k}/N$ ) and $m$ represents the momentum coefficient, which is typically set to 0.999. This algorithm ensures that the category feature prototypes remain stable and accurate, regardless of the variations in data distribution among different classes. The usage of class label frequencies allows for the prototypes to be updated more slowly with regard to the image information of the less frequently occurring classes, which helps prevent significant prototype deviations caused by scarce tail class image information.

The total loss function for image-text matching. During the image-text matching training phase, $\boldsymbol{z}_{i}^{I}$ and $\boldsymbol{z}_{j}^{T}$ are both d-dimension normalized vectors in the joint multimodal space. The overall loss function $\mathcal{L}_{\operatorname{Total}}$ is defined as:

(5)	$\displaystyle\mathcal{L}_{\operatorname{Total}}=$	$\displaystyle\mathcal{L}_{\text{ccl }}+\lambda\mathcal{L}_{\text{PC}},$
	$\displaystyle\mathcal{L}_{\text{ccl }}=$	$\displaystyle-\frac{1}{\left\|\mathcal{T}_{i}^{+}\right\|}\sum_{T_{j}\in\mathcal{T}_{i}^{+}}\log\frac{\exp\left(\boldsymbol{z}_{i}^{I}\cdot\boldsymbol{z}_{j}^{T}/\tau\right)}{\sum_{T_{k}\in\mathcal{T}}\exp\left(\boldsymbol{z}_{i}^{I}\cdot\boldsymbol{z}_{k}^{T}/\tau\right)}$
		$\displaystyle-\frac{1}{\left\|\mathcal{I}_{i}^{+}\right\|}\sum_{I_{j}\in\mathcal{I}_{i}^{+}}\log\frac{\exp\left(\boldsymbol{z}_{i}^{T}\cdot\boldsymbol{z}_{j}^{I}/\tau\right)}{\sum_{I_{k}\in\mathcal{I}}\exp\left(\boldsymbol{z}_{i}^{T}\cdot\boldsymbol{z}_{k}^{I}/\tau\right)},$

where $\mathcal{L}_{\text{ccl }}$ and $\mathcal{L}_{\text{PC }}$ represent the conventional category-level contrastive learning loss function and the category feature prototype loss function we designed, respectively. $\lambda$ is the weight of the category feature prototype loss function. For detailed analysis on $\lambda$ , please refer to the Appendix. D. $\mathcal{T}_{i}^{+}$ represents a subset of $\mathcal{T}$ where each text shares the same category with the image $I_{i}$ . Similarly, all images in $\mathcal{I}_{i}^{+}$ share the same category with the text $T_{i}$ .

Table 1. Results on ImageNet-LT in terms of accuracy (%).

\dagger

indicates the corresponding backbone is initialized with CLIP (Radford et al., 2021) weights.

*

represents the results of the reproduction of the method.

Backbone	Method	Accuracy (%)
Backbone	Method	Many	Med.	Few	All
ResNext-50	SSD (Li et al., 2021)	66.8	53.1	35.4	56.0
ResNext-50	TADE (Zhang et al., 2021a)	66.5	57.0	43.5	58.8
ResNext-50	RIDE (4 Experts) (Wang et al., 2021a)	68.2	53.8	36.0	56.8
ResNext-50	DisAlign (Zhang et al., 2021c)	62.7	52.1	31.4	53.4
ResNext-101	PaCo (Cui et al., 2021)	68.2	58.7	41.0	60.0
ResNext-101	Res-LT (Cui et al., 2022)	63.3	53.3	40.3	55.1
ResNext-152	$\tau$ -norm (Kang et al., 2019)	62.2	50.1	35.8	52.8
ResNext-152	NCM (Kang et al., 2019)	60.3	49.0	33.6	51.3
ResNext-152	cRT (Kang et al., 2019)	64.7	49.1	29.4	52.4
ResNext-152	LWS (Kang et al., 2019)	63.5	50.4	34.2	53.3
ResNet-50 $\dagger$	$\tau$ -norm (Kang et al., 2019)	60.9	48.4	33.8	51.2
ResNet-50 $\dagger$	NCM (Kang et al., 2019)	58.9	46.6	31.1	49.2
ResNet-50 $\dagger$	cRT (Kang et al., 2019)	63.3	47.2	27.8	50.8
ResNet-50 $\dagger$	LWS (Kang et al., 2019)	62.2	48.6	31.8	51.5
ResNet-50 $\dagger$	vMF (Wang et al., 2022)	70.0	58.4	40.9	60.5
ResNet-50 $\dagger$	Zero-Shot CLIP (Radford et al., 2021)	60.8	59.3	58.6	59.8
ResNet-50 $\dagger$	Baseline	74.4	56.9	34.5	60.5
ResNet-50 $\dagger$	BALLAD (Ma et al., 2021)	71.0	66.3	59.5	67.2
ResNet-50 $\dagger$	VL-LTR (Tian et al., 2022)	77.8	67.0	50.8	70.1
ResNet-50 $\dagger$	VL-LTR^∗ (Tian et al., 2022)	75.7	68.6	59.5	70.1
ResNet-50 $\dagger$	Ours	76.4	69.5	60.2	70.9

4.2. Filtering Extraneous Text and Enhancing Attributes

Owing to the noise present in internet text data, the extraction of core attribute semantics can be disrupted, leading to significant disparities between text and visual feature expressions. Consequently, we propose a text noise filtering module to augment our framework’s robustness against noisy text data. Certain generic topics found in text descriptions, such as ”morphology,” ”source,” and ”root,” may be irrelevant to classification scenarios and could impact the modal alignment process within the feature space. To enhance and improve text feature representations, we introduce an extraneous text filtering and attribute enhancement module. Specifically, while constructing image-text pairs, we randomly sample $M$ description segments from the text corpus for an image of class $k$ and filter the candidate text using the category prototype feature. We reconstruct the text features by calculating the similarity score between the category prototype feature $\boldsymbol{c}_{k}$ and each text feature ${z}^{T,i}$ :

(6)

\displaystyle\begin{aligned} z^{T}=\sum_{i=0}^{M}\frac{e^{\mathbf{z}^{T,i}\cdot\boldsymbol{c}^{k}}}{\sum_{j=1}^{M}e^{\mathbf{z}^{T,j}\cdot\boldsymbol{c}^{k}}}\cdot\mathbf{z}^{T,i}.\end{aligned}

The reconstruction of text feature representation via similarity score computation facilitates the model’s focus on more critical information, thereby avoiding the learning of fixed and trivial correlations in specific image-text pairs. As illustrated in Fig. 3, for the category (tail) of ImageNet-LT, the example demonstrates how reconstructing noisy text through similarity-based filtering enables the model to concentrate on the most vital information. Greater emphasis is placed on attribute details pertaining to taxonomy, ensuring that the model does not internalize a spurious correlation between the text and tiger images.

Table 2. Results on Place-LT in terms of accuracy (%).

\dagger

indicates the correspongding backbone is initialized with CLIP (Radford et al., 2021) weights.

*

represents the results of the reproduction of the method.

Backbone	Method	Accuracy (%)
Backbone	Method	Many	Med.	Few	All
ResNext-152	TADE (Zhang et al., 2021a)	40.4	43.2	36.8	40.9
ResNext-152	PaCo (Cui et al., 2021)	36.1	47.9	35.3	41.2
ResNext-152	Res-LT (Cui et al., 2022)	39.8	43.6	31.4	39.8
ResNext-152	$\tau$ -norm (Kang et al., 2019)	37.8	40.7	31.8	37.9
ResNext-152	NCM (Kang et al., 2019)	40.4	37.1	27.3	36.4
ResNext-152	cRT (Kang et al., 2019)	42.0	37.6	24.9	36.7
ResNext-152	LWS (Kang et al., 2019)	40.6	39.1	28.6	37.6
ResNet-50 $\dagger$	$\tau$ -norm (Kang et al., 2019)	34.5	31.4	23.6	31.0
ResNet-50 $\dagger$	NCM (Kang et al., 2019)	37.1	30.6	19.9	30.8
ResNet-50 $\dagger$	cRT (Kang et al., 2019)	38.5	29.7	17.6	30.5
ResNet-50 $\dagger$	LWS (Kang et al., 2019)	36.0	32.1	20.7	31.3
ResNet-50 $\dagger$	Zero-Shot CLIP (Radford et al., 2021)	37.5	37.5	40.1	38.0
ResNet-50 $\dagger$	Baseline	50.8	38.6	22.7	39.7
ResNet-50 $\dagger$	BALLAD	46.7	48.0	42.7	46.5
ResNet-50 $\dagger$	VL-LTR (Tian et al., 2022)	51.9	47.2	38.4	48.0
ResNet-50 $\dagger$	VL-LTR^∗ (Tian et al., 2022)	50.0	48.3	44.2	48.0
ResNet-50 $\dagger$	Ours	50.8	48.8	44.6	48.7

4.3. Enhancing Image Recognition through Category Prototype-Guided Recognition

In the image recognition stage, taking the linear classifier as an example, the weight of the classifier is defined as $\omega_{i},\{i=0,\ldots,K\}$ , and the probability of a sample belonging to the $i$ -th class is given by:

(7)

P^{l}_{i}=\frac{\exp(\omega_{i}\cdot\boldsymbol{z}^{I}+b_{i})}{\sum_{j=0}^{K}\exp(\omega_{j}\cdot\boldsymbol{z}^{I}+b_{j})}

However, due to the long-tail data distribution, as shown in Fig. 4 (left), the classification accuracy of the linear classifier for each class (the number of samples for each class is arranged in descending order on the x-axis) is significantly higher for the head classes than the tail classes, known as the forward bias of the classifier towards the head classes (Wang et al., 2022).

The classifier based on class feature prototypes uses the distance between the sample features and category prototypes to classify the samples. The formula for classifying samples can be expressed as follows:

(8)

P_{i}^{p}=\frac{\exp\left(\boldsymbol{z}^{I}\cdot\boldsymbol{c}^{i}\right)}{\sum_{j=1}^{K}\exp\left(\boldsymbol{z}^{I}\cdot\boldsymbol{c}^{j}\right)}

Fig. 4 (right) shows the classification results based on the category prototypes (with the number of samples per class on the x-axis), compared to Fig. 4 (left), the performance difference between classes is smaller, but the positive bias towards the head classes is reduced. Inspired by this phenomenon, we propose a classifier structure based on guiding the class feature prototypes. Specifically, we combine the pre-trained feature category prototypes $\boldsymbol{c}^{k}$ on the long-tailed dataset with the learnable classifier. The probability of a sample belonging to class $i$ is calculated as follows:

(9)

P_{i}=\alpha\cdot P_{i}^{p}+(1-\alpha)\cdot P_{i}^{l}

where $P_{i}^{p}$ is the output result of the prototype classifier, and $P_{i}^{l}$ is the output result of the learnable classifier. Our designed classifier output result includes both the positive bias towards head classes and the performance compensation for tail classes. We will investigate the impact of $\alpha$ in Appendix. E for more details.

5. Experiments

In this section, a series of experiments were conducted to validate the effectiveness of the proposed method in image classification tasks, including an analysis of experiments and ablation studies.

5.1. Datasets and setup

Three long-tailed recognition benchmark datasets were used in this study, including ImageNet-LT (Liu et al., 2019), Places-LT (Zhou et al., 2017), and iNaturalist 2018 (Van Horn et al., 2018). More details about datasets are shown in Appendix. A.

5.2. Experimental Results on ImageNet-LT

Experimental Setup. To validate the effectiveness of our proposed method, we conducted extensive experiments on ImageNet-LT (Liu et al., 2019). Following the settings of previous work (Tang et al., 2020) (Hong et al., 2021), we employed ResNet-50 (He et al., 2016) as the visual encoder and a 12-layer Transformer (Radford et al., 2019) as the language encoder. More details are shown in Appendix. A.

To make a fair comparison, we also built a baseline method that only used visual modality, which had the same settings as our proposed method except that the baseline model was directly initialized with CLIP pre-trained weights and fine-tuned for 100 epochs. In addition, we also reproduced and reported the performance of some representative methods such as $\tau$ -normalized, cRT, NCM, and LWS (Kang et al., 2019), which were all initialized with CLIP pre-trained weights.

Performance comparison on the ImageNet-LT validation set. This section presents the performance comparison of our proposed method on the ImageNet-LT dataset. To ensure a fair comparison, we used various methods from the deep long-tailed learning survey (Zhang et al., 2021b). Tab. 1 shows the results of our model, which outperforms traditional visual methods that use similar visual encoders (i.e., backbone networks). For instance, when using ResNet-50 (He et al., 2016) as the backbone network, our method achieved an overall accuracy of 70.9%, which is 10.4 percentage points higher than the baseline method (70.9% vs. 60.5%). Furthermore, our method outperformed the previously best-performing methods VL-LTR (Tian et al., 2022) and BALLAD (Ma et al., 2021) by 0.8 percentage points (70.9% vs. 70.1%) and 3.7 percentage points (70.9% vs. 67.2%), respectively. In addition, from the perspective of tail category accuracy, our method also performed well, achieving a 0.7 percentage point improvement over the second-best method (Tian et al., 2022) (60.2% vs. 59.5%).

5.3. Results on Place-LT Dataset

Performance comparison on the Place-LT validation set. Tab. 2 presents results on the Place-LT dataset using ResNet-50. Our model achieved an overall accuracy of 48.7%, surpassing other state-of-the-art methods by at least 7.5 percentage points (48.7% compared to 41.2%), including PaCo (Cui et al., 2021), TADE (Zhang et al., 2021a), and ResLT (Cui et al., 2022), which all use ResNet-152 (He et al., 2016) as the backbone network. Compared to other CLIP-based methods such as BALLAD (Ma et al., 2021) and VL-LTR (Tian et al., 2022), our proposed method still exhibited competitive performance (48.7% compared to 46.5% and 48.0%, respectively).

Table 3. Results on iNaturalist 2018 in terms of accuracy (%).

\dagger

indicates the backbone is initialized with CLIP (Radford et al., 2021) weights.

Backbone	Method	Accuracy (%)
ResNext-50	SSD (Li et al., 2021)	71.5
ResNext-50	TADE (Zhang et al., 2021a)	72.9
ResNext-50	RIDE (4 Experts) (Wang et al., 2021a)	72.6
ResNext-50	PaCo (Cui et al., 2021)	73.2
ResNext-50	Res-LT (Cui et al., 2022)	72.3
ResNext-50	$\tau$ -norm (Kang et al., 2019)	69.3
ResNext-50	NCM (Kang et al., 2019)	63.1
ResNext-50	cRT (Kang et al., 2019)	67.6
ResNext-50	LWS (Kang et al., 2019)	69.5
ResNet-50 $\dagger$	$\tau$ -norm (Kang et al., 2019)	71.2
ResNet-50 $\dagger$	NCM (Kang et al., 2019)	65.3
ResNet-50 $\dagger$	cRT (Kang et al., 2019)	69.9
ResNet-50 $\dagger$	LWS (Kang et al., 2019)	71.0
ResNet-50 $\dagger$	Zero-Shot CLIP (Radford et al., 2021)	3.4
ResNet-50 $\dagger$	Baseline	72.6
ResNet-50 $\dagger$	VL-LTR (Tian et al., 2022)	74.6
ResNet-50 $\dagger$	Ours	75.2

5.4. iNaturalist 2018 Experiment Results

Performance Comparison on the iNaturalist 2018 Validation Set. Tab. 3 shows the top-1 accuracy of different methods on the iNaturalist 2018 dataset. We found that when using ResNet-50 (He et al., 2016) as the backbone network, our model achieved an overall accuracy of 75.2%, surpassing previous methods with the same backbone network by at least 0.6 percentage points.

5.5. Ablation studies

We conducted ablation studies on ImageNet-LT to further clarify the choice of hyperparameters in our method and the effect of each component. All experiments were conducted under the same settings as Section 5.2 for a fair comparison. In addition, we provide the feature map t-SNE visualization about our method in Appendix. F.

Impacts of category-feature prototype initialization strategies. We studied the effectiveness of the “anchor text” selection strategy by replacing each category’s ”a photo of a {label}” text form with a ”cut-off” strategy. In this strategy, we simply select the first $six$ sentences from the text description as the anchor text input to the pre-trained CLIP model to generate category features and then cluster to form category feature prototypes. As shown in the first and second rows of Tab. 4, the model using the ”a photo of a {label}” text form achieved 1.4% higher overall accuracy (70.9% vs. 69.5%) than the model using the ”cut-off” text form. This demonstrates the effectiveness of the ”a photo of a {label}” text form selection in initializing category feature prototypes.

Effectiveness of extraneous text filter module. The effectiveness of the unrelated text filtering module was verified by comparing it with a model that did not use the module. As shown in Tab. 5 rows 1 and 3, the model without the unrelated text filtering module experienced a 1.1% decrease in overall accuracy compared to the model with the module (70.9% vs. 69.8%). This demonstrates that the unrelated text filtering module can effectively reconstruct the text information and improve the model’s attention to important attributes in the text information.

Effectiveness of the category prototype contrastive loss function. We also tested the effectiveness of the contrastive loss function for the category feature prototype by removing it during the text-image feature alignment stage and using a regular category-level contrastive loss function instead. As shown in Tab. 5 rows 1 and 2, the category feature prototype contrastive loss function significantly improved the model’s performance, by approximately 3.8% (70.9% vs. 67.1%). This indicates that the category feature prototype contrastive loss function can effectively ensure the uniformity of the model’s feature space and prevent the degradation of feature representation quality due to the long-tailed data distribution.

Table 4. Ablation experiments on the effectiveness of Prototypes Initialization strategy and extraneous text filter module under ImageNet-LT.

Prototypes Initialization	Text Filter	Accuracy
label	$\checkmark$	70.9
cut off	$\checkmark$	69.5
label	-	69.8

Table 5. Ablation experiments on the effectiveness of each component in our method under ImageNet-LT. ”Pro Cl” represents the category prototype-guided recognition classifier module in the image recognition stage.

Text-image Matching		Image Recognition		Accuracy
w/o $\mathcal{L}_{\text{PC}}$	w/ $\mathcal{L}_{\text{PC}}$	w/o Pro Cl	w/ Pro Cl	Accuracy
-	$\checkmark$	-	$\checkmark$	70.9
$\checkmark$	-	-	$\checkmark$	67.1
-	$\checkmark$	$\checkmark$	-	65.9

Effectiveness of the category feature prototype-guided classifier design module. Finally, we tested the effectiveness of the category feature prototype-guided classifier by comparing it with a standalone category feature prototype classifier and a linear classifier. As shown in Tab. 5 rows 1 and 3, the proposed category feature prototype-guided classifier improved the overall accuracy by 5.0% compared to the standalone linear classifier (70.9% vs. 65.9%). These results demonstrate that the category feature prototype-guided classifier design module can effectively prevent positive bias in the classifier and highlight the effectiveness of the module.

6. Conclusion

In this paper, we propose a novel visual-linguistic representation framework based on the category prototype guidance to alleviate the long-tail problem. In the text-image modality matching training stage, we design a category prototype contrastive loss function to align text and image features onto category feature prototypes uniformly distributed in feature space. Additionally, our proposed irrelevant text filtering and attribute enhancement module allows the model to ignore the irrelevant noisy text and focus more on key attribute information, thereby enhancing the robustness of our framework. In the image recognition fine-tuning stage, we design a classifier structure based on the category feature prototype, which compensates for the performance of tail classes while maintaining the performance of head classes. Our proposed method achieves state-of-the-art performance on long-tailed distributed datasets, including ImageNet-LT, Places-LT, and iNaturalist 2018.

Acknowledgements

We want to extend our heartfelt thanks to everyone who has provided their suggestions regarding our manuscript. Their insightful comments and constructive feedback have significantly contributed to the improvement of this paper.

References

(1)
Cao et al. (2019) Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. 2019. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in Neural Information Processing Systems 32 (2019), 1567–1578.
Chu et al. (2020) Peng Chu, Xiao Bian, Shaopeng Liu, and Haibin Ling. 2020. Feature space augmentation for long-tailed data. In European Conference on Computer Vision. Springer, 694–710.
Cui et al. (2022) Jiequan Cui, Shu Liu, Zhuotao Tian, Zhisheng Zhong, and Jiaya Jia. 2022. Reslt: Residual learning for long-tailed recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 3695–3706.
Cui et al. (2021) Jiequan Cui, Zhisheng Zhong, Shu Liu, Bei Yu, and Jiaya Jia. 2021. Parametric Contrastive Learning. arXiv:2107.12028 [cs.CV]
Elkan (2001) Charles Elkan. 2001. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, Vol. 17. Lawrence Erlbaum Associates Ltd, 973–978.
Fu et al. (2022) Siming Fu, Huanpeng Chu, Xiaoxuan He, Hualiang Wang, Zhenyu Yang, and Haoji Hu. 2022. Meta-Prototype Decoupled Training for Long-tailed Learning. In Proceedings of the Asian Conference on Computer Vision (ACCV). 569–585.
Goh and Sim (2010) Joel Goh and Melvyn Sim. 2010. Distributionally robust optimization and its tractable approximations. Operations research 58, 4-part-1 (2010), 902–917.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Hong et al. (2021) Youngkyu Hong, Seungju Han, Kwanghee Choi, Seokjun Seo, Beomsu Kim, and Buru Chang. 2021. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6626–6636.
Jia et al. (2021) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904–4916.
Kang et al. (2019) B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis. 2019. Decoupling Representation and Classifier for Long-Tailed Recognition. arXiv:1910.09217 (2019).
Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in Neural Information Processing Systems 33 (2020), 18661–18673.
Li et al. (2022b) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022b. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
Li et al. (2022a) Tianhong Li, Peng Cao, Yuan Yuan, Lijie Fan, Yuzhe Yang, Rogerio S Feris, Piotr Indyk, and Dina Katabi. 2022a. Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6918–6928.
Li et al. (2021) Tianhao Li, Limin Wang, and Gangshan Wu. 2021. Self supervision to distillation for long-tailed visual recognition. In The IEEE International Conference on Computer Vision. 630–639.
Li et al. (2019) Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, and Hong Liu. 2019. Expectation-Maximization Attention Networks for Semantic Segmentation. In The IEEE International Conference on Computer Vision. 9167–9176.
Li et al. (2023) Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. 2023. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653 (2023).
Li et al. (2022c) Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, and Xiaomeng Li. 2022c. Exploring Visual Interpretability for Contrastive Language-Image Pre-training. arXiv preprint arXiv:2209.07046 (2022).
Liu et al. (2019) Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. 2019. Large-Scale Long-Tailed Recognition in an Open World. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2537–2546.
Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
Ma et al. (2021) Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao, and Yu Qiao. 2021. A simple long-tailed recognition baseline via vision-language model. arXiv preprint arXiv:2111.14745 (2021).
Mahajan et al. (2018) Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. 2018. Exploring the limits of weakly supervised pretraining. In European Conference on Computer Vision. 181–196.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013), 3111–3119.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Samuel and Chechik (2021) Dvir Samuel and Gal Chechik. 2021. Distributional robustness loss for long-tail learning. In The IEEE International Conference on Computer Vision. 9495–9504.
Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022).
Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
Sun et al. (2007) Yanmin Sun, Mohamed S Kamel, Andrew KC Wong, and Yang Wang. 2007. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 12 (2007), 3358–3378.
Tang et al. (2020) Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Advances in Neural Information Processing Systems 33 (2020), 1513–1524.
Tian et al. (2022) Changyao Tian, Wenhai Wang, Xizhou Zhu, Jifeng Dai, and Yu Qiao. 2022. Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXV. Springer, 73–91.
Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347–10357.
Van Horn et al. (2018) Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. 2018. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8769–8778.
Wang et al. (2022) Hualiang Wang, Siming Fu, Xiaoxuan He, Hangxiang Fang, Zuozhu Liu, and Haoji Hu. 2022. Towards Calibrated Hyper-Sphere Representation via Distribution Overlap Coefficient for Long-Tailed Learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV. Springer, 179–196.
Wang et al. (2021b) Jianfeng Wang, Thomas Lukasiewicz, Xiaolin Hu, Jianfei Cai, and Zhenghua Xu. 2021b. Rsg: A simple but effective module for learning imbalanced datasets. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3784–3793.
Wang et al. (2021d) Jiaqi Wang, Wenwei Zhang, Yuhang Zang, Yuhang Cao, Jiangmiao Pang, Tao Gong, Kai Chen, Ziwei Liu, Chen Change Loy, and Dahua Lin. 2021d. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9695–9704.
Wang and Isola (2020) Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning. PMLR, 9929–9939.
Wang et al. (2021a) Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella Yu. 2021a. Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. In International Conference on Learning Representations.
Wang et al. (2020) Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella X Yu. 2020. Long-tailed recognition by routing diverse distribution-aware experts. arXiv preprint arXiv:2010.01809 (2020).
Wang et al. (2017) Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to model the tail. Advances in neural information processing systems 30 (2017), 7032–7042.
Wang et al. (2021c) Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. 2021c. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 (2021).
Yao et al. (2021) Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2021. FILIP: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783 (2021).
Zhang et al. (2021c) Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, and Jian Sun. 2021c. Distribution alignment: A unified framework for long-tail visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2361–2370.
Zhang et al. (2021a) Yifan Zhang, Bryan Hooi, Lanqing Hong, and Jiashi Feng. 2021a. Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision. arXiv preprint arXiv:2107.09249 (2021).
Zhang et al. (2021b) Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. 2021b. Deep long-tailed learning: A survey. arXiv preprint arXiv:2110.04596 (2021).
Zhong et al. (2021) Zhisheng Zhong, Jiequan Cui, Shu Liu, and Jiaya Jia. 2021. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16489–16498.
Zhou et al. (2017) Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452–1464.
Zhou and Liu (2005) Zhi-Hua Zhou and Xu-Ying Liu. 2005. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on knowledge and data engineering 18, 1 (2005), 63–77.

Appendix A Datasets and setup.

DATASETS. Three challenging long-tailed visual recognition benchmark datasets were used in this study, including ImageNet-LT (Liu et al., 2019), Places-LT (Zhou et al., 2017), and iNaturalist 2018 (Van Horn et al., 2018). ImageNet-LT, which consists of 1000 categories, was constructed by sampling a subset of ImageNet-2012 according to the Pareto distribution (power value $\alpha=6$ ). The training set contains 115.8K images, with the number of images per category ranging from 1280 to 5. The validation set and test set are balanced, containing 20K and 50K images, respectively. Similar to ImageNet-LT, Places-LT is a long-tailed version of the large-scale scene classification dataset Places, with 365 categories and 62.5K images, with the cardinality of categories ranging from 5 to 4980. iNaturalist 2018 is a real-world natural long-tailed dataset consisting of 8142 fine-grained species. The training set contains 437.5K images, with an imbalance factor of 500. The official validation set containing three images per category was used to test the proposed method. The validation set was used to select hyperparameters, and the test set was used to report numerical results.

Following the setting of VL-TLR (Tian et al., 2022), we also collected category-level textual descriptions for these three datasets. The textual descriptions were mainly obtained from Wikipedia, an open-source online encyclopedia containing millions of free articles. We first used the original class names as initial queries to obtain the best-matching entries on Wikipedia. After cleaning and filtering out some obviously irrelevant parts of these entries, such as ”References” or ”External Links,” we split the remaining parts into sentences to form the original text candidate set for each category. It should be noted that some categories had relatively few sentences, so we added 80 additional prompt sentences for each category to alleviate the problem of data scarcity. These prompt sentences were generated using the form of ”a photo of a label” based on the prompt template provided in prior works (Radford et al., 2021).

Experimental Setup. To validate the effectiveness of our proposed method, we conducted extensive experiments on ImageNet-LT (Liu et al., 2019) using PyTorch. Following the settings of previous works (Tang et al., 2020) (Hong et al., 2021), we employed ResNet-50 (He et al., 2016) as the visual encoder and a 12-layer Transformer (Radford et al., 2019) as the language encoder.

All models were optimized using AdamW optimizer with a momentum of 0.9 and weight decay of $5\times 10^{-2}$ . We used the same data augmentation method as previous works (Touvron et al., 2021). In the pre-training stage, the maximum length of text tokens was set to 77 (including [SOS] and [EOS] tokens), and we loaded the pre-trained weights of CLIP (Radford et al., 2021). The initial learning rate was set to $5\times 10^{-5}$ and followed the cosine annealing (Loshchilov and Hutter, 2016). The model was trained for 50 epochs in the first stage of cross-modal feature alignment with a batch size of 256. In the second-stage image recognition, we set the batch size to 128 and fine-tuned the model for another 50 epochs. We set the initial learning rate to $1\times 10^{-3}$ and still used cosine annealing for decay. Unless otherwise specified, we used an input size of 224 × 224 and the square-root sampling strategy (Mahajan et al., 2018) (Mikolov et al., 2013) in both stages.

We also evaluated our proposed method on the Place-LT dataset, which includes images from different domains (Zhou et al., 2017). The experimental setup for the Place-LT dataset is the same as experiments in ImageNet-LT.

We further tested our approach on the iNaturalist 2018. Following the conventional practice (Touvron et al., 2021), we trained our model for 100 epochs in the first stage and then 360 epochs for image classification in the second stage. The initial learning rates for the pre-training and fine-tuning stages were set to $5\times 10^{-4}$ and $2\times 10^{-5}$ , respectively. The same training stages and settings were used for the baseline model as those proposed in our approach. All other experimental settings were the same as those in ImageNet-LT.

Appendix B The qualitative experiment analysis of the toy feature distribution.

To further demonstrate the qualitative experiment analysis of the toy feature distribution shown in Fig. 1. In Tab. 6, we compare the alignment and neighborhood uniformity, of the method before and after implementing the uniformly distributed category prototype-guided image-text matching strategy on ImageNet-LT. The method without implementing the uniformly distributed category prototype-guided image-text matching strategy is named baseline (the same as the ”Baseline” in Tab. 1). Similar to (Wang and Isola, 2020), we define alignment under our setting as the average distances between samples from the same class, where $F_{i}$ is the set of features from class $i$ :

(10)

\mathbf{A}=\frac{1}{C}\sum_{i=1}^{C}\frac{1}{\left|F_{i}\right|^{2}}\sum_{v_{j},v_{k}\in F_{i}}\left\|v_{j}-v_{k}\right\|_{2}.

We define neighborhood uniformity as the distance to the top- $k$ closest class centers of each class:

(11)

\mathbf{U}_{k}=\frac{1}{Ck}\sum_{i=1}^{C}\min_{j_{1},\cdots,j_{k}}\left(\sum_{l=1}^{k}\left\|c_{i}-c_{j_{l}}\right\|_{2}\right),

where $j_{1},\ldots,j_{k}$ represent different classes. And $c_{i}$ is the center of samples from class $i$ on the hypersphere: $c_{i}=\frac{\sum_{v_{j}\in F_{i}}v_{j}}{||\sum_{v_{j}\in F_{i}}v_{j}||^{2}}$ . The results underscore several advantageous properties of our method:

1.) Our method achieves superior neighborhood uniformity across all class splits and maintains better alignment with the baseline.

2.) The neighborhood uniformity (the average uniformity of the closest 10 classes) is 0.09 higher than that of the baseline. Furthermore, the baseline is even worse on tail classes, while our method maintains consistent neighborhood uniformity across all classes. This demonstrates the effectiveness of our method in keeping all classes separate from each other, thereby allowing for clearer decision boundaries between classes.

These results suggest that our method is more effective for handling the long-tailed distribution of the ImageNet-LT dataset. Our method may provide superior image-text matching performance under such circumstances, maintaining consistency and clear decision boundaries among all classes.

Table 6. Implementing the uniformly distributed category prototype-guided image-text matching strategy achieves better alignment and neighborhood uniformity than the baseline on ImageNet-LT. The

k

for neighborhood uniformity is set to 10.

\uparrow

indicates larger is better, whereas

\downarrow

indicates smaller is better.

Metric	Methods	ImageNet-LT
Metric	Methods	Many	Med.	Few	All
$\mathrm{A}$ $\downarrow$	Baseline	0.64	0.63	0.67	0.64
$\mathrm{A}$ $\downarrow$	Ours	0.59	0.61	0.62	0.60
$\mathrm{U}_{10}$ $\uparrow$	Baseline	0.95	0.93	0.92	0.94
$\mathrm{U}_{10}$ $\uparrow$	Ours	1.04	1.03	1.01	1.03
$\mathrm{Acc.}$ $\uparrow$	Baseline	74.4	56.9	34.5	60.5
$\mathrm{Acc.}$ $\uparrow$	Ours	76.4	69.5	60.2	70.9

Table 7. Ablation experiments on the effectiveness of Prototypes Initialization strategy with the pre-trained clip weight under ImageNet-LT.

Prototypes Initialization	weight	Accuracy
label	pre-trained CLIP	70.9
label	randomly initial	63.8

Appendix C Impact of the pre-trained clip weight for initialization of uniformly distributed category prototypes.

To investigate the effect of leveraging CLIP pre-trained weights in the initialization of uniformly distributed category prototypes, we undertake an experiment where we initialize the category prototypes using randomly initialized weights. A comparison between entries $\#2$ and $\#3$ in Tab. 7 indicates that initializing the category prototypes with CLIP pre-trained weights confers a clear advantage. In addition, the large margin proves that the CLIP pre-trained weights generate the uniformly distributed category prototype, but the randomly initialized weights generate the biased category prototype, leading to performance degradation.

Appendix D Ablation study of $\lambda$ in Eq. 5.

Moreover, in Eq. 5, we introduce a scaling hyperparameter $\lambda$ , which governs the degree of the $\mathcal{L}_{\text{PC}}$ . Initially, $\lambda$ is set to 0.1 and we vary the hyperparameter $\lambda$ within the range of 0.1 to 0.9, using a stride of 0.2, and perform ablation experiments with the resulting five sets of values, as shown in Fig. 5. In general, a larger $\lambda$ reflects a higher level of confidence in adjusting the text-image matching process. The optimal $\lambda$ for ImageNet-LT is established at 0.5.

Appendix E Ablation study of $\alpha$ in Eq. 9.

Our method involves a crucial hyper-parameter, $\alpha$ , in Eq. 9, which serves to balance the contributions of the two classification prediction terms $P_{i}^{p}$ and $P_{i}^{l}$ . We evaluate the hyper-parameter $\alpha$ across the following values: $\alpha\in\left\{0.2,0.4,0.6,0.8,1.0\right\}$ . We investigate the sensitivity of the model’s accuracy to these different $\alpha$ values. Fig. 5 demonstrates the influence of the trade-off parameter $\alpha$ on accuracy. The results indicate that optimally combining $P_{i}^{p}$ and $P_{i}^{l}$ with $\alpha$ set at 0.8 yields the best performance.

Appendix F T-SNE visualization and the cross modalities similarity map analysis.

To further demonstrate the effectiveness of our methods, we show the T-SNE visualization as Fig. 6. As shown in Fig. 6, the representation space of VL-TLR is dominated by the head category, leading to poor uniformity in learned category prototypes. Our approach improves performance by enhancing uniformity.