Supervised Contrastive Learning on Blended Images for Long-tailed Recognition

Minki Jeong Changick Kim
Korea Advanced Institute of Science and Technology
{rhm033, changick}@kaist.ac.kr

Abstract

Real-world data often have a long-tailed distribution, where the number of samples per class is not equal over training classes. The imbalanced data form a biased feature space, which deteriorates the performance of the recognition model. In this paper, we propose a novel long-tailed recognition method to balance the latent feature space. First, we introduce a MixUp-based data augmentation technique to reduce the bias of the long-tailed data. Furthermore, we propose a new supervised contrastive learning method, named Supervised contrastive learning on Mixed Classes (SMC), for blended images. SMC creates a set of positives based on the class labels of the original images. The combination ratio of positives weights the positives in the training loss. SMC with the class-mixture-based loss explores more diverse data space, enhancing the generalization capability of the model. Extensive experiments on various benchmarks show the effectiveness of our one-stage training method.

1 Introduction

Real-world data could be long-tail distributed, where the number of samples of each class changes drastically. This imbalance is due to various reasons, such as data collecting costs or lack of data. Training a neural network with imbalanced data is likely to produce an overfitted network biased toward major classes (i.e., head classes) with poor recognition performances in minor classes (i.e., tail classes).

Refer to caption — Figure 1: The concept image of SMC. Data augmentation reduces the imbalance between head and tail classes (fountain and broccoli in this example) without strong regularization. Moreover, our supervised contrastive learning method on blended images improves recognition performances.

Re-sampling and re-weighting methods have been major approaches to the data imbalance problem. Re-sampling methods [4, 1, 21, 32] are artificially balance the training data by assigning different data sampling weights to the training data. When creating a mini-batch for training, those methods oversample tail class instances or undersample head class instances. However, re-sampling approaches could worsen the overfitting issue due to the lack of semantic diversity of the sampled data. Re-weighting methods [2, 7, 25, 16, 14] modify the training loss for the imbalanced condition. They assign different weights on training instances to adjust the impact of training samples on the model. Although re-weighting enhances the recognition performance on the tail class samples, relatively reduced weights on head class samples discourage the head class sample recognition [10].

Several recent studies utilize the generalization capability of supervised contrastive learning (SCL) to long-tailed recognition [9, 15]. It is shown that SCL increases the generalization capability in various applications. Whereas the SCL approach could enhance the feature extractor in the long-tailed condition, it often generates a head-biased feature space due to the nature of long-tailed distribution [9]. Several long-tailed recognition studies [9, 15, 34] introduce regularization approaches to balance the feature space. However, the strong regularization could interfere with the class representation capability of the feature extractor, which deteriorates its recognition performance.

In this paper, we propose a novel long-tailed recognition method that forms a balanced feature space with SCL. The main concept of our method is to reduce the class imbalance by mixing the training samples without strong regularization. The data mixing increases the semantic diversity of tail class information during training. Moreover, the randomly sampled coefficient for the data augmentation expands the available semantic information space for training, as illustrated in Fig. 1. The broader semantic space gives richer information to the model than using a set of pre-defined class data, which encourages the classification power of the model. Among the various data mixing methods, CutMix [30] shows good performance in long-tailed recognition [22]. However, the random cropping operation in CutMix may lead to loss of foreground class information, as illustrated in Fig. 2. Therefore, we use ResizeMix [24] to preserve the semantic information of the foreground class. Since ResizeMix omits random cropping for the foreground patch, the foreground knowledge is preserved. Furthermore, we introduce a new SCL method that considers the semantic information of the original classes, named Supervised contrastive learning on Mixed Classes (SMC), for long-tailed recognition. The mixed data contain multiple class features. Therefore, the positive and negative pairs should be specified for the mixed data to train with supervised contrastive approach. We define three categories of positive pairs with mixed data. Then, we assign weights to the positive pairs based on the combination ratio of mixed samples. The weighted loss function is used to train the network. Moreover, we propose a new data augmentation method for supervised contrastive learning of the class-mixed samples. We found that random cropping on the blended data to create positive pairs could lose the semantic information of the training samples. Based on our observation, we propose a new data augmentation method that preserves the original class information of the mixed sample. Instead of applying augmentations on mixed images, we augment the images before mixing. Then, we create blended images with the augmented images. Our approach retains the information in mixtures which leads to better performance.

We evaluate SMC on the three long-tailed recognition benchmarks: CIFAR-100-LT [3], ImageNet-LT [18], and iNaturalist 2018 [28]. Our method achieves state-of-the-art performances on the benchmarks. In addition, we provide a deep analysis of SMC that justifies the concept of our method.

To summarize, our contribution is three-fold:

•

We suggest a new regularization method to reduce the class imbalance for long-tailed recognition. Using blended data provides more diversity to the network, increasing the generalization capability of the model.
•

We propose a novel supervised contrastive learning method, named Supervised contrastive learning on Mixed Classes (SMC), for long-tailed recognition. The data mixtures encourage the generalization capability of the model. In addition, we suggest positive and negative pair definitions and a new data augmentation method for SMC. Our proposals maintain the original class information of a mixed sample to balance the feature space.
•

We provide an in-depth experimental analysis of SMC. Our method achieves state-of-the-art performance on various long-tailed recognition benchmarks. Moreover, empirical evaluations on our proposals justify the contribution of our method.

2 Related work

2.1 Long-tailed recognition

Re-sampling methods create a subset of training data that consists of class-balanced samples. Oversampling [4, 1] or undersampling [21, 32] methods have been proposed to achieve the class balance of the subset. Oversampling approaches match the number of samples per class by repeatably collecting tail class instances. On the contrary, undersampling methods omit head class samples to equalize the label distribution. Training with the balanced subset reduces defects caused by the class imbalance. However, the lack of diversity of the tail class data still needs to be solved, which endangers the network to overfitting. Re-weighting methods calculate weights of training samples for the training loss function. The weights could be derived from the label distribution [2, 7, 25], or instance-level scoring functions [16, 14]. The weighted loss decreases the dominance of head information, which leads to the trade-off between the head and tail class recognition performances. Using multiple experts [29, 13] also shows good performance. The experts compensate each other to boost the recognition performance. More recently, Mixup-based data augmentation has attracted attention in the field of long-tailed recognition. Remix [6] introduces a class distribution-aware labeling for the image mixtures. Park et al. [22] suggest a MixUp-based data augmentation method, named CMO, to increase the information diversity of head-biased data. The variance-enhanced training data provokes a more generalized feature space, leading to better recognition performances. Our method is in line with Remix and CMO in terms of using data augmentation to adjust the biased distribution of training data. The difference is that SMC utilizes the generalization capability of SCL on top of the mixtures.

2.2 Supervised contrastive learning on long-tailed recognition

Supervised contrastive learning [11] utilizes class labels to define positive and negative pairs of contrastive learning. The positive pairs get closer, while the negative pairs get farther away. SCL shows good performance in various applications [11, 17, 33]. However, in long-tailed recognition, using SCL directly causes a head-biased feature space and a biased classifier that could hinder the network [9]. To reduce the bias, KCL [9] screens the positive pairs during training. Li et al. [15] introduce TSC, a two-stage learning method with regularization on the uniformly pre-defined class center points. BCL [34] utilizes the classifier weights as the additional class information for SCL. Our concept is inspired by the feature regularization approaches mentioned above. The difference is that our method balances the feature space using augmented data without strong regularization.

2.3 MixUp-based data augmentation

MixUp [31] is proposed for more robust training of neural networks. It shows that using mixed instances is in line with the risk minimization of neural networks. Yun et al. [30] claim that the linear combination of images in MixUp creates unnatural and ambiguous images that could deteriorate the performance of networks. They propose CutMix, which replaces the linear combination of images in MixUp with random cropping and pasting to create locally meaningful images. Qin et al. [24] point out the possible information loss of the random cropping in CutMix. They introduce ResizeMix, which uses image resizing rather than the random cropping. The high generalization capability of MixUp-based augmentations boosts various computer vision applications [19, 22]. Recently, SDMP [26] has utilized MixUp in self-supervised learning. The main concept of SDMP is using ResizeMix to create positives for self-supervised contrastive learning, similar to our idea. The difference between SDMP and SMC is that our method aims at supervised contrastive learning with class labels on long-tailed recognition, while SDMP focuses on self-supervised learning.

3 Propose method

3.1 Class imbalance and data mixing

The imbalance of the latent feature space is the main challenge to use SCL in long-tailed recognition. Applying SCL without considering class imbalance results in a head-biased feature space, which discourages tail class recognition. Previous studies add regularization to the network to balance the feature space. However, we suspect that the strong regularization could lead to suboptimal results since tail classes still have small variances. Therefore, we use the MixUp-based [31] data augmentation method to expand the variety of training data. As illustrated in Fig. 1, the interpolated data form a broader data space than the original data, which increases the generalization capability of the network. Furthermore, using the blended data could relax the information imbalance of the long-tailed data.

Among the variations of MixUp, CutMix [30] shows a good performance in the previous long-tailed recognition study [22]. However, the random cropping nature of CutMix may lead to loss of the critical information of foreground classes as illustrated in Fig. 2. Although the information-removed samples could prevent overfitting, the information loss is undesirable since CutMix assumes the presence of multiple class features in the blended image. Thus, we replace the random cropping with a resizing operation to secure the foreground class information, which leads to better generalization capability of the trained model. For foreground and background images $x_{f}$ and $x_{b}$ and their labels $y_{f}$ and $y_{b}$ , the mixed image $\tilde{x}$ and its label $\tilde{y}$ are defined as follows:

		$\displaystyle\tilde{x}=\bm{M}\odot R(x_{f})+(\bm{1}-\bm{M})\odot x_{b},$		(1)
		$\displaystyle\tilde{y}=\lambda y_{f}+(1-\lambda)y_{b},$		(1)

where $\bm{M}\in\{0,1\}^{W\times H}$ is a binary mask for the images, $R(x_{f})$ is a resized and padded foreground, $\odot$ implies the Hadamard product and $\lambda\sim Beta(\alpha,\alpha)$ indicates the combination ratio that sampled from the beta distribution. We normalized the combination ratio in the range of $[0.2,0.8]$ to increase the diversity of training data by preserving both class features in the mixed output.

The next issue is the way to sample the foregrounds and backgrounds. On long-tailed data, the uniform sampling could lead to oversampling of head samples, which is undesirable in terms of data diversity. Inspired by the previous study [22], we randomly sample background images with a uniform probability, while foreground samples have a weighted sampler that samples the tail class instances more frequently. The weighted sampling for foregrounds balances between head and tail class information in the mixture distribution. The sampling probability for the $k$ -th class is defined as follows:

q(k)=\frac{n_{k}^{-\gamma}}{\sum_{l=1}^{C}n_{l}^{-\gamma}},

(2)

where $n_{k}$ is the number of samples in the $k$ -th class, $C$ implies the number of training classes and $\gamma$ is a hyperparameter that determines the sampling strategy. We set $\gamma$ to one based on the previous study [22].

3.2 Supervised contrastive learning on mixed data

SCL requires to define positives and negatives for training samples. Here, positives share the same class label, and negatives have samples with different labels from the target sample. However, this definition is inappropriate for the mixed data because a blended sample contains multiple class information. Therefore, we suggest three positive pair types as illustrated in Fig. 3: foreground-shared pairs, background-shared pairs, and cross-shared pairs. A foreground-shared and background-shared pairs share the foreground and background classes in a mixed image, respectively. A cross-shared pair is an image that shares at least one same class, but is neither a foreground-shared pair nor a background-shared pair. The negative set consists of the remaining samples except for the positives.

Previous contrastive learning methods augment the target image to produce positives [5, 11]. Among the various augmentation operations, random cropping is a crucial factor in enhancing the generalization capability [5]. However, the random cropping on the mixture is in danger of the information loss of the original images as illustrated in Fig. 4. Therefore, we suggest adjusting the original images before blending to secure the diversity for contrastive learning:

		$\displaystyle\tilde{x}=\bm{M}\odot R(\tilde{x}_{f})+(\bm{1}-\bm{M})\odot\tilde{x}_{b},$		(3)
		$\displaystyle\tilde{y}=\lambda y_{f}+(1-\lambda)y_{b},$		(3)

where $\tilde{x}_{f}$ and $\tilde{x}_{b}$ are the augmented foreground and background images, respectively. This approach preserves the information of foreground classes while benefits from the rich generalization capability of image mixing.

Finally, the SMC loss $L_{SMC}$ on a set of features of augmented images $\tilde{x}_{i}\in\tilde{\mathcal{X}}$ is defined as follows:

		$\displaystyle L_{f}=-\frac{1}{\|\tilde{\mathcal{X}}\|}\sum_{\tilde{x}_{i}\in\tilde{\mathcal{X}}}{\frac{1}{\|\tilde{\mathcal{X}}_{i}^{f}\|}\sum_{\tilde{x}_{f}\in\tilde{\mathcal{X}}_{i}^{f}}{\log{\frac{\exp{(\tilde{x}_{i}\tilde{x}_{f}/\tau)}}{\sum_{\tilde{x}_{j}\in\tilde{\mathcal{X}}_{i}}{\exp{(\tilde{x}_{i}\tilde{x}_{j}/\tau)}}}}}},$		(4)
		$\displaystyle L_{b}=-\frac{1}{\|\tilde{\mathcal{X}}\|}\sum_{\tilde{x}_{i}\in\tilde{\mathcal{X}}}{\frac{1}{\|\tilde{\mathcal{X}}_{i}^{b}\|}\sum_{\tilde{x}_{b}\in\tilde{\mathcal{X}}_{i}^{b}}{\log{\frac{\exp{(\tilde{x}_{i}\tilde{x}_{b}/\tau)}}{\sum_{\tilde{x}_{j}\in\tilde{\mathcal{X}}_{i}}{\exp{(\tilde{x}_{i}\tilde{x}_{j}/\tau)}}}}}},$
		$\displaystyle L_{c}=-\frac{1}{\|\tilde{\mathcal{X}}\|}\sum_{\tilde{x}_{i}\in\tilde{\mathcal{X}}}{\frac{1}{\|\tilde{\mathcal{X}}_{i}^{c}\|}\sum_{\tilde{x}_{c}\in\tilde{\mathcal{X}}_{i}^{c}}{\log{\frac{\exp{(\tilde{x}_{i}\tilde{x}_{c}/\tau)}}{\sum_{\tilde{x}_{j}\in\tilde{\mathcal{X}}_{i}}{\exp{(\tilde{x}_{i}\tilde{x}_{j}/\tau)}}}}}},$
		$\displaystyle L_{SMC}=w_{f}L_{f}+w_{b}L_{b}+w_{c}L_{c},$

where $\tilde{\mathcal{X}}_{i}=\tilde{\mathcal{X}}\setminus{\tilde{x}_{i}}$ is a set of the mini-batch features without $\tilde{x}_{i}$ , $|\mathcal{X}|$ indicates the mini-batch size. $\tilde{\mathcal{X}}_{i}^{f}$ , $\tilde{\mathcal{X}}_{i}^{b}$ , and $\tilde{\mathcal{X}}_{i}^{c}$ are a set of foreground-shared pairs, background-shared pairs, and cross-shared pairs, respectively. $|\tilde{\mathcal{X}}_{i}^{f}|$ , $|\tilde{\mathcal{X}}_{i}^{b}|$ , and $|\tilde{\mathcal{X}}_{i}^{c}|$ indicate set sizes of the corresponding positives, and $\tau$ is a temperature parameter set to $0.1$ . The loss weights $w_{f}=\lambda_{w}/1.5$ , $w_{b}=(1-\lambda_{w})/1.5$ , $w_{c}=0.5/1.5$ follow $\lambda_{w}$ , the combination ratio of images in the pair. The weights are normalized to make the total summation of the weights to one. They represent the semantic relationship between images in a pair. Positive pairs with stronger relationships have a greater impact on the network than pairs with smaller relationships.

3.3 Classification loss on mixed data

The cross-entropy function is a typical choice to train classifiers. In the long-tailed environment, however, it often results in a head-biased classifier due to imbalanced training data [25]. The logit compensation methods [3, 20] aim to reduce the bias by adjusting the classifier output with the knowledge on the class distribution. We use the compensation method to balance the classifier with the mixed data. Since the mixing augmentation increases the data variance, the mixtures aid the classifier to explore more broader data space. Furthermore, training the classifier and feature extractor simultaneously achieves the one-stage training of the model. With the blended data, the classifier is trained with the following loss function:

L_{BCE}=CE(f(\tilde{x})+\bm{m},\tilde{y}),

(5)

where $CE(\cdot,\cdot)$ is the cross-entropy function, and $\bm{m}=\log{(\mathbb{P}_{Y})}$ is a prior vector of the class distribution $\mathbb{P}_{Y}$ with the logarithm function for the compensation. The prior vector gives higher values for head classes, which emphasizes tail class instances during training.

By combining the losses defined above, the total training loss is defined as follows:

L_{train}=L_{BCE}+\eta L_{SMC},

(6)

where $\eta$ is a weight hyperparameter. Figure 5 illustrates the summary of SMC.

4 Experimental analysis

4.1 Datasets and comparison metrics

We used CIFAR-100-LT [3], ImageNet-LT [18], and iNaturalist 2018 [28] to evaluate SMC. The datasets have different imbalance ratios. The imbalance ratio $\rho$ is defined as the ratio of the number of instances in the largest head class to that of the smallest tail class. CIFAR-100-LT is a subset of the CIFAR-100 dataset [12]. It artificially excludes training data to make 100 long-tail distributed classes. We used three imbalance ratios for comparisons: 100, 50, and 10. ImageNet-LT is a subset of the ImageNet 2012 dataset [27]. Similar to CIFAR-100-LT, it selects training data to create a set of 1000 imbalanced classes. There are $115.8\text{K}$ images and its imbalance ratio is 256. On the contrary to the previous datasets, the iNaturalist 2018 dataset is imbalanced from the data collecting stage. There are $437,513$ training images of $8,142$ classes, where the imbalance ratio is 500.

4.2 Implementation details

We followed the training strategy of previous long-tailed recognition studies [9, 15, 22] for a fair comparison. we used the ResNet-32 [8] backbone on CIFAR-100-LT and the ResNet-50 [8] backbone on ImageNet-LT and iNaturalist 2018. On CIFAR-100-LT, we followed the training strategy in [2]. For ImageNet-LT and iNaturalist 2018, the initial learning rate is set to $0.1$ . In the ImageNet-LT case, the number of training epochs is set to 100 and the learning rate is decreased by $0.1$ at epoch 60 and 80. The number of samples in a mini-batch is 128. In the iNaturalist 2018 case, we trained the network 200 epochs and we reduced the learning rate at epoch 75 and 160 by $0.1$ . The training mini-batch size is set to 256 for the iNaturalist 2018 dataset. For all datasets, we set the loss weight $\eta$ to $0.1$ and the output dimension of the 2-layer MLP for contrastive learning to 128. The $\alpha$ value for the Beta distribution is set to $1.0$ , and the hidden layer size of the head follows the input feature size. Please see the supplementary material for further details.

4.3 Comparison with the other methods

Since SMC utilizes the power of SCL and MixUp-based augmentation, we select SCL-based methods [9, 15, 34] and MixUp-based methods [6, 15] with previous long-tailed recognition methods [10, 7, 16, 2] as the comparison baselines. The comparison result on CIFAR-100-LT is in Table 1. SMC shows better performances than baselines in all comparisons. This result implies that SMC is effective for various long-tailed conditions. The mixed samples increase the semantic information space of training data, which improves the discriminative power of the network. Furthermore, the blended data gives a soft balancing regularization to the network. The soft regularization reduces the bias in the feature space that leads to better recognition capability.

The performance comparisons on the ImageNet-LT and iNaturalist 2018 datasets are in Table 2 and Table 3, respectively. The overall recognition performance of SMC is still better than the baselines. This comparison result demonstrates that SMC is still effective when the data distribution is complex. Compared with TSC, an SCL-based two-stage learning method, SMC has better performance on medium and few shot classes, while the head class performance is weaker than TSC. It shows that the broader training data space and soft regularization are helpful for low-shot classes, but harms the classification on major classes. Nevertheless, the performance gain on low-shot classes significantly improves the overall recognition capability. We leave the performance balancing between many-shot and low-shot classes as an open problem.

Method	$\rho=100$	$\rho=50$	$\rho=10$
CE	$38.3$	$43.9$	$55.7$
BS [7]	$38.6$	$44.6$	$57.1$
Focal [16]	$38.4$	$44.3$	$55.8$
BS-Focal [7]	$39.6$	$45.2$	$58.0$
CE-DRW [2]	$40.5$	$44.7$	$56.2$
CE-DRS [2]	$40.4$	$44.5$	$56.1$
LDAM [2]	$39.6$	$45.0$	$56.9$
LDAM-DRW [2]	$42.0$	$46.2$	$58.7$
Remix [6]	$45.8$	$49.5$	$59.2$
CMO [22]	$46.6$	$51.4$	$62.3$
KCL [9]	$42.8$	$46.3$	$57.6$
TSC [15]	$43.8$	$47.4$	$59.0$
Ours (SMC)	$\bm{48.9}$	$\bm{52.3}$	$\bm{62.5}$

Table 1: Classification accuracy (%) on the CIFAR-100-LT dataset. Please see the supplementary material for more details.

Method	Many	Medium	Few	All
$\tau$ -norm [10]	$56.6$	$44.2$	$27.4$	$46.7$
cRT [10]	$58.8$	$44.0$	$26.1$	$47.3$
LWS [10]	$57.1$	$45.2$	$29.3$	$47.7$
FCL [9]	$61.4$	$47.0$	$28.2$	$49.8$
Remix [6]	$60.4$	$46.9$	$30.7$	$48.6$
CMO [22]	$62.0$	$49.1$	$36.7$	$52.3$
KCL [9]	$61.8$	$49.4$	$30.9$	$51.5$
TSC [15]	$\bm{63.5}$	$49.7$	$30.4$	$52.4$
Ours (SMC)	$61.8$	$\bm{49.8}$	$\bm{37.9}$	$\bm{52.7}$
†BCL[34]	$-$	$-$	$-$	$56.0$
†Ours (SMC)	$66.9$	$54.2$	$36.1$	$\bm{56.6}$

Table 2: Classification accuracy (%) on the ImageNet-LT dataset. † denotes results with the different training setup, presented in [34]. Please see the supplementary material for further details.

Method	Many	Medium	Few	All
CE	$72.2$	$63.0$	$57.2$	$61.7$
cRT [10]	$69.0$	$66.0$	$63.2$	$65.2$
$\tau$ -norm [10]	$65.6$	$65.3$	$65.9$	$65.6$
LWS [10]	$65.0$	$66.3$	$65.5$	$65.9$
Remix [6]	$-$	$-$	$-$	$70.5$
†CMO [22]	$67.7$	$69.3$	$70.7$	$69.7$
KCL [9]	$-$	$-$	$-$	$68.6$
TSC [15]	$\bm{72.6}$	$\bm{70.6}$	$67.8$	$69.7$
Ours (SMC)	$69.6$	$70.1$	$\bm{71.3}$	$\bm{70.6}$

Table 3: Classification accuracy (%) on the iNaturalist 2018 dataset. † denotes the result with our training setup. Please see the supplementary material for more details.

4.4 Feature space analysis with contrastive loss

In this subsection, we evaluate the contribution of SMC by analyzing the feature space created with SMC. We use two metrics to assess the trained feature space: Inter-class and Semantic similarity scores.

Inter-class score $IS_{k}$ for the $k$ -th class is defined as follows:

IS_{k}=\exp\left(-\frac{1}{C}\sum_{j=1}^{C}{\left(\bm{c}_{k}-\bm{c}_{j}\right)/\tau^{\prime}}\right),

(7)

where $C$ is the number of classes, $\bm{c}_{k}$ is a class feature vector calculated from an average of the feature vectors of the $k$ -th class, and $\tau^{\prime}$ is a temperature value to scale score values. We set $\tau^{\prime}$ to 10 for comparisons. The lower inter-class score means that the class centers are far apart from each other, which implies that the network can distinguish different classes more easily.

The well-trained feature space contains the semantic information of the training classes. If two classes are semantically similar with each other, they will be placed close to each other within the space. Semantic similarity score measures the similarity between class semantic information and the latent space. It measures the difference between cosine similarities of class semantic vectors and that of the class feature vectors. Semantic similarity score $SS$ is defined as follows:

		$\displaystyle SS=\frac{1}{C^{2}}\sum_{i=1}^{C}{\sum_{j=1}^{C}{\|\|\mathbf{S}_{i,j}^{s}-\mathbf{S}_{i,j}^{c}\|\|_{1}}},$		(8)
		$\displaystyle\mathbf{S}^{s}_{i,j}=\frac{\bm{s}_{i}\bm{s}_{j}}{\|\|\bm{s}{i}\|\|_{2}\cdot\|\|\bm{s}_{j}\|\|_{2}}\quad\mathbf{S}^{c}_{i,j}=\frac{\bm{c}_{i}\bm{c}_{j}}{\|\|\bm{c}_{i}\|\|_{2}\cdot\|\|\bm{c}_{j}\|\|_{2}},$		(8)

where $\bm{s}_{k}$ is the semantic vector of the $k$ -th class, $C$ is the number of classes, $\mathbf{S}^{s}$ and $\mathbf{S}^{c}$ are cosine similarity matrices of the class semantic vectors and center features, respectively. We use GloVe [23] to generate semantic vectors of classes from their names. The lower score implies that the trained feature space embeds semantic knowledge of the classes.

The evaluation results are in Table 4. For the comparison, we train a network that follows our training strategy but using KCL for the contrastive loss. Our method forms better latent space than the other. Furthermore, the feature space of SMC is more similar to the semantic information space. SMC helps the network understand semantic information.

Method	Many	Medium	Few	All
Inter-class score
Resize-KCL	$0.578$	$0.573$	$0.596$	$0.582$
Resize-SMC	$\bm{0.573}$	$\bm{0.564}$	$\bm{0.590}$	$\bm{0.575}$
Semantic similarity score
Resize-KCL	$0.591$	$0.581$	$0.620$	$0.596$
Resize-SMC	$\bm{0.578}$	$\bm{0.568}$	$\bm{0.609}$	$\bm{0.584}$
Classification accuracy (%)
Resize-KCL	$\bm{61.9}$	$46.4$	$31.9$	$47.6$
Resize-TSC	$\bm{61.9}$	$\bm{48.8}$	$\bm{33.9}$	$\bm{48.9}$

Table 4: The feature space analysis on CIFAR-100-LT (100). Lower scores and higher accuracy are better.

4.5 Component analysis

In this subsection, we present an in-depth empirical analysis of our method to validate our proposals. Table 5 displays the comparison results, and the detailed description of the analysis follows.

Normalized combination ratio in the data blending. We normalize the combination ratio to the range of $[0.2,0.8]$ to preserve the foreground information and increase the data variance. We assess the effect of the restriction and show the result in the first block of Table 5. The constrained range help contain the information of original images, which enlarges the data variance. This observation emphasizes the impact of data diversity to the network.

Information loss in cropping. The second block of Table 5 is the comparison result between different data mixing methods. Crop-SMC replaces the resizing operation in SMC into random cropping operation. Using resizing operation shows better performance than using random cropping, which implies that resizing is advantageous than random cropping for long-tailed recognition.

Data augmentation analysis. We measure the importance of random cropping to create positive instances with alternative augmentations and present the comparison result in the third block of Table 5. In line with the previous contrastive learning studies [5], the random cropping dramatically enhances the recognition performance. However, cropping on mixed images inhere the possibility of the information loss. Our augmentation method reduces the information loss, which results in the better performance.

Adaptive weights on positives. In Eq. 4, we weight positives based on their combination ratios. We assess the influence of the adaptive weights by replacing the weighted summation in Eq. 4 with different loss weighting methods. The fourth block of Table 5 illustrates the result. Averaging indicates the simple averaging of the three losses, and the assigning uses the class of the larger component as the class label of the blended sample, then it trains the network with the naïve SCL. If the combination ratio is high, the mixtures are more likely to have the foreground information than the background information. The weights adjust the impact of a positive pair to the network to boost the performance of the model. Furthermore, the lower performance of the assigning method emphasizes the necessity to define the positive and negative pair definitions for blended images.

Loss hyperparameter analysis. We set the loss weight hyperparameter $\eta$ in Eq. 6 to $0.1$ in experiments. The last block of Table 5 shows the performance dependence on the loss weight. Higher weight makes the contrastive objective dominate during training, while lower weight lessens the effect of the contrastive term. To fully utilize the power of SMC, the balance between the two losses is required.

Comparisons	Acc (%)
Range of the combination ratio
$[0.0,1.0]$	$48.1$
$[0.2,0.8]$ (proposed)	$48.9$
Information loss in cropping
Crop-SMC	$47.6$
Resizing-SMC (proposed)	$48.9$
Augmentation method for positives
No cropping	$47.3$
Cropping after mixing	$47.8$
Cropping before mixing (proposed)	$48.9$
Weights on positives
Averaging	$46.7$
Assigning the larger class label	$47.1$
Weighted summation (proposed)	$48.9$
Loss hyperparameter
$\eta=0.05$	$47.0$
$\eta=0.1$	$48.9$
$\eta=0.2$	$47.2$

Table 5: Component analysis of SMC on CIFAR-100-LT (100).

4.6 Ensembling on SMC

The aforementioned comparisons have been compared on a single network. In this subsection, we show that SMC can benefit from ensemble-based methods. We combine SMC with RIDE [29], a state-of-the-art ensemble-based long-tailed recognition method, and measure the performance differences followed by the change of the number of experts in the ensemble of networks. We added an additional MLP head for each expert and trained the feature extractor with our training loss during the first stage training of RIDE. Table 6 shows the experimental result. As the number of experts increases, the performance gets better. This result demonstrates the expandability of SMC. Our method can be easily merged with additional long-tailed recognition methodologies for better recognition capability.

Method	Many	Medium	Few	All
SMC (1 expert)	$61.9$	$48.8$	$33.9$	$48.9$
2 experts	$67.5$	$50.9$	$27.1$	$49.6$
3 experts	$70.5$	$50.7$	$25.8$	$50.7$
4 experts	$71.6$	$50.6$	$32.3$	$52.5$

Table 6: Classification accuracy (%) comparison of SMC with the multi-expert ensemble method on CIFAR-100-LT (100).

4.7 Limitations

Our data augmentation method aims to preserve the semantic characteristics of training data. Despite of the efforts to reduce the information losses in the training samples, the loss of critical information could occur. Figure 6 illustrates the several failure cases. The merging operation with a mask could remove the background object, which is undesirable in terms of the data diversity. Moreover, the resizing operation makes the object of interest in a foreground image too small that could drop the geometric information of the foreground object. We believe that the performance of SMC could be improved with a proper knowledge-preserving augmentation method.

5 Conclusion

In this paper, we have introduced a novel long-tailed recognition method, named Supervised contrastive learning on Mixed Classes (SMC). Our method considers the information of blended images that enhances the variety of training data. The resizing operation on foregrounds helps the network explore more diverse data space. Furthermore, we define positive and negative pairs for mixed images and suggest a data augmentation method for the contrastive learning of blended images on long-tailed recognition. Performance comparisons on the various benchmarks and the in-depth analysis of our method validate our idea.

Potential negative social impact The diversified training data requires longer training, and the additional supervised contrastive term requires larger memory. Thus, SMC may need more computational resources than the other methods, which could increase carbon emissions.

References

[1] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural networks, 106:249–259, 2018.
[2] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems, 32, 2019.
[3] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems, 32, 2019.
[4] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[6] Hsin-Ping Chou, Shih-Chieh Chang, Jia-Yu Pan, Wei Wei, and Da-Cheng Juan. Remix: rebalanced mixup. In European Conference on Computer Vision, pages 95–110. Springer, 2020.
[7] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9268–9277, 2019.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[9] Bingyi Kang, Yu Li, Sa Xie, Zehuan Yuan, and Jiashi Feng. Exploring balanced feature spaces for representation learning. In International Conference on Learning Representations, 2021.
[10] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. In International Conference on Learning Representations, 2019.
[11] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33, 2020.
[12] Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009.
[13] Bolian Li, Zongbo Han, Haining Li, Huazhu Fu, and Changqing Zhang. Trustworthy long-tailed classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6970–6979, 2022.
[14] Bo Li, Yongqiang Yao, Jingru Tan, Gang Zhang, Fengwei Yu, Jianwei Lu, and Ye Luo. Equalized focal loss for dense long-tailed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6990–6999, 2022.
[15] Tianhong Li, Peng Cao, Yuan Yuan, Lijie Fan, Yuzhe Yang, Rogerio S Feris, Piotr Indyk, and Dina Katabi. Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6918–6928, 2022.
[16] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[17] Chen Liu, Yanwei Fu, Chengming Xu, Siqian Yang, Jilin Li, Chengjie Wang, and Li Zhang. Learning a few-shot embedding model with contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8635–8643, 2021.
[18] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2537–2546, 2019.
[19] Puneet Mangla, Nupur Kumari, Abhishek Sinha, Mayank Singh, Balaji Krishnamurthy, and Vineeth N Balasubramanian. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2218–2227, 2020.
[20] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. In International Conference on Learning Representations, 2020.
[21] Ajinkya More. Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048, 2016.
[22] Seulki Park, Youngkyu Hong, Byeongho Heo, Sangdoo Yun, and Jin Young Choi. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6887–6896, 2022.
[23] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
[24] Jie Qin, Jiemin Fang, Qian Zhang, Wenyu Liu, Xingang Wang, and Xinggang Wang. Resizemix: Mixing data with preserved object information and true labels. arXiv preprint arXiv:2012.11101, 2020.
[25] Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, et al. Balanced meta-softmax for long-tailed visual recognition. Advances in neural information processing systems, 33:4175–4186, 2020.
[26] Sucheng Ren, Huiyu Wang, Zhengqi Gao, Shengfeng He, Alan Yuille, Yuyin Zhou, and Cihang Xie. A simple data mixing prior for improving self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14595–14604, June 2022.
[27] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
[28] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
[29] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella Yu. Long-tailed recognition by routing diverse distribution-aware experts. In International Conference on Learning Representations, 2021.
[30] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
[31] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
[32] Yongshun Zhang, Xiu-Shen Wei, Boyan Zhou, and Jianxin Wu. Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 3447–3455, 2021.
[33] Bing Zhao, Jun Li, and Hong Zhu. Codo: Contrastive learning with downstream background invariance for detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 4196–4201, June 2022.
[34] Jianggang Zhu, Zheng Wang, Jingjing Chen, Yi-Ping Phoebe Chen, and Yu-Gang Jiang. Balanced contrastive learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6908–6917, 2022.

		$\displaystyle L_{f}=-\frac{1}{\|\tilde{\mathcal{X}}\|}\sum_{\tilde{x}_{i}\in\tilde{\mathcal{X}}}{\frac{1}{\|\tilde{\mathcal{X}}_{i}^{f}\|}\sum_{\tilde{x}_{f}\in\tilde{\mathcal{X}}_{i}^{f}}{\log{\frac{\exp{(\tilde{x}_{i}\tilde{x}_{f}/\tau)}}{\sum_{\tilde{x}_{j}\in\tilde{\mathcal{X}}_{i}}{\exp{(\tilde{x}_{i}\tilde{x}_{j}/\tau)}}}}}},$		(4)
		$\displaystyle L_{b}=-\frac{1}{\|\tilde{\mathcal{X}}\|}\sum_{\tilde{x}_{i}\in\tilde{\mathcal{X}}}{\frac{1}{\|\tilde{\mathcal{X}}_{i}^{b}\|}\sum_{\tilde{x}_{b}\in\tilde{\mathcal{X}}_{i}^{b}}{\log{\frac{\exp{(\tilde{x}_{i}\tilde{x}_{b}/\tau)}}{\sum_{\tilde{x}_{j}\in\tilde{\mathcal{X}}_{i}}{\exp{(\tilde{x}_{i}\tilde{x}_{j}/\tau)}}}}}},$
		$\displaystyle L_{c}=-\frac{1}{\|\tilde{\mathcal{X}}\|}\sum_{\tilde{x}_{i}\in\tilde{\mathcal{X}}}{\frac{1}{\|\tilde{\mathcal{X}}_{i}^{c}\|}\sum_{\tilde{x}_{c}\in\tilde{\mathcal{X}}_{i}^{c}}{\log{\frac{\exp{(\tilde{x}_{i}\tilde{x}_{c}/\tau)}}{\sum_{\tilde{x}_{j}\in\tilde{\mathcal{X}}_{i}}{\exp{(\tilde{x}_{i}\tilde{x}_{j}/\tau)}}}}}},$
		$\displaystyle L_{SMC}=w_{f}L_{f}+w_{b}L_{b}+w_{c}L_{c},$