Contrastive Self-Supervised Learning with Hard Negative Pair Mining

Wentao Zhu Hang Shang Tingxun Lv Chao Liao Sen Yang Ji Liu
Kuaishou Technology

Abstract

Recently, learning from vast unlabeled data, especially self-supervised learning, has been emerging and attracted widespread attention. Self-supervised learning followed by the supervised fine-tuning on a few labeled examples can significantly improve label efficiency and outperform standard supervised training using fully annotated data Chen et al. (2020b). In this work, we present a novel self-supervised deep learning paradigm based on online hard negative pair mining. Specifically, we design a student-teacher network to generate multi-view of the data for self-supervised learning and integrate hard negative pair mining into the training. Then we derive a new triplet-like loss considering both positive sample pairs and mined hard negative sample pairs. Extensive experiments demonstrate the effectiveness of the proposed method and its components on ILSVRC-2012.

1 Introduction

Learning from a large scale unlabeled dataset has long been a hot topic in the computer vision community, because the large amount of high quality labels require laborious and costly annotation for each task and there exists huge amount of unlabeled data from various data servers and sources. Un/self-supervised learning can effectively learn a task-agnostic representation from vast unlabeled data, and downstream tasks, such as image classification, can be well performed by fine-tuning on a few task-specific labels. This strategy has become a main-stream pipeline for the transformer based self-supervised learning approaches Vaswani and others (2017). Recent advanced self-supervised learning achieves promising results and outperforms conventional fully supervised learning method on the image classification Chen et al. (2020b).

The key effort of general self-supervised learning approaches mainly focuses on pretext task construction Jing and Tian (2020). The pretext task can be designed to be predictive tasks Mathieu and others (2016), generative tasks Bansal et al. (2018), contrastive tasks Oord et al. (2018), or a combination of them. The supervision signal for the pretext task, i.e., pseudo label, typically is yielded from a pretext construction process which generally involves exhausted multi-view construction to model various variations Qian et al. (2020). Through solving the pretext task with specific objective function, the network learns transferable visual features for various downstream tasks.

The study of conventional self-supervised learning methods mainly involves data related pretext task design Zhu et al. (2021). Popular pretext tasks include colorizing gray scale images Zhang and others (2016), image inpainting Pathak and others (2016), playing image jigsaw puzzle Noroozi and Favaro (2016), etc. For the video related self-supervised learning approaches, the data related pretext tasks can be sequence order verification Misra and others (2016), solving sequence sorting Lee et al. (2017), predicting the odd or unrelated element Fernando and others (2017), classifying clip order Xu and others (2019), etc.

Recent tremendous success of self-supervised learning is mainly introduced by advanced learning strategies. The InfoNCE loss is widely adopted for contrastive learning, which maximizes a lower bound of mutual information based on the pseudo label in the pretext task Oord et al. (2018). SimCLR employs larger batch sizes, more training steps and composition of data augmentations, which matches the performance of a fully supervised ResNet-50 simply by adding one additional linear classifier Chen et al. (2020a). Wu et al. (2018) maintains a large feature memory bank to store training image representation. MoCo builds a large and consistent dictionary through a dynamic queue and a momentum-updated encoder, which outperforms its supervised pretraining counterpart in the detection and segmentation He and others (2020). SimSiam employs a stop-gradient operation in the Siamese architectures to prevent collapsing solutions of self-supervised learning Chen and He (2020). SimCLRv2 employs big (deep and wide) networks during pretraining and fine-tuning, and it achieves surprising good performance for semi-supervised learning on ImageNet Chen et al. (2020b). BYOL trains an online network to predict a target network representation of the same image where the target network is a slow-moving average of the online network Grill and others (2020).

In this work, we propose a novel self-supervised learning paradigm by introducing an effective negative image pair mining in the contrastive learning framework. Specifically, we introduce a student-teacher network into the contrastive learning framework to construct multi-view representation of data. To effectively learn from unlabeled data in the contrastive learning, we further construct the negative image pairs by hard negative image pair mining. The overall objective function can be derived as a form of triplet-like loss facilitated by the collected positive and negative image pairs.

We conduct extensive experiments including linear evaluation, semi-supervised learning, transfer learning, and ablation study to evaluate our method on the ImageNet dataset Russakovsky and others (2015). The proposed method achieves 77.1% top-1 accuracy using a ResNet-50 encoder for the linear evaluation, which outperforms previous state-of-the-art by 2.8%. For the semi-supervised learning task, our method with a ResNet-50 encoder obtains the top-1 accuracy of 73.4%, which outperforms previous best result by 4.6% using 10% labels. For the transfer learning with linear evaluation, our method with a ResNet-50 encoder achieves the best accuracy on six out of seven widely used transfer learning datasets, which averagely outperforms previous best results by 2.5%. More specifically, our major contributions are summarized as follows.

•

First, we build a student-teacher network to construct multi-view representations in the contrastive learning framework. The gradient of the student sub-network is blocked to ease the training difficulty and stabilize the training of self-supervised learning.
•

Second, we collect hard negative image pairs on-the-fly and add the hard negative image pairs into the training of contrastive self-supervised learning.
•

Third, extensive experiments demonstrate the adversarial contrastive self-supervised learning outperforms previous state-of-the-art self-supervised learning approaches for linear evaluation, semi-supervised learning and transfer learning on the ImageNet dataset.

2 Related Work

The mainstream unsupervised/self-supervised learning literature generally involves two aspects: data/feature related pretext tasks and loss functions He and others (2020). The data/feature related pretext tasks typically can be specially constructed by the multi-view data/feature generation process Jing and Tian (2020). Through solving the pretext task, the deep network of self-supervised learning is expected to learn a good representation for the down-stream tasks. Loss objective functions can often improve the performance of self-supervised learning significantly. The adversarial contrastive learning focuses on the novel loss function based on advanced student-teacher network design. Next we discuss related study with respect to these aspects.

Contrastive loss measures the similarity of image pairs in the feature space Hadsell et al. (2006). In contrastive learning framework, the target can be defined and generated on-the-fly during training Hadsell et al. (2006). Recent significant success of self-supervised learning has witnessed the widespread adoption of contrastive learning Henaff (2020). Zhuang et al. (2019) train an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate. Contrastive multi-view learning trains deep network by maximizing mutual information between different views of the same scene Tian et al. (2019).

The student-teacher network can be used to generate multi-view representations of unlabeled data. Temporal ensembling maintains an exponential moving average (EMA) prediction as the pseudo label for the self-supervised training Laine and Aila (2016). Instead of averaging label predictions, mean-teacher uses EMA to update model weights Tarvainen and Valpola (2017). MoCo further uses a momentum to update the encoder for the new keys on-the-fly, and maintains a queue of keys in the contrastive learning framework He and others (2020). BYOL maintains a student-teacher network to yield multi-view of samples in the training Grill and others (2020). Without negative sample pairs in the training, BYOL achieves surprisingly good performance. Momentum teacher performs two independent momentum updates for teacher’s weight and teacher’s batch normalization statistics to maintain a stable training process Li and others (2021).

3 Method

Refer to caption — Figure 1: The architecture of contrastive self-supervised learning with hard negative pair mining.

We employ a student-teacher network to construct two representational views of the sample, as illustrated in Fig. 1. At the top of the student and teacher sub-networks, we construct both the positive sample pairs and negative sample pairs. Specifically, we consider the representations of the same sample from student and teacher sub-networks as the positive pair, and we only retain the most similar pair of two different samples to construct the negative pair, i.e., hard negative pair. We block the gradient update of the student sub-network and employ the exponential moving average (EMA) to update its parameters to stabilize the self-supervised training.

3.1 Student-Teacher Network

Problem definition: Un/self-supervised learning tries to learn a good representation from a large scale unlabeled dataset $\mathcal{D}=\{I_{1},I_{2},\cdots,I_{N}\}$ , where each $I$ represents an image. For an image sampled from the dataset $I_{i}\sim\mathcal{D}$ , we can obtain two representational views of $I_{i}$ by constructing one student sub-network $\mathcal{S}\left(\cdot;{\theta}_{S}\right)$ and one teacher sub-network $\mathcal{T}(\cdot;{\theta}_{T})$ Shin (2020). To let the network learn various invariant, we employ advanced data augmentation $\mathcal{A}$ , including color jittering, horizontal flipping, Gaussian blurring and random cropping, in the data generation process for the teacher sub-network. We then obtain two views of representations for image $I_{i}$ as

{U}_{i}=\mathcal{T}\left(\mathcal{A}\left(I_{i}\right);\theta_{T}\right),\quad U_{i}^{\prime}=\mathcal{S}(I_{i};\theta_{S}),

(1)

where $U_{i}$ is the representation from the teacher sub-network and ${U}^{\prime}_{i}$ is the representation from the student sub-network.

The self-supervised learning tries to build pretext tasks from these unlabeled data. The generated representation views $U_{i}$ and ${U}^{\prime}_{i}$ from the student and teacher sub-networks can be considered as a positive pair, which belong to the same cluster. The adversarial contrastive self-supervised learning tries to yield a compact representation for images of the same cluster by minimizing their normalized $L_{2}$ distance in the representational space. The intra-cluster distance can be defined

\displaystyle\mathcal{L}_{1}=\mathbb{E}_{I_{i}\sim\mathcal{D}}{\left[\left(\frac{U_{i}}{\|U_{i}\|_{\infty}}-\frac{U^{\prime}_{i}}{\|U^{\prime}_{i}\|_{\infty}}\right)^{2}\right]},

(2)

where images are randomly sampled from the dataset $I_{i}\sim\mathcal{D}$ , $\|\cdot\|_{\infty}$ is the infinity norm, i.e., the maximum of the absolute value of elements in the vector.

3.2 Hard Negative Pair Mining (HNPM)

It is not efficient to train a self-supervised network by solely using positive pairs of samples. Current self-supervised learning uses large batch size Chen et al. (2020a), memory bank Wu et al. (2018) or large dynamic dictionary He and others (2020) to achieve promising results. Adding negative image pairs can significantly improve the training efficiency of a self-supervised learning model.

We heuristically construct negative pairs in the self-supervised learning framework by mining hard negative pairs of images. For two different image $I_{i}$ and image $I_{j}$ , we measure the dissimilarity of the two images by the normalized $L_{2}$ distance in the representation space

		$\displaystyle U_{j}=\mathcal{T}\left(\mathcal{A}\left(I_{j}\right);\theta_{T}\right),$		(3)
		$\displaystyle\text{DisSim}(U_{i}^{\prime},U_{j})=\left(\frac{U_{i}^{\prime}}{\\|U_{i}^{\prime}\\|_{\infty}}-\frac{U_{j}}{\\|U_{j}\\|_{\infty}}\right)^{2}.$		(3)

There exists large numbers of negative pairs of samples. Hard samples have been widely proved to improve the performance of a deep learning model Ren et al. (2015); Lin and others (2017). In the self-supervised learning framework, we define the hard negative pairs to be image pairs of small dissimilarity according to Equation 3. We try to maximize the normalized $L_{2}$ distance or dissimilarity of negative image pairs. The contrastive loss for negative pairs can be derived

\displaystyle\mathcal{L}_{2}=-\mathbb{E}_{I_{i}\sim\mathcal{D}}[\log\big{(}\sum_{I_{j}\in\tilde{\mathcal{B}}_{i}}(\text{DisSim}(U_{i}^{\prime},U_{j}))\big{)}],

(4)

where images are randomly sampled from the dataset $I_{i}\sim\mathcal{D}$ , $\tilde{\mathcal{B}}_{i}$ is the hard negative sample set of the current batch $\mathcal{B}_{i}$ for image $I_{i}$ . The hard negative sample set $\tilde{\mathcal{B}}_{i}$ can be constructed

\displaystyle\tilde{\mathcal{B}}_{i}=\{I_{j}|I_{j}\in\mathcal{B}_{i},I_{j}\neq I_{i},\text{DisSim}(U_{i}^{\prime},U_{j})\leq 1\}.

(5)

We construct hard negative pairs on-the-fly in training, which can be used to efficiently train the self-supervised network.

3.3 Network Update

To stabilize the training and avoid a collapsing solution in the self-supervised learning Chen and He (2020), we block the gradient for the student sub-network $\mathcal{S}(\cdot;\theta_{S})$ . We employ the exponential moving average (EMA) to update the parameters $\theta_{S}$ in the student sub-network Tarvainen and Valpola (2017)

\displaystyle\theta_{S}\leftarrow\tau\theta_{S}+(1-\tau)*\theta_{T},

(6)

where $\tau$ is a smoothing coefficient to tune the update strength of the student sub-network.

In the back-propagation, we only use the gradient to update the parameters of the teacher sub-network. The overall loss function can be derived as

\displaystyle\mathcal{L}(\theta_{T})=\alpha_{1}\mathcal{L}_{1}+\alpha_{2}\mathcal{L}_{2},

(7)

where $0<\alpha_{1}<1$ and $0<\alpha_{2}<1$ are the fixed coefficients to tune the trade-off between the intra-cluster loss and inter-cluster loss. During the back-propagation, we employ the gradient clipping to stabilize the training.

3.4 Connection with InfoNCE and Stability

In our method, we employ hard negative pair mining (HNPM) to add negative image pairs in the training, and use a normalized $L_{2}$ distance in the loss function. We will demonstrate that minimizing the loss of our method is equivalent with minimizing the InfoNCE loss Oord et al. (2018). To simplify the analysis, we temporarily remove the hard negative pair mining mechanism in our method in the derivation of connection with InfoNCE.

The InfoNCE loss Oord et al. (2018) can be written as

\displaystyle\mathcal{L}_{NCE}

\displaystyle=-\mathbb{E}_{I_{i}\sim\mathcal{D}}[\log\frac{f_{k}(U_{i},U_{i}^{\prime})}{\sum_{I_{j}\in\mathcal{D}}f_{k}(U_{j},U_{i}^{\prime})}].

(8)

where $U_{i}$ , $U_{i}^{\prime}$ are calculated from the teacher sub-network and the student sub-network, $f_{k}(\cdot,\cdot)$ models the mutual information between the encoded representations in the InfoNCE and can use similarity loss as a surrogate loss to approximate the mutual information.

We define the similarity loss as the reciprocal of normalized $L_{2}$ distance of the encoded representations. The InfoNCE loss can then be defined as

$\displaystyle\mathcal{L}_{NCE}$	$\displaystyle\triangleq\mathbb{E}_{I_{i}\sim\mathcal{D}}[\log\frac{\text{DisSim}(U_{i},U_{i}^{\prime})}{\sum_{I_{j}\in\mathcal{D}}\text{DisSim}(U_{j},U_{i}^{\prime})}]$	(9)
	$\displaystyle=\mathbb{E}_{I_{i}\sim\mathcal{D}}[\log(\frac{U_{i}}{\\|U_{i}\\|_{\infty}}-\frac{U_{i}^{\prime}}{\\|U_{i}^{\prime}\\|_{\infty}})^{2}]$
	$\displaystyle\quad-\mathbb{E}_{I_{i}\sim\mathcal{D}}[\log\big{(}\sum_{I_{j}\in\mathcal{D}}(\frac{U_{j}}{\\|U_{j}\\|_{\infty}}-\frac{U_{i}^{\prime}}{\\|U_{i}^{\prime}\\|_{\infty}})^{2}\big{)}].$

The second part of derived loss in equation 9 is the same with our negative pair loss in the equation 4 if we temporarily neglect our hard negative sample pair mining for each batch. Minimizing the first part of equation 9 is equivalent with minimizing $\mathbb{E}_{I_{i}\sim\mathcal{D}}[(\frac{U_{i}}{\|U_{i}\|_{\infty}}-\frac{U_{i}^{\prime}}{\|U_{i}^{\prime}\|_{\infty}})^{2}]$ , which is the positive pair loss in the equation 2. From the above derivation, we conclude, with the proper relaxation and assumption, minimizing the our loss is equivalent with minimizing the InfoNCE loss.

Next we demonstrate that the hard negative pair mining (HNPM) leads to stable training. Without the trade-off factors $\alpha_{1}$ and $\alpha_{2}$ , the loss can be written as

	$\displaystyle\mathcal{L}$	$\displaystyle=\mathbb{E}_{I_{i}\sim\mathcal{D}}[(\frac{U_{i}}{\\|U_{i}\\|_{\infty}}-\frac{U_{i}^{\prime}}{\\|U_{i}^{\prime}\\|_{\infty}})^{2}]$		(10)
		$\displaystyle\quad-\mathbb{E}_{I_{i}\sim\mathcal{D}}[\log\big{(}\sum_{I_{j}\in\tilde{\mathcal{B}}_{i}}(\frac{U_{j}}{\\|U_{j}\\|_{\infty}}-\frac{U_{i}^{\prime}}{\\|U_{i}^{\prime}\\|_{\infty}})^{2}\big{)}].$		(10)

Without loss of generality, we remove the normalization constraint and denote $\frac{U_{i}}{\|U_{i}\|_{\infty}}$ as $U_{i}$ .

\displaystyle\mathcal{L}

\displaystyle=\mathbb{E}_{I_{i}\sim\mathcal{D}}\big{[}(U_{i}-U_{i}^{\prime})^{2}-\log\big{(}\sum_{I_{j}\in\tilde{\mathcal{B}_{i}}}(U_{j}-U_{i}^{\prime})^{2}\big{)}\big{]}.

(11)

The hard negative pair mining (HNPM) always explores negative pairs with $L_{2}$ distance smaller than 1, which guarantees $(U_{j}-U_{i}^{\prime})^{2}$ is bounded to be smaller than 1. We use $M$ to denote the upper bound of negative pair loss.

\displaystyle|\mathcal{L}|

\displaystyle\leq\mathbb{E}_{I_{i}\sim\mathcal{D}}\big{[}(U_{i}-U_{i}^{\prime})^{2}\big{]}+M.

(12)

Next we can further prove equation 12 can be optimized stably, and the first part of equation 12, i.e., the loss of positive pairs, can be decreased consecutively by escaping undesirable equilibria. If the model stacks into an undesirable equilibrium solution, the feature representation of teacher sub-network can be denoted as $\mathbb{E}[U_{i}^{\prime}|U_{i}]$ from the update rule in equation 6. The loss of positive pairs $\mathcal{L}_{P}$ can be derived as

	$\displaystyle\mathcal{L}_{P}$	$\displaystyle=\mathbb{E}_{I_{i}\sim\mathcal{D}}\big{[}(U_{i}-U_{i}^{\prime})^{2}\big{]}$		(13)
		$\displaystyle=\mathbb{E}_{I_{i}\sim\mathcal{D}}\big{[}(\mathbb{E}[U_{i}^{\prime}\|U_{i}]-U_{i}^{\prime})^{2}\big{]}=\mathbb{E}_{I_{i}\sim\mathcal{D}}[Var(U_{i}^{\prime}\|U_{i})].$		(13)

Let $Z$ denote an additional variability induced by stochasticities in the training dynamics. We always have a solution leading to a lower loss during the training, which escapes the current equilibrium, because

\displaystyle Var(U_{i}^{\prime}|U_{i},Z)\leq Var(U_{i}^{\prime}|U_{i}).

(14)

From the above derivation, the learning is stable with the benefit of hard negative pair mining and student sub-network updating rule.

3.5 Implementation Details

Because of our advanced learning strategy, we do not use any pretrained model as the backbone in our implementation. To generate multi-view representations, we employ data augmentation to model various variations in different views.

We use residual networks as the student sub-network $\mathcal{S}(\cdot,\theta_{S})$ and teacher sub-network $\mathcal{T}(\cdot,\theta_{T})$ . The two coefficients of the loss in equation 7, $\alpha_{1}$ is set to 0.8 and $\alpha_{2}$ is set to 0.1. We employ the gradient clipping strategy in the back-propagation where we set the maximum norm of gradient clipping as $1.0$ . The Adam optimizer is used to minimize the loss in equation 7. The batch size is 160. The learning rate is set as $0.1$ and we use a cosine annealing schedule for the learning rate with the maximum number of iterations as $100$ . The smoothing coefficient $\tau$ in the update of student sub-network in equation 6 is set as $0.5$ .

We employ data augmentation for teacher sub-network on-the-fly during training. We firstly apply color jittering with brightness of $0.8$ , contrast of $0.8$ , saturation of $0.8$ , and hue of $0.2$ to random 80% training images in each batch. Then we convert random 20% images to gray scale, and horizontally flip 50% images. After that, we smooth random 10% images with a random Gaussian kernel of size $3\times 3$ and standard deviation of $1.5\times 1.5$ . Finally, we crop each image with random crop size of scale range $[0.8,1.0]$ . We use the mean of $[0.485,0.456,0.406]$ and the standard deviation of $[0.229,0.224,0.225]$ to normalize RGB channels.

4 Experiments

We conduct experiments to validate the performance of the proposed method on the ILSVRC-2012 dataset.

4.1 Linear Evaluation

Method	Top-1	Top-5
CPCv2 Henaff (2020)	63.8	85.3
CMC Tian et al. (2019)	66.2	87.0
SimCLR Chen et al. (2020a)	69.3	89.0
MoCov2 Chen and others (2020)	71.1	N/A
SimCLRv2 Chen et al. (2020b)	71.7	N/A
InfoMin Aug. Tian et al. (2020)	73.0	91.1
BYOL Grill and others (2020)	74.3	91.6
Ours	77.1	93.7

Table 1: The accuracy comparison of self-supervised learning (SSL) approaches with the ResNet-50 encoder based on linear evaluation on the ImageNet dataset. The bold face denotes the best accuracy.

Method	Dep.	Wid.	Top-1	Top-5
CMC	50	2 $\times$	70.6	89.7
SimCLRv2	50	2 $\times$	75.6	N/A
BYOL	50	2 $\times$	77.4	93.6
Ours	50	2 $\times$	79.4	94.5
SimCLR	50	4 $\times$	76.5	93.2
BYOL	50	4 $\times$	78.6	94.2
Ours	50	4 $\times$	80.3	95.1
BYOL	200	2 $\times$	79.6	94.8
Ours	200	2 $\times$	81.9	96.4

Table 2: The accuracy (%) comparison of SSL methods with other ResNet encoders based on linear evaluation.

The linear evaluation can be used to evaluate the accuracy of self-supervised learning (SSL) by freezing the SSL model and training a separate linear classifier after the SSL model Grill and others (2020); Kornblith et al. (2019); Zhang and others (2016). We compare our method with previous state-of-the-art approaches with the ResNet-50 encoder and other ResNet encoders on ImageNet in the Table 1 and Table 2, respectively. The top-1 and top-5 accuracy are listed. With the standard ResNet-50 encoder He et al. (2016), our method obtains 77.1% top-1 accuracy and 93.7% top-5 accuracy, which outperform previous state-of-the-art top-1 and top-5 results by 2.8% and 2.1%, respectively. Most surprisingly, our method achieves 0.6% better accuracy than the accuracy, 76.5%, of the supervised baseline from SimCLR Chen et al. (2020a).

Table 2 reports the accuracy of self-supervised learning methods using deeper and wider ResNet encoders based on linear evaluation. Our method with ResNet-200 (2 $\times$ ) obtains 81.9% top-1 and 96.4% top-5 accuracy which increase previous best top-1 and top-5 accuracy by 2.3% and 1.6%, respectively. With ResNet-50 (2 $\times$ ) and ResNet-50 (4 $\times$ ) encoders, our method also achieves better accuracy than those of CMC Tian et al. (2019), SimCLRv2 Chen et al. (2020b) and BYOL Grill and others (2020) with the same encoder.

4.2 Semi-Supervised Learning

Method	Top-1 (1%)	Top-5 (1%)	Top-1 (10%)	Top-5 (10%)
SimCLR Chen et al. (2020a)	48.3	75.5	65.6	87.8
SimCLRv2 Chen et al. (2020b)	57.9	N/A	68.4	N/A
BYOL Grill and others (2020)	53.2	78.4	68.8	89.0
Ours	56.7	80.2 (1.8 $\uparrow$ )	73.4 (4.6 $\uparrow$ )	92.5 (3.5 $\uparrow$ )

Table 3: The accuracy (%) comparison of SSL methods with the ResNet-50 encoder based on semi-supervised learning on ImageNet dataset.

Method	Dep.	Wid.	SK	Para.	Top-1	Top-5	Top-1 (10%)	Top-5 (10%)
SimCLR Chen et al. (2020a)	50	2 $\times$	✗	94M	58.5	83.0	71.7	91.2
BYOL Grill and others (2020)	50	2 $\times$	✗	94M	62.2	84.1	73.5	91.7
Ours	50	2 $\times$	✗	94M	65.7	86.2	78.6 (5.1 $\uparrow$ )	93.2 (1.5 $\uparrow$ )
SimCLR Chen et al. (2020a)	50	4 $\times$	✗	375M	63.0	85.8	74.4	92.6
BYOL Grill and others (2020)	50	4 $\times$	✗	375M	69.1	87.9	75.7	92.5
Ours	50	4 $\times$	✗	375M	70.3	89.9	78.9 (3.2 $\uparrow$ )	95.5 (2.9 $\uparrow$ )
BYOL Grill and others (2020)	200	2 $\times$	✗	250M	71.2	87.9	77.7	92.5
Ours	200	2 $\times$	✗	250M	76.5	90.3	80.7 (3.0 $\uparrow$ )	95.4 (2.9 $\uparrow$ )
SimCLRv2 distilled Chen et al. (2020b)	50	1 $\times$	✗	N/A	73.9	91.5	77.5	93.4
SimCLRv2 distilled Chen et al. (2020b)	50	2 $\times$	✓	N/A	75.9	93.0	80.2	95.0
SimCLRv2 self-distilled Chen et al. (2020b)	152	3 $\times$	✓	N/A	76.6	93.4	80.9	95.5
Ours	152	3 $\times$	✓	N/A	77.6	94.2	81.3	95.7

Table 4: The accuracy (%) comparison of SSL approaches with other ResNet encoders including selective kernel convolution (SK) based on semi-supervised learning on the ImageNet dataset.

Semi-supervised learning can also be used to evaluate the accuracy of self-supervised learning (SSL) by fine-tuning representation with a small subset of the training set Grill and others (2020). In this experiment, we use the fixed data splits of 1% and 10% of the training set in ImageNet, which are the same as Grill and others (2020). We also use the top-1 and top-5 accuracy as the evaluation metric for the semi-supervised learning. The comparison using the ResNet-50 encoder and deeper and wider ResNet encoders are listed in Table 3 and Table 4, respectively. Our method achieves 80.2% top-5 accuracy based on a ResNet-50 encoder which improves previous best result by 1.8% using only 1% training labels in the Table 3. Using 10% training labels, our method achieves 73.4% and 92.5% for the top-1 and top-5 accuracy, which improve previous best top-1 and top-5 accuracy by 4.6% and 3.5%.

The result with ResNet of various depths, widths, selective kernel convolution Li and others (2019) configurations are listed in Table 4. Our method achieves the best top-1 and top-5 accuracy for all the experimental configurations. Specifically, based on ResNet-50 (2 $\times$ ) encoder, our method achieves 65.7% and 78.6% top-1 accuracy using 1% training labels and 10% training labels, which improves previous best top-1 accuracy by 3.5% and 5.1%. Based on ResNet-200 (2 $\times$ ), our method obtains 76.5% and 80.7% top-1 accuracy using 1% training labels and 10% training labels, which improves the accuracy of BYOL Grill and others (2020) by 5.3% and 3.0%.

4.3 Transfer Learning

Method	Food101	CIFAR-10	SUN397	Cars	Pets	VOC 2007	Flowers
BYOL Grill and others (2020)	75.3	91.3	60.6	67.8	90.4	82.5	96.1
SimCLR Chen et al. (2020a)	68.4	90.6	58.8	50.3	83.6	80.5	91.2
Supervised-IN Chen et al. (2020a)	72.3	93.6	61.9	66.7	91.5	82.8	94.7
Ours	77.6	92.4	65.4	72.3	94.2	87.7	97.0

Table 5: The transfer learning accuracy (%) comparison of SSL approaches with ResNet-50 encoder based on linear evaluation on ImageNet.

Method	Food101	CIFAR-10	SUN397	Cars	Pets	VOC 2007	Flowers
BYOL Grill and others (2020)	88.5	97.8	63.7	91.6	91.7	85.4	97.0
SimCLR Chen et al. (2020a)	88.2	97.7	63.5	91.3	89.2	84.1	97.0
Supervised-IN Chen et al. (2020a)	88.3	97.5	64.3	92.1	92.1	85.0	97.6
Ours	89.1	98.0	64.1	92.1	92.8	85.3	97.5

Table 6: The transfer learning accuracy (%) comparison of SSL approaches with the ResNet-50 encoder based on finetuning on ImageNet.

Transfer learning is another widely used task to evaluate the accuracy of self-supervised learning (SSL) methods. Transfer learning can be used to evaluate the generalization ability of the learned SSL model. Practically, both linear evaluation, i.e., only training the last classification layer, and fine-tuning the whole network based on the target dataset can be employed for the evaluation of transfer learning. The comparison of transfer learning with linear evaluation and fine-tuning are listed in the Table 5 and Table 6.

For the linear evaluation of transfer learning task, our method achieves better accuracy than previous state-of-the-art approaches on six out of seven widely used transfer learning datasets in Table 5. We provide the accuracy improvement in Table 5, and our method improves 2.3%, 3.5%, 4.5%, 2.7%, 4.9% and 0.9% on Flood101, SUN397, Cars, Pets, VOC 2007 and Flowers datasets, respectively. On average, the transfer learning accuracy of our method is 2.5% higher than previous best results based on linear evaluation. For the transfer learning with a fine-tuning task, our method achieves the best accuracy on four out of seven tasks in Table 6.

4.4 Ablation Study

$\tau$	1.0	0.999	0.5	0.0
Top-1 (%)	24	73.4	77.1	49.1

Table 7: The effect of the smoothing coefficient

\tau

in the exponential moving average with ResNet-50 encoder based on linear evaluation.

Coefficient $\tau$ in the update of student sub-network We investigate the accuracy of our method with linear evaluation based on the ResNet-50 encoder with respective to the smoothing coefficient $\tau$ of the exponential moving average (EMA) in Table 7. The bigger the $\tau$ is, the smaller update the student sub-network performs. When the $\tau$ is 0, it means that we copy the weights of the teacher sub-network to update the student sub-network completely in each step. When the $\tau$ is 1, it means that the student sub-network is never updated. We find that at moving average coefficient value of 0.5, we obtain the best top-1 accuracy, 77.1%, based on linear evaluation. Neither the moving average coefficient $\tau$ of 0 nor 1 generates good performance.

Hard negative pair mining (HNPM) We conduct ablation study on hard negative pair mining (HNPM) based on liner evaluation task using the ResNet-200 (2 $\times$ ) encoder on the ImageNet dataset. Training with all negative pairs, i.e., without HNPM, is denoted as “w/o HNPM + block student gradient”, and our method is trained with HNPM, which is denoted as “w/ HNPM + block student gradient”. The loss and accuracy comparison $w.r.t.$ the training epochs for the two methods are in Fig. 2. With hard negative pair mining, the training of our method is much more stable and it achieves lower loss and higher accuracy than that without hard negative pair mining.

Blocking gradient of student sub-network We also conduct ablation study on blocking gradient of student sub-network in Fig. 2. Training without blocking the gradient of student sub-network is denoted as “w/ HNPM + student gradient”. Our method achieves lower loss and higher accuracy than that with gradient updating of student sub-network.

5 Conclusion

In this work, we introduce a self-supervised learning framework in a student-teacher network with contrastive loss. To increase the training efficiency, we add the hard negative image pairs into the contrastive self-supervised learning paradigm. To stabilize the training and avoid a collapsing solution, we block the gradient of student sub-network and update the parameters of the student sub-network using exponential moving average. We also conduct ablation study to validate the effectiveness of each component. Extensive experiments demonstrate that our method achieves better performance than previous state-of-the-art approaches based on linear evaluation, semi-supervised learning and transfer learning on the ImageNet dataset.

References

Bansal et al. [2018] Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. Recycle-gan: Unsupervised video retargeting. In ECCV, pages 119–135, 2018.
Chen and He [2020] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566, 2020.
Chen and others [2020] Xinlei Chen et al. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
Chen et al. [2020a] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
Chen et al. [2020b] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020.
Fernando and others [2017] Basura Fernando et al. Self-supervised video representation learning with odd-one-out networks. In CVPR, 2017.
Grill and others [2020] Jean-Bastien Grill et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
Hadsell et al. [2006] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
He and others [2020] Kaiming He et al. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
Henaff [2020] Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In ICML, pages 4182–4192. PMLR, 2020.
Jing and Tian [2020] Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE TPAMI, 2020.
Kornblith et al. [2019] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In CVPR, pages 2661–2671, 2019.
Laine and Aila [2016] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
Lee et al. [2017] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In ICCV, 2017.
Li and others [2019] Xiang Li et al. Selective kernel networks. In CVPR, 2019.
Li and others [2021] Zeming Li et al. Momentum^ 2 teacher: Momentum teacher with momentum statistics for self-supervised learning. arXiv preprint arXiv:2101.07525, 2021.
Lin and others [2017] Tsung-Yi Lin et al. Focal loss for dense object detection. In ICCV, 2017.
Mathieu and others [2016] Michael Mathieu et al. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
Misra and others [2016] Ishan Misra et al. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, 2016.
Noroozi and Favaro [2016] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV. Springer, 2016.
Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Pathak and others [2016] Deepak Pathak et al. Context encoders: Feature learning by inpainting. In CVPR, 2016.
Qian et al. [2020] Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800, 2020.
Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross B Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
Russakovsky and others [2015] Olga Russakovsky et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
Shin [2020] Minchul Shin. Semi-supervised learning with a teacher-student network for generalized attribute prediction. arXiv preprint arXiv:2007.06769, 2020.
Tarvainen and Valpola [2017] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, 2017.
Tian et al. [2019] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
Tian et al. [2020] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243, 2020.
Vaswani and others [2017] Ashish Vaswani et al. Attention is all you need. In NIPS, 2017.
Wu et al. [2018] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
Xu and others [2019] Dejing Xu et al. Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, 2019.
Zhang and others [2016] Richard Zhang et al. Colorful image colorization. In ECCV, 2016.
Zhu et al. [2021] Wentao Zhu, Yufang Huang, Daguang Xu, Zhen Qian, Wei Fan, and Xiaohui Xie. Test-time training for deformable multi-scale image registration. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13618–13625. IEEE, 2021.
Zhuang et al. [2019] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In ICCV, 2019.