Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning

Jae-Hun Lee¹ Doyoung Yoon¹¹¹footnotemark: 1 ByeongMoon Ji¹¹¹footnotemark: 1 Kyungyul Kim¹ Sangheum Hwang^1,2

¹LG CNS AI Research, Seoul, South Korea
²Department of Data Science, Seoul National University of Science and Technology, Seoul, South Korea
{Jaehun.Lee, dy0916, jibm, kyungyul.kim, shwang}@lgcns.com Equal contributionCorresponding author

Abstract

Linear probing (LP) (and $k$ -NN) on the upstream dataset with labels (e.g., ImageNet) and transfer learning (TL) to various downstream datasets are commonly employed to evaluate the quality of visual representations learned via self-supervised learning (SSL). Although existing SSL methods have shown good performances under those evaluation protocols, we observe that the performances are very sensitive to the hyperparameters involved in LP and TL. We argue that this is an undesirable behavior since truly generic representations should be easily adapted to any other visual recognition task, i.e., the learned representations should be robust to the settings of LP and TL hyperparameters. In this work, we try to figure out the cause of performance sensitivity by conducting extensive experiments with state-of-the-art SSL methods. First, we find that input normalization for LP is crucial to eliminate performance variations according to the hyperparameters. Specifically, batch normalization before feeding inputs to a linear classifier considerably improves the stability of evaluation, and also resolves inconsistency of $k$ -NN and LP metrics. Second, for TL, we demonstrate that a weight decay parameter in SSL significantly affects the transferability of learned representations, which cannot be identified by LP or $k$ -NN evaluations on the upstream dataset. We believe that the findings of this study will be beneficial for the community by drawing attention to the shortcomings in the current SSL evaluation schemes and underscoring the need to reconsider them.

1 Introduction

Self-supervised learning (SSL) has emerged as a promising approach for learning generic visual representations with large amounts of unlabeled data. To learn such useful representations with unlabeled data, pretext tasks are typically defined as proxies for training objectives. Many studies have proposed effective learning strategies: contrastive learning that performs instance discrimination based on randomly augmented views [30, 5, 7, 2], a teacher-student framework that trains representations by using outputs of a momentum encoder as supervision [13, 11, 4, 21], and masked image modeling [15, 3, 23, 35] that aims to reconstruct randomly masked patches.

In previous studies, the quality of representations learned via SSL has been evaluated in terms of discriminability and transferability. The former focuses on how well high-level semantics of pretraining data (i.e., upstream data) are discriminated on the learned representation space. The latter addresses whether the learned representations are generally useful for other visual recognition datasets (i.e., downstream data). Ideally, it is desirable that truly generic representations have both of these properties: they should capture object-level concepts of upstream data, while being transferable to other downstream tasks.

Commonly used schemes for the evaluation of discriminability are linear probing (LP, a.k.a. linear evaluation) and $k$ -NN [4]. They perform linear or $k$ -NN classification with the learned representations frozen. Although these schemes are intuitive and widely accepted in the community, we observe that in the case of LP, considerable effort (e.g., hyperparameter tuning) has been devoted to training such a simple linear classifier. In other words, LP performance is very sensitive to the hyperparameters (results shown in Section 3). Furthermore, the rank of the reported performance measured by LP and $k$ -NN does not match (refer to Figure 1 in [14]). This makes it difficult to compare the discriminability of representations learned by existing SSL methods.

To evaluate transferability, transfer learning (TL), i.e., finetuning (FT) a pretrained SSL backbone with various downstream datasets, is typically employed. One can expect good classification performance across datasets if the pretrained features are considered as general-purpose visual representations. However, similar to LP, it is observed that the TL performance of the SSL backbone varies significantly depending on the hyperparameters involved in FT while that of a supervised pretrained backbone does not (results shown in Section 4). Since existing SSL methods are highly tuned by exploring a substantially large hyperparameter space, it is challenging to determine which SSL method can provide useful representations with broad applicability across diverse downstream tasks.

We claim that truly generic representations should be easily transferable and applicable to other visual recognition tasks. From a technical point of view, pretrained representations must be insensitive to the setting of hyperparameters employed in the evaluation protocols (i.e., LP and TL). To examine hyperparameter sensitivity, we conduct extensive experiments under cross-settings of hyperparameters. For example, we use LP or TL setting of DINO [4] to evaluate the corresponding performance of the MoCo v3 [7] model, and vice versa. Experimental results reveal that both LP and TL show a large performance gap depending on the setting of hyperparameters. Then, which factors contribute to the sensitivity of performance? Does this imply that the quality of representations from current SSL methods does not yet meet our expectations? Or are we missing something in the current evaluation protocols? In this work, we address these questions. It should be noted that our goal is not to compare the performance of existing SSL methods, but rather to identify the source of performance sensitivity under current evaluation schemes.

For LP, we find that distributions of representations obtained from each SSL method show a significant difference. A missing component in most previous SSL studies (except MAE [15]) is input normalization although it is a basic and indispensable preprocessing step for effective training. Our observations indicate that applying batch normalization (BN) before feeding inputs (i.e., SSL representations) to a linear classifier results in nearly eliminating performance variations across different hyperparameter settings. Furthermore, the inconsistency issue between LP and $k$ -NN evaluations is resolved: a method demonstrating superior LP performance (with BN) also yields better results in $k$ -NN evaluation. In TL through FT, it is identified that there exists a hidden hyperparameter in the upstream stage that governs the performance sensitivity in the downstream stage across different hyperparameter settings: surprisingly, a weight decay parameter in SSL plays a crucial role in determining the transferability of features. All SSL backbones exhibit considerable stability and robustness in their performance across diverse hyperparameter configurations when trained with a suitable weight decay parameter during pretraining. The degree of robustness is comparable to that of a supervised pretrained backbone, which is considered to have rich discriminative features. However, it is worth mentioning that determining the appropriate weight decay value during upstream pretraining can be challenging since this value does not have a significant impact on the discriminability evaluated by LP or $k$ -NN.

Our main contributions can be summarized as follows:

•

Based on our extensive experiments, we find that the performance of existing SSL methods is highly sensitive to the hyperparameters utilized in LP and TL, highlighting the inadequacy of current evaluation schemes in assessing the quality of SSL representations.
•

We find that the cause of performance variation in LP is unnormalized inputs, and demonstrate that applying BN can eliminate such variation and lead to a more rigorous discriminability comparison.
•

We demonstrate that a weight decay parameter in SSL controls the transferability of learned representations, despite having little effect on discriminability. This implies that the current TL protocol needs to be reconsidered, as sweeping a large hyperparameter space hinders a fair assessment of transferability.

2 Related Work

Self-supervised Learning (SSL).

The SSL approach involves formulating a pretext task that is capable of leveraging unlabeled data to generate visual representations. Through training on a pretext task, a model learns to capture the underlying structure of the input data. That is, the goal of SSL methods is to build a model capable of learning meaningful representations of the input data.

Contrastive learning such as SimCLR [5] and MoCo [16] is one of the most popular techniques. In contrastive learning, differently transformed views of the same original image are considered as positive pairs, while transformed views of different images are considered as negative pairs. There are some attempts to conduct SSL using the teacher-student framework like BYOL [13] and DINO [4]. These methods typically set the teacher model as a moving average of the student model, as opposed to the conventional teacher-student framework for knowledge distillation, which incorporates a pretrained model as the teacher. MAE [15] is one of the representative methods of the masked image modeling (MIM) approach. In MIM, an image is randomly masked and a model is trained to reconstruct the masked patches of the image. Recently, some advanced works such as iBOT [35], Mugs [36], and MSN [1] have introduced, focusing on training models with multiple pretext tasks. These works aim to enhance representations by leveraging the advantages of each SSL approach.

Evaluation Protocols for SSL.

The effectiveness of the learned representations via SSL has been assessed from the viewpoints of discriminability and transferability.

To evaluate discriminability, LP has been widely used across previous works, ranging from early-stage research such as Colorization [33] and RotNet [12] to more recent works like DINO, MoCo v3, MAE, iBOT, Mugs, and MSN. It is also employed for hyperparameter tuning in SSL since it is a relatively quick evaluation protocol. Despite the simplicity of LP, which involves training only a linear classifier, we can easily observe that training configurations of existing SSL methods for LP exhibit a wide range of variations. For example, fundamental hyperparameters like batch size and learning rate vary widely, and MAE employs BN to minimize the need for learning rate expolaration, while other approaches such as DINO, MoCo v3, and Mugs do not consider BN. Moreover, the class tokens from the last four layers of ViT-S are concatenated for LP in DINO, whereas others utilize only the final output features. In addition to LP, Caron et al. [4] also utilized $k$ -NN for simpler evaluation since it has only a single hyperparameter $k$ and no training is necessary. Following DINO, iBOT and Mugs also validated the quality of representations using the $k$ -NN evaluation scheme. However, based on the experimental results of those methods, it is noticeable that there is a rank mismatch in performance between the $k$ -NN and LP evaluation results.

Prior research has also assessed the transferability of representations based on TL, particularly FT, as an additional evaluation method with various downstream datasets. This is an important aspect since the aim of SSL is to acquire general-purpose visual representations that can be valuable for other unseen images. Regarding the hyperparameter settings for FT, some recent works such as DINO, iBOT, Mugs, and MSN have adopted the default setting of DeiT [28]. However, the DeiT setting requires an excessive amount of computation during FT, e.g., 1,000 training epochs on CIFAR-10/100 [20] and Stanford Cars [19], and employs very strong regularization techniques (e.g. Mixup [32], Cutmix [31], and Random Erasing [34]). On the other hand, MoCo v3 and MAE perform only 100 epochs of training and apply a higher learning rate compared to the DeiT setting. In short, similar to LP, the hyperparameter settings for FT in existing SSL methods exhibit notable variability and tend to be overly tailored to individual downstream datasets. As a result, it becomes challenging to conduct a comprehensive evaluation of whether the learned SSL backbones can serve as general-purpose feature extractors.

To the best of our knowledge, no study has yet highlighted the issues with the current evaluation settings, except for Gwilliam et al. [14], who observed that no SSL method emerges as a clear winner under the existing evaluation settings. To facilitate comparison across SSL methods, they analyzed the learned representations using metrics such as uniformity, tolerance, and centered kernel alignment (CKA). Unlike [14], our work focuses on identifying the source of performance variations under the existing evaluation protocols rather than comparing SSL methods, and provides recommendations for addressing evaluation issues that require attention.

		k-NN	LP setting
Method	Architecture	k=20	Paper	DINO	MoCo v3	MAE	DINO w/BN	MoCo v3 w/BN
SL (DeiT) [28]	ViT-S/16	79.3	79.8	79.43	78.85	79.23	79.33	78.83
DINO [4]	ViT-S/16	74.4	77.0	76.86	70.53	75.91	76.27	75.26
MoCo v3 [7]	ViT-S/16	68.2	73.2	47.84	73.20	72.91	73.10	72.10
Mugs [36]	ViT-S/16	75.4	78.9	77.94	72.42	77.67	77.53	76.03
iBOT [35]	ViT-S/16	74.9	77.9	77.83	71.48	76.87	77.18	75.90
MSN [1]	ViT-S/16	74.8	76.9	76.63	73.68	76.31	76.76	75.56
MAE [15]	ViT-B/16	27.3	68.0	32.30	59.09	67.27	67.40	67.35

Table 1:

{k}

-NN and LP classification results on ImageNet under cross-settings. We report top-1 accuracy for

{k}

-NN and LP evaluations on ImageNet validation dataset, considering BN. DeiT shows the most stable results. MoCo v3 and MAE show the most unstable results. Applying input normalization using BN resolves the performance instability issue.

3 Discriminability: Normalization Matters

In this section, we perform an empirical evaluation of k-NN and LP on various SSL methods with cross-setting experiments to investigate how evaluation results differ according to hyperparameter settings and to determine which factors in the training settings are critical for reliable performance evaluation.

Experimental settings.

We conducted experiments on six SSL methods: DINO [4], MoCo v3 [7], iBoT [35], Mugs [36], MAE [15], and MSN [1], using the Vision Transformer architecture (ViT) [10]. To ensure a fair comparison, we used ImageNet-1K pretrained official checkpoints of each method. Specifically, we utilized the ViT-S/16 model¹¹1ViT-S of MoCo v3 uses 12 attention heads, while others have 6 heads., and for MAE, we used the ViT-B model since the ViT-S checkpoint is currently unreleased. We also compared DeiT [28] as a representative model for supervised learning. Since the DeiT model is already trained for ImageNet-1K classification with class labels, the LP performance of DeiT could be regarded as an upper-bound reference.

To compare the performance across LP settings of these SSL checkpoints, we selected three hyperparameter settings; DINO, MoCo v3, and MAE settings. These settings were selected as representative training configurations for contrastive learning, teacher-student approaches, and masked image modeling, respectively, considering differences in several hyperparameters such as batch size, learning rate, etc. For the comparison of k-NN performance, we set k to 20 because we observed little performance variation when sweeping through the values of $\textit{k}\in\{5,10,20,50,100\}$ . More details on the datasets used, LP experiment settings, and k-NN experimental results are provided in the supplementary material (Sections A, B, and C, respectively).

3.1 LP Results with Cross-settings

Table 1 shows the results of each SSL method evaluated using k-NN and LP across DINO, MoCo v3, and MAE settings. Unsurprisingly, the supervised model (DeiT) shows the most stable performance across all evaluations, including k-NN and linear evaluation with different settings. On the other hand, SSL models show large differences between evaluations across different LP settings. When comparing the performance of each model based on k-NN and LP classification accuracy, there are cases where the rankings of k-NN and LP do not align. For example, if we rank k-NN performance from 1 to 7, the LP performance of each model with the MoCo v3 setting shows a ranking of 1-4-6-2-3-5-7. Based on these evaluation results, the selection of the superior model varies depending on the hyperparameter settings used for LP evaluation. However, it can be seen that the MAE setting shows similar performance across multiple SSL checkpoints including MAE and MoCo v3 models, and the k-NN accuracy rank aligns with that of LP under the MAE setting. Thus, even if there exists hyperparameter sensitivity in SSL models themselves, we can measure the performance consistently by employing the MAE setting for LP.

3.2 Input Normalization Effect on LP

Then, what is the underlying factor that contributes to the stable performance of the MAE setting? The MAE setting has one distinct feature: the application of BN before the linear classifier, which sets it apart from other settings. As shown in Figure 1, representations obtained from each SSL checkpoint show quite different distributions. For example, the features from the Mugs model have notably lower average values compared to those from other checkpoints, while the features produced by the MAE and MoCo v3 models show relatively wider distributions (i.e., high variance). In LP, represented as $\mathbf{y=wx+b}$ , differences in mean values can be adjusted by changing the bias vector $\mathbf{b}$ , but differences in variance require adjusting the scale of each weight in $\mathbf{w}$ , which can potentially affect hyperparameter sensitivity. However, when BN is applied, the features extracted from each model show substantially similar distributions, which can reduce the sensitivity.

To observe the performance stabilization effect of input normalization on the DINO and MoCo v3 settings, we apply BN, while keeping the learnable scale and shift parameters in BN fixed to 1 and 0, respectively. As shown in Table 1, applying BN to the DINO and MoCo v3 settings can improve overall performance and robustness. When comparing the performance of each model under different settings, it can be observed that there is a significant difference of 3–26% without BN, but after applying BN, the difference is reduced to around 1%. Despite the difference in various factors such as batch size, optimizer, and learning rate, similar performance is observed across all settings. By looking at k-NN versus LP plot in Figure 2, we can clearly identify how BN influences the performance difference among different LP settings. The use of BN helps to stabilize the hyperparameter sensitivity of SSL checkpoints, and also aligns the ranking of k-NN with that of LP.

In summary, LP is highly sensitive to hyperparameter settings when evaluating the discriminability of learned representations, making it difficult to draw meaningful conclusions. However, the use of BN can significantly reduce this instability. Based on these findings, we highly recommend the use of BN as a standard evaluation protocol for discriminability, applied before the classification layer.

4 Transferability: A Hidden Parameter Exists

		TL setting
Dataset	Method	Paper	DINO	MoCo v3	Short	Gap
CIFAR-100	SL (DeiT)	89.5	90.02	88.14	85.89	4.13
	DINO	90.5	90.21	72.46	57.40	32.81
	MoCo v3	-	88.79	89.06	80.88	8.18
	Mugs	91.8	91.36	88.06	86.03	5.33
	iBOT	90.7	90.60	68.73	51.49	39.11
	MSN	90.5	90.31	86.67	79.21	11.10
Flowers-102	SL (DeiT)	98.2	97.04	96.99	94.53	2.51
	DINO	98.5	98.24	37.42	86.61	60.82
	MoCo v3	-	83.73	94.67	86.61	10.94
	Mugs	98.8	97.97	96.06	94.58	3.39
	iBOT	98.6	98.55	31.57	55.05	66.98
	MSN	-	98.36	94.80	92.29	6.07
Stanford Cars	SL (DeiT)	92.1	88.76	88.46	80.27	8.49
	DINO	93.0	92.24	10.97	8.69	83.55
	MoCo v3	-	49.27	89.43	41.49	47.94
	Mugs	93.9	91.11	89.45	80.07	11.04
	iBOT	94.0	92.35	11.62	6.11	86.24
	MSN	-	92.71	80.24	7.98	84.73
iNaturalist19	SL (DeiT)	76.6	76.34	74.16	62.64	13.70
	DINO	78.2	78.75	56.04	36.92	41.83
	MoCo v3	-	76.01	72.05	58.65	17.36
	Mugs	79.8	78.65	75.38	63.77	14.88
	iBOT	78.5	77.89	53.00	37.18	40.71
	MSN	78.1	78.22	73.10	47.72	30.50

Table 2: TL results on downstream datasets. We report top-1 accuracy on various downstream validation datasets. We use the same pretrained checkpoints used in Table 1. Supervised backbone shows stable results regardless of hyperparameter settings, but some SSL backbones show unstable outcomes. “Gap” refers to the difference between the maximum and minimum values among three settings. Models with the lowest and highest “Gap” for each dataset are shown in bold and underlined, respectively.

In addition to LP and k-NN, TL is one of the popular evaluation protocols for SSL representations. While SSL aims to create models with rich representations of the upstream data domain, it is more important to focus on TL rather than LP and k-NN when evaluating the practical usability of these representations in real-world applications. However, similar to the findings in the previous section, TL hyperparameter settings and their performances vary significantly across SSL methods. In this section, we conduct extensive experiments to investigate this phenomenon and explore which properties of a pretrained model mainly affect its transferability.

Experimental settings.

We compared the same SSL checkpoints as in Section 3 except MAE which showed low LP performance. The datasets used for TL are CIFAR-100 [20], Oxford Flowers 102 [25], Stanford Cars [19], and iNaturalist 2019 [18]. Unlike in the evaluation of discriminability where the target domain for training and evaluation is matched for the supervised model, in the case of TL, the target domain is different for both supervised and SSL models. Therefore, the supervised model does not set a specific upper bound for the SSL models’ performance.

For cross-setting experiments on TL, by considering the learning rate, training epochs, and regularization, the DINO setting was selected as the representative of the slow-long-strong setting (slow learning by a low learning rate, long training epochs, and strong regularization), and the MoCo v3 setting was selected as the fast-short-strong one. Additionally, we configured the fast-short-weak setting named “Short” since long training with strong regularization (such as the slow-long-strong DINO setting) could lead to high performance regardless of the transferability of learned representations. More details on datasets and experimental settings are available in the supplementary material (Sections A and D, respectively)

	SSL setting (IN)	LP setting (IN)	TL setting (Cars)			TL setting (Flowers)			TL setting (CIFAR)
Method	Weight decay	MoCo v3 w/ BN	DINO	MoCo v3	Short	DINO	MoCo v3	Short	DINO	MoCo v3	Short
Mugs	0.04 $\,\to\,$ 0.4	73.98	91.48	73.85	44.98	97.17	44.71	91.75	89.49	86.80	83.80
	0.04 $\,\to\,$ 0.2*	74.46	91.21	89.00	74.04	97.11	96.78	95.04	89.21	88.20	86.20
	0.04	74.00	87.86	89.58	79.27	95.30	97.37	93.86	89.08	89.31	88.79
	0	70.56	46.44	84.99	68.91	64.63	80.22	89.19	84.46	88.24	86.75
DINO	0.04 $\,\to\,$ 0.4*	72.77	91.26	23.67	19.92	97.02	88.32	93.02	88.62	87.20	79.66
	0.04 $\,\to\,$ 0.2	73.09	90.80	86.31	73.39	97.06	96.11	94.98	88.39	87.74	85.49
	0.04	72.39	85.50	86.37	77.45	95.28	96.89	94.34	87.68	88.65	87.36
	0	67.65	58.08	85.14	69.65	78.26	89.01	91.35	82.55	86.45	85.14
MoCo v3	0.04 $\,\to\,$ 0.4	61.35	68.00	9.76	4.28	88.36	12.23	36.55	84.06	64.23	57.32
	0.04 $\,\to\,$ 0.2	64.90	69.53	49.61	5.50	89.80	22.36	77.86	84.75	82.97	73.02
	0.1*	65.37	68.01	85.35	8.92	87.41	84.53	90.67	85.34	85.56	81.48
	0.04	65.17	75.58	86.94	73.25	88.53	92.57	91.70	85.17	86.92	84.93
	0	62.96	65.84	84.69	69.59	89.92	93.46	89.06	84.66	86.79	84.68
MSN	0.04 $\,\to\,$ 0.4*	67.48	88.72	78.15	54.49	96.31	94.81	93.43	86.74	86.55	83.38
	0.04 $\,\to\,$ 0.2	66.60	85.97	81.17	70.78	95.30	95.27	93.71	86.49	86.98	85.05
	0.04	65.19	78.50	81.41	67.55	92.50	94.75	91.68	84.74	86.97	84.98
	0	63.96	64.35	78.85	62.75	87.69	90.97	90.99	83.64	87.06	84.62
iBOT	0.04 $\,\to\,$ 0.4*	74.25	91.98	55.66	21.77	97.40	39.45	92.30	89.42	66.73	76.49
	0.04 $\,\to\,$ 0.2	74.77	91.39	87.86	71.43	97.33	95.85	95.11	89.20	87.80	85.91
	0.04	74.23	87.80	89.52	81.09	95.90	97.43	95.30	88.92	89.38	88.79
	0	70.86	61.78	87.60	73.63	81.31	91.48	92.32	85.63	88.50	87.87

Table 3: Analysis of weight decay during upstream pretraining. We trained ViT-S/16 using each SSL method for 100 epochs with varying weight decay values, and report LP accuracy on ImageNet (IN) as well as TL classification results on Stanford Cars (Cars), Flower-102 (Flowers), and CIFAR-100 (CIFAR). The symbol ‘*’ denotes the default weight decay for each method used in the literature. The weight decay during SSL considerably affects the stability of TL performance whereas LP performance remains robust to changes in weight decay values. Considering the min-max difference among three settings, the most stable results are shown in bold, and the most unstable results are underlined.

4.1 TL Results with Cross-settings

Table 2 shows the performance of different SSL checkpoints on four benchmark datasets across various TL settings. It is challenging to identify a single best-performing SSL checkpoint because the model that achieves the highest performance varies across datasets. Notably, there is a significant performance gap of up to 86.24%, when examining the TL performance across multiple hyperparameter settings. Furthermore, unlike the LP’s MAE setting, there is no preferred single TL setting.

However, surprisingly, the SL (DeiT) model consistently shows stable and high performance unlike SSL models, with a performance gap of 2.51–13.70%. The inclusion of label supervision during training is likely to have helped the SL (DeiT) model to learn more discriminative representations, leading to increased robustness across various TL settings, particularly since the downstream task is also a classification task. Out of the various SSL models compared, only the Mugs model has achieved a similar level of stability comparable to the SL model. From this perspective, we believe that hyperparameter robustness should also be considered when comparing the performance of SSL models with that of SL models.

Based on our experimental results, we have confirmed that TL performance variations stem from the inherent nature of SSL checkpoints including model architectures and SSL training recipes rather than TL hyperparameter settings. The Mugs model learns three tasks simultaneously: instance, local-group, and group discrimination. From the ablation study on Mugs, we verified that neither architectural components nor losses contribute to the robust TL performance of the Mugs checkpoint (see the ablation results in the supplementary material, Section E). After careful investigation, we conclude that the weight decay value during upstream pretraining is a key factor that controls the transferability of SSL representations.

4.2 Weight Decay Governs Transferability

Weight decay is typically used for regularization. If the weight decay value is too small, the regularization effect is weak, while if it is too large, it can hinder training. The impact of SSL weight decay on training can be explained in similar but different ways. It is known that, in non-contrastive SSL based on two augmented views, SSL models learn both nuisance features created by the augmentation process and invariant features based on image content [29]. If the weight decay value is too small, the SSL model may learn nuisance features made by augmentation, leading to overfitting on the upstream data. On the other hand, if the weight decay value is too large, the SSL model may suffer from feature collapse as it fails to properly learn invariant features. The authors of [29] analyzed the impact of weight decay on upstream performance (i.e., discriminability), but not TL performance on downstream datasets.

Learning invariant features of the upstream data can be helpful from the perspective of upstream evaluation, but it may not necessarily be useful from the perspective of TL, as low- and mid-level statistical features are important for TL on various downstream datasets rather than high-level semantic features of the upstream data. Therefore, it is necessary to learn more comprehensive and rich general-purpose representations, rather than just the invariant features of image content.

In order to analyze the impact of SSL weight decay on TL, we experimented with several weight decay values, then measured the performance using the same TL settings as in Table 2. We tested various weight decay values such as 0.04 $\,\to\,$ 0.4 (the default in DINO, MSN, and iBOT), 0.04 $\,\to\,$ 0.2 (the default in Mugs), 0.04, and 0. Following DINO and Mugs, we employed cosine scheduling to increase weight decay values during training. Additionally, we used a weight decay value of 0.1 (the default in MoCo v3) specifically for the MoCo v3 method.

Table 3 summarizes experimental results with various SSL weight decay values on three downstream datasets, CIFAR-100, Oxford Flowers 102, and Stanford Cars. From the perspective of min-max performance differences across TL settings for each dataset, we observe that Mugs and DINO with the SSL weight decay of 0.04 $\,\to\,$ 0.2 exhibit a small difference of 2.07–17.17% and 2.08–17.41%, respectively. However, with the increased SSL weight decay of 0.04 $\,\to\,$ 0.4, the performance variations significantly increase to 5.69–52.46% and 8.70–71.34%, respectively. This finding is intriguing as it highlights that the performance of TL greatly varies when the SSL weight decay value is slightly changed. This pattern is observed across all SSL methods, for example, iBOT also shows considerable performance variations according to the SSL weight decay: the highest stability is achieved with the SSL weight decay value of 0.04 across all datasets. Interestingly, all models enjoy a high degree of transferability (i.e. stability) when the SSL weight decay value is set to 0.04 $\,\to\,$ 0.2 or 0.04.

However, such behavior cannot be noticed by comparing LP performance, e.g., iBOT shows LP performances of 74.25% and 74.23% for the SSL weight decay of 0.04 $\,\to\,$ 0.4 and 0.04, respectively, while the variation of TL performance is significantly small for the SSL weight decay of 0.04. Figure 3 shows the detailed results for Mugs. From this figure, we can easily observe that LP performances are similar despite changes in SSL weight decay, while TL performances do not. The Mugs model shows the highest stability with the SSL weight decay of 0.04. It should be noted again that, except for the SSL weight decay value, we used the default SSL settings for each method. Considering the fact that the stability/instability caused by SSL weight decay is not limited to a particular dataset or method, but appears consistently across all datasets and models, we can conclude that the performance robustness to TL settings (i.e., transferability) is significantly dominated by the SSL weight decay.

To summarize, evaluating the transferability of SSL representations requires cross-setting experiments rather than just comparing the highest LP and TL accuracy. The instability of SSL representations can only be confirmed through such experiments. Additionally, the weight decay hyperparameter used in upstream pretraining is critical in learning transferable and general-purpose visual representations, and thus, it requires attention when evaluating the transferability of SSL representations.

4.3 In-depth Analysis of Weight Decay Effect

From the perspective of analyzing how SSL weight decay affects TL sensitivity of SSL models, we conducted a comparative analysis of Mugs and DINO. Firstly, in terms of loss curves, we discuss the changes in training speed resulting from SSL weight decay. Secondly, in terms of loss landscape, we explore the impact of SSL weight decay on the loss landscape and how it can lead to robust TL. Lastly, using CKA, we analyze the impact of SSL weight decay on feature reuse from a transferability perspective. We selected the DINO setting (i.e., slow-long-strong) as a default since it provides the best-performing models for both Mugs and DINO on the downstream datasets (see Table 2).

Loss curve.

Figure 4 shows the changes in the convergence rate of validation losses for Stanford Cars during the TL process of the Mugs and DINO models trained for Table 3. Once again, it should be emphasized that all TL hyperparameters, including weight decay during TL, were fixed in order to isolate the effects of different SSL backbones pretrained with varying SSL weight decay values. In Figure 4, it can be noticed that a decrease in SSL weight decay results in a slower learning speed. In the cases of 0.04 $\,\to\,$ 0.2 and 0.04 $\,\to\,$ 0.4, the difference in training speed may not be significant and both achieve similar validation loss at convergence. However, for the cases of 0.04 and 0, it can be observed that the convergence rates are notably low. It should be noted that such different training behaviors are only visible in TL scenarios, as SSL backbones with different weight decay values show almost identical LP performance except for the case where weight decay is 0 (see Table 3). Additionally, faster convergence does not necessarily lead to better transferability. Although Mugs and DINO backbones with the SSL weight decay of 0.04 $\,\to\,$ 0.4 show faster training (and higher accuracy), these backbones do not guarantee a high level of transferability as shown in Table 3. As a result, validation loss curves alone do not provide sufficient information to estimate the transferability of SSL backbones.

wd 0.04 $\,\to\,$ 0.4 wd 0.04 $\,\to\,$ 0.2

DINO

Mugs

Figure 5: Validation loss landscapes depending on SSL weight decay. We use the models trained on Stanford Cars in Table 3. The loss landscape is measured at the end of lr warmup phase of TL. The weight decay of 0.04

\,\to\,

0.2 leads to a flatter loss landscape, indicating the robustness of TL hyperparameter settings.

Loss landscape.

Through the lens of loss landscape, we aim to investigate how SSL weight decay affects learning behavior in terms of generalization and TL. We employed the 2D loss landscape visualization method [22, 26] to visualize the loss landscape using the Stanford Cars validation dataset. Specifically, we visualized the checkpoint at the end of the warm-up phase, rather than at the end of training as done in [26]. Despite having similar validation loss curves, the models trained with the SSL weight decay of 0.04 $\,\to\,$ 0.2 show significantly flatter loss landscapes compared to those with 0.04 $\,\to\,$ 0.4 as can be seen in Figure 5. It is well-known that a flatter loss landscape results in improved generalization [22, 6, 26], and it appears that this is also the case for hyperparameter robustness. However, it is necessary to decrease SSL weight decay to an appropriate level, as excessively low values of SSL weight decay can make the convergence rate lower as seen in Figure 4. Therefore, we need to set an appropriate value for SSL weight decay to ensure robustness in the TL settings. While it is difficult to observe the robustness of hyperparamters through the loss curve or LP performance, the loss landscape provides good evidence to analyze this.

Feature reusability based on CKA.

In addition, we employed the CKA analysis, a method to intuitively measure the degree of transferability, to analyze the impact of SSL weight decay on feature reusability, which is commonly employed to examine feature similarity in the context of TL. For example, Neyshabur et al. [24] uses CKA similarity to support that if a model has good transferability, features of the model should be reusable. One of the strengths of TL is that it can improve both training speed and performance by reusing the meaningful representations learned from early–middle layers of a pretrained model [24]. As can be seen in Figure 6, both DINO and Mugs demonstrate an improvement in feature reusability, as measured by CKA, when the SSL weight decay value is set to 0.04 $\,\to\,$ 0.2, particularly in early to middle layers. This observation is consistent with our prior experimental and analytical results, and indicates that an appropriated weight decay value during SSL leads to robust performance across various TL hyperparameters, a flatter loss landscape, and high feature reusability.

5 Conclusion

In this work, we address the limitations of commonly used evaluation protocols for SSL representations. Through extensive experiments, we show that existing SSL methods exhibit significant performance variations in both LP and TL depending on the hyperparameters used. In LP, the variations are due to the lack of input normalization in the current evaluation scheme. Interestingly, we find that the cause of performance instability in TL is the weight decay parameter utilized in SSL pretraining, which cannot be detected by discriminability performance metrics like LP or k-NN. We believe that our findings shed light on the shortcomings of current evaluation schemes for SSL representations and call for a rethinking of these protocols.

Acknowledgement

The author SH was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2021R1C1C1011907).

References

[1] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision (ECCV), 2022.
[2] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Advances Neural Information Processing Systems (NeurIPS), 2019.
[3] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEit: BERT pre-training of image transformers. In International Conference on Learning Representations (ICLR), 2022.
[4] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In The IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICLR), 2020.
[6] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. In International Conference on Learning Representations (ICLR), 2022.
[7] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In The IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[8] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In The IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops, 2020.
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
[11] Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, and Patrick Pérez. Obow: Online bag-of-visual-words generation for self-supervised learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[12] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), 2018.
[13] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS, 2020.
[14] Matthew Gwilliam and Abhinav Shrivastava. Beyond supervised vs. unsupervised: Representative benchmarking and analysis of image representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[15] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[17] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Soudry. Augment your batch: Improving generalization through instance repetition. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[18] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alexander Shepard, Hartwig Adam, Pietro Perona, and Serge J. Belongie. The inaturalist species classification and detection dataset. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[19] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In The IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2013.
[20] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.
[21] Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao. Efficient self-supervised vision transformers for representation learning. In International Conference on Learning Representations (ICLR), 2022.
[22] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
[23] Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, and Jinqiao Wang. Mst: Masked self-supervised transformer for visual representation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
[24] Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer learning? In Advances in Neural Information Processing Systems (NeurIPS), 2020.
[25] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
[26] Namuk Park and Songkuk Kim. How do vision transformers work? In International Conference on Learning Representations (ICLR), 2022.
[27] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[28] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICLR), 2021.
[29] Xiang Wang, Xinlei Chen, Simon Shaolei Du, and Yuandong Tian. Towards demystifying representation learning with non-contrastive self-supervision. arXiv preprint arXiv:2110.04947, 2022.
[30] Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[31] Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. Cutmix: Regularization strategy to train strong classifiers with localizable features. In The IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[32] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR), 2018.
[33] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization. In European Conference on Computer Vision (ECCV), 2016.
[34] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
[35] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. In International Conference on Learning Representations (ICLR), 2022.
[36] Pan Zhou, Yichen Zhou, Chenyang Si, Weihao Yu, Teck Khim Ng, and Shuicheng Yan. Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415, 2022.

Appendix A Datasets

The ImageNet [9] contains 1.2M training images and 50K validation images with 1K classes, widely used as a benchmark dataset for self-supervised learning. CIFAR-100 [20] is a dataset for multi-class image classification. It consists of 50K training images and 10K test images of 32 ${\times}$ 32 resolution with 100 classes. Oxford Flower 102 [25] is a dataset for fine-grained image classification. It consists of 2K training images and 6K validation images with 102 flower categories. Each class contains a varying number of images, ranging from 40 to 258. The Stanford Cars [19] is a fine-grained dataset including 196 classes of cars. It contains 8K training images and 8K test images. The iNaturalist [18] is a dataset of natural fine-grained categories for image classification. Those categories belong to super-categories. iNaturalist 2018 contains 437K training images and 24K validation images with 8K classes. On the other hand, iNaturalist 2019 consists of 265K training images and 3K validation images with 1K classes, focusing on a smaller set of highly similar categories drawn from iNaturalist.

Appendix B LP Experimental Details

We examined the LP hyperparameters of popular SSL methods based on the descriptions in each paper and the official code released by the authors, as summarized in Table S1. The “Concat. ${l}$ last layers” category indicates the number of outputs from the last layer blocks that are utilized as inputs for the classifier. The “Patch token” option allows for the use of the average of patch tokens rather than the class token. The “BN layer” option represents whether to use batch normalization for inputs of a classifier. By default, all methods use an input image that is first resized to 256 ${\times}$ 256 and then randomly cropped to a size of 224 ${\times}$ 224 during training. For inference, images are resized to 256 ${\times}$ 256 and center-cropped to a size of 224 ${\times}$ 224 for input. All methods do not apply weight decay during LP training.

DINO / iBOT / Mugs

(ViT-S/16)

DINO / iBOT / Mugs

(ViT-B/16)

MoCo v3

(ViT-S/16, ViT-B/16)

MAE

(ViT-B/16)

MSN

(ViT-S/16, ViT-B/16)

Epochs

100

Batch size

1024

4096

16384

Optimizer

SGD

LARS

SGD

Learning rate (LR)

0.001 / 0.001 / 0.04

0.001 / 0.001 / 0.008

0.1

6.4

LR warm-up epochs

✗

10 epoch

✗

LR decay

CosineAnnealingLR

Weight decay

✗

Concat.

{l}

last layers

Patch token

✗

✓

✗

BN layer

✗

✓

Table S1: LP settings for the SSL methods. We can observe that the LP hyperparameter settings of each method are quite different. Note that all SSL methods conduct extensive hyperparameter searches to obtain these optimal configurations.

	k-NN
Method	Paper	k=5	k=10	k=20	k=50	k=100
SL (DeiT)	79.8	78.6	79.2	79.3	79.2	79.1
DINO	74.5	73.8	74.4	74.4	73.7	72.9
MoCo v3	-	66.9	68.0	68.2	67.3	66.4
Mugs	75.6	74.8	75.5	75.4	74.8	73.9
iBOT	75.2	74.4	75.2	74.9	74.3	73.6
MSN	-	74.3	74.7	74.8	74.0	73.1

Table S2: ImageNet k-NN classification accuracy. Best accuracy for each method is shown in bold.

Appendix C k-NN Results on ImageNet

k-NN is a frequently used classifier for evaluating discriminability. Specifically, the performance of k-NN classifiers is evaluated based on the features obtained from SSL backbones that are frozen. k-NN has only a single hyperparameter, i.e., the value of k. As shown in Table S2, although there are performance differences depending on the k value, these differences are very small. Based on the official ViT-S/16 checkpoints of various SSL methods, we observed that the best performance is achieved when the value of k is set to 10 or 20. Since both values of k yield similar results, we chose 20 as the default k value for all experiments.

Appendix D TL Experimental Details

We report the hyperparameters setting of SL/SSL methods for TL in Table S3. We used the hyperparameter settings described in the literature as a basis. For values that are not explicitly mentioned in the papers, we referred to the official codes. For unknown values, e.g., the random erasing probability of Mugs, we marked them as ‘?’. Mugs did not specify exact learning rate values except for iNat18 and iNat19, but mentioned that those values were chosen by searching within the set {7.50e-6, 1.50e-5, 3.00e-5, 7.50e-5, 1.50e-4}. The TL hyperparameter setting for Stanford Cars is not provided in the MoCo v3 paper. Considering that the TL settings for Stanford Cars and Flower-102 are the same in other SSL methods such as DINO and iBOT, the Flower-102 setting was used for the Stanford Cars experiments of MoCo v3. Note that many methods (e.g. DINO, Mugs, iBOT, and MSN) follow the hyperparameter settings of DeiT. That is, most methods are employing strong regularization techniques (e.g. label smoothing [27], repeated augmentation [17], Mixup [32], CutMix [31], and random erasing [34]), which boost TL performance regardless of the quality of representations. Therefore, in the “Short” TL setting, we exclude all of these regularizations and set a short training epoch to diminish the performance-boosting effect, thereby enabling us to assess the transferability of these SSL backbones.

	DeiT		DINO
Downstream dataset	CIFAR-10/100	Cars	CIFAR-10	CIFAR-100	iNat18	iNat19	Flowers	Cars
Epochs	1000	1000	1000	1000	300	300	1000	1000
Batch size	768	768	768	768	1024	1024	768	768
Optimizer	SGD	SGD	SGD	AdamW	AdamW	AdamW	AdamW	AdamW
Learning rate (LR)	1.00e-02	1.00e-02	5.00e-06	5.00e-06	5.00e-05	5.00e-05	5.00e-06	5.00e-06
LR decay	cosine	cosine	cosine	cosine	cosine	cosine	cosine	cosine
LR warm-up epochs	5	5	5	5	5	5	5	5
LR warm-up	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06
Weight decay	1.00e-04	1.00e-04	0.05	0.05	0.05	0.05	0.05	0.05
Label smoothing	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1
Drop path rate	0	0	0.1	0.1	0.1	0.1	0.1	0.1
Repeated Aug.	✓	✓	✓	✓	✓	✓	✓	✓
Rand Aug.	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5
Mixup prob.	0.8	0.8	0.8	0.8	0.8	0.8	0.8	0.8
CutMix prob.	1	1	1	1	1	1	1	1
Erasing Prob.	✗	✗	✗	✗	0.25	0.25	✗	✗

	MoCoV3				Mugs
Downstream dataset	ImageNet	CIFAR-10	CIFAR-100	Flowers	CIFAR-10/100	iNat18	iNat19	Flowers	Cars
Epochs	150	100	100	100	1000	360	360	1000	1000
Batch size	1024	1024	1024	1024	768	768	768	768	768
Optimizer	AdamW	AdamW	AdamW	AdamW	AdamW	AdamW	AdamW	AdamW	AdamW
Learning rate (LR)	5.00e-04	3.00e-04	3.00e-04	3.00e-04	?	3.00e-05	7.50e-05	?	?
LR decay	cosine	cosine	cosine	cosine	cosine	cosine	cosine	cosine	cosine
LR warm-up epochs	3	3	3	3	5	5	5	5	5
LR warm-up	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06
Weight decay	0.05	0.1	0.1	0.1	0.05	0.05	0.05	0.05	0.05
Label smoothing	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1
Drop path rate	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1
Repeated Aug.	✓	✓	✓	✓	✓	✓	✓	✓	✓
Rand Aug.	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5
Mixup prob.	0.8	0.8	0.5	0	0.8	0.8	0.8	0.8	0.8
CutMix prob.	1	1	1	0	1	1	1	1	1
Erasing Prob.	0.25	✗	✗	0.25	?	?	?	?	?

	iBOT					MSN				Short
Downstream dataset	CIFAR-10/100	iNat18	iNat19	Flowers	Cars	CIFAR-10	CIFAR-100	iNat18	iNat19	All
Epochs	1000	360	360	1000	1000	1000	1000	300	300	50
Batch size	768	768	768	768	768	768	768	1024	1024	64
Optimizer	AdamW	AdamW	AdamW	AdamW	AdamW	SGD	AdamW	AdamW	AdamW	Adam
Learning rate (LR)	7.50e-06	5.00e-05	2.50e-05	7.50e-06	7.50e-06	7.50e-05	7.50e-05	1.00e-04	1.00e-04	1.00e-04
LR decay	cosine	cosine	cosine	cosine	cosine	cosine	cosine	cosine	cosine	step
LR warm-up epochs	5	5	5	5	5	5	5	5	5	✗
LR warm-up	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	1.00e-06	✗
Weight decay	0.05	0.05	0.05	0.05	0.05	0.05	0.05	0.05	0.05	1.00e-06
Label smoothing	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1	✗
Drop path rate	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1
Repeated Aug.	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗
Rand Aug.	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	9 / 0.5	✗
Mixup prob.	0.8	0.8	0.8	0.8	0.8	0.8	0.8	0.8	0.8	✗
CutMix prob.	1	1	1	1	1	1	1	1	1	✗
Erasing Prob.	0.25	0.1	0.1	0.25	0.25	✗	✗	0.25	0.25	✗

Table S3: TL settings for SL/SSL methods on ViT-S/16. The TL hyperparameter settings of each method and dataset are considerably different.

Appendix E Stability Analysis on Mugs [36]

Here, we present the results of an ablation study on Mugs [36]. We initially checked if a particular Mugs module contributes to stable TL performance over DINO. Subsequently, we investigated the components of SSL training settings that affect the transferability of SSL representations. Lastly, we conducted experiments to determine the influence of the hyperparameters utilized in the SSL stage on transferability.

E.1 Module analysis

The main contribution of Mugs is its ability to learn multi-granular features based on instance, local group, and group modules. Consequently, it is presumed that the newly introduced module is the most significant factor affecting stability. In the literature, the authors conducted a module removal ablation study concerning LP. In this section, we further examine the impact of these modules on TL performance. As shown in Table S4, while the LP and TL performances exhibit slight variations depending on hyperparameter settings, there is little difference in stability, except when solely using the local group module. This is due to the issue of utilizing less-trained features as neighbors when using the local group module alone, leading to a reduction in performance. Note that using the instance module alone can be viewed as an improved version of MoCo v2 based on ViT, but it differs from MoCo v3 in terms of memory for negative samples. Similarly, using the group module alone has exactly the same architecture as DINO. Based on the ablation results, it can be observed that Mugs shows stable TL performance even when trained only with the group module, which has the same model architecture as DINO. Therefore, we can infer that other factors contribute to the stable TL performance of Mugs.

Mugs module			LP setting			TL setting
Instance	Local group	Group	DINO	MoCo v3	MAE	DINO	MoCo v3	Short
✓	✓	✓	74.97	74.39	74.44	91.21	88.78	74.04
✗	✗	✓	74.49	73.79	73.84	90.78	88.37	72.31
✗	✓	✗	31.33	59.46	66.36	89.96	79.32	47.56
✗	✓	✓	75.04	74.35	74.52	91.16	88.75	75.76
✓	✗	✗	74.50	73.78	73.84	90.70	88.45	74.55
✓	✗	✓	75.43	74.61	74.67	90.92	88.79	74.92
✓	✓	✗	74.40	73.84	73.85	91.11	88.46	77.05

Table S4: Effect of Mugs modules. Each model is trained for 100 epochs with ViT-S/16. We report LP results on ImageNet and TL results on Stanford Cars. Except for the local group module only, there is no difference in terms of stability. Unstable results are underlined.

E.2 Ablation study on training settings: Mugs vs. DINO

In order to determine where the stability difference between ‘DINO’ and ‘Mugs with the group module only’ originates, we conducted additional analyses related to the training settings. Through a comparison of DINO and Mugs’ official codes, we discovered several differences in detail, including augmentation, weight normalization, and SSL hyperparameters. When comparing the augmentation schemes for DINO and Mugs, they have nearly identical configurations, but Mugs incorporates an extra strong augmentation technique known as RandAugment [8]. In terms of weight normalization, DINO employs weight normalization for the teacher network’s last layer, whereas Mugs does not. Moreover, Mugs and DINO have different SSL hyperparameters.

We sampled some combinations of settings for the ablation study because it would take too much time and resources to consider all combinations. Based on experimental results, as shown in Table S5, we noticed that DINO’s stability (i.e., transferability) considerably increased when trained under Mugs’ SSL hyperparameters while retaining all other settings identical to the original DINO.

Base	SSL setting			LP setting			TL setting
Base	Augmentation	Teacher WN	Hyperparams.	DINO	MoCo	MAE	DINO	MoCo V3	Short
Mugs	Mugs	Mugs	Mugs	74.49	73.79	73.84	90.78	88.37	72.31
	DINO	DINO	DINO	74.58	73.21	73.49	91.29	84.47	43.29
	DINO	DINO	Mugs	74.03	73.43	73.56	90.44	88.87	72.82
DINO	DINO	DINO	DINO	74.12	72.62	72.89	91.26	24.01	19.93
DINO	DINO	DINO	Mugs	74.30	73.78	73.94	90.49	87.23	67.15

Table S5: Ablation study on Mugs and DINO training settings. Each model is trained for 100 epochs with ViT-S/16. We report LP results on ImageNet and TL results on Stanford Cars. While the SSL hyperparameters of Mugs show stable results in both Mugs and DINO, those of DINO produce unstable results in both Mugs and DINO. Conversely, augmentations and teacher weight normalization do not appear to have a significant impact. Unstable results are underlined.

E.3 Ablation study on SSL hyperparameters: Mugs vs. DINO

We aim to analyze which elements of SSL hyperparameters have the greatest influence on transferability. Upon comparing DINO and Mugs, we discovered several distinct SSL hyperparameters, such as patch embedding learning rate, teacher softmax temperature, gradient clipping, learning rate, and weight decay. Note that hyperparameter settings differ depending on the network architecture. In this section, we investigate the SSL hyperparameters for ViT-S/16. For learning rate, both DINO and Mugs employ cosine scheduling and linear warmup strategies. The learning rate value of DINO is 5e-4 $\,\to\,$ 1e-5, while that of Mugs is 8e-4 $\,\to\,$ 1e-6. Similarly, cosine scheduling is used for weight decay. The weight decay value of DINO is 0.04 $\,\to\,$ 0.4, and that of Mugs is 0.04 $\,\to\,$ 0.2. Only Mugs incorporates a 20% reduction in patch embedding learning rate. DINO does not use gradient clipping, but Mugs use it with a threshold of 3.0. Lastly, DINO schedules the softmax temperature applied to the teacher’s outputs from 0.04 $\,\to\,$ 0.07, while Mugs employs a fixed value of 0.04.

As shown in Table S6, it can be observed that the weight decay parameter in the SSL stage has a significant impact on the transferability of SSL representations, i.e., adjusting the weight decay value from 0.04 $\,\to\,$ 0.4 to 0.04 $\,\to\,$ 0.2 considerably stabilizes DINO’s TL performance. Note that similar to the previous experiments, we sampled some combinations instead of considering all possible cases. It is worth noting that it is difficult to identify such an influence on TL performance based on the perspective of discriminability evaluated by LP.

Base setting	SSL HP setting					LP setting		TL setting
	Patch embedding LR	Gradient clipping	Learning rate	Weight decay	Teacher temperature	DINO	MoCo v3	DINO	MoCo v3	Short
Mugs	M	M	M	M	M	74.03	73.43	90.44	88.87	72.82
	D	D	D	D	D	74.58	73.21	91.29	84.47	43.29
	D	M	M	M	M	73.52	72.90	90.56	89.72	80.45
	M	D	M	M	M	73.84	73.16	90.93	88.54	75.95
	M	M	D	M	M	75.16	74.22	91.75	88.50	70.87
	M	M	M	D	M	74.94	73.74	91.48	80.80	44.99
DINO	D	D	D	D	D	74.12	72.62	91.26	24.01	19.93
	M	M	M	M	M	74.30	73.78	90.49	87.23	67.15
	M	M	D	D	M	74.78	73.65	91.58	9.45	13.34
	M	M	D	D	D	74.64	73.36	91.47	25.67	15.90
	D	D	M	M	M	73.57	73.03	90.32	87.84	71.42
	D	D	M	M	D	73.56	73.10	90.06	88.52	77.19
	M	M	M	M	D	74.23	73.69	90.60	88.38	63.28
	D	D	D	D	M	74.12	72.79	91.13	24.47	36.37
	D	D	M	D	D	73.05	71.95	90.98	11.67	34.45
	D	D	D	M	D	73.65	73.07	90.80	87.24	73.39

Table S6: Impact of hyperparameters in SSL upstream pretraining. D and M represent the settings of DINO and Mugs, respectively. It can be observed that SSL weight decay determines the robustness of TL performance. Unstable results are underlined.