This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning

Yun-Hao Cao1,  Peiqin Sun2,  Shuchang Zhou2

1State Key Laboratory for Novel Software Technology, Nanjing University
2MEGVII Technology
[email protected], {sunpeiqin, zsc}@megvii.com
Corresponding author.
Abstract

We propose universally slimmable self-supervised learning (dubbed as US3L) to achieve better accuracy-efficiency trade-offs for deploying self-supervised models across different devices. We observe that direct adaptation of self-supervised learning (SSL) to universally slimmable networks misbehaves as the training process frequently collapses. We then discover that temporal consistent guidance is the key to the success of SSL for universally slimmable networks, and we propose three guidelines for the loss design to ensure this temporal consistency from a unified gradient perspective. Moreover, we propose dynamic sampling and group regularization strategies to simultaneously improve training efficiency and accuracy. Our US3L method has been empirically validated on both convolutional neural networks and vision transformers. With only once training and one copy of weights, our method outperforms various state-of-the-art methods (individually trained or not) on benchmarks including recognition, object detection and instance segmentation. Our code is available at https://github.com/megvii-research/US3L-CVPR2023.

1 Introduction

Deep supervised learning has achieved great success in the last decade, but the drawback is that it relies heavily on a large set of annotated training data. Self-supervised learning (SSL) has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. Since the emergence of contrastive learning [7], SSL has clearly gained momentum and several recent works [8, 14] have achieved comparable or even better performance than the supervised pretraining when transferring to downstream tasks. However, it remains challenging to deploy trained models for edge computing purposes, due to the limited memory, computation and storage capabilities of such devices.

To facilitate deployment, several model compression techniques have been proposed, including lightweight architecture design [29], knowledge distillation [20], network pruning [15], and quantization [33]. Among them, structured network pruning [25] is directly supported and accelerated by most current hardware and therefore the most studied. However, most structured pruning methods require fine-tuning to obtain a sub-network with a specific sparsity, and a single trained model cannot achieve instant and adaptive accuracy-efficiency trade-offs across different devices. To address this problem in the context of supervised learning, the family of slimmable networks (S-Net) and universally slimmable networks (US-Net) [32, 31, 2, 22] were proposed, which can switch freely among different widths by training only once.

Table 1: Comparisons between supervised classification and SimSiam under S-Net on CIFAR-100. The accuracy for SimSiam is under linear evaluation. ‘-’ denotes the model collapses.
Type Method Accuracy (%)
1.0x 0.75x 0.5x 0.25x
Supervised Individual 73.8 72.8 71.4 67.3
S-Net [32] 71.9 71.7 70.8 66.2
S-Net+Distill [31] 73.1 71.9 70.5 67.2
SimSiam [9] Individual 65.2 64.0 60.6 51.2
S-Net [32] - - - -
S-Net+Distill [31] 46.9 46.9 46.7 45.3
Ours 65.5 65.3 63.2 59.7

Driven by the success of slimmable networks, a question arises: Can we train a self-supervised model that can run at arbitrary width? A naïve solution is to replace the supervised loss with self-supervised loss based on the US-Net framework. However, we find that this solution doesn’t work directly after empirical studies. Table 1 shows that the phenomenon in self-supervised scenarios is very different. The model directly collapses after applying the popular SSL method SimSiam [9] to slimmable networks [32]. Although using inplace distillation [31] for sub-networks prevents the model from collapsing, there is still a big gap between the results of S-Net+Distill and training each model individually for SimSiam. So why is the situation so different in SSL and how to further improve the performance (i.e., close the gap)?

In this paper, we present a unified perspective to explain the differences and propose corresponding measures to bridge the gap. From a unified gradient perspective, we find that the key is that the guidance to sub-networks should be consistent between iterations, and we analyze which components of SSL incur the temporal inconsistency problem and why US-Net works in supervised learning. Based on these theoretical analyses, we propose three guidelines for the loss design of US-Net training to ensure temporal consistency. As long as one of them is satisfied, US-Net can work well, no matter in supervised or self-supervised scenarios. Moreover, considering the characteristics of SSL and the deficiencies of US-Net, we propose dynamic sampling and group regularization to reduce the training overhead while improving accuracy. Our main contributions are:

  • \bullet

    We discover significant differences between supervised and self-supervised learning when training US-Net. Based on these observations, we analyze and summarize three guidelines for the loss design of US-Net to ensure temporal consistency from a unified gradient perspective.

  • \bullet

    We propose a dynamic sampling strategy to reduce the training cost without sacrificing accuracy, which eases coping with the large data volumes in SSL.

  • \bullet

    We analyze how the training scheme of US-Net limits the model capacity and propose group regularization as a solution by giving different freedoms to different channels.

  • \bullet

    We validate the effectiveness of our method on both CNNs and Vision Transformers (ViTs). Our method requires only once training and a single model, which can exceed the results of training each model individually, and is comparable to knowledge distillation from pretrained teachers.

2 Related Works

Self-supervised Learning. To avoid time-consuming and expensive data annotations, many self-supervised methods were proposed to learn visual representations from large-scale unlabeled images or videos [26, 4]. As the driving force of state-of-the-art SSL methods, contrastive learning methods greatly improve the performance of representation learning in recent years [30]. Contrastive learning is a discriminative approach that aims at pulling similar samples closer and pushing diverse samples far from each other. SimCLR [7] and MoCo [16] both employ a contrastive loss function InfoNCE [30], which requires negative samples. BYOL [14] and SimSiam [9] discard negative sampling in contrastive learning by using an asymmetrical design.

To improve the accuracy-efficiency trade-off for self-supervised models, many works have been proposed. Fang et al. [13] proposed self-supervised knowledge distillation (SEED) for SSL with lightweight models. However, models at different widths (sparsities) must be trained individually, which incurs significant computational and storage overhead and is unsustainable for large data volumes. Moreover, it requires a pretrained teacher model while ours does not. Recently, SSQL [3] proposes to pretrain quantization-friendly self-supervised models to facilitate downstream deployment. Concurrent work DATA [5] proposes a neural architecture search (NAS) approach specialized for SSL. In contrast, we focus on structured pruning and we provide a unified theoretical explanation for the loss design of once-for-all training.

Slimmable Networks. Slimmable networks [32] are widely studied because of their ability to execute at different widths, permitting instant and adaptive accuracy-efficiency trade-offs at runtime. Later, [31] proposes universally slimmable networks (US-Net), which extend slimmable networks to run at arbitrary width. Follow-up work OFA [2] extends the sampling space of sub-networks to depth and kernel size dimensions but it also inherits the loss design of US-Net by combining base loss and inplace distillation loss. [6] explores finding an optimal sub-model from a vision transformer [11]. They were all done, however, under the supervised learning paradigm, whereas our method is self-supervised.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: The proposed framework and our method for universally slimmable self-supervised learning.

3 Method

In this section, we begin with the notations and a brief review of previous works in Sec. 3.1. Then, we introduce our method in Sec. 3.2, which we call Universally Slimmable Self-Supervised Learning (dubbed as US3L), as shown in Fig. 1. Finally, we show that temporal consistency of guidance is critical to the success of US-Net training by analyzing the stability of gradient updates of both self-supervised and supervised losses, and we propose three guidelines for the loss design to ensure this consistency in Sec. 3.3.

3.1 Preliminary

In this subsection, we introduce two representative SSL methods SimSiam and SimCLR, as well as (universally) slimmable networks as preliminaries.

1) Self-supervised Losses. Let 𝒙i,1\boldsymbol{x}_{i,1} and 𝒙i,2\boldsymbol{x}_{i,2} denote two randomly augmented views from an input image 𝒙i\boldsymbol{x}_{i}. Let ff denote an encoder network consisting of a backbone (e.g., ResNet [19]) and a projection MLP head [7].

SimSiam [9] maximizes the similarity between two augmentations of one image. A prediction MLP head [14], denoted as hh, transforms the output of one view and matches it to the other view. The output vectors for 𝒙i,1\boldsymbol{x}_{i,1} are denoted as 𝒛i,1f(𝒙i,1)\boldsymbol{z}_{i,1}\triangleq f(\boldsymbol{x}_{i,1}) and 𝒑i,1h(f(𝒙i,1))\boldsymbol{p}_{i,1}\triangleq h(f(\boldsymbol{x}_{i,1})), and 𝒛i,2\boldsymbol{z}_{i,2} and 𝒑i,2\boldsymbol{p}_{i,2} are defined similarly. The negative cosine similarity is defined as D(𝒑,𝒛)𝒑𝒑2𝒛𝒛2D(\boldsymbol{p},\boldsymbol{z})\triangleq-\frac{\boldsymbol{p}}{\|\boldsymbol{p}\|_{2}}\cdot\frac{\boldsymbol{z}}{\|\boldsymbol{z}\|_{2}} and we assume both 𝒛\boldsymbol{z} and 𝒑\boldsymbol{p} have been L2L_{2}-normalized for simplicity in subsequent discussions. Let SG()SG(\cdot) denote the stop-gradient operation. Then, the loss function in SimSiam is:

LMSE=iD(𝒑i,1,SG(𝒛i,2))+D(𝒑i,2,SG(𝒛i,1)).L_{\text{MSE}}=\sum_{i}D(\boldsymbol{p}_{i,1},SG(\boldsymbol{z}_{i,2}))+D(\boldsymbol{p}_{i,2},SG(\boldsymbol{z}_{i,1}))\,. (1)

SimCLR [7] and MoCo [16] contrast with negative samples using InfoNCE [30] loss:

LNCE=iloge𝒛i,1𝒛i,2e𝒛i,1𝒛i,2+ji,v{1,2}e𝒛i,1𝒛j,v,L_{\text{NCE}}=-\sum_{i}\log\frac{e^{\boldsymbol{z}_{i,1}\cdot\boldsymbol{z}_{i,2}}}{e^{\boldsymbol{z}_{i,1}\cdot\boldsymbol{z}_{i,2}}+\sum_{j\neq i,v\in\{1,2\}}e^{\boldsymbol{z}_{i,1}\cdot\boldsymbol{z}_{j,v}}}\,, (2)

where we omit the temperature parameter τ\tau for simplicity.

2) Slimmable Networks. Slimmable networks [32] are a class of networks that can be executable at different scales. During training, only the smallest, the largest and a few randomly sampled networks are used to calculate the loss in each iteration, which is known as the sandwich rule. Further, inplace distillation [31] is introduced to improve performance, where the knowledge inside the largest network is transferred to sub-networks by using distillation loss.

3.2 The Proposed Method

The following three subsections describe the components of our US3L method (Algorithm 1).

3.2.1 Loss Design

The general framework of our method is depicted in Fig. 1(a), in which the loss function is composed of base loss (for the base/largest network) and distillation loss (for sub-networks). By default, we use a momentum encoder to generate targets, InfoNCE as the base loss, and MSE as the distillation loss. Also, we show that we should use an auxiliary distillation head to mitigate the impact of the capacity difference between teacher and student. The overall loss function is

L=LNCELBaseisg(zis)zimLDistill,L=\underbrace{L_{\rm{NCE}}}_{L_{\rm{Base}}}\underbrace{-\sum_{i}\sum_{s}g(z_{i}^{s})\cdot z_{i}^{m}}_{L_{\rm{Distill}}}\,, (3)

where g()g(\cdot) is an auxiliary distillation MLP head, zsz^{s} and zmz^{m} are the output of the sub-network and momentum encoder, respectively. Notice that Eq. (3) is not the only option for the loss design and it can work well as long as it satisfies our guidelines (Fig. 1(b)), which will be discussed in Sec. 3.3.

3.2.2 Dynamic Sampling

It is worth noting that Yu et al. [32] sampled four switches in each iteration throughout training, which is very time-consuming for SSL training. Therefore, we design a dynamic sampling strategy to reduce the training overhead while improving performance (Fig. 1(c)). First, we argue that it is unnecessary to introduce the training of sub-networks at the beginning. We believe that a good and consistent teacher is essential for the learning of sub-networks [1], so we only need to train the base network at the start. Second, the training of sub-networks should be gradual. Specifically, the width of the smallest sub-network should be gradually reduced. By combining the two sampling strategies described above, we successfully reduce the sampling number ss from 4 to 3 (theoretical minimum sampling number) without performance drop (see appendix for detailed analysis and results).

In our implementation, the training process is divided into two stages. In the first stage, only the largest network is trained (i.e., s=1s=1). In the second stage, we sample the largest, the smallest plus a random width (s=3s=3), and the width of the smallest model is gradually reduced. For example, the sampling width range in the second stage begins with [0.75,1.0], then [0.5,1.0], and finally [0.25,1.0].

3.2.3 Group Regularization

Given two channels k1k_{1} and k2k_{2} (k1<k2k_{1}<k_{2}), if k2k_{2} is used in US-Net, then k1k_{1} must also be used. In other words, the earlier channels are used more frequently than the later ones. Therefore, in the training of US-Net, the majority of the weights will be concentrated on the earlier channels to ensure the performance of sub-networks. However, such a weight distribution will limit the base model’s capacity and thus affect its performance. To address this problem, we propose group regularization by giving more degrees of freedom (i.e., smaller regularization coefficients) to the later channels (Fig. 1(d)), so that their weights are more fully utilized. We divide the total KK channels into GG groups in order, with each group containing KG=K/GK_{G}=\lfloor K/G\rfloor channels. Then we define:

LGReg=k=1Kλk𝐰k22,L_{\text{GReg}}=\sum_{k=1}^{K}\lambda_{k}\lVert\mathbf{w}_{k}\rVert_{2}^{2}\,, (4)
λk=λ(1k/KGα),\lambda_{k}=\lambda(1-\lfloor k/K_{G}\rfloor\alpha)\,, (5)

where 𝐰\mathbf{w} denotes the weight matrix and we set G=8G=8 and α=0.05\alpha=0.05 throughout this paper. Notice that when α=0\alpha=0, group regularization degenerates into the standard L2L_{2} regularization. We also empirically demonstrate that group regularization is tailored for US-Net in the appendix.

Algorithm 1 The proposed US3L method
0:  Define width range R=[Rmin,Rmax]R=[R_{\text{min}},R_{\text{max}}]x, for example, Rmin=0.25R_{\text{min}}=0.25, Rmax=1.0R_{\text{max}}=1.0.
1:  for h=1,,Titersh=1,\dots,T_{iters} do
2:     Define period length Tp=Titers/4T_{p}=\lfloor T_{iters}/4\rfloor.
3:     Clear gradients, optimizer.zero_grad().
4:     Run base network z1=M(x1),z2=M(x2)z_{1}=M(x_{1}),z_{2}=M(x_{2}).
5:     Compute base loss, loss=LBase(z1,z2)+LGRegloss=L_{\rm{Base}}(z_{1},z_{2})+L_{\rm{GReg}}.
6:     Detach label, z1t=z1.detach()z^{t}_{1}=z_{1}.detach(), z2t=z2.detach()z^{t}_{2}=z_{2}.detach().
7:     if tTpt\leq T_{p} then
8:        Continue
9:     end if
10:     Dynamic adjust range, Rmin=10.25t/TpR_{\text{min}}=1-0.25\lfloor t/T_{p}\rfloor.
11:     Randomly sample a width from RR as width samples.
12:     Add the smallest width RminR_{\text{min}} to width samples.
13:     for width in width samples do
14:        Execute sub-network MM^{\prime} at width, and distillation head gg, z1s=g(M(x1))z^{s}_{1}=g(M^{\prime}(x_{1})), z2s=g(M(x2))z^{s}_{2}=g(M^{\prime}(x_{2})).
15:        lossloss +=\mathrel{+}= LDistill(z1s,z2t)+LDistill(z2s,z1t)L_{\rm{Distill}}(z^{s}_{1},z^{t}_{2})+L_{\rm{Distill}}(z^{s}_{2},z^{t}_{1}).
16:     end for
17:     Accumulate gradients, loss.backward().
18:     Update weights, optimizer.step().
19:  end for

3.3 Three Guidelines for Loss Design

The special feature of US-Net training is the introduction of sub-network training (i.e., LDistillL_{\text{Distill}}), so the training stability of the sub-networks is very important. In this paper, we find that the key is to ensure the temporal consistency of guidance for sub-networks. One image has different views in two adjacent iterations, which will produce different outputs because the model has not converged and is unstable. We hope that the gradients generated by different views of the same image will also be close between iterations (i.e., robust to changes and provide consistent guidance to sub-networks).

In the context of SSL, (1) and (2) can also be adapted for distillation and we use zsz^{s} and ztz^{t} to denote the output of the sub and base network, respectively. The losses are:

LMSE-distill=ziszit,L_{\text{MSE-distill}}=-z_{i}^{s}\cdot z_{i}^{t}\,, (6)
LNCE-distill=ziszit+logkeziszkt.L_{\text{NCE-distill}}=-z_{i}^{s}\cdot z_{i}^{t}+\log\sum_{k}e^{z_{i}^{s}\cdot z^{t}_{k}}\,. (7)
Lemma 3.1

MSE is not robust to changes in the output, whereas NCE is stabilized by distances from other samples.

Proof.  For (6), the derivative can be derived as follows:

Lzis=zit.\frac{\partial L}{\partial z_{i}^{s}}=-z_{i}^{t}\,. (8)

For (7), the derivative can be derived as follows:

Lzis=zit+jeziszjtkeziszktzjtzit+jPjzjt.\frac{\partial L}{\partial z^{s}_{i}}=-z_{i}^{t}+\sum_{j}\frac{e^{z^{s}_{i}\cdot z^{t}_{j}}}{\sum_{k}e^{z^{s}_{i}\cdot z^{t}_{k}}}z^{t}_{j}\triangleq-z_{i}^{t}+\sum_{j}P_{j}z^{t}_{j}\,. (9)

Hence, we see that for MSE, the gradient only depends on the output ztz^{t}. This output will be very unstable in different iterations due to factors such as rapid model updates and image augmentations, resulting in temporal gradient instability. In contrast, NCE loss is also stabilized by the distance from other samples (corresponding to the extra jPjzjt\sum_{j}P_{j}z^{t}_{j} term). \square

To further illustrate Lemma 3.1, consider the following example (Fig. 2b,c). Assume that due to image augmentations, all outputs are transformed by the same rotation matrix wθw^{\theta} from the hh-th to the hh+1-th iteration (Fig. 2a). The gradient difference between iterations for MSE is:

L(h+1)zisL(h)zis=(Iwθ)zit,\frac{\partial L^{(h+1)}}{\partial z^{s}_{i}}-\frac{\partial L^{(h)}}{\partial z^{s}_{i}}=(I-w^{\theta})z^{t}_{i}\,, (10)

where II denotes the identity matrix. For InfoNCE, we have:

L(h+1)zisL(h)zis=(Iwθ)(zitjPjzjt).\frac{\partial L^{(h+1)}}{\partial z^{s}_{i}}-\frac{\partial L^{(h)}}{\partial z^{s}_{i}}=(I-w^{\theta})(z^{t}_{i}-\sum_{j}P_{j}z^{t}_{j})\,. (11)

If the output of the student is already aligned with the teacher, then we have Pi1P_{i}\approx 1 and thus it can be verified that LNCE(h+1)zisLNCE(h)zis2<LMSE(h+1)zisLMSE(h)zis2\lVert\frac{\partial L_{\text{NCE}}^{(h+1)}}{\partial z^{s}_{i}}-\frac{\partial L_{\text{NCE}}^{(h)}}{\partial z^{s}_{i}}\rVert_{2}<\lVert\frac{\partial L_{\text{MSE}}^{(h+1)}}{\partial z^{s}_{i}}-\frac{\partial L_{\text{MSE}}^{(h)}}{\partial z^{s}_{i}}\rVert_{2}. That is, NCE loss achieves better temporal stability than MSE for the learning of sub-networks.

Refer to caption
Figure 2: Illustration of feature changes and corresponding gradient changes under various settings. Best viewed in color.
Lemma 3.2

Supervised cross entropy (CE) is also stabilized by relative distance and temporal consistency is preserved.

Proof.  Let yiy_{i} denote the label for xix_{i}, CC denote the number of classes, 𝐰d×C\mathbf{w}\in{\mathbb{R}^{d\times{C}}} denote the weight matrix of the classification head (dd is the feature dimension). Then:

LCE=loge𝐰yiT𝒛ij=1Ce𝐰jT𝒛i=𝐰yiT𝒛i+logj=1Ce𝐰jT𝒛i.L_{\text{CE}}=-\log\frac{e^{\mathbf{w}_{y_{i}}^{T}\boldsymbol{z}_{i}}}{\sum_{j=1}^{C}e^{\mathbf{w}_{j}^{T}\boldsymbol{z}_{i}}}=-\mathbf{w}_{y_{i}}^{T}\boldsymbol{z}_{i}+\log\sum_{j=1}^{C}e^{\mathbf{w}_{j}^{T}\boldsymbol{z}_{i}}\,.

The derivative is as follows:

LCE𝒛i=𝐰yi+j=1Ce𝐰jT𝒛ik=1Ce𝐰kT𝒛i𝐰j𝐰yi+jPj𝐰j.\frac{\partial{L_{\text{CE}}}}{\partial{\boldsymbol{z}_{i}}}=-\mathbf{w}_{y_{i}}+\sum_{j=1}^{C}\frac{e^{\mathbf{w}_{j}^{T}\boldsymbol{z}_{i}}}{\sum_{k=1}^{C}e^{\mathbf{w}_{k}^{T}{\boldsymbol{z}_{i}}}}\mathbf{w}_{j}\triangleq-\mathbf{w}_{y_{i}}+\sum\nolimits_{j}P_{j}\mathbf{w}_{j}\,.

The gradient is not only related to the target class weight 𝐰yi\mathbf{w}_{y_{i}} and the analysis is then similar to the NCE above. \square

From Lemma 3.2 we can understand the huge difference between MSE-based SimSiam and CE-based supervised classification in Table 1. The key is that inconsistent outputs can make the temporal gradient updates for MSE very unstable.

Lemma 3.3

A momentum teacher will better preserve temporal consistency by producing slowly updating outputs. 111It is a known fact that a momentum teacher reduces the degree of output change and helps the model with more stable training [14].

From the above analyses, we can summarize three guidelines below to ensure the temporal consistency of guidance:

1. The base loss is based on the relative distance to produce temporal consistent outputs of the base network.

2. The distillation loss is based on the relative distance to produce temporal consistent guidance for sub-networks.

3. A momentum teacher is used to produce stable guidance for sub-networks.

Experimental results in Sec. 4.5 further verify the effectiveness of our proposed three guidelines, and we will empirically find that at least one of the three guidelines needs to be satisfied to make it work for the US-Net framework.

4 Experimental Results

We introduce the implementation details in Sec. 4.1. We experiment with CNNs in Sec. 4.2 and ViTs in Sec. 4.3 on CIFAR-100 [21] and CIFAR-10 [21], respectively. Then, we experiment on ImageNet [28] (IN) in Sec. 4.4 and we evaluate the transfer performance of ImageNet pretrained models on downstream recognition, object detection, and instance segmentation benchmarks. Finally, we investigate the effects of different components in our method in Sec. 4.5.

Table 2: Main results on CIFAR-100. ‘-’ denotes the model collapses. nn denotes the number of sub-models and n=9n=9 in this table. TT denotes the cost of training one model on CIFAR-100 for 400 epochs and we do not consider the effect of model size here, because models of different sizes are encountered in each method. The best two results are bolded and underlined, respectively.
Backbone Method Once Pretrained Training Linear Accuracy (%)
Training Teacher Cost 1.0x 0.9x 0.8x 0.7x 0.6x 0.5x 0.4x 0.3x 0.25x
ResNet-18 SimCLR [7] ×\times ×\times nTnT 66.5 65.4 64.7 63.7 62.6 61.0 59.0 56.1 53.6
SimSiam [9] ×\times ×\times nTnT 66.5 65.4 64.6 63.5 62.6 60.0 58.3 54.9 52.4
BYOL [14] ×\times ×\times nTnT 66.8 66.0 65.6 65.3 63.0 62.1 59.5 56.0 54.3
SEED [13] ×\times BYOL R-50 nTnT 67.3 66.6 65.8 65.2 64.8 63.5 62.2 60.1 58.5
SEED-MSE ×\times BYOL R-18 nTnT 67.5 67.2 66.5 66.0 65.9 64.8 64.0 62.4 60.1
SEED-MSE ×\times BYOL R-50 nTnT 67.5 66.8 66.7 66.0 65.4 64.9 63.6 61.3 60.1
US [31]+SimCLR ×\times 4T4T 65.5 64.9 63.8 63.6 62.7 61.8 60.2 58.2 57.4
US [31]+SimSiam ×\times 4T4T 57.5 57.4 57.3 57.0 56.3 55.4 54.5 53.1 52.4
Ours ×\times 2.5𝑻\mathbf{2.5}\boldsymbol{T} 69.0 68.2 68.0 66.9 66.1 64.7 62.6 60.9 60.4
Ours (800ep) ×\times 5T5T 70.1 69.3 69.0 68.7 67.3 66.4 64.2 63.1 62.3
ResNet-50 BYOL [14] ×\times ×\times nTnT 67.0 66.7 66.5 66.3 66.0 64.9 63.8 62.1 61.2
SEED [13] ×\times BYOL R-50 nTnT 70.3 69.8 69.6 69.4 69.0 68.2 67.2 65.6 65.1
SEED-MSE ×\times BYOL R-50 nTnT 69.4 69.0 68.5 69.1 68.4 68.1 67.3 66.9 66.4
US [31]+SimCLR ×\times 4T4T 70.1 69.9 69.7 69.3 68.7 68.2 67.5 66.0 65.5
US [31]+SimSiam ×\times 4T4T 54.7 54.6 54.7 54.7 54.7 54.8 54.6 54.3 54.0
Ours ×\times 2.5𝑻\mathbf{2.5}\boldsymbol{T} 72.6 72.0 71.5 71.2 70.6 70.2 68.6 67.7 67.4
Ours (800ep) ×\times 5T5T 73.0 72.5 71.9 71.6 71.1 70.8 69.1 68.0 67.6
MobileNetv2 BYOL [14] ×\times ×\times nTnT 61.2 60.7 60.5 60.2 59.9 58.7 57.3 54.6 51.9
SEED-MSE ×\times BYOL R-50 nTnT 68.6 68.9 67.6 67.3 67.4 66.3 65.5 64.0 62.6
SEED-MSE ×\times BYOL MBv2 nTnT 63.8 63.5 63.8 63.6 63.6 63.3 62.7 62.1 59.8
US [31]+SimCLR ×\times 4T4T 56.2 56.0 55.3 55.0 54.8 54.3 54.0 53.2 52.2
US [31]+SimSiam ×\times 4T4T - - - - - - - - -
Ours ×\times 2.5𝑻\mathbf{2.5}\boldsymbol{T} 65.7 65.1 64.2 63.6 63.4 62.2 61.5 60.7 59.3

4.1 Implementation Details

Datasets. The main experiments are conducted on three benchmark datasets, i.e., CIFAR-10, CIFAR-100 and ImageNet. We also conduct transfer experiments on 5 recognition benchmarks as well as 2 detection benchmarks Pascal VOC 07&12 [12] and COCO2017 [24].

Backbones. In addition to the commonly used ResNet-50 [19], we also adopt 2 smaller networks, i.e., ResNet-18 and MobileNetv2 [29]. Moreover, we evaluate our method on vision transformers [11]. Sometimes we abbreviate ResNet-18/50 to R-18/50 and MobileNetv2 to MBv2.

Training details. We use SGD for pretraining, with a batch size of 512 and a base lr=0.5. The learning rate has a cosine decay schedule. The weight decay is 0.0001 and the SGD momentum is 0.9. We pretrain for 400 epochs on CIFAR-100 and 100 epochs on ImageNet unless otherwise specified.

4.2 Experiments on CIFAR

We compare our method with state-of-the-art SSL methods BYOL [14], SimSiam [9] and SimCLR [7], and individually pretrain for them to obtain models at different channel widths. Notice that it is not a fair comparison with our method because they must pretrain different models for different widths (i.e., need 9 pretrained models for 9 widths), whereas ours is trained only once with one copy of weights. To better illustrate the effectiveness of our method, we also compare it with one strong baseline SEED [13]. In addition to training each model separately, SEED also requires an additional pretrained teacher model. The original implementation of SEED uses InfoNCE-based distillation loss and we add one more variant SEED-MSE (use MSE distillation loss). We also directly adapt different SSL methods to US-Net [31] for comparison. We report the linear evaluation accuracy for pretrained models of all methods under the same setting.

As shown in Table 2, our method achieves higher accuracy consistently than baseline methods. Take R-18 as an example, our method achieves 2.2% and 6.1% higher accuracy than BYOL at widths of 1x and 0.25x, respectively, with less training cost. When compared with the strong baseline SEED, our method even performs better in most cases, with much less training cost and additional dependencies. Also notice that the SEED-MSE variant performs even better than the original SEED, especially for small subnets, which is consistent with our analysis in Sec. 3.3. That is, when we have a pretrained teacher which can already generate consistent targets, it is sufficient to use MSE for distillation. We can observe similar trends and improvements for R-50. For MobileNetv2, our method outperforms individually trained BYOL and SEED-MSE (when using pretrained MBv2 as the teacher). However, when we use a better teacher (e.g., R-50), our method is inferior to SEED (the gap is within 3 points). Also, our method outperforms the US-Net baseline at all widths for all three backbones (the role of each component will be discussed in Sec. 4.5). Moreover, we can see that our method greatly reduces the training cost. The training time for individually-trained methods is proportional to the number of sub-networks nn that need to be used while ours is not affected by nn. When compared with the original US-Net, we reduced the expected number of sampling in each iteration from 4 to 2.5 (see appendix for more analyses).

In conclusion, our method outperforms various individually trained SSL algorithms and even achieves comparable accuracy with knowledge distillation, by only training once.

Table 3: Linear evaluation results for ViT on CIFAR-10.
Backbone Method Once Linear Accuracy (%)
Training 1.0x 0.75x 0.5x 0.25x
ViT-Tiny MoCov3 [10] ×\times 82.6 79.5 75.8 68.0
US+MoCov3 79.8 79.4 77.6 76.4
Ours 86.0 84.7 83.3 80.2
ViT-Small MoCov3 [10] ×\times 88.0 86.8 83.0 75.5
US+MoCov3 88.2 87.5 86.3 84.9
Ours 90.3 89.7 88.7 85.5

4.3 Experiments with Vision Transformer

To further demonstrate the efficacy of our method, we evaluate our method on vision transformers [11] (ViTs) and adopt the popular MoCov3 [10] as our baseline method. We employ the official code and experiment on CIFAR-10. To the best of our knowledge, we are the first to study slimmable self-supervised ViTs and we directly reduce the embedding dimension in all layers. As shown in Table 3, our method significantly outperforms individually trained MoCov3. Also, our method surpasses US+MoCov3 (adapt MoCov3 to US-Net) at all widths despite using less training cost. In conclusion, the results show that our method still works even for complex architectures such as vision transformers.

4.4 ImageNet and Transferring Experiments

In this subsection, we do unsupervised pretraining on the large-scale ImageNet training set without using labels. The linear evaluation results on ImageNet are shown in Table 4. Also, we evaluate the transfer ability of the learned representations on ImageNet later. We train one model for our method and 4 separate models for BYOL, each of which is trained for 100 epochs. As shown in Table 4, our US3L achieves higher accuracy than BYOL at all widths and our advantages become greater as the width shrinks.

Table 4: Linear evaluation results on ImageNet.
Backbone Method Once Linear Accuracy (%)
Training 1.0x 0.75x 0.5x 0.25x
ResNet-18 BYOL ×\times 54.0 53.7 47.4 34.9
US+BYOL 55.9 53.1 48.0 40.6
Ours 56.9 54.5 48.7 40.7
ResNet-50 BYOL ×\times 68.1 66.3 61.2 50.9
US+BYOL 64.7 64.3 62.6 57.1
Ours 68.4 66.7 63.4 57.7

We investigate the downstream object detection performance on Pascal VOC07&12 in Fig. 3 and COCO2017 in Table 5. The detector is Faster R-CNN [27] for VOC and Mask R-CNN [18] for COCO (both with FPN [23] backbone), following [3]. Fig. 3 shows that our method outperforms BYOL on Pascal VOC at width of 1.0x. Also, as we decrease the width, our advantages over the baseline counterpart BYOL will be further expanded: up to +2.6 and +18.5 AP50\text{AP}_{50} at width of 0.5x and 0.25x, respectively. We can reach similar conclusions on COCO2017 from Table 5. Although our method achieves comparable accuracy to BYOL at 1.0x on COCO, we achieve +0.9 and +1.5 APbb\text{AP}^{\text{bb}} points higher at 0.5x and 0.25x, respectively. Note that our improvements on COCO are not as large as that on VOC. It is because the amount of training data in COCO is large enough to close the gap between different pretrained models, as noted in [17].

We also transfer the ImageNet learned representations to 5 downstream recognition benchmarks in Table 6. As seen, our method improves a lot on all recognition benchmarks (except for pets) under linear evaluation.

Refer to caption
Figure 3: Transfer results on Pascal VOC 07&12 under R50-FPN.
Table 5: Transfer results on COCO2017 object detection& instance segmentation under R-50 FPN.
Width Method APbb\text{AP}^{\text{bb}} AP50bb\text{AP}_{50}^{\text{bb}} AP75bb\text{AP}_{75}^{\text{bb}} APmk\text{AP}^{\text{mk}} AP50mk\text{AP}_{50}^{\text{mk}} AP75mk\text{AP}_{75}^{\text{mk}}
1.0x BYOL 37.9 57.8 40.9 33.2 54.3 35.0
Ours 38.3 58.0 41.2 33.6 54.6 35.3
0.75x BYOL 35.7 55.3 38.6 32.4 52.4 34.5
Ours 36.2 55.8 39.0 32.8 52.9 35.0
0.5x BYOL 32.6 51.5 35.2 29.9 48.7 31.7
Ours 33.5 52.7 35.8 30.6 49.8 32.3
0.25x BYOL 26.0 43.5 27.0 24.2 40.8 25.3
Ours 27.5 45.0 29.3 25.5 42.4 26.9
Table 6: Transfer results on recognition benchmarks under linear evaluation. ‘C-10/100’ denotes ‘CIFAR-10/100’.
Net Width Params MACS Method Linear Accuracy (%)
C-10 C-100 Flowers Pets Dtd
R-50 1.0x 22.56M 4.11G BYOL 87.1 60.6 81.0 80.9 70.7
Ours 87.1 61.5 90.6 79.4 72.6
0.75x 14.77M 2.34G BYOL 83.6 52.8 82.9 74.2 66.3
Ours 84.4 56.9 89.7 78.0 71.1
0.5x 06.92M 1.06G BYOL 80.6 52.0 74.8 75.0 65.9
Ours 81.6 52.8 88.1 76.8 68.8
0.25x 01.99M 0.28G BYOL 75.9 46.2 75.4 64.7 61.4
Ours 78.9 49.9 84.4 74.0 64.9

4.5 Ablation Studies

In this subsection, we demonstrate the effectiveness of the proposed three guidelines as well as the proposed strategies.

Table 7: Ablation studies of the loss design under ResNet-18 on CIFAR-100. ‘-’ denotes the model collapses.
Base Loss Case Distill Auxiliary Momentum Target Linear Accuracy (%)
Loss Distill Head Base Network Sub Network 1.0x 0.9x 0.8x 0.7x 0.6x 0.5x 0.4x 0.3x 0.25x
MSE 1 ×\times ×\times ×\times ×\times - - - - - - - - -
2 MSE ×\times ×\times ×\times 57.5 57.4 57.3 57.0 56.3 55.4 54.5 53.1 52.4
3 MSE ×\times 64.7 64.7 64.5 64.3 63.9 62.6 61.3 59.7 59.3
4 MSE 65.4 65.0 64.8 64.5 63.8 62.7 61.1 59.8 58.9
5 InfoNCE ×\times ×\times ×\times 62.3 62.3 62.3 62.2 61.8 60.6 58.9 57.6 57.2
6 InfoNCE ×\times ×\times 63.7 63.8 63.7 63.6 63.1 62.0 60.6 59.3 58.2
7 InfoNCE ×\times 65.0 65.0 65.1 65.0 64.5 62.7 61.3 59.8 59.2
8 InfoNCE 65.5 65.5 65.6 65.0 64.6 63.2 61.6 60.2 59.7
InfoNCE 9 ×\times ×\times ×\times ×\times 64.8 64.0 63.2 62.0 60.8 59.8 57.4 55.1 54.2
10 MSE ×\times ×\times ×\times 65.0 64.4 63.1 62.3 61.9 60.3 58.3 57.1 56.6
11 MSE ×\times ×\times 65.8 65.0 64.4 63.4 62.7 61.8 59.8 58.5 57.6
12 MSE ×\times 66.9 66.3 65.7 64.9 63.8 62.9 61.6 59.5 59.1
13 MSE 67.7 67.2 66.5 66.0 65.1 64.3 62.5 60.5 59.6
14 InfoNCE ×\times ×\times ×\times 65.5 64.9 63.8 63.6 62.7 61.8 60.2 58.2 57.4
15 InfoNCE ×\times ×\times 64.7 64.5 64.0 63.6 62.3 61.4 59.8 58.4 57.9
16 InfoNCE ×\times 66.0 65.4 64.8 64.3 63.8 62.4 61.1 59.8 58.7
17 InfoNCE 67.4 66.0 66.1 65.6 64.7 64.0 62.2 60.2 59.5
Table 8: Ablation studies of our strategies on CIFAR-100.
Backbone Dynamic Group Linear Accuracy (%)
Sampling Reg. 1.0x 0.8x 0.6x 0.5x 0.3x 0.25x
R-18 ×\times ×\times 67.7 66.5 65.1 64.3 60.5 59.6
×\times 68.6 67.2 65.5 64.6 60.7 59.9
×\times 68.6 67.3 65.5 64.4 60.9 60.1
69.0 68.0 66.1 64.7 60.9 60.4
R-50 ×\times ×\times 71.0 70.6 70.0 69.1 67.2 66.8
×\times 71.8 71.1 70.2 69.3 67.3 67.2
×\times 71.9 71.1 70.0 69.6 67.7 67.5
72.6 71.5 70.6 70.2 67.7 67.4
MBv2 ×\times ×\times 62.9 62.0 61.5 60.4 59.6 58.7
×\times 64.7 63.3 62.3 61.7 60.7 59.2
×\times 64.0 63.2 62.1 61.4 60.2 59.0
65.7 64.2 63.4 62.2 60.7 59.3

4.5.1 Effectiveness of The Three Guidelines

We investigate various combinations of training loss, distillation head and momentum target in Table 7. The ‘Base Loss’ column represents the loss function for the base (i.e., largest) model. The ‘Distill Loss’ column represents the distillation loss for sub-networks (‘×\times’ means sub-networks use the same loss as the base network without using distillation). The ‘Auxiliary Distill Head’ column indicates whether to use an additional head for distillation (i.e., g()g(\cdot) in (3)). The ‘Momentum Target’ column indicates whether to maintain a momentum encoder of the base network, ‘Base/Sub Network’ column represents whether the base/sub networks use the output of the momentum encoder as the target. We have the following observations from Table 7:

  • \bullet

    As aforementioned, the model will collapse if we use individual MSE loss for sub-networks in case 1 (i.e., SimSiam). This phenomenon is completely different from the supervised case, where each sub-network can be trained with cross entropy loss alone to achieve good results. We conjecture that the base and sub networks directly imitate their respective targets when using MSE, which will bring instability to the entire model, as analyzed in Sec. 3.3.

  • \bullet

    Following US-Net, we use inplace distillation to guide sub-networks in case 2 and it solves the collapse problem. It is because now all sub-networks are aligned to the same target, which ensures cross-subnet consistency. Nevertheless, there is still a large gap compared with individually trained self-supervised networks (nearly 10 points).

  • \bullet

    If we continue to use a momentum teacher, we will find consistent improvements in Table 7 (e.g., case 3 vs. case 2). When comparing case 7 and case 6, we find that it is better to align all networks to the same target, which confirms our analysis of guidance consistency again. Consistency should not only exist between iterations, but also across sub-networks (all sub-networks should use the same target).

  • \bullet

    Experimental results are in full agreement with the proposed three guidelines. Lemma 3.1 states that InfoNCE can obtain a more stable gradient than MSE. Notice that the loss function consists of the base and distillation loss. (i) When the base loss uses MSE, the output of the base network will be unstable, so the sub-networks need to use InfoNCE loss for distillation to deal with this instability (case 5 v.s. case 2). (ii) When the base loss uses InfoNCE, the base network can achieve better temporal stability. Hence, the sub-networks already get stable targets from the teacher and using MSE or InfoNCE for distillation (case 10&14) can achieve good results. In short, in the absence of a momentum teacher, at least one of base loss and distillation loss should use InfoNCE to ensure stability.

  • \bullet

    The use of an auxiliary distillation head will result in consistent improvements when we compare the last two rows of each block in Table 7 (e.g., case 8 vs. case 7).

4.5.2 Effectiveness of Our Strategies

We study the effect of our proposed strategies in Table 8 and the baseline is the best practice in Table 7 (i.e., case 13). We can have the following conclusions from Table 8:

  • \bullet

    Our dynamic sampling strategy can improve the accuracy of the base network as well as sub-networks significantly. Although our total number of iterations is less (without training sub-networks at the start), we can still guarantee the performance of small sub-networks. It is because now we get a better and more stable teacher and then distillation speeds up the convergence of sub-networks.

  • \bullet

    Our group regularization can also significantly improve the accuracy of the largest model while improving the accuracy of sub-networks. The experimental results verify our analysis in Sec. 3.2.3 that our grouping strategy can alleviate the problem of limited model capacity in US-Net.

We also study the intersection of the sandwich rule and dynamic sampling in Table 9. As seen, our dynamic sampling strategy can be used alone or combined with the sandwich rule, which brings improved performance for both scenarios. The results further demonstrate the universality and effectiveness of our dynamic sampling.

Table 9: Intersection of the sandwich rule and dynamic sampling on CIFAR-100 under ResNet-18.
Sandwich Dynamic Linear Accuracy (%)
Rule Sampling 1.0x 0.8x 0.6x 0.5x 0.3x 0.25x
×\times ×\times 65.1 65.0 64.7 63.2 59.4 56.4
×\times 67.4 67.3 65.9 64.7 59.9 58.7
×\times 67.7 66.5 65.1 64.3 60.5 59.6
68.6 67.3 65.5 64.4 60.9 60.1

5 Conclusions

In this paper, we proposed a method called US3L for training universally slimmable self-supervised models. We provided theoretical analyses about the loss design and proposed three guidelines to ensure temporal consistency for US-Net training. Moreover, we proposed dynamic sampling and group regularization to solve the problems of inefficient training and limited model capacity. Experiments on various benchmarks and architectures (both CNNs and ViTs) show that our method significantly outperforms various baselines. When transferring to various downstream tasks, our models exhibit significant advantages at different widths, with only once training and one copy of weights. In the future, we will try to compress in dimensions such as depth and kernel size and explores combinations with NAS methods.

References

  • [1] Lucas Beyer, Xiaohua Zhai, Amelie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 10925–10934, 2022.
  • [2] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. In The International Conference on Learning Representations, pages 1–14, 2020.
  • [3] Yun-Hao Cao, Peiqin Sun, Yechang Huang, Jianxin Wu, and Shuchang Zhou. Synergistic self-supervised and quantization learning. In The European Conference on Computer Vision, volume 13690 of LNCS, page 587–604. Springer, 2022.
  • [4] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In The European Conference on Computer Vision, volume 11218 of LNCS, pages 132–149. Springer, 2018.
  • [5] Qing Chang, Junran Peng, Lingxie Xie, Jiajun Sun, Haoran Yin, Qi Tian, and Zhaoxiang Zhang. DATA: Domain-aware and task-aware self-supervised learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 9841–9850, 2022.
  • [6] Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, and Eric Xing. Vision transformer slimming: Multi-dimension searching in continuous optimization space. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 4931–4941, 2022.
  • [7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In The International Conference on Machine Learning, pages 1597–1607, 2020.
  • [8] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
  • [9] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
  • [10] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In The IEEE International Conference on Computer Vision, pages 9640–9649, 2021.
  • [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, pages 1–12, 2021.
  • [12] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  • [13] Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. SEED: Self-supervised distillation for visual representation. In The International Conference on Learning Representations, pages 1–12, 2021.
  • [14] Jean-Bastien Grill, Florian Strub, Florent Altche, Corentin Tallec, Pierre H.Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammand Gheshlaghi Azar, Bial Piot, Koray Kavukcuoglu, Remi Munos, and Michal Valko. Boostrap your own latent: A new approach to self-supervised learning. In Advances in neural information processing systems, pages 21271–21284, 2020.
  • [15] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In The International Conference on Learning Representations, pages 1–14, 2016.
  • [16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
  • [17] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. In The IEEE International Conference on Computer Vision, pages 4918–4927, 2019.
  • [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In The IEEE International Conference on Computer Vision, pages 2961–2969, 2017.
  • [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [20] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [21] Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • [22] Changlin Li, Guangrun Wang, Bing Wang, Xiaodan Liang, Zhihui Li, and Xiaojun Chang. Dynamic slimmable network. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 8607–8617, 2021.
  • [23] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2177–2125, 2017.
  • [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In The European Conference on Computer Vision, volume 8693 of LNCS, pages 740–755. Springer, 2014.
  • [25] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In The IEEE International Conference on Computer Vision, pages 5058–5066, 2017.
  • [26] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In The European Conference on Computer Vision, volume 9910 of LNCS, pages 69–84. Springer, 2016.
  • [27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [29] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  • [30] Aarin van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • [31] Jiahui Yu and Thomas Huang. Universally slimmable networks and improved training techniques. In The IEEE International Conference on Computer Vision, pages 1803–1811, 2019.
  • [32] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable Neural Networks. In The International Conference on Learning Representations, pages 1–12, 2019.
  • [33] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

Appendix A Training Details

Datasets. The statistics of the recognition benchmarks used in our paper are shown in Table 10.

Table 10: Statistics of the recognition benchmarks used in the paper.
Datasets # Category # Training # Testing
CIFAR-10 10 50,000 10,000
CIFAR-100 100 50,000 10,000
Flowers 102 2,040 6,149
Pets 037 3,680 3,669
DTD 047 3,760 1,880

Training details for SSL methods. Training details for SimCLR, BYOL, SimSiam and our US3L on CIFAR-100 are shown in Table 11.

Table 11: Training details for SimCLR, BYOL, SimSiam and our US3L on CIFAR-100 in Table 2, Table 7 and Table 8. τ\tau denotes the temperature parameter, and mm denotes the momentum coefficient for the momentum network.
Method Settings
bs lr wd epochs optimizer lr sche. τ\tau mm dim
SimSiam 512 0.1 5e-4 400 SGD cosine - - 2048
SimCLR 512 0.5 1e-4 400 SGD cosine 0.5 - 2048
BYOL 512 0.1 5e-4 400 SGD cosine - 0.99 2048
Ours 512 0.5 1e-4 400 SGD cosine 0.5 0.99 2048

Training details for linear evaluation and fine-tuning. For ImageNet linear evaluation, we follow the same settings in [9]. For linear evaluation on other datasets, we train for 100 epochs with lr initialized to 30.0, which is divided by 10 at the 60-th and 80-th epoch.

Source codes. We promise that all codes will be made publicly available upon acceptance of the paper.

Appendix B More Results

B.1 Transfer Results for ResNet-18

We plot the downstream object detection performance on Pascal VOC07&12 for ResNet-18 FPN in Fig. 4. Moreover, we present the transfer results on downstream recognition benchmarks for ResNet-18 in Table 12. The results show that our method is also effective for ResNet-18 when transferring to downstream object detection and recognition tasks.

Refer to caption
Figure 4: Transfer results on Pascal VOC 07&12 under R18-FPN.
Table 12: Transfer results on recognition benchmarks under linear evaluation. ‘C-10/100’ denotes ‘CIFAR-10/100’.
Net Width Params MACS Method Linear Accuracy (%)
C-10 C-100 Flowers Pets Dtd
R-18 1.0x 11.69M 1.82G BYOL 76.6 48.6 83.0 71.1 64.8
Ours 77.9 52.6 84.9 71.2 65.2
0.75x 06.68M 1.05G BYOL 76.4 48.1 82.6 71.0 63.7
Ours 76.7 49.0 83.5 71.0 63.7
0.5x 03.06M 0.49G BYOL 74.6 46.9 81.4 67.6 61.7
Ours 75.2 47.4 82.0 67.3 62.6
0.25x 00.83M 0.14G BYOL 67.0 41.2 75.5 57.6 56.4
Ours 67.8 41.4 77.0 60.6 56.6

B.2 Ablation Studies of Dynamic Sampling

In this subsection, we conduct ablation studies of our dynamic sampling strategy and show how we successfully reduced the sampling number ss from 4 to 3 by using dynamic sampling while improving accuracy. We also compute the expected total forward number for TT iterations. Take our dynamic sampling as an example, we train the largest model only in the first T4\frac{T}{4} iterations and sample three sub-networks in the last 3T4\frac{3T}{4} iterations, hence the expected total forward number is:

1×T4+3×3T4=2.5T.1\times\frac{T}{4}+3\times\frac{3T}{4}=2.5T\,. (12)

As shown in Table 13, our dynamic sampling strategy achieves the best accuracy-efficiency trade-off. Notice that we also investigate the two components in our dynamic sampling strategy separately and we can clearly see that ‘max first’ reduces the training overhead (case 4) and ‘gradually reduce’ improves the accuracy (case 3).

Table 13: Ablation studies of the sampling strategy under ResNet-18 on CIFAR-100. TT denotes the total number of iterations. ‘Max first’ denotes whether to train the largest network only in early epochs. ‘Gradually reduce’ denotes whether to gradually reduce the width of the smallest network.
Case Sampling number Expected forward Dynamic Sampling Linear Accuracy (%)
ss number Max first Gradually reduce 1.0x 0.9x 0.8x 0.7x 0.6x 0.5x 0.4x 0.3x 0.25x
1 4 4TT ×\times ×\times 68.1 67.4 67.0 66.3 65.3 64.4 62.7 60.8 59.9
2 3 3TT ×\times ×\times 67.7 67.2 66.5 66.0 65.1 64.3 62.5 60.5 59.6
3 3 3TT ×\times 68.5 67.9 67.2 66.4 65.3 64.5 62.6 61.0 60.2
4 3 2.5TT ×\times 67.8 67.6 66.8 66.2 65.3 64.5 63.0 60.6 59.8
5 3 2.5TT 68.6 68.1 67.2 66.6 65.5 64.6 62.8 60.7 59.9

B.3 Hyper-parameter Studies of GG and α\alpha

In this subsection, we study the choice of hyper-parameters GG and α\alpha in our group regularization (Eq. (5)) in Table 14. Notice that when α=0\alpha=0, group regularization is equivalent to the standard L2L_{2} normalization (case 1). We used G=8G=8 and α=0.05\alpha=0.05 in the paper. We also present the max decay fraction G×αG\times{\alpha} (i.e., the decay rate for the last group). Table 14 shows that we can achieve the best results when G×α=40%G\times\alpha=40\% (case 1 \sim case 5). Then we keep G×α=40%G\times\alpha=40\% and change GG to 4 or 16. In terms of hyper-parameter GG, we can see that G=8G=8 (case 3) outperforms G=4G=4 (case 6) and accuracy is saturated and will not continue to increase beyond 8 (G=16G=16, case 7). It is worth noting that if we set α\alpha to a negative value which goes against our motivation (case 9), we will no longer see performance gains for large sub-networks as before. The results here further validate the effectiveness of our method and the correctness of our analysis in the paper.

Table 14: Hyper-parameter studies of GG and α\alpha in group regularization under ResNet-18 on CIFAR-100.
Case Max decay fraction Group number Decay rate Linear Accuracy (%)
G×αG\times\alpha GG α\alpha 1.0x 0.9x 0.8x 0.7x 0.6x 0.5x 0.4x 0.3x 0.25x Avg. 1.0x diff
1 0% - 0 67.7 67.2 66.5 66.0 65.1 64.3 62.5 60.5 59.6 64.4 +0.0%
2 20% 8 0.025 68.0 67.4 66.3 66.0 65.2 64.0 62.6 61.0 60.3 64.5 +0.3%
3 40% 8 0.05 68.6 67.8 67.3 66.4 65.5 64.4 63.1 60.9 60.1 64.9 +0.9%
4 60% 8 0.075 68.2 67.6 66.9 66.1 65.3 64.2 62.7 61.0 60.2 64.7 +0.5%
5 80% 8 0.1 67.7 67.0 66.7 66.4 65.8 64.2 62.8 61.3 60.5 64.7 +0.0%
6 40% 4 0.1 68.0 67.4 66.6 66.1 65.1 63.8 63.0 61.6 60.6 64.7 +0.3%
7 40% 16 0.025 68.7 67.6 67.3 66.5 65.5 64.7 62.8 61.3 60.1 64.9 +1.0%
8 80% 16 0.05 67.9 67.3 66.7 66.3 65.5 64.2 63.1 61.1 60.5 64.7 +0.2%
9 -40% 8 -0.05 67.4 67.0 66.3 65.9 64.7 64.0 62.8 61.2 60.1 64.4 -0.3%
Table 15: Effect of group regularization in BYOL and SimCLR under ResNet-18 on CIFAR-100. Our group regularization strategy is tailored for US-Net.
Method Model Once Group Linear Accuracy (%)
Type Training Regularization 1.0x 0.9x 0.8x 0.7x 0.6x 0.5x 0.4x 0.3x 0.25x
BYOL individual ×\times ×\times 66.8 66.0 65.6 65.3 63.0 62.1 59.5 56.0 54.3
66.0 66.2 65.6 64.4 63.4 61.8 59.3 56.1 54.0
SimCLR individual ×\times ×\times 66.5 65.4 64.7 63.7 62.6 61.0 59.0 56.1 53.6
65.7 65.0 65.0 63.3 62.6 61.0 58.8 56.0 53.4
Ours US-Net ×\times 67.7 67.2 66.5 66.0 65.1 64.3 62.5 60.5 59.6
68.6 67.8 67.3 66.4 65.5 64.4 63.1 60.9 60.1

B.4 Group Regularization is Tailored for US-Net

In this subsection, we will demonstrate that our group regularization strategy is tailored for US-Net and our analysis in the paper is valid. We apply group regularization to common SSL methods which train each model individually. As shown in Table 15, the introduction of group regularization does not bring improvements for BYOL and SimCLR, which are individually trained. It shows that the group regularization is tailored for US-Net, and the improvement is due to our unique design rather than factors such as hyper-parameters.

B.5 Our US3L Can Run at Arbitrary Width

Note that we only reported the results of width at [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.25]x for CIFAR-100 and [1.0, 0.75, 0.5, 0.25]x for ImageNet due to limited space. Actually, the pretrained model of our US3L can run at any width within the predefined width range, by only training once. As a supplement, we present the results of more widths on CIFAR-100 in Table 16 and we can see that our pretrained model can achieve a good accuracy-efficiency trade-off.

Table 16: Results of our US3L method at different widths under ResNet-18 and ResNet-50 on CIFAR-100. Our US3L can run at arbitrary width and we only reported partial results as a representative in the paper due to limited space.
Method Backbone Linear Accuracy (%)
1.0x 0.95x 0.9x 0.85x 0.8x 0.75x 0.7x 0.65x 0.6x 0.55x 0.5x 0.45x 0.4x 0.35x 0.3x 0.275x 0.25x
Ours (800ep) R-18 70.1 69.6 69.3 69.2 69.0 68.4 68.7 68.0 67.3 66.7 66.4 65.4 64.2 63.6 63.1 63.1 62.3
R-50 73.0 72.9 72.5 72.1 71.9 71.6 71.6 71.2 71.1 71.0 70.8 69.9 69.1 68.3 68.0 67.8 67.6

B.6 Figures

As a supplement to Table 2 in the paper, we plot the results here to more intuitively see the advantages of our method. We compare with individually trained methods in Fig. 5 and compare with the US-Net baseline in Fig. 6.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Comparison with individually trained baselines on CIFAR-100. All scatters are individually trained, whereas our method is trained only once (the red line).
Refer to caption
(a)
Refer to caption
(b)
Figure 6: Comparison with the original US-Net baseline on CIFAR-100. All are trained only once for 400 epochs.

B.7 Ablation Studies of Loss Design

We present more ablation results of loss design here in Table 17, as a supplement to Table 7 in the paper. We look more closely at the ‘Asymmetric Distill Head’ column, which indicates whether to use an additional head for distillation. Notice that there is already an asymmetrical head itself in MSE-based methods like SimSiam and BYOL. So ‘Share’ refers to sharing the asymmetrical head, and ‘New’ refers to distillation using a brand new head.

Table 17: Ablation studies of the loss design under ResNet-18 on CIFAR-100. ‘-’ denotes the model collapses.
Base Loss Case Distill Asymmetric Momentum Target Linear Accuracy (%)
Loss Distill Head Base model Sub model 1.0x 0.9x 0.8x 0.7x 0.6x 0.5x 0.4x 0.3x 0.25x
MSE 1 ×\times ×\times ×\times ×\times - - - - - - - - -
2 MSE ×\times ×\times ×\times - - - - - - - - -
3 MSE ✓(Share) ×\times ×\times 57.5 57.4 57.3 57.0 56.3 55.4 54.5 53.1 52.4
4 MSE ✓(Share) ×\times - - - - - - - - -
5 MSE ×\times ×\times - - - - - - - - -
6 MSE ×\times - - - - - - - - -
7 MSE ✓(Share) 64.7 64.7 64.5 64.3 63.9 62.6 61.3 59.7 59.3
8 MSE ✓(New) 65.4 65.0 64.8 64.5 63.8 62.7 61.1 59.8 58.9
9 InfoNCE ×\times ×\times ×\times 62.3 62.3 62.3 62.2 61.8 60.6 58.9 57.6 57.2
10 InfoNCE ✓(Share) ×\times ×\times 58.7 58.8 58.8 58.9 58.7 58.4 56.8 55.3 54.3
11 InfoNCE ✓(New) ×\times ×\times 61.5 61.4 61.6 61.6 61.1 60.3 58.7 57.1 56.3
12 InfoNCE ×\times ×\times 63.7 63.8 63.7 63.6 63.1 62.0 60.6 59.3 58.2
13 InfoNCE ✓(Share) ×\times - - - - - - - - -
14 InfoNCE ✓(New) ×\times 64.5 64.5 64.6 64.5 64.2 63.2 62.1 60.0 59.1
15 InfoNCE ×\times 65.0 65.0 65.1 65.0 64.5 62.7 61.3 59.8 59.2
16 InfoNCE ✓(Share) 65.0 64.9 64.9 64.4 64.1 62.8 61.1 60.0 59.5
17 InfoNCE ✓(New) 65.5 65.5 65.6 65.0 64.6 63.2 61.6 60.2 59.7
InfoNCE 18 ×\times ×\times ×\times ×\times 64.8 64.0 63.2 62.0 60.8 59.8 57.4 55.1 54.2
19 MSE ×\times ×\times ×\times 65.0 64.4 63.1 62.3 61.9 60.3 58.3 57.1 56.6
20 MSE ×\times ×\times 65.8 65.0 64.4 63.4 62.7 61.8 59.8 58.5 57.6
21 MSE ×\times 66.7 66.0 65.6 64.5 63.3 62.0 60.8 59.3 58.2
22 MSE ×\times 66.9 66.3 65.7 64.9 63.8 62.9 61.6 59.5 59.1
23 MSE 67.7 67.2 66.5 66.0 65.1 64.3 62.5 60.5 59.6
24 InfoNCE ×\times ×\times ×\times 65.5 64.9 63.8 63.6 62.7 61.8 60.2 58.2 57.4
25 InfoNCE ×\times ×\times 64.7 64.5 64.0 63.6 62.3 61.4 59.8 58.4 57.9
26 InfoNCE ×\times 66.1 66.0 65.4 64.4 63.4 62.3 60.8 59.1 58.6
27 InfoNCE ×\times 66.0 65.4 64.8 64.3 63.8 62.4 61.1 59.8 58.7
28 InfoNCE 67.4 66.0 66.1 65.6 64.7 64.0 62.2 60.2 59.5

Appendix C More Analysis

Lemma C.1

s=3s=3 is the theoretical minimum number of samples for US-Net [31].

Proof.  First, from [31] we know sandwich rules: Performances at all widths are bounded by the performance of the model at the smallest and the largest width. In other words, optimizing the lower and upper bounds of performance can implicitly optimize all sub-networks in a US-Net. To optimize for arbitrary widths, we need at least one randomly sampled width per iteration, except for the largest and smallest sub-networks. In conclusion, s=3s=3 is the theoretical minimum number of samples for US-Net. \square