Multi-view Contrastive Learning for Online Knowledge Distillation
Abstract
Previous Online Knowledge Distillation (OKD) often carries out mutually exchanging probability distributions, but neglects the useful representational knowledge. We therefore propose Multi-view Contrastive Learning (MCL) for OKD to implicitly capture correlations of feature embeddings encoded by multiple peer networks, which provide various views for understanding the input data instances. Benefiting from MCL, we can learn a more discriminative representation space for classification than previous OKD methods. Experimental results on image classification demonstrate that our MCL-OKD outperforms other state-of-the-art OKD methods by large margins without sacrificing additional inference cost. Codes are available at https://github.com/winycg/MCL-OKD.
Index Terms— Online Knowledge Distillation, Multi-view Contrastive Learning, Image Classification
1 Introduction
Albeit modern convolutional neural networks show dramatic performance in image classification tasks, it is still difficult to deploy a superb model on the resource-limited edge device. Typical solutions include efficient architecture design [1, 2, 3], model pruning [4, 5], dynamic inference [6] and knowledge distillation [7]. The idea of Knowledge Distillation (KD) is to transfer the useful information from an excellent yet cumbersome teacher model to a student model with low complexity. Alternatively, teacher-free Online Knowledge Distillation (OKD) [8] is a more practical method for improving the performance of a given model. The framework of OKD often lies in the mutual and cooperative knowledge transfer among several student models trained from scratch.
Popular OKD methods [9, 10, 11] focus on mutually transferring instance-level class probability distributions among various student models, but neglect the more informative representational knowledge for online transfer. In this paper, we implement Multi-view Contrastive Learning (MCL) to implicitly capture correlations of encoded representations of data instances among multiple peer networks, where one peer network represents one view for understanding the input. In MCL, we try to maximize the agreement for the representations of the same input instance from various views, while pushing the representations of input instances with different labels from various views apart. Our motivation is inspired by the mechanism that various people always view a same objective in the real world with individual understandings, and their consensus is always quite robust for discriminating this objective. While human may also carry an additional inductive bias as the noise information in their understandings. Here, all peer networks build a group of people and we try to model the person-invariant representations.
Similar with the previous methods of OKD [9, 10, 11], our training graph contains multiple same networks, except that additional fully-connected layers for linearly transforming representations to the contrastive embedding space, in which we perform pair-wise contrastive loss among all peer networks. Moreover, we also build an ensemble teacher from all online peer networks. The ensemble teacher transfers probabilistic knowledge to the specific student network, which is used for final deployment. Based on the above techniques, we name our framework as MCL-OKD.
We conduct experiments on image classification tasks of CIFAR-100 [12] and ImageNet [13] across widely used networks to compare MCL-OKD against other State-Of-The-Art (SOTA) OKD methods. The results show that our MCL-OKD achieves the best performance for optimizing a given network. Extensive experiments on few-shot classification show the superiority of MCL-OKD in metric learning for learning a discriminative feature space.
2 Related Works
Contrastive learning. The core idea of contrastive learning is to perform contrastive loss on positive pairs against negative pairs in feature embedding space. Many prior works define positive and negative pairs from two views. Deep InfoMax [14] matches the input and its output from the neural network encoder. Instance Discrimination [15] learns to contrast the current embedding with previous embeddings from an online memory bank. SimCLR [16] considers the two views of the same data sample as different augmentations, and maximizes the consistency between them. Besides two views, CMC [17] and AMDIM [18] propose contrastive multi-view coding across the multiple sensory channels or independently-augmented copies of the input image, respectively. For multi-view learning, typical multi-view information can also be derived from multi-modal signals [19] like vision, sound and touch. In comparison, we implement MCL by leveraging multiple peer networks to encode the same data instance, which is different from creating views in terms of the data itself compared to previous contrastive learning, because our method is more advantageous to the scenario of OKD to joint training.
Online knowledge distillation. The seminal OKD method dubbed Deep Mutual Learning (DML) [8] demonstrates that knowledge transfer between two peer student models during the online training achieves obviously better performance than independent training. Inspired by this insight, ONE [9] and CL-ILR [10] propose the frameworks which share the low-level layers to reduce the training complexity and perform knowledge transfer among various branches of high-level layers. OKDDip [11] alleviates the homogenization problem in previous ONE method by introducing two-level distillation and self-attention mechanism. All previous methods handle the probabilistic output to perform OKD, but differ in the ways of supervision. Our MCL-OKD further improves the performance of OKD from the perspective of representation learning.
3 methodology
3.1 Distillation framework
As depicted in Fig.1, peer networks participate in the process of distillation during training, where each network includes a CNN feature extractor and a linear classifier. For contrastive learning, we aim to optimize the embeddings after Global Average Pooling (GAP) layer among peer networks. We linearly transform these embeddings with -normalization into the same contrastive embedding space with embedding size of 128. Given a instance , the generated contrastive embeddings of are denoted as Similar with the previous branch-based OKD[9], low-level layers across the peer networks can also be shared to reduce the complexity and regularize the training networks.
Training and deployment. At the training stage, we jointly optimize . At the test stage, we discard auxiliary networks , only the last network is kept, resulting in no additional inference cost.

3.2 Learning objectives
Learning from labels. Each network is trained by Cross-Entropy (CE) loss between predictive probability distribution and hard labels. Given a instance with label , CE loss of the -th network is:
(1) |
Where is indicator that return if else . is the class posterior calculated from the logits distribution by softmax normalization:
(2) |
Overall, CE loss of networks is .
Distillation from an online teacher. We simply construct an online teacher by implementing the naive ensemble for the predictive probability distribution of all networks and softened by a temperature as:
(3) |
Where and denotes the soft probability of the -th class. The soft probability distribution of the -th network is:
(4) |
We consider transferring probabilistic knowledge from online teacher to the final deployment network , KL divergence is thus used for aligning the soft predictions between the former and latter as:
(5) |
Multi-view contrastive learning. Given a training set includes instances with classes. We consider learning the relationship derived from different networks among various data instances: feature embeddings of the same data instance are mutually closed, while that of two data instances with different classes are far away.
Given two networks and for illustration, generated embeddings across the training set are and , respectively. We define the positive pair as , and the negative pair as . We expect to make positive and negative pairs achieve high and low cos similarities respectively by contrastive learning. Given the embedding of instance from the fixing view , we enumerate the corresponding positive embedding and negative embeddings from . For ease of notation, is the -th negative embedding relative to . We regard the optimization as correctly classifying the positive against negative to approximate the full distribution against all negative embeddings, which is inspired by Noise-Contrastive Estimation (NCE) [20].
The idea behind NCE-based approximation is to transform the instance-level multi-classification into binary classification for discriminating positive pairs and negative pairs. Given the anchor , the probability of matching as the positive pair is:
(6) |
Where is the temperature and is a normalizing constant. Moreover, we define the uniform distribution for the probability of matching as the negative pair, i.e. . Assumed that frequency of sampling lies in every negative instances and 1 positive instance concurrently. Then the posterior probability of drawn from the actual distribution of positive instance (denoted as ) is:
(7) |
We minimize negative log-likelihood derived from positive pair and negative pairs , which approximates the contrastive loss from to as:
(8) |
Where denotes the randomly sampled negative embedding retrieved from an online memory bank [15] instead of real-time computing along with each mini-batch, which allows us efficiently obtaining abundant negative instances for generating contrastive knowledge. Note that only and are computed by and in real time from the input instance in mini-batch.
Symmetrically, we can also fix and enumerate over by contrasting with positive and negative , resulting in the contrastive loss . We thus deduce the overall objective of contrastive learning between and as . We further extend contrastive loss from two views to multiple views, which captures more robust evidences for classification as well as discards noise information. Specifically, we model fully-connected interactions among each view pair that thus lead to relationships. Contrastive loss among networks is summarized as .
Overall learning objective. We combine above three objectives to construct our final objective:
(9) |
Where is used for balancing contributions between hard and soft labels, is a constant factor for rescaling the magnitude of contrastive loss.
Network | FLOPs | Baseline | DML [8] | CL-ILR [10] | ONE [9] | OKDDip [11] | MCL-OKD |
---|---|---|---|---|---|---|---|
DenseNet-40-12 | 0.07G | 29.170.15 | 27.340.36 | 27.380.47 | 29.010.08 | 28.750.63 | 26.040.25(-1.30) |
ResNet-32 | 0.07G | 28.910.31 | 24.920.12 | 25.400.06 | 25.740.19 | 25.760.29 | 24.520.26(-0.40) |
VGG-16 | 0.31G | 25.180.25 | 24.140.36 | 23.580.14 | 25.220.11 | 24.860.30 | 23.110.25(-0.47) |
ResNet-110 | 0.17G | 23.620.73 | 21.510.74 | 21.160.29 | 22.190.56 | 21.050.17 | 20.390.59(-0.66) |
HCGNet-A1 | 0.15G | 22.460.28 | 18.980.20 | 19.040.17 | 22.300.57 | 21.540.11 | 18.720.21(-0.26) |
4 Experiments
4.1 Dataset and setup
Image Classification. We use CIFAR-100 [12] and ImageNet [13] benchmark datasets for evaluations. CIFAR-100 contains 50K training images and 10K test images with the input resolution 3232 from 100 classes, ImageNet contains 1.2 million images and 50K test images with the input resolution 224224 from 1000 classes. We use and in equ.(9). Following [11], we use and in equ.(6) for CIFAR-100 and ImageNet respectively, and architecture-aware in equ.(3.2). We use 4 networks in all OKD methods, i.e. . Given a model, training graph shares the first several stages and separates from the last two stages. Detailed experimental settings can be found in our released codes. We report the mean error rate over 3 runs.
Few-shot learning. We use miniImageNet [21] benchmark for few-shot classification. Prototypical network [22] is used as the backbone, which also plays the role of a peer network in OKD. We use the standard data split following Snell et al. [22]. At the test stage, we report average accuracy over 600 randomly sampled episodes with 95% confidence intervals for miniImageNet. For logit-based OKD methods, we add an auxiliary global classifier to the original prototypical network over the class space of training set, and perform learning of probabilistic outputs among 4 peer networks.
Network | Baseline | MCL-OKD | MCL-OKD (Ens) |
---|---|---|---|
ResNet-34 | 25.43 | 24.64(-0.79) | 23.26(-2.17) |
Baseline | DML | CL_ILR | ONE | OKDDip | MCL-OKD |
---|---|---|---|---|---|
DenseNet-40-12 | 26.02 | 26.19 | 28.67 | 27.51 | 23.55(-2.47) |
ResNet-32 | 22.97 | 24.03 | 24.03 | 23.73 | 22.00(-0.97) |
VGG-16 | 23.27 | 22.96 | 25.12 | 24.52 | 22.36(-0.60) |
ResNet-110 | 19.12 | 18.66 | 20.23 | 19.40 | 18.29(-0.37) |
HCGNet-A1 | 17.86 | 18.35 | 21.64 | 20.97 | 17.54(-0.32) |
Method | 5-Way 1-Shot | 5-Way 5-Shot |
---|---|---|
Baseline [22] | 49.10 0.41 | 66.87 0.33 |
RKD-D [23] | 49.66 0.84 | 67.07 0.67 |
RKD-DA [23] | 50.02 0.83 | 68.16 0.67 |
CL_ILR [10] | 50.75 0.40 | 67.75 0.32 |
ONE [9] | 50.67 0.41 | 67.58 0.33 |
OKDDip [11] | 50.60 0.42 | 67.41 0.33 |
MCL-OKD | 51.58 0.41 | 69.49 0.33 |
4.2 Results of OKD methods
Image Classification. Table 1 shows the comparisons of the performance among SOTA OKD methods across the various networks of VGG [24], ResNet [25], DenseNet [26] and HCGNet [2]. It can be obviously observed that our MCL-OKD consistently outperforms all other alternative methods of DML [8], CL-ILR [10], ONE [9] and OKDDip [11] by large margins, which indicates that performing MCL on the representations among various peer networks is more effective than transparent learning from probability distributions that previous OKD methods always focus on. Compared to the previous SOTA OKDDip, MCL-OKD achieves absolute 1.84% reduction of error rate on average on CIFAR-100. Extensive experiments on the more challenge ImageNet validate that MCL-OKD significantly outperforms the baseline by 0.79% margin. As shown in Table 3, MCL-OKD can also achieve the best ensemble error rate among compared OKD methods with retaining all peer networks.
Few-shot learning. Table 4 compares accuracy among SOTA KD and OKD methods on few-shot learning. We can observe that MCL-OKD significantly outperforms other OKD methods, which verifies that MCL on representations is more effective than learning on probabilistic outputs especially for metric learning tasks, due to the superiority of generating discriminative feature embeddings for instances from newly unseen classes. Moreover, MCL-OKD achieves better results than RKD [23], which is the SOTA KD method for few-shot learning but needs a pre-trained teacher as the prerequisite.
Training complexity. We take ResNet-34 on ImageNet as an example. MCL-OKD uses extra 1.56 GFLOPS for contrastive computing, which is about 16% of the original 9.60 GFLOPS by vanilla training. In practice, we observed no distinct addition of training time (1.23 hours/epoch v.s. 1.52 hours/epoch on a single NVIDIA Tesla V100 GPU). The memory bank of each peer network needs about 600MB memory for storing all 128-d features, resulting in the total 2.4GB memory for 4 peer networks.
5 CONCLUSION
We propose multi-view contrastive learning for OKD to learn a more powerful representation space benefiting from the mutual communications among peer networks. Experimental evidence proves the superiority of learning informative feature representations, which makes our MCL-OKD become a prior choice for model deployment in practice.
Acknowledgements. This work was supported by the Basic Research Reinforcement Project (2019-JCJQ-JJ-412).
References
- [1] Hui Zhu, Zhulin An, Chuanguang Yang, Kaiqiang Xu, Erhu Zhao, and Yongjun Xu, “Eena: efficient evolution of neural architecture,” in ICCV Workshops, 2019.
- [2] Chuanguang Yang, Zhulin An, Hui Zhu, Xiaolong Hu, Kun Zhang, Kaiqiang Xu, Chao Li, and Yongjun Xu, “Gated convolutional networks with hybrid connectivity for image classification,” in AAAI, 2020, pp. 12581–12588.
- [3] Hui Zhu, Zhulin An, Chuanguang Yang, Xiaolong Hu, Kaiqiang Xu, and Yongjun Xu, “Efficient search for the number of channels for convolutional neural networks,” in IJCNN, 2020.
- [4] Chuanguang Yang, Zhulin An, Chao Li, Boyu Diao, and Yongjun Xu, “Multi-objective pruning for cnns using genetic algorithm,” in ICANN, 2019, pp. 299–305.
- [5] Linhang Cai, Zhulin An, Chuanguang Yang, and Yongjun Xu, “Softer pruning, incremental regularization,” in ICPR, 2020.
- [6] Xiaolong Hu, Zhulin An, Chuanguang Yang, Hui Zhu, Kaiqiang Xu, and Yongjun Xu, “Drnet: Dissect and reconstruct the convolutional neural network via interpretable manners,” in ECAI, 2020.
- [7] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- [8] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu, “Deep mutual learning,” in CVPR, 2018, pp. 4320–4328.
- [9] Xiatian Zhu, Shaogang Gong, et al., “Knowledge distillation by on-the-fly native ensemble,” in NeurIPS, 2018, pp. 7517–7527.
- [10] Guocong Song and Wei Chai, “Collaborative learning for deep neural networks,” in NeurIPS, 2018, pp. 1832–1841.
- [11] Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen, “Online knowledge distillation with diverse peers,” in AAAI, 2020, pp. 3430–3437.
- [12] Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
- [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
- [14] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio, “Learning deep representations by mutual information estimation and maximization,” ICLR, 2019.
- [15] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in CVPR, 2018, pp. 3733–3742.
- [16] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
- [17] Yonglong Tian, Dilip Krishnan, and Phillip Isola, “Contrastive multiview coding,” in ECCV, 2020, pp. 776–794.
- [18] Philip Bachman, R Devon Hjelm, and William Buchwalter, “Learning representations by maximizing mutual information across views,” in NeurIPS, 2019, pp. 15509–15519.
- [19] Linda Smith and Michael Gasser, “The development of embodied cognition: Six lessons from babies,” Artificial life, vol. 11, no. 1-2, pp. 13–29, 2005.
- [20] Michael Gutmann and Aapo Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in AISTATS, 2010, pp. 297–304.
- [21] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al., “Matching networks for one shot learning,” in NeurIPS, 2016, pp. 3630–3638.
- [22] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypical networks for few-shot learning,” in NeurIPS, 2017, pp. 4077–4087.
- [23] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho, “Relational knowledge distillation,” in CVPR, 2019, pp. 3967–3976.
- [24] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- [26] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in CVPR, 2017, pp. 4700–4708.