This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Knowledge Distillation via Full Kernel Matrix Transfer

Qi Qian Alibaba Group, [email protected]    Hao Li Alibaba Group, [email protected]    Juhua Hu University of Washington, [email protected]
Abstract

Knowledge distillation is an effective way for model compression in deep learning. Given a large model (i.e., teacher model), it aims to improve the performance of a compact model (i.e., student model) by transferring the information from the teacher. Various information for distillation has been studied. Recently, a number of works propose to transfer the pairwise similarity between examples to distill relative information. However, most of efforts are devoted to developing different similarity measurements, while only a small matrix consisting of examples within a mini-batch is transferred at each iteration that can be inefficient for optimizing the pairwise similarity over the whole data set. In this work, we aim to transfer the full similarity matrix effectively. The main challenge is from the size of the full matrix that is quadratic to the number of examples. To address the challenge, we decompose the original full matrix with Nyström method. By selecting appropriate landmark points, our theoretical analysis indicates that the loss for transfer can be further simplified. Concretely, we find that the difference between the original full kernel matrices between teacher and student can be well bounded by that of the corresponding partial matrices, which only consists of similarities between original examples and landmark points. Compared with the full matrix, the size of the partial matrix is linear in the number of examples, which improves the efficiency of optimization significantly. The empirical study on benchmark data sets demonstrates the effectiveness of the proposed algorithm. Code is available at https://github.com/idstcv/KDA.

Keywords— knowledge distillation, full kernel matrix transfer

1 Introduction

With the development of deep learning, neural networks make many computer vision tasks applicable for edge devices. Those devices often have limited computation and storage resources. Therefore, neural networks that contain a small number of FLOPS and parameters are preferred. Lots of efforts are devoted to improving the performance of neural networks with resource constraints [4, 7, 17]. Among various developed strategies, knowledge distillation (KD) is a simple yet effective way to help train compact networks [7].

KD in deep learning aims to improve the performance of a small network (i.e., student) with the information from a large network (i.e., teacher). Given a teacher, various information can be transferred to regularize the training of the student network. [7] transfers the label information to smooth the label space of the student network. [16] and [23] propose to transfer the information of intermediate layers to help training. [22] transfers the flow between layers as hints for student networks. [3] improves the performance of metric learning with the rank information from teacher models. Recently, a number of methods are developed to transfer pairwise similarities between examples to student networks [10, 13, 14, 20] and different similarity functions have been investigated. In this work, we also focus on pairwise similarity matrix transfer but aim to improve the efficiency of transfer rather than similarity measurements.

Given two images 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j}, their similarity can be computed as k(𝐱i,𝐱j)=ϕ(𝐱i),ϕ(𝐱j)k(\mathbf{x}_{i},\mathbf{x}_{j})=\langle\phi(\mathbf{x}_{i}),\phi(\mathbf{x}_{j})\rangle where ϕ()\phi(\cdot) is the corresponding kernel function that projects examples from the original space to a space that fits the problem better. In an appropriate space, a simple model (e.g., a linear model) can describe the data well [21]. Many efforts have been devoted to designing similarity functions but the strategy for effective transfer is less investigated. Note that the size of the full similarity matrix is of n×nn\times n, where nn is the total number of examples. However, only a mini-batch of examples are often accessed at each iteration in a standard neural network training pipeline. Most of existing methods [10, 13, 14, 20] can transfer only the partial matrix (i.e., r×rr\times r, where rr is the batch size) consisting of examples from a mini-batch, which can be inefficient for distilling the information of the full kernel matrix from teacher. Therefore, many KD methods that transfer pairwise similarity have to include an additional KD loss [7] for the desired performance.

In this work, we propose to transfer the full kernel matrix between all examples from the teacher model to the student model efficiently. The main challenge comes from the size of the full matrix. With a large number of training examples, it becomes intractable to transfer the full matrix directly, especially for training neural networks with mini-batches. Therefore, we first apply the Nyström method [21] to obtain a low-rank approximation of the original matrix with landmark points. Then, we minimize the difference between the compact matrices that are calculated between original examples and landmark points, to transfer the information from the teacher effectively.

Compared with the full matrix of size 𝒪(n2)\mathcal{O}(n^{2}), the size of compact one for transfer is only 𝒪(n)\mathcal{O}(n) in our method. Besides, the number of landmark points is significantly smaller than that of training examples and we can keep them in the network, which improves the efficiency of optimization. Considering that the selection of landmark points is important for approximating the original matrix, we propose to apply class centers as landmark points for better distillation. Our theoretical analysis shows that the difference between original matrices from teacher and student can be well bounded by that of the corresponding partial matrices. Interestingly, according to our analysis, the conventional KD method in [7] that transfers the label information can be considered as a special case of our proposal with sub-optimal landmark points. This observation provides a new perspective to understand the efficacy of KD. We conduct extensive experiments on benchmark data sets. Our method can achieve a satisfied performance without any additional KD loss, which confirms the effectiveness of transferring full kernel matrix. The main contributions of this work can be summarized as

  • Instead of transferring partial similarity matrices within only mini batches that can be inefficient for KD optimization, we propose to transfer the full kernel matrix from the teacher model to the student model.

  • Considering that it is intractable to transfer the full matrix directly due to its large size and the mini-batch setting in training, we propose to obtain a low-rank approximation and minimize the difference between the compact matrices for transferring.

  • More importantly, we provide a theoretical guarantee that the difference between original full matrices from teacher and student can be well bounded by that of the corresponding compact matrices, which also provides a new perspective to understand the efficacy of conventional KD.

  • Our experiments including extensive ablation studies on benchmark data sets further demonstrate our theory and the effectiveness of our proposed method.

2 Related Work

2.1 Knowledge distillation

Knowledge distillation has a long history in ensemble learning and becomes popular for training small-sized neural networks [1, 3, 7, 10, 12, 13, 18, 19, 23]. Various algorithms have been developed to transfer different information from the teacher model to the student model. For example, [7] considers the final output of the teacher model as the soft label and regularizes the similarity between the label distribution output from the student model and that of the soft label from the teacher model. [23] transfers the attention maps from intermediate layers, which provides a way to explore more information from the teacher model. [1] proposes an information-theoretic framework for knowledge transfer which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks.

Recently, a number of algorithms [10, 13, 14, 20] are proposed to transfer the pairwise similarity for knowledge distillation. However, they focus on developing similarity functions while only transferring the similarity information of pairs within a mini-batch, which can be inefficient for distilling. We, on the other hand, aim to transfer the full matrix directly to achieve a better performance. Furthermore, we provide a theoretical analysis to demonstrate the effectiveness of the proposed method.

Besides classification, some methods are proposed for other tasks, e.g., detection [2] and metric learning [3]. We focus on classification in this work while the proposed method can be easily extended to metric learning that aims to optimize the performance of the embedding layer.

2.2 Nyström method

Nyström method is an effective algorithm to obtain a low-rank approximation for a Gram matrix [21]. Given a full Gram matrix, it tries to reconstruct the original one with the randomly sampled columns. The data points corresponding to the selected columns are denoted as landmark points. The approximation error can be bounded even with the randomly sampled landmark points. Later, researchers show that a delicate sampling strategy can further improve the performance [5, 9, 24]. [5] proposes to sample landmark points with a data-dependent probability distribution rather than the uniform distribution. [9] and [24] demonstrate that using clustering centers as landmark points provides the best approximation among different strategies.

Note that Nyström method is developed for unsupervised Gram matrix approximation while we can access the label information in knowledge distillation. In this work, we provide an analysis on the selection criterion of landmark points for Gram matrix transfer and develop a supervised strategy accordingly. Besides, Nyström method is useful only for a single Gram matrix approximation, while we aim to do transferring between two Gram matrices.

3 Efficient Kernel Matrix Transfer

Given two images 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j}, the similarity between them can be measured with a kernel function as k(𝐱i,𝐱j)=ϕ(𝐱i),ϕ(𝐱j)k(\mathbf{x}_{i},\mathbf{x}_{j})=\langle\phi(\mathbf{x}_{i}),\phi(\mathbf{x}_{j})\rangle, where ϕ(𝐱i)\phi(\mathbf{x}_{i}) is a projection function that projects examples from the original space to a space better for the target task.

In this work, we consider a certain layer in a neural network as a projection function, where the output of that layer can be considered as ϕ(𝐱)\phi(\mathbf{x}). We denote the student network as 𝒮\mathcal{S} and the teacher network as 𝒯\mathcal{T}. The features output from a certain layer of 𝒮\mathcal{S} and 𝒯\mathcal{T} are referred as f𝒮(𝐱i)=𝐱𝒮if_{\mathcal{S}}(\mathbf{x}_{i})=\mathbf{x}^{i}_{\mathcal{S}} and f𝒯(𝐱i)=𝐱𝒯if_{\mathcal{T}}(\mathbf{x}_{i})=\mathbf{x}^{i}_{\mathcal{T}}, respectively, and the index for the layer is omitted for brevity. Then, the similarity between two images 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} in the Gram matrix can be computed by

K𝒮(𝐱i,𝐱j)=f𝒮(𝐱i),f𝒮(𝐱j)=𝐱𝒮i𝐱𝒮j;\displaystyle K_{\mathcal{S}}(\mathbf{x}_{i},\mathbf{x}_{j})=\langle f_{\mathcal{S}}(\mathbf{x}_{i}),f_{\mathcal{S}}(\mathbf{x}_{j})\rangle=\mathbf{x}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}};
K𝒯(𝐱i,𝐱j)=f𝒯(𝐱i),f𝒯(𝐱j)=𝐱𝒯i𝐱𝒯j.\displaystyle\quad K_{\mathcal{T}}(\mathbf{x}_{i},\mathbf{x}_{j})=\langle f_{\mathcal{T}}(\mathbf{x}_{i}),f_{\mathcal{T}}(\mathbf{x}_{j})\rangle=\mathbf{x}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}.

Note that other complicated similarity functions can be applied between f(𝐱i)f(\mathbf{x}_{i}) and f(𝐱j)f(\mathbf{x}_{j}) as in existing methods [10, 13, 14, 20].

Let K𝒮K_{\mathcal{S}} and K𝒯K_{\mathcal{T}} denote the n×nn\times n Gram matrices from the student and teacher networks, respectively, where nn is the total number of images. We aim to transfer the full Gram matrix from the teacher model to the student model. The corresponding loss for knowledge distillation with Gram matrix transfer can be written as

(3.1) 𝒮,𝒯=K𝒮K𝒯F=X𝒮X𝒮X𝒯X𝒯F\displaystyle\ell_{\mathcal{S},\mathcal{T}}=\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}=\|X_{\mathcal{S}}^{\top}X_{\mathcal{S}}-X_{\mathcal{T}}^{\top}X_{\mathcal{T}}\|_{F}

where X𝒮d𝒮×nX_{\mathcal{S}}\in\mathbb{R}^{d_{\mathcal{S}}\times n} and X𝒯d𝒯×nX_{\mathcal{T}}\in\mathbb{R}^{d_{\mathcal{T}}\times n} denote the representations of the entire data set output from the same certain layer of the student and teacher network.

Minimizing the loss directly is intractable due to the large size of the Gram matrix, especially for the conventional training pipeline in deep learning, where only a mini-batch of examples are accessible in each iteration. A straightforward way is to optimize only the random pairs in each mini-batch as in [10, 13, 14, 20]. However, it can be sub-optimal and result in a slow optimization in transferring. Hence, we consider to decompose the Gram matrix and optimize the low-rank approximation in lieu of the full Gram matrix.

3.1 Nyström Approximation

Nyström method is prevalently applied to approximate the Gram matrix [5, 9, 21, 24]. We briefly review it in this subsection.

Given a n×nn\times n Gram matrix KK, we can first randomly shuffle columns and rewrite it as

K=[WK12K12K22]K=\left[\begin{array}[]{cc}W&K_{12}^{\top}\\ K_{12}&K_{22}\end{array}\right]

where Wm×mW\in\mathbb{R}^{m\times m}. Then, a good approximation for KK can be obtained as K~=CW+C\widetilde{K}=CW^{+}C^{\top}, where C=[WK12]n×mC=\left[\begin{array}[]{c}W\\ K_{12}\end{array}\right]\in\mathbb{R}^{n\times m} and W+W^{+} denotes the pseudo inverse of WW [21].

Let KkK_{k} denote the best top-kk approximation of Gram matrix KK and kmk\leq m. The rank-k approximation derived by the Nyström method can be computed as K~k=CWk+C\widetilde{K}_{k}=CW_{k}^{+}C^{\top}, where WkW_{k} denotes the best top-kk approximation of WW and Wk+W_{k}^{+} is the corresponding pseudo inverse. The performance of the approximation can be demonstrated by the following theorem.

Theorem 1

[9] Let K~k\widetilde{K}_{k} denote the rank-kk Nyström approximation with mm columns that are sampled uniformly at random without replacement from KK. We have KK~kFKKkF+ε;ε=𝒪(1/m1/4)KF\|K-\widetilde{K}_{k}\|_{F}\leq\|K-K_{k}\|_{F}+\varepsilon;\quad\varepsilon=\mathcal{O}(1/m^{1/4})\|K\|_{F}.

The examples corresponding to the selected columns in CC are referred as landmark points. Theorem 1 shows that the approximation is applicable even with randomly sampled landmark points.

3.2 Gram Matrix Transfer

With the low-rank approximation, the Gram matrix from a certain layer of the student and teacher network can be computed as

K~𝒮k=C𝒮Wk𝒮+C𝒮;K~𝒯k=C𝒯Wk𝒯+C𝒯\widetilde{K}_{\mathcal{S}}^{k}=C_{\mathcal{S}}W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top};\quad\widetilde{K}_{\mathcal{T}}^{k}=C_{\mathcal{T}}W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}

Compared with the original Gram matrix, the partial matrix CC has significantly less number of terms. Let D𝒮d𝒮×mD_{\mathcal{S}}\in\mathbb{R}^{d_{\mathcal{S}}\times m} and D𝒯d𝒯×mD_{\mathcal{T}}\in\mathbb{R}^{d_{\mathcal{T}}\times m} denote the landmark points for the student and teacher Gram matrices, then we have

C𝒮=X𝒮D𝒮,C𝒯=X𝒯D𝒯;C_{\mathcal{S}}=X_{\mathcal{S}}^{\top}D_{\mathcal{S}},\ C_{\mathcal{T}}=X_{\mathcal{T}}^{\top}D_{\mathcal{T}};
W𝒮=D𝒮D𝒮,W𝒯=D𝒯D𝒯.W_{\mathcal{S}}=D_{\mathcal{S}}^{\top}D_{\mathcal{S}},\ W_{\mathcal{T}}=D_{\mathcal{T}}^{\top}D_{\mathcal{T}}.

With Theorem 1, it is easy to show that transferring the compact matrix is up-bounding the distance between original Gram matrices from the teacher and the student.

Corollary 1

With the Nyström approximation, we can bound the loss in Eqn. 3.1 by

K𝒮K𝒯F2ε+C𝒮Wk𝒮+C𝒮C𝒯Wk𝒯+C𝒯F\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}\leq 2\varepsilon+\|C_{\mathcal{S}}W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top}-C_{\mathcal{T}}W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}\|_{F}

In Corollary 1, the partial Gram matrix will be regularized with the pseudo inverse of WW. The computational cost of obtaining pseudo inverse is expensive and it can introduce additional noise when the feature space of the student is unstable in the early stage of training. Besides, it still optimizes the difference between two n×nn\times n matrices.

By selecting ingenious landmark points, we can bound the original loss in Eqn. 3.1 solely with C𝒮C_{\mathcal{S}} and C𝒯C_{\mathcal{T}} as in the following theorem.

Theorem 2

Assuming that C𝒮C_{\mathcal{S}} and C𝒯C_{\mathcal{T}} are bounded by a constant cc as C𝒮F,C𝒯Fc\|C_{\mathcal{S}}\|_{F},\|C_{\mathcal{T}}\|_{F}\leq c and the smallest eigenvalues in W𝒮W_{\mathcal{S}} and W𝒯W_{\mathcal{T}} are larger than 11, we can bound the loss in Eqn. 3.1 by

K𝒮K𝒯F2ε+𝒪(C𝒮C𝒯F)\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}\leq 2\varepsilon+\mathcal{O}(\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F})
  • Proof.
    K𝒮K𝒯F=K𝒮K~𝒮k(K𝒯K~𝒯k)+K~𝒮kK~𝒯kF\displaystyle\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}=\|K_{\mathcal{S}}-\tilde{K}_{\mathcal{S}}^{k}-(K_{\mathcal{T}}-\tilde{K}_{\mathcal{T}}^{k})+\tilde{K}_{\mathcal{S}}^{k}-\tilde{K}_{\mathcal{T}}^{k}\|_{F}
    K𝒮K~𝒮kF+(K𝒯K~𝒯k)F+K~𝒮kK~𝒯kF\displaystyle\leq\|K_{\mathcal{S}}-\tilde{K}_{\mathcal{S}}^{k}\|_{F}+\|(K_{\mathcal{T}}-\tilde{K}_{\mathcal{T}}^{k})\|_{F}+\|\tilde{K}_{\mathcal{S}}^{k}-\tilde{K}_{\mathcal{T}}^{k}\|_{F}
    2ε+C𝒮Wk𝒮+C𝒮C𝒯Wk𝒯+C𝒯F\displaystyle\leq 2\varepsilon+\|C_{\mathcal{S}}W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top}-C_{\mathcal{T}}W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}\|_{F}
    2ε+cWk𝒮+C𝒮Wk𝒯+C𝒯F+c(Wk𝒯+FC𝒮C𝒯F)\displaystyle\leq 2\varepsilon+c\|W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top}-W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}\|_{F}+c(\|W_{k\mathcal{T}}^{+}\|_{F}\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F})
    2ε+c2Wk𝒮+Wk𝒯+F+2c2C𝒮C𝒯F\displaystyle\leq 2\varepsilon+c^{2}\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}+2c^{2}\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}

    Then, we want to show that Wk𝒮+Wk𝒯+FC𝒮C𝒯F\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}\leq\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}. Note that Wk𝒮Wk𝒯FC𝒮C𝒯F\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}\leq\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F} due to the fact that WkW_{k} is a partial matrix from CC. We can prove Wk𝒮+Wk𝒯+FWk𝒮Wk𝒯F\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}\leq\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F} instead. Let

    Wk𝒮=j=1kαjμjμj;Wk𝒯=j=1kβjνjνjW_{k\mathcal{S}}=\sum_{j=1}^{k}\alpha_{j}\mu_{j}\mu_{j}^{\top};\quad W_{k\mathcal{T}}=\sum_{j=1}^{k}\beta_{j}\nu_{j}\nu_{j}^{\top}

    where α1αk>1\alpha_{1}\geq\cdots\geq\alpha_{k}>1 and β1βk>1\beta_{1}\geq\cdots\geq\beta_{k}>1, and Mk×kM\in\mathbb{R}^{k\times k}, where Mi,j=μiμi,νjνjM_{i,j}=\langle\mu_{i}\mu_{i}^{\top},\nu_{j}\nu_{j}^{\top}\rangle. We have

    Wk𝒮Wk𝒯F2=jαj2+jβj22s,tαsβtMs,t\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}^{2}=\sum_{j}\alpha_{j}^{2}+\sum_{j}\beta_{j}^{2}-2\sum_{s,t}\alpha_{s}\beta_{t}M_{s,t}
    Wk𝒮+Wk𝒯+F2=j1αj2+j1βj22s,t1αsβtMs,t\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}^{2}=\sum_{j}\frac{1}{\alpha_{j}^{2}}+\sum_{j}\frac{1}{\beta_{j}^{2}}-2\sum_{s,t}\frac{1}{\alpha_{s}\beta_{t}}M_{s,t}

    So

    Wk𝒮+Wk𝒯+F2Wk𝒮Wk𝒯F2\displaystyle\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}^{2}-\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}^{2}
    =j1αj2+j1βj2jαj2+jβj2\displaystyle=\sum_{j}\frac{1}{\alpha_{j}^{2}}+\sum_{j}\frac{1}{\beta_{j}^{2}}-\sum_{j}\alpha_{j}^{2}+\sum_{j}\beta_{j}^{2}
    +2s,t(αsβt1αsβt)Ms,t\displaystyle+2\sum_{s,t}(\alpha_{s}\beta_{t}-\frac{1}{\alpha_{s}\beta_{t}})M_{s,t}

    According to the definition of MM, we have

    i,jkMi,j=μi(jνjνj)μi1;j,ikMi,j1\forall i,\quad\sum_{j}^{k}M_{i,j}=\mu_{i}(\sum_{j}\nu_{j}\nu_{j}^{\top})\mu_{i}^{\top}\leq 1;\quad\forall j,\quad\sum_{i}^{k}M_{i,j}\leq 1

    Since MM is a doubly stochastic matrix, we can show that

    s,t(αsβt1αsβt)Ms,ts(αsβs1αsβs)\sum_{s,t}(\alpha_{s}\beta_{t}-\frac{1}{\alpha_{s}\beta_{t}})M_{s,t}\leq\sum_{s}(\alpha_{s}\beta_{s}-\frac{1}{\alpha_{s}\beta_{s}})

    It can be proved by contradiction. If the optimal solution MM^{*} has a larger result than the R.H.S., we can denote the first column index of the non-zero off-diagonal element as j>ij>i (i.e., Mi,jM_{i,j}), and the corresponding row index as k>ik>i (i.e., Mk,iM_{k,i}). Let Ai,j=αiβj1αiβjA_{i,j}=\alpha_{i}\beta_{j}-\frac{1}{\alpha_{i}\beta_{j}} and we have

    Ai,j+Ak,iAi,iAk,j\displaystyle A_{i,j}+A_{k,i}-A_{i,i}-A_{k,j}
    =αiβj1αiβj+αkβi1αkβi\displaystyle=\alpha_{i}\beta_{j}-\frac{1}{\alpha_{i}\beta_{j}}+\alpha_{k}\beta_{i}-\frac{1}{\alpha_{k}\beta_{i}}
    (αiβi1αiβi)(αkβj1αkβj)\displaystyle-(\alpha_{i}\beta_{i}-\frac{1}{\alpha_{i}\beta_{i}})-(\alpha_{k}\beta_{j}-\frac{1}{\alpha_{k}\beta_{j}})
    =(αiαk)(βiβj)+(1αi1αk)(1βi1βj)\displaystyle=-(\alpha_{i}-\alpha_{k})(\beta_{i}-\beta_{j})+(\frac{1}{\alpha_{i}}-\frac{1}{\alpha_{k}})(\frac{1}{\beta_{i}}-\frac{1}{\beta_{j}})
    =(αiαk)(βiβj)+(αiαk)(βiβj)αiαkβiβj<0\displaystyle=-(\alpha_{i}-\alpha_{k})(\beta_{i}-\beta_{j})+\frac{(\alpha_{i}-\alpha_{k})(\beta_{i}-\beta_{j})}{\alpha_{i}\alpha_{k}\beta_{i}\beta_{j}}<0

    It shows that the assignment with the diagonal element can achieve a larger result than the optimal assignment, which contradicts the assumption.

    With the optimal results from the assignment of MM, we obtain that

    Wk𝒮+Wk𝒯+F2Wk𝒮Wk𝒯F2jk(1αj1βj)2(αjβj)2\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}^{2}-\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}^{2}\leq\sum_{j}^{k}(\frac{1}{\alpha_{j}}-\frac{1}{\beta_{j}})^{2}-(\alpha_{j}-\beta_{j})^{2}

    For each term, it is easy to show that

    j,(1αj1βj)2(αjβj)20\forall j,\quad(\frac{1}{\alpha_{j}}-\frac{1}{\beta_{j}})^{2}-(\alpha_{j}-\beta_{j})^{2}\leq 0

    Therefore,

    Wk𝒮+Wk𝒯+FWk𝒮Wk𝒯FC𝒮C𝒯F\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}\leq\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}\leq\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}

        

Theorem 2 illustrates that minimizing the difference between the partial Gram matrices from the teacher and the student using landmark points can transfer the original Gram matrix from the teacher model effectively. Note that the partial matrices have the size of n×mn\times m, where mm is the number of landmark points. When mnm\ll n, landmark points D𝒮D_{\mathcal{S}} and D𝒯D_{\mathcal{T}} can be kept in GPU memory as parameters of the loss function for optimization, which is much more efficient than transferring the original full Gram matrix. In addition, the simplified formulation depends on the property of eigenvalues from landmark points. Therefore, an appropriate selection is crucial for the efficient transfer. We elaborate our strategy in the next subsection.

3.3 Landmark Selection

We consider a strategy that obtains a single landmark point for each class, which means m=Lm=L for a LL-class classification problem. The desired landmark points should capture the distribution of examples well while keeping a sufficient diversity between each other for large eigenvalues. With the setting that each class has one landmark point, we can demonstrate the selection criterion as follows.

Theorem 3

Given an arbitrary pair (i,j)(i,j), let 𝐝𝒮i\mathbf{d}_{\mathcal{S}}^{i} and 𝐝𝒯i\mathbf{d}_{\mathcal{T}}^{i} denote the corresponding landmark points for the ii-th example in the space of student and teacher network, respectively. Assuming the norm of 𝐱𝒮i,𝐱𝒯i,𝐱𝒮j,𝐱𝒯j\mathbf{x}^{i}_{\mathcal{S}},\mathbf{x}^{i}_{\mathcal{T}},\mathbf{x}^{j}_{\mathcal{S}},\mathbf{x}^{j}_{\mathcal{T}} are bounded by ee, we have

(3.2) K𝒮K𝒯Fnei𝐱𝒮i𝐝𝒮i+nei𝐱𝒯i𝐝𝒯iA\displaystyle\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}\leq\underbrace{ne\sum_{i}\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}}\|+ne\sum_{i}\|\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}}\|}_{A}
+i,j𝐝𝒮i𝐱𝒮j𝐝𝒯i𝐱𝒯jB\displaystyle+\underbrace{\sum_{i,j}\|\mathbf{d}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\|}_{B}
  • Proof.

    For an arbitrary pair, we have

    K𝒮(𝐱i,𝐱j)K𝒯(𝐱i,𝐱j)=𝐱𝒮i𝐱𝒮j𝐱𝒯i𝐱𝒯j\displaystyle\|K_{\mathcal{S}}(\mathbf{x}_{i},\mathbf{x}_{j})-K_{\mathcal{T}}(\mathbf{x}_{i},\mathbf{x}_{j})\|=\|\mathbf{x}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{x}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\|
    =(𝐱𝒮i𝐝𝒮i+𝐝𝒮i)𝐱𝒮j(𝐱𝒯i𝐝𝒯i+𝐝𝒯i)𝐱𝒯j\displaystyle=\|(\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}}+\mathbf{d}^{i\top}_{\mathcal{S}})\mathbf{x}^{j}_{\mathcal{S}}-(\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}}+\mathbf{d}^{i\top}_{\mathcal{T}})\mathbf{x}^{j}_{\mathcal{T}}\|
    (𝐱𝒮i𝐝𝒮i)𝐱𝒮j+(𝐱𝒯i𝐝𝒯i)x𝒯j\displaystyle\leq\|(\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}})\mathbf{x}^{j}_{\mathcal{S}}\|+\|(\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}})x^{j}_{\mathcal{T}}\|
    +𝐝𝒮i𝐱𝒮j𝐝𝒯i𝐱𝒯j\displaystyle+\|\mathbf{d}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\|
    e𝐱𝒮i𝐝𝒮i+e𝐱𝒯i𝐝𝒯i+𝐝𝒮i𝐱𝒮j𝐝𝒯i𝐱𝒯j\displaystyle\leq e\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}}\|+e\|\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}}\|+\|\mathbf{d}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\|

    We obtain the result by accumulating over all pairs.        

According to Theorem 3, we can find that the transfer loss comes from two aspects. Term AA in Eqn. 3.2 contains the distance from each example to its corresponding landmark point. Since the corresponding landmark point for 𝐱𝒮i\mathbf{x}_{\mathcal{S}}^{i} can be obtained by argmin1lL{𝐱𝒮i𝐝𝒮l}\arg\min_{1\leq l\leq L}\{\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{l\top}_{\mathcal{S}}\|\}, we can rewrite the problem of minimizing the original term as

miniminl{𝐱𝒮i𝐝𝒮l}\min\sum_{i}\min_{l}\{\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{l\top}_{\mathcal{S}}\|\}

Apparently, this objective is a standard clustering problem. Unlike the conventional Nyström method, which is often in an unsupervised learning setting, we can access the label information in knowledge distillation. When we set the number of clusters to be the number of classes, the landmark point becomes the center in each class and can be computed by averaging examples within the class as

(3.3) 𝐝𝒮l=1nli:yi=l𝐱𝒮i;𝐝𝒯l=1nli:yi=l𝐱𝒯i\displaystyle\mathbf{d}_{\mathcal{S}}^{l}=\frac{1}{n_{l}}\sum_{i:y_{i}=l}\mathbf{x}_{\mathcal{S}}^{i};\quad\mathbf{d}_{\mathcal{T}}^{l}=\frac{1}{n_{l}}\sum_{i:y_{i}=l}\mathbf{x}_{\mathcal{T}}^{i}

where yiy_{i} is the class label of the ii-th example and nln_{l} denotes the number of examples in the ll-th class. In addition, class centers keep the diversity from original examples which can have the large eigenvalues as required by Theorem 2. We also verify the assumption in the experiments.

Term BB from Eqn. 3.2, in fact, indicates the difference between the student and teacher Gram matrices defined by the landmark points that is consistent as in Theorem 2.

With landmark points D𝒮D_{\mathcal{S}} and D𝒯D_{\mathcal{T}} obtained from optimizing the term AA, we can formulate the Knowledge Distillation problem by transferring Approximated Gram matrix (KDA) as

(3.4) minX𝒮KDA(X𝒮D𝒮X𝒯D𝒯)\displaystyle\min_{X_{\mathcal{S}}}\ell_{\mathrm{KDA}}(X_{\mathcal{S}}^{\top}D_{\mathcal{S}}-X_{\mathcal{T}}^{\top}D_{\mathcal{T}})

where KDA(X𝒮D𝒮X𝒯D𝒯)=i=1,l=1i=n,l=L(𝐝𝒮l𝐱𝒮i𝐝𝒯l𝐱𝒯i)\ell_{\mathrm{KDA}}(X_{\mathcal{S}}^{\top}D_{\mathcal{S}}-X_{\mathcal{T}}^{\top}D_{\mathcal{T}})=\sum_{i=1,l=1}^{i=n,l=L}\ell(\mathbf{d}^{l\top}_{\mathcal{S}}\mathbf{x}^{i}_{\mathcal{S}}-\mathbf{d}^{l\top}_{\mathcal{T}}\mathbf{x}^{i}_{\mathcal{T}}) and we adopt ()\ell(\cdot) as the smoothed L1L_{1} loss for the stable optimization as

(z)={|z|0.5z>10.5z2o.w.\ell(z)=\left\{\begin{array}[]{cc}|z|-0.5&z>1\\ 0.5z^{2}&o.w.\end{array}\right.

With the KDA loss, we propose a novel knowledge distillation algorithm that works in an alternative manner. In each iteration, we first compute the landmark points with features of examples accumulated from the last epoch by Eqn. 3.3. Then, the KDA loss defined by the fixed landmark points in Eqn. 3.4 will be optimized along with a standard cross-entropy loss for the student. The proposed algorithm is summarized in Alg. 1. Since at least one epoch will be spent on collecting features for computing landmark points, we will minimize the KDA loss after HH epochs of training, where H1H\geq 1.

Algorithm 1 Knowledge Distillation by Approximated Kernel Matrix Transfer (KDA)
  Input: Data set {𝐱i}\{\mathbf{x}_{i}\}, a student model 𝒮\mathcal{S}, a teacher model 𝒯\mathcal{T}, total epochs TT, warm-up epochs HH
  Initialize {𝐝𝒮0}=\{\mathbf{d}_{\mathcal{S}}^{0}\}=\emptyset, {𝐝𝒯0}=\{\mathbf{d}_{\mathcal{T}}^{0}\}=\emptyset
  for t=1t=1 to HH do
     Optimize 𝒮\mathcal{S} without KDA loss
     Compute {𝐝𝒮t}\{\mathbf{d}_{\mathcal{S}}^{t}\}, {𝐝𝒯t}\{\mathbf{d}_{\mathcal{T}}^{t}\} as in Eqn. 3.3
  end for
  for t=H+1t=H+1 to TT do
     Optimize 𝒮\mathcal{S} with KDA loss defined on {𝐝𝒮t1}\{\mathbf{d}_{\mathcal{S}}^{t-1}\}, {𝐝𝒯t1}\{\mathbf{d}_{\mathcal{T}}^{t-1}\}
     Compute {𝐝𝒮t}\{\mathbf{d}_{\mathcal{S}}^{t}\}, {𝐝𝒯t}\{\mathbf{d}_{\mathcal{T}}^{t}\} as in Eqn. 3.3
  end for
  Output: 𝒮\mathcal{S}

3.4 Connection to Conventional KD

In the conventional KD method [7], only the outputs from the last layer in the teacher model are adopted for the student. By setting an appropriate parameter, [7] illustrates that the loss function for KD can be written as

(X𝒮)=i𝐱𝒮i𝐱𝒯i2\ell(X_{\mathcal{S}})=\sum_{i}\|\mathbf{x}_{\mathcal{S}}^{i}-\mathbf{x}_{\mathcal{T}}^{i}\|^{2}

where 𝐱𝒮i,𝐱𝒯iC\mathbf{x}_{\mathcal{S}}^{i},\mathbf{x}_{\mathcal{T}}^{i}\in\mathbb{R}^{C} denote the logits before the SoftMax operator. With the identity matrix II, the equivalent formulation is

(X𝒮)=X𝒮IX𝒯IF2\ell(X_{\mathcal{S}})=\|X_{\mathcal{S}}^{\top}I-X_{\mathcal{T}}^{\top}I\|_{F}^{2}

Compared to the KDA loss in Eqn. 3.4, the conventional KD can be considered as applying one-hot label vectors as landmark points to transfer the Gram matrix of the teacher network. However, it lacks the constraints on the similarity between each example and its corresponding landmark point as illustrated in Theorem 3, which may degenerate the performance of knowledge distillation.

4 Experiments

We conduct experiments on two benchmark data sets to illustrate the effectiveness of the proposed KDA algorithm. We employ ResNet-34 (denoted as R34) [6] as the teacher network. ResNet-18 (denoted as R18), ResNet-18-0.5 (denoted as R18-0.5) and ShuffleNetV2 (denoted as SN) [11] are adopted as student networks, where ResNet-18-0.5 denotes ResNet-18 with a half number of channels. We apply the standard stochastic gradient descent (SGD) with momentum to train the networks. Specifically, we set the size of mini-batch to 256256, momentum to 0.90.9 and weight decay to 5e5e-44 in all experiments. The student models are trained with 120120 epochs. The initial learning rate is 0.10.1 and cosine decay is adopted with H=5H=5 epochs for warm-up. All experiments are implemented on two V100 GPUs.

Three baseline knowledge distillation methods are included in the comparison

  • KD [7]: a conventional knowledge distillation method that constrains the KL-divergence between the output label distributions of the student and teacher networks.

  • AT [23]: a method that transfers the information from intermediate layers to accelerate the training of the student network.

  • RKD [13]: a recent representative work that regularizes the similarity matrices within a mini-batch between the student and teacher networks. It adopts the same similarity function as in our method. Although different ways were used to calculate the similarity (e.g., L2 in [10, 13], normalized inner product in [20] and RBF kernel in [14]), they provide similar performance.

Every algorithm will minimize the combined loss from both the distillation and the standard cross entropy loss for classification. For RKD, we transfer the features before the last fully-connected (FC) layer for comparison. Note that AT transfers the attention map of the teacher, so we adopt the feature before the last pooling layer for distillation. Besides, we let “Baseline” denote the method that trains the student without information from the teacher. Our method is referred as “KDA”. We search the best parameters for all methods in the comparison and keep the same parameters for different experiments.

4.1 Ablation Study

We perform the ablation study on CIFAR-100 [8]. This data set contains 100100 classes, where each class has 500500 images for training and 100100 for test. Each image is a color image with size of 32×3232\times 32.

In this subsection, we set ResNet-34 as the teacher and ResNet-18 as the student. During the training, each image is first padded to be 40×4040\times 40, and then we randomly crop a 32×3232\times 32 image from it. Besides, random horizontal flipping is also adopted for data augmentation.

4.1.1 Effect of Landmark Points

First, we evaluate the strategy for generating landmark points. As illustrated in Corollary 1, the randomly selected landmark points can achieve a good performance. So we compare the KDA with class centers to that with random landmark points in Fig. 1. In this experiment, we adopt the features before the last FC layer for transfer.

Refer to caption

(a) Overall Comparison

Refer to caption

(b) Zoom In

Figure 1: Comparison of landmark points selection.

From Fig. 1, we can observe that with landmark points, two variants of KDA perform significantly better than the baseline. It demonstrates that Gram matrix is informative for training student models and transferring full matrix from teacher can help improve the performance of student. In addition, KDA with class centers as the landmark points shows the best performance among different methods, which confirms the criterion suggested in Theorem 3. We will use class centers as landmark points in the remaining experiments.

4.1.2 Effect of Full Matrix Transfer

Then, we compare the difference between Gram matrices from a teacher and its student models. The performance of transferring a Gram matrix is measured by K𝒮K𝒯F/K𝒯F\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F}, which calculates the faction of information that has not been transferred. We investigate features from two layers in the comparison: the one before the last FC layer and that after the FC layer. The transfer performance of different layers are illustrated in Fig. 2 (a) and (b), respectively.

Refer to caption

(a) Layer before FC

Refer to caption

(b) Layer after FC

Figure 2: Comparison of kernel transfer performance measured by K𝒮K𝒯F/K𝒯F\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F} (lower is better).

Fig. 2 (a) compares the performance of kernel transfer. First, it is obvious that both RKD and KDA are better than the baseline. It indicates that minimizing the difference between Gram matrices can effectively transfer appropriate information from the teacher. Second, RKD transfers the similarity matrix defined by examples in a mini-batch only and shows a larger transfer loss than KDA. Considering the massive number of pairs, optimizing with all of these pairs in RKD is intractable. Note that the number of pairs can be up to 𝒪(n2)\mathcal{O}(n^{2}) while the number of pairs in a mini-batch is only 𝒪(r2)\mathcal{O}(r^{2}), where rr is the size of a mini-batch. To visit all pairs only once, it requires at least 𝒪(n/r)\mathcal{O}(n/r) epochs.

On the contrary, the loss from KDA is only about 23%23\% of that from RKD. KDA optimizes the partial Gram matrix with landmark points and the total number of pairs is linear in that of original examples. Due to a small number of landmark points, the partial matrix CC is much more compact than the original one. For example, there are 50,00050,000 examples in CIFAR-100. When applying 100100 landmark points for distillation, the partial matrix contains 0.2%0.2\% terms of the original one. Besides, since we keep class centers in the memory as the parameters of the loss function, the full Gram matrix can be approximated in a single epoch. Therefore, SGD can optimize the KDA loss sufficiently.

Then, we compare the performance of transfer after the last FC layer as shown in Fig. 2 (b). For KDA, we compute the Gram matrix with features before the SoftMax operator. From the comparison, we can observe that both of KD and KDA have much less transfer loss than the baseline. As illustrated in the discussion of “Connection to Conventional KD”, conventional KD is equivalent to transferring the partial Gram matrix with one-hot landmark points. Therefore, it can reduce the difference between teacher and student effectively. However, the landmark points adopted by KD fail to satisfy the property illustrated in Theorem 3. By equipping class centers as landmark points, KDA can further reduce the transfer loss from 0.230.23 in KD to 0.20.2, which confirms the effectiveness of transferring full Gram matrix with appropriate landmark points.

Refer to caption
Figure 3: C𝒮C𝒯F/K𝒯F\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F} (i.e., our estimation) vs. K𝒮K𝒯F/K𝒯F\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F} (i.e., ground-truth).

Finally, we demonstrate that the difference between partial Gram matrices is closely correlated with that between full Gram matrices as suggested in Theorem 2. Features from the layer before the last FC layer are adopted for evaluation. Fig. 3 illustrates how the ground-truth transfer loss K𝒮K𝒯F/K𝒯F\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F} and the estimated transfer loss C𝒮C𝒯F/K𝒯F\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F} are changing during the training (100×100\timesvalues used for better visualization). Evidently, minimizing C𝒮C𝒯F\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F} can reduce the gap between the original Gram matrices effectively, which is consistent with our theoretical analysis. Note that we have an assumption in Theorem 2 that the smallest eigenvalues of W𝒮W_{\mathcal{S}} and W𝒯W_{\mathcal{T}} are larger than 11. Since we adopt class centers as landmark points, W𝒮W_{\mathcal{S}} and W𝒯W_{\mathcal{T}} can be full rank matrices. We show their smallest eigenvalues in Fig. 4. It is obvious that the smallest eigenvalue is larger than 10, consistent with our assumption.

Refer to caption
Figure 4: Smallest eigenvalues of W𝒮W_{\mathcal{S}} (i.e., student) and W𝒯W_{\mathcal{T}} (i.e., teacher).

4.1.3 Effect of Matrix WW

Corollary 1 implies a variant that uses the standard Nyström method including matrix WW for transferring, while our proposal ignores WW and optimizes only the difference between C𝒮C_{\mathcal{S}} and C𝒯C_{\mathcal{T}} for efficiency. We compare our method to the one with WW in Fig. 5, where features before the last FC layer are adopted for transfer. During the experiment, we find that W𝒯+12W_{\mathcal{T}}^{+\frac{1}{2}} provides better performance. We can observe that our method without WW has a similar performance as the one with W𝒯+12W_{\mathcal{T}}^{+\frac{1}{2}}. It further demonstrates our analysis in Theorem 2.

Refer to caption
Figure 5: Comparison of our method and the method with matrix WW as in Corollary 1

4.1.4 Effect of #\#Centers per Class

When assigning landmark points, we set the number to be that of classes, which avoids clustering in the implementation. It also constrains that each class has a single landmark point. The number of landmark points for each class can be easily increased by clustering. We compare the variant with two centers per class in Fig. 6, where features before the last FC layer are adopted for comparison.

Refer to caption
Figure 6: Comparison of different number of landmark points (i.e., centers) for each class.

We can observe that more centers for each class cannot improve the performance significantly. It may be because that the feature space is optimized with the cross entropy loss. As illustrated in [15], cross entropy loss will push all examples from the same class to a single center. Therefore, assigning one landmark point for each class is an appropriate setting, which also simplifies the algorithm and improves the efficiency. We use one center per class in the following experiments.

Table 1: Comparison of accuracy (%\%) on CIFAR-100 when transferring the full Gram matrix from different layers (S is the student without KD and T is the teacher).
S conv3_x conv4_x Before FC After FC T
77.2 77.7 78.1 79.6 79.4 80.3
Table 2: Comparison of accuracy (%\%) on CIFAR-100 (S is student without KD and T is teacher).
T S S Before Last FC After FC Combo T
AT RKD KDA KD KDA KDA
R34 R18 77.2 78.1±\pm0.3 78.3±\pm0.2 79.6±\pm0.1 78.8±\pm0.2 79.4±\pm0.1 79.7±\pm0.1 80.3
R34 R18-0.5 73.5 75.0±\pm0.1 74.3±\pm0.3 75.6±\pm0.3 74.8±\pm0.2 75.3±\pm0.2 75.9±\pm0.2 80.3
R34 SN 71.7 73.0±\pm0.1 72.5±\pm0.1 74.0±\pm0.3 72.9±\pm0.1 73.6±\pm0.1 74.2±\pm0.3 80.3
Table 3: Comparison of accuracy (%\%) on Tiny-ImageNet (S is student without KD and T is teacher).
T S S Before Last FC After FC Combo T
AT RKD KDA KD KDA KDA
R34 R18 63.4 64.4±\pm0.1 63.9±\pm0.2 65.2±\pm0.3 64.9±\pm0.2 65.4±\pm0.1 65.5±\pm0.1 66.6
R34 R18-0.5 60.3 61.0±\pm0.1 60.6±\pm0.2 61.7±\pm0.3 61.3±\pm0.1 61.9±\pm0.2 62.2±\pm0.2 66.6
R34 SN 60.6 61.3±\pm0.1 61.2±\pm0.2 62.0±\pm0.2 61.5±\pm0.1 62.3±\pm0.2 62.4±\pm0.1 66.6

4.1.5 Effect of Different Layers

Now, we illustrate the performance of transferring the Gram matrix from different layers. ResNet consists of 55 convolutional layer groups and we compare the performance of the last 33 groups (i.e., “conv3_x”, “conv4_x” and “conv5_x”) and the one after the last FC layer. The definition of groups can be found in [6]. For each group, we adopt the last layer for transfer. Before transfer, we add a pooling layer to reduce the dimension of the feature map. Note that after pooling, the last layer of “conv5_x” becomes the layer before the last FC layer.

Table 1 shows the performance of transferring information from different layers. First, transferring information from teacher always improves the performance of student, which demonstrates the effectiveness of knowledge distillation. Besides, the information from the later layers is more helpful for training student. It is because later layers contain more semantic information that is closely related to the target task. We will focus on the layers before and after the FC layer in the rest experiments.

4.2 CIFAR-100

In this subsection, we compare the proposed KDA method to baseline methods on CIFAR-100. The results of different methods can be found in Table 2. All experiments are repeated 33 times and the average results with standard deviation are reported.

First, knowledge distillation methods outperform training the student network without a teacher, which shows that knowledge distillation can improve the performance of student models significantly. By transferring the full Gram matrix before the last FC layer, KDA surpasses RKD by 1.3%1.3\% when ResNet-34 is the teacher and ResNet-18 is the student. The observation is consistent with the comparison in the ablation study, which confirms that K𝒮K𝒯F/K𝒯F\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F} is an appropriate metric to evaluate the transfer loss of the Gram matrix. Moreover, KDA shows a significant improvement on different student networks, which further demonstrates the applicability of the proposed method.

Furthermore, when transferring the Gram matrix after the last FC layer, both of KD and KDA can demonstrate good performance compared with the student model. It is due to the fact that these methods transfer the Gram matrix with landmark points, which is efficient for optimization. Besides, KDA can further improve the performance compared to KD. The superior performance of KDA demonstrates the effectiveness of the proposed strategy for generating appropriate landmark points.

Finally, compared to benchmark methods, KDA can distill the information from different layers with the same formulation in Eqn. 3.4. The proposed method provides a systematic perspective to understand a family of knowledge distillation methods that aim to transfer Gram matrix. If transferring the Gram matrices before and after the FC layer simultaneously, the performance of KDA can be slightly improved as illustrated by “Combo” in Table 2.

4.3 Tiny-ImageNet

Then, we compare different methods on Tiny-ImageNet data set111https://tiny-imagenet.herokuapp.com. There are 200200 classes in this data set and each class provides 500500 images for training and 5050 for validation. We report the performance on the validation set. Since the size of images in Tiny-ImageNet is 64×6464\times 64 that is larger than CIFAR-100, we replace the random crop augmentation with a more aggressive version as in [6], and keep other settings the same.

Table 3 summarizes the comparison. We can observe the similar results as on CIFAR-100. First, all methods with the information from a teacher model can surpass the student model without a teacher by a large margin. Second, KDA outperforms other methods no matter in which layer the transfer happens. Note that CIFAR-100 and Tiny-ImageNet have very different images, which demonstrates the applicability of the proposed algorithm in different real-world applications.

5 Conclusions

In this work, we investigate the knowledge distillation problem from the perspective of kernel matrix transfer. Considering the number of terms in the full kernel matrix is quadratic in the number of training examples, we extend the Nyström method and propose a strategy to obtain the landmark points for efficient transfer. The proposed method not only improves the efficiency of transferring the kernel matrix, but also has the theoretical guarantee for the efficacy. Experiments on the benchmark data sets verify the effectiveness of the proposed algorithm.

Besides the similarity function applied in this work, there are many other complicated functions adopted by other methods. Combining the proposed algorithm with different similarity functions for distillation can be our future work.

References

  • [1] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, Variational information distillation for knowledge transfer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9163–9171.
  • [2] G. Chen, W. Choi, X. Yu, T. X. Han, and M. Chandraker, Learning efficient object detection models with knowledge distillation, in Advances in Neural Information Processing Systems 30, 2017, pp. 742–751.
  • [3] Y. Chen, N. Wang, and Z. Zhang, Darkrank: Accelerating deep metric learning via cross sample similarities transfer, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
  • [4] M. Courbariaux, Y. Bengio, and J. David, Binaryconnect: Training deep neural networks with binary weights during propagations, in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds., 2015, pp. 3123–3131.
  • [5] P. Drineas, M. W. Mahoney, and N. Cristianini, On the nyström method for approximating a gram matrix for improved kernel-based learning., JMLR, 6 (2005).
  • [6] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [7] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531, (2015).
  • [8] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images, (2009).
  • [9] S. Kumar, M. Mohri, and A. Talwalkar, Sampling methods for the nyström method, JMLR, 13 (2012), pp. 981–1006.
  • [10] Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, and Y. Duan, Knowledge distillation via instance relationship graph, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7096–7104.
  • [11] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, Shufflenet v2: Practical guidelines for efficient cnn architecture design, in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
  • [12] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, Improved knowledge distillation via teacher assistant, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 5191–5198.
  • [13] W. Park, D. Kim, Y. Lu, and M. Cho, Relational knowledge distillation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.
  • [14] B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang, Correlation congruence for knowledge distillation, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5007–5016.
  • [15] Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, Softtriple loss: Deep metric learning without triplet sampling, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6450–6458.
  • [16] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, Fitnets: Hints for thin deep nets, in Proceedings of the 3rd International Conference on Learning Representations, 2015.
  • [17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
  • [18] S. Srinivas and F. Fleuret, Knowledge transfer with jacobian matching, in International Conference on Machine Learning, PMLR, 2018, pp. 4723–4731.
  • [19] Y. Tian, D. Krishnan, and P. Isola, Contrastive representation distillation, in Proceedings of the 8th International Conference on Learning Representations, 2020.
  • [20] F. Tung and G. Mori, Similarity-preserving knowledge distillation, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1365–1374.
  • [21] C. K. I. Williams and M. W. Seeger, Using the nyström method to speed up kernel machines, in Advances in Neural Information Processing Systems 13, 2000, pp. 682–688.
  • [22] J. Yim, D. Joo, J. Bae, and J. Kim, A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4133–4141.
  • [23] S. Zagoruyko and N. Komodakis, Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, in Proceedings of the 5th International Conference on Learning Representations, 2017.
  • [24] K. Zhang, I. W. Tsang, and J. T. Kwok, Improved nyström low-rank approximation and error analysis, in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1232–1239.