Improved Knowledge Distillation via Full Kernel Matrix Transfer

Qi Qian Alibaba Group, [email protected] Hao Li Alibaba Group, [email protected] Juhua Hu University of Washington, [email protected]

Abstract

Knowledge distillation is an effective way for model compression in deep learning. Given a large model (i.e., teacher model), it aims to improve the performance of a compact model (i.e., student model) by transferring the information from the teacher. Various information for distillation has been studied. Recently, a number of works propose to transfer the pairwise similarity between examples to distill relative information. However, most of efforts are devoted to developing different similarity measurements, while only a small matrix consisting of examples within a mini-batch is transferred at each iteration that can be inefficient for optimizing the pairwise similarity over the whole data set. In this work, we aim to transfer the full similarity matrix effectively. The main challenge is from the size of the full matrix that is quadratic to the number of examples. To address the challenge, we decompose the original full matrix with Nyström method. By selecting appropriate landmark points, our theoretical analysis indicates that the loss for transfer can be further simplified. Concretely, we find that the difference between the original full kernel matrices between teacher and student can be well bounded by that of the corresponding partial matrices, which only consists of similarities between original examples and landmark points. Compared with the full matrix, the size of the partial matrix is linear in the number of examples, which improves the efficiency of optimization significantly. The empirical study on benchmark data sets demonstrates the effectiveness of the proposed algorithm. Code is available at https://github.com/idstcv/KDA.

Keywords— knowledge distillation, full kernel matrix transfer

1 Introduction

With the development of deep learning, neural networks make many computer vision tasks applicable for edge devices. Those devices often have limited computation and storage resources. Therefore, neural networks that contain a small number of FLOPS and parameters are preferred. Lots of efforts are devoted to improving the performance of neural networks with resource constraints [4, 7, 17]. Among various developed strategies, knowledge distillation (KD) is a simple yet effective way to help train compact networks [7].

KD in deep learning aims to improve the performance of a small network (i.e., student) with the information from a large network (i.e., teacher). Given a teacher, various information can be transferred to regularize the training of the student network. [7] transfers the label information to smooth the label space of the student network. [16] and [23] propose to transfer the information of intermediate layers to help training. [22] transfers the flow between layers as hints for student networks. [3] improves the performance of metric learning with the rank information from teacher models. Recently, a number of methods are developed to transfer pairwise similarities between examples to student networks [10, 13, 14, 20] and different similarity functions have been investigated. In this work, we also focus on pairwise similarity matrix transfer but aim to improve the efficiency of transfer rather than similarity measurements.

Given two images $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ , their similarity can be computed as $k(\mathbf{x}_{i},\mathbf{x}_{j})=\langle\phi(\mathbf{x}_{i}),\phi(\mathbf{x}_{j})\rangle$ where $\phi(\cdot)$ is the corresponding kernel function that projects examples from the original space to a space that fits the problem better. In an appropriate space, a simple model (e.g., a linear model) can describe the data well [21]. Many efforts have been devoted to designing similarity functions but the strategy for effective transfer is less investigated. Note that the size of the full similarity matrix is of $n\times n$ , where $n$ is the total number of examples. However, only a mini-batch of examples are often accessed at each iteration in a standard neural network training pipeline. Most of existing methods [10, 13, 14, 20] can transfer only the partial matrix (i.e., $r\times r$ , where $r$ is the batch size) consisting of examples from a mini-batch, which can be inefficient for distilling the information of the full kernel matrix from teacher. Therefore, many KD methods that transfer pairwise similarity have to include an additional KD loss [7] for the desired performance.

In this work, we propose to transfer the full kernel matrix between all examples from the teacher model to the student model efficiently. The main challenge comes from the size of the full matrix. With a large number of training examples, it becomes intractable to transfer the full matrix directly, especially for training neural networks with mini-batches. Therefore, we first apply the Nyström method [21] to obtain a low-rank approximation of the original matrix with landmark points. Then, we minimize the difference between the compact matrices that are calculated between original examples and landmark points, to transfer the information from the teacher effectively.

Compared with the full matrix of size $\mathcal{O}(n^{2})$ , the size of compact one for transfer is only $\mathcal{O}(n)$ in our method. Besides, the number of landmark points is significantly smaller than that of training examples and we can keep them in the network, which improves the efficiency of optimization. Considering that the selection of landmark points is important for approximating the original matrix, we propose to apply class centers as landmark points for better distillation. Our theoretical analysis shows that the difference between original matrices from teacher and student can be well bounded by that of the corresponding partial matrices. Interestingly, according to our analysis, the conventional KD method in [7] that transfers the label information can be considered as a special case of our proposal with sub-optimal landmark points. This observation provides a new perspective to understand the efficacy of KD. We conduct extensive experiments on benchmark data sets. Our method can achieve a satisfied performance without any additional KD loss, which confirms the effectiveness of transferring full kernel matrix. The main contributions of this work can be summarized as

•

Instead of transferring partial similarity matrices within only mini batches that can be inefficient for KD optimization, we propose to transfer the full kernel matrix from the teacher model to the student model.
•

Considering that it is intractable to transfer the full matrix directly due to its large size and the mini-batch setting in training, we propose to obtain a low-rank approximation and minimize the difference between the compact matrices for transferring.
•

More importantly, we provide a theoretical guarantee that the difference between original full matrices from teacher and student can be well bounded by that of the corresponding compact matrices, which also provides a new perspective to understand the efficacy of conventional KD.
•

Our experiments including extensive ablation studies on benchmark data sets further demonstrate our theory and the effectiveness of our proposed method.

2 Related Work

2.1 Knowledge distillation

Knowledge distillation has a long history in ensemble learning and becomes popular for training small-sized neural networks [1, 3, 7, 10, 12, 13, 18, 19, 23]. Various algorithms have been developed to transfer different information from the teacher model to the student model. For example, [7] considers the final output of the teacher model as the soft label and regularizes the similarity between the label distribution output from the student model and that of the soft label from the teacher model. [23] transfers the attention maps from intermediate layers, which provides a way to explore more information from the teacher model. [1] proposes an information-theoretic framework for knowledge transfer which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks.

Recently, a number of algorithms [10, 13, 14, 20] are proposed to transfer the pairwise similarity for knowledge distillation. However, they focus on developing similarity functions while only transferring the similarity information of pairs within a mini-batch, which can be inefficient for distilling. We, on the other hand, aim to transfer the full matrix directly to achieve a better performance. Furthermore, we provide a theoretical analysis to demonstrate the effectiveness of the proposed method.

Besides classification, some methods are proposed for other tasks, e.g., detection [2] and metric learning [3]. We focus on classification in this work while the proposed method can be easily extended to metric learning that aims to optimize the performance of the embedding layer.

2.2 Nyström method

Nyström method is an effective algorithm to obtain a low-rank approximation for a Gram matrix [21]. Given a full Gram matrix, it tries to reconstruct the original one with the randomly sampled columns. The data points corresponding to the selected columns are denoted as landmark points. The approximation error can be bounded even with the randomly sampled landmark points. Later, researchers show that a delicate sampling strategy can further improve the performance [5, 9, 24]. [5] proposes to sample landmark points with a data-dependent probability distribution rather than the uniform distribution. [9] and [24] demonstrate that using clustering centers as landmark points provides the best approximation among different strategies.

Note that Nyström method is developed for unsupervised Gram matrix approximation while we can access the label information in knowledge distillation. In this work, we provide an analysis on the selection criterion of landmark points for Gram matrix transfer and develop a supervised strategy accordingly. Besides, Nyström method is useful only for a single Gram matrix approximation, while we aim to do transferring between two Gram matrices.

3 Efficient Kernel Matrix Transfer

Given two images $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ , the similarity between them can be measured with a kernel function as $k(\mathbf{x}_{i},\mathbf{x}_{j})=\langle\phi(\mathbf{x}_{i}),\phi(\mathbf{x}_{j})\rangle$ , where $\phi(\mathbf{x}_{i})$ is a projection function that projects examples from the original space to a space better for the target task.

In this work, we consider a certain layer in a neural network as a projection function, where the output of that layer can be considered as $\phi(\mathbf{x})$ . We denote the student network as $\mathcal{S}$ and the teacher network as $\mathcal{T}$ . The features output from a certain layer of $\mathcal{S}$ and $\mathcal{T}$ are referred as $f_{\mathcal{S}}(\mathbf{x}_{i})=\mathbf{x}^{i}_{\mathcal{S}}$ and $f_{\mathcal{T}}(\mathbf{x}_{i})=\mathbf{x}^{i}_{\mathcal{T}}$ , respectively, and the index for the layer is omitted for brevity. Then, the similarity between two images $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ in the Gram matrix can be computed by

	$\displaystyle K_{\mathcal{S}}(\mathbf{x}_{i},\mathbf{x}_{j})=\langle f_{\mathcal{S}}(\mathbf{x}_{i}),f_{\mathcal{S}}(\mathbf{x}_{j})\rangle=\mathbf{x}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}};$
	$\displaystyle\quad K_{\mathcal{T}}(\mathbf{x}_{i},\mathbf{x}_{j})=\langle f_{\mathcal{T}}(\mathbf{x}_{i}),f_{\mathcal{T}}(\mathbf{x}_{j})\rangle=\mathbf{x}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}.$

Note that other complicated similarity functions can be applied between $f(\mathbf{x}_{i})$ and $f(\mathbf{x}_{j})$ as in existing methods [10, 13, 14, 20].

Let $K_{\mathcal{S}}$ and $K_{\mathcal{T}}$ denote the $n\times n$ Gram matrices from the student and teacher networks, respectively, where $n$ is the total number of images. We aim to transfer the full Gram matrix from the teacher model to the student model. The corresponding loss for knowledge distillation with Gram matrix transfer can be written as

(3.1)

\displaystyle\ell_{\mathcal{S},\mathcal{T}}=\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}=\|X_{\mathcal{S}}^{\top}X_{\mathcal{S}}-X_{\mathcal{T}}^{\top}X_{\mathcal{T}}\|_{F}

where $X_{\mathcal{S}}\in\mathbb{R}^{d_{\mathcal{S}}\times n}$ and $X_{\mathcal{T}}\in\mathbb{R}^{d_{\mathcal{T}}\times n}$ denote the representations of the entire data set output from the same certain layer of the student and teacher network.

Minimizing the loss directly is intractable due to the large size of the Gram matrix, especially for the conventional training pipeline in deep learning, where only a mini-batch of examples are accessible in each iteration. A straightforward way is to optimize only the random pairs in each mini-batch as in [10, 13, 14, 20]. However, it can be sub-optimal and result in a slow optimization in transferring. Hence, we consider to decompose the Gram matrix and optimize the low-rank approximation in lieu of the full Gram matrix.

3.1 Nyström Approximation

Nyström method is prevalently applied to approximate the Gram matrix [5, 9, 21, 24]. We briefly review it in this subsection.

Given a $n\times n$ Gram matrix $K$ , we can first randomly shuffle columns and rewrite it as

K=\left[\begin{array}[]{cc}W&K_{12}^{\top}\\ K_{12}&K_{22}\end{array}\right]

where $W\in\mathbb{R}^{m\times m}$ . Then, a good approximation for $K$ can be obtained as $\widetilde{K}=CW^{+}C^{\top}$ , where $C=\left[\begin{array}[]{c}W\\ K_{12}\end{array}\right]\in\mathbb{R}^{n\times m}$ and $W^{+}$ denotes the pseudo inverse of $W$ [21].

Let $K_{k}$ denote the best top- $k$ approximation of Gram matrix $K$ and $k\leq m$ . The rank-k approximation derived by the Nyström method can be computed as $\widetilde{K}_{k}=CW_{k}^{+}C^{\top}$ , where $W_{k}$ denotes the best top- $k$ approximation of $W$ and $W_{k}^{+}$ is the corresponding pseudo inverse. The performance of the approximation can be demonstrated by the following theorem.

Theorem 1

[9] Let $\widetilde{K}_{k}$ denote the rank- $k$ Nyström approximation with $m$ columns that are sampled uniformly at random without replacement from $K$ . We have $\|K-\widetilde{K}_{k}\|_{F}\leq\|K-K_{k}\|_{F}+\varepsilon;\quad\varepsilon=\mathcal{O}(1/m^{1/4})\|K\|_{F}$ .

The examples corresponding to the selected columns in $C$ are referred as landmark points. Theorem 1 shows that the approximation is applicable even with randomly sampled landmark points.

3.2 Gram Matrix Transfer

With the low-rank approximation, the Gram matrix from a certain layer of the student and teacher network can be computed as

\widetilde{K}_{\mathcal{S}}^{k}=C_{\mathcal{S}}W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top};\quad\widetilde{K}_{\mathcal{T}}^{k}=C_{\mathcal{T}}W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}

Compared with the original Gram matrix, the partial matrix $C$ has significantly less number of terms. Let $D_{\mathcal{S}}\in\mathbb{R}^{d_{\mathcal{S}}\times m}$ and $D_{\mathcal{T}}\in\mathbb{R}^{d_{\mathcal{T}}\times m}$ denote the landmark points for the student and teacher Gram matrices, then we have

C_{\mathcal{S}}=X_{\mathcal{S}}^{\top}D_{\mathcal{S}},\ C_{\mathcal{T}}=X_{\mathcal{T}}^{\top}D_{\mathcal{T}};

W_{\mathcal{S}}=D_{\mathcal{S}}^{\top}D_{\mathcal{S}},\ W_{\mathcal{T}}=D_{\mathcal{T}}^{\top}D_{\mathcal{T}}.

With Theorem 1, it is easy to show that transferring the compact matrix is up-bounding the distance between original Gram matrices from the teacher and the student.

Corollary 1

With the Nyström approximation, we can bound the loss in Eqn. 3.1 by

\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}\leq 2\varepsilon+\|C_{\mathcal{S}}W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top}-C_{\mathcal{T}}W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}\|_{F}

In Corollary 1, the partial Gram matrix will be regularized with the pseudo inverse of $W$ . The computational cost of obtaining pseudo inverse is expensive and it can introduce additional noise when the feature space of the student is unstable in the early stage of training. Besides, it still optimizes the difference between two $n\times n$ matrices.

By selecting ingenious landmark points, we can bound the original loss in Eqn. 3.1 solely with $C_{\mathcal{S}}$ and $C_{\mathcal{T}}$ as in the following theorem.

Theorem 2

Assuming that $C_{\mathcal{S}}$ and $C_{\mathcal{T}}$ are bounded by a constant $c$ as $\|C_{\mathcal{S}}\|_{F},\|C_{\mathcal{T}}\|_{F}\leq c$ and the smallest eigenvalues in $W_{\mathcal{S}}$ and $W_{\mathcal{T}}$ are larger than $1$ , we can bound the loss in Eqn. 3.1 by

\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}\leq 2\varepsilon+\mathcal{O}(\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F})

Proof.

	$\displaystyle\\|K_{\mathcal{S}}-K_{\mathcal{T}}\\|_{F}=\\|K_{\mathcal{S}}-\tilde{K}_{\mathcal{S}}^{k}-(K_{\mathcal{T}}-\tilde{K}_{\mathcal{T}}^{k})+\tilde{K}_{\mathcal{S}}^{k}-\tilde{K}_{\mathcal{T}}^{k}\\|_{F}$
	$\displaystyle\leq\\|K_{\mathcal{S}}-\tilde{K}_{\mathcal{S}}^{k}\\|_{F}+\\|(K_{\mathcal{T}}-\tilde{K}_{\mathcal{T}}^{k})\\|_{F}+\\|\tilde{K}_{\mathcal{S}}^{k}-\tilde{K}_{\mathcal{T}}^{k}\\|_{F}$
	$\displaystyle\leq 2\varepsilon+\\|C_{\mathcal{S}}W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top}-C_{\mathcal{T}}W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}\\|_{F}$
	$\displaystyle\leq 2\varepsilon+c\\|W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top}-W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}\\|_{F}+c(\\|W_{k\mathcal{T}}^{+}\\|_{F}\\|C_{\mathcal{S}}-C_{\mathcal{T}}\\|_{F})$
	$\displaystyle\leq 2\varepsilon+c^{2}\\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\\|_{F}+2c^{2}\\|C_{\mathcal{S}}-C_{\mathcal{T}}\\|_{F}$

Then, we want to show that $\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}\leq\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}$ . Note that $\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}\leq\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}$ due to the fact that $W_{k}$ is a partial matrix from $C$ . We can prove $\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}\leq\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}$ instead. Let

W_{k\mathcal{S}}=\sum_{j=1}^{k}\alpha_{j}\mu_{j}\mu_{j}^{\top};\quad W_{k\mathcal{T}}=\sum_{j=1}^{k}\beta_{j}\nu_{j}\nu_{j}^{\top}

where $\alpha_{1}\geq\cdots\geq\alpha_{k}>1$ and $\beta_{1}\geq\cdots\geq\beta_{k}>1$ , and $M\in\mathbb{R}^{k\times k}$ , where $M_{i,j}=\langle\mu_{i}\mu_{i}^{\top},\nu_{j}\nu_{j}^{\top}\rangle$ . We have

\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}^{2}=\sum_{j}\alpha_{j}^{2}+\sum_{j}\beta_{j}^{2}-2\sum_{s,t}\alpha_{s}\beta_{t}M_{s,t}

\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}^{2}=\sum_{j}\frac{1}{\alpha_{j}^{2}}+\sum_{j}\frac{1}{\beta_{j}^{2}}-2\sum_{s,t}\frac{1}{\alpha_{s}\beta_{t}}M_{s,t}

	$\displaystyle\\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\\|_{F}^{2}-\\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\\|_{F}^{2}$
	$\displaystyle=\sum_{j}\frac{1}{\alpha_{j}^{2}}+\sum_{j}\frac{1}{\beta_{j}^{2}}-\sum_{j}\alpha_{j}^{2}+\sum_{j}\beta_{j}^{2}$
	$\displaystyle+2\sum_{s,t}(\alpha_{s}\beta_{t}-\frac{1}{\alpha_{s}\beta_{t}})M_{s,t}$

According to the definition of $M$ , we have

\forall i,\quad\sum_{j}^{k}M_{i,j}=\mu_{i}(\sum_{j}\nu_{j}\nu_{j}^{\top})\mu_{i}^{\top}\leq 1;\quad\forall j,\quad\sum_{i}^{k}M_{i,j}\leq 1

Since $M$ is a doubly stochastic matrix, we can show that

\sum_{s,t}(\alpha_{s}\beta_{t}-\frac{1}{\alpha_{s}\beta_{t}})M_{s,t}\leq\sum_{s}(\alpha_{s}\beta_{s}-\frac{1}{\alpha_{s}\beta_{s}})

It can be proved by contradiction. If the optimal solution $M^{*}$ has a larger result than the R.H.S., we can denote the first column index of the non-zero off-diagonal element as $j>i$ (i.e., $M_{i,j}$ ), and the corresponding row index as $k>i$ (i.e., $M_{k,i}$ ). Let $A_{i,j}=\alpha_{i}\beta_{j}-\frac{1}{\alpha_{i}\beta_{j}}$ and we have

	$\displaystyle A_{i,j}+A_{k,i}-A_{i,i}-A_{k,j}$
	$\displaystyle=\alpha_{i}\beta_{j}-\frac{1}{\alpha_{i}\beta_{j}}+\alpha_{k}\beta_{i}-\frac{1}{\alpha_{k}\beta_{i}}$
	$\displaystyle-(\alpha_{i}\beta_{i}-\frac{1}{\alpha_{i}\beta_{i}})-(\alpha_{k}\beta_{j}-\frac{1}{\alpha_{k}\beta_{j}})$
	$\displaystyle=-(\alpha_{i}-\alpha_{k})(\beta_{i}-\beta_{j})+(\frac{1}{\alpha_{i}}-\frac{1}{\alpha_{k}})(\frac{1}{\beta_{i}}-\frac{1}{\beta_{j}})$
	$\displaystyle=-(\alpha_{i}-\alpha_{k})(\beta_{i}-\beta_{j})+\frac{(\alpha_{i}-\alpha_{k})(\beta_{i}-\beta_{j})}{\alpha_{i}\alpha_{k}\beta_{i}\beta_{j}}<0$

It shows that the assignment with the diagonal element can achieve a larger result than the optimal assignment, which contradicts the assumption.

With the optimal results from the assignment of $M$ , we obtain that

\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}^{2}-\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}^{2}\leq\sum_{j}^{k}(\frac{1}{\alpha_{j}}-\frac{1}{\beta_{j}})^{2}-(\alpha_{j}-\beta_{j})^{2}

For each term, it is easy to show that

\forall j,\quad(\frac{1}{\alpha_{j}}-\frac{1}{\beta_{j}})^{2}-(\alpha_{j}-\beta_{j})^{2}\leq 0

Therefore,

\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\|_{F}\leq\|W_{k\mathcal{S}}-W_{k\mathcal{T}}\|_{F}\leq\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}

Theorem 2 illustrates that minimizing the difference between the partial Gram matrices from the teacher and the student using landmark points can transfer the original Gram matrix from the teacher model effectively. Note that the partial matrices have the size of $n\times m$ , where $m$ is the number of landmark points. When $m\ll n$ , landmark points $D_{\mathcal{S}}$ and $D_{\mathcal{T}}$ can be kept in GPU memory as parameters of the loss function for optimization, which is much more efficient than transferring the original full Gram matrix. In addition, the simplified formulation depends on the property of eigenvalues from landmark points. Therefore, an appropriate selection is crucial for the efficient transfer. We elaborate our strategy in the next subsection.

3.3 Landmark Selection

We consider a strategy that obtains a single landmark point for each class, which means $m=L$ for a $L$ -class classification problem. The desired landmark points should capture the distribution of examples well while keeping a sufficient diversity between each other for large eigenvalues. With the setting that each class has one landmark point, we can demonstrate the selection criterion as follows.

Theorem 3

Given an arbitrary pair $(i,j)$ , let $\mathbf{d}_{\mathcal{S}}^{i}$ and $\mathbf{d}_{\mathcal{T}}^{i}$ denote the corresponding landmark points for the $i$ -th example in the space of student and teacher network, respectively. Assuming the norm of $\mathbf{x}^{i}_{\mathcal{S}},\mathbf{x}^{i}_{\mathcal{T}},\mathbf{x}^{j}_{\mathcal{S}},\mathbf{x}^{j}_{\mathcal{T}}$ are bounded by $e$ , we have

(3.2)		$\displaystyle\\|K_{\mathcal{S}}-K_{\mathcal{T}}\\|_{F}\leq\underbrace{ne\sum_{i}\\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}}\\|+ne\sum_{i}\\|\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}}\\|}_{A}$
	$\displaystyle+\underbrace{\sum_{i,j}\\|\mathbf{d}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\\|}_{B}$

Proof.

For an arbitrary pair, we have

	$\displaystyle\\|K_{\mathcal{S}}(\mathbf{x}_{i},\mathbf{x}_{j})-K_{\mathcal{T}}(\mathbf{x}_{i},\mathbf{x}_{j})\\|=\\|\mathbf{x}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{x}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\\|$
	$\displaystyle=\\|(\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}}+\mathbf{d}^{i\top}_{\mathcal{S}})\mathbf{x}^{j}_{\mathcal{S}}-(\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}}+\mathbf{d}^{i\top}_{\mathcal{T}})\mathbf{x}^{j}_{\mathcal{T}}\\|$
	$\displaystyle\leq\\|(\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}})\mathbf{x}^{j}_{\mathcal{S}}\\|+\\|(\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}})x^{j}_{\mathcal{T}}\\|$
	$\displaystyle+\\|\mathbf{d}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\\|$
	$\displaystyle\leq e\\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}}\\|+e\\|\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}}\\|+\\|\mathbf{d}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\\|$

We obtain the result by accumulating over all pairs.

According to Theorem 3, we can find that the transfer loss comes from two aspects. Term $A$ in Eqn. 3.2 contains the distance from each example to its corresponding landmark point. Since the corresponding landmark point for $\mathbf{x}_{\mathcal{S}}^{i}$ can be obtained by $\arg\min_{1\leq l\leq L}\{\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{l\top}_{\mathcal{S}}\|\}$ , we can rewrite the problem of minimizing the original term as

\min\sum_{i}\min_{l}\{\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{l\top}_{\mathcal{S}}\|\}

Apparently, this objective is a standard clustering problem. Unlike the conventional Nyström method, which is often in an unsupervised learning setting, we can access the label information in knowledge distillation. When we set the number of clusters to be the number of classes, the landmark point becomes the center in each class and can be computed by averaging examples within the class as

(3.3)

\displaystyle\mathbf{d}_{\mathcal{S}}^{l}=\frac{1}{n_{l}}\sum_{i:y_{i}=l}\mathbf{x}_{\mathcal{S}}^{i};\quad\mathbf{d}_{\mathcal{T}}^{l}=\frac{1}{n_{l}}\sum_{i:y_{i}=l}\mathbf{x}_{\mathcal{T}}^{i}

where $y_{i}$ is the class label of the $i$ -th example and $n_{l}$ denotes the number of examples in the $l$ -th class. In addition, class centers keep the diversity from original examples which can have the large eigenvalues as required by Theorem 2. We also verify the assumption in the experiments.

Term $B$ from Eqn. 3.2, in fact, indicates the difference between the student and teacher Gram matrices defined by the landmark points that is consistent as in Theorem 2.

With landmark points $D_{\mathcal{S}}$ and $D_{\mathcal{T}}$ obtained from optimizing the term $A$ , we can formulate the Knowledge Distillation problem by transferring Approximated Gram matrix (KDA) as

(3.4)

\displaystyle\min_{X_{\mathcal{S}}}\ell_{\mathrm{KDA}}(X_{\mathcal{S}}^{\top}D_{\mathcal{S}}-X_{\mathcal{T}}^{\top}D_{\mathcal{T}})

where $\ell_{\mathrm{KDA}}(X_{\mathcal{S}}^{\top}D_{\mathcal{S}}-X_{\mathcal{T}}^{\top}D_{\mathcal{T}})=\sum_{i=1,l=1}^{i=n,l=L}\ell(\mathbf{d}^{l\top}_{\mathcal{S}}\mathbf{x}^{i}_{\mathcal{S}}-\mathbf{d}^{l\top}_{\mathcal{T}}\mathbf{x}^{i}_{\mathcal{T}})$ and we adopt $\ell(\cdot)$ as the smoothed $L_{1}$ loss for the stable optimization as

\ell(z)=\left\{\begin{array}[]{cc}|z|-0.5&z>1\\ 0.5z^{2}&o.w.\end{array}\right.

With the KDA loss, we propose a novel knowledge distillation algorithm that works in an alternative manner. In each iteration, we first compute the landmark points with features of examples accumulated from the last epoch by Eqn. 3.3. Then, the KDA loss defined by the fixed landmark points in Eqn. 3.4 will be optimized along with a standard cross-entropy loss for the student. The proposed algorithm is summarized in Alg. 1. Since at least one epoch will be spent on collecting features for computing landmark points, we will minimize the KDA loss after $H$ epochs of training, where $H\geq 1$ .

Algorithm 1 Knowledge Distillation by Approximated Kernel Matrix Transfer (KDA)

Input: Data set

\{\mathbf{x}_{i}\}

, a student model

\mathcal{S}

, a teacher model

\mathcal{T}

, total epochs

T

, warm-up epochs

H

Initialize

\{\mathbf{d}_{\mathcal{S}}^{0}\}=\emptyset

\{\mathbf{d}_{\mathcal{T}}^{0}\}=\emptyset

for

t=1

H

Optimize

\mathcal{S}

without KDA loss

Compute

\{\mathbf{d}_{\mathcal{S}}^{t}\}

\{\mathbf{d}_{\mathcal{T}}^{t}\}

as in Eqn. 3.3

end for

for

t=H+1

T

Optimize

\mathcal{S}

with KDA loss defined on

\{\mathbf{d}_{\mathcal{S}}^{t-1}\}

\{\mathbf{d}_{\mathcal{T}}^{t-1}\}

Compute

\{\mathbf{d}_{\mathcal{S}}^{t}\}

\{\mathbf{d}_{\mathcal{T}}^{t}\}

as in Eqn. 3.3

end for

Output:

\mathcal{S}

3.4 Connection to Conventional KD

In the conventional KD method [7], only the outputs from the last layer in the teacher model are adopted for the student. By setting an appropriate parameter, [7] illustrates that the loss function for KD can be written as

\ell(X_{\mathcal{S}})=\sum_{i}\|\mathbf{x}_{\mathcal{S}}^{i}-\mathbf{x}_{\mathcal{T}}^{i}\|^{2}

where $\mathbf{x}_{\mathcal{S}}^{i},\mathbf{x}_{\mathcal{T}}^{i}\in\mathbb{R}^{C}$ denote the logits before the SoftMax operator. With the identity matrix $I$ , the equivalent formulation is

\ell(X_{\mathcal{S}})=\|X_{\mathcal{S}}^{\top}I-X_{\mathcal{T}}^{\top}I\|_{F}^{2}

Compared to the KDA loss in Eqn. 3.4, the conventional KD can be considered as applying one-hot label vectors as landmark points to transfer the Gram matrix of the teacher network. However, it lacks the constraints on the similarity between each example and its corresponding landmark point as illustrated in Theorem 3, which may degenerate the performance of knowledge distillation.

4 Experiments

We conduct experiments on two benchmark data sets to illustrate the effectiveness of the proposed KDA algorithm. We employ ResNet-34 (denoted as R34) [6] as the teacher network. ResNet-18 (denoted as R18), ResNet-18-0.5 (denoted as R18-0.5) and ShuffleNetV2 (denoted as SN) [11] are adopted as student networks, where ResNet-18-0.5 denotes ResNet-18 with a half number of channels. We apply the standard stochastic gradient descent (SGD) with momentum to train the networks. Specifically, we set the size of mini-batch to $256$ , momentum to $0.9$ and weight decay to $5e$ - $4$ in all experiments. The student models are trained with $120$ epochs. The initial learning rate is $0.1$ and cosine decay is adopted with $H=5$ epochs for warm-up. All experiments are implemented on two V100 GPUs.

Three baseline knowledge distillation methods are included in the comparison

•

KD [7]: a conventional knowledge distillation method that constrains the KL-divergence between the output label distributions of the student and teacher networks.
•

AT [23]: a method that transfers the information from intermediate layers to accelerate the training of the student network.
•

RKD [13]: a recent representative work that regularizes the similarity matrices within a mini-batch between the student and teacher networks. It adopts the same similarity function as in our method. Although different ways were used to calculate the similarity (e.g., L2 in [10, 13], normalized inner product in [20] and RBF kernel in [14]), they provide similar performance.

Every algorithm will minimize the combined loss from both the distillation and the standard cross entropy loss for classification. For RKD, we transfer the features before the last fully-connected (FC) layer for comparison. Note that AT transfers the attention map of the teacher, so we adopt the feature before the last pooling layer for distillation. Besides, we let “Baseline” denote the method that trains the student without information from the teacher. Our method is referred as “KDA”. We search the best parameters for all methods in the comparison and keep the same parameters for different experiments.

4.1 Ablation Study

We perform the ablation study on CIFAR-100 [8]. This data set contains $100$ classes, where each class has $500$ images for training and $100$ for test. Each image is a color image with size of $32\times 32$ .

In this subsection, we set ResNet-34 as the teacher and ResNet-18 as the student. During the training, each image is first padded to be $40\times 40$ , and then we randomly crop a $32\times 32$ image from it. Besides, random horizontal flipping is also adopted for data augmentation.

4.1.1 Effect of Landmark Points

First, we evaluate the strategy for generating landmark points. As illustrated in Corollary 1, the randomly selected landmark points can achieve a good performance. So we compare the KDA with class centers to that with random landmark points in Fig. 1. In this experiment, we adopt the features before the last FC layer for transfer.

Refer to caption — Figure 1: Comparison of landmark points selection.

From Fig. 1, we can observe that with landmark points, two variants of KDA perform significantly better than the baseline. It demonstrates that Gram matrix is informative for training student models and transferring full matrix from teacher can help improve the performance of student. In addition, KDA with class centers as the landmark points shows the best performance among different methods, which confirms the criterion suggested in Theorem 3. We will use class centers as landmark points in the remaining experiments.

4.1.2 Effect of Full Matrix Transfer

Then, we compare the difference between Gram matrices from a teacher and its student models. The performance of transferring a Gram matrix is measured by $\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F}$ , which calculates the faction of information that has not been transferred. We investigate features from two layers in the comparison: the one before the last FC layer and that after the FC layer. The transfer performance of different layers are illustrated in Fig. 2 (a) and (b), respectively.

Fig. 2 (a) compares the performance of kernel transfer. First, it is obvious that both RKD and KDA are better than the baseline. It indicates that minimizing the difference between Gram matrices can effectively transfer appropriate information from the teacher. Second, RKD transfers the similarity matrix defined by examples in a mini-batch only and shows a larger transfer loss than KDA. Considering the massive number of pairs, optimizing with all of these pairs in RKD is intractable. Note that the number of pairs can be up to $\mathcal{O}(n^{2})$ while the number of pairs in a mini-batch is only $\mathcal{O}(r^{2})$ , where $r$ is the size of a mini-batch. To visit all pairs only once, it requires at least $\mathcal{O}(n/r)$ epochs.

On the contrary, the loss from KDA is only about $23\%$ of that from RKD. KDA optimizes the partial Gram matrix with landmark points and the total number of pairs is linear in that of original examples. Due to a small number of landmark points, the partial matrix $C$ is much more compact than the original one. For example, there are $50,000$ examples in CIFAR-100. When applying $100$ landmark points for distillation, the partial matrix contains $0.2\%$ terms of the original one. Besides, since we keep class centers in the memory as the parameters of the loss function, the full Gram matrix can be approximated in a single epoch. Therefore, SGD can optimize the KDA loss sufficiently.

Then, we compare the performance of transfer after the last FC layer as shown in Fig. 2 (b). For KDA, we compute the Gram matrix with features before the SoftMax operator. From the comparison, we can observe that both of KD and KDA have much less transfer loss than the baseline. As illustrated in the discussion of “Connection to Conventional KD”, conventional KD is equivalent to transferring the partial Gram matrix with one-hot landmark points. Therefore, it can reduce the difference between teacher and student effectively. However, the landmark points adopted by KD fail to satisfy the property illustrated in Theorem 3. By equipping class centers as landmark points, KDA can further reduce the transfer loss from $0.23$ in KD to $0.2$ , which confirms the effectiveness of transferring full Gram matrix with appropriate landmark points.

Finally, we demonstrate that the difference between partial Gram matrices is closely correlated with that between full Gram matrices as suggested in Theorem 2. Features from the layer before the last FC layer are adopted for evaluation. Fig. 3 illustrates how the ground-truth transfer loss $\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F}$ and the estimated transfer loss $\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F}$ are changing during the training ( $100\times$ values used for better visualization). Evidently, minimizing $\|C_{\mathcal{S}}-C_{\mathcal{T}}\|_{F}$ can reduce the gap between the original Gram matrices effectively, which is consistent with our theoretical analysis. Note that we have an assumption in Theorem 2 that the smallest eigenvalues of $W_{\mathcal{S}}$ and $W_{\mathcal{T}}$ are larger than $1$ . Since we adopt class centers as landmark points, $W_{\mathcal{S}}$ and $W_{\mathcal{T}}$ can be full rank matrices. We show their smallest eigenvalues in Fig. 4. It is obvious that the smallest eigenvalue is larger than 10, consistent with our assumption.

4.1.3 Effect of Matrix $W$

Corollary 1 implies a variant that uses the standard Nyström method including matrix $W$ for transferring, while our proposal ignores $W$ and optimizes only the difference between $C_{\mathcal{S}}$ and $C_{\mathcal{T}}$ for efficiency. We compare our method to the one with $W$ in Fig. 5, where features before the last FC layer are adopted for transfer. During the experiment, we find that $W_{\mathcal{T}}^{+\frac{1}{2}}$ provides better performance. We can observe that our method without $W$ has a similar performance as the one with $W_{\mathcal{T}}^{+\frac{1}{2}}$ . It further demonstrates our analysis in Theorem 2.

4.1.4 Effect of $\#$ Centers per Class

When assigning landmark points, we set the number to be that of classes, which avoids clustering in the implementation. It also constrains that each class has a single landmark point. The number of landmark points for each class can be easily increased by clustering. We compare the variant with two centers per class in Fig. 6, where features before the last FC layer are adopted for comparison.

We can observe that more centers for each class cannot improve the performance significantly. It may be because that the feature space is optimized with the cross entropy loss. As illustrated in [15], cross entropy loss will push all examples from the same class to a single center. Therefore, assigning one landmark point for each class is an appropriate setting, which also simplifies the algorithm and improves the efficiency. We use one center per class in the following experiments.

Table 1: Comparison of accuracy (

\%

) on CIFAR-100 when transferring the full Gram matrix from different layers (S is the student without KD and T is the teacher).

S	conv3_x	conv4_x	Before FC	After FC	T
77.2	77.7	78.1	79.6	79.4	80.3

Table 2: Comparison of accuracy (

\%

) on CIFAR-100 (S is student without KD and T is teacher).

T	S	S	Before Last FC			After FC		Combo	T
T	S	S	AT	RKD	KDA	KD	KDA	KDA	T
R34	R18	77.2	78.1 $\pm$ 0.3	78.3 $\pm$ 0.2	79.6 $\pm$ 0.1	78.8 $\pm$ 0.2	79.4 $\pm$ 0.1	79.7 $\pm$ 0.1	80.3
R34	R18-0.5	73.5	75.0 $\pm$ 0.1	74.3 $\pm$ 0.3	75.6 $\pm$ 0.3	74.8 $\pm$ 0.2	75.3 $\pm$ 0.2	75.9 $\pm$ 0.2	80.3
R34	SN	71.7	73.0 $\pm$ 0.1	72.5 $\pm$ 0.1	74.0 $\pm$ 0.3	72.9 $\pm$ 0.1	73.6 $\pm$ 0.1	74.2 $\pm$ 0.3	80.3

Table 3: Comparison of accuracy (

\%

) on Tiny-ImageNet (S is student without KD and T is teacher).

T	S	S	Before Last FC			After FC		Combo	T
T	S	S	AT	RKD	KDA	KD	KDA	KDA	T
R34	R18	63.4	64.4 $\pm$ 0.1	63.9 $\pm$ 0.2	65.2 $\pm$ 0.3	64.9 $\pm$ 0.2	65.4 $\pm$ 0.1	65.5 $\pm$ 0.1	66.6
R34	R18-0.5	60.3	61.0 $\pm$ 0.1	60.6 $\pm$ 0.2	61.7 $\pm$ 0.3	61.3 $\pm$ 0.1	61.9 $\pm$ 0.2	62.2 $\pm$ 0.2	66.6
R34	SN	60.6	61.3 $\pm$ 0.1	61.2 $\pm$ 0.2	62.0 $\pm$ 0.2	61.5 $\pm$ 0.1	62.3 $\pm$ 0.2	62.4 $\pm$ 0.1	66.6

4.1.5 Effect of Different Layers

Now, we illustrate the performance of transferring the Gram matrix from different layers. ResNet consists of $5$ convolutional layer groups and we compare the performance of the last $3$ groups (i.e., “conv3_x”, “conv4_x” and “conv5_x”) and the one after the last FC layer. The definition of groups can be found in [6]. For each group, we adopt the last layer for transfer. Before transfer, we add a pooling layer to reduce the dimension of the feature map. Note that after pooling, the last layer of “conv5_x” becomes the layer before the last FC layer.

Table 1 shows the performance of transferring information from different layers. First, transferring information from teacher always improves the performance of student, which demonstrates the effectiveness of knowledge distillation. Besides, the information from the later layers is more helpful for training student. It is because later layers contain more semantic information that is closely related to the target task. We will focus on the layers before and after the FC layer in the rest experiments.

4.2 CIFAR-100

In this subsection, we compare the proposed KDA method to baseline methods on CIFAR-100. The results of different methods can be found in Table 2. All experiments are repeated $3$ times and the average results with standard deviation are reported.

First, knowledge distillation methods outperform training the student network without a teacher, which shows that knowledge distillation can improve the performance of student models significantly. By transferring the full Gram matrix before the last FC layer, KDA surpasses RKD by $1.3\%$ when ResNet-34 is the teacher and ResNet-18 is the student. The observation is consistent with the comparison in the ablation study, which confirms that $\|K_{\mathcal{S}}-K_{\mathcal{T}}\|_{F}/\|K_{\mathcal{T}}\|_{F}$ is an appropriate metric to evaluate the transfer loss of the Gram matrix. Moreover, KDA shows a significant improvement on different student networks, which further demonstrates the applicability of the proposed method.

Furthermore, when transferring the Gram matrix after the last FC layer, both of KD and KDA can demonstrate good performance compared with the student model. It is due to the fact that these methods transfer the Gram matrix with landmark points, which is efficient for optimization. Besides, KDA can further improve the performance compared to KD. The superior performance of KDA demonstrates the effectiveness of the proposed strategy for generating appropriate landmark points.

Finally, compared to benchmark methods, KDA can distill the information from different layers with the same formulation in Eqn. 3.4. The proposed method provides a systematic perspective to understand a family of knowledge distillation methods that aim to transfer Gram matrix. If transferring the Gram matrices before and after the FC layer simultaneously, the performance of KDA can be slightly improved as illustrated by “Combo” in Table 2.

4.3 Tiny-ImageNet

Then, we compare different methods on Tiny-ImageNet data set¹¹1https://tiny-imagenet.herokuapp.com. There are $200$ classes in this data set and each class provides $500$ images for training and $50$ for validation. We report the performance on the validation set. Since the size of images in Tiny-ImageNet is $64\times 64$ that is larger than CIFAR-100, we replace the random crop augmentation with a more aggressive version as in [6], and keep other settings the same.

Table 3 summarizes the comparison. We can observe the similar results as on CIFAR-100. First, all methods with the information from a teacher model can surpass the student model without a teacher by a large margin. Second, KDA outperforms other methods no matter in which layer the transfer happens. Note that CIFAR-100 and Tiny-ImageNet have very different images, which demonstrates the applicability of the proposed algorithm in different real-world applications.

5 Conclusions

In this work, we investigate the knowledge distillation problem from the perspective of kernel matrix transfer. Considering the number of terms in the full kernel matrix is quadratic in the number of training examples, we extend the Nyström method and propose a strategy to obtain the landmark points for efficient transfer. The proposed method not only improves the efficiency of transferring the kernel matrix, but also has the theoretical guarantee for the efficacy. Experiments on the benchmark data sets verify the effectiveness of the proposed algorithm.

Besides the similarity function applied in this work, there are many other complicated functions adopted by other methods. Combining the proposed algorithm with different similarity functions for distillation can be our future work.

References

[1] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, Variational information distillation for knowledge transfer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9163–9171.
[2] G. Chen, W. Choi, X. Yu, T. X. Han, and M. Chandraker, Learning efficient object detection models with knowledge distillation, in Advances in Neural Information Processing Systems 30, 2017, pp. 742–751.
[3] Y. Chen, N. Wang, and Z. Zhang, Darkrank: Accelerating deep metric learning via cross sample similarities transfer, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
[4] M. Courbariaux, Y. Bengio, and J. David, Binaryconnect: Training deep neural networks with binary weights during propagations, in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds., 2015, pp. 3123–3131.
[5] P. Drineas, M. W. Mahoney, and N. Cristianini, On the nyström method for approximating a gram matrix for improved kernel-based learning., JMLR, 6 (2005).
[6] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[7] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531, (2015).
[8] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images, (2009).
[9] S. Kumar, M. Mohri, and A. Talwalkar, Sampling methods for the nyström method, JMLR, 13 (2012), pp. 981–1006.
[10] Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, and Y. Duan, Knowledge distillation via instance relationship graph, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7096–7104.
[11] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, Shufflenet v2: Practical guidelines for efficient cnn architecture design, in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
[12] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, Improved knowledge distillation via teacher assistant, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 5191–5198.
[13] W. Park, D. Kim, Y. Lu, and M. Cho, Relational knowledge distillation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.
[14] B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang, Correlation congruence for knowledge distillation, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5007–5016.
[15] Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, Softtriple loss: Deep metric learning without triplet sampling, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6450–6458.
[16] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, Fitnets: Hints for thin deep nets, in Proceedings of the 3rd International Conference on Learning Representations, 2015.
[17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
[18] S. Srinivas and F. Fleuret, Knowledge transfer with jacobian matching, in International Conference on Machine Learning, PMLR, 2018, pp. 4723–4731.
[19] Y. Tian, D. Krishnan, and P. Isola, Contrastive representation distillation, in Proceedings of the 8th International Conference on Learning Representations, 2020.
[20] F. Tung and G. Mori, Similarity-preserving knowledge distillation, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1365–1374.
[21] C. K. I. Williams and M. W. Seeger, Using the nyström method to speed up kernel machines, in Advances in Neural Information Processing Systems 13, 2000, pp. 682–688.
[22] J. Yim, D. Joo, J. Bae, and J. Kim, A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4133–4141.
[23] S. Zagoruyko and N. Komodakis, Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, in Proceedings of the 5th International Conference on Learning Representations, 2017.
[24] K. Zhang, I. W. Tsang, and J. T. Kwok, Improved nyström low-rank approximation and error analysis, in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1232–1239.

	$\displaystyle\\|K_{\mathcal{S}}-K_{\mathcal{T}}\\|_{F}=\\|K_{\mathcal{S}}-\tilde{K}_{\mathcal{S}}^{k}-(K_{\mathcal{T}}-\tilde{K}_{\mathcal{T}}^{k})+\tilde{K}_{\mathcal{S}}^{k}-\tilde{K}_{\mathcal{T}}^{k}\\|_{F}$
	$\displaystyle\leq\\|K_{\mathcal{S}}-\tilde{K}_{\mathcal{S}}^{k}\\|_{F}+\\|(K_{\mathcal{T}}-\tilde{K}_{\mathcal{T}}^{k})\\|_{F}+\\|\tilde{K}_{\mathcal{S}}^{k}-\tilde{K}_{\mathcal{T}}^{k}\\|_{F}$
	$\displaystyle\leq 2\varepsilon+\\|C_{\mathcal{S}}W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top}-C_{\mathcal{T}}W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}\\|_{F}$
	$\displaystyle\leq 2\varepsilon+c\\|W_{k\mathcal{S}}^{+}C_{\mathcal{S}}^{\top}-W_{k\mathcal{T}}^{+}C_{\mathcal{T}}^{\top}\\|_{F}+c(\\|W_{k\mathcal{T}}^{+}\\|_{F}\\|C_{\mathcal{S}}-C_{\mathcal{T}}\\|_{F})$
	$\displaystyle\leq 2\varepsilon+c^{2}\\|W_{k\mathcal{S}}^{+}-W_{k\mathcal{T}}^{+}\\|_{F}+2c^{2}\\|C_{\mathcal{S}}-C_{\mathcal{T}}\\|_{F}$

	$\displaystyle\\|K_{\mathcal{S}}(\mathbf{x}_{i},\mathbf{x}_{j})-K_{\mathcal{T}}(\mathbf{x}_{i},\mathbf{x}_{j})\\|=\\|\mathbf{x}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{x}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\\|$
	$\displaystyle=\\|(\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}}+\mathbf{d}^{i\top}_{\mathcal{S}})\mathbf{x}^{j}_{\mathcal{S}}-(\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}}+\mathbf{d}^{i\top}_{\mathcal{T}})\mathbf{x}^{j}_{\mathcal{T}}\\|$
	$\displaystyle\leq\\|(\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}})\mathbf{x}^{j}_{\mathcal{S}}\\|+\\|(\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}})x^{j}_{\mathcal{T}}\\|$
	$\displaystyle+\\|\mathbf{d}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\\|$
	$\displaystyle\leq e\\|\mathbf{x}^{i\top}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{S}}\\|+e\\|\mathbf{x}^{i\top}_{\mathcal{T}}-\mathbf{d}^{i\top}_{\mathcal{T}}\\|+\\|\mathbf{d}^{i\top}_{\mathcal{S}}\mathbf{x}^{j}_{\mathcal{S}}-\mathbf{d}^{i\top}_{\mathcal{T}}\mathbf{x}^{j}_{\mathcal{T}}\\|$