Precise Knowledge Transfer via Flow Matching

Shitong Shao Zhiqiang Shen Linrui Gong Huanran Chen Xu Dai

Abstract

In this paper, we propose a novel knowledge transfer framework that introduces continuous normalizing flows for progressive knowledge transformation and leverages multi-step sampling strategies to achieve precision knowledge transfer. We name this framework Knowledge Transfer with Flow Matching (FM-KT), which can be integrated with a metric-based distillation method with any form (e.g. vanilla KD, DKD, PKD and DIST) and a meta-encoder with any available architecture (e.g. CNN, MLP and Transformer). By introducing stochastic interpolants, FM-KD is readily amenable to arbitrary noise schedules (e.g., VP-ODE, VE-ODE, Rectified flow) for normalized flow path estimation. We theoretically demonstrate that the training objective of FM-KT is equivalent to minimizing the upper bound of the teacher feature map or logit negative log-likelihood. Besides, FM-KT can be viewed as a unique implicit ensemble method that leads to performance gains. By slightly modifying the FM-KT framework, FM-KT can also be transformed into an online distillation framework OFM-KT with desirable performance gains. Through extensive experiments on CIFAR-100, ImageNet-1k, and MS-COCO datasets, we empirically validate the scalability and state-of-the-art performance of our proposed methods among relevant comparison approaches.

Machine Learning, ICML

1 Introduction

Despite the remarkable achievements of deep neural networks, the dramatic increase in the number of parameters in recent years prevents their application to real-world scenarios. To solve this problem, knowledge distillation (Hinton et al., 2015) has been introduced for model compression in order to deploy lightweight models with desirable performance on mobile devices. One critical component of knowledge distillation is knowledge transfer, which aims to transfer knowledge from the teacher to the student, ensuring efficient student performance during runtime. The vast majority of existing research focuses on enhancing the capability of knowledge transfer, including how to design effective and efficient meta-encoders to transform the output (i.e. feature or logit) of the student in the high-dimensional space to match the corresponding output of the teacher (Chuanguang et al., 2021; Meng et al., 2022; Huang et al., 2022), and designing metric-based distillation methods to reduce the gap between the knowledge of the teacher and the knowledge of the student (Tao et al., 2022; Zhao et al., 2022).

Refer to caption — Figure 1: A highly scalable knowledge transfer framework FM-KT.

Motivation. The significant discrepancy in feature/logit distributions between the teacher and the student adversely impacts distillation performance, a challenge not fully mitigated by even well-designed meta-encoders and metric-based distillation methods (Gou et al., 2021). This difficulty stems from the inherent challenges of transferring complex features or logits between the teacher and the student in a single step, which often compromises reliability and precision. A promising strategy involves segmenting the distribution gap into multiple sub-parts and sequentially matching these distributions to facilitate gradual and accurate knowledge transfer (Huang et al., 2023; Yao et al., 2024). Thus, a question worth exploring is, “How to implement multi-step sampling to facilitate a progressive transformation, thereby achieving more effective and precise knowledge transfer?”

In this work, we take apart the features/logits of the teacher and the student as empirical distributions in an attempt to answer this question. From this perspective, diffusion models (Ho et al., 2020; Song et al., 2023c) and continuous normalized flows (CNFs) (Lipman et al., 2022; Albergo & Vanden-Eijnden, 2022) are suitable for implementing progressive transformation. Among them, diffusion models were employed by the prior study DiffKD (Huang et al., 2023) to transition from the student feature/logit distribution to the teacher counterpart. However, the inherent characteristics of diffusion models necessitate that one end of the distribution trajectory adheres to a Gaussian distribution. Consequently, this process requires transforming the student feature/logit distribution into a Gaussian approximation before it can be converted into the teacher feature/logit distribution during reverse sampling. Moreover, the complexity of this approach prevents the full exploitation of its potential in progressive transformation. In contrast, CNFs can directly map these two empirical distributions without accessing uncorrelated Gaussian distributions. This allows for the progressive transformation to reshape the student’s feature/logit density to the teacher’s with more fine-grained information.

Unfortunately, directly implementing this approach faces a critical challenge: popular flow matching (Liu et al., 2022; Lipman et al., 2022; Albergo & Vanden-Eijnden, 2022), akin to the evidence lower bound (ELBO) in diffusion models (Ho et al., 2020; Song et al., 2023a, c), tends to an unreliable alignment with a high probability space of the teacher’s knowledge. This approach inadvertently informs the student and meta-encoder about the target information the teacher’s knowledge during the training phase, which can be considered as “cheating”. This approach substantially undermines the generalization ability of the student.

To address this issue and realize precise knowledge transfer, we propose a novel framework, Knowledge Transfer with Flow Matching (FM-KT), to amend incorrect single output from the student through multi-step sampling. To be specific, we design a serial training paradigm with theoretical guarantees to avoid “cheating” in knowledge distillation, finally yielding a reliable meta-encoder for multi-step sampling during inference. It is worth mentioning that by changing the noise schedule (Kingma et al., 2021) in CNFs, FM-KT can consistently model various probabilistic flows, such as VP ODE (Liu et al., 2022; Song et al., 2023c), VE ODE (Liu et al., 2022; Song et al., 2023c), and Rectified flow (Liu et al., 2022; Lipman et al., 2022).

FM-KT is a versatile training paradigm for knowledge transfer with high scalability. As depicted in Fig. 1, FM-KT is comprised of a meta-encoder with any available architecture and a metric-based distillation method with any form, enabling both feature-based and logit-based distillation, and consequently enhancing the generalization ability of the student. Most importantly, it can be theoretically interpreted as an implicit ensemble algorithm when the noise schedule is set as Rectified flow. Notably, we propose a variant of FM-KT called FM-KT^Θ, which avoids additional computational overhead during inference. By introducing a metric function between the predicted velocity at each time point and the numerical solution derived from the final discrete sampling, FM-KT can be transformed into an online distillation algorithm OFM-KT. Our experiments, both qualitative and quantitative, demonstrate that our proposed methods enhance performance across both image classification and object detection tasks.

2 Preliminaries

Review the Knowledge Transfer.

Knowledge transfer plays an important role in knowledge distillation (Hinton et al., 2015), which aims to transfer the teacher’s knowledge to the student, thus enhancing the performance of the student. In classical knowledge distillation algorithms, a common and simple approach (Zagoruyko & Komodakis, 2016a; Ahn et al., 2019; Tung & Mori, 2019; Xu et al., 2020; Tian et al., 2019; Cao et al., 2022; Zhao et al., 2022; Tao et al., 2022) is to align the student’s feature/logit $X^{S}$ with the teacher’s feature/logit $X^{T}$ using two encoders: $g^{S}(\cdot)$ and $g^{T}(\cdot)$ . This can be expressed as $\min L(g^{S}(X^{S}),g^{T}(X^{T}))$ , where $L(\cdot,\cdot)$ refers to the distance metric function with any form. In some cases, $g^{S}(\cdot)$ and $g^{T}(\cdot)$ can be reduced to the identity function, making the supervision a direct matching between $X^{S}$ and $X^{T}$ . This is widely employed in logit-based distillation.

Recently, DiffKD and KDiffusion were introduced by Huang et al. (2023) and Yao et al. (2024), which employ a diffusion model (i.e. meta-encoder) $g_{v_{\theta}}(\cdot)$ to replace the original $g^{S}(\cdot)$ for more effective progressive transformation, which significantly enhances the generalization ability of the student. However, DiffKD does not immediately effectuate the transition from $X^{S}$ to $X^{T}$ . Instead, it first transforms $X^{S}$ into Gaussian noises and subsequently translates this noise to $X^{T}$ . This dual-stage transformation process might potentially be overly complex, substantially increasing the inference cost and hindering its widespread use. KDiffusion complicates its noise schedule, and the supervision of task information is sparse (i.e. $\mathcal{L}_{\textrm{guided}}$ in the original paper), leading to its relatively poor performance in ImageNet-1k.

Continuous Normalized Flows.

Given the couple $(X^{S},X^{T})\in(\mathbb{R}^{d},\mathbb{R}^{d})$ sampling from two empirical distributions $(\pi_{0},\pi_{1})$ , the time-dependent probability density function can be denoted as $\rho_{t}(Z):\mathbb{R}^{d}\times[0,1]\rightarrow\mathbb{R}^{d}$ , which satisfies $\rho_{0}(X^{T})=\rho_{1}(X^{S})+\int_{\rho_{1}(Z)}^{\rho_{0}(Z)}\partial\rho_{t}(Z)$ . Continuous normalized flows (CNFs) (Lipman et al., 2022) optimize $g_{v_{\theta}}(\cdot)$ by solving a flow matching problem:

\displaystyle\operatorname*{arg\,min}_{v_{\theta}}

\displaystyle\int_{0}^{1}\mathbb{E}[||\partial\rho_{t}(Z)/\partial t-g_{v_{\theta}}(Z_{t},t)||]dt.

(1)

In inference, the reverse sampling process can be achieved by solving the ordinary differential equation (ODE) $\frac{d\hat{Z}_{t}}{dt}=-g^{*}_{v_{\theta}}(\hat{Z}_{t},t)$ through the numerical integration with an initial condition $\hat{Z}_{1}\sim\pi_{1}$ and the optimized meta-encoder $g^{*}_{v_{\theta}}(\cdot)$ , which ultimately yields the synthesized data $\hat{Z}_{0}$ that is expected to satisfy $\hat{Z}_{0}\sim\pi_{0}$ . In this context, the trajectory $\{Z_{t}\}_{t}$ remains indeterminate in the absence of additional constraints, a factor that often leads to the collapse of the student during training. Drawing inspiration from diffusion models, which have demonstrated exceptional efficacy in image synthesis, the incorporation of prior forward processes (i.e., fixed probability flow paths) is equally essential and advantageous. Therefore, we utilize a noise schedule-like definition (Lu et al., 2022; Kingma et al., 2021) to model stochastic interpolants (Albergo & Vanden-Eijnden, 2022) ( $\alpha_{t}$ , $\sigma_{t}\in\mathbb{R}^{+}$ are differentiable functions of $t$ ):

\displaystyle Z_{t}=\alpha_{t}X^{S}+\sigma_{t}X^{T},\ s.t.\ \lim_{t\rightarrow 0}\alpha_{t}=0,\lim_{t\rightarrow 0}\sigma_{t}=1,\lim_{t\rightarrow 1}\sigma_{t}=0.

(2)

Note that withdrawing constraint $\lim_{t\rightarrow 1}\alpha_{t}=1$ is used to ensure that the VE ODE can be unified within our definition. This conversion is convenient since it only needs to modify the initial condition $Z_{1}$ from $X^{S}$ to $\alpha_{1}X^{S}$ .

Noise Schedules.

Different noise schedules affect the effectiveness of knowledge transfer. In this work, we consider three well-known noise schedules, namely VP ODE (Song et al., 2023c), VE ODE (Song et al., 2023c), and Rectified flow (Liu et al., 2022), to analyze which noise schedule is most beneficial for knowledge transfer. Among them, VP ODE and VE ODE are derived from VP SDE and VE SDE, respectively (Ho et al., 2020; Song & Ermon, 2019; Song et al., 2023c). Review these three noise schedules, which can be defined as follows:

VP ODE: $\alpha_{t}=\textrm{exp}(-\frac{1}{4}a(1-t)^{2}-\frac{1}{2}b(1-t))$ ; $\sigma_{t}=\sqrt{1-\alpha_{t}^{2}}$ , s.t. $\mathit{a=19.9}$ , $\mathit{b=0.1}$ .
VE ODE: $\alpha_{t}=a(\frac{b}{a})^{t}$ ; $\sigma_{t}=1$ , s.t. $\mathit{a=0.02}$ , $\mathit{b=100}$ .
Rectified flow: $\alpha_{t}=t$ ; $\sigma_{t}=1-t$ .

3 Methodology

In this section, we first present FM-KT, a novel method for multi-step sampling designed for precise knowledge transfer. Subsequently, a theoretical analysis of the reliability and effectiveness of FM-KT is given. Finally, we further introduce two variants: a lightweight offline knowledge distillation method FM-KT^Θ and an online knowledge distillation method OFM-KT.

3.1 Serial Training Paradigm

The significant challenge in implementing CNFs is the risk of “cheating”. This risk can hinder the distilled student from acquiring meaningful representations. Our empirical observation indicates that if we directly introduce Rectified Flow, it only achieves an accuracy of 3.42% with the WRN-40-2-WRN-16-2 pair on CIFAR-100. Furthermore, from a theoretical perspective, the ELBO of CNFs can be written as $\log p_{v_{\theta}}(Z_{0})\geq-\mathbb{E}_{q(Z_{1/N:1}|Z_{0})}[\log\frac{q(Z_{1}|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}|Z_{1/N})}+\sum_{i=1}^{N}\log\frac{q(Z_{(i-1)/N}|Z_{i/N},Z_{0})}{p_{v_{\theta}}(Z_{(i-1)/N}|Z_{i/N})}]$ . The condition $Z_{i/N}$ of $p_{v_{\theta}}(\cdot|Z_{i/N})$ in this equation can be obtained by stochastic interpolants, which necessarily incorporate the target information $X^{T}$ to be learnt, ultimately causing the student to fall into trivial solutions when training converges.

This requires us to modify the training paradigm of the original CNFs according to the properties of the knowledge transfer scenario. As illustrated in Fig. 2, we propose a serial training paradigm within FM-KT to address “cheating”, which can be denoted as (see Appendix O for derivation)

		$\displaystyle\mathcal{L}_{\textrm{FM-KT}}\!=\!\mathbb{E}[\frac{1}{N}\sum_{i=0}^{N-1}\!L(\mathcal{T}((\nabla_{t}\alpha_{t}Z_{1}\!\!-\!\!g_{v_{\theta}}(Z_{1\!-\!i/N},1-i/N))/\!-\!\nabla_{t}\sigma_{t})$		(3)
		$\displaystyle,X^{T})+\underbrace{L(\mathcal{T}((\nabla_{t}\alpha_{t}Z_{1}-g_{v_{\theta}}(Z_{1-i/N},1-i/N))/-\nabla_{t}\sigma_{t}),Y)}_{\textrm{match the ground truth label (optional)}}],$
		$\displaystyle\textrm{where}\quad Z_{1-i/N}=Z_{1-(i-1)/N}$
		$\displaystyle\quad\quad-g_{v_{\theta}}(Z_{1-(i-1)/N},1-(i-1)/N)/N,\quad s.t.\quad i\geq 1.$

The expectation in Eq. 3 is taken with respect to $(X^{S},X^{T},Y)$ , and $L(\cdot,\cdot)$ and $Y$ are the metric-based distillation method (i.e. the loss function) and the ground truth label, respectively. The initial state of the sampling is $Z_{1}=\alpha_{1}X^{S}$ . We define $N$ and $K$ as the number of sampling steps during training and inference, respectively. In our work, different values of $K$ are implemented using skip-step sampling of DDIM (Song et al., 2023a). The pseudo code of FM-KT can be found in Appendix A. It is guaranteed that $N$ does not exceed 8 (default $N$ as 8) in our experiments thereby avoiding a significant increase in computational cost. It is important to clarify that not only does the training of FM-KT need to be performed by serial, but the inference also relies on multi-step sampling. Intuitively, FM-KT is an interesting “time-for-accuracy” algorithm, which in some ways makes a trade-off between time cost and student performance even in inference.

We prove that minimizing $\mathcal{L}_{\textrm{FM-KT}}$ is closely equivalence to the minimization of the upper bound of the negative log-likelihood of $X^{T}$ as shown by the following Theorem 3.1. This theorem leads to efficient training and theoretically guarantees the rationality and practicality of FM-KT.

Theorem 3.1.

(Proof in Appendix B) Optimizing $\mathcal{L}_{\textrm{FM-KT}}$ not only avoids “cheating” by accessing $X^{T}$ during training, but also establishes an equivalence to the upper bound of the negative log-likelihood of $X^{T}$ .

3.2 Choice of Noise Schedule

Here, we examine two non-straight noise schedules, namely VP ODE and VE ODE, as well as one straight noise schedule referred to as Rectified flow to acquire the most desirable noise schedule in our study. Thus, we empirically conduct experiments with the WRN-40-2-WRN-16-2 pair on CIFAR-100 and present outcomes in Fig. 3. We discover that Rectified flow yields the best stability and effectiveness since the invariance of $\nabla_{t}\alpha_{t}$ and $\nabla_{t}\sigma_{t}$ at each time point ensures stability during training. Therefore, unless otherwise specified, Rectified flow is applied as the noise schedule by default in all our experiments.

3.3 Serve to Feature-/Logit-based Distillation

By simplistically integrating FM-KT into the standard distillation framework, it can serve to the majority of feature-/logit-based distillation algorithms. This introduction is straightforward; it involves replacing the loss function in FM-KT with suitable metric-based distillation approaches. Practically, FM-KT is strategically placed between different layers of the student to accomplish knowledge transfer. We give an example in Fig. 4. For feature-based distillation, FM-KT is inserted between the intermediate layers of the student, typically before the downsampling layer. This insertion does not alter the rest of the student architecture. Furthermore, for logit-based distillation, FM-KT replaces the original pooling layer, linear classification layer, or even the penultimate one or two layers (e.g. convolution, activation and normalization layers), to achieve logit-level matching. In our experiments, the unique replacement of the extra penultimate one or two layers is only used for the student on CIFAR-100 is MobileNetV2. Besides, as shown in Eq. 3, FM-KT can optionally add a new loss function by substituting the ground truth label for $X^{T}$ , enabling consistency with the classical logit-based distillation paradigm.

For complex distillation algorithms with learnable encoders, such as MasKD (Yang et al., 2022a), we can denote the entire algorithm as a loss function. Hence, it is plausible to replace $L(\cdot,\cdot)$ in FM-KT with these algorithms to enable “serve to feature-/logit-based distillation”. In this study, we focus on simple yet effective metric-based distillation methods, including vanilla KD, DKD, PKD, and DIST. The adaptation of more complex distillation algorithms is earmarked for future work, which will help further ascertain the robust applicability of FM-KT.

3.4 Approximate to Ensemble

We attribute the ability of FM-KT to efficiently realize knowledge transfer to its multi-step sampling enabled by numerical integration. In the training and inference phases, $N$ ( $\leq$ 8) is controlled to be small enough that FM-KT no longer satisfies the form of CNFs. However, in case the noise schedule is set as Rectified flow, Euler’s method can be rewritten as averaging multiple time-step outputs, which intuitively approximates an ensemble approach. For completeness, we provide in-depth theoretical support for this through the perspective of error analysis in Proposition 3.2.

Proposition 3.2.

(Proof in Appendix C) Assume the noise schedule is set as Rectified flow, FM-KT can be considered a unique implicit ensemble algorithm. The number of outputs used for ensemble is equivalent to the number of samplings.

As is well-known, some past methods (Lu et al., 2022; Song et al., 2023b) for error analysis in the sampling process of the diffusion model use absolute error bound, which achieves recursion and thus scaling of the accumulated error value. We discard the constraint on the absolute value and employ recursion and Taylor expansion in the derivation of Proposition 3.2. As a result, we obtain the interesting conclusion that the truncation error, which is supposed to be progressively scaled, makes the sampling process of the FM-KT a unique implicit ensemble approach in this proposition.

3.5 Lightweight FM-KT^Θ without Additional Inference Burden

The multi-step sampling of FM-KT introduces additional overhead during inference. To facilitate efficient deployment, we propose a streamlined variant FM-KT^Θ for logit-based distillation, which enhances the process by distilling $Z_{0}$ from FM-KT into the existing classification head (i.e. the original student’s classification head) $\mathcal{T}_{\textrm{vanilla}}(\cdot)$ , ensuring no extra inference cost. Essentially, this is a concept of progressive distillation, which enhances student performance by effectively reducing the gap between the teacher and the student. During training, we reformulate the loss function to accommodate this integration:

		$\displaystyle\mathcal{L}_{\textrm{FM-KT}^{\Theta}}=$		(4)
		$\displaystyle\mathbb{E}[L(\mathcal{T}_{\textrm{vanilla}}(X^{S}),\mathcal{T}(Z_{0}))+\alpha^{\Theta}L(\mathcal{T}_{\textrm{vanilla}}(X^{S}),Y)]+\mathcal{L}_{\textrm{FM-KT}},$		(4)

where the expectation is taken with respect to $(X^{S},X^{T},Y)$ , and $\alpha^{\Theta}$ refers to the balance weight. In inference, we can directly utilize $\mathcal{T}_{\textrm{vanilla}}(\cdot)$ to achieve prediction without going through $g_{v_{\theta}}(\cdot)$ and $\mathcal{T}(\cdot)$ to increase the sampling burden.

3.6 Translate to Online Knowledge Distillation

Numerous Online Knowledge Distillation (Online KD) algorithms essentially integrate the outputs of multiple branches, thus avoiding asynchronous updating of gradients and ultimately improving the generalization ability of the student. FM-KT and Online KD have different approaches but equally satisfactory results, which provides the feasibility for FM-KT to be converted to Online KD. In comparison to Offline Knowledge Distillation (Offline KD), Online KD doesn’t use an explicit teacher; instead, the teacher is represented by a weighted average of branches in the student. Similarly, we can achieve the goal “translate to Online KD” by simply replacing $X^{T}$ in Eq. 3 with the final result after sampling with Euler’s method. In detail, we first obtain the sampling result $Z_{0}$ by continuously calling Euler’s method $Z_{1-i/N}\!=\!Z_{1\!-\!(i\!-\!1)/N}\!-\!g_{v_{\theta}}(Z_{1\!-\!(i\!-\!1)/N},1\!-\!(i\!-\!1)/N)/N$ . Finally, we retain the portion of FM-KT that matches the ground truth label and add the Online KD loss to it

		$\displaystyle\mathcal{L}_{\textrm{OFM-KT}}=\mathbb{E}_{(X^{S},Y)}[\frac{1}{N}\sum_{i=0}^{N-1}$		(5)
		$\displaystyle\quad\underbrace{L(\mathcal{T}((\nabla_{t}\alpha_{t}Z_{1}-g_{v_{\theta}}(Z_{1-i/N},1-i/N))/-\nabla_{t}\sigma_{t}),Z_{0})}_{\textrm{the Online KD loss}}$
		$\displaystyle+\underbrace{L(\mathcal{T}((\nabla_{t}\alpha_{t}Z_{1}-g_{v_{\theta}}(Z_{1-i/N},1-i/N))/-\nabla_{t}\sigma_{t}),Y)}_{\textrm{match the ground truth label}}].$

The variant $\mathcal{L}_{\textrm{OFM-KT}}$ can be empirically understood as a novel Online KD algorithm OFM-KT. Compared with traditional Online KD algorithms including ONE (Lan et al., 2018), KDCL (Guo et al., 2020) and AHBF-OKD (Gong et al., 2023), OFM-KT has some unique characteristics, including the meta-encoder shares parameters at different time points, whereas traditional Online KD algorithms do not. Besides, the input of the meta-encoder in OFM-KT is different at different time points, and as $t\rightarrow 0$ , the input contains more target information. In contrast, the traditional Online KD has the same input for each branch. This means that OFM-KT achieves ensemble through various inputs instead of unshared parameters.

Teacher Student	Meta- Encoder	ResNet56 ResNet20	WRN-40-2 WRN-16-2	WRN-40-2 WRN-40-1	ResNet32 $\times$ 4 ResNet8 $\times$ 4	VGG13 VGG8	VGG13 MobileNetV2	WRN-40-2 ShuffleNetV1
Teacher	✗	73.24	75.61	75.61	79.42	74.64	75.61	75.61
Student	✗	69.06	73.26	71.98	72.50	70.36	64.60	70.50
ATKD	✗	70.55	74.08	72.77	73.44	71.43	59.40	72.73
SPKD	✗	69.67	73.83	72.43	72.94	72.68	66.30	74.52
CRD	✗	71.16	75.48	74.14	75.51	73.94	69.73	76.05
vanilla KD	✗	70.66	74.92	73.54	73.33	72.98	67.37	74.83
DKD	✗	71.97	76.24	74.81	76.32	74.68	69.73	76.70
DIST	✗	71.26	75.29	74.42	75.79	73.11	68.48	75.23
FM-KT^Θ	✗	72.20	75.98	74.99	76.52	74.82	69.90	77.19
DiffKD	✓	71.92	76.13	74.09	76.31	-	-	-
FM-KT ( $K$ =1)	✓	74.28	77.14	75.88	76.74	75.21	69.68	76.34
FM-KT ( $K$ =2)	✓	74.09	76.58	74.52	74.98	74.86	69.52	75.55
FM-KT ( $K$ =4)	✓	75.12	77.69	76.24	77.49	75.42	69.94	76.95
FM-KT ( $K$ =8)	✓	74.97	77.84	76.09	77.71	75.46	69.94	77.21

Table 1: Results of different Offline KD methods on CIFAR-100. Among them, ATKD, SPKD, CRD and DiffKD belong to feature-based distillation, while vanilla KD, DKD and DIST belong to logit-based distillation.

T-S Pair	Acc.	Tea.	Stu.	Vanilla KD	ReviewKD	DKD	DIST	FM-KT^Θ	DiffKD	KDiffusion	FM-KT ( $K$ =1)	FM-KT ( $K$ =2)	FM-KT ( $K$ =4)	FM-KT ( $K$ =8)
Meta-encoder	-	-	-	✗	✗	✗	✗	✗	✓	✓	✓	✓	✓	✓
R34-R18	Top-1	73.31	69.75	70.66	71.61	71.70	72.07	72.14	72.49	72.04	72.49	72.86	73.08	73.17
R34-R18	Top-5	91.42	89.08	89.88	90.51	90.41	90.42	90.44	90.71	90.53	90.83	91.00	91.12	91.18
R50-MBV1	Top-1	76.16	70.13	70.68	72.56	72.05	73.24	73.29	73.78	73.62	73.61	74.01	74.20	74.22
R50-MBV1	Top-5	92.86	89.49	90.30	91.00	91.05	91.12	91.15	91.48	91.82	91.36	91.71	91.84	91.81

Table 2: Results of different Offline KD methods on ImageNet-1k. “R34-R18” and “R50-MBV1” refer to “ResNet34-ResNet18 pair” and “ResNet50-MobileNetV1 pair”, respectively.

4 Experiments

We perform comparison and ablation experiments on CIFAR-100, ImageNet-1k and MS-COCO. The implementation details of FM-KT, FM-KT^Θ, and OFM-KT can be found in Appendix K, and the experimental result of MS-COCO can be found in Appendix G. Note that all normalization layers in the meta-encoder are not BatchNorm, because their inputs are various at different time points, so the statistics of the mean and variance will encounter difficulties, thereby causing training collapse. Moreover, we introduce a strategy named pair decoupling (PD), which is controlled by the hyperparameter dirac ratio $\beta_{d}$ , applied to shuffle part of the sample pairs in a batch. This approach is particularly effective for feature-based distillation in image classification tasks, and its detailed description and specific implementation can be found in Appendix D and A, respectively.

Architecture	Meta-encoder	ResNet32	ResNet110	VGG16	DenseNet40-2	MobileNetV2
Student	✗	71.28	76.21	74.32	71.03	59.79
CL	✓	72.33	78.83	74.33	71.45	60.63
ONE	✓	72.45	78.44	74.38	71.39	60.84
FFSD-C	✓	74.50	78.83	74.89	71.74	61.88
ABHF-OKD	✓	74.81	79.04	75.08	72.12	62.23
OFM-KT ( $K$ =1)	✓	72.86	79.49	75.07	73.12	63.62
OFM-KT ( $K$ =2)	✓	73.02	79.50	75.10	73.34	63.67
OFM-KT ( $K$ =4)	✓	73.10	79.45	75.09	73.40	63.63
OFM-KT ( $K$ =8)	✓	73.07	79.47	75.06	73.39	63.61

Table 3: Results of different Online KD methods on CIFAR-100. The metric is the Top-1 accuracy.

Architecture	Student	ONE	OKDDip	FFSD-C	ABHF-OKD	OFM-KT ( $K$ =1)	OFM-KT ( $K$ =2)	OFM-KT ( $K$ =4)	OFM-KT ( $K$ =8)
Meta-encoder	✗	✓	✓	✓	✓	✓	✓	✓	✓
ResNet18	69.75	70.55	70.63	70.15	70.72	71.38	71.52	71.56	71.56
ResNet34	73.24	74.10	74.40	74.20	74.53	74.16	74.20	74.20	74.20

Table 4: Results of different Online KD methods on ImageNet-1k. The metric is the Top-1 accuracy.

The impact of the normalization layer selection in the meta-encoder, where stages used for distillation in the feature-based scenario, and ideal configuration of dirac ratio $\beta_{d}$ can be found in the additional ablation experiments in Appendix E. By default, we set $\beta_{d}$ as 0.25 and use the 1st and 2nd last stages for feature-based distillation in image classification tasks.

Crucially, detailed results about the stronger teacher comparison, vision transformer comparison, and visualization of sampling trajectory, can be found in Appendix H, I and J, respectively.

4.1 Image Classification Comparison

Offline Knowledge Distillation.

On CIFAR-100, we conduct experiments on teacher-student pairs including ResNet56-ResNet20, WRN-40-2-WRN-16-2 (Zagoruyko & Komodakis, 2016b), WRN-40-2-WRN-40-1, ResNet32 $\times$ 4-ResNet8 $\times$ 4, VGG13-VGG8 (Szegedy et al., 2015), VGG13-MobileNetV2 (Sandler et al., 2018) and WRN-40-2-ShuffleNetV1 (Zhang et al., 2018) pairs. We compare FM-KT with state-of-the-art methods including ATKD (Zagoruyko & Komodakis, 2016a), SPKD, CRD, DiffKD, KDiffusion, vanilla KD, DKD and DIST, and present the results in Table 1. As shown in Table 1, FM-KT significantly outperforms prior KD methods with all pairs. Note that FM-KT improves the student performance on ResNet56-ResNet20, WRN-40-2-WRN-16-2, WRN-40-2-WRN-40-1 and VGG13-VGG8 pairs by 3.15%, 1.60%, 1.43% and 0.64%, respectively, compared with the best prior methods. Moreover, our lightweight variant FM-KT^Θ, which without additional computational cost in inference, achieves state-of-the-art performance across a wide range of teacher-student pairs. On ImageNet-1k, FM-KT treats DIST as its $L(\cdot,\cdot)$ (w.r.t. baseline). Compared with DIST, FM-KT exceeds DIST on ResNet34-ResNet18 and ResNet50-MobileNetV1 by 1.10% and 0.98%, respectively. Specifically, FM-KT demonstrates superior performance over similar algorithms, DiffKD and KDiffusion, showing a substantial margin of improvement on both ResNet34-ResNet18 and ResNet50-MobileNetV1 pairs. However, it should be noted that DiffKD introduces 11 additional convolutional layers in its encoder (considering its mentioned Diffusion Model and Noise Adapter), while in contrast, FM-KT employs 2-layer MLP with only 4 linear layers as its meta-encoder. Furthermore, the lightweight FM-KT^Θ outperforms all algorithms with no additional computational overhead in inference, validating its effectiveness and applicability.

Online Knowledge Distillation.

We present the results of the comparison between OFM-KT and prior state-of-the-art approaches CL (Song & Chai, 2018), ONE (Lan et al., 2018), OKDDip (Chen et al., 2020), FFSD-C (Li et al., 2022) and ABHF-OKD (Gong et al., 2023) in Table 3 and 4. Among them, Table 3 illustrates the experimental results on CIFAR-100. We can observe OFM-KT beats all comparison methods on ResNet110, VGG16, DenseNet40-2, and MobileNetV2. For the results on ImageNet-1k on Table 4, OFM-KT outperforms other methods on ResNet18, albeit lagging behind the optimal ABHF-OKD by a marginal 0.33% on ResNet34. Importantly, regarding both ResNet18 and ResNet34, OFM-KT necessitates merely two Number of Function Evaluations (NFEs) to attain the best results. This indicates that OFM-KT corresponds to Online KD, which is the aggregated outcome of two branches sharing parameters. Hence, this compellingly substantiates that OFM-KT is a potent Online KD algorithm.

4.2 Ablation Studies

The Number of Sampling Steps $K$ in Inference.

We can also call $K$ as NFEs, an important metric affecting the GPU latency during inference. Both FM-KT and OFM-KT have a similarity form to the diffusion models family (e.g. VE-SDE, VP-SDE, EDM (Karras et al., 2022)) and INN (Solodskikh et al., 2023), in that after obtaining the training weights, the NFEs can be modified at the time of inference to trade-off effective and efficiency. We can see from Table 1, 3, 2, 4 and Fig. 6 that increasing $K$ will improve the student performance, but in general $K\!=\!2$ will achieve quite satisfactory results. Note that the combination of PKD and Swin-Transformer on ImageNet-1k in Fig. 6 has a large difference in the results achieved by the different NFEs, but the best results are superior to the combination of PKD and MLP/CNN. This might be because Swin-Transformer does not have inductive bias (Park & Kim, 2021) and the specificity that PKD is a feature-based distillation method.

The Computational Cost Analysis. For computational burden during training, as illustrated in Fig. 7, applying a serial loss calculation $\mathcal{L}_{\textrm{FM-KT}}$ does not introduce excessive GPU latency during training. Even the most computationally demanding meta-encoder Transformer has less than double the GPU latency compared to vanilla KD. Additionally, the computational overhead comparison between FM-KT and DiffKD during inference can be found in Appendix L.

The Effectiveness of Optimization Objective. We present the validity of the FM-KT optimization objective in Fig. 5 to prevent misinterpretations due to the properties of the meta-encoder and the loss function themselves. As $K$ rises, we can observe that the performance improvement becomes increasingly clear. Note that 4.67% enhancement produced by the DIST+Swin-Transformer on the ResNet56-ResNet20 pair demonstrates that the characteristic of FM-KT – implicit ensemble – does result in performance gains.

The Ablation about the Loss Function and Meta-Encoder. The outcomes of this ablation study are summarized in Fig. 6. For the meta-encoder, it is demonstrated that the Swin-Transformer yields the most favorable results when combined with any loss function on CIFAR-100. Conversely, on ImageNet-1k, the amalgamation of MLP with DKD, DIST and PKD demonstrates superior performance. Moreover, for the loss function, DIST and DKD exhibit comparable and enhanced performance relative to PKD and vanilla KD across all student-teacher pairs.

5 Related Work

Knowledge distillation.

The main strategies of knowledge distillation are categorized into three: feature-based (Tian et al., 2019; Li et al., 2022; Zhang & Ma, 2020), logit-based (Tao et al., 2022; Zhao et al., 2022; Shen & Xing, 2022), and data-based distillations (Wang et al., 2022; Shao et al., 2023). Regardless of the approach, the knowledge transfer framework plays an important role in it. Thus, this paper aims to design a more desirable knowledge transfer framework that can serve both feature-based distillation as well as logit-based distillation.

Continuous Network Representation.

There are a number of architectures belonging to continuous network representation, such as RNN (Williams & Zipser, 1989), LSTM (S & J, 1997), Neural ODE (Chen et al., 2018), GflowNet family (Bengio et al., 2021; Zhang et al., 2022), diffusion model family (Karras et al., 2022; Ho et al., 2020; Song et al., 2023a), INN (Solodskikh et al., 2023), DiffKD (Huang et al., 2023) and KDiffusion (Yao et al., 2024). More discussion can be found in Appendix F.

6 Conclusion

We have proposed a highly scalable framework FM-KT, its lightweight variant FM-KT^Θ, and its online knowledge distillation variant OFM-KT for knowledge transfer. The design flexibility of FM-KT, FM-KT^Θ and OFM-KT allows them to be formulated utilizing a loss function of any form and a meta-encoder with any available architecture, making them adaptable for progressive transformation focused both on features and logits via arbitrary noise schedules. Theoretically, we have proven that the optimization objective of FM-KT is equivalent to minimizing the upper bound of the negative log-likelihood of the target (e.g. the teacher’s output). In future work, we aim to further explore the design space of FM-KT and extend its application to a broader scope of downstream tasks.

Impact Statement.

Our proposed FM-KT enhances computational efficiency in lightweight models on edge devices, enabling entities with limited computational resources, such as certain companies and laboratories, to effectively leverage knowledge from large-scale models. However, FM-KT also entails modifications to the student model’s architecture, posing potential challenges under certain deployment circumstances that are constrained by hardware specificity.

References

Achiam et al. (2023) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Ahn et al. (2019) Ahn, S., Hu, S. X., Damianou, A. C., Lawrence, N. D., and Dai, Z. Variational information distillation for knowledge transfer. In Computer Vision and Pattern Recognition, pp. 9163–9171, Long Beach, CA, USA, Jun. 2019. Computer Vision Foundation / IEEE.
Albergo & Vanden-Eijnden (2022) Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations, 2022.
Bengio et al. (2021) Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for non-iterative diverse candidate generation. In Neural Information Processing Systems, pp. 27381–27394, Virtual Event, Dec. 2021.
Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Cao et al. (2022) Cao, W., Zhang, Y., Gao, J., Cheng, A., Cheng, K., and Cheng, J. PKD: general distillation framework for object detectors via pearson correlation coefficient. In Neural Information Processing Systems, New Orleans, LA, USA, Dec. 2022.
Chen et al. (2020) Chen, D., Mei, J.-P., Wang, C., Feng, Y., and Chen, C. Online knowledge distillation with diverse peers. Association for the Advance of Artificial Intelligence, 34(04):3430–3437, 2020.
Chen et al. (2018) Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Neural Information Processing Systems, 31, 2018.
Chuanguang et al. (2021) Chuanguang, Y., Zhulin, A., Linhang, C., and Yongjun, X. Hierarchical self-supervised augmented knowledge distillation. In International Joint Conference on Artificial Intelligence, pp. 1217–1223, Virtual Event, Aug. 2021. IJCAI.
Contributors (2020) Contributors, M. Openmmlab’s image classification toolbox and benchmark. https://github.com/open-mmlab/mmclassification, 2020.
Cubuk et al. (2020) Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702–703, 2020.
Gong et al. (2023) Gong, L., Lin, S., Zhang, B., Shen, Y., Li, K., Qiao, R., Ren, B., Li, M., Yu, Z., and Ma, L. Adaptive hierarchy-branch fusion for online knowledge distillation. In Association for the Advancement of Artificial Intelligence, volume 37, pp. 7731–7739, Washington, DC, USA, Jun. 2023. AAAI.
Gou et al. (2021) Gou, J., Yu, B., Maybank, S. J., and Tao, D. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
Guo et al. (2020) Guo, Q., Wang, X., Wu, Y., Yu, Z., Liang, D., Hu, X., and Luo, P. Online knowledge distillation via collaborative learning. In Computer Vision and Pattern Recognition, pp. 11020–11029, 2020.
Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531.
Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Neural Information Processing Systems, pp. 6840–6851, Virtual Event, Dec. 2020. NIPS.
Huang et al. (2023) Huang, T., Zhang, Y., Zheng, M., You, S., Wang, F., Qian, C., and Xu, C. Knowledge diffusion for distillation. arXiv preprint arXiv:2305.15712, 2023.
Huang et al. (2022) Huang, W., Peng, Z., Dong, L., Wei, F., Jiao, J., and Ye, Q. Generic-to-specific distillation of masked autoencoders. In Computer Vision and Pattern Recognition, New Orleans, LA, USA, Jun. 2022.
Karras et al. (2022) Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
Kingma et al. (2021) Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Neural Information Processing Systems, 34:21696–21707, 2021.
Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
Lan et al. (2018) Lan, X., Zhu, X., and Gong, S. Knowledge distillation by on-the-fly native ensemble. In Neural Information Processing Systems, pp. 7528–7538, Montréal Canada, Dec. 2018. MIT Press.
Lee et al. (2023) Lee, S., Kim, B., and Ye, J. C. Minimizing trajectory curvature of ode-based generative models. arXiv preprint arXiv:2301.12003, 2023.
Li et al. (2022) Li, S., Lin, M., Wang, Y., Wu, Y., Tian, Y., Shao, L., and Ji, R. Distilling a powerful student model via online knowledge distillation. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–10, 2022.
Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp. 740–755. Springer, 2014.
Lipman et al. (2022) Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In International Conference on Learning Representations, 2022.
Liu et al. (2022) Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
Liu et al. (2021) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision, pp. 10012–10022, 2021.
Lu et al. (2022) Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Neural Information Processing Systems, New Orleans, LA, USA, Nov.-Dec. 2022. NIPS.
Meng et al. (2022) Meng, C., Gao, R., Kingma, D. P., Ermon, S., Ho, J., and Salimans, T. On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022.
Park & Kim (2021) Park, N. and Kim, S. How do vision transformers work? In International Conference on Learning Representations, Virtual Event, May 2021. Openreview.net.
Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
S & J (1997) S, H. and J, S. Long short-term memory. Neural Computation, pp. 1375–1780, 1997.
Sandler et al. (2018) Sandler, M., Howard, A. G., Zhu, M., Zhmoginov, A., and Chen, L. Mobilenetv2: Inverted residuals and linear bottlenecks. In Computer Vision and Pattern Recognition, pp. 4510–4520, Salt Lake City, UT, USA, Jun. 2018. IEEE.
Shao et al. (2023) Shao, S., Chen, H., Huang, Z., Gong, L., Wang, S., and Wu, X. Teaching what you should teach: A data-based distillation method. In International Joint Conference on Artificial Intelligence, pp. 1351–1359, Macao, SAR, China, Aug. 2023. ijcai.org.
Shen & Xing (2022) Shen, Z. and Xing, E. A fast knowledge distillation framework for visual recognition. In European Conference on Computer Vision, pp. 673–690. Springer, 2022.
Solodskikh et al. (2023) Solodskikh, K., Kurbanov, A., Aydarkhanov, R., Zhelavskaya, I., Parfenov, Y., Song, D., and Lefkimmiatis, S. Integral neural networks. In Computer Vision and Pattern Recognition, pp. 16113–16122, Vancouver, BC, Jun. 2023. IEEE.
Song & Chai (2018) Song, G. and Chai, W. Collaborative learning for deep neural networks. In Neural Information Processing Systems, pp. 1837–1846, Montréal Canada, Dec. 2018. MIT Press.
Song et al. (2023a) Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, kigali, rwanda, May. 2023a. OpenReview.net.
Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems, pp. 11895–11907, Vancouver, BC, Canada, Dec. 2019.
Song et al. (2023b) Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consistency models. arXiv preprint arXiv:2303.01469, 2023b.
Song et al. (2023c) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, kigali, rwanda, May. 2023c. OpenReview.net.
Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Computer Vision and Pattern Recognition, Boston, Massachusetts, Jun. 2015. IEEE.
Tao et al. (2022) Tao, H., Shan, Y., Fei, W., Chen, Q., and Xu, C. Knowledge distillation from a stronger teacher. In Advances in Neural Information Processing Systems, 2022.
Tian et al. (2019) Tian, Y., Krishnan, D., and Isola, P. Contrastive representation distillation. In International Conference on Learning Representations, 2019.
Touvron et al. (2021) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357. PMLR, 2021.
Tung & Mori (2019) Tung, F. and Mori, G. Similarity-preserving knowledge distillation. In International Conference on Computer Vision, pp. 1365–1374, 2019.
Wang et al. (2022) Wang, H., Lohit, S., Jones, M. N., and Fu, Y. What makes a ”good” data augmentation in knowledge distillation - a statistical perspective. In Neural Information Processing Systems, volume 35, pp. 13456–13469, New Orleans, LA, USA, Dec. 2022. NIPS.
Wightman et al. (2021) Wightman, R., Touvron, H., and Jegou, H. Resnet strikes back: An improved training procedure in timm. In NeurIPS 2021 Workshop on ImageNet: Past, Present, and Future, 2021. URL https://openreview.net/forum?id=NG6MJnVl6M5.
Williams & Zipser (1989) Williams, R. J. and Zipser, D. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280, 1989.
Xu et al. (2020) Xu, G., Liu, Z., Li, X., and Loy, C. C. Knowledge distillation meets self-supervision. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.-M. (eds.), European Conference on Computer Vision, pp. 588–604, Cham, 2020. Springer.
Yang et al. (2022a) Yang, Z., Li, Z., Shao, M., Shi, D., Yuan, Z., and Yuan, C. Masked generative distillation. In European Conference Computer Vision, volume 13671, pp. 53–69, Tel Aviv, Israel, Oct. 2022a. Springer.
Yang et al. (2022b) Yang, Z., Li, Z., Zeng, A., Li, Z., Yuan, C., and Li, Y. Vitkd: Practical guidelines for vit feature knowledge distillation. arXiv preprint arXiv:2209.02432, 2022b.
Yao et al. (2024) Yao, X., Lu, F., Zhang, Y., Zhang, X., Zhao, W., and Yu, B. Progressively knowledge distillation via re-parameterizing diffusion reverse process. In Association for the Advance of Artificial Intelligence, 2024.
Zagoruyko & Komodakis (2016a) Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, 2016a.
Zagoruyko & Komodakis (2016b) Zagoruyko, S. and Komodakis, N. Wide residual networks. In British Machine Vision Conference, York, UK, Spet. 2016b. BMVA Press.
Zhang et al. (2022) Zhang, D., Malkin, N., Liu, Z., Volokhova, A., Courville, A. C., and Bengio, Y. Generative flow networks for discrete probabilistic modeling. In International Conference on Machine Learning, volume 162, pp. 26412–26428, Baltimore, Maryland, USA, Jul. 2022. PMLR.
Zhang & Ma (2020) Zhang, L. and Ma, K. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In International Conference on Learning Representations, 2020.
Zhang et al. (2018) Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Computer Vision and Pattern Recognition, pp. 6848–6856, 2018.
Zhao et al. (2022) Zhao, B., Cui, Q., Song, R., Qiu, Y., and Liang, J. Decoupled knowledge distillation. In Computer Vision and Pattern Recognition, pp. 11953–11962, June 2022.
Zhong et al. (2020) Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. Random erasing data augmentation. In Association for the Advance of Artificial Intelligence, volume 34, pp. 13001–13008, 2020.
Zong et al. (2023) Zong, M., Qiu, Z., Ma, X., Yang, K., Liu, C., Hou, J., Yi, S., and Ouyang, W. Better teacher better student: Dynamic prior knowledge for knowledge distillation. In International Conference on Learning Representations, Kigali, Rwanda, May 2023. OpenReview.net. URL https://openreview.net/pdf?id=M0_sUuEyHs.

APPENDIX

Appendix A: Pseudo Code of FM-KT.
Appendix B: Theoretical Guarantees of FM-KT.
Appendix C: Link FM-KT to Ensemble.
Appendix D: Pair Decoupling.
Appendix E: Additional Ablation Experiment.
Appendix F: Additional Related Work on Continuous Network Representations.
Appendix G: Additional Object Detection Comparison.
Appendix H: Stronger Strategies and Stronger Teacher Comparison.
Appendix I: Vision Transformer Comparison.
Appendix J: Visualization of Sampling Trajectory.
Appendix K: Implementation Detail.
1. K.1: Training Strategies.
2. K.2: Loss Function and Meta-encoder.
Appendix L: Additional Training and Inference Computational Cost Discussion.
Appendix M: Best Meta-encoder Choice on ImageNet-1k.
Appendix N: Architecture-Sensitive Experiments between FM-KT and DiffKD.
Appendix O: Unify VP SDE, VE SDE and Rectified flow in FM-KT.
Appendix P: Limitation.

Appendix A Pseudo Code of FM-KT

For ease of understanding, we show the pseudo code of FM-KT in the Offline KD scenario. The implementation of FM-KT^Θ and OFM-KT only needs to modify the optimization objective as described in our main paper.

Algorithm 1 Pseudo code of FM-KT in a PyTorch-like style.

import torch
import torch.nn as nn
import torch.nn.functional as F

class FlowMatchingModule(nn.Module):
    def __init__(self,...):
        super().__init__()
        self.meta_encoder:nn.Module = (...)
        self.metric_based_loss_function:nn.Module = (...)
        self.time_embed:nn.Module = nn.Linear(...)
        self.training_sampling:int = (...)  # the number of sampling steps during training
        self.shape_transformation_function:nn.Module = (...)
        self.dirac_ratio:float = (...)  # hyperparameter  $\beta_{d}$ , which belongs to [0,1]
        self.weight:float = (...)


    def forward(self, s_f, t_f=None, target=None, inference_sampling=1):
        # s_f: the feature/logit of the student
        # t_f: the feature/logit of the teacher
        # target: the logit-based ground truth label, only used for logit-based distillation
        # inference_sampling: the number of sampling steps during inference

        all_p_t_f = []
        if self.training:
            # Shuffle one-to-one teacher-student feature/logit pair
            if t_f is not None:
                l = int(self.dirac_ratio * t_f.shape[0])
                t_f[l:][torch.randperm(t_f.shape[0] - l)] = t_f[l:].clone()
            loss, x = 0., s_f
            indices = reversed(range(1, self.training_sampling + 1))
            # Calculate the FM-KT loss
            for i in indices:
                t = torch.ones(s_f.shape[0]) * i / self.training_sampling
                embed_t = self.time_embed(t)
                embed_x = x + embed_t
                velocity = self.meta_encoder(embed_x)
                x = x - velocity / self.training_sampling
                p_t_f = self.shape_transformation_function(s_f - velocity)
                all_p_t_f.append(p_t_f)
                loss += self.metric_based_loss_function(p_t_f, t_f)
                if target is not None:
                    loss += F.cross_entropy(p_t_f, target)
            loss *= (self.weight / self.training_sampling)
            return loss, torch.stack(all_p_t_f, 0).mean(0)
        else:
            x = s_f
            indices = reversed(range(1, inference_sampling + 1))
            for i in indices:
                t = torch.ones(s_f.shape[0]) * i / inference_sampling
                embed_t = self.time_embed(t)
                embed_x = x + embed_t
                velocity = self.meta_encoder(embed_x)
                x = x - velocity / inference_sampling
                all_p_t_f.append(self.shape_transformation_function(s_f - velocity))
            return 0., torch.stack(all_p_t_f, 0).mean(0)

Appendix B Theoretical Guarantees of FM-KT

FM-KT proposes a novel serial training paradigm in order to avoid “cheating” by accessing $X^{T}$ during training:

\displaystyle\mathcal{L}_{\textrm{FM-KT}}=\mathbb{E}_{(X^{S},X^{T})}\frac{1}{N}\sum_{i=1}^{N}||\mathcal{T}((\nabla_{t}\alpha_{t}Z_{1}-g_{v_{\theta}}(Z_{1-i/N},1-i/N))/-\nabla_{t}\sigma_{t})-X^{T}||_{2}^{2},

(6)

where $Z_{1}=\alpha_{1}X^{S}$ . Here we assume the loss function is $\ell_{2}$ -norm. Broadly speaking, the loss function used by FM-KT only needs to ensure that it can achieve the effect of minimizing the difference in distributions similar to Kullback-Leibler Divergence. When $i\geq 1$ , $Z_{1-i/N}=Z_{1-(i-1)/N}-g_{v_{\theta}}(Z_{1-(i-1)/N},1-(i-1)/N)/N$ . $\mathcal{T}(\cdot)$ is used for shape alignment to ensure the calculation of $\ell_{2}$ -norm. Let us define $q(\cdot|\cdot)$ as the predefined conditional probability, and $p_{v_{\theta}}(\cdot|\cdot)$ as the predicted conditional probability. We know the training objective of the classical diffusion probability model can be performed by minimizing the upper bound on negative log-likelihood:

\displaystyle-\log p_{v_{\theta}}(Z_{0})\leq\mathbb{E}_{q(Z_{1/N:1}|Z_{0})}\left[\log\frac{q(Z_{1}|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}|Z_{1/N})}+\sum_{i=1}^{N}\log\frac{q(Z_{(i-1)/N}|Z_{i/N},Z_{0})}{p_{v_{\theta}}(Z_{(i-1)/N}|Z_{i/N})}\right].

(7)

We can rewritten it as

		$\displaystyle-\log p_{v_{\theta}}(Z_{0})\leq\mathbb{E}_{q(Z_{1/N:1}\|Z_{0})}\left[\log\frac{q(Z_{1}\|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}\|Z_{1/N})}+\sum_{i=1}^{N}\log\frac{q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})}{p_{v_{\theta}}(Z_{(i-1)/N}\|Z_{i/N})}\right]$		(8)
		$\displaystyle=\mathbb{E}_{q(Z_{1/N:1}\|Z_{0})}\left[\log\frac{q(Z_{1}\|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}\|Z_{1/N})}\right]+\sum_{i=1}^{N}\mathbb{E}_{q(Z_{i/N}\|Z_{0})}\mathbb{E}_{q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})}\left[\log\frac{q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})}{p_{v_{\theta}}(Z_{(i-1)/N}\|Z_{i/N})}\right]$
		$\displaystyle=\mathbb{E}_{q(Z_{1/N:1}\|Z_{0})}\left[\log\frac{q(Z_{1}\|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}\|Z_{1/N})}\right]+\sum_{i=1}^{N}\mathbb{E}_{\hat{Z}_{i/N}\sim\int p_{v_{\theta}}(Z_{i/N}\|Z_{1})q(Z_{1}\|Z_{0})dZ_{1}}$
		$\displaystyle\left[D_{\mathrm{KL}}(q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})\|\|p_{v_{\theta}}(Z_{(i-1)/N}\|\hat{Z}_{i/N}))\right],\quad s.t.\quad\textrm{Law}(Z_{i/N})\stackrel{{\scriptstyle\sim}}{{=}}\textrm{Law}(\hat{Z}_{i/N})$
		$\displaystyle\approx\mathbb{E}_{q(Z_{1/N:1}\|Z_{0})}\left[\log\frac{q(Z_{1}\|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}\|Z_{1/N})}\right]+\sum_{i=1}^{N}\mathbb{E}_{\hat{Z}_{i/N}\sim\int p_{v_{\theta}}(Z_{i/N}\|Z_{1})q(Z_{1}\|Z_{0})dZ_{1}}$
		$\displaystyle\left[\|\|q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})-p_{v_{\theta}}(Z_{(i-1)/N}\|\hat{Z}_{i/N})\|\|_{2}^{2}\right],\quad s.t.\quad\textrm{Law}(Z_{i/N})\stackrel{{\scriptstyle\sim}}{{=}}\textrm{Law}(\hat{Z}_{i/N}).$

For $i\!\geq\!1$ , if $\textrm{Law}(Z_{i/N})\stackrel{{\scriptstyle\sim}}{{=}}\textrm{Law}(\hat{Z}_{i/N})$ is guaranteed, then $\textrm{Law}(Z_{(i-1)/N})\stackrel{{\scriptstyle\sim}}{{=}}\textrm{Law}(\hat{Z}_{(i-1)/N})$ can also be guaranteed by optimizing $\mathbb{E}_{\hat{Z}_{i/N},Z_{1},Z_{0}}D_{\mathrm{KL}}(q(Z_{(i-1)/N}|Z_{i/N},Z_{0})||p_{v_{\theta}}(Z_{(i-1)/N}|\hat{Z}_{i/N}))$ in Eq. 8. Based on the prior condition $\textrm{Law}(Z_{1})\stackrel{{\scriptstyle\sim}}{{=}}\textrm{Law}(\hat{Z}_{1})$ , we can deduce $\{\textrm{Law}(Z_{i/N})\stackrel{{\scriptstyle\sim}}{{=}}\textrm{Law}(\hat{Z}_{i/N})\}_{i=0}^{N-1}$ sequentially by recursive method.

Note that this derivation via Bayes’ Theorem satisfies almost all noise schedules, such as Rectified flow (Liu et al., 2022; Lipman et al., 2022), VP ODE (Song et al., 2023c; Albergo & Vanden-Eijnden, 2022) and VE ODE (Song et al., 2023c; Albergo & Vanden-Eijnden, 2022). More details can be found in Appendix O. In fact, the upper bound in Eq. 8 is precisely the optimization objective of FM-KT. The main difference between Eq. 7 and Eq. 8 is that $\hat{Z}_{i/N}$ replaces ${Z}_{i/N}$ , and $\hat{Z}_{i/N}$ is obtained from the reverse sampling process. In this manner, the trajectory $\{\hat{Z}_{i/N}\}_{i}$ is obtained through the estimator $g_{v_{\theta}}(\cdot)$ and no longer contains the teacher knowledge $X^{T}$ , thereby enabling the distillation process to proceed normally.

Appendix C Link FM-KT to Ensemble

Ensemble is a method that trains multiple models, aggregates their outputs through voting, and produces a final prediction. In this section, we prove theoretically that FM-KT is essentially a unique implicit ensemble method under the assumption that its noise schedule is set as Rectified flow.

First, we define ODE in FM-KT as $X^{S}-X^{T}=\frac{dX_{t}}{dt}$ (for the convenience of derivation, this definition is slightly different from that in the main paper), so we need to fit $||\frac{dX_{t}}{dt}-g_{v_{\theta}}(X_{t},t)||_{2}^{2}$ , where $X_{t}=tX^{S}+(1-t)X^{T}$ , $t\sim\mathcal{U}[0,1]$ . In inference, this ODE solver defaults to Euler’s method in FM-KT, and the sampling must be discrete with $N$ steps because fitting continuous time steps $t$ consumes extensive computational costs. When the meta-encoder $g_{v_{\theta}}(\cdot)$ is at the optimal solution, we assume that its error from the true value can be expressed as a function of $x_{t}$ and $t$ , and that this function is at 1-Lipschitz.

Thus, we can define $\mathcal{H}(t)=\operatorname*{arg\,sup}_{X_{t}}\{||\frac{dX_{t}}{dt}-g_{v_{\theta}}(X_{t},t)||_{2}^{2}\}$ , then the truncation error $\mathcal{K}(t)$ can be defined as $\left[\frac{dX_{t}}{dt}-g_{v_{\theta}}(\mathcal{H}(t),t)\right]$ , which is also at 1-Lipschitz under the assumption $\mathcal{H}(t)$ is at 1-Lipschitz. After that, we also need to define the step number of sampling in inference. We set it as $K$ , so $dt$ is $1/K$ . Based on the aforementioned notations, we can analyse the truncation error by the recursive method.

For the sake of derivation convenience, we define $\{Z_{t}\}_{t}$ as the sampled trajectory in inference to distinguish it from $\{X_{t}\}_{t}$ in training. Thus, a step in sampling can be described as $Z_{1-(i+1)/K}\!=\!Z_{1-i/K}-g_{v_{\theta}}(Z_{1-{i/K}},1-i/K)dt$ , and $Z_{1-i/K}=X_{1-i/K}+\mathcal{E}(Z_{1-i/K})$ , where $\mathcal{E}(Z_{1-i/K})$ refers to the truncation error accumulated to a intermediate sample $Z_{1-i/K}$ in the sampling process. Note that $\mathcal{K}(t)$ and $\mathcal{E}(Z_{1-i/K})$ are not results of the norm, and therefore $\forall t$ and $\forall Z_{1-i/K}$ , this derivation does not need to satisfy that $\mathcal{K}(t)\geq 0$ and $\mathcal{E}(Z_{1-i/K})\geq 0$ . This approach avoids the accumulation of the truncation error due to $\ell_{2}$ -norm $\geq$ 0. We can derive the sample $Z_{1-(i-1)/K}$ in the next step by the derivation:

$\displaystyle Z_{1-(i+1)/K}$	$\displaystyle=Z_{1-i/K}-(1/K)g_{v_{\theta}}(Z_{1-i/K},1-i/K)$	(9)
	$\displaystyle=Z_{1-i/K}-(1/K)g_{v_{\theta}}(X_{1-i/K}+\mathcal{E}(Z_{1-i/K}),1-i/K)$
	$\displaystyle=X_{1-i/K}+\mathcal{E}(Z_{1-i/K})-(1/K)g_{v_{\theta}}(X_{1-i/K}+\mathcal{E}(Z_{1-i/K}),1-i/K)$
	$\displaystyle\approx X_{1-i/K}+\mathcal{E}(Z_{1-i/K})-(1/K)\left[g_{v_{\theta}}(X_{1-i/K},1-i/K)+\mathcal{E}(Z_{1-i/K})\nabla_{X_{t}}g_{v_{\theta}}(X_{1-i/K},1-i/K)\right],$
	$\displaystyle=X_{1-i/K}+\mathcal{E}(Z_{1-i/K})-(1/K)\left[g_{v_{\theta}}(X_{1-i/K},1-i/K)+\mathcal{E}(Z_{1-i/K})\psi(1-i/K)\right],$

where $\psi(t)=\nabla_{X_{t}}g_{v_{\theta}}(X_{t},t)$ . Then, Eq. 9 can continue to be derived as

$\displaystyle Z_{1-(i+1)/K}$	$\displaystyle\approx X_{1-(i+1)/K}+\mathcal{E}(Z_{1-i/K})+(1/K)\mathcal{K}(1-i/K)-(1/K)\mathcal{E}(Z_{1-i/K})\psi(1-i/K)$	(10)
	$\displaystyle=X_{1-(i+1)/K}+\mathcal{E}(Z_{1-i/K})[1-(1/K)\psi(1-i/K)]+(1/K)\mathcal{K}(1-i/K)$
$\displaystyle Z_{1-(i+1)/K}-X_{1-(i+1)/K}$	$\displaystyle=\mathcal{E}(Z_{1-i/K})[1-(1/K)\psi(1-i/K)]+(1/K)\mathcal{K}(1-i/K).$

Thus, $\mathcal{E}(Z_{1-(i+1)/K})=\mathcal{E}(Z_{1-i/K})[1-(1/K)\psi(1-i/K)]+(1/K)\mathcal{K}(1-i/K)$ . After that, the recursive method leads us to the following conclusions:

		$\displaystyle\mathcal{E}(Z_{1-1/K})=(1/K)\mathcal{K}(1)$		(11)
		$\displaystyle\mathcal{E}(Z_{1-2/K})=\mathcal{E}(Z_{1-1/K})(1-(1/K)\psi(1-1/K))+(1/K)\mathcal{K}(1-1/K)$
		$\displaystyle\vdots$
		$\displaystyle\mathcal{E}(Z_{0})=(1/K)\left[\sum_{i=0}^{K-1}\mathcal{K}(1-i/K)\right]+(1/K^{2})\left[\sum_{j=1}^{K-1}\left[\psi(1-j/K)\left(\sum_{i=0}^{j-1}\mathcal{K}(1-i/K)\right)\right]\right]+\mathcal{O}(1/K^{3}).$

Looking at the first term, we can see that the truncation error comes from summing $\mathcal{K}(\cdot)$ over all time points. When treating the error sampling as Monte Carlo sampling, with a sufficient number of samplings $K$ , it becomes possible for FM-KT to approximate ensemble methods and thus estimate the ground truth effectively.

Appendix D Pair Decoupling

In this section, we present pair decoupling (PD), a straightforward yet effective technique for enhancing performance in the feature-based distillation scenario of image classification using FM-KT. This method involves shuffling a subset of samples in a batch to achieve regularization, thereby preventing overfitting of the teacher’s refined low-level hierarchical features. Let $B$ , $C$ , $H$ and $W$ denote the batch size, the number of channels, the height of the feature map, and the width of the feature map, respectively. Given the teacher’s feature map $X^{T}\in\mathbb{R}^{B\times C\times H\times W}$ from a specific layer, PD is applied prior to all FM-KT related calculations. Implementing PD involves defining a hyperparameter, the dirac ratio $\beta_{d}$ , and perturbing $B\!-\!\left\lfloor{\beta_{d}}B\right\rfloor$ samples in the batch. The Pytorch code for this is provided in Appendix A. Specifically, PD selects the random $B\!-\!\left\lfloor{\beta_{d}}B\right\rfloor$ samples $X^{T}\left[0:B\!-\!\left\lfloor{\beta_{d}}B\right\rfloor\right]$ in a batch and then shuffles them:

X^{T}\left[0:B\!-\!\left\lfloor{\beta_{d}}B\right\rfloor\right]=\textbf{{shuffle}}\left(X^{T}\left[0:B\!-\!\left\lfloor{\beta_{d}}B\right\rfloor\right]\right).

We refer to the hyperparameter $\beta_{d}$ as “dirac ratio” because, following the PD operation, $\left\lfloor{\beta_{d}}B\right\rfloor$ samples are used to compute $\mathcal{L}_{\textrm{FM-KT}}$ . Here, $X^{T}$ and $X^{S}$ are treated as Dirac distributions with the objective of achieving one-to-one matching. Conversely, the remaining $B-\left\lfloor{\beta_{d}}B\right\rfloor$ samples are utilized in the computation of $\mathcal{L}_{\textrm{FM-KT}}$ , where $X^{T}$ and $X^{S}$ are considered as non-Dirac distributions, targeting many-to-many matching.

Due to the specificity of the feature-based distillation scenario for image classification, PD is specifically designed to avoid over-matching the refined low-level feature thus improving the final performance of the student. Meanwhile, our experiments in Appendix E empirically demonstrate that PD is effective only in the feature-based distillation scenario of image classification, whereas in other scenarios it rather degrades the performance. This is because matching the teacher’s feature/logit at a fine-grained level is closely related to the final performance of the student in the logit-based distillation scenario for image classification as well as the feature-based distillation scenario for object detection. In other words, in the feature-based distillation scenario for image classification, it does not imply that improving similarity between the student’s low-level feature and the teacher’s low-level feature will result in the greater classification accuracy of the student.

Appendix E Additional Ablation Experiment

Here, we experimentally substantiate some empirical findings on the topics of normalization layer selection in the meta-encoder, where stages used for distillation in the feature-based scenario, and ideal configuration of dirac ratio $\beta_{d}$ in different scenarios.

Normalization type	GroupNorm	BatchNorm
WRN-40-2 (T)	75.61	75.61
WRN-40-2 (S+Baseline)	-	73.26
WRN-40-2 (S+DIST)	-	75.29
WRN-16-2 (S+FM-KT $K$ =1)	75.58	1.00
WRN-16-2 (S+FM-KT $K$ =2)	75.85	1.00
WRN-16-2 (S+FM-KT $K$ =4)	75.87	1.10
WRN-16-2 (S+FM-KT $K$ =8)	75.87	1.43

Table 5: Experiments were conducted on the different normalization type of FM-KT on CIFAR-100. Note that in this table all the architecture of the meta-encoder and the form of the loss function are set as CNN and DIST, respectively.

Ablation experiments on various normalization operations reveal instability in the FM-KT training paradigm when using BatchNorm. As observed in Table 5, the accuracy achieved with BatchNorm as the normalization layer is approximately 1%, even when the training of the student remains stable (i.e., the loss is not NAN). This indicates that using BatchNorm in FM-KT introduces instability by computing the mean and variance of inputs at different time points during inference. It is essential to note that although BatchNorm is applied in DiffKD, this choice is justified as the student converges effectively with a sufficiently high number of layers in the Diffusion Model (i.e. meta-encoder) mentioned in their work. Similar results are obtained in our studies by replacing the meta-encoder in FM-KT with Diffusion Model in DiffKD.

In Fig. 8, we investigated the optimal stages for distillation in the feature-based scenario and the ideal configuration for the dirac ratio $\beta_{d}$ . As there are no specific distillation stages for the logit-based distillation, we designated it as “[0, 0, 0]” for clarity. Our observations indicate that in the feature-based distillation scenario, the distillation stage does not significantly affect the final outcomes. Meanwhile, the configuration “[1, 1, 1]” often underperforms compared with “[0, 1, 1]” and “[0, 0, 1]”. This observation aligns with the conclusions drawn from most of the prior feature-based distillation studies (Tung & Mori, 2019; Zagoruyko & Komodakis, 2016a; Zong et al., 2023). Moreover, for different values of $\beta_{d}$ , the settings $\beta_{d}$ =0.25 typically yields the best result in feature-based distillation, while $\beta_{d}$ =1.0 excels in the logit-based distillation. This implies that the PD technique is more effective in the feature-based distillation context for image classification, and less so in the logit-based distillation.

It is worth noting that our experiments on PD in object detection revealed that $\beta_{d}$ =1.0 and $\beta_{d}$ =0.75 yield comparable performance, whereas a decrease in $\beta_{d}$ results in diminished performance. Based on these findings, we recommend using $\beta_{d}$ =0.25 as the default in the feature-based distillation scenario for image classification and $\beta_{d}$ =1.0 in other contexts.

Method	Schedule	mAP	AP ${}_{\textrm{50}}$	AP ${}_{\textrm{75}}$	AP ${}_{\textrm{S}}$	AP ${}_{\textrm{M}}$	AP ${}_{\textrm{L}}$
Mask RCNN-Swin (T)	3 $\times$ +ms	48.2	69.8	52.8	32.1	51.8	62.7
Retina-Res50 (S)	2 $\times$	37.4	56.7	39.6	20.0	40.7	49.7
PKD	2 $\times$	41.3 (+3.9)	60.5	44.1	23.0	45.3	55.9
FM-KT ( $K$ =1)	2 $\times$	41.4 (+4.0)	60.6	44.0	22.5	45.6	55.7
FM-KT ( $K$ =4)	2 $\times$	41.4 (+4.0)	60.6	44.1	22.5	45.6	55.7
FasterRCNN-Res101 (T)	2 $\times$	39.8	60.1	43.3	22.5	43.6	52.8
FasterRCNN-Res50 (S)	2 $\times$	38.4	59.0	42.0	21.5	42.1	50.3
GID	2 $\times$	40.2 (+1.8)	60.7	43.8	22.7	44.0	53.2
FRS	2 $\times$	40.4 (+2.0)	60.8	44.0	23.2	44.4	53.1
FGD	2 $\times$	40.4 (+2.0)	60.7	44.3	22.8	44.5	53.5
PKD	2 $\times$	40.3 (+1.9)	60.8	44.0	22.9	44.5	53.1
FM-KT ( $K$ =1)	2 $\times$	40.4 (+2.0)	60.7	44.1	22.9	44.8	52.8
FM-KT ( $K$ =4)	2 $\times$	40.5 (+2.1)	60.7	44.2	22.9	44.8	52.9
FCOS-Res101 (T)	2 $\times$ +ms	41.2	60.4	44.2	24.7	45.3	52.7
Retina-Res50 (S)	1 $\times$	37.4	56.7	39.6	20.0	40.7	49.7
PKD	1 $\times$	40.3 (+2.9)	59.6	43.0	22.2	44.9	53.7
FM-KT ( $K$ =1)	1 $\times$	40.5 (+3.1)	59.9	43.6	22.5	45.0	53.5
FM-KT ( $K$ =4)	1 $\times$	40.5 (+3.1)	59.8	43.6	22.5	45.0	53.7

Table 6: Results of FM-KT with different detection frameworks on MS-COCO. “T” and “S” mean the “teacher” and “student” detector, respectively.

Appendix F Additional Related Work on Continuous Network Representations

With the development of deep learning, there are a number of architectures belonging to continuous network representation, such as RNN (Williams & Zipser, 1989), LSTM (S & J, 1997), Neural ODE (Chen et al., 2018), GflowNet family (Bengio et al., 2021; Zhang et al., 2022), diffusion model family (Song et al., 2023c; Karras et al., 2022; Ho et al., 2020; Song et al., 2023a), INN (Solodskikh et al., 2023), DiffKD (Huang et al., 2023) and KDiffusion (Yao et al., 2024). Here, we mainly emphasize the similarities and differences between our proposed FM-KT and these methods. In this way, we show the novelty of the FM-KT design and its advantages in application:

•

The application scenario of FM-KT is different from RNN, LSTM, Neural ODE, GflowNet family, diffusion model family and INN. Of these, only FM-KT, DiffKD and KDiffusion are applied to knowledge distillation in the form of continuous network representations.
•

RNN, LSTM, Neural ODE, GflowNet family, diffusion model family, INN, and FM-KT have a meta-encoder shared parameters. However, the difference is that the forward process (meaning the backward process in diffusion model family and FM-KT) in RNN, LSTM, and GflowNet family is unknown. Unlike Neural ODE, diffusion model family and FM-KT, there exists a human-designed sampling process (a.k.a predefined forward processes), which makes it impossible to use numerical integration to trade-off performance and efficiency.
•

INN primarily enables the continuous representation of convolutional operations (i.e. convolutional kernels), not the entire network. In contrast, the Neural ODE, diffusion model family, continuously represents the entire network.
•

The primary distinction between Neural ODE/FM-KT and the diffusion model family lies in their training paradigms. The diffusion model family is trained using unpaired samples, aiming to capture the entire data distribution. In contrast, Neural ODE/FM-KT utilizes paired samples, focusing on learning the Dirac distribution of the output.
•

The biggest difference between FM-KT and Neural ODE is that FM-KT has a deterministic a priori forward process to model the optimization objective of the intermediate points (i.e. $\{Z_{1-i/N}\}_{i}$ ), which ensures the stability of training. Neural ODE has no such a priori forward process, and simply expects the network itself to learn a continuous representation from input to output.

Appendix G Object Detection Comparison

The experimental results of object detection are presented in Table 6, where Mask RCNN-Swin-RetinaNet-Res50 pair represents the case of being distilled from a strong teacher, FasterRCNN-Res101-FasterRCNN-Res50 pair represents the homogeneous teacher-student pair, and FCOS-Res101-Retina-Res50 pair represents the heterogeneous teacher-student pair. We observe that FM-KT, which applies PKD as its loss function, shows improvement to some extent compared to the baseline FKD and achieves state-of-the-art performance across all teacher-student pairs. Note that knowledge transfer in object detection is facilitated by the high similarity between the feature maps of the student and the teacher. Consequently, the student’s mAP remains consistent for both $K$ = $4$ and $K$ = $8$ , so we do not present results for $K$ = $8$ .

Appendix H Stronger Strategies and Stronger Teacher Comparison

In recent years, with the advancement of deep learning, stronger training strategies and higher-quality foundational models have emerged. As a result, traditional distillation methods are no longer sufficient for capturing a superior student. In this context, we utilize the ResNet50 (with an accuracy of 80.1%) from TIMM (Wightman et al., 2021) training as a stronger teacher to distill the ResNet18. Simultaneously, we adopt some stronger strategies: the learning rate begins at 5e-4, the chosen optimizer is AdamW, the batch size is set as 1024, the number of training epochs is set as 350, the learning rate warms up over 3 epochs, and then decays at a rate of 0.9874 per epoch. For data augmentation, we employ a combination of RandomCrop, RandomClip, RandAugment (Cubuk et al., 2020), and RandomErasing (Zhong et al., 2020). It’s important to note that the loss function and the meta-encoder in FM-KT remain consistent with the main paper, being DIST and Swin-Transformer, respectively. Finally, the experimental results on ImageNet-1k are presented in Table 7.

	Teacher (ResNet50)	DIST	FM-KT ( $K$ =1)	FM-KT ( $K$ =2)	FM-KT ( $K$ =4)	FM-KT ( $K$ =8)
Top-1 Acc.	80.12%	72.89%	72.61%	73.11%	73.59%	73.71%

Table 7: Additional results of FM-KT in the stronger strategies and stronger teacher setting.

From Table 7, we can observe that FM-KT performs remarkably when both the teacher and the strategies are stronger. For instance, when $K$ =8, the student’s accuracy is 73.70%, which is 0.82% higher than the baseline DIST. This is a clear indication that FM-KT can be generalized to scenarios with strong strategies and a stronger teacher.

Appendix I Vision Transformer Comparison

Transformer has been brilliantly turned into an infrastructure for numerous computer vision (Liu et al., 2021) and natural language processing tasks (Brown et al., 2020; Achiam et al., 2023). In this section, we integrate FM-KT with the Vision Transformer (ViT) model DeiT (Touvron et al., 2021) and conduct comparative experiments between FM-KT and other prevalent knowledge distillation methods (e.g. ViTKD) tailored for ViT. For outcomes derived from all the relevant algorithms in our experiments, the training hyperparameters follow the basic settings outlined in MMClassification (Contributors, 2020). Additionally, we implement FM-KT considering DIST as the loss function and a 2-MLP as the meta-encoder. This implementation is simultaneously coupled with ViTKD (Yang et al., 2022b) to facilitate combined distillation. Ultimately, he experimental results on ImageNet-1k are presented in Table 8.

	Teacher (DeiT III-Small)	Student (DeiT-Tiny)	ViTKD	ViTKD+NKD	FM-KT ( $K$ =1)	FM-KT ( $K$ =2)	FM-KT ( $K$ =4)
Top-1 Acc.	81.35%	74.42%	76.06%	77.78%	77.84%	78.15%	78.15%

Table 8: Additional results of FM-KT in vision transformer comparison.

From Table 8, we can illustrate that FM-KT is apparently effective on the ViT setting. Specifically, FM-KT consistently surpasses the baseline DeiT-Tiny, ViTKD, and ViTKD+KD in the final performance with all feasible $K$ during inference.

Appendix J Visualization of Sampling Trajectory

To elucidate the sampling mechanism of FM-KT, we utilize the student obtained by training the ResNet34-ResNet18 pair on ImageNet-1k and visualize the student output’s sampling trajectory (i.e., $\{Z_{1-i/K}\}_{i=1}^{K}$ , where $K$ is set as 8) during inference. Note that the loss function and meta-encoder are set as DIST and Swin-Transformer in training, respectively. Since general visualization methods are designed for feature maps in intermediate layers, it’s challenging to demonstrate that better visualization directly correlates with improved image classification performance. Therefore, we employ the reliability histogram for the visualization of the sampling trajectory, thereby demonstrating that FM-KT, as an implicit ensemble method, indeed enhances the generalization ability of the student.

The reliability histogram typically plots predicted probabilities on the x-axis and the fraction of positives on the y-axis. Typically, the closer the predicted probability is to the fraction of positives, the better the student’s prediction. Therefore, the closer the peak of the student’s reliability histogram bin is to the diagonal, the stronger its generalization ability. Thus, it is clear from Fig. 9 that the reliability histogram is not well-presented at the beginning (i.e. $Z_{7/8}$ ) of the sampling trajectory. As $i$ in $Z_{1-i/K}$ gradually decreases, the representation of its reliability histogram improves, which indicates that both the generalization ability and the reliability of the student are enhanced.

Appendix K Implementation Detail

K.1 Training Strategies

We train on image classification datasets including CIFAR-100 (Krizhevsky et al., 2009) and ImageNet-1k (Russakovsky et al., 2015), and object detection dataset including MS-COCO (Lin et al., 2014). For Offline KD, the training strategy of image classification follows CRD (Tian et al., 2019) and DIST, while the training strategy of object detection follows PKD. Specifically, for CIFAR-100, the learning rate is 0.05 (when MobileNetV2 or ShuffleNetV1 as the student, the learning rate is 0.01), batch size is 64, total number of epochs is 240, and the learning rate is linearly reduced to 0.1 of its previous value at epochs 150, 180, and 210; for ImageNet-1k, the training learning rate is 0.1, batch size is 256, total number of epochs is 100, and the learning rate is linearly reduced to 0.1 of its previous value at epochs 30, 60, and 90; for MS-COCO, the training learning rate is 0.02, batch size is 16, total number of epochs is 24, and the learning rate is linearly reduced to 0.1 of its previous value at epochs 16 and 22. For Online KD, all hyperparameters settings follow AHBF-OKD (Gong et al., 2023) and are unchanged. For conviction, we report the mean test accuracy with 3 runs for all experimental results.

K.2 Loss Function and Meta-encoder

The loss weights of FM-KT and its variant OFM-KT are not explicitly set, and their values will follow the loss weight settings of the metric-based distillation method introduced by themselves. For instance, if FM-KT applies DIST as its $L(\cdot,\cdot)$ , the loss weights $\beta$ and $\gamma$ are both set to $2$ as mentioned in the original paper. For convenience of description, all forms “FM-KT ( $K$ =number)” or “OFM-KT ( $K$ =number)” refer to the corresponding algorithms that sampled “number” steps during inference.

For all comparative experiments on CIFAR-100, FM-KT and OFM-KT use Swin-Transformer as the meta-encoder and DIST as the metric-based distillation method, except for VGG13-VGG8 and VGG13-MobileNetV2 pairs in the Offline KD scenario. VGG13-VGG8 and VGG13-MobileNetV2 pairs in the Offline KD scenario use Swin-Transformer as the meta-encoder and DKD as the metric-based distillation method. For all comparative experiments on ImageNet-1k, in the Offline KD scenario, FM-KT uses MLP (i.e. 2-MLP) as the meta-encoder and DIST as the metric-based distillation method; in the Online KD scenario, OFM-KT uses Swin-Transformer as the meta-encoder and DIST as the metric-based distillation method.

For FM-KT^Θ, the loss function and meta-encoder are set to DKD and Swin-Transformer with all pairs on CIFAR-100; the loss function and meta-encoder are set to DIST and MLP with all pairs on ImageNet-1k; the balance weight $\alpha^{\Theta}$ is set as 1.0, 1.0 and 0.0 on all teacher-student pairs on CIFAR-100, ResNet34-ResNet18 pair on ImageNet-1k and ResNet50-MobileNetV2 pair on ImageNet-1k, respectively.

For object detection, unless otherwise specified, FM-KT uses CNN as the meta-encoder and PKD as the metric-based distillation method.

For the architecture of the meta-encoder, we adopt a task-specific setup. Swin-Transformer adopts one layer of [Swin Attention-Linear-ReLU-Linear] in the Offline KD scenario, and the number of heads is $4$ . In the Online KD scenario, if the student architecture is not ResNet18 then we add the same extra layer in the meta-encoder. CNN uses one layer of [SiLU-Conv-GroupNorm-SiLU-Conv] in the image classification datasets and two layer of [Depthwise Conv-LayerNorm-Pointwise Conv-GeLU-Pointwise Conv] in the object detection dataset. In image classification, the kernel size of first convolutional layer is 3 $\times$ 3, and the second layer is 1 $\times$ 1. And in object detection, the kernel size of the depthwise convolutional layer is 7 $\times$ 7. MLP adopts two layers of [Linear-ReLU-Linear] in the logit-based distillation scenario and one layer of [Linear-ReLU-Linear] in the feature-based distillation scenario. Besides, the shape transformation function $\mathcal{T}(\cdot)$ utilizes one layer of [Conv] or [Identity Function] (if no shape alignment is required) in the feature-based distillation scenario, and we use one layer of [AdaptAvgpool(1)-Linear] in the logit-based distillation scenario. Note that in the logit-based distillation scenario, FM-KT completes flow matching on the logit, so [AdaptAvgpool(1)-Linear] essentially represents the classification layer.

Appendix L Additional Training and Inference Computational Cost Discussion

Our proposed FM-KT, similar to DiffKD, incurs an additional computational burden during inference. However, our variant, FM-KT^Θ, differs in that it avoids this extra computational load during inference. This is achieved by transferring the knowledge in $Z_{0}$ (w.r.t. $t$ =0) in FM-KT to the vanilla classification head of the student. To provide a clear comparison of the computational costs of DiffKD, FM-KT, and FM-KT^Θ, we have conducted relevant measurements. The results are presented in Fig. 10. Notably, both FM-KT and FM-KT^Θ utilize logit-based distillation as the loss function (referred to as DIST) and employ a 2-layer MLP as the meta-encoder. Moreover, DiffKD adheres to the approach outlined in its original paper, employing both feature-based and logit-based distillation. The feature-based distillation in DiffKD, which relies on Bottleneck from ResNet, is implemented in the meta-encoder and is inserted into the backbone output feature before average pooling. Meanwhile, its logit-based distillation employs a 1-layer MLP as the meta-encoder and is inserted into the output logit of the classification head. As presented in Fig. 10, the computational overhead of DiffKD, in both training and inference, is drastically higher than that of FM-KT and FM-KT^Θ. Furthermore, FM-KT^Θ aligns with classical knowledge distillation algorithms in terms of computational cost during inference, offering additional savings in inference overhead compared to FM-KT.

Appendix M Best Meta-encoder Choice on ImageNet-1k

As illustrated in Figure 6, FM-KT achieves the highest effectiveness and efficiency on ImageNet-1k when implemented with MLP. Accordingly, this section presents the optimal performance of FM-KT by using MLP for meta-encoder on ImageNet-1k and examines the impact of varying the number of MLP layers on its performance.

Method	FM-KT	FM-KT	FM-KT	DiffKD
Meta-encoder	1-MLP	2-MLP	3-MLP	2-BottleNeck+ Conv+BN+MLP
ResNet34-ResNet18	72.48	73.17	73.28	72.49
ResNet50-MobileNetV1	73.74	74.22	74.28	73.78

Table 9: The influence of the number of layers in the meta-encoder (i.e. MLP) on student performance on ImageNet-1k. Note that all results from FM-KT are obtained when

K

=8.

The experimental results presented in Table 9 show that FM-KT outperforms DiffKD with 2-later MLP (i.e. 2-MLP). Furthermore, as detailed in Appendix L, the training and inference costs of FM-KT are nearly half those of DiffKD. This effectively demonstrates FM-KT’s capability to not only outperform DiffKD but also achieve state-of-the-art performance. Meanwhile, the performance of FM-KT improves as the number of layers in the MLP increases.

Appendix N Architecture-Sensitive Experiments between FM-KT and DiffKD

DiffKD uses FM-KT’s meta-encoder (i.e. 2-MLP)	FM-KT uses DiffKD’s meta-encoder (i.e. 2-Bottleneck+Conv+BN+MLP)	DiffKD uses DiffKD’s meta-encoder (i.e. 2-Bottleneck+Conv+BN+MLP)	FM-KT uses FM-KT’s meta-encoder (i.e. 2-MLP)
NAN	74.26%	73.78%	74.22%

Table 10: Comparison experimental result between FM-KT and DiffKD with ResNet50-MobileNetV1 pair on ImageNet-1k. Note that the results from FM-KT are obtained when

K

=8.

In order to know the sensitivity of FM-KT and DiffKD to architecture and for further fair comparisons, we perform DiffKD to use 2-MLP from FM-KT as its meta-encoder, and FM-KT to use 2-Bottleneck+Conv+BN+MLP from DiffKD as its meta-encoder. The experiments were then conducted on ImageNet-1k using ResNet50-MobileNetV1 pair. Unfortunately, when employing the logit-based distillation approach of DiffKD (following its official code and implementation (Huang et al., 2023)), its loss became NAN at epoch 1. However, in Table 10, we discover that the result (with $K$ =8) of FM-KT using 2-Bottleneck+Conv+BN+MLP from DiffKD as its meta-encoder, which significantly outperformed DiffKD.

Appendix O Unify VP SDE, VE SDE and Rectified flow in FM-KT

Vanilla diffusion processes such as VP SDE (Song et al., 2023c) and VE SDE (Song et al., 2023c) and vanilla continuous probability flows such as Rectified flow (Liu et al., 2022) can be transformed into our proposed FM-KT¹¹1Both VP SDE and VP SDE can be transformed into ODE form, referring to deterministic forward and backward processes.. Here, we give the derivation of FM-KT. All noise schedule can be written in the following form:

\displaystyle Z_{t}=\alpha_{t}X^{S}+\sigma_{t}X^{T},\ s.t.\ \lim_{t\rightarrow 0}\alpha_{t}=0,\lim_{t\rightarrow 0}\sigma_{t}=1,\lim_{t\rightarrow 1}\sigma_{t}=0.

(12)

The training paradigm of them can be denoted as

		$\displaystyle\operatorname*{arg\,min}_{v_{\theta}}\mathbb{E}_{(Z_{1},Z_{0},t)}\|\|g_{v_{\theta}}(Z_{t},t)-\nabla_{t}Z_{t}\|\|_{2}^{2}$		(13)
		$\displaystyle=\operatorname*{arg\,min}_{v_{\theta}}\mathbb{E}_{(Z_{1},Z_{0},t)}\|\|g_{v_{\theta}}(Z_{t},t)-(\nabla_{t}\alpha_{t}X^{S}+\nabla_{t}\sigma_{t}X^{T})\|\|_{2}^{2}.$		(13)

VP ODE:

(1) $\alpha_{t}=\textrm{exp}(-\frac{1}{4}a(1-t)^{2}-\frac{1}{2}b(1-t))$ ; (2) $\sigma_{t}=\sqrt{1-\alpha_{t}^{2}},\ s..t.\quad a=19.9,b=0.1$ .

VE ODE:

(1) $\alpha_{t}=a(\frac{b}{a})^{t}$ ; (2) $\sigma_{t}=1,\ s.t.\quad a=0.02,b=100$ .

Rectified flow:

(1) $\alpha_{t}=t$ ; (2) $\sigma_{t}=1-t$ .

Substituting $\alpha_{t}$ and $\sigma_{t}$ yields:

VP ODE:

\displaystyle\operatorname*{arg\,min}_{v_{\theta}}\mathbb{E}_{(Z_{1},Z_{0},t)}||g_{v_{\theta}}(Z_{t},t)-((\frac{1}{2}a(1-t)+\frac{1}{2}b)\alpha_{t}X^{S}-\frac{\alpha_{t}}{\sqrt{1-\alpha_{t}^{2}}}\alpha_{t}(\frac{1}{2}a(1-t)+\frac{1}{2}b)X^{T})||_{2}^{2}.

(14)

VE ODE:

\displaystyle\operatorname*{arg\,min}_{v_{\theta}}\mathbb{E}_{(Z_{1},Z_{0},t)}||g_{v_{\theta}}(Z_{t},t)-(\alpha_{t}[\log(b)-\log(a)]X^{S})||_{2}^{2}.

(15)

Rectified flow:

\displaystyle\operatorname*{arg\,min}_{v_{\theta}}\mathbb{E}_{(Z_{1},Z_{0},t)}||g_{v_{\theta}}(Z_{t},t)-(X^{S}-X^{T})||_{2}^{2}.

(16)

All forms can be transformed into serial training forms by Theorem 3.1 in our paper²²2For convenience, we ignore time steps here. It is worth noting that, due to the adaptability of step size in the Euler method, introducing this hyperparameter is entirely feasible.:

		$\displaystyle\mathcal{L}_{\textrm{FM-KT++}}=\mathbb{E}_{(X^{S},X^{T},Y)}\frac{1}{N}\sum_{i=0}^{N-1}L(\mathcal{T}((\nabla_{t}\alpha_{t}Z_{1}-g_{v_{\theta}}(Z_{1-i/N},1-i/N))/-\nabla_{t}\sigma_{t}),X^{T})$		(17)
		$\displaystyle+\underbrace{L(\mathcal{T}((\nabla_{t}\alpha_{t}Z_{1}-g_{v_{\theta}}(Z_{1-i/N},1-i/N))/-\nabla_{t}\sigma_{t}),Y)}_{\textrm{match the ground truth label (optional)}},$
		$\displaystyle\textrm{the sampling process:}\quad Z_{1-i/N}=Z_{1-(i-1)/N}-g_{v_{\theta}}(Z_{1-(i-1)/N},1-(i-1)/N)/N,\quad s.t.\quad i\geq 1,$

where $Z_{1}=\alpha_{1}X_{S}$ . Thus, the key to achieving knowledge transfer in knowledge distillation is not the difference between noise schedules, but the form of deterministic sampling in both the forward and backward processes and the serial training paradigm given in Theorem 3.1 of our paper.

Evaluation. In the practical implementation, since $\lim_{t\rightarrow 1}\nabla_{t}\alpha_{t}=+\infty$ in VP ODE, and considering that $\nabla_{t}\alpha_{t}$ and $\nabla_{t}\sigma_{t}$ show large variations at different $t$ in both VP ODE and VE ODE, both VP ODE and VE ODE are expressed in the forms of differentiations $\frac{\alpha_{t}-\alpha_{t-\Delta t}}{t-\Delta t}$ and $\frac{\sigma_{t}-\sigma_{t-\Delta t}}{t-\Delta t}$ . Since $\nabla_{t}\sigma_{t}\equiv 0$ in VE ODE cannot be divided, we modified $\sigma_{t}$ from $\sigma(t)=1$ to $\sigma(t)=1-0.1t$ . In addition, our experiments revealed instability in the flow loss of VE ODE and VP ODE training, necessitating the use of the learning rate warm-up technique (extending up to 20 epochs) for effective training. And $b$ in VE ODE is extra reduced to $10$ . The test accuracy per epoch for VP ODE, VE ODE and Rectified flow (i.e., the default form used in our paper) is illustrated in Fig. 11, which is the same as Fig. 3 in our main paper.

The experimental results in Fig. 11 are obtained on CIFAR-100 with WRN-40-2-WRN-16-2 pair. VP ODE, VE ODE, and Rectified flow all utilize 2-MLP as the meta-encoder, and DIST as the loss function (modifying the hyperparameter temperature to 1 for stable training). It can be observed that the training paradigm proposed in Eq. 17 is capable of effectively training all noise schedules. In particular, Rectified flow is comparatively more stable and efficient than VP ODE and VE ODE.

Discussion. Through the above analysis, coupled with our new insights, the reasons for adopting Rectified flow in our work are as follows: 1) Rectified flow is more effective and stable compared to VP ODE and VE ODE; 2) Rectified flow is simple for implementation and understanding; 3) in the derivation of approximating ensembles (i.e. Proposition 3.2), Rectified flow can be proved that the truncation error at each time step has the same impact on the ultimate error (i.e., equal weight); 4) Rectified flow enhances the student performance by its accelerated sampling property when NFE is very small. Specifically, Rectified flow has the ability to minimize the hessian matrix (Lee et al., 2023) with respect to $Z_{t}$ , which enables the estimation $g_{v_{\theta}}(Z_{t},t)$ of $dZ_{t}$ to also accurately estimate $\{dZ_{t-\Delta_{t}},dZ_{t-2\Delta_{t}},\cdots,dZ_{s}\}$ , where $t$ and $s$ refer to the source and target time points, respectively, ultimately reducing the truncation error of $Z_{t}+\int_{t}^{s}g_{v_{\theta}}(Z_{\tau},\tau)d\tau$ . This point is demonstrated by the experiments in relevant papers (Liu et al., 2022; Lee et al., 2023).

Appendix P Limitation

FM-KT demonstrates improved generalization capabilities relative to conventional knowledge distillation methods, yet it incurs a higher computational burden during inference. Moreover, FM-KT’s effectiveness in object detection is not as pronounced as in image classification. This discrepancy stems from the fact that, in image classification, flow matching with the teacher at the logit level often yields performance akin to the teacher’s. In contrast, in object detection, flow matching with the teacher at the FPN (Feature Pyramid Network) level does not directly translate to enhanced performance in the ultimate metric, mAP.

		$\displaystyle-\log p_{v_{\theta}}(Z_{0})\leq\mathbb{E}_{q(Z_{1/N:1}\|Z_{0})}\left[\log\frac{q(Z_{1}\|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}\|Z_{1/N})}+\sum_{i=1}^{N}\log\frac{q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})}{p_{v_{\theta}}(Z_{(i-1)/N}\|Z_{i/N})}\right]$		(8)
		$\displaystyle=\mathbb{E}_{q(Z_{1/N:1}\|Z_{0})}\left[\log\frac{q(Z_{1}\|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}\|Z_{1/N})}\right]+\sum_{i=1}^{N}\mathbb{E}_{q(Z_{i/N}\|Z_{0})}\mathbb{E}_{q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})}\left[\log\frac{q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})}{p_{v_{\theta}}(Z_{(i-1)/N}\|Z_{i/N})}\right]$
		$\displaystyle=\mathbb{E}_{q(Z_{1/N:1}\|Z_{0})}\left[\log\frac{q(Z_{1}\|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}\|Z_{1/N})}\right]+\sum_{i=1}^{N}\mathbb{E}_{\hat{Z}_{i/N}\sim\int p_{v_{\theta}}(Z_{i/N}\|Z_{1})q(Z_{1}\|Z_{0})dZ_{1}}$
		$\displaystyle\left[D_{\mathrm{KL}}(q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})\|\|p_{v_{\theta}}(Z_{(i-1)/N}\|\hat{Z}_{i/N}))\right],\quad s.t.\quad\textrm{Law}(Z_{i/N})\stackrel{{\scriptstyle\sim}}{{=}}\textrm{Law}(\hat{Z}_{i/N})$
		$\displaystyle\approx\mathbb{E}_{q(Z_{1/N:1}\|Z_{0})}\left[\log\frac{q(Z_{1}\|Z_{0})}{p_{v_{\theta}}(Z_{1})p_{v_{\theta}}(Z_{0}\|Z_{1/N})}\right]+\sum_{i=1}^{N}\mathbb{E}_{\hat{Z}_{i/N}\sim\int p_{v_{\theta}}(Z_{i/N}\|Z_{1})q(Z_{1}\|Z_{0})dZ_{1}}$
		$\displaystyle\left[\|\|q(Z_{(i-1)/N}\|Z_{i/N},Z_{0})-p_{v_{\theta}}(Z_{(i-1)/N}\|\hat{Z}_{i/N})\|\|_{2}^{2}\right],\quad s.t.\quad\textrm{Law}(Z_{i/N})\stackrel{{\scriptstyle\sim}}{{=}}\textrm{Law}(\hat{Z}_{i/N}).$

Precise Knowledge Transfer via Flow Matching

Abstract

1 Introduction

2 Preliminaries

Review the Knowledge Transfer.

Continuous Normalized Flows.

Noise Schedules.

3 Methodology

3.1 Serial Training Paradigm

Theorem 3.1.

3.2 Choice of Noise Schedule

3.3 Serve to Feature-/Logit-based Distillation

3.4 Approximate to Ensemble

Proposition 3.2.

3.5 Lightweight FM-KTΘ without Additional Inference Burden

3.6 Translate to Online Knowledge Distillation

4 Experiments

4.1 Image Classification Comparison

Offline Knowledge Distillation.

Online Knowledge Distillation.

4.2 Ablation Studies

5 Related Work

Knowledge distillation.

Continuous Network Representation.

6 Conclusion

Impact Statement.

References

Appendix A Pseudo Code of FM-KT

Appendix B Theoretical Guarantees of FM-KT

Appendix C Link FM-KT to Ensemble

Appendix D Pair Decoupling

Appendix E Additional Ablation Experiment

Appendix F Additional Related Work on Continuous Network Representations

Appendix G Object Detection Comparison

Appendix H Stronger Strategies and Stronger Teacher Comparison

Appendix I Vision Transformer Comparison

Appendix J Visualization of Sampling Trajectory

Appendix K Implementation Detail

K.1 Training Strategies

K.2 Loss Function and Meta-encoder

Appendix L Additional Training and Inference Computational Cost Discussion

Appendix M Best Meta-encoder Choice on ImageNet-1k

Appendix N Architecture-Sensitive Experiments between FM-KT and DiffKD

Appendix O Unify VP SDE, VE SDE and Rectified flow in FM-KT

VP ODE:

VE ODE:

Rectified flow:

Appendix P Limitation

3.5 Lightweight FM-KT^Θ without Additional Inference Burden