Quantifying Knowledge Distillation Using
Partial Information Decomposition

Pasan Dissanayake^∗ University of Maryland Faisal Hamman University of Maryland Barproda Halder University of Maryland Ilia Sucholutsky Princeton University
Qiuyi Zhang Google Research Sanghamitra Dutta University of Maryland

Abstract

Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher’s representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.

^†^†Accepted at the 28^th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025, Mai Khao, Thailand. ¹{pasand, fhamman, bhalder, sanghamd}@umd.edu ² [email protected] ³ [email protected]

1 Introduction

Modern-day machine learning requires large amounts of compute for both training and inference. Knowledge distillation [1, 2] can be used to compress a complex machine learning model (the teacher) by distilling it into a relatively simpler model (the student). The term “distillation” in this context means obtaining some assistance from the teacher while training the student so that the student performs much better than when trained alone (see Figure 1). In its earliest forms, knowledge distillation involved the student trying to match the output logits of the teacher [1]. More advanced methods focus on distilling multiple intermediate representations of the teacher to the corresponding layers of the student [2, 3, 4, 5]. We also refer the reader to [6, 7] for surveys.

Information theory has been instrumental in both designing [3, 4] and explaining [8, 9] knowledge distillation techniques. However, less attention has been given to characterizing the fundamental limits of the process from an information-theoretical perspective. Our goal is to bridge this gap by first introducing new measures to quantify the “transferred knowledge” and “knowledge to distill” for a teacher and a student model given a target downstream task. We bring in an emerging body of work called Partial Information Decomposition (PID) [10, 11, 12] to explain knowledge distillation. We define the knowledge to distill using the PID measure of “unique” information about the task that is available only with the teacher but not the student. As it follows, the transferred knowledge is succinctly quantified by the measure of “redundant” information that is common between the teacher and student.

We propose a multi-level optimization that maximizes redundant information (transferred knowledge) as a regularizer for more effective distillation. While PID has been explored in a few avenues of machine learning,

Refer to caption — Figure 1: Knowledge Distillation: The teacher (a complex model) assists the student (usually a substantially simpler model) during their training. The learned student can perform much better than an independently trained student without distillation with a similar training setup (i.e., hyper-parameters and data). The teacher may or may not have been trained for the same task as the student.

it has remained a challenge to maximize these measures as a regularizer since computing them itself requires solving an optimization. Our optimization leads to a novel knowledge distillation framework – Redundant Information Distillation (RID) – which precisely captures task-relevant knowledge and filters out the task-irrelevant information from the teacher. In summary, our main contributions are as follows:

• Quantifying transferred knowledge and knowledge to distill: Given a downstream task, and a teacher and a student model, we formally define the knowledge to distill as the unique information in the teacher (Definition 3.1) and the transferred knowledge as the redundant information (Definition 3.2). Through examples and theoretical results (Theorem 3.1), we first show that redundant information succinctly captures task-related knowledge transferred to the student as opposed to the existing frameworks which directly align teacher ( $T$ ) and student ( $S$ ) representations, e.g., Variational Information Distillation (VID) [3] maximizes the mutual information $I(T;S)$ . Theorem 3.1 points out a fundamental limitation of the existing knowledge distillation frameworks for capacity-limited students: they blindly align the student and the teacher without precisely capturing the task-related knowledge.

• Maximizing redundant information as a regularizer: To alleviate this limitation, we propose a strategy to incorporate redundant information as a regularizer during model distillation to precisely maximize the task-relevant transferred knowledge. We first circumvent the challenge of computing the redundant information measure proposed in [12] by utilizing the quantity termed intersection information defined in [13], which we prove (in Theorem 4.1) to be a lower-bound for redundant information. The significance of Theorem 4.1 is that it enables us to obtain an optimization formulation to maximize a lower-bound of redundant information without making distributional assumptions, a contribution that is also of independent interest outside the domain of knowledge distillation.

• A novel knowledge distillation framework: We propose a new framework called Redundant Information Distillation (RID) whose distillation loss is tailored to maximize the redundant information (i.e., the transferred knowledge as per Definition 3.2). We carry out a number of experiments to demonstrate the advantage of this new framework over VID [3], an existing knowledge distillation framework that maximizes the mutual information between the teacher and the student representations (Section 5). Experiments are carried out for the CIFAR10 and the CIFAR100 datasets, as well as for a transfer learning setting where the teacher is trained on the ImageNet dataset and transferred to the students over the CUB-200-2011 dataset. Our framework explains knowledge distillation and shows more resilience under less-informative nuisance teachers.

Related Works: Multi-layer knowledge distillation was introduced in FitNets [2]. Henceforth, a large number of techniques, based on different statistics derived for matching a teacher-student pair, have been proposed. In particular, [3, 4, 14, 15] leverage an information-theoretic perspective to arrive at a solution (also see surveys [6, 7]). In this paper, we focus on VID [3] as a representative framework of the larger class of distillation frameworks which maximize $I(T;S)$ as the distillation strategy. We also discuss Task-aware Layer-wise Distillation (TED) [5] as a framework that filters out task-related information. Specifically, [5] highlight the importance of distilling only the task-related information when there is a significant complexity gap between the teacher and the student. Towards this end, [16] point out the existence of non-distillable classes due to the unmatched capacity of the student model. We discuss some more related works on knowledge distillation [17, 18, 19, 20, 21] in Appendix A.

Information theory has been instrumental towards explaining the success of knowledge distillation. [9] utilize information bottleneck principles [22, 23] to explain how a teacher model may assist the student in learning relevant features quickly. [8] observe the training process as systematically discarding knowledge from the input. Accordingly, the distillation helps the student to quickly learn what information to discard. Despite these attempts, we observe that there exists a gap in characterizing the fundamental limits of knowledge distillation which we seek to address using the mathematical tool of PID.

PID is also beginning to generate interest in other areas of machine learning [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 35, 39, 40, 41]. However, it has not been leveraged in the context of knowledge distillation before. Additionally, while most related works predominantly focus on efficiently computing PID, e.g., [40, 28, 36, 42] that itself requires solving an optimization over the joint distribution, there are limited works that further incorporate it as a regularizer during model training. [25] leverage Gaussian assumptions to obtain closed-form expressions for the PID terms, enabling them to use unique information as a regularizer during training for fairness (also see [38, 35] for more details on Gaussian PID). Our work makes novel connections between two notions of redundant information and demonstrates how PID can be integrated as a regularizer in a multi-level optimization without Gaussian assumptions, which could also be of independent interest outside the context of knowledge distillation.

2 Preliminaries

Background on PID: Partial Information Decomposition (PID), first introduced by [10], offers a way to decompose the joint information in two sources, say $T$ and $S$ , about another random variable $Y$

(i.e., $I(Y;T,S)$ where $I(A;B)$ denotes the mutual information between $A$ and $B$ [43]) into four components as follows:

1.

Unique information $Uni(Y:T\backslash S)$ and $Uni(Y:S\backslash T)$ : information about $Y$ that each source uniquely contains
2.

Redundant information $Red(Y:T,S)$ : the information about $Y$ that both $T$ and $S$ share
3.

Synergistic information $Syn(Y:T,S)$ : the information about $Y$ that can be recovered only by using both $T$ and $S$ .

See Figure 2 for a graphical representation. These PID components satisfy the relationships given below:

$\displaystyle I(Y;T,S)$	$\displaystyle=Uni(Y:T\backslash S)+Uni(Y:S\backslash T)$
	$\displaystyle\qquad+Red(Y:T,S)+Syn(Y:T,S)$	(1)
$\displaystyle I(Y;T)$	$\displaystyle=Uni(Y:T\backslash S)+Red(Y:T,S)$	(2)
$\displaystyle I(Y;S)$	$\displaystyle=Uni(Y:S\backslash T)+Red(Y:T,S).$	(3)

While this system of equations cannot be solved to arrive at a deterministic definition for each PID term, defining only one of the terms is sufficient to define the rest. Consequently, a wide array of definitions exists, each based on different desired properties [10, 12, 11, 13]. Among these, the definition proposed in [12] is motivated with an operational interpretation of unique information from decision theory. Moving on to the context of knowledge distillation, we map $T$ to be the teacher representation, $S$ to be the student representation, and $Y$ to be the downstream task that the student is being trained for. That makes $I(Y;T)$ and $I(Y;S)$ be the total knowledge about $Y$ that is in the teacher and in the student, respectively.

Notation and Problem Setting: We consider a layer-wise distillation scheme where the teacher representation $T(X)$ is distilled into the student representation $S_{\eta_{s}}(X)$ , where $X$ is the input. The target of the student is to predict the task $Y$ from $X$ . Both $T(\cdot)$ and $S_{\eta_{s}}(\cdot)$ are deterministic functions of $X$ and the randomness is due to the input being random. Note that the student representation depends on the parameters of the student network denoted by $\eta_{s}$ and hence written as $S_{\eta_{s}}$ . However, when this parameterization and dependence on $X$ is irrelevant or obvious, we may omit both and simply write $T$ and $S$ . We denote the supports of $Y,T$ and $S$ by $\mathcal{Y},\mathcal{T}$ and $\mathcal{S}$ , respectively. In general, upper-case letters denote random variables, except $P$ and $Q$ which represent probability distributions, $C,H,W$ which stand for the representation dimensions, and $K$ which indicates the number of layers distilled. Lowercase letters are used for vectors unless specified otherwise. Lowercase Greek letters denote the parameters of neural networks.

Knowledge distillation is achieved by modifying the student loss function to include a distillation loss term in addition to the ordinary task-related loss as follows:

\mathcal{L}(\eta_{s})=\lambda_{1}\mathcal{L}_{\text{ordinary}}(Y,\hat{Y}(X))+\lambda_{2}\mathcal{L}_{\text{distill}}(Y,\hat{Y}(X),S_{\eta_{s}},T).

(4)

Here, $\lambda_{1},\lambda_{2}>0$ . When the task at hand is a classification, $Y$ denotes the true class label and we use the cross entropy loss defined as $\mathcal{L}_{CE}(Y,\hat{Y})=-\mathbb{E}_{P_{X}}\left[\log P_{\hat{Y}(X)}(Y)\right]$ as the ordinary task-related loss for the student. Here, $\hat{Y}(X)$ is the student’s final prediction of $Y$ . The teacher network is assumed to remain unmodified during the distillation process.

3 Explaining Knowledge Distillation

In this section, we propose information theoretic metrics to quantify both the task-relevant information that is available in the teacher for distillation, and the amount of information that has already been transferred to the student. We mathematically demonstrate favorable properties of our proposed measures in comparison to other candidate measures. Our mathematical results highlight the limitations of existing knowledge distillation frameworks that naively align the student with the teacher with no regard for task-relevance.

Definition 3.1 (Knowledge to distill).

Let $Y$ , $S$ , and $T$ be the target variable, the student’s intermediate representation, and the teacher’s intermediate representation, respectively. The knowledge to distill from $T$ to $S$ is defined as $Uni(Y:T\backslash S)$ , the unique information about $Y$ that is in $T$ but not in $S$ .

With the knowledge to distill is defined as the unique information $Uni(Y:T\backslash S)$ , we see that the more the distillation happens, the more the $Uni(Y:T\backslash S)$ shrinks. Note that under the knowledge distillation setting, the total knowledge of the teacher $I(Y;T)$ is constant since the teacher is not modified during the process. Since $I(Y;T)=Uni(Y:T\backslash S)+Red(Y:T,S)$ we therefore propose $Red(Y:T,S)$ as a measure for knowledge that has already been transferred.

Definition 3.2 (Transferred knowledge).

Let $Y$ , $S$ , and $T$ be the target variable, student’s intermediate representation, and the teacher’s intermediate representation, respectively. The transferred knowledge from $T$ to $S$ is defined as $Red(Y:T,S)$ , the redundant information about $Y$ between $T$ and $S$ .

We leverage the unique and redundant information definitions given by [12] for an exact quantification of these quantities.

Definition 3.3 (Unique and redundant information [12]).

Let $P$ be the joint distribution of $Y,T$ and $S$ , and $\Delta$ be the set of all joint distributions over $\mathcal{Y}\times\mathcal{T}\times\mathcal{S}$ . Then,

	$\displaystyle Uni(Y:T\backslash S)$	$\displaystyle:=\min_{Q\in\Delta_{P}}I_{Q}(Y;T\|S)$		(5)
	$\displaystyle Red(Y:T,S)$	$\displaystyle:=I(Y;T)-\min_{Q\in\Delta_{P}}I_{Q}(Y;T\|S)$		(6)

where $\Delta_{P}=\{Q\in\Delta:Q(Y=y,T=t)=P(Y=y,T=t),Q(Y=y,S=s)=P(Y=y,S=s)\;\forall\;y\in\mathcal{Y},t\in\mathcal{T}\text{ and }s\in\mathcal{S}\}$ i.e., $\Delta_{P}$ is the set of all joint distributions with marginals of the pairs $(Y,T)$ and $(Y,S)$ equal to that of $P$ .

Comparison to Existing Approaches for Knowledge Distillation: A multitude of knowledge distillation frameworks exist which are based on maximizing the mutual information between the teacher and the student (i.e., $I(T;S)$ ) [3, 4, 14, 15]. While a distillation loss that maximizes $I(T;S)$ can be helpful to the student when the teacher possesses task-related information, we show that it creates a tension with the ordinary loss when the teacher has little or no task-relevant information. Moreover, even though the teacher contains task-related information, the limited capacity of the student may hinder a proper distillation when this kind of framework is used. The following examples provide critical insights, exposing the limitation of $I(T;S)$ . Our proposed measure $Red(Y:T,S)$ resolves these cases in an explainable manner by succinctly capturing task-relevant knowledge.

Example 1: (Uninformative teacher) An uninformative teacher representation (i.e., $T$ with $I(Y;T)=0$ ) gives $Uni(Y:T\backslash S)=Red(Y:T,S)=0$ for any $S$ , agreeing with the intuition. Hence, an algorithm that maximizes exactly the transferred knowledge $Red(Y:T,S)$ will have a zero gradient over this term. In contrast, algorithms that maximize the similarity between $S$ and $T$ quantified by $I(T;S)$ will force $S$ to mimic the uninformative teacher, causing a performance worse than ordinary training without distillation. As a simplified example, let $U_{1},U_{2}\sim Ber(0.5)$ and $Y=U_{1},T=U_{2}$ . Then, the teacher cannot predict the intended task $Y$ . Note that, in this case, $I(T;S)$ is not maximized when the student representation is $S=Y$ . Instead, it is maximized when $S=U_{2}$ .

Example 2: (Extra complex teacher) Let $U_{1}\sim Ber(0.2),U_{2}\sim Ber(0.5)$ and $Y=U_{1},T=(U_{1},U_{2})$ . Then, the teacher can completely predict the intended task $Y$ . Assume the student is simpler than the teacher and has only one binary output. In this situation, $I(T;S)$ is not maximized when $S=U_{1}$ because $I((U_{1},U_{2});U_{1})\approx 0.72<1=I((U_{1},U_{2});U_{2})$ where the right-hand side is achieved when $S=U_{2}$ . However, $S=U_{1}$ is a maximizer for $Red(Y:T,S)$ (i.e., $Red(Y:T,S)=Red(U_{1}:T,U_{1})=I(Y;T)$ ). Theorem 3.1 presents a more general case.

Theorem 3.1 formally exposes the limitations of maximizing $I(T;S)$ for capacity-limited students as $I(T;S)$ does not emphasize task-related knowledge.

Theorem 3.1 (Teacher with nuisance).

Let $T=(Z,G)$ where $Z$ contains all the task-related information (i.e., $I(Y;T)=I(Y;Z)$ ) and $G$ does not contain any information about the task (i.e., $I(Y;G)=0$ ). Let the student be a capacity-limited model as defined by $H(S)\leq\max\{H(Z),H(G)\}$ where $H(X)$ denotes the Shannon entropy of the random variable $X$ . Then,

(i)

$I(T;S)$ is maximized when

$S=\left\{\begin{matrix}Z&;&H(Z)>H(G)\\ G&;&H(Z)<H(G)\end{matrix}\right..$ (7)
(ii)

$Red(Y:T,S)$ is always maximized when $S=Z$ .

The uninformative random variable $G$ here can be seen as a stronger version of the nuisance defined in [44, Section 2.2]. In the above scenario, the task-related part of the student loss will have a tension with the distillation loss when $H(Z)<H(G)$ , in which case, the distillation actually adversely affects the student. In contrast, a loss that maximizes $Red(Y:T,S)$ will always be aligned with the task-related loss.

These examples show that the frameworks based on maximizing $I(T;S)$ are not capable of selectively distilling the task-related information to the student. In an extreme case, they are not robust to being distilled from a corrupted teacher network. This is demonstrated in the experiments under Section 5.

It may appear that using information gain (conditional mutual information) $I(Y;T|S)=I(Y;T,S)-I(Y;S)$ as the measure for the amount of knowledge available to distill resolves the cases similar to Example 1. However, Example 3 below provides a counterexample.

Example 3: (Effect of synergy) Consider a scenario similar to Example 1, where the teacher is uninformative regarding the interested task. For example, let $U_{1},U_{2}\sim Ber(0.5)$ and $Y=U_{1},T=U_{1}\oplus U_{2}$ where $\oplus$ denotes the binary XOR operation. Suppose we were to consider conditional mutual information $I(Y;T|S)$ as the measure of the amount of knowledge available to distill information available in the teacher. Then, $I(Y;T|S)=H(Y)$ when $S=U_{2}$ , indicating non-zero knowledge available to distill in the teacher. This is un-intuitive since in this case both $I(Y;T)=I(Y;S)=0$ and neither the teacher nor the student can be used alone to predict $Y$ . In contrast, the proposed measures give $Uni(Y:T\backslash S)=Red(Y:T,S)=0$ indicating no available or transferred knowledge.

Next, we present Theorem 3.2 which highlights some important properties of the proposed metrics. These properties indicate that the proposed measures agree well with our intuition for explaining distillation.

Theorem 3.2 (Properties).

The following properties hold for knowledge to distill and already transferred knowledge defined as in Definition 3.1 and Definition 3.2 respectively.

(i)

$Uni(Y:T\backslash S)$ and $Red(Y:T,S)$ are non-negative.
(ii)

When $Uni(Y:T\backslash S)=0$ , the teacher has zero knowledge available for distillation. At this point, the student has the maximum information that any one of the representations $T$ or $S$ has about $Y$ ; i.e., $\max\{I(Y;T),I(Y;S)\}=I(Y;S)$ .
(iii)

For a given student representation $S$ and any two teacher representations $T_{1}$ and $T_{2}$ if there exists a deterministic mapping $h$ such that $T_{1}=h(T_{2})$ , then $Uni(Y:T_{1}\backslash S)\leq Uni(Y:T_{2}\backslash S)$ .

4 A Framework For Maximizing Transferred Knowledge

In this section, we propose a distillation framework – Redundant Information Distillation (RID) – which maximizes the transferred knowledge $Red(Y:T,S)$ with a focus on classification problems. We first show that our measure of transferred knowledge is lower-bounded by an alternative definition of redundant information (also called the intersection information – denoted by $Red_{\cap}(Y:T,S)$ ). Next, we leverage this lower-bound to develop a multi-level optimization framework to selectively distill task-relevant knowledge.

Definition 4.1 ( $I_{\alpha}$ measure [13]).

Red_{\cap}(Y:T,S)=\max_{P(Q|Y)}I(Y;Q)\quad\text{such that}\quad I(Y;Q|f_{t}(T))=I(Y;Q|f_{s}(S))=0.

(8)

Next we show, from Theorem 4.1, that $Red_{\cap}(Y:T,S)$ is a lower-bound for $Red(Y:T,S)$ . The significance of this result is that maximizing the lower-bound $Red_{\cap}(Y:T,S)$ would also increase $Red(Y:T,S)$ .

Theorem 4.1 (Transferred knowledge lower bound).

For any three random variables $Y,T$ and $S$ ,

Red_{\cap}(Y:T,S)\leq Red(Y:T,S)

(9)

where $Red_{\cap}(Y:T,S)$ is defined as per Definition 4.1 and $Red(Y:T,S)$ is defined in Definition 3.3.

Next, we discuss our proposed approach for maximizing the lower-bound $Red_{\cap}(Y:T,S)$ (also see Figure 3 and Algorithm 1 for our complete strategy). Our proposed framework is based on selecting $Q$ in Definition 4.1 to be $Q=f_{t}(T)$ , and parameterizing $f_{t}(\cdot)$ and $f_{s}(\cdot)$ using small neural networks. To denote the parameterization, we will occasionally use the elaborated notation $f_{t}(\cdot;\theta_{t})$ and $f_{s}(\cdot;\theta_{s})$ , where $\theta_{t}$ and $\theta_{s}$ denote the parameters of $f_{t}$ and $f_{s}$ , respectively. With the substitution of $Q=f_{t}(T)$ , Definition 4.1 results in the optimization problem given below:

\max_{\theta_{t},\theta_{s},\eta_{s}}I(Y;f_{t}(T;\theta_{t}))\quad\text{such that}\quad I(Y;f_{t}(T;\theta_{t})|f_{s}(S_{\eta_{s}};\theta_{s}))=0.

(P1)

We divide the problem (P1) into two phases and employ gradient descent on two carefully designed loss functions to perform the optimization. In the first phase, we maximize the objective w.r.t. $\theta_{t}$ while $\theta_{s}$ and $S$ are kept constant (recall that $T$ is fixed in all cases because the teacher is not being trained during the process). For this, we append an additional classification head $g_{t}(\cdot;\phi_{t})$ parameterized by $\phi_{t}$ to the teacher’s task aware filter $f_{t}$ . We then minimize the following loss function with respect to $\theta_{t}$ and $\phi_{t}$ :

\mathcal{L}_{t}(\theta_{t},\phi_{t})=\mathcal{L}_{CE}(Y,g_{t}(f_{t}(T;\theta_{t});\phi_{t}))+\sum_{c=1}^{C}\sum_{h=1}^{H}\sum_{w=1}^{W}\mathbb{E}_{P_{X}}\left[\frac{V_{c,h,w}^{2}}{\sigma_{c}}\right]

(10)

where $V_{c,h,w}$ denotes the corresponding element of $V=f_{t}(T(X);\theta_{t})-f_{s}(S(X);\theta_{s})\in\mathbb{R}^{C\times H\times W}$ . Here, $C,H,$ and $W$ are the number of channels, height, and width of the outputs of $f_{s}$ and $f_{t}$ . $\sigma=[\sigma_{1},\dots,\sigma_{C}]^{T}$ is a stand-alone vector of weights that are optimized in the second phase. Minimizing the cross-entropy term $\mathcal{L}_{CE}(Y,g_{t}(f_{t}(T;\theta_{t});\phi_{t}))$ of $\mathcal{L}_{t}(\theta_{t},\phi_{t})$ above amounts to maximizing $I(Y;f_{t}(T;\theta_{t}))$ . The second term prohibits $f_{t}(T)$ from diverting too far from $f_{s}(S)$ during the process, so that the constraint $I(Y;f_{t}(T;\theta_{t})|f_{s}(S;\theta_{s}))=0$ can be ensured.

During the second phase, we freeze $\theta_{t}$ and maximize the objective over $\theta_{s},S_{\eta_{s}}$ and $\sigma$ . The loss function employed in this phase is as follows:

\mathcal{L}(\theta_{s},\sigma,\eta_{s})=\lambda_{1}\mathcal{L}_{CE}(Y,\hat{Y}_{\eta_{s}})+\lambda_{2}\mathcal{L}_{s}(\theta_{s},\sigma,\eta_{s})

(11)

where

\mathcal{L}_{s}(\theta_{s},\sigma,\eta_{s})=\Bigg{(}||\sigma||^{2}+\sum_{c=1}^{C}\sum_{h=1}^{H}\sum_{w=1}^{W}\mathbb{E}_{P_{X}}\left[\frac{V_{c,h,w}^{2}}{\sigma_{c}}\right]\Bigg{)}

(12)

and $\lambda_{1}$ and $\lambda_{2}$ are scalar hyperparameters which determine the prominence of ordinary learning (through $\mathcal{L}_{CE}(Y,\hat{Y}_{\eta_{s}})$ ) and distillation (through $\mathcal{L}_{s}(\theta_{s},\sigma,\eta_{s})$ ). $V$ and $\sigma$ are as defined earlier. $\hat{Y}_{\eta_{s}}$ denotes the final prediction of the student network.

The first term of the loss function is the ordinary task-related loss. The next two terms correspond to the distillation loss, which is our focus in the following explanation. Consider phase 2 as an estimation problem that minimizes the $\sigma$ -weighted mean squared error, where $Q=f_{t}(T)$ is the estimand and $f_{s}(\cdot)$ is the estimator. The magnitudes of the positive weights $\sigma$ are controlled using the term $||\sigma||^{2}$ . We observe that this optimization ensures $I(Y;Q|f_{s}(S))=0$ given that the following assumption holds.

Assumption: Let the estimation error be $\epsilon=f_{t}(T)-f_{s}(S)$ . Assume $I(\epsilon;Y|f_{s}(S))=0$ , i.e., given the estimate, the estimation error is independent of $Y$ .

With the above assumption, we see that

I(Y;Q|f_{s}(S))=I(Y;f_{s}(S)+\epsilon|f_{s}(S))=I(Y;\epsilon|f_{s}(S))=0.

(13)

Therefore, the constraint in problem P1 is satisfied by this selection of random variables. Therefore, along with the maximization of $I(Y;Q)$ during phase 1, the proposed framework can be seen as performing the optimization in Definition 4.1 in two steps.

This, along with Theorem 4.1, completes our claim that the proposed framework maximizes a lower bound for the transferred knowledge. RID loss terms can be extended to multiple layers as follows: Replace the single term $\mathcal{L}_{t}(\theta_{t},\phi_{t})$ in (10) with a sum of terms $\sum_{k=1}^{K}\mathcal{L}_{t}^{(k)}(\theta_{t}^{(k)},\phi_{t}^{(k)})$ corresponding to each pair of layers $T^{(k)}$ and $S_{\eta_{s}}^{(k)}$ where $k=1,\dots,K$ . Similarly, replace $\mathcal{L}_{s}(\theta_{s},\sigma,\eta_{s})$ in (12) with $\sum_{k=1}^{K}\mathcal{L}_{s}^{(k)}(\theta_{s}^{(k)},\sigma^{(k)},\eta_{s})$ . The framework is summarized in Algorithm 1. The advantage of RID over the VID framework [3] which maximizes $I(T;S)$ can be observed in the experiments that are detailed in Section 5.

Remark.

We note that the Large Language Model (LLM) fine-tuning framework of Task-aware Layer-wise Distillation (TED) [5] shares intuitive similarities with RID regarding distilling task-related knowledge. However, they take a heuristic approach when designing the framework. In fact, our mathematical formulation can explain the success of TED as detailed in Appendix C. In addition to the application domain, the difference between TED and RID can mainly be attributed to the following:

(i)

During the first stage, TED trains both $f_{t}(\cdot)$ and $f_{s}(\cdot)$ whereas RID only trains $f_{t}(\cdot)$ .
(ii)

In the second stage loss, TED includes an ordinary mean squared error term whereas RID includes a weighted (using $\sigma$ ) mean squared error term.

To the best of our knowledge, our work is the first to information-theoretically quantify the actual task-relevant transferred knowledge and formally incorporate it into an optimization.

Data: A dataset of samples of (X,Y), teacher model with intermediate representations

T^{(1)},\dots,T^{(k)}

, hyperparameters

\lambda_{1},\lambda_{2}>0

, # warm-up epochs

n_{w}

, # training epochs

n

, # steps per cycle

q\leq n

, alternating ratio

r(0<r<1)

Result: Trained student network parameterized with

\eta_{s}

Initialize parameters

\theta^{(k)}_{t},\theta^{(k)}_{s},\phi^{(k)}_{t}

and

\eta_{s}

;

for $i\in\{1,\dots,n_{w}\}$ do

\displaystyle\operatorname*{minimize}_{\{\theta_{t}^{(k)},\phi_{t}^{(k)}\}}\sum_{k=1}^{K}\mathcal{L}_{CE}(Y,g_{t}^{(k)}(f_{t}^{(k)}(T^{(k)};\theta_{t}^{(k)});\phi_{t}^{(k)}))

;

end for

for $i\in\{1,\dots,n\}$ do

if $\texttt{round}(i/q)<q\times r$ then

\displaystyle\operatorname*{minimize}_{\{\theta_{t}^{(k)},\phi_{t}^{(k)}\}}\sum_{k=1}^{K}\mathcal{L}_{t}(\theta_{t}^{(k)},\phi_{t}^{(k)})

;

/*See Eq. (10)*/

else

\displaystyle\operatorname*{minimize}_{\{\theta_{s}^{(k)},\sigma^{(k)},\eta_{s}\}}\lambda_{1}\mathcal{L}_{CE}(Y,\hat{Y}_{\eta_{s}})+\lambda_{2}\sum_{k=1}^{K}\mathcal{L}_{s}(\theta_{s}^{(k)},\sigma^{(k)},\eta_{s})

;

/*See Eq. (12)*/

end if

end for

Algorithm 1 Redundant Information Distillation

5 Empirical Validation

Setup: We compare the performance of the proposed RID framework with (i) VID; and (ii) our adaptation of TED in this domain under two different conditions. In the first setting, the teacher network is fully trained with the complete training set, whereas in the second setting, the teacher is just randomly initialized without any training at all. Experiments are carried out on CIFAR-10 [45], CIFAR-100 [46], ImageNet [47] and CUB-200-2011 [48] datasets (with the last two being used in a transfer learning setup). Additionally, we train a student without any knowledge distillation, which we label as “BAS”. Table 1 provides details of the model architectures used in each setting.

Table 1: Model architectures

Dataset	Teacher	Student
CIFAR-10	WRN-(40,2)	WRN-(16,1)
CIFAR-100	WRN-(28,10)	WRN-(16,8)
ImageNet to CUB-200-2011	ResNet-34	ResNet-18

We distill three layers from the teacher to the corresponding student layers. In the case of RID, each teacher layer $T^{(k)}$ has its own filter $f_{t}^{(k)}$ parameterized with $\theta_{t}^{(k)}$ . Student filters are parameterized in a similar manner. Moreover, each teacher filter $f_{t}^{(k)}(\cdot)$ has its own classification head $g^{(k)}_{t}(\cdot)$ parameterized with $\phi^{(k)}$ . All the student representations are parameterized by the complete weight vector $\eta_{s}$ . In the beginning, the teacher filters are trained for $n_{w}$ number of warm-up epochs with just the cross-entropy loss $\sum_{k=1}^{K}\mathcal{L}_{CE}(Y,g_{t}^{(k)}(f_{t}^{(k)}(T^{(k)};\theta_{t}^{(k)});\phi_{t}^{(k)}))$ . Then, the optimization alternates between the first and the second phases, with each cycle taking $q$ epochs in total. Within a cycle, phase 1 is carried out for $r\times q$ epochs followed by phase 2 for rest of the epochs (see Algorithm 1). Values of all the hyperparameters are given in Appendix D. TED is implemented similarly, except the fact that now, the student filters also have a classification head $g_{s}^{(k)}(\cdot)$ (see Appendix C). Since TED acts as a fine-tuning method, we use the BAS student as the base model to apply TED on.

In order to evaluate the validity of the Definitions 3.1 and 3.2, we compute the PID components [12] of the joint mutual information $I(Y;T,S)$ of the innermost distilled layer using the estimation method given in [28]. See Appendix D for additional details.

Results: Figure 4 shows the classification accuracy on CIFAR-10 dataset for each student model RID, VID, and TED when distilled with either a trained teacher or an untrained one. It also shows the classification accuracies of the baseline model BAS (without any distillation) and the trained teacher. Figures 5(a) and 5(b) present the PID values and the marginal mutual information estimates for the teacher and the student models for CIFAR-10 and CIFAR-100, respectively. Tables D.4 and D.4 in Appendix D present the classification accuracies for CIFAR-100 and the transfer learning setup ImageNet $\rightarrow$ CUB-200-2011, respectively.

Discussion: From Figure 4, we see that both RID and VID perform equally well when the teacher is trained. However, the performance degradation of VID under the untrained teacher is significantly larger w.r.t. that of the RID. This can be attributed to the fact that the VID student trying to naively mimic the teacher. RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations. TED, despite being rather unstable, performs very close to the baseline model in both cases. As evident from Figure 5, in the case of the trained teacher (whose $I(Y;T)>0$ ), we observe that the knowledge available for distillation $Uni(Y:T\backslash S)$ in the teacher decreases with the increasing number of epochs. Consequently, the amount of knowledge transferred $Red(Y:T){S}$ increases. When the teacher is not trained (i.e., $I(Y;T)=0$ ), $Uni(Y:T\backslash S),Red(Y:T,S)\approx 0$ as expected. Both BAS and RID models show an increase in $I(Y;S)$ even when distilled from the untrained teacher. However, in this case, VID shows a very low $I(Y;S)$ as expected, caused by the distillation loss forcing to mimic the teacher. A Python implementation of the experiments is available at https://github.com/pasandissanayake/kd-rid.

6 Conclusion

With the growing interest in knowledge distillation, our work provides critical insights into the explainability of knowledge distillation. We propose using $Uni(Y:T\backslash S)$ to quantify the knowledge available in a teacher model for distillation w.r.t. a student and a downstream task. This, in turn, leads to the definition of the amount of knowledge that has been distilled to a student as $Red(Y:T,S)$ . We show that knowledge distillation frameworks which use mutual information between the teacher and the student representations to achieve knowledge distillation have a fundamental problem: These frameworks force the student to mimic the teacher regardless of the usefulness of the teacher’s information to perform the task at hand. In contrast, through many examples we demonstrate that the proposed metrics can correctly characterize the amounts of knowledge available for distillation and the already transferred knowledge. Moreover, we show the advantage of the proposed metric by implementing a new distillation framework – Redundant Information Distillation (RID) – and comparing its performance with the existing technique VID [3]. While VID and RID perform similarly when the teacher is well-trained for the downstream task, VID performance degrades largely when the teacher is not trained but RID performs close to a student model trained without distillation.

Limitations and future work: While the RID framework uses an alternative definition for redundant information, computation of exact $Red(Y:T,S)$ during training can be computationally prohibitive due to the optimization over $\Delta_{P}$ . Moreover, characterizing the extent to which the assumption in Section 4 holds is not explored in this work. Extending the mathematical formulation in Section 4 to analyze other knowledge distillation frameworks is an interesting path for future research. Other potential research directions include: (i) distilling from an ensemble of teachers [49] in a way that the adverse effects of corrupted teachers are mitigated; (ii) dataset distillation [50]; or (iii) distillation for model reconstruction from counterfactual explanations [51]. Incorporating fundamentally different definitions for PID components, such as [41] which provides explicit formulae, as regularizers can also be interesting.

Acknowledgments: This work was supported in part by NSF CAREER Award 2340006 and Northrop Grumman Seed Grant.

References

[1] Geoffrey Hinton “Distilling the Knowledge in a Neural Network” In arXiv preprint arXiv:1503.02531, 2015
[2] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta and Yoshua Bengio “Fitnets: Hints for thin deep nets” In ICLR, 2015
[3] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence and Zhenwen Dai “Variational information distillation for knowledge transfer” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9163–9171
[4] Yonglong Tian, Dilip Krishnan and Phillip Isola “Contrastive representation distillation” In ICLR, 2020
[5] Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen and Tuo Zhao “Less is more: Task-aware layer-wise distillation for language model compression” In International Conference on Machine Learning, 2023, pp. 20852–20867 PMLR
[6] Jianping Gou, Baosheng Yu, Stephen J Maybank and Dacheng Tao “Knowledge distillation: A survey” In International Journal of Computer Vision 129.6 Springer, 2021, pp. 1789–1819
[7] Ilia Sucholutsky et al. “Getting aligned on representational alignment” In arXiv preprint arXiv:2310.13018, 2023
[8] Quanshi Zhang, Xu Cheng, Yilan Chen and Zhefan Rao “Quantifying the knowledge in a DNN to explain knowledge distillation for classification” In IEEE Transactions on Pattern Analysis and Machine Intelligence 45.4 IEEE, 2022, pp. 5099–5113
[9] Chaofei Wang, Qisen Yang, Rui Huang, Shiji Song and Gao Huang “Efficient knowledge distillation from model checkpoints” In Advances in Neural Information Processing Systems 35, 2022, pp. 607–619
[10] Paul L Williams and Randall D Beer “Nonnegative decomposition of multivariate information” In arXiv preprint arXiv:1004.2515, 2010
[11] Virgil Griffith, Edwin K.. Chong, Ryan G. James, Christopher J. Ellison and James P. Crutchfield “Intersection Information Based on Common Randomness” In Entropy 16.4, 2014, pp. 1985–2000
[12] Nils Bertschinger, Johannes Rauh, Eckehard Olbrich, Jürgen Jost and Nihat Ay “Quantifying unique information” In Entropy 16.4 Multidisciplinary Digital Publishing Institute, 2014, pp. 2161–2183
[13] Virgil Griffith and Tracey Ho “Quantifying redundant information in predicting a target random variable” In Entropy 17.7 MDPI, 2015, pp. 4644–4653
[14] Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao and Lawrence Carin “Wasserstein contrastive representation distillation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16296–16305
[15] Roy Miles, Adrian Lopez Rodriguez and Krystian Mikolajczyk “Information theoretic representation distillation” In arXiv preprint arXiv:2112.00459, 2021
[16] Yichen Zhu, Ning Liu, Zhiyuan Xu, Xin Liu, Weibin Meng, Louis Wang, Zhicai Ou and Jian Tang “Teach less, learn more: On the undistillable classes in knowledge distillation” In Advances in Neural Information Processing Systems 35, 2022, pp. 32011–32024
[17] Souvik Kundu, Qirui Sun, Yao Fu, Massoud Pedram and Peter Beerel “Analyzing the confidentiality of undistillable teachers in knowledge distillation” In Advances in Neural Information Processing Systems 34, 2021, pp. 9181–9192
[18] Dae Young Park, Moon-Hyun Cha, Daesin Kim and Bohyung Han “Learning student-friendly teacher networks for knowledge distillation” In Advances in neural information processing systems 34, 2021, pp. 13292–13303
[19] Emanuel Ben-Baruch, Matan Karklinsky, Yossi Biton, Avi Ben-Cohen, Hussam Lawen and Nadav Zamir “It’s all in the head: Representation knowledge distillation through classifier sharing” In arXiv preprint arXiv:2201.06945, 2022
[20] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang and Qun Liu “Tinybert: Distilling bert for natural language understanding” In arXiv preprint arXiv:1909.10351, 2019
[21] Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao and Qixiang Ye “Generic-to-specific distillation of masked autoencoders” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15996–16005
[22] Naftali Tishby, Fernando C Pereira and William Bialek “The information bottleneck method” In arXiv preprint physics/0004057, 2000
[23] Naftali Tishby and Noga Zaslavsky “Deep learning and the information bottleneck principle” In 2015 ieee information theory workshop (itw), 2015, pp. 1–5 IEEE
[24] Sanghamitra Dutta, Praveen Venkatesh, Piotr Mardziel, Anupam Datta and Pulkit Grover “An information-theoretic quantification of discrimination with exempt features” In Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 3825–3833
[25] Sanghamitra Dutta, Praveen Venkatesh, Piotr Mardziel, Anupam Datta and Pulkit Grover “Fairness under feature exemptions: Counterfactual and observational measures” In IEEE Transactions on Information Theory 67.10 IEEE, 2021, pp. 6675–6710
[26] Sanghamitra Dutta and Faisal Hamman “A review of partial information decomposition in algorithmic fairness and explainability” In Entropy 25.5 MDPI, 2023, pp. 795
[27] Faisal Hamman and Sanghamitra Dutta “Demystifying Local and Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition” In International Conference on Learning Representations, 2024
[28] Paul Pu Liang et al. “Quantifying & modeling multimodal interactions: An information decomposition framework” In Advances in Neural Information Processing Systems 36, 2024
[29] Paul Pu Liang et al. “Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications” In The Twelfth International Conference on Learning Representations, 2024
[30] F. Hamman and S. Dutta “A Unified View of Group Fairness Tradeoffs Using Partial Information Decomposition” In IEEE International Symposium on Information Theory (ISIT), 2024, pp. 214–219
[31] Tycho Tax, Pedro Mediano and Murray Shanahan “The partial information decomposition of generative neural network models” In Entropy 19.9 Multidisciplinary Digital Publishing Institute, 2017, pp. 474
[32] David A Ehrlich, Andreas C Schneider, Michael Wibral, Viola Priesemann and Abdullah Makkeh “Partial information decomposition reveals the structure of neural representations” In arXiv preprint arXiv:2209.10438, 2022
[33] Patricia Wollstadt, Sebastian Schmitt and Michael Wibral “A Rigorous Information-Theoretic Definition of Redundancy and Relevancy in Feature Selection Based on (Partial) Information Decomposition.” In J. Mach. Learn. Res. 24, 2023, pp. 131–1
[34] Salman Mohamadi, Gianfranco Doretto and Donald A Adjeroh “More Synergy, Less Redundancy: Exploiting Joint Mutual Information for Self-Supervised Learning” In arXiv preprint arXiv:2307.00651, 2023
[35] Praveen Venkatesh, Corbett Bennett, Sam Gale, Tamina Ramirez, Greggory Heller, Severine Durand, Shawn Olsen and Stefan Mihalas “Gaussian partial information decomposition: Bias correction and application to high-dimensional data” In Advances in Neural Information Processing Systems 36, 2024
[36] B. Halder, F. Hamman, P. Dissanayake, Q. Zhang, I. Sucholutsky and S. Dutta “Quantifying Spuriousness of Biased Datasets Using Partial Information Decomposition” In ICML Workshop on Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models, 2024
[37] Shaurya Dewan, Rushikesh Zawar, Prakanshul Saxena, Yingshan Chang, Andrew Luo and Yonatan Bisk “Diffusion PID: Interpreting Diffusion via Partial Information Decomposition” In Advances in Neural Information Processing Systems 37, 2024, pp. 2045–2079
[38] Praveen Venkatesh and Gabriel Schamberg “Partial information decomposition via deficiency for multivariate gaussians” In 2022 IEEE International Symposium on Information Theory (ISIT), 2022, pp. 2892–2897 IEEE
[39] Chaitanya Goswami and Amanda Merkley “Analytically deriving Partial Information Decomposition for affine systems of stable and convolution-closed distributions” In Advances in Neural Information Processing Systems 37, 2024, pp. 86749–86835
[40] Michael Kleinman, Alessandro Achille, Stefano Soatto and Jonathan C Kao “Redundant information neural estimation” In Entropy 23.7 MDPI, 2021, pp. 922
[41] Aobo Lyu, Andrew Clark and Netanel Raviv “Explicit Formula for Partial Information Decomposition” In arXiv preprint arXiv:2402.03554, 2024
[42] Ari Pakman, Amin Nejatbakhsh, Dar Gilboa, Abdullah Makkeh, Luca Mazzucato, Michael Wibral and Elad Schneidman “Estimating the unique information of continuous variables” In Advances in neural information processing systems 34, 2021, pp. 20295–20307
[43] Thomas M. Cover and Joy A. Thomas “Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)” USA: Wiley-Interscience, 2006
[44] Alessandro Achille and Stefano Soatto “Emergence of invariance and disentanglement in deep representations” In Journal of Machine Learning Research 19.50, 2018, pp. 1–34
[45] Alex Krizhevsky, Vinod Nair and Geoffrey Hinton “CIFAR-10 (Canadian Institute for Advanced Research)”, 2009 URL: http://www.cs.toronto.edu/~kriz/cifar.html
[46] Alex Krizhevsky, Vinod Nair and Geoffrey Hinton “CIFAR-100 (Canadian Institute for Advanced Research)”, 2009 URL: http://www.cs.toronto.edu/~kriz/cifar.html
[47] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei “Imagenet: A large-scale hierarchical image database” In 2009 IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255 IEEE
[48] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona and Serge Belongie “The Caltech-UCSD Birds-200-2011 Dataset” California Institute of Technology, 2011
[49] Andrey Malinin, Bruno Mlodozeniec and Mark Gales “Ensemble Distribution Distillation” In International Conference on Learning Representations, 2020
[50] I. Sucholutsky and M. Schonlau “Soft-Label Dataset Distillation and Text Dataset Distillation” In 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–8
[51] P. Dissanayake and S. Dutta “Model Reconstruction Using Counterfactual Explanations: A Perspective From Polytope Theory” In Advances in Neural Information Processing Systems (NeurIPS), 2024
[52] Pradeep Kr Banerjee, Eckehard Olbrich, Jürgen Jost and Johannes Rauh “Unique informations and deficiencies” In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2018, pp. 32–38 IEEE

Appendix A Additional Related Works

The nature of the teacher model plays an important role in distillation. [17] present a distillation scheme to distill from a teacher which has been intentionally made difficult-to-distill. They distill the penultimate layer of the teacher to several intermediate representations of the student through auxiliary student branches. [18] prepare the teacher during its training phase so that it can later be used for better distillation when training a student. This is done through training an augmented version of the teacher formed by adding several student branches, all of which are trained together. [19] propose two novel frameworks: Teacher-Head Knowledge Distillation (TH-KD) and Student-Head Knowledge Distillation (SH-KD). In TH-KD, the teacher’s classification head is attached to the student and the distillation loss is computed as a weighted sum of the discrepancies of predictions from the two heads of the student with respect to the teacher’s predictions. SH-KD involves three steps: (i) a student is trained in the conventional way; (ii) the classification head of the student is fixed to the teacher and the teacher is trained while keeping the classification head frozen; and (iii) this teacher model is used to train a new student using conventional distillation. The assumption is that the teacher trained with the student classification head will adapt the teacher to transfer the knowledge better suited to the capacity-limited student. [20] propose an interesting knowledge distillation framework for compressing the well-known BERT models into a much smaller version called TinyBERT. The process consists of two steps. During the first step, a generic TinyBERT is distilled from a pre-trained BERT model using a large corpus. In the second step, this TinyBERT model is further fine-tuned by distilling a fine-tuned BERT on a task-specific dataset. While the broad goal of [20] seems to align with ours, the approach is quite different: we intend to filter out task-specific information from a generalized teacher by defining a measure that precisely captures this whereas [20] focus on distilling from a task-specific (i.e., fine-tuned) teacher in an efficient manner. The work on distilling vision transformers by [21] also proceeds in two steps. Similar to [20], the first step focuses on distilling task-agnostic information using a pre-trained teacher within an encoder-decoder setup. In the second step, the decoder is abandoned and the task-specific information is transferred using a fine-tuned teacher. As pointed out in [21, Section 4.3], this framework can be seen as maximizing the mutual information (conditioned on the dataset being used) between the teacher and the student in each step (rather than quantifying task-specific information), and hence, can be conceptually categorized together with VID. In contrast, our RID framework focuses on quantifying and extracting task-specific information from a generalized teacher.

Appendix B Proofs

B.1 Proof of Theorem 3.1

See 3.1

Proof.

To prove claim 1, observe that

$\displaystyle I(T;S)$	$\displaystyle=H(T)-H(T\|S)$	(14)
	$\displaystyle=H(Z,G)-H(Z,G\|S)$	(15)
	$\displaystyle=H(Z)+H(G)-H(Z,G\|S).$	(16)

Now, $S=Z\implies H(Z,G|S)=H(G)$ and $S=G\implies H(Z,G|S)=H(Z)$ . Therefore,

I(T;S)=\left\{\begin{matrix}H(Z)&;&S=Z\\ H(G)&;&S=G\end{matrix}\right..

(17)

Claim 1 follows from the above since $I(T;S)\leq H(S)\leq\max\{H(Z),H(G)\}$ .

To prove claim 2, first observe that $I(Y;T)=I(Y;Z)\implies I(Y;G|Z)=0$ . Now consider the conditional mutual information $I(Y;T|S)$ :

	$\displaystyle I(Y;T\|S)$	$\displaystyle=I(Y;Z,G\|S)$		(18)
		$\displaystyle=I(Y;G\|S)+I(Y;Z\|G,S)$		(19)

Note that the right-hand side above vanishes when $S=Z$ . Therefore, $S=Z\implies I(Y;T|S)=0$ . Now since

Red(Y:T,S)=I(Y;T)-\underbrace{\min_{Q\in\Delta_{P}}I_{Q}(Y;T|S)}_{=0\text{ with }Q=P\text{ when }S=Z}

(20)

and $Red(Y:T,S)\leq I(Y;T)$ , setting $S=Z$ achieves the maximum $Red(Y:T,S)$ . ∎

B.2 Proof of Theorem 3.2

See 3.2 Proof of the second property is given below:

Proof.

$\displaystyle\max$	$\displaystyle\{I(Y;T),I(Y;S)\}$
	$\displaystyle=\max\Big{\{}Red(Y:T,S)+\underbrace{Uni(Y:T\backslash S)}_{=0},Red(Y:T,S)+Uni(Y:S\backslash T)\Big{\}}$	(21)
	$\displaystyle=Red(Y:T,S)+Uni(Y:S\backslash T)$	(22)
	$\displaystyle=I(Y;S).$	(23)

∎

The first and the third properties directly follow from [12, Lemma 5] and [52, Lemma 31].

B.3 Proof of Lemma B.1

Lemma B.1.

Let $Y,T$ and $S$ be any three random variables with supports $\mathcal{Y},\mathcal{T}$ and $\mathcal{S}$ respectively and $g(\cdot)$ be a deterministic function with domain $\mathcal{S}$ . Then

I(Y;T|g(S),S)=I(Y;T|S).

(24)

Proof.

By applying the mutual information chain rule to $I(Y;T,S,g(S))$ we get

$\displaystyle I(Y;T,$	$\displaystyle S,g(S))$
	$\displaystyle=I(Y;S)+I(Y;T\|S)+I(Y;g(S)\|T,S)$	(25)
	$\displaystyle=I(Y;S)+I(Y;T\|S)+\underbrace{H(g(S)\|T,S)}_{=0}-\underbrace{H(g(S)\|Y,T,S)}_{=0}$	(26)
	$\displaystyle=I(Y;S)+I(Y;T\|S).$	(27)

Also, from a different decomposition, we get

	$\displaystyle I(Y;T,S,g(S))$	$\displaystyle=I(Y;S)+\underbrace{I(Y;g(S)\|S)}_{=0}+I(Y;T\|g(S),S)$		(28)
		$\displaystyle=I(Y;S)+I(Y;T\|g(S),S).$		(29)

Combining the two right-hand sides yields the final result. ∎

B.4 Proof of Theorem 4.1

See 4.1

Proof.

For a given set of random variables $Y,T$ and $S$ , let $f_{t}^{*}(T)$ and $f_{s}^{*}(S)$ achieve the maximum $I(Y;Q)$ in Definition 4.1, i.e., $Red_{\cap}(Y:T,S)=I(Y;f_{t}^{*}(T))$ while $I(Y;f_{t}^{*}(T)|f_{s}^{*}(S))=0$ . We first observe that $Red(Y:f_{t}^{*}(T),f_{s}^{*}(S))=Red_{\cap}(Y:T,S)=I(Y;f_{t}^{*}(T))$ as shown below:

$\displaystyle Red($	$\displaystyle Y:f_{t}^{}(T),f_{s}^{}(S))$
	$\displaystyle=I(Y;f_{t}^{}(T))-\min_{Q\in\Delta_{P}}I_{Q}(Y;f_{t}^{}\|f_{s}^{*}(S))$	(30)
	$\displaystyle=I(Y;f_{t}^{}(T))\quad(\because I(Y;f_{t}^{}(T)\|f_{s}^{*}(S))=0)$	(31)
	$\displaystyle=Red_{\cap}(Y:T,S).$	(32)

Next, we show that $Red(Y:f_{t}^{*}(T),f_{s}^{*}(S))<Red(Y:T,S)$ . In this regard, we use the following lemma due to [12].

Lemma B.2 (Lemma 25, [12]).

Let $X,Y,Z_{1},Z_{2}\dots,Z_{k}$ and $Z_{k+1}$ be a set of random variables. Then,

Uni(X:Y\backslash Z_{1},Z_{2}\dots,Z_{k})\geq Uni(X:Y\backslash Z_{1},Z_{2}\dots,Z_{k},Z_{k+1}).

(33)

Consider the set of random variable $Y,f_{t}^{*}(T),f_{s}^{*}(S)$ and $S$ . From the above lemma we get

$\displaystyle Uni(Y$	$\displaystyle:f_{t}^{}(T)\backslash f_{s}^{}(S))$
	$\displaystyle\geq Uni(Y:f_{t}^{}(T)\backslash f_{s}^{}(S),S)$	(34)
	$\displaystyle=I(Y;f_{t}^{}(T))-I_{Q^{}}(Y;f_{t}^{}(T)\|f_{s}^{}(S),S)$	(35)

where $Q^{*}=\arg\min_{Q\in\Delta_{P}}I_{Q^{*}}(Y;f_{t}^{*}(T)|f_{s}^{*}(S),S)$ . Now, by applying Lemma B.1 to the right-hand side we arrive at

$\displaystyle Uni(Y$	$\displaystyle:f_{t}^{}(T)\backslash f_{s}^{}(S))$
	$\displaystyle\geq I(Y;f_{t}^{}(T))-I_{Q^{}}(Y;f_{t}^{*}(T)\|S)$	(36)
	$\displaystyle=Uni(Y:f_{t}^{*}(T)\backslash S).$	(37)

Next, observe that the following line arguments holds from Definition 3.3:

	$\displaystyle Uni(Y:f_{t}^{}(T)\backslash f_{s}^{}(S))\geq Uni(Y:f_{t}^{*}(T)\backslash S)$	(38)
$\displaystyle\iff$	$\displaystyle I(Y;f_{t}^{}(T))-Uni(Y:f_{t}^{}(T)\backslash f_{s}^{}(S))\leq I(Y;f_{t}^{}(T))-Uni(Y:f_{t}^{*}(T)\backslash S)$	(39)
$\displaystyle\iff$	$\displaystyle Red(Y:f_{t}^{}(T),f_{s}^{}(S))\leq Red(Y:f_{t}^{*}(T),S).$	(40)

Noting that $Red(Y:A,B)$ is symmetric w.r.t. $A$ and $B$ , we may apply the previous argument to the pair $Red(Y:f_{t}^{*}(T),S)$ and $Red(Y:T,S)$ to obtain

Red(Y:f_{t}^{*}(T),f_{s}^{*}(S))\leq Red(Y:f_{t}^{*}(T),S)\leq Red(Y:T,S),

(41)

concluding the proof. ∎

Appendix C VID And TED Frameworks

C.1 Variational Information Distillation (VID)

The VID framework [3] is based on maximizing a variational lower bound to the mutual information $I(T;S)$ . It finds a student representation $S$ which minimizes the following loss function:

\mathcal{L}_{VID}(\eta_{s},\mu)=\mathcal{L}_{CE}(Y,\hat{Y}_{\eta_{s}})+\lambda\sum_{c=1}^{C}\sum_{h=1}^{H}\sum_{w=1}^{W}\Bigg{(}\log\sigma_{c}+\mathbb{E}_{P_{X}}\left[\frac{(T_{c,h,w}-\mu_{c,h,w}(S_{\eta_{s}}))^{2}}{2\sigma_{c}^{2}}\right]\Bigg{)}.

(42)

Here, $C,H$ and $W$ are the number of channels, height and width of the representation $T$ respectively (i.e., $T\in\mathbb{R}^{C\times H\times W}$ ). $\mu$ is a deterministic function parameterized using a neural network and learned during the training process. $\sigma=[\sigma_{1},\dots,\sigma_{c}]^{T}$ is a vector of independent positive parameters, which is also learned during the training process. $\hat{Y}_{\eta_{s}}$ is the final prediction of the student model of the target label $Y$ .

C.2 Task-aware Layer-wise Distillation (TED)

The TED framework [5] fine-tunes a student in two stages. During the first stage, task-aware filters appended to the teacher and the student are trained with task-related heads while the student and the teacher parameters are kept constant. In the next step, the task-related heads are removed from the filters and the student is trained along with its task-aware filter while the teacher and its task-aware filter is kept unchanged. We observe that each of these steps implicitly maximizes the redundant information under Definition 4.1. To see the relationship between the TED framework and the above definition of redundant information, let $Q$ be parameterized using the teacher’s task-aware filter as $Q=f_{t}(T)$ . Now consider the first stage loss corresponding to the teacher’s task-aware filter which is given below:

\mathcal{L}_{t}\left(T,\theta_{t}\right)=\mathbb{E}_{x\sim\mathcal{X}}\left[\ell(f_{t}(T;\theta_{t}))\right].

(43)

Here, $\ell(\cdot)$ is the task specific loss, $f_{t}$ is the task-aware filter parameterized by $\theta_{t}$ . During the first stage, this loss is minimized over $\theta_{t}$ . A similar loss corresponding to the student (i.e., $\mathbb{E}_{x\sim\mathcal{X}}\left[\ell(f_{s}(S;\theta_{t}))\right]$ ) is minimized in order to train the student’s task aware filter. Note that during this process, both $I(Y;f_{t}(T))$ and $I(Y;f_{s}(S))$ are increased.

During stage 2, the distillation loss which is given below is minimized over $\theta_{s}$ and $S$ while $\theta_{t}$ and $T$ being held constant.

\mathcal{D}_{TED}\left(T,S\right)=\mathbb{E}_{x\sim\mathcal{X}}\left[||f_{t}(T;\theta_{t})-f_{s}(S;\theta_{s})||^{2}\right].

(44)

Consider stage 2 as an estimation problem which minimizes the mean square error, where $Q=f_{t}(T)$ is the estimand and $f_{s}(\cdot)$ is the estimator. We observe that this optimization ensures $I(Y;Q|f_{s}(S))=0$ given that the same assumption as in Section 4 holds. Following similar steps as in Section 4, we see that TED framework maximizes a lower bound for the transferred knowledge, quantified as in Definition 3.2.

The main difference of this scheme w.r.t. the RID framework is two-fold. First, in RID we optimize $f_{t}(\cdot)$ in addition to $f_{s}(\cdot)$ and $S$ during stage 2. In contrast, TED does not modify the teacher’s filter during the second stage. Second, RID distillation loss employs a weighting parameter similar to that of VID.

Appendix D Empirical Validation

D.1 Datasets

We use the CIFAR-10 dataset [45] with 60,000 32x32 colour images belonging to 10 classes, with 6,000 images per class. The training set consists of 50,000 images (5,000 per class) and the test set is 10,000 images (1,000 per class). The PID values are evaluated over the same test set. Similarly, CIFAR-100 dataset [46] contains 32x32 color images belonging to 100 classes with 600 images from each class. The dataset is split into training and test sets with 500 and 100 images per class in each split, respectively. While this split was used to train the teacher, the students were trained only on a subset of training data with only 100 samples per class. For the transfer learning setup, a teacher model initialized with pre-trained weights for the ImageNet dataset [47] was used. Students were distilled using the CUB-200-2011 dataset [48]. The ImageNet dataset contains 1,281,167 training images, 50,000 validation images and 100,000 test images belonging to 1,000 general object classes. The CUB-200-2011 dataset contains 11,768 images of 200 bird species.

D.2 Models and hyperparameters

CIFAR-10: Teacher models are WideResNet-(40,2) and the student models are WideResNet-(16,1). For the VID distillation, the value for $\lambda$ was set to 100. Learning rate was 0.05 at the beginning and was reduced to 0.01 and 0.002 at 150^th and 200^th epochs respectively. Stochastic Gradient Descent with a weight decay=0.0005 and momentum=0.9 with Nesterov momentum enabled was used as the optimizer. We choose three intermediate layers; the outputs of the second, third and the fourth convolutional blocks of both the teacher and student models. The function $\mu(\cdot)$ for each layer is parameterized using a sequential model with three convolutional layers, ReLU activations and batch normalization in between the layers. A similar architecture and a training setup was used for the baseline (BAS, no distillation) and the RID models. In case of the RID models, the filters $f_{s}(\cdot)$ and $f_{t}(\cdot)$ were parameterized using 2-layer convolutional network with a batch normalization layer in the middle. The classification head $g_{t}(\cdot)$ is a linear layer. We set $n_{w}=30,q=30,r=1/4$ and the total number of epochs $n+n_{w}=300$ . Teacher, Baseline and VID models are trained for 300 epochs. In both cases of VID and RID, the independent parameter vector $\sigma$ has a dimension equal to the number of channels in the outputs of functions $\mu,f_{s}$ or $f_{t}$ . All the training was carried out on a computer with an AMD Ryzen Threadripper PRO 5975WX processor and an Nvidia RTX A4500 graphic card. The average training time per model is around 1.5 hours.

CIFAR-100: Teacher models are WideResNet-(28,10) and the student models are WideResNet-(16,8). For the VID distillation, the value for $\lambda$ was set to 100. Learning rate was 0.1 at the beginning and was multiplied by a factor of 0.2 at 60^th, 120^th and 160^th epochs respectively. Stochastic Gradient Descent with a weight decay=0.0005 and momentum=0.9 with Nesterov momentum enabled was used as the optimizer. We choose three intermediate layers; the outputs of the second, third and the fourth convolutional blocks of both the teacher and student models. The function $\mu(\cdot)$ for each layer is parameterized using a sequential model with three convolutional layers, ReLU activations and batch normalization in between the layers. A similar architecture and a training setup was used for the baseline (BAS, no distillation) and the RID models. In case of the RID models, the filters $f_{s}(\cdot)$ and $f_{t}(\cdot)$ were parameterized using 2-layer convolutional network with a batch normalization layer in the middle. The classification head $g_{t}(\cdot)$ is a linear layer. We set $n_{w}=5,q=5,r=1/5$ and the total number of epochs $n+n_{w}=250$ . Teacher, Baseline and VID models are trained for 250 epochs. In both cases of VID and RID, the independent parameter vector $\sigma$ has a dimension equal to the number of channels in the outputs of functions $\mu,f_{s}$ or $f_{t}$ . All the training was carried out on a computer with an AMD EPYC 7763 64-Core processor and an Nvidia RTX A6000 Ada graphic card. The average training time per model is around 3.75 hours.

ImageNet $\rightarrow$ CUB-200-2011: The teacher is a ResNet-34 model initialized with weights pre-trained on ImageNet (from Torchvision). The students are ResNet-18 models. For the VID distillation, the value for $\lambda$ was set to 100. Learning rate was 0.1 at the beginning and was multiplied by a factor of 0.2 at 150^th and 200^th epochs respectively. Stochastic Gradient Descent with a weight decay=0.0005 and momentum=0.9 with Nesterov momentum enabled was used as the optimizer. We choose two intermediate layers; the outputs of the third and the fourth convolutional blocks of both the teacher and student models. The function $\mu(\cdot)$ for each layer is parameterized using a sequential model with two convolutional layers, ReLU activations and batch normalization in between the layers. A similar architecture and a training setup was used for the baseline (BAS, no distillation) and the RID models. In case of the RID models, the filters $f_{s}(\cdot)$ and $f_{t}(\cdot)$ were parameterized using 2-layer convolutional network with a batch normalization layer in the middle. The classification head $g_{t}(\cdot)$ is a linear layer. We set $n_{w}=20,q=20,r=1/5$ and the total number of epochs $n+n_{w}=250$ . Teacher, Baseline and VID models are trained for 250 epochs. In both cases of VID and RID, the independent parameter vector $\sigma$ has a dimension equal to the number of channels in the outputs of functions $\mu,f_{s}$ or $f_{t}$ . All the training was carried out on a computer with an AMD Ryzen Threadripper PRO 5975WX processor and an Nvidia RTX A4500 graphic card. The average training time per model is around 7 hours.

D.3 PID computation

We compute the PID components of the joint information of innermost distilled layers $I(Y;T,S)$ , using the framework proposed in [28] as follows:

1.

Representations are individually flattened
2.

Compute $n_{\text{PCA}}$ -component PCA on each set of representations
3.

Cluster representations into $n_{C}$ clusters to discretize
4.

Compute the joint distribution $p(Y,T,S)$
5.

Compute PID components using the joint distribution

For CIFAR-10, we use $n_{\text{PCA}}=n_{C}=10$ . For CIFAR-100, we use $n_{\text{PCA}}=5$ and $n_{C}=3$ . For the transfer learning setting with ImageNet and CUB-200-2011 the computation was infeasible due to the intermediate representation dimensionality being extremely large.

D.4 Results

Table D.4 presents the accuracies of each framework over CIFAR-100 dataset. Table D.4 presents the results of the transfer learning experiment conducted using a teacher trained on ImageNet and students distilled and evaluated over the CUB-200-2011 dataset. In each scenario, RID exhibits more resilience to the untrained teacher compared to VID.

Table 2: Accuracy of each framework on CIFAR-100 dataset when the teacher is trained vs. not trained. A subset of the training data (100 samples per class) was used for the distillation.

Framework	Trained	Untrained
RID	50%	41%
VID	67%	20%
BAS	43%	43%

Table 3: Accuracy of each framework in the transfer learning setting (ImageNet teacher

\rightarrow

CUB-200-2011 student) when the teacher is trained vs. not trained.

Framework	Trained	Untrained
RID	65%	33%
VID	70%	11%
BAS	36%	36%

	$\displaystyle I(Y;T\|S)$	$\displaystyle=I(Y;Z,G\|S)$		(18)
		$\displaystyle=I(Y;G\|S)+I(Y;Z\|G,S)$		(19)

Quantifying Knowledge Distillation Using Partial Information Decomposition

Abstract

1 Introduction

2 Preliminaries

3 Explaining Knowledge Distillation

Definition 3.1 (Knowledge to distill).

Definition 3.2 (Transferred knowledge).

Definition 3.3 (Unique and redundant information [12]).

Theorem 3.1 (Teacher with nuisance).

Theorem 3.2 (Properties).

4 A Framework For Maximizing Transferred Knowledge

Definition 4.1 (IαI_{\alpha} measure [13]).

Theorem 4.1 (Transferred knowledge lower bound).

Remark.

5 Empirical Validation

6 Conclusion

References

Appendix A Additional Related Works

Appendix B Proofs

B.1 Proof of Theorem 3.1

Proof.

B.2 Proof of Theorem 3.2

Proof.

B.3 Proof of Lemma B.1

Lemma B.1.

Proof.

B.4 Proof of Theorem 4.1

Proof.

Lemma B.2 (Lemma 25, [12]).

Appendix C VID And TED Frameworks

C.1 Variational Information Distillation (VID)

C.2 Task-aware Layer-wise Distillation (TED)

Appendix D Empirical Validation

D.1 Datasets

D.2 Models and hyperparameters

D.3 PID computation

D.4 Results

Quantifying Knowledge Distillation Using
Partial Information Decomposition

Definition 4.1 ( $I_{\alpha}$ measure [13]).