Quantifying Knowledge Distillation Using
Partial Information Decomposition
Abstract
Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher’s representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.
1 Introduction
Modern-day machine learning requires large amounts of compute for both training and inference. Knowledge distillation [1, 2] can be used to compress a complex machine learning model (the teacher) by distilling it into a relatively simpler model (the student). The term “distillation” in this context means obtaining some assistance from the teacher while training the student so that the student performs much better than when trained alone (see Figure 1). In its earliest forms, knowledge distillation involved the student trying to match the output logits of the teacher [1]. More advanced methods focus on distilling multiple intermediate representations of the teacher to the corresponding layers of the student [2, 3, 4, 5]. We also refer the reader to [6, 7] for surveys.
Information theory has been instrumental in both designing [3, 4] and explaining [8, 9] knowledge distillation techniques. However, less attention has been given to characterizing the fundamental limits of the process from an information-theoretical perspective. Our goal is to bridge this gap by first introducing new measures to quantify the “transferred knowledge” and “knowledge to distill” for a teacher and a student model given a target downstream task. We bring in an emerging body of work called Partial Information Decomposition (PID) [10, 11, 12] to explain knowledge distillation. We define the knowledge to distill using the PID measure of “unique” information about the task that is available only with the teacher but not the student. As it follows, the transferred knowledge is succinctly quantified by the measure of “redundant” information that is common between the teacher and student.
We propose a multi-level optimization that maximizes redundant information (transferred knowledge) as a regularizer for more effective distillation. While PID has been explored in a few avenues of machine learning,

it has remained a challenge to maximize these measures as a regularizer since computing them itself requires solving an optimization. Our optimization leads to a novel knowledge distillation framework – Redundant Information Distillation (RID) – which precisely captures task-relevant knowledge and filters out the task-irrelevant information from the teacher. In summary, our main contributions are as follows:
• Quantifying transferred knowledge and knowledge to distill: Given a downstream task, and a teacher and a student model, we formally define the knowledge to distill as the unique information in the teacher (Definition 3.1) and the transferred knowledge as the redundant information (Definition 3.2). Through examples and theoretical results (Theorem 3.1), we first show that redundant information succinctly captures task-related knowledge transferred to the student as opposed to the existing frameworks which directly align teacher () and student () representations, e.g., Variational Information Distillation (VID) [3] maximizes the mutual information . Theorem 3.1 points out a fundamental limitation of the existing knowledge distillation frameworks for capacity-limited students: they blindly align the student and the teacher without precisely capturing the task-related knowledge.
• Maximizing redundant information as a regularizer: To alleviate this limitation, we propose a strategy to incorporate redundant information as a regularizer during model distillation to precisely maximize the task-relevant transferred knowledge. We first circumvent the challenge of computing the redundant information measure proposed in [12] by utilizing the quantity termed intersection information defined in [13], which we prove (in Theorem 4.1) to be a lower-bound for redundant information. The significance of Theorem 4.1 is that it enables us to obtain an optimization formulation to maximize a lower-bound of redundant information without making distributional assumptions, a contribution that is also of independent interest outside the domain of knowledge distillation.
• A novel knowledge distillation framework: We propose a new framework called Redundant Information Distillation (RID) whose distillation loss is tailored to maximize the redundant information (i.e., the transferred knowledge as per Definition 3.2). We carry out a number of experiments to demonstrate the advantage of this new framework over VID [3], an existing knowledge distillation framework that maximizes the mutual information between the teacher and the student representations (Section 5). Experiments are carried out for the CIFAR10 and the CIFAR100 datasets, as well as for a transfer learning setting where the teacher is trained on the ImageNet dataset and transferred to the students over the CUB-200-2011 dataset. Our framework explains knowledge distillation and shows more resilience under less-informative nuisance teachers.
Related Works: Multi-layer knowledge distillation was introduced in FitNets [2]. Henceforth, a large number of techniques, based on different statistics derived for matching a teacher-student pair, have been proposed. In particular, [3, 4, 14, 15] leverage an information-theoretic perspective to arrive at a solution (also see surveys [6, 7]). In this paper, we focus on VID [3] as a representative framework of the larger class of distillation frameworks which maximize as the distillation strategy. We also discuss Task-aware Layer-wise Distillation (TED) [5] as a framework that filters out task-related information. Specifically, [5] highlight the importance of distilling only the task-related information when there is a significant complexity gap between the teacher and the student. Towards this end, [16] point out the existence of non-distillable classes due to the unmatched capacity of the student model. We discuss some more related works on knowledge distillation [17, 18, 19, 20, 21] in Appendix A.
Information theory has been instrumental towards explaining the success of knowledge distillation. [9] utilize information bottleneck principles [22, 23] to explain how a teacher model may assist the student in learning relevant features quickly. [8] observe the training process as systematically discarding knowledge from the input. Accordingly, the distillation helps the student to quickly learn what information to discard. Despite these attempts, we observe that there exists a gap in characterizing the fundamental limits of knowledge distillation which we seek to address using the mathematical tool of PID.
PID is also beginning to generate interest in other areas of machine learning [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 35, 39, 40, 41]. However, it has not been leveraged in the context of knowledge distillation before. Additionally, while most related works predominantly focus on efficiently computing PID, e.g., [40, 28, 36, 42] that itself requires solving an optimization over the joint distribution, there are limited works that further incorporate it as a regularizer during model training. [25] leverage Gaussian assumptions to obtain closed-form expressions for the PID terms, enabling them to use unique information as a regularizer during training for fairness (also see [38, 35] for more details on Gaussian PID). Our work makes novel connections between two notions of redundant information and demonstrates how PID can be integrated as a regularizer in a multi-level optimization without Gaussian assumptions, which could also be of independent interest outside the context of knowledge distillation.
2 Preliminaries
Background on PID: Partial Information Decomposition (PID), first introduced by [10], offers a way to decompose the joint information in two sources, say and , about another random variable

(i.e., where denotes the mutual information between and [43]) into four components as follows:
-
1.
Unique information and : information about that each source uniquely contains
-
2.
Redundant information : the information about that both and share
-
3.
Synergistic information : the information about that can be recovered only by using both and .
See Figure 2 for a graphical representation. These PID components satisfy the relationships given below:
(1) | ||||
(2) | ||||
(3) |
While this system of equations cannot be solved to arrive at a deterministic definition for each PID term, defining only one of the terms is sufficient to define the rest. Consequently, a wide array of definitions exists, each based on different desired properties [10, 12, 11, 13]. Among these, the definition proposed in [12] is motivated with an operational interpretation of unique information from decision theory. Moving on to the context of knowledge distillation, we map to be the teacher representation, to be the student representation, and to be the downstream task that the student is being trained for. That makes and be the total knowledge about that is in the teacher and in the student, respectively.
Notation and Problem Setting: We consider a layer-wise distillation scheme where the teacher representation is distilled into the student representation , where is the input. The target of the student is to predict the task from . Both and are deterministic functions of and the randomness is due to the input being random. Note that the student representation depends on the parameters of the student network denoted by and hence written as . However, when this parameterization and dependence on is irrelevant or obvious, we may omit both and simply write and . We denote the supports of and by and , respectively. In general, upper-case letters denote random variables, except and which represent probability distributions, which stand for the representation dimensions, and which indicates the number of layers distilled. Lowercase letters are used for vectors unless specified otherwise. Lowercase Greek letters denote the parameters of neural networks.
Knowledge distillation is achieved by modifying the student loss function to include a distillation loss term in addition to the ordinary task-related loss as follows:
(4) |
Here, . When the task at hand is a classification, denotes the true class label and we use the cross entropy loss defined as as the ordinary task-related loss for the student. Here, is the student’s final prediction of . The teacher network is assumed to remain unmodified during the distillation process.
3 Explaining Knowledge Distillation
In this section, we propose information theoretic metrics to quantify both the task-relevant information that is available in the teacher for distillation, and the amount of information that has already been transferred to the student. We mathematically demonstrate favorable properties of our proposed measures in comparison to other candidate measures. Our mathematical results highlight the limitations of existing knowledge distillation frameworks that naively align the student with the teacher with no regard for task-relevance.
Definition 3.1 (Knowledge to distill).
Let , , and be the target variable, the student’s intermediate representation, and the teacher’s intermediate representation, respectively. The knowledge to distill from to is defined as , the unique information about that is in but not in .
With the knowledge to distill is defined as the unique information , we see that the more the distillation happens, the more the shrinks. Note that under the knowledge distillation setting, the total knowledge of the teacher is constant since the teacher is not modified during the process. Since we therefore propose as a measure for knowledge that has already been transferred.
Definition 3.2 (Transferred knowledge).
Let , , and be the target variable, student’s intermediate representation, and the teacher’s intermediate representation, respectively. The transferred knowledge from to is defined as , the redundant information about between and .
We leverage the unique and redundant information definitions given by [12] for an exact quantification of these quantities.
Definition 3.3 (Unique and redundant information [12]).
Let be the joint distribution of and , and be the set of all joint distributions over . Then,
(5) | ||||
(6) |
where i.e., is the set of all joint distributions with marginals of the pairs and equal to that of .
Comparison to Existing Approaches for Knowledge Distillation: A multitude of knowledge distillation frameworks exist which are based on maximizing the mutual information between the teacher and the student (i.e., ) [3, 4, 14, 15]. While a distillation loss that maximizes can be helpful to the student when the teacher possesses task-related information, we show that it creates a tension with the ordinary loss when the teacher has little or no task-relevant information. Moreover, even though the teacher contains task-related information, the limited capacity of the student may hinder a proper distillation when this kind of framework is used. The following examples provide critical insights, exposing the limitation of . Our proposed measure resolves these cases in an explainable manner by succinctly capturing task-relevant knowledge.
Example 1: (Uninformative teacher) An uninformative teacher representation (i.e., with ) gives for any , agreeing with the intuition. Hence, an algorithm that maximizes exactly the transferred knowledge will have a zero gradient over this term. In contrast, algorithms that maximize the similarity between and quantified by will force to mimic the uninformative teacher, causing a performance worse than ordinary training without distillation. As a simplified example, let and . Then, the teacher cannot predict the intended task . Note that, in this case, is not maximized when the student representation is . Instead, it is maximized when .
Example 2: (Extra complex teacher) Let and . Then, the teacher can completely predict the intended task . Assume the student is simpler than the teacher and has only one binary output. In this situation, is not maximized when because where the right-hand side is achieved when . However, is a maximizer for (i.e., ). Theorem 3.1 presents a more general case.
Theorem 3.1 formally exposes the limitations of maximizing for capacity-limited students as does not emphasize task-related knowledge.
Theorem 3.1 (Teacher with nuisance).
Let where contains all the task-related information (i.e., ) and does not contain any information about the task (i.e., ). Let the student be a capacity-limited model as defined by where denotes the Shannon entropy of the random variable . Then,
-
(i)
is maximized when
(7) -
(ii)
is always maximized when .
The uninformative random variable here can be seen as a stronger version of the nuisance defined in [44, Section 2.2]. In the above scenario, the task-related part of the student loss will have a tension with the distillation loss when , in which case, the distillation actually adversely affects the student. In contrast, a loss that maximizes will always be aligned with the task-related loss.
These examples show that the frameworks based on maximizing are not capable of selectively distilling the task-related information to the student. In an extreme case, they are not robust to being distilled from a corrupted teacher network. This is demonstrated in the experiments under Section 5.
It may appear that using information gain (conditional mutual information) as the measure for the amount of knowledge available to distill resolves the cases similar to Example 1. However, Example 3 below provides a counterexample.
Example 3: (Effect of synergy) Consider a scenario similar to Example 1, where the teacher is uninformative regarding the interested task. For example, let and where denotes the binary XOR operation. Suppose we were to consider conditional mutual information as the measure of the amount of knowledge available to distill information available in the teacher. Then, when , indicating non-zero knowledge available to distill in the teacher. This is un-intuitive since in this case both and neither the teacher nor the student can be used alone to predict . In contrast, the proposed measures give indicating no available or transferred knowledge.
Next, we present Theorem 3.2 which highlights some important properties of the proposed metrics. These properties indicate that the proposed measures agree well with our intuition for explaining distillation.
Theorem 3.2 (Properties).
The following properties hold for knowledge to distill and already transferred knowledge defined as in Definition 3.1 and Definition 3.2 respectively.
-
(i)
and are non-negative.
-
(ii)
When , the teacher has zero knowledge available for distillation. At this point, the student has the maximum information that any one of the representations or has about ; i.e., .
-
(iii)
For a given student representation and any two teacher representations and if there exists a deterministic mapping such that , then .
4 A Framework For Maximizing Transferred Knowledge


In this section, we propose a distillation framework – Redundant Information Distillation (RID) – which maximizes the transferred knowledge with a focus on classification problems. We first show that our measure of transferred knowledge is lower-bounded by an alternative definition of redundant information (also called the intersection information – denoted by ). Next, we leverage this lower-bound to develop a multi-level optimization framework to selectively distill task-relevant knowledge.
Definition 4.1 ( measure [13]).
(8) |
Next we show, from Theorem 4.1, that is a lower-bound for . The significance of this result is that maximizing the lower-bound would also increase .
Theorem 4.1 (Transferred knowledge lower bound).
Next, we discuss our proposed approach for maximizing the lower-bound (also see Figure 3 and Algorithm 1 for our complete strategy). Our proposed framework is based on selecting in Definition 4.1 to be , and parameterizing and using small neural networks. To denote the parameterization, we will occasionally use the elaborated notation and , where and denote the parameters of and , respectively. With the substitution of , Definition 4.1 results in the optimization problem given below:
(P1) |
We divide the problem (P1) into two phases and employ gradient descent on two carefully designed loss functions to perform the optimization. In the first phase, we maximize the objective w.r.t. while and are kept constant (recall that is fixed in all cases because the teacher is not being trained during the process). For this, we append an additional classification head parameterized by to the teacher’s task aware filter . We then minimize the following loss function with respect to and :
(10) |
where denotes the corresponding element of . Here, and are the number of channels, height, and width of the outputs of and . is a stand-alone vector of weights that are optimized in the second phase. Minimizing the cross-entropy term of above amounts to maximizing . The second term prohibits from diverting too far from during the process, so that the constraint can be ensured.
During the second phase, we freeze and maximize the objective over and . The loss function employed in this phase is as follows:
(11) |
where
(12) |
and and are scalar hyperparameters which determine the prominence of ordinary learning (through ) and distillation (through ). and are as defined earlier. denotes the final prediction of the student network.
The first term of the loss function is the ordinary task-related loss. The next two terms correspond to the distillation loss, which is our focus in the following explanation. Consider phase 2 as an estimation problem that minimizes the -weighted mean squared error, where is the estimand and is the estimator. The magnitudes of the positive weights are controlled using the term . We observe that this optimization ensures given that the following assumption holds.
Assumption: Let the estimation error be . Assume , i.e., given the estimate, the estimation error is independent of .
With the above assumption, we see that
(13) |
Therefore, the constraint in problem P1 is satisfied by this selection of random variables. Therefore, along with the maximization of during phase 1, the proposed framework can be seen as performing the optimization in Definition 4.1 in two steps.
This, along with Theorem 4.1, completes our claim that the proposed framework maximizes a lower bound for the transferred knowledge. RID loss terms can be extended to multiple layers as follows: Replace the single term in (10) with a sum of terms corresponding to each pair of layers and where. Similarly, replace in (12) with . The framework is summarized in Algorithm 1. The advantage of RID over the VID framework [3] which maximizes can be observed in the experiments that are detailed in Section 5.
Remark.
We note that the Large Language Model (LLM) fine-tuning framework of Task-aware Layer-wise Distillation (TED) [5] shares intuitive similarities with RID regarding distilling task-related knowledge. However, they take a heuristic approach when designing the framework. In fact, our mathematical formulation can explain the success of TED as detailed in Appendix C. In addition to the application domain, the difference between TED and RID can mainly be attributed to the following:
-
(i)
During the first stage, TED trains both and whereas RID only trains .
-
(ii)
In the second stage loss, TED includes an ordinary mean squared error term whereas RID includes a weighted (using ) mean squared error term.
To the best of our knowledge, our work is the first to information-theoretically quantify the actual task-relevant transferred knowledge and formally incorporate it into an optimization.
5 Empirical Validation
Setup: We compare the performance of the proposed RID framework with (i) VID; and (ii) our adaptation of TED in this domain under two different conditions. In the first setting, the teacher network is fully trained with the complete training set, whereas in the second setting, the teacher is just randomly initialized without any training at all. Experiments are carried out on CIFAR-10 [45], CIFAR-100 [46], ImageNet [47] and CUB-200-2011 [48] datasets (with the last two being used in a transfer learning setup). Additionally, we train a student without any knowledge distillation, which we label as “BAS”. Table 1 provides details of the model architectures used in each setting.
Dataset | Teacher | Student |
---|---|---|
CIFAR-10 | WRN-(40,2) | WRN-(16,1) |
CIFAR-100 | WRN-(28,10) | WRN-(16,8) |
ImageNet to CUB-200-2011 | ResNet-34 | ResNet-18 |
We distill three layers from the teacher to the corresponding student layers. In the case of RID, each teacher layer has its own filter parameterized with . Student filters are parameterized in a similar manner. Moreover, each teacher filter has its own classification head parameterized with . All the student representations are parameterized by the complete weight vector . In the beginning, the teacher filters are trained for number of warm-up epochs with just the cross-entropy loss . Then, the optimization alternates between the first and the second phases, with each cycle taking epochs in total. Within a cycle, phase 1 is carried out for epochs followed by phase 2 for rest of the epochs (see Algorithm 1). Values of all the hyperparameters are given in Appendix D. TED is implemented similarly, except the fact that now, the student filters also have a classification head (see Appendix C). Since TED acts as a fine-tuning method, we use the BAS student as the base model to apply TED on.
In order to evaluate the validity of the Definitions 3.1 and 3.2, we compute the PID components [12] of the joint mutual information of the innermost distilled layer using the estimation method given in [28]. See Appendix D for additional details.
Results: Figure 4 shows the classification accuracy on CIFAR-10 dataset for each student model RID, VID, and TED when distilled with either a trained teacher or an untrained one. It also shows the classification accuracies of the baseline model BAS (without any distillation) and the trained teacher. Figures 5(a) and 5(b) present the PID values and the marginal mutual information estimates for the teacher and the student models for CIFAR-10 and CIFAR-100, respectively. Tables D.4 and D.4 in Appendix D present the classification accuracies for CIFAR-100 and the transfer learning setup ImageNet CUB-200-2011, respectively.



Discussion: From Figure 4, we see that both RID and VID perform equally well when the teacher is trained. However, the performance degradation of VID under the untrained teacher is significantly larger w.r.t. that of the RID. This can be attributed to the fact that the VID student trying to naively mimic the teacher. RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations. TED, despite being rather unstable, performs very close to the baseline model in both cases. As evident from Figure 5, in the case of the trained teacher (whose ), we observe that the knowledge available for distillation in the teacher decreases with the increasing number of epochs. Consequently, the amount of knowledge transferred increases. When the teacher is not trained (i.e., ), as expected. Both BAS and RID models show an increase in even when distilled from the untrained teacher. However, in this case, VID shows a very low as expected, caused by the distillation loss forcing to mimic the teacher. A Python implementation of the experiments is available at https://github.com/pasandissanayake/kd-rid.
6 Conclusion
With the growing interest in knowledge distillation, our work provides critical insights into the explainability of knowledge distillation. We propose using to quantify the knowledge available in a teacher model for distillation w.r.t. a student and a downstream task. This, in turn, leads to the definition of the amount of knowledge that has been distilled to a student as . We show that knowledge distillation frameworks which use mutual information between the teacher and the student representations to achieve knowledge distillation have a fundamental problem: These frameworks force the student to mimic the teacher regardless of the usefulness of the teacher’s information to perform the task at hand. In contrast, through many examples we demonstrate that the proposed metrics can correctly characterize the amounts of knowledge available for distillation and the already transferred knowledge. Moreover, we show the advantage of the proposed metric by implementing a new distillation framework – Redundant Information Distillation (RID) – and comparing its performance with the existing technique VID [3]. While VID and RID perform similarly when the teacher is well-trained for the downstream task, VID performance degrades largely when the teacher is not trained but RID performs close to a student model trained without distillation.
Limitations and future work: While the RID framework uses an alternative definition for redundant information, computation of exact during training can be computationally prohibitive due to the optimization over . Moreover, characterizing the extent to which the assumption in Section 4 holds is not explored in this work. Extending the mathematical formulation in Section 4 to analyze other knowledge distillation frameworks is an interesting path for future research. Other potential research directions include: (i) distilling from an ensemble of teachers [49] in a way that the adverse effects of corrupted teachers are mitigated; (ii) dataset distillation [50]; or (iii) distillation for model reconstruction from counterfactual explanations [51]. Incorporating fundamentally different definitions for PID components, such as [41] which provides explicit formulae, as regularizers can also be interesting.
Acknowledgments: This work was supported in part by NSF CAREER Award 2340006 and Northrop Grumman Seed Grant.
References
- [1] Geoffrey Hinton “Distilling the Knowledge in a Neural Network” In arXiv preprint arXiv:1503.02531, 2015
- [2] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta and Yoshua Bengio “Fitnets: Hints for thin deep nets” In ICLR, 2015
- [3] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence and Zhenwen Dai “Variational information distillation for knowledge transfer” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9163–9171
- [4] Yonglong Tian, Dilip Krishnan and Phillip Isola “Contrastive representation distillation” In ICLR, 2020
- [5] Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen and Tuo Zhao “Less is more: Task-aware layer-wise distillation for language model compression” In International Conference on Machine Learning, 2023, pp. 20852–20867 PMLR
- [6] Jianping Gou, Baosheng Yu, Stephen J Maybank and Dacheng Tao “Knowledge distillation: A survey” In International Journal of Computer Vision 129.6 Springer, 2021, pp. 1789–1819
- [7] Ilia Sucholutsky et al. “Getting aligned on representational alignment” In arXiv preprint arXiv:2310.13018, 2023
- [8] Quanshi Zhang, Xu Cheng, Yilan Chen and Zhefan Rao “Quantifying the knowledge in a DNN to explain knowledge distillation for classification” In IEEE Transactions on Pattern Analysis and Machine Intelligence 45.4 IEEE, 2022, pp. 5099–5113
- [9] Chaofei Wang, Qisen Yang, Rui Huang, Shiji Song and Gao Huang “Efficient knowledge distillation from model checkpoints” In Advances in Neural Information Processing Systems 35, 2022, pp. 607–619
- [10] Paul L Williams and Randall D Beer “Nonnegative decomposition of multivariate information” In arXiv preprint arXiv:1004.2515, 2010
- [11] Virgil Griffith, Edwin K.. Chong, Ryan G. James, Christopher J. Ellison and James P. Crutchfield “Intersection Information Based on Common Randomness” In Entropy 16.4, 2014, pp. 1985–2000
- [12] Nils Bertschinger, Johannes Rauh, Eckehard Olbrich, Jürgen Jost and Nihat Ay “Quantifying unique information” In Entropy 16.4 Multidisciplinary Digital Publishing Institute, 2014, pp. 2161–2183
- [13] Virgil Griffith and Tracey Ho “Quantifying redundant information in predicting a target random variable” In Entropy 17.7 MDPI, 2015, pp. 4644–4653
- [14] Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao and Lawrence Carin “Wasserstein contrastive representation distillation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16296–16305
- [15] Roy Miles, Adrian Lopez Rodriguez and Krystian Mikolajczyk “Information theoretic representation distillation” In arXiv preprint arXiv:2112.00459, 2021
- [16] Yichen Zhu, Ning Liu, Zhiyuan Xu, Xin Liu, Weibin Meng, Louis Wang, Zhicai Ou and Jian Tang “Teach less, learn more: On the undistillable classes in knowledge distillation” In Advances in Neural Information Processing Systems 35, 2022, pp. 32011–32024
- [17] Souvik Kundu, Qirui Sun, Yao Fu, Massoud Pedram and Peter Beerel “Analyzing the confidentiality of undistillable teachers in knowledge distillation” In Advances in Neural Information Processing Systems 34, 2021, pp. 9181–9192
- [18] Dae Young Park, Moon-Hyun Cha, Daesin Kim and Bohyung Han “Learning student-friendly teacher networks for knowledge distillation” In Advances in neural information processing systems 34, 2021, pp. 13292–13303
- [19] Emanuel Ben-Baruch, Matan Karklinsky, Yossi Biton, Avi Ben-Cohen, Hussam Lawen and Nadav Zamir “It’s all in the head: Representation knowledge distillation through classifier sharing” In arXiv preprint arXiv:2201.06945, 2022
- [20] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang and Qun Liu “Tinybert: Distilling bert for natural language understanding” In arXiv preprint arXiv:1909.10351, 2019
- [21] Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao and Qixiang Ye “Generic-to-specific distillation of masked autoencoders” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15996–16005
- [22] Naftali Tishby, Fernando C Pereira and William Bialek “The information bottleneck method” In arXiv preprint physics/0004057, 2000
- [23] Naftali Tishby and Noga Zaslavsky “Deep learning and the information bottleneck principle” In 2015 ieee information theory workshop (itw), 2015, pp. 1–5 IEEE
- [24] Sanghamitra Dutta, Praveen Venkatesh, Piotr Mardziel, Anupam Datta and Pulkit Grover “An information-theoretic quantification of discrimination with exempt features” In Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 3825–3833
- [25] Sanghamitra Dutta, Praveen Venkatesh, Piotr Mardziel, Anupam Datta and Pulkit Grover “Fairness under feature exemptions: Counterfactual and observational measures” In IEEE Transactions on Information Theory 67.10 IEEE, 2021, pp. 6675–6710
- [26] Sanghamitra Dutta and Faisal Hamman “A review of partial information decomposition in algorithmic fairness and explainability” In Entropy 25.5 MDPI, 2023, pp. 795
- [27] Faisal Hamman and Sanghamitra Dutta “Demystifying Local and Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition” In International Conference on Learning Representations, 2024
- [28] Paul Pu Liang et al. “Quantifying & modeling multimodal interactions: An information decomposition framework” In Advances in Neural Information Processing Systems 36, 2024
- [29] Paul Pu Liang et al. “Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications” In The Twelfth International Conference on Learning Representations, 2024
- [30] F. Hamman and S. Dutta “A Unified View of Group Fairness Tradeoffs Using Partial Information Decomposition” In IEEE International Symposium on Information Theory (ISIT), 2024, pp. 214–219
- [31] Tycho Tax, Pedro Mediano and Murray Shanahan “The partial information decomposition of generative neural network models” In Entropy 19.9 Multidisciplinary Digital Publishing Institute, 2017, pp. 474
- [32] David A Ehrlich, Andreas C Schneider, Michael Wibral, Viola Priesemann and Abdullah Makkeh “Partial information decomposition reveals the structure of neural representations” In arXiv preprint arXiv:2209.10438, 2022
- [33] Patricia Wollstadt, Sebastian Schmitt and Michael Wibral “A Rigorous Information-Theoretic Definition of Redundancy and Relevancy in Feature Selection Based on (Partial) Information Decomposition.” In J. Mach. Learn. Res. 24, 2023, pp. 131–1
- [34] Salman Mohamadi, Gianfranco Doretto and Donald A Adjeroh “More Synergy, Less Redundancy: Exploiting Joint Mutual Information for Self-Supervised Learning” In arXiv preprint arXiv:2307.00651, 2023
- [35] Praveen Venkatesh, Corbett Bennett, Sam Gale, Tamina Ramirez, Greggory Heller, Severine Durand, Shawn Olsen and Stefan Mihalas “Gaussian partial information decomposition: Bias correction and application to high-dimensional data” In Advances in Neural Information Processing Systems 36, 2024
- [36] B. Halder, F. Hamman, P. Dissanayake, Q. Zhang, I. Sucholutsky and S. Dutta “Quantifying Spuriousness of Biased Datasets Using Partial Information Decomposition” In ICML Workshop on Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models, 2024
- [37] Shaurya Dewan, Rushikesh Zawar, Prakanshul Saxena, Yingshan Chang, Andrew Luo and Yonatan Bisk “Diffusion PID: Interpreting Diffusion via Partial Information Decomposition” In Advances in Neural Information Processing Systems 37, 2024, pp. 2045–2079
- [38] Praveen Venkatesh and Gabriel Schamberg “Partial information decomposition via deficiency for multivariate gaussians” In 2022 IEEE International Symposium on Information Theory (ISIT), 2022, pp. 2892–2897 IEEE
- [39] Chaitanya Goswami and Amanda Merkley “Analytically deriving Partial Information Decomposition for affine systems of stable and convolution-closed distributions” In Advances in Neural Information Processing Systems 37, 2024, pp. 86749–86835
- [40] Michael Kleinman, Alessandro Achille, Stefano Soatto and Jonathan C Kao “Redundant information neural estimation” In Entropy 23.7 MDPI, 2021, pp. 922
- [41] Aobo Lyu, Andrew Clark and Netanel Raviv “Explicit Formula for Partial Information Decomposition” In arXiv preprint arXiv:2402.03554, 2024
- [42] Ari Pakman, Amin Nejatbakhsh, Dar Gilboa, Abdullah Makkeh, Luca Mazzucato, Michael Wibral and Elad Schneidman “Estimating the unique information of continuous variables” In Advances in neural information processing systems 34, 2021, pp. 20295–20307
- [43] Thomas M. Cover and Joy A. Thomas “Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)” USA: Wiley-Interscience, 2006
- [44] Alessandro Achille and Stefano Soatto “Emergence of invariance and disentanglement in deep representations” In Journal of Machine Learning Research 19.50, 2018, pp. 1–34
- [45] Alex Krizhevsky, Vinod Nair and Geoffrey Hinton “CIFAR-10 (Canadian Institute for Advanced Research)”, 2009 URL: http://www.cs.toronto.edu/~kriz/cifar.html
- [46] Alex Krizhevsky, Vinod Nair and Geoffrey Hinton “CIFAR-100 (Canadian Institute for Advanced Research)”, 2009 URL: http://www.cs.toronto.edu/~kriz/cifar.html
- [47] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei “Imagenet: A large-scale hierarchical image database” In 2009 IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255 IEEE
- [48] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona and Serge Belongie “The Caltech-UCSD Birds-200-2011 Dataset” California Institute of Technology, 2011
- [49] Andrey Malinin, Bruno Mlodozeniec and Mark Gales “Ensemble Distribution Distillation” In International Conference on Learning Representations, 2020
- [50] I. Sucholutsky and M. Schonlau “Soft-Label Dataset Distillation and Text Dataset Distillation” In 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–8
- [51] P. Dissanayake and S. Dutta “Model Reconstruction Using Counterfactual Explanations: A Perspective From Polytope Theory” In Advances in Neural Information Processing Systems (NeurIPS), 2024
- [52] Pradeep Kr Banerjee, Eckehard Olbrich, Jürgen Jost and Johannes Rauh “Unique informations and deficiencies” In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2018, pp. 32–38 IEEE
Appendix A Additional Related Works
The nature of the teacher model plays an important role in distillation. [17] present a distillation scheme to distill from a teacher which has been intentionally made difficult-to-distill. They distill the penultimate layer of the teacher to several intermediate representations of the student through auxiliary student branches. [18] prepare the teacher during its training phase so that it can later be used for better distillation when training a student. This is done through training an augmented version of the teacher formed by adding several student branches, all of which are trained together. [19] propose two novel frameworks: Teacher-Head Knowledge Distillation (TH-KD) and Student-Head Knowledge Distillation (SH-KD). In TH-KD, the teacher’s classification head is attached to the student and the distillation loss is computed as a weighted sum of the discrepancies of predictions from the two heads of the student with respect to the teacher’s predictions. SH-KD involves three steps: (i) a student is trained in the conventional way; (ii) the classification head of the student is fixed to the teacher and the teacher is trained while keeping the classification head frozen; and (iii) this teacher model is used to train a new student using conventional distillation. The assumption is that the teacher trained with the student classification head will adapt the teacher to transfer the knowledge better suited to the capacity-limited student. [20] propose an interesting knowledge distillation framework for compressing the well-known BERT models into a much smaller version called TinyBERT. The process consists of two steps. During the first step, a generic TinyBERT is distilled from a pre-trained BERT model using a large corpus. In the second step, this TinyBERT model is further fine-tuned by distilling a fine-tuned BERT on a task-specific dataset. While the broad goal of [20] seems to align with ours, the approach is quite different: we intend to filter out task-specific information from a generalized teacher by defining a measure that precisely captures this whereas [20] focus on distilling from a task-specific (i.e., fine-tuned) teacher in an efficient manner. The work on distilling vision transformers by [21] also proceeds in two steps. Similar to [20], the first step focuses on distilling task-agnostic information using a pre-trained teacher within an encoder-decoder setup. In the second step, the decoder is abandoned and the task-specific information is transferred using a fine-tuned teacher. As pointed out in [21, Section 4.3], this framework can be seen as maximizing the mutual information (conditioned on the dataset being used) between the teacher and the student in each step (rather than quantifying task-specific information), and hence, can be conceptually categorized together with VID. In contrast, our RID framework focuses on quantifying and extracting task-specific information from a generalized teacher.
Appendix B Proofs
B.1 Proof of Theorem 3.1
See 3.1
Proof.
To prove claim 1, observe that
(14) | ||||
(15) | ||||
(16) |
Now, and . Therefore,
(17) |
Claim 1 follows from the above since .
To prove claim 2, first observe that . Now consider the conditional mutual information :
(18) | ||||
(19) |
Note that the right-hand side above vanishes when . Therefore, . Now since
(20) |
and , setting achieves the maximum . ∎
B.2 Proof of Theorem 3.2
See 3.2 Proof of the second property is given below:
Proof.
(21) | ||||
(22) | ||||
(23) |
∎
B.3 Proof of Lemma B.1
Lemma B.1.
Let and be any three random variables with supports and respectively and be a deterministic function with domain . Then
(24) |
Proof.
By applying the mutual information chain rule to we get
(25) | ||||
(26) | ||||
(27) |
Also, from a different decomposition, we get
(28) | ||||
(29) |
Combining the two right-hand sides yields the final result. ∎
B.4 Proof of Theorem 4.1
See 4.1
Proof.
For a given set of random variables and , let and achieve the maximum in Definition 4.1, i.e., while . We first observe that as shown below:
(30) | ||||
(31) | ||||
(32) |
Next, we show that . In this regard, we use the following lemma due to [12].
Lemma B.2 (Lemma 25, [12]).
Let and be a set of random variables. Then,
(33) |
Consider the set of random variable and . From the above lemma we get
(34) | ||||
(35) |
where . Now, by applying Lemma B.1 to the right-hand side we arrive at
(36) | ||||
(37) |
Next, observe that the following line arguments holds from Definition 3.3:
(38) | ||||
(39) | ||||
(40) |
Noting that is symmetric w.r.t. and , we may apply the previous argument to the pair and to obtain
(41) |
concluding the proof. ∎
Appendix C VID And TED Frameworks
C.1 Variational Information Distillation (VID)
The VID framework [3] is based on maximizing a variational lower bound to the mutual information . It finds a student representation which minimizes the following loss function:
(42) |
Here, and are the number of channels, height and width of the representation respectively (i.e., ). is a deterministic function parameterized using a neural network and learned during the training process. is a vector of independent positive parameters, which is also learned during the training process. is the final prediction of the student model of the target label .
C.2 Task-aware Layer-wise Distillation (TED)
The TED framework [5] fine-tunes a student in two stages. During the first stage, task-aware filters appended to the teacher and the student are trained with task-related heads while the student and the teacher parameters are kept constant. In the next step, the task-related heads are removed from the filters and the student is trained along with its task-aware filter while the teacher and its task-aware filter is kept unchanged. We observe that each of these steps implicitly maximizes the redundant information under Definition 4.1. To see the relationship between the TED framework and the above definition of redundant information, let be parameterized using the teacher’s task-aware filter as . Now consider the first stage loss corresponding to the teacher’s task-aware filter which is given below:
(43) |
Here, is the task specific loss, is the task-aware filter parameterized by . During the first stage, this loss is minimized over . A similar loss corresponding to the student (i.e., ) is minimized in order to train the student’s task aware filter. Note that during this process, both and are increased.
During stage 2, the distillation loss which is given below is minimized over and while and being held constant.
(44) |
Consider stage 2 as an estimation problem which minimizes the mean square error, where is the estimand and is the estimator. We observe that this optimization ensures given that the same assumption as in Section 4 holds. Following similar steps as in Section 4, we see that TED framework maximizes a lower bound for the transferred knowledge, quantified as in Definition 3.2.
The main difference of this scheme w.r.t. the RID framework is two-fold. First, in RID we optimize in addition to and during stage 2. In contrast, TED does not modify the teacher’s filter during the second stage. Second, RID distillation loss employs a weighting parameter similar to that of VID.
Appendix D Empirical Validation
D.1 Datasets
We use the CIFAR-10 dataset [45] with 60,000 32x32 colour images belonging to 10 classes, with 6,000 images per class. The training set consists of 50,000 images (5,000 per class) and the test set is 10,000 images (1,000 per class). The PID values are evaluated over the same test set. Similarly, CIFAR-100 dataset [46] contains 32x32 color images belonging to 100 classes with 600 images from each class. The dataset is split into training and test sets with 500 and 100 images per class in each split, respectively. While this split was used to train the teacher, the students were trained only on a subset of training data with only 100 samples per class. For the transfer learning setup, a teacher model initialized with pre-trained weights for the ImageNet dataset [47] was used. Students were distilled using the CUB-200-2011 dataset [48]. The ImageNet dataset contains 1,281,167 training images, 50,000 validation images and 100,000 test images belonging to 1,000 general object classes. The CUB-200-2011 dataset contains 11,768 images of 200 bird species.
D.2 Models and hyperparameters
CIFAR-10: Teacher models are WideResNet-(40,2) and the student models are WideResNet-(16,1). For the VID distillation, the value for was set to 100. Learning rate was 0.05 at the beginning and was reduced to 0.01 and 0.002 at 150th and 200th epochs respectively. Stochastic Gradient Descent with a weight decay=0.0005 and momentum=0.9 with Nesterov momentum enabled was used as the optimizer. We choose three intermediate layers; the outputs of the second, third and the fourth convolutional blocks of both the teacher and student models. The function for each layer is parameterized using a sequential model with three convolutional layers, ReLU activations and batch normalization in between the layers. A similar architecture and a training setup was used for the baseline (BAS, no distillation) and the RID models. In case of the RID models, the filters and were parameterized using 2-layer convolutional network with a batch normalization layer in the middle. The classification head is a linear layer. We set and the total number of epochs . Teacher, Baseline and VID models are trained for 300 epochs. In both cases of VID and RID, the independent parameter vector has a dimension equal to the number of channels in the outputs of functions or . All the training was carried out on a computer with an AMD Ryzen Threadripper PRO 5975WX processor and an Nvidia RTX A4500 graphic card. The average training time per model is around 1.5 hours.
CIFAR-100: Teacher models are WideResNet-(28,10) and the student models are WideResNet-(16,8). For the VID distillation, the value for was set to 100. Learning rate was 0.1 at the beginning and was multiplied by a factor of 0.2 at 60th, 120th and 160th epochs respectively. Stochastic Gradient Descent with a weight decay=0.0005 and momentum=0.9 with Nesterov momentum enabled was used as the optimizer. We choose three intermediate layers; the outputs of the second, third and the fourth convolutional blocks of both the teacher and student models. The function for each layer is parameterized using a sequential model with three convolutional layers, ReLU activations and batch normalization in between the layers. A similar architecture and a training setup was used for the baseline (BAS, no distillation) and the RID models. In case of the RID models, the filters and were parameterized using 2-layer convolutional network with a batch normalization layer in the middle. The classification head is a linear layer. We set and the total number of epochs . Teacher, Baseline and VID models are trained for 250 epochs. In both cases of VID and RID, the independent parameter vector has a dimension equal to the number of channels in the outputs of functions or . All the training was carried out on a computer with an AMD EPYC 7763 64-Core processor and an Nvidia RTX A6000 Ada graphic card. The average training time per model is around 3.75 hours.
ImageNet CUB-200-2011: The teacher is a ResNet-34 model initialized with weights pre-trained on ImageNet (from Torchvision). The students are ResNet-18 models. For the VID distillation, the value for was set to 100. Learning rate was 0.1 at the beginning and was multiplied by a factor of 0.2 at 150th and 200th epochs respectively. Stochastic Gradient Descent with a weight decay=0.0005 and momentum=0.9 with Nesterov momentum enabled was used as the optimizer. We choose two intermediate layers; the outputs of the third and the fourth convolutional blocks of both the teacher and student models. The function for each layer is parameterized using a sequential model with two convolutional layers, ReLU activations and batch normalization in between the layers. A similar architecture and a training setup was used for the baseline (BAS, no distillation) and the RID models. In case of the RID models, the filters and were parameterized using 2-layer convolutional network with a batch normalization layer in the middle. The classification head is a linear layer. We set and the total number of epochs . Teacher, Baseline and VID models are trained for 250 epochs. In both cases of VID and RID, the independent parameter vector has a dimension equal to the number of channels in the outputs of functions or . All the training was carried out on a computer with an AMD Ryzen Threadripper PRO 5975WX processor and an Nvidia RTX A4500 graphic card. The average training time per model is around 7 hours.
D.3 PID computation
We compute the PID components of the joint information of innermost distilled layers , using the framework proposed in [28] as follows:
-
1.
Representations are individually flattened
-
2.
Compute -component PCA on each set of representations
-
3.
Cluster representations into clusters to discretize
-
4.
Compute the joint distribution
-
5.
Compute PID components using the joint distribution
For CIFAR-10, we use . For CIFAR-100, we use and . For the transfer learning setting with ImageNet and CUB-200-2011 the computation was infeasible due to the intermediate representation dimensionality being extremely large.
D.4 Results
Table D.4 presents the accuracies of each framework over CIFAR-100 dataset. Table D.4 presents the results of the transfer learning experiment conducted using a teacher trained on ImageNet and students distilled and evaluated over the CUB-200-2011 dataset. In each scenario, RID exhibits more resilience to the untrained teacher compared to VID.
Framework | Trained | Untrained |
---|---|---|
RID | 50% | 41% |
VID | 67% | 20% |
BAS | 43% | 43% |
Framework | Trained | Untrained |
---|---|---|
RID | 65% | 33% |
VID | 70% | 11% |
BAS | 36% | 36% |