Teach Harder, Learn Poorer: Rethinking Hard Sample Distillation for GNN-to-MLP Knowledge Distillation

Lirong Wu Westlake UniversityHangzhouChina [email protected] , Yunfan Liu Westlake UniversityHangzhouChina [email protected] , Haitao Lin Westlake UniversityHangzhouChina [email protected] , Yufei Huang Westlake UniversityHangzhouChina [email protected] and Stan Z. Li Westlake UniversityHangzhouChina [email protected]

(2024)

Abstract.

To bridge the gaps between powerful Graph Neural Networks (GNNs) and lightweight Multi-Layer Perceptron (MLPs), GNN-to-MLP Knowledge Distillation (KD) proposes to distill knowledge from a well-trained teacher GNN into a student MLP. In this paper, we revisit the knowledge samples (nodes) in teacher GNNs from the perspective of hardness, and identify that hard sample distillation may be a major performance bottleneck of existing graph KD algorithms. The GNN-to-MLP KD involves two different types of hardness, one student-free knowledge hardness describing the inherent complexity of GNN knowledge, and the other student-dependent distillation hardness describing the difficulty of teacher-to-student distillation. However, most of the existing work focuses on only one of these aspects or regards them as one thing. This paper proposes a simple yet effective Hardness-aware GNN-to-MLP Distillation (HGMD) framework, which decouples the two hardnesses and estimates them using a non-parametric approach. Finally, two hardness-aware distillation schemes (i.e., HGMD-weight and HGMD-mixup) are further proposed to distill hardness-aware knowledge from teacher GNNs into the corresponding nodes of student MLPs. As non-parametric distillation, HGMD does not involve any additional learnable parameters beyond the student MLPs, but it still outperforms most of the state-of-the-art competitors. HGMD-mixup improves over the vanilla MLPs by 12.95% and outperforms its teacher GNNs by 2.48% averaged over seven real-world datasets. Codes will be made public at https://github.com/LirongWu/HGMD.

Graph Neural Networks; Knowledge Distillation; Hrad Samples

^†^†journalyear: 2024^†^†copyright: rightsretained^†^†conference: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management; October 21–25, 2024; Boise, ID, USA^†^†booktitle: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24), October 21–25, 2024, Boise, ID, USA^†^†doi: 10.1145/3627673.3679586^†^†isbn: 979-8-4007-0436-9/24/10^†^†ccs: Computing methodologies Artificial intelligence

1. Introduction

Recently, the emerging Graph Neural Networks (GNNs) (Wu et al., 2020; Zhou et al., 2020; Wu et al., 2021, 2023c) have demonstrated their powerful capability in handling various graph-structured data. Benefiting from the powerful topology awareness enabled by message passing, GNNs have achieved great academic success. However, the neighborhood-fetching latency arising from data dependency in GNNs makes it still less popular for practical deployment, especially in computational-constraint applications. In contrast, Multi-Layer Perceptrons (MLPs) are free from data dependencies among neighboring nodes and infer much faster than GNNs, but at the cost of suboptimal performance. To bridge these two worlds, GLNN (Zhang et al., 2022) proposes GNN-to-MLP Knowledge Distillation (KD), which extracts informative knowledge from a teacher GNN and then injects the konwledge into a student MLP.

Refer to caption — (a) Accuracy Rankings

A long-standing intuitive idea about knowledge distillation is ”better teacher, better student”. In other words, distillation from a better teacher is expected to yield a better student, since a better teacher can usually capture more informative knowledge from which the student can benefit. However, some recent work has challenged this intuition, arguing that it does not hold true in all cases, i.e., distillation from a larger teacher, typically with more parameters and high accuracy, may be inferior to distillation from a smaller, less accurate teacher (Mirzadeh et al., 2020; Shen et al., 2021; Stanton et al., 2021; Zhu et al., 2022). To illustrate this, we show the rankings of three teacher GNNs, including Graph Convolutional Network (GCN) (Kipf and Welling, 2016), Graph Attention Network (GAT) (Veličković et al., 2017), and GraphSAGE (Hamilton et al., 2017), on seven datasets, as well as their corresponding distilled MLPs in Fig. 1(a), from which we observe that GCN is the best teacher on the Arxiv dataset, but its distilled student MLP performs the poorest. There have been many previous works (Jafari et al., 2021; Zhu and Wang, 2021; Qiu et al., 2022) delving into this issue, but most of them attribute this counter-intuitive observation to the capacity mismatch between the teacher and student models. In other words, a student with fewer parameters may fail to ”understand” the high-order semantic knowledge captured by a teacher with numerous parameters.

However, we found that the above popular explanation from a model capacity perspective may hold true for KD in computer vision, but fails in the graph domain. For a teacher GCN and a student MLP with the same amount of parameters (i.e., the same layer depth and width), we plot the accuracy fluctuations of the teacher and distilled student with respect to the distillation temperature $\tau$ in Fig. 1(b). It can be seen that while the temperature $\tau$ does not affect the teacher’s accuracy, it influences the knowledge hardness of GNNs, which in turn leads to different student’s accuracy.

Present Work. In this paper, we rethink what exactly are the criteria for ”better” knowledge samples (nodes) in teacher GNNs from the perspective of hardness rather than correctness, which has been rarely studied in previous works. The motivational experiment in Fig. 1(c) indicates that most GNN knowledge of samples misclassified by student MLPs is distributed in the high-entropy zones, which suggests that GNN knowledge samples with higher uncertainty are usually harder to be correctly distilled. Furthermore, we explore the roles played by GNN knowledge samples of different hardness during distillation and identify that hard sample distillation may be a major performance bottleneck of existing KD algorithms. As a result, to provide more additional supervision for the distillation of those hard samples, we propose a simple yet effective Hardness-aware GNN-to-MLP Distillation (HGMD) framework. The proposed framework first models both knowledge and distillation hardness in a non-parametric fashion, then extracts a hardness-aware subgraph (the harder, the larger) for each node separately, and finally applies two distillation schemes (HGMD-weight and HGMD-mixup) to distill subgraph-level knowledge from teacher GNNs into corresponding nodes of the student MLPs.

The main contributions of this paper are as follows: (1) We are the first to identify that hard sample distillation is the main bottleneck that limits the performance of existing GNN-to-MLP KD algorithms, and more importantly, we have described in detail what it represents, what impact it has, and how to deal with it. (2) We decouple two different hardnesses, i.e., knowledge hardness and distillation hardness, and propose a non-parametric approach to estimate them. (3) We propose two distillation schemes based on the two decoupled hardnesses for hard sample distillation. Despite not involving any additional parameters, they are still comparable to or even better than most of the state-of-the-art competitors.

2. Related Work

2.1. GNN-to-GNN Knowledge Distillation

Recent years have witnessed the great success of GNNs in handling graph-structured data (Wu et al., 2022b, 2020). However, most existing GNNs share the de facto design that relies on message passing to aggregate features from neighborhoods, which may be one major source of latency in GNN inference. To address this problem, several previous works on graph distillation try to distill knowledge from large teacher GNNs to smaller student GNNs, termed as GNN-to-GNN knowledge distillation (KD) (Lassance et al., 2020; Zhang et al., 2020a; Wu et al., 2022a; Ren et al., 2021; Wu et al., 2024), including RDD (Zhang et al., 2020b), TinyGNN (Yan et al., 2020), LSP (Yang et al., 2020), GraphAKD (He et al., 2022), GNN-SD (Chen et al., 2020a), and FreeKD (Feng et al., 2022), etc. However, both teachers and students in the above works are GNNs, making these designs still suffer from the neighborhood-fetching latency arising from the data dependency in GNNs.

2.2. GNN-to-MLP Knowledge Distillation

To bridge the gaps between powerful GNNs and lightweight MLPs, the other branch of graph KD is to directly distill from teacher GNNs to lightweight student MLPs, termed GNN-to-MLP KD. For example, GLNN (Zhang et al., 2022) directly distills knowledge from teacher GNNs to vanilla MLPs by imposing KL-divergence between their logits. Instead, CPF (Yang et al., 2021) improves the performance of student MLPs by incorporating label propagation in MLPs, which may further burden the inference latency. Besides, FF-G2M (Wu et al., 2023b) propose to factorize GNN knowledge into low- and high-frequency components in the spectral domain and propose a novel framework to distill both low- and high-frequency knowledge from teacher GNNs into student MLPs. Moreover, RKD (Wu et al., 2023a) takes into account the reliability of GNN knowledge and adopts a parameterized distribution fitting to filter out unreliable GNN knowledge. For more approaches on GNN-to-MLP KD, we refer the interested reader to a recent survey (Tian et al., 2023a). Despite the great progress, most of the existing methods have focused on how to make better use of those simple samples, while little effort is made on those hard samples. However, we found in this paper that hard sample distillation may be a main bottleneck that limits the performance of existing GNN-to-MLP KD algorithms.

3. Methodology

3.1. Preliminary

3.1.1. Notations.

Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where the node set and edge set are $\mathcal{V}=\{v_{1},v_{2},\cdots,v_{N}\}$ and $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ , respectively. In addition, $\mathbf{X}\in\mathbb{R}^{N\times d}$ and $\mathbf{A}\in[0,1]^{N\times N}$ denotes the feature matrix and adjacency matrix, where each node $v_{i}\in\mathcal{V}$ is associated with a $d$ -dimensional features vector $\mathbf{x}_{i}\in\mathbb{R}^{d}$ and $\mathbf{A}_{i,j}=1$ iff $(v_{i},v_{j})\in\mathcal{E}$ . Consider node classification in a transductive setting in which only a subset of node $\mathcal{V}_{L}\in\mathcal{V}$ with corresponding labels $\mathcal{Y}_{L}$ are known, we denote the labeled set as $\mathcal{D}_{L}=(\mathcal{V}_{L},\mathcal{Y}_{L})$ and unlabeled set as $\mathcal{D}_{U}=(\mathcal{V}_{U},\mathcal{Y}_{U})$ , where $\mathcal{V}_{U}=\mathcal{V}\backslash\mathcal{V}_{L}$ . The objective of GNN-to-MLP knowledge distillation is to first train a teacher GNN $\mathbf{Z}=f_{\theta}^{\mathcal{T}}(\mathbf{A},\mathbf{X})$ on the labeled data $\mathcal{D}_{L}$ , and then distill knowledge from the teacher GNN into a student MLP $\mathbf{H}=f_{\gamma}^{\mathcal{S}}(\mathbf{X})$ by imposing KL-divergence $\mathcal{D}_{KL}(\cdot,\cdot)$ on the set $\mathcal{V}$ , as follows

(1)

\mathcal{L}_{\mathrm{KD}}=\frac{1}{|\mathcal{V}|}\sum_{i\in\mathcal{V}}\mathcal{D}_{KL}\Big{(}\sigma\left(\mathbf{z}_{i}/\tau\right),\sigma\left(\mathbf{h}_{i}/\tau\right)\Big{)},

where $\sigma(\cdot)=\operatorname{softmax}(\cdot)$ is the activation function, and $\tau$ is the distillation temperature. Beisdes, $\mathbf{z}_{i}$ and $\mathbf{h}_{i}$ are the node embeddings of node $v_{i}$ in $\mathbf{Z}$ and $\mathbf{H}$ , respectively. Once distillation is done, the distilled MLP can be used to infer $y_{i}\in\mathcal{Y}_{U}$ for unlabeled data.

3.1.2. Knowledge Hardness.

Inspired by the experiment in Fig. 1(c), where GNN knowledge samples with higher entropy are harder to be correctly distilled into the student MLPs, we use the information entropy $\mathcal{H}(\mathbf{z}_{i})$ of node $v_{i}$ as a measure of its knowledge hardness,

(2)

\mathcal{H}(\mathbf{z}_{i})=-\sum_{j}\sigma\big{(}\mathbf{z}_{i,j}/\tau\big{)}\text{log}\big{(}\sigma\left(\mathbf{z}_{i,j}/\tau\right)\big{)}.

We default to using Eq. (2) for measuring the knowledge hardness in this paper and delay the definition of distillation hardness until Sec. 3.3. For more experimental results of using other more complex knowledge hardness metrics, please refer to Appendix D.

3.2. Bottleneck: Hard Sample Distillation

Recent years have witnessed the great success of knowledge distillation and a surge of related distillation techniques. As the research goes deeper, the rationality of ”better teacher, better student” has been increasingly challenged. A lot of earlier works (Jafari et al., 2021; Son et al., 2021) have found that as the performance of the teacher model improves, the accuracy of the student model may unexpectedly get worse. Most of the existing works attribute such counter-intuitive observation to the capacity mismatch between the teacher and student models. In other words, a smaller student may have difficulty ”understanding” the high-order semantic knowledge captured by a large teacher. Although this problem has been well studied in computer vision, little work has been devoted to whether it exists in graph knowledge distillation, what it arises from, and how to deal with it. In this paper, we get the same observation during GNN-to-MLP distillation that better teachers do not necessarily lead to better students in Fig. 1(a), but we find that this has little to do with the popular idea of capacity mismatch. This is because, unlike common visual backbones with very deep layers in computer vision, GNNs tend to suffer from the undesired over-smoothing problem (Chen et al., 2020b; Yan et al., 2022) when stacking deeply. Therefore, most existing GNNs are shallow networks, making the effects of capacity mismatch negligible during GNN-to-MLP KD.

To explore the criteria for better GNN knowledge samples (nodes), we conduct an exploratory experiment to evaluate the roles played by GNN knowledge samples of different hardnesses during knowledge distillation. For example, we report in Fig. 2 the distillation accuracy of several representative methods for simple samples (bottom 50% hardness) and hard samples (top 50% hardness), as well as their overall accuracy. As can be seen from Fig. 2, those simple samples can be handled well by all methods, and the main difference in the performance of different distillation methods lies in their capability to handle those hard samples. In other words, hard sample distillation may be a major performance bottleneck of existing distillation algorithms. For example, FF-G2M improves the overall accuracy by 1.86% compared to GLNN, where hard samples contribute 3.27%, but simple samples contribute only 0.45%. Note that this phenomenon also exists in human education, where simple knowledge can be easily grasped by all students and therefore teachers are encouraged to spend more efforts in teaching hard knowledge. Based on the above observations, we believe that not only should we not ignore those hard samples, but we should provide them with more supervision in a hardness-based manner.

3.3. Hardness-aware GNN-to-MLP Distillation

One previous work (Zhou et al., 2021) defined knowledge hardness as the cross entropy on labeled data and proposed to weigh the distillation losses among samples in a hardness-based manner. To extend it to the transductive setting for graphs in this paper, we adopt the information entropy defined in Eq. (2) instead of the cross entropy as the knowledge hardness, and derive a variant of it as follows,

(3)

\mathcal{L}_{\mathrm{KD}}=\frac{1}{|\mathcal{V}|}\sum_{i\in\mathcal{V}}\Big{(}1-e^{-\mathcal{H}(\mathbf{h}_{i})/\mathcal{H}(\mathbf{z}_{i})}\Big{)}\cdot\mathcal{D}_{KL}\Big{(}\widetilde{\mathbf{z}}_{i},\widetilde{\mathbf{h}}_{i}\Big{)},

where $\widetilde{\mathbf{z}}_{i}\!=\!\sigma\left(\mathbf{z}_{i}/\tau\right)$ and $\widetilde{\mathbf{h}}_{i}\!=\!\sigma\left(\mathbf{h}_{i}/\tau\right)$ . As far as GNN knowledge hardness is concerned, Eq. (3) reduces the weights of those hard samples with large knowledge hardness, i.e., higher $\mathcal{H}(\mathbf{z}_{i})$ , while leaving those simple samples to dominate the optimization. However, Sec. 3.2 shows that not only should we not ignore those hard samples, but we should pay more attention to them by providing more supervision. To this end, we propose a novel GNN-to-MLP KD framework, namely HGMD, which extracts a hardness-aware subgraph (the harder, the larger) for each sample separately and then distills the subgraph-level knowledge into the corresponding nodes of student MLPs through two distillation schemes. A high-level overview of the HGMD framework is shown in Fig. 3.

3.3.1. Hardness-aware Subgraph Extraction

We estimate the distillation hardness based on the knowledge hardness of both the teacher and the student, and then model the probability that the neighbors of a target node are included in the corresponding subgraph based on the distillation hardness. To enable hardness-aware subgraph extraction, four heuristic factors that influence the distillation hardness and subgraph size should be considered: (1) A harder sample with higher knowledge hardness $\mathcal{H}(\mathbf{z}_{i})$ in teacher GNNs should be assigned a larger subgraph for more supervision. (2) A sample with high uncertainty $\mathcal{H}(\mathbf{h}_{i})$ in student MLPs requires a larger subgraph for more supervision. (3) A node $v_{j}\in\mathcal{N}_{i}$ with lower knowledge hardness $\mathcal{H}(\mathbf{z}_{j})$ has a higher probability of being included in the subgraph. (4) Nodes in the subgraph are expected to share similar label distributions with the target node $v_{i}$ . Inspired by these heuristic factors, we model the probability $p_{j\rightarrow i}$ that a neighboring node $v_{j}\in\mathcal{N}_{i}$ of the target node $v_{i}$ is included in the subgraph based on the distillation hardness $r_{j\rightarrow i}$ , defined as follow

(4)			$\displaystyle p_{j\rightarrow i}=1-r_{j\rightarrow i},\ \text{where}$
(4)		$\displaystyle r_{j\rightarrow i}=$	$\displaystyle\exp\Big{(}-\eta\cdot\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})\cdot\frac{\sqrt{\mathcal{H}(\mathbf{h}_{i})\cdot\mathcal{H}(\mathbf{z}_{i})}}{\mathcal{H}(\mathbf{z}_{j})}\Big{)},$

where $\mathcal{D}(\mathbf{z}_{i},\mathbf{z}_{j})$ denotes the cosine similarity between $\mathbf{z}_{i}$ and $\mathbf{z}_{j}$ , and we specify that $p_{i\rightarrow i}=1$ . In addition, $\eta$ is a hyperparameter used to control the overall hardness sensitivity. In this paper, we adopt an exponentially decaying strategy to set the hyperparameter $\eta$ . Extensive experiments are provided in Sec. 4.3 to demonstrate the effectiveness of such a non-parametric hardness estimation.

3.3.2. HGMD-weight

Based on the sampling probabilities modeled in Eq. (4), we can easily sample a hardness-aware subgraph $g_{i}$ with node set $\mathcal{V}^{g}_{i}=\{v_{j}\sim\operatorname{Bernoulli}(p_{j\rightarrow i})\ |j\in(\mathcal{N}_{i}\cup i)\}$ for each target node $v_{i}$ by Bernoulli sampling. Next, a key issue being left is how to distill the subgraph-level knowledge from teacher GNNs into the corresponding nodes of student MLPs. A straightforward idea is to follow (Wu et al., 2023b) to perform many-to-one (multi-teacher) knowledge distillation by optimizing the objective, as follows

(5)

\mathcal{L}_{\mathrm{KD}}^{\text{weight}}=\frac{1}{|\mathcal{V}|}\sum_{i\in\mathcal{V}}\frac{1}{|\mathcal{V}_{i}^{g}|}\sum_{i\in\mathcal{V}_{i}^{g}}p_{j\rightarrow i}\cdot\mathcal{D}_{KL}\Big{(}\widetilde{\mathbf{z}}_{i},\widetilde{\mathbf{h}}_{i}\Big{)}.

Compared to the loss weighting of Eq. (3), the strengths of the HGMD-weight in Eq. (5) are four-fold: (1) it extends knowledge distillation from node-to-node single-teacher KD to subgraph-to-node multi-teacher KD, which introduces additional supervision; (2) it provides more supervision (i.e., larger subgraphs) for hard samples in a hardness-aware manner, rather than neglecting them by reducing their loss weights; (3) it inherits the benefit of loss weighting by assigning a large weight $p_{j\rightarrow i}$ to a sample $v_{j}$ with low hardness $\mathcal{H}(\mathbf{z}_{j})$ in the subgraph; (4) it takes into account not only the knowledge hardness of the target node but also the nodes in the subgraph and their similarities to the target, enjoying more contextual information. While the modification from Eq. (3) to Eq. (5) does not introduce any additional learnable parameters, it achieves a huge improvement as shown in the subsequent experiments.

3.3.3. HGMD-mixup

Recently, mixup (Abu-El-Haija et al., 2019), as an important data augmentation technique, has achieved great success in various fields. Combining mixup with our HGMD framework enables the generation of more GNN knowledge variants as additional supervision for those hard samples, which may help to improve the generalizability of the distilled student model. Inspired by this, we propose another hardness-aware mixup scheme to distill the subgraph-level knowledge from GNNs into MLPs. Instead of mixing the samples randomly, we mix them by emphasizing the sample with a high probability $p_{j\rightarrow i}$ . Formally, for each target sample $v_{i}$ , a synthetic sample $\bm{u}_{i,j}$ ( $v_{j}\in\mathcal{V}^{g}_{i}$ ) will be generated by

(6)

\bm{u}_{i,j}=\lambda\cdot p_{j\rightarrow i}\cdot\mathbf{z}_{j}+(1-\lambda\cdot p_{j\rightarrow i})\cdot\mathbf{z}_{i},\ \lambda\sim\operatorname{Beta}(\alpha,\alpha),

where $\operatorname{Beta}(\alpha,\alpha)$ is a beta distribution parameterized by $\alpha$ . For a node $v_{j}\in\mathcal{V}^{g}_{i}$ in the subgraph with lower hardness $\mathcal{H}(\mathbf{z}_{j})$ and higher similarity $\mathcal{D}(\mathcal{H}(\mathbf{z}_{i}),\mathcal{H}(\mathbf{z}_{j}))$ , the synthetic sample $\bm{u}_{i,j}$ will be closer to $\mathbf{z}_{j}$ . Finally, we can distill the knowledge of synthetic samples $\{\bm{u}_{i,j}\}_{v_{j}\in\mathcal{V}_{i}^{g}}$ in the subgraph $g_{i}$ into the corresponding node $v_{i}$ of student MLPs by optimizing the objective, as follows

(7)

\mathcal{L}_{\mathrm{KD}}^{\text{mixup}}=\frac{1}{|\mathcal{V}|}\sum_{i\in\mathcal{V}}\frac{1}{|\mathcal{V}_{i}^{g}|}\sum_{i\in\mathcal{V}_{i}^{g}}\mathcal{D}_{KL}\Big{(}\sigma\left(\mathbf{u}_{i,j}/\tau\right),\widetilde{\mathbf{h}}_{i}\Big{)}.

Compared to the weighting-based scheme (HGMD-weight) of Eq. (5), the mixup-based scheme (HGMD-mixup) generates more variants of GNN knowledge through miuxp augmentation, which is more in line with our original intention of providing more additional supervision for knowledge distillation on hard samples.

3.4. Training Strategy

To achieve GNN-to-MLP knowledge distillation, we first pre-train the teacher GNNs with the classification loss $\mathcal{L}_{\mathrm{label}}$ , as follows

(8)

\mathcal{L}_{\mathrm{label}}=\frac{1}{|\mathcal{V}_{L}|}\sum_{i\in\mathcal{V}_{L}}\operatorname{CE}\big{(}y_{i},\sigma(\mathbf{z}_{i})\big{)},

where $\operatorname{CE}(\cdot)$ denotes the cross-entropy loss. We further distill knowledge from teacher GNNs into student MLPs with the objective,

(9)

\mathcal{L}_{\mathrm{total}}=\frac{\beta}{|\mathcal{V}_{L}|}\sum_{i\in\mathcal{V}_{L}}\operatorname{CE}\big{(}y_{i},\sigma(\mathbf{h}_{i})\big{)}+\big{(}1-\beta\big{)}\mathcal{L}_{\mathrm{KD}},

where $\beta$ is the hyperparameter to trade-off the classification and distillation losses. The pseudo-code of HGMD (taking HGMD-mixup as an example) has been summarized in Algorithm. 1.

Table 1. Accuracy

\pm

std (%) on seven datasets, where three architectures are considered, and the best metrics are marked by bold.

Teacher	Student	Cora	Citeseer	Pubmed	Photo	CS	Physics	ogbn-arxiv	Average
MLPs	Vanilla	$59.58_{\pm 0.97}$	$60.32_{\pm 0.61}$	$73.40_{\pm 0.68}$	$78.65_{\pm 1.68}$	$87.82_{\pm 0.64}$	$88.81_{\pm 1.08}$	$54.63_{\pm 0.84}$	-
GCN	-	$81.70_{\pm 0.96}$	$71.64_{\pm 0.34}$	$79.48_{\pm 0.21}$	$90.63_{\pm 1.53}$	$90.00_{\pm 0.58}$	$92.45_{\pm 0.53}$	$71.20_{\pm 0.17}$	-
	GLNN	$82.20_{\pm 0.73}$	$71.72_{\pm 0.30}$	$80.16_{\pm 0.20}$	$91.42_{\pm 1.61}$	$92.22_{\pm 0.72}$	$93.11_{\pm 0.39}$	$67.76_{\pm 0.23}$	-
	Loss-Weighting	$83.25_{\pm 0.69}$	$72.98_{\pm 0.41}$	$81.20_{\pm 0.50}$	$91.76_{\pm 1.52}$	$93.16_{\pm 0.66}$	$93.46_{\pm 0.43}$	$68.56_{\pm 0.27}$	-
	HGMD-weight	$84.42_{\pm 0.54}$	$74.42_{\pm 0.50}$	$81.86_{\pm 0.44}$	$92.94_{\pm 1.37}$	$93.93_{\pm 0.33}$	$94.09_{\pm 0.56}$	$70.76_{\pm 0.19}$	-
	Improv.	2.22	2.70	1.70	1.52	1.71	0.98	3.00	1.98
	HGMD-mixup	$\textbf{84.66}_{\pm 0.47}$	$\textbf{74.62}_{\pm 0.40}$	$\textbf{82.02}_{\pm 0.45}$	$\textbf{93.33}_{\pm 1.31}$	$\textbf{94.16}_{\pm 0.32}$	$\textbf{94.27}_{\pm 0.63}$	$\textbf{71.09}_{\pm 0.21}$	-
	Improv.	2.46	2.90	1.86	1.91	1.94	1.16	3.33	2.22
GraphSAGE	Vanilla	$82.02_{\pm 0.94}$	$71.76_{\pm 0.49}$	$79.36_{\pm 0.45}$	$90.56_{\pm 1.69}$	$89.29_{\pm 0.77}$	$91.97_{\pm 0.91}$	$71.06_{\pm 0.27}$	-
	GLNN	$81.86_{\pm 0.88}$	$71.52_{\pm 0.54}$	$80.32_{\pm 0.38}$	$91.34_{\pm 1.46}$	$92.00_{\pm 0.57}$	$92.82_{\pm 0.93}$	$68.30_{\pm 0.19}$	-
	Loss-Weighting	$83.16_{\pm 0.76}$	$72.30_{\pm 0.47}$	$80.92_{\pm 0.46}$	$91.63_{\pm 1.31}$	$92.84_{\pm 0.60}$	$93.28_{\pm 0.72}$	$69.04_{\pm 0.22}$	-
	HGMD-weight	$84.36_{\pm 0.60}$	$\textbf{73.70}_{\pm 0.50}$	$81.50_{\pm 0.57}$	$93.01_{\pm 1.19}$	$93.77_{\pm 0.47}$	$\textbf{94.21}_{\pm 0.57}$	$71.62_{\pm 0.26}$	-
	Improv.	2.50	2.18	1.18	1.67	1.77	1.39	3.32	2.00
	HGMD-mixup	$\textbf{84.54}_{\pm 0.53}$	$73.48_{\pm 0.53}$	$\textbf{81.66}_{\pm 0.36}$	$\textbf{93.29}_{\pm 1.22}$	$\textbf{94.03}_{\pm 0.43}$	$94.12_{\pm 0.61}$	$\textbf{71.86}_{\pm 0.24}$	-
	Improv.	2.68	1.96	1.34	1.95	2.03	1.30	3.52	2.11
GAT	Vanilla	$81.66_{\pm 1.04}$	$70.78_{\pm 0.60}$	$79.88_{\pm 0.85}$	$90.06_{\pm 1.38}$	$90.90_{\pm 0.37}$	$91.97_{\pm 0.58}$	$71.08_{\pm 0.19}$	-
	GLNN	$81.78_{\pm 0.75}$	$70.96_{\pm 0.86}$	$80.48_{\pm 0.47}$	$91.22_{\pm 1.45}$	$92.44_{\pm 0.41}$	$92.70_{\pm 0.56}$	$68.56_{\pm 0.22}$	-
	Loss-Weighting	$82.69_{\pm 0.74}$	$71.80_{\pm 0.52}$	$81.27_{\pm 0.55}$	$91.58_{\pm 1.42}$	$92.96_{\pm 0.58}$	$93.10_{\pm 0.64}$	$69.32_{\pm 0.25}$	-
	HGMD-weight	$\textbf{84.22}_{\pm 0.77}$	$73.10_{\pm 0.83}$	$82.02_{\pm 0.59}$	$93.18_{\pm 0.47}$	$94.09_{\pm 1.33}$	$\textbf{94.29}_{\pm 0.56}$	$71.76_{\pm 0.26}$	-
	Improv.	2.44	2.14	1.54	1.96	1.65	1.59	3.20	2.07
	HGMD-mixup	$84.02_{\pm 0.65}$	$\textbf{73.18}_{\pm 0.79}$	$\textbf{82.16}_{\pm 0.64}$	$\textbf{93.43}_{\pm 1.26}$	$\textbf{94.20}_{\pm 0.27}$	$94.19_{\pm 0.43}$	$\textbf{72.31}_{\pm 0.20}$	-
	Improv.	2.24	2.22	1.68	2.21	1.76	1.49	3.75	2.19

Algorithm 1 Algorithm for MGMD-mixup

1:Feature Matrix:

\mathbf{X}

; Adjacency Matrix:

\mathbf{A}

2:Predicted Labels

\mathcal{Y}_{U}

and student MLPs

f_{\gamma}^{\mathcal{S}}(\cdot)

3:Randomly initializing the parameters of the teacher GNNs

f_{\theta}^{\mathcal{T}}(\cdot)

and the student MLPs

f_{\gamma}^{\mathcal{S}}(\cdot)

4:Computing node embeddins

\{\mathbf{z}_{i}\}_{i=1}^{N}

of GNNs and pre-training the student GNNs until convergence by Eq. (8).

5:for

epoch

\in

{1, 2,

\cdots

E

} do

6: Computing node embeddins

\{\mathbf{h}_{i}\}_{i=1}^{N}

of the MLPs.

7: Calculating sampling probabilities

\{p_{j\rightarrow i}\}_{i\in\mathcal{V},j\in\mathcal{N}_{i}}

and extracting hardness-aware subgraphs

\{g_{i}\}_{i=1}^{N}

8: Generating synthetic samples

\{\bm{u}_{i,j}\}_{i\in\mathcal{V},j\in\mathcal{V}_{i}^{g}}

by Eq. (6).

9: Calculating mixup-based KD loss

\mathcal{L}_{\mathrm{KD}}^{\text{mixup}}

by Eq. (7).

10: Updating parameters of the student MLPs

f_{\gamma}^{\mathcal{S}}(\cdot)

by Eq. (9).

11:end for

12:return Predicted Labels

\mathcal{Y}_{U}

and student MLPs

f_{\gamma}^{\mathcal{S}}(\cdot)

3.5. Parameters and Computational Complexity

Compared to previous GNN-to-MLP KD methods, such as RKD (Wu et al., 2023a), HGMD decouples and estimates knowledge and distillation hardness in a non-parametric fashion, which does not introduce any additional learnable parameters in the process of subgraph extraction and subgraph distillation. In terms of the computational complexity, the time complexity of HGMD mainly comes from two parts: (1) GNN training $\mathcal{O}(|\mathcal{V}|dF+|\mathcal{E}|F)$ and (2) Knowledge distillation $\mathcal{O}(|\mathcal{E}|F)$ , where $d$ and $F$ are the dimensions of input and hidden spaces. The total time complexity $\mathcal{O}(|\mathcal{V}|dF+|\mathcal{E}|F)$ is linear w.r.t the number of nodes $|\mathcal{V}|$ and edges $|\mathcal{E}|$ . This indicates that the time complexity of KD in HGMD is basically on par with GNN training and does not suffer from a high computational burden.

4. Experiments

In this paper, we evaluate HGMD on eight real-world datasets, including Cora (Sen et al., 2008), Citeseer (Giles et al., 1998), Pubmed (McCallum et al., 2000), Coauthor-CS, Coauthor-Physics, Amazon-Photo (Shchur et al., 2018), ogbn-arxiv (Hu et al., 2020), and ogbn-products (Hu et al., 2020). A statistical overview of these datasets is available in Appendix A. Besides, we defer the implementation details and hyperparameter settings for each dataset to Appendix B. In addition, we consider three common GNN architectures as GNN teachers, including GCN (Kipf and Welling, 2016), GraphSAGE (Hamilton et al., 2017), and GAT (Veličković et al., 2017), and comprehensively evaluate two distillation schemes, HGMD-weight and HGMD-mixup, respectively. Furthermore, we also compare HGMD with two types of state-of-the-art graph distillation methods, including (1) GNN-to-GNN KD: (Yang et al., 2020), TinyGNN (Yan et al., 2020), GraphAKD (He et al., 2022), RDD (Zhang et al., 2020b), FreeKD (Feng et al., 2022), and GNN-SD (Chen et al., 2020a); and (2) GNN-to-MLP KD: CPF (Yang et al., 2021), RKD-MLP (Anonymous, 2023), GLNN (Zhang et al., 2022), FF-G2M (Wu et al., 2023b), NOSMOG (Tian et al., 2023b), and KRD (Wu et al., 2023a).

Table 2. Accuracy

\pm

std (%) of various GNN-to-GNN and GNN-to-MLP knowledge distillation algorithms in the transductive setting on eight real-world datasets, where bold and underline denote the best and second metrics on each dataset, respectively.

Category	Method	Cora	Citeseer	Pubmed	Photo	CS	Physics	ogbn-arxiv	products
Vanilla	MLPs	$59.58_{\pm 0.97}$	$60.32_{\pm 0.61}$	$73.40_{\pm 0.68}$	$78.65_{\pm 1.68}$	$87.82_{\pm 0.64}$	$88.81_{\pm 1.08}$	$54.63_{\pm 0.84}$	$61.89_{\pm 0.18}$
Vanilla	Vanilla GCNs	$81.70_{\pm 0.96}$	$71.64_{\pm 0.34}$	$79.48_{\pm 0.21}$	$90.63_{\pm 1.53}$	$90.00_{\pm 0.58}$	$92.45_{\pm 0.53}$	$71.20_{\pm 0.17}$	$75.42_{\pm 0.28}$
GNN-to-GNN	LSP	$82.70_{\pm 0.43}$	$72.68_{\pm 0.62}$	$80.86_{\pm 0.50}$	$91.74_{\pm 1.42}$	$92.56_{\pm 0.45}$	$92.85_{\pm 0.46}$	$71.57_{\pm 0.25}$	$74.18_{\pm 0.41}$
	GNN-SD	$82.54_{\pm 0.36}$	$72.34_{\pm 0.55}$	$80.52_{\pm 0.37}$	$91.83_{\pm 1.58}$	$91.92_{\pm 0.51}$	$93.22_{\pm 0.66}$	$70.90_{\pm 0.23}$	$73.90_{\pm 0.23}$
	GraphAKD	$83.71_{\pm 0.77}$	$72.68_{\pm 0.71}$	$80.96_{\pm 0.39}$	-	-	-	-	-
	TinyGNN	$83.10_{\pm 0.53}$	$73.24_{\pm 0.72}$	$81.20_{\pm 0.44}$	$92.03_{\pm 1.49}$	$93.78_{\pm 0.38}$	$93.70_{\pm 0.56}$	$72.18_{\pm 0.27}$	$74.76_{\pm 0.30}$
	RDD	$83.68_{\pm 0.40}$	$73.64_{\pm 0.50}$	$81.74_{\pm 0.44}$	$92.18_{\pm 1.45}$	$\textbf{94.20}_{\pm 0.48}$	$\underline{94.14}_{\pm 0.39}$	$\underline{72.34}_{\pm 0.17}$	$75.30_{\pm 0.24}$
	FreeKD	$83.84_{\pm 0.47}$	$73.92_{\pm 0.47}$	$81.48_{\pm 0.38}$	$92.38_{\pm 1.54}$	$93.65_{\pm 0.43}$	$93.87_{\pm 0.48}$	$\textbf{72.50}_{\pm 0.29}$	$75.84_{\pm 0.25}$
GNN-to-MLP	GLNN	$82.20_{\pm 0.73}$	$71.72_{\pm 0.30}$	$80.16_{\pm 0.20}$	$91.42_{\pm 1.61}$	$92.22_{\pm 0.72}$	$93.11_{\pm 0.39}$	$67.76_{\pm 0.23}$	$65.18_{\pm 0.27}$
	CPF	$83.56_{\pm 0.48}$	$72.98_{\pm 0.60}$	$81.54_{\pm 0.47}$	$91.70_{\pm 1.50}$	$93.42_{\pm 0.48}$	$93.47_{\pm 0.41}$	$69.05_{\pm 0.18}$	$68.80_{\pm 0.24}$
	RKD-MLP	$82.68_{\pm 0.45}$	$73.42_{\pm 0.45}$	$81.32_{\pm 0.32}$	$91.28_{\pm 1.48}$	$93.16_{\pm 0.64}$	$93.26_{\pm 0.37}$	$69.87_{\pm 0.25}$	$72.52_{\pm 0.35}$
	FF-G2M	$84.06_{\pm 0.43}$	$73.85_{\pm 0.51}$	$81.62_{\pm 0.37}$	$91.84_{\pm 1.42}$	$93.35_{\pm 0.55}$	$93.59_{\pm 0.43}$	$69.64_{\pm 0.26}$	$71.69_{\pm 0.31}$
	NOSMOG	$83.80_{\pm 0.50}$	$74.08_{\pm 0.45}$	$81.49_{\pm 0.53}$	$\underline{93.18}_{\pm 1.20}$	$93.54_{\pm 0.98}$	$93.61_{\pm 0.58}$	$71.20_{\pm 0.24}$	$\underline{76.14}_{\pm 0.32}$
	HGMD-weight	$\underline{84.42}_{\pm 0.54}$	$\underline{74.42}_{\pm 0.50}$	$\underline{81.86}_{\pm 0.44}$	$92.94_{\pm 1.37}$	$93.93_{\pm 0.33}$	$94.09_{\pm 0.56}$	$70.76_{\pm 0.19}$	$75.21_{\pm 0.22}$
	HGMD-mixup	$\textbf{84.66}_{\pm 0.47}$	$\textbf{74.62}_{\pm 0.40}$	$\textbf{82.02}_{\pm 0.45}$	$\textbf{93.33}_{\pm 1.31}$	$\underline{94.16}_{\pm 0.32}$	$\textbf{94.27}_{\pm 0.63}$	$71.09_{\pm 0.21}$	$\textbf{76.25}_{\pm 0.18}$

4.1. Comparative Results

To evaluate the effectiveness of the HGMD framework, we compare its two instantiations, HGMD-weight and HGMD-mixup, with GLNN of Eq. (1) and Loss-Weighting of Eq. (3), respectively. The experiments are conducted on seven datasets with three different GNN architectures as teacher GNNs, where imporv. denotes the performance improvements with respect to GLNN. From the results reported in Table. 1, we can make three observations: (1) Both HGMD-weight and HGMD-mixup perform much better than vanilla MLP, GLNN, and Loss-Weighting on all seven datasets, especially on the large-scale ogbn-arxiv dataset. (2) Both HGMD-weight and HGMD-mixup are applicable to various types of teacher GNN architectures. For example, HGMD-mixup outperforms GLNN by 2.22% (GCN), 2.11% (SAGE), and 2.19% (GAT) averaged over seven datasets, respectively. (3) Overall, HGMD-mixup performs slightly better than HGMD-weight across various datasets, owing to more knowledge variants augmented by the hardness-aware mixup.

Furthermore, we compare HGMD-weight and HGDM-mixup with several state-of-the-art graph distillation methods, including both GNN-to-GNN and GNN-to-MLP KD. The experimental results reported in Table. 2 show that (1) Despite being completely non-parametric methods, HGMD-weight and HGMD-mixup both perform much better than existing GNN-to-MLP baselines on 5 out of 8 datasets. (2) HGMD-weight and HGMD-mixup outperform those GNN-to-GNN baselines on four relatively small datasets (i.e., Cora, Citeseer, Pubmed, and Photo). Besides, their performance is comparable to those GNN-to-GNN baselines on four relatively large datasets (i.e., CS, Physics, and ogbn-arxiv, products). These observations indicate that distilled MLPs have the same expressive potential as teacher GNNs, and that ”parametric” is not a must for knowledge distillation. In addition, we also evaluate HGMD in the inductive setting and the results are provided in Appendix C. A detailed comparison between HGMD and RKD using different knowledge hardness metrics can be found in Appendix D.

Table 3. Ablation study on the subgraph extraction and two distillation modules, where bold denotes the best metrics.

Scheme	KD	SubGraph	Weight	Mixup	Cora	Citeseer	Pubmed	Photo	CS	Physics
Vinilla GCNs	✗	✗	✗	✗	$81.70_{\pm 0.96}$	$71.64_{\pm 0.34}$	$79.48_{\pm 0.21}$	$90.63_{\pm 1.53}$	$90.00_{\pm 0.58}$	$92.45_{\pm 0.53}$
GLNN	✔				$82.20_{\pm 0.73}$	$71.72_{\pm 0.30}$	$80.16_{\pm 0.20}$	$91.42_{\pm 1.61}$	$92.22_{\pm 0.72}$	$93.11_{\pm 0.39}$
SubGraph-only	✔	✔			$83.78_{\pm 0.55}$	$74.26_{\pm 0.69}$	$81.46_{\pm 0.45}$	$92.58_{\pm 1.58}$	$93.48_{\pm 0.53}$	$93.80_{\pm 0.63}$
Weight-only	✔		✔		$83.56_{\pm 0.72}$	$73.96_{\pm 0.46}$	$81.18_{\pm 0.64}$	$92.14_{\pm 1.37}$	$93.23_{\pm 0.46}$	$93.50_{\pm 0.67}$
Mixup-only	✔			✔	$83.90_{\pm 0.73}$	$74.14_{\pm 0.50}$	$81.34_{\pm 0.38}$	$92.40_{\pm 1.26}$	$93.66_{\pm 0.52}$	$93.68_{\pm 0.70}$
HGMD-weight	✔	✔	✔		$84.42_{\pm 0.54}$	$74.42_{\pm 0.50}$	$81.86_{\pm 0.44}$	$92.94_{\pm 1.37}$	$93.93_{\pm 0.33}$	$94.09_{\pm 0.56}$
HGMD-mixup	✔	✔		✔	$\textbf{84.66}_{\pm 0.47}$	$\textbf{74.62}_{\pm 0.40}$	$\textbf{82.02}_{\pm 0.45}$	$\textbf{93.33}_{\pm 1.31}$	$\textbf{94.16}_{\pm 0.32}$	$\textbf{94.27}_{\pm 0.63}$

4.2. Ablation Study

To evaluate how hardness-aware subgraph extraction (SubGraph) and two subgraph distillation strategies (weight and mixup) influence performance, we compare vanilla GCNs and GLNN with the following five schemes: (A) Subgraph-only: extract hardness-aware subgraphs and then distill their knowledge into the student MLP with equal loss weights; (B) Weight-only: take the full neighborhoods as subgraphs and then distill by hardness-aware weighting as in Eq. (5); (C) Mixup-only: take the full neighborhoods as subgraphs and then distill by hardness-aware mixup as in Eq. (7); (D) HGMD-weight; and (E) HGMD-mixup. We can observe from the experimental results reported in Table. 3 that (1) SubGraph plays a very important role in improving performance, which illustrates the benefits of performing knowledge distillation at the subgraph level compared to the node level, as it provides more supervision for those hard samples in a hardness-aware manner. (2) Both hardness-aware weighting and mixup help improve performance, especially the latter. (3) Combining the two different designs (subgraph extraction and subgraph distillation) together can further improve performance on top of each on all six graph datasets.

4.3. Deep Analysis on Hardness Awareness

4.3.1. Case Study of Hardness-aware Subgraphs

To intuitively show what ”hardness awareness” means, we select three GNN knowledge samples with different hardness levels from four datasets, respectively. Next, we mark the hardness of each knowledge sample as well as their neighboring nodes according to the color bar on their right side, where a darker blue indicates a higher knowledge hardness. In addition, we use the edge color to denote the probability of the corresponding neighboring node being sampled into the hardness-aware subgraph, according to another color bar displayed on the right. We can make two observations from the results of Fig. 4 that: (1) For a given target node, neighboring nodes with lower hardness (lighter blue) tend to have a higher probability of being sampled into the subgraph. (2) A target node with higher hardness (darker blue) has a higher probability of having its neighboring nodes sampled. In other words, the sampling probability of neighboring nodes is actually a trade-off between their own hardness and the hardness of the target node. For example, when the hardness of the target node is 0.36, a neighboring node with a hardness of 0.61 is still hard to be sampled, as shown in Fig. 4(a); however, when the hardness of the target node is 1.73, even a neighboring node with a hardness of 1.52 has a high sampling probability.

4.3.2. 3D Histogram on Hardness and Similarity

We show in Fig. 5(a) and Fig. 5(b) the 3D histograms of the sampling probability of neighboring nodes with respect to their hardness, their cosine similarity to the target node, and the hardness of the target node, from which we can observe that: (1) As the hardness of a target node increases, the sampling probability of its neighboring nodes also increases; (2) Neighboring nodes with lower hardness have a higher probability of being sampled into the corresponding subgraph; (3) As the cosine similarity between neighboring nodes and target node increases, their sampling probability also increases; However, when the hardness of the target node is high, an overly high similarity means that the hardness of neighboring nodes will also be high, which in turn reduces the sampling probability, which is actually a trade-off between high similarity and low hardness.

4.3.3. Training Curves

We further report in Fig. 5(c) the average entropy of the nodes in student MLPs and the average size of sampled subgraphs during training on the Cora dataset. It can be seen from Fig. 5(c) that there exists an approximate resonance between the two curves. As the training progresses, the uncertainty of nodes in the student MLPs decreases and thus additional supervision required for hard sample distillation can be reduced accordingly.

4.3.4. Asymmetric Property of Subgraph Extraction

We statistically calculate the ratios of two connected nodes among all edges that are and are not sampled into each other’s subgraphs simultaneously, called symmetrized and asymmetrized sampling. The histogram in Fig. 5(d) shows that subgraph extraction is mostly asymmetric, especially for large-scale datasets. This is because subgraph extraction is performed in a hardness-aware manner, where low-hardness neighboring nodes of a high-hardness target node have a higher sampling probability, but not vice versa. We believe that such asymmetric property of subgraph extraction is a key aspect of the effectiveness of HGMD framework, as it essentially transforms an undirected graph into a directed graph for processing.

5. Conclusion

In this paper, we explore thoroughly the knowledge hardness and distillation hardness for GNN-to-MLP knowledge distillation. We identify that hard sample distillation may be a major performance bottleneck of existing distillation algorithms. To address this problem, we propose a novel Hardness-aware GNN-to-MLP Distillation (HGMD) framework, which distills knowledge from teacher GNNs at the subgraph level (rather than the node level) in a hardness-aware manner to provide more supervision for those hard samples. Extensive experiments have been provided to demonstrate the superiority of HGMD across various datasets and GNN architectures. Limitations still exist, for example, designing better hardness metrics or introducing additional learnable parameters for knowledge distillation may be promising directions for future work.

6. Acknowledgement

This work was supported by National Science and Technology Major Project of China (No. 2022ZD0115101), National Natural Science Foundation of China Project (No. U21A20427), and Project (No. WU2022A009) from the Center of Synthetic Biology and Integrated Bioengineering of Westlake University.

Appendix

A. Dataset Statistics

A total of Eight real-world graph datasets are used in this paper to evaluate the proposed HGMD framework. An overview summary of the dataset characteristics is provided in Table. A1. For the three small-scale datasets, including Cora, Citeseer, and Pubmed, we follow the data splitting strategy by (Kipf and Welling, 2016). For the three large-scale datasets, including Coauthor-CS, Coauthor-Physics, and Amazon-Photo, we follow (Zhang et al., 2022; Yang et al., 2021) to randomly split the data into train/val/test sets, and each random seed corresponds to a different data splitting. For the two large-scale ogbn-arxiv and ogbn-products datasets, we use the public data splits provided by the authors (Hu et al., 2020).

Table A1. Statistical information of the eight datasets.

Dataset	Cora	Citeseer	Pubmed	Photo	CS	Physics	arxiv	products
$\#$ Nodes	2,708	3,327	19,717	7,650	18,333	34,493	169,343	2,449,029
$\#$ Edges	5,278	4,614	44,324	119,081	81,894	247,962	1,166,243	61,859,140
$\#$ Features	1,433	3,703	500	745	6,805	8,415	128	100
$\#$ Classes	7	6	3	8	15	5	40	47

B. Implementation Details

The following hyperparameters are set the same for all datasets: Epoch $E$ = 500, learning rate $lr=0.01$ (0.005 for ogbn-axriv), weight decay $decay=$ 5e-4 (0.0 for ogbn-arxiv), and layer number $L=2$ (3 for Cora and ogbn-arxiv). The other dataset-specific hyperparameters are determined by an AutoML toolkit NNI with the search spaces as: hidden dimension $F=\{256,512,2048\}$ , KD temperature $\tau=\{0.8,0.9,1.0,1.1,1.2\}$ , loss weight $\beta=\{0.0,0.1,0.2,0.3,0.4,0.5\}$ , coefficient of beta distribution $\alpha=\{0.3,0.4,0.5\}$ . Moreover, the hyperparameter $\eta$ in Eq. (4) is initially set to $\{1,5,10\}$ and then decays exponentially with the decay step of 250. The experiments are implemented based on the DGL library (Wang et al., 2019) with Intel(R) Xeon(R) Gold 6240R @ 2.40GHz CPU and NVIDIA V100 GPU. For a fair comparison, the model with the highest validation accuracy will be selected for testing. Besides, each set of experiments is run five times with different random seeds, and the averages are reported.

For all baselines, we did not directly copy the results from their original papers but reproduced them by distilling from the same teacher GNNs as in this paper, under the same settings and data splits. As we know, the performance of the distilled student MLPs depends heavily on the quality of teacher GNNs. However, we have no way to get the checkpoints of the teacher models used in previous baselines, i.e., we cannot guarantee that the student MLPs in all baselines are distilled from the same teacher GNNs. For the purpose of a fair comparison, we have to train teacher GNNs from scratch and then reproduce the results of previous baselines by distilling the knowledge from the SAME teacher GNNs. Therefore, even if we follow the default implementation and hyperparameters of these baselines exactly, there is no way to get identical results.

Table A2. Acuracy

\pm

std (%) in the inductive setting on seven datasets, where HGMD-mixup outperforms GLNN by a wide margin and is comparable to NOSMOG and KRD, where bold and underline denote the best and second metrics, respectively.

Teacher	Student	Cora	Citeseer	Pubmed	Photo	CS	Physics	ogbn-arxiv
MLPs	-	$59.20_{\pm 1.26}$	$60.16_{\pm 0.87}$	$73.26_{\pm 0.83}$	$79.02_{\pm 1.42}$	$87.90_{\pm 0.58}$	$89.10_{\pm 0.90}$	$54.46_{\pm 0.52}$
GCNs	-	$79.30_{\pm 0.49}$	$71.46_{\pm 0.36}$	$78.10_{\pm 0.51}$	$89.32_{\pm 1.63}$	$90.07_{\pm 0.60}$	$92.05_{\pm 0.78}$	$70.88_{\pm 0.35}$
	GLNN	$71.24_{\pm 0.55}$	$70.76_{\pm 0.30}$	$80.16_{\pm 0.73}$	$89.92_{\pm 1.34}$	$92.08_{\pm 0.98}$	$92.89_{\pm 0.88}$	$60.92_{\pm 0.31}$
	KRD	$73.78_{\pm 0.55}$	$71.80_{\pm 0.41}$	$81.48_{\pm 0.29}$	$90.37_{\pm 1.79}$	$93.15_{\pm 0.43}$	$93.86_{\pm 0.55}$	$62.85_{\pm 0.32}$
	NOSMOG (w/o POS)	$73.18_{\pm 0.45}$	$72.40_{\pm 0.51}$	$80.84_{\pm 0.46}$	$90.37_{\pm 1.14}$	$92.87_{\pm 0.53}$	$93.56_{\pm 0.61}$	$62.88_{\pm 0.30}$
	HGMD-mixup (w/o POS)	$\underline{73.92}_{\pm 0.47}$	$73.05_{\pm 0.34}$	$\textbf{81.78}_{\pm 0.59}$	$91.10_{\pm 1.59}$	$\textbf{93.65}_{\pm 0.64}$	$\underline{94.10}_{\pm 0.70}$	$63.20_{\pm 0.28}$
	NOSMOG (w/ POS)	$73.64_{\pm 0.53}$	$\underline{73.10}_{\pm 0.47}$	$81.32_{\pm 0.38}$	$\underline{91.26}_{\pm 1.49}$	$93.47_{\pm 0.71}$	$93.94_{\pm 0.65}$	$\textbf{71.48}_{\pm 0.35}$
	HGMD-mixup (w/ POS)	$\textbf{74.24}_{\pm 0.31}$	$\textbf{73.25}_{\pm 0.40}$	$\underline{81.67}_{\pm 0.46}$	$\textbf{91.32}_{\pm 1.71}$	$\underline{93.51}_{\pm 0.54}$	$\textbf{94.44}_{\pm 0.75}$	$\underline{70.24}_{\pm 0.48}$

Table A3. Comparison of KRD and HGMD-mixup with two knowledge hardness metrics under the transductive setting.

Method	Knowledge Hardness	Cora	Citeseer	Pubmed	Photo	CS	Physics	ogbn-arxiv
MLPs	-	$59.58_{\pm 0.97}$	$60.32_{\pm 0.61}$	$73.40_{\pm 0.68}$	$78.65_{\pm 1.68}$	$87.82_{\pm 0.64}$	$88.81_{\pm 1.08}$	$54.63_{\pm 0.84}$
GCNs	-	$81.70_{\pm 0.96}$	$71.64_{\pm 0.34}$	$79.48_{\pm 0.21}$	$90.63_{\pm 1.53}$	$90.00_{\pm 0.58}$	$92.45_{\pm 0.53}$	$71.20_{\pm 0.17}$
KRD	Information Entropy	$83.87_{\pm 0.51}$	$74.12_{\pm 0.47}$	$81.24_{\pm 0.31}$	$91.75_{\pm 1.46}$	$\underline{94.21}_{\pm 0.37}$	$93.90_{\pm 0.54}$	$70.51_{\pm 0.24}$
HGMD-mixup	Information Entropy	$\underline{84.66}_{\pm 0.47}$	$\underline{74.62}_{\pm 0.40}$	$\underline{82.02}_{\pm 0.45}$	$\underline{93.33}_{\pm 1.31}$	$94.16_{\pm 0.32}$	$94.27_{\pm 0.63}$	$\underline{71.09}_{\pm 0.21}$
KRD	Invariant Entropy	$84.42_{\pm 0.57}$	$\textbf{74.86}_{\pm 0.58}$	$81.98_{\pm 0.41}$	$92.21_{\pm 1.44}$	$94.08_{\pm 0.34}$	$\underline{94.30}_{\pm 0.46}$	$70.92_{\pm 0.21}$
HGMD-mixup	Invariant Entropy	$\textbf{84.89}_{\pm 0.35}$	$74.48_{\pm 0.41}$	$\textbf{82.30}_{\pm 0.28}$	$\textbf{93.57}_{\pm 1.17}$	$\textbf{94.47}_{\pm 0.49}$	$\textbf{94.54}_{\pm 0.55}$	$\textbf{71.48}_{\pm 0.27}$

C. Comparison under Inductive Setting

We compare HGMD-mixup with vanilla GCNs, GLNN (Zhang et al., 2022), KRD (Wu et al., 2023a), and NOSMOG (Tian et al., 2023b) in the inductive setting with GCNs as teacher GNNs. Considering the importance of node positional features (POS) in the inductive setting (as revealed by the ablation study in NOGMOG), we consider the performance of HGMD-mixup and NOGMOG w/ and w/o POS, respectively. We can observe from the results in Table. A2 that (1) POS features play a crucial role, especially on the large-scale ogbn-arxiv dataset. (2) HGMD-mixup outperforms GLNN and KRD by a large margin and is comparable to NOSMOG regardless of w/ and w/o POS features. Note that HGMD-mixup achieves such good performance with no additional learnable parameters introduced beyond the student MLPs.

D. Evaluation of Knowledge Hardness Metrics

In this paper, we default to the information entropy $\mathcal{H}(\mathbf{z}_{i})$ as a measure of its knowledge hardness, as defined in Eq. (2). To further evaluate the effects of knowledge hardness metrics, we consider another more complicated knowledge hardness metric proposed by KRD (Wu et al., 2023a), namely Invariant Entropy, that defines the knowledge hardness of a GNN knowledge sample (node) $v_{i}$ by measuring the invariance of its information entropy to noise perturbations,

(A.1)		$\displaystyle\rho_{i}=\frac{1}{\delta^{2}}$	$\displaystyle\underset{\mathbf{X}^{\prime}\sim\mathcal{N}(\mathbf{X},\bm{\Sigma}(\delta))}{\mathbb{E}}\left[\left\\|\mathcal{H}(\mathbf{z}^{\prime}_{i})-\mathcal{H}(\mathbf{z}_{i})\right\\|^{2}\right],$
(A.1)		$\displaystyle\text{where}\mathbf{Z}^{\prime}$	$\displaystyle=f^{\mathcal{T}}_{\theta}(\mathbf{A},\mathbf{X}^{\prime})\ \text{and}\ \mathbf{Z}=f^{\mathcal{T}}_{\theta}(\mathbf{A},\mathbf{X})$

From the experimental results in Table. A3, we can make two observations that (1) Even the simplest information entropy metric is sufficient to yield state-of-the-art results. However, HGMD is also applicable to other knowledge hardness metrics in addition to information entropy; finer and more complicated hardness metrics, such as Invariant Entropy, can lead to more performance gains, regardless of for KRD or HGMD-mixup. (2) We made a fair comparison between HGMD-mixup and KRD by using the same knowledge hardness metrics. Despite the fact that HGMD-mixup does not involve any additional parameters in the distillation, HGMD-mixup outperforms KRD in 12 out of 14 metrics across seven datasets.

References

(1)
Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. 2019. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning. PMLR, 21–29.
Anonymous (2023) Anonymous. 2023. Double Wins: Boosting Accuracy and Efficiency of Graph Neural Networks by Reliable Knowledge Distillation. In Submitted to The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=NGIFt6BNvLe under review.
Chen et al. (2020b) Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020b. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 3438–3445.
Chen et al. (2020a) Yuzhao Chen, Yatao Bian, Xi Xiao, Yu Rong, Tingyang Xu, and Junzhou Huang. 2020a. On self-distilling graph neural network. arXiv preprint arXiv:2011.02255 (2020).
Feng et al. (2022) Kaituo Feng, Changsheng Li, Ye Yuan, and Guoren Wang. 2022. FreeKD: Free-direction Knowledge Distillation for Graph Neural Networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 357–366.
Giles et al. (1998) C Lee Giles, Kurt D Bollacker, and Steve Lawrence. 1998. CiteSeer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries. 89–98.
Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024–1034.
He et al. (2022) Huarui He, Jie Wang, Zhanqiu Zhang, and Feng Wu. 2022. Compressing deep graph neural networks via adversarial knowledge distillation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 534–544.
Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687 (2020).
Jafari et al. (2021) Aref Jafari, Mehdi Rezagholizadeh, Pranav Sharma, and Ali Ghodsi. 2021. Annealing knowledge distillation. arXiv preprint arXiv:2104.07163 (2021).
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Lassance et al. (2020) Carlos Lassance, Myriam Bontonou, Ghouthi Boukli Hacene, Vincent Gripon, Jian Tang, and Antonio Ortega. 2020. Deep geometric knowledge distillation with graphs. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8484–8488.
McCallum et al. (2000) Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. 2000. Automating the construction of internet portals with machine learning. Information Retrieval 3, 2 (2000), 127–163.
Mirzadeh et al. (2020) Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 5191–5198.
Qiu et al. (2022) Zengyu Qiu, Xinzhu Ma, Kunlin Yang, Chunya Liu, Jun Hou, Shuai Yi, and Wanli Ouyang. 2022. Better teacher better student: Dynamic prior knowledge for knowledge distillation. arXiv preprint arXiv:2206.06067 (2022).
Ren et al. (2021) Yating Ren, Junzhong Ji, Lingfeng Niu, and Minglong Lei. 2021. Multi-task Self-distillation for Graph-based Semi-Supervised Learning. arXiv preprint arXiv:2112.01174 (2021).
Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
Shen et al. (2021) Zhiqiang Shen, Zechun Liu, Dejia Xu, Zitian Chen, Kwang-Ting Cheng, and Marios Savvides. 2021. Is label smoothing truly incompatible with knowledge distillation: An empirical study. arXiv preprint arXiv:2104.00676 (2021).
Son et al. (2021) Wonchul Son, Jaemin Na, Junyong Choi, and Wonjun Hwang. 2021. Densely guided knowledge distillation using multiple teacher assistants. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9395–9404.
Stanton et al. (2021) Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A Alemi, and Andrew G Wilson. 2021. Does knowledge distillation really work? Advances in Neural Information Processing Systems 34 (2021), 6906–6919.
Tian et al. (2023a) Yijun Tian, Shichao Pei, Xiangliang Zhang, Chuxu Zhang, and Nitesh V Chawla. 2023a. Knowledge Distillation on Graphs: A Survey. arXiv preprint arXiv:2302.00219 (2023).
Tian et al. (2023b) Yijun Tian, Chuxu Zhang, Zhichun Guo, Xiangliang Zhang, and Nitesh Chawla. 2023b. Learning MLPs on Graphs: A Unified View of Effectiveness, Robustness, and Efficiency. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=Cs3r5KLdoj
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
Wang et al. (2019) Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv preprint arXiv:1909.01315 (2019).
Wu et al. (2024) Lirong Wu, Haitao Lin, Zhangyang Gao, Guojiang Zhao, and Stan Z Li. 2024. A Teacher-Free Graph Knowledge Distillation Framework with Dual Self-Distillation. IEEE Transactions on Knowledge and Data Engineering (2024).
Wu et al. (2023b) Lirong Wu, Haitao Lin, Yufei Huang, Tianyu Fan, and Stan Z Li. 2023b. Extracting Low-/High- Frequency Knowledge from Graph Neural Networks and Injecting it into MLPs: An Effective GNN-to-MLP Distillation Framework. In Proceedings of the AAAI Conference on Artificial Intelligence.
Wu et al. (2022a) Lirong Wu, Haitao Lin, Yufei Huang, and Stan Z Li. 2022a. Knowledge Distillation Improves Graph Structure Augmentation for Graph Neural Networks. In Advances in Neural Information Processing Systems.
Wu et al. (2023a) Lirong Wu, Haitao Lin, Yufei Huang, and Stan Z Li. 2023a. Quantifying the Knowledge in GNNs for Reliable Distillation into MLPs. arXiv preprint arXiv:2306.05628 (2023).
Wu et al. (2023c) Lirong Wu, Haitao Lin, Zihan Liu, Zicheng Liu, Yufei Huang, and Stan Z Li. 2023c. Homophily-Enhanced Self-Supervision for Graph Structure Learning: Insights and Directions. IEEE Transactions on Neural Networks and Learning Systems (2023).
Wu et al. (2021) Lirong Wu, Haitao Lin, Cheng Tan, Zhangyang Gao, and Stan Z Li. 2021. Self-supervised learning on graphs: Contrastive, generative, or predictive. IEEE Transactions on Knowledge and Data Engineering (2021).
Wu et al. (2022b) Lirong Wu, Jun Xia, Zhangyang Gao, Haitao Lin, Cheng Tan, and Stan Z Li. 2022b. Graphmixup: Improving class-imbalanced node classification by reinforcement mixup and self-supervised context prediction. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 519–535.
Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems (2020).
Yan et al. (2020) Bencheng Yan, Chaokun Wang, Gaoyang Guo, and Yunkai Lou. 2020. TinyGNN: Learning Efficient Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1848–1856.
Yan et al. (2022) Yujun Yan, Milad Hashemi, Kevin Swersky, Yaoqing Yang, and Danai Koutra. 2022. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. In 2022 IEEE International Conference on Data Mining (ICDM). IEEE, 1287–1292.
Yang et al. (2021) Cheng Yang, Jiawei Liu, and Chuan Shi. 2021. Extract the Knowledge of Graph Neural Networks and Go Beyond it: An Effective Knowledge Distillation Framework. In Proceedings of the Web Conference 2021. 1227–1237.
Yang et al. (2020) Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, and Xinchao Wang. 2020. Distilling knowledge from graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7074–7083.
Zhang et al. (2020a) Hanlin Zhang, Shuai Lin, Weiyang Liu, Pan Zhou, Jian Tang, Xiaodan Liang, and Eric P Xing. 2020a. Iterative graph self-distillation. arXiv preprint arXiv:2010.12609 (2020).
Zhang et al. (2022) Shichang Zhang, Yozen Liu, Yizhou Sun, and Neil Shah. 2022. Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation. In International Conference on Learning Representations. https://openreview.net/forum?id=4p6_5HBWPCw
Zhang et al. (2020b) Wentao Zhang, Xupeng Miao, Yingxia Shao, Jiawei Jiang, Lei Chen, Olivier Ruas, and Bin Cui. 2020b. Reliable data distillation on graph convolutional network. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1399–1414.
Zhou et al. (2021) Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. 2021. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint arXiv:2102.00650 (2021).
Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.
Zhu et al. (2022) Yichen Zhu, Ning Liu, Zhiyuan Xu, Xin Liu, Weibin Meng, Louis Wang, Zhicai Ou, and Jian Tang. 2022. Teach Less, Learn More: On the Undistillable Classes in Knowledge Distillation. In Advances in Neural Information Processing Systems.
Zhu and Wang (2021) Yichen Zhu and Yi Wang. 2021. Student customized knowledge distillation: Bridging the gap between student and teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5057–5066.