Boosting Graph Neural Networks via Adaptive Knowledge Distillation

Zhichun Guo¹, Chunhui Zhang², Yujie Fan³, Yijun Tian¹, Chuxu Zhang², Nitesh V. Chawla¹

Abstract

Graph neural networks (GNNs) have shown remarkable performance on diverse graph mining tasks. While sharing the same message passing framework, our study shows that different GNNs learn distinct knowledge from the same graph. This implies potential performance improvement by distilling the complementary knowledge from multiple models. However, knowledge distillation (KD) transfers knowledge from high-capacity teachers to a lightweight student, which deviates from our scenario: GNNs are often shallow. To transfer knowledge effectively, we need to tackle two challenges: how to transfer knowledge from compact teachers to a student with the same capacity; and, how to exploit student GNN’s own learning ability. In this paper, we propose a novel adaptive KD framework, called BGNN, which sequentially transfers knowledge from multiple GNNs into a student GNN. We also introduce an adaptive temperature module and a weight boosting module. These modules guide the student to the appropriate knowledge for effective learning. Extensive experiments have demonstrated the effectiveness of BGNN. In particular, we achieve up to 3.05% improvement for node classification and 6.35% improvement for graph classification over vanilla GNNs.

Introduction

Refer to caption — Figure 1: CKA similarity between graph representation at different layers of GNNs on Enzymes.

Recent years have witnessed the significant development of graph neural networks (GNNs). Various GNNs have been developed and applied to different graph mining tasks (Kipf and Welling 2017; Hamilton, Ying, and Leskovec 2017; Velickovic et al. 2018; Klicpera, Bojchevski, and Günnemann 2019; Xu et al. 2019; Wu et al. 2019; Jin et al. 2021; Guo et al. 2022a). Although most GNNs can be unified into the Message Passing Neural Networks (Gilmer et al. 2017), their learning abilities diverge (Xu et al. 2019; Balcilar et al. 2020). In our preliminary study, we observe that the graph representations learned by different GNNs are not similar, especially in deeper layers. It suggests that different GNNs may encode complementary knowledge due to their different aggregation schemes. Based on this observation, it is natural to ask: can we boost vanilla GNNs by effectively utilizing complementary knowledge learned by different GNNs from the same dataset?

An intuitive solution is to compose multiple models into an ensemble (Hansen and Salamon 1990; Breiman 2001) that would achieve better performance than each of its constituent models. However, ensemble is not always effective especially when the base classifiers are strong learners (Zhang et al. 2020). Thus, we seek a different approach to take advantage of knowledge from different GNNs: knowledge distillation (KD) (Hinton et al. 2015; Romero et al. 2014; Touvron et al. 2021), which distills information from one (teacher) model to another (student) model. However, KD is always accompanied by model compression (Yim et al. 2017; Heo et al. 2019; Yuan et al. 2019), where the teacher network is a high-capacity neural network, and the student network is a compact and fast-to-execute model. Standing by this situation, there could be a significant performance gap between students and teachers. But this kind of performance gap may not exist in our scenario: GNNs are all very shallow due to the oversmoothing issue (Zhao and Akoglu 2019; Li, Han, and Wu 2018; Alon and Yahav 2020). Hence, it is more difficult to distill extra knowledge from teacher GNNs to boost the student GNN. To achieve this goal, two major challenges arise: the first one is how to transfer knowledge from a teacher GNN into a student GNN with the same capacity that can produce the same even better performance (teaching effectiveness); the second one is how to push the student model to play the best role in learning by itself, which is ignored in the traditional KD where the student’s performance heavily relies on the teacher (learning ability).

In this work, we propose a novel framework, namely BGNN, which combines the knowledge from different GNNs in a “boosting” way to strengthen a vanilla GNN through knowledge distillation. To improve the teaching effectiveness, we propose two strategies to increase the useful knowledge transferred from the teachers to the student. One is the sequential training strategy, where the student is encouraged to focus on learning from one teacher at a time. This allows the student to learn diverse knowledge from individual GNNs. The other one is an adaptive temperature module. Unlike existing KD methods that use a uniform temperature for all samples, the temperature in BGNN is adjustable based on the teacher’s confidence in a specific sample. To enhance the learning ability, we develop a weight boosting module. This module redistributes the weight of samples, making the student GNN pay more attention to the misclassified samples. Our proposed BGNN is a general model which can be applied to both graph classification and node classification tasks. We conduct extensive experimental studies on both tasks, and the results demonstrate the superior performance of BGNN compared with a set of baseline methods.

To summarize, our contributions are listed as follows:

•

Through empirical study, we show that the representations learned by different GNNs are not similar, indicating that they encode different knowledge from the same input.
•

Motivated by our observation, we propose a novel framework BGNN that transfers knowledge from different GNNs in a “boosting” way to elevate a vanilla GNN.
•

Rather than using a uniform temperature for all samples, we design an adaptive temperature for each sample, which benefits the knowledge transfer from teacher to student.
•

Empirical results have demonstrated the effectiveness of BGNN. Particularly, we achieve up to 3.05% and 6.35% improvement over vanilla GNNs for node classification and graph classification, respectively.

Related Work and Background

Graph Neural Networks. Most GNNs follow a message-passing scheme, which consists of message, update, and readout functions to learn node embeddings by iteratively aggregating the information of its neighbors (Xu et al. 2019; Wu et al. 2019; Klicpera, Bojchevski, and Günnemann 2019). For example, GCN (Kipf and Welling 2017) simplifies graph convolutions, which takes averaged aggregation method to aggregate the neighbors’ information; GraphSage (Hamilton, Ying, and Leskovec 2017) fixes the number of sampled neighbors to perform aggregation; GAT (Velickovic et al. 2018) proposes an attention mechanism (Vaswani et al. 2017) to treat neighbors differently in aggregation. The aim of this work is not to design a new GNN architecture, but to propose a new framework to boost existing GNNs by leveraging the diverse learning abilities of different GNNs.

GNN Knowledge Distillation. There have been many models that apply KD framework on GNNs for better efficiency in different settings (Yang, Liu, and Shi 2021; Zheng et al. 2022; Zhang et al. 2020; Deng and Zhang 2021; Feng et al. 2022). For example, Yan et al. (Yan et al. 2020) proposed TinyGNN to distillate a large GNN to a small GNN. GLNN (Zhang et al. 2022) was proposed to distillate GNNs to MLP. All of these work distillate knowledge by penalizing the softened logit differences between a teacher and a student following (Hinton et al. 2015). Besides this vanilla KD, Yang et al. (Yang et al. 2020) proposed LSP, a local structure preserving based KD method in computer vision area, to transfer the knowledge effectively between different GCN models. Wang et al. (Wang et al. 2021) propose a novel multi-teacher KD method, MulDE, for link prediction based on knowledge graph embeddings. LLP (Guo et al. 2022b) is another KD framework specifically for link prediction tasks. In this work, we also take logits-based KD method to distillate knowledge but design two modules to increase useful knowledge transferred from the teachers to the student and play the best of the student model. Different from combining teachers’ knowledge in a parallel way in MulDE, we utilize sequential training strategy to combine different teacher models.

Background and Preliminary Study

Background

Notations. Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ denote a graph, where $\mathcal{V}$ stands for all nodes and $\mathcal{E}$ stands for all edges. Each node $v_{i}\in\mathcal{V}$ in the graph has a corresponding $D$ -dimensional feature vector $\boldsymbol{x}_{i}\in\mathbb{R}^{D}$ . There are $N$ nodes in the graph. The entire node features matrix is $\textbf{X}\in\mathbb{R}^{N\times D}$ .

Graph Neural Networks. A GNN iteratively updates node embeddings by aggregating information of its neighboring nodes. We initialize the embedding of node $v$ as $\boldsymbol{h}_{v}^{(0)}=\boldsymbol{x}_{v}$ . Its embedding in the $l$ -th layer is updated to $\boldsymbol{h}_{v}^{(l)}$ by aggregating its neighbors’ embedding, which is formulated as: $\boldsymbol{h}^{(l)}_{v}=\textsc{UPDATE}_{l}\big{(}\boldsymbol{h}^{(l-1)}_{v},\textsc{AGG}_{l}\big{(}\{\boldsymbol{h}^{(l-1)}_{u}:\forall u\in\mathcal{N}(v)\}\big{)}\big{)},$ where AGG and UPDATE are aggregation function and update function, respectively, $\mathcal{N}(v)$ denotes the neighbors of node $v$ . Furthermore, the whole graph representation can be computed based on all nodes’ representations as: $\boldsymbol{h}_{G}=\textsc{READOUT}(\{\boldsymbol{h}^{(l)}_{v}|v\in\mathcal{V}\}),$ where readout is a graph-level pooling function.

Node Classification. Node classification is a typical supervised learning task for GNNs. The target is to predict the label of unlabeled node $v$ in the graph. Let $\textbf{Y}\in\mathbb{R}^{N\times C}$ be the set of node labels. The ground truth of node $v$ will be $y_{v}$ , a $C$ -dimension one-hot vector.

Graph Classification. Graph classification is commonly used in chemistry tasks like molecular property prediction (Hu et al. 2019; Guo et al. 2021). Graph classification is to predict the graph properties. Here, the ground truth matrix $\textbf{Y}\in\mathbb{R}^{M\times C}$ is the set of graph labels, where $M$ and $C$ are the number of graphs and graph categories, respectively.

Preliminary Study on GNNs’ Representation

Next, we perform a preliminary study to answer the following question: do different GNNs encode different knowledge from the same input graphs? We train 4-layer GCN, GAT, and GraphSage on Enzymes in a supervised way. After the training, we utilize Centered Kernel Alignment (CKA) (Kornblith et al. 2019) as a similarity index to evaluate the relationship among different representations. The higher CKA means the compared representations are more similar. We take the average of all the embeddings at each layer as the representations of that layer. Figure 1 illustrates the CKA between representations of each layer learned from GCN, GAT, and GraphSage on Enzymes. We observe that the similarities between the learned representations at different layers in GCN, GAT, and GraphSage are diverse. For example, the CKA value between the representation from layer 1/2/3/4 of GCN and that from GAT is around 0.7/0.35/0.4/0.3. It indicates that different GNNs may encode different knowledge.

We posit that different aggregation schemes in these GNNs cause difference in the learned representations. In particular, GCN aggregates neighborhoods with predefined weights; GAT aggregates neighborhoods using learnable weights and GraphSage randomly samples neighbors during aggregation. Given such differences, it is promising to boost one GNN by incorporating the knowledge from other GNNs and it motivates us to design a framework that can take advantage of the diverse knowledge from different GNNs.

The Proposed Framework

In this section, we introduce the framework BGNN to boost GNNs by utilizing complementary knowledge from other GNNs. An illustration of the model framework is shown in Figure 2. Our framework adopts a sequential training strategy to encourage the student to focus on learning from one single teacher at a time. To adjust the information distilled from the teacher, we propose an adaptive temperature module to adjust the soft labels from teachers. Further, we propose a weight boosting mechanism to enhance the student model training.

Model Overview

To boost a GNN, we take it as the student and we aim to transfer diverse knowledge from other teacher GNNs into it. In this work, we utilize the KD method proposed by (Hinton et al. 2015), where a teacher’s knowledge is transferred to a student by encouraging the student model to imitate the teacher’s behavior. In our framework, we pre-train a teacher GNN (GNN_T) with ground-truth labels and keep its parameters fixed during KD. Then, we transfer the knowledge from GNN_T by letting the student GNN (GNN_S) optimize the soft cross-entropy loss between the student network’s logits $\boldsymbol{z}_{v}$ and the teachers’ logits $\boldsymbol{t}_{v}$ . Let $\tau_{v}$ be the temperature for node $v$ to soften the logits distribution of teacher GNN.

Then we incorporate the target of knowledge distillation into the training process of the student model by minimizing the following loss:

	$\displaystyle\mathcal{L}=\mathcal{L}_{\text{label}}+\lambda\mathcal{L}_{KD}=\mathcal{L}_{\text{label}}-\lambda\sum\nolimits_{v\in\mathcal{V}}\hat{\boldsymbol{y}}_{v}^{\mathrm{T}}\mathrm{log}(\hat{\boldsymbol{y}}_{v}^{\mathrm{S}}),$		(1)
	$\displaystyle\text{with}\quad\hat{\boldsymbol{y}}_{v}^{\mathrm{T}}=\mathrm{softmax}(\boldsymbol{t}_{v}/\tau_{v}),\;\hat{\boldsymbol{y}}_{v}^{\mathrm{S}}=\mathrm{softmax}(\boldsymbol{z}_{v}/\tau_{v}),$		(1)

where $\mathcal{L}_{\text{label}}$ is the supervised training loss w.r.t. ground-truth labels and $\lambda$ is a trade-off factor to balance their importance.

Following the above objective, we adopt a sequential process of distillation, where we can freely boost the student GNN with one teacher GNN or multiple teachers. Such a sequential manner encourages the student model to focus on the knowledge from one single teacher. In contrast, when using multiple teachers simultaneously, the student may receive mixed noisy signals which can harm the distillation process.

As illustrated in Figure 2(a), we first train a teacher GNN₁ using true labels and train a student GNN₂ with dual targets of predicting the true labels and matching the logits distribution of GNN₁. The logits distribution has been softened by our proposed adaptive temperature (Figure 2(b)) for each node and the weight of the nodes misclassified by the teacher GNN are boosted when predicting true labels (Figure 2(c)). The parameters of teacher GNN₁ are not updated when we train GNN₂. Such process is repeated for $p$ steps and the knowledge from GNN_p-1, $\ldots$ , GNN₁ can be transferred into the last student GNN_p.

Distillation via Adaptive Temeperature

Instead of leveraging KD for compressing GNN models, we aim to use KD for boosting the student by transferring knowledge between GNNs sharing the same capacity. To achieve this goal, we need a more powerful KD method to fully take advantage of the useful knowledge in the teacher GNN so as to produce better performance.

Analysis of KD. Before we introduce the detailed technique, we first analyze the reason behind KD’s success by comparing the gradient of $\mathcal{L}_{KD}$ and supervised loss. Specifically, for any single sample, we compute the gradient of $\mathcal{L}_{KD}$ for its $c$ -th output (Detailed derivation is in Section A of Appendix):

	$\displaystyle\frac{\partial{\mathcal{L}_{KD}}}{\partial{z_{v,c}}}$	$\displaystyle=-\frac{\partial}{\partial z_{v,c}}\sum\nolimits_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})=\frac{1}{\tau_{v}}(\hat{y}_{v,c}^{\mathrm{S}}-\hat{y}_{v,c}^{\mathrm{T}})$
		$\displaystyle=\frac{1}{\tau_{v}}\Big{(}\frac{e^{z_{v,c/\tau_{v}}}}{\sum\nolimits_{j=1}^{C}e^{z_{v,j}/\tau_{v}}}-\frac{e^{t_{v,c/\tau_{v}}}}{\sum_{j=1}^{C}e^{t_{v,j}/\tau_{v}}}\Big{)}.$		(2)

Let $*$ denote the true label for the single sample, i.e., $y_{v,*}=1$ . Then the gradient of KD loss for this sample is:

	$\displaystyle\frac{\partial{\mathcal{L}_{KD}}}{\partial{z_{v}}}$	$\displaystyle=\frac{1}{\tau_{v}}\sum\nolimits_{c=1}^{C}(\hat{y}_{v,c}^{\mathrm{S}}-\hat{y}_{v,c}^{\mathrm{T}})$
	$\displaystyle=\frac{1}{\tau_{v}}$	$\displaystyle\Big{(}(\hat{y}_{v,}^{\mathrm{S}}-\hat{y}_{v,}^{\mathrm{T}})+\sum\nolimits_{c=1,c\neq*}^{C}(\hat{y}_{v,c}^{\mathrm{S}}-\hat{y}_{v,c}^{\mathrm{T}})\Big{)}.$		(3)

In the above equation, the first term $(\hat{y}_{v,*}^{\mathrm{S}}-\hat{y}_{v,*}^{\mathrm{T}})$ corresponds to transferring the knowledge hidden in the distribution of the logits of the correct category; the second term is responsible for transferring the dark knowledge from the wrong categories. This dark knowledge contains important information about the similarity between categories, which is the key point for the success of KD (Hinton et al. 2015). We rewrite the first term as: $\hat{y}_{v,*}^{\mathrm{S}}-\hat{y}_{v,*}^{\mathrm{T}}y_{v,*}$ . Note that the gradient for the sample of the cross entropy between student logits $z_{v}$ and ground truth label is $\hat{y}_{v,*}^{\mathrm{S}}-y_{v,*}$ when $\tau_{v}$ is 1. We can view $\hat{y}_{v,*}^{\mathrm{T}}$ as the importance weight of the true label information. If the teacher is confident for this sample (i.e. $\hat{y}_{v,*}^{\mathrm{T}}\approx 1$ ), the ground truth related information will play a more important role in gradient computation. Thus, both true label information and similarity information hidden in the wrong categories contribute a lot to the success of KD. The confidence of the teacher is the key to balancing these two parts’ importance in KD training process.

Detailed Technique. From the above analysis, temperature directly controls the trade-off between true label knowledge and dark knowledge. Additionally, existing work (Zhang and Sabuncu 2020) has proved that temperature scaling helps to yield more calibrated models. To transfer more useful knowledge from teachers to students, we make the predefined hyper-parameter temperature adaptive for all nodes: each node is associated with an adaptive temperature based on the confidence of teachers for each node. Specifically, we take the entropy of the teacher’s logits $\boldsymbol{t}_{v}$ to measure the teachers’ confidence for each node. The lower entropy means more confident the teacher is for the specific node. Then we compute the temperature for each node with learnable parameters by the following formulation:

	$\displaystyle\mathrm{Confidence}(\boldsymbol{t}_{v})=-\sum\nolimits_{c=1}^{C}t_{v,c}log(t_{v,c}),$		(4)
	$\displaystyle\tau_{v}=\sigma\Big{(}\mathrm{MLP}\big{(}\mathrm{Confidence}(\boldsymbol{t}_{v})\big{)}\Big{)},$		(5)

where $\sigma$ represents $\mathrm{sigmoid}$ operation. We use $\tau_{v}=\tau_{v}\times(\tau_{max}-\tau_{min})+\tau_{min}$ to limit the temperature into a fixed range [ $\tau_{min},\tau_{max}$ ]. In the experiments, we discover that involving the distribution of the logits of the teacher in the temperature is more effective than only considering the teachers’ confidence after two training steps:

\tau_{v}=\sigma\Big{(}\mathrm{MLP}\big{(}\textsc{Concat}(\boldsymbol{t}_{v},\mathrm{Confidence}(\boldsymbol{t}_{v}))\big{)}\Big{)},

(6)

where Concat represents the concatenation function.

Weight Boosting

As we mentioned earlier, we aim to transfer knowledge from teachers to students with the same capacity. In this case, we should not only enhance the knowledge transferred from teachers to the student but also enhance the supervised training process of the student GNN itself. For the supervised training procedure, inspired by Adaboost algorithm (Freund, Schapire, and Abe 1999; Sun, Zhu, and Lin 2021), we propose to boost the weights of misclassified nodes by the teacher GNN, which encourages the student GNN to pay more attention to these misclassified samples and learn them better. For better efficiency, we drop the ensemble step of boosting and take the student GNN generated by the last training step for the latter prediction in test data. In specific, firstly, we initialize the node weights $w_{i}=1/N_{train},i=1,2,3,...,N_{train}$ on training set. Next, we pre-train a teacher GNN with samples of the training set in a supervised way. Then we boost the weights of training samples (i.e. graphs for graph classification and nodes for node classification) that are misclassified by GNN_T. Here, we apply SAMME.R (Hastie et al. 2009) to update the weight:

w_{i}=w_{i}\cdot\mathrm{exp}\Big{(}-\frac{C-1}{C}\boldsymbol{y}_{i}^{\top}\mathrm{log}\ \boldsymbol{p}_{i}\Big{)},i=1,2,...,N_{train},

(7)

where $\boldsymbol{p}_{i}=\mathrm{softmax}(\boldsymbol{z}_{i})$ .

Overall Objective. By incorporating $w_{i}$ into the supervised training process and combining KD training, the loss function of our method can be formulated as:

	$\displaystyle\mathcal{L}_{BGNN}=\mathcal{L}_{Label}+\lambda\mathcal{L}_{KD}$
	$\displaystyle=-\sum\nolimits_{i=1}^{N_{train}}w_{i}\ \boldsymbol{y}_{i}\mathrm{log}(\boldsymbol{p}_{i})-\lambda\sum_{v\in\mathcal{V}}\hat{\boldsymbol{y}}_{v}^{\mathrm{T}}\mathrm{log}(\hat{\boldsymbol{y}}_{v}^{\mathrm{S}}).$		(8)

Experiments

		Graph Classification			Node Classification
Student	Method	Collab	IMDB	Enzymes	Cora	Citeseer	Pubmed	A-Computers
GCN	NoKD	80.82 $\pm$ 0.99	77.20 $\pm$ 0.40	66.33 $\pm$ 1.94	82.25 $\pm$ 0.39	72.30 $\pm$ 0.21	79.40 $\pm{0.56}$	87.79 $\pm$ 0.04
	KD (Hinton et al. 2015)	81.24 $\pm$ 0.63	77.60 $\pm$ 0.77	65.00 $\pm$ 1.36	82.78 $\pm$ 0.42	72.59 $\pm$ 0.28	79.60 $\pm$ 0.67	87.13 $\pm$ 0.11
	LSP (Yang et al. 2020)	81.22 $\pm$ 0.29	77.99 $\pm$ 1.21	67.12 $\pm$ 1.11	82.29 $\pm$ 0.77	72.77 $\pm$ 0.34	78.93 $\pm$ 0.38	87.93 $\pm$ 0.72
	BGNN	82.73 $\pm$ 0.34	79.33 $\pm$ 0.47	69.44 $\pm$ 0.79	83.97 $\pm$ 0.17	73.87 $\pm$ 0.24	80.73 $\pm$ 0.25	89.58 $\pm$ 0.03
SAGE	NoKD	81.20 $\pm$ 0.55	77.60 $\pm$ 1.02	73.67 $\pm$ 5.52	81.18 $\pm$ 0.92	71.62 $\pm$ 0.33	78.00 $\pm$ 0.23	87.49 $\pm$ 1.40
	KD (Hinton et al. 2015)	80.34 $\pm$ 0.50	77.90 $\pm$ 1.14	71.67 $\pm$ 1.36	81.72 $\pm$ 1.04	72.58 $\pm$ 0.32	77.25 $\pm$ 0.39	89.34 $\pm$ 0.18
	LSP (Yang et al. 2020)	81.33 $\pm$ 0.31	78.23 $\pm$ 0.91	74.23 $\pm$ 2.16	82.00 $\pm$ 0.75	71.37 $\pm$ 0.34	77.40 $\pm$ 0.22	88.49 $\pm$ 0.72
	BGNN	82.67 $\pm$ 0.57	79.67 $\pm$ 0.47	78.33 $\pm$ 1.00	83.30 $\pm$ 0.35	73.90 $\pm$ 0.16	79.03 $\pm$ 0.09	89.85 $\pm$ 0.12
GAT	NoKD	79.28 $\pm$ 0.47	76.20 $\pm$ 1.33	74.33 $\pm$ 2.00	83.44 $\pm$ 0.17	72.46 $\pm$ 0.75	79.36 $\pm$ 0.19	87.98 $\pm$ 0.29
	KD (Hinton et al. 2015)	78.98 $\pm$ 1.01	77.80 $\pm$ 1.25	72.67 $\pm$ 2.36	83.50 $\pm$ 0.27	72.52 $\pm$ 0.20	78.91 $\pm$ 0.18	86.90 $\pm$ 0.25
	LSP (Yang et al. 2020)	78.99 $\pm$ 1.39	78.00 $\pm$ 0.93	74.43 $\pm$ 3.27	83.36 $\pm$ 0.57	71.90 $\pm$ 0.36	79.53 $\pm$ 0.26	86.90 $\pm$ 0.48
	BGNN	80.73 $\pm$ 0.25	79.33 $\pm$ 0.94	79.36 $\pm$ 2.81	84.63 $\pm$ 0.28	74.53 $\pm$ 0.29	79.83 $\pm$ 0.12	89.43 $\pm$ 0.11

Table 1: Classification performances under the single teacher setting. NoKD denotes the results of supervised training without KD. Bold text is used to highlight best results for each backbone. Shadow is to highlight best results for each dataset.

		Graph Classification			Node Classification
Student	Method	Collab	IMDB	Enzymes	Cora	Citeseer	Pubmed	A-Computers
GCN	BAN (Furlanello et al. 2018)	81.60 $\pm$ 0.40	78.50 $\pm$ 1.00	66.67 $\pm$ 1.67	83.17 $\pm$ 0.26	72.47 $\pm$ 0.21	79.87 $\pm$ 0.21	88.78 $\pm$ 0.05
	MulDE (Wang et al. 2021)	80.86 $\pm$ 1.18	77.50 $\pm$ 1.20	67.33 $\pm$ 0.90	82.53 $\pm$ 0.05	72.33 $\pm$ 0.09	78.73 $\pm$ 0.09	88.41 $\pm$ 0.12
	BGNN(s)	82.73 $\pm$ 0.34	79.33 $\pm$ 0.47	69.44 $\pm$ 0.79	83.97 $\pm$ 0.17	73.87 $\pm$ 0.24	80.73 $\pm$ 0.25	89.58 $\pm$ 0.03
	BGNN(m)-ST	82.87 $\pm$ 0.09	79.67 $\pm$ 0.27	71.12 $\pm$ 2.45	84.83 $\pm$ 0.25	73.60 $\pm$ 0.14	80.20 $\pm$ 0.08	89.03 $\pm$ 0.02
	BGNN(m)-TS	83.40 $\pm$ 0.15	79.00 $\pm$ 0.00	70.00 $\pm$ 1.36	84.40 $\pm$ 0.22	74.87 $\pm$ 0.25	80.90 $\pm$ 0.00	89.02 $\pm$ 0.03
SAGE	BAN (Furlanello et al. 2018)	81.80 $\pm$ 0.20	80.73 $\pm$ 0.20	70.00 $\pm$ 1.67	82.80 $\pm$ 0.31	73.10 $\pm$ 0.57	77.90 $\pm$ 0.08	89.73 $\pm$ 0.08
	MulDE (Wang et al. 2021)	81.00 $\pm$ 0.91	77.60 $\pm$ 1.50	77.00 $\pm$ 0.14	81.67 $\pm$ 0.09	68.83 $\pm$ 0.34	78.13 $\pm$ 0.34	88.05 $\pm$ 1.35
	BGNN(s)	82.67 $\pm$ 0.57	79.67 $\pm$ 0.47	78.33 $\pm$ 1.00	83.30 $\pm$ 0.35	73.90 $\pm$ 0.16	79.03 $\pm$ 0.09	89.85 $\pm$ 0.12
	BGNN(m)-TC	82.80 $\pm$ 0.28	79.67 $\pm$ 0.49	78.03 $\pm$ 0.49	83.90 $\pm$ 0.22	74.67 $\pm$ 0.33	79.23 $\pm$ 0.05	90.12 $\pm$ 0.05
	BGNN(m)-CT	83.30 $\pm$ 0.23	79.33 $\pm$ 0.79	77.78 $\pm$ 1.57	83.90 $\pm$ 0.00	74.40 $\pm$ 0.16	79.10 $\pm$ 0.08	90.40 $\pm$ 0.16
GAT	BAN (Furlanello et al. 2018)	80.10 $\pm$ 0.50	79.50 $\pm$ 0.50	76.83 $\pm$ 1.50	83.93 $\pm$ 0.21	72.90 $\pm$ 0.08	79.57 $\pm$ 0.12	87.66 $\pm$ 0.15
	MulDE (Wang et al. 2021)	79.40 $\pm$ 1.65	78.50 $\pm$ 0.43	77.50 $\pm$ 0.58	83.23 $\pm$ 0.33	71.23 $\pm$ 0.19	78.40 $\pm$ 0.08	88.08 $\pm$ 0.18
	BGNN(s)	80.73 $\pm$ 0.25	79.33 $\pm$ 0.94	79.36 $\pm$ 2.81	84.63 $\pm$ 0.28	74.53 $\pm$ 0.29	79.83 $\pm$ 0.12	89.43 $\pm$ 0.11
	BGNN(m)-SC	81.53 $\pm$ 0.41	81.33 $\pm$ 1.79	78.89 $\pm$ 0.79	84.77 $\pm$ 0.05	74.20 $\pm$ 0.08	80.30 $\pm$ 0.17	89.27 $\pm$ 0.04
	BGNN(m)-CS	81.80 $\pm$ 0.09	80.33 $\pm$ 0.94	80.68 $\pm$ 1.49	84.30 $\pm$ 0.22	74.70 $\pm$ 0.08	79.83 $\pm$ 0.05	89.07 $\pm$ 0.11
Ensemble (Hansen and Salamon 1990)		82.27 $\pm$ 0.09	78.67 $\pm$ 0.94	80.00 $\pm$ 3.60	82.63 $\pm$ 0.05	72.43 $\pm$ 0.26	78.17 $\pm$ 0.31	90.40 $\pm$ 0.30

Table 2: Classification performances under the multi teacher setting. For simplicity, we abbreviate GCN to C, GraphSage to S, and GAT to T. The order in which the abbreviation letters appear reflects their training order. For example, BGNN-ST indicates that GraphSage (S) and GAT (T) are the teachers in turn. We further denote the GNNs trained with single teacher as BGNN(s) and the GNNs trained with multiple teachers as BGNN(m). Best performances for each backbone and each dataset are marked with bold and shadow, respectively.

Experimental Setup

Datasets. We use seven datasets to conduct graph classification and node classification experiments. We follow the data split as in original papers (Sen et al. 2008; Namata et al. 2012) for Cora, Citeseer and Pubmed while the remaining datasets are randomly split using an empirical ratio. The more details are in Section B of Appendix.

BGNN Training. We select GraphSage (Hamilton, Ying, and Leskovec 2017), GCN (Kipf and Welling 2017) and GAT (Velickovic et al. 2018) as GNN backbones and examine the performance of BGNN with different combinations of teachers and students. For all the GNN backbones, we use two-layer models. For the single teacher setting, we use one GNN as the teacher and a different GNN as the student. For the multiple teacher setting, we permute the three GNNs in six orders, where the first two serve as teachers and the third one is the student. During the training, we clamp the adaptive temperature in the range from 1 to 4, which is a commonly used temperature range in KD. Implementation and experiment details are shown in Section C of Appendix.

Baselines. For the single teacher setting, we compare BGNN with the student GNN trained without a teacher and the student GNN trained with different teacher GNNs using KD (Hinton et al. 2015) and LSP (Yang et al. 2020). KD is the most commonly used framework to distillate the knowledge from teacher GNNs to student GNNs. LSP is the state-of-the-art KD framwork proposed for GNNs, which teaches the student based on local structure information instead of logits information. We set the temperature $\tau$ in KD to a commonly used value 4 (Stanton et al. 2021). For the multi teacher setting, we compare BGNN with two multi-teacher KD frameworks (i.e., BAN (Furlanello et al. 2018) and MulDE (Wang et al. 2021)). BAN (Furlanello et al. 2018) uses a similar sequential training strategy as ours. It follows the born-again idea (Breiman and Shang 1996) to train one teacher and multiple students sharing the same architecture, and the resulting model is an ensemble (Hansen and Salamon 1990) of all trained students. MulDE (Wang et al. 2021) trains the student by weighted-combining multiple teachers’ outputs and utilizing KD. Here, the teachers’ outputs are their generated logits for each node or graph. Additionally, we also include the ensemble that averages the prediction of supervised GCN, GraphSage and GAT as a baseline.

Evaluation. We evaluate the performance by classification accuracy. For graph classification tasks, we obtain the entire graph representation by sum pooling as recommended by (Xu et al. 2019). For each experiment, we report the average and standard deviation of the accuracy from ten training rounds.

How Does BGNN Boost the Vanilla GNNs?

This experiment examines whether our BGNN can boost the performance of the vanilla GNNs by transferring knowledge from multiple GNNs. Both the single teacher setting and the multiple teacher setting are considered.

For the single teacher setting, there are six possible teacher-student pairs: GCN-GAT, GCN-GraphSage, GraphSage-GCN, GraphSage-GAT, GAT-GCN and GAT-GraphSage. Each teacher-student pair is trained using our BGNN, KD and LSP. Note that for each student GNN, two teacher options are available. We report the better result of the two teachers for each of BGNN, KD and LSP, which is different from the setting of our ablation study. In Table 1, we can observe that BGNN outperforms the respective supervisedly trained GNN (NoKD) by significant margins for all datasets, regardless of the selection of student GNN. This confirms that our BGNN can boost the vanilla GNNs by transferring additional knowledge from other GNNs. However, we should also note that, the prediction power of the student GNN is still constrained by its own architecture. For example, for the Enzymes dataset, the original GAT without distillation (74.33%) outperforms the original GCN (66.33%) by a large margin. Although our BGNN can boost the performance of GCN from 66.33% to 69.44%, the boosted accuracy of GCN is still lower than that of the original GAT. This indicates that selecting an appropriate student GNN is still important.

For the multi-teacher setting, we examine the performance of BGNN with all the six permutations of the three GNNs. In Table 2, we find that BGNN(m) outperforms the respective BGNN(s) in most cases, and the supervised trained GNNs (NoKD) in all cases. This indicates that the extra teachers in the BGNN training successfully transfer additional knowledge into the student GNNs.

We further compare BGNN(m) with the average ensemble of the vanilla GNNs, which combines the knowledge of GNNs in a straightforward manner. In Table 2, BGNN(m) obtains better prediction results than those of the ensemble. The ensemble fails to leverage the mutual knowledge in the training stage, as it combines the prediction results directly. In contrast, our BGNN trains the student GNNs under the guidance of teachers. This combines the knowledge from teachers and students in a natural way, providing better prediction power to the student trained with BGNN.

Variant	Collab	IMDB	Enzymes	Cora	Citeseer	Pubmed	A-Computers
w/o Adaptive Temp	81.27 $\pm$ 0.25	79.00 $\pm$ 0.00	66.67 $\pm$ 1.36	83.30 $\pm$ 0.22	73.18 $\pm$ 0.23	80.10 $\pm$ 0.14	87.30 $\pm$ 0.06
w/o Weight Boosting	82.53 $\pm$ 0.09	77.67 $\pm$ 0.47	68.33 $\pm$ 1.36	83.17 $\pm$ 0.42	72.80 $\pm$ 0.17	79.80 $\pm$ 0.14	88.56 $\pm$ 0.00
BGNN	82.53 $\pm$ 0.47	79.33 $\pm$ 0.47	68.89 $\pm$ 0.79	83.70 $\pm$ 0.37	73.87 $\pm$ 0.24	80.73 $\pm$ 0.25	89.12 $\pm$ 0.01

Table 3: Results of fusion selections (BGNN, BGNN without adaptive temperature, and BGNN without weight boosting). Here, we use GAT as the teacher and GCN as the student, the same as for Figure 4 and Figure 5.

Does BGNN Distill Knowledge More Effectively?

For the single teacher setting, we compare our BGNN with KD (Hinton et al. 2015) and LSP (Yang et al. 2020). In Table 1, we observe that our method can outperform both KD and LSP on both graph classification and node classification tasks. This may credit to both the adaptive temperature and weight boosting modules.

For the multi-teacher setting, we compare BGNN with MulDE and BAN. MulDE combines the teacher models in parallel, while our BGNN and BAN combines the teacher models in a sequential manner. As shown in Table 2, our BGNN achieves the best performance for all the seven datasets. BGNN also performs better than the respective MulDE and BAN in almost all settings (20 out of 21 cases). In addition, we find that the sequential methods (BGNN and BAN) outperform the parallel method (MulDE) in general.

A possible explanation is that all GNNs play an important role in the parallel MulDE training at the same time, which may not exploit the full potential of the more powerful GNNs. On the contrary, the sequential methods trains a student with a single teacher at each step. This allows the student to focus on learning from one specific teacher GNN. In this way, the prediction power of a single powerful teacher GNN may be transferred to the student more effectively.

For the sequential methods, BAN uses the same architecture for both the teachers and the students. Therefore, it is not able to leverage information from different GNNs. In contrast, our BGNN achieves better results by combining knowledge of different models. Furthermore, it is worth mentioning that our model does not use the average ensemble step like BAN. This may lead to higher accuracy as well, assuming that the final student can inherit the power of all preceding GNNs.

How Does the Selection of Teacher GNNs Influence the Student GNNs?

We study the performance of the student GNNs when they are taught by different teachers. The results are shown in Figure 3. In 18 out of the 21 settings, the student learning from different architectures outperforms the respective ones learning from the same teacher GNNs. One exception is that, on Enzymes, the student GCN learning from GCN is more powerful than the GCNs learning from other GNNs. But for all the other datasets, we still find that GCN learns more from other GNNs. If we only consider the node classification task (the four datasets on the right part of each histogram), it becomes more obvious that a student may learn more from different architectures. In 11 out of the 12 settings, the students learning from the same architecture have the worst performance. However, for the multi-teacher setting, we do not find a clear clue of how to decide the order of teachers. As shown in Table 2, BGNN(m) usually outperforms BGNN(s), but it seems arbitrary when a specific order will be better.

Ablation Study

Does the adaptive temperature matter for knowledge distillation? In Table 3, we can see that the original BGNN performs better than its variation without the adaptive temperature (replacing the adaptive temperature with a fixed temperature). It means the adaptive temperature indeed helps BGNN learn more knowledge from the teacher GNN than the fixed temperature. Then, we further compare the adaptive temperature with the fixed temperatures from 1 to 10. Note that we do not train the models with the weight boosting to exclude its impact. In the left of Figure 4, we find that the adaptive temperature achieves the highest accuracy for all the seven datasets. This means that the adaptive temperature can make the best trade-off between the dark knowledge and true label information to maximize the transferred knowledge.

Does the weight boosting help? In Table 3, we observe that BGNN performs better than BGNN without the weight boosting, which confirms its effectiveness of weight boosting. We argue that the power originates from the improvement on the nodes mis-classified by the teacher GNN. This is verified by the right figure of Figure 5, which compares the performance with and without weight boosting on the mis-classified nodes. We see that BGNN obtains higher accuracy with weight boosting, because the weight boosting forces the student to pay more attention on these mis-classified nodes.

Conclusion

In this paper, we propose a novel KD framework to boost a single GNN by combining various knowledge from different GNNs. We develop a sequential KD training strategy to merge all knowledge. To transfer more useful knowledge from the teacher model to the student model, we propose an adaptive temperature for each node based on the teacher’s confidence. Additionally, our boosting weight module helps the student to pick up the knowledge missed by the teacher GNN. The effectiveness of our methods has been proved on both node classification and graph classification tasks.

Acknowledgements

This work was supported by National Science Foundation under the NSF Center for Computer Assisted Synthesis (C-CAS), grant number CHE-2202693. We thank all anonymous reviewers for their valuable comments and suggestions.

References

Alon and Yahav (2020) Alon, U.; and Yahav, E. 2020. On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205.
Balcilar et al. (2020) Balcilar, M.; Renton, G.; Héroux, P.; Gaüzère, B.; Adam, S.; and Honeine, P. 2020. Analyzing the expressive power of graph neural networks in a spectral perspective. In ICLR.
Borgwardt et al. (2005) Borgwardt, K. M.; Ong, C. S.; Schönauer, S.; Vishwanathan, S.; Smola, A. J.; and Kriegel, H.-P. 2005. Protein function prediction via graph kernels. Bioinformatics.
Breiman (2001) Breiman, L. 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science.
Breiman and Shang (1996) Breiman, L.; and Shang, N. 1996. Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report.
Clevert, Unterthiner, and Hochreiter (2015) Clevert, D.-A.; Unterthiner, T.; and Hochreiter, S. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.
Deng and Zhang (2021) Deng, X.; and Zhang, Z. 2021. Graph-Free Knowledge Distillation for Graph Neural Networks. In IJCAI.
Feng et al. (2022) Feng, K.; Li, C.; Yuan, Y.; and Wang, G. 2022. FreeKD: Free-direction Knowledge Distillation for Graph Neural Networks. In KDD.
Fey and Lenssen (2019) Fey, M.; and Lenssen, J. E. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
Freund, Schapire, and Abe (1999) Freund, Y.; Schapire, R.; and Abe, N. 1999. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence.
Furlanello et al. (2018) Furlanello, T.; Lipton, Z.; Tschannen, M.; Itti, L.; and Anandkumar, A. 2018. Born again neural networks. In ICML.
Gilmer et al. (2017) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In ICML.
Guo et al. (2022a) Guo, Z.; Nan, B.; Tian, Y.; Wiest, O.; Zhang, C.; and Chawla, N. V. 2022a. Graph-based Molecular Representation Learning. arXiv preprint arXiv:2207.04869.
Guo et al. (2022b) Guo, Z.; Shiao, W.; Zhang, S.; Liu, Y.; Chawla, N.; Shah, N.; and Zhao, T. 2022b. Linkless Link Prediction via Relational Distillation. arXiv preprint arXiv:2210.05801.
Guo et al. (2021) Guo, Z.; Zhang, C.; Yu, W.; Herr, J.; Wiest, O.; Jiang, M.; and Chawla, N. V. 2021. Few-shot graph learning for molecular property prediction. In WWW.
Hamilton, Ying, and Leskovec (2017) Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. NeurIPS.
Hansen and Salamon (1990) Hansen, L. K.; and Salamon, P. 1990. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence.
Hastie et al. (2009) Hastie, T.; Rosset, S.; Zhu, J.; and Zou, H. 2009. Multi-class adaboost. Statistics and its Interface.
Heo et al. (2019) Heo, B.; Kim, J.; Yun, S.; Park, H.; Kwak, N.; and Choi, J. Y. 2019. A comprehensive overhaul of feature distillation. In ICCV.
Hinton et al. (2015) Hinton, G.; Vinyals, O.; Dean, J.; et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Hu et al. (2019) Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; and Leskovec, J. 2019. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265.
Jin et al. (2021) Jin, W.; Liu, X.; Zhao, X.; Ma, Y.; Shah, N.; and Tang, J. 2021. Automated self-supervised learning for graphs. arXiv preprint arXiv:2106.05470.
Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kipf and Welling (2017) Kipf, T. N.; and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. In ICLR.
Klicpera, Bojchevski, and Günnemann (2019) Klicpera, J.; Bojchevski, A.; and Günnemann, S. 2019. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In ICLR.
Kornblith et al. (2019) Kornblith, S.; Norouzi, M.; Lee, H.; and Hinton, G. 2019. Similarity of neural network representations revisited. In ICML.
Leskovec, Kleinberg, and Faloutsos (2005) Leskovec, J.; Kleinberg, J.; and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD.
Li, Han, and Wu (2018) Li, Q.; Han, Z.; and Wu, X.-M. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI.
McAuley et al. (2015) McAuley, J.; Targett, C.; Shi, Q.; and Van Den Hengel, A. 2015. Image-based recommendations on styles and substitutes. In SIGIR.
Namata et al. (2012) Namata, G.; London, B.; Getoor, L.; Huang, B.; and Edu, U. 2012. Query-driven active surveying for collective classification. In International Workshop on Mining and Learning with Graphs.
Romero et al. (2014) Romero, A.; Ballas, N.; Kahou, S. E.; Chassang, A.; Gatta, C.; and Bengio, Y. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.
Sen et al. (2008) Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI magazine.
Shchur et al. (2018) Shchur, O.; Mumme, M.; Bojchevski, A.; and Günnemann, S. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868.
Stanton et al. (2021) Stanton, S.; Izmailov, P.; Kirichenko, P.; Alemi, A. A.; and Wilson, A. G. 2021. Does knowledge distillation really work? NeurIPS.
Sun, Zhu, and Lin (2021) Sun, K.; Zhu, Z.; and Lin, Z. 2021. Adagcn: Adaboosting graph convolutional networks into deep models. In ICLR.
Touvron et al. (2021) Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and Jégou, H. 2021. Training data-efficient image transformers & distillation through attention. In ICML.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. NeurIPS.
Velickovic et al. (2018) Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph attention networks. In ICLR.
Wang et al. (2021) Wang, K.; Liu, Y.; Ma, Q.; and Sheng, Q. Z. 2021. Mulde: Multi-teacher knowledge distillation for low-dimensional knowledge graph embeddings. In WWW.
Wu et al. (2019) Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; and Weinberger, K. 2019. Simplifying graph convolutional networks. In ICML.
Xu et al. (2019) Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2019. How Powerful are Graph Neural Networks? In ICLR.
Yan et al. (2020) Yan, B.; Wang, C.; Guo, G.; and Lou, Y. 2020. Tinygnn: Learning efficient graph neural networks. In KDD.
Yanardag and Vishwanathan (2015) Yanardag, P.; and Vishwanathan, S. 2015. Deep graph kernels. In KDD.
Yang, Liu, and Shi (2021) Yang, C.; Liu, J.; and Shi, C. 2021. Extract the knowledge of graph neural networks and go beyond it: An effective knowledge distillation framework. In WWW.
Yang et al. (2020) Yang, Y.; Qiu, J.; Song, M.; Tao, D.; and Wang, X. 2020. Distilling knowledge from graph convolutional networks. In CVPR.
Yim et al. (2017) Yim, J.; Joo, D.; Bae, J.; and Kim, J. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR.
You et al. (2017) You, S.; Xu, C.; Xu, C.; and Tao, D. 2017. Learning from multiple teacher networks. In KDD.
Yuan et al. (2019) Yuan, L.; Tay, F. E.; Li, G.; Wang, T.; and Feng, J. 2019. Revisit knowledge distillation: a teacher-free framework.
Zhang et al. (2022) Zhang, S.; Liu, Y.; Sun, Y.; and Shah, N. 2022. Graph-less neural networks: Teaching old mlps new tricks via distillation. In ICLR.
Zhang et al. (2020) Zhang, W.; Miao, X.; Shao, Y.; Jiang, J.; Chen, L.; Ruas, O.; and Cui, B. 2020. Reliable data distillation on graph convolutional network. In SIGMOD.
Zhang and Sabuncu (2020) Zhang, Z.; and Sabuncu, M. 2020. Self-distillation as instance-specific label smoothing. NeurIPS.
Zhao and Akoglu (2019) Zhao, L.; and Akoglu, L. 2019. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223.
Zheng et al. (2022) Zheng, W.; Huang, E. W.; Rao, N.; Katariya, S.; Wang, Z.; and Subbian, K. 2022. Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods. In ICLR.

Appendix A Derivation of Equation (2) in the Paper

In this section, we elaborate on the detailed derivation of Equation (2) in the paper. As the softened logit vector produced by the teacher $\hat{\boldsymbol{y}}_{v}^{\mathrm{T}}$ is independent to $z_{v,c}$ :

-\frac{\partial}{\partial z_{v,c}}\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})=-\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\cdot\frac{\partial}{\partial z_{v,c}}\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}}).

(9)

Then we first expand the term $\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})$ into:

	$\displaystyle\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})$	$\displaystyle=\mathrm{log}(\frac{e^{z_{v,j}/\tau_{v}}}{\sum_{k=1}^{C}e^{z_{v,k}/\tau_{v}}})$
		$\displaystyle=z_{v,j}/\tau_{v}-\mathrm{log}(\sum_{k=1}^{C}e^{z_{v,k}/\tau_{v}}).$		(10)

Based on the above result, the partial derivative of Equation (9) becomes:

\frac{\partial}{\partial z_{v,c}}\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})=\frac{\partial z_{v,j}/\tau_{v}}{\partial z_{v,c}}-\frac{\partial}{\partial z_{v,c}}\mathrm{log}(\sum_{k=1}^{C}e^{z_{v,k}/\tau_{v}}),

(11)

where the first term on the right hand side is:

\frac{\partial z_{v,j}/\tau_{v}}{\partial z_{v,c}}=\left\{\begin{array}[]{rcl}\frac{1}{\tau_{v}}&,&\text{if}\ j=c\\ 0&,&\text{otherwise}\end{array}\right..

(12)

We can concisely rewrite Equation (12) using the indicator function: $\frac{1}{\tau_{v}}\cdot 1\{\cdot\}$ : if the argument is true, the indicator function returns $1$ and the result of Equation (12) becomes $\frac{1}{\tau_{v}}$ ; otherwise, the result is $0$ in both cases. Applying this indicator function and the chain rule, we can rewrite Equation (11) as:

	$\displaystyle\frac{\partial}{\partial z_{v,c}}\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})$
	$\displaystyle=\frac{1}{\tau_{v}}\cdot 1\{j=c\}$	$\displaystyle-\frac{1}{\sum_{k=1}^{C}e^{z_{v,k}/\tau_{v}}}\cdot\frac{\partial}{\partial z_{v,c}}(\sum_{k=1}^{C}e^{z_{v,k}/\tau_{v}}).$		(13)

Next, we can derive the partial derivative of the summation term in Equation (13) as follows:

	$\displaystyle\frac{\partial}{\partial z_{v,c}}(\sum_{k=1}^{C}e^{z_{v,k}/\tau_{v}})$
	$\displaystyle=\frac{\partial}{\partial z_{v,c}}(e^{z_{v,1}/\tau_{v}}+e^{z_{v,2}/\tau_{v}}+\cdots+e^{z_{v,C}/\tau_{v}})$
	$\displaystyle=\frac{\partial}{\partial z_{v,c}}(e^{z_{v,c}/\tau_{v}})=\frac{1}{\tau_{v}}(e^{z_{v,c}/\tau_{v}}).$		(14)

Plugging this result into Equation (13), we have:

	$\displaystyle\frac{\partial}{\partial z_{v,c}}\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})$	$\displaystyle=\frac{1}{\tau_{v}}\cdot 1\{j=c\}-\frac{1}{\tau_{v}}\frac{e^{z_{v,c}/\tau_{v}}}{\sum_{k=1}^{C}e^{z_{v,k}/\tau_{v}}}$
		$\displaystyle=\frac{1}{\tau_{v}}\cdot 1\{j=c\}-\frac{1}{\tau_{v}}\hat{y}_{v,c}^{\mathrm{S}}.$		(15)

Therefore, the derivative of the KD loss, shown in Equation (9), can be written as:

	$\displaystyle-\frac{\partial}{\partial z_{v,c}}\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})$
	$\displaystyle=-\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\cdot(\frac{1}{\tau_{v}}\cdot 1\{j=c\}-\frac{1}{\tau_{v}}\hat{y}_{v,c}^{\mathrm{S}})$		(16)
	$\displaystyle=\frac{1}{\tau_{v}}\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\hat{y}_{v,c}^{\mathrm{S}}-\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\cdot\frac{1}{\tau_{v}}\cdot 1\{j=c\},$		(17)

where $\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}=1$ . Since the value of the indicator function is $1$ only when $j=c$ , the second term of the right hand side in Equation (17) becomes $\frac{1}{\tau_{v}}\hat{y}_{v,c}^{\mathrm{T}}$ . Thus, Equation (17) can be further simplified as:

	$\displaystyle-\frac{\partial}{\partial z_{v,c}}\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\mathrm{log}(\hat{y}_{v,j}^{\mathrm{S}})$	$\displaystyle=\frac{1}{\tau_{v}}\sum_{j=1}^{C}\hat{y}_{v,j}^{\mathrm{T}}\hat{y}_{v,c}^{\mathrm{S}}-\frac{1}{\tau_{v}}\hat{y}_{v,c}^{\mathrm{T}}$
		$\displaystyle=\frac{1}{\tau_{v}}\hat{y}_{v,c}^{\mathrm{S}}-\frac{1}{\tau_{v}}\hat{y}_{v,c}^{\mathrm{T}},$		(18)

which is exactly the derivation of Equation (2) in the paper: $\frac{\partial{\mathcal{L}_{KD}}}{\partial{z_{v,c}}}=\frac{1}{\tau_{v}}(\hat{y}_{v,c}^{\mathrm{S}}-\hat{y}_{v,c}^{\mathrm{T}})$ .

Note that the above derivation is used for the cross entropy between the prediction of teacher $\hat{y}_{v,c}^{\mathrm{T}}$ and that of the student $\hat{y}_{v,c}^{\mathrm{S}}$ . When the supervised loss between the ground-truth $y_{v}$ and the prediction of student $\hat{y}_{v,c}^{\mathrm{S}}$ is considered, it is sufficient to simply substitute $\hat{y}_{v,c}^{\mathrm{T}}$ in Equation (18) with $y_{v}$ . This result has been used directly in the Pre-analysis paragraphs of Section Distillation via Adaptive Temeperature in the paper.

Appendix B Dataset Details

The dataset statistics for graph classification and node classification can be found in Table 4. In the following, we introduce the detailed description of the datasets.

Task	Dataset	# Graphs	# Nodes	# Edges	# Features	# Classes
Graph	Enzymes	600	$\sim$ 32.6	$\sim$ 124.3	3	6
	IMDB-Binary	1,000	$\sim$ 19.8	$\sim$ 193.1	-	2
	Collab	5,000	$\sim$ 74.5	$\sim$ 4,914.4	-	3
Node	Cora	-	2,485	5,069	1,433	7
	Citeseer	-	2,110	3,668	3,703	6
	Pubmed	-	19,717	44,324	500	3
	A-computers	-	13,381	245,778	767	10

Table 4: Dataset statistics.

Graph Classification Datasets

Collab is a scientific collaboration dataset, incorporating three public collaboration datasets (Leskovec, Kleinberg, and Faloutsos 2005), namely High Energy Physics, Condensed Matter Physics, and Astro Physics. Each graph represents the ego-network of scholars in a research field. The graphs are classified into three classes based on the fields. We randomly split this dataset in our experiments. Since there is no node feature for this dataset, we provide synthetic node features using the one-hot degree transforms¹¹1https://pytorch-geometric.readthedocs.io/en/latest/
modules/datasets.html#torch_geometric.datasets.TUDataset.

IMDB (Yanardag and Vishwanathan 2015) is a movie collaboration dataset. Each graph represents the ego-network of an actor/actress, where each node represents one actor/actress and each edge indicates the co-appearance of two actors/actresses in a movie. The graphs are classified into two classes based on the genres of the movies: Action and Romance. We randomly split the dataset in our experiments. Same as the Collab dataset, we provide synthetic node features using the one-hot degree transforms.

Enzymes (Borgwardt et al. 2005) is a benchmark graph dataset. Each graph represents a protein tertiary structure. The node features are categorical labels. The graphs are classified based on six EC top-level classes. We randomly split the dataset in our experiments.

Node Classification Datasets

Cora (Sen et al. 2008) is a benchmark citation dataset. Each node represents a machine-learning paper and each edge represents the citation relationship between two papers. The nodes are associated with sparse bag-of-words feature vectors. Papers are classified into seven classes based on the research fields. We use the standard fixed splits for this dataset.

Citeseer (Sen et al. 2008) is another benchmark citation dataset, where each node represents a computer science paper. It has a similar configuration to Cora, but it has six classes and larger features for each node. We also use the standard fixed splits for this dataset.

Pubmed (Namata et al. 2012) is a citation dataset as well, where the papers are related to diabetes from the PubMed dataset. The node features are TF/IDF-weighted word frequencies. Nodes are classified into three classes based on the types of diabetes addressed in the paper. This dataset provides standard fixed splits, which we use in our experiments.

A-computers (Shchur et al. 2018) is extracted from Amazon co-purchase graph (McAuley et al. 2015), where each node represents a product and each edge indicates that the two products are frequently bought together. The node features are bag-of-words encoded product reviews. Products are classified into ten classes based on the product category. We randomly split this dataset in our experiments, as this dataset does not come with a standard fixed split.

Appendix C Additional Experimental Results

What’s the influence of $\lambda$ ? We conduct experiments with GAT as the teacher and GCN as the student. Figure 6 shows that the accuracy of BGNN is stable for most datasets except Enzymes, when the trade-off factor $\lambda$ varies from 0.1 to 1. It indicates that our method is insensitive to $\lambda$ in general.

Appendix D Implementation Details

Both our method BGNN and other baselines are implemented using PyTorch. Specifically, we use the PyG (Fey and Lenssen 2019) library for GNN algorithms, and Adam (Kingma and Ba 2014) for optimization with weight decay set to $5\times 10^{-4}$ . Other training settings vary across datasets, which are described as follows.

BGNN Training Details. We use 2-layer GCN (Kipf and Welling 2017), GraphSage (Hamilton, Ying, and Leskovec 2017) and GAT (Velickovic et al. 2018) with activation and dropout operation in intermediate layers as backbone GNNs. We use ReLU as the activation function for GCN and GraphSage, and use ELU activation (Clevert, Unterthiner, and Hochreiter 2015) for GAT. The hidden dimensions of different GNNs are selected per dataset to match the performance reported in the original publication under supervised settings. For GCN and GraphSage, we use the same configuration of hidden dimensions: 16 on Cora and Pubmed; 32 on IMDB, Enzymes, and Citeseer; 64 on Collab; and 128 on A-computers. For GAT, we build the first layer with 8 heads of 8-dimension on most datasets, except 4 heads of 8-dimension on collab and 8 heads of 16-dimension on Enzymes and A-computers.

We take mini-batch to train BGNN on graph classification datasets for memory efficiency. The batch size is 32 on all three datasets. We further add a batch normalization operation in the intermediate layers of GNNs for better prediction performance. We use full-batch on node classification datasets. What’s more, to obtain relatively stable results on graph classification tasks, we take the top 5 results over 10 rounds as the results for the corresponding tasks, which is applied on both BGNN and baselines. For the teacher and student GNNs, we set the number of layers and the hidden dimension of each layer to the same numbers as the corresponding supervised settings. The supervised settings are described in the previous paragraph. For BGNN, we conduct the hyperparameter search of the weights for $\lambda$ from [0.1, 0.5, 1, 5, 10], and the learning rate from [0.005, 0.01, 0.05].

Baseline Details. For BAN (Furlanello et al. 2018), LMTN (You et al. 2017) and KD (Hinton et al. 2015), we select the supervisedly trained GNNs as teachers in the same way as for BGNN. For Ensemble, we take average logits of three supervisedly trained GNNs for both node classification task and graph classification task. GNNs used in the baselines share the same architectures as those in BGNN.

Hardware Details. We run all experiments on a single NVIDIA P100 GPU with 16GB RAM, except the experiments using GAT (Velickovic et al. 2018) on Collab are run on a single NVIDIA V100 GPU with 32GB RAM.